Remove unnecessary files. by robertnishihara · Pull Request #4 · ray-project/ray

robertnishihara · 2016-10-26T05:47:18Z

No description provided.

Tuple keys in dicts can be serialized

(redo of #4 which was dropped)

* changes to min/max dataframes functions * max/min now return a Series - fixed tests to check equality in pandas series objects * added error checking for axis in min and max * updated error checking for axis in min/max

#4504) * Fixes Inconsistent weight assignment operations in DQNPolicyGraph (#4502) * Update dqn_policy_graph.py

� This is the 1st commit message: Rounding error fixes, removing cpu addition in cython and test fixes. � This is the commit message ray-project#2: Plumbing and support for dynamic custom resources. � This is the commit message ray-project#3: update cluster utils to use EntryType. � This is the commit message ray-project#4: Update node_manager to use new zero resource == deletion semantics.

[Hosted Dashboard] Persist data to S3

We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 #1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 #7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 #8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 #11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 #12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 #14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 #15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.

We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 ray-project#1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 ray-project#7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 ray-project#8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 ray-project#9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 ray-project#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 ray-project#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 ray-project#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 ray-project#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 ray-project#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 ray-project#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 ray-project#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe. Signed-off-by: Rohan138 <rapotdar@purdue.edu>

We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 ray-project#1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so ray-project#6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 ray-project#7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 ray-project#8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 ray-project#9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 ray-project#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 ray-project#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 ray-project#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 ray-project#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 ray-project#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 ray-project#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 ray-project#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

Signed-off-by: Jack He <jackhe2345@gmail.com>

Signed-off-by: elliottower <elliot@elliottower.com>

#61034) Currently, there is a chance that a worker can crash on the `getenv` syscall from the otel lazy initialization. We found the race is between `setenv` on the user thread (`setenv(RBLN_DEVICES)`) and `getenv` on the worker internal thread. However, we can't forbid `setenv` on a user's thread; the only thing we can do is not call `getenv` once the user's thread starts. Here is the backtrace of the crash we found by intercepting the `getenv`: ``` [getenv_preload] setenv name=RBLN_DEVICES value= overwrite=1 [getenv_preload] setenv backtrace: #0 /home/ray/getenv_trace_preload.so(setenv+0x73) [0x748a77ea870b] #1 ray::IDLE(+0x224d5b) [0x59f10aeead5b] #2 ray::IDLE(+0x13dfc3) [0x59f10ae03fc3] #3 ray::IDLE(_PyEval_EvalFrameDefault+0x313) [0x59f10adf3703] #4 ray::IDLE(+0x184bfd) [0x59f10ae4abfd] #5 ray::IDLE(+0x19da04) [0x59f10ae63a04] #6 ray::IDLE(_PyEval_EvalFrameDefault+0x115a) [0x59f10adf454a] #7 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] #8 ray::IDLE(_PyEval_EvalFrameDefault+0x49ae) [0x59f10adf7d9e] #9 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] #10 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x9a9333) [0x748a76270333] #11 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_PSN_ISG_IS11_bESaIS16_EERSO_INS0_17LocalMemoryBufferEEPbPSsS1E_RKSN_INS0_16ConcurrencyGroupESaIS1F_EESsbbblRKSt8optionalISsEEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S19_S1C_S1D_S1E_S1E_S1J_SsbbblS1L_EE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1X_OS19_S1C_OS1D_OS1E_S20_S1J_S1W_ObS21_S21_OlS1N_+0x1ab) [0x748a761786ab] #12 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDESt10shared_ptrINS_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSsS15_+0x1166) [0x748a76320a96] #13 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDESt10shared_ptrINS0_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsS15_ESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_SK_SS_SS_SW_S13_S14_S15_S15_EPS19_St12_PlaceholderILi1EES1D_ILi2EES1D_ILi3EES1D_ILi4EES1D_ILi5EES1D_ILi6EES1D_ILi7EES1D_ILi8EES1D_ILi9EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSS_S1U_OSW_OS13_OS14_OS15_S1Y_+0x87) [0x748a762e8647] #14 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb5186d) [0x748a7641886d] #15 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb557c5) [0x748a7641c7c5] #16 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x103e3eb) [0x748a769053eb] #17 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1034f0b) [0x748a768fbf0b] #18 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb6f21b) [0x748a7643621b] #19 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x15893cb) [0x748a76e503cb] #20 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158ad69) [0x748a76e51d69] #21 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158b472) [0x748a76e52472] #22 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0x132) [0x748a762e4252] #23 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x41) [0x748a76336bd1] #24 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x8a45c1) [0x748a7616b5c1] #25 ray::IDLE(_PyEval_EvalFrameDefault+0x6fb) [0x59f10adf3aeb] #26 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] #27 ray::IDLE(_PyEval_EvalFrameDefault+0x6fb) [0x59f10adf3aeb] #28 ray::IDLE(+0x1d5cac) [0x59f10ae9bcac] #29 ray::IDLE(PyEval_EvalCode+0x85) [0x59f10ae9bbf5] #30 ray::IDLE(+0x20732a) [0x59f10aecd32a] #31 ray::IDLE(+0x201d13) [0x59f10aec7d13] #32 ray::IDLE(+0x976be) [0x59f10ad5d6be] #33 ray::IDLE(_PyRun_SimpleFileObject+0x1bb) [0x59f10aec23db] #34 ray::IDLE(_PyRun_AnyFileObject+0x44) [0x59f10aec1f74] #35 ray::IDLE(Py_RunMain+0x371) [0x59f10aebf3e1] #36 ray::IDLE(Py_BytesMain+0x37) [0x59f10ae8f447] #37 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x748a77baad90] #38 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x748a77baae40] #39 ray::IDLE(+0x1c930e) [0x59f10ae8f30e] [getenv_preload] getenv name=OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_BACKOFF_MULTIPLIER [getenv_preload] backtrace: #0 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x10a9d17) [0x7321ce3c9d17] #1 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x10abe2b) [0x7321ce3cbe2b] #2 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1050ffc) [0x7321ce370ffc] #3 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x104f4d7) [0x7321ce36f4d7] #4 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1045833) [0x7321ce365833] #5 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xa6c760) [0x7321cdd8c760] #6 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xe69d9a) [0x7321ce189d9a] #7 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_16HealthCheckReplyEE15OnReplyReceivedEv+0x165) [0x7321ce18c005] #8 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7321cdd8e475] #9 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x103e3eb) [0x7321ce35e3eb] #10 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1034f0b) [0x7321ce354f0b] #11 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb6f21b) [0x7321cde8f21b] #12 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x15893cb) [0x7321ce8a93cb] #13 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158ad69) [0x7321ce8aad69] #14 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158b472) [0x7321ce8ab472] #15 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xa6bb54) [0x7321cdd8bb54] #16 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xba2250) [0x7321cdec2250] #17 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7321cf66eac3] #18 /lib/x86_64-linux-gnu/libc.so.6(+0x1268d0) [0x7321cf7008d0] *** SIGSEGV received at time=1770862205 on cpu 1 *** PC: @ 0x748a77bc5c1d (unknown) getenv @ 0x748a77bc3520 (unknown) (unknown) {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"*** SIGSEGV received at time=1770862205 on cpu 1 ***","filename":"logging.cc","lineno":474} {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"PC: @ 0x748a77bc5c1d (unknown) getenv","filename":"logging.cc","lineno":474} {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":" @ 0x748a77bc3520 (unknown) (unknown)","filename":"logging.cc","lineno":474} Fatal Python error: Segmentation fault ``` According to the backtrace, we can identify that it is the `OtlpGrpcMetricExporterOptions`, [which called `getenv(OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_BACKOFF_MULTIPLIER)`](https://github.com/open-telemetry/opentelemetry-cpp/blob/13ad05a6f431efb76995cffb1225d26b45374749/exporters/otlp/src/otlp_grpc_metric_exporter_options.cc#L47), getting initialized by calling `InitOpenTelemetryExporter` in the `metrics_agent_client_->WaitForServerReady()` callback, that causes the issue. This PR moves `OtlpGrpcMetricExporterOptions` into `OpenTelemetryMetricRecorder` (so that we keep otel details encapsulated) and moves its initialization early to `stats::Init()`, to force the `OtlpGrpcMetricExporterOptions` to be initialized early, so that we don't call `getenv` afterward. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

ray-project#61034) Currently, there is a chance that a worker can crash on the `getenv` syscall from the otel lazy initialization. We found the race is between `setenv` on the user thread (`setenv(RBLN_DEVICES)`) and `getenv` on the worker internal thread. However, we can't forbid `setenv` on a user's thread; the only thing we can do is not call `getenv` once the user's thread starts. Here is the backtrace of the crash we found by intercepting the `getenv`: ``` [getenv_preload] setenv name=RBLN_DEVICES value= overwrite=1 [getenv_preload] setenv backtrace: #0 /home/ray/getenv_trace_preload.so(setenv+0x73) [0x748a77ea870b] ray-project#1 ray::IDLE(+0x224d5b) [0x59f10aeead5b] ray-project#2 ray::IDLE(+0x13dfc3) [0x59f10ae03fc3] ray-project#3 ray::IDLE(_PyEval_EvalFrameDefault+0x313) [0x59f10adf3703] ray-project#4 ray::IDLE(+0x184bfd) [0x59f10ae4abfd] ray-project#5 ray::IDLE(+0x19da04) [0x59f10ae63a04] ray-project#6 ray::IDLE(_PyEval_EvalFrameDefault+0x115a) [0x59f10adf454a] ray-project#7 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] ray-project#8 ray::IDLE(_PyEval_EvalFrameDefault+0x49ae) [0x59f10adf7d9e] ray-project#9 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] ray-project#10 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x9a9333) [0x748a76270333] ray-project#11 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_PSN_ISG_IS11_bESaIS16_EERSO_INS0_17LocalMemoryBufferEEPbPSsS1E_RKSN_INS0_16ConcurrencyGroupESaIS1F_EESsbbblRKSt8optionalISsEEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S19_S1C_S1D_S1E_S1E_S1J_SsbbblS1L_EE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1X_OS19_S1C_OS1D_OS1E_S20_S1J_S1W_ObS21_S21_OlS1N_+0x1ab) [0x748a761786ab] ray-project#12 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDESt10shared_ptrINS_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSsS15_+0x1166) [0x748a76320a96] ray-project#13 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDESt10shared_ptrINS0_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsS15_ESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_SK_SS_SS_SW_S13_S14_S15_S15_EPS19_St12_PlaceholderILi1EES1D_ILi2EES1D_ILi3EES1D_ILi4EES1D_ILi5EES1D_ILi6EES1D_ILi7EES1D_ILi8EES1D_ILi9EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSS_S1U_OSW_OS13_OS14_OS15_S1Y_+0x87) [0x748a762e8647] ray-project#14 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb5186d) [0x748a7641886d] ray-project#15 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb557c5) [0x748a7641c7c5] ray-project#16 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x103e3eb) [0x748a769053eb] ray-project#17 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1034f0b) [0x748a768fbf0b] ray-project#18 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb6f21b) [0x748a7643621b] ray-project#19 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x15893cb) [0x748a76e503cb] ray-project#20 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158ad69) [0x748a76e51d69] ray-project#21 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158b472) [0x748a76e52472] ray-project#22 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0x132) [0x748a762e4252] ray-project#23 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x41) [0x748a76336bd1] ray-project#24 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x8a45c1) [0x748a7616b5c1] ray-project#25 ray::IDLE(_PyEval_EvalFrameDefault+0x6fb) [0x59f10adf3aeb] ray-project#26 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] ray-project#27 ray::IDLE(_PyEval_EvalFrameDefault+0x6fb) [0x59f10adf3aeb] ray-project#28 ray::IDLE(+0x1d5cac) [0x59f10ae9bcac] ray-project#29 ray::IDLE(PyEval_EvalCode+0x85) [0x59f10ae9bbf5] ray-project#30 ray::IDLE(+0x20732a) [0x59f10aecd32a] ray-project#31 ray::IDLE(+0x201d13) [0x59f10aec7d13] ray-project#32 ray::IDLE(+0x976be) [0x59f10ad5d6be] ray-project#33 ray::IDLE(_PyRun_SimpleFileObject+0x1bb) [0x59f10aec23db] ray-project#34 ray::IDLE(_PyRun_AnyFileObject+0x44) [0x59f10aec1f74] ray-project#35 ray::IDLE(Py_RunMain+0x371) [0x59f10aebf3e1] ray-project#36 ray::IDLE(Py_BytesMain+0x37) [0x59f10ae8f447] ray-project#37 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x748a77baad90] ray-project#38 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x748a77baae40] ray-project#39 ray::IDLE(+0x1c930e) [0x59f10ae8f30e] [getenv_preload] getenv name=OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_BACKOFF_MULTIPLIER [getenv_preload] backtrace: #0 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x10a9d17) [0x7321ce3c9d17] ray-project#1 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x10abe2b) [0x7321ce3cbe2b] ray-project#2 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1050ffc) [0x7321ce370ffc] ray-project#3 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x104f4d7) [0x7321ce36f4d7] ray-project#4 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1045833) [0x7321ce365833] ray-project#5 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xa6c760) [0x7321cdd8c760] ray-project#6 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xe69d9a) [0x7321ce189d9a] ray-project#7 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_16HealthCheckReplyEE15OnReplyReceivedEv+0x165) [0x7321ce18c005] ray-project#8 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7321cdd8e475] ray-project#9 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x103e3eb) [0x7321ce35e3eb] ray-project#10 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1034f0b) [0x7321ce354f0b] ray-project#11 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb6f21b) [0x7321cde8f21b] ray-project#12 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x15893cb) [0x7321ce8a93cb] ray-project#13 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158ad69) [0x7321ce8aad69] ray-project#14 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158b472) [0x7321ce8ab472] ray-project#15 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xa6bb54) [0x7321cdd8bb54] ray-project#16 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xba2250) [0x7321cdec2250] ray-project#17 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7321cf66eac3] ray-project#18 /lib/x86_64-linux-gnu/libc.so.6(+0x1268d0) [0x7321cf7008d0] *** SIGSEGV received at time=1770862205 on cpu 1 *** PC: @ 0x748a77bc5c1d (unknown) getenv @ 0x748a77bc3520 (unknown) (unknown) {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"*** SIGSEGV received at time=1770862205 on cpu 1 ***","filename":"logging.cc","lineno":474} {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"PC: @ 0x748a77bc5c1d (unknown) getenv","filename":"logging.cc","lineno":474} {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":" @ 0x748a77bc3520 (unknown) (unknown)","filename":"logging.cc","lineno":474} Fatal Python error: Segmentation fault ``` According to the backtrace, we can identify that it is the `OtlpGrpcMetricExporterOptions`, [which called `getenv(OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_BACKOFF_MULTIPLIER)`](https://github.com/open-telemetry/opentelemetry-cpp/blob/13ad05a6f431efb76995cffb1225d26b45374749/exporters/otlp/src/otlp_grpc_metric_exporter_options.cc#L47), getting initialized by calling `InitOpenTelemetryExporter` in the `metrics_agent_client_->WaitForServerReady()` callback, that causes the issue. This PR moves `OtlpGrpcMetricExporterOptions` into `OpenTelemetryMetricRecorder` (so that we keep otel details encapsulated) and moves its initialization early to `stats::Init()`, to force the `OtlpGrpcMetricExporterOptions` to be initialized early, so that we don't call `getenv` afterward. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

ray-project#61034) Currently, there is a chance that a worker can crash on the `getenv` syscall from the otel lazy initialization. We found the race is between `setenv` on the user thread (`setenv(RBLN_DEVICES)`) and `getenv` on the worker internal thread. However, we can't forbid `setenv` on a user's thread; the only thing we can do is not call `getenv` once the user's thread starts. Here is the backtrace of the crash we found by intercepting the `getenv`: ``` [getenv_preload] setenv name=RBLN_DEVICES value= overwrite=1 [getenv_preload] setenv backtrace: #0 /home/ray/getenv_trace_preload.so(setenv+0x73) [0x748a77ea870b] ray-project#1 ray::IDLE(+0x224d5b) [0x59f10aeead5b] ray-project#2 ray::IDLE(+0x13dfc3) [0x59f10ae03fc3] ray-project#3 ray::IDLE(_PyEval_EvalFrameDefault+0x313) [0x59f10adf3703] ray-project#4 ray::IDLE(+0x184bfd) [0x59f10ae4abfd] ray-project#5 ray::IDLE(+0x19da04) [0x59f10ae63a04] ray-project#6 ray::IDLE(_PyEval_EvalFrameDefault+0x115a) [0x59f10adf454a] ray-project#7 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] ray-project#8 ray::IDLE(_PyEval_EvalFrameDefault+0x49ae) [0x59f10adf7d9e] ray-project#9 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] ray-project#10 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x9a9333) [0x748a76270333] ray-project#11 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_PSN_ISG_IS11_bESaIS16_EERSO_INS0_17LocalMemoryBufferEEPbPSsS1E_RKSN_INS0_16ConcurrencyGroupESaIS1F_EESsbbblRKSt8optionalISsEEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S19_S1C_S1D_S1E_S1E_S1J_SsbbblS1L_EE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1X_OS19_S1C_OS1D_OS1E_S20_S1J_S1W_ObS21_S21_OlS1N_+0x1ab) [0x748a761786ab] ray-project#12 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDESt10shared_ptrINS_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSsS15_+0x1166) [0x748a76320a96] ray-project#13 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDESt10shared_ptrINS0_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsS15_ESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_SK_SS_SS_SW_S13_S14_S15_S15_EPS19_St12_PlaceholderILi1EES1D_ILi2EES1D_ILi3EES1D_ILi4EES1D_ILi5EES1D_ILi6EES1D_ILi7EES1D_ILi8EES1D_ILi9EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSS_S1U_OSW_OS13_OS14_OS15_S1Y_+0x87) [0x748a762e8647] ray-project#14 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb5186d) [0x748a7641886d] ray-project#15 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb557c5) [0x748a7641c7c5] ray-project#16 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x103e3eb) [0x748a769053eb] ray-project#17 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1034f0b) [0x748a768fbf0b] ray-project#18 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb6f21b) [0x748a7643621b] ray-project#19 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x15893cb) [0x748a76e503cb] ray-project#20 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158ad69) [0x748a76e51d69] ray-project#21 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158b472) [0x748a76e52472] ray-project#22 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0x132) [0x748a762e4252] ray-project#23 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x41) [0x748a76336bd1] ray-project#24 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x8a45c1) [0x748a7616b5c1] ray-project#25 ray::IDLE(_PyEval_EvalFrameDefault+0x6fb) [0x59f10adf3aeb] ray-project#26 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc] ray-project#27 ray::IDLE(_PyEval_EvalFrameDefault+0x6fb) [0x59f10adf3aeb] ray-project#28 ray::IDLE(+0x1d5cac) [0x59f10ae9bcac] ray-project#29 ray::IDLE(PyEval_EvalCode+0x85) [0x59f10ae9bbf5] ray-project#30 ray::IDLE(+0x20732a) [0x59f10aecd32a] ray-project#31 ray::IDLE(+0x201d13) [0x59f10aec7d13] ray-project#32 ray::IDLE(+0x976be) [0x59f10ad5d6be] ray-project#33 ray::IDLE(_PyRun_SimpleFileObject+0x1bb) [0x59f10aec23db] ray-project#34 ray::IDLE(_PyRun_AnyFileObject+0x44) [0x59f10aec1f74] ray-project#35 ray::IDLE(Py_RunMain+0x371) [0x59f10aebf3e1] ray-project#36 ray::IDLE(Py_BytesMain+0x37) [0x59f10ae8f447] ray-project#37 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x748a77baad90] ray-project#38 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x748a77baae40] ray-project#39 ray::IDLE(+0x1c930e) [0x59f10ae8f30e] [getenv_preload] getenv name=OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_BACKOFF_MULTIPLIER [getenv_preload] backtrace: #0 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x10a9d17) [0x7321ce3c9d17] ray-project#1 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x10abe2b) [0x7321ce3cbe2b] ray-project#2 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1050ffc) [0x7321ce370ffc] ray-project#3 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x104f4d7) [0x7321ce36f4d7] ray-project#4 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1045833) [0x7321ce365833] ray-project#5 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xa6c760) [0x7321cdd8c760] ray-project#6 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xe69d9a) [0x7321ce189d9a] ray-project#7 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_16HealthCheckReplyEE15OnReplyReceivedEv+0x165) [0x7321ce18c005] ray-project#8 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7321cdd8e475] ray-project#9 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x103e3eb) [0x7321ce35e3eb] ray-project#10 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1034f0b) [0x7321ce354f0b] ray-project#11 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb6f21b) [0x7321cde8f21b] ray-project#12 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x15893cb) [0x7321ce8a93cb] ray-project#13 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158ad69) [0x7321ce8aad69] ray-project#14 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158b472) [0x7321ce8ab472] ray-project#15 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xa6bb54) [0x7321cdd8bb54] ray-project#16 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xba2250) [0x7321cdec2250] ray-project#17 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7321cf66eac3] ray-project#18 /lib/x86_64-linux-gnu/libc.so.6(+0x1268d0) [0x7321cf7008d0] *** SIGSEGV received at time=1770862205 on cpu 1 *** PC: @ 0x748a77bc5c1d (unknown) getenv @ 0x748a77bc3520 (unknown) (unknown) {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"*** SIGSEGV received at time=1770862205 on cpu 1 ***","filename":"logging.cc","lineno":474} {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"PC: @ 0x748a77bc5c1d (unknown) getenv","filename":"logging.cc","lineno":474} {"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":" @ 0x748a77bc3520 (unknown) (unknown)","filename":"logging.cc","lineno":474} Fatal Python error: Segmentation fault ``` According to the backtrace, we can identify that it is the `OtlpGrpcMetricExporterOptions`, [which called `getenv(OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_BACKOFF_MULTIPLIER)`](https://github.com/open-telemetry/opentelemetry-cpp/blob/13ad05a6f431efb76995cffb1225d26b45374749/exporters/otlp/src/otlp_grpc_metric_exporter_options.cc#L47), getting initialized by calling `InitOpenTelemetryExporter` in the `metrics_agent_client_->WaitForServerReady()` callback, that causes the issue. This PR moves `OtlpGrpcMetricExporterOptions` into `OpenTelemetryMetricRecorder` (so that we keep otel details encapsulated) and moves its initialization early to `stats::Init()`, to force the `OtlpGrpcMetricExporterOptions` to be initialized early, so that we don't call `getenv` afterward. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

…opy_me__upb_internal_use_only) (#61147) We found that if the version of python profobuf library mismatches with the raylet's, the ray client python server will hit segment fault with this trace: ``` * thread #12, name = 'python3.11', stop reason = signal SIGSEGV: address not mapped to object (fault address=0x16) * frame #0: 0x0000733c1ea38a59 _raylet.so`_upb_Arena_SlowMalloc_dont_copy_me__upb_internal_use_only + 41 frame #1: 0x0000733c1ea366ad _raylet.so`_upb_Array_Realloc_dont_copy_me__upb_internal_use_only + 285 frame #2: 0x0000733c1c54517b _message.cpython-311-x86_64-linux-gnu.so`_upb_Decoder_DecodeMessage + 3835 frame #3: 0x0000733c1c545f0c _message.cpython-311-x86_64-linux-gnu.so`upb_Decoder_Decode + 108 frame #4: 0x0000733c1c543ff9 _message.cpython-311-x86_64-linux-gnu.so`upb_Decode + 201 frame #5: 0x0000733c1c52907d _message.cpython-311-x86_64-linux-gnu.so`PyUpb_Message_MergeFromString + 237 frame #6: 0x0000733c1c5293c4 _message.cpython-311-x86_64-linux-gnu.so`PyUpb_Message_FromString + 36 frame #7: 0x0000568134f2d81a python3.11`cfunction_vectorcall_O(func=0x0000733c15c0b560, args=0x0000733c20200480, nargsf=<unavailable>, kwnames=<unavailable>) at methodobject.c:514:24 frame #8: 0x0000568135270620 python3.11 ``` We can see from the trace that the python profobuf library (`message.cpython-311-x86_64-linux-gnu.so`) tried to decode a message with a function `_upb_Array_Realloc_dont_copy_me__upb_internal_use_only` from `_raylet.so`, which is apparently not ideal. Ideally, the python profobuf library should not use a function from `_raylet.so`. That happens because the current exporting rule `*ray*internal*` accidentally matches `_upb_Array_Realloc_dont_copy_me__upb_internal_use_only`, so we have it exposed globally from raylet: <img width="1162" height="169" alt="image" src="https://github.com/user-attachments/assets/f40ae524-9675-454d-8cce-f6c43d2d901c" /> The problematic rule `*ray*internal*` aims to export `ray::internal` only, so this PR makes the pattern strict and does not expose _upb_Arena_SlowMalloc_dont_copy_me__upb_internal_use_only. Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

## Description grpc 1.57.1 will call `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")` on every grpc channel establishment for parsing load-balancing policy. This causes race conditions between user tasks as they are allowed to do setenv at anytime. This PR upgrades the grpc lib to 1.58.0 to get rid of the `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")`. ``` (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89 #3 0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26 #4 <signal handler called> #5 __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44 #6 __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78 #7 __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89 #8 0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26 #9 <signal handler called> #10 __GI_getenv (name=0x7580a6a078c2 "PC_EXPERIMENTAL_PICKFIRST_LB_CONFIG") at ./stdlib/getenv.c:84 #11 0x00007580a67e8b8a in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #12 0x00007580a649601f in grpc_core::ShufflePickFirstEnabled() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #13 0x00007580a64960ed in grpc_core::json_detail::FinishedJsonObjectLoader<grpc_core::(anonymous namespace)::PickFirstConfig, 1ul, void>::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #14 0x00007580a6787384 in grpc_core::json_detail::LoadWrapped::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #15 0x00007580a6497b07 in grpc_core::(anonymous namespace)::PickFirstFactory::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #16 0x00007580a67c18a7 in grpc_core::LoadBalancingPolicyRegistry::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #17 0x00007580a66ad9b8 in grpc_core::ClientChannel::OnResolverResultChangedLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #18 0x00007580a66ae452 in grpc_core::ClientChannel::ResolverResultHandler::ReportResult(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #19 0x00007580a63bc603 in grpc_core::PollingResolver::OnRequestCompleteLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #20 0x00007580a63bcb2d in std::_Function_handler<void (), grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #21 0x00007580a67cbf46 in grpc_core::WorkSerializer::WorkSerializerImpl::Run(std::function<void ()>, grpc_core::DebugLocation const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #22 0x00007580a67cc0ea in grpc_core::WorkSerializer::Run(std::function<void ()>, grpc_core::DebugLocation const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #23 0x00007580a63bd117 in grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #24 0x00007580a63b3f86 in grpc_core::(anonymous namespace)::AresClientChannelDNSResolver::AresRequestWrapper::OnHostnameResolved(void*, absl::lts_20230802::Status) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #25 0x00007580a67c44c4 in grpc_core::ExecCtx::Flush() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #26 0x00007580a63408a2 in grpc_core::ExecCtx::~ExecCtx() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #27 0x00007580a6740343 in grpc_call_start_batch () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #28 0x00007580a5e281e9 in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpRecvInitialMetadata, grpc::internal::CallOpRecvMessage<google::protobuf::MessageLite>, grpc::internal::CallOpClientSendClose, grpc::internal::CallOpClientRecvStatus>::ContinueFillOpsAfterInterception() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #29 0x00007580a5e2d809 in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::inte--Type <RET> for more, q to quit, c to continue without paging--c rnal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #30 0x00007580a62d76ea in opentelemetry::proto::collector::metrics::v1::MetricsService::Stub::Export(grpc::ClientContext*, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest const&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #31 0x00007580a62ca40c in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::DelegateExport(opentelemetry::proto::collector::metrics::v1::MetricsService::StubInterface*, std::unique_ptr<grpc::ClientContext, std::default_delete<grpc::ClientContext> >&&, std::unique_ptr<google::protobuf::Arena, std::default_delete<google::protobuf::Arena> >&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #32 0x00007580a62c23ed in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #33 0x00007580a62c0334 in (anonymous namespace)::OpenTelemetryMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #34 0x00007580a62e5fdf in opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}::operator()() const::{lambda(opentelemetry::v1::sdk::metrics::ResourceMetrics&)#1}::operator()(opentelemetry::v1::sdk::metrics::ResourceMetrics&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #35 0x00007580a62ee7a6 in opentelemetry::v1::sdk::metrics::MetricReader::Collect(opentelemetry::v1::nostd::function_ref<bool (opentelemetry::v1::sdk::metrics::ResourceMetrics&)>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #36 0x00007580a62e5085 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}> > >::_M_run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #37 0x00007580a6997be0 in execute_native_thread_routine () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #38 0x00007580a7597ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #39 0x00007580a76298d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 ``` Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

## Description grpc 1.57.1 will call `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")` on every grpc channel establishment for parsing load-balancing policy. This causes race conditions between user tasks as they are allowed to do setenv at anytime. This PR upgrades the grpc lib to 1.58.0 to get rid of the `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")`. ``` (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89 #3 0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26 #4 <signal handler called> #5 __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44 #6 __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78 #7 __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89 #8 0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26 #9 <signal handler called> #10 __GI_getenv (name=0x7580a6a078c2 "PC_EXPERIMENTAL_PICKFIRST_LB_CONFIG") at ./stdlib/getenv.c:84 #11 0x00007580a67e8b8a in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #12 0x00007580a649601f in grpc_core::ShufflePickFirstEnabled() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #13 0x00007580a64960ed in grpc_core::json_detail::FinishedJsonObjectLoader<grpc_core::(anonymous namespace)::PickFirstConfig, 1ul, void>::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #14 0x00007580a6787384 in grpc_core::json_detail::LoadWrapped::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #15 0x00007580a6497b07 in grpc_core::(anonymous namespace)::PickFirstFactory::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #16 0x00007580a67c18a7 in grpc_core::LoadBalancingPolicyRegistry::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #17 0x00007580a66ad9b8 in grpc_core::ClientChannel::OnResolverResultChangedLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #18 0x00007580a66ae452 in grpc_core::ClientChannel::ResolverResultHandler::ReportResult(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #19 0x00007580a63bc603 in grpc_core::PollingResolver::OnRequestCompleteLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #20 0x00007580a63bcb2d in std::_Function_handler<void (), grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #21 0x00007580a67cbf46 in grpc_core::WorkSerializer::WorkSerializerImpl::Run(std::function<void ()>, grpc_core::DebugLocation const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #22 0x00007580a67cc0ea in grpc_core::WorkSerializer::Run(std::function<void ()>, grpc_core::DebugLocation const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #23 0x00007580a63bd117 in grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #24 0x00007580a63b3f86 in grpc_core::(anonymous namespace)::AresClientChannelDNSResolver::AresRequestWrapper::OnHostnameResolved(void*, absl::lts_20230802::Status) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #25 0x00007580a67c44c4 in grpc_core::ExecCtx::Flush() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #26 0x00007580a63408a2 in grpc_core::ExecCtx::~ExecCtx() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #27 0x00007580a6740343 in grpc_call_start_batch () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #28 0x00007580a5e281e9 in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpRecvInitialMetadata, grpc::internal::CallOpRecvMessage<google::protobuf::MessageLite>, grpc::internal::CallOpClientSendClose, grpc::internal::CallOpClientRecvStatus>::ContinueFillOpsAfterInterception() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #29 0x00007580a5e2d809 in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::inte--Type <RET> for more, q to quit, c to continue without paging--c rnal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #30 0x00007580a62d76ea in opentelemetry::proto::collector::metrics::v1::MetricsService::Stub::Export(grpc::ClientContext*, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest const&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #31 0x00007580a62ca40c in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::DelegateExport(opentelemetry::proto::collector::metrics::v1::MetricsService::StubInterface*, std::unique_ptr<grpc::ClientContext, std::default_delete<grpc::ClientContext> >&&, std::unique_ptr<google::protobuf::Arena, std::default_delete<google::protobuf::Arena> >&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #32 0x00007580a62c23ed in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #33 0x00007580a62c0334 in (anonymous namespace)::OpenTelemetryMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #34 0x00007580a62e5fdf in opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}::operator()() const::{lambda(opentelemetry::v1::sdk::metrics::ResourceMetrics&)#1}::operator()(opentelemetry::v1::sdk::metrics::ResourceMetrics&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #35 0x00007580a62ee7a6 in opentelemetry::v1::sdk::metrics::MetricReader::Collect(opentelemetry::v1::nostd::function_ref<bool (opentelemetry::v1::sdk::metrics::ResourceMetrics&)>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #36 0x00007580a62e5085 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}> > >::_M_run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #37 0x00007580a6997be0 in execute_native_thread_routine () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so #38 0x00007580a7597ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #39 0x00007580a76298d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 ``` Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

Adds a self-contained doc + patch pair so other macOS contributors can reproduce the build/test workflow used for this branch: rep-64-poc/MACOS_DEV_SETUP.md — narrative: what changed, why, how to apply, how to verify rep-64-poc/macos-dev.patch — git diff of the six local changes, apply with `git apply`, revert before pushing The code changes themselves are intentionally not committed: Ray's supported build target is Linux, and four of the six are general macOS portability fixes that belong in a separate upstream-facing PR rather than this POC branch. Updates EVIDENCE.md (executive summary finding ray-project#4, substrate matrix, follow-up ray-project#6) to reflect that the macOS dev workflow is now green — build, editable install, and `pytest python/ray/tests/test_basic.py` all succeed on macOS arm64. Substrate-evidence runs (Phases 1/4/7/8) on APFS / F_FULLFSYNC remain pending; this work unblocks them but does not perform them. Signed-off-by: Santosh Jha <santosh.m.jha@gmail.com>

Specifies a two-tier (kind + external cluster) reproducible test harness covering COLLABORATORS.md items #1, ray-project#4, ray-project#5, ray-project#6, ray-project#8. All cluster-specific values live in env-file profiles; scripts are cluster-agnostic. Default remote image is the public ghcr.io/jhasm/ray-rocksdb-poc:rep-64-poc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Santosh Jha <santosh.m.jha@gmail.com>

Phases A-D in detail (kind tier, end-to-end on a developer machine); Phases E-F outlined as a follow-on plan once kind tier is green. Covers COLLABORATORS.md items #1, ray-project#4 (partial — fsync probe), ray-project#5 (with inmem baseline), ray-project#6. Item ray-project#8 stubbed pending image-bundling of storage_microbench. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Santosh Jha <santosh.m.jha@gmail.com>

- Methodology: add the corrected throughput model — shuffle barrier makes it total ≈ T(read+map) + T(reduce+write) (not one overlapped min()); include WRITE as a stage; ceiling = current binding stage, so speeding up any binding stage raises GB/s by ~its time-share. 512GB shares: phase1 ~51% (read-bound), phase2 ~47% (reduce restore ~10-16% + write). - Methodology: add "Reduce restore path — inefficiencies & fixes" section (no fetch/decode overlap, batch=16 pull window, head-of-line blocking, max_io_workers=4, write/read fusing asymmetry, tiny shards) with file:line refs and the ray-project#1+ray-project#4 high-impact combo. - design plot: annotate shuffle barrier + PHASE 1/2 brackets and shares. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Confirmed this cluster (m5.2xlarge, root nvme0n1 = "Amazon Elastic Block Store", no instance store) spills to EBS. Measured ~600 MB/s/node = m5.2xlarge instance EBS bandwidth cap (~593 MB/s), not NVMe. So fix ray-project#4 "raise max_io_workers" does NOT apply here: 4 workers already saturate the EBS pipe. Make the reduce-restore fix backend-aware (NVMe = raise concurrency; EBS = bandwidth-capped, use instance-store NVMe / spill less; S3 = latency-bound, raise concurrency + batch restore). Note the page-cache spill_throughput_mb metric hides the real EBS rate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fix ray-project#2 (limit pushdown): the read fns now slice their batch to per_task_row_limit, so a downstream limit(K) reads ~K chunks instead of the whole batch's I/O. Previously ReadTask only truncated the already-built block (_iter_sliced_blocks), so every chunk in the batch was still fetched. Fix ray-project#4 (retries): chunk reads are wrapped in iterate_with_retry(match=DataContext.retried_io_errors) -- the same mechanism FileBasedDatasource uses -- so zarr reads now honor Ray Data's retry config. The underlying filesystem's own retry still applies underneath. Tests: per_task_row_limit caps the number of _read_chunk calls (not just the output row count); _read_chunk retries a transient error then succeeds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

## Why are these changes needed? Ray's NIXL tensor transport currently hardcodes the UCX backend (`nixl_agent_config(backends=["UCX"])`). On AWS EFA instances UCX cannot drive the EFA NICs (they are libfabric-only), so cross-node RDT transfers silently degrade to **software-emulated RMA over TCP (~0.3 GB/s)** — correct data, but ~600× slower than the hardware is capable of, with no warning emitted. NIXL itself logs a hint recommending LIBFABRIC. This change makes `NixlTensorTransport` select the **LIBFABRIC** backend on EFA hardware (falling back to UCX everywhere else), unlocking genuine GPUDirect RDMA over EFA. Builds on the initial draft (`joshlee/use-libfabric-nixl-backend`) and hardens it for the environments where naive device-presence selection fails: containers/K8s, hosts without GPUDirect, and CUDA VMM allocators. ### What's included 1. **Backend auto-selection** (`select_backend()`): `LIBFABRIC` when EFA is present, else `UCX`. `get_nixl_agent()` uses it instead of the hardcoded UCX. 2. **Container/K8s-aware EFA detection.** Device presence alone (`glob /sys/class/net/efa*`) is **empty inside pods** — netdevs are network- namespaced — so on K8s the naive check reports "no EFA" and silently selects the slow UCX/TCP path. Detection also accepts the rdma-verbs devices the EFA k8s device-plugin mounts (`/dev/infiniband/uverbs*` + `/sys/class/infiniband/*`). 3. **Validate before committing to LIBFABRIC.** EFA-present ≠ GPUDirect-works: on hosts/drivers without `nvidia-peermem`/dmabuf, LIBFABRIC instantiates and registers *small* CUDA buffers via EFA's ~32 MiB host bounce pool (looks healthy), but `fi_mr_reg` of a real-size buffer fails at transfer time with `NIXL_ERR_BACKEND`. We register/deregister a CUDA buffer **at realistic transfer size** before committing, and fall back to UCX on failure. Size is tunable via `RAY_NIXL_VALIDATE_BYTES`. 4. **VMM (`expandable_segments`) safety.** CUDA VMM memory (`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, standard in large-model training) cannot be RDMA-registered. The validation probe allocates non-expandable memory so it mirrors a real registrable buffer rather than false-failing on a VMM probe. 5. **`RAY_NIXL_BACKEND` override.** Per-process auto-detection can diverge between the two endpoints of a transfer when they run in different processes / Ray jobs with different allocator state (observed: sender → UCX, receiver → LIBFABRIC → every read fails `NIXL_ERR_BACKEND`). `RAY_NIXL_BACKEND=UCX| LIBFABRIC` lets an orchestrator pin both sides; the size-matched probe makes auto-selection deterministic across same-shaped hosts as the default. 6. **Per-transfer stats + backend-specific diagnostics** (`NixlTransferStats`: bytes, xfer time, throughput, backend) and backend-aware error hints (`FI_LOG_LEVEL=Debug` for LIBFABRIC, `UCX_LOG_LEVEL=debug` for UCX) on registration failure. ## Benchmark results Validated in a downstream RL trainer (miles) weight-sync workload (Megatron trainer → sglang rollout engines) on Anyscale, p5.48xlarge (8×H100-80GB, 32×EFA), Ray 2.55.1 + this change, cross-node (every transfer crosses the node boundary). This addresses the draft's "TODO: benchmarking". | Backend | path | effective bandwidth | |---|---|---| | **LIBFABRIC (this PR)** | GPUDirect RDMA over EFA | **~23–30 GiB/s per source** | | UCX (status quo on EFA) | software-emulated RMA over TCP | ~0.3 GB/s | End-to-end weight-sync time (`pull_wait` = pure transfer; full model bf16): | Model | size | shard/src | `pull_wait` (transfer) | full sync | |---|---|---|---|---| | Qwen3-30B-A3B | 57 GiB | 14.24 GiB | 0.61 s | 2.7 s | | GLM-4.7-Flash (MLA+MoE) | — | 14.13 GiB | 0.61 s | 1.9 s | | Moonlight-16B (MLA) | — | 7.48 GiB | 0.33 s | 1.0 s | - LIBFABRIC over EFA was **~8–9× faster than NCCL broadcast** for the same cross-node sync (~3 s vs ~26 s for the 57 GiB model). - Flat across 2→4 nodes (transfer is per-source-shard bound, not node-count bound). - All transfers byte-correct end-to-end (downstream `--check-weight-update-equal` byte-compares the received weights, incl. MLA and fused-MoE tensors). Without fix #2 (K8s detection) the change is a silent no-op on pods (stays on UCX/TCP). Without #3/#4 it selects LIBFABRIC and then fails at transfer time on no-GPUDirect/VMM hosts. With all of them, the LIBFABRIC path is correct and fast across dense, MoE, and MLA models. ## Related issue number Builds on / supersedes the draft on `joshlee/use-libfabric-nixl-backend`. ## Checks - Testing strategy: exercised via a real cross-node RDT weight-sync workload on EFA hardware (numbers above), plus the fallback paths — UCX selection on non-EFA, and LIBFABRIC→UCX fallback when the transfer-size CUDA registration fails (no-GPUDirect host). Suggested unit coverage: `select_backend()` honors `RAY_NIXL_BACKEND` and the EFA globs; `get_nixl_agent()` falls back to UCX when the validation registration raises. - [ ] Signed off commits - [ ] `scripts/format.sh` - [ ] Doc update (`doc/source/ray-core/direct-transport.rst` — backend selection section) - [ ] Tests passing ## Notes for reviewers - The `NixlTransferStats` backend field reads `list(nixl_agent.backends.keys())`; confirm `backends` is always a dict on the supported NIXL version (an earlier review flagged a possible `AttributeError`) and consider guarding it, since the stats path runs on the hot recv path. - Detection globs are cached (`functools.lru_cache`) — fine since hardware doesn't change within a process. --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com> Signed-off-by: xyuzh <xinyzng@gmail.com> Co-authored-by: xyuzh <xinyzng@gmail.com>

## Why are these changes needed? Ray's NIXL tensor transport currently hardcodes the UCX backend (`nixl_agent_config(backends=["UCX"])`). On AWS EFA instances UCX cannot drive the EFA NICs (they are libfabric-only), so cross-node RDT transfers silently degrade to **software-emulated RMA over TCP (~0.3 GB/s)** — correct data, but ~600× slower than the hardware is capable of, with no warning emitted. NIXL itself logs a hint recommending LIBFABRIC. This change makes `NixlTensorTransport` select the **LIBFABRIC** backend on EFA hardware (falling back to UCX everywhere else), unlocking genuine GPUDirect RDMA over EFA. Builds on the initial draft (`joshlee/use-libfabric-nixl-backend`) and hardens it for the environments where naive device-presence selection fails: containers/K8s, hosts without GPUDirect, and CUDA VMM allocators. ### What's included 1. **Backend auto-selection** (`select_backend()`): `LIBFABRIC` when EFA is present, else `UCX`. `get_nixl_agent()` uses it instead of the hardcoded UCX. 2. **Container/K8s-aware EFA detection.** Device presence alone (`glob /sys/class/net/efa*`) is **empty inside pods** — netdevs are network- namespaced — so on K8s the naive check reports "no EFA" and silently selects the slow UCX/TCP path. Detection also accepts the rdma-verbs devices the EFA k8s device-plugin mounts (`/dev/infiniband/uverbs*` + `/sys/class/infiniband/*`). 3. **Validate before committing to LIBFABRIC.** EFA-present ≠ GPUDirect-works: on hosts/drivers without `nvidia-peermem`/dmabuf, LIBFABRIC instantiates and registers *small* CUDA buffers via EFA's ~32 MiB host bounce pool (looks healthy), but `fi_mr_reg` of a real-size buffer fails at transfer time with `NIXL_ERR_BACKEND`. We register/deregister a CUDA buffer **at realistic transfer size** before committing, and fall back to UCX on failure. Size is tunable via `RAY_NIXL_VALIDATE_BYTES`. 4. **VMM (`expandable_segments`) safety.** CUDA VMM memory (`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, standard in large-model training) cannot be RDMA-registered. The validation probe allocates non-expandable memory so it mirrors a real registrable buffer rather than false-failing on a VMM probe. 5. **`RAY_NIXL_BACKEND` override.** Per-process auto-detection can diverge between the two endpoints of a transfer when they run in different processes / Ray jobs with different allocator state (observed: sender → UCX, receiver → LIBFABRIC → every read fails `NIXL_ERR_BACKEND`). `RAY_NIXL_BACKEND=UCX| LIBFABRIC` lets an orchestrator pin both sides; the size-matched probe makes auto-selection deterministic across same-shaped hosts as the default. 6. **Per-transfer stats + backend-specific diagnostics** (`NixlTransferStats`: bytes, xfer time, throughput, backend) and backend-aware error hints (`FI_LOG_LEVEL=Debug` for LIBFABRIC, `UCX_LOG_LEVEL=debug` for UCX) on registration failure. ## Benchmark results Validated in a downstream RL trainer (miles) weight-sync workload (Megatron trainer → sglang rollout engines) on Anyscale, p5.48xlarge (8×H100-80GB, 32×EFA), Ray 2.55.1 + this change, cross-node (every transfer crosses the node boundary). This addresses the draft's "TODO: benchmarking". | Backend | path | effective bandwidth | |---|---|---| | **LIBFABRIC (this PR)** | GPUDirect RDMA over EFA | **~23–30 GiB/s per source** | | UCX (status quo on EFA) | software-emulated RMA over TCP | ~0.3 GB/s | End-to-end weight-sync time (`pull_wait` = pure transfer; full model bf16): | Model | size | shard/src | `pull_wait` (transfer) | full sync | |---|---|---|---|---| | Qwen3-30B-A3B | 57 GiB | 14.24 GiB | 0.61 s | 2.7 s | | GLM-4.7-Flash (MLA+MoE) | — | 14.13 GiB | 0.61 s | 1.9 s | | Moonlight-16B (MLA) | — | 7.48 GiB | 0.33 s | 1.0 s | - LIBFABRIC over EFA was **~8–9× faster than NCCL broadcast** for the same cross-node sync (~3 s vs ~26 s for the 57 GiB model). - Flat across 2→4 nodes (transfer is per-source-shard bound, not node-count bound). - All transfers byte-correct end-to-end (downstream `--check-weight-update-equal` byte-compares the received weights, incl. MLA and fused-MoE tensors). Without fix ray-project#2 (K8s detection) the change is a silent no-op on pods (stays on UCX/TCP). Without ray-project#3/ray-project#4 it selects LIBFABRIC and then fails at transfer time on no-GPUDirect/VMM hosts. With all of them, the LIBFABRIC path is correct and fast across dense, MoE, and MLA models. ## Related issue number Builds on / supersedes the draft on `joshlee/use-libfabric-nixl-backend`. ## Checks - Testing strategy: exercised via a real cross-node RDT weight-sync workload on EFA hardware (numbers above), plus the fallback paths — UCX selection on non-EFA, and LIBFABRIC→UCX fallback when the transfer-size CUDA registration fails (no-GPUDirect host). Suggested unit coverage: `select_backend()` honors `RAY_NIXL_BACKEND` and the EFA globs; `get_nixl_agent()` falls back to UCX when the validation registration raises. - [ ] Signed off commits - [ ] `scripts/format.sh` - [ ] Doc update (`doc/source/ray-core/direct-transport.rst` — backend selection section) - [ ] Tests passing ## Notes for reviewers - The `NixlTransferStats` backend field reads `list(nixl_agent.backends.keys())`; confirm `backends` is always a dict on the supported NIXL version (an earlier review flagged a possible `AttributeError`) and consider guarding it, since the stats path runs on the hot recv path. - Detection globs are cached (`functools.lru_cache`) — fine since hardware doesn't change within a process. --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com> Signed-off-by: xyuzh <xinyzng@gmail.com> Co-authored-by: xyuzh <xinyzng@gmail.com>

Three real drifts (unrelated to v3's disk-transport design) closed: 1. Map input bundles were pinned across the entire shuffle, defeating v3's whole point of reducing Plasma pressure. v2 destroys them the moment a mapper task finishes (shuffle_map_operator.py:306); v3 now does the same. FT limit drops to v2 parity (no retry after task success), which is the accepted tradeoff already documented for v2. * Drop ``_completed_input_bundles`` field + shutdown loop. * ``_handle_map_done``: ``input_bundle.destroy_if_owned()`` in place of the .append. 2. ShuffleReduceOpV3 lacked v2's ``fused_output_map_target_max_block_size_override`` equivalent. Downstream Write that wants a specific output block size was silently ignored when fused. * New ctor kwarg ``downstream_map_target_max_block_size_override``. * v3_reduce_task takes it as a new trailing param and threads it into the TaskContext it constructs around ``downstream_map_transformer``. * operator_fusion._get_fused_map_into_shuffle_reduce_operator V3 branch propagates ``down_op.target_max_block_size_override`` into it, matching the v2 branch semantics exactly. 3. ShuffleReduceOpV3 lacked v2's ``disallow_block_splitting`` kwarg. Callers that want "N partitions -> N unsplit blocks" had no op-level knob; ``coalesce_output=True`` only coalesces post-reduce_fn, it doesn't drop the BlockOutputBuffer reshape. * New ctor kwarg ``disallow_block_splitting``. * ``_add_input_inner`` mirrors v2 (shuffle_reduce_operator.py:161-165): when True, target_max_block_size becomes None so the reshape step is a no-op. * operator_fusion V3 branch propagates ``up_op._disallow_block_splitting``. Not touched: DRIFT ray-project#4 (_emit_empty_partition) — v3_reduce_task already emits a typed empty block from the handle-carried schema when a partition has no jobs (hash_shuffle_v3.py:1501-1510), so the N-block contract holds; v3 just pays a wasted task dispatch relative to v2's driver-side short-circuit. Efficiency, not correctness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com>

Remove unnecessary files.

a9692a1

pcmoritz merged commit 6ed6411 into master Oct 27, 2016

pcmoritz deleted the cleanups branch October 27, 2016 06:24

pcmoritz pushed a commit that referenced this pull request Nov 19, 2016

Merge pull request #4 from pcmoritz/key_tuples

74406e9

Tuple keys in dicts can be serialized

richardliaw referenced this pull request in richardliaw/ray Oct 4, 2017

refactoring experiment code (#5)

2514d74

(redo of #4 which was dropped)

arvindc95 mentioned this pull request Oct 12, 2017

Ray issue on Odroid-XU4 board #1008

Closed

surehb mentioned this pull request Jul 21, 2018

Use different serialization context for each driver. #2406

Merged

ericl pushed a commit that referenced this pull request Mar 29, 2019

Fixes Inconsistent weight assignment operations in DQNPolicyGraph (#4… (

798944f

#4504) * Fixes Inconsistent weight assignment operations in DQNPolicyGraph (#4502) * Update dqn_policy_graph.py

mitar mentioned this pull request May 10, 2019

Using plasma on Kubernetes #1315

Closed

rkooo567 referenced this pull request in rkooo567/ray Mar 24, 2020

Merge pull request #4 from anyscale/persist_data

a73b950

[Hosted Dashboard] Persist data to S3

bentzinir pushed a commit to bentzinir/ray that referenced this pull request Jul 29, 2020

removing unnecessary changes ray-project#4

e668406

gshimansky mentioned this pull request May 25, 2021

Ray 1.3.0 CHECK fail during normal task scheduling #15990

Closed

2 tasks

Peter-P779 mentioned this pull request May 19, 2022

ray[RLlib]: Windows fatal exception: access violation #24955

Closed

xwjiang2010 mentioned this pull request Nov 8, 2022

[air] [rfc] Re-introduce session.checkpoint_dir() API #30070

Closed

jianoaix added a commit to jianoaix/ray that referenced this pull request Feb 28, 2023

Streaming executor fixes ray-project#4

846cc65

ericl pushed a commit that referenced this pull request Mar 1, 2023

Streaming executor fixes #4 (#32882)

2875357

rickyyx mentioned this pull request Mar 7, 2023

[core][regression] stress_test_many_tasks stage 2 iteration time regressed #33109

Closed

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023

Streaming executor fixes ray-project#4 (ray-project#32882)

016aba2

Signed-off-by: Jack He <jackhe2345@gmail.com>

peytondmurray referenced this pull request in peytondmurray/ray Mar 22, 2023

Streaming executor fixes #4 (ray-project#32882)

7c80f9f

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

Streaming executor fixes ray-project#4 (ray-project#32882)

78e2266

Signed-off-by: elliottower <elliot@elliottower.com>

gemini-code-assist Bot mentioned this pull request Jan 28, 2026

[data] Skip upscaling validation warning for fixed-size actor pools #60569

Merged

marwan116 mentioned this pull request Apr 10, 2026

[Core] Permanently stuck objects/tasks #62512

Closed

SisPiao mentioned this pull request Apr 30, 2026

[RFC][Serve][LLM] Prefill-decode disaggregation support #62953

Open

xyuzh mentioned this pull request Jun 15, 2026

[core][rdt] Enable LIBFABRIC backend for NIXL #62339

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove unnecessary files.#4

Remove unnecessary files.#4
pcmoritz merged 1 commit into
masterfrom
cleanups

robertnishihara commented Oct 26, 2016

Labels

2 participants

Uh oh!

Conversation

robertnishihara commented Oct 26, 2016

Labels

2 participants