[xray] Implement Actor Reconstruction by raulchen · Pull Request #3332 · ray-project/ray

raulchen · 2018-11-15T12:24:29Z

What do these changes do?

This PR implements actor reconstruction for raylet mode.

When an actor dies accidentally (either because the process dies or because the whole node dies), raylet backend will automatically reconstruct the actor by replaying its creation task.
Reconstruction is turned off by default, users can enable it by specifying a max_reconstructions option in @ray.remote(), which indicates how many times this actor should be reconstructed.

Related issue number

See #3063 for previous discussions of this PR.

AmplabJenkins · 2018-11-15T12:46:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9379/
Test FAILed.

AmplabJenkins · 2018-11-19T06:41:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9441/
Test FAILed.

AmplabJenkins · 2018-11-19T08:59:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9442/
Test PASSed.

AmplabJenkins · 2018-11-20T06:05:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9471/
Test FAILed.

AmplabJenkins · 2018-11-20T11:45:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9478/
Test FAILed.

raulchen · 2018-11-20T14:54:24Z

@stephanie-wang actor_test.py/test_local_scheduler_dying hangs only on Python 2. I debugged this issue locally and got the following error. The weird thing is that the task that triggered this error actually belongs to the previous test test_actor_init_error_propagated. Any idea why? (If I run test_local_scheduler_dying alone, it can pass.)

139137 I1120 22:47:12.093858 139269568 node_manager.cc:1595] Reconstructing task 0000000015bc38724a7713829c8b219a21ca722c on client b4875fd41191d75aa234b3175420d472719de0ff
139138 F1120 22:47:12.094072 139269568 node_manager.cc:1612]  Check failed: lineage_cache_.ContainsTask(task_id)
139139 *** Check failure stack trace: ***
139140     @        0x1003508fd  google::LogMessage::Fail()
139141     @        0x10034e71e  google::LogMessage::SendToLog()
139142     @        0x10034f59f  google::LogMessage::Flush()
139143     @        0x10034f3d9  google::LogMessage::~LogMessage()
139144     @        0x10034f695  google::LogMessage::~LogMessage()
139145     @        0x1002a0a95  ray::RayLog::~RayLog()
139146     @        0x10030d182  std::__1::__function::__func<>::operator()()
139147     @        0x100286c50                                                                                                                                                                                            _ZZN3ray3gcs5TableINS_8UniqueIDENS_8protocol4TaskEE6LookupERKS2_S7_RKNSt3__18functionIFvPNS0_14AsyncGcsClientES7_RKNS3_5TaskTEEEERKNS9_IFvSB_S7_EEEENKUlSB_S7_RKNS8_6vectorISC_NS8_9allocatorISC_EEEEE_clESB_       S7_SS_
139148     @        0x100284dc5                                                                                                                                                                                            _ZZN3ray3gcs3LogINS_8UniqueIDENS_8protocol4TaskEE6LookupERKS2_S7_RKNSt3__18functionIFvPNS0_14AsyncGcsClientES7_RKNS8_6vectorINS3_5TaskTENS8_9allocatorISD_EEEEEEEENKUlRKNS8_12basic_stringIcNS8_11char_traits       IcEENSE_IcEEEEE_clEST_
139149     @        0x10029b6f7  ray::gcs::GlobalRedisCallback()
139150     @        0x100316ded  redisProcessCallbacks
139151     @        0x10029fa3c  RedisAsioClient::handle_read()
139152     @        0x1002a008c  boost::asio::detail::reactive_null_buffers_op<>::do_complete()
139153     @        0x100241848  boost::asio::detail::task_io_service::do_run_one()
139154     @        0x100241391  boost::asio::detail::task_io_service::run()
139155     @        0x10023b1fe  main

AmplabJenkins · 2018-11-20T16:25:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9479/
Test FAILed.

AmplabJenkins · 2018-11-20T16:32:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9481/
Test FAILed.

AmplabJenkins · 2018-11-20T16:32:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9480/
Test FAILed.

AmplabJenkins · 2018-11-20T17:23:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9482/
Test FAILed.

stephanie-wang · 2018-11-21T03:07:44Z

Hmm I'm not sure about that error...did you figure out anything more from it?

Also, FYI we're going to merge a subset of this PR soon in #3359.

raulchen · 2018-11-21T03:18:25Z

@stephanie-wang it's very weird. It seems that something is left over after test_exception_raised_when_actor_node_dies. I haven't figured out the reason. But I changed test_exception_raised_when_actor_node_dies to use cluster_utils. It works right now.

AmplabJenkins · 2018-11-21T05:53:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9499/
Test FAILed.

AmplabJenkins · 2018-11-21T06:33:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9501/
Test FAILed.

AmplabJenkins · 2018-11-21T07:32:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9509/
Test FAILed.

AmplabJenkins · 2018-11-21T09:05:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9508/
Test FAILed.

AmplabJenkins · 2018-11-21T14:44:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9514/
Test FAILed.

AmplabJenkins · 2018-12-04T08:48:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9738/
Test FAILed.

AmplabJenkins · 2018-12-04T17:19:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9741/
Test FAILed.

raulchen · 2018-12-05T03:24:32Z

@stephanie-wang actor_test.py/test_local_scheduler_dying hangs only on Python 2. I debugged this issue locally and got the following error. The weird thing is that the task that triggered this error actually belongs to the previous test test_actor_init_error_propagated. Any idea why? (If I run test_local_scheduler_dying alone, it can pass.)

139137 I1120 22:47:12.093858 139269568 node_manager.cc:1595] Reconstructing task 0000000015bc38724a7713829c8b219a21ca722c on client b4875fd41191d75aa234b3175420d472719de0ff
139138 F1120 22:47:12.094072 139269568 node_manager.cc:1612]  Check failed: lineage_cache_.ContainsTask(task_id)
139139 *** Check failure stack trace: ***
139140     @        0x1003508fd  google::LogMessage::Fail()
139141     @        0x10034e71e  google::LogMessage::SendToLog()
139142     @        0x10034f59f  google::LogMessage::Flush()
139143     @        0x10034f3d9  google::LogMessage::~LogMessage()
139144     @        0x10034f695  google::LogMessage::~LogMessage()
139145     @        0x1002a0a95  ray::RayLog::~RayLog()
139146     @        0x10030d182  std::__1::__function::__func<>::operator()()
139147     @        0x100286c50                                                                                                                                                                                            _ZZN3ray3gcs5TableINS_8UniqueIDENS_8protocol4TaskEE6LookupERKS2_S7_RKNSt3__18functionIFvPNS0_14AsyncGcsClientES7_RKNS3_5TaskTEEEERKNS9_IFvSB_S7_EEEENKUlSB_S7_RKNS8_6vectorISC_NS8_9allocatorISC_EEEEE_clESB_       S7_SS_
139148     @        0x100284dc5                                                                                                                                                                                            _ZZN3ray3gcs3LogINS_8UniqueIDENS_8protocol4TaskEE6LookupERKS2_S7_RKNSt3__18functionIFvPNS0_14AsyncGcsClientES7_RKNS8_6vectorINS3_5TaskTENS8_9allocatorISD_EEEEEEEENKUlRKNS8_12basic_stringIcNS8_11char_traits       IcEENSE_IcEEEEE_clEST_
139149     @        0x10029b6f7  ray::gcs::GlobalRedisCallback()
139150     @        0x100316ded  redisProcessCallbacks
139151     @        0x10029fa3c  RedisAsioClient::handle_read()
139152     @        0x1002a008c  boost::asio::detail::reactive_null_buffers_op<>::do_complete()
139153     @        0x100241848  boost::asio::detail::task_io_service::do_run_one()
139154     @        0x100241391  boost::asio::detail::task_io_service::run()
139155     @        0x10023b1fe  main

The reason of this bug is because of actor handle GC. right now, we send a __ray_terminiate__ message to actor when the actor handle is GC'ed by Python interpreter. However, the timing of GC is unreliable. In unit test, we run multiple test cases sequentially in one single process. And it turns out that, in Python 2, an actor handle that was created in the previous test case would be GC'ed in the next test case. In this case, driver will send the __ray_terminate__ to the new raylet, and trigger reconstructing the actor creation task. However, that task doesn't exist in the new GCS, then this check fails.

raulchen · 2018-12-05T03:27:40Z

To fix this issue, I add a check in __del__ to prevent sending __ray_terminate__ in this case. But I think it makes more sense to deprecate __del__, because:

there's no guarantee whether or when __del__ will be called, see https://stackoverflow.com/questions/1481488/what-is-the-del-method-how-to-call-it.
it doesn't handle the case of forked handles.
we can already clean up the actors when driver exits.

If users do want to terminate an actor before driver exits, they can use actor.__ray_terminate__.remote()

robertnishihara · 2018-12-05T03:30:26Z

Can we actually log a warning in this case, e.g., logger.warn(...)?

This should be pretty rare and it'd be useful to see that this code path is actually getting hit.

robertnishihara · 2018-12-05T03:30:42Z

The worker.mode == ray.worker.SCRIPT_MODE check isn't really necessary, right?

I think it's needed. because in a worker, self._ray_actor_driver_id.id() != worker.worker_id is always true

AmplabJenkins · 2018-12-05T05:05:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9762/
Test FAILed.

stephanie-wang · 2018-12-10T18:11:15Z

Why was the RAY_CHECK removed?

Oops, it's a mistake.

stephanie-wang · 2018-12-10T18:12:02Z

This gets called inside TreatTaskAsFailed, so I don't think we need this call anymore.

stephanie-wang · 2018-12-10T18:13:18Z

I would move this to DEBUG since actors dying is an expected error.

Yep, actor dying is expected, but should be infrequent. I think this message is worth noticing, because if it become massive, we can know something is wrong. I prefer keeping INFO, what do you think?

Hmm about how we only log if !intentional_disconnect?

I changed it to debug. that's fine as well.

stephanie-wang · 2018-12-10T18:22:30Z

Previously, we actually let this case continue and resubmit the task, since the SubmitTask codepath would eventually treat the task as failed if the actor is DEAD. I think we should probably keep it that way to reduce duplicate logic.

raulchen · 2018-12-12T07:33:51Z

@stephanie-wang comments addressed.

AmplabJenkins · 2018-12-12T09:41:32Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9986/
Test FAILed.

AmplabJenkins · 2018-12-12T10:12:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9987/
Test FAILed.

stephanie-wang

Great, thanks! I'll merge assuming the tests pass.

AmplabJenkins · 2018-12-13T04:18:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10002/
Test FAILed.

AmplabJenkins · 2018-12-13T14:21:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10020/
Test FAILed.

AmplabJenkins · 2018-12-14T03:09:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10030/
Test FAILed.

raulchen · 2018-12-14T03:23:27Z

Jenkins, retest this please

AmplabJenkins · 2018-12-14T03:56:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10035/
Test FAILed.

raulchen · 2018-12-14T04:08:16Z

Hmm, Jenkins failed at a Ray Tune test. Doesn't look like related to this PR. Jenkins, retest this please.

raulchen · 2018-12-14T04:32:49Z

@stephanie-wang Travis CI all passed, and Jenkins should be unrelated. Should be okay for merge now.

stephanie-wang · 2018-12-14T05:29:02Z

Great, thanks!

AmplabJenkins · 2018-12-14T06:28:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10039/
Test FAILed.

raulchen mentioned this pull request Nov 15, 2018

[xray] Implement actor reconstruction. #3063

Closed

4 tasks

raulchen force-pushed the reconstruct_actor2 branch from b0c7d4d to cc24b6c Compare November 19, 2018 06:26

raulchen force-pushed the reconstruct_actor2 branch from 0d15f41 to 3935bb3 Compare November 20, 2018 14:19

raulchen force-pushed the reconstruct_actor2 branch from 1490f24 to ec456a9 Compare November 21, 2018 03:30

raulchen mentioned this pull request Nov 21, 2018

Fix failure handling for actor death #3359

Merged

raulchen force-pushed the reconstruct_actor2 branch from 3b5655e to 679c5b4 Compare December 4, 2018 07:27

robertnishihara reviewed Dec 5, 2018

View reviewed changes

stephanie-wang reviewed Dec 10, 2018

View reviewed changes

stephanie-wang approved these changes Dec 13, 2018

View reviewed changes

raulchen added 12 commits December 13, 2018 19:48

Implement Actor Reconstruction

2aeaeda

fix

2771e20

fix actor handle __del__

f66ca5e

fix lint

84c8147

add comment

6a63917

Remove actorCreationDummyObjectId

6a3c2b5

address comments

b9774a8

fix

646b1f3

address comments

094a47b

avoid copy

390c671

change log to debug

45eeadc

fix error name

202240b

raulchen force-pushed the reconstruct_actor2 branch from 9865fcd to 202240b Compare December 13, 2018 11:52

Merge branch 'master' into reconstruct_actor2

a0fecb8

stephanie-wang merged commit e7b51cb into ray-project:master Dec 14, 2018

raulchen deleted the reconstruct_actor2 branch December 14, 2018 05:51

Uh oh!

Conversation

raulchen commented Nov 15, 2018

What do these changes do?

Related issue number

AmplabJenkins commented Nov 15, 2018

AmplabJenkins commented Nov 19, 2018

AmplabJenkins commented Nov 19, 2018

AmplabJenkins commented Nov 20, 2018

AmplabJenkins commented Nov 20, 2018

raulchen commented Nov 20, 2018

AmplabJenkins commented Nov 20, 2018

AmplabJenkins commented Nov 20, 2018

AmplabJenkins commented Nov 20, 2018

AmplabJenkins commented Nov 20, 2018

stephanie-wang commented Nov 21, 2018

raulchen commented Nov 21, 2018

AmplabJenkins commented Nov 21, 2018

AmplabJenkins commented Nov 21, 2018

AmplabJenkins commented Nov 21, 2018

AmplabJenkins commented Nov 21, 2018

AmplabJenkins commented Nov 21, 2018

AmplabJenkins commented Dec 4, 2018

AmplabJenkins commented Dec 4, 2018

raulchen commented Dec 5, 2018

raulchen commented Dec 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Dec 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen commented Dec 12, 2018

AmplabJenkins commented Dec 12, 2018

AmplabJenkins commented Dec 12, 2018

stephanie-wang left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Dec 13, 2018

AmplabJenkins commented Dec 13, 2018

AmplabJenkins commented Dec 14, 2018

raulchen commented Dec 14, 2018

AmplabJenkins commented Dec 14, 2018

raulchen commented Dec 14, 2018

raulchen commented Dec 14, 2018

stephanie-wang commented Dec 14, 2018

AmplabJenkins commented Dec 14, 2018

Labels

5 participants