Skip to content

[xray] Implement Actor Reconstruction#3332

Merged
stephanie-wang merged 13 commits into
ray-project:masterfrom
antgroup:reconstruct_actor2
Dec 14, 2018
Merged

[xray] Implement Actor Reconstruction#3332
stephanie-wang merged 13 commits into
ray-project:masterfrom
antgroup:reconstruct_actor2

Conversation

@raulchen

Copy link
Copy Markdown
Contributor

What do these changes do?

This PR implements actor reconstruction for raylet mode.

  • When an actor dies accidentally (either because the process dies or because the whole node dies), raylet backend will automatically reconstruct the actor by replaying its creation task.
  • Reconstruction is turned off by default, users can enable it by specifying a max_reconstructions option in @ray.remote(), which indicates how many times this actor should be reconstructed.

Related issue number

#2868

See #3063 for previous discussions of this PR.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9379/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9441/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9442/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9471/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9478/
Test FAILed.

@raulchen

Copy link
Copy Markdown
Contributor Author

@stephanie-wang actor_test.py/test_local_scheduler_dying hangs only on Python 2. I debugged this issue locally and got the following error. The weird thing is that the task that triggered this error actually belongs to the previous test test_actor_init_error_propagated. Any idea why? (If I run test_local_scheduler_dying alone, it can pass.)

139137 I1120 22:47:12.093858 139269568 node_manager.cc:1595] Reconstructing task 0000000015bc38724a7713829c8b219a21ca722c on client b4875fd41191d75aa234b3175420d472719de0ff
139138 F1120 22:47:12.094072 139269568 node_manager.cc:1612]  Check failed: lineage_cache_.ContainsTask(task_id)
139139 *** Check failure stack trace: ***
139140     @        0x1003508fd  google::LogMessage::Fail()
139141     @        0x10034e71e  google::LogMessage::SendToLog()
139142     @        0x10034f59f  google::LogMessage::Flush()
139143     @        0x10034f3d9  google::LogMessage::~LogMessage()
139144     @        0x10034f695  google::LogMessage::~LogMessage()
139145     @        0x1002a0a95  ray::RayLog::~RayLog()
139146     @        0x10030d182  std::__1::__function::__func<>::operator()()
139147     @        0x100286c50                                                                                                                                                                                            _ZZN3ray3gcs5TableINS_8UniqueIDENS_8protocol4TaskEE6LookupERKS2_S7_RKNSt3__18functionIFvPNS0_14AsyncGcsClientES7_RKNS3_5TaskTEEEERKNS9_IFvSB_S7_EEEENKUlSB_S7_RKNS8_6vectorISC_NS8_9allocatorISC_EEEEE_clESB_       S7_SS_
139148     @        0x100284dc5                                                                                                                                                                                            _ZZN3ray3gcs3LogINS_8UniqueIDENS_8protocol4TaskEE6LookupERKS2_S7_RKNSt3__18functionIFvPNS0_14AsyncGcsClientES7_RKNS8_6vectorINS3_5TaskTENS8_9allocatorISD_EEEEEEEENKUlRKNS8_12basic_stringIcNS8_11char_traits       IcEENSE_IcEEEEE_clEST_
139149     @        0x10029b6f7  ray::gcs::GlobalRedisCallback()
139150     @        0x100316ded  redisProcessCallbacks
139151     @        0x10029fa3c  RedisAsioClient::handle_read()
139152     @        0x1002a008c  boost::asio::detail::reactive_null_buffers_op<>::do_complete()
139153     @        0x100241848  boost::asio::detail::task_io_service::do_run_one()
139154     @        0x100241391  boost::asio::detail::task_io_service::run()
139155     @        0x10023b1fe  main
@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9479/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9481/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9480/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9482/
Test FAILed.

@stephanie-wang

Copy link
Copy Markdown
Contributor

Hmm I'm not sure about that error...did you figure out anything more from it?

Also, FYI we're going to merge a subset of this PR soon in #3359.

@raulchen

Copy link
Copy Markdown
Contributor Author

@stephanie-wang it's very weird. It seems that something is left over after test_exception_raised_when_actor_node_dies. I haven't figured out the reason. But I changed test_exception_raised_when_actor_node_dies to use cluster_utils. It works right now.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9499/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9501/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9509/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9508/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9514/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9738/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9741/
Test FAILed.

@raulchen

raulchen commented Dec 5, 2018

Copy link
Copy Markdown
Contributor Author

@stephanie-wang actor_test.py/test_local_scheduler_dying hangs only on Python 2. I debugged this issue locally and got the following error. The weird thing is that the task that triggered this error actually belongs to the previous test test_actor_init_error_propagated. Any idea why? (If I run test_local_scheduler_dying alone, it can pass.)

139137 I1120 22:47:12.093858 139269568 node_manager.cc:1595] Reconstructing task 0000000015bc38724a7713829c8b219a21ca722c on client b4875fd41191d75aa234b3175420d472719de0ff
139138 F1120 22:47:12.094072 139269568 node_manager.cc:1612]  Check failed: lineage_cache_.ContainsTask(task_id)
139139 *** Check failure stack trace: ***
139140     @        0x1003508fd  google::LogMessage::Fail()
139141     @        0x10034e71e  google::LogMessage::SendToLog()
139142     @        0x10034f59f  google::LogMessage::Flush()
139143     @        0x10034f3d9  google::LogMessage::~LogMessage()
139144     @        0x10034f695  google::LogMessage::~LogMessage()
139145     @        0x1002a0a95  ray::RayLog::~RayLog()
139146     @        0x10030d182  std::__1::__function::__func<>::operator()()
139147     @        0x100286c50                                                                                                                                                                                            _ZZN3ray3gcs5TableINS_8UniqueIDENS_8protocol4TaskEE6LookupERKS2_S7_RKNSt3__18functionIFvPNS0_14AsyncGcsClientES7_RKNS3_5TaskTEEEERKNS9_IFvSB_S7_EEEENKUlSB_S7_RKNS8_6vectorISC_NS8_9allocatorISC_EEEEE_clESB_       S7_SS_
139148     @        0x100284dc5                                                                                                                                                                                            _ZZN3ray3gcs3LogINS_8UniqueIDENS_8protocol4TaskEE6LookupERKS2_S7_RKNSt3__18functionIFvPNS0_14AsyncGcsClientES7_RKNS8_6vectorINS3_5TaskTENS8_9allocatorISD_EEEEEEEENKUlRKNS8_12basic_stringIcNS8_11char_traits       IcEENSE_IcEEEEE_clEST_
139149     @        0x10029b6f7  ray::gcs::GlobalRedisCallback()
139150     @        0x100316ded  redisProcessCallbacks
139151     @        0x10029fa3c  RedisAsioClient::handle_read()
139152     @        0x1002a008c  boost::asio::detail::reactive_null_buffers_op<>::do_complete()
139153     @        0x100241848  boost::asio::detail::task_io_service::do_run_one()
139154     @        0x100241391  boost::asio::detail::task_io_service::run()
139155     @        0x10023b1fe  main

The reason of this bug is because of actor handle GC. right now, we send a __ray_terminiate__ message to actor when the actor handle is GC'ed by Python interpreter. However, the timing of GC is unreliable. In unit test, we run multiple test cases sequentially in one single process. And it turns out that, in Python 2, an actor handle that was created in the previous test case would be GC'ed in the next test case. In this case, driver will send the __ray_terminate__ to the new raylet, and trigger reconstructing the actor creation task. However, that task doesn't exist in the new GCS, then this check fails.

@raulchen

raulchen commented Dec 5, 2018

Copy link
Copy Markdown
Contributor Author

To fix this issue, I add a check in __del__ to prevent sending __ray_terminate__ in this case. But I think it makes more sense to deprecate __del__, because:

  1. there's no guarantee whether or when __del__ will be called, see https://stackoverflow.com/questions/1481488/what-is-the-del-method-how-to-call-it.
  2. it doesn't handle the case of forked handles.
  3. we can already clean up the actors when driver exits.

If users do want to terminate an actor before driver exits, they can use actor.__ray_terminate__.remote()

Comment thread python/ray/actor.py Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually log a warning in this case, e.g., logger.warn(...)?

This should be pretty rare and it'd be useful to see that this code path is actually getting hit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread python/ray/actor.py Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker.mode == ray.worker.SCRIPT_MODE check isn't really necessary, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's needed. because in a worker, self._ray_actor_driver_id.id() != worker.worker_id is always true

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9762/
Test FAILed.

Comment thread src/ray/raylet/node_manager.cc Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the RAY_CHECK removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it's a mistake.

Comment thread src/ray/raylet/node_manager.cc Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets called inside TreatTaskAsFailed, so I don't think we need this call anymore.

Comment thread src/ray/raylet/node_manager.cc Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to DEBUG since actors dying is an expected error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, actor dying is expected, but should be infrequent. I think this message is worth noticing, because if it become massive, we can know something is wrong. I prefer keeping INFO, what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm about how we only log if !intentional_disconnect?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to debug. that's fine as well.

Comment thread src/ray/raylet/node_manager.cc Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we actually let this case continue and resubmit the task, since the SubmitTask codepath would eventually treat the task as failed if the actor is DEAD. I think we should probably keep it that way to reduce duplicate logic.

@raulchen

Copy link
Copy Markdown
Contributor Author

@stephanie-wang comments addressed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9986/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9987/
Test FAILed.

@stephanie-wang stephanie-wang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! I'll merge assuming the tests pass.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10002/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10020/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10030/
Test FAILed.

@raulchen

Copy link
Copy Markdown
Contributor Author

Jenkins, retest this please

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10035/
Test FAILed.

@raulchen

Copy link
Copy Markdown
Contributor Author

Hmm, Jenkins failed at a Ray Tune test. Doesn't look like related to this PR. Jenkins, retest this please.

@raulchen

Copy link
Copy Markdown
Contributor Author

@stephanie-wang Travis CI all passed, and Jenkins should be unrelated. Should be okay for merge now.

@stephanie-wang stephanie-wang merged commit e7b51cb into ray-project:master Dec 14, 2018
@stephanie-wang

Copy link
Copy Markdown
Contributor

Great, thanks!

@raulchen raulchen deleted the reconstruct_actor2 branch December 14, 2018 05:51
@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10039/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants