[xray] Implement Actor Reconstruction#3332
Conversation
|
Test FAILed. |
b0c7d4d to
cc24b6c
Compare
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
0d15f41 to
3935bb3
Compare
|
@stephanie-wang |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Hmm I'm not sure about that error...did you figure out anything more from it? Also, FYI we're going to merge a subset of this PR soon in #3359. |
|
@stephanie-wang it's very weird. It seems that something is left over after |
1490f24 to
ec456a9
Compare
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
3b5655e to
679c5b4
Compare
|
Test FAILed. |
|
Test FAILed. |
The reason of this bug is because of actor handle GC. right now, we send a |
|
To fix this issue, I add a check in
If users do want to terminate an actor before driver exits, they can use |
There was a problem hiding this comment.
Can we actually log a warning in this case, e.g., logger.warn(...)?
This should be pretty rare and it'd be useful to see that this code path is actually getting hit.
There was a problem hiding this comment.
The worker.mode == ray.worker.SCRIPT_MODE check isn't really necessary, right?
There was a problem hiding this comment.
I think it's needed. because in a worker, self._ray_actor_driver_id.id() != worker.worker_id is always true
|
Test FAILed. |
There was a problem hiding this comment.
Why was the RAY_CHECK removed?
There was a problem hiding this comment.
Oops, it's a mistake.
There was a problem hiding this comment.
This gets called inside TreatTaskAsFailed, so I don't think we need this call anymore.
There was a problem hiding this comment.
I would move this to DEBUG since actors dying is an expected error.
There was a problem hiding this comment.
Yep, actor dying is expected, but should be infrequent. I think this message is worth noticing, because if it become massive, we can know something is wrong. I prefer keeping INFO, what do you think?
There was a problem hiding this comment.
Hmm about how we only log if !intentional_disconnect?
There was a problem hiding this comment.
I changed it to debug. that's fine as well.
There was a problem hiding this comment.
Previously, we actually let this case continue and resubmit the task, since the SubmitTask codepath would eventually treat the task as failed if the actor is DEAD. I think we should probably keep it that way to reduce duplicate logic.
|
@stephanie-wang comments addressed. |
|
Test FAILed. |
|
Test FAILed. |
stephanie-wang
left a comment
There was a problem hiding this comment.
Great, thanks! I'll merge assuming the tests pass.
|
Test FAILed. |
9865fcd to
202240b
Compare
|
Test FAILed. |
|
Test FAILed. |
|
Jenkins, retest this please |
|
Test FAILed. |
|
Hmm, Jenkins failed at a Ray Tune test. Doesn't look like related to this PR. Jenkins, retest this please. |
|
@stephanie-wang Travis CI all passed, and Jenkins should be unrelated. Should be okay for merge now. |
|
Great, thanks! |
|
Test FAILed. |
What do these changes do?
This PR implements actor reconstruction for raylet mode.
max_reconstructionsoption in@ray.remote(), which indicates how many times this actor should be reconstructed.Related issue number
#2868
See #3063 for previous discussions of this PR.