Skip to content

Don't check fail on missing lineage cache entry#3861

Merged
ericl merged 7 commits into
ray-project:masterfrom
ericl:fix-3813
Feb 5, 2019
Merged

Don't check fail on missing lineage cache entry#3861
ericl merged 7 commits into
ray-project:masterfrom
ericl:fix-3813

Conversation

@ericl

@ericl ericl commented Jan 25, 2019

Copy link
Copy Markdown
Contributor

What do these changes do?

Under some race conditions with slow actor creation tasks, it seems like we hit this check. Be defensive and just return empty lineage in this case.

Related issue number

#3813

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11161/
Test FAILed.

Comment thread src/ray/raylet/lineage_cache.cc Outdated
if (entry) {
RAY_CHECK(uncommitted_lineage.SetEntry(entry->TaskData(), entry->GetStatus()));
} else {
RAY_LOG(ERROR) << "No lineage cache entry found for task " << task_id;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a comment on under what conditions this entry doesn't exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a pointer back to the issue.

It's not super clear what is evicting the lineage: perhaps some race condition on a task getting rescheduled, or the task succeeding but the node failing after that.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11174/
Test PASSed.

@robertnishihara

Copy link
Copy Markdown
Collaborator

I strongly think we should not do this. The assertion is failing because there is a bug that we don't understand. The solution is not to get rid of the assertion but rather to fix the bug. If we remove assertions whenever we don't understand what is going on, technical debt will accumulate.

You've mentioned that we shouldn't have fatal checks in production. And I'm ok with shipping wheels where DCHECKS just log errors or something like that, but this is turning it off even in the case where we're trying to do development and debugging.

@ericl

ericl commented Jan 26, 2019

Copy link
Copy Markdown
Contributor Author

What do we need to do to make RAY_DCHECK work?

#define RAY_DCHECK(condition) \

I modified it to a error log followed by a dcheck, but I don't know if this DCHECK is enabled in the right environments.

But overall, I would prefer we move to a more defensively progammed, fault-tolerant implementation of the backend. By fault-tolerant, I mean that we keep running successfully even if certain components hit bugs. A bug in lineage cache should not take down other parts of Ray.

A stronger condition yet would be to be able to survive raylet crashes. That way, as long as raylets can remain alive for long enough, the application keeps making progress.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11177/
Test PASSed.

@ericl

ericl commented Jan 27, 2019

Copy link
Copy Markdown
Contributor Author

What's the resolution here, is DCHECK the way to go?

@stephanie-wang

Copy link
Copy Markdown
Contributor

DCHECK makes sense, but I would keep the assertion (and the log message) in the same place as before so that we can actually catch it.

@stephanie-wang

Copy link
Copy Markdown
Contributor

Ah ignore the above comment, I read the code wrong.

The method that you changed here is supposed to return a lineage with the requested task in it, but it won't always do that now. You'll have to modify the place where this gets called in the raylet so that the caller adds the correct task to the lineage.

Comment thread src/ray/raylet/node_manager.cc Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to actually add the task to the uncommitted lineage if it is not already there. The receiving node manager expects the forwarded task to be in the lineage here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I guess it makes more sense to handle this all in this function.

@ericl ericl force-pushed the fix-3813 branch 2 times, most recently from 790ffa6 to e680553 Compare January 28, 2019 00:19

@ericl ericl left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some renaming to make check semantics more clear.

<< "ray.init(redis_max_memory=<max_memory_bytes>).";
// Use a copy of the cached task spec to re-execute the task.
const Task task = lineage_cache_.GetTask(task_id);
const Task task = lineage_cache_.GetTaskOrDie(task_id);

@ericl ericl Jan 28, 2019

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't check fail in this function. However, the issue now is that you can't fail a task without the TaskSpec.

Is there a function from TaskID -> ReturnIDs?

@stephanie-wang stephanie-wang Jan 28, 2019

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ComputeReturnId. Unfortunately, we can't know how many return values to put the error for without the task spec...I guess the safest option for now is to just put one return value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I returned like 10? Could that cause an issue?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For actor tasks, the last return value is the dummy object, which isn't supposed to have any value in the object store...this could potentially break other parts of the raylet, but I'm not totally sure.

@ericl

ericl commented Jan 28, 2019

Copy link
Copy Markdown
Contributor Author

Note: I don't think NDEBUG is actually enabled in our prod builds, but that's probably fine for now. The proximate cause of the original issue is likely fixed by #3860

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11220/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11216/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11219/
Test PASSed.

@ericl

ericl commented Feb 3, 2019

Copy link
Copy Markdown
Contributor Author

Perhaps we should punt on the return value issue since it's more complicated? Any other issues here?

@stephanie-wang

Copy link
Copy Markdown
Contributor

Perhaps we should punt on the return value issue since it's more complicated? Any other issues here?

Sure, I approved.

@ericl

ericl commented Feb 4, 2019

Copy link
Copy Markdown
Contributor Author

jenkins retest this please

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11540/
Test PASSed.

@ericl ericl merged commit 5fb813f into ray-project:master Feb 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants