Skip to content

Prepare socket file when start ray#3925

Merged
raulchen merged 3 commits into
ray-project:masterfrom
antgroup:fixSocketFile
Feb 2, 2019
Merged

Prepare socket file when start ray#3925
raulchen merged 3 commits into
ray-project:masterfrom
antgroup:fixSocketFile

Conversation

@guoyuhong

Copy link
Copy Markdown
Contributor

What do these changes do?

Sometimes, users want to specify their own socket file name. However, there are two cases that they could fail.

  1. The directory specified does not exist. In this case, ray should make the directory.
  2. The socket file exists before they start the cluster, which may due to that we are test failover functionality and send SIGKILL to raylet. In this case, ray should clean the uncleaned file.

For the first case, raylet will crash with:

libc++abi.dylib: terminating with uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::  error_info_injector<boost::system::system_error> >: bind: No such file or directory
*** Aborted at 1549001100 (unix time) try "date -d @1549001100" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fff6972ab86) received by PID 76085 (TID 0x10cef95c0) stack trace: ***
    @     0x7fff697d5b3d _sigtramp

For the second case, raylet will crash with:

libc++abi.dylib: terminating with uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >: bind: Address already in use
*** Aborted at 1549001712 (unix time) try "date -d @1549001712" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fff6972ab86) received by PID 77036 (TID 0x1161875c0) stack trace: ***
    @     0x7fff697d5b3d _sigtramp

Related issue number

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11390/
Test FAILed.

@jovany-wang jovany-wang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Comment thread python/ray/node.py Outdated
Comment thread python/ray/node.py Outdated
Args:
socket_path (string): the socket file to prepare.
"""
if not os.path.isfile(socket_path):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not os.path.exist(socket_path) will be more explicit?

Comment thread test/runtest.py Outdated
shutil.rmtree(tempdir)


def test_socket_directory_none_existant(shutdown_only):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, existent. or maybe shorter test_socket_dir_not_existing

Comment thread test/runtest.py Outdated
def test_socket_directory_none_existant(shutdown_only):
level1_name = ray.ObjectID(_random_string()).hex()
level2_name = ray.ObjectID(_random_string()).hex()
temp_raylet_socket_dir = "/tmp/{}/{}".format(level1_name, level2_name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, let's put the socket under something like /tmp/ray/tests to avoid creating too many files in /tmp

Comment thread python/ray/node.py Outdated
if not os.path.isdir(path):
try_to_create_directory(path)
else:
os.remove(socket_path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's a good idea to remove the existing socket file. If a raylet is still running and user starts ray again, this will break the previous raylet.
We can check if this socket file is still used by any processes before removing it. Or maybe simply raise an error if it already exists.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this function is used in raylet socket and plasma socket, it not easy to connect to it to check whether it is in use. I raise exception here. It is better than nothing. If we don't do this and start cluster using ray start, there will be no error after this command. We need to check it in raylet.err. With this exception, ray start will stop immediately.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, raising the error earlier is better.
BTW, we can detect if a socket file is in use with the lsof command. But raising an error is okay as well.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11402/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11403/
Test PASSed.

@guoyuhong

Copy link
Copy Markdown
Contributor Author

@robertnishihara The Linux Wheel test fails in all PRs. Shall we change the number of expecting wheels to 4 to unblock the test?

@raulchen raulchen merged commit 54cbb43 into ray-project:master Feb 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants