Multi-model training, tuning, and serving are common tasks in machine learning. They require training and tuning multiple models, on the same or different data segments. The data segments typcially correspond to different locations, products, or groups of locations or products, etc. Using distributed compute to train hundreds or thousands of models takes less time than traditional Python because the data and model training/tuning/inferencing can be split up into batches and run in parallel!
These notebooks demonstrate how to use Ray v2 for quick and easy distributed forecasting - a special case of multi-model training, tuning, inferencing, and prediction. You will learn how to convert existing code so it can run in parallel on multiple compute nodes. The compute can be cores on your laptop or clusters in the cloud.
Ray can be used with any AI/ML Python library! But, in these notebooks, we will demo:
These notebooks use the public NYC Taxi rides dataset.
-
Raw data original source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
-
Raw data hosted publicly on AWS: s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/
-
8 months of cleaned data in this repo under folder data/
We recommend running Ray on Anyscale to take full advantage of developing on a personal laptop, then quickly spinning up resources in a cloud to run your same laptop code on bigger compute resources.
To configure an Anyscale cluster Configuration, use the latest Ray (right now it is v2.2) on a Python 3.8 ML docker image, example anyscale/ray-ml:2.2.0-py38-gpu. Don't worry, you can on-the-fly remove the GPU per cluster just before you spin one up, if you don't need expensive GPU. 'ml' docker image means standard ml libraries automatically installed, e.g. pandas, matplotlib. Python3.8 is important! Since, at the time of writing this, Prophet still has this dependency.
- In your browser, open `console.anyscale.com`.
- Click on `Configurations` > `Create a new environment`.
- Give the configuration a name example `myname-forecasting`.
- Select a base docker image, example `anyscale/ray-ml:2.2.0-py38-gpu`.
- Specify `Pip packages` in this order:
-
protobuf==3.19.*
Cython
numba
numpy==1.21.6
pystan==2.19.1.1
cmdstanpy==0.9.68
prophet==1.0
plotly
statsforecast==1.3.1
scikit-learn
pyarrow==10.0.0
statsmodels
ax-platform
gpytorch
scipy
seaborn
torch
kats
For PyTorch Forecasting add these:
ray_lightning
pytorch-forecasting
mlflow
- For PyTorch Forecasting specify `Conda packages` in this order:
-
tqdm
grpcio-tools
tensorflow
tensorboard
tensorboardx
- Put your github repo in the `Post build commands` section:
- If you have a project name:
- git clone your-git-repo-url ../your-project-name/
- Otherwise if you do not have a project:
- git clone your-git-repo-url
- If you have a project name:
- Click 'Create'.
- In your browser, open `console.anyscale.com`.
- Click on `Clusters` > `Create`.
- Give the cluster a name.
- Select a project that the cluster belongs to.
- Select the latest cluster environment name that you just created, example `myname-forecasting` and latest version.
- Leave the default radio button on `Compute config` = `Create a one-off configuration`.
- Select a default cloud config from your organization, e.g. AWS, region=us-west-2, zones=any.
- Node types. Here is where you can delete the GPU if you are not going to use it, example Remove `g4dn.4xlarge`. You can also specify min/max number of worker node clusters, memory, and AWS spot instances option here.
- Click `Start`.
- Wait until the cluster is ready, then click `Jupyter` button.
Anyscale by default will automatically shut down your cluster for you after 2 hours of inactivity. That way you don't have to worry about accidentally leaving it running over a weekend.
- In your browser, open `console.anyscale.com`.
- Click on `Clusters` > `Created by me`.
- Click on the cluster.
- Click `Start`.
- Wait until the cluster is ready, then click `Jupyter` button.
🎓 To further speed up your development process (especially convenient if you are contributing to open-source Ray), use Anyscale Workspaces, to develop and save your code directly on a cloud, instead of on your laptop!
Let's have fun 😜 and Thank you 🙏.


