ray on X:

ByteScale is a new LLM training framework - Evaluated 7B to 141B param models - 256K to 2048K context lengths - 12,000 GPUs - Optimized for mixed long and short sequences The crux of it is a much more dynamic parallelism strategy (as opposed to a static mesh) to account for heterogeneity in sequence length. They call this strategy Hybrid Data Parallelism (HDP), which combines regular data parallelism with context parallelism in a dynamic manner. Their data loading strategy is very network and CPU-memory intensive and requires global coordination across workers (as opposed to each worker doing its own thing). They use Ray actor for this coordination. There are - Servers to fetch and preprocess raw data from HDFS and generate metadata - A scheduler to collect global metadata from all servers, figure out the the loading plan, and broadcast the plan to clients - Clients (on GPUs), which read the partial data from servers based on the loading plan

9:23 PM · Mar 7, 202518.4KViews

Post

Post