Post

Log inSign up

Post

user avatar
ray
@raydistributed
ByteScale is a new LLM training framework - Evaluated 7B to 141B param models - 256K to 2048K context lengths - 12,000 GPUs - Optimized for mixed long and short sequences The crux of it is a much more dynamic parallelism strategy (as opposed to a static mesh) to account for heterogeneity in sequence length. They call this strategy Hybrid Data Parallelism (HDP), which combines regular data parallelism with context parallelism in a dynamic manner. Their data loading strategy is very network and CPU-memory intensive and requires global coordination across workers (as opposed to each worker doing its own thing). They use Ray actor for this coordination. There are - Servers to fetch and preprocess raw data from HDFS and generate metadata - A scheduler to collect global metadata from all servers, figure out the the loading plan, and broadcast the plan to clients - Clients (on GPUs), which read the partial data from servers based on the loading plan
9:23 PM · Mar 7, 202518.4KViews

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Relevant people

user avatar
ray@raydistributedFollow

Trending now

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up