vLLM (@vllm_project) / X

vLLM

1,091 posts

vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join slack.vllm.ai to discuss together with the community!

vllm.ai

تاریخ پیوستن: March 2024

دنبال‌شده

43.1K

دنبال‌کنندگان

vLLM
@vllm_project
۲۸ مهر ۱۴۰۴
🚀 DeepSeek-OCR — the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM ⚡ (~2500 tokens/s on A100-40G) — powered by vllm==0.8.5 for day-0 model support. 🧠 Compresses visual contexts up to 20× while keeping
2M
vLLM
@vllm_project
۲۵ فروردین ۱۴۰۴
🙏 @deepseek_ai's highly performant inference engine is built on top of vLLM. Now they are open-sourcing the engine the right way: instead of a separate repo, they are bringing changes to the open source community so everyone can immediately benefit!
open-infra-index/OpenSourcing_DeepSeek_Inference_Engine/README.md at main · deepseek-ai/open-infr...
از github.com
203K
vLLM
@vllm_project
۱۲ آبان ۱۴۰۴
Wow excited to see PewDiePie using vLLM to serve language models locally 😃 vLLM brings easy, fast, and cheap LLM serving for everyone 🥰
Yuchen Jin
@Yuchenj_UW
۹ آبان ۱۴۰۴
PewDiePie in 2025: – built a 10×4090 rig – runs Llama 70B, gpt-oss-120B & Qwen 245B locally via vLLM – built a custom web UI (chat, RAG, search, TTS) – ran protein-folding simulations for charity – created an AI “council”, a swarm of 64 models – now fine-tuning his own model
164K
vLLM
@vllm_project
۲۷ شهریور ۱۴۰۴
Congrats to @deepseek_ai ! DeepSeek-R1 was published in Nature yesterday as the cover article, and vLLM is proud to have supported its RL training and inference🥰
213K
vLLM
@vllm_project
۲۶ مرداد ۱۴۰۴
🚀 Amazing community project! vLLM CLI — a command-line tool for serving LLMs with vLLM: ✅ Interactive menu-driven UI & scripting-friendly CLI ✅ Local + HuggingFace Hub model management ✅ Config profiles for perf/memory tuning ✅ Real-time server & GPU monitoring ✅ Error
71K
vLLM
@vllm_project
۳ اسفند ۱۴۰۳
We're excited to receive our first #NVIDIADGX B200 system which we'll use for vLLM research and development! Thank you @nvidia!
119K
vLLM
@vllm_project
۲۴ مهر ۱۴۰۴
Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. 🚀 What's New? - JAX + Pytorch: Run PyTorch models on
157K
vLLM
@vllm_project
۲۸ فروردین ۱۴۰۴
vLLM🤝🤗! You can now deploy any @huggingface language model with vLLM's speed. This integration makes it possible for one consistent implementation of the model in HF for both training and inference. 🧵
Transformers modeling backend integration in vLLM
از vllm.ai
73K
vLLM
@vllm_project
۱۳ بهمن ۱۴۰۳
We landed the 1st batch of enhancements to the @deepseek_ai models, starting MLA and cutlass fp8 kernels. Compared to v0.7.0, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism.
90K
vLLM
@vllm_project
۷ مهر ۱۴۰۴
How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
این پست دردسترس نیست.
103K
vLLM
@vllm_project
۶ مهر ۱۴۰۴
🚀 New in vLLM: dots.ocr 🔥 A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM! 📝 Single end-to-end parser for text, tables (HTML), formulas (LaTeX), and layouts (Markdown) 🌍 Supports 100 languages with robust performance on
merve
@mervenoyann
۱۴ مرداد ۱۴۰۴
we're all sleeping on this OCR model 🔥 dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🤯 single e2e model to extract image, convert tables, formula, and more into markdown 📝
69K
vLLM
@vllm_project
۳۰ مهر ۱۴۰۴
it’s tokenization again! 🤯 did you know tokenize(detokenize(token_ids)) ≠ token_ids? RL researchers from Agent Lightning coined the term Retokenization Drift — a subtle mismatch between what your model generated and what your trainer thinks it generated. why? because most
170K
vLLM
@vllm_project
۸ بهمن ۱۴۰۳
🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.
95K
vLLM
@vllm_project
۱۸ شهریور ۱۴۰۴
The amazing blogpost from @gordic_aleksa is alive at vLLM's blogpost blog.vllm.ai/2025/09/05/ana… (after more proofreading and clarifications)! Looking forward to future series of tech deep dive blogposts😍
Aleksa Gordić (水平问题)
@gordic_aleksa
۱۰ شهریور ۱۴۰۴
New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up
47K