Log inSign up
vLLM
1,091 posts
user avatar
vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join slack.vllm.ai to discuss together with the community!
vllm.ai
Joined March 2024
36
Following
43.1K
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    vLLM
    @vllm_project
    2025年10月20日
    🚀 DeepSeek-OCR — the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM ⚡ (~2500 tokens/s on A100-40G) — powered by vllm==0.8.5 for day-0 model support. 🧠 Compresses visual contexts up to 20× while keeping
    2M
  • user avatar
    vLLM
    @vllm_project
    2025年4月14日
    🙏 @deepseek_ai's highly performant inference engine is built on top of vLLM. Now they are open-sourcing the engine the right way: instead of a separate repo, they are bringing changes to the open source community so everyone can immediately benefit!
    open-infra-index/OpenSourcing_DeepSeek_Inference_Engine/README.md at main · deepseek-ai/open-infr...
    From github.com
    203K
  • user avatar
    vLLM
    @vllm_project
    2025年11月3日
    Wow excited to see PewDiePie using vLLM to serve language models locally 😃 vLLM brings easy, fast, and cheap LLM serving for everyone 🥰
    user avatar
    Yuchen Jin
    @Yuchenj_UW
    2025年10月31日
    PewDiePie in 2025: – built a 10×4090 rig – runs Llama 70B, gpt-oss-120B & Qwen 245B locally via vLLM – built a custom web UI (chat, RAG, search, TTS) – ran protein-folding simulations for charity – created an AI “council”, a swarm of 64 models – now fine-tuning his own model
    164K
  • user avatar
    vLLM
    @vllm_project
    2025年9月18日
    Congrats to @deepseek_ai ! DeepSeek-R1 was published in Nature yesterday as the cover article, and vLLM is proud to have supported its RL training and inference🥰
    213K
  • user avatar
    vLLM
    @vllm_project
    2025年8月17日
    🚀 Amazing community project! vLLM CLI — a command-line tool for serving LLMs with vLLM: ✅ Interactive menu-driven UI & scripting-friendly CLI ✅ Local + HuggingFace Hub model management ✅ Config profiles for perf/memory tuning ✅ Real-time server & GPU monitoring ✅ Error
    71.2K
  • user avatar
    vLLM
    @vllm_project
    2025年2月21日
    We're excited to receive our first #NVIDIADGX B200 system which we'll use for vLLM research and development! Thank you @nvidia!
    119K
  • user avatar
    vLLM
    @vllm_project
    2025年10月16日
    Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. 🚀 What's New? - JAX + Pytorch: Run PyTorch models on
    157K
  • user avatar
    vLLM
    @vllm_project
    2025年4月17日
    vLLM🤝🤗! You can now deploy any @huggingface language model with vLLM's speed. This integration makes it possible for one consistent implementation of the model in HF for both training and inference. 🧵
    Transformers modeling backend integration in vLLM
    From vllm.ai
    72.6K
  • user avatar
    vLLM
    @vllm_project
    2025年2月1日
    We landed the 1st batch of enhancements to the @deepseek_ai models, starting MLA and cutlass fp8 kernels. Compared to v0.7.0, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism.
    90.4K
  • user avatar
    vLLM
    @vllm_project
    2025年9月29日
    How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
    This post is unavailable.
    103K
  • user avatar
    vLLM
    @vllm_project
    2025年9月28日
    🚀 New in vLLM: dots.ocr 🔥 A powerful multilingual OCR model from @xiaohongshu hi lab is now officially supported in vLLM! 📝 Single end-to-end parser for text, tables (HTML), formulas (LaTeX), and layouts (Markdown) 🌍 Supports 100 languages with robust performance on
    user avatar
    merve
    @mervenoyann
    2025年8月5日
    we're all sleeping on this OCR model 🔥 dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🤯 single e2e model to extract image, convert tables, formula, and more into markdown 📝
    69.5K
  • user avatar
    vLLM
    @vllm_project
    2025年10月22日
    it’s tokenization again! 🤯 did you know tokenize(detokenize(token_ids)) ≠ token_ids? RL researchers from Agent Lightning coined the term Retokenization Drift — a subtle mismatch between what your model generated and what your trainer thinks it generated. why? because most
    170.4K
  • user avatar
    vLLM
    @vllm_project
    2025年1月27日
    🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.
    95.3K
  • user avatar
    vLLM
    @vllm_project
    2025年9月9日
    The amazing blogpost from @gordic_aleksa is alive at vLLM's blogpost blog.vllm.ai/2025/09/05/ana… (after more proofreading and clarifications)! Looking forward to future series of tech deep dive blogposts😍
    user avatar
    Aleksa Gordić (水平问题)
    @gordic_aleksa
    2025年9月1日
    New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up
    47.1K