Skip to content

[ASR] Optimize CPU preproc to get 2.5x RTFx via multi-threading#44612

Merged
vllm-bot merged 17 commits into
vllm-project:mainfrom
ekagra-ranjan:er-asr-multi-thread
Jun 12, 2026
Merged

[ASR] Optimize CPU preproc to get 2.5x RTFx via multi-threading#44612
vllm-bot merged 17 commits into
vllm-project:mainfrom
ekagra-ranjan:er-asr-multi-thread

Conversation

@ekagra-ranjan

@ekagra-ranjan ekagra-ranjan commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Currently, vLLM ASR's CPU preprocessing like audio loading and RMS chunking is a synchronous code which blocks the main eventloop of FastAPI server which leads to serial execution instead of batched processing. This bottleneck becomes much more painful when the system is processing long audio . Under high concurrency, it also leads to timeouts in health/ endpoint which makes it difficult to identify faulty nodes and makes auto scaling aggressive.

This PR unblocks the CPU preproc sync processing by offloading them to a ThreadPool executor which frees the main eventloop of asyncio to keep running the server while the background thread load audio arrays and do the RMS chunk split.

  • This optimization leads to 2.5x higher RTFx throughput for openai/whisper-large-v3-turbo
  • It also unblocks the health/ endpoint and we see the latency go from 25s -> 0.06ms

Adds VLLM_MAX_AUDIO_PREPROCESS_WORKERS as an env variable.

Cmd for running benchmark (after #44587 is merged)

MODEL_ID=openai/whisper-large-v3-turbo
vllm serve $MODEL_ID --trust-remote-code

Long audio

MAX_CONCURRENCY=128
NUM_PROMPTS=128

time vllm bench serve \
  --backend openai-audio \
  --endpoint /v1/audio/transcriptions \
  --dataset-name hf \
  --dataset-path ArtificialAnalysis/Earnings22-Cleaned-AA \
  --hf-split test \
  --num-prompts ${NUM_PROMPTS} \
  --ready-check-timeout-sec 600 \
  --save-result \
  --save-detailed \
  --result-filename asr-bench.json \
  --max-concurrency ${MAX_CONCURRENCY}

On 1xH100:

  • w/o this PR -> RTFx: 735, Benchmark duration 201.75
  • this PR
    • 2 threads RTFx: 1890 , Benchmark duration 78.45
    • 1 thread RTFx: 1335.05
    • 4 thread RTFx: 1739.89
    • 8 thread RTFx: 1272.91
    • 16 thread RTFx: 1292.97
  • 250% gain with 2 thread

Short audio (>80% samples are <10s)

MAX_CONCURRENCY=500
NUM_PROMPTS=128

time vllm bench serve \
  --backend openai-audio \
  --endpoint /v1/audio/transcriptions \
  --dataset-name hf \
  --dataset-path D4nt3/esb-datasets-earnings22-validation-tiny-filtered \
  --hf-split validation \
  --num-prompts ${NUM_PROMPTS} \
  --ready-check-timeout-sec 600 \
  --save-result \
  --save-detailed \
  --result-filename asr-bench.json \
  --max-concurrency ${MAX_CONCURRENCY}

On 1xH100:

  • w/o this PR -> RTFx: 600.885
  • this PR -> RTFx: 591.7725
  • 1.5% drop

However if we start vllm server with VLLM_MAX_AUDIO_PREPROCESS_WORKERS=1 then we get 1.7% higher RTFx

Test

test_long_audio_wer_correctness introduced in #44587 passes and will gate this codepath on CI. That test is long audio + BS>1 + multi-threading + RMS chunk and is fast since that dataset is just 37MB so good for CI.

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@mergify mergify Bot added the frontend label Jun 5, 2026
@ekagra-ranjan ekagra-ranjan changed the title [ASR][Perf] Optimize CPU preproc to get 2.5x RTFx via multi-threading Jun 5, 2026
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@ekagra-ranjan ekagra-ranjan marked this pull request as ready for review June 5, 2026 04:17

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.



def _get_stt_preprocess_max_workers() -> int:
default_workers = max(1, min(os.cpu_count() or 1, 2))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max thread = 2 was found optimal num of thread on this dataset after tuning on [1, 2, 4, 8, 16]. The audio libs themselves would be using multi-threading so we dont need very high thread here.

Even though we chose 2 thread here, we observe 2.5x which is > 2 thread. This is because earlier only the event loop was doing work which is just 1 thread but now the system has 1 event loop thread + 2 new thread so total 3 threads.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 threads RTFx: 1890 , Benchmark duration 78.45
1 thread RTFx: 1335.05
4 thread RTFx: 1739.89
8 thread RTFx: 1272.91
16 thread RTFx: 1292.97

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this optimization @ekagra-ranjan !
I think this approach can work, although it would be nice to align methodology with how the rest of the items are preprocessed with Renderer @DarkLight1337

Also I wonder, did you measure any overhead for short audios?

Comment thread vllm/entrypoints/speech_to_text/base/serving.py Outdated
return cast(type[SupportsTranscription], model_cls)

def shutdown(self) -> None:
if (executor := getattr(self, "_preprocess_executor", None)) is not None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is the self._preprocess_executor attr not present?

@ekagra-ranjan ekagra-ranjan Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being too conservative in handling it defensively as in maybe the __init__() fails due to some reason and then later shutdown needs to happen?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm yeah I think we could avoid getattr if possibly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done!

Comment thread vllm/entrypoints/speech_to_text/base/serving.py Outdated
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@ekagra-ranjan

ekagra-ranjan commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @NickLucche for the reviews! All of them make sense and I have updated the PR.

although it would be nice to align methodology with how the rest of the items are preprocessed with Renderer

I'll have a look at the Render's methodology and report back.

Also I wonder, did you measure any overhead for short audios?

Just checked on D4nt3/esb-datasets-earnings22-validation-tiny-filtered where >80% are <10s- there is 1.5% drop in RTFx, ie., (600.885 vs 591.7725) when default thread is 2. However if we start vllm server with VLLM_STT_PREPROCESS_MAX_WORKERS=1 then we get 1.7% higher RTFx.

@ekagra-ranjan

ekagra-ranjan commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

although it would be nice to align methodology with how the rest of the items are preprocessed with Renderer

@NickLucche @DarkLight1337 - regarding the aligning to Renderer comment:

  1. is it specific to how input preparation is done in STT and NOT related to the changes in this PR? This PR only moves the audio load + RMS chunk from sync path to threadpool and doesnt touch prompt/token prep. The audio load + chunk part is something which is not reused by other endpoints so I dont think that logic should go into renderer.
  2. If so, then the alignment of the STT input prep should be a separate PR since the unalignment exists even w/o this PR?
@ekagra-ranjan ekagra-ranjan requested a review from NickLucche June 5, 2026 19:34
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@mergify

mergify Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 9, 2026
@ekagra-ranjan

Copy link
Copy Markdown
Contributor Author

Sharing the offline discussion on whether to reuse renderer worker pool here or not for posterity (will link this comment to a comment in code)

I tried to use the thread pool from Renderer but get ~15% lower throughput than having a separate pool on front end:

  • separate pool on frontend -> 1890 RTFx
  • reusing rendered pool
      * --renderer-num-workers 2: 1408 RTFx
      * --renderer-num-workers 3: 1648 RTFx
      * --renderer-num-workers 4: 1638 RTFx
      * --renderer-num-workers 5: 1633 RTFx

I think this is because the renderer-num-worker was added mainly to unblock the asyncio main event loop to improve server responsiveness in earlier PR while giving some improvement for vision tasks. Parallelizing renderer work doenst help much but can block ASR front end work if they share the same pool. That PR mentions this: "The key improvement comes from offloading preprocessing off the event loop (so /health stays responsive), not from parallelizing it. Default of workers=1 is sufficient."

For ASR, splitting the threads allocation unequally bw frontend and renderer work is 15% better than merging them into 1 queue.

The last commit of my PR has the changes to reuse rendered threadpool in frontend (just in case you wanted to see it)

Given this, should we keep a separate thread pool in frontend for ASR since it allows splitting thread resources unequally which appears to be beneficial for ASR workloads.

@mergify mergify Bot added ci/build deepseek Related to DeepSeek models rust llama Related to Llama models multi-modality Related to multi-modality (#4194) mistral Related to Mistral models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models nvidia labels Jun 9, 2026
@mergify mergify Bot added the rocm Related to AMD ROCm label Jun 9, 2026
This reverts commit c572b15.

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Comment thread vllm/entrypoints/speech_to_text/base/serving.py Outdated
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Comment thread vllm/entrypoints/speech_to_text/base/serving.py Outdated
Comment thread vllm/entrypoints/speech_to_text/base/serving.py Outdated
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

@DarkLight1337 DarkLight1337 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience!

@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @ekagra-ranjan, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

return cast(type[SupportsTranscription], model_cls)

def shutdown(self) -> None:
if (executor := getattr(self, "_preprocess_executor", None)) is not None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm yeah I think we could avoid getattr if possibly

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend multi-modality Related to multi-modality (#4194) performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed verified Run pre-commit for new contributors without triggering other tests

5 participants