[ASR] Optimize CPU preproc to get 2.5x RTFx via multi-threading by ekagra-ranjan · Pull Request #44612 · vllm-project/vllm

ekagra-ranjan · 2026-06-05T04:12:51Z

Currently, vLLM ASR's CPU preprocessing like audio loading and RMS chunking is a synchronous code which blocks the main eventloop of FastAPI server which leads to serial execution instead of batched processing. This bottleneck becomes much more painful when the system is processing long audio . Under high concurrency, it also leads to timeouts in health/ endpoint which makes it difficult to identify faulty nodes and makes auto scaling aggressive.

This PR unblocks the CPU preproc sync processing by offloading them to a ThreadPool executor which frees the main eventloop of asyncio to keep running the server while the background thread load audio arrays and do the RMS chunk split.

This optimization leads to 2.5x higher RTFx throughput for openai/whisper-large-v3-turbo
It also unblocks the health/ endpoint and we see the latency go from 25s -> 0.06ms

Adds VLLM_MAX_AUDIO_PREPROCESS_WORKERS as an env variable.

Cmd for running benchmark (after #44587 is merged)

MODEL_ID=openai/whisper-large-v3-turbo
vllm serve $MODEL_ID --trust-remote-code

Long audio

MAX_CONCURRENCY=128
NUM_PROMPTS=128

time vllm bench serve \
  --backend openai-audio \
  --endpoint /v1/audio/transcriptions \
  --dataset-name hf \
  --dataset-path ArtificialAnalysis/Earnings22-Cleaned-AA \
  --hf-split test \
  --num-prompts ${NUM_PROMPTS} \
  --ready-check-timeout-sec 600 \
  --save-result \
  --save-detailed \
  --result-filename asr-bench.json \
  --max-concurrency ${MAX_CONCURRENCY}

On 1xH100:

w/o this PR -> RTFx: 735, Benchmark duration 201.75
this PR
- 2 threads RTFx: 1890 , Benchmark duration 78.45
- 1 thread RTFx: 1335.05
- 4 thread RTFx: 1739.89
- 8 thread RTFx: 1272.91
- 16 thread RTFx: 1292.97
250% gain with 2 thread

Short audio (>80% samples are <10s)

MAX_CONCURRENCY=500
NUM_PROMPTS=128

time vllm bench serve \
  --backend openai-audio \
  --endpoint /v1/audio/transcriptions \
  --dataset-name hf \
  --dataset-path D4nt3/esb-datasets-earnings22-validation-tiny-filtered \
  --hf-split validation \
  --num-prompts ${NUM_PROMPTS} \
  --ready-check-timeout-sec 600 \
  --save-result \
  --save-detailed \
  --result-filename asr-bench.json \
  --max-concurrency ${MAX_CONCURRENCY}

On 1xH100:

w/o this PR -> RTFx: 600.885
this PR -> RTFx: 591.7725
1.5% drop

However if we start vllm server with VLLM_MAX_AUDIO_PREPROCESS_WORKERS=1 then we get 1.7% higher RTFx

Test

test_long_audio_wer_correctness introduced in #44587 passes and will gate this codepath on CI. That test is long audio + BS>1 + multi-threading + RMS chunk and is fast since that dataset is just 37MB so good for CI.

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

ekagra-ranjan · 2026-06-05T04:28:15Z



+def _get_stt_preprocess_max_workers() -> int:
+    default_workers = max(1, min(os.cpu_count() or 1, 2))


max thread = 2 was found optimal num of thread on this dataset after tuning on [1, 2, 4, 8, 16]. The audio libs themselves would be using multi-threading so we dont need very high thread here.

Even though we chose 2 thread here, we observe 2.5x which is > 2 thread. This is because earlier only the event loop was doing work which is just 1 thread but now the system has 1 event loop thread + 2 new thread so total 3 threads.

2 threads RTFx: 1890 , Benchmark duration 78.45
1 thread RTFx: 1335.05
4 thread RTFx: 1739.89
8 thread RTFx: 1272.91
16 thread RTFx: 1292.97

NickLucche

Thanks for looking into this optimization @ekagra-ranjan !
I think this approach can work, although it would be nice to align methodology with how the rest of the items are preprocessed with Renderer @DarkLight1337

Also I wonder, did you measure any overhead for short audios?

NickLucche · 2026-06-05T12:57:24Z

        return cast(type[SupportsTranscription], model_cls)

+    def shutdown(self) -> None:
+        if (executor := getattr(self, "_preprocess_executor", None)) is not None:


when is the self._preprocess_executor attr not present?

I was being too conservative in handling it defensively as in maybe the __init__() fails due to some reason and then later shutdown needs to happen?

mm yeah I think we could avoid getattr if possibly

sure, done!

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan · 2026-06-05T18:31:15Z

Thanks @NickLucche for the reviews! All of them make sense and I have updated the PR.

although it would be nice to align methodology with how the rest of the items are preprocessed with Renderer

I'll have a look at the Render's methodology and report back.

Also I wonder, did you measure any overhead for short audios?

Just checked on D4nt3/esb-datasets-earnings22-validation-tiny-filtered where >80% are <10s- there is 1.5% drop in RTFx, ie., (600.885 vs 591.7725) when default thread is 2. However if we start vllm server with VLLM_STT_PREPROCESS_MAX_WORKERS=1 then we get 1.7% higher RTFx.

ekagra-ranjan · 2026-06-05T19:34:01Z

although it would be nice to align methodology with how the rest of the items are preprocessed with Renderer

@NickLucche @DarkLight1337 - regarding the aligning to Renderer comment:

is it specific to how input preparation is done in STT and NOT related to the changes in this PR? This PR only moves the audio load + RMS chunk from sync path to threadpool and doesnt touch prompt/token prep. The audio load + chunk part is something which is not reused by other endpoints so I dont think that logic should go into renderer.
If so, then the alignment of the STT input prep should be a separate PR since the unalignment exists even w/o this PR?

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

mergify · 2026-06-09T17:09:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ekagra-ranjan · 2026-06-09T18:25:02Z

Sharing the offline discussion on whether to reuse renderer worker pool here or not for posterity (will link this comment to a comment in code)

I tried to use the thread pool from Renderer but get ~15% lower throughput than having a separate pool on front end:

separate pool on frontend -> 1890 RTFx
reusing rendered pool
  * --renderer-num-workers 2: 1408 RTFx
  * --renderer-num-workers 3: 1648 RTFx
  * --renderer-num-workers 4: 1638 RTFx
  * --renderer-num-workers 5: 1633 RTFx

I think this is because the renderer-num-worker was added mainly to unblock the asyncio main event loop to improve server responsiveness in earlier PR while giving some improvement for vision tasks. Parallelizing renderer work doenst help much but can block ASR front end work if they share the same pool. That PR mentions this: "The key improvement comes from offloading preprocessing off the event loop (so /health stays responsive), not from parallelizing it. Default of workers=1 is sufficient."

For ASR, splitting the threads allocation unequally bw frontend and renderer work is 15% better than merging them into 1 queue.

The last commit of my PR has the changes to reuse rendered threadpool in frontend (just in case you wanted to see it)

Given this, should we keep a separate thread pool in frontend for ASR since it allows splitting thread resources unequally which appears to be beneficial for ASR workloads.

This reverts commit c572b15. Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

DarkLight1337

Thanks for your patience!

mergify · 2026-06-11T15:23:31Z

Hi @ekagra-ranjan, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

NickLucche · 2026-06-11T15:54:51Z

        return cast(type[SupportsTranscription], model_cls)

+    def shutdown(self) -> None:
+        if (executor := getattr(self, "_preprocess_executor", None)) is not None:


mm yeah I think we could avoid getattr if possibly

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

multi threading cpu perf optimization ASR

f719328

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

mergify Bot added the frontend label Jun 5, 2026

ekagra-ranjan changed the title ~~[ASR][Perf] Optimize CPU preproc to get 2.5x RTFx via multi-threading~~ Jun 5, 2026

clean

23f2908

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan marked this pull request as ready for review June 5, 2026 04:17

ekagra-ranjan requested review from NickLucche and njhill as code owners June 5, 2026 04:17

claude Bot reviewed Jun 5, 2026

View reviewed changes

ekagra-ranjan commented Jun 5, 2026

View reviewed changes

NickLucche reviewed Jun 5, 2026

View reviewed changes

ekagra-ranjan added 2 commits June 5, 2026 18:02

remove pedantic code

1c3f44b

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

add comment

e18211f

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan requested a review from NickLucche June 5, 2026 19:34

try reusing rendered worker

c572b15

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

ekagra-ranjan requested a review from DarkLight1337 as a code owner June 8, 2026 18:06

DarkLight1337 approved these changes Jun 9, 2026

View reviewed changes

mergify Bot added the needs-rebase label Jun 9, 2026

mergify Bot removed the needs-rebase label Jun 9, 2026

ekagra-ranjan force-pushed the er-asr-multi-thread branch from d80ef14 to 4861687 Compare June 9, 2026 21:05

ekagra-ranjan requested review from AndreasKaratzas, dllehr-amd, luccafong, noooop, patrickvonplaten, tjtanaa and zyongye as code owners June 9, 2026 21:05

github-project-automation Bot added this to gpt-oss Issues & Enhancements Jun 9, 2026

mergify Bot added the rocm Related to AMD ROCm label Jun 9, 2026

ekagra-ranjan added 5 commits June 9, 2026 21:24

Revert "try reusing rendered worker"

76dcc30

This reverts commit c572b15. Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

add comment about threadpool

607df52

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

fix conflict

4b42339

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

move to envs.py

8b1fd3e

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

clean

8e1d1a6

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

DarkLight1337 reviewed Jun 10, 2026

View reviewed changes

Comment thread vllm/entrypoints/speech_to_text/base/serving.py Outdated

ekagra-ranjan added 3 commits June 10, 2026 23:39

use make_async

300d63a

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

add make_async_with_semaphore

1e7f032

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

add make_async_with_semaphore

af95c57

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

DarkLight1337 reviewed Jun 11, 2026

View reviewed changes

Comment thread vllm/entrypoints/speech_to_text/base/serving.py Outdated

DarkLight1337 reviewed Jun 11, 2026

View reviewed changes

Comment thread vllm/entrypoints/speech_to_text/base/serving.py Outdated

remove log

af993ac

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

DarkLight1337 approved these changes Jun 11, 2026

View reviewed changes

Merge branch 'main' into er-asr-multi-thread

76cb193

NickLucche approved these changes Jun 11, 2026

View reviewed changes

ekagra-ranjan added 2 commits June 11, 2026 16:21

avoid getaattr

0fc2adc

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

Merge branch 'main' into er-asr-multi-thread

3cae93c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ASR] Optimize CPU preproc to get 2.5x RTFx via multi-threading#44612

[ASR] Optimize CPU preproc to get 2.5x RTFx via multi-threading#44612
vllm-bot merged 17 commits into
vllm-project:mainfrom
ekagra-ranjan:er-asr-multi-thread

ekagra-ranjan commented Jun 5, 2026 •

edited

Loading

claude Bot left a comment

ekagra-ranjan Jun 5, 2026

ekagra-ranjan Jun 8, 2026

NickLucche left a comment

Uh oh!

NickLucche Jun 5, 2026

ekagra-ranjan Jun 5, 2026 •

edited

Loading

NickLucche Jun 11, 2026

ekagra-ranjan Jun 11, 2026

Uh oh!

ekagra-ranjan commented Jun 5, 2026 •

edited

Loading

ekagra-ranjan commented Jun 5, 2026 •

edited

Loading

mergify Bot commented Jun 9, 2026

ekagra-ranjan commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

mergify Bot commented Jun 11, 2026

NickLucche Jun 11, 2026

Labels

5 participants



		def _get_stt_preprocess_max_workers() -> int:
		default_workers = max(1, min(os.cpu_count() or 1, 2))

Uh oh!

Uh oh!

Conversation

ekagra-ranjan commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

ekagra-ranjan Jun 5, 2026

Choose a reason for hiding this comment

ekagra-ranjan Jun 8, 2026

Choose a reason for hiding this comment

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche Jun 5, 2026

Choose a reason for hiding this comment

ekagra-ranjan Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

NickLucche Jun 11, 2026

Choose a reason for hiding this comment

ekagra-ranjan Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ekagra-ranjan commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mergify Bot commented Jun 9, 2026

ekagra-ranjan commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

mergify Bot commented Jun 11, 2026

NickLucche Jun 11, 2026

Choose a reason for hiding this comment

Labels

5 participants

ekagra-ranjan commented Jun 5, 2026 •

edited

Loading

ekagra-ranjan Jun 5, 2026 •

edited

Loading

ekagra-ranjan commented Jun 5, 2026 •

edited

Loading

ekagra-ranjan commented Jun 5, 2026 •

edited

Loading