Skip to content

[Security] Fix DoS via audio decompression bomb in speech-to-text endpoint#44970

Merged
Isotr0py merged 2 commits into
vllm-project:mainfrom
jperezdealgaba:fix/audio-decompression-bomb-dos
Jun 9, 2026
Merged

[Security] Fix DoS via audio decompression bomb in speech-to-text endpoint#44970
Isotr0py merged 2 commits into
vllm-project:mainfrom
jperezdealgaba:fix/audio-decompression-bomb-dos

Conversation

@jperezdealgaba

@jperezdealgaba jperezdealgaba commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Purpose

Fix a denial-of-service vulnerability where the /v1/audio/transcriptions endpoint limits compressed upload size (default 25MB) but not decoded PCM output. A 25MB OPUS file at 6kbps encodes ~8.7 hours of audio, which expands to ~5.7GB of float32 PCM at decode time (232x amplification). Three concurrent requests are enough to OOM-kill the server.

The root cause is that load_audio() fully decodes compressed audio into memory before max_audio_clip_s (default 30s) is checked, meaning the giant allocation has already happened by the time the duration guard runs.

This patch enforces a decoded audio duration limit during decoding — before np.concatenate allocates the contiguous float32 array — by adding a max_duration_s parameter to load_audio_pyav (sample-counting in the decode loop) and load_audio_soundfile (frame-count metadata check before read). A new environment variable VLLM_MAX_AUDIO_DECODE_DURATION_S (default 600s / 10 minutes) controls the limit and is wired into the speech-to-text serving layer.

Test Plan

python -m pytest tests/multimodal/media/test_audio.py::test_load_audio_max_duration_respected -v
python -m pytest tests/multimodal/media/test_audio.py::test_load_audio_max_duration_rejected -v
Test Result
Audio within the duration limit loads successfully (no regression)
Audio exceeding the duration limit is rejected with ValueError: Audio exceeds maximum allowed duration during decode, before the large contiguous allocation occurs
Existing callers (AudioMediaIO, assets/audio.py, voxtral.py) are unaffected since max_duration_s defaults to None (no limit)

@mergify mergify Bot added frontend multi-modality Related to multi-modality (#4194) labels Jun 9, 2026
@jperezdealgaba jperezdealgaba changed the title Fix DoS via audio decompression bomb in speech-to-text endpoint Jun 9, 2026
Comment thread vllm/envs.py
"VLLM_MAX_AUDIO_CLIP_FILESIZE_MB": lambda: int(
os.getenv("VLLM_MAX_AUDIO_CLIP_FILESIZE_MB", "25")
),
# Maximum decoded audio duration in seconds. Compressed audio files

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @NickLucche @Isotr0py I haven't used audio models much, is 10 minutes enough for vast majority of cases?

@Isotr0py Isotr0py Jun 9, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Novel and strong forced alignment Solution: We introduce Qwen3-ForcedAligner-0.6B, which supports timestamp prediction for arbitrary units within up to 5 minutes of speech in 11 languages. Evaluations show its timestamp accuracy surpasses E2E based forced-alignment models.

Seems Qwen3-ASR's audio limitation is up to 5min (https://huggingface.co/Qwen/Qwen3-ASR-0.6B#introduction), so I think 10min per chunk should be fine. In practice, a long audio is usually chunked into several small chunks for transcriptions.

@ekagra-ranjan ekagra-ranjan Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above 5min is for Qwen3-ForcedAligner-0.6B which is a forced aligner model for timestamp and not the core ASR model. Below is the max duration and file limit from different providers:

Provider / model Official API limit Source URL
Qwen3-ASR-Flash 5 minutes per audio https://docs.qwencloud.com/developer-guides/speech/asr
Qwen3-ASR-Flash-Filetrans 12 hours, 2 GB per file https://docs.qwencloud.com/developer-guides/speech/asr
Fun-ASR 12 hours, 2 GB per file https://docs.qwencloud.com/api-reference/speech-recognition/fun-asr-recording/restful-api
OpenAI whisper-1 25 MB per upload https://developers.openai.com/api/docs/guides/speech-to-text
Cohere Transcribe 25 MB per upload https://docs.cohere.com/v2/docs/transcribe
Mistral STT / Voxtral 60 minutes, 500 MB per file https://docs.mistral.ai/resources/known-limitations

a long audio is usually chunked into several small chunks for transcriptions.

The chunking is something which is should be handled by the server and is not by the client. In fact, we are doing this in vLLM server here

I think this change might break a lot of downstream task using vLLM whisper for long audio transcription since it adds a new variable and the default is quite low. Coincidentally I have a PR to add long audio test in CI to catch these changes.

Comment thread vllm/multimodal/media/audio.py
Enforce a decoded audio duration limit *during* decoding, before
np.concatenate allocates the contiguous float32 array.  A 25MB OPUS
file at 6kbps encodes ~8.7h of audio which expands to ~5.7GB of
float32 PCM (232x amplification ratio); three concurrent requests
are enough to OOM-kill the server.

The fix adds a `max_duration_s` parameter to `load_audio_pyav` and
`load_audio_soundfile` that caps decoded samples in the decode loop
(pyav) or via frame-count metadata (soundfile).  A new env var
`VLLM_MAX_AUDIO_DECODE_DURATION_S` (default 600s) controls the limit
and is wired into the speech-to-text serving layer.

Signed-off-by: Juan Pérez de Algaba <jperezde@redhat.com>

Signed-off-by: jperezde <jperezde@redhat.com>
@jperezdealgaba jperezdealgaba force-pushed the fix/audio-decompression-bomb-dos branch from 3e1436a to cd4296c Compare June 9, 2026 09:36
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026
@Isotr0py Isotr0py merged commit 1b1359c into vllm-project:main Jun 9, 2026
70 checks passed
Comment on lines +92 to +97
raise ValueError(
f"Audio exceeds maximum allowed duration of "
f"{max_duration_s}s (metadata reports "
f"{metadata_duration_s:.1f}s). This limit "
f"prevents decompression-bomb attacks."
)

@ekagra-ranjan ekagra-ranjan Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention the usage of VLLM_MAX_AUDIO_DECODE_DURATION_S to change the limit since this is a new arg so users can increase it if needed.

Also, when this error is raised, it will thrown to user with ValueError("Invalid or unsupported audio file.") from exc due to the try catch around load_audio() in vllm/entrypoints/speech_to_text/base/serving.py which would be misleading.

@jperezdealgaba - is this something you can fix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a MR for this: #45113

ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…point (vllm-project#44970)

Signed-off-by: jperezde <jperezde@redhat.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
jperezdealgaba added a commit to jperezdealgaba/vllm that referenced this pull request Jun 12, 2026
Let the descriptive ValueError from load_audio propagate directly
instead of catching it and re-raising a generic "Invalid or
unsupported audio file" message.  Also update the error messages to
mention the VLLM_MAX_AUDIO_DECODE_DURATION_S env var so users know
how to increase the limit.

Follow-up to vllm-project#44970 vllm-project#44970.

Signed-off-by: jperezde <jperezde@redhat.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…point (vllm-project#44970)

Signed-off-by: jperezde <jperezde@redhat.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

4 participants