[Security] Fix DoS via audio decompression bomb in speech-to-text endpoint by jperezdealgaba · Pull Request #44970 · vllm-project/vllm

jperezdealgaba · 2026-06-09T07:01:37Z

Purpose

Fix a denial-of-service vulnerability where the /v1/audio/transcriptions endpoint limits compressed upload size (default 25MB) but not decoded PCM output. A 25MB OPUS file at 6kbps encodes ~8.7 hours of audio, which expands to ~5.7GB of float32 PCM at decode time (232x amplification). Three concurrent requests are enough to OOM-kill the server.

The root cause is that load_audio() fully decodes compressed audio into memory before max_audio_clip_s (default 30s) is checked, meaning the giant allocation has already happened by the time the duration guard runs.

This patch enforces a decoded audio duration limit during decoding — before np.concatenate allocates the contiguous float32 array — by adding a max_duration_s parameter to load_audio_pyav (sample-counting in the decode loop) and load_audio_soundfile (frame-count metadata check before read). A new environment variable VLLM_MAX_AUDIO_DECODE_DURATION_S (default 600s / 10 minutes) controls the limit and is wired into the speech-to-text serving layer.

Test Plan

python -m pytest tests/multimodal/media/test_audio.py::test_load_audio_max_duration_respected -v
python -m pytest tests/multimodal/media/test_audio.py::test_load_audio_max_duration_rejected -v
Test Result
Audio within the duration limit loads successfully (no regression)
Audio exceeding the duration limit is rejected with ValueError: Audio exceeds maximum allowed duration during decode, before the large contiguous allocation occurs
Existing callers (AudioMediaIO, assets/audio.py, voxtral.py) are unaffected since max_duration_s defaults to None (no limit)

DarkLight1337 · 2026-06-09T07:49:28Z

    "VLLM_MAX_AUDIO_CLIP_FILESIZE_MB": lambda: int(
        os.getenv("VLLM_MAX_AUDIO_CLIP_FILESIZE_MB", "25")
    ),
+    # Maximum decoded audio duration in seconds.  Compressed audio files


cc @NickLucche @Isotr0py I haven't used audio models much, is 10 minutes enough for vast majority of cases?

Novel and strong forced alignment Solution: We introduce Qwen3-ForcedAligner-0.6B, which supports timestamp prediction for arbitrary units within up to 5 minutes of speech in 11 languages. Evaluations show its timestamp accuracy surpasses E2E based forced-alignment models.

Seems Qwen3-ASR's audio limitation is up to 5min (https://huggingface.co/Qwen/Qwen3-ASR-0.6B#introduction), so I think 10min per chunk should be fine. In practice, a long audio is usually chunked into several small chunks for transcriptions.

The above 5min is for Qwen3-ForcedAligner-0.6B which is a forced aligner model for timestamp and not the core ASR model. Below is the max duration and file limit from different providers:

Provider / model Official API limit Source URL

Qwen3-ASR-Flash 5 minutes per audio https://docs.qwencloud.com/developer-guides/speech/asr

Qwen3-ASR-Flash-Filetrans 12 hours, 2 GB per file https://docs.qwencloud.com/developer-guides/speech/asr

Fun-ASR 12 hours, 2 GB per file https://docs.qwencloud.com/api-reference/speech-recognition/fun-asr-recording/restful-api

OpenAI whisper-1 25 MB per upload https://developers.openai.com/api/docs/guides/speech-to-text

Cohere Transcribe 25 MB per upload https://docs.cohere.com/v2/docs/transcribe

Mistral STT / Voxtral 60 minutes, 500 MB per file https://docs.mistral.ai/resources/known-limitations

a long audio is usually chunked into several small chunks for transcriptions.

The chunking is something which is should be handled by the server and is not by the client. In fact, we are doing this in vLLM server here

I think this change might break a lot of downstream task using vLLM whisper for long audio transcription since it adds a new variable and the default is quite low. Coincidentally I have a PR to add long audio test in CI to catch these changes.

Enforce a decoded audio duration limit *during* decoding, before np.concatenate allocates the contiguous float32 array. A 25MB OPUS file at 6kbps encodes ~8.7h of audio which expands to ~5.7GB of float32 PCM (232x amplification ratio); three concurrent requests are enough to OOM-kill the server. The fix adds a `max_duration_s` parameter to `load_audio_pyav` and `load_audio_soundfile` that caps decoded samples in the decode loop (pyav) or via frame-count metadata (soundfile). A new env var `VLLM_MAX_AUDIO_DECODE_DURATION_S` (default 600s) controls the limit and is wired into the speech-to-text serving layer. Signed-off-by: Juan Pérez de Algaba <jperezde@redhat.com> Signed-off-by: jperezde <jperezde@redhat.com>

ekagra-ranjan · 2026-06-09T20:43:28Z

+                    raise ValueError(
+                        f"Audio exceeds maximum allowed duration of "
+                        f"{max_duration_s}s (metadata reports "
+                        f"{metadata_duration_s:.1f}s). This limit "
+                        f"prevents decompression-bomb attacks."
+                    )


I think we should mention the usage of VLLM_MAX_AUDIO_DECODE_DURATION_S to change the limit since this is a new arg so users can increase it if needed.

Also, when this error is raised, it will thrown to user with ValueError("Invalid or unsupported audio file.") from exc due to the try catch around load_audio() in vllm/entrypoints/speech_to_text/base/serving.py which would be misleading.

@jperezdealgaba - is this something you can fix?

I did a MR for this: #45113

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

Let the descriptive ValueError from load_audio propagate directly instead of catching it and re-raising a generic "Invalid or unsupported audio file" message. Also update the error messages to mention the VLLM_MAX_AUDIO_DECODE_DURATION_S env var so users know how to increase the limit. Follow-up to vllm-project#44970 vllm-project#44970. Signed-off-by: jperezde <jperezde@redhat.com>

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

jperezdealgaba requested review from DarkLight1337, NickLucche, tjtanaa and ywang96 as code owners June 9, 2026 07:01

mergify Bot added frontend multi-modality Related to multi-modality (#4194) labels Jun 9, 2026

jperezdealgaba changed the title ~~Fix DoS via audio decompression bomb in speech-to-text endpoint~~ Jun 9, 2026

DarkLight1337 reviewed Jun 9, 2026

View reviewed changes

Isotr0py reviewed Jun 9, 2026

View reviewed changes

Comment thread vllm/multimodal/media/audio.py

jperezdealgaba force-pushed the fix/audio-decompression-bomb-dos branch from 3e1436a to cd4296c Compare June 9, 2026 09:36

Merge branch 'main' into fix/audio-decompression-bomb-dos

6403038

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026

DarkLight1337 approved these changes Jun 9, 2026

View reviewed changes

Isotr0py merged commit 1b1359c into vllm-project:main Jun 9, 2026
70 checks passed

ekagra-ranjan reviewed Jun 9, 2026

View reviewed changes

ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026

[Security] Fix DoS via audio decompression bomb in speech-to-text end…

013a182

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026

[Security] Fix DoS via audio decompression bomb in speech-to-text end…

ff21d69

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

DRL-NextGen mentioned this pull request Jun 18, 2026

feat(cli): Add nexus validate benchmarks command IBM/algorithm-nexus#136

Merged

vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026

[Security] Fix DoS via audio decompression bomb in speech-to-text end…

3bc209d

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

This was referenced Jun 19, 2026

Cp benchmark test pr 2 IBM/algorithm-nexus#142

Open

build(deps): update vLLM to 0.23.0 in candidate and 0.21.0 in product IBM/algorithm-nexus#143

Merged

This was referenced Jun 22, 2026

build(deps): update dependencies IBM/algorithm-nexus#146

Closed

build(hooks): update pre-commit hooks IBM/algorithm-nexus#147

Merged

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Security] Fix DoS via audio decompression bomb in speech-to-text end…

40f8f06

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

nixpkgs-security-tracker Bot mentioned this pull request Jun 23, 2026

vLLM: security issues < 0.23.1rc0 NixOS/nixpkgs#534486

Open

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Security] Fix DoS via audio decompression bomb in speech-to-text end…

20953a0

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026

[Security] Fix DoS via audio decompression bomb in speech-to-text end…

e7b9aba

…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Security] Fix DoS via audio decompression bomb in speech-to-text endpoint#44970

[Security] Fix DoS via audio decompression bomb in speech-to-text endpoint#44970
Isotr0py merged 2 commits into
vllm-project:mainfrom
jperezdealgaba:fix/audio-decompression-bomb-dos

jperezdealgaba commented Jun 9, 2026 •

edited

Loading

DarkLight1337 Jun 9, 2026

Isotr0py Jun 9, 2026 •

edited

Loading

ekagra-ranjan Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

ekagra-ranjan Jun 9, 2026 •

edited

Loading

jperezdealgaba Jun 10, 2026

Labels

4 participants

Provider / model	Official API limit	Source URL
Qwen3-ASR-Flash	5 minutes per audio	https://docs.qwencloud.com/developer-guides/speech/asr
Qwen3-ASR-Flash-Filetrans	12 hours, 2 GB per file	https://docs.qwencloud.com/developer-guides/speech/asr
Fun-ASR	12 hours, 2 GB per file	https://docs.qwencloud.com/api-reference/speech-recognition/fun-asr-recording/restful-api
OpenAI whisper-1	25 MB per upload	https://developers.openai.com/api/docs/guides/speech-to-text
Cohere Transcribe	25 MB per upload	https://docs.cohere.com/v2/docs/transcribe
Mistral STT / Voxtral	60 minutes, 500 MB per file	https://docs.mistral.ai/resources/known-limitations

Uh oh!

Uh oh!

Conversation

jperezdealgaba commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

DarkLight1337 Jun 9, 2026

Choose a reason for hiding this comment

Isotr0py Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ekagra-ranjan Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ekagra-ranjan Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

jperezdealgaba Jun 10, 2026

Choose a reason for hiding this comment

Labels

4 participants

jperezdealgaba commented Jun 9, 2026 •

edited

Loading

Isotr0py Jun 9, 2026 •

edited

Loading

ekagra-ranjan Jun 9, 2026 •

edited

Loading

ekagra-ranjan Jun 9, 2026 •

edited

Loading