[Security] Fix DoS via audio decompression bomb in speech-to-text endpoint#44970
Conversation
| "VLLM_MAX_AUDIO_CLIP_FILESIZE_MB": lambda: int( | ||
| os.getenv("VLLM_MAX_AUDIO_CLIP_FILESIZE_MB", "25") | ||
| ), | ||
| # Maximum decoded audio duration in seconds. Compressed audio files |
There was a problem hiding this comment.
cc @NickLucche @Isotr0py I haven't used audio models much, is 10 minutes enough for vast majority of cases?
There was a problem hiding this comment.
Novel and strong forced alignment Solution: We introduce Qwen3-ForcedAligner-0.6B, which supports timestamp prediction for arbitrary units within up to 5 minutes of speech in 11 languages. Evaluations show its timestamp accuracy surpasses E2E based forced-alignment models.
Seems Qwen3-ASR's audio limitation is up to 5min (https://huggingface.co/Qwen/Qwen3-ASR-0.6B#introduction), so I think 10min per chunk should be fine. In practice, a long audio is usually chunked into several small chunks for transcriptions.
There was a problem hiding this comment.
The above 5min is for Qwen3-ForcedAligner-0.6B which is a forced aligner model for timestamp and not the core ASR model. Below is the max duration and file limit from different providers:
| Provider / model | Official API limit | Source URL |
|---|---|---|
| Qwen3-ASR-Flash | 5 minutes per audio | https://docs.qwencloud.com/developer-guides/speech/asr |
| Qwen3-ASR-Flash-Filetrans | 12 hours, 2 GB per file | https://docs.qwencloud.com/developer-guides/speech/asr |
| Fun-ASR | 12 hours, 2 GB per file | https://docs.qwencloud.com/api-reference/speech-recognition/fun-asr-recording/restful-api |
| OpenAI whisper-1 | 25 MB per upload | https://developers.openai.com/api/docs/guides/speech-to-text |
| Cohere Transcribe | 25 MB per upload | https://docs.cohere.com/v2/docs/transcribe |
| Mistral STT / Voxtral | 60 minutes, 500 MB per file | https://docs.mistral.ai/resources/known-limitations |
a long audio is usually chunked into several small chunks for transcriptions.
The chunking is something which is should be handled by the server and is not by the client. In fact, we are doing this in vLLM server here
I think this change might break a lot of downstream task using vLLM whisper for long audio transcription since it adds a new variable and the default is quite low. Coincidentally I have a PR to add long audio test in CI to catch these changes.
Enforce a decoded audio duration limit *during* decoding, before np.concatenate allocates the contiguous float32 array. A 25MB OPUS file at 6kbps encodes ~8.7h of audio which expands to ~5.7GB of float32 PCM (232x amplification ratio); three concurrent requests are enough to OOM-kill the server. The fix adds a `max_duration_s` parameter to `load_audio_pyav` and `load_audio_soundfile` that caps decoded samples in the decode loop (pyav) or via frame-count metadata (soundfile). A new env var `VLLM_MAX_AUDIO_DECODE_DURATION_S` (default 600s) controls the limit and is wired into the speech-to-text serving layer. Signed-off-by: Juan Pérez de Algaba <jperezde@redhat.com> Signed-off-by: jperezde <jperezde@redhat.com>
3e1436a to
cd4296c
Compare
| raise ValueError( | ||
| f"Audio exceeds maximum allowed duration of " | ||
| f"{max_duration_s}s (metadata reports " | ||
| f"{metadata_duration_s:.1f}s). This limit " | ||
| f"prevents decompression-bomb attacks." | ||
| ) |
There was a problem hiding this comment.
I think we should mention the usage of VLLM_MAX_AUDIO_DECODE_DURATION_S to change the limit since this is a new arg so users can increase it if needed.
Also, when this error is raised, it will thrown to user with ValueError("Invalid or unsupported audio file.") from exc due to the try catch around load_audio() in vllm/entrypoints/speech_to_text/base/serving.py which would be misleading.
@jperezdealgaba - is this something you can fix?
…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>
…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Let the descriptive ValueError from load_audio propagate directly instead of catching it and re-raising a generic "Invalid or unsupported audio file" message. Also update the error messages to mention the VLLM_MAX_AUDIO_DECODE_DURATION_S env var so users know how to increase the limit. Follow-up to vllm-project#44970 vllm-project#44970. Signed-off-by: jperezde <jperezde@redhat.com>
…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>
…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>
…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>
…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>
…point (vllm-project#44970) Signed-off-by: jperezde <jperezde@redhat.com>
Purpose
Fix a denial-of-service vulnerability where the /v1/audio/transcriptions endpoint limits compressed upload size (default 25MB) but not decoded PCM output. A 25MB OPUS file at 6kbps encodes ~8.7 hours of audio, which expands to ~5.7GB of float32 PCM at decode time (232x amplification). Three concurrent requests are enough to OOM-kill the server.
The root cause is that load_audio() fully decodes compressed audio into memory before max_audio_clip_s (default 30s) is checked, meaning the giant allocation has already happened by the time the duration guard runs.
This patch enforces a decoded audio duration limit during decoding — before np.concatenate allocates the contiguous float32 array — by adding a max_duration_s parameter to load_audio_pyav (sample-counting in the decode loop) and load_audio_soundfile (frame-count metadata check before read). A new environment variable VLLM_MAX_AUDIO_DECODE_DURATION_S (default 600s / 10 minutes) controls the limit and is wired into the speech-to-text serving layer.
Test Plan
python -m pytest tests/multimodal/media/test_audio.py::test_load_audio_max_duration_respected -v
python -m pytest tests/multimodal/media/test_audio.py::test_load_audio_max_duration_rejected -v
Test Result
Audio within the duration limit loads successfully (no regression)
Audio exceeding the duration limit is rejected with ValueError: Audio exceeds maximum allowed duration during decode, before the large contiguous allocation occurs
Existing callers (AudioMediaIO, assets/audio.py, voxtral.py) are unaffected since max_duration_s defaults to None (no limit)