[ASR] Optimize CPU preproc to get 2.5x RTFx via multi-threading#44612
Conversation
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
|
||
|
|
||
| def _get_stt_preprocess_max_workers() -> int: | ||
| default_workers = max(1, min(os.cpu_count() or 1, 2)) |
There was a problem hiding this comment.
max thread = 2 was found optimal num of thread on this dataset after tuning on [1, 2, 4, 8, 16]. The audio libs themselves would be using multi-threading so we dont need very high thread here.
Even though we chose 2 thread here, we observe 2.5x which is > 2 thread. This is because earlier only the event loop was doing work which is just 1 thread but now the system has 1 event loop thread + 2 new thread so total 3 threads.
There was a problem hiding this comment.
2 threads RTFx: 1890 , Benchmark duration 78.45
1 thread RTFx: 1335.05
4 thread RTFx: 1739.89
8 thread RTFx: 1272.91
16 thread RTFx: 1292.97
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for looking into this optimization @ekagra-ranjan !
I think this approach can work, although it would be nice to align methodology with how the rest of the items are preprocessed with Renderer @DarkLight1337
Also I wonder, did you measure any overhead for short audios?
| return cast(type[SupportsTranscription], model_cls) | ||
|
|
||
| def shutdown(self) -> None: | ||
| if (executor := getattr(self, "_preprocess_executor", None)) is not None: |
There was a problem hiding this comment.
when is the self._preprocess_executor attr not present?
There was a problem hiding this comment.
I was being too conservative in handling it defensively as in maybe the __init__() fails due to some reason and then later shutdown needs to happen?
There was a problem hiding this comment.
mm yeah I think we could avoid getattr if possibly
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
Thanks @NickLucche for the reviews! All of them make sense and I have updated the PR.
I'll have a look at the Render's methodology and report back.
Just checked on |
@NickLucche @DarkLight1337 - regarding the aligning to Renderer comment:
|
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Sharing the offline discussion on whether to reuse renderer worker pool here or not for posterity (will link this comment to a comment in code) I tried to use the thread pool from Renderer but get ~15% lower throughput than having a separate pool on front end:
I think this is because the renderer-num-worker was added mainly to unblock the asyncio main event loop to improve server responsiveness in earlier PR while giving some improvement for vision tasks. Parallelizing renderer work doenst help much but can block ASR front end work if they share the same pool. That PR mentions this: "The key improvement comes from offloading preprocessing off the event loop (so /health stays responsive), not from parallelizing it. Default of workers=1 is sufficient." For ASR, splitting the threads allocation unequally bw frontend and renderer work is 15% better than merging them into 1 queue. The last commit of my PR has the changes to reuse rendered threadpool in frontend (just in case you wanted to see it) Given this, should we keep a separate thread pool in frontend for ASR since it allows splitting thread resources unequally which appears to be beneficial for ASR workloads. |
d80ef14 to
4861687
Compare
This reverts commit c572b15. Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
DarkLight1337
left a comment
There was a problem hiding this comment.
Thanks for your patience!
|
Hi @ekagra-ranjan, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
| return cast(type[SupportsTranscription], model_cls) | ||
|
|
||
| def shutdown(self) -> None: | ||
| if (executor := getattr(self, "_preprocess_executor", None)) is not None: |
There was a problem hiding this comment.
mm yeah I think we could avoid getattr if possibly
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Currently, vLLM ASR's CPU preprocessing like audio loading and RMS chunking is a synchronous code which blocks the main eventloop of FastAPI server which leads to serial execution instead of batched processing. This bottleneck becomes much more painful when the system is processing long audio . Under high concurrency, it also leads to timeouts in
health/endpoint which makes it difficult to identify faulty nodes and makes auto scaling aggressive.This PR unblocks the CPU preproc sync processing by offloading them to a ThreadPool executor which frees the main eventloop of asyncio to keep running the server while the background thread load audio arrays and do the RMS chunk split.
openai/whisper-large-v3-turbohealth/endpoint and we see the latency go from 25s -> 0.06msAdds
VLLM_MAX_AUDIO_PREPROCESS_WORKERSas an env variable.Cmd for running benchmark (after #44587 is merged)
Long audio
On 1xH100:
Short audio (>80% samples are <10s)
On 1xH100:
However if we start vllm server with
VLLM_MAX_AUDIO_PREPROCESS_WORKERS=1then we get 1.7% higher RTFxTest
test_long_audio_wer_correctnessintroduced in #44587 passes and will gate this codepath on CI. That test is long audio + BS>1 + multi-threading + RMS chunk and is fast since that dataset is just 37MB so good for CI.