[Migration] Migrate GGUF quantization support to plugin#39612
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Documentation preview: https://vllm--39612.org.readthedocs.build/en/39612/ |
There was a problem hiding this comment.
Code Review
This pull request removes hardcoded GGUF support from the core vLLM codebase and replaces it with a more extensible ModelFormatHandler architecture. The changes involve deleting GGUF-specific CUDA kernels, documentation, and tests, while refactoring model loaders and layers (Linear, MoE, Embedding) to use generic quantization configuration hooks. I have no feedback to provide.
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
@mgoin The GGUF plugin test pass now: https://buildkite.com/vllm/ci/builds/71683#019ebaa4-74dd-4462-9b0b-24cff7c5bbf3, perhaps it's safe to merge this PR to catch up the v0.23 release now? |
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 9752054)
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 9752054)
…#39612) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 9752054)
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 9752054)
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 9752054)
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 9752054)
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 9752054)
…ddings After vllm-project#39612, ParallelLMHead.tie_weights delegates to quant_method.tie_weights and QuantizeMethodBase.tie_weights raised NotImplementedError. Only UnquantizedEmbeddingMethod implemented it, so a model with tied word embeddings quantized via ModelOpt (NVFP4/FP8) -- or whose lm_head is excluded and gets UnquantizedLinearMethod -- crashed at load with NotImplementedError (e.g. the Gemma family). Default the base to sharing the weight tensor, matching the behavior ParallelLMHead.tie_weights had directly before vllm-project#39612; methods needing special handling (e.g. GGUF) still override. Fixes vllm-project#45543. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> (cherry picked from commit 9752054)
Purpose
After this PR, GGUF support will be migrated to https://github.com/vllm-project/vllm-gguf-plugin, you can still use GGUF models normally after plugin installation!
This is a draft PR used to assist with GGUF plugin development.Test Plan
Test Result
All tests pass.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.