You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.

BitsAndBytes

目录

BitsAndBytes#

vLLM 现在支持 BitsAndBytes，以实现更有效的模型推理。BitsAndBytes 对模型进行量化，以减少内存使用量并提高性能，而不会显著降低准确性。与其他量化方法相比，BitsAndBytes 消除了使用输入数据校准量化模型的需要。

以下是使用 vLLM 利用 BitsAndBytes 的步骤。

$ pip install bitsandbytes>=0.42.0

vLLM 读取模型的配置文件，并支持动态量化和预量化检查点。

你可以在 https://huggingface.co/models?other=bitsandbytes 上找到 bitsandbytes 量化模型。通常，这些存储库包含一个 config.json 文件，其中包含一个 quantization_config 部分。

读取量化检查点。#

from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")

飞行中量化：加载为 4 位量化#

from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")