LMDeploy 安装

pip install packaging
pip install flash-attn==2.4.2
# pip install /root/share/wheels/flash_attn-2.4.2+cu118torch2.0cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
pip install 'lmdeploy[all]==v0.1.0'

LMDeploy架构

截屏2024-01-15 05.43.34.png

模型推理/服务：模型本身的推理
API服务：前端的后端，前端调用接口，返回json结果。
客户端：前端

基于TurboMind模型转换

TurboMind 是一款关于 LLM 推理的高效推理引擎，基于英伟达的 FasterTransformer 研发而成。它的主要功能包括：

LLaMa 结构模型的支持
persistent batch 推理模式
可扩展的 KV 缓存管理器。

在线转换

从huggingface.co直接转换，例如：

# 需要能访问 Huggingface 的网络环境
lmdeploy chat turbomind internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b

# huggingface走国内镜像
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_ENABLE_HF_TRANSFER=1

运行 lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b

首先会下载模型
然后如下：

model_source: hf_model
model_config:
{
  "model_name": "qwen-7b",
  "tensor_para_size": 1,
  "head_num": 32,
  "kv_head_num": 32,
  "vocab_size": 151936,
  "num_layer": 32,
  "inter_size": 11008,
  "norm_eps": 1e-06,
  "attn_bias": 1,
  "start_id": 0,
  "end_id": 151643,
  "session_len": 8200,
  "weight_type": "bf16",
  "rotary_embedding": 128,
  "rope_theta": 10000.0,
  "size_per_head": 128,
  "group_size": 0,
  "max_batch_size": 64,
  "max_context_token_num": 1,
  "step_length": 1,
  "cache_max_entry_count": 0.5,
  "cache_block_seq_len": 128,
  "cache_chunk_size": 1,
  "use_context_fmha": 1,
  "quant_policy": 0,
  "max_position_embeddings": 8192,
  "rope_scaling_factor": 0.0,
  "use_logn_attn": 1
}
get 323 model params
[WARNING] gemm_config.in is not found; using default GEMM algo                                                          
session 1
double enter to end input >>> 你是谁？

<|im_start|>user
你是谁？<|im_end|>
<|im_start|>assistant
 我是来自阿里云的大规模语言模型，我叫通义千问。

double enter to end input >>>