pip install packaging
pip install flash-attn==2.4.2
# pip install /root/share/wheels/flash_attn-2.4.2+cu118torch2.0cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
pip install 'lmdeploy[all]==v0.1.0'
TurboMind 是一款关于 LLM 推理的高效推理引擎,基于英伟达的 FasterTransformer 研发而成。它的主要功能包括:
从huggingface.co直接转换,例如:
# 需要能访问 Huggingface 的网络环境
lmdeploy chat turbomind internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
# huggingface走国内镜像
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_ENABLE_HF_TRANSFER=1
运行 lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
model_source: hf_model
model_config:
{
"model_name": "qwen-7b",
"tensor_para_size": 1,
"head_num": 32,
"kv_head_num": 32,
"vocab_size": 151936,
"num_layer": 32,
"inter_size": 11008,
"norm_eps": 1e-06,
"attn_bias": 1,
"start_id": 0,
"end_id": 151643,
"session_len": 8200,
"weight_type": "bf16",
"rotary_embedding": 128,
"rope_theta": 10000.0,
"size_per_head": 128,
"group_size": 0,
"max_batch_size": 64,
"max_context_token_num": 1,
"step_length": 1,
"cache_max_entry_count": 0.5,
"cache_block_seq_len": 128,
"cache_chunk_size": 1,
"use_context_fmha": 1,
"quant_policy": 0,
"max_position_embeddings": 8192,
"rope_scaling_factor": 0.0,
"use_logn_attn": 1
}
get 323 model params
[WARNING] gemm_config.in is not found; using default GEMM algo
session 1
double enter to end input >>> 你是谁?
<|im_start|>user
你是谁?<|im_end|>
<|im_start|>assistant
我是来自阿里云的大规模语言模型,我叫通义千问。
double enter to end input >>>