下载模型:https://huggingface.co/inclusionAI/Ling-mini-2.0-GGUF
下载专用llama.cpp: https://github.com/im0qianqian/llama.cpp/releases
load_backend: loaded RPC backend from /root/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /root/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /root/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium | 9.22 GiB | 16.26 B | RPC,Vulkan | 99 | 1024 | pp512 | 956.02 ± 10.17 |
| bailingmoe2 16B.A1B Q4_K - Medium | 9.22 GiB | 16.26 B | RPC,Vulkan | 99 | 1024 | tg128 | 187.39 ± 0.84 |
build: 58fb8dfc (6570)
增大输出生成测试:
load_backend: loaded RPC backend from /root/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /root/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /root/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium | 9.22 GiB | 16.26 B | RPC,Vulkan | 99 | pp512 | 956.25 ± 10.76 |
| bailingmoe2 16B.A1B Q4_K - Medium | 9.22 GiB | 16.26 B | RPC,Vulkan | 99 | tg1024 | 181.47 ± 1.52 |
达到180 tokens/s 的生成速度,非常快了。Ling-mini-2.0 紧凑而强大。它共有160亿个参数,但每个输入标记仅激活14亿个参数(非嵌入部分为7890万)。
原创文章,转载请注明: 转载自诺德美地科技
本文链接地址: Ling-mini-2.0-GGUF 在MI50上的速度测试