版本号:
llama: b6550
ROCm: 6.3.4
Vulkan版本来自github官方下载,ROCm版为自行编译。
加载模型:
unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL
Vulkan版本:
load_backend: loaded RPC backend from /root/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /root/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /root/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium | 16.45 GiB | 30.53 B | RPC,Vulkan | 99 | pp512 | 380.79 ± 1.39 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.45 GiB | 30.53 B | RPC,Vulkan | 99 | tg128 | 89.86 ± 0.46 |
ROCm版本:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium | 16.45 GiB | 30.53 B | ROCm | 99 | pp512 | 1041.98 ± 4.16 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.45 GiB | 30.53 B | ROCm | 99 | tg128 | 68.19 ± 0.03 |
以上表示
pp512 | Prompt Processing (512 tokens) | 处理输入提示的速度 | 输入端延迟、吞吐量 | 等待“模型开始思考”的时间 |
tg128 | Token Generation (128 tokens) | 生成输出文本的速度 | 输出端吞吐量、延迟 | 等待“回答完整输出”的时间 |
由于我们在使用Coder模型进行AI编码提示时,需要更多的输出内容,修改参数测试更多吞吐量
Vulkan:
load_backend: loaded RPC backend from /root/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /root/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /root/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium | 16.45 GiB | 30.53 B | RPC,Vulkan | 99 | pp512 | 380.60 ± 1.46 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.45 GiB | 30.53 B | RPC,Vulkan | 99 | tg1024 | 83.47 ± 0.19 |
ROCm:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium | 16.45 GiB | 30.53 B | ROCm | 99 | pp512 | 1041.47 ± 3.92 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.45 GiB | 30.53 B | ROCm | 99 | tg1024 | 59.49 ± 0.06 |
经过多次测试,得到一个意外的结果,Vulkan后端在推理输出这项上的速度居然更快些。
原创文章,转载请注明: 转载自诺德美地科技