Abstract
This field note consolidates public AMD Instinct MI50 benchmark observations with a local 3×MI50 llama.cpp run. The objective is to document practical performance behavior, compare VBIOS and power-limit effects, and establish a baseline for future local AI workload testing.
The external reference comes from a public MI50 32GB VBIOS note that compares V420.rom benchmark behavior across power caps, SCLK/MCLK settings, and LLM throughput. The local benchmark was performed on a 3×MI50/gfx906 ROCm system using Qwen2.5-Coder-32B-Instruct in Q4_K_M GGUF format.
This is not a strict apples-to-apples comparison. The external reference uses 4×MI50 with gpt-oss:120b through Ollama, while the local run uses 3×MI50 with Qwen2.5-Coder-32B-Instruct through llama.cpp. The value of the comparison is operational: it highlights what to measure, what configuration variables matter, and where future controlled tests should focus.
Benchmark Context
The public benchmark table was designed to check stability under different overclocks and power caps. The source also notes that power limits and thermals can hold the card back. All listed tests were performed with rocm-smi --setperflevel high, after warm-up, on a Ryzen 9 5950X test system.
The external LLM test used the following configuration:
| Item | External Reference |
|---|---|
| GPU | 4× AMD Instinct MI50 |
| ROM | V420.rom |
| Workload | gpt-oss:120b |
| Runtime | Ollama ROCm container |
| Context | 32768 |
| KV cache | q8_0 |
| Flash attention | Enabled |
| Output metrics | Prompt processing rate and token generation rate |
Extracted V420.rom Results
The table below extracts the average values from the public benchmark data.
| ROM / Power Cap | SCLK / MCLK | Avg FPS | Avg pp/s | Avg tg/s |
|---|---|---|---|---|
| V420.rom @ 178W | 1800 / 1000 | 72.65 | 2562.27 | 31.52 |
| V420.rom @ 178W | 1800 / 1180 | 72.98 | 2623.45 | 33.01 |
| V420.rom @ 178W | 2000 / 1000 | 73.46 | 2679.17 | 33.20 |
| V420.rom @ 178W | 2000 / 1180 | 74.01 | 2810.14 | 34.86 |
| V420.rom @ 225W | 2000 / 1000 | N/A | 2683.52 | 33.29 |
| V420.rom @ 225W | 2000 / 1180 | 78.51 | 2796.71 | 34.78 |
| V420.rom @ 300W | 2000 / 1000 | N/A | 2682.30 | 33.24 |
| V420.rom @ 300W | 2000 / 1150 | 80.73 | 2764.77 | 34.60 |
| V420.rom @ 300W | 2000 / 1180 | 80.85 | 2802.37 | 34.85 |
Key Observations from the External Benchmark
The strongest LLM generation result in the extracted table was approximately 34.86 tg/s at 178W, 2000/1180 MHz. A near-identical result appeared at 300W, 2000/1180 MHz with 34.85 tg/s. This suggests that, for this specific workload, increasing the power cap alone did not materially improve LLM token generation once clock and memory settings were already favorable.
Memory frequency appears more important than raw power cap in several rows. At 178W and 2000 MHz SCLK, increasing MCLK from 1000 to 1180 MHz improved average generation throughput from 33.20 tg/s to 34.86 tg/s, an improvement of roughly 5%.
Game FPS behaved differently. The Cyberpunk 2077 result improved from 74.01 FPS at 178W / 2000/1180 to 80.85 FPS at 300W / 2000/1180. This indicates that synthetic, game, and LLM workloads do not stress the card in the same way.
Local Benchmark Baseline
The local benchmark was performed using a 3×MI50 ROCm setup and llama.cpp.
| Item | Local Run |
|---|---|
| GPUs | 3× AMD Instinct MI50 / gfx906 |
| Runtime | llama.cpp / llama-server |
| Model | Qwen2.5-Coder-32B-Instruct |
| Quantization | Q4_K_M GGUF |
| Context | 4096 |
| Model size | 18.48 GiB |
| Parameters | 32.76B |
| GPU offload | 65 / 65 layers |
| Prompt eval | 313.68 ms / 35 tokens |
| Prompt throughput | 111.58 tokens/s |
| Decode eval | 183.36 ms / 4 tokens |
| Decode throughput | 21.82 tokens/s |
| Total | 497.03 ms / 39 tokens |
Interpretation
The local run confirms that the model loads correctly, the 65/65 layers are offloaded to GPU, and the stack is operational. The prompt evaluation rate of 111.58 tokens/s is a useful positive signal for prompt ingestion on the current setup.
The decode result of 21.82 tokens/s should be treated carefully because it was measured over only four generated tokens. This is too short to represent stable long-generation throughput. A better future benchmark should use longer output, repeated runs, warm-up, and consistent logging of GPU clocks, power draw, temperature, and memory usage.
Practical Benefits
This benchmark note provides four practical benefits:
- It documents a working MI50 local inference baseline.
- It separates external reference behavior from local results.
- It shows that memory clock and VBIOS behavior may matter more than simply increasing power.
- It gives a repeatable structure for future tests on the same server.
Recommended Next Benchmark Plan
A stronger next run should use the same model and runtime across all tests:
| Test Area | Recommended Method |
|---|---|
| Runtime | llama.cpp only |
| Model | Same GGUF across all tests |
| Prompt | Fixed prompt with fixed token length |
| Output | At least 256 generated tokens |
| Runs | 5 runs after warm-up |
| Metrics | pp/s, tg/s, total time, GPU temperature, SCLK, MCLK, power |
| Comparison | Same VBIOS and same GPU count before changing one variable |
The most important rule is to change only one variable at a time: model, quantization, context size, GPU count, clocks, or VBIOS.
Conclusion
The AMD Instinct MI50 remains a useful low-cost accelerator for local AI workloads when VBIOS, ReBAR, ROCm, cooling, and runtime configuration are handled carefully. The public V420.rom data shows that LLM throughput can remain nearly flat across 178W to 300W when clocks are already favorable, while memory frequency improvements can have a clearer effect.
The local 3×MI50 llama.cpp run confirms functional multi-GPU offload with Qwen2.5-Coder-32B-Instruct Q4_K_M, reaching 111.58 tokens/s in prompt evaluation and 21.82 tokens/s in a short decode sample. The result is a valid operational baseline, not a final performance ceiling.
Future testing should use longer generations, multiple runs, and consistent telemetry. That will make it possible to determine whether the next improvement should come from VBIOS tuning, clock configuration, quantization choice, model selection, or runtime parameters.
References
- evilJazz, AMD Instinct MI50 32GB VBIOS, GitHub Gist.
- Local benchmark notes from the current 3×MI50
llama.cppsetup.