AMD Instinct MI50 Benchmark Notes: VBIOS, Power Limits, and Local LLM Throughput

AMD Instinct MI50 benchmark summary

Abstract

This field note consolidates public AMD Instinct MI50 benchmark observations with a local 3×MI50 llama.cpp run. The objective is to document practical performance behavior, compare VBIOS and power-limit effects, and establish a baseline for future local AI workload testing.

The external reference comes from a public MI50 32GB VBIOS note that compares V420.rom benchmark behavior across power caps, SCLK/MCLK settings, and LLM throughput. The local benchmark was performed on a 3×MI50/gfx906 ROCm system using Qwen2.5-Coder-32B-Instruct in Q4_K_M GGUF format.

This is not a strict apples-to-apples comparison. The external reference uses 4×MI50 with gpt-oss:120b through Ollama, while the local run uses 3×MI50 with Qwen2.5-Coder-32B-Instruct through llama.cpp. The value of the comparison is operational: it highlights what to measure, what configuration variables matter, and where future controlled tests should focus.

Benchmark Context

The public benchmark table was designed to check stability under different overclocks and power caps. The source also notes that power limits and thermals can hold the card back. All listed tests were performed with rocm-smi --setperflevel high, after warm-up, on a Ryzen 9 5950X test system.

The external LLM test used the following configuration:

Item	External Reference
GPU	4× AMD Instinct MI50
ROM	V420.rom
Workload	`gpt-oss:120b`
Runtime	Ollama ROCm container
Context	32768
KV cache	`q8_0`
Flash attention	Enabled
Output metrics	Prompt processing rate and token generation rate

Extracted V420.rom Results

The table below extracts the average values from the public benchmark data.

ROM / Power Cap	SCLK / MCLK	Avg FPS	Avg pp/s	Avg tg/s
V420.rom @ 178W	1800 / 1000	72.65	2562.27	31.52
V420.rom @ 178W	1800 / 1180	72.98	2623.45	33.01
V420.rom @ 178W	2000 / 1000	73.46	2679.17	33.20
V420.rom @ 178W	2000 / 1180	74.01	2810.14	34.86
V420.rom @ 225W	2000 / 1000	N/A	2683.52	33.29
V420.rom @ 225W	2000 / 1180	78.51	2796.71	34.78
V420.rom @ 300W	2000 / 1000	N/A	2682.30	33.24
V420.rom @ 300W	2000 / 1150	80.73	2764.77	34.60
V420.rom @ 300W	2000 / 1180	80.85	2802.37	34.85

V420 generation throughput chart

Key Observations from the External Benchmark

The strongest LLM generation result in the extracted table was approximately 34.86 tg/s at 178W, 2000/1180 MHz. A near-identical result appeared at 300W, 2000/1180 MHz with 34.85 tg/s. This suggests that, for this specific workload, increasing the power cap alone did not materially improve LLM token generation once clock and memory settings were already favorable.

Memory frequency appears more important than raw power cap in several rows. At 178W and 2000 MHz SCLK, increasing MCLK from 1000 to 1180 MHz improved average generation throughput from 33.20 tg/s to 34.86 tg/s, an improvement of roughly 5%.

Game FPS behaved differently. The Cyberpunk 2077 result improved from 74.01 FPS at 178W / 2000/1180 to 80.85 FPS at 300W / 2000/1180. This indicates that synthetic, game, and LLM workloads do not stress the card in the same way.

Local Benchmark Baseline

The local benchmark was performed using a 3×MI50 ROCm setup and llama.cpp.

Local llama.cpp run

Item	Local Run
GPUs	3× AMD Instinct MI50 / gfx906
Runtime	`llama.cpp` / `llama-server`
Model	`Qwen2.5-Coder-32B-Instruct`
Quantization	`Q4_K_M` GGUF
Context	4096
Model size	18.48 GiB
Parameters	32.76B
GPU offload	65 / 65 layers
Prompt eval	313.68 ms / 35 tokens
Prompt throughput	111.58 tokens/s
Decode eval	183.36 ms / 4 tokens
Decode throughput	21.82 tokens/s
Total	497.03 ms / 39 tokens

Interpretation

The local run confirms that the model loads correctly, the 65/65 layers are offloaded to GPU, and the stack is operational. The prompt evaluation rate of 111.58 tokens/s is a useful positive signal for prompt ingestion on the current setup.

The decode result of 21.82 tokens/s should be treated carefully because it was measured over only four generated tokens. This is too short to represent stable long-generation throughput. A better future benchmark should use longer output, repeated runs, warm-up, and consistent logging of GPU clocks, power draw, temperature, and memory usage.

Practical Benefits

This benchmark note provides four practical benefits:

It documents a working MI50 local inference baseline.
It separates external reference behavior from local results.
It shows that memory clock and VBIOS behavior may matter more than simply increasing power.
It gives a repeatable structure for future tests on the same server.

Recommended Next Benchmark Plan

A stronger next run should use the same model and runtime across all tests:

Test Area	Recommended Method
Runtime	`llama.cpp` only
Model	Same GGUF across all tests
Prompt	Fixed prompt with fixed token length
Output	At least 256 generated tokens
Runs	5 runs after warm-up
Metrics	pp/s, tg/s, total time, GPU temperature, SCLK, MCLK, power
Comparison	Same VBIOS and same GPU count before changing one variable

The most important rule is to change only one variable at a time: model, quantization, context size, GPU count, clocks, or VBIOS.

Conclusion

The AMD Instinct MI50 remains a useful low-cost accelerator for local AI workloads when VBIOS, ReBAR, ROCm, cooling, and runtime configuration are handled carefully. The public V420.rom data shows that LLM throughput can remain nearly flat across 178W to 300W when clocks are already favorable, while memory frequency improvements can have a clearer effect.

The local 3×MI50 llama.cpp run confirms functional multi-GPU offload with Qwen2.5-Coder-32B-Instruct Q4_K_M, reaching 111.58 tokens/s in prompt evaluation and 21.82 tokens/s in a short decode sample. The result is a valid operational baseline, not a final performance ceiling.

Future testing should use longer generations, multiple runs, and consistent telemetry. That will make it possible to determine whether the next improvement should come from VBIOS tuning, clock configuration, quantization choice, model selection, or runtime parameters.

References

evilJazz, AMD Instinct MI50 32GB VBIOS, GitHub Gist.
Local benchmark notes from the current 3×MI50 llama.cpp setup.