Benchmarking llama.cpp on legacy hardware
Inspired by the benchmarking results for llama.cpp on Apple Silicon published here, I have used the llama-bench tool to produce comparable results on some older and less powerful devices, as well as newer ones.
This gives an idea of the rate of progress in LLM performance over a longer time period (even going back to a time before anyone thought of running DNNs of the size that we take for granted today on consumer hardware). The key metric is “tokens per second” for prompt processing (pp512) and next-token generation (tg128). Prompt processing usually benefits more from better parallelization.
These numbers are not so much intended as a scientifically accurate study, but more as a few quick orders-of-magnitude estimates across different platforms. It should be noted that most of these numbers are for CPU inference, even if a GPU was available.
Most tests run on Windows 10/11, binaries compiled with MSVC 2022.
Typical command line:
llama-bench.exe -m models\llama2\llama-2-7b.Q4_0.gguf -p 512 -n 128 -t 4
Dell Latitude E6420
- Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
- 8 GB RAM
- Release date: 2011
- SIMD extensions active: AVX
Build 80f19b4
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | pp512 | 1.98 ± 0.02 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | tg128 | 1.90 ± 0.10 |
ZOTAC ZBOX ID89 Plus
- Intel(R) Core(TM) i5-3470T CPU @ 2.90GHz
- 8 GB RAM
- Release date: 2013
- SIMD extensions active: AVX
Build e3af5563 (6878)
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | pp512 | 2.31 ± 0.02 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | tg128 | 2.46 ± 0.01 |
Lenovo Yoga 730-15IKB
- Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
- 16 GB RAM
- NVIDIA Geforce GTX 1050 not used
- Compile options (for cross-compiling):
-DGGML_AVX2=1 -DGGML_NATIVE=0 - Release date: 2018
Build e3af5563 (6878)
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | pp512 | 13.40 ± 0.08 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | tg128 | 5.64 ± 0.14 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | pp512 | 16.29 ± 0.29 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 6.37 ± 0.08 |
Samsung S22 Ultra
- Android 12 smartphone with Exynos 2200 S5E9925 CPU
- 8 GB RAM (so smaller quantization used)
- Release date: 2022
Build 80f19b4
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B Q3_K - Small | 2.75 GiB | 6.74 B | CPU | 8 | pp512 | 2.53 ± 0.04 |
| llama 7B Q3_K - Small | 2.75 GiB | 6.74 B | CPU | 8 | tg128 | 1.73 ± 0.53 |
Lenovo Yoga 7 2-in-1 Gen 10 (16AKP10)
- AMD Ryzen AI 7 350 w/ Radeon 860M (GPU not used)
- 32 GB RAM
- Compile options (for cross-compiling):
-DGGML_AVX512=1 -DGGML_NATIVE=0 - Release date: 2025
Build e3af5563 (6878)
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | pp512 | 31.37 ± 0.16 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | tg128 | 10.49 ± 0.08 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | pp512 | 35.84 ± 0.25 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg128 | 10.27 ± 0.02 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | pp512 | 36.16 ± 0.16 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | tg128 | 10.06 ± 0.10 |