Benchmarking llama.cpp on legacy hardware

Inspired by the benchmarking results for llama.cpp on Apple Silicon published here, I have used the llama-bench tool to produce comparable results on some older and less powerful devices, as well as newer ones.

This gives an idea of the rate of progress in LLM performance over a longer time period (even going back to a time before anyone thought of running DNNs of the size that we take for granted today on consumer hardware). The key metric is “tokens per second” for prompt processing (pp512) and next-token generation (tg128). Prompt processing usually benefits more from better parallelization.

These numbers are not so much intended as a scientifically accurate study, but more as a few quick orders-of-magnitude estimates across different platforms. It should be noted that most of these numbers are for CPU inference, even if a GPU was available.

Most tests run on Windows 10/11, binaries compiled with MSVC 2022.

Typical command line:

llama-bench.exe -m models\llama2\llama-2-7b.Q4_0.gguf -p 512 -n 128 -t 4

Dell Latitude E6420

  • Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
  • 8 GB RAM
  • Release date: 2011
  • SIMD extensions active: AVX

Build 80f19b4

modelsizeparamsbackendthreadstestt/s
llama 7B Q4_03.56 GiB6.74 BCPU4pp5121.98 ± 0.02
llama 7B Q4_03.56 GiB6.74 BCPU4tg1281.90 ± 0.10

ZOTAC ZBOX ID89 Plus

  • Intel(R) Core(TM) i5-3470T CPU @ 2.90GHz
  • 8 GB RAM
  • Release date: 2013
  • SIMD extensions active: AVX

Build e3af5563 (6878)

modelsizeparamsbackendthreadstestt/s
llama 7B Q4_03.56 GiB6.74 BCPU4pp5122.31 ± 0.02
llama 7B Q4_03.56 GiB6.74 BCPU4tg1282.46 ± 0.01

Lenovo Yoga 730-15IKB

  • Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  • 16 GB RAM
  • NVIDIA Geforce GTX 1050 not used
  • Compile options (for cross-compiling): -DGGML_AVX2=1 -DGGML_NATIVE=0
  • Release date: 2018

Build e3af5563 (6878)

modelsizeparamsbackendthreadstestt/s
llama 7B Q4_03.56 GiB6.74 BCPU4pp51213.40 ± 0.08
llama 7B Q4_03.56 GiB6.74 BCPU4tg1285.64 ± 0.14
llama 7B Q4_03.56 GiB6.74 BCPU8pp51216.29 ± 0.29
llama 7B Q4_03.56 GiB6.74 BCPU8tg1286.37 ± 0.08

Samsung S22 Ultra

  • Android 12 smartphone with Exynos 2200 S5E9925 CPU
  • 8 GB RAM (so smaller quantization used)
  • Release date: 2022

Build 80f19b4

modelsizeparamsbackendthreadstestt/s
llama 7B Q3_K - Small2.75 GiB6.74 BCPU8pp5122.53 ± 0.04
llama 7B Q3_K - Small2.75 GiB6.74 BCPU8tg1281.73 ± 0.53

Lenovo Yoga 7 2-in-1 Gen 10 (16AKP10)

  • AMD Ryzen AI 7 350 w/ Radeon 860M (GPU not used)
  • 32 GB RAM
  • Compile options (for cross-compiling): -DGGML_AVX512=1 -DGGML_NATIVE=0
  • Release date: 2025

Build e3af5563 (6878)

modelsizeparamsbackendthreadstestt/s
llama 7B Q4_03.56 GiB6.74 BCPU4pp51231.37 ± 0.16
llama 7B Q4_03.56 GiB6.74 BCPU4tg12810.49 ± 0.08
llama 7B Q4_03.56 GiB6.74 BCPU8pp51235.84 ± 0.25
llama 7B Q4_03.56 GiB6.74 BCPU8tg12810.27 ± 0.02
llama 7B Q4_03.56 GiB6.74 BCPU16pp51236.16 ± 0.16
llama 7B Q4_03.56 GiB6.74 BCPU16tg12810.06 ± 0.10