CPU Microbenchmark Suite

Characterises memory hierarchy latency, bandwidth, and peak SIMD throughput on x86-64 (AVX2/FMA3) and AArch64 (NEON) platforms, then uses those measurements to build a roofline model and predict optimal GEMM tile sizes analytically.

Built on top of Google Benchmark. No external benchmark library is needed beyond that -- CMake fetches it automatically.

What it measures

Latency -- pointer-chase microbenchmarks that defeat hardware prefetchers sweep from 4 KiB to 512 MiB to expose L1, L2, L3, and DRAM latency plateaus.

Bandwidth -- STREAM-style Copy / Scale / Add / Triad kernels in three variants: scalar, compiler-vectorised, and hand-written AVX2 with non-temporal stores (for DRAM-level measurement).

Peak FLOPS -- unrolled FMA chains with 10 independent accumulator registers to saturate FMA execution ports on AVX2 and NEON cores.

Roofline model -- Python post-processing takes the measured peak GFLOPS and per-level bandwidth numbers, plots the roofline, and places kernels on the graph.

GEMM tile-size prediction -- applies the BLIS analytical cache-blocking model to the measured hardware parameters to recommend (Mc, Kc, Nc) panel sizes for a blocked FP32 GEMM, then cross-validates against a naive blocked implementation.

Repository layout

cpu-microbench/
- CMakeLists.txt
- Makefile                    # convenience wrapper around cmake
- LICENSE
- CONTRIBUTING.md
- requirements.txt            # numpy, matplotlib, scipy
- .gitignore
- cmake/
  - DetectArch.cmake          # ISA feature detection (AVX2, FMA, NEON)
  - PerfCounters.cmake        # perf_event_open availability check
  - Sanitizers.cmake          # ASan + UBSan (for tests only)
  - toolchain-aarch64.cmake   # cross-compilation for AArch64
- include/
  - bench_utils.hpp           # timing helpers and cache-level classifier
  - perf_event.hpp            # RAII wrapper around perf_event_open(2)
  - topology.hpp              # /proc/cpuinfo and sysfs reader
  - numa.hpp                  # libnuma wrappers with graceful fallback
- src/
  - pointer_chase.cpp         # L1/L2/L3/DRAM latency benchmarks
  - stream.cpp                # Copy/Scale/Add/Triad bandwidth benchmarks
  - peak_flops.cpp            # AVX2 and NEON peak GFLOPS benchmarks
  - gemm_tile_predict.cpp     # blocked GEMM tile-size validation
- scripts/
  - check_system.sh           # pre-flight system checker
  - run_suite.sh              # orchestration + perf stat wrapper
  - parse_perf.py             # JSON aggregator -> summary.json
  - roofline.py               # roofline + latency + bandwidth plots
  - plot_bandwidth_scaling.py # thread-count vs bandwidth scaling plot
  - tile_predict.py           # BLIS tile-size calculator
- tests/
  - test_permutation.cpp
  - test_stream_verify.cpp
- docs/
  - methodology.md
  - reproducibility.md
  - hardware_notes/
- .github/workflows/ci.yml
- .vscode/                    # local editor tasks/launch configs
- results/                    # auto-generated; gitignored
  - raw/
  - plots/

Project docs

docs/methodology.md: schema, analysis assumptions, and extension method.
docs/reproducibility.md: publication-grade experimental protocol.
CONTRIBUTING.md: contribution and code-quality standards.
CODE_OF_CONDUCT.md: community conduct expectations.
SECURITY.md: responsible vulnerability disclosure workflow.
CITATION.cff: citation metadata for papers and reports.
CHANGELOG.md: release and repository hardening history.

Quick start

# 1. Verify your system has everything needed
bash scripts/check_system.sh

# 2. Install Python dependencies
pip3 install -r requirements.txt

# 3. Build (make handles cmake configure + build)
make

# 4. Run correctness tests
make test

# 5. Run all benchmarks and generate plots
make run-all

# 6. View results
ls results/raw/        # timestamped JSON files
ls results/plots/      # roofline.pdf, latency_vs_size.png, bandwidth_by_level.png

make help lists all available targets.

Prerequisites

System

Requirement	Minimum	Notes
Linux kernel	5.15	`perf_event_open` with full PMU access
GCC	12	or Clang 15+
CMake	3.25
Python	3.10	for post-processing scripts
Ninja	any	recommended; Make also works

Optional

Package	Purpose
`libnuma-dev`	NUMA-local allocation (auto-detected by CMake)
`linux-tools-generic`	`perf` command
`cpupower`	locking CPU frequency
`numactl`	NUMA binding
`likwid`	cross-validation of hardware counter readings

Install on Ubuntu/Debian:

sudo apt update
sudo apt install build-essential cmake ninja-build python3 python3-pip \
                 libnuma-dev linux-tools-generic cpuidtool
pip3 install -r requirements.txt

Building

# Configure (CMake downloads Google Benchmark automatically)
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_COMPILER=g++ \
  -DBENCH_ENABLE_AVX2=ON \
  -DBENCH_ENABLE_FMA=ON \
  -DBENCH_ENABLE_PERF=ON \
  -DBENCHMARK_DOWNLOAD_DEPENDENCIES=ON \
  -G Ninja

# Build all benchmarks and tests
cmake --build build -j$(nproc)

CMake options:

Option	Default	Description
`BENCH_ENABLE_AVX2`	`AUTO`	Enable `-mavx2`
`BENCH_ENABLE_FMA`	`AUTO`	Enable `-mfma`
`BENCH_ENABLE_AVX512`	`OFF`	Enable `-mavx512f` (Skylake-X+)
`BENCH_ENABLE_PERF`	`ON`	Compile perf_event_open wrappers
`BENCH_WITH_NUMA`	`AUTO`	Link libnuma
`BENCH_SANITIZE`	`OFF`	ASan + UBSan (tests only, never benchmarks)
`BENCH_LTO`	`OFF`	Link-time optimisation (intentionally discouraged)

Never enable LTO for benchmarks. The linker may constant-fold or eliminate the loops being measured.

Running

Full suite (recommended)

# Lock CPU frequency and disable turbo (requires sudo)
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo  # Intel
sudo sysctl -w kernel.perf_event_paranoid=1

# Run on a single NUMA node
numactl --cpunodebind=0 --membind=0 bash scripts/run_suite.sh

# View results
ls results/raw/        # timestamped directory with JSON files
ls results/plots/      # roofline.pdf, latency_vs_size.png, bandwidth_by_level.png

Expected runtime on a modern desktop CPU: around 8 minutes for the full sweep.

Individual benchmarks

# Latency sweep
./build/pointer_chase --benchmark_format=console --benchmark_repetitions=3

# Bandwidth sweep
./build/stream_bench  --benchmark_format=console

# Peak FLOPS
./build/peak_flops    --benchmark_format=console --benchmark_repetitions=5

# GEMM tile validation
./build/gemm_tile_bench --benchmark_format=console

Correctness tests

ctest --test-dir build -V

VS Code setup

Open the folder directly in VS Code. The .vscode/ directory includes:

settings.json -- CMake Tools integration, C++ IntelliSense, Python formatter
tasks.json -- build tasks (Ctrl+Shift+B triggers CMake: build)
launch.json -- run/debug configurations for each benchmark binary
extensions.json -- recommended extension list

Install recommended extensions when prompted (or via Extensions: Show Recommended Extensions). The most important ones are:

Extension	ID
C/C++	`ms-vscode.cpptools`
CMake Tools	`ms-vscode.cmake-tools`
CMake	`twxs.cmake`
Python	`ms-python.python`

After installing CMake Tools, press Ctrl+Shift+P and run CMake: Configure, then CMake: Build. The integrated terminal then lets you run any benchmark directly.

Interpreting results

Latency plateaus

The pointer-chase sweep exposes distinct latency levels corresponding to each cache. Sharp transitions between plateaus indicate the cache capacity boundaries. If all levels collapse to a single latency, turbo boost is active and inflating the CPU frequency -- disable it and re-run.

Bandwidth numbers

The AVX2 non-temporal Triad result (BM_Triad_AVX2_NT) gives the most accurate picture of sustainable DRAM bandwidth because non-temporal stores bypass the cache on write, eliminating read-for-ownership (RFO) traffic.

Roofline classification

A kernel sits on the bandwidth roof if it is memory-bound (left of the ridge point). It sits on the compute roof if it is compute-bound (right of the ridge point). A kernel well below both roofs has a bottleneck not captured by the two-roof model -- typically instruction latency, branch misprediction, or lock contention.

Expected values (i7-12700K, DDR5-4800, 1 core)

Metric	Expected	Acceptable range
L1 load latency	4-5 cycles (~1.4 ns)	3-6 cycles
L2 load latency	12-14 cycles (~4 ns)	10-16 cycles
L3 load latency	40-50 cycles (~14 ns)	35-60 cycles
DRAM latency	60-80 ns	50-100 ns
DRAM bandwidth (1 thread)	40-52 GB/s	35-60 GB/s
Peak SP GFLOPS (1 core)	120-150	100-175

Measurement pitfalls

CPU frequency scaling -- always lock frequency before running. Turbo Boost causes non-stationary distributions and inflated FLOPS counts.

NUMA remote access -- on multi-socket systems, use numactl to bind to a single node. Remote memory can appear as 2x DRAM latency.

Hyperthreading -- bandwidth benchmarks do not model contention between HT siblings sharing L1/L2. Pin to physical cores only.

DRAM refresh -- refresh cycles inject ~7.8 us latency spikes (LPDDR4). These show up in p99 latency but average out in the mean. Report both.

Prefetcher interaction -- the pointer-chase benchmark defeats the hardware stride prefetcher by design (random permutation). Bandwidth benchmarks intentionally keep the prefetcher enabled, since that is what production kernels experience.

Extending the suite

Adding a new kernel

Implement in src/ as a Google Benchmark function.
Report state.counters["arith_intensity_flop_byte"] from the benchmark.
Add perf counter collection in scripts/run_suite.sh.
Add a record to kernels in scripts/parse_perf.py's build_kernel_list().

Adding a new architecture

Document PMU event codes in docs/hardware_notes/<arch>.md.
Add event codes to include/perf_event.hpp's dispatch table.
Add SIMD intrinsic variant in src/peak_flops.cpp under #ifdef __<ARCH>__.
Add toolchain file cmake/toolchain-<arch>.cmake.
Update cmake/DetectArch.cmake to probe the new ISA feature.

References

McCalpin, J.D. (1995). Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Technical Committee on Computer Architecture.
Williams, S., Waterman, A., Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. CACM 52(4).
Van Zee, F.G., van de Geijn, R.A. (2015). BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM TOMS 41(3).
Fog, A. (2024). Instruction Tables. Technical University of Denmark. https://agner.org/optimize/
Drepper, U. (2007). What Every Programmer Should Know About Memory. Red Hat.
Intel Corp. (2024). Intel 64 and IA-32 Architectures Optimization Reference Manual, Vol. 1.
ARM Ltd. (2021). Cortex-A72 Software Optimization Guide.

License

MIT

Citation

If you use this repository in academic or industrial research, cite using CITATION.cff metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CPU Microbenchmark Suite

What it measures

Repository layout

Project docs

Quick start

Prerequisites

System

Optional

Building

Running

Full suite (recommended)

Individual benchmarks

Correctness tests

VS Code setup

Interpreting results

Latency plateaus

Bandwidth numbers

Roofline classification

Expected values (i7-12700K, DDR5-4800, 1 core)

Measurement pitfalls

Extending the suite

Adding a new kernel

Adding a new architecture

References

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
.vscode		.vscode
cmake		cmake
docs		docs
include		include
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
FIXES.md		FIXES.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CPU Microbenchmark Suite

What it measures

Repository layout

Project docs

Quick start

Prerequisites

System

Optional

Building

Running

Full suite (recommended)

Individual benchmarks

Correctness tests

VS Code setup

Interpreting results

Latency plateaus

Bandwidth numbers

Roofline classification

Expected values (i7-12700K, DDR5-4800, 1 core)

Measurement pitfalls

Extending the suite

Adding a new kernel

Adding a new architecture

References

License

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages