Performance Monitoring
Below we describe how one can use LIKWID.jl to measure the performance of a piece of Julia code on a hardware level.
CPU
The @perfmon
macro
The macro @perfmon
is the easiest tool to use for performance monitoring. You need to provide two things:
- the performance group(s) that you're interested in and
- the piece of Julia code to be analyzed.
As for the first, you can use PerfMon.supported_groups
to get a list of the performance groups available on your system. The most common ones, that should also be available on most systems, are FLOPSDP and FLOPSSP for obtaining information about double- and single-precision floating point operations.
As for point 2, pretty much every Julia code is syntactically valid, i.e. you can call a function or use, e.g., begin ... end
to setup monitoring of a block of code. However, it is important to realize that, by default, @perfmon
will only monitor the CPU threads ("cores") associated with Julia threads and, for example, does not realiably provide information about computations happening on separate BLAS threads. To monitor the latter, one can try to use the perfmon
function instead.
using LIKWID
using Base.Threads
N = 10_000
a = 3.141
x = rand(N)
y = rand(N)
z = zeros(N)
function saxpy!(z, a, x, y)
@threads :static for i in eachindex(z)
z[i] = a * x[i] + y[i]
end
return z
end
saxpy!(z, a, x, y); # warmup
metrics, events = @perfmon "FLOPS_DP" saxpy!(z, a, x, y);
Group: FLOPS_DP
┌───────────────────────────┬──────────┬──────────┬──────────┐
│ Event │ Thread 1 │ Thread 2 │ Thread 3 │
├───────────────────────────┼──────────┼──────────┼──────────┤
│ ACTUAL_CPU_CLOCK │ 594353.0 │ 172690.0 │ 219724.0 │
│ MAX_CPU_CLOCK │ 398790.0 │ 225764.0 │ 143752.0 │
│ RETIRED_INSTRUCTIONS │ 67624.0 │ 80827.0 │ 85597.0 │
│ CPU_CLOCKS_UNHALTED │ 161793.0 │ 63454.0 │ 71502.0 │
│ RETIRED_SSE_AVX_FLOPS_ALL │ 6668.0 │ 6666.0 │ 6666.0 │
│ MERGE │ 0.0 │ 0.0 │ 0.0 │
└───────────────────────────┴──────────┴──────────┴──────────┘
┌──────────────────────┬─────────────┬────────────┬────────────┐
│ Metric │ Thread 1 │ Thread 2 │ Thread 3 │
├──────────────────────┼─────────────┼────────────┼────────────┤
│ Runtime (RDTSC) [s] │ 5.23601e-5 │ 5.23601e-5 │ 5.23601e-5 │
│ Runtime unhalted [s] │ 0.000264663 │ 7.68981e-5 │ 9.78422e-5 │
│ Clock [MHz] │ 3346.97 │ 1717.77 │ 3432.53 │
│ CPI │ 2.39254 │ 0.785059 │ 0.835333 │
│ DP [MFLOP/s] │ 127.349 │ 127.311 │ 127.311 │
└──────────────────────┴─────────────┴────────────┴────────────┘
Apart from printing, the monitoring results are provided in form of the nested data structures metrics
and events
. For example, the FLOPS (floating point operations per second) can be queried as follows,
metrics["FLOPS_DP"][1]["DP [MFLOP/s]"]
127.34884328126886
Here, "FLOPS_DP
is the performance group, 1
indicated the first Julia thread, and "DP [MFLOP/s]
is a LIKWID metric.
To ensure a reliable monitoring process, @perfmon
will automatically pin the Julia threads to the CPU threads they are currently running on (to avoid migration).
The perfmon
function
If you need more fine-grained control, you should use the perfmon
function instead of the @perfmon
macro. Among other things, it allows one to
- disable automatic thread-pinning via
autopin=false
, - manually indicate the CPU threads ("cores") to be monitored through the
cpuids
keyword argument - suppress printing via
print=false
.
# since we'll have autopin=false, we must manually ensure that computations run on the
# cpu threads / cores that we're monitoring!
LIKWID.pinthreads([0,1,2])
metrics, events = perfmon(() -> saxpy!(z, a, x, y), "FLOPS_DP"; cpuids=[0,1], autopin=false);
Group: FLOPS_DP
┌───────────────────────────┬──────────┬──────────┐
│ Event │ Thread 1 │ Thread 2 │
├───────────────────────────┼──────────┼──────────┤
│ ACTUAL_CPU_CLOCK │ 805504.0 │ 677456.0 │
│ MAX_CPU_CLOCK │ 548820.0 │ 451058.0 │
│ RETIRED_INSTRUCTIONS │ 64537.0 │ 103995.0 │
│ CPU_CLOCKS_UNHALTED │ 181393.0 │ 120942.0 │
│ RETIRED_SSE_AVX_FLOPS_ALL │ 6668.0 │ 6666.0 │
│ MERGE │ 0.0 │ 0.0 │
└───────────────────────────┴──────────┴──────────┘
┌──────────────────────┬─────────────┬─────────────┐
│ Metric │ Thread 1 │ Thread 2 │
├──────────────────────┼─────────────┼─────────────┤
│ Runtime (RDTSC) [s] │ 0.000189612 │ 0.000189612 │
│ Runtime unhalted [s] │ 0.000358688 │ 0.000301668 │
│ Clock [MHz] │ 3296.01 │ 3372.87 │
│ CPI │ 2.81068 │ 1.16296 │
│ DP [MFLOP/s] │ 35.1665 │ 35.1559 │
└──────────────────────┴─────────────┴─────────────┘
Note that Julia's do
syntax can often be useful here.
metrics, events = perfmon("FLOPS_DP"; cpuids=[0,1], autopin=false, print=false) do
# code goes here...
saxpy!(z, a, x, y)
end;
GPU
Experimental
using LIKWID
using CUDA
N = 10_000
a = 3.141f0 # Float32
x = CUDA.rand(Float32, N)
y = CUDA.rand(Float32, N)
z = CUDA.zeros(Float32, N)
saxpy!(z, a, x, y) = z .= a .* x .+ y
saxpy!(z, a, x, y); # warmup
metrics, events = @nvmon "FLOPS_SP" saxpy!(z, a, x, y);
Group: FLOPS_SP
┌────────────────────────────────────────────────────┬─────────┐
│ Event │ GPU 1 │
├────────────────────────────────────────────────────┼─────────┤
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM │ 0.0 │
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM │ 0.0 │
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM │ 10000.0 │
└────────────────────────────────────────────────────┴─────────┘
┌─────────────────────┬────────────┐
│ Metric │ GPU 1 │
├─────────────────────┼────────────┤
│ Runtime (RDTSC) [s] │ 1.84467e10 │
│ SP [MFLOP/s] │ 1.0842e-12 │
└─────────────────────┴────────────┘
This page was generated using Literate.jl.