Counting Events

PAPI provides the ability to start, stop, read and accumulate the counters for a specified list of events. PAPI.jl exposes this functionality (See counting), but builds more user-friendly primitives on top of this.

PAPI.jl stores count values using the Counts type.

Measuring Performance

PAPI.jl provides two ways of measuring performance counters:

Quick, one shot profile
Extensive sample based

Profile

profile and @profile can be used to quickly measure some performance counts on a function or expression. Since the function is only executed once, these counts should be taken with a grain of salt as they can be riddled with noise especially for short running functions.

PAPI.profile — Function

profile(f::Function, events::Vector{Event}; gcfirst::Bool=true, warmup::Int64=0)

Execute the function f once while counting specific events.

Arguments:

-`f`: the function to profile
-`events`: the events to count
-`gcfirst`: run the gc several times before the execution to reduce gc noise
-`warmup`: number of times to run the function prior to counting

Return values:

EventValues containing the events, counts and runtime collected

source

PAPI.@profile — Macro

profile(ex, args...)

Convience macro for profiling an expression. Events can be specified as a first argument, otherwise the default events [BR_INS, BR_MSP, TOT_INS, TOT_CYC] are counted

Arguments and return values are similar to profile

Example:

@profile f(x, y, z) # sampling default events
@profile [PAPI.TOT_INS, PAPI.DP_OPS, native_event] f(x, y, z) gcfirst=false

source

The resulting EventValues contain the events and counts collected and when printed a nice summary is generated with additional information.

Sample

Contrary to profile, sample and @sample perform multiple executions. The resulting samples can be used to investigate the distributions of the counts and to perform various statistical tests.

Note

You should be careful when making decisions on performance based on single or aggregate values (such as mean, minimum, median) as well as "eyeball" statistics. Statistical tests such as a t-test, wilcoxon can be used.

Personally, I'm also wary towards any asymptotical tests putting strong assumptions on the distributions (such as the t-test). For example, performance numbers (even runtime) are not always normal distributed: the process is usually skewed towards one side. I tend to use the samples themselves as well as their quantiles.

PAPI.sample — Function

sample(f::Function, events::Vector{Event}; max_secs::Float64=5, max_epochs::Int64=1000, gcsample::Bool=false, warmup::Int64=1)

Execute the function f several times, each time counting specific events. Sampling continues until either the maximum number of samples max_epochs are collected or the runtime budget max_secs is exceeded.

Arguments:

-`f`: the function to profile
-`events`: the events to count
-`max_secs`: maximum number of seconds to sample for
-`gcsample`: run the gc several times before the execution to reduce gc noise
-`warmup`: number of times to run the function prior to counting

Return values:

EventStats containing the events, counts and runtime collected

source

PAPI.@sample — Macro

sample(ex, args...)

Convience macro for sampling an expression. Events can be specified as a first argument, otherwise the default events [BR_INS, BR_MSP, TOT_INS, TOT_CYC] are counted

Arguments and return values are similar to sample

Example:

@sample f(x, y, z) # sampling default events
@sample [PAPI.TOT_INS, PAPI.DP_OPS, native_event] f(x, y, z) max_secs=1

source

The resulting EventStats contain the events and all counts collected and when printed a nice summary is generated with additional statistics. The additional statistics are computed using the mean and are mostly intended as information to guide the optimization process.

Example

As an example consider the following function

function mysum(X::AbstractArray)
    s = zero(eltype(X))
    for x in X
        s += x
    end
    s
end

Next, we'll create an array of double precision values and produce a profile using the default events

Note

warmup is set to 1, to force compilation prior to measurement

X = rand(10000)
@profile mysum(X) warmup=1

EventValues:
  BR_INS = 10261 # 25.0% of all instructions # 587.0 M/sec 
  BR_MSP = 57 # 1.0%[m of all branches
  TOT_INS = 41495 # 0.573 insn per cycle
  TOT_CYC = 72379 # 4.138 Ghz # 1.744 cycles per insn
  runtime = 17492 nsecs

Since the number of elements added was 10000, you can note it take roughly 4 instructions per element. Out of which one is a branch instruction, corresponding to the termination test of the loop. We also expect the data needs to be loaded in every iteration, something we can verify by additionally measuring the number of loads using the preset PAPI.LD_INS.

@profile [PAPI.TOT_INS, PAPI.TOT_CYC, PAPI.LD_INS, PAPI.BR_INS] mysum(X)

EventValues:
  TOT_INS = 41380 # 0.57 insn per cycle
  TOT_CYC = 72575 # 3.964 Ghz # 1.754 cycles per insn
  LD_INS = 10429 # 25.0% of all instructions # 570.0 M/sec 
  BR_INS = 10293 # 25.0% of all instructions # 562.0 M/sec 
  runtime = 18310 nsecs

Indeed, in addition to the branch instruction there is a load in every iteration as well. Generally, you would expect a load, a branch, an addition and a counter increment in every iteration, and this data seems to confirm this. However modern processors are capable of loading and multiplying multiple data items at the same time. The @simd macro might be useful here.

Note

@simd instructs the compiler that it is safe to execute the iterations in arbitrary and overlapping order. This also means that we are fine with any floating-point errors introduced by this reordering of operations.

function mysum2(X::AbstractArray)
    s = zero(eltype(X))
    @simd for x in X
        s += x
    end
    s
end

@profile [PAPI.TOT_INS, PAPI.TOT_CYC, PAPI.LD_INS, PAPI.BR_INS] mysum2(X) warmup=1

EventValues:
  TOT_INS = 5770 # 0.211 insn per cycle
  TOT_CYC = 27286 # 5.101 Ghz # 4.729 cycles per insn
  LD_INS = 2948 # 51.0% of all instructions # 551.0 M/sec 
  BR_INS = 920 # 16.0% of all instructions # 172.0 M/sec 
  runtime = 5349 nsecs

Note the reduction in the number of cycles, instructions, branches, loads and runtime. From these measurements, one could assume the addition of @simd has a positive effect. However, it would be better if we compute more samples and perform a statistical test to confirm that the effect is significant.

Finally, let's look at the number of floating-point operations and vectorized floating-point operations.

@profile [PAPI.TOT_INS, PAPI.DP_OPS, PAPI.VEC_DP] mysum(X)

EventValues:
  TOT_INS = 41386
  DP_OPS = 10000 # 24.0% of all instructions # 1.0x vectorized
  VEC_DP = 10000 # 24.0% of all instructions
  runtime = 17046 nsecs

@profile [PAPI.TOT_INS, PAPI.DP_OPS, PAPI.VEC_DP] mysum2(X)

EventValues:
  TOT_INS = 5781
  DP_OPS = 10015 # 173.0% of all instructions # 3.998x vectorized
  VEC_DP = 2505 # 43.0% of all instructions
  runtime = 9565 nsecs

Both functions perform the same number of additions ($\approx 1$ per element), but mysum is able to vectorize those operations (performing almost 4 additions at the same time). Looking back at the earlier comparison, we can also see the number of loads has decreased by a factor of 4, indicating that the loop now loads roughly four elements simultaneously. The number of branches had decreased even more. Investigation of @code_native mysum2(X) indeed also indicated that the compiler does loop unrolling in addition to vectorization.

Let's now collect multiple samples in order to draw our final conclusion:

stats = @sample mysum(X) # original version
stats2 = @sample mysum2(X) # optimized version

EventStats:
  BR_INS = [774, 773, 773, 773, 773, 773, 773, 773, 773, 773  …  773, 773, 773, 773, 773, 773, 773, 773, 773, 773] # 15.0% of all instructions # 580.0 M/sec 
  BR_MSP = [19, 8, 7, 2, 3, 2, 3, 2, 2, 2  …  2, 2, 2, 2, 2, 2, 2, 2, 2, 2] # 0.0%[m of all branches
  TOT_INS = [5122, 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119  …  5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119] # 0.998 insn per cycle
  TOT_CYC = [22533, 7377, 6621, 6016, 6555, 5483, 6605, 5411, 6313, 6035  …  3516, 3524, 3578, 3756, 3507, 3510, 3483, 3497, 3483, 3497] # 3.849 Ghz # 1.033 cycles per insn
  runtime = [4719, 1726, 1793, 1620, 1784, 1446, 1778, 1425, 1729, 1617  …  909, 910, 923, 977, 902, 906, 898, 901, 894, 899] nsecs

1000 samples were collected for both version for the default events as well as the runtimes. A decision can for example be made using a wilcoxon (or mann-whitney U) test. This tests whether the two sets of samples come from the same underlying distribution.

using HypothesisTests
x = stats.time
y = stats2.time
MannWhitneyUTest(x,y)

Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          12109.5

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations in each group: [1000, 1000]
    Mann-Whitney-U statistic:             1.0e6
    rank sums:                            [1.5005e6, 500500.0]
    adjustment for ties:                  11502.0
    normal approximation (μ, σ):          (500000.0, 12913.2)

Rejection of the null-hypothesis indicates the underlying distributions are different and the changes have some significant effect (in this case a positive effect).

Note

Alternatively, a student t-test could be used to perform a similar test. However, this test assumes normally distributed samples. Plotting a histogram of the samples clearly indicates this is not the case.

Using a Monte-Carlo estimator, the expected speedup can be computed as well as 95% uncertainty intervals.

using Statistics
println("expected speedup = ", mean(x ./ y'))
println("95% of the time, the speedup lies within ", quantile(vec(x ./ y'), [.025, 0.975]))

expected speedup = 10.15672911251631
95% of the time, the speedup lies within [7.631516587677726, 14.996757105867978]

PAPI interface

Only a number of events can be counted at the same time, num_counters returns the number of hardware counters available on the system.

PAPI.num_counters — Function

Get the number of hardware counters available on the system

PAPI.num_counters() initializes the PAPI library if necessary.

PAPI_num_counters()` returns the number of hardware counters available on the system.

source

After selecting a number of events to count, the counting process can be started and stopped using start_counters and stop_counters.

Warning

It is the user’s responsibility to choose events that can be counted simultaneously by reading the vendor’s documentation.

PAPI.start_counters — Function

start_counters(events)

Start counting hardware events. This function cannot be called if the counters have already been started.

source

PAPI.stop_counters — Function

stop_counters(evtset::EventSet)

Stop counters and returns counts The counters must have been started by a previous call to start_counters

source

PAPI.stop_counters! — Function

stop_counters!(evtset::EventSet, values::Vector{Counts})

Stop counters and return current counts. The counters must have been started by a previous call to start_counters

The user must provide a vector of the correct size (equal to the number of events)

source

events = Event[PAPI.TOT_INS, PAPI.DP_OPS]
evtset = start_counters(events)
computation()
values = stop_counters(evtset)

2-element Vector{Int64}:
 62936334
    10106

During the counting process, the current counts can be queried and accumulated using read_counters! and accum_counters!.

PAPI.read_counters! — Function

read_counters!(evtset::EventSet, values::Vector{Counts})

Read and reset counters. read_counters! copies the event counters into values. The counters are reset and left running after the call.

The user must provide a vector of the correct size (equal to the number of events)

source

PAPI.accum_counters! — Function

accum_counters!(evtset::EventSet, values::Vector{Counts})

Accumulate and reset counters. accum_counters! accumulates the event counters into values. The counters are reset and left running after the call.

The user must provide a vector of the correct size (equal to the number of events)

source

events = Event[PAPI.DP_OPS]
values = Vector{Counts}(undef, length(events))
evtset = start_counters(events)
computation(100) # perform 100 double precision operations
read_counters!(evtset, values)
println(values[1], " roughly equals 100")

computation(100) # perform 100 double precision operations
accum_counters!(evtset, values)
println(values[1], " roughly equals 200")

values[1] = -100
computation(100) # perform 100 double precision operations
accum_counters!(evtset, values)
println(values[1], " roughly equals 0")
stop_counters(evtset); nothing

135 roughly equals 100
415 roughly equals 200
35 roughly equals 0