Counting Events
PAPI
provides the ability to start, stop, read and accumulate the counters for a specified list of events. PAPI.jl
exposes this functionality (See counting), but builds more user-friendly primitives on top of this.
PAPI.jl
stores count values using the Counts
type.
Measuring Performance
PAPI.jl
provides two ways of measuring performance counters:
- Quick, one shot
profile
- Extensive
sample
based
Profile
profile
and @profile
can be used to quickly measure some performance counts on a function or expression. Since the function is only executed once, these counts should be taken with a grain of salt as they can be riddled with noise especially for short running functions.
PAPI.profile
— Functionprofile(f::Function, events::Vector{Event}; gcfirst::Bool=true, warmup::Int64=0)
Execute the function f
once while counting specific events
.
Arguments:
-`f`: the function to profile
-`events`: the events to count
-`gcfirst`: run the gc several times before the execution to reduce gc noise
-`warmup`: number of times to run the function prior to counting
Return values:
EventValues
containing the events, counts and runtime collected
PAPI.@profile
— Macroprofile(ex, args...)
Convience macro for profiling an expression. Events can be specified as a first argument, otherwise the default events [BR_INS, BR_MSP, TOT_INS, TOT_CYC]
are counted
Arguments and return values are similar to profile
Example:
@profile f(x, y, z) # sampling default events
@profile [PAPI.TOT_INS, PAPI.DP_OPS, native_event] f(x, y, z) gcfirst=false
The resulting EventValues
contain the events and counts collected and when printed a nice summary is generated with additional information.
Sample
Contrary to profile
, sample
and @sample
perform multiple executions. The resulting samples can be used to investigate the distributions of the counts and to perform various statistical tests.
You should be careful when making decisions on performance based on single or aggregate values (such as mean
, minimum
, median
) as well as "eyeball" statistics. Statistical tests such as a t-test, wilcoxon can be used.
Personally, I'm also wary towards any asymptotical tests putting strong assumptions on the distributions (such as the t-test). For example, performance numbers (even runtime) are not always normal distributed: the process is usually skewed towards one side. I tend to use the samples themselves as well as their quantiles.
PAPI.sample
— Functionsample(f::Function, events::Vector{Event}; max_secs::Float64=5, max_epochs::Int64=1000, gcsample::Bool=false, warmup::Int64=1)
Execute the function f
several times, each time counting specific events
. Sampling continues until either the maximum number of samples max_epochs
are collected or the runtime budget max_secs
is exceeded.
Arguments:
-`f`: the function to profile
-`events`: the events to count
-`max_secs`: maximum number of seconds to sample for
-`gcsample`: run the gc several times before the execution to reduce gc noise
-`warmup`: number of times to run the function prior to counting
Return values:
EventStats
containing the events, counts and runtime collected
PAPI.@sample
— Macrosample(ex, args...)
Convience macro for sampling an expression. Events can be specified as a first argument, otherwise the default events [BR_INS, BR_MSP, TOT_INS, TOT_CYC]
are counted
Arguments and return values are similar to sample
Example:
@sample f(x, y, z) # sampling default events
@sample [PAPI.TOT_INS, PAPI.DP_OPS, native_event] f(x, y, z) max_secs=1
The resulting EventStats
contain the events and all counts collected and when printed a nice summary is generated with additional statistics. The additional statistics are computed using the mean
and are mostly intended as information to guide the optimization process.
Example
As an example consider the following function
function mysum(X::AbstractArray)
s = zero(eltype(X))
for x in X
s += x
end
s
end
Next, we'll create an array of double precision values and produce a profile using the default events
warmup
is set to 1, to force compilation prior to measurement
X = rand(10000)
@profile mysum(X) warmup=1
EventValues: BR_INS = 10261 # 25.0% of all instructions # 587.0 M/sec BR_MSP = 57 # 1.0%[m of all branches TOT_INS = 41495 # 0.573 insn per cycle TOT_CYC = 72379 # 4.138 Ghz # 1.744 cycles per insn runtime = 17492 nsecs
Since the number of elements added was 10000, you can note it take roughly 4 instructions per element. Out of which one is a branch instruction, corresponding to the termination test of the loop. We also expect the data needs to be loaded in every iteration, something we can verify by additionally measuring the number of loads using the preset PAPI.LD_INS
.
@profile [PAPI.TOT_INS, PAPI.TOT_CYC, PAPI.LD_INS, PAPI.BR_INS] mysum(X)
EventValues: TOT_INS = 41380 # 0.57 insn per cycle TOT_CYC = 72575 # 3.964 Ghz # 1.754 cycles per insn LD_INS = 10429 # 25.0% of all instructions # 570.0 M/sec BR_INS = 10293 # 25.0% of all instructions # 562.0 M/sec runtime = 18310 nsecs
Indeed, in addition to the branch instruction there is a load in every iteration as well. Generally, you would expect a load, a branch, an addition and a counter increment in every iteration, and this data seems to confirm this. However modern processors are capable of loading and multiplying multiple data items at the same time. The @simd
macro might be useful here.
@simd
instructs the compiler that it is safe to execute the iterations in arbitrary and overlapping order. This also means that we are fine with any floating-point errors introduced by this reordering of operations.
function mysum2(X::AbstractArray)
s = zero(eltype(X))
@simd for x in X
s += x
end
s
end
@profile [PAPI.TOT_INS, PAPI.TOT_CYC, PAPI.LD_INS, PAPI.BR_INS] mysum2(X) warmup=1
EventValues: TOT_INS = 5770 # 0.211 insn per cycle TOT_CYC = 27286 # 5.101 Ghz # 4.729 cycles per insn LD_INS = 2948 # 51.0% of all instructions # 551.0 M/sec BR_INS = 920 # 16.0% of all instructions # 172.0 M/sec runtime = 5349 nsecs
Note the reduction in the number of cycles, instructions, branches, loads and runtime. From these measurements, one could assume the addition of @simd
has a positive effect. However, it would be better if we compute more samples and perform a statistical test to confirm that the effect is significant.
Finally, let's look at the number of floating-point operations and vectorized floating-point operations.
@profile [PAPI.TOT_INS, PAPI.DP_OPS, PAPI.VEC_DP] mysum(X)
EventValues: TOT_INS = 41386 DP_OPS = 10000 # 24.0% of all instructions # 1.0x vectorized VEC_DP = 10000 # 24.0% of all instructions runtime = 17046 nsecs
@profile [PAPI.TOT_INS, PAPI.DP_OPS, PAPI.VEC_DP] mysum2(X)
EventValues: TOT_INS = 5781 DP_OPS = 10015 # 173.0% of all instructions # 3.998x vectorized VEC_DP = 2505 # 43.0% of all instructions runtime = 9565 nsecs
Both functions perform the same number of additions ($\approx 1$ per element), but mysum
is able to vectorize those operations (performing almost 4 additions at the same time). Looking back at the earlier comparison, we can also see the number of loads has decreased by a factor of 4, indicating that the loop now loads roughly four elements simultaneously. The number of branches had decreased even more. Investigation of @code_native mysum2(X)
indeed also indicated that the compiler does loop unrolling in addition to vectorization.
Let's now collect multiple samples in order to draw our final conclusion:
stats = @sample mysum(X) # original version
stats2 = @sample mysum2(X) # optimized version
EventStats: BR_INS = [774, 773, 773, 773, 773, 773, 773, 773, 773, 773 … 773, 773, 773, 773, 773, 773, 773, 773, 773, 773] # 15.0% of all instructions # 580.0 M/sec BR_MSP = [19, 8, 7, 2, 3, 2, 3, 2, 2, 2 … 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] # 0.0%[m of all branches TOT_INS = [5122, 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119 … 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119, 5119] # 0.998 insn per cycle TOT_CYC = [22533, 7377, 6621, 6016, 6555, 5483, 6605, 5411, 6313, 6035 … 3516, 3524, 3578, 3756, 3507, 3510, 3483, 3497, 3483, 3497] # 3.849 Ghz # 1.033 cycles per insn runtime = [4719, 1726, 1793, 1620, 1784, 1446, 1778, 1425, 1729, 1617 … 909, 910, 923, 977, 902, 906, 898, 901, 894, 899] nsecs
1000 samples were collected for both version for the default events as well as the runtimes. A decision can for example be made using a wilcoxon
(or mann-whitney U
) test. This tests whether the two sets of samples come from the same underlying distribution.
using HypothesisTests
x = stats.time
y = stats2.time
MannWhitneyUTest(x,y)
Approximate Mann-Whitney U test ------------------------------- Population details: parameter of interest: Location parameter (pseudomedian) value under h_0: 0 point estimate: 12109.5 Test summary: outcome with 95% confidence: reject h_0 two-sided p-value: <1e-99 Details: number of observations in each group: [1000, 1000] Mann-Whitney-U statistic: 1.0e6 rank sums: [1.5005e6, 500500.0] adjustment for ties: 11502.0 normal approximation (μ, σ): (500000.0, 12913.2)
Rejection of the null-hypothesis indicates the underlying distributions are different and the changes have some significant effect (in this case a positive effect).
Alternatively, a student t-test could be used to perform a similar test. However, this test assumes normally distributed samples. Plotting a histogram of the samples clearly indicates this is not the case.
Using a Monte-Carlo estimator, the expected speedup can be computed as well as 95% uncertainty intervals.
using Statistics
println("expected speedup = ", mean(x ./ y'))
println("95% of the time, the speedup lies within ", quantile(vec(x ./ y'), [.025, 0.975]))
expected speedup = 10.15672911251631 95% of the time, the speedup lies within [7.631516587677726, 14.996757105867978]
PAPI interface
Only a number of events can be counted at the same time, num_counters
returns the number of hardware counters available on the system.
PAPI.num_counters
— FunctionGet the number of hardware counters available on the system
PAPI.num_counters()
initializes the PAPI library if necessary.
PAPI_num_counters()
` returns the number of hardware counters available on the system.
After selecting a number of events to count, the counting process can be started and stopped using start_counters
and stop_counters
.
It is the user’s responsibility to choose events that can be counted simultaneously by reading the vendor’s documentation.
PAPI.start_counters
— Functionstart_counters(events)
Start counting hardware events. This function cannot be called if the counters have already been started.
PAPI.stop_counters
— Functionstop_counters(evtset::EventSet)
Stop counters and returns counts The counters must have been started by a previous call to start_counters
PAPI.stop_counters!
— Functionstop_counters!(evtset::EventSet, values::Vector{Counts})
Stop counters and return current counts. The counters must have been started by a previous call to start_counters
The user must provide a vector of the correct size (equal to the number of events)
events = Event[PAPI.TOT_INS, PAPI.DP_OPS]
evtset = start_counters(events)
computation()
values = stop_counters(evtset)
2-element Vector{Int64}: 62936334 10106
During the counting process, the current counts can be queried and accumulated using read_counters!
and accum_counters!
.
PAPI.read_counters!
— Functionread_counters!(evtset::EventSet, values::Vector{Counts})
Read and reset counters. read_counters!
copies the event counters into values. The counters are reset and left running after the call.
The user must provide a vector of the correct size (equal to the number of events)
PAPI.accum_counters!
— Functionaccum_counters!(evtset::EventSet, values::Vector{Counts})
Accumulate and reset counters. accum_counters!
accumulates the event counters into values. The counters are reset and left running after the call.
The user must provide a vector of the correct size (equal to the number of events)
events = Event[PAPI.DP_OPS]
values = Vector{Counts}(undef, length(events))
evtset = start_counters(events)
computation(100) # perform 100 double precision operations
read_counters!(evtset, values)
println(values[1], " roughly equals 100")
computation(100) # perform 100 double precision operations
accum_counters!(evtset, values)
println(values[1], " roughly equals 200")
values[1] = -100
computation(100) # perform 100 double precision operations
accum_counters!(evtset, values)
println(values[1], " roughly equals 0")
stop_counters(evtset); nothing
135 roughly equals 100 415 roughly equals 200 35 roughly equals 0