WS23 971007 VU Advanced Methods and its Applications - Performance topics

How to measure performance in Julia

For that, we have a look at a simple vector summation function and include the simple @time macro.

function my_sum(V)
    s = zero(eltype(V))

    for i in eachindex(V)
        @inbounds s += V[i]
    end

    return s
end

V = rand(100_000)
@time my_sum(V)
@time my_sum(V)

  0.007566 seconds (2.21 k allocations: 152.062 KiB, 98.51% compilation time)
  0.000097 seconds (1 allocation: 16 bytes)
49923.9182677654

In order to optimize the loop call we use the @inbounds macro to eliminate inbound checks for the array access and thus improve performance. It is challenging and probably even not possible to prove the safety of methods using @inbounds, so it is essential to consider the trade-off between performance gains and the potential for unnoticed errors. Rule of thumb: Do not use it, unless you really need this performance gain.

The downside with the @time macro is, that it really just measures the execution time of what is given to it. This means, if the function is not already compiled this might include compiling or if the CPU is busy with something else it is often not accurate.

Therefore, if we are serious about measuring performance we should stick to the BenchmarkTools. It comes with a couple of macros that we should test out:

In order to use the BenchmarkTools we need to include it with using BenchmarkTools, as any other package. Benchmark our my_sum function with the following macros:

@benchmark
@btime
Look at the detailed output of your benchmark with dump(t), where t is the output result of a @benchmark run and compare the output and results.
Duplicate my_sum(), name it my_sum_no_inbounds(), remove the @inbounds macro and compare the performance to that of my_sum().

Solution

To measure the performance of the above code we do the following:

using BenchmarkTools

@benchmark my_sum($V)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  92.542 μs … 131.215 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     92.613 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.217 μs ±   2.190 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ▂▁                                                       ▁
  █▅▃▅██▃▅▄▂▄▅▇▆▅▅▅▄▅▄▅▅▄▄▄▅▄▅▅▄▅▅▅▇▆▆▆▆▇▇▆▆▆▅▅▅▅▄▄▅▄▇▇▇▆▅▅▅▅▅ █
  92.5 μs       Histogram: log(frequency) by time       103 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

the full details with

t = @benchmark my_sum($V)
dump(t)

BenchmarkTools.Trial
  params: BenchmarkTools.Parameters
    seconds: Float64 5.0
    samples: Int64 10000
    evals: Int64 1
    overhead: Float64 0.0
    gctrial: Bool true
    gcsample: Bool false
    time_tolerance: Float64 0.05
    memory_tolerance: Float64 0.01
  times: Array{Float64}((10000,)) [94116.0, 93164.0, 92583.0, 92644.0, 92713.0, 92704.0, 92653.0, 92874.0, 92613.0, 110998.0  …  99376.0, 92613.0, 92634.0, 92633.0, 92673.0, 92644.0, 92613.0, 92664.0, 92653.0, 92643.0]
  gctimes: Array{Float64}((10000,)) [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  memory: Int64 0
  allocs: Int64 0

and the often used sanity check, that actually also shows you the output of your code.

@btime my_sum($V)

  92.553 μs (0 allocations: 0 bytes)
49923.9182677654

Skipping the inbound check only gives a minor improvement as you can see when you compare the results to the following version which performs the check:

function my_sum_no_inbounds(V)
    s = zero(eltype(V))

    for i in eachindex(V)
        s += V[i]
    end

    return s
end

@benchmark my_sum_no_inbounds($V)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  92.543 μs … 115.947 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     92.603 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.134 μs ±   2.011 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ▂▂                                                       ▁
  █▆▆▇██▆▆▄▄▄▆▇█▇▅▅▆▆▅▆▄▆▁▆▅▅▅▆▅▆▅▄▅▆▆▆▆▆▇▇▇▇▆▆▅▆▅▅▆▅▇██▆▅▅▄▄▅ █
  92.5 μs       Histogram: log(frequency) by time       102 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Note that for benchmarking we often use the $ literal for variables to tell the Julia interpreter to use interpolation. This will make sure that the variable is not allocated inside the function and the measurement is more accurate, or more likely what we actually want to know.

We can also use the Profiler package to really dig into profiling the code but this is a bit too much of a deep dive for this class, it would look like this:

using Profile

Profile.clear()
@profile for _ in 1:100_000; my_sum(V); end
Profile.print(maxdepth=15)

Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
    ╎8281  @Base/client.jl:495; _start()
    ╎ 8281  @Base/client.jl:309; exec_options(opts::Base.JLOptions)
    ╎  8281  @Base/client.jl:379; run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    ╎   8281  @Base/essentials.jl:714; invokelatest
    ╎    8281  @Base/essentials.jl:716; #invokelatest#2
    ╎     8281  @Base/client.jl:394; (::Base.var"#936#938"{Bool, Bool, Bool})(REPL::Module)
    ╎    ╎ 8281  ...r/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:351; run_repl(repl::REPL.AbstractREPL, consumer::Any)
    ╎    ╎  8281  ...r/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:364; run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
    ╎    ╎   8281  .../worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:231; start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
    ╎    ╎    8281  .../worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:246; repl_backend_loop(backend::REPL.REPLBackend)
    ╎    ╎     8281  ...worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:150; eval_user_input(ast::Any, backend::REPL.REPLBackend)
    ╎    ╎    ╎ 8281  @Base/boot.jl:373; eval
    ╎    ╎    ╎  8281  .../package_linux64/build/usr/share/julia/stdlib/v1.7/Profile/src/Profile.jl:28; top-level scope
   1╎    ╎    ╎   8281  REPL[18]:1; macro expansion
    ╎    ╎    ╎    1     REPL[7]:3; my_sum(V::Vector{Float64})
    ╎    ╎    ╎     1     @Base/abstractarray.jl:279; eachindex
    ╎    ╎    ╎    8279  REPL[7]:4; my_sum(V::Vector{Float64})
    ╎    ╎    ╎     8279  @Base/range.jl:837; iterate
Total snapshots: 16562

CC BY-NC-SA 4.0 - Gregor Ehrensperger, Peter Kandolf. Last modified: January 18, 2024. Website built with Franklin.jl and the Julia programming language.