Introduction to HPC and benchmarking

Computational Biology 2025

Benjamin Rombaut

Ghent University

Simon Van de Vyver

Ghent University

2025-02-21

(Premature) Optimization

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

– Donald Knuth in Structured Programming with go to Statements (1974)

The Rules of Optimization Club:

  1. You do not optimize.
  2. You do not optimize, without without profiling your code first.
  3. Think about external factors like ease-of-use and maintainability.
  4. Only optimize code that already has full unit test coverage.
  5. Optimize one code change or factor at a time.

Tips for optimizing speed

  • Improve implementation: remove unneeded code, reduce work in for loop…
  • Improve data structure: e.g. dict instead of list
  • Improve algorithm e.g. lower time complexity
  • more speed at the expense of simplicity
    • Improve usage of CPU e.g. multiprocessing
      • Global Interpreter Lock prevents true parallelism
      • staying single-threaded and running jobs with xargs or GNU parallel can be faster
    • Improve usage of accelerators: use better compiled libraries or e.g. BLAS, numba, cython, pycuda

Tips for optimizing memory

  • improve data structure:
  • reduce copies of data structure: e.g. numpy views, pandas Copy-on-Write
  • avoid materializing data structure e.g. generators
    • memory efficient, no barriers
    • elegant and clean functions
    • hard to debug, less familiar

Benchmarking and profiling

  • doctest, pytest, unit tests… come first!
  • Read the profile docs
  • A profile tries to understand the performance of a program
    • Profiling is a type of runtime analysis that operates on large amounts of runtime data and gives you a view of what is happening inside a process.
    • can be as simple as difference of time.perf_counter_ns before and after a function
      • but limited, no insight into effects of caching, variation, statistical significance…
    • don’t manually write down metrics from Windows Task Manager
      • not reproducible, reliable…
  • A benchmark tries to compare different code implementations

Prefer timeit for small code snippets

python -m timeit '"-".join(str(n) for n in range(100))'
10000 loops, best of 5: 30.2 usec per loop

A profiler tries to understand the performance of a program

  • Time profile
  • Memory profile
  • Time tracing profile

A profiler tries to understand the performance of a program

A profiler tries to understand the performance of a program

The way you visualize the profile is important

Visualize snakeviz

python -m cProfile -o results.prof myscript.py
snakeviz results.prof

Icicle example graph of profiled function calls. Functions are sorted alphabetically in X. Wider in X means the function was sampled more and uses more resources. Y is depth in the stack frame, at the top is the highest calling function and at the bottom are the lowest functions.

A profiler tries to understand the performance of a program

memory-profiler is still useful, but no longer actively maintained

python -m memory_profiler example.py

Line #    Mem usage    Increment  Occurrences   Line Contents
============================================================
     3   38.816 MiB   38.816 MiB           1   @profile
     4                                         def my_func():
     5   46.492 MiB    7.676 MiB           1       a = [1] * (10 ** 6)
     6  199.117 MiB  152.625 MiB           1       b = [2] * (2 * 10 ** 7)
     7   46.629 MiB -152.488 MiB           1       del b
     8   46.629 MiB    0.000 MiB           1       return a

viz in Excel, matplotlib or kcachegrind (Linux)

Tracking forked child processes over time

mprof run --multiprocess example.py
mprof plot

memray

🕵️‍♀️ Traces every function call so it can accurately represent the call stack, unlike sampling profilers.

ℭ Also handles native calls in C/C++ libraries so the entire call stack is present in the results.

🏎 Blazing fast! Profiling slows the application only slightly. Tracking native code is somewhat slower, but this can be enabled or disabled on demand.

📈 It can generate various reports about the collected memory usage data, like flame graphs.

🧵 Works with Python threads.

👽🧵 Works with native-threads (e.g. C++ threads in C extensions).

Profiling multithreaded code with yappi becomes more complex

import yappi
import time
import threading

_NTHREAD = 3


def _work(n):
    time.sleep(n * 0.1)


yappi.start()

threads = []
# generate _NTHREAD threads
for i in range(_NTHREAD):
    t = threading.Thread(target=_work, args=(i + 1, ))
    t.start()
    threads.append(t)
# wait all threads to finish
for t in threads:
    t.join()

yappi.stop()

# retrieve thread stats by their thread id (given by yappi)
threads = yappi.get_thread_stats()
for thread in threads:
    print(
        "Function stats for (%s) (%d)" % (thread.name, thread.id)
    )  # it is the Thread.__class__.__name__
    yappi.get_func_stats(ctx_id=thread.id).print_all()
Function stats for (Thread) (3)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000017  0.000062  0.000062
doc3.py:8 _work                       1      0.000012  0.000045  0.000045

Function stats for (Thread) (2)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000017  0.000065  0.000065
doc3.py:8 _work                       1      0.000010  0.000048  0.000048


Function stats for (Thread) (1)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000010  0.000043  0.000043
doc3.py:8 _work                       1      0.000006  0.000033  0.000033

Your IDE can help you with profiling and visualization

  • VS Code
  • PyCharm
    • Time profiling
    • Flame Graph
    • Call Tree/Graph
    • Method List

Benchmarking and profiling

  • doctest, pytest, unit tests… come first!
  • Read the profile docs
  • A profile tries to understand the performance of a program
    • can be as simple as difference of time.perf_counter_ns before and after a function
      • but limited, no insight into effects of caching, variation, statistical significance…
    • don’t manually write down metrics from Windows Task Manager
      • not reproducible, reliable…
  • A benchmark tries to compare different code implementations

A benchmark is much more than just runtime and memory usage

Example of a funkyheatmap. Other metrics are also taken into account such as scalability, stability, usability…

Open Problems in Single-Cell Analysis

Some general benchmark tools

Some examples

Using time builtin is very limited

time sleep 0.01
sleep 0.01  0.00s user 0.00s system 11% cpu 0.020 total

Keeping track of time in a bash script has overhead and limited precision

start=`date +%s.%N`
sleep 0.01
end=`date +%s.%N`

runtime=$( echo "$end - $start" | bc -l )
echo $runtime
.025655000

Good benchmark frameworks run the code multiple times and report statistics

hyperfine --runs 5 'sleep 0.01'
Benchmark 1: sleep 0.01
  Time (mean ± σ):      16.0 ms ±   1.8 ms    [User: 0.3 ms, System: 0.8 ms]
  Range (min … max):    13.7 ms …  18.3 ms    5 runs

Tips for assignments

  • Test the code for correctness
  • Create different input sizes to test scalability
  • Compare with a simple baseline (brute-force agorithm, non-operation to estimate overhead…)
  • Setup a benchmark to compare different implementations
  • Use a profiler to understand the performance of the code and where to focus optimization

HPC-UGent

Benchmarking on the HPC

Some older example code is available in at https://github.com/saeyslab/hydra_hpc_example:

  • src/sleep_pbs/README.md is an example used to explain interactive and job-based scheduling with PBS and SLURM. The example sleep script is benchmarked for runtime and memory usage with timeit and memray.
  • src/sleep_hydra/README.md is the same sleep example and benchmarking, but executed with the Hydra framework. More powerful and flexible, but also more complex.
  • src/dask_jobqueue/README.md is an example of how to use Hydra and submitit to launch a Dask jobqueue through the SLURM scheduler.
  • src/frequencies_hydra/README.md is the counting frequencies example with benchmarking. It uses a Python-only configuration, based on hydra-zen.
  • src/dask_batchrunner/README.md is an example of how to use the SLURMRunner from dask-jobqueue to launch a Dask cluster on SLURM. It uses either Pixi or vsc-venv to manage the environment. It currently does not use MPI.
  • src/dask_mympi/README.md is an example of how to use the vsc-mympirun to launch a Dask cluster on SLURM. It uses vsc-venv to manage the environment.