Writing a benchmark report

Computational Biology 2024

Benjamin Rombaut

Ghent University

2024-03-15

A benchmark report is a form of scientific communication

Every engineer is also a writer

Open course for Google software engineers. A good resource for learning how to write technical reports.

Title and course details Summary
Technical Writing One Learn the critical basics of technical writing. Take this course before taking any of the other courses.
Technical Writing Two Practice four intermediate topics in technical writing.
Tech Writing for Accessibility Develop skills for making documentation more accessible to all.
Writing Helpful Error Messages Write clearer, more effective error messages, whether they appear in IDEs, command lines, or GUIs. This course is online only.

Academic writing: part of the scientific method

The Scientific Method - Wikipedia

You will see this again in e.g. Machine Learning (C003758) and in the Master’s thesis.

Academic writing: structure

Most academic writing and reporting follows a similar structure:

  1. Introduction: brief overview of the relevant literature and explanation of what you did, how you did it, and the core finding.
  2. Methods: should contain enough detail for a colleague to repeat the experiments.
  3. Results: not just a big data dump, but point out key patterns in your results in connection to the goal of your research, including figures.
  4. Discussion: provide an interpretation of the results, be self-critical.
  5. Conclusion: think about what’s next.

Academic writing: The ABCs of writing style

  • Accurate: Be precise.
    • Bad: The majority of samples came from dataset X
    • Better: 74% of samples were randomly selected from dataset X, with the remaining 26% coming from Y (see Table 1).
  • Brief: Don’t be redundant. Don’t repeat yourself. Avoid making the same point twice. Try not to rephrase the same idea multiple times. Just make your point once.
  • Clear: Don’t try to say multiple things at the same time.
    • Bad: Whereas the samples from class X, which is the most difficult to separate, and for which the most errors were made by the model, represent only 10% of the training set, the samples from class Y represent the other 90%.
    • Better: 10% of the samples belong the class X, whereas the other 90% belong to class X. Class X is the most difficult to separate. This can also be seen in the model’s increased error rate for class X.

Academic writing: Common issues in writing style

  • Ambiguity
    • Incomplete comparison: X is greater.
    • Ambiguous reference: It was working well.
    • Anthropomorphism: The model didn’t want to converge.
  • Unnecessary complexity
    • Don’t use flowery language (e.g. utilize, just use use)
    • However: don’t get too informal! e.g. don’t -> do not
  • Passive voice
    • It’s OK to say we used X to determine Y.
    • Using active voice for your own work and passive for other people’s work can improve clarity.

Academic writing: Citing other works

  • You should add a citation whenever:
    • You make some claim that was proven in another paper
    • You use some idea or algorithm from another paper
    • You mention existing work
  • Don’t do this manually! Use a tools like biblatex, Zotero, browser plugins…
  • More info: Bibliography management with biblatex
  • Can’t find a citation? Make sure your VPN is to use the UGent network access to articles, more info here. Alternatively, SciHub exists.

Visualizations for performance evaluation

List of datasets

  • A good benchmark is performed on a variety of datasets. They can be summarized in a table with the following information:
    • Dataset name
    • Citation to the paper introducing the dataset
    • Link to fully reproduce that specific dataset
    • Number of samples
    • Number of features

List of datasets

  • Additional information is optional, but can be useful:
    • Comparison measure
      • just using e.g. image dimensions is not always comparable. The total number of pixels in the image is more useful for comparison.
    • Metadata
      • origin of the dataset, e.g. images of healthy human liver cells
    • Annotations
      • how the dataset was annotated, e.g. with a certain tool or method, by a pathologist
      • which annotation classes were used
    • Indication of complexity
      • how much class imbalance is there?
      • performance of a simple model, e.g. random, most common label…

List of methods

The method table should contain the following columns:

  • Method name
  • Citation to the paper introducing the method
  • Link to the code

Additional information is optional, but can be useful:

  • Hyperparameters
    • which hyperparameters were used for the benchmark
  • Training time
    • how long did it take to train the model
  • Indication of complexity
    • how many parameters does the model have

List of evaluation metrics

  • The used evaluation metrics should be listed and explained.
  • e.g. accuracy, precision, recall, F1-score, AUC, ROC
  • The metrics should be explained in a way that a non-expert can understand them.
  • It should be clear what is preferred for each metric, e.g. high values mean better performance.

A benchmark is much more than just the metrics

Example of a funkyheatmap. Other metrics are also taken into account such as scalability, stability, usability…

Consider also including other metrics like usability of the method. To help readers understand when to use which method, a decision-making flowchart can be helpful.

Interesting plots and figures

Perfomance table

  • A table with the performance of each method on each dataset.
  • The table should contain the following:
    • Method name
    • Per Dataset
      • Evaluation metric 1
      • Evaluation metric 2
  • You can create a table with pandas and use the pandas.DataFrame Styler to make it look nice in LaTeX.

Example of pandas.DataFrame Styler

Performance plot

  • A plot with the performance of each method on each dataset.
  • Include also the variance of the performance, e.g. with error bars.

https://www.galileo.fbw.ugent.be/schrijven/wat-met-lay-out-tabellen-en-figuren/

Seaborn makes it easy to create complex plots

import seaborn as sns
sns.set_theme(style="whitegrid")

penguins = sns.load_dataset("penguins")

# Draw a nested barplot by species and sex
g = sns.catplot(
    data=penguins, kind="bar",
    x="species", y="body_mass_g", hue="sex",
    errorbar="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("", "Body mass (g)")
g.legend.set_title("")

https://seaborn.pydata.org/examples/grouped_barplot.html

More bespoke performance plot

  • Note that the same data is shown 3 times: as datapoints, as a boxplot and as a distribution. This gives a more complete picture of the data.
  • Missing datapoints due to memory errors are also included in the plot.

https://www.nature.com/articles/s41592-021-01326-w/figures/1

Scalability plot for time

  • Note the use of speedup factor on the y-axis and the dashed line for the theoretical upper bound for the speedup

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04736-5/figures/2

Scalability plot for memory

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04736-5/figures/3

Much more bespoke scalability plots (not for the assignment)

https://www.nature.com/articles/s41587-019-0071-9/figures/10

Git burndown chart

X: development time; Y: best performance of current code

Shows how the performance of the method evolves over time as a downwards step function. This can be useful to show the best points of improvement (e.g. 5 improvements points A-E explained in a separate table). You can use git tags to mark the method at a certain point in time.

Tips for assignment

Zie rubric voor de opdracht.

  • Vormgeving
    • template
    • figuren en tabellen
  • Verslaggeving
    • taalgebruik
    • opbouw verhaal
    • abstract
  • Algoritment en datastructuren
    • algoritmen
    • tijdsmetingen
    • geheugenmetingen