Data visualization: Communicating insights in academic publishing

Using using R and ggplot2

Peder Rustøen Braadland

Who am I?

Peder Rustøen Braadland

Master in biotechnology (NTNU, 2013)

PhD at the Institute for Cancer Research (2020)

Current:

Postdoc at the NoPSC Research Center

Mostly working with bioinformatics, data analysis, biostatistics…

braadland.bsky.social

peder.quarto.pub/blog/

pbraadland

Outline

  • Short introduction to data visualization
  • Types of plots
  • Colors
  • Creating publication-ready plots
  • Grant applications, illustrations
  • Some example data visualizations

Why is data visualization important?

  • Data visualization is storytelling with data
  • Communicates insights faster than “raw” numbers
    • Convert tabular data into plots and tables
  • Helps us draw insights from complex, otherwise opaque data sets
  • Helps identify patterns and trends
  • Helps the reader understand your data more easily
  • Can aid in decision-making
    • Scientific research: Which prognostic marker looks most promising?
    • Public health (“FHI”): Which measures will be most effective to limit Covid-19 spread?

Data visualization is all around us

Common (mis)conceptions about data visualization

  • Claim: Graphs should be dull and preferably grayscale
    • Remnant from when color figures were expensive to print?
    • Visually appealing figures sell better, improves comprehension and accessibility
  • Claim: The analyst must create data visualizations in an unbiased way
    • When we write, we tell a story. We should have this academic freedom also when producing figures
    • One data set can have a thousands stories to tell. As data visualizers it’s our job to direct the reader’s attention
  • Claim: People know how to read graphs
    • People read and interpret graphs in different ways (even researchers!)
    • Self-explanatory and intuitive data visualizations can be understood faster and more easily

The branches of data visualization

Context Priority
Art Data should be beautiful
Public information Clear messages, intuitive, simple
Scientific paper Data accuracy, details, statistics
Presentations Clear messages, engage the audience
  • Data visualization “rules” differ by context
  • Some “rules” exist
    • High data-to-ink ratio?
    • Honest representation (e.g. no axis truncation)?
    • One plot – one message
  • My opinion: Do what works, discuss with colleagues, friends, family
    • Key point is that your visualization manages to convey your message to the target audience

How do we choose the right data visualization?

Some examples

Endless opportunities

The bar plot

The work horse of academic figures

The bar plot

The work horse of academic figures

  • Several questions:
  • What does the top of the bar indicate?
    • Count, median, mean, max?
  • Does the bar area matter?
  • How many samples were measured?
  • Was there a distribution of y-values?

Visualizing distributions, uncertainty using bar plots

Visualizing distributions, uncertainty using bar plots

Visualizing distributions, uncertainty using bar plots

Be mindful of which summary statistics you show

Bars typically indicate the mean of some parameter, and we should communicate the uncertainty associated with our estimate.

  • Standard deviation: Reflects the variability around the mean
  • Standard error (of the mean): Reflects our confidence in whether the estimated mean reflects the population mean
    • Gets smaller with higher number of samples
    • Relates more to inferential statistics and hypothesis testing
  • Standard deviation preferred to show variability in bar plots
    • We use p-values to draw inference, and the standard deviation to reflect variability

The boxplot

The box plot overcomes many of the challenges with the bar plot

  • Can be used when data are skewed or follow a non-normal distribution
  • The plot components (lines and error bars) are “standardized”
  • Automatically computes and plots:
    • Median
    • 25th and 75th percentiles
    • Error bars (1.5 * IQR) and outliers (>1.5 * IQR)

The boxplot can deceive

The boxplot can deceive

  • Different distributions can give rise to similar-looking box plots

  • We see that adding individual points gives the reader more context than a box plot alone

The raincloud plot

Shows distributions as well as the number of samples

Pie charts

  • Often criticized
    • Difficult to estimate quantity by angles
    • Difficult to match labels with pie slices

  • But: illustrates well parts-of-whole
    • Microbiome composition
  • Stacked bar plots are an option

Alternative to pie chart

Alternative to pie chart

Alternative to pie chart

Visualizing uncertainty

  • Error bars
  • Different types of plots can benefit from other ways to visualize uncertainty

Confidence bands

An alternative to the confidence interval?

Colors

Color palettes

Sequential:

Ordered data - low to high

Intuitively, light colors are low, and dark colors are high

Diverging:

Works to illustrate mid values and extreme ends

Typically useful for scaled data

Qualitative:

Suitable for categorical or nominal data

Choosing a palette

  • Around 8% of humans are color blind (98% men)
    • If you like green, don’t combine it with orange/red or blue
    • By varying lightness, everyone can differentiate the colors
    • Vary shapes or patterns

Choosing a palette

  • Many color palettes online
    • R Color Brewer’s palettes (r-graph-gallery.com/38-rcolorbrewers-palettes.html)
    • Emil Hvitfeldt’s comprehensive list (github.com/EmilHvitfeldt/r-color-palettes)

Using colors to our benefit

Highlighting specific elements can guide the reader

  • Your data may have many messages, but if we tell them all, the reader will be overwhelmed and neither point will be important
  • When we have multiple messages in our data, we can:
    • Separate the message into separate plots (e.g. faceting)

Or use colors to guide our reader’s attention

De-emphasis using gray tones

Sequential gray-to-red

Emphasizes change over time

Changing opacity

  • Reduce opacity adding a number after your hex code

Using intuitive colors

Using intuitive colors

Using intuitive colors

Varying line color (and thickness)

  • Note how the anomalous years 2023 and 2024 are emphasized using colors and line thickness
  • Direct annotations on plot instead of in a legend
  • Indented map to help the reader know where the Main Development Region is

Saving your plots for a scientific journal or a grant application

Not ideal

  • Low resolution
  • Inconsistent text sizes
  • Unintuitive colors
  • Overlapping text
  • Unecessary gridlines, legend and y-axis title

Much better

  • Consistent use of colors, text size
  • Plots are aligned
  • No unecessary figure legends
  • High resolution

Adhering to journal figure guidelines

Journals typically specify how figures should appear

  • Canvas size (in mm or inches)
  • Text size (in points)
  • Font
  • Line widths (in points)
  • dpi (dots per inch; resolution)

General recommendations for plots for scientific publishing

  • Be consistent in plot appearances (use reusable themes())
    • Same text size, font, colors, line thickness…
    • Consistent coloring of categories
  • Create plot composites (ggpubr::ggarrange() or the patchwork package are good choices)
    • Allows alignment and automatic plot tags:

Saving your plots for a scientific journal or application

  • I recommend using R and ggplot2() and the ggsave() function to meet these requirements
  • We can define a theme() function to all (gg)plots
    • Fonts, text size, line widths and other aesthetics
  • ggsave() to define:
    • File type (svg and pdf produce vector-based, editable graphics)
    • Canvas size (e.g. 165 mm width)
    • Resolution (dpi)

A “raw” plot example

Define a reusable theme and color palette

theme_publication <- function() {
    theme_minimal(base_size = 7, base_family = "IBM Plex Sans", base_line_size = 0.2) +
  theme(
      axis.title.y = element_text(color = "#222222"),
      axis.title.x = element_text(color = "#222222"),
      axis.text.y = element_text(color = "#444444"),
      axis.text.x = element_text(color = "#444444"),
      axis.line.y = element_line(color = "#222222"),
      axis.line.x = element_line(color = "#222222"),
      axis.ticks.x = element_line(color = "#222222"),
      axis.ticks.y = element_line(color = "#222222"),
      axis.ticks.length = unit(2, "mm"),
      panel.grid.minor = element_blank(),
      legend.position = "top",
      legend.title = element_blank(),
      legend.text = element_text(color = "#444444"),
      legend.key.height = unit(2, "mm"),
      legend.key.width = unit(2, "mm")
  )
}

palette <- c(
  cofactors = "#d4e080", 
  lipids = "#f17367", 
  nucleotides = "#b3646c", 
  peptides = "#ff85a5", 
  PCM = "#e06bdf",
  energy = "#71dcca",
  carbohydrates = "#66c4ec",
  `amino acids` = "#ffd38b"
)

ggsave() the plot with dpi = 500 and width = 165 mm

Applying the theme to our ggplots and saving

How it would look in a journal print

Grant applications, illustrations

Grant applications, illustrations

  • Make your application stand out in the pile of applications sent to the evaluation committee
  • Focus on the main messages
    • Test your visualizations on colleagues and ditch your “darlings”

Grant applications, illustrations

  • Use harmonious colors, and use them consistently throughout the application
  • Simplify cluttered or complicated graphs and emphasize elements that you think are important
    • Details in preliminary data are perhaps not important although you think they are
  • Annotate, but sparingly, so the figure “reads itself” (don’t rely on people reading the figure legend to understand the figure)
  • Learn Illustrator
    • BioRender is a great alternative
    • Making illustrations yourself gives you full control, however
      • Besides, it’s fun

Illustrator

Keep all illustrations in one, large project for easier re-use of vector elements

Some data visualizations/inspiration

For a general audience

Telling stories with data

  • Filled area emphasizes difference
  • Emphasis on certain years using line thickness
  • No y-axis title - the reader typically reads from top (title and subtitle) to bottom
    • Y-axis text: degrees celsius (intuitive!)
  • Works well on small screens like cell phones

Telling stories with data

  • The y-axis text does not have to be on the left-hand side
  • Dark background
  • Emphasis on certain years using coloring and line thickness

Telling stories with data

  • The sequential color palette is intuitive
    • The dark background helps the intuition

Interactive plots

Examples using polling data

Line plot

  • Colors are intuitive
  • Annotations (political party) directly on plot instead of in legend
  • Y-axis shows percentage - intuitive for the reader and hence no axis title needed
  • Dashed lines indicate uncertainty

Stacked bars

  • Easier to see how distributions change over time?
  • Using transparency to indicate uncertainty

Grouped bars

  • Useful to see change by political party

Waffle plot

  • Shows part-of-whole (similar to the pie chart)
  • Perhaps more useful with fewer groups

Thank you!

The presentation will be uploaded to:

https://peder.quarto.pub/blog