Aastat

Creating advanced figures in R

Aastat — Mon, 21 Sep 2020 09:30:14 +0000

In this blog post we are replicating the picture below which was originally created in SAS

Let’s first generate some dummy data:

library(tidyverse)
n_pat <- 25
patient <- 1:n_pat
treatment <- sample(c("Drug A", "Drug B"), n_pat, replace=TRUE)
change <- rnorm(n_pat, 0, 20)
biomarkers <- c("T790M","Ex19del","L959R","Ex20Ins","MET","ERBB2","EGFR",
                "EGFR2","PIK3CA","KRAS","CDKN2","RB1","ALK","KIT","MET2",
                "Other")
genes <- matrix(sample(x=c("CC", "AA", "AC"), replace=TRUE, size=n_pat * length(biomarkers)),
                nrow=n_pat, ncol=length(biomarkers))
biomarker_groups <- c(rep("Baseline", 4), rep("SCNA", 3), rep("SNV", 9))
df <- data.frame(patient, treatment, change, genes)
colnames(df) <- c("patient", "treatment", "change", biomarkers)
head(df)

##    patient treatment      change T790M Ex19del L959R Ex20Ins MET ERBB2 EGFR EGFR2 PIK3CA KRAS CDKN2 RB1 ALK KIT MET2 Other
## 1        1    Drug B  17.3058647    AC      AC    AC      CC  AA    AC   AA    CC     AC   AC    AC  AC  CC  AC   CC    CC
## 2        2    Drug B   8.1824572    AC      CC    CC      AC  CC    AC   AA    AC     AC   CC    AA  AA  CC  CC   AA    AC
## 3        3    Drug B -18.5752930    AA      AA    AC      CC  AC    AA   AC    AA     AA   AA    AA  AC  AC  AA   AA    AC
## 4        4    Drug A  -5.2139298    AC      AA    AC      AA  AA    AA   CC    AA     AC   AA    CC  AC  AC  CC   CC    AC
## 5        5    Drug B   5.6130694    CC      CC    CC      CC  AA    CC   AA    AC     AC   AA    CC  AA  AC  CC   CC    AC

We have a patient number, treatment group and change in tumor size in our dataset. We also have collected some biomarkers so we may inspect if we find some interesting correlations.

In the picture above we have 3 distinct plots:

The change in tumor size
Highlighted genes in biomarkers
Percentage of selected genes from each one of the biomarkers.

First plot

Plot is fairly standard barplot but there is some notable options that we need to set. First of all we notice that there is text indicating the change in tumor size outside of bars. The second thing we notice that the x-axis ticks are not just numbers but there is a custom string indicating that the ticks represent patients.

We can add the text to bar plots using geom_text but if you try it with only these we notice that the plot is not the most aesthetic. The stat = “identity” in the geom_bar means that we are providing our own values so the function is not trying to plot counts or something else.

df %>% 
  ggplot(aes(x=factor(patient), y=change, label=change, fill=treatment)) +
  geom_bar(stat="identity") +
  geom_text()

We can clip the text, rotate its angle and add some vertical adjustment to it so they line up nicely outside of the bar instead of at the edges.
The

hjust = ifelse(change < 0, 1.1, -0.3)

indicates that the horizontal adjustment goes to above the bar if the change is positive and respectively to the bottom of the bar if the change is negative.

With the x-axis ticks we can change the text with scale_x_discrete. For label values we paste the string “pat” and corresponding number together using paste0 command. And finally I’m changing the colors from light to darker to indicate importance.

p1 <- df %>%
    ggplot(aes(x=factor(patient), y=change)) +
    geom_bar(stat = "identity", aes(fill=factor(treatment))) +
    geom_text(aes(label=formatC(change, format="f", digits=0)),
              hjust=ifelse(change < 0, 1.1, -0.3), angle=90,
              vjust=0.35) +
    theme(axis.title.x = element_blank(),
          panel.grid = element_blank(),
          axis.text.x = element_text(angle=45, hjust=1)) +
    ylab("Change from baseline (%)") +
    labs(fill="Treatment") +
    scale_fill_manual(values = c("#0044ba", "#9e181c")) +
    scale_x_discrete(labels = paste0("pat", patient)) + 
    ylim(min(change) - 10, max(change) + 10)
p1

Second plot

First we need to create a dataframe were we have all the biomarkers, percentages out of those that have the selected gene and grouping for the biomarker.

genes_df <- df %>%
    select(all_of(biomarkers))

pcts <- colSums(genes_df == "CC") / length(df)

gene_pct_df <- data.frame(pcts, biomarker_groups, biomarkers)
gene_pct_df

##              pcts biomarker_groups biomarkers
## T790M   0.3157895         Baseline      T790M
## Ex19del 0.4210526         Baseline    Ex19del
## L959R   0.3684211         Baseline      L959R
## Ex20Ins 0.7368421         Baseline    Ex20Ins
## MET     0.3684211             SCNA        MET
## ERBB2   0.4210526             SCNA      ERBB2
## EGFR    0.4736842             SCNA       EGFR
## EGFR2   0.2105263              SNV      EGFR2
## PIK3CA  0.4210526              SNV     PIK3CA
## KRAS    0.1578947              SNV       KRAS
## CDKN2   0.3684211              SNV      CDKN2
## RB1     0.4210526              SNV        RB1
## ALK     0.4210526              SNV        ALK
## KIT     0.5789474              SNV        KIT
## MET2    0.3157895              SNV       MET2
## Other   0.4210526              SNV      Other

For the next plot I am using helper function percent that converts the decimal to percentages and adds the percentage sign

percent <- function(x, digits = 2, format = "f", is.float=TRUE,...) {
  paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}

This plot is similar to the first one but we have few notable differences. First of all the bars are horizontal instead of vertical.
Secondly the ticks from the axis are removed so they are not interfering with the other plots

First we create the plot with original rotation and at the last step we used coord_flip to flip it sideways. Changing the tick labels is done by modifying the underlying theme. Generally to move something from the plot we use “theme(something = element_blank())”.

One more thing that is absolutely necessary is to order the bars by groups. For this we firt create variable bio_factor that is just
numbers 1-3 according to which group they belong. Using this variable we can order the x-axis (later y-axis) by groups.

p2 <- gene_pct_df %>%
  mutate(bio_factor = as.numeric(factor(biomarker_groups))) %>%
  ggplot(aes(x=reorder(biomarkers, -bio_factor), y=pcts)) +
  geom_bar(stat="identity", aes(fill=biomarker_groups), show.legend = F) +
  geom_text(aes(label=percent(pcts, digits=0)), hjust = -0.2, size=3) +
  theme(axis.title.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        panel.grid = element_blank()) +
  ylim(0, max(pcts) + 0.3) +
  coord_flip()
p2

Last plot

This is the most complicated plot out of all three. In this plot there is a grid that is divided into subgroups by the biomarker groups. Certain grids with specific genes are colored differently than the others.

Before we use plotting functions we again add the bio_factor variable (as we did in the last step) and add color_scheme variable that tells what color each one of the cells should be.

We are creating the grid with geom_raster (we could also use geom_rect but according to documentation geom_raster is preferred when we have even sized squares) and adding the text as usually with the geom_text. To get the grouping working correctly we need to use facet_grid to break the plot into smaller grids and add options so the grids are closer together (I encourage you to copy the code and see what each one of the options does)

p3 <- df %>%
    pivot_longer(cols=all_of(biomarkers)) %>%
    left_join(., gene_pct_df, by=c("name" = "biomarkers")) %>%
    mutate(bio_factor = as.numeric(factor(biomarker_groups))) %>%
    mutate(color_scheme = case_when(
      value == "CC" & bio_factor == 1 ~ "a",
      value == "CC" & bio_factor == 2 ~ "b",
      value == "CC" & bio_factor == 3 ~ "c",
      TRUE ~ "d")) %>%
    ggplot(aes(x = factor(patient), y=reorder(name, bio_factor))) +
    geom_raster(aes(fill=color_scheme,
                    alpha=color_scheme),
                show.legend = F) +
    geom_text(aes(label=value), size=3,
              show.legend = F) +
    facet_grid(biomarker_groups ~ ., switch = "both", scales="free_y",
               space = "free_y") +
    scale_fill_manual(values = c("#F8766D", "#00BA38" ,"#619CFF", "white")) +
    scale_alpha_manual(values = c(0.9, 0.9, 0.9, 0.4)) +
    theme(axis.title.x=element_blank(),
          axis.ticks.y=element_blank(),
          axis.title.y=element_blank(),
          axis.text.x=element_blank(),
          axis.ticks = element_blank(),
          legend.title = element_blank(),
          panel.grid = element_blank(),
          panel.spacing.y = unit(-0.1, "lines"))
p3

Combining the plots

Now all there is left to this is to combine all three plots so that all the columns and rows are lined up. For this we are using library called cowplot. According to documentation of cowplot it is a library that

“provides various features that help with creating publication-quality
figures, such as a set of themes, functions to align plots and arrange
them into complex compound figures, and functions that make it easy to
annotate plots and or mix plots with images.”

Function plot_grid from cowplot package is used for creating table like layouts of plots. We can spesify how the plots are arranged and aligned using arguments ncol, nrow, align and axis.

First we need to remove the legend from the first plot so the aligning works better and add it back later.

library(cowplot)
p1_legend <- get_legend(p1)
p1 <- p1 + theme(legend.position = "none")

Here we saved the legend from the first plot into variable called legend and set the legend hidden in the original plot. Now we are going to do nested plot_grid.

First we align change plot with the geneplot
Second we align the legend from the first plot with the barplot
We align the first two plots adjust the width of the plots usign
rel_widths argument so that plots on the left are larger than plots
on the right side.
And finally we draw the aligned plot using ggdraw function.

ggdraw(plot_grid(
    plot_grid(p1, p3, ncol=1, align = "v", axis="lr"),
    plot_grid(p1_legend, p2, ncol=1),
    rel_widths = c(1, 0.2)
  ))

Finally we have created a plot that we tried to mimic.

Code it took to recreate this figure is a bit shorter than the code used for creating it originally in SAS. I will be posting the original SAS code in our GitHub pages and I will update the URL in here after that. One downside is that the plots need quite a bit of extra options and tweaking to get them looking right.

EDIT:
Link to the SAS code

Mikael Roto
4/8/2021

Artikkeli Creating advanced figures in R julkaistiin ensimmäisen kerran Aastat.

PDF compiler

Aastat — Mon, 21 Sep 2020 09:29:56 +0000

Goal of this post is to introduce a small program developed by Aastat

Background and motivation

In medical datascience we occasionally must send data to FDA. Usually the data is parsed
from tens or hundreds of invidual *.txt or *.rtf files and manually added together using
some text editing which is usually microsoft word. This approach usually takes hours, it
is very suspectible to manual erros and FDA standards require that there must be table of
contents page with hyperlinks.

Light at the end of the tunnel

The solution for this is to automate it all away. I coded up a script in Python that completely
automates the process. The user doesn’t need to do anything other than select the files, change
few options depending on the layout and structure of provided files and then press the compile
button that outputs the document with the automatically generated table of contents.

And the best of all this that the program is completely free and open source so you can edit the
code, see what code others have written and if you’d like to contribute more features we’d love that

But I don’t have Python installed or resources to learn it

No worries. I built in all the dependencies into one *.exe file that provides Python and all the
required libraries. All you need to do is to download the project from github,
extract the files from the packed file and start up the Creator.exe (name might change later or)

Okay I got the files, but what do all these options mean?

The amount of options looks intimidating and most of the names are not that informative what the setting
does fortunately there is HUGE documentation in the github main page. There you can find everything
you need to know what the settings do, what you need to set before running the program and how the
program works behind the scenes.

If you encounter any bugs or problems you can contact Aastat and we might fix them in the future. In
case you know how to program in Python and want to contribute to the project create a pull request on
github and we’ll check it out!

Mikael Roto
10/8/2021

Artikkeli PDF compiler julkaistiin ensimmäisen kerran Aastat.

More figures in R

Aastat — Mon, 21 Sep 2020 09:29:41 +0000

In this blog post we are making the picture below.

We will be using library called tidyverse in this tutorial. Tidyverse is a collection of
packages that share underlying design philosophy, grammar and data structures. Dplyr from
tidyverse provides useful “pipes” that allows piping data forward into another expression
or funtion call.

library(tidyverse)

You can find more information about tidyverse and its other packages from online documentation.
Lets first generate some data to work with that we can use in our figure.

n_pat <- 25
patient <- 1:n_pat
censoring <- ceiling(rexp(n_pat, 1/30))
tumor_shrink <- (rbeta(n_pat, 2, 2) - 0.5) * 100

n_parameters <- 15
parameters <- paste("Parameter", 1:n_parameters)

response <- sample(c("PR", "NE", "CR", "PD", "SD"), size=n_pat,
                   replace = T)

missing_combination <- sample(c(TRUE, FALSE), size=n_pat, replace=T, prob = c(0.1, 0.9))

changes <- matrix(runif(n_pat * n_parameters, 1, 100), nrow=n_pat, ncol=n_parameters)
changes[sample(1:dim(changes)[1], 4, replace = FALSE), sample(1:dim(changes)[2], 5, replace = F)] <- NA

df <- data.frame(patient, censoring, tumor_shrink, changes, missing_combination)
colnames(df) <- c("patient", "censoring", "tumor_shrink", parameters, "missing_combination")
head(df)

##    patient censoring tumor_shrink Parameter 1 Parameter 2 Parameter 3 Parameter 4
## 1        1        50   -33.5020150  76.692716  12.905816  51.320504    6.95165   
## 2        2        18   -34.3932674  71.841917  94.354270   4.175872   40.83416   
## 3        3        19    25.5744672         NA         NA  75.877590   51.54885   
## 4        4        10     4.2591308  90.204811  36.336677  39.754126   72.06269   
## 5        5        14    -8.4798810  33.499890  13.695571  28.529885   87.61651   

##  Parameter 5 Parameter 6 Parameter 7 Parameter 8 Parameter 9 Parameter 10
##      3.702113  80.529008  52.739191  35.523220  20.034390   98.995443    
##     88.104306  95.018387  86.157191  44.547260  66.223263    4.477640    
##     59.516219  47.779858  22.964046  20.790171  27.846610   46.499506    
##      1.586673  60.106080  40.002346  47.315590  56.189063   78099096     
##     99.882826  71.494717  60.329041  58.260342  51.893355   78.442637    

## Parameter 11 Parameter 12 Parameter 13 Parameter 14 Parameter 15 missing_combination
## 5    2.34731   89.944140   22.969047   86.286218   37.865428     FALSE
##     19.13054   57.357747   66.792806   57.220612   71.090477     FALSE
##           NA   42.132381   26.674702          NA          NA     TRUE
##     46.84222    6.844924   80.998685   77.085822   38.931028     FALSE
##     41.88079   75.042574   58.337938   78.939537    1.698262     TRUE






##    patient censoring tumor_shrink Parameter 1 Parameter 2 Parameter 3 Parameter 4  Parameter 5 Parameter 6 Parameter 7 Parameter 8 Parameter 9 Parameter 10  Parameter 11 Parameter 12 Parameter 13 Parameter 14 Parameter 15 
## 1        1        50   -33.5020150  76.692716  12.905816  51.320504    6.95165    3.702113  80.529008  52.739191  35.523220  20.034390   98.995443     52.34731   89.944140   22.969047   86.286218   37.865428
## 2        2        18   -34.3932674  71.841917  94.354270   4.175872   40.83416   78.104306  95.018387  86.157191  44.547260  66.223263    4.477640     19.13054   57.357747   66.792806   57.220612   71.090477
## 3        3        19    25.5744672         NA         NA  75.877590   51.54885   59.516219  47.779858  22.964046  20.790171  27.846610   46.499506           NA   42.132381   26.674702          NA          NA
## 4        4        10     4.2591308  90.204811  36.336677  39.754126   72.06269    1.586673  60.106080  40.002346  47.315590  56.189063   78.099096     46.84222    6.844924   80.998685   77.085822   38.931028
## 5        5        14    -8.4798810  33.499890  13.695571  28.529885   87.61651   99.882826  71.494717  60.329041  58.260342  51.893355   78.442637     41.88079   75.042574   58.337938   78.939537    1.698262
      missing_combination
## 1                FALSE
## 2                FALSE
## 3                 TRUE
## 4                FALSE
## 5                 TRUE

Here we have generated a dataframe containing example patients, how their tumor size has changed from start of the study until the end of study and time after they were censored from the study (quit, died, etc). On top of that we also have measurements on different anonymized parameters noted by Parameter [number]. Note that data has missing values indicated by NA. In the dataframe there is a column called missing_combination which indicates that there was problems while gathering the data. TRUE values indicates problems and FALSE values indicate the data is gathered fine. Note that if you try to replicate the code you may get different results. You can set seed using set.seed(“Seed number”) so the data will stay same from run to run

The figure consists of four individual plots. Three smaller plots stacked on top of each other and larger plot under those three. Lets create the top most plot first.

p1 <- df %>%
  mutate(color = case_when(
    response == "PR" ~ "lightgreen",
    response == "NE" ~ "white",
    response == "CR" ~ "darkgreen",
    response == "PD" ~ "red",
    response == "SD" ~ "yellow"
  )) %>% 
  arrange(tumor_shrink) %>% 
  mutate(patient = factor(patient, levels=patient)) %>% 
  ggplot(aes(x=patient, y=1, fill=color)) +
  geom_raster() +
  geom_tile(color="black", size=1) +
  geom_text(aes(label=response), size=3) +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none",
        axis.title.y = element_text(angle = 0, vjust=0.57, size = 12),
        plot.margin = unit(c(5, 0, 0, 0), "pt")) +
  scale_fill_identity() +
  labs(y="Best ov. resp") +
  coord_fixed()
p1

Everything else looks pretty standard except the arrange() and mutate(). We want to sort our patients by their growth of their tumor. First we arrange them by the change of size in their tumors and after that we modify the patient column. This changes from integer into ordinal. Main point of this is that ggplot fills its value in (0, 1) instead of (0.5, 1.5). We also could have used only ggplot(aes(x = factor(patient))) but in the later plot we also need the numerical value. So for the consistency we use this approach.

scale_fill_identity() is useful when you want to set the colors manually using mutate and if/else conditions.

The second and third plot are fairly similar to the first one. Again we are using “hacks” to get our plot looking correct. We pass the patients as x-values and keep the y-value at constant 1. In each square we plot value that we want to plot (censoring), pass the colors in aes(…, fill=color) and finally create the black lines around the square with geom_tile.

Onto the next plot!

p2 <- df %>%
  arrange(tumor_shrink) %>% 
  mutate(patient = factor(patient, levels=patient)) %>% 
  mutate(color = ifelse(missing_combination, "white", "gray")) %>% 
  ggplot(aes(x=factor(patient), y=1, fill=color)) +
  geom_raster() +
  geom_tile(color="black", size=1) +
  geom_text(aes(label=censoring), size=3) +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "none",
        axis.title.y = element_text(angle = 0, vjust=0.57, size=12),
        plot.margin = unit(c(-5, 0, 0, 0), "pt")) +
  scale_fill_identity() + 
  labs(y="Censoring")+
  coord_fixed()
p2

This plot is again similar to the previous two and the code seems self explatory if you understood how to make the first two. Main differences in this section are modifying the scale_fill_gradient() so we get a nice gradient of colors from minimum of tumor_shrink variable to the maximum value.

p3 <- df %>% 
  arrange(tumor_shrink) %>% 
  mutate(patient = factor(patient, levels=patient)) %>% 
  ggplot(aes(x=patient, y=1, fill=tumor_shrink)) +
  geom_raster(alpha=0.8) +
  geom_tile(color="black", size=1) +
  geom_text(aes(label=formatC(tumor_shrink, 0, format="f")), 
            size=3) +
  theme(axis.text = element_blank(),
        axis.title = element_blank(), 
        axis.ticks = element_blank(),
        axis.title.y = element_text(angle = 0, vjust=0.57, size=12),
        plot.margin = unit(c(-5, 0, 0, 0), "pt"),
        legend.position = "none") +
  scale_fill_gradient(low="green", high="red") +
  labs(y="Tumor shrink")+
  coord_fixed()
p3

Now we need to transform the dataframe into long format and normalize the values to be in the [-100, 100] range. For this we are using function

In the previous equation x’ is the scaled vector of values and x is the original vector. The fraction inside parenthesis normalizes the x values between [0, 1] and then we transform them to desired [-100, 100] range. Here is that as a R function.

normalize <- function(x, na.rm = TRUE) {
  up = x - min(x, na.rm=T)
  down = max(x, na.rm=T) - min(x, na.rm=T)
  return((2 * (up / down) - 1) * 100)
}

In the next block we will pivot the dataframe into long format and apply our normalization function to all non NaN values. We are also creating a column called pat which is factor(patient) but with numerical columns. This was needed so we can sort the values in the last plot with the tumor_shrink values. To add more things to the plot I decided to add markers to the plot that could indicate some importance. For this exercise I have flagged cells that have absolute scaled value higher than 70.

cdf <- df %>% 
  pivot_longer(all_of(parameters)) %>% 
  mutate(scaled_val = normalize(value)) %>% 
  mutate(important = ifelse((abs(scaled_val) > 85), TRUE, FALSE)) %>% 
  replace_na(list(important = FALSE)) %>% 
  arrange(tumor_shrink) %>% 
  mutate(pat=factor(patient, levels = rev(unique(patient)), ordered=TRUE))

With most of the work done with creating the dataframe that we want to plot it is pretty easy to create the plot from that. The plot itself is similar to one created in the previous post.

p4 <- cdf %>% 
  ggplot(aes(x=pat, y=name, fill=scaled_val)) +
  geom_raster(alpha=0.85) +
  geom_text(data=filter(cdf, important), aes(label="★"), colour="black",
            size=8, vjust=0.2, alpha=0.9) +
  scale_fill_gradient2(low="blue", mid="white", high="red", guide="none") +
  scale_x_discrete(labels = paste0("Subj ", unique(cdf$patient))) +
  theme(axis.title = element_blank(),
        panel.grid = element_blank(),
        axis.text.x = element_text(angle=-45, hjust=0.3),
        plot.margin = unit(c(10, 5, 5, 5), "pt")) +
  coord_fixed()
p4

Now all we need to do is to combine all the plots together. This time there is no need to use cowplot as we can use a bit simpler method from library called Patchwork. Patchwork is a brilliant library that allows joining plot using arithmetic operations. You may have been wondering why we need to specify the plot margins. The three plots on top of the
bigger plot are tightly together and to mimic that we need to remove the plot margins.

library(patchwork)
p1 / p2 / p3 / p4

Mikael Roto
14/8/2021

Artikkeli More figures in R julkaistiin ensimmäisen kerran Aastat.