In this blog post we are making the picture below.
We will be using library called tidyverse in this tutorial. Tidyverse is a collection of
packages that share underlying design philosophy, grammar and data structures. Dplyr from
tidyverse provides useful “pipes” that allows piping data forward into another expression
or funtion call.
library(tidyverse)
You can find more information about tidyverse and its other packages from online documentation.
Lets first generate some data to work with that we can use in our figure.
n_pat <- 25
patient <- 1:n_pat
censoring <- ceiling(rexp(n_pat, 1/30))
tumor_shrink <- (rbeta(n_pat, 2, 2) - 0.5) * 100
n_parameters <- 15
parameters <- paste("Parameter", 1:n_parameters)
response <- sample(c("PR", "NE", "CR", "PD", "SD"), size=n_pat,
replace = T)
missing_combination <- sample(c(TRUE, FALSE), size=n_pat, replace=T, prob = c(0.1, 0.9))
changes <- matrix(runif(n_pat * n_parameters, 1, 100), nrow=n_pat, ncol=n_parameters)
changes[sample(1:dim(changes)[1], 4, replace = FALSE), sample(1:dim(changes)[2], 5, replace = F)] <- NA
df <- data.frame(patient, censoring, tumor_shrink, changes, missing_combination)
colnames(df) <- c("patient", "censoring", "tumor_shrink", parameters, "missing_combination")
head(df)
## patient censoring tumor_shrink Parameter 1 Parameter 2 Parameter 3 Parameter 4
## 1 1 50 -33.5020150 76.692716 12.905816 51.320504 6.95165
## 2 2 18 -34.3932674 71.841917 94.354270 4.175872 40.83416
## 3 3 19 25.5744672 NA NA 75.877590 51.54885
## 4 4 10 4.2591308 90.204811 36.336677 39.754126 72.06269
## 5 5 14 -8.4798810 33.499890 13.695571 28.529885 87.61651
## Parameter 5 Parameter 6 Parameter 7 Parameter 8 Parameter 9 Parameter 10
## 3.702113 80.529008 52.739191 35.523220 20.034390 98.995443
## 88.104306 95.018387 86.157191 44.547260 66.223263 4.477640
## 59.516219 47.779858 22.964046 20.790171 27.846610 46.499506
## 1.586673 60.106080 40.002346 47.315590 56.189063 78099096
## 99.882826 71.494717 60.329041 58.260342 51.893355 78.442637
## Parameter 11 Parameter 12 Parameter 13 Parameter 14 Parameter 15 missing_combination
## 5 2.34731 89.944140 22.969047 86.286218 37.865428 FALSE
## 19.13054 57.357747 66.792806 57.220612 71.090477 FALSE
## NA 42.132381 26.674702 NA NA TRUE
## 46.84222 6.844924 80.998685 77.085822 38.931028 FALSE
## 41.88079 75.042574 58.337938 78.939537 1.698262 TRUE
## patient censoring tumor_shrink Parameter 1 Parameter 2 Parameter 3 Parameter 4 Parameter 5 Parameter 6 Parameter 7 Parameter 8 Parameter 9 Parameter 10 Parameter 11 Parameter 12 Parameter 13 Parameter 14 Parameter 15
## 1 1 50 -33.5020150 76.692716 12.905816 51.320504 6.95165 3.702113 80.529008 52.739191 35.523220 20.034390 98.995443 52.34731 89.944140 22.969047 86.286218 37.865428
## 2 2 18 -34.3932674 71.841917 94.354270 4.175872 40.83416 78.104306 95.018387 86.157191 44.547260 66.223263 4.477640 19.13054 57.357747 66.792806 57.220612 71.090477
## 3 3 19 25.5744672 NA NA 75.877590 51.54885 59.516219 47.779858 22.964046 20.790171 27.846610 46.499506 NA 42.132381 26.674702 NA NA
## 4 4 10 4.2591308 90.204811 36.336677 39.754126 72.06269 1.586673 60.106080 40.002346 47.315590 56.189063 78.099096 46.84222 6.844924 80.998685 77.085822 38.931028
## 5 5 14 -8.4798810 33.499890 13.695571 28.529885 87.61651 99.882826 71.494717 60.329041 58.260342 51.893355 78.442637 41.88079 75.042574 58.337938 78.939537 1.698262
missing_combination
## 1 FALSE
## 2 FALSE
## 3 TRUE
## 4 FALSE
## 5 TRUE
Here we have generated a dataframe containing example patients, how their tumor size has changed from start of the study until the end of study and time after they were censored from the study (quit, died, etc). On top of that we also have measurements on different anonymized parameters noted by Parameter [number]. Note that data has missing values indicated by NA. In the dataframe there is a column called missing_combination which indicates that there was problems while gathering the data. TRUE values indicates problems and FALSE values indicate the data is gathered fine. Note that if you try to replicate the code you may get different results. You can set seed using set.seed(“Seed number”) so the data will stay same from run to run
The figure consists of four individual plots. Three smaller plots stacked on top of each other and larger plot under those three. Lets create the top most plot first.
p1 <- df %>%
mutate(color = case_when(
response == "PR" ~ "lightgreen",
response == "NE" ~ "white",
response == "CR" ~ "darkgreen",
response == "PD" ~ "red",
response == "SD" ~ "yellow"
)) %>%
arrange(tumor_shrink) %>%
mutate(patient = factor(patient, levels=patient)) %>%
ggplot(aes(x=patient, y=1, fill=color)) +
geom_raster() +
geom_tile(color="black", size=1) +
geom_text(aes(label=response), size=3) +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
legend.position = "none",
axis.title.y = element_text(angle = 0, vjust=0.57, size = 12),
plot.margin = unit(c(5, 0, 0, 0), "pt")) +
scale_fill_identity() +
labs(y="Best ov. resp") +
coord_fixed()
p1
Everything else looks pretty standard except the arrange() and mutate(). We want to sort our patients by their growth of their tumor. First we arrange them by the change of size in their tumors and after that we modify the patient column. This changes from integer into ordinal. Main point of this is that ggplot fills its value in (0, 1) instead of (0.5, 1.5). We also could have used only ggplot(aes(x = factor(patient))) but in the later plot we also need the numerical value. So for the consistency we use this approach.
scale_fill_identity() is useful when you want to set the colors manually using mutate and if/else conditions.
The second and third plot are fairly similar to the first one. Again we are using “hacks” to get our plot looking correct. We pass the patients as x-values and keep the y-value at constant 1. In each square we plot value that we want to plot (censoring), pass the colors in aes(…, fill=color) and finally create the black lines around the square with geom_tile.
Onto the next plot!
p2 <- df %>%
arrange(tumor_shrink) %>%
mutate(patient = factor(patient, levels=patient)) %>%
mutate(color = ifelse(missing_combination, "white", "gray")) %>%
ggplot(aes(x=factor(patient), y=1, fill=color)) +
geom_raster() +
geom_tile(color="black", size=1) +
geom_text(aes(label=censoring), size=3) +
theme(axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
legend.position = "none",
axis.title.y = element_text(angle = 0, vjust=0.57, size=12),
plot.margin = unit(c(-5, 0, 0, 0), "pt")) +
scale_fill_identity() +
labs(y="Censoring")+
coord_fixed()
p2
This plot is again similar to the previous two and the code seems self explatory if you understood how to make the first two. Main differences in this section are modifying the scale_fill_gradient() so we get a nice gradient of colors from minimum of tumor_shrink variable to the maximum value.
p3 <- df %>%
arrange(tumor_shrink) %>%
mutate(patient = factor(patient, levels=patient)) %>%
ggplot(aes(x=patient, y=1, fill=tumor_shrink)) +
geom_raster(alpha=0.8) +
geom_tile(color="black", size=1) +
geom_text(aes(label=formatC(tumor_shrink, 0, format="f")),
size=3) +
theme(axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.title.y = element_text(angle = 0, vjust=0.57, size=12),
plot.margin = unit(c(-5, 0, 0, 0), "pt"),
legend.position = "none") +
scale_fill_gradient(low="green", high="red") +
labs(y="Tumor shrink")+
coord_fixed()
p3
Now we need to transform the dataframe into long format and normalize the values to be in the [-100, 100] range. For this we are using function
In the previous equation x’ is the scaled vector of values and x is the original vector. The fraction inside parenthesis normalizes the x values between [0, 1] and then we transform them to desired [-100, 100] range. Here is that as a R function.
normalize <- function(x, na.rm = TRUE) {
up = x - min(x, na.rm=T)
down = max(x, na.rm=T) - min(x, na.rm=T)
return((2 * (up / down) - 1) * 100)
}
In the next block we will pivot the dataframe into long format and apply our normalization function to all non NaN values. We are also creating a column called pat which is factor(patient) but with numerical columns. This was needed so we can sort the values in the last plot with the tumor_shrink values. To add more things to the plot I decided to add markers to the plot that could indicate some importance. For this exercise I have flagged cells that have absolute scaled value higher than 70.
cdf <- df %>%
pivot_longer(all_of(parameters)) %>%
mutate(scaled_val = normalize(value)) %>%
mutate(important = ifelse((abs(scaled_val) > 85), TRUE, FALSE)) %>%
replace_na(list(important = FALSE)) %>%
arrange(tumor_shrink) %>%
mutate(pat=factor(patient, levels = rev(unique(patient)), ordered=TRUE))
With most of the work done with creating the dataframe that we want to plot it is pretty easy to create the plot from that. The plot itself is similar to one created in the previous post.
p4 <- cdf %>%
ggplot(aes(x=pat, y=name, fill=scaled_val)) +
geom_raster(alpha=0.85) +
geom_text(data=filter(cdf, important), aes(label="★"), colour="black",
size=8, vjust=0.2, alpha=0.9) +
scale_fill_gradient2(low="blue", mid="white", high="red", guide="none") +
scale_x_discrete(labels = paste0("Subj ", unique(cdf$patient))) +
theme(axis.title = element_blank(),
panel.grid = element_blank(),
axis.text.x = element_text(angle=-45, hjust=0.3),
plot.margin = unit(c(10, 5, 5, 5), "pt")) +
coord_fixed()
p4
Now all we need to do is to combine all the plots together. This time there is no need to use cowplot as we can use a bit simpler method from library called Patchwork. Patchwork is a brilliant library that allows joining plot using arithmetic operations. You may have been wondering why we need to specify the plot margins. The three plots on top of the
bigger plot are tightly together and to mimic that we need to remove the plot margins.
library(patchwork)
p1 / p2 / p3 / p4
Mikael Roto
14/8/2021