Learning objectives

  • Create RMarkdown reports for sharing analysis methods, code and results
  • Import your Salmon quantification files in R and build a DESeq2 object

1 Making your research reproducible

We have already made a case about reproducibility in this training. In this lesson we will focus on one of the tools to enable and empower you to perform analysis reproducibly.

When you do lab work, you use lab notebooks to organize your methods, results, and conclusions for future retrieval and reproduction. The information in these notebooks is converted into a more concise experimental description for the Methods section when publishing the results. Computational analysis requires the same diligence! The equivalent of a lab notebook for computational work is a detailed log of the workflow used, the tools at each step, the parameters for those tools and last, but not least, the versions of the tools.

Image source: “Reproducible Research in Computational Science”, Peng 2011 https://doi.org/10.1126/science.1213847

1.1 RMarkdown for R analysis

Creating the “gold standard” code is not always easy depending on what programming language you are using. For analyses within R, RStudio helps facilitate reproducible research with the use of R scripts, which document all code used to perform a particular analysis. However, we often don’t save the version of the tools we use in a script, nor do we include or interpret the results of the analyses within the script.

In the first part of this session we will be learning about RMarkdown. RMarkdown is a file format in its most basic form, that can eventually be converted into a shareable document, e.g HTML, PDF and many others. It allows you to document not just your R (Python and SQL) code, but also enables the inclusion of tables, figures, along with descriptive text. Thus resulting in a final document that has the methods, the code and interpretation of results all in a single document!

To elaborate, you write a file using the Markdown language and within it embed executable R code chunks. The code chunks are paired with knitr syntax, so that once your document is complete, you can easily convert it into one of several common formats (i.e. HTML, PDF, PPT) for sharing or documentation.

Nothing better than an example to convince you !

Exercices #1

  1. Open a new Rmarkdown file (File > New File > R Markdown… ), a dialog box will open, add your name and keep the other values by default and click on OK.
  • A template file will be generated to give you a starting point to modify for your analyses.

  1. Save it as “default_template.Rmd” and “knit” the document to generate an HTML document

  1. A web page will pop-up (You may have to allow the page to pop-up)

  1. Change the title and knit a new time

  2. Add a sentence below the “## R Markdown” and knit once again

1.2 RMarkdown basics

Markdown is a lightweight markup language with plain-text-formatting syntax. It is often used for formatting README files, writing messages in online discussion forums, and creating rich text documents using a plain text editor. The Markdown language has been adopted by many different coding groups, and some have added their own “flavours”. RStudio implements an “R-flavoured markdown”, or “RMarkdown”, which has really nice features for text and code formatting.

The RStudio cheatsheet for Rmarkdown is quite daunting, but includes more advanced Rmarkdown options that may be helpful as you become familiar with report generation, including options for adding interactive plots RShiny.

1.2.1 Components of a .Rmd file

Let’s take a closer look at the “raw” file and understand the components therein.

1. A file header in YAML format

---
---
title: "Super title"
author: "Toto"
date: "2023-03-24"
output: html_document
---

This section has information listed in YAML format, and is usually used to specify metadata (title, author) and basic configuration information (output format) associated with the file. You can find detailed information about specifications that can be made in this section on this webpage.

2. Descriptive text

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

The syntax for formatting the text portion of the report is relatively easy. You can easily get text that is bolded, italicized, bolded & italicized. You can create “headers” and “sub-headers” to organize the information by placing an “#” or “##” and so on in front of a line of text, generate numbered and bulleted lists, add hyperlinks to words or phrases, and so on.

Let’s take a look at the syntax of how to do this in RMarkdown:

You can also get more information about Markdown formatting here and here.

3. Code chunks

The basic idea behind RMarkdown is that you can describe your analysis workflow and provide interpretation of results in plain text, and intersperse chunks of R code within that document to tell a complete story using a single document. Code chunks in RMarkdown are delimited with a special marker (```). Backticks (`) commonly indicate a chunk of code.

Each individual code chunk should be given a unique name. The string after r between the curly brackets ({r name}) at beginning of chunks. The name should be something meaningful, and we recommend using snake_case for the names whenever possible.

There is a handy Insert button within RStudio that allows you to insert an empty R chunk in your document without having to type the backticks etc. yourself.

Alternatively, there are keyboard shortcuts available as well.

  • Ctrl + Alt + i for PC users
  • Command + option + i for Mac users

Finally, you can write inline R code enclosed by single backticks (`) containing a lowercase r. This allows for variable returns outside of code chunks, and is extremely useful for making report text more dynamic. For example, you can print the current date inline within the report with this syntax: 2024-05-01. See how we implement this in the YAML header.

For the final code chunk in your analysis, it is recommended to run the sessionInfo() function. This function will output the R version and the versions of all libraries loaded in the R environment. Documenting the versions of the tools you used is important for reproduction of your analysis in the future.

1.2.2 Generating the report

Once we have finished creating an RMarkdown file, we finally need to “knit” the report. You can knit the files by using the knit() function, or by just clicking on “knit” in the panel above the script as we had done in our first activity in this lesson.

Note that when creating your own reports, you will very likely find yourself knitting the report periodically as you work through rather than just once at the end. It is an iterative process usually since you may have to turn off warnings, or if you decide you need a figure to be larger/smaller, or updating the descriptive text in the document to be informative (for others and your future self).

When you click on the “knit” button, by default an HTML report will be generated. If you would prefer a different document format, this can be specified in the YAML header with the output: parameter as discussed above, or you can also click on the button in the panel above the script and click on “Knit” to get the various options as shown in the image under the 5th part of the exercise above.

Note: PDF rendering is sometimes problematic, especially when running R remotely, like on the cluster. If you run into problems, it’s likely an issue related to pandoc and latex instalation.

Only html file can be responsive.

Exercices #2

  1. Scroll down to the end of your Rmd template file document. Add a new code chunk. Within the code chunk place the operation 1+1. Knit your document.

  2. Add a new section header (“New section”) above the newly created code chunk and a new sub section header (“New sub section”).

  3. You can click on the Outline button on the top right corner to have the table of content of your document and thus more easily navigate in your document.

  1. Add a bold text, an italic text, a list

5 . Add an image to your Rmd. (First, save the image you want to display on your VM)

![](img.png){width=250px}

You can try to center it

#using option of the chunk: {r , out.width = "30%", fig.align = "center"}

knitr::include_graphics("session_5/R.png")

2 R basics

We will not do a R basics section but don’t hesitate to ask questions.

For those who have never done R, take a look to sessions 1 to 6 of the R for beginners tutorial

https://can.gitbiopages.ens-lyon.fr/R_basis/

3 Differential gene expression (DGE) analysis (Training dataset)

In the following, we will walk you through an end-to-end gene-level RNA-seq differential expression workflow using various R packages. We will start with reading in data obtained from Salmon, convert pseudocounts to counts, perform exploratory data analysis for quality assessment and to explore the relationship between samples, perform differential expression analysis, and visually explore the results prior to performing downstream functional analysis.

3.1 Review of the dataset

We will be using the Salmon abundance estimates from the RNA-Seq dataset we used erlier that is part of a larger study described in Kenny PJ et al, Cell Rep 2014.

The RNA-Seq was performed on HEK293F cells that were either transfected with a MOV10 transgene, or siRNA to knock down Mov10 expression, or non-specific (irrelevant) siRNA. This resulted in 3 conditions Mov10 oe (over expression), Mov10 kd (knock down) and Irrelevant kd, respectively. The number of replicates is as shown below.

Using these data, we will evaluate transcriptional patterns associated with perturbation of MOV10 expression. Please note that the irrelevant siRNA will be treated as our control condition.

What is the purpose of these datasets? What does Mov10 do?

The authors are investigating interactions between various genes involved in Fragile X syndrome, a disease in which there is aberrant production of the FMRP protein.

  • FMRP is “most commonly found in the brain, is essential for normal cognitive development and female reproductive function. Mutations of this gene can lead to fragile X syndrome, mental retardation, premature ovarian failure, autism, Parkinson’s disease, developmental delays and other cognitive deficits.” - from wikipedia

  • MOV10, is a putative RNA helicase that is also associated with FMRP in the context of the microRNA pathway.

The hypothesis the paper is testing is that FMRP and MOV10 associate and regulate the translation of a subset of RNAs.

Our questions:

  • What patterns of expression can we identify with the loss or gain of MOV10?
  • Are there any genes shared between the two conditions?

3.2 Setting up

Before we get into the details of the analysis, let’s get started creating a new directory for this analysis: R_analysis in your training_project directory.

Within your R_analysis directory create two new directories:

  • meta
  • results

(NOTE: we will be downloading our data folder`)

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p R_analysis
cd R_analysis
mkdir -p meta data results

Remember the key to a good analysis is keeping organized from the start! (NOTE: we will be downloading our data folder`)

Now we need to grab the files that we will be working with for the analysis. There are two things we need to download.

  1. First we need the Salmon results for the full dataset.

Use the following command lines to download directly into your project directory:

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/R_analysis

wget https://www.dropbox.com/s/oz9yralwbtphw8u/data.zip?dl=1 -O data.zip
unzip data.zip

# rm compressed file and unwanted dowloaded files
rm -r data.zip __MACOSX/

# To see the data content
tree data

Once you have the zip file downloaded you will want to decompress it. This will create a data directory with sub-directories that correspond to each of the samples in our dataset.

  1. Next, we need the annotation file which maps our transcript identifiers to gene identifiers. We have created this file for you using the R Bioconductor package AnnotationHub. For now, we will use it as is but later in the workshop we will spend some time showing you how to create one for yourself.

Use the following command lines to download directly into your project directory:

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/R_analysis/data

wget https://github.com/hbctraining/DGE_workshop_salmon/raw/master/data/tx2gene_grch38_ens94.txt

Look at the first lines of this file.

Terminal
head tx2gene_grch38_ens94.txt
tx_id   ensgene symbol
ENST00000387314 ENSG00000210049 MT-TF
ENST00000389680 ENSG00000211459 MT-RNR1
ENST00000387342 ENSG00000210077 MT-TV
ENST00000387347 ENSG00000210082 MT-RNR2
ENST00000612848 ENSG00000276345 AC004556.1
ENST00000386347 ENSG00000209082 MT-TL1
ENST00000361390 ENSG00000198888 MT-ND1

Finally, we will create a script to analyse the data with R. For that, go to the File menu and select New File, then select Rmd Script. Use Gene-level differential expression analysis using DESeq2 (Training dataset) as title and your name as author. This should open up a script editor in the top left hand corner. This is where we will be typing and saving all commands required for this analysis.

Keep only header lines of the Rmd template file and the first chunk.

Now save the file as DEG_analysis.Rmd. When finished your working directory should now look similar to this:

R_analysis/
├── DEG_analysis.Rmd
├── data
│   ├── tx2gene_grch38_ens94.txt
│   ├── Irrel_kd_1.salmon
│   ├── Irrel_kd_2.salmon
│   ├── Irrel_kd_3.salmon
│   ├── Mov10_kd_2.salmon
│   ├── Mov10_kd_3.salmon
│   ├── Mov10_oe_1.salmon
│   ├── Mov10_oe_2.salmon
│   └── Mov10_oe_3.salmon
├── meta
└── results

3.2.1 Loading libraries

For this analysis we will be using several R packages, some which have been installed from CRAN and others from Bioconductor. To use these packages (and the functions contained within them), we need to load the libraries.

  • Add a section “#Loading R libraries”

  • Add the following in a chunk and don’t forget to comment liberally!

## Setup
### Bioconductor and CRAN libraries used
library(DESeq2)
library(tidyverse)
library(RColorBrewer)
library(pheatmap)
library(tximport)
library(ggplot2)
library(ggrepel)
  • You can execute this chunk to check if all the library are installed on your machine.

To install R packages:

  • CRAN libraries
install.packages("tidyverse")
  • Bioconductor libraries
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

3.2.2 Loading data

  • Create a new section “#Load Salmon data” and a sub section ‘## quant files’

The main output of Salmon is a quant.sf file, and we have one of these for each individual sample in our dataset. An screenshot of the file is displayed below:

NOTE: The effective gene length in a sample is then the average of the transcript lengths after weighting for their relative expression. You may see effective lengths that are larger than the physical length. The interpretation would be that in this case, given the sequence composition of these transcripts (including both the sequence-specific and fragment GC biases), you’d expect a priori to sample more reads from them — thus they have a longer estimated effective length.

The pseudocounts generated by Salmon are represented as normalized TPM (transcripts per million) counts and map to transcripts. These need to be converted into non-normalized count estimates for performing DESeq2 analysis. To use DESeq2 we also need to collapse our abundance estimates from the transcript level to the gene-level. We will be using the R Bioconductor package tximport to do all of the above and get set up for DESeq2.

The first thing we need to do is create a variable that contains the paths to each of our quant.sf files. Then we will add names to our quant files which will allow us to easily discriminate between samples in the final output matrix.

  • Add the following chunk and execute it
## List all directories containing data and finishing by salmon
## (Warning, the following line may be adjusted according to how you have named the Salmon results directories (here: data/sample_name.salmon))
samples <- list.files(path = "./data", full.names = T, pattern="salmon$")

## Obtain a vector of all quant.sf filenames including the path
quant_files <- file.path(samples, "quant.sf")

## Since all quant files have the same name it is useful to have names for each element
## (Warning, the following line may be adjusted according to how you have named the Salmon results directories (here: data/sample_name.salmon))
names(quant_files) <- str_replace(samples, "./data/", "") %>% 
                str_replace(".salmon", "")

## Display the quant_files named vector
quant_files
                         Irrel_kd_1 
"./data/Irrel_kd_1.salmon/quant.sf" 
                         Irrel_kd_2 
"./data/Irrel_kd_2.salmon/quant.sf" 
                         Irrel_kd_3 
"./data/Irrel_kd_3.salmon/quant.sf" 
                         Mov10_kd_2 
"./data/Mov10_kd_2.salmon/quant.sf" 
                         Mov10_kd_3 
"./data/Mov10_kd_3.salmon/quant.sf" 
                         Mov10_oe_1 
"./data/Mov10_oe_1.salmon/quant.sf" 
                         Mov10_oe_2 
"./data/Mov10_oe_2.salmon/quant.sf" 
                         Mov10_oe_3 
"./data/Mov10_oe_3.salmon/quant.sf"

Our Salmon index was generated with transcript sequences listed by Ensembl IDs, but tximport needs to know which genes these transcripts came from in order to sum transcript counts at the genes level. We will use the annotation table that we downloaded to extract transcript to gene information so that we know which transcripts belong to each gene.

  • Create a new sub section ‘## tx2gene file’

  • Add the following chunk and execute it

# Load the annotation table for GrCh38
tx2gene <- read.delim("data/tx2gene_grch38_ens94.txt")

# Take a look at it 
tx2gene %>% head()
            tx_id         ensgene     symbol
1 ENST00000387314 ENSG00000210049      MT-TF
2 ENST00000389680 ENSG00000211459    MT-RNR1
3 ENST00000387342 ENSG00000210077      MT-TV
4 ENST00000387347 ENSG00000210082    MT-RNR2
5 ENST00000612848 ENSG00000276345 AC004556.1
6 ENST00000386347 ENSG00000209082     MT-TL1

tx2gene is a three-column dataframe linking transcript ID (column 1) to gene ID (column 2) to gene symbol (column 3). We will take the first two columns as input to tximport. The column names are not relevant, but the column order is (i.e trasncript ID must be first).

Now we are ready to run tximport. Note that although there is a column in our quant.sf files that corresponds to the estimated count value for each transcript, those valuse are correlated by effective length. What we want to do is use the countsFromAbundance=“lengthScaledTPM” argument. This will use the TPM column, and compute quantities that are on the same scale as original counts, except no longer correlated with transcript length across samples.

  • Create a new sub section ‘## run tximport’

  • Add the following chunk and execute it

#?tximport   # let's take a look at the arguments for the tximport function

# Run tximport
txi <- tximport(quant_files,
                type="salmon",
                tx2gene=tx2gene[,c("tx_id", "ensgene")],
                countsFromAbundance="lengthScaledTPM")

NOTE: When performing your own analysis you may find that the reference transcriptome file you obtain from Ensembl will have version numbers included on your identifiers (i.e ENSG00000265439.2). This will cause a discrepancy with the tx2gene file since the annotation databases don’t usually contain version numbers (i.e ENSG00000265439). To get around this issue you can use the argument ignoreTxVersion = TRUE. The logical value indicates whether to split the tx id on the ‘.’ character to remove version information, for easier matching.

3.2.3 Viewing data

The txi object is a simple list containing matrices of the abundance, counts, length. Another list element ‘countsFromAbundance’ carries through the character argument used in the tximport call. The length matrix contains the average transcript length for each gene which can be used as an offset for gene-level analysis.

  • execute the following line to get the attributes of “txi”
attributes(txi)
$names
[1] "abundance"           "counts"              "length"             
[4] "countsFromAbundance"
  • execute this command to get the count matrix
# Look at the counts
txi$counts

3.2.4 Creating metadata

Of great importance is keeping track of the information about our data. At minimum, we need to atleast have a file which maps our samples to the corresponding sample groups that we are investigating. We will use the columns headers from the counts matrix as the row names of our metadata file and have single column to identify each sample as “MOV10_overexpression”, “MOV10_knockdown”, or “control”.

  • Add a sub sub section “## build/import metadata”

  • Add the following chunck and execute it

## Create a sampletable/metadata

### create manually each column
manual_sample_names <- c("Mov10_kd_2","Mov10_kd_3","Mov10_oe_1","Mov10_oe_2","Mov10_oe_3","Irrel_kd_1","Irrel_kd_2","Irrel_kd_3")

manual_sampletype <- c("MOV10_knockdown", "MOV10_knockdown", "MOV10_overexpression", "MOV10_overexpression", "MOV10_overexpression", "control", "control", "control")


### create manually the table

manual_meta <- data.frame(name = manual_sample_names,
                          sampletype = manual_sampletype)

manual_meta
      name           sampletype
Irrel_kd_1              control
Irrel_kd_2              control
Irrel_kd_3              control
Mov10_kd_2      MOV10_knockdown
Mov10_kd_3      MOV10_knockdown
Mov10_oe_1 MOV10_overexpression
Mov10_oe_2 MOV10_overexpression
Mov10_oe_3 MOV10_overexpression
# save the metadata in a document in tsv format (tabulated text file)
write_tsv(manual_meta, file = "meta/sample_metadata.tsv")
  • You can modify the file, for example by adding manually a “replicate” column, containing, rep1, rep2, rep3.

  • Save the file in a new name (sample_metadata_ok.tsv) to avoid overwrite on it.

  • Import the metadata from this file.

# Load the metadata from a document in tsv format (tabulated text file)
meta <- read_tsv("meta/sample_metadata_ok.tsv")

3.3 Differential gene expression analysis overview

So what does this count data actually represent? The count data used for differential expression analysis represents the number of sequence reads that originated from a particular gene. The higher the number of counts, the more reads associated with that gene, and the assumption that there was a higher level of expression of that gene in the sample.

With differential expression analysis, we are looking for genes that change in expression between two or more groups (defined in the metadata)

  • case vs. control
  • correlation of expression with some variable or clinical outcome

Why does it not work to identify differentially expressed gene by ranking the genes by how different they are between the two groups (based on fold change values)?

More often than not, there is much more going on with your data than what you are anticipating. Genes that vary in expression level between samples is a consequence of not only the experimental variables of interest but also due to extraneous sources. The goal of differential expression analysis to determine the relative role of these effects, and to separate the “interesting” from the “uninteresting”.

Even though the mean expression levels between sample groups may appear to be quite different, it is possible that the difference is not actually significant. This is illustrated for ‘GeneA’ expression between ‘untreated’ and ‘treated’ groups in the figure below. The mean expression level of geneA for the ‘treated’ group is twice as large as for the ‘untreated’ group. But is the difference in expression (counts) between groups significant given the amount of variation observed within groups (replicates)?

We need to take into account the variation in the data (and where it might be coming from) when determining whether genes are differentially expressed.

3.3.1 RNA-seq count distribution

To test for significance, we need an appropriate statistical model that accurately performs normalization (to account for differences in sequencing depth, etc.) and variance modeling (to account for few numbers of replicates and large dynamic expression range).

To determine the appropriate statistical model, we need information about the distribution of counts. To get an idea about how RNA-seq counts are distributed, let’s plot the counts for a single sample, ‘Mov10_oe_1’:

# Write the counts to an object
data <- txi$counts %>% 
  round() %>% 
  data.frame()
  
ggplot(data) +
  geom_histogram(aes(x = Mov10_oe_1), stat = "bin", bins = 200) +
  xlab("Raw expression counts") +
  ylab("Number of genes")

This plot illustrates some common features of RNA-seq count data:

  • a low number of counts associated with a large proportion of genes
  • a long right tail due to the lack of any upper limit for expression
  • large dynamic range

NOTE: The log intensities of the microarray data approximate a normal distribution. However, due to the different properties of the of RNA-seq count data, such as integer counts instead of continuous measurements and non-normally distributed data, the normal distribution does not accurately model RNA-seq counts [1].

3.3.2 Modeling count data

Count data in general can be modeled with various distributions:

  1. Binomial distribution: Gives you the probability of getting a number of heads upon tossing a coin a number of times. Based on discrete events and used in situations when you have a certain number of cases.

  2. Poisson distribution: For use, when the number of cases is very large (i.e. people who buy lottery tickets), but the probability of an event is very small (probability of winning). The Poisson is similar to the binomial, but is based on continuous events. Appropriate for data where mean == variance.

  3. Negative binomial distribution: An approximation of the Poisson, but has an additional parameter that adjusts the variance independently from the mean.

Details provided by Rafael Irizarry in the EdX class.

3.3.2.1 So what do we use for RNA-seq count data?

With RNA-Seq data, a very large number of RNAs are represented and the probability of pulling out a particular transcript is very small. Thus, it would be an appropriate situation to use the Poisson or Negative binomial distribution. Choosing one over the other will depend on the relationship between mean and variance in our data.

However, in practice a large number of replicates can be either hard to obtain (depending on how samples are obtained) and/or can be unaffordable. It is more common to see datasets with only a handful of replicates (~3-5) and reasonable amount of variation between them. The model that fits best, given this type of variability between replicates, is the Negative Binomial (NB) model. Essentially, the NB model is a good approximation for data where the mean < variance, as is the case with RNA-Seq count data.

In the figure above (mean versus variance for the ‘Mov10 overexpression’ replicates), each data point represents a gene and the red line represents x = y. There are a few important things to note here:

  1. The variance across replicates tends to be greater than the mean (red line), especially for genes with large mean expression levels.
  2. For the lowly expressed genes we see quite a bit of scatter. We usually refer to this as “heteroscedasticity”. That is, for a given expression level we observe a lot of variation in the amount of variance.

NOTE: If we use the Poisson this will underestimate variability leading to an increase in false positive DE genes.

3.3.3 Improving mean estimates with biological replicates

The value of additional replicates is that as you add more data, you get increasingly precise estimates of group means, and ultimately greater confidence in the ability to distinguish differences between sample classes (i.e. more DE genes).

  • Biological replicates represent multiple samples (i.e. RNA from different mice) representing the same sample class
  • Technical replicates represent the same sample (i.e. RNA from the same mouse) but with technical steps replicated
  • Usually biological variance is much greater than technical variance, so we do not need to account for technical variance to identify biological differences in expression
  • Don’t spend money on technical replicates - biological replicates are much more useful

NOTE: If you are using cell lines and are unsure whether or not you have prepared biological or technical replicates, take a look at this link. This is a useful resource in helping you determine how best to set up your in-vitro experiment.

The figure below illustrates the relationship between sequencing depth and number of replicates on the number of differentially expressed genes identified [1].

Note that an increase in the number of replicates tends to return more DE genes than increasing the sequencing depth. Therefore, generally more replicates are better than higher sequencing depth, with the caveat that higher depth is required for detection of lowly expressed DE genes and for performing isoform-level differential expression.

3.3.4 Differential expression analysis workflow

To model counts appropriately when performing a differential expression analysis, there are a number of software packages that have been developed for differential expression analysis of RNA-seq data. Even as new methods are continuously being developed a few tools are generally recommended as best practice, e.g. DESeq2 and EdgeR. Both these tools use the negative binomial model, employ similar methods, and typically, yield similar results. They are pretty stringent, and have a good balance between sensitivity and specificity (reducing both false positives and false negatives).

Limma-Voom is another set of tools often used together for DE analysis, but this method may be less sensitive for small sample sizes. This method is recommended when the number of biological replicates per group grows large (> 20).

Many studies describing comparisons between these methods show that while there is some agreement, there is also much variability between tools. Additionally, there is no one method that performs optimally under all conditions (Soneson and Dleorenzi, 2013).

We will be using DESeq2 for the DE analysis, and the analysis steps with DESeq2 are shown in the flowchart below in green. DESeq2 first normalizes the count data to account for differences in library sizes and RNA composition between samples. Then, we will use the normalized counts to make some plots for QC at the gene and sample level. The final step is to use the appropriate functions from the DESeq2 package to perform the differential expression analysis. We will go in-depth into each of these steps in the following lessons, but additional details and helpful suggestions regarding DESeq2 can be found in the DESeq2 vignette.

3.4 Differential gene expression (DGE) analysis using DESeq2

3.4.1 Normalization

The first step in the DE analysis workflow is count normalization, which is necessary to make accurate comparisons of gene expression between samples.

The counts of mapped reads for each gene is proportional to the expression of RNA (“interesting”) in addition to many other factors (“uninteresting”). Normalization is the process of scaling raw count values to account for the “uninteresting” factors. In this way the expression levels are more comparable between and/or within samples.

The main factors often considered during normalization are:

  • Sequencing depth: Accounting for sequencing depth is necessary for comparison of gene expression between samples. In the example below, each gene appears to have doubled in expression in Sample A relative to Sample B, however this is a consequence of Sample A having double the sequencing depth.

NOTE: In the figure above, each pink and green rectangle represents a read aligned to a gene. Reads connected by dashed lines connect a read spanning an intron.

  • Gene length: Accounting for gene length is necessary for comparing expression between different genes within the same sample. In the example, Gene X and Gene Y have similar levels of expression, but the number of reads mapped to Gene X would be many more than the number mapped to Gene Y because Gene X is longer.

  • RNA composition: A few highly differentially expressed genes between samples, differences in the number of genes expressed between samples, or presence of contamination can skew some types of normalization methods. Accounting for RNA composition is recommended for accurate comparison of expression between samples, and is particularly important when performing differential expression analyses [1].

    In the example, imagine the sequencing depths are similar between Sample A and Sample B, and every gene except for gene DE presents similar expression level between samples. The counts in Sample B would be greatly skewed by the DE gene, which takes up most of the counts. Other genes for Sample B would therefore appear to be less expressed than those same genes in Sample A.

While normalization is essential for differential expression analyses, it is also necessary for exploratory data analysis, visualization of data, and whenever you are exploring or comparing counts between or within samples.

3.4.2 Common normalization methods

Several common normalization methods exist to account for these differences:

Normalization method Description Accounted factors Recommendations for use
CPM (counts per million) counts scaled by total number of reads sequencing depth gene count comparisons between replicates of the same samplegroup; NOT for within sample comparisons or DE analysis
TPM (transcripts per kilobase million) counts per length of transcript (kb) per million reads mapped sequencing depth and gene length gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis
RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped) similar to TPM sequencing depth and gene length gene count comparisons between genes within a sample; NOT for between sample comparisons or DE analysis
DESeq2’s median of ratios [1] counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene sequencing depth and RNA composition gene count comparisons between samples and for DE analysis; NOT for within sample comparisons
EdgeR’s trimmed mean of M values (TMM) [2] uses a weighted trimmed mean of the log expression ratios between samples sequencing depth, RNA composition gene count comparisons between samples and for DE analysis; NOT for within sample comparisons

3.4.2.2 DESeq2-normalized counts: Median of ratios method

Since tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool. However, sequencing depth and RNA composition do need to be taken into account.

To normalize for sequencing depth and RNA composition, DESeq2 uses the median of ratios method. On the user-end there is only one step, but on the back-end there are multiple steps involved, as described below.

NOTE: The steps below describe in detail some of the steps performed by DESeq2 when you run a single function to get DE genes. Basically, for a typical RNA-seq analysis, you would not run these steps individually.

  • Step 1: creates a pseudo-reference sample (row-wise geometric mean)

For each gene, a pseudo-reference sample is created that is equal to the geometric mean across all samples.

gene sampleA sampleB pseudo-reference sample
EF2A 1489 906 sqrt(1489 * 906) = 1161.5
ABCD1 22 13 sqrt(22 * 13) = 17.7
  • Step 2: calculates ratio of each sample to the reference

For every gene in a sample, the ratios (sample/ref) are calculated (as shown below). This is performed for each sample in the dataset. Since the majority of genes are not differentially expressed, the majority of genes in each sample should have similar ratios within the sample.

gene sampleA sampleB pseudo-reference sample ratio of sampleA/ref ratio of sampleB/ref
EF2A 1489 906 1161.5 1489/1161.5 = 1.28 906/1161.5 = 0.78
ABCD1 22 13 16.9 22/16.9 = 1.30 13/16.9 = 0.77
MEFV 793 410 570.2 793/570.2 = 1.39 410/570.2 = 0.72
BAG1 76 42 56.5 76/56.5 = 1.35 42/56.5 = 0.74
MOV10 521 1196 883.7 521/883.7 = 0.590 1196/883.7 = 1.35
  • Step 3: calculate the normalization factor for each sample (size factor)

The median value (column-wise for the above table) of all ratios for a given sample is taken as the normalization factor (size factor) for that sample, as calculated below. Notice that the differentially expressed genes should not affect the median value:

normalization_factor_sampleA <- median(c(1.28, 1.3, 1.39, 1.35, 0.59))

normalization_factor_sampleB <- median(c(0.78, 0.77, 0.72, 0.74, 1.35))

The figure below illustrates the median value for the distribution of all gene ratios for a single sample (frequency is on the y-axis).

The median of ratios method makes the assumption that not ALL genes are differentially expressed; therefore, the normalization factors should account for sequencing depth and RNA composition of the sample (large outlier genes will not represent the median ratio values). This method is robust to imbalance in up-/down-regulation and large numbers of differentially expressed genes.

Usually these size factors are around 1, if you see large variations between samples it is important to take note since it might indicate the presence of extreme outliers.

  • Step 4: calculate the normalized count values using the normalization factor

This is performed by dividing each raw count value in a given sample by that sample’s normalization factor to generate normalized count values. This is performed for all count values (every gene in every sample). For example, if the median ratio for SampleA was 1.3 and the median ratio for SampleB was 0.77, you could calculate normalized counts as follows:

SampleA median ratio = 1.3

SampleB median ratio = 0.77

Raw Counts

gene sampleA sampleB
EF2A 1489 906
ABCD1 22 13

Normalized Counts

gene sampleA sampleB
EF2A 1489 / 1.3 = 1145.39 906 / 0.77 = 1176.62
ABCD1 22 / 1.3 = 16.92 13 / 0.77 = 16.88

Please note that normalized count values are not whole numbers.

3.5 Count normalization of Mov10 dataset using DESeq2

Now that we know the theory of count normalization, we will normalize the counts for the Mov10 dataset using DESeq2. This requires a few steps:

  1. Ensure the row names of the metadata dataframe are present and in the same order as the column names of the counts dataframe.
  2. Create a DESeqDataSet object
  3. Generate the normalized counts

3.6 1. Match the metadata and counts data

We should always make sure that we have sample names that match between the two files, and that the samples are in the right order. DESeq2 will output an error if this is not the case.

  • add the following chunck in your Rmd file
# display col names of your count table
colnames(txi$counts)

# display the column containing sample id in your metadata table
meta$name
  • You can see that colnames(txi$counts) and your metadata table have not the same order.

  • We have to order them:

### Check that sample names from colnames(txi$counts) are in the name column of meta
all(colnames(txi$counts) %in% meta$name) #TRUE


### There are several ways of sorting your meta table, so choose the one that suits you best.

# Solution 1: use factor to sort your table and add row names
meta1 <- meta

meta1 <- meta1 %>% 
  mutate(rownames_tmp = factor(name, level = colnames(txi$counts))) %>%
  arrange(rownames_tmp) %>%
  column_to_rownames("rownames_tmp")

#display meta1
meta1

# Solution 2: use row names to sort your table
meta2 <- meta

rownames(meta2) <- meta2$name
meta2 <- meta2 [colnames(txi$counts),]
rownames(meta2) <- meta2$name

#display meta2
meta2



# finally, keep one of the solution
meta <- meta1

# final checking

all(colnames(txi$counts) == rownames(meta)) #TRUE
### Check that sample names match in both objects
all(colnames(txi$counts) %in% rownames(meta)) # true
all(colnames(txi$counts) == rownames(meta)) # true

3.7 2. Create DESEq2 object

  • Create a new section in your Rmd file: “# Normalization and overview”

Bioconductor software packages often define and use a custom class within R for storing data (input data, intermediate data and also results). These custom data structures are similar to lists in that they can contain multiple different data types/structures within them. But, unlike lists they have pre-specified data slots, which hold specific types/classes of data. The data stored in these pre-specified slots can be accessed by using specific package-defined functions.

Let’s start by creating the DESeqDataSet object and then we can talk a bit more about what is stored inside it. To create the object we will need the count matrix and the metadata table as input. We will also need to specify a design formula. The design formula specifies the column(s) in the metadata table and how they should be used in the analysis. For our dataset we only have one column we are interested in, that is ~sampletype. This column has three factor levels, which tells DESeq2 that for each gene we want to evaluate gene expression change with respect to these different levels.

Our count matrix input is stored inside the txi list object, and so we pass that in using the DESeqDataSetFromTximport() function which will extract the counts component and round the values to the nearest whole number.

  • Add a sub section “## Create DEseq object”

  • Add the following chunck and execute it

## Init DESeq2Dataset object
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ sampletype)

You can use DESeq-specific functions to access the different slots and retrieve information, if you wish. For example, suppose we wanted the original count matrix we would use counts():

counts(dds)

As we go through the workflow we will use the relevant functions to check what information gets stored inside our object.

3.8 3. Generate the Mov10 normalized counts

The next step is to normalize the count data in order to be able to make fair gene comparisons between samples.

To perform the normalization, you will use the DESeq() function on the dds object. This “meta” function will in fact run the normalization throw the intern function estimateSizeFactors() that will generate size factors for us.

  • Add the following chunck and execute it
dds <- DESeq(dds)

By assigning the results back to the dds object we are filling in the slots of the DESeqDataSet object with the appropriate information. We can take a look at the normalization factor applied to each sample using:

sizeFactors(dds)

Now, to retrieve the normalized counts matrix from dds, we use the counts() function and add the argument normalized=TRUE.

  • Add the following chunck and execute it
normalized_counts <- counts(dds, normalized=TRUE)

To retrieve the raw counts matrix from dds, we can use the counts() function and add the argument normalized=FALSE.

  • Add the following chunck and execute it
raw_counts <- counts(dds, normalized=FALSE)

We can save this normalized and the raw data matrix to files for later use:

  • Add the following chunck and execute it
#convert matrix in table and put rownames in a column names "gene_ID"
normalized_counts_df <- normalized_counts %>% as_tibble(rownames = "gene_ID")
raw_counts_df <- raw_counts %>% as_tibble(rownames = "gene_ID")
write_tsv(normalized_counts_df,  file="results/normalized_counts.txt")
write_tsv(raw_counts_df, file="results/raw_counts.txt")

NOTE: DESeq2 doesn’t actually use normalized counts, rather it uses the raw counts and models the normalization inside the Generalized Linear Model (GLM). These normalized counts will be useful for downstream visualization of results, but cannot be used as input to DESeq2 or any other tools that peform differential expression analysis which use the negative binomial model.

3.9 Quality Control

The next step in the DESeq2 workflow is QC, which includes sample-level and gene-level steps to perform QC checks on the count data to help us ensure that the samples/replicates look good.

3.9.1 Sample-level QC

A useful initial step in an RNA-seq analysis is often to assess overall similarity between samples:

  • Which samples are similar to each other, which are different?
  • Does this fit to the expectation from the experiment’s design?
  • What are the major sources of variation in the dataset?

To explore the similarity of our samples, we will be performing sample-level QC using Principal Component Analysis (PCA) and hierarchical clustering methods. Our sample-level QC allows us to see how well our replicates cluster together, as well as, observe whether our experimental condition represents the major source of variation in the data. Performing sample-level QC can also identify any sample outliers, which may need to be explored further to determine whether they need to be removed prior to DE analysis.

When using these unsupervised clustering methods, log2-transformation of the normalized counts improves the distances/clustering for visualization. DESeq2 uses a regularized log transform (rlog) of the normalized counts for sample-level QC as it moderates the variance across the mean, improving the clustering.

NOTE: The DESeq2 vignette suggests large datasets (100s of samples) to use the variance-stabilizing transformation (vst) instead of rlog for transformation of the counts, since the rlog function might take too long to run and the vst() function is faster with similar properties to rlog.

3.9.2 Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset (dimensionality reduction). Details regarding PCA are given below (based on materials from StatQuest, and if you would like a more thorough description, we encourage you to explore StatQuest’s video and our longer lesson.

If two samples have similar levels of expression for the genes that contribute significantly to the variation represented by PC1, they will be plotted close together on the PC1 axis. Therefore, we would expect that biological replicates to have similar scores (since the same genes are changing) and cluster together on PC1 and/or PC2, and the samples from different treatment groups to have different score. This is easiest to understand by visualizing example PCA plots.

3.9.3 Hierarchical Clustering Heatmap

Similar to PCA, hierarchical clustering is another, complementary method for identifying strong patterns in a dataset and potential outliers. The heatmap displays the correlation of gene expression for all pairwise combinations of samples in the dataset. Since the majority of genes are not differentially expressed, samples generally have high correlations with each other (values higher than 0.80). Samples below 0.80 may indicate an outlier in your data and/or sample contamination.

The hierarchical tree can indicate which samples are more similar to each other based on the normalized gene expression values. The color blocks indicate substructure in the data, and you would expect to see your replicates cluster together as a block for each sample group. Additionally, we expect to see samples clustered similar to the groupings observed in a PCA plot.

3.9.4 Gene-level QC

In addition to examining how well the samples/replicates cluster together, there are a few more QC steps. Prior to differential expression analysis it is beneficial to omit genes that have little or no chance of being detected as differentially expressed. This will increase the power to detect differentially expressed genes. The genes omitted fall into three categories:

  • Genes with zero counts in all samples
  • Genes with an extreme count outlier
  • Genes with a low mean normalized counts

DESeq2 will perform this filtering by default; however other DE tools, such as EdgeR will not. Filtering is a necessary step, even if you are using limma-voom and/or edgeR’s quasi-likelihood methods. Be sure to follow pre-filtering steps when using other tools, as outlined in their user guides found on Bioconductor as they generally perform much better.

3.10 Mov10 quality assessment and exploratory analysis using DESeq2

Now that we have a good understanding of the QC steps normally employed for RNA-seq, let’s implement them for the Mov10 dataset we are going to be working with.

  • add a sub section “## Quality control”

3.10.1 Transform normalized counts using the rlog transformation

To improve the distances/clustering for the PCA and heirarchical clustering visualization methods, we need to moderate the variance across the mean by applying the rlog transformation to the normalized counts.

The rlog transformation of the normalized counts is only necessary for these visualization methods during this quality assessment. We will not be using these tranformed counts downstream.

  • Add the following chunck and execute it
### Transform counts for data visualization
rld <- rlog(dds, blind=TRUE)

The blind=TRUE argument results in a transformation unbiased to sample condition information. When performing quality assessment, it is important to include this option. The DESeq2 vignette has more details.

The rlog function returns a DESeqTransform object, another type of DESeq-specific object. The reason you don’t just get a matrix of transformed values is because all of the parameters (i.e. size factors) that went into computing the rlog transform are stored in that object. We use this object to plot the PCA and heirarchical clustering figures for quality assessment.

NOTE: The rlog() funtion can be a bit slow when you have e.g. > 20 samples. In these situations the vst() function is much faster and performs a similar transformation appropriate for use with plotPCA(). It’s typically just a few seconds with vst() due to optimizations and the nature of the transformation.

3.10.2 Principal components analysis (PCA)

DESeq2 has a built-in function for plotting PCA plots, that uses ggplot2 under the hood. This is great because it saves us having to type out lines of code and having to fiddle with the different ggplot2 layers. In addition, it takes the rlog object as an input directly, hence saving us the trouble of extracting the relevant information from it.

The function plotPCA() requires two arguments as input: an rlog object and the intgroup (the column in our metadata that we are interested in).

  • Add a sub sub section “### PCA”

  • Add the following chunck and execute it

### Plot PCA 
plotPCA(rld, intgroup="sampletype")

What does this plot tell you about the similarity of samples? Does it fit the expectation from the experimental design? By default the function uses the top 500 most variable genes. You can change this by adding the ntop argument and specifying how many genes you want to use to draw the plot.

Depending on how much variation is explained by the first few principal components, you may want to explore more (i.e consider more components and plot pairwise combinations). Even if your samples do not separate clearly by the experimental variable, you may still get biologically relevant results from the DE analysis. If you are expecting very small effect sizes, then it’s possible the signal is drowned out by extraneous sources of variation. In situations where you can identify those sources, it is important to account for these in your model, as it provides more power to the tool for detecting DE genes.

NOTE: The plotPCA() function will only return the values for PC1 and PC2. If you would like to explore the additional PCs in your data or if you would like to identify genes that contribute most to the PCs, you can use the prcomp() function. For example, to plot any of the PCs we could run the following code:

# Input is a matrix of log transformed values
rld <- rlog(dds, blind=T)
rld_mat <- assay(rld)
pca <- prcomp(t(rld_mat))
# Create data frame with metadata and PC3 and PC4 values for input to ggplot
df <- cbind(meta, pca$x)
ggplot(df) + geom_point(aes(x=PC3, y=PC4, color = sampletype))

Resources are available to learn how to do more complex inquiries using the PCs.

3.10.3 Hierarchical Clustering

Since there is no built-in function for heatmaps in DESeq2 we will be using the pheatmap() function from the pheatmap package. This function requires a matrix/dataframe of numeric values as input, and so the first thing we need to is retrieve that information from the rld object:

  • Add a sub sub section “### Hierarchical Clustering”

  • Add the following chunck and execute it

### Extract the rlog matrix from the object
rld_mat <- assay(rld)    ## assay() is function from the "SummarizedExperiment" package that was loaded when you loaded DESeq2

Then we need to compute the pairwise correlation values for samples. We can do this using the cor() function:

### Compute pairwise correlation values
rld_cor <- cor(rld_mat)    ## cor() is a base R function

head(rld_cor)   ## check the output of cor(), make note of the rownames and colnames

And now to plot the correlation values as a heatmap:

### Load pheatmap package
library(pheatmap)

### Plot heatmap
pheatmap(rld_cor, annotation = meta)

Overall, we observe pretty high correlations across the board ( > 0.999) suggesting no outlying sample(s). Also, similar to the PCA plot you see the samples clustering together by sample group. Together, these plots suggest to us that the data are of good quality and we have the green light to proceed to differential expression analysis.

NOTE: The pheatmap function has a number of different arguments that we can alter from default values to enhance the aesthetics of the plot. If you are curious and want to explore more, try running the code below. How does your plot change? Take a look through the help pages (?pheatmap) and identify what each of the added arguments is contributing to the plot.

heat.colors <- brewer.pal(6, "Blues")

pheatmap(rld_cor, annotation = meta, color = heat.colors,
         border_color=NA, fontsize = 10, fontsize_row = 10, height=20)

Curious on all of the available color palettes offered by the RColorBrewer package? Try typing in your console display.brewer.all() and see what happens!

3.11 DEG on your dataset

Now it is time to do the same things on your dataset. Create a new R_analysis directory in your my_project directory

You can find below some commands to get the transcript to gene maps for human and mouse.

3.11.1 Human tr2gene

#BiocManager::install("biomaRt")
library(biomaRt)
human_mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

human_tr2genes <- getBM(
  attributes=c("ensembl_transcript_id","ensembl_gene_id"),
  mart = human_mart)

head(human_tr2genes)
dim(human_tr2genes)

3.11.2 Mouse tr2gene

#BiocManager::install("biomaRt")
library(biomaRt)
mouse_mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

mouse_tr2genes <- getBM(
  attributes=c("ensembl_transcript_id","ensembl_gene_id"),
  mart = mouse_mart)

head(mouse_tr2genes)
dim(mouse_tr2genes)

NOTE: When performing your own analysis you may find that the reference transcriptome file you obtain from Ensembl will have version numbers included on your identifiers (i.e ENSG00000265439.2). This will cause a discrepancy with the tx2gene file since the annotation databases don’t usually contain version numbers (i.e ENSG00000265439). To get around this issue you can use the argument ignoreTxVersion = TRUE in tximport() function. The logical value indicates whether to split the tx id on the ‘.’ character to remove version information, for easier matching.

See you in the next session


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).

cc_by_sa

sources