Data form the previous session:

1 Analysing DE gene lists

Previously, we used DEseq2 to get DE gene lists for 2 comparisons in the training dataset:

  • MOV10 overexpressed (OE) samples versus control samples
  • MOV10 knockdown (KD) samples versus control samples.

Now, we now want to perform additional analyses to characterize these lists. You can create a new Rmd file to conduct these additional analyses, which will utilize the output of your previous Rmd file.

2 How to compare two lists of genes

You may wish to identify common genes from 2 lists of genes. For instance, in the training dataset, we seek to identify common genes between both comparisons.

To achieve this, you can use the following command lines.

  • First import the data
#import results comparison MOV10 over expression samples and the control samples
res_tableOE_df <- read_tsv("results/DE_result_table_oe_vs_control.tsv")
head(res_tableOE_df)
# A tibble: 57,761 × 8
   gene_ID         baseMean log2FoldChange  lfcSE    pvalue      padj is_up is_down
   <chr>              <dbl>          <dbl>  <dbl>     <dbl>     <dbl> <chr> <chr>  
 1 ENSG00000155363   95772.          6.99  0.122  0         0         yes   no     
 2 ENSG00000189060    8090.          1.51  0.0666 3.72e-115 3.41e-111 yes   no     
 3 ENSG00000173110     239.          6.50  0.363  9.44e- 72 5.76e- 68 yes   no     
 4 ENSG00000265972    5285.          1.37  0.0788 1.64e- 68 7.48e- 65 yes   no     
 5 ENSG00000187837    1747.          1.48  0.0890 1.75e- 63 6.41e- 60 yes   no     
#import results comparison MOV10 Knockdown samples and the control samples
res_tableKD_df <- read_tsv("results/DE_result_table_kd_vs_control.tsv")
head(res_tableKD_df)
# A tibble: 57,761 × 8
   gene_ID         baseMean log2FoldChange  lfcSE   pvalue     padj is_up is_down
   <chr>              <dbl>          <dbl>  <dbl>    <dbl>    <dbl> <chr> <chr>  
 1 ENSG00000116962    6434.          1.09  0.0701 7.24e-56 1.27e-51 yes   no     
 2 ENSG00000270882    2595.          1.62  0.106  9.61e-54 8.44e-50 yes   no     
 3 ENSG00000143183    1807.          1.23  0.0858 8.12e-48 4.75e-44 yes   no     
 4 ENSG00000124762    2894.          1.22  0.0867 5.27e-46 2.31e-42 yes   no     
 5 ENSG00000116473    2189.         -1.00  0.0719 2.69e-45 9.44e-42 no    yes    
 6 ENSG00000129250    2797.         -1.04  0.0822 3.36e-38 9.83e-35 no    yes      
  • Build subset tables for DE, UP, Down genes of each comparisons
DE_genes_OE_df <- res_tableOE_df %>% dplyr::filter(is_up == "yes" | is_down == "yes") # "|" means OR
UP_genes_OE_df <- res_tableOE_df %>% dplyr::filter(is_up == "yes")
DOWN_genes_OE_df <- res_tableOE_df %>% dplyr::filter(is_down == "yes")
DE_genes_KD_df <- res_tableKD_df %>% dplyr::filter(is_up == "yes" | is_down == "yes") # "|" means OR
UP_genes_KD_df <- res_tableKD_df %>% dplyr::filter(is_up == "yes")
DOWN_genes_KD_df <- res_tableKD_df %>% dplyr::filter(is_down == "yes")
  • Make some intersects of

    • for list of genes:

in_OE_and_in_KD <- intersect(DE_genes_OE_df$gene_ID,
                             DE_genes_KD_df$gene_ID)
length(in_OE_and_in_KD) #1095

in_OE_and_not_KD <- setdiff(DE_genes_OE_df$gene_ID,
                            DE_genes_KD_df$gene_ID)
length(in_OE_and_not_KD) #3679

not_OE_and_in_KD <- setdiff(DE_genes_KD_df$gene_ID,
                            DE_genes_OE_df$gene_ID)
length(not_OE_and_in_KD) #1715
  • for table:

#keep columns from x and y : inner_join
in_OE_and_in_KD_df <- inner_join(DE_genes_OE_df,
                                 DE_genes_KD_df,
                                 by = "gene_ID",
                                 suffix = c(".OE",".KD"))
dim(in_OE_and_in_KD_df) #1095    15


#keep only columns from x : semi_join
in_OE_and_in_KD_df2 <- semi_join(DE_genes_OE_df,
                                  DE_genes_KD_df,
                                  by = "gene_ID")

dim(in_OE_and_in_KD_df2) #1095   8


in_OE_and_not_KD_df <- anti_join(DE_genes_OE_df,
                                  DE_genes_KD_df,
                                  by = "gene_ID")
dim(in_OE_and_not_KD_df) #3679    8


not_OE_and_in_KD_df <- anti_join(DE_genes_KD_df,
                                  DE_genes_OE_df,
                                  by = "gene_ID")
dim(not_OE_and_in_KD_df)  #1715    8
  • One way to visualize list intersections is to use Venn diagramm.

    There are different R packages available for creating Venn Diagramm. We will use “ggvenn”, which can be found here: https://github.com/yanlinlin82/ggvenn

    Let’s install it

#specify the CRAN url if necessary
options(repos = c(CRAN = "https://cran.rstudio.com"))

#install a package
install.packages("ggvenn") 

Exercise

  • From the documentation, try to reproduce this figure:

library(ggvenn)

# List of items
DE_genes_OE_KD <- list(OE = DE_genes_OE_df$gene_ID,
                       KD = DE_genes_KD_df$gene_ID)

# 2D Venn diagram
ggvenn(
  DE_genes_OE_KD, 
  fill_color = c("#0073C2FF", "#EFC000FF", "#868686FF", "#CD534CFF"),
  stroke_size = 2, set_name_size = 12, text_size = 8,
  show_percentage = FALSE
  )

3 Functional analysis

source

The output of RNA-seq differential expression analysis is a list of significant differentially expressed genes (DEGs). To gain greater biological insight onto the differentially expressed genes, various analyses can be performed:

  • determine whether there is enrichment of known biological functions, interactions, or pathways
  • identify genes’ involvement in novel pathways or networks by grouping genes together based on similar trends
  • use global changes in gene expression by visualizing all genes being significantly up- or down-regulated in the context of external interaction data

Generally for any differential expression analysis, it is useful to interpret the resulting gene lists using freely available web- and R-based tools. While tools for functional analysis span a wide variety of techniques, they can loosely be categorized into three main types: over-representation analysis, functional class scoring, and pathway topology [1].

The goal of functional analysis is provide biological insight, so it’s necessary to analyze our results in the context of our experimental hypothesis: FMRP and MOV10 associate and regulate the translation of a subset of RNAs. Therefore, based on the authors’ hypothesis, we may expect the enrichment of processes/pathways related to translation, splicing, and the regulation of mRNAs, which we would need to validate experimentally.

Note that all tools described below are great t for validating experimental results and generating hypotheses. These tools suggest genes/pathways that may be involved in your condition of interest; however, you should NOT use these tools to draw conclusions about the pathways involved in your experimental process. Experimental validation of any suggested pathways is necessary.

3.1 Over-representation analysis

There are a plethora of functional enrichment tools that perform some type of “over-representation” analysis by querying databases containing information about gene function and interactions.

These databases typically categorize genes into groups (gene sets) based on shared function, involvement in a pathway, presence in a specific cellular location, or other categorizations, e.g. functional pathways, etc. Essentially, known genes are binned into categories that have been consistently named (controlled vocabulary) based on how the gene has been annotated functionally. These categories are independent of any organism, however each organism has distinct categorizations available.

To determine whether any categories are over-represented, you can calculate the probability of observing the proportion of genes associated with a specific category in your gene list, based on the proportion of genes associated with the same category in the background set (gene categorizations for the appropriate organism).

The statistical test used to determine whether something is actually over-represented is the Hypergeometric test.

3.1.1 Hypergeometric testing

Using the example of the first functional category above, hypergeometric distribution is a probability distribution that describes the probability of 25 genes (k) being associated with “Functional category 1”, out of all genes in our gene list (n=1000), from a population of all of the genes in entire genome (N=13,000) which contains 35 genes (K) associated with “Functional category 1” [2].

The calculation of probability of k successes follows the formula:

This test will result in an adjusted p-value (after multiple test correction) for each category tested.

3.1.2 Gene Ontology project

One of the most widely-used categorizations is the Gene Ontology (GO) established by the Gene Ontology project.

“The Gene Ontology project is a collaborative effort to address the need for consistent descriptions of gene products across databases” [3]. The Gene Ontology Consortium maintains the GO terms, and these GO terms are incorporated into gene annotations in many of the popular repositories for animal, plant, and microbial genomes.

Tools that investigate enrichment of biological functions or interactions often use the Gene Ontology (GO) categorizations, i.e. the GO terms to determine whether any have significantly modified representation in a given list of genes. Therefore, to best use and interpret the results from these functional analysis tools, it is helpful to have a good understanding of the GO terms themselves and their organization.

3.1.2.1 GO Ontologies

To describe the roles of genes and gene products, GO terms are organized into three independent controlled vocabularies (ontologies) in a species-independent manner:

  • Biological process: refers to the biological role involving the gene or gene product, and could include “transcription”, “signal transduction”, and “apoptosis”. A biological process generally involves a chemical or physical change of the starting material or input.
  • Molecular function: represents the biochemical activity of the gene product, such as “ligand”, “GTPase”, and “transporter”.
  • Cellular component: refers to the location in the cell of the gene product. Cellular components could include “nucleus”, “lysosome”, and “plasma membrane”.

Each GO term has a term name (e.g. DNA repair) and a unique term accession number (GO:0005125), and a single gene product can be associated with many GO terms, since a single gene product “may function in several processes, contain domains that carry out diverse molecular functions, and participate in multiple alternative interactions with other proteins, organelles or locations in the cell” [4].

3.1.2.2 GO term hierarchy

Some gene products are well-researched, with vast quantities of data available regarding their biological processes and functions. However, other gene products have very little data available about their roles in the cell.

For example, the protein, “p53”, would contain a wealth of information on its roles in the cell, whereas another protein might only be known as a “membrane-bound protein” with no other information available.

The GO ontologies were developed to describe and query biological knowledge with differing levels of information available. To do this, GO ontologies are loosely hierarchical, ranging from general, ‘parent’, terms to more specific, ‘child’ terms. The GO ontologies are “loosely” hierarchical since ‘child’ terms can have multiple ‘parent’ terms.

Some genes with less information may only be associated with general ‘parent’ terms or no terms at all, while other genes with a lot of information may be associated with many terms.

Nature Reviews Cancer 7, 23-34 (January 2007)

Tips for working with GO terms

3.1.3 clusterProfiler

We will be using clusterProfiler to perform over-representation analysis on GO terms associated with our list of significant genes. The tool takes as input a significant gene list and a background gene list and performs statistical enrichment analysis using hypergeometric testing. The basic arguments allow the user to select the appropriate organism and GO ontology (BP, CC, MF) to test.

3.1.3.1 Running clusterProfiler

To run clusterProfiler GO over-representation analysis, we will use Ensembl IDs instead of gene name, since the tool works a bit easier with the Ensembl IDs.

Install required libraries:

# Install the libraries if necessary , (not update old package for the training)
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("org.Mm.eg.db")

Then load the following libraries:

# Load libraries
library(clusterProfiler)
library(org.Hs.eg.db)

To perform the over-representation analysis, we need a list of background genes and a list of significant genes.

  • For our background dataset we will use all genes tested for differential expression (all genes in our results table).
  • For our significant gene list we will use genes with p-adjusted values less than 0.05 (we could include a fold change threshold too if we have many DE genes).
## Create background dataset for hypergeometric testing using all genes tested for significance in the results                 
allOE_genes <- as.character(res_tableOE_df$gene_ID)
length(allOE_genes) #57761

## Extract significant results
sigOE_df <- res_tableOE_df %>% dplyr::filter(padj < 0.05)

sigOE_genes <- as.character(sigOE_df$gene_ID)
length(sigOE_genes) #4774

Now we can perform the GO enrichment analysis and save the results:

## Run GO enrichment analysis 
sigOE_ego <- enrichGO(gene = sigOE_genes, 
                universe = allOE_genes,
                keyType = "ENSEMBL",
                OrgDb = org.Hs.eg.db, 
                ont = "BP", 
                pAdjustMethod = "BH", 
                qvalueCutoff = 0.05, 
                readable = TRUE)

NOTE: The different organisms with annotation databases available to use with for the OrgDb argument can be found here.

Also, the keyType argument may be coded as keytype in different versions of clusterProfiler.

Finally, the ont argument can accept either “BP” (Biological Process), “MF” (Molecular Function), and “CC” (Cellular Component) subontologies, or “ALL” for all three.

## Output results from GO analysis to a table
sigOE_cluster_summary <- as_tibble(sigOE_ego)

head(sigOE_cluster_summary)

write_tsv(sigOE_cluster_summary, "results/clusterProfiler_Mov10oe.tsv")

NOTE: Instead of saving just the results summary from the ego object, it might also be beneficial to save the object itself. The save() function enables you to save it as a .rda file, e.g. save(ego, file="results/ego.rda"). The complementary function to save() is the function load(), e.g. ego <- load(file="results/ego.rda").

This is a useful set of functions to know, since it enables one to preserve analyses at specific stages and reload them when needed. More information about these functions can be found here & here.


NOTE: You can also perform GO enrichment analysis with only the up or down regulated genes in addition to performing it for the full list of significant genes. This can be useful to identify GO terms impacted in one direction and not the other. If very few genes are in any of these lists (< 50, roughly) it may not be possible to get any significant GO terms.

 ## Extract upregulated genes
 sigOE_up_df <- dplyr::filter(res_tableOE_df, padj < 0.05 & log2FoldChange > 0)
 
 sigOE_up_genes <- as.character(sigOE_up_df$gene_ID)
 
 ## Extract downregulated genes
 sigOE_down_df <- dplyr::filter(res_tableOE_df, padj < 0.05 & log2FoldChange < 0)
  
 sigOE_down_genes <- as.character(sigOE_down_df$gene_ID)

You can then run the enrichGO() function for the up and then the down genes lists by using gene = sigOE_up_genes or gene = sigOE_down_genes and save the results in ego_up & ego_down objects.


3.1.3.2 Visualizing clusterProfiler results

clusterProfiler has a variety of options for viewing the over-represented GO terms. We will explore the dotplot, enrichment plot, and the category netplot.

The dotplot shows the number of genes associated with the first 50 terms (size) and the p-adjusted values for these terms (color). This plot displays the top 50 GO terms by gene ratio (# genes related to GO term / total number of sig genes), not p-adjusted value.

## Dotplot 
dotplot(sigOE_ego, showCategory=50, label_format=70)

To save the figure, click on the Export button in the RStudio Plots tab and select Save as PDF.... or use the ggsave function as the output plot is a ggplot plot.

## Dotplot  
p_dotplot_oe <- dotplot(sigOE_ego, showCategory=50, label_format=70)
ggsave(p_dotplot_oe, filename = "results/dotplot_ego_oe.pdf", width = 8, height = 14)
ggsave(p_dotplot_oe, filename = "results/dotplot_ego_oe.png", width = 8, height = 14)

For this training, we will not perform enrichment analysis on other databases as KEGG but you can find more information from:

4 Summary of differential expression analysis workflow

Bilan

We have detailed the various steps in a differential expression analysis workflow, providing theory with example code. To provide a more succinct reference for the code needed to run a DGE analysis, we have summarized the steps in an analysis below:

  1. Obtaining gene-level counts from Salmon using tximport

    # Run tximport
    txi <- tximport(files, 
            type="salmon", 
            tx2gene=t2g, 
            countsFromAbundance = "lengthScaledTPM")
    
    # "files" is a vector wherein each element is the path to the salmon quant.sf file, and each element is named with the name of the sample.
    # "t2g" is a 2 column data frame which contains transcript IDs mapped to geneIDs (in that order)
  2. Creating the dds object:

    # Check that the row names of the metadata equal the column names of the **raw counts** data
    all(colnames(txi$counts) == rownames(metadata))
    
    # Create DESeq2Dataset object
    dds <- DESeqDataSetFromTximport(txi, 
                    colData = metadata, 
                    design = ~ condition)
  3. Exploratory data analysis (PCA & hierarchical clustering) - identifying outliers and sources of variation in the data:

    # Transform counts for data visualization
    rld <- rlog(dds, 
            blind=TRUE)
    
    # Plot PCA 
    plotPCA(rld, 
        intgroup="condition")
    
    # Extract the rlog matrix from the object and compute pairwise correlation values
    rld_mat <- assay(rld)
    rld_cor <- cor(rld_mat)
    
    # Plot heatmap
    pheatmap(rld_cor, 
         annotation = metadata)
  4. Run DESeq2:

        # **Optional step** - Re-create DESeq2 dataset if the design formula has changed after QC analysis in include other sources of variation using "dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ covaraite + condition)"
    
    # Run DESeq2 differential expression analysis
    dds <- DESeq(dds)
    
        # **Optional step** - Output normalized counts to save as a file to access outside RStudio using "normalized_counts <- counts(dds, normalized=TRUE)"
  5. Check the fit of the dispersion estimates:

    # Plot dispersion estimates
    plotDispEsts(dds)
  6. Create contrasts to perform Wald testing on the shrunken log2 foldchanges between specific conditions:

    # Specify contrast for comparison of interest
    contrast <- c("condition", "level_to_compare", "base_level")
    
    # Output results of Wald test for contrast
    res <- results(dds, 
               contrast = contrast, 
               alpha = 0.05)
    
    # Shrink the log2 fold changes to be more accurate
    res <- lfcShrink(dds, 
             coef = "sampletype_group1_vs_group2", 
             type = "apeglm")    
             # The coef will be dependent on what your contras was. and should be identical to what is stored in resultsNames()
  7. Output significant results:

    # Set thresholds
    padj.cutoff < - 0.05
    
    # Turn the results object into a tibble for use with tidyverse functions
    res_tbl <- res %>%
                      data.frame() %>%
                      rownames_to_column(var="gene") %>% 
                      as_tibble()
    
    # Subset the significant results
    sig_res <- filter(res_tbl, 
              padj < padj.cutoff)
  8. Visualize results: volcano plots, heatmaps, normalized counts plots of top genes, etc.

  9. Perform analysis to extract functional significance of results: GO or KEGG enrichment, GSEA, etc.

  10. Make sure to output the versions of all tools used in the DE analysis:

    sessionInfo()

    For better reproducibility, it can help to create RMarkdown reports, which save all code, results, and visualizations as nicely formatted html reports. To create these reports we have additional materials available.

5 To go further

This training has helped you get your foot in the door.

Here are some other resources to help you learn more:


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).

cc_by_sa