Learning objectives

  • Use R packages for retrieval of genomic annotations
  • Visualizing differentially expressed genes (MAplot, Volcanoplot, Heatmap, Venn Diagramm)

Link to the Rmd file from the previous session

1 Visualizing the results of the MOV10 dataset

When we are working with large datasets, it’s beneficial to visually represent the information to gain deeper insight. In this lesson, we will introduce you to some basic and more advanced plots commonly used when for exploring differential gene expression data, however, many of these plots can be helpful in visualizing other types of data as well.

We will be working with three different data objects created in earlier lessons:

  • Metadata for our samples (a dataframe): meta
meta
# A tibble: 8 × 3
  name       sampletype           replicate
* <chr>      <chr>                <chr>    
1 Irrel_kd_1 control              rep1     
2 Irrel_kd_2 control              rep2     
3 Irrel_kd_3 control              rep3     
4 Mov10_kd_2 MOV10_knockdown      rep2     
5 Mov10_kd_3 MOV10_knockdown      rep3     
6 Mov10_oe_1 MOV10_overexpression rep1     
7 Mov10_oe_2 MOV10_overexpression rep2     
8 Mov10_oe_3 MOV10_overexpression rep3
  • Normalized expression data for every gene in each of our samples (a matrix): normalized_counts
head(normalized_counts_df)
                  Irrel_kd_1   Irrel_kd_2   Irrel_kd_3   Mov10_kd_2   Mov10_kd_3   Mov10_oe_1   Mov10_oe_2   Mov10_oe_3
ENSG00000000003 3.924533e+03 3.794359e+03 3.960739e+03 3.951612e+03 3.940558e+03 2.727254e+03 2.729935e+03 3.178078e+03
ENSG00000000005 2.421444e+01 3.018832e+01 3.069306e+01 2.366618e+01 1.389025e+01 2.039526e+01 3.331328e+01 3.363046e+01
ENSG00000000419 1.325516e+03 1.340778e+03 1.179681e+03 1.515275e+03 1.431764e+03 1.541882e+03 1.548191e+03 1.942923e+03
ENSG00000000457 4.555902e+02 4.215954e+02 4.764096e+02 5.974110e+02 6.101026e+02 5.270136e+02 5.181091e+02 5.411446e+02
ENSG00000000460 1.250183e+03 1.211697e+03 1.134309e+03 1.389268e+03 1.300341e+03 9.651038e+02 9.985217e+02 1.028786e+03
ENSG00000000938 8.968311e-01 1.040976e+00 0.000000e+00 1.279253e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
  • Tibble versions of the DESeq2 results we generated in the last lesson: res_tableOE_df and res_tableKD_df
head(res_tableOE_df)
# A tibble: 6 × 8
  gene_ID         baseMean log2FoldChange  lfcSE    pvalue      padj is_up is_down
  <chr>              <dbl>          <dbl>  <dbl>     <dbl>     <dbl> <chr> <chr>  
1 ENSG00000155363   95772.           6.99 0.122  0         0         yes   no     
2 ENSG00000189060    8090.           1.51 0.0666 3.72e-115 3.41e-111 yes   no     
3 ENSG00000173110     239.           6.50 0.363  9.44e- 72 5.76e- 68 yes   no     
4 ENSG00000265972    5285.           1.37 0.0788 1.64e- 68 7.48e- 65 yes   no     
5 ENSG00000187837    1747.           1.48 0.0890 1.75e- 63 6.41e- 60 yes   no     
6 ENSG00000270882    2595.           1.47 0.0956 1.49e- 54 4.54e- 51 yes   no   
head(res_tableKD_df)
# A tibble: 6 × 8
  gene_ID         baseMean log2FoldChange  lfcSE   pvalue     padj is_up is_down
  <chr>              <dbl>          <dbl>  <dbl>    <dbl>    <dbl> <chr> <chr>  
1 ENSG00000116962    6434.           1.09 0.0701 7.24e-56 1.27e-51 yes   no     
2 ENSG00000270882    2595.           1.62 0.106  9.61e-54 8.44e-50 yes   no     
3 ENSG00000143183    1807.           1.23 0.0858 8.12e-48 4.75e-44 yes   no     
4 ENSG00000124762    2894.           1.22 0.0867 5.27e-46 2.31e-42 yes   no     
5 ENSG00000116473    2189.          -1.00 0.0719 2.69e-45 9.44e-42 no    yes    
6 ENSG00000129250    2797.          -1.04 0.0822 3.36e-38 9.83e-35 no    yes    
  • An annotation table for the Mov10 dataset containing, Transcript IDs, gene IDs and gene names (Symbol ID). (For your data, we will use BiomaRt to get obtain this information)
head(tx2gene)
          tx_id         ensgene     symbol
ENST00000387314 ENSG00000210049      MT-TF
ENST00000389680 ENSG00000211459    MT-RNR1
ENST00000387342 ENSG00000210077      MT-TV
ENST00000387347 ENSG00000210082    MT-RNR2
ENST00000612848 ENSG00000276345 AC004556.1
ENST00000386347 ENSG00000209082     MT-TL1

First, let’s add a column with gene symbols to the normalized_counts object, so we can use them to label our plots. Ensembl IDs are useful for many purposes, but gene symbols are more recognizable to biologists.

# DESeq2 creates a matrix when you use the counts() function
## First convert normalized_counts to a data frame and transfer the row names to a new column called "gene_ID"
normalized_counts_df <- normalized_counts %>% as_tibble(rownames = "gene_ID")
  
# Next, merge together (ensembl IDs) the normalized counts data frame with a subset of the annotations in the tx2gene data frame (only the columns for ensembl gene IDs and gene symbols)
grch38annot <- tx2gene %>% 
               dplyr::select(ensgene, symbol) %>% 
               dplyr::distinct()

head(grch38annot)
          ensgene     symbol
1 ENSG00000210049      MT-TF
2 ENSG00000211459    MT-RNR1
3 ENSG00000210077      MT-TV
4 ENSG00000210082    MT-RNR2
5 ENSG00000276345 AC004556.1
6 ENSG00000209082     MT-TL1
## This will bring in a column of gene symbols
normalized_counts_and_annot_df <- left_join(normalized_counts_df, grch38annot, by=c("gene_ID"="ensgene"))



#Relocate the symbol column at the beginning
normalized_counts_and_annot_df <- normalized_counts_and_annot_df %>%
                     relocate("symbol", .after =  "gene_ID") %>%
                     as_tibble()

head(normalized_counts_and_annot_df)
# A tibble: 6 × 10
  gene_ID         symbol   Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_kd_2 Mov10_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3
  <chr>           <chr>         <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 ENSG00000000003 TSPAN6     3925.       3794.       3961.     3952.       3941.      2727.      2730.      3178. 
2 ENSG00000000005 TNMD         24.2        30.2        30.7      23.7        13.9       20.4       33.3       33.6
3 ENSG00000000419 DPM1       1326.       1341.       1180.     1515.       1432.      1542.      1548.      1943. 
4 ENSG00000000457 SCYL3       456.        422.        476.      597.        610.       527.       518.       541. 
5 ENSG00000000460 C1orf112   1250.       1212.       1134.     1389.       1300.       965.       999.      1029. 
6 ENSG00000000938 FGR           0.897       1.04        0         1.28        0          0          0          0  

Exercise

You can also add gene names to your DEseq2 tables res_tableOE_df, res_tableKD_df and create 2 new tables : - res_tableOE_and_annot_df

# A tibble: 57,761 × 9
   gene_ID         symbol   baseMean log2FoldChange  lfcSE    pvalue      padj is_up is_down
   <chr>           <chr>       <dbl>          <dbl>  <dbl>     <dbl>     <dbl> <chr> <chr>  
 1 ENSG00000155363 MOV10      95772.          6.99  0.122  0         0         yes   no     
 2 ENSG00000189060 H1F0        8090.          1.51  0.0666 3.72e-115 3.41e-111 yes   no     
 3 ENSG00000173110 HSPA6        239.          6.50  0.363  9.44e- 72 5.76e- 68 yes   no     
 4 ENSG00000265972 TXNIP       5285.          1.37  0.0788 1.64e- 68 7.48e- 65 yes   no    
 5 ENSG00000187837 HIST1H1C    1747.          1.48  0.0890 1.75e- 63 6.41e- 60 yes   no     

res_tableOE_and_annot_df <- left_join(res_tableOE_df, grch38annot, by=c("gene_ID"="ensgene"))
res_tableOE_and_annot_df <- relocate(res_tableOE_and_annot_df, "symbol", .after =  "gene_ID")
  • res_tableKD_and_annot_df
# A tibble: 57,761 × 9
   gene_ID         symbol   baseMean log2FoldChange  lfcSE   pvalue     padj is_up is_down
   <chr>           <chr>       <dbl>          <dbl>  <dbl>    <dbl>    <dbl> <chr> <chr>  
 1 ENSG00000116962 NID1        6434.          1.09  0.0701 7.24e-56 1.27e-51 yes   no     
 2 ENSG00000270882 HIST2H4A    2595.          1.62  0.106  9.61e-54 8.44e-50 yes   no     
 3 ENSG00000143183 TMCO1       1807.          1.23  0.0858 8.12e-48 4.75e-44 yes   no     
 4 ENSG00000124762 CDKN1A      2894.          1.22  0.0867 5.27e-46 2.31e-42 yes   no     
 5 ENSG00000116473 RAP1A       2189.         -1.00  0.0719 2.69e-45 9.44e-42 no    yes  

#in one command (!)
res_tableKD_and_annot_df <- res_tableKD_df %>%
  left_join(grch38annot, by=c("gene_ID"="ensgene")) %>%
  relocate("symbol", .after =  "gene_ID")

1.1 Plotting signicant DE genes

One way to visualize results is to simply plot the expression data for a handful of genes. We can do this by selecting specific genes of interest or by choosing a range of genes.

1.1.1 Using DESeq2 plotCounts() to plot expression of a single gene

To select a specific gene of interest to plot, for example MOV10, we can use the plotCounts() from DESeq2. plotCounts() requires that the gene specified matches the original input to DESeq2, which in our case was Ensembl IDs.

# Find the Ensembl ID of MOV10
grch38annot[grch38annot$symbol == "MOV10", "ensgene"]

# Plot expression for single gene
plotCounts(dds, gene="ENSG00000155363", intgroup="sampletype") 

This DESeq2 function only allows for plotting the counts of a single gene at a time, and is not flexible regarding the appearance.

1.1.2 Using ggplot2 to plot expression of a single gene

If you wish to change the appearance of this plot, we can save the output of plotCounts() to a variable specifying the returnData=TRUE argument, then use ggplot():

# Save plotcounts to a data frame object
count_gene_x_df <- plotCounts(dds, gene="ENSG00000155363", intgroup="sampletype", returnData=TRUE)

# What is the data output of plotCounts()?
count_gene_x_df

# Plot the MOV10 normalized counts, using the samplenames (rownames(d) as labels)
ggplot(count_gene_x_df, aes(x = sampletype, y = count, color = sampletype)) + 
    geom_point(position=position_jitter(w = 0.1,h = 0)) +
    geom_text_repel(aes(label = rownames(count_gene_x_df)), show.legend = FALSE) + 
    theme_bw() +
    ggtitle("MOV10") +
    theme(plot.title = element_text(hjust = 0.5))

Note that in the plot below (code above), we are using geom_text_repel() from the ggrepel package to label our individual points on the plot.

1.1.3 Using ggplot2 to plot expression of a list of genes

Often it is helpful to check the expression of multiple genes of interest at the same time. This often first requires some data wrangling.

We are going to plot the normalized count values for the top 20 differentially expressed genes (by padj values).

To do this, we first need to get the gene IDs from our ordered (by padj values) results table (sigOE_up) and extracting the top 20 genes :

top20_sigOE_genes <- sigOE_up_df %>%
  pull(gene_ID) %>%
  head(20)

top20_sigOE_genes

Then, we can extract the normalized count values for these top 20 genes:

## normalized counts for top 20 significant genes
top20_sigOE_norm_df <- normalized_counts_and_annot_df %>%
        filter(gene_ID %in% top20_sigOE_genes)

top20_sigOE_norm_df

Now we have the normalized counts for each of the top 20 genes for all 8 samples.

To plot the data using ggplot(), we need to convert the counts in a “long format”. Indeed, ggplot need one column with the values we want it to plot.

The pivot_longer() function in the tidyr package will perform this operation and will output the normalized counts for all genes for Mov10_oe_1 listed in the first 20 rows, followed by the normalized counts for Mov10_oe_2 in the next 20 rows, and so on.

# Gathering the columns to have normalized counts to a single column
top20_sigOE_dfl <- top20_sigOE_norm_df %>%
  pivot_longer(cols = c("Irrel_kd_1","Irrel_kd_2","Irrel_kd_3","Mov10_kd_2","Mov10_kd_3","Mov10_oe_1","Mov10_oe_2","Mov10_oe_3"),
               names_to = "samplename",
               values_to = "normalized_counts")

# a shorter way to write it
top20_sigOE_dfl <- top20_sigOE_norm_df %>%
  pivot_longer(cols = -c(gene_ID,symbol),
               names_to = "samplename",
               values_to = "normalized_counts")

## check the column header in the "long format" data frame
top20_sigOE_dfl
# A tibble: 160 × 4
   gene_ID         symbol samplename normalized_counts
   <chr>           <chr>  <chr>                  <dbl>
 1 ENSG00000096654 ZNF184 Irrel_kd_1              502.
 2 ENSG00000096654 ZNF184 Irrel_kd_2              500.
 3 ENSG00000096654 ZNF184 Irrel_kd_3              475.
 4 ENSG00000096654 ZNF184 Mov10_kd_2              794.
 5 ENSG00000096654 ZNF184 Mov10_kd_3              793.
 6 ENSG00000096654 ZNF184 Mov10_oe_1             1046.
 7 ENSG00000096654 ZNF184 Mov10_oe_2             1011.
 8 ENSG00000096654 ZNF184 Mov10_oe_3             1102.
 9 ENSG00000112972 HMGCS1 Irrel_kd_1             9310.
10 ENSG00000112972 HMGCS1 Irrel_kd_2             9373.
# ℹ 150 more rows

Now, if we want our counts colored by sample group, then we need to combine the metadata information with the normalized counts data into the same data frame for input to ggplot():

top20_sigOE_dfl <- left_join(top20_sigOE_dfl, meta, by = c("samplename" = "name"))
head(top20_sigOE_dfl)
# A tibble: 6 × 6
  gene_ID         symbol samplename normalized_counts sampletype           replicate
  <chr>           <chr>  <chr>                  <dbl> <chr>                <chr>    
1 ENSG00000096654 ZNF184 Irrel_kd_1              502. control              rep1     
2 ENSG00000096654 ZNF184 Irrel_kd_2              500. control              rep2     
3 ENSG00000096654 ZNF184 Irrel_kd_3              475. control              rep3     
4 ENSG00000096654 ZNF184 Mov10_kd_2              794. MOV10_knockdown      rep2     
5 ENSG00000096654 ZNF184 Mov10_kd_3              793. MOV10_knockdown      rep3     
6 ENSG00000096654 ZNF184 Mov10_oe_1             1046. MOV10_overexpression rep1     

The left_join() will join the 2 data frames with respect to the samplename column of the top20_sigOE_dfl table and the sample column of the meta table.

Now that we have a data frame in a format that can be utilized by ggplot easily, let’s plot!

## plot using ggplot2
ggplot(top20_sigOE_dfl) +
        geom_point(aes(x = sampletype, y = normalized_counts, color = sampletype)) +
        scale_y_log10() +
        xlab("Genes") +
        ylab("log10 Normalized Counts") +
        ggtitle("Top 20 Significant DE Genes") +
        theme_bw() +
  facet_wrap(~symbol, scales = "free_y") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    theme(plot.title = element_text(hjust = 0.5))

You can change chunk options to increase plot size. For example, you can use {r, fig.height=6, fig.width=10}

You can also change sample name to have a shorter X labels. For that we will add a column with the new names

top20_sigOE_dfl <- top20_sigOE_dfl %>%
  mutate(sampletype_short =  case_when(
    sampletype == "MOV10_knockdown" ~ 'kd',
    sampletype == "MOV10_overexpression" ~ 'oe',
    TRUE ~ 'ctl'))
## plot using ggplot2
ggplot(top20_sigOE_dfl) +
        geom_point(aes(x = sampletype_short, y = normalized_counts, color = sampletype_short)) +
        scale_y_log10() +
        xlab("Genes") +
        ylab("log10 Normalized Counts") +
        ggtitle("Top 20 Significant DE Genes") +
        theme_bw() +
  facet_wrap(~symbol, scales = "free") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    theme(plot.title = element_text(hjust = 0.5))


Excercise

  1. Make the same plot for the top 20 downregulated genes

MOV10 Differential Expression Analysis: Control versus Knockdown

  1. Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.

1.2 Heatmap

In addition to plotting subsets, we could also extract the normalized values of all the significant genes and plot a heatmap of their expression using pheatmap().

### Extract normalized expression for significant genes from the OE and control samples (3:5 and 8:10) (and gene ID)

norm_OEsig_df <- normalized_counts_and_annot_df[,c("gene_ID",
                                                "symbol",
                                                "Irrel_kd_1",
                                                "Irrel_kd_2",
                                                "Irrel_kd_3",
                                                "Mov10_oe_1",
                                                "Mov10_oe_2",
                                                "Mov10_oe_3")
                                             ] %>% 
              filter(gene_ID %in% sigOE_df$gene_ID)

norm_OEsig_df
# A tibble: 4,774 × 7
   gene_ID         Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3
   <chr>                <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1 ENSG00000000003      3925.      3794.      3961.      2727.      2730.      3178.
 2 ENSG00000000419      1326.      1341.      1180.      1542.      1548.      1943.
 3 ENSG00000000460      1250.      1212.      1134.       965.       999.      1029.
 4 ENSG00000001084      2691.      2538.      2625.      1898.      2117.      2379.
 5 ENSG00000001167      2752.      2514.      2601.      2149.      2134.      2096.

Now let’s draw the heatmap using pheatmap:

### Set a color palette
heat_colors <- brewer.pal(6, "YlOrRd")

norm_OEsig_mat <- norm_OEsig_df %>% column_to_rownames("gene_ID") %>%
  dplyr::select(Irrel_kd_1,Irrel_kd_2,Irrel_kd_3,Mov10_oe_1,Mov10_oe_2,Mov10_oe_3) %>%
  as.matrix()

annotation_col_df <- meta%>% as.data.frame()
rownames(annotation_col_df) <- annotation_col_df$name
annotation_col_df <- annotation_col_df[,c("replicate","sampletype")]

### Run pheatmap using the metadata data frame for the annotation
pheatmap(norm_OEsig_mat, 
    color = heat_colors, 
    cluster_rows = T, 
    show_rownames = F,
    annotation = annotation_col_df, 
    border_color = NA, 
    fontsize = 10, 
    scale = "row", 
    fontsize_row = 10, 
    height = 20)

NOTE: There are several additional arguments we have included in the function for aesthetics. One important one is scale="row", in which Z-scores are plotted, rather than the actual normalized count value. Z-scores are computed on a gene-by-gene basis by subtracting the mean and then dividing by the standard deviation. The Z-scores are computed after the clustering, so that it only affects the graphical aesthetics and the color visualization is improved.


Excercise

MOV10 Differential Expression Analysis: Control versus Knockdown

Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.


1.3 Volcano plot

The above plot would be great to look at the expression levels of a good number of genes, but for more of a global view, there are other plots we can draw. A commonly used one is a volcano plot; in which you have the log transformed adjusted p-values plotted on the y-axis and log2 fold change values on the x-axis.

To generate a volcano plot, we first need to have a column in our results data indicating whether or not the gene is considered differentially expressed based on p-adjusted values and we will include a log2fold change here.

## Obtain logical vector where TRUE values denote padj values < 0.05 and fold change > 1.5 (log2(1.5) = 0.58) in either direction

res_tableOE_and_annot_df <- res_tableOE_and_annot_df %>% 
                            mutate(threshold_OE = (!is.na(padj)) & padj < 0.05 & abs(log2FoldChange) >= 0.58)
# A tibble: 57,761 × 10
   gene_ID         symbol   baseMean log2FoldChange  lfcSE    pvalue      padj is_up is_down threshold_OE
   <chr>           <chr>       <dbl>          <dbl>  <dbl>     <dbl>     <dbl> <chr> <chr>   <lgl>       
 1 ENSG00000155363 MOV10      95772.          6.99  0.122  0         0         yes   no      TRUE        
 2 ENSG00000189060 H1F0        8090.          1.51  0.0666 3.72e-115 3.41e-111 yes   no      TRUE        
 3 ENSG00000173110 HSPA6        239.          6.50  0.363  9.44e- 72 5.76e- 68 yes   no      TRUE        
 4 ENSG00000265972 TXNIP       5285.          1.37  0.0788 1.64e- 68 7.48e- 65 yes   no      TRUE        
 5 ENSG00000187837 HIST1H1C    1747.          1.48  0.0890 1.75e- 63 6.41e- 60 yes   no      TRUE        

Now we can start plotting. The geom_point object is most applicable, as this is essentially a scatter plot:

## Volcano plot
ggplot(res_tableOE_and_annot_df) +
    geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = threshold_OE)) +
    ggtitle("Mov10 overexpression") +
    xlab("log2 fold change") + 
    ylab("-log10 adjusted p-value") +
    theme_bw() +
    theme(legend.position = "none",
          plot.title = element_text(size = rel(1.5), hjust = 0.5),
          axis.title = element_text(size = rel(1.25)))  

This is a great way to get an overall picture of what is going on, but what if we also wanted to know where the top 10 genes (lowest padj) in our DE list are located on this plot? We could label those dots with the gene name on the Volcano plot using geom_text_repel().

For that, we plot it as before with an additional layer for geom_text_repel() wherein we can specify a subset of the results table and the column of gene labels.

ggplot(res_tableOE_and_annot_df, aes(x = log2FoldChange, y = -log10(padj))) +
    geom_point(aes(colour = threshold_OE)) +
    geom_text_repel(data = res_tableOE_and_annot_df %>% head(10) , aes(label = symbol)) +
    ggtitle("Mov10 overexpression") +
    xlab("log2 fold change") + 
    ylab("-log10 adjusted p-value") +
    theme_bw() +
    theme(legend.position = "none",
          plot.title = element_text(size = rel(1.5), hjust = 0.5),
          axis.title = element_text(size = rel(1.25))) 


1.4 R packages for advanced visualization of DGE results

1.4.1 EnhancedVolcano

The Bioconductor package EnhancedVolcano can use the DESeq2 results output to make the volcano plots generated above by writing much fewer lines of code.

#BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)
  EnhancedVolcano(res_tableOE_and_annot_df,
    lab = res_tableOE_and_annot_df$symbol,
    x = 'log2FoldChange',
    y = 'pvalue')

Using its documentation, you can “easily” customize your plot. Don’t hesitate to take a look.

top20_sigOE_symbol <- res_tableOE_and_annot_df %>% pull(symbol) %>% head(20)

EnhancedVolcano(res_tableOE_and_annot_df,
    lab = res_tableOE_and_annot_df$symbol,
    x = 'log2FoldChange',
    y = 'pvalue',
    pCutoff = 0.05,
    FCcutoff = 0.58,
    drawConnectors = TRUE,
    widthConnectors = 0.75,
    selectLab = top20_sigOE_symbol)


Excercise

MOV10 Differential Expression Analysis: Control versus Knockdown

Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.


1.4.2 ggpubR and its ggmaplot function

Another package, ggpubr, offers a useful function (ggmaplot) to easily draw MA plot (mean expression across all samples in function of the logfoldchange).

#install.packages("ggpubr")
library(ggpubr)

Documentation of this function can be found here

# Default plot
ggmaplot(res_tableOE_and_annot_df, main = "MOV10 overexpression samples versus control samples",
   fdr = 0.05, fc = 1.5, size = 0.4,
   palette = c("#B31B21", "#1465AC", "darkgray"),
   genenames = as.vector(res_tableOE_and_annot_df$symbol),
   legend = "top", top = 20,
   font.label = c("bold", 11),
   font.legend = "bold",
   font.main = "bold",
   ggtheme = ggplot2::theme_bw())

The caveat of these functions is you have to check the default values of the options to understand what is effectively plotted.

summary(res_tableOE_shrunken, alpha = 0.05)
adjusted p-value < 0.05
LFC > 0 (up)       : 2004, 5.2%
LFC < 0 (down)     : 2770, 7.1%

For example, if you compare the displayed number of up/down regulated genes to the results of the function “summary”, numbers are different. The difference comes from the value of the fc option.

# Default plot
ggmaplot(res_tableOE_and_annot_df, main = "MOV10 overexpression samples versus control samples",
   fdr = 0.05, fc = 0, size = 0.4,
   palette = c("#B31B21", "#1465AC", "darkgray"),
   genenames = as.vector(res_tableOE_and_annot_df$symbol),
   legend = "top", top = 20,
   font.label = c("bold", 11),
   font.legend = "bold",
   font.main = "bold",
   ggtheme = ggplot2::theme_bw())

Excercise

MOV10 Differential Expression Analysis: Control versus Knockdown

Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.


2 Visualizing the results of the your dataset

Now it is time to do the same things with your dataset.

2.1 Find the mapping between gene IDs and gene names and descriptions

We will use the BiomaRt package again to retrieve from Ensembl the mapping between gene IDs and gene names (and descriptions) in order to plot gene names instead of gene IDs.

You can find below command lines to get this table for human or mouse.

2.1.1 Human gene IDs to gene names

#BiocManager::install("biomaRt")
library(biomaRt)
human_mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

human_geneIDs2geneNames <- getBM(
  attributes=c("ensembl_gene_id","external_gene_name","gene_biotype","description"),
  mart = human_mart)

head(human_geneIDs2geneNames)
dim(human_geneIDs2geneNames) # 70711     4
  ensembl_gene_id external_gene_name   gene_biotype   description
1 ENSG00000210049              MT-TF        Mt_tRNA           ...
2 ENSG00000211459            MT-RNR1        Mt_rRNA           ...
3 ENSG00000210077              MT-TV        Mt_tRNA           ...
4 ENSG00000210082            MT-RNR2        Mt_rRNA           ...
5 ENSG00000209082             MT-TL1        Mt_tRNA           ...
6 ENSG00000198888             MT-ND1 protein_coding           ...

2.1.2 Mouse gene IDs to gene names

#BiocManager::install("biomaRt")
library(biomaRt)
mouse_mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

mouse_geneIDs2geneNames <- getBM(
  attributes=c("ensembl_gene_id","external_gene_name","gene_biotype","description"),
  mart = mouse_mart)

head(mouse_geneIDs2geneNames)
dim(mouse_geneIDs2geneNames)

Another useful R package is AnnotationHub allowing you to retrieve data from several genomic database.

2.2 Pre-Filtering your count table

2.2.1 for low count genes

While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.

Here we perform minimal pre-filtering to keep only rows that have at least 10 reads total. Note that stricter filtering to increase power is automatically applied via independent filtering on the mean of normalized counts within the results function.

keep <- rowSums(counts(dds)) >= 10
dds_bis <- dds[keep,]
nrow(dds) #57761
nrow(dds_bis) #27817
dds_bis <- DESeq(dds_bis)

Alternatively, a popular filter is to ensure at least X samples with a count of 10 or more, where X can be chosen as the sample size of the smallest group of samples:

X = 3
keep <- rowSums(counts(dds) >= 10) >= X
dds_ter <- dds[keep,]
nrow(dds) #57761
nrow(dds_ter) #20024
dds_ter <- DESeq(dds_ter)

!Warning! : you have to do this pre-filtering before running the DEseq function.

2.2.2 for protein coding genes

human_geneIDs2geneNames_only_proteins <- human_geneIDs2geneNames %>%
  filter(gene_biotype == "protein_coding")

dim(human_geneIDs2geneNames_only_proteins)
dim(human_geneIDs2geneNames)
keep_proteins <- rownames(counts(dds)) %in% human_geneIDs2geneNames_only_proteins$ensembl_gene_id

dds_prot <- dds[keep_proteins,]
nrow(dds)
nrow(dds_prot)
dds_prot <- DESeq(dds_prot)

!Warning! : you have to do this pre-filtering before running the DEseq function.

Using the following code you can check the number/proportion of reads mapping in protein coding genes or other genes (pseudogenes, ncrna, …)

gene_biotype_simplified_list <- c("protein_coding","lncRNA","processed_pseudogene","unprocessed_pseudogene","misc_RNA","rRNA","snRNA")

bilan_gene_biotype_df <-   counts(dds, normalize = F) %>%
  as_tibble(rownames = "ensembl_gene_id") %>% 
  left_join(human_geneIDs2geneNames) %>%
  pivot_longer(-c(ensembl_gene_id,external_gene_name,description, gene_biotype), names_to = "sample", values_to = "count") %>%
  mutate(gene_biotype_simplified = ifelse(gene_biotype %in% gene_biotype_simplified_list, gene_biotype, "other")) %>%
  group_by(sample, gene_biotype_simplified) %>%
  summarise(total_by_sample = sum(count)) %>%
  group_by(sample) %>%
  mutate(percentage = total_by_sample / sum(total_by_sample) * 100) %>%
  arrange(desc(percentage))

head(bilan_gene_biotype_df)
  sample     gene_biotype_simplified total_by_sample percentage
  <chr>      <chr>                             <int>      <dbl>
1 Irrel_kd_1 protein_coding                 30093095       96.6
2 Irrel_kd_2 protein_coding                 25582178       96.5
3 Irrel_kd_3 protein_coding                 19783387       96.5
4 Mov10_kd_2 protein_coding                 43683866       96.4
5 Mov10_kd_3 protein_coding                 25778701       96.4
6 Mov10_oe_3 protein_coding                 16596199       96.0
ggplot(data = bilan_gene_biotype_df, aes(x = sample, y = total_by_sample, fill = gene_biotype_simplified)) +
  geom_bar(stat = "identity") + theme_bw() +
  ggtitle("Raw read mapping distribution") +
  scale_fill_brewer(palette="Dark2")

ggplot(data = bilan_gene_biotype_df, aes(x = sample, y = percentage, fill = gene_biotype_simplified)) +
  geom_bar(stat = "identity") + theme_bw() +
  ggtitle("Raw read mapping distribution") +
  scale_fill_brewer(palette="Dark2")

See you in the next session


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).

cc_by_sa

sources