Learning objectives
Link to the Rmd file from the previous session
When we are working with large datasets, it’s beneficial to visually represent the information to gain deeper insight. In this lesson, we will introduce you to some basic and more advanced plots commonly used when for exploring differential gene expression data, however, many of these plots can be helpful in visualizing other types of data as well.
We will be working with three different data objects created in earlier lessons:
meta
meta
# A tibble: 8 × 3
name sampletype replicate
* <chr> <chr> <chr>
1 Irrel_kd_1 control rep1
2 Irrel_kd_2 control rep2
3 Irrel_kd_3 control rep3
4 Mov10_kd_2 MOV10_knockdown rep2
5 Mov10_kd_3 MOV10_knockdown rep3
6 Mov10_oe_1 MOV10_overexpression rep1
7 Mov10_oe_2 MOV10_overexpression rep2
8 Mov10_oe_3 MOV10_overexpression rep3
normalized_counts
head(normalized_counts_df)
Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_kd_2 Mov10_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3
ENSG00000000003 3.924533e+03 3.794359e+03 3.960739e+03 3.951612e+03 3.940558e+03 2.727254e+03 2.729935e+03 3.178078e+03
ENSG00000000005 2.421444e+01 3.018832e+01 3.069306e+01 2.366618e+01 1.389025e+01 2.039526e+01 3.331328e+01 3.363046e+01
ENSG00000000419 1.325516e+03 1.340778e+03 1.179681e+03 1.515275e+03 1.431764e+03 1.541882e+03 1.548191e+03 1.942923e+03
ENSG00000000457 4.555902e+02 4.215954e+02 4.764096e+02 5.974110e+02 6.101026e+02 5.270136e+02 5.181091e+02 5.411446e+02
ENSG00000000460 1.250183e+03 1.211697e+03 1.134309e+03 1.389268e+03 1.300341e+03 9.651038e+02 9.985217e+02 1.028786e+03
ENSG00000000938 8.968311e-01 1.040976e+00 0.000000e+00 1.279253e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
res_tableOE_df
and res_tableKD_df
head(res_tableOE_df)
# A tibble: 6 × 8
gene_ID baseMean log2FoldChange lfcSE pvalue padj is_up is_down
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ENSG00000155363 95772. 6.99 0.122 0 0 yes no
2 ENSG00000189060 8090. 1.51 0.0666 3.72e-115 3.41e-111 yes no
3 ENSG00000173110 239. 6.50 0.363 9.44e- 72 5.76e- 68 yes no
4 ENSG00000265972 5285. 1.37 0.0788 1.64e- 68 7.48e- 65 yes no
5 ENSG00000187837 1747. 1.48 0.0890 1.75e- 63 6.41e- 60 yes no
6 ENSG00000270882 2595. 1.47 0.0956 1.49e- 54 4.54e- 51 yes no
head(res_tableKD_df)
# A tibble: 6 × 8
gene_ID baseMean log2FoldChange lfcSE pvalue padj is_up is_down
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ENSG00000116962 6434. 1.09 0.0701 7.24e-56 1.27e-51 yes no
2 ENSG00000270882 2595. 1.62 0.106 9.61e-54 8.44e-50 yes no
3 ENSG00000143183 1807. 1.23 0.0858 8.12e-48 4.75e-44 yes no
4 ENSG00000124762 2894. 1.22 0.0867 5.27e-46 2.31e-42 yes no
5 ENSG00000116473 2189. -1.00 0.0719 2.69e-45 9.44e-42 no yes
6 ENSG00000129250 2797. -1.04 0.0822 3.36e-38 9.83e-35 no yes
head(tx2gene)
tx_id ensgene symbol
ENST00000387314 ENSG00000210049 MT-TF
ENST00000389680 ENSG00000211459 MT-RNR1
ENST00000387342 ENSG00000210077 MT-TV
ENST00000387347 ENSG00000210082 MT-RNR2
ENST00000612848 ENSG00000276345 AC004556.1
ENST00000386347 ENSG00000209082 MT-TL1
First, let’s add a column with gene symbols to the normalized_counts
object, so we can use them to label our plots. Ensembl IDs are useful for many purposes, but gene symbols are more recognizable to biologists.
# DESeq2 creates a matrix when you use the counts() function
## First convert normalized_counts to a data frame and transfer the row names to a new column called "gene_ID"
normalized_counts_df <- normalized_counts %>% as_tibble(rownames = "gene_ID")
# Next, merge together (ensembl IDs) the normalized counts data frame with a subset of the annotations in the tx2gene data frame (only the columns for ensembl gene IDs and gene symbols)
grch38annot <- tx2gene %>%
dplyr::select(ensgene, symbol) %>%
dplyr::distinct()
head(grch38annot)
ensgene symbol
1 ENSG00000210049 MT-TF
2 ENSG00000211459 MT-RNR1
3 ENSG00000210077 MT-TV
4 ENSG00000210082 MT-RNR2
5 ENSG00000276345 AC004556.1
6 ENSG00000209082 MT-TL1
## This will bring in a column of gene symbols
normalized_counts_and_annot_df <- left_join(normalized_counts_df, grch38annot, by=c("gene_ID"="ensgene"))
#Relocate the symbol column at the beginning
normalized_counts_and_annot_df <- normalized_counts_and_annot_df %>%
relocate("symbol", .after = "gene_ID") %>%
as_tibble()
head(normalized_counts_and_annot_df)
# A tibble: 6 × 10
gene_ID symbol Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_kd_2 Mov10_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003 TSPAN6 3925. 3794. 3961. 3952. 3941. 2727. 2730. 3178.
2 ENSG00000000005 TNMD 24.2 30.2 30.7 23.7 13.9 20.4 33.3 33.6
3 ENSG00000000419 DPM1 1326. 1341. 1180. 1515. 1432. 1542. 1548. 1943.
4 ENSG00000000457 SCYL3 456. 422. 476. 597. 610. 527. 518. 541.
5 ENSG00000000460 C1orf112 1250. 1212. 1134. 1389. 1300. 965. 999. 1029.
6 ENSG00000000938 FGR 0.897 1.04 0 1.28 0 0 0 0
Exercise
You can also add gene names to your DEseq2 tables res_tableOE_df
, res_tableKD_df
and create 2 new tables :
- res_tableOE_and_annot_df
# A tibble: 57,761 × 9
gene_ID symbol baseMean log2FoldChange lfcSE pvalue padj is_up is_down
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ENSG00000155363 MOV10 95772. 6.99 0.122 0 0 yes no
2 ENSG00000189060 H1F0 8090. 1.51 0.0666 3.72e-115 3.41e-111 yes no
3 ENSG00000173110 HSPA6 239. 6.50 0.363 9.44e- 72 5.76e- 68 yes no
4 ENSG00000265972 TXNIP 5285. 1.37 0.0788 1.64e- 68 7.48e- 65 yes no
5 ENSG00000187837 HIST1H1C 1747. 1.48 0.0890 1.75e- 63 6.41e- 60 yes no
res_tableOE_and_annot_df <- left_join(res_tableOE_df, grch38annot, by=c("gene_ID"="ensgene"))
res_tableOE_and_annot_df <- relocate(res_tableOE_and_annot_df, "symbol", .after = "gene_ID")
res_tableKD_and_annot_df
# A tibble: 57,761 × 9
gene_ID symbol baseMean log2FoldChange lfcSE pvalue padj is_up is_down
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ENSG00000116962 NID1 6434. 1.09 0.0701 7.24e-56 1.27e-51 yes no
2 ENSG00000270882 HIST2H4A 2595. 1.62 0.106 9.61e-54 8.44e-50 yes no
3 ENSG00000143183 TMCO1 1807. 1.23 0.0858 8.12e-48 4.75e-44 yes no
4 ENSG00000124762 CDKN1A 2894. 1.22 0.0867 5.27e-46 2.31e-42 yes no
5 ENSG00000116473 RAP1A 2189. -1.00 0.0719 2.69e-45 9.44e-42 no yes
#in one command (!)
res_tableKD_and_annot_df <- res_tableKD_df %>%
left_join(grch38annot, by=c("gene_ID"="ensgene")) %>%
relocate("symbol", .after = "gene_ID")
One way to visualize results is to simply plot the expression data for a handful of genes. We can do this by selecting specific genes of interest or by choosing a range of genes.
plotCounts()
to plot expression of a single geneTo select a specific gene of interest to plot, for example MOV10, we can use the plotCounts()
from DESeq2. plotCounts()
requires that the gene specified matches the original input to DESeq2, which in our case was Ensembl IDs.
# Find the Ensembl ID of MOV10
grch38annot[grch38annot$symbol == "MOV10", "ensgene"]
# Plot expression for single gene
plotCounts(dds, gene="ENSG00000155363", intgroup="sampletype")
This DESeq2 function only allows for plotting the counts of a single gene at a time, and is not flexible regarding the appearance.
If you wish to change the appearance of this plot, we can save the output of plotCounts()
to a variable specifying the returnData=TRUE
argument, then use ggplot()
:
# Save plotcounts to a data frame object
count_gene_x_df <- plotCounts(dds, gene="ENSG00000155363", intgroup="sampletype", returnData=TRUE)
# What is the data output of plotCounts()?
count_gene_x_df
# Plot the MOV10 normalized counts, using the samplenames (rownames(d) as labels)
ggplot(count_gene_x_df, aes(x = sampletype, y = count, color = sampletype)) +
geom_point(position=position_jitter(w = 0.1,h = 0)) +
geom_text_repel(aes(label = rownames(count_gene_x_df)), show.legend = FALSE) +
theme_bw() +
ggtitle("MOV10") +
theme(plot.title = element_text(hjust = 0.5))
Note that in the plot below (code above), we are using geom_text_repel()
from the ggrepel
package to label our individual points on the plot.
Often it is helpful to check the expression of multiple genes of interest at the same time. This often first requires some data wrangling.
We are going to plot the normalized count values for the top 20 differentially expressed genes (by padj values).
To do this, we first need to get the gene IDs from our ordered (by padj values) results table (sigOE_up
) and extracting the top 20 genes :
top20_sigOE_genes <- sigOE_up_df %>%
pull(gene_ID) %>%
head(20)
top20_sigOE_genes
Then, we can extract the normalized count values for these top 20 genes:
## normalized counts for top 20 significant genes
top20_sigOE_norm_df <- normalized_counts_and_annot_df %>%
filter(gene_ID %in% top20_sigOE_genes)
top20_sigOE_norm_df
Now we have the normalized counts for each of the top 20 genes for all 8 samples.
To plot the data using ggplot(), we need to convert the counts in a “long format”. Indeed, ggplot need one column with the values we want it to plot.
The pivot_longer() function in the tidyr package will perform this operation and will output the normalized counts for all genes for Mov10_oe_1 listed in the first 20 rows, followed by the normalized counts for Mov10_oe_2 in the next 20 rows, and so on.
# Gathering the columns to have normalized counts to a single column
top20_sigOE_dfl <- top20_sigOE_norm_df %>%
pivot_longer(cols = c("Irrel_kd_1","Irrel_kd_2","Irrel_kd_3","Mov10_kd_2","Mov10_kd_3","Mov10_oe_1","Mov10_oe_2","Mov10_oe_3"),
names_to = "samplename",
values_to = "normalized_counts")
# a shorter way to write it
top20_sigOE_dfl <- top20_sigOE_norm_df %>%
pivot_longer(cols = -c(gene_ID,symbol),
names_to = "samplename",
values_to = "normalized_counts")
## check the column header in the "long format" data frame
top20_sigOE_dfl
# A tibble: 160 × 4
gene_ID symbol samplename normalized_counts
<chr> <chr> <chr> <dbl>
1 ENSG00000096654 ZNF184 Irrel_kd_1 502.
2 ENSG00000096654 ZNF184 Irrel_kd_2 500.
3 ENSG00000096654 ZNF184 Irrel_kd_3 475.
4 ENSG00000096654 ZNF184 Mov10_kd_2 794.
5 ENSG00000096654 ZNF184 Mov10_kd_3 793.
6 ENSG00000096654 ZNF184 Mov10_oe_1 1046.
7 ENSG00000096654 ZNF184 Mov10_oe_2 1011.
8 ENSG00000096654 ZNF184 Mov10_oe_3 1102.
9 ENSG00000112972 HMGCS1 Irrel_kd_1 9310.
10 ENSG00000112972 HMGCS1 Irrel_kd_2 9373.
# ℹ 150 more rows
Now, if we want our counts colored by sample group, then we need to combine the metadata information with the normalized counts data into the same data frame for input to ggplot():
top20_sigOE_dfl <- left_join(top20_sigOE_dfl, meta, by = c("samplename" = "name"))
head(top20_sigOE_dfl)
# A tibble: 6 × 6
gene_ID symbol samplename normalized_counts sampletype replicate
<chr> <chr> <chr> <dbl> <chr> <chr>
1 ENSG00000096654 ZNF184 Irrel_kd_1 502. control rep1
2 ENSG00000096654 ZNF184 Irrel_kd_2 500. control rep2
3 ENSG00000096654 ZNF184 Irrel_kd_3 475. control rep3
4 ENSG00000096654 ZNF184 Mov10_kd_2 794. MOV10_knockdown rep2
5 ENSG00000096654 ZNF184 Mov10_kd_3 793. MOV10_knockdown rep3
6 ENSG00000096654 ZNF184 Mov10_oe_1 1046. MOV10_overexpression rep1
The left_join() will join the 2 data frames with respect to the samplename column of the top20_sigOE_dfl
table and the sample column of the meta
table.
Now that we have a data frame in a format that can be utilized by ggplot easily, let’s plot!
## plot using ggplot2
ggplot(top20_sigOE_dfl) +
geom_point(aes(x = sampletype, y = normalized_counts, color = sampletype)) +
scale_y_log10() +
xlab("Genes") +
ylab("log10 Normalized Counts") +
ggtitle("Top 20 Significant DE Genes") +
theme_bw() +
facet_wrap(~symbol, scales = "free_y") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5))
You can change chunk options to increase plot size. For example, you can use {r, fig.height=6, fig.width=10}
You can also change sample name to have a shorter X labels. For that we will add a column with the new names
top20_sigOE_dfl <- top20_sigOE_dfl %>%
mutate(sampletype_short = case_when(
sampletype == "MOV10_knockdown" ~ 'kd',
sampletype == "MOV10_overexpression" ~ 'oe',
TRUE ~ 'ctl'))
## plot using ggplot2
ggplot(top20_sigOE_dfl) +
geom_point(aes(x = sampletype_short, y = normalized_counts, color = sampletype_short)) +
scale_y_log10() +
xlab("Genes") +
ylab("log10 Normalized Counts") +
ggtitle("Top 20 Significant DE Genes") +
theme_bw() +
facet_wrap(~symbol, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5))
Excercise
MOV10 Differential Expression Analysis: Control versus Knockdown
In addition to plotting subsets, we could also extract the normalized values of all the significant genes and plot a heatmap of their expression using pheatmap()
.
### Extract normalized expression for significant genes from the OE and control samples (3:5 and 8:10) (and gene ID)
norm_OEsig_df <- normalized_counts_and_annot_df[,c("gene_ID",
"symbol",
"Irrel_kd_1",
"Irrel_kd_2",
"Irrel_kd_3",
"Mov10_oe_1",
"Mov10_oe_2",
"Mov10_oe_3")
] %>%
filter(gene_ID %in% sigOE_df$gene_ID)
norm_OEsig_df
# A tibble: 4,774 × 7
gene_ID Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003 3925. 3794. 3961. 2727. 2730. 3178.
2 ENSG00000000419 1326. 1341. 1180. 1542. 1548. 1943.
3 ENSG00000000460 1250. 1212. 1134. 965. 999. 1029.
4 ENSG00000001084 2691. 2538. 2625. 1898. 2117. 2379.
5 ENSG00000001167 2752. 2514. 2601. 2149. 2134. 2096.
Now let’s draw the heatmap using pheatmap
:
### Set a color palette
heat_colors <- brewer.pal(6, "YlOrRd")
norm_OEsig_mat <- norm_OEsig_df %>% column_to_rownames("gene_ID") %>%
dplyr::select(Irrel_kd_1,Irrel_kd_2,Irrel_kd_3,Mov10_oe_1,Mov10_oe_2,Mov10_oe_3) %>%
as.matrix()
annotation_col_df <- meta%>% as.data.frame()
rownames(annotation_col_df) <- annotation_col_df$name
annotation_col_df <- annotation_col_df[,c("replicate","sampletype")]
### Run pheatmap using the metadata data frame for the annotation
pheatmap(norm_OEsig_mat,
color = heat_colors,
cluster_rows = T,
show_rownames = F,
annotation = annotation_col_df,
border_color = NA,
fontsize = 10,
scale = "row",
fontsize_row = 10,
height = 20)
NOTE: There are several additional arguments we have included in the function for aesthetics. One important one is scale="row"
, in which Z-scores are plotted, rather than the actual normalized count value.
Z-scores are computed on a gene-by-gene basis by subtracting the mean and then dividing by the standard deviation. The Z-scores are computed after the clustering, so that it only affects the graphical aesthetics and the color visualization is improved.
Excercise
MOV10 Differential Expression Analysis: Control versus Knockdown
Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.
The above plot would be great to look at the expression levels of a good number of genes, but for more of a global view, there are other plots we can draw. A commonly used one is a volcano plot; in which you have the log transformed adjusted p-values plotted on the y-axis and log2 fold change values on the x-axis.
To generate a volcano plot, we first need to have a column in our results data indicating whether or not the gene is considered differentially expressed based on p-adjusted values and we will include a log2fold change here.
## Obtain logical vector where TRUE values denote padj values < 0.05 and fold change > 1.5 (log2(1.5) = 0.58) in either direction
res_tableOE_and_annot_df <- res_tableOE_and_annot_df %>%
mutate(threshold_OE = (!is.na(padj)) & padj < 0.05 & abs(log2FoldChange) >= 0.58)
# A tibble: 57,761 × 10
gene_ID symbol baseMean log2FoldChange lfcSE pvalue padj is_up is_down threshold_OE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <lgl>
1 ENSG00000155363 MOV10 95772. 6.99 0.122 0 0 yes no TRUE
2 ENSG00000189060 H1F0 8090. 1.51 0.0666 3.72e-115 3.41e-111 yes no TRUE
3 ENSG00000173110 HSPA6 239. 6.50 0.363 9.44e- 72 5.76e- 68 yes no TRUE
4 ENSG00000265972 TXNIP 5285. 1.37 0.0788 1.64e- 68 7.48e- 65 yes no TRUE
5 ENSG00000187837 HIST1H1C 1747. 1.48 0.0890 1.75e- 63 6.41e- 60 yes no TRUE
Now we can start plotting. The geom_point
object is most applicable, as this is essentially a scatter plot:
## Volcano plot
ggplot(res_tableOE_and_annot_df) +
geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = threshold_OE)) +
ggtitle("Mov10 overexpression") +
xlab("log2 fold change") +
ylab("-log10 adjusted p-value") +
theme_bw() +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.5), hjust = 0.5),
axis.title = element_text(size = rel(1.25)))
This is a great way to get an overall picture of what is going on, but what if we also wanted to know where the top 10 genes (lowest padj) in our DE list are located on this plot? We could label those dots with the gene name on the Volcano plot using geom_text_repel()
.
For that, we plot it as before with an additional layer for geom_text_repel()
wherein we can specify a subset of the results table and the column of gene labels.
ggplot(res_tableOE_and_annot_df, aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(aes(colour = threshold_OE)) +
geom_text_repel(data = res_tableOE_and_annot_df %>% head(10) , aes(label = symbol)) +
ggtitle("Mov10 overexpression") +
xlab("log2 fold change") +
ylab("-log10 adjusted p-value") +
theme_bw() +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.5), hjust = 0.5),
axis.title = element_text(size = rel(1.25)))
The Bioconductor package EnhancedVolcano
can use the DESeq2 results output to make the volcano plots generated above by writing much fewer lines of code.
#BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)
EnhancedVolcano(res_tableOE_and_annot_df,
lab = res_tableOE_and_annot_df$symbol,
x = 'log2FoldChange',
y = 'pvalue')
Using its documentation, you can “easily” customize your plot. Don’t hesitate to take a look.
top20_sigOE_symbol <- res_tableOE_and_annot_df %>% pull(symbol) %>% head(20)
EnhancedVolcano(res_tableOE_and_annot_df,
lab = res_tableOE_and_annot_df$symbol,
x = 'log2FoldChange',
y = 'pvalue',
pCutoff = 0.05,
FCcutoff = 0.58,
drawConnectors = TRUE,
widthConnectors = 0.75,
selectLab = top20_sigOE_symbol)
Excercise
MOV10 Differential Expression Analysis: Control versus Knockdown
Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.
Another package, ggpubr
, offers a useful function (ggmaplot
) to easily draw MA plot (mean expression across all samples in function of the logfoldchange).
#install.packages("ggpubr")
library(ggpubr)
Documentation of this function can be found here
# Default plot
ggmaplot(res_tableOE_and_annot_df, main = "MOV10 overexpression samples versus control samples",
fdr = 0.05, fc = 1.5, size = 0.4,
palette = c("#B31B21", "#1465AC", "darkgray"),
genenames = as.vector(res_tableOE_and_annot_df$symbol),
legend = "top", top = 20,
font.label = c("bold", 11),
font.legend = "bold",
font.main = "bold",
ggtheme = ggplot2::theme_bw())
The caveat of these functions is you have to check the default values of the options to understand what is effectively plotted.
summary(res_tableOE_shrunken, alpha = 0.05)
adjusted p-value < 0.05
LFC > 0 (up) : 2004, 5.2%
LFC < 0 (down) : 2770, 7.1%
For example, if you compare the displayed number of up/down regulated genes to the results of the function “summary”, numbers are different. The difference comes from the value of the fc option.
# Default plot
ggmaplot(res_tableOE_and_annot_df, main = "MOV10 overexpression samples versus control samples",
fdr = 0.05, fc = 0, size = 0.4,
palette = c("#B31B21", "#1465AC", "darkgray"),
genenames = as.vector(res_tableOE_and_annot_df$symbol),
legend = "top", top = 20,
font.label = c("bold", 11),
font.legend = "bold",
font.main = "bold",
ggtheme = ggplot2::theme_bw())
Excercise
MOV10 Differential Expression Analysis: Control versus Knockdown
Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.
Now it is time to do the same things with your dataset.
We will use the BiomaRt package again to retrieve from Ensembl the mapping between gene IDs and gene names (and descriptions) in order to plot gene names instead of gene IDs.
You can find below command lines to get this table for human or mouse.
#BiocManager::install("biomaRt")
library(biomaRt)
human_mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
human_geneIDs2geneNames <- getBM(
attributes=c("ensembl_gene_id","external_gene_name","gene_biotype","description"),
mart = human_mart)
head(human_geneIDs2geneNames)
dim(human_geneIDs2geneNames) # 70711 4
ensembl_gene_id external_gene_name gene_biotype description
1 ENSG00000210049 MT-TF Mt_tRNA ...
2 ENSG00000211459 MT-RNR1 Mt_rRNA ...
3 ENSG00000210077 MT-TV Mt_tRNA ...
4 ENSG00000210082 MT-RNR2 Mt_rRNA ...
5 ENSG00000209082 MT-TL1 Mt_tRNA ...
6 ENSG00000198888 MT-ND1 protein_coding ...
#BiocManager::install("biomaRt")
library(biomaRt)
mouse_mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
mouse_geneIDs2geneNames <- getBM(
attributes=c("ensembl_gene_id","external_gene_name","gene_biotype","description"),
mart = mouse_mart)
head(mouse_geneIDs2geneNames)
dim(mouse_geneIDs2geneNames)
Another useful R package is AnnotationHub
allowing you to retrieve data from several genomic database.
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.
Here we perform minimal pre-filtering to keep only rows that have at least 10 reads total. Note that stricter filtering to increase power is automatically applied via independent filtering on the mean of normalized counts within the results function.
keep <- rowSums(counts(dds)) >= 10
dds_bis <- dds[keep,]
nrow(dds) #57761
nrow(dds_bis) #27817
dds_bis <- DESeq(dds_bis)
Alternatively, a popular filter is to ensure at least X samples with a count of 10 or more, where X can be chosen as the sample size of the smallest group of samples:
X = 3
keep <- rowSums(counts(dds) >= 10) >= X
dds_ter <- dds[keep,]
nrow(dds) #57761
nrow(dds_ter) #20024
dds_ter <- DESeq(dds_ter)
!Warning! : you have to do this pre-filtering before running the DEseq function.
human_geneIDs2geneNames_only_proteins <- human_geneIDs2geneNames %>%
filter(gene_biotype == "protein_coding")
dim(human_geneIDs2geneNames_only_proteins)
dim(human_geneIDs2geneNames)
keep_proteins <- rownames(counts(dds)) %in% human_geneIDs2geneNames_only_proteins$ensembl_gene_id
dds_prot <- dds[keep_proteins,]
nrow(dds)
nrow(dds_prot)
dds_prot <- DESeq(dds_prot)
!Warning! : you have to do this pre-filtering before running the DEseq function.
Using the following code you can check the number/proportion of reads mapping in protein coding genes or other genes (pseudogenes, ncrna, …)
gene_biotype_simplified_list <- c("protein_coding","lncRNA","processed_pseudogene","unprocessed_pseudogene","misc_RNA","rRNA","snRNA")
bilan_gene_biotype_df <- counts(dds, normalize = F) %>%
as_tibble(rownames = "ensembl_gene_id") %>%
left_join(human_geneIDs2geneNames) %>%
pivot_longer(-c(ensembl_gene_id,external_gene_name,description, gene_biotype), names_to = "sample", values_to = "count") %>%
mutate(gene_biotype_simplified = ifelse(gene_biotype %in% gene_biotype_simplified_list, gene_biotype, "other")) %>%
group_by(sample, gene_biotype_simplified) %>%
summarise(total_by_sample = sum(count)) %>%
group_by(sample) %>%
mutate(percentage = total_by_sample / sum(total_by_sample) * 100) %>%
arrange(desc(percentage))
head(bilan_gene_biotype_df)
sample gene_biotype_simplified total_by_sample percentage
<chr> <chr> <int> <dbl>
1 Irrel_kd_1 protein_coding 30093095 96.6
2 Irrel_kd_2 protein_coding 25582178 96.5
3 Irrel_kd_3 protein_coding 19783387 96.5
4 Mov10_kd_2 protein_coding 43683866 96.4
5 Mov10_kd_3 protein_coding 25778701 96.4
6 Mov10_oe_3 protein_coding 16596199 96.0
ggplot(data = bilan_gene_biotype_df, aes(x = sample, y = total_by_sample, fill = gene_biotype_simplified)) +
geom_bar(stat = "identity") + theme_bw() +
ggtitle("Raw read mapping distribution") +
scale_fill_brewer(palette="Dark2")
ggplot(data = bilan_gene_biotype_df, aes(x = sample, y = percentage, fill = gene_biotype_simplified)) +
geom_bar(stat = "identity") + theme_bw() +
ggtitle("Raw read mapping distribution") +
scale_fill_brewer(palette="Dark2")
These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.