Learning objectives

  • Demonstrate the use of the design formula with simple and complex designs
  • Construct R code to execute the differential expression analysis workflow with DESeq2
  • Recongnize the importance of multiple test correction
  • Setup results data for application of visualization techniques
  • Describe different data visualization useful for exploring results from a DGE analysis
  • Create a volcano plot to evaluate relationship among DGE statistics
  • Create a heatmap to illustrate expression changes of differentially expressed genes

Link to the Rmd file (DEG_analysis.Rmd) from the previous session

Link to the tsv file (meta/sample_metadata_ok.tsv) from the previous session

1 Differential expression analysis with DESeq2

The final step in the differential expression analysis workflow is fitting the raw counts to the NB model and performing the statistical test for differentially expressed genes. In this step we essentially want to determine whether the mean expression levels of different sample groups are significantly different.

Image credit: Paul Pavlidis, UBC

The DESeq2 paper was published in 2014, but the package is continually updated and available for use in R through Bioconductor. It builds on good ideas for dispersion estimation and use of Generalized Linear Models from the DSS and edgeR methods.

Differential expression analysis with DESeq2 involves multiple steps as displayed in the flowchart below in blue. Briefly, DESeq2 will model the raw counts, using normalization factors (size factors) to account for differences in library depth. Then, it will estimate the gene-wise dispersions and shrink these estimates to generate more accurate estimates of dispersion to model the counts. Finally, DESeq2 will fit the negative binomial model and perform hypothesis testing using the Wald test or Likelihood Ratio Test.

NOTE: DESeq2 is actively maintained by the developers and continuously being updated. As such, it is important that you note the version you are working with. Recently, there have been some rather big changes implemented that impact the output. To find out more detail about the specific modifications made to methods described in the original 2014 paper, take a look at this section in the DESeq2 vignette.

Additional details on the statistical concepts underlying DESeq2 are elucidated nicely in Rafael Irizarry’s materials for the EdX course, “Data Analysis for the Life Sciences Series”.

1.1 Running DESeq2

Prior to performing the differential expression analysis, it is a good idea to know what sources of variation are present in your data, either by exploration during the QC and/or prior knowledge. Once you know the major sources of variation, you can remove them prior to analysis or control for them in the statistical model by including them in your design formula.

1.1.1 Design formula

A design formula tells the statistical software the known sources of variation to control for, as well as, the factor of interest to test for during differential expression testing. For example, if you know that sex is a significant source of variation in your data, then sex should be included in your model. The design formula should have all of the factors in your metadata that account for major sources of variation in your data. The last factor entered in the formula should be the condition of interest.

For example, suppose you have the following metadata:

If you want to examine the expression differences between treatments, and you know that major sources of variation include sex and age, then your design formula would be:

design = ~ sex + age + treatment

The tilde (~) should always precede your factors and tells DESeq2 to model the counts using the following formula. Note the factors included in the design formula need to match the column names in the metadata.


Exercises

  1. Suppose you wanted to study the expression differences between the two age groups in the metadata shown above, and major sources of variation were sex and treatment, how would the design formula be written?

~ sex + treatment + age
  1. Based on our Mov10 metadata dataframe, which factors could we include in our design formula?

~ sampletype
  1. What would you do if you wanted to include a factor in your design formula that is not in your metadata?

#Add it in your metadata

1.1.1.1 Complex designs

DESeq2 also allows for the analysis of complex designs. You can explore interactions or ‘the difference of differences’ by specifying for it in the design formula. For example, if you wanted to explore the effect of sex on the treatment effect, you could specify for it in the design formula as follows:

design = ~ sex + age + treatment + sex:treatment

Since the interaction term sex:treatment is last in the formula, the results output from DESeq2 will output results for this term.

There are additional recommendations for complex designs in the DESeq2 vignette. In addition, Limma documentation offers additional insight into creating more complex design formulas.

NOTE: Need help figuring out what information should be present in your metadata?. You can find additional materials highlighting bulk RNA-seq planning considerations. Please take a look at these materials before starting an experiment to help with proper experimental design.

1.1.2 MOV10 DE analysis

Now that we know how to specify the model to DESeq2, we can run the differential expression pipeline on the raw counts.

To get our differential expression results from our raw count data, we only need to run 2 lines of code!

First we create a DESeqDataSet as we did in the ‘Count normalization’ lesson and specify the txi object which contains our raw counts, the metadata variable, and provide our design formula:

  • Add a new section in your Rmd file: “# Differential expression analysis with DESeq2” and a new sub section “## build the dds”

  • Add the following chunk and execute it

## Create DESeq2Dataset object
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ sampletype)

## Run analysis
dds <- DESeq(dds)

To run the actual differential expression analysis, we use a single call to the function DESeq().

By re-assigning the results of the function back to the same variable name (dds), we can fill in the slots of our DESeqDataSet object.

Everything from normalization to linear modeling was carried out by the use of a single function! This function will print out a message for the various steps it performs:

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

To see what is occurring in each of these steps you can go here, but the code to execute these steps is encompassed in the two lines above.

NOTE: There are individual functions available in DESeq2 that would allow us to carry out each step in the workflow in a step-wise manner, rather than a single call. We demonstrated one example when generating size factors to create a normalized matrix. By calling DESeq(), the individual functions for each step are run for you.


Exercise

Let’s suppose our experiment has the following metadata:

genotype treatment
sample1 WT ev
sample2 WT ev
sample3 WT ev
sample4 WT ev
sample5 KO_geneA ev
sample6 KO_geneA ev
sample7 KO_geneA ev
sample8 KO_geneA ev
sample9 WT treated
sample10 WT treated
sample11 WT treated
sample12 WT treated
sample13 KO_geneA treated
sample14 KO_geneA treated
sample15 KO_geneA treated
sample16 KO_geneA treated

How would the design formula be structured to perform the following analyses?

  1. Test for the effect of treatment.

~ treatment
  1. Test for the effect of genotype, while regressing out the variation due to treatment.

~ treatment + genotype
  1. Test for the effect of genotype on the treatment effects.

~ genotype + treatment + genotype:treatment

2 Exploring Results (Wald test)

By default DESeq2 uses the Wald test to identify genes that are differentially expressed between two sample classes. Given the factor(s) used in the design formula, and how many factor levels are present, we can extract results for a number of different comparisons. Here, we will walk through how to obtain results from the dds object and provide some explanations on how to interpret them.

The Wald test is a test usually performed on parameters that have been estimated by maximum likelihood. In our case we are testing each gene model coefficient (LFC) which was derived using parameters like dispersion, which were estimated using maximum likelihood.

NOTE: The Wald test can also be used with continuous variables. If the variable of interest provided in the design formula is continuous-valued, then the reported log2FoldChange is per unit of change of that variable.

2.1 Specifying contrasts

In our dataset, we have three sample classes so we can make three possible pairwise comparisons:

  1. Control vs. Mov10 overexpression
  2. Control vs. Mov10 knockdown
  3. Mov10 knockdown vs. Mov10 overexpression

We are really only interested in #1 and #2 from above. When we intially created our dds object we had provided ~ sampletype as our design formula, indicating that sampletype is our main factor of interest.

To indicate which two sample classes we are interested in comparing, we need to specify contrasts. The contrasts are used as input to the DESeq2 results() function to extract the desired results.

The more commonly method to specify contrasts is to supply as a character vector with exactly three elements: the name of the factor (of interest) in the design formula, the name of the two factors levels to compare. The factor level given last is the base level for the comparison. The syntax is given below:

# DO NOT RUN!
contrast <- c("condition", "level_to_compare", "base_level")
results(dds, contrast = contrast)

Alternatively, if you only had two factor levels you could do nothing and not worry about specifying contrasts (i.e. results(dds)). In this case, DESeq2 will choose what your base factor level based on alphabetical order of the levels.

To start, we want to evaluate expression changes between the MOV10 overexpression samples and the control samples. As such we will use the first method for specifying contrasts and create a character vector:


Exercise

  • What contrast will we use ?

## Define contrasts for MOV10 overexpression
contrast_oe <- c("sampletype", "MOV10_overexpression", "control")
  • Does it matter what I choose to be my base level?

#Yes, it does matter. **Deciding what level is the base level will determine how to interpret the fold change that is reported.**  So for example, if we observe a log2 fold change of -2 this would mean the gene expression is lower in factor level of interest relative to the base level. Thus, if leaving it up to DESeq2 to decide on the contrasts be sure to check that the alphabetical order coincides with the fold change direction you are anticipating.

2.2 The results table

Now that we have our contrast created, we can use it as input to the results() function. Let’s take a quick look at the help manual for the function:

?results

You will see we have the option to provide a wide array of arguments and tweak things from the defaults as needed. As we go through the lesson we will keep coming back to the help documentation to discuss some arguments that are good to know about.

  • add a sub section “## expression changes between the MOV10 overexpression samples and the control samples”

  • add the following chunk

## Define contrasts for MOV10 overexpression
contrast_oe <- c("sampletype", "MOV10_overexpression", "control")

## Extract results for MOV10 overexpression vs control
res_tableOE <- results(dds, contrast=contrast_oe, alpha = 0.05)

NOTE: For our analysis, in addition to the contrast argument we will also provide a value of 0.05 for the alpha argument. We will describe this in more detail when we talk about gene-level filtering.

The results table that is returned to us is a DESeqResults object, which is a simple subclass of DataFrame. In many ways it can be treated like a dataframe (i.e when accessing/subsetting data), however it is important to recognize that there are differences for downstream steps like visualization.

# Check what type of object is returned
class(res_tableOE)

Now let’s take a look at what information is stored in the results:

# What is stored in results?
res_tableOE
log2 fold change (MLE): sampletype MOV10_overexpression vs control 
Wald test p-value: sampletype MOV10 overexpression vs control 
DataFrame with 57761 rows and 6 columns
                  baseMean log2FoldChange     lfcSE       stat      pvalue        padj
                 <numeric>      <numeric> <numeric>  <numeric>   <numeric>   <numeric>
ENSG00000000003  3525.8835      -0.438245 0.0774607 -5.6576468 1.53463e-08 4.25096e-07
ENSG00000000005    26.2489       0.029208 0.4411289  0.0662119 9.47209e-01 9.72687e-01
ENSG00000000419  1478.2512       0.383635 0.1137610  3.3722909 7.45457e-04 4.67394e-03
ENSG00000000457   518.4220       0.228971 0.1023312  2.2375448 2.52508e-02 8.02341e-02
ENSG00000000460  1159.7761      -0.269138 0.0814993 -3.3023396 9.58819e-04 5.76079e-03
...                    ...            ...       ...        ...         ...         ...
ENSG00000285889    1.82171       -4.68144 3.9266061   -1.19224  0.23316908          NA
ENSG00000285950    7.58089       -1.01978 1.0715579   -0.95168  0.34125937          NA
ENSG00000285976 4676.24904        0.19364 0.0656673    2.94881  0.00319004   0.0157502
ENSG00000285978    2.25697        4.13612 2.0706198    1.99753  0.04576780          NA
ENSG00000285980    0.00000             NA        NA         NA          NA          NA

We have six columns of information reported for each gene (row). We can use the mcols() function to extract information on what the values stored in each column represent:

# Get information on each column in results
 mcols(res_tableOE, use.names=T) 
                       type                                                  description
baseMean       intermediate                    mean of normalized counts for all samples
log2FoldChange      results log2 fold change (MLE): sampletype MOV10_overexpression vs control
lfcSE               results         standard error: sampletype MOV10 overexpression vs control
stat                results         Wald statistic: sampletype MOV10 overexpression vs control
pvalue              results      Wald test p-value: sampletype MOV10 overexpression vs control
padj                results                                         BH adjusted p-values

2.3 P-values

The p-value is a probability value used to determine whether there is evidence to reject the null hypothesis. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis. However, because we are performing a test for each individual gene we need to correct these p-values for multiple testing.

The padj column in the results table represents the p-value adjusted for multiple testing, and is the most important column of the results. Typically, a threshold such as padj < 0.05 is a good starting point for identifying significant genes. The default method for multiple test correction in DESeq2 is an implementation of the Benjamini-Hochberg false discovery rate (FDR). There are other corrections methods available and can be changed by adding the pAdjustMethod argument to the results() function.

2.3.1 Gene-level filtering

Let’s take a closer look at our results table. As we scroll through it, you will notice that for selected genes there are NA values in the pvalue and padj columns. What does this mean?

The missing values represent genes that have undergone filtering as part of the DESeq() function. Prior to differential expression analysis it is beneficial to omit genes that have little or no chance of being detected as differentially expressed. This will increase the power to detect differentially expressed genes. DESeq2 does not physically remove any genes from the original counts matrix, and so all genes will be present in your results table. The genes omitted by DESeq2 meet one of the three filtering criteria outlined below:

1. Genes with zero counts in all samples

If within a row, all samples have zero counts there is no expression information and therefore these genes are not tested.

# Filter genes by zero expression
res_tableOE %>% filter(baseMean == 0) %>% as_tible()

The baseMean column for these genes will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA.

2. Genes with an extreme count outlier

The DESeq() function calculates, for every gene and for every sample, a diagnostic test for outliers called Cook’s distance. Cook’s distance is a measure of how much a single sample is influencing the fitted coefficients for a gene, and a large value of Cook’s distance is intended to indicate an outlier count. Genes which contain a Cook’s distance above a threshold are flagged, however at least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates. We can turn off this filtering by using the cooksCutoff argument in the results() function.

# Filter genes that have an extreme outlier
res_tableOE %>%
  as_data_frame(rownames = "gene_ID") %>%
  filter(is.na(pvalue) & is.na(padj) & baseMean > 0) 

If a gene contains a sample with an extreme count outlier then the p-value and adjusted p-value will be set to NA.

3. Genes with a low mean normalized counts

DESeq2 defines a low mean threshold, that is empirically determined from your data, in which the fraction of significant genes can be increased by reducing the number of genes that are considered for multiple testing. This is based on the notion that genes with very low counts are not likely to see significant differences typically due to high dispersion.

Image courtesy of slideshare presentation from Joachim Jacob, 2014.

At a user-specified value (alpha = 0.05), DESeq2 evaluates the change in the number of significant genes as it filters out increasingly bigger portions of genes based on their mean counts, as shown in the figure above. The point at which the number of significant genes reaches a peak is the low mean threshold that is used to filter genes that undergo multiple testing. There is also an argument to turn off the filtering off by setting independentFiltering = F.

# Filter genes below the low mean threshold

res_tableOE %>%
  as_data_frame(rownames = "gene_ID") %>%
  filter( ! is.na(pvalue) & is.na(padj) & baseMean > 0)  %>% 
  View()

If a gene is filtered by independent filtering, then only the adjusted p-value will be set to NA.

NOTE: DESeq2 will perform the filtering outlined above by default; however other DE tools, such as EdgeR will not. Filtering is a necessary step, even if you are using limma-voom and/or edgeR’s quasi-likelihood methods. Be sure to follow pre-filtering steps when using other tools, as outlined in their user guides found on Bioconductor as they generally perform much better.

2.4 Fold change

Another important column in the results table, is the log2FoldChange. With large significant gene lists it can be hard to extract meaningful biological relevance. To help increase stringency, one can also add a fold change threshold. Keep in mind when setting that value that we are working with log2 fold changes, so a cutoff of log2FoldChange < 1 would translate to an actual fold change of 2.

An alternative approach to add the fold change threshold: The results() function has an option to add a fold change threshold using the lfcThrehsold argument. This method is more statistically motivated, and is recommended when you want a more confident set of genes based on a certain fold-change. It actually performs a statistical test against the desired threshold, by performing a two-tailed test for log2 fold changes greater than the absolute value specified instead of 0. The user can change the alternative hypothesis using altHypothesis and perform two one-tailed tests as well. This is a more conservative approach, so expect to retrieve a much smaller set of genes!

The fold changes reported in the results table are calculated by:

log2 (normalized_counts_group1 / normalized_counts_group2)

The problem is, these fold change estimates are not entirely accurate as they do not account for the large dispersion we observe with low read counts. To address this, the log2 fold changes need to be adjusted.

2.4.1 More accurate LFC estimates

To generate more accurate log2 foldchange (LFC) estimates, DESeq2 allows for the shrinkage of the LFC estimates toward zero when the information for a gene is low, which could include:

  • Low counts
  • High dispersion values

LFC shrinkage uses information from all genes to generate more accurate estimates. Specifically, the distribution of LFC estimates for all genes is used (as a prior) to shrink the LFC estimates of genes with little information or high dispersion toward more likely (lower) LFC estimates.

Illustration taken from the DESeq2 paper.

In the figure above, we have an example using two genes green gene and purple gene. For each gene the expression values are plotted for each sample in the two different mouse strains (C57BL/6J and DBA/2J). Both genes have the same mean values for the two sample groups, but the green gene has little variation within group while the purple gene has high levels of variation. For the green gene with low within group variation, the unshrunken LFC estimate (vertex of the green solid line) is very similar to the shrunken LFC estimate (vertex of the green dotted line). However, LFC estimates for the purple gene are quite different due to the high dispersion. So even though two genes can have similar normalized count values, they can have differing degrees of LFC shrinkage. Notice the LFC estimates are shrunken toward the prior (black solid line).

Shrinking the log2 fold changes will not change the total number of genes that are identified as significantly differentially expressed. The shrinkage of fold change is to help with downstream assessment of results. For example, if you wanted to subset your significant genes based on fold change for further evaluation, you may want to use shruken values. Additionally, for functional analysis tools such as GSEA which require fold change values as input you would want to provide shrunken values.

To generate the shrunken log2 fold change estimates, you have to run an additional step on your results object (that we will create below) with the function lfcShrink().

  • add the following chunk
## Save the unshrunken results to compare
res_tableOE_unshrunken <- res_tableOE

# Apply fold change shrinkage
res_tableOE_shrunken <- lfcShrink(dds,
                                  res=res_tableOE_unshrunken,
                                  coef="sampletype_MOV10_overexpression_vs_control",
                                  type="apeglm")
  • **contrastvscoef** When using the alternative methods, rather than using thecontrastargument you will be required to specifycoef`. Using contrast forms an expanded model matrix, treating all factor levels equally, and averages over all distances between all pairs of factor levels to estimate the prior. Using coef, means looking only at that column of the model matrix (so usually that would be one level against the reference level) and estimates the prior for that coefficient from the distribution of those MLE of coefficients. When using coef, the shrinkage depends on which level is chosen as reference.

  • How do I know what to value to provide to the coef argument? The value you provide here needs to match identically to what is stored in the column header of the coefficients table. To see what values you have to work with you can use resultsNames(dds).

Note that the stat column is no longer present in the results table. You can compare by inspecting the output of res_tableOE and res_tableOE_unshrunken in the R console.

Depending on the version of DESeq2 you are using the default method for shrinkage estimation will differ. The defaults can be changed by adding the argument type in the lfcShrink() function as we have above. For most recent versions of DESeq2, type="normal" is the default and was the only method in earlier versions. It has been shown that in most situations there are alternative methods that have less bias than the ’normal` method, and therefore we chose to use apeglm.

For more information on shrinkage, the DESeq2 vignette has an Extended section on shrinkage estimators that is quite useful.

2.5 MA plot

A plot that can be useful to exploring our results is the MA plot. The MA plot shows the mean of the normalized counts versus the log2 foldchanges for all genes tested. The genes that are significantly DE are colored to be easily identified. This is also a great way to illustrate the effect of LFC shrinkage. The DESeq2 package offers a simple function to generate an MA plot.

  • Add chunks with the following codes:

Let’s start with the unshrunken results:

# MA plot using unshrunken fold changes
plotMA(res_tableOE_unshrunken, ylim=c(-2,2))

And now the shrunken results:

# MA plot using shrunken fold changes
plotMA(res_tableOE_shrunken, ylim=c(-2,2))

On the left you have the unshrunken fold change values plotted and you can see the abundance of scatter for the the lowly expressed genes. That is, many of the low expressors exhibit very high fold changes. After shrinkage, we see the fold changes are much smaller estimates.

In addition to the comparison described above, this plot allows us to evaluate the magnitude of fold changes and how they are distributed relative to mean expression. Generally, we would expect to see significant genes across the full range of expression levels.


Excercise

MOV10 Differential Expression Analysis: Control versus Knockdown

Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.

(Before add a subsection “## expression changes between the MOV10 Knockdown samples and the control samples” and organize your code in chunk)

  1. Create a contrast vector called contrast_kd.

## Define contrasts for MOV10 Knockdown
contrast_kd <- c("sampletype", "MOV10_knockdown", "control")
  1. Use contrast vector in the results() to extract a results table and store that to a variable called res_tableKD.

## Define contrasts for MOV10 Knockdown
contrast_kd <- c("sampletype", "MOV10_knockdown", "control")

## Extract results for MOV10 overexpression vs control
res_tableKD <- results(dds, contrast=contrast_kd, alpha = 0.05)
  1. Shrink the LFC estimates using lfcShrink() and assign it back to res_tableKD.

## Save the unshrunken results to compare
res_tableKD_unshrunken <- res_tableKD

# Apply fold change shrinkage
res_tableKD_shrunken <- lfcShrink(dds, res=res_tableKD_unshrunken, coef="sampletype_MOV10_knockdown_vs_control", type="apeglm")
  1. Make a MA plot using shrunken fold changes

# MA plot using shrunken fold changes
plotMA(res_tableKD_shrunken, ylim=c(-2,2))

2.6 Summarizing results

To summarize the results table, a handy function in DESeq2 is summary(). Confusingly it has the same name as the function used to inspect data frames. This function when called with a DESeq results table as input, will summarize the results using a default threshold of padj < 0.1. However, since we had set the alpha argument to 0.05 when creating our results table threshold: FDR < 0.05 (padj/FDR is used even though the output says p-value < 0.05). Let’s start with the OE vs control results:

## Summarize results
summary(res_tableOE_shrunken, alpha = 0.05)
out of 38903 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 2011, 5.2%
LFC < 0 (down)     : 2797, 7.2%
outliers [1]       : 28, 0.072%
low counts [2]     : 21313, 55%
(mean count < 16)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

In addition to the number of genes up- and down-regulated at the default threshold, the function also reports the number of genes that were tested (genes with non-zero total read count), and the number of genes not included in multiple test correction due to a low mean count.

2.7 Order the result table and save it

You can save the result table in a file. We will in the next session add “true” gene names to this file.

  • add the following chunk
## Convert matrix in table (keep rownames in a column gene_ID) and
res_tableOE_df <- res_tableOE_shrunken %>%
                  as_tibble(rownames = "gene_ID") 

## Order result by padj
res_tableOE_df <- res_tableOE_df %>%
                  arrange(padj)

#add columns to identify DE genes (up,down)
res_tableOE_df <- res_tableOE_df %>%
                  mutate(is_up   = ifelse(padj < 0.05 & log2FoldChange > 0, "yes","no"),
                         is_down = ifelse(padj < 0.05 & log2FoldChange < 0, "yes","no"))

#save the table in a tabulated files
write_tsv(res_tableOE_df, file = "results/DE_result_table_oe_vs_control.tsv")

2.8 Extracting significant differentially expressed genes

Let’s first create variables that contain our threshold criteria. We will only be using the adjusted p-values in our criteria:

### Set thresholds
padj.cutoff <- 0.05

We can subset that table to only keep the significant genes using our pre-defined thresholds using the filter() function.

# Subset the tibble to keep only significant genes
sigOE_df <- res_tableOE_df %>%
            filter(padj < padj.cutoff)
# Take a quick look at this tibble
sigOE_df

#save the table in a tabulated files
write_tsv(sigOE_df, file = "results/sign_DE_genes_oe_vs_control.tsv")

You can do also the same to subset up regulated genes

# Take a quick look at this tibble
sigOE_up_df <- sigOE_df %>%
               filter(log2FoldChange > 0)

sigOE_down_df <- sigOE_df %>%
                 filter(log2FoldChange < 0)

#Get the ID of up regulated genes
sigOE_up_gene_ID <- sigOE_up_df$gene_ID

#Get the ID of down regulated genes
sigOE_down_gene_ID <- sigOE_down_df$gene_ID
  • Build a new chunk from the previous code

Exercise

MOV10 Differential Expression Analysis: Control versus Knockdown

  1. Using the same p-adjusted threshold as above (padj.cutoff < 0.05), subset res_tableKD_shrunken to report the number of genes that are up- and down-regulated in Mov10_knockdown compared to control.

summary(res_tableKD_shrunken, alpha = 0.05)
  1. Order the result table and save it in a file “results/DE_result_table_kd_vs_control.tsv”

## Convert matrix in table (keep rownames in a column gene_ID) and
res_tableKD_df <- res_tableKD_shrunken %>%
                  as_tibble(rownames = "gene_ID") 

## Order result by padj
res_tableKD_df <- res_tableKD_df %>%
                  arrange(padj)


#add columns to identify DE genes (up,down)
res_tableKD_df <- res_tableKD_df %>%
                  mutate(is_up   = ifelse(padj < 0.05 & log2FoldChange > 0, "yes","no"),
                         is_down = ifelse(padj < 0.05 & log2FoldChange < 0, "yes","no"))

#save the table in a tabulated files
write_tsv(res_tableKD_df, file = "results/DE_result_table_kd_vs_control.tsv")
  1. Identify up/down regulated genes

### Set thresholds
padj.cutoff <- 0.05

# Subset the tibble to keep only significant genes
sigKD_df <- res_tableKD_df %>%
        filter(padj < padj.cutoff)

# Take a quick look at this tibble
sigKD_df

#save the table in a tabulated files
write_tsv(sigKD_df, file = "results/sign_DE_genes_KD_vs_control.tsv")

# Take a quick look at this tibble
sigKD_up_df <- sigKD_df %>%
        filter(log2FoldChange > 0)

sigKD_down_df <- sigKD_df %>%
        filter(log2FoldChange < 0)

#Get the ID of up regulated genes
sigKD_up_gene_ID <- sigKD_up_df$gene_ID

#Get the ID of down regulated genes
sigKD_down_gene_ID <- sigKD_down_df$gene_ID

Now that we have extracted the significant results, we are ready for visualization!

See you in the next session


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).

cc_by_sa

sources