#Test whether tidyverse is installed on your computer and install it if it isn't
if (!requireNamespace("tidyverse", quietly = TRUE)) {
install.packages('tidyverse')
else {
} unloadNamespace("tidyverse")
}
Table filtering and joining
BIBS Workshop
1 Get the mission
In this training, we will use the result of a differential expression analysis described in Kenny PJ et al, Cell Rep 2014.
The RNA-Seq was performed on HEK293F cells that were either transfected with a MOV10 transgene or non-specific (irrelevant) siRNA.
Using these data, the authors evaluate transcriptional patterns associated with perturbation of MOV10 expression. Please note that the irrelevant siRNA will be treated as our control condition.
Training files:
- The complete result of the differential expression analysis, with gene identifiers, logfoldchange values and pvalue adjusted: here
- The same file in xlsx format here
- A table containing the name and the description of each human gene identifiers here
We will use a R markdown file to keep track of your code.
RMarkdown is a file format in its most basic form, that can eventually be converted into a shareable document, e.g HTML, PDF and many others. It allows you to document not just your R (Python and SQL) code, but also enables the inclusion of tables, figures, along with descriptive text. Thus resulting in a final document that has the methods, the code and interpretation of results all in a single document!
To put it simply, you write a file using the Markdown language and within it embed executable R code chunks. The code chunks are paired with knitr syntax, so that once your document is complete, you can easily convert it into one of several common formats (i.e. HTML, PDF, PPT) for sharing or documentation.
To go further you can read the first part of this document: here
2 Import/read a table from a text file or a xlsx file in R
In this training, we will use a package called tidyverse which contain in fact a set of package containing a lot of useful function with simple syntax. If you do not have this package installed on your computer, you can install it with the following command:
One time, tidyverse is installed, we have to load it in our environment with the following command:
library(tidyverse)
To load xlsx file you have to use the openxlsx2::read_xlsx() function. This function came not from tidyvserse, we have next to convert the table in tibble, the tidyverse table format, using the function as_tibble().
<- openxlsx2::read_xlsx("data/human_geneIDs2geneNames.xlsx", sheet = 1)
gene_annotation <- as_tibble(gene_annotation) gene_annotation
To load txt file, containing a table you can use the *read_delim()** function which is part of tidyverse, so you do not need to convert it in tibble.
#Don't run
<- read_delim("data/test.tsv") test_table
In programming languages, all variables are not equal. When you display a tibble you can see the type of a column. Here is a list of common variable types that you will encounter
- int stands for integers.
- dbl stands for doubles or real numbers.
- chr stands for character vectors or strings.
- dttm stands for date-times (a date + a time).
- lgl stands for logical, vectors that contain only TRUE or FALSE.
- fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
- date stands for dates.
You cannot add an int to a chr, but you can add an int to a dbl the results will be a dbl.
To display a tibble, you can just type its variable name. That will allow you to have information about the dimension of the table, the colnames and the type of each column.
gene_annotation
To display all column names, you can use the colnames() function:
colnames(gene_annotation)
You can also access to the dimension of your table with the dim() function:
dim(gene_annotation)
Or just the number of row of your table with the nrow() function:
nrow(gene_annotation)
For example, here the column “entrezgene_id” contains the NCBI identifier, which is in fact not numeric. You can convert this column to a character if required.
<- mutate(gene_annotation, entrezgene_id = as.character(entrezgene_id))
gene_annotation gene_annotation
You can convert a character column in numeric with the as.numeric() function.
mutate(gene_annotation, entrezgene_id = as.numeric(entrezgene_id))
3 Filter a table according the content of a column
The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions.
You can use the relational operators (<,>,==,<=,>=,!=) to make a test on a column and keep rows for which the results is TRUE.
filter(gene_annotation, percentage_gene_gc_content >= 50)
filter(gene_annotation, chromosome_name == "X")
filter(gene_annotation, external_gene_name == "GAPDH")
The operator %in% is very useful to test if a value is in a list.
filter(gene_annotation, chromosome_name %in% c("1","MT"))
The function grepl() is also very useful to test if a pattern is in a column of string.
filter(gene_annotation, grepl("interferon", description))
dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-
<- filter(gene_annotation, percentage_gene_gc_content < 33 & chromosome_name %in% c("1","2"))
gene_annotation_chr1and2_lowGC gene_annotation_chr1and2_lowGC
4 Ordering/Arranging a table according the content of a column
arrange() works similarly to filter() except that instead of selecting rows, it changes their order.
arrange(gene_annotation_chr1and2_lowGC, percentage_gene_gc_content)
arrange(gene_annotation_chr1and2_lowGC, desc(percentage_gene_gc_content))
arrange(gene_annotation_chr1and2_lowGC, chromosome_name,percentage_gene_gc_content)
As for filter, if you want to save the result, you’ll need to use the assignment operator*, <-
<-arrange(gene_annotation_chr1and2_lowGC, chromosome_name,percentage_gene_gc_content) gene_annotation_chr1and2_lowGC
5 Write a table in a text file or a xlsx file
To save your tibble as a text file (which can be opened in Excel), use the following command:
write_tsv(gene_annotation_chr1and2_lowGC, file = "results/gene_annotation_chr1and2_lowGC.tsv")
To save a tibble directly in a xlsx file, you will use the write_xlsx() function of the package openxlsx2 :
::write_xlsx(gene_annotation_chr1and2_lowGC, file = "results/gene_annotation_chr1and2_lowGC.xlsx") openxlsx2
If you have an error, check that the results directory exist in your working directory.
6 Reorganize or Rename columns in a table
You may want to select only some columns before writing your file. For that you will use the select() function.
select(gene_annotation_chr1and2_lowGC,
c(ensembl_gene_id,
external_gene_name,
description ))
You can also remove column using the - sign before the second argument:
select(gene_annotation_chr1and2_lowGC,
-c(gene_biotype,
percentage_gene_gc_content,
chromosome_name ))
As for filter and arrange, if you want to save the result, you’ll need to use the assignment operator*, <-
<- select(gene_annotation_chr1and2_lowGC,
gene_annotation_chr1and2_lowGC_ok c(ensembl_gene_id,
external_gene_name,
description,
gene_biotype ))
You may want to change the order of the column before writing your file. For that you will use the relocate() function.
#original order: ensembl_gene_id | external_gene_name | description | gene_biotype
gene_annotation_chr1and2_lowGC_ok
relocate(gene_annotation_chr1and2_lowGC_ok, external_gene_name) #new order: external_gene_name | ensembl_gene_id | description | gene_biotype
relocate(gene_annotation_chr1and2_lowGC_ok, external_gene_name, .after=last_col()) #new order: ensembl_gene_id | description | gene_biotype | external_gene_name
relocate(gene_annotation_chr1and2_lowGC_ok, gene_biotype, .before=description) #new order: ensembl_gene_id | external_gene_name | gene_biotype | description
Finally, You may want to change the name of a column. For that you will use the rename() function.
#original name: ensembl_gene_id | external_gene_name | description | gene_biotype
gene_annotation_chr1and2_lowGC_ok
rename(gene_annotation_chr1and2_lowGC_ok, human_gene_name = external_gene_name) #new name: ensembl_gene_id | human_gene_name | description | gene_biotype
As for previous functions, if you want to save the result, you’ll need to use the assignment operator*, <-
<- relocate(gene_annotation_chr1and2_lowGC_ok, gene_biotype, .before=description)
gene_annotation_chr1and2_lowGC_ok write_tsv(gene_annotation_chr1and2_lowGC_ok, file = "results/gene_annotation_chr1and2_lowGC.tsv")
7 Joining 2 tables according a common column
To avoid having a huge table and to save space, information is often split between different tables.
For example, here we have the DE genes table on one side and the gene annotation on the other.
If we want to know the name of the DE genes we need to join the 2 columns.
Depending on how you want to join your tables, you’ll need several functions. For this, we’ll use functions from the “join” family.
A inner_join() only keeps observations from x that have a matching key in y.
A left_join() keeps all observations in x.
A right_join() keeps all observations in y.
A full_join() keeps all observations in x and y.
Small concrete examples:
By default, the *join** functions use all variables that appear in both tables as key columns, the so called natural join.
If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: by = c(“a” = “b”). This will match variable a in table x to variable b in table y.
When you start programming, I recommend that you systematically specify the column on which you want to join your 2 tables.
For example, if we want to add the gene name of each gene of the DE_table using the gene_annotation table, you need to use a left join using the gene_ID column for the DE_table and the ensembl_gene_id for the gene_annotation as key columns:
left_join(DE_table, gene_annotation, by= c("gene_ID"="ensembl_gene_id"))
If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix .x and .y because all column names must be different in the output table.
8 Time to work on your data
These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).