Table filtering and joining

BIBS Workshop

Author

Carine Rey - BIBS team - CIRI

Published

Last session: June, 2023

Objectives
  • Learn how to import a table from a text file or a xlsx file in R

  • Learn how to filter a table according the content of a column

  • Learn how to write a table in a text file or a xlsx file

  • Learn how to reorganize or rename columns in a table

  • Learn how to sort a table according the content of a column

  • Learn how to join 2 tables according a common column (ex: gene ID)

  • Practice ! Practice ! and practice !

1 Get the mission

In this training, we will use the result of a differential expression analysis described in Kenny PJ et al, Cell Rep 2014.

The RNA-Seq was performed on HEK293F cells that were either transfected with a MOV10 transgene or non-specific (irrelevant) siRNA.

Using these data, the authors evaluate transcriptional patterns associated with perturbation of MOV10 expression. Please note that the irrelevant siRNA will be treated as our control condition.

Training files:

  • The complete result of the differential expression analysis, with gene identifiers, logfoldchange values and pvalue adjusted: here
  • The same file in xlsx format here
  • A table containing the name and the description of each human gene identifiers here
Objective

Obtain the list of the 10 most significantly up-regulated gene names in the decreasing order.

We will use a R markdown file to keep track of your code.

RMarkdown is a file format in its most basic form, that can eventually be converted into a shareable document, e.g HTML, PDF and many others. It allows you to document not just your R (Python and SQL) code, but also enables the inclusion of tables, figures, along with descriptive text. Thus resulting in a final document that has the methods, the code and interpretation of results all in a single document!

To put it simply, you write a file using the Markdown language and within it embed executable R code chunks. The code chunks are paired with knitr syntax, so that once your document is complete, you can easily convert it into one of several common formats (i.e. HTML, PDF, PPT) for sharing or documentation.

To go further you can read the first part of this document: here

Exercice
  • Create a working directory (ex: BIBS_workshop_June)
  • Create 2 directories, insides, data and results
  • Download these 3 training files in data
  • Create a new Rmd file to keep the track of your code:
    • On the left corner, click on File > New file > R markdown file
    • Define a title, for example : “BIBS workshop” and click on “Ok”
    • A template file will be open by Rstudio
    • Save it in this working directory as “BIBS_workshop.Rmd”
    • Click on “Knit

Now that your working directory is set up, we can start coding!

2 Import/read a table from a text file or a xlsx file in R

In this training, we will use a package called tidyverse which contain in fact a set of package containing a lot of useful function with simple syntax. If you do not have this package installed on your computer, you can install it with the following command:

#Test whether tidyverse is installed on your computer and install it if it isn't
if (!requireNamespace("tidyverse", quietly = TRUE)) {
  install.packages('tidyverse')
} else {
  unloadNamespace("tidyverse")
}

One time, tidyverse is installed, we have to load it in our environment with the following command:

library(tidyverse)

To load xlsx file you have to use the openxlsx2::read_xlsx() function. This function came not from tidyvserse, we have next to convert the table in tibble, the tidyverse table format, using the function as_tibble().

gene_annotation <- openxlsx2::read_xlsx("data/human_geneIDs2geneNames.xlsx", sheet = 1) 
gene_annotation <- as_tibble(gene_annotation)
Warning

The numeric separator must be well defined (, or .) in your file according your default language so that numeric values are identified as numeric and not as character strings.

To load txt file, containing a table you can use the *read_delim()** function which is part of tidyverse, so you do not need to convert it in tibble.

#Don't run
test_table <- read_delim("data/test.tsv")

In programming languages, all variables are not equal. When you display a tibble you can see the type of a column. Here is a list of common variable types that you will encounter

  • int stands for integers.
  • dbl stands for doubles or real numbers.
  • chr stands for character vectors or strings.
  • dttm stands for date-times (a date + a time).
  • lgl stands for logical, vectors that contain only TRUE or FALSE.
  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
  • date stands for dates.

You cannot add an int to a chr, but you can add an int to a dbl the results will be a dbl.

To display a tibble, you can just type its variable name. That will allow you to have information about the dimension of the table, the colnames and the type of each column.

gene_annotation

To display all column names, you can use the colnames() function:

colnames(gene_annotation)

You can also access to the dimension of your table with the dim() function:

dim(gene_annotation)

Or just the number of row of your table with the nrow() function:

nrow(gene_annotation)

For example, here the column “entrezgene_id” contains the NCBI identifier, which is in fact not numeric. You can convert this column to a character if required.

gene_annotation <- mutate(gene_annotation, entrezgene_id = as.character(entrezgene_id))
gene_annotation

You can convert a character column in numeric with the as.numeric() function.

mutate(gene_annotation, entrezgene_id = as.numeric(entrezgene_id))

Exercice

1- Import the table contain in “data/DE_complete_results.xlsx” in a variable called DE_table

Show the code
DE_table <- read_delim("data/DE_complete_results.tsv")

#OR

DE_table <- openxlsx2::read_xlsx("data/DE_complete_results.xlsx", sheet = 1) 
DE_table <- as_tibble(DE_table)


DE_table

Link to the starting Rmd file here

1- Count the number of lines and columns

Show the code
dim(DE_table)

3 Filter a table according the content of a column

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions.

You can use the relational operators (<,>,==,<=,>=,!=) to make a test on a column and keep rows for which the results is TRUE.

filter(gene_annotation, percentage_gene_gc_content >= 50)
filter(gene_annotation, chromosome_name ==  "X")
filter(gene_annotation, external_gene_name ==  "GAPDH")

The operator %in% is very useful to test if a value is in a list.

filter(gene_annotation, chromosome_name %in% c("1","MT"))

The function grepl() is also very useful to test if a pattern is in a column of string.

filter(gene_annotation, grepl("interferon", description))

dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-

gene_annotation_chr1and2_lowGC <- filter(gene_annotation, percentage_gene_gc_content < 33 & chromosome_name %in% c("1","2"))
gene_annotation_chr1and2_lowGC
Exercice

1- Create a new tibble with only significantly differential expressed genes (value in padj < 0.05)

Show the code
DE_table_sign <- filter(DE_table, padj < 0.05)
#OR
DE_table_sign <- DE_table %>% filter(padj < 0.05)

1- count the number of significantly differential expressed genes

Show the code
nrow(DE_table_sign)
#OR
dim(DE_table_sign)

1- Create a new tibble with only significantly differential expressed genes (value in padj < 0.05) and up-regulated (log2FoldChange > 0)

Show the code
DE_table_sign_up <- filter(DE_table, padj < 0.05 & log2FoldChange > 0)

#OR
DE_table_sign_up <- DE_table %>% filter( padj < 0.05 & log2FoldChange > 0)

1- Count the number of significantly differential expressed genes and up-regulated

Show the code
nrow(DE_table_sign_up)
#OR
dim(DE_table_sign_up)

1- Count the number of significantly differential expressed genes with a log2FoldChange superior to 2 in absolute

Show the code
DE_table_sign_down_lfcsup2 <- filter(DE_table, padj < 0.05 & log2FoldChange > 2)
DE_table_sign_down_lfcinf2 <- filter(DE_table, padj < 0.05 & log2FoldChange < -2)

nrow(DE_table_sign_down_lfcsup2) + nrow(DE_table_sign_down_lfcinf2)

#OR
filter(DE_table, padj < 0.05 & abs(log2FoldChange) > 2)

4 Ordering/Arranging a table according the content of a column

arrange() works similarly to filter() except that instead of selecting rows, it changes their order.

arrange(gene_annotation_chr1and2_lowGC, percentage_gene_gc_content)
arrange(gene_annotation_chr1and2_lowGC, desc(percentage_gene_gc_content))
arrange(gene_annotation_chr1and2_lowGC, chromosome_name,percentage_gene_gc_content)

As for filter, if you want to save the result, you’ll need to use the assignment operator*, <-

gene_annotation_chr1and2_lowGC <-arrange(gene_annotation_chr1and2_lowGC, chromosome_name,percentage_gene_gc_content)
Exercice

1- Sort table from most to least significant genes

Show the code
DE_table_sign <- arrange(DE_table_sign, padj)

#OR
DE_table_sign <- DE_table_sign %>% arrange(padj)


#to visualize the fisrt lines : 
head(DE_table_sign)

1- Count the number of significantly differential expressed genes with a log2FoldChange superior to 2 in absolute

Show the code
DE_table_sign_up <- filter(DE_table, padj < 0.05 & log2FoldChange > 0)

#OR
DE_table_sign_up <- DE_table %>% filter( padj < 0.05 & log2FoldChange > 0)

1- count the number of significantly differential expressed genes and up-regulated

Show the code
nrow(DE_table_sign_up)
#OR
dim(DE_table_sign_up)

5 Write a table in a text file or a xlsx file

To save your tibble as a text file (which can be opened in Excel), use the following command:

write_tsv(gene_annotation_chr1and2_lowGC, file = "results/gene_annotation_chr1and2_lowGC.tsv")

To save a tibble directly in a xlsx file, you will use the write_xlsx() function of the package openxlsx2 :

openxlsx2::write_xlsx(gene_annotation_chr1and2_lowGC, file = "results/gene_annotation_chr1and2_lowGC.xlsx")

If you have an error, check that the results directory exist in your working directory.

To go further

Take a look to the BIBS workshop about R <-> Excel format file: https://rpubs.com/EnsBioStat/1022687 to know how to also add formatting to your excel file.

Exercice

1- Write significantly up regulated genes in a file, and check it

Show the code
DE_table_sign_up <- filter(DE_table, padj < 0.05 & log2FoldChange > 0)


openxlsx2::write_xlsx(DE_table_sign_up, file = "results/DE_table_sign_up.xlsx")
#OR
write_tsv(DE_table_sign_up, file = "results/DE_table_sign_up.tsv")

1- Write significantly down regulated genes in a file, ranking them from the most important logfoldchange to the weakest, and check it.

Show the code
DE_table_sign_down <- filter(DE_table, padj < 0.05 & log2FoldChange < 0) %>%
  arrange(log2FoldChange)

openxlsx2::write_xlsx(DE_table_sign_down, file = "results/DE_table_sign_down.xlsx")
#OR
write_tsv(DE_table_sign_down, file = "results/DE_table_sign_down.tsv")

6 Reorganize or Rename columns in a table

You may want to select only some columns before writing your file. For that you will use the select() function.

select(gene_annotation_chr1and2_lowGC,
       c(ensembl_gene_id,
         external_gene_name,
         description
       ))

You can also remove column using the - sign before the second argument:

select(gene_annotation_chr1and2_lowGC,
       -c(gene_biotype,
          percentage_gene_gc_content,
          chromosome_name
       ))

As for filter and arrange, if you want to save the result, you’ll need to use the assignment operator*, <-

gene_annotation_chr1and2_lowGC_ok <- select(gene_annotation_chr1and2_lowGC,
                                            c(ensembl_gene_id,
                                              external_gene_name,
                                              description,
                                              gene_biotype
                                            ))

You may want to change the order of the column before writing your file. For that you will use the relocate() function.

gene_annotation_chr1and2_lowGC_ok #original order: ensembl_gene_id | external_gene_name | description | gene_biotype

relocate(gene_annotation_chr1and2_lowGC_ok, external_gene_name) #new order: external_gene_name | ensembl_gene_id |  description | gene_biotype

relocate(gene_annotation_chr1and2_lowGC_ok, external_gene_name, .after=last_col()) #new order: ensembl_gene_id | description | gene_biotype | external_gene_name

relocate(gene_annotation_chr1and2_lowGC_ok, gene_biotype, .before=description) #new order: ensembl_gene_id | external_gene_name | gene_biotype | description 

Finally, You may want to change the name of a column. For that you will use the rename() function.

gene_annotation_chr1and2_lowGC_ok #original name: ensembl_gene_id | external_gene_name | description | gene_biotype

rename(gene_annotation_chr1and2_lowGC_ok, human_gene_name = external_gene_name) #new name: ensembl_gene_id | human_gene_name |  description | gene_biotype

As for previous functions, if you want to save the result, you’ll need to use the assignment operator*, <-

gene_annotation_chr1and2_lowGC_ok <- relocate(gene_annotation_chr1and2_lowGC_ok, gene_biotype, .before=description)
write_tsv(gene_annotation_chr1and2_lowGC_ok, file = "results/gene_annotation_chr1and2_lowGC.tsv")
Exercice

1- In the DE_table, select only the gene_ID, baseMean,log2FoldChange, padj columns and save it in a new variable called DE_table_simplified

Show the code
DE_table_simplified <- select(DE_table, gene_ID, baseMean,log2FoldChange, padj)
DE_table_simplified

1- Rename the column baseMean by “GlobalMeanExpression

Show the code
DE_table_simplified <- rename(DE_table_simplified, GlobalMeanExpression = baseMean)
DE_table_simplified

1- Relocate the GlobalMeanExpression column at the end of the tibble

Show the code
DE_table_simplified <- relocate(DE_table_simplified, GlobalMeanExpression, .after=last_col())
DE_table_simplified

7 Joining 2 tables according a common column

To avoid having a huge table and to save space, information is often split between different tables.

For example, here we have the DE genes table on one side and the gene annotation on the other.

If we want to know the name of the DE genes we need to join the 2 columns.

Depending on how you want to join your tables, you’ll need several functions. For this, we’ll use functions from the “join” family.

  • A inner_join() only keeps observations from x that have a matching key in y.

  • A left_join() keeps all observations in x.

  • A right_join() keeps all observations in y.

  • A full_join() keeps all observations in x and y.

Small concrete examples:

By default, the *join** functions use all variables that appear in both tables as key columns, the so called natural join.

If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: by = c(“a” = “b”). This will match variable a in table x to variable b in table y.

When you start programming, I recommend that you systematically specify the column on which you want to join your 2 tables.

For example, if we want to add the gene name of each gene of the DE_table using the gene_annotation table, you need to use a left join using the gene_ID column for the DE_table and the ensembl_gene_id for the gene_annotation as key columns:

left_join(DE_table, gene_annotation, by= c("gene_ID"="ensembl_gene_id"))

If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix .x and .y because all column names must be different in the output table.

Exercice

Now, we have all the keys to achieve your mission: Obtain the list of the 10 most significantly up-regulated gene names in the decreasing order.

1 - join the DE_table and the gene_annotation tibbles. You can simplify each tibble by keeping only columns of interest.

Show the code
DE_table_simplified <- select(DE_table, gene_ID, baseMean,log2FoldChange, padj)
gene_annotation_simplified <- select(gene_annotation, ensembl_gene_id, external_gene_name)

cross_DE_table_gene_annotation_simplified <- left_join(DE_table_simplified,gene_annotation_simplified, by= c("gene_ID"="ensembl_gene_id"))

1- Check that you keep the same number of rows in the merged table.

Show the code
dim(DE_table_simplified)
dim(cross_DE_table_gene_annotation_simplified)

1- order the merged table by padj.

Show the code
cross_DE_table_gene_annotation_simplified <- arrange(cross_DE_table_gene_annotation_simplified,padj)

1- Keep only the first 10 rows with the head function -> head(tibble, 10)

Show the code
cross_DE_table_gene_annotation_simplified_top10 <- head(cross_DE_table_gene_annotation_simplified,10)

1- Write the result in a file

Show the code
openxlsx2::write_xlsx(cross_DE_table_gene_annotation_simplified_top10, file = "results/cross_DE_table_gene_annotation_simplified_top10.xlsx")

2- New challenges:

  • extract the top 15 genes with the highest expression (baseMean column)
  • extract “interferon” genes

8 Time to work on your data


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).