Learning objectives

  • Learn how to quantify your libraries
  • Learn how to use a new bioinformatic tool

1 Aligning reads: tools and theory

Now, your data are “clean”, we can map or aligne reads to the reference genome to determine from where these sequences originated and thus build a count table.

There are two primary complementary approaches to achieve this:

  • Genome mapping : You map your reads onto the genome and then count how many reads map on each gene. This was the former “standard” approach for quantifying gene expression.
  • Transcriptome mapping: You map your reads directly onto the transcriptome. This is the current standard for quantifying gene expression but you don’t know where reads map on the genome.

Depending on your goals, you may use one or both approaches.

source

1.1 Genome mapping

The goal is to determine the origin of each read in the genome.

1.1.1 Splice aware alignement

To accomplish this, we will use tools as STAR or HISAT2 to align each read to the genome and find the better match.

You must take care to use splice aware aligner, as intronic sequences should not be in your reads.

In these tools, complex algorithms are employed to optimize the search time of the best matches. By adjusting parameters, you can fine-tune the sensitivity threshold according to your specific question.

However, this approach is time-consuming due to various challenges:

Non-comprehensive list of challenges

  • Large, incomplete and repetitive genomes
  • Short reads: 50-150 bp
    • Non-unique alignment
    • Sensitive to non-exact matching (variants, sequencing errors)
  • Massive number of short reads
  • Small insert size: 200-500 bp libraries
  • Computational capacity for efficient base-to-base mapping

1.1.2 Input files: Fastq files and Reference genome data files

For input files, you’ll need the Fastq files of your samples and your reference genome.

Reference data files

  • Genome assembly: Provides the nucleotide sequence of the reference genome.

    • It is in the form of a Fasta file. (A FASTA file is a text file, where each sequence begins with a single-line description, followed by lines of sequence data.)

    • Genome build/release represents a version of the genome with improvements (i.e. gaps filled, mistakes corrected).

  • Transcriptome: Provides the complete set of transcripts

  • It is in the form of a GTF (gene transfer format) file. (A GTF file is a simple tab-delimited text file for describing genes structure.)

Reference data versions matter !

Make sure that all reference files used in an analysis are matched (i.e. genome file (Fasta), transcriptome file (Fasta, gtf))

  • Same build version
  • Same source (eg. both from Ensembl)

1.2 Transcriptome mapping

For transcriptome mapping, such as with Salmon, we use the reference transcriptome (in FASTA format) and raw sequencing reads (in FASTQ format) as input to perform both mapping and quantification of the reads.

The “quasi-mapping” approach used by Salmon requires a transcriptome reference index to determine the position and orientation information for where the fragments best map prior to quantification [2]. The reference index essentially provides the transcriptome in a format that is easily and rapidly searchable. Therefore, it will allow us to quickly find the positions in the transcriptome where each of the reads originated.

NOTE: Since we are searching against only the transcripts (transcriptome), Salmon would not be the appropriate tool to use if trying to detect novel genes or isoforms, intron retention events, or other methods that require a well annotated whole genome instead of only the transcriptome.

1.2.1 Why use lightweight alignment?

These approaches avoid base-to-base alignment:

  • Faster, more efficient (~ >20x faster than alignment-based)
  • Improved accuracy for transcript-level quantification
  • Improvements in accuracy for gene-level quantification ref
  • Tools include: Kallisto (quasi-aligner), Sailfish (kmer-based),Salmon (quasi-aligner), RSEM

1.2.2 Input files: Fastq files and Transcriptome reference file

The transcriptome reference file is a fasta file containing each (known) transcript sequences.

Mapping results are only as good as the quality of the reference transcriptome

  • If one does not exist, it can be created using coordinates from a GTF file and the genome sequence file
  • Reference data versions matter!
  • Stay consistent with the source and builds/releases being used

For this training, we will use the transcriptome mapping approach for all samples, and we may try the genome mapping approach on one sample.

2 Get transcriptomic or genomic reference data

Therefore, we need a reference transcriptome (in FASTA format) to align our sequences.

To download reference data, there are a few different sources available:

  • General biological databases: Ensembl, NCBI, and UCSC
  • Organism-specific biological databases: Wormbase, Flybase, etc. (often updated more frequently, so may be more comprehensive)
  • Reference data collections: Illumina’s iGenomes, providing access genome reference data from Ensembl, UCSC and NCBI in one location
  • Local access: shared databases on the FAS-RC cluster or HMS-RC’s O2 cluster with access to genome reference data from Ensembl, UCSC and NCBI

*Note that these reference data sources are relevant to most types of genomic analyses, not just differential expression analyses.

2.1 General biological databases

Biological databases for gene expression data store genome assemblies and provide annotations regarding where the genes, transcripts, and other genomic features are located on the genome.

Genome assemblies provide the nucleotide sequence of the reference genome. Although the Human Genome Project was “completed” in 2003, small gaps in the sequence remained (estimated 1% of gene-containing portions). As technology improves and more genomes are sequenced, these gaps are filled, mistakes are corrected and alternate alleles are provided. Therefore, every several years a new genome build is released that contains these improvements.

The current genome build is GRCh38/hg38 for the human, which was released in 2013 and is maintained by the Genome Reference Consortium (GRC). Usually the biological databases will include the updated versions as soon as they are stably released, in addition to access to archived versions.

Genome databases incorporate these genomes and generate the gene annotations with the following similarities/differences:

  • Ensembl, NCBI, and UCSC all use the same genome assemblies or builds provided by the GRC

    • GRCh38 = hg38; GRCh37 = hg19
    • Patches or minor revisions of the genome, which don’t change the genome coordinates, are frequently provided by the GRC. Each database makes the patches available for users at different intervals. If the user applies the patches, the genome reference sequence may differ between databases.
      • GRCh38p1 != GRCh38p2
  • Each biological database independently determines the gene annotations; therefore, gene annotations between these databases can differ, even though the genome assembly is the same. Naming conventions are also different (chr1=1) between databases.

  • Always use the same biological database for all reference data!

2.1.1 Ensembl

Ensembl provides a website that acts as a single point of access to annotated genomes for vertebrate species. For all other organisms there are additional Ensembl databases available through Ensembl Genomes; however, they do not include viruses (NCBI does).

  • Genome assemblies/builds (reference genomes)
    • New genome builds are released every several years or more depending on the species
    • Genome assemblies are updated every two years to include patches, or less often depending on the species
  • Gene annotations
    • Gene annotations are created or updated using a variety of sources (ENA, UniProtKB, NCBI RefSeq, RFAM, miRBase, and tRNAscan-SE databases)
    • Automatic annotation is performed for all species using identified proteins and transcripts
    • Manual curation by the HAVANA group is performed for human, mouse, zebrafish, and rat species, providing better confidence of transcript annotations
    • Directly imports annotations from FlyBase, WormBase and SGD

2.1.2 Using the Ensembl genomic database and genome browser

Navigate to the Ensembl website to view the interface. The homepage for Ensembl has a lot to offer, with the a lot of information and access to a range of functionality and tools.

  • Searching Ensembl: Look for a gene, location, variant and more using the search box on the homepage or the box that is provided in the top right corner of any Ensembl page.

    • a gene name (for example, BRCA2) - best to use the official gene symbols (HGNC)
    • a UniProt accession number (for example, P51587)
    • a disease name (for example, coronary heart disease)
    • a variation (for example, rs1223)
    • a location - a genomic region (for example, rat X:100000..200000)
    • a PDB ID or a Gene Ontology (GO) term

    Most search results will take you to the appropriate Ensembl view through a results page. These linked pages will allow you to download information/sequences for specific genes/transcripts/exons/variants. If you search using a location you will be directed straight to the location tab (this tab provides a view of a region of a genome).

  • Browse a Genome: Choose your species of interest in this section. The drop-down menu under ‘All genomes’ allows you to select from the full list. The Ensembl Pre! site contains new genomes (either new species to Ensembl or updates in the reference assembly) that do not yet have an Ensembl gene set. BLAST/BLAT is available for organisms in all Ensembl sites, including Pre!

  • Help: There is a wealth of help and documentation in Ensembl if you are new to the browser. Video tutorials are provided and printable pdfs with exercises. Custom data may be uploaded to Ensembl or displayed directly by attaching a file by URL.

  • News: To find out what genome build and release you are working with, have a look at the news section of the homepage. If the current release is not the one you need, access archive sites to access previous versions, or releases, of Ensembl using the link on the lower right side.

While we are not going to explore the Ensembl database in this workshop, we have materials available if you wish to explore on your own.

Ensembl identifiers

When using Ensembl, note that it uses the following format for biological identifiers:

  • ENSG###########: Ensembl Gene ID
  • ENST###########: Ensembl Transcript ID
  • ENSP###########: Ensembl Peptide ID
  • ENSE###########: Ensembl Exon ID

For non-human species a suffix is added:

  • ENSMUSG###: MUS (Mus musculus) for mouse
  • ENSDARG###: DAR (Danio rerio) for zebrafish

source

3 Finding and accessing reference data for the training dataset on Ensembl

We will download the human transcriptome from Ensembl site.

Click on “human” to have access to human genomic data (https://www.ensembl.org/Homo_sapiens/Info/Index)

Then, on the right, click on Download fasta in genome annotation.

Click on cdna and then “right” click to copy the link to the file.

We will use this link to download the file from your VM using the wget command.

Script: scripts/06_download_human_transcriptome.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

mkdir -p 01_toy_data/reference_transcriptome

cd 01_toy_data/reference_transcriptome

# Download from the FTP server
wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

# Decompress the FASTA file
gzip -d Homo_sapiens.GRCh38.cdna.all.fa.gz

Exercices

  1. Build the script and execute it

Terminal
cd  /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/06_download_human_transcriptome.sh
  1. Look to the first lines of the transcriptome

Terminal
cd  /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome
ls
head Homo_sapiens.GRCh38.cdna.all.fa
  1. Count the number of sequences of the transcriptome

Terminal

cd  /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome

grep ">" Homo_sapiens.GRCh38.cdna.all.fa | wc -l
  1. Display the first 100 sequence names and 100 last sequence names.

Terminal
cd  /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome

grep ">" Homo_sapiens.GRCh38.cdna.all.fa | head -n 100
grep ">" Homo_sapiens.GRCh38.cdna.all.fa | tail -n 100
  1. Check your favorite gene (ex:IFNLR1)

Terminal
cd  /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome
grep "IFNLR1" Homo_sapiens.GRCh38.cdna.all.fa 

If you are interested in non-coding RNA, you have to had them to your transcriptome.

Script: scripts/06_download_human_transcriptome_cdna_ncrna.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

mkdir -p 01_toy_data/reference_transcriptome

cd 01_toy_data/reference_transcriptome

# Download from the FTP server
wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz

cat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz > Homo_sapiens.GRCh38.cdna_ncrna.all.fa.gz

# Decompress the FASTA file
gzip -d Homo_sapiens.GRCh38.cdna_ncrna.all.fa.gz

4 Finding and accessing reference data for your project

Now, you must do the same thing for your project.

Exercices

  1. Find the transcriptome of your reference species in a public database

  2. Find and copy the link to the fasta file

  3. Build your script

  4. Execute it

See you in the next session


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).

cc_by_sa

sources