Learning objectives
Now, your data are “clean”, we can map or aligne reads to the reference genome to determine from where these sequences originated and thus build a count table.
There are two primary complementary approaches to achieve this:
Depending on your goals, you may use one or both approaches.
The goal is to determine the origin of each read in the genome.
To accomplish this, we will use tools as STAR or HISAT2 to align each read to the genome and find the better match.
You must take care to use splice aware aligner, as intronic sequences should not be in your reads.
In these tools, complex algorithms are employed to optimize the search time of the best matches. By adjusting parameters, you can fine-tune the sensitivity threshold according to your specific question.
However, this approach is time-consuming due to various challenges:
Non-comprehensive list of challenges
For input files, you’ll need the Fastq files of your samples and your reference genome.
Reference data files
Genome assembly: Provides the nucleotide sequence of the reference genome.
It is in the form of a Fasta file. (A FASTA file is a text file, where each sequence begins with a single-line description, followed by lines of sequence data.)
Genome build/release represents a version of the genome with improvements (i.e. gaps filled, mistakes corrected).
Transcriptome: Provides the complete set of transcripts
It is in the form of a GTF (gene transfer format) file. (A GTF file is a simple tab-delimited text file for describing genes structure.)
Reference data versions matter !
Make sure that all reference files used in an analysis are matched (i.e. genome file (Fasta), transcriptome file (Fasta, gtf))
For transcriptome mapping, such as with Salmon, we use the reference transcriptome (in FASTA format) and raw sequencing reads (in FASTQ format) as input to perform both mapping and quantification of the reads.
The “quasi-mapping” approach used by Salmon requires a transcriptome reference index to determine the position and orientation information for where the fragments best map prior to quantification [2]. The reference index essentially provides the transcriptome in a format that is easily and rapidly searchable. Therefore, it will allow us to quickly find the positions in the transcriptome where each of the reads originated.
NOTE: Since we are searching against only the transcripts (transcriptome), Salmon would not be the appropriate tool to use if trying to detect novel genes or isoforms, intron retention events, or other methods that require a well annotated whole genome instead of only the transcriptome.
These approaches avoid base-to-base alignment:
The transcriptome reference file is a fasta file containing each (known) transcript sequences.
Mapping results are only as good as the quality of the reference transcriptome
For this training, we will use the transcriptome mapping approach for all samples, and we may try the genome mapping approach on one sample.
Therefore, we need a reference transcriptome (in FASTA format) to align our sequences.
To download reference data, there are a few different sources available:
*Note that these reference data sources are relevant to most types of genomic analyses, not just differential expression analyses.
Biological databases for gene expression data store genome assemblies and provide annotations regarding where the genes, transcripts, and other genomic features are located on the genome.
Genome assemblies provide the nucleotide sequence of the reference genome. Although the Human Genome Project was “completed” in 2003, small gaps in the sequence remained (estimated 1% of gene-containing portions). As technology improves and more genomes are sequenced, these gaps are filled, mistakes are corrected and alternate alleles are provided. Therefore, every several years a new genome build is released that contains these improvements.
The current genome build is GRCh38/hg38 for the human, which was released in 2013 and is maintained by the Genome Reference Consortium (GRC). Usually the biological databases will include the updated versions as soon as they are stably released, in addition to access to archived versions.
Genome databases incorporate these genomes and generate the gene annotations with the following similarities/differences:
Ensembl, NCBI, and UCSC all use the same genome assemblies or builds provided by the GRC
Each biological database independently determines the gene annotations; therefore, gene annotations between these databases can differ, even though the genome assembly is the same. Naming conventions are also different (chr1=1) between databases.
Always use the same biological database for all reference data!
Ensembl provides a website that acts as a single point of access to annotated genomes for vertebrate species. For all other organisms there are additional Ensembl databases available through Ensembl Genomes; however, they do not include viruses (NCBI does).
Navigate to the Ensembl website to view the interface. The homepage for Ensembl has a lot to offer, with the a lot of information and access to a range of functionality and tools.
Searching Ensembl: Look for a gene, location, variant and more using the search box on the homepage or the box that is provided in the top right corner of any Ensembl page.
Most search results will take you to the appropriate Ensembl view through a results page. These linked pages will allow you to download information/sequences for specific genes/transcripts/exons/variants. If you search using a location you will be directed straight to the location tab (this tab provides a view of a region of a genome).
Browse a Genome: Choose your species of interest in this section. The drop-down menu under ‘All genomes’ allows you to select from the full list. The Ensembl Pre! site contains new genomes (either new species to Ensembl or updates in the reference assembly) that do not yet have an Ensembl gene set. BLAST/BLAT is available for organisms in all Ensembl sites, including Pre!
Help: There is a wealth of help and documentation in Ensembl if you are new to the browser. Video tutorials are provided and printable pdfs with exercises. Custom data may be uploaded to Ensembl or displayed directly by attaching a file by URL.
News: To find out what genome build and release you are working with, have a look at the news section of the homepage. If the current release is not the one you need, access archive sites to access previous versions, or releases, of Ensembl using the link on the lower right side.
While we are not going to explore the Ensembl database in this workshop, we have materials available if you wish to explore on your own.
Ensembl identifiers
When using Ensembl, note that it uses the following format for biological identifiers:
For non-human species a suffix is added:
We will download the human transcriptome from Ensembl site.
Click on “human” to have access to human genomic data (https://www.ensembl.org/Homo_sapiens/Info/Index)
Then, on the right, click on Download fasta in genome annotation.
Click on cdna and then “right” click to copy the link to the file.
We will use this link to download the file from your VM using the wget
command.
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p 01_toy_data/reference_transcriptome
cd 01_toy_data/reference_transcriptome
# Download from the FTP server
wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
# Decompress the FASTA file
gzip -d Homo_sapiens.GRCh38.cdna.all.fa.gz
Exercices
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/06_download_human_transcriptome.sh
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome
ls
head Homo_sapiens.GRCh38.cdna.all.fa
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome
grep ">" Homo_sapiens.GRCh38.cdna.all.fa | wc -l
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome
grep ">" Homo_sapiens.GRCh38.cdna.all.fa | head -n 100
grep ">" Homo_sapiens.GRCh38.cdna.all.fa | tail -n 100
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome
grep "IFNLR1" Homo_sapiens.GRCh38.cdna.all.fa
If you are interested in non-coding RNA, you have to had them to your transcriptome.
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p 01_toy_data/reference_transcriptome
cd 01_toy_data/reference_transcriptome
# Download from the FTP server
wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
cat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz > Homo_sapiens.GRCh38.cdna_ncrna.all.fa.gz
# Decompress the FASTA file
gzip -d Homo_sapiens.GRCh38.cdna_ncrna.all.fa.gz
Now, you must do the same thing for your project.
Exercices
Find the transcriptome of your reference species in a public database
Find and copy the link to the fasta file
Build your script
Execute it
These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.