1 Aligning reads: in practice on the training dataset

As previously mentioned, we will use Salmon to quantify the genes expression for the differential gene expression analysis. The documentation can be found here or clicking on the tab Documentation on the salmon web page.

1.1 Creating transcriptome index

The “quasi-mapping” approach utilized by Salmon requires a transcriptome reference index to determine the position and orientation information for where the fragments best map prior to quantification.

So, first we need to build this index. In real-life scenarios, one must find and understand command lines to utilize a new tool.

Should decoy sequences be used or not?

Fast is good but fast and accurate is better! Alignment and mapping methodologies influence transcript abundance estimation, and accounting for fragments of unexpected origin can improve transcript quantification. To this end, Salmon provides the ability to index both the transcriptome as well as decoy sequences that can be considered during mapping and quantification. The decoy sequence accounts for reads that might otherwise be (spuriously) attributed to some annotated transcript.

Using decoy sequences in the Salmon index requires much more resources to build and also during the quantification step.

For example:

  • Salmon index for human cDNA (no decoy sequences): 680Mo ~5min
  • Salmon index for human cDNA + human genome as decoy sequences : 22Go ~1h

For the training, we will not use decoy sequences. However, if you have enough resources, you can find help to build your index with decoy sequences here

Exercices

  1. From the Salmon documentation, find the command line to index your transcriptome with Salmon.

  2. Build your script.

Script: scripts/07_build_salmon_index.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

ref_transcriptome="/home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome/Homo_sapiens.GRCh38.cdna.all.fa"
mkdir -p results/07_salmon_index

#Documentation https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode
# salmon index
# -t: the path to the transcriptome file (in FASTA format)
# -i: the path to the folder to store the indices generated
# -k: the length of kmer to use to create the indices (will output all sequences in transcriptome of length k)
# -p: the number of threads
# Thus, a smaller value of k may slightly improve sensitivity.
# We find that a k of 31 seems to work well for reads of 75bp or longer, but you might consider a smaller k if you plan to deal with shorter reads. 

salmon index -k 31 -t $ref_transcriptome -p 7 -i results/07_salmon_index/human_cdna.salmon_index
  1. Execute your script.

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash scripts/07_build_salmon_index.sh

Exercises:

In your RNA-seq experiment, you expressed a GFP transgene in your mice, and you would like to quantify the expression of GFP. We know that the sequence is not present in the mice transcriptome. What would you do?

  • a. Try laboratory techniques like quantitative PCR to act as a proxy for expression level
  • b. Add the sequence of the GFP transcript to the FASTA reference file
  • c. Manually search for and count the number of times the GFP transcript is present in the read files for each sample
  • d. Feel defeated and accept that there is no valid way to do this

b

1.2 Quasi-mapping and transcripts quantification

The quasi-mapping approach estimates where the reads best map to on the transcriptome through identifying where informative sequences within the read map instead of performing base-by-base alignment.

NOTE: If there are k-mers in the reads that are not in the index, they are not counted. As such, trimming is not required when using this method. But it is reassuring to see the amount of usable data. Accordingly, if there are reads from transcripts not present in the reference transcriptome, they will not be quantified. Quantification of the reads is only as good as the quality of the reference transcriptome.

After determining the best mapping for each read/fragment using the quasi-mapping method, Salmon will generate the final transcript abundance estimates after modeling sample-specific parameters and biases. Note that reads/fragments that map equally well to more than one transcript will have the count divided between all of the mappings, thereby not losing information for the various gene isoforms.

Instead of only counting the number of reads/fragments mapping to each of the transcripts, Salmon uses multiple complex modeling approaches to estimate the transcript abundances while correcting the abundance estimates for any sample-specific biases/factors [4]. Sample-specific bias models are helpful when needing to account for known biases present in RNA-Seq data including:

  • GC bias
  • positional coverage biases
  • sequence biases at 5’ and 3’ ends of the fragments
  • fragment length distribution
  • strand-specific methods

If not accounted for, these biases can lead to unacceptable false positive rates in differential expression studies. The Salmon algorithm can learn these sample-specific biases and account for them in the transcript abundance estimates. Generally, this step results in more accurate transcript abundance estimation.

Now we know a little bit about how it works, let’s map our data using Salmon.

They are two modes to quantify with Salmon, the alignment-based mode or the mapping-based mode.

  • The mapping-based mode of Salmon runs in two phases; indexing and quantification. The indexing step is independent of the reads, and only needs to be run once for a particular set of reference transcripts. The quantification step, obviously, is specific to the set of RNA-seq reads.

  • The alignment-based mode of Salmon does not require indexing but you need to align your reads on the transcriptome with another tool (wich may need an index) to get an alignment file (SAM/BAM).

You will use the mapping-based mode.

Exercices

  1. Build a command for on sample (results/04_trimmed_data_SE/Irrel_kd_1.subset_trim.fastq.gz). You may have to create an output directory to store results (results/08_salmon_quantification/)

Script: scripts/08_run_quantification_salmon.sh
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p results/08_salmon_quantification/

#salmon quant
#--seqBias will enable it to learn and correct for sequence-specific biases in the input data
#--GCbias will enable it to learn and correct for fragment-level GC biases in the input data
#--validateMapping is  now the default mapping strategy, add this option will not change the command

salmon quant -i results/07_salmon_index/human_cdna.salmon_index \
        -l A \
        -p 7 \
          -r results/04_trimmed_data_SE/Irrel_kd_1.subset_trim.fastq.gz \
          -o results/08_salmon_quantification/Irrel_kd_1.salmon \
          --seqBias \
          --gcBias

The command is very fast as the traininig dataset is a subset of data. You can run it.

  1. Build a command for one sample using variables ($output_dir, $salmon_index, $full_fastq_filename, $prefix, … )

Script: scripts/08_run_quantification_salmon.sh
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

output_dir=results/08_salmon_quantification/
mkdir -p $output_dir

salmon_index=results/07_salmon_index/human_cdna.salmon_index

full_fillnameSE=results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz
fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}

#salmon quant
#--seqBias will enable it to learn and correct for sequence-specific biases in the input data
#--GCbias will enable it to learn and correct for fragment-level GC biases in the input data
#--validateMapping is  now the default mapping strategy, add this option will not change the command

salmon quant -i  $salmon_index\
        -l A \
        -p 7 \
          -r $full_fillnameSE \
          -o $output_dir/${prefix}.salmon \
          --seqBias \
          --gcBias
  1. Finally, build a command for all samples of the training dataset using a for loop

Script: scripts/08_run_quantification_salmon.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

output_dir=results/08_salmon_quantification/
mkdir -p $output_dir

salmon_index=results/07_salmon_index/human_cdna.salmon_index

for full_fillnameSE in results/04_trimmed_data_SE/*.subset_trim.fastq.gz
do
  echo START: $full_fillnameSE

  fillnameSE=$(basename $full_fillnameSE)
  prefix=${fillnameSE/.subset_trim.fastq.gz/}

  #salmon quant
  #--seqBias will enable it to learn and correct for sequence-specific biases in the input data
  #--GCbias will enable it to learn and correct for fragment-level GC biases in the input data
  #--validateMapping is  now the default mapping strategy, add this option will not change the command
  salmon quant -i  $salmon_index\
               -l A \
               -p 7 \
               -r $full_fillnameSE \
               -o $output_dir/${prefix}.salmon \
               --seqBias \
               --gcBias
  echo END
done
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash scripts/08_run_quantification_salmon.sh
  1. Adapt your previous script to teh PE training dataset

Script: scripts/08_run_quantification_salmon_PE.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

output_dir=results/08_salmon_quantification_PE/
mkdir -p $output_dir

salmon_index=results/07_salmon_index/human_cdna.salmon_index

for full_filenameR1 in results/04_trimmed_data_PE/*_trim_R1.fastq.gz
do
  echo START: $full_filenameR1
  
  fastq_dirname=$(dirname $full_filenameR1)
  filenameR1=$(basename $full_filenameR1)
  
  filenameR2=${filenameR1/R1/R2}
  full_filenameR2="$fastq_dirname/$filenameR2"
  
  prefix=${filenameR1/_trim_R1.fastq.gz/}
  
  echo Prefix: $prefix # display Irrel_kd_1

  #salmon quant
  #--seqBias will enable it to learn and correct for sequence-specific biases in the input data
  #--GCbias will enable it to learn and correct for fragment-level GC biases in the input data
  #--validateMapping is  now the default mapping strategy, add this option will not change the command
  salmon quant -i  $salmon_index\
               -l A \
               -p 7 \
               -1 $full_filenameR1 \
               -2 $full_filenameR2 \
               -o $output_dir/${prefix}.salmon \
               --seqBias \
               --gcBias
  echo END
done
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash scripts/08_run_quantification_salmon_PE.sh

RNA-seq bias correction: To have Salmon correct for RNA-Seq biases you will need to specify the appropriate parameters when you run it. As noted, when describing the FASTQC results, with RNA-seq data you will always observe sequence-specific biases due to the random hexamer priming and so we would always want to have that correction turned on. Before using the remaining parameters it is advisable to assess your data using tools like Qualimap to look specifically for the presence of these biases in your data and decide on which parameters would be appropriate.

1.3 Salmon output

Salmon output should be in results/08_salmon_quantification/. Take a look at what is contained in this directory:

Terminal
ls  results/08_salmon_quantification/*

From the documentation, you should find details about the output files. On the left, there is a Salmon Output File Formats section.

The most important file is quant.sf. This is the quantification file in which each row corresponds to a transcript, listed by Ensembl ID. The columns correspond to metrics for each transcript:

  • The first two columns are self-explanatory, the name of the transcript and the length of the transcript in base pairs (bp).
  • The effective length represents the various factors that effect the length of transcript (i.e degradation, technical limitations of the sequencing platform)
  • Salmon outputs ‘pseudocounts’ or ‘abundance estimates’ which predict the relative abundance of different isoforms in the form of three possible metrics (FPKM, RPKM, and TPM). TPM (transcripts per million) is a commonly used normalization method as described in [1] and is computed based on the effective length of the transcript. We do NOT recommend FPKM or RPKM.
  • Estimated number of reads, which is the estimate of the number of reads drawn from this transcript given the transcript’s relative abundance and length)

NOTE: The effective gene length in a sample is the average of the transcript lengths after weighting for their relative expression. You may see effective lengths that are larger than the physical length. The interpretation would be that in this case, given the sequence composition of these transcripts (including both the sequence-specific and fragment GC biases), you’d expect a priori to sample more reads from them — thus they have a longer estimated effective length.

Exercices

  1. Look to the top lines of the quantification file (quant.sf) of the Irrel_kd_1 (SE) sample

bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

head results/08_salmon_quantification/Irrel_kd_1.salmon/quant.sf
Name    Length  EffectiveLength TPM     NumReads
ENST00000632684.1       12      3.00168 0       0
ENST00000434970.2       9       2.81792 0       0
ENST00000448914.1       13      3.04008 0       0
ENST00000415118.1       8       2.72193 0       0
ENST00000631435.1       12      3.00168 0       0
ENST00000390567.1       20      3.18453 0       0
ENST00000439842.1       11      2.95387 0       0
  1. Search for the count of “ENST00000612242” with the command grep

bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

grep ENST00000612242 results/08_salmon_quantification/Irrel_kd_1.salmon/quant.sf
ENST00000612242.4       6592    5586.055        2031.576451     1212.992
  1. There is also logs directory, which contains all of the text that was printed to screen as Salmon was running. Display with the cat command the file results/08_salmon_quantification/Irrel_kd_1.salmon/logs/salmon_quant.log and search for the mapping rate.

bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

cat results/08_salmon_quantification/Irrel_kd_1.salmon/logs/salmon_quant.log
  1. Search for each sample the line containing the Mapping rate with the grep command and using * to build the pattern of all log files.

bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

grep "Mapping rate" results/08_salmon_quantification/*.salmon/logs/salmon_quant.log

The quant.sf file is used as input for downstream analyses including differential gene expression analysis. For each sample that you run through Salmon, you will get a corresponding quant.sf file. Downstream tools like tximport take these files and aggregate them to obtain expression matrices for your dataset. We will see that in the next session.

1.4 Get and analyse mapping quality reports with multiqc

Instead of get manually mapping statistics you can use once again multiqc to agglomerate log results of Salmon.

Build the script as below and execute it.

Script: scripts/09_agglomerate_salmon_reports.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
multiqc results/08_salmon_quantification/ -o results/09_multiqc_report_quantification/
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/09_agglomerate_salmon_reports.sh

Using, the multiqc report you can see the percentage of mapping for eaach of your samples. For PE data, you can see the fragment length distribution. For SE data, a unique value will be estimated.

2 Aligning reads: in practice on your dataset

Exercises:

It’s now time to run quantification on your dataset. As previsously, write a script adapted to your data, check it on one sample and execute it on all your samples.

Terminal [To be adjusted]
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
bash scripts/08_run_quantification_salmon.sh

3 Quality control

3.1 Aligned sequence reads (STAR/Qualimap)

As mentioned above, the differential gene expression analysis will use transcript/gene pseudocounts generated by Salmon. However, to perform some basic quality checks on the sequencing data, it is important to align the reads to the whole genome. Either STAR or HiSAT2 are able to perform this step and generate a BAM file that can be used for QC.

3.1.1 STAR (genome mapping)

As Salmon, STAR need an index to facilitate the research. This time, we will index the genome. This can take time and computing resources. For the training, we will just index the chromosome 1 of the Human genome, as the training dataset is a subset of reads mapping on the first chromosome.

As previously for the transcriptome you can find genomic sequences on the Ensembl database.

You can find information about STAR on its github repository (https://github.com/alexdobin/STAR) and also the complete documentation. ( https://github.com/alexdobin/STAR/raw/master/doc/STARmanual.pdf)

You can find below, the script to download the sequence of the first human chromosome from Ensembl thus the gtf file containing gene annotations, and use them to build the STAR index.

Copy the following script in scripts/10_download_human_genome_and_STAR_index.sh and execute it.

Script: scripts/10_download_human_genome_and_STAR_index.sh
#!/bin/bash

cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
mkdir -p data/genome

wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz -P data/genome
wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz -P data/genome

gunzip data/genome/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
gunzip data/genome/Homo_sapiens.GRCh38.109.gtf.gz

mkdir -p results/10_star_index/

STAR --runThreadN 7 \
--runMode genomeGenerate \
--genomeDir results/10_star_index/chr1_hg38_index \
--genomeFastaFiles data/genome/Homo_sapiens.GRCh38.dna.chromosome.1.fa \
--sjdbGTFfile data/genome/Homo_sapiens.GRCh38.109.gtf \
--sjdbOverhang 99 \
--genomeSAindexNbases 12 # min(14, log2(GenomeLength)/2 - 1)

 #sjdbOverhang  = readlength - 1
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/10_download_human_genome_and_STAR_index.sh

Then, you will use STAR to map reads from results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz on the previously building index.

Script: scripts/11_STAR_mapping.sh
#!/bin/bash

# Run STAR on one sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

star_index=results/10_star_index/chr1_hg38_index
outputdir_bam=results/11_star_mapping/bam/

mkdir -p $outputdir_bam

full_fillnameSE=results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz

fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}
  

STAR --runThreadN 7 \
     --genomeDir $star_index  \
     --readFilesCommand zcat \
     --readFilesIn $full_fillnameSE \
     --outFileNamePrefix $outputdir_bam/${prefix}. \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard


#for PE data
#STAR --runThreadN 7 \
#     --genomeDir $star_index  \
#     --readFilesCommand zcat \
#     --readFilesIn $full_fillnameR1 $full_fillnameR2 \
#     --outFileNamePrefix $outputdir_bam/${prefix}. \
#     --outSAMtype BAM SortedByCoordinate \
#     --outSAMunmapped Within \
#     --outSAMattributes Standard

To note, here we use the option “–outSAMunmapped Within” of STAR, as we are interested to get statistic about mapping rate. You can remove this option to remove unmapped read of the bam file and thus save space.

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/11_STAR_mapping.sh

STAR will output BAM files which are compressed SAM files.

The Sequence Alignment/Map (SAM) is a text file format to save alignment information of short reads mapped against reference sequences. It usually starts with a header section followed by alignment information as tab separated lines for each read. (https://www.metagenomics.wiki/tools/samtools/bam-sam-file-format)

3.1.2 Bam files

To look to the content of bam files you can use samtools view as in the following command lines:

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

#To see the content of a bam file
samtools view results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | head

# You need to add the -h option to see the header section
samtools view -h results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | head

Bam files contain alignment information in a tabular format. Each row represent an hit of a read on a reference sequence. Reads can have several hits. For each hit you can find among other information, the reference sequence name (3rd column) and the coordinate of the alignment (4th)

You can find easily on internet the specification of each column (for example: https://www.metagenomics.wiki/tools/samtools/bam-sam-file-format).

In the second column, there is a code/flag to encode some properties of the mapping of a read and for PE data about its mate.

  • is the read mapped or not ?
  • its mate is also map, in the same sens ?
  • is it a uniq map ?

You can find here a tool to decode this flag: https://broadinstitute.github.io/picard/explain-flags.html or here: https://www.samformat.info/sam-format-flag

You can use the -f option of samtools to filter hits on this flag.

Terminal
#To see unmapped reads flag=4
samtools view -f 4 results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | head

#To see multimap read flag=256
samtools view -f 256 results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | head

You can also combine samtools with grep to search all mapping hit of a given read:

Terminal
# To see all maps of a read
samtools view  results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | grep HWI-ST330:304:H045HADXX:2:1202:12011:47361

3.1.3 Qualimap

A tool called Qualimap explores the features of aligned reads in the context of the genomic region they map to, hence providing an overall view of the data quality (as an HTML file). Various quality metrics assessed by Qualimap include:

  • DNA or rRNA contamination
  • 5’-3’ biases
  • Coverage biases
Script: scripts/12_qualimap.sh
#!/bin/bash

# Run Qualimap on one sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

gtf_file=data/genome/Homo_sapiens.GRCh38.109.gtf

qualimap_out=results/12_qualimap/
mkdir -p $qualimap_out

full_fillnameSE=results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz

fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}
outputdir_bam=results/11_star_mapping/bam/
align_out_bam=${outputdir_bam}/${prefix}.Aligned.sortedByCoord.out.bam


unset DISPLAY
# Run Qualimap
qualimap rnaseq \
  -outdir $qualimap_out \
  -a proportional \
  -bam $align_out_bam \
  -p strand-specific-reverse \
  -gtf $gtf_file \
  --java-mem-size=8G

# for PE data
#-pe 
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/12_qualimap.sh

An alternative to Qualimap, is to use multiqc to get mapping rates on the genome but not the distribution between exon/intron/…

3.2 Count features and get count table (just for comparison)

From bam/sam files you can use the former “standard” approach for quantifying gene expression which consist to count how many reads map on each gene.

For that, we will use a tool name htseq-count. You can find here documentation

htseq-count need the gtf file containing gene annotation (=all gene coordinates) and bam/sam alignment files (=all read alignment coordinates).

Use the following script to get the count table for the Irrel_kd_2 sample.

Script: scripts/13_htseq_count.sh
#!/bin/bash

# Run htseq-count on one sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

gtf_file=01_toy_data/genome/Homo_sapiens.GRCh38.109.gtf
outputdir_count=results/13_htseq_count/
mkdir -p $outputdir_count

prefix=Irrel_kd_2
outputdir_bam=results/11_star_mapping/bam/
align_out_bam=$outputdir_bam/$prefix.Aligned.sortedByCoord.out.bam

htseq-count -n 7 -r pos \
            -t exon -i gene_id $align_out_bam $gtf_file -c $outputdir_count/$prefix.tsv
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/13_htseq_count.sh

You can inspect the output file with

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
head  results/13_htseq_count/Irrel_kd_2.tsv 

tail  results/13_htseq_count/Irrel_kd_2.tsv 

You can see that there are multimapping (__alignment_not_unique) reads. The defaut parameter of htseq-count is to count all of them.

You have to specify what to do with multi mapping reads with the --secondary-alignments option

Script: scripts/14_htseq_count_ignore_secondary.sh
#!/bin/bash

# Run htseq-count on one sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

gtf_file=data/genome/Homo_sapiens.GRCh38.109.gtf
outputdir_count=results/14_htseq_count_ignore_secondary/
mkdir -p $outputdir_count

full_fillnameSE=results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz

fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}
outputdir_bam=results/11_star_mapping/bam/
align_out_bam=${outputdir_bam}/${prefix}.Aligned.sortedByCoord.out.bam

htseq-count -n 7 -r pos -t exon \
            -i gene_id $align_out_bam ${gtf_file} \
            --nonunique all --secondary-alignments=ignore > ${outputdir_count}/${prefix}.nosecondary.tsv
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/14_htseq_count_ignore_secondary.sh

You could compare the output files of the 2 scripts.

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
diff  results/13_htseq_count/Irrel_kd_2.tsv results/14_htseq_count_ignore_secondary/Irrel_kd_2.nosecondary.tsv


grep ENSG00000232811 results/13_htseq_count/Irrel_kd_2.tsv results/14_htseq_count_ignore_secondary/Irrel_kd_2.nosecondary.tsv

During genome mapping you can filter for multi-mapping reads or then after with samtools

4 Aggregating all results with MultiQC

source

Throughout the workflow we have performed various steps of quality checks on our data. You will need to do this for every sample in your dataset, making sure these metrics are consistent across the samples for a given experiment. Outlier samples should be flagged for further investigation and potential removal.

Manually tracking these metrics and browsing through multiple HTML reports (FastQC, Qualimap) and log files (Salmon, STAR) for each samples is tedious and prone to errors. MultiQC is a tool which aggregates results from several tools and generates a single HTML report with plots to visualize and compare various QC metrics between the samples. Assessment of the QC metrics may result in the removal of samples before proceeding to the next step, if necessary.

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

multiqc results/ -o results/multiqc_all

5 Visualize mapping data

We will use IGV to visualize genomic mapping store in your bam files.

From IGV website open, the IGV web app. (https://igv.org/app/)

You need to index you bam files to help IGV to quickly navigate in the bam files.

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

samtools index results/11_star_mapping/bam/Irrel_kd_2.Aligned.sortedByCoord.out.bam 

ls results/11_star_mapping/bam/
  1. Download on your computer Irrel_kd_2.Aligned.sortedByCoord.out.bam and Irrel_kd_2.Aligned.sortedByCoord.out.bam.bai

  2. In IGV web app, import these files

  1. In IGV web app, go to look ENSG00000203865, using the research bar

6 And now in the “true” life

6.1 Deal with install

6.1.1 the PATH

The PATH is an environment variable that is used to store the location of the resources that contain executable files on the Linux operating system.

Terminal
echo $PATH

The which command allows users to search the list of paths in the $PATH environment variable and outputs the full path of the command specified as an argument.

Terminal
which ls

So, in fact, when you use the command ls, you execute the script contain in the file: /usr/bin/ls.

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
ls 
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
/usr/bin/ls

6.1.2 use package management system as Conda

Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs, and updates packages and their dependencies. Conda easily creates, saves, loads, and switches between environments on your local computer. It was created for Python programs but it can package and distribute software for any language.

To install conda on your computer -> https://docs.conda.io/en/latest/miniconda.html

example: to install salmon https://anaconda.org/bioconda/salmon

Terminal
conda install -c bioconda salmon

Alternatives :

  • Use other package managers as apt.
  • Use container as Docker
  • Install from source

6.2 Where to run scripts

  • Using VM on the Biosphere cloud of IFB as in the training
  • On your computer for small process
    • in your terminal
    • in virtual box
  • On the PSMN cluster at the ENS

6.3 How to run your scripts for a reproducible science

One script to rule them all!

  • First step : a well documented readme file.

  • Final step : Use pipeline managers as Nextflow

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.

example: https://nf-co.re/rnaseq

6.4 Save, synchronize your scripts and collaborate with git/gitlab

The most important files on your virtual machine is your script (as your raw data are stored elsewhere).

For the training, we don’t use a tool called git to version, save and synchronize your script in a remote server. You could learn how to use this (fabulous) tool in a BIBS monthly workshop.

7 Save on your local computer your work

  1. Save on your local computer your scripts directory, use the export button of rstudio.

  2. Repatriate also the important results files

  3. Take time also to right a README file to explain how to use these different scripts to get the result.

See you in the next session


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).

cc_by_sa

sources