Learning objectives

  • Acquire best practices for bioinformatics.
  • Learn checking sequencing data quality and cleaning them accordingly

1 Analyse the quality of your library

FastQC is a widely used software that offers a straightforward method for conducting quality control checks on raw sequence data.

Its main functions include:

  • Providing a quick overview to identify areas with potential problems
  • Generating summary graphs and tables to quickly assess your data
  • Export of results to an HTML-based permanent report

1.1 Libraries quality checking with fastqc

The different sequencing quality indicator include:

  • Number of reads
  • Per base sequence quality (!)
  • Per sequence quality score
  • Per base sequence content
  • Per sequence GC content (!)
  • Per base N content
  • Sequence length distribution
  • Sequence duplication levels
  • Overrepresented sequences
  • Adapter content (!)
  • Kmer content

You can find a good report here and a bad report here.

Some known pitfalls/issues are listed here. One of the most well-known issues pertains to the biased composition of the first bases, as detailed here

1.2 Run fastqc on your data

Script: my_project/02_check_quality.sh [To be adjusted]
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
mkdir -p results/02_fastqc/

fastqc -o results/02_fastqc/ 01_raw_data/raw_fastq/*.fastq.gz

and run the script again

TERMINAL [To be adjusted]
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
bash 02_check_quality.sh

2 How to organize your project directory ?

For now, we have placed all the scripts at the root of your project directory. In fact it is a bad idea.

It is recommended to put all files related to the same project within a single folder, whether they are:

  • raw data,
  • scripts,
  • intermediate results,
  • final results,
  • project documentation,

However, it is advisable to further organize these files by separating them into sub directories.

This method will help you navigate your project more effectively and avoid mixing up or accidentally deleting files.

For instance, your working directory might resemble the following:

project_name/
├── README.md             # overview of the project
├── data/                 # data files used in the project
├── results/              # results of the analysis (data, tables, figures)
├── scripts/              # contains all code in the project
│   ├── 01_...
│   ├── 02_...
│   └── ...
└── doc/                  # documentation for your project
    └── ...

Moreover, to be able to find your way around and for questions of reproducibility, you need to add a file, often named README.md, to the root of your folder which will contain all the useful information to take the project in hand. This is also the file that will be visible on the home page of your project on gitlab. So when someone wants to (re)work on the project, they will be able to open the folder, and they will know where to go to see and understand what has been done. This person can be a collaborator, your manager or simply yourself 6 months later.

Exercices

    1. In the training project folder, create a “scripts” folder and move scripts inside.

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

mkdir -p scripts

mv 0*sh scripts
    1. Do the same thing for your project and if necessary, put all your fastq files in a 01_raw_data/fastq_files/

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project

mkdir -p scripts
mv 0*sh scripts

mkdir -p 01_raw_data/fastq_files/

mv 01_raw_data/raw_fastq/*.fastq.gz 01_raw_data/fastq_files/

#check it is ok and then you can remove the old directory
#rm -r 01_raw_data/raw_fastq

2.1 Scripts and data naming

Another aspect of staying organized is making sure that all the directories and filenames for an analysis are as consistent as possible. You want to avoid names like alignment1.bam, and rather have names like 20170823_kd_rep1_gmap-1.4.bam which provide a basic level of information about the file.

Good name must be :

  • machine-readable
    • avoid spaces, punctuation, accented characters
    • friendly for regular expression and globbing (e.g., ls CondA* )
    • easy to process (ex: deliberate use of delimiters, CondA_rep1.tsv)
  • human-readable
    • names should contain information on content (e.g., 01_download_training_dataset.sh)
  • plays well with default ordering (ex: put something numeric first for script for examle, or use date for results)

You can refer to this link and slideshow for some useful guidelines for file naming dos and don’ts.

3 More useful bash commands

In the previous session, you were introduced to some basic bash commands. In this session, we will explore additional commands to enhance your bash proficiency. As previously, the idea is only to provide you with commands relevant to this training but not with a comprehensive overview of bash.

3.1 New bash commands

To begin, create a new “tuto_bash” directory for the exercises within mydatalocal

Terminal
cd /home/rstudio/mydatalocal
mkdir -p tuto_bash
cd tuto_bash
  • grep -> Used to extract lines from a file that match a specific pattern.
Terminal
cowsay -f dragon "Hello Hello!" > dragon.txt
cat dragon.txt
grep "-" dragon.txt
  • sed -> Utilized for replacing all instances of a pattern with another.
Terminal
sed "s/Hello/TOTO/g"  dragon.txt

sed can do a lot of other things but we will use only this one.

To replace only the first occurrence of a pattern, omit the g option

Terminal
sed "s/Hello/TOTO/"  dragon.txt
  • wc -l -> Count the number of lines in a file
Terminal
wc -l  dragon.txt

wc can do other things but we will use only this one.

  • basename -> to get only the name of a file from a given path (the file may not exist)
Terminal
basename raw_fastq/sample1.fastq.gz
  • dirname -> to get only the name of the directory containing the file from a given path (the file may not exist)
Terminal
dirname raw_fastq/sample1.fastq.gz
dirname /home/rstudio/mydatalocal/tuto_bash/raw_fastq/sample1.fastq.gz
  • pipe command with |

Sometimes, you need to chain several consecutive commands. For example, if you want to count the number of line containing a pattern in a file:

Terminal
grep "-" dragon.txt > occurences.txt
wc -l occurences.txt

You can “pipe” the command using a pipe | (AltGr+6 for PC and Alt + Maj + L for Mac ) between two commands to link the output of the first command as input to the second.

Terminal
grep "-" dragon.txt | wc -l  

(By the way, grep have an option for counting occurrences (grep -c), but it was for the example.)

  • gunzip and zcat -> Used for decompressing files.

As previously seen, fastq files are huge files and are commonly stored in compressed formats. In Unix systems, the preferred compression format is typically .gz (utilizing gunzip) rather than .zip, although the concept remains the same.

To check the contents of a compressed file, you can unzip it with gunzip, but this process consumes time and disk space. To do it quickly, there is a command called zcat which enables you to decompress a .gz file on the fly and display its contents. herefore, piping zcat with head allow you to display the first few lines of a .gz file without consuming additional time or disk space.

Terminal
zcat /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz | head -n 8

Exercices

Using the commands we’ve just seen:

    1. Display the first 100 sequences of a training fastq file

Terminal
zcat /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz | head -n 400
    1. Count the number of sequences of a training fastq file

Terminal
zcat /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz | grep "@HWI-ST330" | wc -l

3.2 Stock strings in variables

In scripts, we often use the path to a file multiple times. It is more convenient to store it in a named “variable” and then use it as needed. This way, if you need to modify the path, you only need to change it once instead of at every occurrence.

To define a variable, you will declare it with an = without spaces.

Terminal
variable1="toto"

Quote are not mandatory unless the string contains spaces.

Terminal
variable1=toto

Then, you can refer to its content by using a $ before its name. To display the content of a variable you can use echo.

Terminal
echo $variable1
echo variable1
echo "variable1: $variable1"

You can also combine variables to define a new one.

Terminal
variable1="TOTO"
variable2="TATA"

echo variable1: $variable1
echo variable2: $variable2

variable3="$variable1 $variable2"
echo variable3: $variable3

It is advisable to use descriptive variable names rather than generic ones like “variable1”, “variable2”.

Sometimes, you may need to enclose the variable name in curly brackets ({}) to distinguish it from the following string.

For example, if you want to create a filename using the content of a variable called $sample as a prefix, such as $sample_rep1.fastq.gz, you need to add curly brackets around sample (${sample}_rep1.fastq.gz). Otherwise, the terminal will attempt to interpret $sample_rep1 as a single variable.

Terminal
sample="sampleA"

filenameX=$sample_rep1.txt #variable sample_rep1 don't exist so filenameX=.txt
filenameY=${sample}_rep1.txt

echo filenameX: $filenameX
echo filenameY: $filenameY

Variable can also be use to save the output of a command, using $(command)

Terminal
list_files=$(ls)
working_directory=$(pwd)

echo "$working_directory contains the following files:
$list_files"

You can use also pipes inside

Terminal
nb_dash=$(grep "-" dragon.txt | wc -l)
echo $nb_dash

Using variables will simplify both your script writing and reading processes.

Terminal
filename=dragon.txt
new_filename=dragon_bis.txt

echo "$filename will be cp to $new_filename"
cp  $filename  $new_filename

You can also perform pattern substitutions in a variable, using ${variable/pattern/replace} to replace the first occurrence.

Terminal
filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/R1/R2}

If you want to remove a pattern, you can use the same syntax but without a replacement string. This is a useful method to remove the extension of a file name.

Terminal
filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/.R1.fastq.gz/}

You can save the output of this pattern substitution in a new variable

Terminal
filenameR1=sample1.R1.fastq.gz
filenameR2=${filenameR1/R1/R2}

echo R1: $filenameR1 R2: $filenameR2

For the following exercise, we need Paired-End (PE) data. PE fastq file names differ only by “R1” or “R2”. The file containing reads 1, contains “R1” in its file name and the other file containing reads 2, contains “R2”. Therefore, from one, you can deduce the other. The training dataset is single end (SE). We will just create a fake PE dataset (until we find a real one).

Terminal
fastqgz_dir="/home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/"

ls $fastqgz_dir 

cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/

mkdir -p fake_PE_fastq
cp $fastqgz_dir/Irrel_kd_1.subset.fastq.gz fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
cp $fastqgz_dir/Irrel_kd_1.subset.fastq.gz fake_PE_fastq/Irrel_kd_1_R2.fastq.gz

cp $fastqgz_dir/Mov10_oe_1.subset.fastq.gz fake_PE_fastq/Mov10_oe_1_R1.fastq.gz
cp $fastqgz_dir/Mov10_oe_1.subset.fastq.gz fake_PE_fastq/Mov10_oe_1_R2.fastq.gz

You must have:

01_toy_data/
├── fake_PE_fastq
│   ├── Irrel_kd_1_R1.fastq.gz
│   ├── Irrel_kd_1_R2.fastq.gz
│   ├── Mov10_oe_1_R1.fastq.gz
│   └── Mov10_oe_1_R2.fastq.gz
└── raw_fastq
    ├── Irrel_kd_1.subset.fastq.gz
    ├── Irrel_kd_2.subset.fastq.gz
    ├── Irrel_kd_3.subset.fastq.gz
    ├── Mov10_oe_1.subset.fastq.gz
    ├── Mov10_oe_2.subset.fastq.gz
    └── Mov10_oe_3.subset.fastq.gz

Exercices

You will work in the /home/rstudio/mydatalocal/tuto_bash directory to do the following exercice. But data will be in /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/

Terminal
cd /home/rstudio/mydatalocal/tuto_bash
fastq_data_dir="/home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/"

Using the commands we’ve just seen:

  1. Complete the commands below to define a variable prefix containing the prefix of the sample name (“Irrel_kd_1”) from one file of a pair of PE fastq files (fake_PE_fastq/Irrel_kd_1_R1.fastq.gz)
Terminal
full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz

...

echo $prefix # display Irrel_kd_1

Terminal
full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
filenameR1=$(basename $full_filenameR1)

prefix=${filenameR1/_R1.fastq.gz/}

echo $prefix # display Irrel_kd_1
  1. From this prefix variable, write codes to save in a file (results/seq_numbers/$prefix.nb_sequences.txt) the number of sequences (~22000) of the R1 fastq file. Before create a directory to save the results (“results/seq_numbers”).

Terminal
output_dir="results/seq_numbers"
mkdir -p $output_dir


full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz

filenameR1=$(basename $full_filenameR1)

prefix=${filenameR1/_R1.fastq.gz/}

echo $prefix # display Irrel_kd_1

zcat $full_filenameR1 | grep "@HWI-ST330" | wc -l > $output_dir/$prefix.nb_sequences.txt
  1. Starting from the full filename: full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz, write codes to save automatically:

    • in a variable named filenameR2 the name of the R2 file (=Irrel_kd_1_R2.fastq.gz)
    • and in a variable full_filenameR2 the path to the file R2 (=/home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data//fake_PE_fastq/Irrel_kd_1_R2.fastq.gz)
Terminal
full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz

...

echo filenameR2: $filenameR2  # display Irrel_kd_1_R2.fastq.gz
echo full_filenameR2: $full_filenameR2 # display /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data//fake_PE_fastq/Irrel_kd_1_R2.fastq.gz

Terminal

full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
fastq_dirname=$(dirname $full_filenameR1)

filenameR1=$(basename $full_filenameR1)

echo filenameR1: $filenameR1
echo full_filenameR1: $full_filenameR1

filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"

echo filenameR2: $filenameR2
echo full_filenameR2: $full_filenameR2
  1. Complete the previous codes to get the number of sequences of each files of the sample (R1 and R2)

the output file must contain:

Irrel_kd_1_R1.fastq.gz: 222921
Irrel_kd_1_R2.fastq.gz: 222921

Terminal
cd /home/rstudio/mydatalocal/tuto_bash/

output_dir="results/seq_numbers"
mkdir -p $output_dir


full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
fastq_dirname=$(dirname $full_filenameR1)

filenameR1=$(basename $full_filenameR1)

filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"


prefix=${filenameR1/_R1.fastq.gz/}
output_file="$output_dir/$prefix.nb_sequences.txt"


echo $prefix # display sampleA

nb_lines_R1=$(zcat $full_filenameR1 | grep "@HWI-ST330" | wc -l)

echo "$filenameR1: $nb_lines_R1"> $output_file

nb_lines_R2=$(zcat $full_filenameR2  | grep "@HWI-ST330" | wc -l)

echo "$filenameR2: $nb_lines_R2">> $output_file


cat $output_dir/$prefix.nb_sequences.txt

3.3 “For” loops for iterating over a list

Bioinformatic analysis often involves executing numerous commands, but with only slight variations each time you run a new command.

For instance, you might have to analyze sample1, then sample2, sample3, and so forth. To save time, we employ something known as a loop, specifically a for-loop.

Let’s clarify this concept with a straightforward example. Suppose you wish to print the letters a, b and c. You could achieve this by typing echo a, hitting return; then echo b, and echo c. While this accomplishes the task, it is evident that this method would become cumbersome if you had to to print dozens or hundreds of numbers.

How can we simplify this process? You probably noticed that each time we ran the command, we changed the letter after the echo command. A for-loop automates this process for you.

Here is an example of a for-loop to print the letters a, b, c, d:

Terminal
for X in a b c d
do
  echo $X
done

The for-loop comprises three sections:

  • The first section is the Declaration: it begins by assigning the first item after in to the variable *X; in this case, it would assign the value “a” to “X”. The letters after in are referred to as the List**.
  • The next section is the Body, which runs the commands written after do, replacing the variable with whichever value is currently assigned to the variable - for the first loop, this will be the letter a. Since items remain in the list, the loop goes back to the declaration and assigns the next number in the list to the variable X; in this case, the number b. Then the body is run, and the process repeats until the end of the list is reached.
  • The final section, called the End, contains only the word done, meaning to exit the loop after all of the items in the list have been run through the Body of the loop.

You can include additional commands to the Body section. For instance, we could modify the loop to:

Terminal
for X in sheep elephant fox duck
do
  # Display the value in X
  echo $X
  #Do something independent of the variable X
  cowsay -f turtle "Hello"
  #Do something dependent of the content of X ($X)
  cowsay -f $X "Hello, I'm a $X"
  #Do another thing, if you want
  cowsay -f turtle "Welcome, $X"
done

You can use a variable, as seen previously, to store the list of items to iterate over. Additionally, it is preferable to use a more descriptive variable name than “X” for iteration (for example “letter”).

Terminal
list="a b c d"
for letter in $list
do
  echo $letter
  echo Do something
  echo Do something with $letter
  echo Do another thing
done

Furthermore, the list to iterate over can be automatically generated. For instance, if you wish to iterate over a list of files you can do:

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

list_files=$(ls 01_toy_data/raw_fastq/*gz)
echo $list_files

echo "Start:"

for file in $list_files
do
  echo $file
  echo Do something
  echo Do something with $file
  echo Do another thing
done

echo "End"

#OR

for file in 01_toy_data/raw_fastq/*gz
do
  echo $file
  echo Do something
  echo Do something with $file
  echo Do another thing
done

Note: Indentations are not important for the machine, but they helps in code readability for humans.

Exercices

Using the commands we’ve just seen:

  1. Iterate on each fastq files of the fake_PE_fastq directory and print its path

Terminal

cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
for full_filename in 01_toy_data/fake_PE_fastq/*gz
do
  echo Path : $full_filename
done
Path : fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
Path : fake_PE_fastq/Irrel_kd_1_R2.fastq.gz
Path : fake_PE_fastq/Mov10_oe_1_R1.fastq.gz
Path : fake_PE_fastq/Mov10_oe_1_R2.fastq.gz
  1. Using the same basis, get from the file path the prefix of the sample

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

for full_filename in 01_toy_data/fake_PE_fastq/*gz
do
  echo Path : $full_filename
  
  filename=$(basename $full_filename)

  prefix=${filename/.fastq.gz/}

  echo Prefix : $prefix
done
Path : fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
Prefix : Irrel_kd_1_R1
Path : fake_PE_fastq/Irrel_kd_1_R2.fastq.gz
Prefix : Irrel_kd_1_R2
Path : fake_PE_fastq/Mov10_oe_1_R1.fastq.gz
Prefix : Mov10_oe_1_R1
Path : fake_PE_fastq/Mov10_oe_1_R2.fastq.gz
Prefix : Mov10_oe_1_R2
  1. Because the fake training dataset is Paired-end, each sample is printed 2 times. But we will want to iterate only one time by sample to run analysis on each sample. Search for a way to display only once a prefix corresponding to each sample (without “_R1” or “_R2”).

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/

for full_filenameR1 in 01_toy_data/fake_PE_fastq/*R1*gz
do
  echo Path R1: $full_filenameR1
  
  filename=$(basename $full_filenameR1)

  prefix=${filename/_R1.fastq.gz/}

  echo Prefix : $prefix
done
Path R1: 01_toy_data/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
Prefix : Irrel_kd_1
Path R1: 01_toy_data/fake_PE_fastq/Mov10_oe_1_R1.fastq.gz
Prefix : Mov10_oe_1
  1. Retrieve the code you wrote to obtain the number of sequences from a single fastq file, and iterate over each sample in the PE training dataset to obtain the number of sequences from each fastq file.
Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

output_dir="/home/rstudio/mydatalocal/tuto_bash/results/seq_numbers"
mkdir -p $output_dir

for full_filenameR1 in 01_toy_data/fake_PE_fastq/*R1*gz
do


...


done

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

output_dir="/home/rstudio/mydatalocal/tuto_bash/results/seq_numbers"
mkdir -p $output_dir

for full_filenameR1 in 01_toy_data/fake_PE_fastq/*R1*gz
do
  echo Filename : $full_filenameR1

  fastq_dirname=$(dirname $full_filenameR1)
  filenameR1=$(basename $full_filenameR1)
  
  filenameR2=${filenameR1/R1/R2}
  full_filenameR2="$fastq_dirname/$filenameR2"
  
  prefix=${filenameR1/_R1.fastq.gz/}
  
  output_file="$output_dir/$prefix.nb_sequences.txt"
  
  echo Prefix: $prefix # display Irrel_kd_1
  
  nb_lines_R1=$(zcat $full_filenameR1 | grep "@HWI-ST330" | wc -l)
  echo "$filenameR1: $nb_lines_R1"> $output_file
  
  nb_lines_R2=$(zcat $full_filenameR2 | grep "@HWI-ST330" | wc -l)
  echo "$filenameR2: $nb_lines_R2">> $output_file
  
  #to check
  cat $output_file

done

#to check the output dir
ls $output_dir

4 Raw data cleaning

We will use fastp for data cleaning.

fastp is a tool designed to offer fast all-in-one preprocessing for FastQ files. Developed in C++ with support for multithreading, this tool ensures high performance.

4.1 On the training dataset

4.1.1 Fastp: Trim data and remove adapters

For this complex tool, we provide you with the command to execute for a Single-End or Paired-end dataset. In the “true” life, one typically consults documentation or seeks assistance from forums or online tutorials to determine the appropriate command. The key point is to understand ((or at least grasp the concept of) the code you wiare going to run.

For the next one tool, you will find the command independently from the documentation. (We will assist you, don’t worry!).

  • Code for one sample of a SE dataset (so one file)
Script: Do not run
#Do not run
mkdir -p results/04_trim_data results/04_report_trimming

#For one SE fastq file

fastp --thread 7 \
      --qualified_quality_phred 30 \
      --length_required 80 \
      --in1 01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz \
      --out1 results/04_trimmed_data/Irrel_kd_1_trim.fastq.gz \
      --html results/04_report_trimming/Irrel_kd_1_trim_fastp.html \
      --json results/04_report_trimming/Irrel_kd_1_trim_fastp.json \
      --report_title Irrel_kd_1

(For more readability you can cut a command line with a backslash \ without consequence. Indeed, the backslash is used by bash to indicate a line continuation and is commonly used in bash scripts.)

  • Code for one sample of a PE dataset (so 2 paired files)
Script: Do not run
#Do not run
cd /home/rstudio/mydatalocal/tuto_bash

#for one pair of fastq files sample1 sample1_R1.fastq.gz & sample1_R2.fastq.gz

fastp --thread 7 \
      --qualified_quality_phred 30 \
      --detect_adapter_for_pe \
      --length_required 80 \
      --in1  fake_PE_fastq/Irrel_kd_1_R1.fastq.gz \
      --in2  fake_PE_fastq/Irrel_kd_1_R2.fastq.gz \
      --out1 results/04_trimmed_data/Irrel_kd_1_trim_R1.fastq.gz \
      --out2 results/04_trimmed_data/Irrel_kd_1_trim_R2.fastq.gz \
      --html results/04_report_trimming/Irrel_kd_1_trim_fastp.html \
      --json results/04_report_trimming/Irrel_kd_1_trim_fastp.json \
      --report_title Irrel_kd_1

Using the documentation, we will break down the different options. (Then, you will be able to annotate this poorly commented code.)

Depending on the library preparation kit you used, you may need to remove UMI from your reads (e.g., kit Corall from lexogen) Fastp can Fastp can accomplish this with the following additional options:

--umi --umi_loc read1 --umi_len 12  # remove UMI for Corall kit

4.1.2 Build the script to run on the training dataset

To iterate over filenames we need to replace all actual filenames and paths with variables.

Using variables:

Script: Do not run
#Do not run
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

output_dir_trimmed_data=results/04_trimmed_data
output_dir_report=results/04_report_trimming

mkdir -p $output_dir_trimmed_data $output_dir_report

#Filenames and path for one SE fastq file
full_filenameSE=01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz
filenameSE=Irrel_kd_1.subset.fastq.gz
prefix=Irrel_kd_1.subset
output_SE_trim="$output_dir_trimmed_data/$prefix"_trim.fastq.gz

echo full_filenameSE: $full_filenameSE
echo output_SE_trim: $output_SE_trim

fastp --thread 7 \
      --qualified_quality_phred 30 \
      --length_required 80 \
      --in1  ${full_filenameSE} \
      --out1 ${output_SE_trim} \
      --html ${output_dir_report}/${prefix}_trim_fastp.html \
      --json ${output_dir_report}/${prefix}_trim_fastp.json \
      --report_title ${prefix}

Now that we have designed the command for one sample, we can iterate on each sample. It is necessary to generate filenames and paths programmatically, as we did before.

  • Code to iterate on all samples to be adapted.

    • For SE data
Script: Do not run
#Do not run

#SE data
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

output_dir_trimmed_data=results/04_trimmed_data
output_dir_report=results/04_report_trimming

mkdir -p $output_dir_trimmed_data $output_dir_report

for full_filenameSE in 01_toy_data/raw_fastq/*gz
do
  #For each SE fastq file
  echo Path : $full_filenameSE treatment:
  
  dirname=$(dirname $full_filenameSE) # actually we do not use it here!
  
  filenameSE=$(basename $full_filenameSE)
  prefix=${filenameSE/.fastq.gz/}
  
  output_SE_trim="$output_dir_trimmed_data/${prefix}_trim.fastq.gz"
  
  echo Prefix : $prefix
  echo full_filenameSE: $full_filenameSE
  echo output_SE_trim: $output_SE_trim
  
  fastp --thread 7 \
        --qualified_quality_phred 30 \
        --length_required 80 \
        --in1  ${full_filenameSE} \
        --out1 ${output_SE_trim} \
        --html ${output_dir_report}/${prefix}_trim_fastp.html \
        --json ${output_dir_report}/${prefix}_trim_fastp.json \
        --report_title ${prefix}
done
  • For PE data, we need to iterate only on “R1” files, and generate “R2” filenames programmatically, as we did before:
Script: Do not run
#Do not run

#(fake) PE data
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

output_dir_trimmed_data=results/04_trimmed_data_PE
output_dir_report=results/04_report_trimming_PE

mkdir -p $output_dir_trimmed_data $output_dir_report

for full_filenameR1 in 01_toy_data/fake_PE_fastq/*R1*gz
do
  #For each PE R1 fastq file
  echo Filename : $full_filenameR1 treatment:

  fastq_dirname=$(dirname $full_filenameR1) # actually, we do not use it here!
  
  filenameR1=$(basename $full_filenameR1)
  
  filenameR2=${filenameR1/R1/R2}
  full_filenameR2="$fastq_dirname/$filenameR2"
  
  prefix=${filenameR1/_R1.fastq.gz/}
  
  echo Prefix: $prefix

  fastp --thread 7 \
        --qualified_quality_phred 30 \
        --detect_adapter_for_pe \
        --length_required 80 \
        --in1  $full_filenameR1 \
        --in2  $full_filenameR2 \
        --out1 ${output_dir_trimmed_data}/${prefix}_trim_R1.fastq.gz \
        --out2 ${output_dir_trimmed_data}/${prefix}_trim_R2.fastq.gz \
        --html ${output_dir_report}/${prefix}_trim_fastp.html \
        --json ${output_dir_report}/${prefix}_trim_fastp.json \
        --report_title ${prefix}
done

Exercices

Trim (SE or PE) training dataset : You have to check and save the code in scripts (scripts/04_trim_SE_data_fastp.sh; scripts/04_trim_PE_data_fastp.sh) and then execute them. (Do not forget to replace yourName by your actual name!)

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/04_trim_SE_data_fastp.sh

bash scripts/04_trim_PE_data_fastp.sh

4.1.3 Agglomerate your triming results with multiqc

Once, the script is done you can use again multiqc to agglomerate trimming report of fastp.

Script: scripts/05_agglomerate_fastp_reports.sh
#!/bin/bash

cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
multiqc results/04_report_trimming/ -o results/05_multiqc_report_trimming/

Terminal
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/05_agglomerate_fastp_reports.sh

4.2 On your dataset

4.2.1 Agglomerate your QC results with multiqc

Now, it’s time to look to the quality result of your data. For that you can use multiqc. Write a script and execute it.

Script: scripts/03_agglomerate_fastqc.sh [To be adjusted]
#!/bin/bash

cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project/
multiqc results/02_fastqc/  -o results/03_fastqc_multiqc/
Terminal [To be adjusted]
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project/
bash scripts/03_agglomerate_fastqc.sh

4.2.2 Check the quality of your raw reads and define fastp options

Now, it’s time to look to the quality result of your data, define which fastp option are needed, write a command for one file, write a script for all files using a for loop and execute it.

(Before execute it on all sample, it’s better to check that it’s run on one sample.)

4.2.3 Run fastp on your data

Exercices

From previous exercises, adapt the code corresponding to your data (PE or SE) for

  1. On one sample

  2. On all samples

  3. Build your scripts and execute them.

4.2.4 Agglomerate your triming results with multiqc

Script: scripts/05_agglomerate_fastp_reports.sh [To be adjusted]
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project/

multiqc results/04_report_trimming/ -o results/05_multiqc_report_trimming/
Terminal [To be adjusted]
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project/
bash scripts/05_agglomerate_fastp_reports.sh

See you in the next session


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).

cc_by_sa

sources