Learning objectives
FastQC is a widely used software that offers a straightforward method for conducting quality control checks on raw sequence data.
Its main functions include:
The different sequencing quality indicator include:
You can find a good report here and a bad report here.
Some known pitfalls/issues are listed here. One of the most well-known issues pertains to the biased composition of the first bases, as detailed here
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
mkdir -p results/02_fastqc/
fastqc -o results/02_fastqc/ 01_raw_data/raw_fastq/*.fastq.gz
and run the script again
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
bash 02_check_quality.sh
For now, we have placed all the scripts at the root of your project directory. In fact it is a bad idea.
It is recommended to put all files related to the same project within a single folder, whether they are:
However, it is advisable to further organize these files by separating them into sub directories.
This method will help you navigate your project more effectively and avoid mixing up or accidentally deleting files.
For instance, your working directory might resemble the following:
project_name/
├── README.md # overview of the project
├── data/ # data files used in the project
├── results/ # results of the analysis (data, tables, figures)
├── scripts/ # contains all code in the project
│ ├── 01_...
│ ├── 02_...
│ └── ...
└── doc/ # documentation for your project
└── ...
Moreover, to be able to find your way around and for questions of reproducibility, you need to add a file, often named README.md, to the root of your folder which will contain all the useful information to take the project in hand. This is also the file that will be visible on the home page of your project on gitlab. So when someone wants to (re)work on the project, they will be able to open the folder, and they will know where to go to see and understand what has been done. This person can be a collaborator, your manager or simply yourself 6 months later.
Exercices
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p scripts
mv 0*sh scripts
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
mkdir -p scripts
mv 0*sh scripts
mkdir -p 01_raw_data/fastq_files/
mv 01_raw_data/raw_fastq/*.fastq.gz 01_raw_data/fastq_files/
#check it is ok and then you can remove the old directory
#rm -r 01_raw_data/raw_fastq
Another aspect of staying organized is making sure that all the directories and filenames for an analysis are as consistent as possible. You want to avoid names like alignment1.bam, and rather have names like 20170823_kd_rep1_gmap-1.4.bam which provide a basic level of information about the file.
Good name must be :
You can refer to this link and slideshow for some useful guidelines for file naming dos and don’ts.
In the previous session, you were introduced to some basic bash commands. In this session, we will explore additional commands to enhance your bash proficiency. As previously, the idea is only to provide you with commands relevant to this training but not with a comprehensive overview of bash.
To begin, create a new “tuto_bash” directory for the exercises within mydatalocal
cd /home/rstudio/mydatalocal
mkdir -p tuto_bash
cd tuto_bash
grep
-> Used to extract lines from a file that match a specific pattern.cowsay -f dragon "Hello Hello!" > dragon.txt
cat dragon.txt
grep "-" dragon.txt
sed
-> Utilized for replacing all instances of a pattern with another.sed "s/Hello/TOTO/g" dragon.txt
sed
can do a lot of other things but we will use only this one.
To replace only the first occurrence of a pattern, omit the g option
sed "s/Hello/TOTO/" dragon.txt
wc -l
-> Count the number of lines in a filewc -l dragon.txt
wc
can do other things but we will use only this one.
basename
-> to get only the name of a file from a given path (the file may not exist)basename raw_fastq/sample1.fastq.gz
dirname
-> to get only the name of the directory containing the file from a given path (the file may not exist)dirname raw_fastq/sample1.fastq.gz
dirname /home/rstudio/mydatalocal/tuto_bash/raw_fastq/sample1.fastq.gz
|
Sometimes, you need to chain several consecutive commands. For example, if you want to count the number of line containing a pattern in a file:
grep "-" dragon.txt > occurences.txt
wc -l occurences.txt
You can “pipe” the command using a pipe |
(AltGr+6 for PC and Alt + Maj + L for Mac ) between two commands to link the output of the first command as input to the second.
grep "-" dragon.txt | wc -l
(By the way, grep have an option for counting occurrences (grep -c), but it was for the example.)
gunzip
and zcat
-> Used for decompressing files.As previously seen, fastq files are huge files and are commonly stored in compressed formats. In Unix systems, the preferred compression format is typically .gz
(utilizing gunzip
) rather than .zip
, although the concept remains the same.
To check the contents of a compressed file, you can unzip it with gunzip
, but this process consumes time and disk space. To do it quickly, there is a command called zcat
which enables you to decompress a .gz
file on the fly and display its contents. herefore, piping zcat
with head
allow you to display the first few lines of a .gz
file without consuming additional time or disk space.
zcat /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz | head -n 8
Exercices
Using the commands we’ve just seen:
zcat /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz | head -n 400
zcat /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz | grep "@HWI-ST330" | wc -l
In scripts, we often use the path to a file multiple times. It is more convenient to store it in a named “variable” and then use it as needed. This way, if you need to modify the path, you only need to change it once instead of at every occurrence.
To define a variable, you will declare it with an =
without spaces.
variable1="toto"
Quote are not mandatory unless the string contains spaces.
variable1=toto
Then, you can refer to its content by using a $
before its name. To display the content of a variable you can use echo
.
echo $variable1
echo variable1
echo "variable1: $variable1"
You can also combine variables to define a new one.
variable1="TOTO"
variable2="TATA"
echo variable1: $variable1
echo variable2: $variable2
variable3="$variable1 $variable2"
echo variable3: $variable3
It is advisable to use descriptive variable names rather than generic ones like “variable1”, “variable2”.
Sometimes, you may need to enclose the variable name in curly brackets ({}
) to distinguish it from the following string.
For example, if you want to create a filename using the content of a variable called $sample
as a prefix, such as $sample_rep1.fastq.gz
, you need to add curly brackets around sample (${sample}_rep1.fastq.gz
). Otherwise, the terminal will attempt to interpret $sample_rep1
as a single variable.
sample="sampleA"
filenameX=$sample_rep1.txt #variable sample_rep1 don't exist so filenameX=.txt
filenameY=${sample}_rep1.txt
echo filenameX: $filenameX
echo filenameY: $filenameY
Variable can also be use to save the output of a command, using $(command)
list_files=$(ls)
working_directory=$(pwd)
echo "$working_directory contains the following files:
$list_files"
You can use also pipes inside
nb_dash=$(grep "-" dragon.txt | wc -l)
echo $nb_dash
Using variables will simplify both your script writing and reading processes.
filename=dragon.txt
new_filename=dragon_bis.txt
echo "$filename will be cp to $new_filename"
cp $filename $new_filename
You can also perform pattern substitutions in a variable, using ${variable/pattern/replace}
to replace the first occurrence.
filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/R1/R2}
If you want to remove a pattern, you can use the same syntax but without a replacement string. This is a useful method to remove the extension of a file name.
filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/.R1.fastq.gz/}
You can save the output of this pattern substitution in a new variable
filenameR1=sample1.R1.fastq.gz
filenameR2=${filenameR1/R1/R2}
echo R1: $filenameR1 R2: $filenameR2
For the following exercise, we need Paired-End (PE) data. PE fastq file names differ only by “R1” or “R2”. The file containing reads 1, contains “R1” in its file name and the other file containing reads 2, contains “R2”. Therefore, from one, you can deduce the other. The training dataset is single end (SE). We will just create a fake PE dataset (until we find a real one).
fastqgz_dir="/home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/"
ls $fastqgz_dir
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/
mkdir -p fake_PE_fastq
cp $fastqgz_dir/Irrel_kd_1.subset.fastq.gz fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
cp $fastqgz_dir/Irrel_kd_1.subset.fastq.gz fake_PE_fastq/Irrel_kd_1_R2.fastq.gz
cp $fastqgz_dir/Mov10_oe_1.subset.fastq.gz fake_PE_fastq/Mov10_oe_1_R1.fastq.gz
cp $fastqgz_dir/Mov10_oe_1.subset.fastq.gz fake_PE_fastq/Mov10_oe_1_R2.fastq.gz
You must have:
01_toy_data/
├── fake_PE_fastq
│ ├── Irrel_kd_1_R1.fastq.gz
│ ├── Irrel_kd_1_R2.fastq.gz
│ ├── Mov10_oe_1_R1.fastq.gz
│ └── Mov10_oe_1_R2.fastq.gz
└── raw_fastq
├── Irrel_kd_1.subset.fastq.gz
├── Irrel_kd_2.subset.fastq.gz
├── Irrel_kd_3.subset.fastq.gz
├── Mov10_oe_1.subset.fastq.gz
├── Mov10_oe_2.subset.fastq.gz
└── Mov10_oe_3.subset.fastq.gz
Exercices
You will work in the /home/rstudio/mydatalocal/tuto_bash directory to do the following exercice. But data will be in /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/raw_fastq/
cd /home/rstudio/mydatalocal/tuto_bash
fastq_data_dir="/home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/"
Using the commands we’ve just seen:
prefix
containing the prefix of the sample name (“Irrel_kd_1”) from one file of a pair of PE fastq files (fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
)full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
...
echo $prefix # display Irrel_kd_1
full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
filenameR1=$(basename $full_filenameR1)
prefix=${filenameR1/_R1.fastq.gz/}
echo $prefix # display Irrel_kd_1
output_dir="results/seq_numbers"
mkdir -p $output_dir
full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
filenameR1=$(basename $full_filenameR1)
prefix=${filenameR1/_R1.fastq.gz/}
echo $prefix # display Irrel_kd_1
zcat $full_filenameR1 | grep "@HWI-ST330" | wc -l > $output_dir/$prefix.nb_sequences.txt
Starting from the full filename: full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
, write codes to save automatically:
full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
...
echo filenameR2: $filenameR2 # display Irrel_kd_1_R2.fastq.gz
echo full_filenameR2: $full_filenameR2 # display /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data//fake_PE_fastq/Irrel_kd_1_R2.fastq.gz
full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
fastq_dirname=$(dirname $full_filenameR1)
filenameR1=$(basename $full_filenameR1)
echo filenameR1: $filenameR1
echo full_filenameR1: $full_filenameR1
filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"
echo filenameR2: $filenameR2
echo full_filenameR2: $full_filenameR2
the output file must contain:
Irrel_kd_1_R1.fastq.gz: 222921
Irrel_kd_1_R2.fastq.gz: 222921
cd /home/rstudio/mydatalocal/tuto_bash/
output_dir="results/seq_numbers"
mkdir -p $output_dir
full_filenameR1=$fastq_data_dir/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
fastq_dirname=$(dirname $full_filenameR1)
filenameR1=$(basename $full_filenameR1)
filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"
prefix=${filenameR1/_R1.fastq.gz/}
output_file="$output_dir/$prefix.nb_sequences.txt"
echo $prefix # display sampleA
nb_lines_R1=$(zcat $full_filenameR1 | grep "@HWI-ST330" | wc -l)
echo "$filenameR1: $nb_lines_R1"> $output_file
nb_lines_R2=$(zcat $full_filenameR2 | grep "@HWI-ST330" | wc -l)
echo "$filenameR2: $nb_lines_R2">> $output_file
cat $output_dir/$prefix.nb_sequences.txt
Bioinformatic analysis often involves executing numerous commands, but with only slight variations each time you run a new command.
For instance, you might have to analyze sample1, then sample2, sample3, and so forth. To save time, we employ something known as a loop, specifically a for-loop.
Let’s clarify this concept with a straightforward example. Suppose you wish to print the letters a, b and c. You could achieve this by typing echo a
, hitting return; then echo b
, and echo c
. While this accomplishes the task, it is evident that this method would become cumbersome if you had to to print dozens or hundreds of numbers.
How can we simplify this process? You probably noticed that each time we ran the command, we changed the letter after the echo command. A for-loop automates this process for you.
Here is an example of a for-loop to print the letters a, b, c, d:
for X in a b c d
do
echo $X
done
The for-loop comprises three sections:
You can include additional commands to the Body section. For instance, we could modify the loop to:
for X in sheep elephant fox duck
do
# Display the value in X
echo $X
#Do something independent of the variable X
cowsay -f turtle "Hello"
#Do something dependent of the content of X ($X)
cowsay -f $X "Hello, I'm a $X"
#Do another thing, if you want
cowsay -f turtle "Welcome, $X"
done
You can use a variable, as seen previously, to store the list of items to iterate over. Additionally, it is preferable to use a more descriptive variable name than “X” for iteration (for example “letter”).
list="a b c d"
for letter in $list
do
echo $letter
echo Do something
echo Do something with $letter
echo Do another thing
done
Furthermore, the list to iterate over can be automatically generated. For instance, if you wish to iterate over a list of files you can do:
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
list_files=$(ls 01_toy_data/raw_fastq/*gz)
echo $list_files
echo "Start:"
for file in $list_files
do
echo $file
echo Do something
echo Do something with $file
echo Do another thing
done
echo "End"
#OR
for file in 01_toy_data/raw_fastq/*gz
do
echo $file
echo Do something
echo Do something with $file
echo Do another thing
done
Note: Indentations are not important for the machine, but they helps in code readability for humans.
Exercices
Using the commands we’ve just seen:
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
for full_filename in 01_toy_data/fake_PE_fastq/*gz
do
echo Path : $full_filename
done
Path : fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
Path : fake_PE_fastq/Irrel_kd_1_R2.fastq.gz
Path : fake_PE_fastq/Mov10_oe_1_R1.fastq.gz
Path : fake_PE_fastq/Mov10_oe_1_R2.fastq.gz
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
for full_filename in 01_toy_data/fake_PE_fastq/*gz
do
echo Path : $full_filename
filename=$(basename $full_filename)
prefix=${filename/.fastq.gz/}
echo Prefix : $prefix
done
Path : fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
Prefix : Irrel_kd_1_R1
Path : fake_PE_fastq/Irrel_kd_1_R2.fastq.gz
Prefix : Irrel_kd_1_R2
Path : fake_PE_fastq/Mov10_oe_1_R1.fastq.gz
Prefix : Mov10_oe_1_R1
Path : fake_PE_fastq/Mov10_oe_1_R2.fastq.gz
Prefix : Mov10_oe_1_R2
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
for full_filenameR1 in 01_toy_data/fake_PE_fastq/*R1*gz
do
echo Path R1: $full_filenameR1
filename=$(basename $full_filenameR1)
prefix=${filename/_R1.fastq.gz/}
echo Prefix : $prefix
done
Path R1: 01_toy_data/fake_PE_fastq/Irrel_kd_1_R1.fastq.gz
Prefix : Irrel_kd_1
Path R1: 01_toy_data/fake_PE_fastq/Mov10_oe_1_R1.fastq.gz
Prefix : Mov10_oe_1
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
output_dir="/home/rstudio/mydatalocal/tuto_bash/results/seq_numbers"
mkdir -p $output_dir
for full_filenameR1 in 01_toy_data/fake_PE_fastq/*R1*gz
do
...
done
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
output_dir="/home/rstudio/mydatalocal/tuto_bash/results/seq_numbers"
mkdir -p $output_dir
for full_filenameR1 in 01_toy_data/fake_PE_fastq/*R1*gz
do
echo Filename : $full_filenameR1
fastq_dirname=$(dirname $full_filenameR1)
filenameR1=$(basename $full_filenameR1)
filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"
prefix=${filenameR1/_R1.fastq.gz/}
output_file="$output_dir/$prefix.nb_sequences.txt"
echo Prefix: $prefix # display Irrel_kd_1
nb_lines_R1=$(zcat $full_filenameR1 | grep "@HWI-ST330" | wc -l)
echo "$filenameR1: $nb_lines_R1"> $output_file
nb_lines_R2=$(zcat $full_filenameR2 | grep "@HWI-ST330" | wc -l)
echo "$filenameR2: $nb_lines_R2">> $output_file
#to check
cat $output_file
done
#to check the output dir
ls $output_dir
We will use fastp for data cleaning.
fastp is a tool designed to offer fast all-in-one preprocessing for FastQ files. Developed in C++ with support for multithreading, this tool ensures high performance.
For this complex tool, we provide you with the command to execute for a Single-End or Paired-end dataset. In the “true” life, one typically consults documentation or seeks assistance from forums or online tutorials to determine the appropriate command. The key point is to understand ((or at least grasp the concept of) the code you wiare going to run.
For the next one tool, you will find the command independently from the documentation. (We will assist you, don’t worry!).
#Do not run
mkdir -p results/04_trim_data results/04_report_trimming
#For one SE fastq file
fastp --thread 7 \
--qualified_quality_phred 30 \
--length_required 80 \
--in1 01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz \
--out1 results/04_trimmed_data/Irrel_kd_1_trim.fastq.gz \
--html results/04_report_trimming/Irrel_kd_1_trim_fastp.html \
--json results/04_report_trimming/Irrel_kd_1_trim_fastp.json \
--report_title Irrel_kd_1
(For more readability you can cut a command line with a backslash \
without consequence. Indeed, the backslash is used by bash to indicate a line continuation and is commonly used in bash scripts.)
#Do not run
cd /home/rstudio/mydatalocal/tuto_bash
#for one pair of fastq files sample1 sample1_R1.fastq.gz & sample1_R2.fastq.gz
fastp --thread 7 \
--qualified_quality_phred 30 \
--detect_adapter_for_pe \
--length_required 80 \
--in1 fake_PE_fastq/Irrel_kd_1_R1.fastq.gz \
--in2 fake_PE_fastq/Irrel_kd_1_R2.fastq.gz \
--out1 results/04_trimmed_data/Irrel_kd_1_trim_R1.fastq.gz \
--out2 results/04_trimmed_data/Irrel_kd_1_trim_R2.fastq.gz \
--html results/04_report_trimming/Irrel_kd_1_trim_fastp.html \
--json results/04_report_trimming/Irrel_kd_1_trim_fastp.json \
--report_title Irrel_kd_1
Using the documentation, we will break down the different options. (Then, you will be able to annotate this poorly commented code.)
Depending on the library preparation kit you used, you may need to remove UMI from your reads (e.g., kit Corall from lexogen) Fastp can Fastp can accomplish this with the following additional options:
--umi --umi_loc read1 --umi_len 12 # remove UMI for Corall kit
To iterate over filenames we need to replace all actual filenames and paths with variables.
Using variables:
#Do not run
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
output_dir_trimmed_data=results/04_trimmed_data
output_dir_report=results/04_report_trimming
mkdir -p $output_dir_trimmed_data $output_dir_report
#Filenames and path for one SE fastq file
full_filenameSE=01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz
filenameSE=Irrel_kd_1.subset.fastq.gz
prefix=Irrel_kd_1.subset
output_SE_trim="$output_dir_trimmed_data/$prefix"_trim.fastq.gz
echo full_filenameSE: $full_filenameSE
echo output_SE_trim: $output_SE_trim
fastp --thread 7 \
--qualified_quality_phred 30 \
--length_required 80 \
--in1 ${full_filenameSE} \
--out1 ${output_SE_trim} \
--html ${output_dir_report}/${prefix}_trim_fastp.html \
--json ${output_dir_report}/${prefix}_trim_fastp.json \
--report_title ${prefix}
Now that we have designed the command for one sample, we can iterate on each sample. It is necessary to generate filenames and paths programmatically, as we did before.
Code to iterate on all samples to be adapted.
#Do not run
#SE data
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
output_dir_trimmed_data=results/04_trimmed_data
output_dir_report=results/04_report_trimming
mkdir -p $output_dir_trimmed_data $output_dir_report
for full_filenameSE in 01_toy_data/raw_fastq/*gz
do
#For each SE fastq file
echo Path : $full_filenameSE treatment:
dirname=$(dirname $full_filenameSE) # actually we do not use it here!
filenameSE=$(basename $full_filenameSE)
prefix=${filenameSE/.fastq.gz/}
output_SE_trim="$output_dir_trimmed_data/${prefix}_trim.fastq.gz"
echo Prefix : $prefix
echo full_filenameSE: $full_filenameSE
echo output_SE_trim: $output_SE_trim
fastp --thread 7 \
--qualified_quality_phred 30 \
--length_required 80 \
--in1 ${full_filenameSE} \
--out1 ${output_SE_trim} \
--html ${output_dir_report}/${prefix}_trim_fastp.html \
--json ${output_dir_report}/${prefix}_trim_fastp.json \
--report_title ${prefix}
done
#Do not run
#(fake) PE data
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
output_dir_trimmed_data=results/04_trimmed_data_PE
output_dir_report=results/04_report_trimming_PE
mkdir -p $output_dir_trimmed_data $output_dir_report
for full_filenameR1 in 01_toy_data/fake_PE_fastq/*R1*gz
do
#For each PE R1 fastq file
echo Filename : $full_filenameR1 treatment:
fastq_dirname=$(dirname $full_filenameR1) # actually, we do not use it here!
filenameR1=$(basename $full_filenameR1)
filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"
prefix=${filenameR1/_R1.fastq.gz/}
echo Prefix: $prefix
fastp --thread 7 \
--qualified_quality_phred 30 \
--detect_adapter_for_pe \
--length_required 80 \
--in1 $full_filenameR1 \
--in2 $full_filenameR2 \
--out1 ${output_dir_trimmed_data}/${prefix}_trim_R1.fastq.gz \
--out2 ${output_dir_trimmed_data}/${prefix}_trim_R2.fastq.gz \
--html ${output_dir_report}/${prefix}_trim_fastp.html \
--json ${output_dir_report}/${prefix}_trim_fastp.json \
--report_title ${prefix}
done
Exercices
Trim (SE or PE) training dataset : You have to check and save the code in scripts (scripts/04_trim_SE_data_fastp.sh; scripts/04_trim_PE_data_fastp.sh) and then execute them. (Do not forget to replace yourName by your actual name!)
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/04_trim_SE_data_fastp.sh
bash scripts/04_trim_PE_data_fastp.sh
Once, the script is done you can use again multiqc to agglomerate trimming report of fastp.
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
multiqc results/04_report_trimming/ -o results/05_multiqc_report_trimming/
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/05_agglomerate_fastp_reports.sh
Now, it’s time to look to the quality result of your data. For that you can use multiqc. Write a script and execute it.
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project/
multiqc results/02_fastqc/ -o results/03_fastqc_multiqc/
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project/
bash scripts/03_agglomerate_fastqc.sh
Now, it’s time to look to the quality result of your data, define which fastp option are needed, write a command for one file, write a script for all files using a for loop and execute it.
(Before execute it on all sample, it’s better to check that it’s run on one sample.)
Exercices
From previous exercises, adapt the code corresponding to your data (PE or SE) for
On one sample
On all samples
Build your scripts and execute them.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project/
multiqc results/04_report_trimming/ -o results/05_multiqc_report_trimming/
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project/
bash scripts/05_agglomerate_fastp_reports.sh
These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.