Learning objectives

  • Understand the different steps of the RNA-seq workflow, from RNA extraction to assessing the expression levels of genes.
  • Get the basic skills to use the command line interface (bash)

1 Bulk RNA-seq data analysis in a nutshell

RNA-seq is an exciting experimental technique that is utilized to explore and/or quantify gene expression within or between conditions.

To perform Differential Gene Expression (DGE) analysis, we need to start with a matrix of counts representing the levels of gene expression where you have in row genes and in column samples. Below is an example of the first few rows of a count matrix.

After the normalization of the count matrix using dedicated tools, we will look for genes, as below, whose expression is statistically different between condition, which is why we need replicates.

It is important to understand how the count matrix is generated, before diving into the statistical analysis.

Source

So the goal of these first bloc of 4 sessions, is to discover the RNA-processing pipeline for bulk RNA-seq, and the different steps to go from raw sequencing reads to a gene expression count matrix.

We will see in the second bloc the statistical analysis.

It would be really difficult to do these steps on your computer, that why we will use more powerful computers that will not be turned off during the entire training. Also, you will be able to use your own computers outside of the training (which is much more convenient).

2 Start your powerful remote work computer

For the training we choose to use a convenient service of the French Institute of Bioinformatics - IFB which provide virtual powerful machine available for remote access.

2.1 What is Biosphere - the French Institute of Bioinformatics - IFB Cloud services ?

French Institute of Bioinformatics - IFB is providing cloud services to analyze life science data. These services rely on a federation of clouds - Biosphere - built on interconnected IT infrastructures of some IFB’s platforms.

The IFB cloud allows powerful computers to be shared among all users.

Each user can use one of these computers, called instances or virtual machine (VM), and configure it as he wishes.

The configuration of an instance includes:

  • its content (pre-installed software, operating system, …)
  • its computing power (CPU and RAM)
  • its storage

When the user shutdown one of his instances, the modifications made on the instance (installed software, …) are lost as well as the data that have been added on the storage associated to the VM (mydatalocal).

A permanent storage can be attached to a VM if your belong to a group. This storage will be shared between all the member of the group. It will not be deleted when a VM is shut down and will be automatically connected to each of your VMs.

We will use this particularity to store your data between distant sessions. This permanent storage will appear as a directory called ciri_rnaseq_2024 in your VM.

The principle of IFB is to share the computational resources shared between all users.

You should therefore think about:

  • Be careful, do not shut down your instance if you haven’t been asked to do!

After the training, you could use this computation solution but take car to

  • Turn off the instances that you are not using in order not to block the computation resources so that other users can use them.
  • Store your data on your own hard disk before turning off your instance else they will be lost.

More information: https://biosphere.france-bioinformatique.fr/

2.2 Create your account on the cloud

To create your account at IFB:

  1. go to https://biosphere.france-bioinformatique.fr/
  2. at the top right click on login/se connecter
  3. click again on login/se connecter at the center of the window
  4. click on accept conditions
  5. Select your employer (INSERM, CNRS, ENS de Lyon,..)
  6. identify yourself with your institutional login and password
  7. fill in the information Name, First name, city and postal code and leave the other information by default and accept.
  8. You will have a new page, in the top right corner click on the little man icon then group
  9. Click on Join a group in the tabs at the top left
  10. Search for the group CIRI-RNAseq-2024 (Formation à l’analyse de l’expression différentielle à partir de données RNA-Seq - CIRI 2024) and click on the + button to apply
  11. and it’s all good ! Now we have to validate your application.

After the training, you could ask to join the “CIRI” group to get a personal count and use the IFB cloud for your own projects.

2.3 Start your own VM

  • To start an instance, you have to go to the RAINBio catalog (top left), select it and choose an “image”.

  • For the training, you will use the image UE NGS-ENS Lyon. Go to the bottom of the window to see it.

  • To configure your instance (name and size) click on Configure (wrench symbol), then, you have 4 fields to fill:

    1. Name : RNAseq_training_yourname
    2. Group : CIRI-RNAseq-2024
    3. Cloud : meso-psmn-cirrus
    4. Template/gabarit:
      • standard.8c.32g (8 vCPU, 32GB RAM, 120Go local disk)

  • Then click on launch and go to the myVM panel (top left)
  • Your new VM has appeared in the “Deployments” frame. However the deployment may take > 20 min (on the left, the blue hourglass will be replaced by a green dot).
  • Then, you will be able to retrieve the login information in the access column
  • By clicking on the ID of your machine, you will have access to the address, the login and the password of your machine.
  • If the deployment of your machine has an error, try again

3 Presentation of the trainees and their dataset

While waiting for your machines to start, we will take some time for you to introduce yourself, your dataset and your expectations of this training.

4 From biological samples to computer files

Before we start uploading your sequencing output files, we need to understand the different steps leading to these files from biological samples. We will highlight the important steps that will then matter in the analysis.

4.1 Biological samples

To make proteins, the DNA is transcribed into messenger RNA, or mRNA, which is translated by the ribosome into protein. However, some genes encode RNA that does not get translated into protein; these RNAs are called non-coding RNAs, or ncRNAs. Often these RNAs have a function in and of themselves and include rRNAs, tRNAs, and siRNAs, among others. All RNAs transcribed from genes are called transcripts.

To be translated into proteins, the primary RNA must undergo processing to generate the mature mRNA. In the figure below, the top strand in the image represents a gene in the genome, comprised of the untranslated regions (UTRs) and the open read frame. In eucaryotes, genes are transcribed into pre-mRNA, which still contains the intronic sequences. After post-transciptional processing, the introns are spliced out and a polyA tail and 5’ cap are added to yield mature mRNA transcripts, which can be translated into proteins.

While most cellular mRNA transcripts have a polyA tail (eg. histones have not), many of the non-coding RNA transcripts do not, as the post-transcriptional processing is different for these transcripts.

Contrary to eukaryotes, prokaryote genes do not have intronic sequences neither poly-A tails.

In RNA-seq data analyses, you will analyze the mRNA expression levels and not the proteins levels. They are correlated but there could be differences due to translation efficiency, protein degradation or stabilization.

4.2 RNA Extraction and library preparation

Before RNA can be sequenced, it must first be extracted and separated from its cellular environment and prepared into a cDNA library. There are a number of steps involved which are outlined in the figure below, and in parallel there are various quality checks implemented to make sure we have good quality RNA to move forward with. We briefly describe some of these steps below.

  • Enriching for mRNA. Once the sample has been treated with DNAse to remove any contaminating DNA sequence, the sample undergoes either selection of the mRNA (polyA selection) or depletion of the rRNA. Generally, ribosomal RNA represents the majority of the RNAs present in a cell, while messenger RNAs represent a small percentage of total RNA, ~2% in humans. Therefore, if we want to study the protein-coding genes, we need to enrich for mRNA or deplete the rRNA. For differential gene expression analysis, it is best to enrich for Poly(A)+, unless you are aiming to obtain information about long non-coding RNAs, in which case ribosomal RNA depletion is recommended. To go further.

  • RNA Quality check: It is essential to check for the integrity of the extracted RNA prior to starting the cDNA library prepation. Traditionally, RNA integrity was assessed via gel electrophoresis by visual inspection of the ribosomal RNA bands; but that method is time consuming and imprecise. The Bioanalyzer system from Agilent will rapidly assess RNA integrity and calculate an RNA Integrity Number (RIN), which facilitates the interpretation and reproducibility of RNA quality. RIN, essentially, provides a means by which RNA quality from different samples can be compared to each other in a standardized manner.

  • Fragmentation and size selection. The remaining RNA molecules are then fragmented. This is done either via chemical, enzymatic (e.g., RNAses) or physical processes (e.g., chemical/mechanical shearing). These fragments then undergo size selection to retain only those fragments within a size range that Illumina sequencing machines can handle best, i.e., between 150 to 300 bp.

  • Fragment size quality check: After size selection/exclusion the fragment size distribution should be assessed to ensure that it is unimodal and well-defined.

  • Reverse transcribe RNA into double-stranded cDNA. Information about which strand a fragment originated from can be preserved by creating stranded libraries. The most commonly used method incorporates deoxy-UTP during the synthesis of the second cDNA strand (for details see Levin et al. (2010)). Once double-stranded cDNA fragments are generated, sequence adapters are ligated to the ends. (Size selection can be performed here instead of at the RNA-level.)

  • PCR amplification. If the amount of starting material is low and/or to increase the number of cDNA molecules to an amount sufficient for sequencing, libraries are usually PCR amplified. Run as few amplication cycles as possible to avoid PCR artefacts.

Image source: Zeng and Mortavi, 2012

4.3 Sequencing (Illumina)

Sequencing of the cDNA libraries will generate reads. Reads correspond to the nucleotide sequences of the ends of each of the cDNA fragments in the library. You will have the choice of sequencing either a single end of the cDNA fragments (single-end reads) or both ends of the fragments (paired-end reads).

  • SE - Single end dataset => Only Read1
  • PE - Paired-end dataset => Read1 + Read2
    • can be 2 separate fastq files or just one with interleaved pairs

Generally single-end sequencing is sufficient unless it is expected that the reads will match multiple locations on the genome (e.g. organisms with many paralogous genes), assemblies are being performed, or for splice isoform differentiation. Be aware that paired-end reads are generally 2x more expensive but they are better for expression analysis. source

  • Sequencing-by-synthesis

Illumina sequencing technology uses a sequencing-by-synthesis approach. To explore sequencing by synthesis in more depth, please watch this linked video on Illumina’s YouTube channel.

We have provided a brief explanation of the steps below:

Cluster growth: The DNA fragments in the cDNA library are denatured and hybridized to the glass flowcell (adapter complementarity). Each fragment is then clonally amplified, forming a cluster of double-stranded DNA. This step is necessary to ensure that the sequencing signal will be strong enough to be detected/captured unambiguously for each base of each fragment.

  • Number of clusters ~= Number of reads

Sequencing: The sequencing of the fragment ends is based on fluorophore labelled dNTPs with reversible terminator elements. In each sequencing cycle, a base is incorporated into every cluster and excited by a laser.

Image acquisition: Each dNTP has a distinct excitatory signal emission which is captured by cameras.

Base calling: The Base calling program will then generate the sequence of bases, i.e. reads, for each fragement/cluster by assessing the images captured during the many sequencing cycles. In addition to calling the base in every position, the base caller will also report the certainty with which it was able to make the call (quality information).

  • Number of sequencing cycles = Length of reads

4.4 Fastq files

The raw reads obtained from the sequencer are stored as FASTQ files. The FASTQ file format is the defacto file format for sequence reads generated from next-generation sequencing technologies.

Each FASTQ file is a text file which represents sequence readouts for a sample. Each read is represented by 4 lines as shown below:

@HWI-ST330:304:H045HADXX:1:1101:1111:61397
CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
+
@?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
Line Description
1 Always begins with ‘@’ and then information about the read
2 The actual DNA sequence
3 Always begins with a ‘+’ and sometimes the same info as in line 1
4 Has a string of characters which represent the quality scores; must have same number of characters as line 2

These files can be huge (severals GB) and it is a good idea to store them in a compressed form to save space. So the extension of these files can be .fastq.gz, you can also find sometimes .fq.gz.

Now that we understand the origin of the contents of the fastq file, we can start to analyze them.

Many/most bioinformatic tools do not have graphical interfaces. To use them, you need some basics on the command line.

5 Learn the basics of using the command line

The goal is not to turn you into bioinformatics specialists but to give you some skills so that you can run basic commands. By this way you will be able to find, follow and understand tutorials and helps available on the jungle of the Internet

5.1 Connect you to your VM

To start, you need a linux environment where you can train yourself. For that, go back to the IFB website (There is a link on the top of this page) and go to the deployment frame in the “myVM” and connect you to the VM you started at the beginning of the session.

To connect to an instance you need 3 pieces of information: a login, a password and the IP address of your VM.

    1. Click on deployment “Params” link (right), then copy your Rstudio_PASWORD
    1. Click on deployment “https” link (right), which will send you to the correct IP.
    1. fill the login (“rstudio”), and your password in the popup window, et type “enter”. ”

Warning

  • You must accept the security warning to access the machine.
  • If you are using firefox, it is possible that firefox does not want to override the security warning, you must then use another browser e.g. chrome.

5.2 Discovering the Rstudio interface and access to its terminal

By default, the “R” console is open on the left-hand side. This is because R is the expected language in Rstudio, but we are going to use another Rstudio function to talk directly to the computer. To do this, we are going to use the terminal.

You can also see a file browser in the bottom right-hand corner and a text editor in the Source tab.

5.2.1 Access to the terminal

Open the “Terminal” tab of your Rstudio interface to access the terminal on your VM.

5.3 What is a terminal ?

A terminal is an interface used to give commands to a computer by typing only text. You already use a lot of commands indirectly through the graphical interface of your computer (like creating or opening folders, for example).

5.4 Why use a terminal ?

There are plenty or reasons. Here are 4 :

  1. Automation/scalability. With one command, it is possible to create 100 folders instantly, which is much more tedious through a graphical interface.
  2. Flexibility. Commands can have a wide array of inputs and options.
  3. Availability. A graphical interface is a luxury! Some programs (even some operating sytems) just. . . don’t have any. Thus, if you want to use a lot of programs out there, you need the terminal.
  4. Reproducibility. A set of commands can be stored in a script, which makes them easy to reproduce.

5.5 What does a terminal look like ?

The terminal’s input line has the following layout

login@hostname:~/path$

We can split this in two parts:

• who is executing the commands: name of the user ( login ), @ , and name of the machine ( hostname ).

• where on the system is the command being executed is indicated the spot “from where” the command is executed ( ~/path ).

A $ sign ends the input prompt.

5.6 Launch your first commands

You have to imagine that you are a cursor in a tree of directories and files of your computer and that you can move from directory to directory and launch commands from your position on the files of the folder you are in but also of the whole computer.

5.6.1 What is a command ?

A command is a text you pass to your terminal where you will call a program which will do something and which can take or not arguments and options.

An example will be easier to understand. cowsay is a (fun and useless) program which need a message as argument.

Type this command in your terminal and type enter to execute it.

Terminal
cowsay Hello

A program can take different options (with or without arguments)

Terminal
cowsay -f dragon Hi

Warning, each space is interpreted as a new argument, so you have to add quote to an argument with spaces.

Terminal
cowsay -e @@ "What! A talking dragon!"

The following command will list all available animals that can be made to speak.

Terminal
cowsay -l

If you want to know all the options of a (common) program, you have usually to use the option “-h” or “–help” to get help about its usage but you can find much more help on the Internet.

Terminal
cowsay -h

There are plenty of (more useful) programs. We will see only the basics.

5.6.2 Commands to navigate around your files tree.

  • pwd -> print working directory

To know where you are in the tree file

Terminal
pwd
  • ls -> list files and directories of your current directory
Terminal
ls
  • cd -> change directory
Terminal
cd mydatalocal
  • mkdir -> make new directorie(s)
Terminal
mkdir tuto_bash

With the -p option, it will create all intermediary directories which don’t exist

Terminal
mkdir tuto_bash/folderA/folderA_1 # raise an error
mkdir -p tuto_bash/folderA/folderA_1

The -p option allows also avoiding an error if directory exists

Terminal
mkdir tuto_bash # raise an error if tuto_bash exists
mkdir -p tuto_bash

5.6.3 Absolute/Relative paths

To know where your files are, the machine uses paths. There are 2 kinds of paths, relative and absolute.

  • A relative path is “how to get somewhere from my current position”. Relative paths use the “.” and “..” notations to look under or above your current position. “../../” means the parent directory of your parent directory.
Terminal
cd tuto_bash/folderA/folderA_1
pwd
Terminal
cd ../
pwd

cd ../../
pwd

cd # In the absence of an argument, returns to your starting directory (your "home").
  • Absolute paths all start with “/” , which is the root of the file system. Absolute paths are useful because (unlike relative paths) they are correct no matter where you are on the machine.
Terminal
cd /home/rstudio/mydatalocal/tuto_bash
  • *” the wildcard

a star is a “wildcard” matching any number of any characters. e.g* folder* matches all files/folders starting by “folder”.

Terminal
ls folder*

Exercices

Using the commands we’ve just seen:

    1. Make sure you are inside tuto_bash . What is its absolute path ?

Terminal
# Answer: check the prompt, and
pwd

#or
cd /home/rstudio/mydatalocal/tuto_bash
    1. Make a new directory inside tuto_bash named folderB/ and create a subdirectory folderB/folderB_1

Terminal
cd /home/rstudio/mydatalocal/tuto_bash

mkdir folderB
mkdir folderB/folderB_1

#or

mkdir folderB
cd folderB
mkdir folderB_1

#or

mkdir -p folderB/folderB_1
    1. Go in the last sub directory you created (/home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1)

Terminal
#with an absolute path
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
pwd

#or a relative path from  tuto_bash
cd  /home/rstudio/mydatalocal/tuto_bash
cd folderB/folderB_1
pwd

#or a relative path from folderB
cd  /home/rstudio/mydatalocal/tuto_bash/folderB
cd folderB_1
pwd
    1. From here, create a subdirectory folderA_2 in folderA

Terminal
#with an absolute path
mkdir -p /home/rstudio/mydatalocal/tuto_bash/folderA/folderA_2

#or a relative path from  folderB_1
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
mkdir -p ../../folderA/folderA_2
    1. List all the directories of folderA

Terminal
#with an absolute path
ls /home/rstudio/mydatalocal/tuto_bash/folderA

#or a relative path from  folderB_1
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
ls ../../folderA

5.6.4 Writing and Reading Text Files

Go back to the /home/rstudio/mydatalocal/tuto_bash/ directory

Terminal
cd  /home/rstudio/mydatalocal/tuto_bash/
  • touch -> to create empty file
Terminal
touch empty_file.txt

But you can easily create new file using the right bottom panel or using file > New File > Text File

Exercices

  • Create a text file and name it toto.txt,
  • Write something inside.
  • Save it.
  • echo -> display text in the terminal
Terminal
echo "hello world"
  • redirection > and >> -> > redirect the output of a command in a text file, >> add output of a command at the end of a file
Terminal
echo "hello world" > tata.txt
Terminal
echo "hello world 2" >> tata.txt

You can also redirect output of other commands.

Terminal
cowsay -f dragon "hello world" > dragon.txt
Terminal
cowsay  "hi" >> dragon.txt
  • cat -> concatenate files and display output
Terminal
cat dragon.txt  tata.txt

It is often use to display the content of one file

Terminal
cat dragon.txt

The cat command is useful for viewing the contents of smaller files, but if the file contains hundreds of lines of text, it is overwhelming to have everything printed to the Terminal at once. To see only a part of the file, we can use the commands head and tail to see the first few or the last few lines of the file, respectively.

  • head and tail -> to display first or last lines of a file
Terminal
head  dragon.txt
Terminal
tail  dragon.txt

You can specify the number of lines you want to display

Terminal
head -n 5  dragon.txt

Exercices

    1. Create 2 new files named turtle.txt and sheep.txt, containing animal saying something. You will use cowsay -f turtle message and redirection

Terminal
cowsay -f turtle hello > turtle.txt
cowsay -f sheep hello > sheep.txt
    1. Concatenate them and redirect the result in a file turtle_and_sheep.txt

Terminal
cat  turtle.txt sheep.txt  > turtle_and_sheep.txt
    1. Display the entire file, only the first 8 lines, the last 8 lines.

Terminal
cat turtle_and_sheep.txt

head -n 8  turtle_and_sheep.txt


tail -n 8 turtle_and_sheep.txt
    1. List all files of your current directory and save it in list_files.txt (Try to write in the first lines: “All the files of this directory:”, then the absolute path of the current directory)

Terminal
echo "All the files of this directory:" > list_files.txt
pwd >> list_files.txt

echo >> list_files.txt

ls >>  list_files.txt

cat list_files.txt

5.6.5 Copying and Removing Files

  • cp -> copy a file in another path. cp <file> <file|folder>

If you want to copy a file in the same directory: (you have to change the name)

Terminal
cp dragon.txt dragon_cp.txt

If you want to copy a file in another directory: (you can keep the same name or change the name)

Terminal
cp dragon.txt folderA/
ls folderA/
Terminal
cp dragon.txt folderA/dragon_cp.txt
ls folderA/
  • cp -r -> copy a directory
Terminal
cp -r folderA folderAprime
  • mv -> move a file/folder in another path. mv <file> <file|folder>

Move mv can be used to rename a file/folder or move it in another paths

Terminal
ls
mv dragon.txt dragon_new.txt
ls
Terminal
mv folderAprime folderC

If you want to move a file in another directory: (you can keep the same name or change the name)

Terminal
mv  dragon_new.txt folderA/
ls folderA/
Terminal
mv dragon_cp.txt folderA/dragon_cp_NEW.txt
ls folderA/
  • rm -> remove file(s). Deleting is immediate and permanent!
Terminal
rm empty_file.txt
  • rm -r -> remove recursively file(s) and directories. Deleting is immediate and permanent!
Terminal
mkdir -p folderC/folderC_1
cp sheep.txt folderC/folderC_1
ls folderC/folderC_1

rm -r folderC 
ls folderC/folderC_1  #raise an error

Exercices

    1. Copy turtle.txt in folderB/ and copy sheep.txt in folderB/folderB_1

Terminal
cp turtle.txt folderB/
cp sheep.txt folderB/folderB_1/
    1. Move folderB/turtle.txt in folderB/folderB_1/

Terminal
mv folderB/turtle.txt folderB/folderB_1/
    1. Move all files of folderB/folderB_1/ to folderA/folderA_1/. You will use the * (wildcard)

Terminal
ls folderB/folderB_1/

mv folderB/folderB_1/* folderA/folderA_1/

ls folderB/folderB_1/
ls folderA/folderA_1/
    1. List files of your current directory and remove files starting by “sheep”

Terminal
ls 

rm sheep*

ls
    1. Copy the subdirectory folderB_1 in folderD

Terminal
mkdir folderD
cp -r folderB/folderB_1/ folderD
  • 6 - use the command tree to get the files tree

Terminal
cd /home/rstudio/mydatalocal/tuto_bash
tree

├── folderA
│   ├── dragon_cp_NEW.txt
│   ├── dragon_cp.txt
│   ├── dragon_new.txt
│   ├── dragon.txt
│   ├── folderA_1
│   │   ├── sheep.txt
│   │   └── turtle.txt
│   └── folderA_2
├── folderB
│   └── folderB_1
├── folderD
│   └── folderB_1
├── list_files.txt
├── tata.txt
├── turtle_and_sheep.txt
└── turtle.txt

Most commands have multiple options which are specified with a single or double dash (e.g. -v , --verbose ). All options should be detailed on the help page of a command, which you can usually print in the terminal with the --help option.

Terminal
cp --help

5.7 Combining Commands in scripts

So far you have learned how to use commands. You’ll soon find, however, that large and complex blocks of code are tedious to write out by hand every time you want to run them. It is also difficult to debug a long string of code that you wrote in the Terminal.

Instead, we can put everything into a script, or file that contains code. This allows you to make your code compact and easy to move between directories if you need to. It also makes debugging much easier.

A bash (the language you use to communicate) script is a simple text file, usually starting with #!/bin/bash and named with a ‘.sh’ extension, in which you list commands to run.

Here is an example of a simple script:

Script: tuto_bash/my_script.sh
#!/bin/bash

cd /home/rstudio/mydatalocal/tuto_bash

echo "Run in:"
pwd

mkdir -p folder_1
echo "Made a folder named folder_1"
touch folder_1/text_file.txt
echo "I want to eat strawberries." > folder_1/text_file.txt
echo "Created a file in folder_1 with some text"

Open a text file in the RStudio text editor using File -> New File -> Shell Script, copy the script text in the new text file and save it as “my_script.sh” (saved in the tuto_bash folder). Run it with the bash command:

TERMINAL
cd /home/rstudio/mydatalocal/tuto_bash
bash my_script.sh

You can check that the script worked using the commands previously shown.

Comments

Always add comments to your scripts to keep track of what they do. In Bash, lines starting with a # symbol are ignored by the computer.

Script: folderA/my_script_comments.sh
#!/bin/bash

# This is a comment

#Display the working directory
echo "Run in:"
pwd

#Made a folder named folder_1
mkdir -p folder_1

# Created a file in folder_1 with some text
echo "I want to eat strawberries." > folder_1/text_file.txt

cowsay "me too"  >> folder_1/text_file.txt

Add a couple of comments to the script we created above to specify what the commands do.

In longer and more complex scripts, it’s important (necessary, even) to document each section properly.

You’ll be thankful when reading properly commented code from others (or from past you), rather than spend time figuring out what each part does.

Script can be run from another folder, you just have to adjust the path:

TERMINAL
cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh
TERMINAL
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1/
bash /home/rstudio/mydatalocal/tuto_bash/folderA/my_script_comments.sh

You can use tree to search for the “folder1” you created with the previous script

TERMINAL
cd /home/rstudio/mydatalocal/tuto_bash/
tree

(Usually, we prefer to use relative path.)

If you run a script from another directory, take care to the path used in your script. Indeed, the working directory will be the directory where you launch the script and not the directory containing the script. A good habit is to fix at the beginning of your script the working directory, for example :

Script: sandbox/my_template_script.sh
#!/bin/bash

cd /home/rstudio/mydatalocal/tuto_bash

[...]

You can redirect the output of your script in a file to keep log of its execution (if you have print message)

TERMINAL
cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh > script.log

5.8 Helpful resources

6 Get a toy dataset for the training

As we want to keep this data between all the sessions, you will use the permanent storage shared between all the trainees.

For the following of the training, we will work in a new personal directory: /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project.

Warning: Change “YourName” by your actual name.

Create this directory and go inside.

Terminal
mkdir -p /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

You can also remove the tuto_bash directory

Terminal
rm -r /home/rstudio/mydatalocal/tuto_bash

The dataset we are using for this training is part of a larger study described in Kenny PJ et al., Cell Rep 2014. The authors are investigating interactions between various genes involved in Fragile X syndrome, a disease of aberrant protein production, which results in cognitive impairment and autistic-like features. The authors sought to show that RNA helicase MOV10 regulates the translation of RNAs involved in Fragile X syndrome.

6.1 Metadata

Some relevant metadata for our training dataset is provided below:

  • The RNA was extracted from human HEK293F cells that were transfected with a MOV10 transgene, MOV10 siRNA, or an irrelevant siRNA. (For this workshop we won’t be using the MOV10 knock down samples.)
  • The libraries for this dataset are stranded and were generated using the standard Tru-seq prep kit (using the dUTP method).
  • Sequencing was carried out on the Illumina HiSeq-2500 and 100bp single end reads were generated.
  • The full dataset was sequenced to ~40 million reads per sample, but for this workshop we will be looking at a small subset on chr1 (~300,000 reads/sample).
  • For each group we have three replicates as described in the figure below.

From: https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon-flipped/

6.2 Get the raw fastq files

Copy the script below in a file training_project/01_download_toydata.sh to automatically download the training dataset.

We will use wget which is a command (like ls or cat) which allows to get a file from an URL.

Script: training_project/01_download_toydata.sh
#!/bin/bash

mkdir -p /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project

mkdir -p 01_toy_data/
cd 01_toy_data/

cp -r /home/rstudio/ciri_rnaseq_2024/Toy_dataset/raw_fastq .

echo "Files:"
ls raw_fastq

Run the following command, replacing “YourName” by your real name, in the script as well as in the the command.

TERMINAL
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash 01_download_toydata.sh

6.3 Run the quality checking on the training dataset

Now we get some data, you can run your first bioinformatic analysis: check the quality control of your raw sequence data.

FastQC is a commonly used software that provides a simple way to do this.

All the tools have already been installed to save you this boring part.

The template command to run fastqc is :

fastqc -o output_dir sample.fastq.gz

First get the path of a fastq file of the training dataset

TERMINAL
ls 01_toy_data/raw_fastq/*
01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz
01_toy_data/raw_fastq/Irrel_kd_2.subset.fastq.gz
01_toy_data/raw_fastq/Irrel_kd_3.subset.fastq.gz
01_toy_data/raw_fastq/Mov10_oe_1.subset.fastq.gz
01_toy_data/raw_fastq/Mov10_oe_2.subset.fastq.gz
01_toy_data/raw_fastq/Mov10_oe_3.subset.fastq.gz

Note the difference between the command with the wildcard “*”

TERMINAL
ls 01_toy_data/raw_fastq/*

and without the “*”

TERMINAL
ls 01_toy_data/raw_fastq/
Irrel_kd_1.subset.fastq.gz
Irrel_kd_2.subset.fastq.gz
Irrel_kd_3.subset.fastq.gz
Mov10_oe_1.subset.fastq.gz
Mov10_oe_2.subset.fastq.gz
Mov10_oe_3.subset.fastq.gz

Now we have “just” to adapt the command to run fastqc on a file of the training dataset.

You can save the commands in a script : “02_toy_check_quality.sh”

Script: training_project/02_toy_check_quality.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p results/02_toy_fastqc/

fastqc -o results/02_toy_fastqc/ 01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz

and run it

TERMINAL
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash 02_toy_check_quality.sh

We can now look to the results but we will study the results in the following session.

The result will be in the 02_toy_fastqc folder in the results folder

└── results
    └── 02_toy_fastqc
        ├── Irrel_kd_1.subset_fastqc.html   <-----
        └── Irrel_kd_1.subset_fastqc.zip

You can search it in the right bottom panel of rstudio and open it with “View in a web Browser

Fastqc have the particularity that you can pass it as argument a list of files to analyse. We will use the “*” joker to avoid to write the path of all the files

Modify the script 02_toy_check_quality.sh in consequence:

Script: training_project/02_toy_check_quality.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p results/02_toy_fastqc/

fastqc -o results/02_toy_fastqc/ 01_toy_data/raw_fastq/*.fastq.gz

and run the script again

TERMINAL
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash 02_toy_check_quality.sh

In Rstudio you can open several terminal at the same time to avoid to wait for the completion of a command.

By doing a list of files corresponding to the patern pass to fastqc we can check to you do not have make an error :

TERMINAL
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
ls 01_toy_data/raw_fastq/*.fastq.gz

6.4 Agglomerate quality reports with multiqc

MultiQC aggregates results from bioinformatics analyses across many samples into a single report It searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools.

Write this script and execute it.

Script: training_project/03_toy_aggregate_quality_report_multiqc.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p results/03_toy_fastqc_multiqc/

multiqc results/02_toy_fastqc/ -o results/03_toy_fastqc_multiqc/
TERMINAL
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash 03_toy_aggregate_quality_report_multiqc.sh

7 Upload your own data on your remote computer

We will now upload your dataset into your remote computer. As for the training project, we will use the permanent storage.

First create a new directory /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project for your project. (Adapt “YourName”)

From now, the codes that you have to adapt to your own project will have a red top box.

Create a directory to welcome your data:

TERMINAL [To be adjusted]
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName

mkdir -p my_project/01_raw_data/
cd my_project/01_raw_data/

The procedure to upload your data on the VM will be different if your data are on your computer or they come from a public dataset available on the NCBI

7.1 Your data are on your local computer

To download your data, if your data are in local on your computer :

  • Create a archive containing your fastq files (a zip or a tar archive)
  • Using the file browser (right bottom corner of rstudio), go in the directory you want to download the data
  • click on the upload button and choose your archive on your computer

  • Wait for the uploading, that can be take a long time
  • untar (tar -xvf archive.tar) or unzip (unzip archive.zip) your archive on your VM.

7.2 Downlad public data from the NCBI

Sequencing data are often stored in a public database called SRA maintained by the NCBI. To acceess to data link from the publication, you should find in its “data availability” section a sentence like “The sequencing data have been deposited in GEO under access codes GSE153946” or “The sequencing data have been deposited under the project PRJNA644595”. From this analyse/project identifier you should find sequencing experiments link to this project in the section “Relationship” or by clicking on “SRA Run Selector”.

Each sequencing experiment of a sample has a unique identifier starting by “SRR”. This ID will be use to download the data. There is a tool called fastq-dump which allow downloading directly a experiment.

For example, if you want to download files associated to experiments SRR12164959, SRR12164960, SRR12164961, SRR12164962, SRR12164963, SRR12164964, SRR12164965, SRR12164966, SRR12164967.

Create a new empty script called 01_download_rawdata.sh. Copy the following code and replace the list of SRR corresponding to your study

Script: my_project/01_downlad_rawdata.sh [To be adjusted]
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project

mkdir -p 01_raw_data/
cd 01_raw_data/


# SRA id to download
sra_accessions="SRR12164959
SRR12164960
SRR12164961
SRR12164962
SRR12164963
SRR12164964
SRR12164965
SRR12164966
SRR12164967"

for SRR in $sra_accessions
do
    echo "Download $SRR"
    fastq-dump --split-files --gzip "$SRR"
done

Then, in the terminal execute the script using this command

TERMINAL [To be adjusted]
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
bash 01_downlad_rawdata.sh

This step will take some time but the script will run even if you close your web tab.

We will see in the following session how a “for loop” works.

See you in the next session


These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (https://github.com/hbctraining) under the Creative Commons Attribution license (CC BY 4.0).

cc_by_sa

sources