Learning objectives
RNA-seq is an exciting experimental technique that is utilized to explore and/or quantify gene expression within or between conditions.
To perform Differential Gene Expression (DGE) analysis, we need to start with a matrix of counts representing the levels of gene expression where you have in row genes and in column samples. Below is an example of the first few rows of a count matrix.
After the normalization of the count matrix using dedicated tools, we will look for genes, as below, whose expression is statistically different between condition, which is why we need replicates.
It is important to understand how the count matrix is generated, before diving into the statistical analysis.
So the goal of these first bloc of 4 sessions, is to discover the RNA-processing pipeline for bulk RNA-seq, and the different steps to go from raw sequencing reads to a gene expression count matrix.
We will see in the second bloc the statistical analysis.
It would be really difficult to do these steps on your computer, that why we will use more powerful computers that will not be turned off during the entire training. Also, you will be able to use your own computers outside of the training (which is much more convenient).
For the training we choose to use a convenient service of the French Institute of Bioinformatics - IFB which provide virtual powerful machine available for remote access.
French Institute of Bioinformatics - IFB is providing cloud services to analyze life science data. These services rely on a federation of clouds - Biosphere - built on interconnected IT infrastructures of some IFB’s platforms.
The IFB cloud allows powerful computers to be shared among all users.
Each user can use one of these computers, called instances or virtual machine (VM), and configure it as he wishes.
The configuration of an instance includes:
When the user shutdown one of his instances, the modifications made on the instance (installed software, …) are lost as well as the data that have been added on the storage associated to the VM (mydatalocal).
A permanent storage can be attached to a VM if your belong to a group. This storage will be shared between all the member of the group. It will not be deleted when a VM is shut down and will be automatically connected to each of your VMs.
We will use this particularity to store your data between distant sessions. This permanent storage will appear as a directory called ciri_rnaseq_2024 in your VM.
The principle of IFB is to share the computational resources shared between all users.
You should therefore think about:
After the training, you could use this computation solution but take car to
More information: https://biosphere.france-bioinformatique.fr/
To create your account at IFB:
After the training, you could ask to join the “CIRI” group to get a personal count and use the IFB cloud for your own projects.
To start an instance, you have to go to the RAINBio catalog (top left), select it and choose an “image”.
For the training, you will use the image UE NGS-ENS Lyon. Go to the bottom of the window to see it.
To configure your instance (name and size) click on Configure (wrench symbol), then, you have 4 fields to fill:
Before we start uploading your sequencing output files, we need to understand the different steps leading to these files from biological samples. We will highlight the important steps that will then matter in the analysis.
To make proteins, the DNA is transcribed into messenger RNA, or mRNA, which is translated by the ribosome into protein. However, some genes encode RNA that does not get translated into protein; these RNAs are called non-coding RNAs, or ncRNAs. Often these RNAs have a function in and of themselves and include rRNAs, tRNAs, and siRNAs, among others. All RNAs transcribed from genes are called transcripts.
To be translated into proteins, the primary RNA must undergo processing to generate the mature mRNA. In the figure below, the top strand in the image represents a gene in the genome, comprised of the untranslated regions (UTRs) and the open read frame. In eucaryotes, genes are transcribed into pre-mRNA, which still contains the intronic sequences. After post-transciptional processing, the introns are spliced out and a polyA tail and 5’ cap are added to yield mature mRNA transcripts, which can be translated into proteins.
While most cellular mRNA transcripts have a polyA tail (eg. histones have not), many of the non-coding RNA transcripts do not, as the post-transcriptional processing is different for these transcripts.
Contrary to eukaryotes, prokaryote genes do not have intronic sequences neither poly-A tails.
In RNA-seq data analyses, you will analyze the mRNA expression levels and not the proteins levels. They are correlated but there could be differences due to translation efficiency, protein degradation or stabilization.
Before RNA can be sequenced, it must first be extracted and separated from its cellular environment and prepared into a cDNA library. There are a number of steps involved which are outlined in the figure below, and in parallel there are various quality checks implemented to make sure we have good quality RNA to move forward with. We briefly describe some of these steps below.
Enriching for mRNA. Once the sample has been treated with DNAse to remove any contaminating DNA sequence, the sample undergoes either selection of the mRNA (polyA selection) or depletion of the rRNA. Generally, ribosomal RNA represents the majority of the RNAs present in a cell, while messenger RNAs represent a small percentage of total RNA, ~2% in humans. Therefore, if we want to study the protein-coding genes, we need to enrich for mRNA or deplete the rRNA. For differential gene expression analysis, it is best to enrich for Poly(A)+, unless you are aiming to obtain information about long non-coding RNAs, in which case ribosomal RNA depletion is recommended. To go further.
RNA Quality check: It is essential to check for the integrity of the extracted RNA prior to starting the cDNA library prepation. Traditionally, RNA integrity was assessed via gel electrophoresis by visual inspection of the ribosomal RNA bands; but that method is time consuming and imprecise. The Bioanalyzer system from Agilent will rapidly assess RNA integrity and calculate an RNA Integrity Number (RIN), which facilitates the interpretation and reproducibility of RNA quality. RIN, essentially, provides a means by which RNA quality from different samples can be compared to each other in a standardized manner.
Fragmentation and size selection. The remaining RNA molecules are then fragmented. This is done either via chemical, enzymatic (e.g., RNAses) or physical processes (e.g., chemical/mechanical shearing). These fragments then undergo size selection to retain only those fragments within a size range that Illumina sequencing machines can handle best, i.e., between 150 to 300 bp.
Fragment size quality check: After size selection/exclusion the fragment size distribution should be assessed to ensure that it is unimodal and well-defined.
Reverse transcribe RNA into double-stranded cDNA. Information about which strand a fragment originated from can be preserved by creating stranded libraries. The most commonly used method incorporates deoxy-UTP during the synthesis of the second cDNA strand (for details see Levin et al. (2010)). Once double-stranded cDNA fragments are generated, sequence adapters are ligated to the ends. (Size selection can be performed here instead of at the RNA-level.)
PCR amplification. If the amount of starting material is low and/or to increase the number of cDNA molecules to an amount sufficient for sequencing, libraries are usually PCR amplified. Run as few amplication cycles as possible to avoid PCR artefacts.
Image source: Zeng and Mortavi, 2012
Sequencing of the cDNA libraries will generate reads. Reads correspond to the nucleotide sequences of the ends of each of the cDNA fragments in the library. You will have the choice of sequencing either a single end of the cDNA fragments (single-end reads) or both ends of the fragments (paired-end reads).
Generally single-end sequencing is sufficient unless it is expected that the reads will match multiple locations on the genome (e.g. organisms with many paralogous genes), assemblies are being performed, or for splice isoform differentiation. Be aware that paired-end reads are generally 2x more expensive but they are better for expression analysis. source
Illumina sequencing technology uses a sequencing-by-synthesis approach. To explore sequencing by synthesis in more depth, please watch this linked video on Illumina’s YouTube channel.
We have provided a brief explanation of the steps below:
Cluster growth: The DNA fragments in the cDNA library are denatured and hybridized to the glass flowcell (adapter complementarity). Each fragment is then clonally amplified, forming a cluster of double-stranded DNA. This step is necessary to ensure that the sequencing signal will be strong enough to be detected/captured unambiguously for each base of each fragment.
Sequencing: The sequencing of the fragment ends is based on fluorophore labelled dNTPs with reversible terminator elements. In each sequencing cycle, a base is incorporated into every cluster and excited by a laser.
Image acquisition: Each dNTP has a distinct excitatory signal emission which is captured by cameras.
Base calling: The Base calling program will then generate the sequence of bases, i.e. reads, for each fragement/cluster by assessing the images captured during the many sequencing cycles. In addition to calling the base in every position, the base caller will also report the certainty with which it was able to make the call (quality information).
The raw reads obtained from the sequencer are stored as FASTQ files. The FASTQ file format is the defacto file format for sequence reads generated from next-generation sequencing technologies.
Each FASTQ file is a text file which represents sequence readouts for a sample. Each read is represented by 4 lines as shown below:
@HWI-ST330:304:H045HADXX:1:1101:1111:61397
CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
+
@?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
Line | Description |
---|---|
1 | Always begins with ‘@’ and then information about the read |
2 | The actual DNA sequence |
3 | Always begins with a ‘+’ and sometimes the same info as in line 1 |
4 | Has a string of characters which represent the quality scores; must have same number of characters as line 2 |
These files can be huge (severals GB) and it is a good idea to store them in a compressed form to save space. So the extension of these files can be .fastq.gz, you can also find sometimes .fq.gz.
Now that we understand the origin of the contents of the fastq file, we can start to analyze them.
Many/most bioinformatic tools do not have graphical interfaces. To use them, you need some basics on the command line.
The goal is not to turn you into bioinformatics specialists but to give you some skills so that you can run basic commands. By this way you will be able to find, follow and understand tutorials and helps available on the jungle of the Internet
To start, you need a linux environment where you can train yourself. For that, go back to the IFB website (There is a link on the top of this page) and go to the deployment frame in the “myVM” and connect you to the VM you started at the beginning of the session.
To connect to an instance you need 3 pieces of information: a login, a password and the IP address of your VM.
Warning
By default, the “R” console is open on the left-hand side. This is because R is the expected language in Rstudio, but we are going to use another Rstudio function to talk directly to the computer. To do this, we are going to use the terminal.
You can also see a file browser in the bottom right-hand corner and a text editor in the Source tab.
Open the “Terminal” tab of your Rstudio interface to access the terminal on your VM.
A terminal is an interface used to give commands to a computer by typing only text. You already use a lot of commands indirectly through the graphical interface of your computer (like creating or opening folders, for example).
There are plenty or reasons. Here are 4 :
The terminal’s input line has the following layout
login@hostname:~/path$
We can split this in two parts:
• who is executing the commands: name of the user ( login ), @ , and name of the machine ( hostname ).
• where on the system is the command being executed is indicated the spot “from where” the command is executed ( ~/path ).
A $ sign ends the input prompt.
You have to imagine that you are a cursor in a tree of directories and files of your computer and that you can move from directory to directory and launch commands from your position on the files of the folder you are in but also of the whole computer.
A command is a text you pass to your terminal where you will call a program which will do something and which can take or not arguments and options.
An example will be easier to understand. cowsay
is a (fun and useless) program which need a message as argument.
Type this command in your terminal and type enter to execute it.
cowsay Hello
A program can take different options (with or without arguments)
cowsay -f dragon Hi
Warning, each space is interpreted as a new argument, so you have to add quote to an argument with spaces.
cowsay -e @@ "What! A talking dragon!"
The following command will list all available animals that can be made to speak.
cowsay -l
If you want to know all the options of a (common) program, you have usually to use the option “-h” or “–help” to get help about its usage but you can find much more help on the Internet.
cowsay -h
There are plenty of (more useful) programs. We will see only the basics.
To know where your files are, the machine uses paths. There are 2 kinds of paths, relative and absolute.
cd tuto_bash/folderA/folderA_1
pwd
cd ../
pwd
cd ../../
pwd
cd # In the absence of an argument, returns to your starting directory (your "home").
cd /home/rstudio/mydatalocal/tuto_bash
*
” the wildcarda star is a “wildcard” matching any number of any characters. e.g* folder* matches all files/folders starting by “folder”.
ls folder*
Exercices
Using the commands we’ve just seen:
# Answer: check the prompt, and
pwd
#or
cd /home/rstudio/mydatalocal/tuto_bash
cd /home/rstudio/mydatalocal/tuto_bash
mkdir folderB
mkdir folderB/folderB_1
#or
mkdir folderB
cd folderB
mkdir folderB_1
#or
mkdir -p folderB/folderB_1
#with an absolute path
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
pwd
#or a relative path from tuto_bash
cd /home/rstudio/mydatalocal/tuto_bash
cd folderB/folderB_1
pwd
#or a relative path from folderB
cd /home/rstudio/mydatalocal/tuto_bash/folderB
cd folderB_1
pwd
#with an absolute path
mkdir -p /home/rstudio/mydatalocal/tuto_bash/folderA/folderA_2
#or a relative path from folderB_1
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
mkdir -p ../../folderA/folderA_2
#with an absolute path
ls /home/rstudio/mydatalocal/tuto_bash/folderA
#or a relative path from folderB_1
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
ls ../../folderA
Go back to the /home/rstudio/mydatalocal/tuto_bash/ directory
cd /home/rstudio/mydatalocal/tuto_bash/
touch
-> to create empty filetouch empty_file.txt
But you can easily create new file using the right bottom panel or using file > New File > Text File
Exercices
echo
-> display text in the terminalecho "hello world"
>
and >>
-> >
redirect the output of a command in a text file, >>
add output of a command at the end of a fileecho "hello world" > tata.txt
echo "hello world 2" >> tata.txt
You can also redirect output of other commands.
cowsay -f dragon "hello world" > dragon.txt
cowsay "hi" >> dragon.txt
cat
-> concatenate files and display outputcat dragon.txt tata.txt
It is often use to display the content of one file
cat dragon.txt
The cat
command is useful for viewing the contents of smaller files, but if the file contains hundreds of lines of text, it is overwhelming to have everything printed to the Terminal at once. To see only a part of the file, we can use the commands head
and tail
to see the first few or the last few lines of the file, respectively.
head
and tail
-> to display first or last lines of a filehead dragon.txt
tail dragon.txt
You can specify the number of lines you want to display
head -n 5 dragon.txt
Exercices
cowsay -f turtle message
and redirection
cowsay -f turtle hello > turtle.txt
cowsay -f sheep hello > sheep.txt
cat turtle.txt sheep.txt > turtle_and_sheep.txt
cat turtle_and_sheep.txt
head -n 8 turtle_and_sheep.txt
tail -n 8 turtle_and_sheep.txt
echo "All the files of this directory:" > list_files.txt
pwd >> list_files.txt
echo >> list_files.txt
ls >> list_files.txt
cat list_files.txt
cp
-> copy a file in another path. cp <file> <file|folder>
If you want to copy a file in the same directory: (you have to change the name)
cp dragon.txt dragon_cp.txt
If you want to copy a file in another directory: (you can keep the same name or change the name)
cp dragon.txt folderA/
ls folderA/
cp dragon.txt folderA/dragon_cp.txt
ls folderA/
cp -r
-> copy a directorycp -r folderA folderAprime
mv
-> move a file/folder in another path. mv <file> <file|folder>
Move mv
can be used to rename a file/folder or move it in another paths
ls
mv dragon.txt dragon_new.txt
ls
mv folderAprime folderC
If you want to move a file in another directory: (you can keep the same name or change the name)
mv dragon_new.txt folderA/
ls folderA/
mv dragon_cp.txt folderA/dragon_cp_NEW.txt
ls folderA/
rm
-> remove file(s). Deleting is immediate and permanent!rm empty_file.txt
rm -r
-> remove recursively file(s) and directories. Deleting is immediate and permanent!mkdir -p folderC/folderC_1
cp sheep.txt folderC/folderC_1
ls folderC/folderC_1
rm -r folderC
ls folderC/folderC_1 #raise an error
Exercices
cp turtle.txt folderB/
cp sheep.txt folderB/folderB_1/
mv folderB/turtle.txt folderB/folderB_1/
*
(wildcard)
ls folderB/folderB_1/
mv folderB/folderB_1/* folderA/folderA_1/
ls folderB/folderB_1/
ls folderA/folderA_1/
ls
rm sheep*
ls
mkdir folderD
cp -r folderB/folderB_1/ folderD
tree
to get the files tree
cd /home/rstudio/mydatalocal/tuto_bash
tree
├── folderA
│ ├── dragon_cp_NEW.txt
│ ├── dragon_cp.txt
│ ├── dragon_new.txt
│ ├── dragon.txt
│ ├── folderA_1
│ │ ├── sheep.txt
│ │ └── turtle.txt
│ └── folderA_2
├── folderB
│ └── folderB_1
├── folderD
│ └── folderB_1
├── list_files.txt
├── tata.txt
├── turtle_and_sheep.txt
└── turtle.txt
Most commands have multiple options which are specified with a single or double dash (e.g. -v
,
--verbose
). All options should be detailed on the help page of a command, which you can usually
print in the terminal with the --help
option.
cp --help
So far you have learned how to use commands. You’ll soon find, however, that large and complex blocks of code are tedious to write out by hand every time you want to run them. It is also difficult to debug a long string of code that you wrote in the Terminal.
Instead, we can put everything into a script, or file that contains code. This allows you to make your code compact and easy to move between directories if you need to. It also makes debugging much easier.
A bash (the language you use to communicate) script is a simple text file, usually starting with #!/bin/bash
and named with a ‘.sh’ extension, in which you list commands to run.
Here is an example of a simple script:
#!/bin/bash
cd /home/rstudio/mydatalocal/tuto_bash
echo "Run in:"
pwd
mkdir -p folder_1
echo "Made a folder named folder_1"
touch folder_1/text_file.txt
echo "I want to eat strawberries." > folder_1/text_file.txt
echo "Created a file in folder_1 with some text"
Open a text file in the RStudio text editor using File -> New File -> Shell Script, copy the script text in the new text file and save it as “my_script.sh” (saved in the tuto_bash folder). Run it with the bash command:
cd /home/rstudio/mydatalocal/tuto_bash
bash my_script.sh
You can check that the script worked using the commands previously shown.
Comments
Always add comments to your scripts to keep track of what they do. In Bash, lines starting with a #
symbol are ignored by the computer.
#!/bin/bash
# This is a comment
#Display the working directory
echo "Run in:"
pwd
#Made a folder named folder_1
mkdir -p folder_1
# Created a file in folder_1 with some text
echo "I want to eat strawberries." > folder_1/text_file.txt
cowsay "me too" >> folder_1/text_file.txt
Add a couple of comments to the script we created above to specify what the commands do.
In longer and more complex scripts, it’s important (necessary, even) to document each section properly.
You’ll be thankful when reading properly commented code from others (or from past you), rather than spend time figuring out what each part does.
Script can be run from another folder, you just have to adjust the path:
cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1/
bash /home/rstudio/mydatalocal/tuto_bash/folderA/my_script_comments.sh
You can use tree
to search for the “folder1” you created with the previous script
cd /home/rstudio/mydatalocal/tuto_bash/
tree
(Usually, we prefer to use relative path.)
If you run a script from another directory, take care to the path used in your script. Indeed, the working directory will be the directory where you launch the script and not the directory containing the script. A good habit is to fix at the beginning of your script the working directory, for example :
#!/bin/bash
cd /home/rstudio/mydatalocal/tuto_bash
[...]
You can redirect the output of your script in a file to keep log of its execution (if you have print message)
cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh > script.log
As we want to keep this data between all the sessions, you will use the permanent storage shared between all the trainees.
For the following of the training, we will work in a new personal directory: /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
.
Warning: Change “YourName” by your actual name.
Create this directory and go inside.
mkdir -p /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
You can also remove the tuto_bash directory
rm -r /home/rstudio/mydatalocal/tuto_bash
The dataset we are using for this training is part of a larger study described in Kenny PJ et al., Cell Rep 2014. The authors are investigating interactions between various genes involved in Fragile X syndrome, a disease of aberrant protein production, which results in cognitive impairment and autistic-like features. The authors sought to show that RNA helicase MOV10 regulates the translation of RNAs involved in Fragile X syndrome.
Some relevant metadata for our training dataset is provided below:
From: https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon-flipped/
Copy the script below in a file training_project/01_download_toydata.sh to automatically download the training dataset.
We will use wget
which is a command (like ls
or cat
) which allows to get a file from an URL.
#!/bin/bash
mkdir -p /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p 01_toy_data/
cd 01_toy_data/
cp -r /home/rstudio/ciri_rnaseq_2024/Toy_dataset/raw_fastq .
echo "Files:"
ls raw_fastq
Run the following command, replacing “YourName” by your real name, in the script as well as in the the command.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash 01_download_toydata.sh
Now we get some data, you can run your first bioinformatic analysis: check the quality control of your raw sequence data.
FastQC is a commonly used software that provides a simple way to do this.
All the tools have already been installed to save you this boring part.
The template command to run fastqc is :
fastqc -o output_dir sample.fastq.gz
First get the path of a fastq file of the training dataset
ls 01_toy_data/raw_fastq/*
01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz
01_toy_data/raw_fastq/Irrel_kd_2.subset.fastq.gz
01_toy_data/raw_fastq/Irrel_kd_3.subset.fastq.gz
01_toy_data/raw_fastq/Mov10_oe_1.subset.fastq.gz
01_toy_data/raw_fastq/Mov10_oe_2.subset.fastq.gz
01_toy_data/raw_fastq/Mov10_oe_3.subset.fastq.gz
Note the difference between the command with the wildcard “*”
ls 01_toy_data/raw_fastq/*
and without the “*”
ls 01_toy_data/raw_fastq/
Irrel_kd_1.subset.fastq.gz
Irrel_kd_2.subset.fastq.gz
Irrel_kd_3.subset.fastq.gz
Mov10_oe_1.subset.fastq.gz
Mov10_oe_2.subset.fastq.gz
Mov10_oe_3.subset.fastq.gz
Now we have “just” to adapt the command to run fastqc on a file of the training dataset.
You can save the commands in a script : “02_toy_check_quality.sh”
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p results/02_toy_fastqc/
fastqc -o results/02_toy_fastqc/ 01_toy_data/raw_fastq/Irrel_kd_1.subset.fastq.gz
and run it
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash 02_toy_check_quality.sh
We can now look to the results but we will study the results in the following session.
The result will be in the 02_toy_fastqc folder in the results folder
└── results
└── 02_toy_fastqc
├── Irrel_kd_1.subset_fastqc.html <-----
└── Irrel_kd_1.subset_fastqc.zip
You can search it in the right bottom panel of rstudio and open it with “View in a web Browser”
Fastqc have the particularity that you can pass it as argument a list of files to analyse. We will use the “*” joker to avoid to write the path of all the files
Modify the script 02_toy_check_quality.sh in consequence:
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p results/02_toy_fastqc/
fastqc -o results/02_toy_fastqc/ 01_toy_data/raw_fastq/*.fastq.gz
and run the script again
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash 02_toy_check_quality.sh
In Rstudio you can open several terminal at the same time to avoid to wait for the completion of a command.
By doing a list of files corresponding to the patern pass to fastqc we can check to you do not have make an error :
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
ls 01_toy_data/raw_fastq/*.fastq.gz
MultiQC aggregates results from bioinformatics analyses across many samples into a single report It searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools.
Write this script and execute it.
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p results/03_toy_fastqc_multiqc/
multiqc results/02_toy_fastqc/ -o results/03_toy_fastqc_multiqc/
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash 03_toy_aggregate_quality_report_multiqc.sh
We will now upload your dataset into your remote computer. As for the training project, we will use the permanent storage.
First create a new directory /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project for your project. (Adapt “YourName”)
From now, the codes that you have to adapt to your own project will have a red top box.
Create a directory to welcome your data:
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName
mkdir -p my_project/01_raw_data/
cd my_project/01_raw_data/
The procedure to upload your data on the VM will be different if your data are on your computer or they come from a public dataset available on the NCBI
To download your data, if your data are in local on your computer :
tar -xvf archive.tar
) or unzip (unzip archive.zip
) your archive on your VM.Sequencing data are often stored in a public database called SRA maintained by the NCBI. To acceess to data link from the publication, you should find in its “data availability” section a sentence like “The sequencing data have been deposited in GEO under access codes GSE153946” or “The sequencing data have been deposited under the project PRJNA644595”. From this analyse/project identifier you should find sequencing experiments link to this project in the section “Relationship” or by clicking on “SRA Run Selector”.
Each sequencing experiment of a sample has a unique identifier starting by “SRR”. This ID will be use to download the data.
There is a tool called fastq-dump
which allow downloading directly a experiment.
For example, if you want to download files associated to experiments SRR12164959, SRR12164960, SRR12164961, SRR12164962, SRR12164963, SRR12164964, SRR12164965, SRR12164966, SRR12164967.
Create a new empty script called 01_download_rawdata.sh. Copy the following code and replace the list of SRR corresponding to your study
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
mkdir -p 01_raw_data/
cd 01_raw_data/
# SRA id to download
sra_accessions="SRR12164959
SRR12164960
SRR12164961
SRR12164962
SRR12164963
SRR12164964
SRR12164965
SRR12164966
SRR12164967"
for SRR in $sra_accessions
do
echo "Download $SRR"
fastq-dump --split-files --gzip "$SRR"
done
Then, in the terminal execute the script using this command
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
bash 01_downlad_rawdata.sh
This step will take some time but the script will run even if you close your web tab.
We will see in the following session how a “for loop” works.
These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
While waiting for your machines to start, we will take some time for you to introduce yourself, your dataset and your expectations of this training.