As previously mentioned, we will use Salmon to quantify the genes expression for the differential gene expression analysis. The documentation can be found here or clicking on the tab Documentation on the salmon web page.
The “quasi-mapping” approach utilized by Salmon requires a transcriptome reference index to determine the position and orientation information for where the fragments best map prior to quantification.
So, first we need to build this index. In real-life scenarios, one must find and understand command lines to utilize a new tool.
Should decoy sequences be used or not?
Fast is good but fast and accurate is better! Alignment and mapping methodologies influence transcript abundance estimation, and accounting for fragments of unexpected origin can improve transcript quantification. To this end, Salmon provides the ability to index both the transcriptome as well as decoy sequences that can be considered during mapping and quantification. The decoy sequence accounts for reads that might otherwise be (spuriously) attributed to some annotated transcript.
Using decoy sequences in the Salmon index requires much more resources to build and also during the quantification step.
For example:
For the training, we will not use decoy sequences. However, if you have enough resources, you can find help to build your index with decoy sequences here
Exercices
From the Salmon documentation, find the command line to index your transcriptome with Salmon.
Build your script.
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
ref_transcriptome="/home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/01_toy_data/reference_transcriptome/Homo_sapiens.GRCh38.cdna.all.fa"
mkdir -p results/07_salmon_index
#Documentation https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode
# salmon index
# -t: the path to the transcriptome file (in FASTA format)
# -i: the path to the folder to store the indices generated
# -k: the length of kmer to use to create the indices (will output all sequences in transcriptome of length k)
# -p: the number of threads
# Thus, a smaller value of k may slightly improve sensitivity.
# We find that a k of 31 seems to work well for reads of 75bp or longer, but you might consider a smaller k if you plan to deal with shorter reads.
salmon index -k 31 -t $ref_transcriptome -p 7 -i results/07_salmon_index/human_cdna.salmon_index
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash scripts/07_build_salmon_index.sh
Exercises:
In your RNA-seq experiment, you expressed a GFP transgene in your mice, and you would like to quantify the expression of GFP. We know that the sequence is not present in the mice transcriptome. What would you do?
b
The quasi-mapping approach estimates where the reads best map to on the transcriptome through identifying where informative sequences within the read map instead of performing base-by-base alignment.
NOTE: If there are k-mers in the reads that are not in the index, they are not counted. As such, trimming is not required when using this method. But it is reassuring to see the amount of usable data. Accordingly, if there are reads from transcripts not present in the reference transcriptome, they will not be quantified. Quantification of the reads is only as good as the quality of the reference transcriptome.
After determining the best mapping for each read/fragment using the quasi-mapping method, Salmon will generate the final transcript abundance estimates after modeling sample-specific parameters and biases. Note that reads/fragments that map equally well to more than one transcript will have the count divided between all of the mappings, thereby not losing information for the various gene isoforms.
Instead of only counting the number of reads/fragments mapping to each of the transcripts, Salmon uses multiple complex modeling approaches to estimate the transcript abundances while correcting the abundance estimates for any sample-specific biases/factors [4]. Sample-specific bias models are helpful when needing to account for known biases present in RNA-Seq data including:
If not accounted for, these biases can lead to unacceptable false positive rates in differential expression studies. The Salmon algorithm can learn these sample-specific biases and account for them in the transcript abundance estimates. Generally, this step results in more accurate transcript abundance estimation.
Now we know a little bit about how it works, let’s map our data using Salmon.
They are two modes to quantify with Salmon, the alignment-based mode or the mapping-based mode.
The mapping-based mode of Salmon runs in two phases; indexing and quantification. The indexing step is independent of the reads, and only needs to be run once for a particular set of reference transcripts. The quantification step, obviously, is specific to the set of RNA-seq reads.
The alignment-based mode of Salmon does not require indexing but you need to align your reads on the transcriptome with another tool (wich may need an index) to get an alignment file (SAM/BAM).
You will use the mapping-based mode.
Exercices
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
mkdir -p results/08_salmon_quantification/
#salmon quant
#--seqBias will enable it to learn and correct for sequence-specific biases in the input data
#--GCbias will enable it to learn and correct for fragment-level GC biases in the input data
#--validateMapping is now the default mapping strategy, add this option will not change the command
salmon quant -i results/07_salmon_index/human_cdna.salmon_index \
-l A \
-p 7 \
-r results/04_trimmed_data_SE/Irrel_kd_1.subset_trim.fastq.gz \
-o results/08_salmon_quantification/Irrel_kd_1.salmon \
--seqBias \
--gcBias
The command is very fast as the traininig dataset is a subset of data. You can run it.
$output_dir
, $salmon_index
, $full_fastq_filename
, $prefix
, … )
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
output_dir=results/08_salmon_quantification/
mkdir -p $output_dir
salmon_index=results/07_salmon_index/human_cdna.salmon_index
full_fillnameSE=results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz
fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}
#salmon quant
#--seqBias will enable it to learn and correct for sequence-specific biases in the input data
#--GCbias will enable it to learn and correct for fragment-level GC biases in the input data
#--validateMapping is now the default mapping strategy, add this option will not change the command
salmon quant -i $salmon_index\
-l A \
-p 7 \
-r $full_fillnameSE \
-o $output_dir/${prefix}.salmon \
--seqBias \
--gcBias
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
output_dir=results/08_salmon_quantification/
mkdir -p $output_dir
salmon_index=results/07_salmon_index/human_cdna.salmon_index
for full_fillnameSE in results/04_trimmed_data_SE/*.subset_trim.fastq.gz
do
echo START: $full_fillnameSE
fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}
#salmon quant
#--seqBias will enable it to learn and correct for sequence-specific biases in the input data
#--GCbias will enable it to learn and correct for fragment-level GC biases in the input data
#--validateMapping is now the default mapping strategy, add this option will not change the command
salmon quant -i $salmon_index\
-l A \
-p 7 \
-r $full_fillnameSE \
-o $output_dir/${prefix}.salmon \
--seqBias \
--gcBias
echo END
done
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash scripts/08_run_quantification_salmon.sh
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
output_dir=results/08_salmon_quantification_PE/
mkdir -p $output_dir
salmon_index=results/07_salmon_index/human_cdna.salmon_index
for full_filenameR1 in results/04_trimmed_data_PE/*_trim_R1.fastq.gz
do
echo START: $full_filenameR1
fastq_dirname=$(dirname $full_filenameR1)
filenameR1=$(basename $full_filenameR1)
filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"
prefix=${filenameR1/_trim_R1.fastq.gz/}
echo Prefix: $prefix # display Irrel_kd_1
#salmon quant
#--seqBias will enable it to learn and correct for sequence-specific biases in the input data
#--GCbias will enable it to learn and correct for fragment-level GC biases in the input data
#--validateMapping is now the default mapping strategy, add this option will not change the command
salmon quant -i $salmon_index\
-l A \
-p 7 \
-1 $full_filenameR1 \
-2 $full_filenameR2 \
-o $output_dir/${prefix}.salmon \
--seqBias \
--gcBias
echo END
done
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
bash scripts/08_run_quantification_salmon_PE.sh
RNA-seq bias correction: To have Salmon correct for RNA-Seq biases you will need to specify the appropriate parameters when you run it. As noted, when describing the FASTQC results, with RNA-seq data you will always observe sequence-specific biases due to the random hexamer priming and so we would always want to have that correction turned on. Before using the remaining parameters it is advisable to assess your data using tools like Qualimap to look specifically for the presence of these biases in your data and decide on which parameters would be appropriate.
Salmon output should be in results/08_salmon_quantification/
. Take a look at what is contained in this directory:
ls results/08_salmon_quantification/*
From the documentation, you should find details about the output files. On the left, there is a Salmon Output File Formats section.
The most important file is quant.sf
. This is the quantification file in which each row corresponds to a transcript, listed by Ensembl ID. The columns correspond to metrics for each transcript:
NOTE: The effective gene length in a sample is the average of the transcript lengths after weighting for their relative expression. You may see effective lengths that are larger than the physical length. The interpretation would be that in this case, given the sequence composition of these transcripts (including both the sequence-specific and fragment GC biases), you’d expect a priori to sample more reads from them — thus they have a longer estimated effective length.
Exercices
quant.sf
) of the Irrel_kd_1 (SE) sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
head results/08_salmon_quantification/Irrel_kd_1.salmon/quant.sf
Name Length EffectiveLength TPM NumReads
ENST00000632684.1 12 3.00168 0 0
ENST00000434970.2 9 2.81792 0 0
ENST00000448914.1 13 3.04008 0 0
ENST00000415118.1 8 2.72193 0 0
ENST00000631435.1 12 3.00168 0 0
ENST00000390567.1 20 3.18453 0 0
ENST00000439842.1 11 2.95387 0 0
grep
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
grep ENST00000612242 results/08_salmon_quantification/Irrel_kd_1.salmon/quant.sf
ENST00000612242.4 6592 5586.055 2031.576451 1212.992
cat
command the file results/08_salmon_quantification/Irrel_kd_1.salmon/logs/salmon_quant.log and search for the mapping rate.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
cat results/08_salmon_quantification/Irrel_kd_1.salmon/logs/salmon_quant.log
grep
command and using *
to build the pattern of all log files.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project
grep "Mapping rate" results/08_salmon_quantification/*.salmon/logs/salmon_quant.log
The quant.sf file is used as input for downstream analyses including differential gene expression analysis. For each sample that you run through Salmon, you will get a corresponding quant.sf
file. Downstream tools like tximport
take these files and aggregate them to obtain expression matrices for your dataset. We will see that in the next session.
Instead of get manually mapping statistics you can use once again multiqc to agglomerate log results of Salmon.
Build the script as below and execute it.
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
multiqc results/08_salmon_quantification/ -o results/09_multiqc_report_quantification/
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/09_agglomerate_salmon_reports.sh
Using, the multiqc report you can see the percentage of mapping for eaach of your samples. For PE data, you can see the fragment length distribution. For SE data, a unique value will be estimated.
Exercises:
It’s now time to run quantification on your dataset. As previsously, write a script adapted to your data, check it on one sample and execute it on all your samples.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/my_project
bash scripts/08_run_quantification_salmon.sh
As mentioned above, the differential gene expression analysis will use transcript/gene pseudocounts generated by Salmon. However, to perform some basic quality checks on the sequencing data, it is important to align the reads to the whole genome. Either STAR or HiSAT2 are able to perform this step and generate a BAM file that can be used for QC.
As Salmon, STAR need an index to facilitate the research. This time, we will index the genome. This can take time and computing resources. For the training, we will just index the chromosome 1 of the Human genome, as the training dataset is a subset of reads mapping on the first chromosome.
As previously for the transcriptome you can find genomic sequences on the Ensembl database.
You can find information about STAR on its github repository (https://github.com/alexdobin/STAR) and also the complete documentation. ( https://github.com/alexdobin/STAR/raw/master/doc/STARmanual.pdf)
You can find below, the script to download the sequence of the first human chromosome from Ensembl thus the gtf file containing gene annotations, and use them to build the STAR index.
Copy the following script in scripts/10_download_human_genome_and_STAR_index.sh and execute it.
#!/bin/bash
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
mkdir -p data/genome
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz -P data/genome
wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz -P data/genome
gunzip data/genome/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
gunzip data/genome/Homo_sapiens.GRCh38.109.gtf.gz
mkdir -p results/10_star_index/
STAR --runThreadN 7 \
--runMode genomeGenerate \
--genomeDir results/10_star_index/chr1_hg38_index \
--genomeFastaFiles data/genome/Homo_sapiens.GRCh38.dna.chromosome.1.fa \
--sjdbGTFfile data/genome/Homo_sapiens.GRCh38.109.gtf \
--sjdbOverhang 99 \
--genomeSAindexNbases 12 # min(14, log2(GenomeLength)/2 - 1)
#sjdbOverhang = readlength - 1
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/10_download_human_genome_and_STAR_index.sh
Then, you will use STAR to map reads from results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz on the previously building index.
#!/bin/bash
# Run STAR on one sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
star_index=results/10_star_index/chr1_hg38_index
outputdir_bam=results/11_star_mapping/bam/
mkdir -p $outputdir_bam
full_fillnameSE=results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz
fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}
STAR --runThreadN 7 \
--genomeDir $star_index \
--readFilesCommand zcat \
--readFilesIn $full_fillnameSE \
--outFileNamePrefix $outputdir_bam/${prefix}. \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard
#for PE data
#STAR --runThreadN 7 \
# --genomeDir $star_index \
# --readFilesCommand zcat \
# --readFilesIn $full_fillnameR1 $full_fillnameR2 \
# --outFileNamePrefix $outputdir_bam/${prefix}. \
# --outSAMtype BAM SortedByCoordinate \
# --outSAMunmapped Within \
# --outSAMattributes Standard
To note, here we use the option “–outSAMunmapped Within” of STAR, as we are interested to get statistic about mapping rate. You can remove this option to remove unmapped read of the bam file and thus save space.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/11_STAR_mapping.sh
STAR will output BAM files which are compressed SAM files.
The Sequence Alignment/Map (SAM) is a text file format to save alignment information of short reads mapped against reference sequences. It usually starts with a header section followed by alignment information as tab separated lines for each read. (https://www.metagenomics.wiki/tools/samtools/bam-sam-file-format)
To look to the content of bam files you can use samtools view
as in the following command lines:
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
#To see the content of a bam file
samtools view results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | head
# You need to add the -h option to see the header section
samtools view -h results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | head
Bam files contain alignment information in a tabular format. Each row represent an hit of a read on a reference sequence. Reads can have several hits. For each hit you can find among other information, the reference sequence name (3rd column) and the coordinate of the alignment (4th)
You can find easily on internet the specification of each column (for example: https://www.metagenomics.wiki/tools/samtools/bam-sam-file-format).
In the second column, there is a code/flag to encode some properties of the mapping of a read and for PE data about its mate.
You can find here a tool to decode this flag: https://broadinstitute.github.io/picard/explain-flags.html or here: https://www.samformat.info/sam-format-flag
You can use the -f
option of samtools to filter hits on this flag.
#To see unmapped reads flag=4
samtools view -f 4 results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | head
#To see multimap read flag=256
samtools view -f 256 results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | head
You can also combine samtools with grep to search all mapping hit of a given read:
# To see all maps of a read
samtools view results/11_star_mapping/bam//Irrel_kd_2.Aligned.sortedByCoord.out.bam | grep HWI-ST330:304:H045HADXX:2:1202:12011:47361
A tool called Qualimap explores the features of aligned reads in the context of the genomic region they map to, hence providing an overall view of the data quality (as an HTML file). Various quality metrics assessed by Qualimap include:
#!/bin/bash
# Run Qualimap on one sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
gtf_file=data/genome/Homo_sapiens.GRCh38.109.gtf
qualimap_out=results/12_qualimap/
mkdir -p $qualimap_out
full_fillnameSE=results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz
fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}
outputdir_bam=results/11_star_mapping/bam/
align_out_bam=${outputdir_bam}/${prefix}.Aligned.sortedByCoord.out.bam
unset DISPLAY
# Run Qualimap
qualimap rnaseq \
-outdir $qualimap_out \
-a proportional \
-bam $align_out_bam \
-p strand-specific-reverse \
-gtf $gtf_file \
--java-mem-size=8G
# for PE data
#-pe
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/12_qualimap.sh
An alternative to Qualimap, is to use multiqc to get mapping rates on the genome but not the distribution between exon/intron/…
From bam/sam files you can use the former “standard” approach for quantifying gene expression which consist to count how many reads map on each gene.
For that, we will use a tool name htseq-count. You can find here documentation
htseq-count
need the gtf file containing gene annotation (=all gene coordinates) and bam/sam alignment files (=all read alignment coordinates).
Use the following script to get the count table for the Irrel_kd_2 sample.
#!/bin/bash
# Run htseq-count on one sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
gtf_file=01_toy_data/genome/Homo_sapiens.GRCh38.109.gtf
outputdir_count=results/13_htseq_count/
mkdir -p $outputdir_count
prefix=Irrel_kd_2
outputdir_bam=results/11_star_mapping/bam/
align_out_bam=$outputdir_bam/$prefix.Aligned.sortedByCoord.out.bam
htseq-count -n 7 -r pos \
-t exon -i gene_id $align_out_bam $gtf_file -c $outputdir_count/$prefix.tsv
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/13_htseq_count.sh
You can inspect the output file with
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
head results/13_htseq_count/Irrel_kd_2.tsv
tail results/13_htseq_count/Irrel_kd_2.tsv
You can see that there are multimapping (__alignment_not_unique) reads. The defaut parameter of htseq-count is to count all of them.
You have to specify what to do with multi mapping reads with the --secondary-alignments
option
#!/bin/bash
# Run htseq-count on one sample
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
gtf_file=data/genome/Homo_sapiens.GRCh38.109.gtf
outputdir_count=results/14_htseq_count_ignore_secondary/
mkdir -p $outputdir_count
full_fillnameSE=results/04_trimmed_data_SE/Irrel_kd_2.subset_trim.fastq.gz
fillnameSE=$(basename $full_fillnameSE)
prefix=${fillnameSE/.subset_trim.fastq.gz/}
outputdir_bam=results/11_star_mapping/bam/
align_out_bam=${outputdir_bam}/${prefix}.Aligned.sortedByCoord.out.bam
htseq-count -n 7 -r pos -t exon \
-i gene_id $align_out_bam ${gtf_file} \
--nonunique all --secondary-alignments=ignore > ${outputdir_count}/${prefix}.nosecondary.tsv
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
bash scripts/14_htseq_count_ignore_secondary.sh
You could compare the output files of the 2 scripts.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
diff results/13_htseq_count/Irrel_kd_2.tsv results/14_htseq_count_ignore_secondary/Irrel_kd_2.nosecondary.tsv
grep ENSG00000232811 results/13_htseq_count/Irrel_kd_2.tsv results/14_htseq_count_ignore_secondary/Irrel_kd_2.nosecondary.tsv
During genome mapping you can filter for multi-mapping reads or then after with samtools
Throughout the workflow we have performed various steps of quality checks on our data. You will need to do this for every sample in your dataset, making sure these metrics are consistent across the samples for a given experiment. Outlier samples should be flagged for further investigation and potential removal.
Manually tracking these metrics and browsing through multiple HTML reports (FastQC, Qualimap) and log files (Salmon, STAR) for each samples is tedious and prone to errors. MultiQC is a tool which aggregates results from several tools and generates a single HTML report with plots to visualize and compare various QC metrics between the samples. Assessment of the QC metrics may result in the removal of samples before proceeding to the next step, if necessary.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
multiqc results/ -o results/multiqc_all
We will use IGV to visualize genomic mapping store in your bam files.
From IGV website open, the IGV web app. (https://igv.org/app/)
You need to index you bam files to help IGV to quickly navigate in the bam files.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
samtools index results/11_star_mapping/bam/Irrel_kd_2.Aligned.sortedByCoord.out.bam
ls results/11_star_mapping/bam/
Download on your computer Irrel_kd_2.Aligned.sortedByCoord.out.bam and Irrel_kd_2.Aligned.sortedByCoord.out.bam.bai
In IGV web app, import these files
The PATH is an environment variable that is used to store the location of the resources that contain executable files on the Linux operating system.
echo $PATH
The which
command allows users to search the list of paths in the $PATH environment variable and outputs the full path of the command specified as an argument.
which ls
So, in fact, when you use the command ls
, you execute the script contain in the file: /usr/bin/ls
.
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
ls
cd /home/rstudio/ciri_rnaseq_2024/trainees/YourName/training_project/
/usr/bin/ls
Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs, and updates packages and their dependencies. Conda easily creates, saves, loads, and switches between environments on your local computer. It was created for Python programs but it can package and distribute software for any language.
To install conda on your computer -> https://docs.conda.io/en/latest/miniconda.html
example: to install salmon https://anaconda.org/bioconda/salmon
conda install -c bioconda salmon
Alternatives :
One script to rule them all!
First step : a well documented readme file.
Final step : Use pipeline managers as Nextflow
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
example: https://nf-co.re/rnaseq
The most important files on your virtual machine is your script (as your raw data are stored elsewhere).
For the training, we don’t use a tool called git to version, save and synchronize your script in a remote server. You could learn how to use this (fabulous) tool in a BIBS monthly workshop.
Save on your local computer your scripts directory, use the export button of rstudio.
Repatriate also the important results files
Take time also to right a README file to explain how to use these different scripts to get the result.
These materials have been developed by members of BIBS team of the CIRI (https://ciri.ens-lyon.fr/). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.