Installation
Main Pipeline
Installing TAP
git clone git@github.com:FrietzeLabUVM/TAP.git
cd TAP
There is one general setup step required once per TAP installation.
Prior to running TAP jobs, scripts must be deployed. This process
ensures TAP knows how to best utilize the job scheduler if you’re
working on a HPC (high performance computing environment, i.e. a
cluster). If you are not on an HPC it’s still required to run the
following command but you can skip over the details.
bash deployment/deploy_TAP_scripts.sh -c deployment/deployment_configs/generic_deployment_config.csv
Modify the
deployment/deployment_configs/generic_deployment_config.csv
file for compatibility with your HPC. Currently deployment has only been
tested for SLURM but please contact jrboyd@uvm.edu for help deploying on SGE or other
scheduler systems.
TAPhelpR R package
The TAP companion R package helps with preparing to run TAP and with
utilizing the outputs.
if(!require("devtools")){
install.packages("devtools")
}
# This optional dependency has fallen out of Bioconductor and must be installed manually.
# It is only required to download data from ENCODE.
devtools::install_github("CharlesJB/ENCODExplorer")
devtools::install_github("FrietzeLabUVM/TAPhelpR")
Before Running TAP
All bash commands are assumed to be running from the directory where
you’ve installed TAP.
Installing Dependencies
It is strongly recommended that you use either singularity
or docker to
satisfy dependencies. When working on a cluster, you probably only have
singularity as an option.
Commands in this demonstration will assume the use of
singularity.
The singularity image file can be created as follows. You’ll need to
add --singularity tap_latest.sif
to TAP script calls.
# creates tap_latest.sif
bash setup_scripts/pull_singularity_image.sh
The docker image can be downloaded as follows. You’ll need to add
--docker jrboyd/tap image
to TAP script calls.
# pulls jrboyd/tap image
bash setup_scripts/pull_docker_image.sh
Manual installation of dependencies is not recommended. If you do
successfully add all dependencies to your PATH, there is not special
command addon required, in contrast to using singularity or docker.
# not expected to work
# you'll need to run equivalent commands on your system and address errors as they arise
bash setup_scripts/manual_install_dependencies.sh
Creating reference
The reference directory should be generated using
setup_scripts/setup_new_reference.sh
from the directory
where TAP is installed. This script requires 2 minimal inputs with some
additional options. You must provide a single .fasta file and compatible
(same organism assembly) gene reference .gtf/.gff file.
Let’s assume you downloaded the NCBI fasta and gtf files as described
in the previous section and moved the .zip file to your TAP installation
directory.
unzip ncbi_dataset.zip
# fasta files ending in .fasta or .fa are expected. .fna merely indicates that it's a fasta file with nucleotide information.
mv ncbi_dataset/data/GCF_003254395.2/GCF_003254395.2_Amel_HAv3.1_genomic.fna ncbi_dataset/data/GCF_003254395.2/GCF_003254395.2_Amel_HAv3.1_genomic.fasta
bash setup_scripts/setup_new_reference.sh \
-f ncbi_dataset/data/GCF_003254395.2/GCF_003254395.2_Amel_HAv3.1_genomic.fasta \
-gens ncbi_dataset/data/GCF_003254395.2/genomic.gtf \
-o indexes/honeybee \
--genomeSAindexNbases 12 \
--singularity tap_latest.sif
# --genomeSAindexNbases 12 is required for smaller genomes. For mammalian genomes the default of 14 works fine.
R Setup
library(TAPhelpR)
library(tidyverse)
The following parameters control relevant file and directory
locations. Modify them to suit your local system.
tap_in = "input_fastq"
tap_out = "output_TAP"
tap_ref = "indexes/honeybee"
tap_singularity = "tap_latest.sif"
tap_config = "honeybee_config.csv"
Obtaining fastq files
TAPhelpR contains a file mapping SRR accessions to useful file names.
The following code loads this file and then subsets it to download,
rename, and compress, a selection of relevant fastq files.
df_srr = example_honeybee_metadata()
df_srr = df_srr %>%
separate(name, into = c("species", "day", "role", "rep"), remove = FALSE)
# for demonstration we will filter to 2 reps from a single day.
# skip this filter step to work with the complete dataset.
df_srr = df_srr %>% filter(day == "4days") %>% filter(rep %in% c("rep1", "rep2"))
SRA_download_files(srr_tofetch = df_srr$srr, fastq_prefixes = df_srr$name, singularity = tap_singularity, out_dir = tap_in, ncores = 8)
If you want to download data from GEO or ENCODE there are additional
helper functions for that.
GEO_download_files
ENCODE_download_files
Configs
Generate Config
Configuration files are strongly recommended to run TAP. They can
control all command line parameters and also set paths to fastq files.
You should also take the opportunity to rename samples to something more
meaningful and easier to work with. If you have sequencing replicates,
they can be aggregated to single samples in the config.
To use TAPhelpR to generate your config file, you need 2 things. A
directory with reference files and a directory with
fastq.gz files.
The reference directory should be generated using
setup_scripts/setup_new_reference.sh
from the directory
where TAP is installed. This script requires 2 minimal inputs with some
additional options. You must provide a single .fasta file and compatible
(same organism assembly) gene reference .gtf/.gff file.
The fastq.gz files must be gzipped should be for both reads
if you have paired-end data. They also must have consistent suffixes
(i.e. all end in _R1_001.fastq.gz or _R2_001.fastq.gz for PE data or all
end in _R1_001.fastq.gz for SE data). It’s OK if your suffixes differ
from these defaults, just be sure you specify f1_suffix (and f2_suffix
for PE data) when creating the config file.
cfg = config_create(
inDir = tap_in,
f1_suffix = "_R1_001.fastq.gz", #these suffixes are defaults but should match your files
f2_suffix = "_R2_001.fastq.gz",
outDir = tap_out,
singularity = tap_singularity,
reference = tap_ref
)
config_create
only gathers the information required to
write the config file. We have the opportunity to add some important
information before writing this config.
In our case, as an exercise we will change “honeybee” to the
assembly, “Amel_HAv3.1”.
# data.frame from config_create contains fastq file names and default sample name (fastq with suffix removed).
cfg_fq = cfg$fastq_lines
cfg_fq$name = sub("honeybee", "Amel_HAv3.1", cfg_fq$name)
# replace fastq data.frame in cfg
cfg$fastq_lines = cfg_fq
Now we want to write the config file.
config_write(cfg, tap_config, overwrite = TRUE)
Validate Config
Prior to running TAP, it’s a good idea to validate our config file.
This validation will not catch all possible errors but focuses on the
most common and hardest to diagnose.
config_validate(tap_config)
## V1 V2 V3 V4 V5 V6
## honeybee_4days_queen_rep1_R1_001.fastq.gz Amel HAv3 1 4days queen rep1
## honeybee_4days_queen_rep2_R1_001.fastq.gz Amel HAv3 1 4days queen rep2
## honeybee_4days_worker_rep1_R1_001.fastq.gz Amel HAv3 1 4days worker rep1
## honeybee_4days_worker_rep2_R1_001.fastq.gz Amel HAv3 1 4days worker rep2
Note the output of the above is a data.frame with fastq files as the
rownames and the _/. delimited name elements split. This also allows you
to verify that your names are well formed. Well formed names have the
same number of elements. Poorly formed names will make downstream
analysis more painful as you’ll constantly have to add additional code
to handle these inconsistencies.
You’ll see a warning if sample names are not well formed.
While running TAP
Evaluate completeness
After submitting the TAP pipeline, it’s nice to know how close to
completion it is. Once it’s finished, identifying and debugging any
errors is also critical. TAPhelpR
provides a series of
report functions for this purpose.
report_completion
is the most basic of these functions.
It simply checks for the existence of 3 files that TAP write to record
its status.
report_completion(tap_out)
Here we see a message reporting the number of samples that have begun
running and: 1) how many are still processing and 2) how many have
finished with no errors.
If all samples have finished without error there’s no reason to
investigate further.
For a more detailed report of progress, use
report_progress
. You’ll receive a warning if any steps
failed to complete due to errors.
report_progress(tap_out)
report_progress_plot(tap_out)
Eventually all jobs will complete.
Finally, if there were errors, use report_errors
.
report_errors(tap_out)
This will return paths to all of the relevant .error log files.
You’ll need to investigate these files to diagnose and resolve the
errors. You may need to delete affected files but then you can resubmit
TAP using the same output. You must allow TAP to finish running or
cancel remaining jobs before resubmitting.
After running TAP
Working with outputs
Only proceed once report_completion
returns 100%
completed successfully.
TAPhelpR contains several setup_*_files functions to create a
convenient data.frame containing file paths and meta data.
bam_files = setup_bam_files(tap_out, variable_map = c("species", "assembly", "assembly_version", "day", "role", "rep"))
The generic equivalent of the above looks like this:
setup_files(tap_out, pattern = "sortedByCoord.out.bam$", variable_map = c("species", "day", "role", "rep"))
## file
## 1: output_TAP/Amel_HAv3.1_4days_queen_rep1.Aligned.sortedByCoord.out.bam
## 2: output_TAP/Amel_HAv3.1_4days_queen_rep2.Aligned.sortedByCoord.out.bam
## 3: output_TAP/Amel_HAv3.1_4days_worker_rep1.Aligned.sortedByCoord.out.bam
## 4: output_TAP/Amel_HAv3.1_4days_worker_rep2.Aligned.sortedByCoord.out.bam
## species day role rep name
## 1: Amel HAv3 1 4days Amel_HAv3_1_4days
## 2: Amel HAv3 1 4days Amel_HAv3_1_4days
## 3: Amel HAv3 1 4days Amel_HAv3_1_4days
## 4: Amel HAv3 1 4days Amel_HAv3_1_4days
Run suppa2 diffSplice
suppa2
conducts differential splicing analysis and
TAPhelpR provides several wrapper functions for steps required to run
suppa2
joinFiles
groups replicates.
suppa_joinFiles(bam_files, by = "role")
diffSplice
performs differential analysis between groups
specified in joinFiles
.
suppa_diffSplice(
ref_location = example_honeybee_reference(),
PSI_todo = TAP_SPLICE_EVENTS$SkippingExon)
This variant of diffSplice
allows more complicated
specification of analysis “bins” that should be isolated. In this
examples, within each “day” group, differences will be
calculated between each “role” group.
suppa_diffSplice.within_group(
input_files = bam_files,
within_group = "day",
between_group = "role",
ref_location = example_honeybee_reference(),
PSI_todo = TAP_SPLICE_EVENTS$SkippingExon)
Clustering events can help makes the differential results more
interpretable.
suppa_clusterEvents(PSI_todo = TAP_SPLICE_EVENTS$SkippingExon)
Counts and DESeq2
To get to the entrypoint for a standard DESeq2 workflow is only a few
lines of code.
Loading raw counts is simple.
mat = load_counts(tap_out)
head(mat)
## Amel_HAv3.1_4days_queen_rep1 Amel_HAv3.1_4days_queen_rep2
## LOC551580 281 396
## LOC551555 3346 3664
## LOC726347 43 81
## Rfwd3 296 320
## LOC726145 1763 1825
## LOC551448 1101 1256
## Amel_HAv3.1_4days_worker_rep1 Amel_HAv3.1_4days_worker_rep2
## LOC551580 218 213
## LOC551555 2599 2126
## LOC726347 56 59
## Rfwd3 215 185
## LOC726145 1673 1484
## LOC551448 939 674
Raw counts will be per “gene_id” in reference by default. If you
provide the gtf reference, TAPhelpR can convert to “gene_name” or a
different, human friendly, attribute.
#if you supply the reference gtf file, or GRanges of same, to load_counts, gene_ids in counts will be aggregated to gene_name
mat.gene_name = load_counts(tap_out, gtf_file = tap_ref, name_attribute = "gene")
The final bit of information needed is the colData with sample
metadata.
library(DESeq2)
meta_df = bam_files
meta_df$file = NULL
meta_df = dplyr::mutate(meta_df, name = paste(day, role, rep, sep = "_"))
colnames(mat) = sub("Amel_HAv3.1_", "", colnames(mat))
des = DESeqDataSetFromMatrix(mat[, meta_df$name], colData = meta_df, design = ~role)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# and run from there ...
UCSC tracks
UCSC tracks are extremely useful for looking at individual genes in
detail.
stage_output_for_UCSC_tracks(tap_out, track_hosting_dir = "~/public_files/honeybee", overwrite = TRUE)
launch shiny app
launch_UCSC_tracks_app.UVM_galaxy(track_hosting_dir = "~/public_files/honeybee")