README for VARiD: Variation Identification for Letter-space and Color-space. This README can be found at http://compbio.cs.utoronto.ca/varid/ This document described the installation, use and output of the VARiD toolset. VARiD employs a Hidden Markov Model for variation detection, using color-space and letter-space reads as the emissions at positions along the genome. A manuscript describing the algorithms has been submitted for publication and will be included in this README for reference. The VARiD algorithms and tools involve: Adrian V. Dalca Michael Brudno Misko Dzamba Stephen M. Rumble Samuel Levy with a significant amount of help in the development of the C code by Adrian Dalca Sr. The authors may be contacted at varid at cs dot toronto dot edu Copyright: VARiD is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. VARiD is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. Please see the TODO file for the next steps in development. For an overview of the algorithms used please see: Adrian V. Dalca; Stephen M. Rumble; Samuel Levy; Michael Brudno. VARiD: A variation detection framework for color-space and letter-space platforms. Bioinformatics 2010 26: i343-i349. ---------------------------------------------------------------------- Table of Contents ---------------------------------------------------------------------- 1. Installation 2. Quick Start 3. Usage (incl parameters and output) 4. Usage with larger contigs 5. Using VARiD with BAM files 6. VARiD and re-calibrating read quality values 7. Contacts 8. Acknowledgements ---------------------------------------------------------------------- 1. Installation ---------------------------------------------------------------------- The source code is available at http://compbio.cs.utoronto.ca/varid/ . You can download the package using a web-browser or using wget, wget http://compbio.cs.utoronto.ca/varid/releases/varid-1.0.7c.tar.gz To extract the downloaded package use, tar -zxvf varid-1.0.7c.tar.gz next, move into the created folder and start the build: cd varid-1.0.7c ./configure --prefix=`pwd` make make install VARID_FOLDER=`pwd`/ export VARID_FOLDER compilation should proceed automatically. Once complete execute varid with the flag '-v' to determine the current version, ./bin/varid_exec -v for quick usage reference use, ./bin/varid_exec --help or ./bin/varid_exec -h ---------------------------------------------------------------------- 2. Quick Start ---------------------------------------------------------------------- Here a brief example using VARiD to detect short indels and SNPs will be outlined. To follow along you may wish to download the referenced data files which are avaliable for download at the following location, http://compbio.cs.utoronto.ca/varid/releases/varid-1.0.7c_example.tar.gz The archive contains the following files (and more), . |-- donor.fa |-- reads | |-- cs_fa_mapped.sam | |-- cs_reads.fa | |-- cs_fa_mapped.sam_sorted.bam | `-- cs_fa_mapped.sam_sorted.bam.bai |-- reference.fa |-- results | |-- indel_locations | `-- snp_locations `-- varid_output |-- cs_fa_mapped.sam_calls `-- cs_fa_mapped.sam_varid Here 'reference.fa' is a randomly generated 5000bp genome. The other fasta file, 'donor.fa', is a modified version of reference.fa with the changes made recorded in the 'results/' folder. For example, the contents of 'results/snp_locations' is, SNP: 173 -> c SNP: 917 -> c SNP: 1138 -> c SNP: 1204 -> c SNP: 2097 -> a SNP: 2413 -> t SNP: 2731 -> c SNP: 3385 -> c SNP: 4307 -> t SNP: 4481 -> t This lists all the SNPs added to 'donor.fa' relative to 'reference.fa' coordinates. To run VARiD the user must specify a reference genome (in fasta format) along with SAM format alignments. In this example 'reference.fa' will be the reference. The SAM format alignments used are stored in the file 'reads/cs_fa_mapped.sam' Which was generated by SHRiMP 2.0.2 using the following flags, gmapper-cs -h 25% --strata -q -20 -g -20 -E -N 4 cs_reads.fa reference.fa The file 'reads/cs_reads.fa' is a simulated AB SOLiD data set, with simulation details provided in 'reads/generate_log'. To run VARiD on the full genome with the alignments provide, the following command is used, ~/varid-1.0.7c/bin/varid_exec -a reads/cs_fa_mapped.sam \ -r reference.fa \ -o ./varid_prettybase \ --threads 4 VARiD takes significant memory, depending on the order of the genome, please see section (4) for this. In this example we can reduce the memory usage by only running prediction on a subset of the genome, ~/varid-1.0.7c/bin/varid_exec -a reads/cs_fa_mapped.sam \ -r reference.fa \ -o ./varid_prettybase \ --threads 4 \ --predict-start 50 \ --predict-end 100 This will be default create and evaluate a HMM for the selected prediction region (buffered on both sides by 200 bp if avaliable). In this particular case VARiD would create and run a HMM for positions 1 to 300 and report only positions 50 to 100 in the prettybase output. ---------------------------------------------------------------------- 3. Usage (incl parameters and output) ---------------------------------------------------------------------- 3.1 Usage. ---------- VARiD is special in that it can make use of both color-space and letter-space alignment data. The two necessary files to give VARiD are the SAM format alignments and the reference fasta file. By default VARiD assumes the SAM format alignments will come from standard in. So only 1 parameter needs to be specified for VARiD to run, the location of the reference fasta file. A current limitation of VARiD is memory usage which is proportional to length of genome. This results in a large memory footprint for VARiD when run on large genomes. As a result, VARiD may fail to allocate memory and crash, to work with large contigs please see section (4). After reading in the reference and SAM alignments file, VARiD will attempt to make small re-alignment corrections for gaps. Once this is complete it will create a HMM model based on the VARiD publication, and run by default the forward-backward algorithm (viterbi avaliable, but is not default). 3.2 Paramteres. --------------- Required Parameters: [ -r/--ref-file filepath ] This parameter specifies a filepath to the reference fasta file. Optional Parameters: [ -t/--call-threshold real_value ] The probability threshold (0..1) for calling bases. VARiD computes a confidence probability for each state (base call), when using the Forward Backward algorithm. It will only give bases then it is more confident in than this threshold value (>=). The default is 0.75. [ -n/--donor-name name ] The name specified with this option is provided in the prettybase VARiD output as the donor genome name. The default is "DONOR_GENOME". [ -a/--alignments filepath ] This parameter specifies a filepath to the SAM alignment file. The default is STDIN (see 'man stdin'). [ -d/--temp-dir folderpath ] This parameter specifies a folderpath to where VARiD can keep its temporary files. These files include the split a possibly trimmed SAM alignments for each contig. The default is "./varid_tmp". [ --min-coverage integer_value ] This value specifies the minimum read coverage (letter or color) at a position VARiD must have before making a none 'N' base call. The default is 4. [ --format pb|vcf ] Use pretty-base (pb) output format, or use vcf4.0 (vcf) output format. The default is pb. [ --all ] Report calls for every-base even if reference base is called. The default is disabled. Probability Settings: [ -s/--snp-rate real_value ] This value specifies the expected probability of a SNP. This value is used in the VARiD HMM as described in the VARiD publication. The default is 0.001. [ -g/--gap-error real_value ] This value specifies the probability of a gap occuring in a read. This is used instead of a quality value derived probability for the emission matrix when a read has a 'deletion' relative to the reference. The default is 0.03. [ --insert-open real_value ] This is the probability of having a gap open in the reference sequence. The probabilty of beginning a insertion into the reference to form the donor genome. This value is used in the VARiD HMM. The default is 0.00010. [ --insert-extend real_value ] This is the probability of having a gap extend in the reference sequence. The probability of extending a insertion in the reference to form the donor genome. This value is used in the VARiD HMM. The default is 0.00100. [ --deletion-open real_value ] This is the probability of having a deletion in the reference sequence. The probability of beginning a deletion in the reference to form the donor genome. This value is used in the VARiD HMM. The default is 0.00010. [ --err-coverage integer_value ] This is the minimum coverage needed for a quality value (or read index, if no quality values provided) in order for VARiD to estimate its error. The default is 40. [ --ls-err real_value ] This is the default letter space error value for VARiD if it does not run its error estimation algorithm. The default is 0.01. [ --min-ls-err real_value ] This is the minimum allowable letter space error for VARiDs error estimation algorithm. The default is 0.001. [ --max-ls-err real_value ] This is the maximum allowable letter space error for VARiDs error estimation algorithm. The default is 0.99. [ --cs-err real_value ] This is the default color space error value for VARiD if it does not run its error estimation algorithm. The default is 0.01. [ --min-cs-err real_value ] This is the minimum allowable color space error for VARiDs error estimation algorithm. The default is 0.001. [ --max-cs-err real_value ] This is the maximum allowable color space error for VARiDs error estimation algorithm. The default is 0.99. [ --get-bp-freq ] This option tells VARiD to compute A,C,G,T frequencies from the reference fasta file. The default is disabled. [ --freq-a real_value ] This is the default frequency of 'A' in the genome when --get-bp-freq is not used. The default is 0.25. [ --freq-c real_value ] This is the default frequency of 'C' in the genome when --get-bp-freq is not used. The default is 0.25. [ --freq-g real_value ] This is the default frequency of 'G' in the genome when --get-bp-freq is not used. The default is 0.25. [ --freq-t real_value ] This is the default frequency of 'T' in the genome when --get-bp-freq is not used. The default is 0.25. [ --log-coeff-ls real_value ] This is a scaling coefficient that can be used to scale the log probabilities of all letter space associated emissions. I.e. a value of 2 , squares all letter space emission probabilities. The default is 1. [ --log-coeff-cs real_value ] This is a scaling coefficient that can be used to scale the log probabilities of all color space associated emissions. I.e. a value of 2 , squares all color space emission probabilities. The default is 1. [ -S/--save-learned filepath ] This provides VARiD a filepath where to dump learned quality error values on a large genomic window. This can later be loaded when run on a smaller genomic window, to have a better error estimate of quality value read errors. The default is disabled. [ -L/--load-learned filepath ] This provides VARiD a filepath where previously learned quality error values can be read from and used. This is useful when generating error estimates for larger genomic windows, while running VARiD only a small window. The default is disabled. HMM options: [ --predict-start integer_value ] This tells VARiD what position to start making base calls from. The default is 1. [ --predict-end integer_value ] This tells VARiD what position to stop making base calls at. The default is length of reference genome. [ --hmm-area integer_value ] This tells VARiD how many bases to go upstream and downstream from '--predict-start' and '--predict-end' for the underlying HMM model. The default is 200. [ --hmm-start integer_value ] This tells VARiD what position to start the HMM model from. This value must be always greater then or equal to the value specified to '--predict-start'. The default is MAX(1,value specified by --predict-start - value specified by --hmm-area [200]) [ --hmm-end integer_value ] This tells VARiD what position to end the HMM model at. This value must be always less then or equal to the value specified to '--predict-end'. The default is MIN(genome length,value specified by --predict-end + value specified by --hmm-area [200]) [ --viterbi ] Use the viterbi based prediction algorithm. In this mode, VARiD does not report confidence values, so the confidence threshold is ignored. The default is disabled. Heterozygous distribution penalties: [ --rank-sum ] Use a hybrid of rank-sum based quality value penalty with a modified gaussian, to assign penalties to heterozygous states. The default is enabled. [ --normal real_value ] Use a gaussian based heterzygous penalty. With the provided standard deviation parameter. The default is disabled. [ --poisson ] Use two poisson distributions to model the coverage of two haplomes in a heterozygous state, and assign penalties based on this. The default is disabled. [ --binomial ] Use a binomial distribution to model the coverage of the two haplomes in a heterzygous state, and assign penalties based on this. The default is disabled. VARiD options: [ --haploid ] Model the organism as a haploid. The default is disabled. [ -e/--heuristics ] Use non-quality VARiD 0.99 based post processing heuristics. The default is disabled. [ --drop-bad-reads ] Try to drop alignments that cause problems. The default is disabled. [ -v/--version ] Display the VARiD version and exit. The default is disabled. [ -h/--help ] Display the complete usage and exit. The default is disabled. 3.3 Output ---------- VARiD not only calls SNP positions (heterozygous or homozygous SNPs and micro-indels), but in general will fully call the nucleotides at each position as well. To represent this, VARiD outputs base calls in a version of the prettybase format (see http://pga.mbt.washington.edu/) with slight additions. More concretly, the output is: where the fields are tab-delimited and where: will be the reference from the mfa reference file followed by the position along that reference will be the name given in -ref_name will be the nucleotide call of the first haplotype (order of first and second is arbritrary) will be the nucleotide call of the second haplotype (order of first and second is arbritrary) will depend on the ALLELE calls, and will be 'R' for a Reference Allele (no variation), 'H' for a heterozygous call and 'A' for a homozygous alternate SNP call. represents the confidence in this particular state. for example, potential lines from an output are chr3_1009519 NA17156 G G R 0.98 chr3_1009520 NA17156 T G H 0.99 chr3_1009521 NA17156 A A R 0.83 ---------------------------------------------------------------------- 4. Usage with larger contigs ---------------------------------------------------------------------- The VARiD HMM currently has 1,600 states per genome position. This creates problems for VARiD memory footprint. When using forward backward prediction algorithm. This means that for each genome position VARiD must allocate (1600 x 3 x sizeof(double)) bytes. This is roughly, 40kB x Length of Genome This means that for a genome of length 10,000 just the forward, backward and probability matrix will take approximately 400MB. If you try to run chromosome 1 in VARiD for prediction this will try to allocate approximately 9.2TB. The current solution VARiD offers to this is (not yet a disk based paging solution) but a helper script and option to run VARiD on a subset of the input genome. For example if you would like to run VARiD on a 5,000 bp genome but you only have 128MB RAM the script 'helper/varidhelper.sh' can help. Running varidhelper.sh with the flag "--script" will print the commands that 'would have been issued by varidhelper.sh'. For example running, varidhelper.sh --alignments reads/cs_fa_mapped.sam --ref-file reference.fa --output output --temp-dir /tmp/ --script --threads 4 --split-size 1000 Produces the output, ~/varid-1.0.7c/bin/varid_exec -a reads/cs_fa_mapped.sam -r reference.fa --threads 4 ~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 1 --predict-end 1000 --hmm-area 100 --threads 4 >> output ~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 1001 --predict-end 2000 --hmm-area 100 --threads 4 >> output ~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 2001 --predict-end 3000 --hmm-area 100 --threads 4 >> output ~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 3001 --predict-end 4000 --hmm-area 100 --threads 4 >> output ~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 4001 --predict-end 5000 --hmm-area 100 --threads 4 >> output To get VARiD to actually run, run the helper script without the flag "--script". ---------------------------------------------------------------------- 5. Using VARiD with BAM files ---------------------------------------------------------------------- By default VARiD assumes SAM alignments will be provided on standard input (STDIN, see man stdin). This can be useful when using VARiD with BAM files. For example we can run the following, samtools view reads/cs_fa_mapped.sam_sorted.bam | ~/varid-1.0.7c/varid_exec -r reference.fa For large alignment files (one BAM file for a whole human individual) it might be useful to only give VARiD a subset of the alignments relevant to the region. To do this VARiD can be run with the following, samtools view reads/cs_fa_mapped.sam_sorted.bam RANDOM_GENOME_5000:3000-4000 | ~/varid-1.0.7c/varid_exec -r reference.fa --predict-start 3300 --predict-end 3600 This command will retrieve only alignments spanning into the 3000~4000 bp region of the reference and pass them to VARiD. VARiD will then by default contrsuct a HMM model (with 200 bp boundary buffer, see '--hmm-area') from position 3100 to 3800. Also VARiD will only report base calls on the interval 3300 to 3600. ---------------------------------------------------------------------- 6. VARiD and re-calibrating read quality values ---------------------------------------------------------------------- VARiD tries to learn alignment specific error rates for quality values instead of using any hard coded values. When running VARiD on smaller genomic windows VARiD will only load those alignments which span into this window. This will lower the data points avaliable for each quality value, and cause VARiD to have a worse internal estimate of these quality values. This is why VARiD has options to learn errors on a full dataset (without running the actual HMM) and save them to a file, so that they can be loaded later. For example we can instruct VARiD to learn error values on the full alignments with the following, ~/varid-1.0.7c/varid_exec -r reference.fa -a reads/cs_fa_mapped.sam -S cs_error This will save estimated errors to the file 'cs_error'. When quality values are not avaliable for alignments, VARiD will create a position specific error profile. This is what happens in the above example. To later use these learned error rates in VARiD when running on smaller genomic windows, the following can be used, samtools view reads/cs_fa_mapped.sam_sorted.bam RANDOM_GENOME_5000:3000-4000 | ~/varid-1.0.7c/varid_exec -r reference.fa --predict-start 3300 --predict-end 3600 -L cs_error ---------------------------------------------------------------------- 7. Contacts ---------------------------------------------------------------------- The program and more information can be found at http://compbio.cs.utoronto.ca/varid/ The authors can be contacted at varid at cs dot toronto dot edu ---------------------------------------------------------------------- 8. Acknowledgements ---------------------------------------------------------------------- The algorithms of VARiD were developed at the University of Toronto Computational Biology Lab. The development was performed by the authors with significant help from Adrian Dalca Sr. The work was made possible in part by National Engineering and Research Council of Canada.