README  for  VARiD:  Variation  Identification  for  Letter-space  and
Color-space.      This      README      can      be      found      at
http://compbio.cs.utoronto.ca/varid/

This document described the installation,  use and output of the VARiD
toolset. VARiD employs a  Hidden Markov Model for variation detection,
using color-space and letter-space reads as the emissions at positions
along  the genome.  A  manuscript describing  the algorithms  has been
submitted  for publication  and will  be included  in this  README for
reference.

The VARiD algorithms and tools involve:
	Adrian V. Dalca
	Michael Brudno
	Misko Dzamba
	Stephen M. Rumble
	Samuel Levy

with a significant amount of help  in the development of the C code by
Adrian Dalca  Sr.  The  authors may  be contacted at  varid at  cs dot
toronto dot edu

Copyright:  VARiD is  free software:  you can  redistribute  it and/or
modify  it  under the  terms  of the  GNU  General  Public License  as
published by  the Free  Software Foundation, either  version 3  of the
License, or (at your option)  any later version.  VARiD is distributed
in the hope that it will  be useful, but WITHOUT ANY WARRANTY; without
even  the  implied  warranty  of  MERCHANTABILITY  or  FITNESS  FOR  A
PARTICULAR  PURPOSE.  See  the  GNU General  Public  License for  more
details.

Please see the TODO file for the next steps in development.

For an overview of the algorithms used please see:
	Adrian V. Dalca; Stephen M. Rumble; Samuel Levy; Michael Brudno. 
	VARiD: A variation detection framework for color-space and 
	letter-space platforms.  Bioinformatics 2010 26: i343-i349.

----------------------------------------------------------------------
Table of Contents
----------------------------------------------------------------------
 1. Installation
 2. Quick Start
 3. Usage (incl parameters and output)
 4. Usage with larger contigs
 5. Using VARiD with BAM files
 6. VARiD and re-calibrating read quality values
 7. Contacts
 8. Acknowledgements

----------------------------------------------------------------------
 1. Installation
----------------------------------------------------------------------

The source  code is available  at http://compbio.cs.utoronto.ca/varid/
. You can download the package using a web-browser or using wget,

wget http://compbio.cs.utoronto.ca/varid/releases/varid-1.0.7c.tar.gz

To extract the downloaded package use,

	tar -zxvf varid-1.0.7c.tar.gz

next, move into the created folder and start the build:

	cd varid-1.0.7c
	./configure --prefix=`pwd`
	make
 	make install
	VARID_FOLDER=`pwd`/
	export VARID_FOLDER

compilation should proceed automatically. Once complete execute varid
with the flag '-v' to determine the current version,

	./bin/varid_exec -v

for quick usage reference use,

	./bin/varid_exec --help
	or
	./bin/varid_exec -h

----------------------------------------------------------------------
 2. Quick Start
----------------------------------------------------------------------

Here a brief example using VARiD to detect short indels and SNPs will
be outlined. To follow along you may wish to download the referenced
data files which are avaliable for download at the following location,

http://compbio.cs.utoronto.ca/varid/releases/varid-1.0.7c_example.tar.gz

The archive contains the following files (and more),

.
|-- donor.fa
|-- reads
|   |-- cs_fa_mapped.sam
|   |-- cs_reads.fa
|   |-- cs_fa_mapped.sam_sorted.bam
|   `-- cs_fa_mapped.sam_sorted.bam.bai
|-- reference.fa
|-- results
|   |-- indel_locations
|   `-- snp_locations
`-- varid_output
    |-- cs_fa_mapped.sam_calls
    `-- cs_fa_mapped.sam_varid

Here 'reference.fa' is a randomly generated 5000bp genome. The other fasta
file, 'donor.fa', is a modified version of reference.fa with the changes
made recorded in the 'results/' folder.

For example, the contents of 'results/snp_locations' is,

SNP: 173 -> c
SNP: 917 -> c
SNP: 1138 -> c
SNP: 1204 -> c
SNP: 2097 -> a
SNP: 2413 -> t
SNP: 2731 -> c
SNP: 3385 -> c
SNP: 4307 -> t
SNP: 4481 -> t

This lists all the SNPs added to 'donor.fa' relative to 'reference.fa'
coordinates.

To run VARiD the user must specify a reference genome (in fasta format)
along with SAM format alignments. In this example 'reference.fa' will be
the reference.

The SAM format alignments used are stored in the file 
'reads/cs_fa_mapped.sam'
Which was generated by SHRiMP 2.0.2 using the following flags,
gmapper-cs -h 25% --strata -q -20 -g -20 -E -N 4 cs_reads.fa reference.fa
The file 'reads/cs_reads.fa' is a simulated AB SOLiD data set, with 
simulation details provided in 'reads/generate_log'.

To run VARiD on the full genome with the alignments provide, the
following command is used,

~/varid-1.0.7c/bin/varid_exec -a reads/cs_fa_mapped.sam  \
                              -r reference.fa            \
                              -o ./varid_prettybase      \
                              --threads 4

VARiD takes significant memory, depending on the order of the genome,
please see section (4) for this. In this example we can reduce the 
memory usage by only running prediction on a subset of the genome,

~/varid-1.0.7c/bin/varid_exec -a reads/cs_fa_mapped.sam  \
                              -r reference.fa            \
                              -o ./varid_prettybase      \
                              --threads 4                \
                              --predict-start 50         \
                              --predict-end 100

This will be default create and evaluate a HMM for the selected
prediction region (buffered on both sides by 200 bp if avaliable).
In this particular case VARiD would create and run a HMM for positions
1 to 300 and report only positions 50 to 100 in the prettybase output.
		 
----------------------------------------------------------------------
 3. Usage (incl parameters and output)
----------------------------------------------------------------------

  3.1 Usage.
  ----------

  VARiD is special in that it can make use of both color-space and 
letter-space alignment data. The two necessary files to give VARiD
are the SAM format alignments and the reference fasta file. By default
VARiD assumes the SAM format alignments will come from standard in.
So only 1 parameter needs to be specified for VARiD to run, the location
of the reference fasta file. A current limitation of VARiD is memory
usage which is proportional to length of genome. This results in a large
memory footprint for VARiD when run on large genomes. As a result, VARiD
may fail to allocate memory and crash, to work with large contigs please
see section (4).
 
After reading in the reference and SAM alignments file, VARiD will
attempt to make small re-alignment corrections for gaps. Once this is
complete it will create a HMM model based on the VARiD publication, and
run by default the forward-backward algorithm (viterbi avaliable, but is
not default).

  3.2 Paramteres.
  ---------------

  Required Parameters:

  [ -r/--ref-file filepath ]
	This parameter specifies a filepath to the reference fasta file. 

  Optional Parameters:

  [ -t/--call-threshold real_value ] 
	The probability threshold (0..1) for calling bases. VARiD 
computes a confidence probability for each state (base call), when
using the Forward Backward algorithm. It will only give bases then it 
is more confident in than this threshold value (>=).
The default is 0.75. 

  [ -n/--donor-name name ]
	The name specified with this option is provided in the
prettybase VARiD output as the donor genome name.
The default is "DONOR_GENOME".	

  [ -a/--alignments filepath ]
	This parameter specifies a filepath to the SAM alignment file.
The default is STDIN (see 'man stdin').

  [ -d/--temp-dir folderpath ]
	This parameter specifies a folderpath to where VARiD can keep
its temporary files. These files include the split a possibly trimmed
SAM alignments for each contig.
The default is "./varid_tmp".

  [    --min-coverage integer_value ]
	This value specifies the minimum read coverage (letter or color)
at a position VARiD must have before making a none 'N' base call.
The default is 4.

  [    --format pb|vcf ]
	Use pretty-base (pb) output format, or use vcf4.0 (vcf) output
format. The default is pb.

  [    --all ]
	Report calls for every-base even if reference base is called.
The default is disabled.

  Probability Settings:

  [ -s/--snp-rate real_value ]
	This value specifies the expected probability of a SNP. This 
value is used in the VARiD HMM as described in the VARiD publication.
The default is 0.001.

  [ -g/--gap-error real_value ]
	This value specifies the probability of a gap occuring in a
read. This is used instead of a quality value derived probability
for the emission matrix when a read has a 'deletion' relative to
the reference.
The default is 0.03.

  [    --insert-open real_value ]
	This is the probability of having a gap open in the reference
sequence. The probabilty of beginning a insertion into the reference
to form the donor genome. This value is used in the VARiD HMM.
The default is 0.00010.

  [    --insert-extend real_value ]
	This is the probability of having a gap extend in the
reference sequence. The probability of extending a insertion in the
reference to form the donor genome. This value is used in the VARiD HMM.
The default is 0.00100.

  [    --deletion-open real_value ]
	This is the probability of having a deletion in the reference
sequence. The probability of beginning a deletion in the reference to
form the donor genome. This value is used in the VARiD HMM.
The default is 0.00010.

  [    --err-coverage integer_value ]
	This is the minimum coverage needed for a quality value (or 
read index, if no quality values provided) in order for VARiD to estimate
its error.
The default is 40.

  [    --ls-err real_value ]
	This is the default letter space error value for VARiD if it
does not run its error estimation algorithm.
The default is 0.01.

  [    --min-ls-err real_value ]
	This is the minimum allowable letter space error for VARiDs 
error estimation algorithm.
The default is 0.001.

  [    --max-ls-err real_value ]
	This is the maximum allowable letter space error for VARiDs
error estimation algorithm.
The default is 0.99.

  [    --cs-err real_value ]
	This is the default color space error value for VARiD if it
does not run its error estimation algorithm.
The default is 0.01.

  [    --min-cs-err real_value ]
	This is the minimum allowable color space error for VARiDs 
error estimation algorithm.
The default is 0.001.

  [    --max-cs-err real_value ]
	This is the maximum allowable color space error for VARiDs
error estimation algorithm.
The default is 0.99.

  [    --get-bp-freq ]
	This option tells VARiD to compute A,C,G,T frequencies from the
reference fasta file.
The default is disabled.

  [    --freq-a real_value ]
	This is the default frequency of 'A' in the genome when
--get-bp-freq is not used. 
The default is 0.25.

  [    --freq-c real_value ]
	This is the default frequency of 'C' in the genome when
--get-bp-freq is not used. 
The default is 0.25.

  [    --freq-g real_value ]
	This is the default frequency of 'G' in the genome when
--get-bp-freq is not used. 
The default is 0.25.

  [    --freq-t real_value ]
	This is the default frequency of 'T' in the genome when
--get-bp-freq is not used. 
The default is 0.25.

  [    --log-coeff-ls real_value ]
	This is a scaling coefficient that can be used to scale the
log probabilities of all letter space associated emissions. 
I.e. a value of 2 , squares all letter space emission probabilities.
The default is 1.

  [    --log-coeff-cs real_value ]
	This is a scaling coefficient that can be used to scale the
log probabilities of all color space associated emissions. 
I.e. a value of 2 , squares all color space emission probabilities.
The default is 1.

  [ -S/--save-learned filepath ]
	This provides VARiD a filepath where to dump learned 
quality error values on a large genomic window. This can later be 
loaded when run on a smaller genomic window, to have a better error
estimate of quality value read errors.
The default is disabled.

  [ -L/--load-learned filepath ]
	This provides VARiD a filepath where previously learned 
quality error values can be read from and used. This is useful when
generating error estimates for larger genomic windows, while running
VARiD only a small window.
The default is disabled.

  HMM options:

  [    --predict-start integer_value ]
	This tells VARiD what position to start making base calls from.
The default is 1.

  [    --predict-end integer_value ]
	This tells VARiD what position to stop making base calls at.
The default is length of reference genome.

  [    --hmm-area integer_value ]
	This tells VARiD how many bases to go upstream and downstream
from '--predict-start' and '--predict-end' for the underlying HMM model.
The default is 200.

  [    --hmm-start integer_value ]
	This tells VARiD what position to start the HMM model from. 
This value must be always greater then or equal to the value specified
to '--predict-start'.
The default is MAX(1,value specified by --predict-start - 
                          value specified by --hmm-area [200])

  [    --hmm-end integer_value ]
	This tells VARiD what position to end the HMM model at. 
This value must be always less then or equal to the value specified
to '--predict-end'.
The default is MIN(genome length,value specified by --predict-end + 
                          value specified by --hmm-area [200])

  [    --viterbi ]
	Use the viterbi based prediction algorithm. In this mode, 
VARiD does not report confidence values, so the confidence threshold 
is ignored.
The default is disabled.
 
  Heterozygous distribution penalties:

  [    --rank-sum ]
	Use a hybrid of rank-sum based quality value penalty with a 
modified gaussian, to assign penalties to heterozygous states.
The default is enabled.

  [    --normal real_value ]
	Use a gaussian based heterzygous penalty. With the provided 
standard deviation parameter.
The default is disabled.

  [    --poisson ]
	Use two poisson distributions to model the coverage of two 
haplomes in a heterozygous state, and assign penalties based on this.
The default is disabled.

  [    --binomial ]
	Use a binomial distribution to model the coverage of the two 
haplomes in a heterzygous state, and assign penalties based on this.
The default is disabled.

  VARiD options:

  [    --haploid ]
	Model the organism as a haploid.
The default is disabled.

  [ -e/--heuristics ]
	Use non-quality VARiD 0.99 based post processing heuristics.
The default is disabled.

  [ --drop-bad-reads ]
	Try to drop alignments that cause problems.
The default is disabled.

  [ -v/--version ]
	Display the VARiD version and exit.
The default is disabled.

  [ -h/--help ]
	Display the complete usage and exit.
The default is disabled.

 
  3.3 Output
  ----------
  VARiD not only calls  SNP positions (heterozygous or homozygous SNPs
and micro-indels), but  in general will fully call  the nucleotides at
each position as well. To  represent this, VARiD outputs base calls in
a      version      of       the      prettybase      format      (see
http://pga.mbt.washington.edu/) with slight additions. More concretly, 
the output is:

  <refname_position> <run_ID> <Allele1> <Allele2> <VAR_call> <confidence>

  where the fields are tab-delimited and where:
    <refname_position> will  be the  reference from the  mfa reference
file followed by the position along that reference
    <run_ID> will be the name given in -ref_name
    <Allele1>  will be  the  nucleotide call  of  the first  haplotype
(order of first and second is arbritrary)
    <Allele2>  will be  the nucleotide  call of  the  second haplotype
(order of first and second is arbritrary)
    <VAR_call> will depend on the ALLELE  calls, and will be 'R' for a
Reference Allele (no  variation), 'H' for a heterozygous  call and 'A'
for a homozygous alternate SNP call.
    <Confidence> represents the confidence in this particular state.

  for example, potential lines from an output are

	chr3_1009519        NA17156 G       G       R	0.98
	chr3_1009520        NA17156 T       G       H	0.99
	chr3_1009521        NA17156 A       A       R	0.83

----------------------------------------------------------------------
 4. Usage with larger contigs
----------------------------------------------------------------------

The VARiD HMM currently has 1,600 states per genome position. This 
creates problems for VARiD memory footprint. When using
forward backward prediction algorithm. This means that for each genome
position VARiD must allocate (1600 x 3 x sizeof(double)) bytes.
This is roughly,
	40kB x Length of Genome
This means that for a genome of length 10,000 just the forward,
backward and probability matrix will take approximately 400MB.
If you try to run chromosome 1 in VARiD for prediction this will try
to allocate approximately 9.2TB.

The current solution VARiD offers to this is (not yet a disk based 
paging solution) but a helper script and option to run VARiD on a 
subset of the input genome.

For example if you would like to run VARiD on a 5,000 bp genome
but you only have 128MB RAM the script 'helper/varidhelper.sh' can help.
Running varidhelper.sh with the flag "--script" will print the commands
that 'would have been issued by varidhelper.sh'.
For example running,

varidhelper.sh --alignments reads/cs_fa_mapped.sam --ref-file reference.fa --output output --temp-dir /tmp/ --script --threads 4 --split-size 1000

Produces the output,
~/varid-1.0.7c/bin/varid_exec -a reads/cs_fa_mapped.sam -r reference.fa --threads 4

~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 1 --predict-end 1000 --hmm-area 100 --threads 4 >> output
~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 1001 --predict-end 2000 --hmm-area 100 --threads 4 >> output
~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 2001 --predict-end 3000 --hmm-area 100 --threads 4 >> output
~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 3001 --predict-end 4000 --hmm-area 100 --threads 4 >> output
~/varid-1.0.7c/bin/varid_exec -a aln.sam -r 100k -n -d /tmp/ --predict-start 4001 --predict-end 5000 --hmm-area 100 --threads 4 >> output

To get VARiD to actually run, run the helper script without the
flag "--script".

----------------------------------------------------------------------
 5. Using VARiD with BAM files
----------------------------------------------------------------------

By default VARiD assumes SAM alignments will be provided on standard
input (STDIN, see man stdin). This can be useful when using VARiD
with BAM files.

For example we can run the following,
samtools view reads/cs_fa_mapped.sam_sorted.bam | ~/varid-1.0.7c/varid_exec -r reference.fa

For large alignment files (one BAM file for a whole human individual)
it might be useful to only give VARiD a subset of the alignments
relevant to the region. To do this VARiD can be run with the following,

samtools view reads/cs_fa_mapped.sam_sorted.bam RANDOM_GENOME_5000:3000-4000 | ~/varid-1.0.7c/varid_exec -r reference.fa --predict-start 3300 --predict-end 3600

This command will retrieve only alignments spanning into the
3000~4000 bp region of the reference and pass them to VARiD.
VARiD will then by default contrsuct a HMM model (with 200 bp boundary
buffer, see '--hmm-area') from position 3100 to 3800. Also
VARiD will only report base calls on the interval 3300 to 3600.


----------------------------------------------------------------------
 6. VARiD and re-calibrating read quality values
----------------------------------------------------------------------

VARiD tries to learn alignment specific error rates for quality values
instead of using any hard coded values. When running VARiD on smaller
genomic windows VARiD will only load those alignments which span into
this window. This will lower the data points avaliable for each quality
value, and cause VARiD to have a worse internal estimate of these 
quality values.

This is why VARiD has options to learn errors on a full dataset 
(without running the actual HMM) and save them to a file, so that they
can be loaded later.

For example we can instruct VARiD to learn error values on the full
alignments with the following,

~/varid-1.0.7c/varid_exec -r reference.fa -a reads/cs_fa_mapped.sam -S cs_error

This will save estimated errors to the file 'cs_error'. When quality
values are not avaliable for alignments, VARiD will create a position
specific error profile. This is what happens in the above example.

To later use these learned error rates in VARiD when running on 
smaller genomic windows, the following can be used,

samtools view reads/cs_fa_mapped.sam_sorted.bam RANDOM_GENOME_5000:3000-4000 | ~/varid-1.0.7c/varid_exec -r reference.fa --predict-start 3300 --predict-end 3600 -L cs_error

----------------------------------------------------------------------
 7. Contacts
----------------------------------------------------------------------

The program and more information can be found at
	http://compbio.cs.utoronto.ca/varid/

The authors can be contacted at
	varid at cs dot toronto dot edu



----------------------------------------------------------------------
 8. Acknowledgements
----------------------------------------------------------------------

The algorithms  of VARiD were  developed at the University  of Toronto
Computational  Biology  Lab.  The  development  was  performed by  the
authors with significant help from  Adrian Dalca Sr. The work was made
possible  in part  by  National Engineering  and  Research Council  of
Canada.