SHARCGS Documentation
SHARCGS - SHort-read Assembler based on Robust Contig-extension
for Genomic Sequencing.
sharcgs.pl [options] <input-file>
Using this software implies acceptance of the terms of the GNU General
Public License (see http://sharcgs.molgen.mpg.de/docs/license.txt).
This program generates long contigs of genomic sequence based on very
short 25-40mer, error prone reads as produced by 2nd generation sequencing
machines. A large number of equal-sized reads is read from the input
file and concatenated to generate contigs of several 1000 bases
in length. The reads may contain errors in as many as 2% of all
base calls.
The algorithm filters the input for reads which fulfill the
following criteria: they are read several times and their ends are
matched perfectly by other, overlapping reads. When quality scores
are provided for each call, the first step of the filtering is
modified as follows: If a read is read several times, it is stored
only once. For each call the highest observed quality score is
retained. If two reads are reverse complements, their corresponding
quality scores are added up. The criterion for the first filter
step is the smallest cumulated quality score for a read.
The assembly algorithm assumes that only correct reads pass
this filter despite the unreliability of the raw data. The
algorithm then only joins reads into contigs when this is possible
unambiguously. In order to assemble contigs covering almost the
complete target sequence, reads from a clear majority of positions
must pass the mentioned filters. If no read is available from a
position, this is called 'gap' in this documentation. Actually,
a gap can consist of several positions in a row, for which no read
passes the filter.
SHARCGS can perform several runs of the core assembly algorithm
for different filter settings. When running the assembly for
sloppy filtering, contig breaks are expected to occur frequently
due to reads containing sequencing errors. With stringent
filtering on the other hand, contig breaks occur because larger
gaps must be spanned. The contig breaks, however, are not expected
at the same positions. Therefore, SHARCGS gives the possibility to
merge contigs from different core assembly runs.
The final assembly is stored in a file containing one entry for
each generated contig in FASTA format. If quality scores are
provided as input, this output file is complemented by a
FASTA-format file containing entries in the same order as the
contig file. Instead of base calls, however, quality scores
separated by spaces are reported. Quality scores are thereby
computed by taking the maximum when several reads confirm a base
call on the same strand. For each contig, quality scores
corresponding to its reverse complement are also computed. The
final quality scores are computed by adding up the corresponding
quality scores from forward and reverse contig.
The only mandatory argument is the input file, which must contain
one read per line. Lines starting with a '>' sign are skipped from
the input. All of these reads must have exactly the same
length. Since we normally do not know, which strand has been read,
all reads are converted to their reverse complement and added to
the input data.
The following options can be used to fine-tune the assembly
process by overriding the corresponding defaults given in
parentheses:
Reads which are occuring in the input data less than this value
are discarded from the assembly. Setting this value to 1 most
probably integrates too many faulty reads into the assembly, since
as many as 20% of all 30-nt reads contain sequencing errors, when
the error rate per base call is 0.6%. Setting the read repetition
requirement to values above 5 is not advisable for its wasteful
consequences. This parameter can be set to a comma delimited list
of values. In this case, one by one of these values is used as
filter setting and the resulting contigs are merged into a single
assembly. During the merge, the algorithm uses first generated
contigs first and extends or merges them afterwards with contigs
from subsequent runs. It is therefore recommended to start the
list with higher values, since contigs assembled with more
stringent filtering are expected to be more reliable. If quality
scores are provided by specifying a quality file, the values given
to -r are interpreted as minimal quality scores.
If the -r option is not specified SHARCGS chooses settings
according to the input data. It estimates the length N of the
target sequence from the observed read repetitions according to a
maximum likelihood model. Then, SHARCGS chooses read repetition
levels for which N/2 to N different reads pass the filter.
This value sets the maximal length of a gap that can be safely
spanned by the assembly algorithm. Setting this value too low
leads to small contigs, because gaps that result from missing reads
cannot be bridged. Moreover, SHARCGS cannot guarantee that
contigs are joined correctly due to the lack of reads needed to
detect ambiguous extensions. Setting this value too high causes
the algorithm to detect spurious ambiguities and will also lead to
short contigs due to resulting minimal overlaps being not
unique. The best value for this parameter depends on the amount of
repetitive sequence and the fraction of missing data. It is
actually translated into the minimal overlap we use to join a read
to a contig. Large gap-span values also imply a deeper prefix
trie that needs more system memory. If the user sets this parameter
to a specific value, the algorithm performs exactly one assembly
using this particular gap span. By default, the algorithm
iterates through values between the estimated maximum gap size
(see section OUTPUT) and the depth of the prefix trie to find the
smallest gap span which leads to sufficiently few "open ends" (see
below).
In a quality file, quality scores for each base call in the main
input file are specified. The read must be reported in the same
order in both files and the quality file must conform to the
Illumina 1G format. For each base call, four integer values
separated by space characters are reported. Each base call is
separated by a tab and each read is reported on one line. Quality
scores are interpreted as Q = -10 log10 (p/(1-p)) with p .
Whenever SHARCGS stops the assembly of a contig, because no
extension is found with a minimal overlap, we call this an open or
dangling end. Open ends are expected at both end of a target
sequence if the target is in linear form. If the target is
circular, no open ends are expected, i.e. all contigs should be
terminated due to detected ambiguities. This is the best guarantee
that no wrong connections between contigs are generated. Thus for
linear targets, this parameter should be set to 2, while for
circular ones it should be set to 0.
All reads passing the filter are stored in a prefix trie. This
data structure allows for very fast searching of extensions given
a prefix. All reads are split into a prefix of minimal length and
a suffix containing the rest of the read. The suffix is stored in
a tree structure associated to the prefix using a hash table. In
this manner, prefixes of any length larger than the minimal prefix
length can be searched quickly and the corresponding suffix is
determined easily. The prefix trie requires a large amount of
memory. Setting this parameter to low values makes SHARCGS less
greedy but limits it to span gaps shorter than this value. If the
parameter is not specified, it is set to a value such that large
gaps can be spanned, as they are expected when reads from as many
as 50% of target sequence positions are missing.
The minimal length of contigs to be reported.
The output file where the contigs are to be stored in FASTA
format.
The most important output, namely the collection of contigs
assembled, is written to the output file specified with the '-o'
parameter. In addition, the following messages generated on the
screen may be of interest:
On this line the number of reads in the input file and the length
of each individual read is given.
In a first step of the assembly reverse complements of all reads
are generated and stored into a single hash table. Here the number
of unique reads from both strands is given.
The length of the target sequence and the error rate per base with
which the reads are generated is estimated using a maximum
likelihood method. The information used is the distribution of
how often reads are generated. The probability that a read at a
given position in the target sequence is read exactly x times, can
be deduced from a binomial distribution Binom(p, k) using the
number of reads k in the input file and the probability p that a
read at a particular position occurs without error as
parameters. Thereby, p depends on the error rate e, the read
length l, and the length of the target sequence n:
p = (1 - e)^l / n
The number of reads which are expected to occur exacly x times in
the input file is also deduced from a binomial distribution. This
time the parameters are the length of the target sequence k and
the probability to read x times without errors computed as
mentioned above. Maximizing this probability over a grid of
parameters deemed to be likely for a sequencing machine, we obtain
the values given here. This method has shown to be accurate on
simulated data but estimates are imprecise on real data.
This line reports, which fraction of missing reads can be handled
and what depth of the prefix trie is needed to do so. If the -p or
-g parameters are specified, the fraction of missing reads is
adapted, otherwise the depth of the trie is adapted such that 50%
missing reads can be handled.
The first filter step removes reads which are not confirmed a
minimal number of times or with a minimal quality score for any of
its bases. This line states the number of reads that are kept
after this filtering step.
This line gives the number of suffix tree nodes allocated during
the storage of all reads in the prefix trie.
Here SHARCGS reports the number of reads kept after the removal of
frequent reads which have no partner.
The read fraction given here is the estimated fraction of
positions from the target sequence, from which reads survived the
filtering steps. It is deduced from the estimated target sequence
length and the number of reads kept after both filtering
steps. The initial gap size to be spanned is chosen such that we
expect only one gap of that length or longer. The probability that
a gap of length g starts at a given position is f(1-f)^g, provided
the read fraction f. This probability is multiplied by the length
of the target sequence n to obtain the expected number of such
gaps. Setting this value to 1 gives the first guess for a
reasonable gap size for our algorithm:
n f (1 - f)^g = 1
log(n) + log(f) + g log(1 - f) = 0
g = -(log(n) + log(f)) / log(1 - f)
This gap size is used to initiate the assembly and increased if
needed due to open ends.
For each run of the core assembly algorithm one progress bar is
shown. After an assembly is finished, the algorithm reports the
number of contigs generated and the number of open ends
detected. If this second number is larger than the corresponding
parameter, the gap span is increased and another run of the core
assembly algorithm follows.
Finally, a summary of the generated assembly is given. It provides
simple characteristics like the number of contigs generated, the
length of the largest contig found, and the average length of
contigs.
Juliane Dohm <dohm@molgen.mpg.de>
Claudio Lottaz <Claudio.Lottaz@klinik.uni-regensburg.de>
http://sharcgs.molgen.mpg.de
# trust defaults
sharcgs.pl my.reads -o my.seq
# circular targets
sharcgs.pl -d 0 my.reads -o my.seq
# try to limit memory usage
sharcgs.pl -p 10 my.reads -o my.seq
# focus on large contigs
sharcgs.pl -l 500 my.reads -o my.seq
# tell the algorithm exactly, what to do
sharcgs.pl -r 2 -g 10 my.reads -o my.seq
# tell the algorithm, what option settings to explore
sharcgs.pl -r 2,3,5 my.reads -o my.seq