SHARCGS Logo
Home
Publications
Download

SHARCGS Documentation


NAME

    SHARCGS - SHort-read Assembler based on Robust Contig-extension
    for Genomic Sequencing.


SYNOPSIS

    sharcgs.pl [options] <input-file>


LICENSE

    Using this software implies acceptance of the terms of the GNU General
    Public License (see http://sharcgs.molgen.mpg.de/docs/license.txt).


DESCRIPTION

    This program generates long contigs of genomic sequence based on very
    short 25-40mer, error prone reads as produced by 2nd generation sequencing
    machines. A large number of equal-sized reads is read from the input
    file and concatenated to generate contigs of several 1000 bases
    in length. The reads may contain errors in as many as 2% of all
    base calls.
    The algorithm filters the input for reads which fulfill the
    following criteria: they are read several times and their ends are
    matched perfectly by other, overlapping reads. When quality scores
    are provided for each call, the first step of the filtering is
    modified as follows: If a read is read several times, it is stored
    only once. For each call the highest observed quality score is
    retained. If two reads are reverse complements, their corresponding
    quality scores are added up. The criterion for the first filter
    step is the smallest cumulated quality score for a read.
    The assembly algorithm assumes that only correct reads pass
    this filter despite the unreliability of the raw data. The
    algorithm then only joins reads into contigs when this is possible
    unambiguously. In order to assemble contigs covering almost the
    complete target sequence, reads from a clear majority of positions
    must pass the mentioned filters. If no read is available from a
    position, this is called 'gap' in this documentation. Actually,
    a gap can consist of several positions in a row, for which no read
    passes the filter.
    SHARCGS can perform several runs of the core assembly algorithm
    for different filter settings. When running the assembly for
    sloppy filtering, contig breaks are expected to occur frequently
    due to reads containing sequencing errors. With stringent
    filtering on the other hand, contig breaks occur because larger
    gaps must be spanned. The contig breaks, however, are not expected
    at the same positions. Therefore, SHARCGS gives the possibility to
    merge contigs from different core assembly runs.
    The final assembly is stored in a file containing one entry for
    each generated contig in FASTA format. If quality scores are
    provided as input, this output file is complemented by a
    FASTA-format file containing entries in the same order as the
    contig file. Instead of base calls, however, quality scores
    separated by spaces are reported. Quality scores are thereby
    computed by taking the maximum when several reads confirm a base
    call on the same strand. For each contig, quality scores
    corresponding to its reverse complement are also computed. The
    final quality scores are computed by adding up the corresponding
    quality scores from forward and reverse contig.
    The only mandatory argument is the input file, which must contain
    one read per line. Lines starting with a '>' sign are skipped from
    the input. All of these reads must have exactly the same
    length. Since we normally do not know, which strand has been read,
    all reads are converted to their reverse complement and added to
    the input data.


OPTIONS

    The following options can be used to fine-tune the assembly
    process by overriding the corresponding defaults given in
    parentheses:

-r <read-repetition-requirement> (default according to input data)

    Reads which are occuring in the input data less than this value
    are discarded from the assembly. Setting this value to 1 most
    probably integrates too many faulty reads into the assembly, since
    as many as 20% of all 30-nt reads contain sequencing errors, when
    the error rate per base call is 0.6%. Setting the read repetition
    requirement to values above 5 is not advisable for its wasteful
    consequences. This parameter can be set to a comma delimited list
    of values. In this case, one by one of these values is used as
    filter setting and the resulting contigs are merged into a single
    assembly. During the merge, the algorithm uses first generated
    contigs first and extends or merges them afterwards with contigs
    from subsequent runs.  It is therefore recommended to start the
    list with higher values, since contigs assembled with more
    stringent filtering are expected to be more reliable. If quality
    scores are provided by specifying a quality file, the values given
    to -r are interpreted as minimal quality scores.
    If the -r option is not specified SHARCGS chooses settings
    according to the input data. It estimates the length N of the
    target sequence from the observed read repetitions according to a
    maximum likelihood model. Then, SHARCGS chooses read repetition
    levels for which N/2 to N different reads pass the filter.

-g <gap-span> (default according to input data)

    This value sets the maximal length of a gap that can be safely
    spanned by the assembly algorithm. Setting this value too low
    leads to small contigs, because gaps that result from missing reads
    cannot be bridged. Moreover, SHARCGS cannot guarantee that
    contigs are joined correctly due to the lack of reads needed to
    detect ambiguous extensions. Setting this value too high causes
    the algorithm to detect spurious ambiguities and will also lead to
    short contigs due to resulting minimal overlaps being not
    unique. The best value for this parameter depends on the amount of
    repetitive sequence and the fraction of missing data.  It is
    actually translated into the minimal overlap we use to join a read
    to a contig.  Large gap-span values also imply a deeper prefix
    trie that needs more system memory. If the user sets this parameter
    to a specific value, the algorithm performs exactly one assembly
    using this particular gap span.  By default, the algorithm
    iterates through values between the estimated maximum gap size
    (see section OUTPUT) and the depth of the prefix trie to find the
    smallest gap span which leads to sufficiently few "open ends" (see
    below).

-q <quality-file> (default is none)

    In a quality file, quality scores for each base call in the main
    input file are specified. The read must be reported in the same
    order in both files and the quality file must conform to the
    Illumina 1G format.  For each base call, four integer values
    separated by space characters are reported. Each base call is
    separated by a tab and each read is reported on one line. Quality
    scores are interpreted as Q = -10 log10 (p/(1-p)) with p .

-d <max-dangling-ends> (default 2)

    Whenever SHARCGS stops the assembly of a contig, because no
    extension is found with a minimal overlap, we call this an open or
    dangling end. Open ends are expected at both end of a target
    sequence if the target is in linear form. If the target is
    circular, no open ends are expected, i.e. all contigs should be
    terminated due to detected ambiguities. This is the best guarantee
    that no wrong connections between contigs are generated. Thus for
    linear targets, this parameter should be set to 2, while for
    circular ones it should be set to 0.

-p <prefix-trie-depth> (default according to input data)

    All reads passing the filter are stored in a prefix trie. This
    data structure allows for very fast searching of extensions given
    a prefix. All reads are split into a prefix of minimal length and
    a suffix containing the rest of the read. The suffix is stored in
    a tree structure associated to the prefix using a hash table. In
    this manner, prefixes of any length larger than the minimal prefix
    length can be searched quickly and the corresponding suffix is
    determined easily. The prefix trie requires a large amount of
    memory. Setting this parameter to low values makes SHARCGS less
    greedy but limits it to span gaps shorter than this value.  If the
    parameter is not specified, it is set to a value such that large
    gaps can be spanned, as they are expected when reads from as many
    as 50% of target sequence positions are missing.

-l <min-contig-length> (default 50)

    The minimal length of contigs to be reported.

-o <output-file> (default contigs.seq)

    The output file where the contigs are to be stored in FASTA
    format.


OUTPUT

    The most important output, namely the collection of contigs
    assembled, is written to the output file specified with the '-o'
    parameter. In addition, the following messages generated on the
    screen may be of interest:

Read input data

    On this line the number of reads in the input file and the length
    of each individual read is given.

Add reverse complements

    In a first step of the assembly reverse complements of all reads
    are generated and stored into a single hash table. Here the number
    of unique reads from both strands is given.

Estimate sequence length/error rate

    The length of the target sequence and the error rate per base with
    which the reads are generated is estimated using a maximum
    likelihood method.  The information used is the distribution of
    how often reads are generated.  The probability that a read at a
    given position in the target sequence is read exactly x times, can
    be deduced from a binomial distribution Binom(p, k) using the
    number of reads k in the input file and the probability p that a
    read at a particular position occurs without error as
    parameters. Thereby, p depends on the error rate e, the read
    length l, and the length of the target sequence n:
    p = (1 - e)^l / n
    The number of reads which are expected to occur exacly x times in
    the input file is also deduced from a binomial distribution. This
    time the parameters are the length of the target sequence k and
    the probability to read x times without errors computed as
    mentioned above. Maximizing this probability over a grid of
    parameters deemed to be likely for a sequencing machine, we obtain
    the values given here. This method has shown to be accurate on
    simulated data but estimates are imprecise on real data.

Prefix trie depth

    This line reports, which fraction of missing reads can be handled
    and what depth of the prefix trie is needed to do so. If the -p or
    -g parameters are specified, the fraction of missing reads is
    adapted, otherwise the depth of the trie is adapted such that 50%
    missing reads can be handled.

Remove rare reads

    The first filter step removes reads which are not confirmed a
    minimal number of times or with a minimal quality score for any of
    its bases. This line states the number of reads that are kept
    after this filtering step.

Store frequent reads to prefix trie

    This line gives the number of suffix tree nodes allocated during
    the storage of all reads in the prefix trie.

Remove reads with no partners

    Here SHARCGS reports the number of reads kept after the removal of
    frequent reads which have no partner.

Guess read fraction/best gap span

    The read fraction given here is the estimated fraction of
    positions from the target sequence, from which reads survived the
    filtering steps. It is deduced from the estimated target sequence
    length and the number of reads kept after both filtering
    steps. The initial gap size to be spanned is chosen such that we
    expect only one gap of that length or longer. The probability that
    a gap of length g starts at a given position is f(1-f)^g, provided
    the read fraction f. This probability is multiplied by the length
    of the target sequence n to obtain the expected number of such
    gaps. Setting this value to 1 gives the first guess for a
    reasonable gap size for our algorithm:
    n f (1 - f)^g = 1
    log(n) + log(f) + g log(1 - f) = 0
    g = -(log(n) + log(f)) / log(1 - f)
    This gap size is used to initiate the assembly and increased if
    needed due to open ends.

Assemble contigs

    For each run of the core assembly algorithm one progress bar is
    shown. After an assembly is finished, the algorithm reports the
    number of contigs generated and the number of open ends
    detected. If this second number is larger than the corresponding
    parameter, the gap span is increased and another run of the core
    assembly algorithm follows.

Assembly statistics

    Finally, a summary of the generated assembly is given. It provides
    simple characteristics like the number of contigs generated, the
    length of the largest contig found, and the average length of
    contigs.


AUTHORS

Juliane Dohm <dohm@molgen.mpg.de>
Claudio Lottaz <Claudio.Lottaz@klinik.uni-regensburg.de>


URL

http://sharcgs.molgen.mpg.de


EXAMPLES

    # trust defaults
    sharcgs.pl my.reads -o my.seq
    # circular targets 
    sharcgs.pl -d 0 my.reads -o my.seq
    # try to limit memory usage
    sharcgs.pl -p 10 my.reads -o my.seq
    # focus on large contigs
    sharcgs.pl -l 500 my.reads -o my.seq
    # tell the algorithm exactly, what to do
    sharcgs.pl -r 2 -g 10 my.reads -o my.seq
    # tell the algorithm, what option settings to explore
    sharcgs.pl -r 2,3,5 my.reads -o my.seq

Last modified November 20, 2007 by Juliane Dohm