This document describes installation and usage of the LASTZ sequence alignment program. LASTZ is a drop-in replacement for BLASTZ, and is backward compatible with BLASTZ’s command-line syntax. That is, it supports all of BLASTZ’s options but also has additional ones, and may produce slightly different alignment results.
| LASTZ: | A tool for (1) aligning two DNA sequences, and (2) inferring appropriate scoring parameters automatically. | |
|---|---|---|
| Platform: | This package was developed on a Macintosh OS X system, but should work on other Unix or Linux platforms with little change (if any). LASTZ is written in C and compiled with gcc; other C compilers can probably be used by adjusting the Makefile. Some ancillary tools are written in Python, but only use modules available in typical python installations. | |
| Author: | Bob Harris <rsharris at bx dot psu dot edu> | |
| Date: | August 2021 | |
| Mailing list: | http://lists.bx.psu.edu/listinfo/lastz-users |
LASTZ is available from github at https://github.com/lastz/lastz.
A packed archive containing source code for LASTZ is available from the Miller Lab at Penn State. Note: 1.04.15 is the last release that will be available via the Miller Lab website.
If you have received the distribution as a packed archive, unpack it
by whatever means are appropriate for your computer. The result should be
a directory <somepath>/lastz‑distrib‑X.XX.XX that contains
a src subdirectory (and some others). You may find it convenient
to remove the revision number (‑X.XX.XX) from the directory name.
Before building or installing any of the programs, you will need to tell the
installer where to put the executable, either by setting the shell variable
$LASTZ_INSTALL, or by editing the make‑include.mak
file to set the definition of installDir. Also, be sure to add
the directory you choose to your $PATH.
Then to build the LASTZ executable, enter the following commands from bash
or a similar command-line shell (Solaris users should substitute
gmake for make). This will build two executables
(lastz and lastz_D) and copy them into your
installDir.
cd <somepath>/lastz-distrib-X.XX.XX/src
make
make install
The two executables are basically the same program; the only difference is
that lastz uses integer scores, while lastz_D uses
floating-point scores.
The build process should not report any warnings or errors. Because of this, the Makefile is set up so that warnings are considered errors and will stop the build. If you encounter this situation, you can use Makefile.warnings instead:
make -f Makefile.warnings
This should allow the build to complete, while still reporting the warnings.
You'll need to decide whether the warnings indicate something is really wrong.
Usually they don't, but please report them to the author regardless.
A simple self test is included so you can test whether the build succeeded. To run it, enter the following command:
make test
If the test is successful, you will see no output from this command.
Otherwise, you will see the differences between the expected output and the
output of your build, plus a line that looks like this:
make: *** [test] Error 1
An additional executable (lastz_32) can be built, to handle
genomes larger than 2 gigabases. For details, see the section on
aligning to whole genomes.
Any executable can be built to allow adjacent indels (by default, these are not allowed). For details, see the section on adjacent indels.
LASTZ is designed to preprocess one sequence or set of sequences (which we collectively call the target) and then align several query sequences to it. The general flow of the program is like a pipeline: the output of one stage is the input to the next. The user can choose to skip most stages via command-line options; any stages that are skipped pass their input along to the next stage unchanged. Two of the stages, scoring inference and interpolation, are special in that they perform a miniature version of the pipeline within them.
Note that the following discussion is a generalization, intended to describe the basic idea of LASTZ’s operation. There are many exceptions that depend on the particular options specified.
The stages are:
The usual flow is as follows (though most of these steps are optional,
and some settings like ‑‑anyornone
may affect the processing order).
We first read the target sequence(s) into memory, and use that to build a seed
word position table that will allow us to quickly map any word in the target to
all of the positions where it appears. (For the purposes of this discussion
you can think of a word as a 12-mer of DNA.) Then we read each
query sequence in turn, processing them more or less independently. We examine
the word starting at each base in the query and use the position table to find
matches, called seeds, in the target. The seeds are extended to
longer matches called HSPs (high-scoring segment pairs) and filtered
based on score. The HSPs are chained into the highest-scoring set of syntenic
alignments, and then reduced to single locations called anchors.
The anchors are then extended to local alignments (which may contain
gaps) and again filtered by score, followed by back-end filtering to discard
alignment blocks that do not meet specified criteria for certain traits. We
then interpolate, repeating the entire process at a higher sensitivity in the
holes between the alignment blocks. And finally, we write out the alignment
information to a file. Then these steps are repeated with the reverse
complement of the query sequence, before moving on to the next sequence in the
query file.
The scoring inference stage is not usually performed. Typically it is used only when sequences for new species are acquired, to create scoring files for subsequent alignments of those species.
For those eager to try it out, here are some illustrative examples to get you started. Detailed reference material begins with the next section.
It is often adequate to use a lower sensitivity level than is achieved with LASTZ’s defaults. For example, to compare two complete chromosomes, even for species as distant as human and chicken, the alignment landscape is evident even at very low sensitivity settings. This can speed up the alignment process considerably.
This example compares human chromosome 4 to chicken chromosome 4. These sequences can be found in the downloads section of the UCSC Genome Browser, and are 191 and 94 megabases long, respectively. To run a quick low-sensitivity alignment of these sequences, use a command like this:
lastz hg18.chr4.fa galGal3.chr4.fa \
--notransition --step=20 --nogapped \
--format=maf > hg18_4_vs_galGal3_4.maf
This runs in about two and a half minutes on a 2-GHz workstation, requiring
only 400 Mb of RAM. Figure 1(a) shows the results, plotted using the
‑‑format=rdotplot output option and
the R statistical package.
(When in MAF format, LASTZ output can be browsed with
the GMAJ interactive viewer for multiple alignments, available from the
Miller Lab at Penn State.)
Using ‑‑notransition lowers
seeding sensitivity and reduces runtime (by a factor of about 10 in this case).
‑‑step=20 also lowers seeding
sensitivity, reducing runtime and also reducing memory consumption (by a factor
of about 3.3 in this case).
‑‑nogapped eliminates the
computation of gapped alignments. The complete alignment process using default
settings (shown in Figure 1(b)) uses 1.3 Gb of RAM and takes 4.5 hours on a
machine running at 2.83 GHz.
|
Figure 1(a)
|
Figure 1(b)
|
Short read mapping for close species requires parameters very different from
LASTZ’s defaults. This example compares a simulated set of primate shotgun
reads to human chromosome 21. The chromosome can be found in the downloads
section of the UCSC Genome Browser
(it is about 47 megabases). Ten thousand simulated reads were generated by
extracting 60-bp intervals from chimp chr21, subjecting them to mild mutation
(including short gaps), and then truncating them to 50 bp (these are included
in the LASTZ distribution, in test_data/fake_chimp_reads.2bit).
To see where these reads map onto the human chromosome, use this command:
lastz hg18.chr21.fa[unmask] fake_chimp_reads.2bit \
--step=10 --seed=match12 --notransition --exact=20 --noytrim \
--match=1,5 --ambiguous=n \
--filter=coverage:90 --filter=identity:95 \
--format=general:name1,start1,length1,name2,strand2 \
> hg18_21_vs_reads.dat