LASTZ   Release 1.04.15, built August 27, 2021

TABLE OF CONTENTS


Introduction

This document describes installation and usage of the LASTZ sequence alignment program. LASTZ is a drop-in replacement for BLASTZ, and is backward compatible with BLASTZ’s command-line syntax. That is, it supports all of BLASTZ’s options but also has additional ones, and may produce slightly different alignment results.

LASTZ: A tool for (1) aligning two DNA sequences, and (2) inferring appropriate scoring parameters automatically.
Platform: This package was developed on a Macintosh OS X system, but should work on other Unix or Linux platforms with little change (if any). LASTZ is written in C and compiled with gcc; other C compilers can probably be used by adjusting the Makefile. Some ancillary tools are written in Python, but only use modules available in typical python installations.
Author: Bob Harris  <rsharris at bx dot psu dot edu>
Date: August 2021
Mailing list: http://lists.bx.psu.edu/listinfo/lastz-users

Availability

LASTZ is available from github at https://github.com/lastz/lastz.

A packed archive containing source code for LASTZ is available from the Miller Lab at Penn State. Note: 1.04.15 is the last release that will be available via the Miller Lab website.


Installation

If you have received the distribution as a packed archive, unpack it by whatever means are appropriate for your computer. The result should be a directory <somepath>/lastz‑distrib‑X.XX.XX that contains a src subdirectory (and some others). You may find it convenient to remove the revision number (‑X.XX.XX) from the directory name.

Before building or installing any of the programs, you will need to tell the installer where to put the executable, either by setting the shell variable $LASTZ_INSTALL, or by editing the make‑include.mak file to set the definition of installDir. Also, be sure to add the directory you choose to your $PATH.

Then to build the LASTZ executable, enter the following commands from bash or a similar command-line shell (Solaris users should substitute gmake for make). This will build two executables (lastz and lastz_D) and copy them into your installDir.

    cd <somepath>/lastz-distrib-X.XX.XX/src
    make
    make install
The two executables are basically the same program; the only difference is that lastz uses integer scores, while lastz_D uses floating-point scores.

The build process should not report any warnings or errors. Because of this, the Makefile is set up so that warnings are considered errors and will stop the build. If you encounter this situation, you can use Makefile.warnings instead:

    make -f Makefile.warnings
This should allow the build to complete, while still reporting the warnings. You'll need to decide whether the warnings indicate something is really wrong. Usually they don't, but please report them to the author regardless.

A simple self test is included so you can test whether the build succeeded. To run it, enter the following command:

    make test
If the test is successful, you will see no output from this command. Otherwise, you will see the differences between the expected output and the output of your build, plus a line that looks like this:
    make: *** [test] Error 1

Build Options

An additional executable (lastz_32) can be built, to handle genomes larger than 2 gigabases. For details, see the section on aligning to whole genomes.

Any executable can be built to allow adjacent indels (by default, these are not allowed). For details, see the section on adjacent indels.


Overview of Processing Stages and Terminology

LASTZ is designed to preprocess one sequence or set of sequences (which we collectively call the target) and then align several query sequences to it. The general flow of the program is like a pipeline: the output of one stage is the input to the next. The user can choose to skip most stages via command-line options; any stages that are skipped pass their input along to the next stage unchanged. Two of the stages, scoring inference and interpolation, are special in that they perform a miniature version of the pipeline within them.

Note that the following discussion is a generalization, intended to describe the basic idea of LASTZ’s operation. There are many exceptions that depend on the particular options specified.

The stages are:

The usual flow is as follows (though most of these steps are optional, and some settings like ‑‑anyornone may affect the processing order). We first read the target sequence(s) into memory, and use that to build a seed word position table that will allow us to quickly map any word in the target to all of the positions where it appears. (For the purposes of this discussion you can think of a word as a 12-mer of DNA.) Then we read each query sequence in turn, processing them more or less independently. We examine the word starting at each base in the query and use the position table to find matches, called seeds, in the target. The seeds are extended to longer matches called HSPs (high-scoring segment pairs) and filtered based on score. The HSPs are chained into the highest-scoring set of syntenic alignments, and then reduced to single locations called anchors. The anchors are then extended to local alignments (which may contain gaps) and again filtered by score, followed by back-end filtering to discard alignment blocks that do not meet specified criteria for certain traits. We then interpolate, repeating the entire process at a higher sensitivity in the holes between the alignment blocks. And finally, we write out the alignment information to a file. Then these steps are repeated with the reverse complement of the query sequence, before moving on to the next sequence in the query file.

The scoring inference stage is not usually performed. Typically it is used only when sequences for new species are acquired, to create scoring files for subsequent alignments of those species.


Examples

For those eager to try it out, here are some illustrative examples to get you started. Detailed reference material begins with the next section.

Comparing a Human Chromosome and a Chicken Chromosome

It is often adequate to use a lower sensitivity level than is achieved with LASTZ’s defaults. For example, to compare two complete chromosomes, even for species as distant as human and chicken, the alignment landscape is evident even at very low sensitivity settings. This can speed up the alignment process considerably.

This example compares human chromosome 4 to chicken chromosome 4. These sequences can be found in the downloads section of the UCSC Genome Browser, and are 191 and 94 megabases long, respectively. To run a quick low-sensitivity alignment of these sequences, use a command like this:

    lastz hg18.chr4.fa galGal3.chr4.fa \
      --notransition --step=20 --nogapped \
      --format=maf > hg18_4_vs_galGal3_4.maf

This runs in about two and a half minutes on a 2-GHz workstation, requiring only 400 Mb of RAM. Figure 1(a) shows the results, plotted using the ‑‑format=rdotplot output option and the R statistical package. (When in MAF format, LASTZ output can be browsed with the GMAJ interactive viewer for multiple alignments, available from the Miller Lab at Penn State.)

Using ‑‑notransition lowers seeding sensitivity and reduces runtime (by a factor of about 10 in this case). ‑‑step=20 also lowers seeding sensitivity, reducing runtime and also reducing memory consumption (by a factor of about 3.3 in this case). ‑‑nogapped eliminates the computation of gapped alignments. The complete alignment process using default settings (shown in Figure 1(b)) uses 1.3 Gb of RAM and takes 4.5 hours on a machine running at 2.83 GHz.

Figure 1(a)

human vs. chicken: low sensitivity

lastz \
  hg18.chr4.fa galGal3.chr4.fa \
  --notransition --step=20 \
  --nogapped
Figure 1(b)

human vs. chicken: defaults

lastz \
  hg18.chr4.fa galGal3.chr4.fa

Aligning Shotgun Reads to a Human Chromosome

Short read mapping for close species requires parameters very different from LASTZ’s defaults. This example compares a simulated set of primate shotgun reads to human chromosome 21. The chromosome can be found in the downloads section of the UCSC Genome Browser (it is about 47 megabases). Ten thousand simulated reads were generated by extracting 60-bp intervals from chimp chr21, subjecting them to mild mutation (including short gaps), and then truncating them to 50 bp (these are included in the LASTZ distribution, in test_data/fake_chimp_reads.2bit).

To see where these reads map onto the human chromosome, use this command:

    lastz hg18.chr21.fa[unmask] fake_chimp_reads.2bit \
      --step=10 --seed=match12 --notransition --exact=20 --noytrim \
      --match=1,5 --ambiguous=n \
      --filter=coverage:90 --filter=identity:95 \
      --format=general:name1,start1,length1,name2,strand2 \
      > hg18_21_vs_reads.dat