TABLE OF CONTENTS
LAV is a plain-text file format for alignments of two DNA sequences. It is the only output format produced by the BLASTZ alignment program (though often converted to AXT format by post-processing programs), and is the default output format for BLASTZ's successor, LASTZ.
The alignment blocks are grouped by sequence (e.g. chromosome, scaffold, contig, cDNA read, shotgun sequencing read, etc.) and strand, and described by listing the coordinates of the gap-free aligning segments in each block. This format is compact because it does not include the nucleotides, but the tradeoff is that interpretation usually requires access to the original sequence files, and it is not easy for humans to read.
Here's a typical LAV file:
#:lav d { "lastz.v0.3 malus.fa aurantium.fa C=2 W=8 T=0 A C G T 91 -114 -31 -123 -114 100 -125 -31 -31 -125 100 -114 -123 -31 -114 91 O = 400, E = 30, K = 3000, L = 3000, M = 0" } #:lav s { "malus.fa" 1 191411218 0 1 "aurantium.fa" 1 90634903 0 1 } h { "> apple" "> orange" } a { s 20643 b 46566766 2083211 e 46567353 2083795 l 46566766 2083211 46566796 2083241 61 l 46566797 2083245 46566814 2083262 78 l 46566821 2083263 46567353 2083795 65 } a { s 4233 b 47246530 10635696 e 47246660 10635826 l 47246530 10635696 47246660 10635826 63 } ... many more a-stanzas ... #:lav s { "malus.fa" 1 191411218 0 1 "aurantium.fa-" 1 90634903 1 1 } h { "> apple" "> orange (reverse complement)" } a { s 13897 b 1005819 5352698 e 1006099 5352978 l 1005819 5352698 1006099 5352978 74 } ... many more a-stanzas ... #:eof
An LAV file primarily consists of a series of "stanzas", each of which
is a single letter code followed by a brace-enclosed block. There are
also #:lav
lines which break the file into sections, and
one #:eof
line indicating the end of the file. Programs
that read LAV format should consider the file bad if the
#:eof
is missing (or if anything appears after it).
The d-stanza is intended to document the program and parameters used to create the file. Programs reading the file normally treat this as a comment, but it is possible to extract the scoring parameters for further processing.
An s-stanza describes the sequences used for the subsequent alignment records (a-stanzas). It contains exactly two lines in the following format.
"<filename>[-]" <start> <stop> [<rev_comp_flag> <sequence_number>]
Here <start>
and <stop>
are
origin 1 (i.e. the first base in the original given sequence is called
"1") and inclusive (both endpoints are included in the interval).
Usually <start>
is 1 and <stop>
is the full length of the given sequence, however they can specify any
subsequence (e.g. if the alignment program was instructed to use only
part of the original sequence).
<rev_comp_flag>
is 1 if the sequence was
reverse-complemented before aligning, or 0 otherwise. Usually the
first sequence will have a 0 here, since most alignment programs only
ever reverse-complement the second one. If this flag is 1, the
<filename>
will also have a -
appended
to it; programs that read LAV format should report an error if these
two indicators are contradictory. Note that even when this flag
indicates reverse-complement, the <start>
and
<stop>
endpoints are still relative to the original
orientation, and <start>
is less than
<stop>
. That is, conceptually the alignment program
extracts the requested sequence fragment first, then reverse-complements
it (if applicable), and finally tries to align it.
<sequence_number>
is useful when the second file
contains multiple sequences. The first sequence is 1, the second is 2, and so
on. Most programs that write and read LAV format do not allow the first file to
contain multiple sequences, so in these cases the sequence number for the first
file is always 1 (though the format itself does not require this). Note that
<start>
and <stop>
are relative to each
sequence, not to the entire file.
The <rev_comp_flag>
and
<sequence_number>
are shown here as optional because
early versions of this format did not include them.
Usually an s-stanza is followed immediately by an h-stanza, which
provides a name for each of the two sequences, typically obtained
from the FASTA header line. (Before the s-stanza's
<sequence_number>
field was introduced, this was
the only way to identify which sequence from a multi-sequence file was
aligned.) A (reverse complement)
suffix is appended when
applicable; again, programs should report an error if this contradicts
the other indicators.
An a-stanza describes a single alignment block, sometimes called a
"local alignment", which typically includes gaps due to small insertions
and deletions in the aligned sequences. In the example below, the
s
, b
, and
e
lines indicate that the block has a score
of 13916 and an overall range of 4886..5171 in sequence 1 and 21292..21537
in sequence 2.
The l
lines describe the block's gap-free segments, with the
final field representing the percentage of matching bases in each segment.
In this example the alignment starts with a segment from 4886..4899 in
sequence 1 and from 21292..21305 in sequence 2, having a percent identity
of 79%. Note that the segment length must be the same in both sequences
(14 basepairs for this segment). The next segment starts at 4900 and 21308
in sequences 1 and 2, respectively, indicating a two-base gap in sequence 1
(corresponding to positions 21306 and 21307 in sequence 2).
a { s 13916 b 4886 21292 e 5171 21537 l 4886 21292 4899 21305 79 l 4900 21308 4924 21332 92 l 4925 21334 5024 21433 88 l 5027 21434 5040 21447 100 l 5086 21448 5117 21479 84 l 5118 21484 5171 21537 87 }
Coordinates in an a-stanza are origin 1 and inclusive, and are relative to the subsequences indicated in the most recent s-stanza. In the example below the alignment is of apple 1333..1444 to orange 2777..2888.
s { "malus.fa" 1001 2000 0 1 "aurantium.fa" 2001 5000 0 1 } ... a { s 7321 b 333 777 e 444 888 l 333 777 444 888 62 }
If a sequence is reverse-complemented, then the coordinates are relative
to the reverse complement, so they are counted back from the end of
the subsequence. Thus the example below represents an alignment of apple
1333..1444 to the reverse complement of orange 4113..4224. In detail:
the s-stanza indicates that the first sequence from aurantium.fa should be
used, and its subsequence from 2001..5000 should be extracted and then
reverse complemented before aligning with apple. In this 3000 bp
reverse-complemented subsequence, the first base corresponds to position
5000 in the original sequence, the second to position 4999, and so on to
the last (3000th) base, which corresponds to position 2001. Thus the
conversion formula is p = 5000 - (r - 1)
, where p
is the position in the original sequence, and r is the position in the
reverse-complemented subsequence. Within the reverse-complemented
subsequence, the alignment is at 777..888. The starting point, 777, is the
nucleotide 776 bp back from 5000, or 4224, while the ending point, 888, is
887 bp back from 5000, or 4113.
s { "malus.fa" 1001 2000 0 1 "aurantium.fa-" 2001 5000 1 1 } ... a { s 7321 b 333 777 e 444 888 l 333 777 444 888 62 }
The fifth numeric field in an a-stanza's l
line is the
percentage of bases in the aligned segment that match (often called the
"percent identity" or "percent id"). This is used by viewer tools such as
Laj and
PipMaker.
An LAV file may also contain x- and m-stanzas describing dynamic masking. Each section will contain an x-stanza that looks like the one below. The count is the number of bases newly masked as a result of processing the latest query sequence; it does not include bases previously masked.
x { n <count> }
A single m-stanza listing the masked regions is then included in the final
section, and looks like the one below. <start>
and
<stop>
are origin 1 and inclusive, and are relative
to the first subsequence indicated in the most recent s-stanza.
m { x <start> <end> x <start> <end> ... n <count> }
Dynamic masking is invoked by the
‑‑masking=<count>
option in LASTZ, or
the M=<count>
option in BLASTZ. For more information
about these options, please see the
LASTZ documentation.
In LASTZ, the ‑‑census
option will produce a
Census-stanza. The first field in each line (1
,
2
, 3
, …) is a
position in the target (sequence 1). The count indicates the number of
times the corresponding base appears in an alignment.
Census { 1 <count> 2 <count> ... }