abtools.alignment: Pairwise and Multiple Sequence Alignment

abtools.alignment.mafft(sequences=None, alignment_file=None, fasta=None, fmt='fasta', threads=-1, as_file=False, print_stdout=False, print_stderr=False)

Performs multiple sequence alignment with MAFFT.

MAFFT is a required dependency.

Parameters:
  • sequences (list) –

    Sequences to be aligned. sequences can be one of four things:

    1. a FASTA-formatted string
    2. a list of BioPython SeqRecord objects
    3. a list of AbTools Sequence objects
    4. a list of lists/tuples, of the format [sequence_id, sequence]
  • alignment_file (str) – Path for the output alignment file. If not supplied, a name will be generated using tempfile.NamedTemporaryFile().
  • fasta (str) – Path to a FASTA-formatted file of sequences. Used as an alternative to sequences when suppling a FASTA file.
  • fmt (str) – Format of the alignment. Options are ‘fasta’ and ‘clustal’. Default is ‘fasta’.
  • threads (int) – Number of threads for MAFFT to use. Default is -1, which results in MAFFT using multiprocessing.cpu_count() threads.
  • as_file (bool) – If True, returns a path to the alignment file. If False, returns a BioPython MultipleSeqAlignment object (obtained by calling Bio.AlignIO.read() on the alignment file).
Returns:

Returns a BioPython MultipleSeqAlignment object, unless as_file is True,

in which case the path to the alignment file is returned.

abtools.alignment.muscle(sequences=None, alignment_file=None, fasta=None, fmt='fasta', as_file=False, maxiters=None, diags=False, gap_open=None, gap_extend=None)

Performs multiple sequence alignment with MUSCLE.

MUSCLE is a required dependency.

Parameters:
  • sequences (list) –

    Sequences to be aligned. sequences can be one of four things:

    1. a FASTA-formatted string
    2. a list of BioPython SeqRecord objects
    3. a list of AbTools Sequence objects
    4. a list of lists/tuples, of the format [sequence_id, sequence]
  • alignment_file (str) – Path for the output alignment file. If not supplied, a name will be generated using tempfile.NamedTemporaryFile().
  • fasta (str) – Path to a FASTA-formatted file of sequences. Used as an alternative to sequences when suppling a FASTA file.
  • fmt (str) – Format of the alignment. Options are ‘fasta’ and ‘clustal’. Default is ‘fasta’.
  • threads (int) – Number of threads for MAFFT to use. Default is -1, which results in MAFFT using multiprocessing.cpu_count() threads.
  • as_file (bool) – If True, returns a path to the alignment file. If False, returns a BioPython MultipleSeqAlignment object (obtained by calling Bio.AlignIO.read() on the alignment file).
  • maxiters (int) – Passed directly to MUSCLE using the -maxiters flag.
  • diags (int) – Passed directly to MUSCLE using the -diags flag.
  • gap_open (float) – Passed directly to MUSCLE using the -gapopen flag. Ignored if gap_extend is not also provided.
  • gap_extend (float) – Passed directly to MUSCLE using the -gapextend flag. Ignored if gap_open is not also provided.
Returns:

Returns a BioPython MultipleSeqAlignment object, unless as_file is True,

in which case the path to the alignment file is returned.

abtools.alignment.local_alignment(query, target=None, targets=None, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, matrix=None, aa=False, gap_open_penalty=None, gap_extend_penalty=None)

Striped Smith-Waterman local pairwise alignment.

Parameters:
  • query

    Query sequence. query can be one of four things:

    1. a nucleotide or amino acid sequence, as a string
    2. a Biopython SeqRecord object
    3. an AbTools Sequence object
    4. a list/tuple of the format [seq_id, sequence]
  • target – A single target sequence. target can be anything that query accepts.
  • targets (list) – A list of target sequences, to be proccssed iteratively. Each element in the targets list can be anything accepted by query.
  • match (int) – Match score. Should be a positive integer. Default is 3.
  • mismatch (int) – Mismatch score. Should be a negative integer. Default is -2.
  • gap_open (int) – Penalty for opening gaps. Should be a negative integer. Default is -5.
  • gap_extend (int) – Penalty for extending gaps. Should be a negative integer. Default is -2.
  • matrix (str, dict) –

    Alignment scoring matrix. Two options for passing the alignment matrix:

    • The name of a built-in matrix. Current options are blosum62 and pam250.
    • A nested dictionary, giving an alignment score for each residue pair. Should be formatted such that retrieving the alignment score for A and G is accomplished by:
      matrix['A']['G']
      
  • aa (bool) – Must be set to True if aligning amino acid sequences. Default is False.
Returns:

If a single target sequence is provided (via target), a single SSWAlignment object will be returned. If multiple target sequences are supplied (via targets), a list of SSWAlignment objects will be returned.

abtools.alignment.global_alignment(query, target=None, targets=None, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, score_match=None, score_mismatch=None, score_gap_open=None, score_gap_extend=None, matrix=None, aa=False)

Needleman-Wunch global pairwise alignment.

With global_alignment, you can score an alignment using different paramaters than were used to compute the alignment. This allows you to compute pure identity scores (match=1, mismatch=0) on pairs of sequences for which those alignment parameters would be unsuitable. For example:

seq1 = 'ATGCAGC'
seq2 = 'ATCAAGC'

using identity scoring params (match=1, all penalties are 0) for both alignment and scoring produces the following alignment:

ATGCA-GC
|| || ||
AT-CAAGC

with an alignment score of 6 and an alignment length of 8 (identity = 75%). But what if we want to calculate the identity of a gapless alignment? Using:

global_alignment(seq1, seq2,
                 gap_open=20,
                 score_match=1,
                 score_mismatch=0,
                 score_gap_open=10,
                 score_gap_extend=1)

we get the following alignment:

ATGCAGC
||  |||
ATCAAGC

which has an score of 5 and an alignment length of 7 (identity = 71%). Obviously, this is an overly simple example (it would be much easier to force gapless alignment by just iterating over each sequence and counting the matches), but there are several real-life cases in which different alignment and scoring paramaters are desirable.

Parameters:
  • query

    Query sequence. query can be one of four things:

    1. a nucleotide or amino acid sequence, as a string
    2. a Biopython SeqRecord object
    3. an AbTools Sequence object
    4. a list/tuple of the format [seq_id, sequence]
  • target – A single target sequence. target can be anything that query accepts.
  • targets (list) – A list of target sequences, to be proccssed iteratively. Each element in the targets list can be anything accepted by query.
  • match (int) – Match score for alignment. Should be a positive integer. Default is 3.
  • mismatch (int) – Mismatch score for alignment. Should be a negative integer. Default is -2.
  • gap_open (int) – Penalty for opening gaps in alignment. Should be a negative integer. Default is -5.
  • gap_extend (int) – Penalty for extending gaps in alignment. Should be a negative integer. Default is -2.
  • score_match (int) – Match score for scoring the alignment. Should be a positive integer. Default is to use the score from match or matrix, whichever is provided.
  • score_mismatch (int) – Mismatch score for scoring the alignment. Should be a negative integer. Default is to use the score from mismatch or matrix, whichever is provided.
  • score_gap_open (int) – Gap open penalty for scoring the alignment. Should be a negative integer. Default is to use gap_open.
  • score_gap_extend (int) – Gap extend penalty for scoring the alignment. Should be a negative integer. Default is to use gap_extend.
  • matrix (str, dict) –

    Alignment scoring matrix. Two options for passing the alignment matrix:

    • The name of a built-in matrix. Current options are blosum62 and pam250.
    • A nested dictionary, giving an alignment score for each residue pair. Should be formatted such that retrieving the alignment score for A and G is accomplished by:
      matrix['A']['G']
      
  • aa (bool) – Must be set to True if aligning amino acid sequences. Default is False.
Returns:

If a single target sequence is provided (via target), a single NWAlignment object will be returned. If multiple target sequences are supplied (via targets), a list of NWAlignment objects will be returned.

class abtools.alignment.BaseAlignment(query, target, matrix, match, mismatch, gap_open, gap_extend, aa)

Base class for local and global pairwise alignments.

Note

All comparisons between BaseAlignments are done on the score attribute (which must be implemented by any classes that subclass BaseAlignment). This was done so that sorting alignments like so:

alignments = [list of alignments]
alignments.sort(reverse=True)

results in a sorted list of alignments from the highest alignment score to the lowest.

query

Sequence – The input query sequence, as an AbTools Sequence object.

target

Sequence – The input target sequence, as an AbTools Sequence object.

target_id

str – ID of the target sequence.

raw_query

The raw query, before conversion to a Sequence.

raw_target

The raw target, before conversion to a Sequence.

class abtools.alignment.SSWAlignment(query, target, match=3, mismatch=-2, matrix=None, gap_open=5, gap_extend=2, aa=False)

Structure for performing and analyzing a Smith-Waterman local alignment.

alignment_type

str – Is ‘local’ for all SSWAlignment objects.

aligned_query

str – The aligned query sequence (including gaps).

aligned_target

str – The aligned target sequence (including gaps).

alignment_midline

str – Midline for the aligned sequences, with | indicating matches and a gap indicating mismatches:

print(aln.aligned_query)
print(aln.alignment_midline)
print(aln.aligned_target)

# ATGC
# || |
# ATCC
score

int – Alignment score.

query_begin

int – Position in the raw query sequence at which the optimal alignment begins.

query_end

int – Position in the raw query sequence at which the optimal alignment ends.

target_begin

int – Position in the raw target sequence at which the optimal alignment begins.

target_end

int – Position in the raw target sequence at which the optimal alignment ends.

class abtools.alignment.NWAlignment(query, target, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, score_match=None, score_mismatch=None, score_gap_open=None, score_gap_extend=None, matrix=None, aa=False)

Structure for performing and analyzing a Needleman-Wunch global alignment.

alignment_type

str – Is ‘global’ for all NWAlignment objects.

aligned_query

str – The aligned query sequence (including gaps).

aligned_target

str – The aligned target sequence (including gaps).

alignment_midline

str – Midline for the aligned sequences, with | indicating matches and a gap indicating mismatches:

print(aln.aligned_query)
print(aln.alignment_midline)
print(aln.aligned_target)

# ATGC
# || |
# ATCC
score

int – Alignment score.

query_begin

int – Position in the raw query sequence at which the optimal alignment begins.

query_end

int – Position in the raw query sequence at which the optimal alignment ends.

target_begin

int – Position in the raw target sequence at which the optimal alignment begins.

target_end

int – Position in the raw target sequence at which the optimal alignment ends.