abtools.alignment: Pairwise and Multiple Sequence Alignment¶
-
abtools.alignment.mafft(sequences=None, alignment_file=None, fasta=None, fmt='fasta', threads=-1, as_file=False, print_stdout=False, print_stderr=False)¶ Performs multiple sequence alignment with MAFFT.
MAFFT is a required dependency.
Parameters: - sequences (list) –
Sequences to be aligned.
sequencescan be one of four things:- a FASTA-formatted string
- a list of BioPython
SeqRecordobjects - a list of AbTools
Sequenceobjects - a list of lists/tuples, of the format
[sequence_id, sequence]
- alignment_file (str) – Path for the output alignment file. If not supplied,
a name will be generated using
tempfile.NamedTemporaryFile(). - fasta (str) – Path to a FASTA-formatted file of sequences. Used as an
alternative to
sequenceswhen suppling a FASTA file. - fmt (str) – Format of the alignment. Options are ‘fasta’ and ‘clustal’. Default is ‘fasta’.
- threads (int) – Number of threads for MAFFT to use. Default is
-1, which results in MAFFT usingmultiprocessing.cpu_count()threads. - as_file (bool) – If
True, returns a path to the alignment file. IfFalse, returns a BioPythonMultipleSeqAlignmentobject (obtained by callingBio.AlignIO.read()on the alignment file).
Returns: - Returns a BioPython
MultipleSeqAlignmentobject, unlessas_fileisTrue, in which case the path to the alignment file is returned.
- sequences (list) –
-
abtools.alignment.muscle(sequences=None, alignment_file=None, fasta=None, fmt='fasta', as_file=False, maxiters=None, diags=False, gap_open=None, gap_extend=None)¶ Performs multiple sequence alignment with MUSCLE.
MUSCLE is a required dependency.
Parameters: - sequences (list) –
Sequences to be aligned.
sequencescan be one of four things:- a FASTA-formatted string
- a list of BioPython
SeqRecordobjects - a list of AbTools
Sequenceobjects - a list of lists/tuples, of the format
[sequence_id, sequence]
- alignment_file (str) – Path for the output alignment file. If not supplied,
a name will be generated using
tempfile.NamedTemporaryFile(). - fasta (str) – Path to a FASTA-formatted file of sequences. Used as an
alternative to
sequenceswhen suppling a FASTA file. - fmt (str) – Format of the alignment. Options are ‘fasta’ and ‘clustal’. Default is ‘fasta’.
- threads (int) – Number of threads for MAFFT to use. Default is
-1, which results in MAFFT usingmultiprocessing.cpu_count()threads. - as_file (bool) – If
True, returns a path to the alignment file. IfFalse, returns a BioPythonMultipleSeqAlignmentobject (obtained by callingBio.AlignIO.read()on the alignment file). - maxiters (int) – Passed directly to MUSCLE using the
-maxitersflag. - diags (int) – Passed directly to MUSCLE using the
-diagsflag. - gap_open (float) – Passed directly to MUSCLE using the
-gapopenflag. Ignored ifgap_extendis not also provided. - gap_extend (float) – Passed directly to MUSCLE using the
-gapextendflag. Ignored ifgap_openis not also provided.
Returns: - Returns a BioPython
MultipleSeqAlignmentobject, unlessas_fileisTrue, in which case the path to the alignment file is returned.
- sequences (list) –
-
abtools.alignment.local_alignment(query, target=None, targets=None, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, matrix=None, aa=False, gap_open_penalty=None, gap_extend_penalty=None)¶ Striped Smith-Waterman local pairwise alignment.
Parameters: - query –
Query sequence.
querycan be one of four things:- a nucleotide or amino acid sequence, as a string
- a Biopython
SeqRecordobject - an AbTools
Sequenceobject - a list/tuple of the format
[seq_id, sequence]
- target – A single target sequence.
targetcan be anything thatqueryaccepts. - targets (list) – A list of target sequences, to be proccssed iteratively.
Each element in the
targetslist can be anything accepted byquery. - match (int) – Match score. Should be a positive integer. Default is 3.
- mismatch (int) – Mismatch score. Should be a negative integer. Default is -2.
- gap_open (int) – Penalty for opening gaps. Should be a negative integer. Default is -5.
- gap_extend (int) – Penalty for extending gaps. Should be a negative integer. Default is -2.
- matrix (str, dict) –
Alignment scoring matrix. Two options for passing the alignment matrix:
- The name of a built-in matrix. Current options are
blosum62andpam250. - A nested dictionary, giving an alignment score for each residue pair. Should be formatted
such that retrieving the alignment score for A and G is accomplished by:
matrix['A']['G']
- The name of a built-in matrix. Current options are
- aa (bool) – Must be set to
Trueif aligning amino acid sequences. Default isFalse.
Returns: If a single target sequence is provided (via
target), a singleSSWAlignmentobject will be returned. If multiple target sequences are supplied (viatargets), a list ofSSWAlignmentobjects will be returned.- query –
-
abtools.alignment.global_alignment(query, target=None, targets=None, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, score_match=None, score_mismatch=None, score_gap_open=None, score_gap_extend=None, matrix=None, aa=False)¶ Needleman-Wunch global pairwise alignment.
With
global_alignment, you can score an alignment using different paramaters than were used to compute the alignment. This allows you to compute pure identity scores (match=1, mismatch=0) on pairs of sequences for which those alignment parameters would be unsuitable. For example:seq1 = 'ATGCAGC' seq2 = 'ATCAAGC'
using identity scoring params (match=1, all penalties are 0) for both alignment and scoring produces the following alignment:
ATGCA-GC || || || AT-CAAGC
with an alignment score of 6 and an alignment length of 8 (identity = 75%). But what if we want to calculate the identity of a gapless alignment? Using:
global_alignment(seq1, seq2, gap_open=20, score_match=1, score_mismatch=0, score_gap_open=10, score_gap_extend=1)
we get the following alignment:
ATGCAGC || ||| ATCAAGC
which has an score of 5 and an alignment length of 7 (identity = 71%). Obviously, this is an overly simple example (it would be much easier to force gapless alignment by just iterating over each sequence and counting the matches), but there are several real-life cases in which different alignment and scoring paramaters are desirable.
Parameters: - query –
Query sequence.
querycan be one of four things:- a nucleotide or amino acid sequence, as a string
- a Biopython
SeqRecordobject - an AbTools
Sequenceobject - a list/tuple of the format
[seq_id, sequence]
- target – A single target sequence.
targetcan be anything thatqueryaccepts. - targets (list) – A list of target sequences, to be proccssed iteratively.
Each element in the
targetslist can be anything accepted byquery. - match (int) – Match score for alignment. Should be a positive integer. Default is 3.
- mismatch (int) – Mismatch score for alignment. Should be a negative integer. Default is -2.
- gap_open (int) – Penalty for opening gaps in alignment. Should be a negative integer. Default is -5.
- gap_extend (int) – Penalty for extending gaps in alignment. Should be a negative integer. Default is -2.
- score_match (int) – Match score for scoring the alignment. Should be a positive integer.
Default is to use the score from
matchormatrix, whichever is provided. - score_mismatch (int) – Mismatch score for scoring the alignment. Should be a negative
integer. Default is to use the score from
mismatchormatrix, whichever is provided. - score_gap_open (int) – Gap open penalty for scoring the alignment. Should be a negative
integer. Default is to use
gap_open. - score_gap_extend (int) – Gap extend penalty for scoring the alignment. Should be a negative
integer. Default is to use
gap_extend. - matrix (str, dict) –
Alignment scoring matrix. Two options for passing the alignment matrix:
- The name of a built-in matrix. Current options are
blosum62andpam250. - A nested dictionary, giving an alignment score for each residue pair. Should be
formatted such that retrieving the alignment score for A and G is accomplished by:
matrix['A']['G']
- The name of a built-in matrix. Current options are
- aa (bool) – Must be set to
Trueif aligning amino acid sequences. Default isFalse.
Returns: If a single target sequence is provided (via
target), a singleNWAlignmentobject will be returned. If multiple target sequences are supplied (viatargets), a list ofNWAlignmentobjects will be returned.- query –
-
class
abtools.alignment.BaseAlignment(query, target, matrix, match, mismatch, gap_open, gap_extend, aa)¶ Base class for local and global pairwise alignments.
Note
All comparisons between
BaseAlignmentsare done on thescoreattribute (which must be implemented by any classes that subclassBaseAlignment). This was done so that sorting alignments like so:alignments = [list of alignments] alignments.sort(reverse=True)
results in a sorted list of alignments from the highest alignment score to the lowest.
-
query¶ Sequence – The input query sequence, as an AbTools
Sequenceobject.
-
target¶ Sequence – The input target sequence, as an AbTools
Sequenceobject.
-
target_id¶ str – ID of the target sequence.
-
raw_query¶ The raw query, before conversion to a
Sequence.
-
raw_target¶ The raw target, before conversion to a
Sequence.
-
-
class
abtools.alignment.SSWAlignment(query, target, match=3, mismatch=-2, matrix=None, gap_open=5, gap_extend=2, aa=False)¶ Structure for performing and analyzing a Smith-Waterman local alignment.
-
alignment_type¶ str – Is ‘local’ for all
SSWAlignmentobjects.
-
aligned_query¶ str – The aligned query sequence (including gaps).
-
aligned_target¶ str – The aligned target sequence (including gaps).
-
alignment_midline¶ str – Midline for the aligned sequences, with
|indicating matches and a gap indicating mismatches:print(aln.aligned_query) print(aln.alignment_midline) print(aln.aligned_target) # ATGC # || | # ATCC
-
score¶ int – Alignment score.
-
query_begin¶ int – Position in the raw query sequence at which the optimal alignment begins.
-
query_end¶ int – Position in the raw query sequence at which the optimal alignment ends.
-
target_begin¶ int – Position in the raw target sequence at which the optimal alignment begins.
-
target_end¶ int – Position in the raw target sequence at which the optimal alignment ends.
-
-
class
abtools.alignment.NWAlignment(query, target, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, score_match=None, score_mismatch=None, score_gap_open=None, score_gap_extend=None, matrix=None, aa=False)¶ Structure for performing and analyzing a Needleman-Wunch global alignment.
-
alignment_type¶ str – Is ‘global’ for all
NWAlignmentobjects.
-
aligned_query¶ str – The aligned query sequence (including gaps).
-
aligned_target¶ str – The aligned target sequence (including gaps).
-
alignment_midline¶ str – Midline for the aligned sequences, with
|indicating matches and a gap indicating mismatches:print(aln.aligned_query) print(aln.alignment_midline) print(aln.aligned_target) # ATGC # || | # ATCC
-
score¶ int – Alignment score.
-
query_begin¶ int – Position in the raw query sequence at which the optimal alignment begins.
-
query_end¶ int – Position in the raw query sequence at which the optimal alignment ends.
-
target_begin¶ int – Position in the raw target sequence at which the optimal alignment begins.
-
target_end¶ int – Position in the raw target sequence at which the optimal alignment ends.
-