abtools.sequence: Sequence utilities¶
-
class
abtools.sequence.Sequence(seq, id=None, qual=None, id_key='seq_id', seq_key='vdj_nt')¶ Container for biological (RNA and DNA) sequences.
seqcan be one of several things:- a raw sequence, as a string
- an iterable, formatted as
[seq_id, sequence] - a dict, containing at least the ID (default key = ‘seq_id’) and a
sequence (default key = ‘vdj_nt’). Alternate
id_keyandseq_keycan be provided at instantiation. - a Biopython
SeqRecordobject - an AbTools
Sequenceobject
If
seqis provided as a string, the sequence ID can optionally be provided viaid. Ifseqis a string andidis not provided, a random sequence ID will be generated withuuid.uuid4().Quality scores can be supplied with
qualor as part of aSeqRecordobject. If providing both a SeqRecord object with quality scores and quality scores viaqual, thequalscores will override the SeqRecord quality scores.If
seqis a dictionary, typically the result of a MongoDB query, the dictionary can be accessed directly from theSequenceinstance. To retrive the value for'junc_aa'in the instantiating dictionary, you would simply:s = Sequence(dict) junc = s['junc_aa']
If
seqis a dictionary, an optionalid_keyandseq_keycan be provided, which tells theSequenceobject which field to use to populateSequence.idandSequence.sequence. Defaults areid_key='seq_id'andseq_key='vdj_nt'.Alternately, the
__getitem__()interface can be used to obtain a slice from thesequenceattribute. An example of the distinction:d = {'name': 'MySequence', 'sequence': 'ATGC'} seq = Sequence(d, id_key='name', seq_key='sequence') seq['name'] # 'MySequence' seq[:2] # 'AT'
If the
Sequenceis instantiated with a dictionary, calls to__contains__()will returnTrueif the supplied item is a key in the dictionary. In non-dict instantiations,__contains__()will look in theSequence.sequencefield directly (essentially a motif search). For example:dict_seq = Sequence({'seq_id': 'seq1', 'vdj_nt': 'ACGT'}) 'seq_id' in dict_seq # TRUE 'ACG' in dict_seq # FALSE str_seq = Sequence('ACGT', id='seq1') 'seq_id' in str_seq # FALSE 'ACG' in str_seq # TRUE
Note
When comparing
Sequenceobjects, they are comsidered equal only if their sequences and IDs are identical. This means that twoSequenceobjects with identical sequences but without user-supplied IDs won’t be equal, because their IDs will have been randomly generated.-
fasta¶ str – Returns the sequence, as a FASTA-formatted string
Note: The FASTA string is built using
Sequence.idandSequence.sequence.
-
fastq¶ str – Returns the sequence, as a FASTQ-formatted string
If
Sequence.qualisNone, thenNonewill be returned instead of a FASTQ string
-
reverse_complement¶ str – Returns the reverse complement of
Sequence.sequence.
-
region(start=0, end=None)¶ Returns a region of
Sequence.sequence, in FASTA format.If called without kwargs, the entire sequence will be returned.
Parameters: - start (int) – Start position of the region to be returned. Default is 0.
- end (int) – End position of the region to be returned. Negative values will function as they do when slicing strings.
Returns: A region of
Sequence.sequence, in FASTA formatReturn type: str