abtools.sequence: Sequence utilities

class abtools.sequence.Sequence(seq, id=None, qual=None, id_key='seq_id', seq_key='vdj_nt')

Container for biological (RNA and DNA) sequences.

seq can be one of several things:

  1. a raw sequence, as a string
  2. an iterable, formatted as [seq_id, sequence]
  3. a dict, containing at least the ID (default key = ‘seq_id’) and a sequence (default key = ‘vdj_nt’). Alternate id_key and seq_key can be provided at instantiation.
  4. a Biopython SeqRecord object
  5. an AbTools Sequence object

If seq is provided as a string, the sequence ID can optionally be provided via id. If seq is a string and id is not provided, a random sequence ID will be generated with uuid.uuid4().

Quality scores can be supplied with qual or as part of a SeqRecord object. If providing both a SeqRecord object with quality scores and quality scores via qual, the qual scores will override the SeqRecord quality scores.

If seq is a dictionary, typically the result of a MongoDB query, the dictionary can be accessed directly from the Sequence instance. To retrive the value for 'junc_aa' in the instantiating dictionary, you would simply:

s = Sequence(dict)
junc = s['junc_aa']

If seq is a dictionary, an optional id_key and seq_key can be provided, which tells the Sequence object which field to use to populate Sequence.id and Sequence.sequence. Defaults are id_key='seq_id' and seq_key='vdj_nt'.

Alternately, the __getitem__() interface can be used to obtain a slice from the sequence attribute. An example of the distinction:

d = {'name': 'MySequence', 'sequence': 'ATGC'}
seq = Sequence(d, id_key='name', seq_key='sequence')

seq['name']  # 'MySequence'
seq[:2]  # 'AT'

If the Sequence is instantiated with a dictionary, calls to __contains__() will return True if the supplied item is a key in the dictionary. In non-dict instantiations, __contains__() will look in the Sequence.sequence field directly (essentially a motif search). For example:

dict_seq = Sequence({'seq_id': 'seq1', 'vdj_nt': 'ACGT'})
'seq_id' in dict_seq  # TRUE
'ACG' in dict_seq     # FALSE

str_seq = Sequence('ACGT', id='seq1')
'seq_id' in str_seq  # FALSE
'ACG' in str_seq     # TRUE

Note

When comparing Sequence objects, they are comsidered equal only if their sequences and IDs are identical. This means that two Sequence objects with identical sequences but without user-supplied IDs won’t be equal, because their IDs will have been randomly generated.

fasta

str – Returns the sequence, as a FASTA-formatted string

Note: The FASTA string is built using Sequence.id and Sequence.sequence.

fastq

str – Returns the sequence, as a FASTQ-formatted string

If Sequence.qual is None, then None will be returned instead of a FASTQ string

reverse_complement

str – Returns the reverse complement of Sequence.sequence.

region(start=0, end=None)

Returns a region of Sequence.sequence, in FASTA format.

If called without kwargs, the entire sequence will be returned.

Parameters:
  • start (int) – Start position of the region to be returned. Default is 0.
  • end (int) – End position of the region to be returned. Negative values will function as they do when slicing strings.
Returns:

A region of Sequence.sequence, in FASTA format

Return type:

str