abtools.cluster: Sequence Clustering

class abtools.cluster.Cluster(raw_cluster, seq_db=None, db_path=None, seq_dict=None)

Data and methods for a cluster of sequences.

All public attributes are evaluated lazily, so attributes that require significant processing time are only computed when needed. In addition, attributes are only calculated once, so if you change the Cluster object after accessing attributes, the attributes will not update. Setters are provided for all attributes, however, so you can update them manually if necessary:

seqs = [Sequence1, Sequence2, ... SequenceN]
clust = cluster(seqs)

# calculate the consensus
consensus = clust.consensus

# add sequences to the Cluster
more_sequences = [SequenceA, SequenceB, SequenceC]
clust.sequences += more_sequences

# need to recompute the consensus manually
clust.consensus = clust._make_consensus()
ids

list – A list of all sequence IDs in the Cluster

size

int – Number of sequences in the Cluster

sequences

list – A list of all sequences in the Cluster, as AbTools Sequence objects.

consensus

Sequence – Consensus sequence, calculated by aligning all sequences with MAFFT and computing the Bio.Align.AlignInfo.SummaryInfo.gap_consensus()

centroid

Sequence – Centroid sequence, as calculated by CD-HIT.

abtools.cluster.cluster(seqs, threshold=0.975, out_file=None, make_db=True, temp_dir=None, quiet=False, threads=0, return_just_seq_ids=False, max_memory=800, debug=False)

Perform sequence clustering with CD-HIT.

Parameters:
  • seqs (list) – An iterable of sequences, in any format that abtools.sequence.Sequence() can handle
  • threshold (float) – Clustering identity threshold. Default is 0.975.
  • out_file (str) – Path to the clustering output file. Default is to use tempfile.NamedTempraryFile to generate an output file name.
  • temp_dir (str) – Path to the temporary directory. If not provided, ‘/tmp’ is used.
  • make_db (bool) – Whether to build a SQlite database of sequence information. Required if you want to calculate consensus/centroid sequences for the resulting clusters or if you need to access the clustered sequences (not just sequence IDs) Default is True.
Returns:

A list of Cluster objects, one per cluster.

Return type:

list