`abtools.cluster`: Sequence Clustering¶

class abtools.cluster.Cluster(raw_cluster, seq_db=None, db_path=None, seq_dict=None)¶

Data and methods for a cluster of sequences.

All public attributes are evaluated lazily, so attributes that require significant processing time are only computed when needed. In addition, attributes are only calculated once, so if you change the Cluster object after accessing attributes, the attributes will not update. Setters are provided for all attributes, however, so you can update them manually if necessary:

seqs = [Sequence1, Sequence2, ... SequenceN]
clust = cluster(seqs)

# calculate the consensus
consensus = clust.consensus

# add sequences to the Cluster
more_sequences = [SequenceA, SequenceB, SequenceC]
clust.sequences += more_sequences

# need to recompute the consensus manually
clust.consensus = clust._make_consensus()

ids¶: list – A list of all sequence IDs in the Cluster

size¶: int – Number of sequences in the Cluster

sequences¶: list – A list of all sequences in the Cluster, as AbTools Sequence objects.

consensus¶: Sequence – Consensus sequence, calculated by aligning all sequences with MAFFT and computing the Bio.Align.AlignInfo.SummaryInfo.gap_consensus()

centroid¶: Sequence – Centroid sequence, as calculated by CD-HIT.

abtools.cluster.cluster(seqs, threshold=0.975, out_file=None, make_db=True, temp_dir=None, quiet=False, threads=0, return_just_seq_ids=False, max_memory=800, debug=False)¶

Perform sequence clustering with CD-HIT.

Parameters:	seqs (list) – An iterable of sequences, in any format that abtools.sequence.Sequence() can handle threshold (float) – Clustering identity threshold. Default is 0.975. out_file (str) – Path to the clustering output file. Default is to use tempfile.NamedTempraryFile to generate an output file name. temp_dir (str) – Path to the temporary directory. If not provided, ‘/tmp’ is used. make_db (bool) – Whether to build a SQlite database of sequence information. Required if you want to calculate consensus/centroid sequences for the resulting clusters or if you need to access the clustered sequences (not just sequence IDs) Default is True.
Returns:	A list of Cluster objects, one per cluster.
Return type:	list

abtools.cluster: Sequence Clustering¶

`abtools.cluster`: Sequence Clustering¶