abtools.cluster: Sequence Clustering¶
-
class
abtools.cluster.Cluster(raw_cluster, seq_db=None, db_path=None, seq_dict=None)¶ Data and methods for a cluster of sequences.
All public attributes are evaluated lazily, so attributes that require significant processing time are only computed when needed. In addition, attributes are only calculated once, so if you change the Cluster object after accessing attributes, the attributes will not update. Setters are provided for all attributes, however, so you can update them manually if necessary:
seqs = [Sequence1, Sequence2, ... SequenceN] clust = cluster(seqs) # calculate the consensus consensus = clust.consensus # add sequences to the Cluster more_sequences = [SequenceA, SequenceB, SequenceC] clust.sequences += more_sequences # need to recompute the consensus manually clust.consensus = clust._make_consensus()
-
ids¶ list – A list of all sequence IDs in the Cluster
-
size¶ int – Number of sequences in the Cluster
-
sequences¶ list – A list of all sequences in the Cluster, as AbTools
Sequenceobjects.
-
consensus¶ Sequence – Consensus sequence, calculated by aligning all sequences with MAFFT and computing the
Bio.Align.AlignInfo.SummaryInfo.gap_consensus()
-
centroid¶ Sequence – Centroid sequence, as calculated by CD-HIT.
-
-
abtools.cluster.cluster(seqs, threshold=0.975, out_file=None, make_db=True, temp_dir=None, quiet=False, threads=0, return_just_seq_ids=False, max_memory=800, debug=False)¶ Perform sequence clustering with CD-HIT.
Parameters: - seqs (list) – An iterable of sequences, in any format that abtools.sequence.Sequence() can handle
- threshold (float) – Clustering identity threshold. Default is 0.975.
- out_file (str) – Path to the clustering output file. Default is to use tempfile.NamedTempraryFile to generate an output file name.
- temp_dir (str) – Path to the temporary directory. If not provided, ‘/tmp’ is used.
- make_db (bool) – Whether to build a SQlite database of sequence information. Required if you want to calculate consensus/centroid sequences for the resulting clusters or if you need to access the clustered sequences (not just sequence IDs) Default is True.
Returns: A list of Cluster objects, one per cluster.
Return type: list