Cookbook 2: Clustering in Depth
Discover natural groupings in large vector datasets with DBSCAN and k-means, from protein sequences to support tickets, without exporting data to a separate analytics stack.
Answer
How do I cluster records in EigenLake?
Filter a snapshot of your index, run spherical k-means or DBSCAN directly over the embeddings, and receive cluster summaries with representative records you can validate immediately.
Inputs
- A corpus of protein sequences with known organism and family metadata
- Embedding vectors that match the index dimension
- An EigenLake index with workload-ready schema fields
Outputs
- DBSCAN clusters that represent natural structural families
- k-means clusters for fixed-size routing queues
- Representative records per cluster for human validation
- A repeatable clustering workflow over one index
Clustering is the workload that turns a vector store into a pattern-discovery engine. This recipe uses protein sequences because the ground truth is objective — a TIM barrel is a TIM barrel whether the algorithm knows the label or not — but the same primitives apply to support tickets, sensor traces, or any other embedded corpus.
Problem
You have 8,000 protein sequences and you want to discover structural families without pre-defined labels. You also want to compare density-based discovery (DBSCAN, which finds its own number of clusters) with centroid-based routing (k-means, which you size to your downstream queues). And you want representative sequences from each cluster so a biologist can validate the results without reading thousands of records.
Prerequisites
- Python 3.10 or newer.
pip install eigenlake numpy datasets.- An API key from
https://api.eigenlake.dev. - A protein embedding model (ESM-2, ProtTrans) or the
fake_embedfallback for sandbox testing.
The recipe
Step 1 — Load the dataset
from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s
DIM = 768
# UniProt sequences: https://huggingface.co/datasets/jglaser/uniprot_sequences
ds = load_dataset("jglaser/uniprot_sequences", split="train")
ds = ds.shuffle(seed=42).select(range(8_000))
Step 2 — Define the schema
schema, index_options = (
s.SchemaBuilder(additional_properties=False)
.add("accession", s.string(required=True, filterable=True))
.add("organism", s.string(filterable=True))
.add("protein_family", s.string(filterable=True)) # known labels for validation
.add("sequence", s.string(filterable=False))
.add("length", s.integer(filterable=True))
.build()
)
protein_family is optional metadata. You do not use it during clustering, but it is invaluable afterwards for validating whether the discovered clusters align with known biology.
Step 3 — Ingest records
# Real embedding (recommended for production)
# from transformers import AutoTokenizer, AutoModel
# tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
# model = AutoModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
# def embed(seq: str) -> list[float]:
# inputs = tokenizer(seq, return_tensors="pt", truncation=True, max_length=512)
# outputs = model(**inputs)
# return outputs.last_hidden_state.mean(dim=1).detach().numpy()[0].tolist()
# Deterministic fallback for sandbox testing
def fake_embed(text: str) -> list[float]:
rng = np.random.default_rng(abs(hash(text)) % (2**32))
v = rng.standard_normal(DIM).astype("float32")
v /= np.linalg.norm(v) + 1e-12
return v.tolist()
embed = fake_embed # swap to real embedder in production
with eigenlake.connect(
url="https://api.eigenlake.dev",
api_key="<sk_sbx_your_api_key_here>",
) as client:
idx = client.indexes.create_or_get(
namespace="cookbook-02",
index="protein-clustering",
dimensions=DIM,
schema=schema,
index_options=index_options,
)
payload = [
{
"properties": {
"accession": r["accession"],
"organism": r.get("organism", "Unknown"),
"protein_family": r.get("protein_family", "Unknown"),
"sequence": r["sequence"],
"length": len(r["sequence"]),
},
"vector": embed(r["sequence"]),
}
for r in ds
]
result = idx.records.add_many(payload, on_error="continue")
print(f"inserted {len(result)} of {len(payload)} sequences")
Step 4 — DBSCAN for discovery
Use DBSCAN when you do not know how many clusters exist and you want the algorithm to find natural families while tagging outliers as noise.
clusters = idx.search.cluster(
filter={"length": {"$gte": 100}},
limit=10_000,
algorithm="dbscan",
dbscan_min_samples=5,
distance_metric="cosine",
representatives_per_cluster=3,
)
for i, c in enumerate(clusters["clusters"], start=1):
print(f"cluster {i}: {c['count']} sequences")
for rep in c["representatives"]:
fam = rep["metadata"].get("protein_family", "Unknown")
print(f" - {rep['metadata']['accession']} ({fam})")
print(f"noise: {clusters.get('noise_count', 0)} sequences")
dbscan_min_samples=5 is the minimum number of records within the eps radius for a point to become a core point. Lower it (2–3) to find smaller families at the cost of more false positives. Raise it (8–10) to demand stronger consensus.
representatives_per_cluster=3 returns the three sequences closest to each cluster centroid. A biologist reads those three, recognizes a Rossmann fold or an immunoglobulin domain, and labels the cluster without inspecting thousands of members.
Step 5 — k-means for fixed routing
Use k-means when you already know how many queues you have — for example, ten cryo-EM validation pipelines — and you want every sequence routed to exactly one queue.
kmeans_clusters = idx.search.cluster(
filter={"length": {"$gte": 100}},
limit=10_000,
algorithm="kmeans",
auto_tune=True,
min_clusters=8,
max_clusters=15,
distance_metric="cosine",
representatives_per_cluster=3,
)
print(f"selected k={kmeans_clusters['parameters']['num_clusters']}")
for c in kmeans_clusters["clusters"]:
print(f" cluster {c['cluster_id']}: {c['count']} sequences")
auto_tune=True evaluates every cluster count in [min_clusters, max_clusters] and selects the one with the best silhouette score. The chosen num_clusters is returned in parameters so you can audit the decision.
Step 6 — Multi-scale clustering
For large heterogeneous corpora, run DBSCAN coarsely first, then k-means inside each large cluster:
# Coarse pass: broad families
coarse = idx.search.cluster(
filter={"length": {"$gte": 100}},
limit=10_000,
algorithm="dbscan",
dbscan_min_samples=20,
distance_metric="cosine",
)
# Fine pass: subfamilies inside the largest coarse cluster
largest = max(coarse["clusters"], key=lambda c: c["count"])
member_ids = [r["uuid"] for r in largest.get("members", [])]
# In production, you would filter by the member UUIDs or a shared property.
# For this recipe we show the pattern conceptually:
fine = idx.search.cluster(
filter={"organism": {"$eq": "Homo sapiens"}},
limit=5_000,
algorithm="kmeans",
auto_tune=True,
min_clusters=3,
max_clusters=8,
)
print(f"fine subclusters: {len(fine['clusters'])}")
Multi-scale clustering is how you handle corpora that contain both super-families (TIM barrels) and sub-families (variant classes within a barrel) in the same index.
Agent path
Natural-language clustering defers filter inference and algorithm selection to EigenLake:
agent_result = idx.agent.query(
"Cluster proteins by structural family",
mode="cluster",
algorithm="dbscan",
dbscan_min_samples=5,
)
print(agent_result["action"]) # "cluster"
print(agent_result["filter"]) # inferred filter
for c in agent_result["clusters"]:
print(f" {c['count']} sequences: {c['summary']}")
Agent mode inspects the schema, infers a sensible filter, and runs the clustering workload. It is the fastest path from a question to a set of clusters. When you need precise control over min_samples, distance_metric, or multi-scale passes, drop to the direct API.
What is happening, line by line
Spherical k-means with cosine distance. EigenLake normalizes vectors before clustering, so Euclidean distance on the raw vectors is equivalent to cosine distance. The spherical k-means variant respects this geometry and produces clusters that are interpretable as semantic cones.
Silhouette scoring for k-means. When auto_tune=True, the API runs k-means for every candidate k in the range, computes the silhouette score for each, and returns the winner. The silhouette score measures how similar a record is to its own cluster versus the nearest other cluster. Higher is better.
DBSCAN noise is not a bug. Records tagged as noise are often the most interesting: they are singletons or small groups that do not fit any major family. After clustering, route noise records to the anomaly detection workload for a second opinion.
Variations
Validate with known labels. If your corpus has ground-truth labels like protein_family, compute purity per cluster (fraction of members with the majority label). Low purity means the embedding model is not capturing the biological signal you care about.
Auto-tag unlabeled records. After clustering, write the cluster_id back to each record:
for c in clusters["clusters"]:
for member in c.get("members", []):
idx.records.update(member["uuid"], properties={"cluster_id": c["cluster_id"]})
Switch distance metrics. The recipe uses cosine. Use euclidean when your embedding model does not normalize outputs and absolute magnitude carries signal.
Window for >10k records. The synchronous clustering limit is 10,000 records. For larger corpora, partition by a filterable field (e.g., organism or length quartile), cluster each window, and merge similar centroids in application code.
Wire into an agent loop. A biology agent can cluster nightly, write cluster_id and representative_accession to a summary index, and pass those signals to a downstream validation agent. The upstream agent does the compute; the downstream agent reads structured results.
What to read next
- Cookbook 3: Anomaly Detection in Depth — find the outliers that DBSCAN tagged as noise, or detect subtle degradation signals in sensor data.
- Cookbook 6: From Search to Insight — see how clustering fits into a multi-workload investigative pipeline.
- EigenRun: How EigenLake Lets Agents Compute, Not Just Retrieve — the research-backed argument for why clustering is a perceptual modality for agents.