EigenLake
Cookbook
Cookbook/

Cookbook 2: Clustering in Depth

Discover natural groupings in large vector datasets with DBSCAN and k-means, from protein sequences to support tickets, without exporting data to a separate analytics stack.

cookbookpythonclusteringdbscank-meansproteins

Answer

How do I cluster records in EigenLake?

Filter a snapshot of your index, run spherical k-means or DBSCAN directly over the embeddings, and receive cluster summaries with representative records you can validate immediately.

Workload
Clustering
Difficulty
intermediate
Estimated time
10 minutes

Inputs

  • A corpus of protein sequences with known organism and family metadata
  • Embedding vectors that match the index dimension
  • An EigenLake index with workload-ready schema fields

Outputs

  • DBSCAN clusters that represent natural structural families
  • k-means clusters for fixed-size routing queues
  • Representative records per cluster for human validation
  • A repeatable clustering workflow over one index

Clustering is the workload that turns a vector store into a pattern-discovery engine. This recipe uses protein sequences because the ground truth is objective — a TIM barrel is a TIM barrel whether the algorithm knows the label or not — but the same primitives apply to support tickets, sensor traces, or any other embedded corpus.

Problem

You have 8,000 protein sequences and you want to discover structural families without pre-defined labels. You also want to compare density-based discovery (DBSCAN, which finds its own number of clusters) with centroid-based routing (k-means, which you size to your downstream queues). And you want representative sequences from each cluster so a biologist can validate the results without reading thousands of records.

Prerequisites

  • Python 3.10 or newer.
  • pip install eigenlake numpy datasets.
  • An API key from https://api.eigenlake.dev.
  • A protein embedding model (ESM-2, ProtTrans) or the fake_embed fallback for sandbox testing.

The recipe

Step 1 — Load the dataset

from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s

DIM = 768

# UniProt sequences: https://huggingface.co/datasets/jglaser/uniprot_sequences
ds = load_dataset("jglaser/uniprot_sequences", split="train")
ds = ds.shuffle(seed=42).select(range(8_000))

Step 2 — Define the schema

schema, index_options = (
    s.SchemaBuilder(additional_properties=False)
    .add("accession", s.string(required=True, filterable=True))
    .add("organism", s.string(filterable=True))
    .add("protein_family", s.string(filterable=True))  # known labels for validation
    .add("sequence", s.string(filterable=False))
    .add("length", s.integer(filterable=True))
    .build()
)

protein_family is optional metadata. You do not use it during clustering, but it is invaluable afterwards for validating whether the discovered clusters align with known biology.

Step 3 — Ingest records

# Real embedding (recommended for production)
# from transformers import AutoTokenizer, AutoModel
# tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
# model = AutoModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
# def embed(seq: str) -> list[float]:
#     inputs = tokenizer(seq, return_tensors="pt", truncation=True, max_length=512)
#     outputs = model(**inputs)
#     return outputs.last_hidden_state.mean(dim=1).detach().numpy()[0].tolist()

# Deterministic fallback for sandbox testing
def fake_embed(text: str) -> list[float]:
    rng = np.random.default_rng(abs(hash(text)) % (2**32))
    v = rng.standard_normal(DIM).astype("float32")
    v /= np.linalg.norm(v) + 1e-12
    return v.tolist()

embed = fake_embed  # swap to real embedder in production

with eigenlake.connect(
    url="https://api.eigenlake.dev",
    api_key="<sk_sbx_your_api_key_here>",
) as client:
    idx = client.indexes.create_or_get(
        namespace="cookbook-02",
        index="protein-clustering",
        dimensions=DIM,
        schema=schema,
        index_options=index_options,
    )

    payload = [
        {
            "properties": {
                "accession": r["accession"],
                "organism": r.get("organism", "Unknown"),
                "protein_family": r.get("protein_family", "Unknown"),
                "sequence": r["sequence"],
                "length": len(r["sequence"]),
            },
            "vector": embed(r["sequence"]),
        }
        for r in ds
    ]

    result = idx.records.add_many(payload, on_error="continue")
    print(f"inserted {len(result)} of {len(payload)} sequences")

Step 4 — DBSCAN for discovery

Use DBSCAN when you do not know how many clusters exist and you want the algorithm to find natural families while tagging outliers as noise.

    clusters = idx.search.cluster(
        filter={"length": {"$gte": 100}},
        limit=10_000,
        algorithm="dbscan",
        dbscan_min_samples=5,
        distance_metric="cosine",
        representatives_per_cluster=3,
    )

    for i, c in enumerate(clusters["clusters"], start=1):
        print(f"cluster {i}: {c['count']} sequences")
        for rep in c["representatives"]:
            fam = rep["metadata"].get("protein_family", "Unknown")
            print(f"  - {rep['metadata']['accession']} ({fam})")
    print(f"noise: {clusters.get('noise_count', 0)} sequences")

dbscan_min_samples=5 is the minimum number of records within the eps radius for a point to become a core point. Lower it (2–3) to find smaller families at the cost of more false positives. Raise it (8–10) to demand stronger consensus.

representatives_per_cluster=3 returns the three sequences closest to each cluster centroid. A biologist reads those three, recognizes a Rossmann fold or an immunoglobulin domain, and labels the cluster without inspecting thousands of members.

Step 5 — k-means for fixed routing

Use k-means when you already know how many queues you have — for example, ten cryo-EM validation pipelines — and you want every sequence routed to exactly one queue.

    kmeans_clusters = idx.search.cluster(
        filter={"length": {"$gte": 100}},
        limit=10_000,
        algorithm="kmeans",
        auto_tune=True,
        min_clusters=8,
        max_clusters=15,
        distance_metric="cosine",
        representatives_per_cluster=3,
    )

    print(f"selected k={kmeans_clusters['parameters']['num_clusters']}")
    for c in kmeans_clusters["clusters"]:
        print(f"  cluster {c['cluster_id']}: {c['count']} sequences")

auto_tune=True evaluates every cluster count in [min_clusters, max_clusters] and selects the one with the best silhouette score. The chosen num_clusters is returned in parameters so you can audit the decision.

Step 6 — Multi-scale clustering

For large heterogeneous corpora, run DBSCAN coarsely first, then k-means inside each large cluster:

    # Coarse pass: broad families
    coarse = idx.search.cluster(
        filter={"length": {"$gte": 100}},
        limit=10_000,
        algorithm="dbscan",
        dbscan_min_samples=20,
        distance_metric="cosine",
    )

    # Fine pass: subfamilies inside the largest coarse cluster
    largest = max(coarse["clusters"], key=lambda c: c["count"])
    member_ids = [r["uuid"] for r in largest.get("members", [])]

    # In production, you would filter by the member UUIDs or a shared property.
    # For this recipe we show the pattern conceptually:
    fine = idx.search.cluster(
        filter={"organism": {"$eq": "Homo sapiens"}},
        limit=5_000,
        algorithm="kmeans",
        auto_tune=True,
        min_clusters=3,
        max_clusters=8,
    )
    print(f"fine subclusters: {len(fine['clusters'])}")

Multi-scale clustering is how you handle corpora that contain both super-families (TIM barrels) and sub-families (variant classes within a barrel) in the same index.

Agent path

Natural-language clustering defers filter inference and algorithm selection to EigenLake:

    agent_result = idx.agent.query(
        "Cluster proteins by structural family",
        mode="cluster",
        algorithm="dbscan",
        dbscan_min_samples=5,
    )
    print(agent_result["action"])       # "cluster"
    print(agent_result["filter"])       # inferred filter
    for c in agent_result["clusters"]:
        print(f"  {c['count']} sequences: {c['summary']}")

Agent mode inspects the schema, infers a sensible filter, and runs the clustering workload. It is the fastest path from a question to a set of clusters. When you need precise control over min_samples, distance_metric, or multi-scale passes, drop to the direct API.

What is happening, line by line

Spherical k-means with cosine distance. EigenLake normalizes vectors before clustering, so Euclidean distance on the raw vectors is equivalent to cosine distance. The spherical k-means variant respects this geometry and produces clusters that are interpretable as semantic cones.

Silhouette scoring for k-means. When auto_tune=True, the API runs k-means for every candidate k in the range, computes the silhouette score for each, and returns the winner. The silhouette score measures how similar a record is to its own cluster versus the nearest other cluster. Higher is better.

DBSCAN noise is not a bug. Records tagged as noise are often the most interesting: they are singletons or small groups that do not fit any major family. After clustering, route noise records to the anomaly detection workload for a second opinion.

Variations

Validate with known labels. If your corpus has ground-truth labels like protein_family, compute purity per cluster (fraction of members with the majority label). Low purity means the embedding model is not capturing the biological signal you care about.

Auto-tag unlabeled records. After clustering, write the cluster_id back to each record:

for c in clusters["clusters"]:
    for member in c.get("members", []):
        idx.records.update(member["uuid"], properties={"cluster_id": c["cluster_id"]})

Switch distance metrics. The recipe uses cosine. Use euclidean when your embedding model does not normalize outputs and absolute magnitude carries signal.

Window for >10k records. The synchronous clustering limit is 10,000 records. For larger corpora, partition by a filterable field (e.g., organism or length quartile), cluster each window, and merge similar centroids in application code.

Wire into an agent loop. A biology agent can cluster nightly, write cluster_id and representative_accession to a summary index, and pass those signals to a downstream validation agent. The upstream agent does the compute; the downstream agent reads structured results.

What to read next

Related reading

Cookbook/Semantic Search

Cookbook 1: Semantic Search at Scale

Use filtered nearest-neighbour search, cursor pagination, and search-unit economics to retrieve relevant records from a corpus of tens of thousands of consumer complaints.

/8 minutes/intermediate
Read more
Cookbook/Anomaly Detection

Cookbook 3: Anomaly Detection in Depth

Surface unusual sensor events, operational records, and outlier folds with Local Outlier Factor over vector embeddings — without pre-defining what normal looks like.

/10 minutes/intermediate
Read more
Cookbook/Topic Modeling

Cookbook 4: Topic Modeling in Depth

Discover emergent themes in multilingual text, scientific literature, or operational notes without manual labeling — using spherical k-means, c-TF-IDF, and optional LLM-generated labels.

/10 minutes/intermediate
Read more