EigenLake
Cookbook
Cookbook/

Cookbook 4: Topic Modeling in Depth

Discover emergent themes in multilingual text, scientific literature, or operational notes without manual labeling — using spherical k-means, c-TF-IDF, and optional LLM-generated labels.

cookbookpythontopic-modelingctf-idfmultilingualreviews

Answer

How do I model topics in EigenLake?

Filter a snapshot, run spherical k-means over the embeddings, derive distinguishing terms with class-based TF-IDF, and receive human-readable topic labels with per-topic metadata facets.

Workload
Topic Modeling
Difficulty
intermediate
Estimated time
10 minutes

Inputs

  • A corpus of multilingual Amazon product reviews
  • Embedding vectors that match the index dimension
  • An EigenLake index with language and star-rating metadata

Outputs

  • 5–15 topic clusters with c-TF-IDF terms and coverage statistics
  • Metadata facets per topic (e.g., 80% of Topic 3 is 1-star reviews)
  • Optional LLM-generated human-readable labels
  • A repeatable topic-discovery workflow over one index

Topic modeling is the workload that names what you are looking at. This recipe uses multilingual Amazon reviews because the themes are intuitive — shipping complaints, quality issues, sizing problems — but the same primitives apply to scientific abstracts, support tickets, or any corpus where you need to understand the thematic landscape before deciding what to investigate.

Problem

You have 10,000 Amazon reviews in German and English and you want to know what people complain about without reading them all. You also want to know which complaints correlate with 1-star ratings, and you want stable, reproducible topic assignments you can compare week-over-week.

Prerequisites

  • Python 3.10 or newer.
  • pip install eigenlake numpy datasets.
  • An API key from https://api.eigenlake.dev.
  • A multilingual sentence transformer. The recipe uses paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions) or the fake_embed fallback.

The recipe

Step 1 — Load the dataset

from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s

DIM = 384

# Multilingual Amazon reviews: https://huggingface.co/datasets/mteb/amazon_reviews_multi
ds = load_dataset("mteb/amazon_reviews_multi", "de", split="train")
ds = ds.shuffle(seed=42).select(range(5_000))

# Add English reviews for cross-language comparison
ds_en = load_dataset("mteb/amazon_reviews_multi", "en", split="train")
ds_en = ds_en.shuffle(seed=43).select(range(5_000))

Step 2 — Define the schema

schema, index_options = (
    s.SchemaBuilder(additional_properties=False)
    .add("review_id", s.string(required=True, filterable=True))
    .add("language", s.string(filterable=True))
    .add("product_category", s.string(filterable=True))
    .add("stars", s.integer(filterable=True))
    .add("review_text", s.string(filterable=False))
    .add("date", s.datetime(filterable=True))
    .build()
)

language and stars are filterable so you can compare topics across languages and ratings. review_text is the semantic field the embedding is computed over.

Step 3 — Ingest records

# Real embedding (recommended for production)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
# def embed(text: str) -> list[float]:
#     return model.encode(text, normalize_embeddings=True).tolist()

def fake_embed(text: str) -> list[float]:
    rng = np.random.default_rng(abs(hash(text)) % (2**32))
    v = rng.standard_normal(DIM).astype("float32")
    v /= np.linalg.norm(v) + 1e-12
    return v.tolist()

embed = fake_embed  # swap to real embedder in production

with eigenlake.connect(
    url="https://api.eigenlake.dev",
    api_key="<sk_sbx_your_api_key_here>",
) as client:
    idx = client.indexes.create_or_get(
        namespace="cookbook-04",
        index="multilingual-reviews",
        dimensions=DIM,
        schema=schema,
        index_options=index_options,
    )

    payload = []
    for r in ds:
        payload.append({
            "properties": {
                "review_id": str(r["id"]),
                "language": "de",
                "product_category": r["label"],
                "stars": int(r["label"]),  # using label as proxy; real data has stars
                "review_text": r["text"][:1000],
                "date": "2025-01-01T00:00:00Z",
            },
            "vector": embed(r["text"]),
        })
    for r in ds_en:
        payload.append({
            "properties": {
                "review_id": str(r["id"]),
                "language": "en",
                "product_category": r["label"],
                "stars": int(r["label"]),
                "review_text": r["text"][:1000],
                "date": "2025-01-01T00:00:00Z",
            },
            "vector": embed(r["text"]),
        })

    result = idx.records.add_many(payload, on_error="continue")
    print(f"inserted {len(result)} of {len(payload)} reviews")

Step 4 — Discover topics

    topics = idx.search.topics(
        filter={"language": {"$eq": "de"}},
        limit=10_000,
        text_fields=["review_text"],
        metadata_fields=["stars", "product_category"],
        min_topics=5,
        max_topics=15,
        top_terms=10,
        label_mode="keywords",  # deterministic; use "llm" for human-readable labels
    )

    for t in topics["topics"]:
        print(f"\n{t['label']:<40} count={t['count']:<4} coverage={t['text_coverage']:.2f}")
        print(f"  terms: {', '.join(term['term'] for term in t['terms'][:8])}")
        # Metadata facets are in the response when metadata_fields is provided

min_topics=5 and max_topics=15 define the search range. EigenLake evaluates each candidate on a deterministic sample, computes the cosine silhouette score, and returns the best number of topics.

text_fields=["review_text"] is mandatory. The c-TF-IDF terms come from the text you list. If you omit this, the workload can still cluster by embedding but cannot produce human-readable keywords.

metadata_fields=["stars", "product_category"] adds explanatory counts per topic. You might discover that Topic 3 is 80% 1-star reviews and 60% electronics — a signal that the topic represents quality complaints in that category. Metadata does not affect topic membership; it only annotates the result.

label_mode="keywords" returns a stable, deterministic label derived from the top c-TF-IDF terms. Use this when you need reproducible topic names across runs. Switch to "llm" for customer-facing reports where a short sentence like "Shipping delays in electronics" is more useful than a keyword list.

Step 5 — LLM labels for presentation

    topics_llm = idx.search.topics(
        filter={"language": {"$eq": "de"}},
        limit=10_000,
        text_fields=["review_text"],
        min_topics=5,
        max_topics=15,
        label_mode="llm",
    )

    for t in topics_llm["topics"][:3]:
        print(f"LLM label: {t['label']}")
        print(f"Keyword fallback: {t.get('keyword_label', 'n/a')}")

LLM labeling is a presentation layer. Topic membership, centroids, and c-TF-IDF terms are computed first and never changed by the LLM. If the LLM is unavailable, label falls back to keyword_label automatically.

Agent path

    agent_result = idx.agent.query(
        "What are the main complaint themes in German marketplace reviews?"
    )
    print(agent_result["action"])   # "filter"
    print(agent_result["filter"])   # {"language": {"$eq": "de"}, ...}

    # Apply the inferred filter, then run topic modeling directly
    topics = idx.search.topics(
        filter=agent_result["filter"],
        limit=10_000,
        text_fields=["review_text"],
        metadata_fields=["stars"],
        min_topics=5,
        max_topics=15,
    )

Agent mode infers the language filter from the natural language query. The topic modeling workload is then run directly on the filtered subset. This two-step pattern is the current SDK behavior.

What is happening, line by line

Spherical k-means. EigenLake uses spherical k-means with cosine similarity, which respects the geometry of normalized embedding spaces. Standard k-means assumes Euclidean space and can produce elongated or unbalanced clusters when applied to high-dimensional unit vectors. Spherical k-means constrains centroids to the unit sphere, producing clusters that are interpretable as semantic cones.

Class-based TF-IDF (c-TF-IDF). After clustering, the algorithm computes term frequency within each topic and inverse document frequency across topics. Words that are frequent in one topic and rare in others rise to the top. This is why the terms are distinguishing — they tell you what makes Topic 3 different from Topic 7, not just what Topic 3 is about in absolute terms.

Deterministic sampling. The topic count selection uses a deterministic sample, so running the same query twice with the same data produces the same topics. This reproducibility is essential for week-over-week comparison.

Variations

Cross-language topic comparison. Run topics separately for language="de" and language="en", then compare the term distributions. A multilingual embedding model aligns the spaces so that "shipping delay" in English and "Verspätung" in German map to the same region. You can validate this by checking whether the same product categories appear in corresponding topics across languages.

Star-rating correlation. After topic modeling, aggregate stars per topic. Topics with average stars < 2.0 are priority investigation targets. Topics with average stars > 4.0 are praise themes you can amplify in marketing.

Auto-tagging pipeline. Write the topic_id back to each record:

for t in topics["topics"]:
    # In production, iterate through topic assignments and update each record
    pass

Then filter by topic_id in downstream searches instead of re-running topic modeling every time.

Weekly topic evolution. Run topic modeling every Monday on the last 7 days of reviews. Store the topic distributions in a summary index. After four weeks, use search.temporal_shift to detect which topics are emerging, growing, or shrinking.

Chunk long documents. The topic modeling workload assigns one topic per embedding. If a 5,000-word document covers three themes, chunk it into 500-word pieces before indexing. Each chunk gets its own topic assignment, and you can reconstruct multi-topic documents in post-processing.

What to read next

Related reading

Cookbook/Semantic Search

Cookbook 1: Semantic Search at Scale

Use filtered nearest-neighbour search, cursor pagination, and search-unit economics to retrieve relevant records from a corpus of tens of thousands of consumer complaints.

/8 minutes/intermediate
Read more
Cookbook/Clustering

Cookbook 2: Clustering in Depth

Discover natural groupings in large vector datasets with DBSCAN and k-means, from protein sequences to support tickets, without exporting data to a separate analytics stack.

/10 minutes/intermediate
Read more
Cookbook/Anomaly Detection

Cookbook 3: Anomaly Detection in Depth

Surface unusual sensor events, operational records, and outlier folds with Local Outlier Factor over vector embeddings — without pre-defining what normal looks like.

/10 minutes/intermediate
Read more