Cookbook 4: Topic Modeling in Depth
Discover emergent themes in multilingual text, scientific literature, or operational notes without manual labeling — using spherical k-means, c-TF-IDF, and optional LLM-generated labels.
Answer
How do I model topics in EigenLake?
Filter a snapshot, run spherical k-means over the embeddings, derive distinguishing terms with class-based TF-IDF, and receive human-readable topic labels with per-topic metadata facets.
Inputs
- A corpus of multilingual Amazon product reviews
- Embedding vectors that match the index dimension
- An EigenLake index with language and star-rating metadata
Outputs
- 5–15 topic clusters with c-TF-IDF terms and coverage statistics
- Metadata facets per topic (e.g., 80% of Topic 3 is 1-star reviews)
- Optional LLM-generated human-readable labels
- A repeatable topic-discovery workflow over one index
Topic modeling is the workload that names what you are looking at. This recipe uses multilingual Amazon reviews because the themes are intuitive — shipping complaints, quality issues, sizing problems — but the same primitives apply to scientific abstracts, support tickets, or any corpus where you need to understand the thematic landscape before deciding what to investigate.
Problem
You have 10,000 Amazon reviews in German and English and you want to know what people complain about without reading them all. You also want to know which complaints correlate with 1-star ratings, and you want stable, reproducible topic assignments you can compare week-over-week.
Prerequisites
- Python 3.10 or newer.
pip install eigenlake numpy datasets.- An API key from
https://api.eigenlake.dev. - A multilingual sentence transformer. The recipe uses
paraphrase-multilingual-MiniLM-L12-v2(384 dimensions) or thefake_embedfallback.
The recipe
Step 1 — Load the dataset
from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s
DIM = 384
# Multilingual Amazon reviews: https://huggingface.co/datasets/mteb/amazon_reviews_multi
ds = load_dataset("mteb/amazon_reviews_multi", "de", split="train")
ds = ds.shuffle(seed=42).select(range(5_000))
# Add English reviews for cross-language comparison
ds_en = load_dataset("mteb/amazon_reviews_multi", "en", split="train")
ds_en = ds_en.shuffle(seed=43).select(range(5_000))
Step 2 — Define the schema
schema, index_options = (
s.SchemaBuilder(additional_properties=False)
.add("review_id", s.string(required=True, filterable=True))
.add("language", s.string(filterable=True))
.add("product_category", s.string(filterable=True))
.add("stars", s.integer(filterable=True))
.add("review_text", s.string(filterable=False))
.add("date", s.datetime(filterable=True))
.build()
)
language and stars are filterable so you can compare topics across languages and ratings. review_text is the semantic field the embedding is computed over.
Step 3 — Ingest records
# Real embedding (recommended for production)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
# def embed(text: str) -> list[float]:
# return model.encode(text, normalize_embeddings=True).tolist()
def fake_embed(text: str) -> list[float]:
rng = np.random.default_rng(abs(hash(text)) % (2**32))
v = rng.standard_normal(DIM).astype("float32")
v /= np.linalg.norm(v) + 1e-12
return v.tolist()
embed = fake_embed # swap to real embedder in production
with eigenlake.connect(
url="https://api.eigenlake.dev",
api_key="<sk_sbx_your_api_key_here>",
) as client:
idx = client.indexes.create_or_get(
namespace="cookbook-04",
index="multilingual-reviews",
dimensions=DIM,
schema=schema,
index_options=index_options,
)
payload = []
for r in ds:
payload.append({
"properties": {
"review_id": str(r["id"]),
"language": "de",
"product_category": r["label"],
"stars": int(r["label"]), # using label as proxy; real data has stars
"review_text": r["text"][:1000],
"date": "2025-01-01T00:00:00Z",
},
"vector": embed(r["text"]),
})
for r in ds_en:
payload.append({
"properties": {
"review_id": str(r["id"]),
"language": "en",
"product_category": r["label"],
"stars": int(r["label"]),
"review_text": r["text"][:1000],
"date": "2025-01-01T00:00:00Z",
},
"vector": embed(r["text"]),
})
result = idx.records.add_many(payload, on_error="continue")
print(f"inserted {len(result)} of {len(payload)} reviews")
Step 4 — Discover topics
topics = idx.search.topics(
filter={"language": {"$eq": "de"}},
limit=10_000,
text_fields=["review_text"],
metadata_fields=["stars", "product_category"],
min_topics=5,
max_topics=15,
top_terms=10,
label_mode="keywords", # deterministic; use "llm" for human-readable labels
)
for t in topics["topics"]:
print(f"\n{t['label']:<40} count={t['count']:<4} coverage={t['text_coverage']:.2f}")
print(f" terms: {', '.join(term['term'] for term in t['terms'][:8])}")
# Metadata facets are in the response when metadata_fields is provided
min_topics=5 and max_topics=15 define the search range. EigenLake evaluates each candidate on a deterministic sample, computes the cosine silhouette score, and returns the best number of topics.
text_fields=["review_text"] is mandatory. The c-TF-IDF terms come from the text you list. If you omit this, the workload can still cluster by embedding but cannot produce human-readable keywords.
metadata_fields=["stars", "product_category"] adds explanatory counts per topic. You might discover that Topic 3 is 80% 1-star reviews and 60% electronics — a signal that the topic represents quality complaints in that category. Metadata does not affect topic membership; it only annotates the result.
label_mode="keywords" returns a stable, deterministic label derived from the top c-TF-IDF terms. Use this when you need reproducible topic names across runs. Switch to "llm" for customer-facing reports where a short sentence like "Shipping delays in electronics" is more useful than a keyword list.
Step 5 — LLM labels for presentation
topics_llm = idx.search.topics(
filter={"language": {"$eq": "de"}},
limit=10_000,
text_fields=["review_text"],
min_topics=5,
max_topics=15,
label_mode="llm",
)
for t in topics_llm["topics"][:3]:
print(f"LLM label: {t['label']}")
print(f"Keyword fallback: {t.get('keyword_label', 'n/a')}")
LLM labeling is a presentation layer. Topic membership, centroids, and c-TF-IDF terms are computed first and never changed by the LLM. If the LLM is unavailable, label falls back to keyword_label automatically.
Agent path
agent_result = idx.agent.query(
"What are the main complaint themes in German marketplace reviews?"
)
print(agent_result["action"]) # "filter"
print(agent_result["filter"]) # {"language": {"$eq": "de"}, ...}
# Apply the inferred filter, then run topic modeling directly
topics = idx.search.topics(
filter=agent_result["filter"],
limit=10_000,
text_fields=["review_text"],
metadata_fields=["stars"],
min_topics=5,
max_topics=15,
)
Agent mode infers the language filter from the natural language query. The topic modeling workload is then run directly on the filtered subset. This two-step pattern is the current SDK behavior.
What is happening, line by line
Spherical k-means. EigenLake uses spherical k-means with cosine similarity, which respects the geometry of normalized embedding spaces. Standard k-means assumes Euclidean space and can produce elongated or unbalanced clusters when applied to high-dimensional unit vectors. Spherical k-means constrains centroids to the unit sphere, producing clusters that are interpretable as semantic cones.
Class-based TF-IDF (c-TF-IDF). After clustering, the algorithm computes term frequency within each topic and inverse document frequency across topics. Words that are frequent in one topic and rare in others rise to the top. This is why the terms are distinguishing — they tell you what makes Topic 3 different from Topic 7, not just what Topic 3 is about in absolute terms.
Deterministic sampling. The topic count selection uses a deterministic sample, so running the same query twice with the same data produces the same topics. This reproducibility is essential for week-over-week comparison.
Variations
Cross-language topic comparison. Run topics separately for language="de" and language="en", then compare the term distributions. A multilingual embedding model aligns the spaces so that "shipping delay" in English and "Verspätung" in German map to the same region. You can validate this by checking whether the same product categories appear in corresponding topics across languages.
Star-rating correlation. After topic modeling, aggregate stars per topic. Topics with average stars < 2.0 are priority investigation targets. Topics with average stars > 4.0 are praise themes you can amplify in marketing.
Auto-tagging pipeline. Write the topic_id back to each record:
for t in topics["topics"]:
# In production, iterate through topic assignments and update each record
pass
Then filter by topic_id in downstream searches instead of re-running topic modeling every time.
Weekly topic evolution. Run topic modeling every Monday on the last 7 days of reviews. Store the topic distributions in a summary index. After four weeks, use search.temporal_shift to detect which topics are emerging, growing, or shrinking.
Chunk long documents. The topic modeling workload assigns one topic per embedding. If a 5,000-word document covers three themes, chunk it into 500-word pieces before indexing. Each chunk gets its own topic assignment, and you can reconstruct multi-topic documents in post-processing.
What to read next
- Cookbook 5: Temporal Shift in Depth — compare topic distributions across time windows to detect emerging themes.
- Cookbook 6: From Search to Insight — see how topic modeling chains with clustering and anomaly detection in a production pipeline.
- Why We Call It Vector Intelligence, Not Vector Search — the argument for why topic modeling is computation, not retrieval.