Cookbook 5: Temporal Shift in Depth
Compare two time windows and find semantic shifts — emerging themes, growing clusters, and disappearing patterns — without manual trend analysis or external BI tools.
Answer
How do I detect temporal shifts in EigenLake?
Define a baseline window and a current window, run topic discovery independently in each, align similar topics across the two windows, and rank the shifts by kind and magnitude.
Inputs
- A time-stamped corpus of scientific abstracts or operational records
- Embedding vectors that match the index dimension
- A timestamp_field that partitions records into windows
Outputs
- Ranked shifts: emerging, growing, shrinking, drifted, and restructured topics
- Per-shift evidence with baseline and current topic details
- Optional LLM-generated summaries for human-readable reporting
- A repeatable drift-detection workflow over one index
Temporal shift is the workload that answers "what changed?" This recipe uses scientific abstracts from the PubMed corpus because the evolution of research themes is concrete — CRISPR emerges, RNA interference shrinks — but the same primitives apply to support tickets, fraud patterns, or any time-stamped embedded corpus.
Problem
You have 10,000 scientific abstracts published over six months and you want to know which research themes emerged, which grew, which shrank, and which restructured between Q1 and Q2. You want this comparison per research category (immunology, oncology, neuroscience) so that shifts in one field do not drown out signals in another.
Prerequisites
- Python 3.10 or newer.
pip install eigenlake numpy datasets.- An API key from
https://api.eigenlake.dev.
The recipe
Step 1 — Load the dataset
from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s
DIM = 384
# PubMed scientific abstracts: https://huggingface.co/datasets/scientific_papers
ds = load_dataset("scientific_papers", "pubmed", split="train")
ds = ds.shuffle(seed=42).select(range(10_000))
Step 2 — Define the schema
schema, index_options = (
s.SchemaBuilder(additional_properties=False)
.add("paper_id", s.string(required=True, filterable=True))
.add("title", s.string(filterable=True))
.add("abstract", s.string(filterable=False))
.add("category", s.string(filterable=True)) # immunology, oncology, neuroscience
.add("year", s.integer(filterable=True))
.add("month", s.integer(filterable=True))
.build()
)
year and month are filterable so you can define arbitrary time windows. category is the group_by field that keeps comparisons independent across research areas.
Step 3 — Ingest records with synthetic dates
The PubMed dataset does not include granular dates, so this recipe maps each record to a synthetic month for demonstration. In production, use the actual submitted_date or published_date from your data.
# Real embedding (recommended for production)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("all-MiniLM-L6-v2")
# def embed(text: str) -> list[float]:
# return model.encode(text, normalize_embeddings=True).tolist()
def fake_embed(text: str) -> list[float]:
rng = np.random.default_rng(abs(hash(text)) % (2**32))
v = rng.standard_normal(DIM).astype("float32")
v /= np.linalg.norm(v) + 1e-12
return v.tolist()
embed = fake_embed # swap to real embedder in production
with eigenlake.connect(
url="https://api.eigenlake.dev",
api_key="<sk_sbx_your_api_key_here>",
) as client:
idx = client.indexes.create_or_get(
namespace="cookbook-05",
index="scientific-papers",
dimensions=DIM,
schema=schema,
index_options=index_options,
)
payload = []
for i, r in enumerate(ds):
# Synthetic dates: first 5,000 = Q1 (months 1-3), last 5,000 = Q2 (months 4-6)
month = 1 + (i % 3) if i < 5_000 else 4 + (i % 3)
year = 2025
category = np.random.choice(["immunology", "oncology", "neuroscience"])
text = r["abstract"][:2000] if r["abstract"] else r["article"][:2000]
payload.append({
"properties": {
"paper_id": str(i),
"title": r["title"][:200],
"abstract": text,
"category": category,
"year": year,
"month": month,
},
"vector": embed(text),
})
result = idx.records.add_many(payload, on_error="continue")
print(f"inserted {len(result)} of {len(payload)} papers")
Step 4 — Run temporal shift
shifts = idx.search.temporal_shift(
baseline={
"start": "2025-01-01T00:00:00Z",
"end": "2025-03-31T23:59:59Z",
},
current={
"start": "2025-04-01T00:00:00Z",
"end": "2025-06-30T23:59:59Z",
},
timestamp_field="month", # In production, use a datetime field like "submitted_date"
filter={"year": {"$eq": 2025}},
group_by=["category"],
limit_per_window=5_000,
min_clusters=3,
max_clusters=12,
min_relative_shift=0.20,
min_count_shift=5,
text_fields=["abstract"],
metadata_fields=["year"],
summary_mode="deterministic",
)
for shift in shifts["shifts"][:10]:
print(f"\n{shift['kind']:<12} {shift['direction']:<12} score={shift['score']:.4f}")
print(f" label: {shift['label']}")
print(f" explanation: {shift['explanation']}")
baseline and current define the two windows. The API selects records whose timestamp_field falls within each range.
group_by=["category"] runs independent comparisons per research area. Immunology shifts are ranked against immunology baselines; oncology shifts against oncology baselines. Without group_by, a large shift in one dominant category can drown out smaller but important shifts in others.
min_relative_shift=0.20 requires a 20% change in topic prevalence to qualify as growing or shrinking. min_count_shift=5 requires an absolute change of at least 5 records. Together they suppress noise from tiny topics.
summary_mode="deterministic" returns stable, reproducible labels and explanations. Use "llm" for customer-facing reports where a narrative summary is more useful than a structured signal list.
Step 5 — Read the shift kinds
Each shift has a kind and a direction:
| Kind | Direction | Meaning |
|---|---|---|
| emerging | new | Topic appeared in current window, absent in baseline |
| growing | up | Topic grew by more than min_relative_shift |
| shrinking | down | Topic shrank by more than min_relative_shift |
| drift | shifted | Topic persisted but its semantic center moved |
| restructured | restructured | Topic merged, split, or recombined |
Read baseline_topics and current_topics for the topic IDs, counts, and similarity scores that support the shift classification.
Agent path
agent_result = idx.agent.query(
"What changed in immunology submissions between Q1 and Q2 2025?"
)
print(agent_result["action"]) # "filter"
print(agent_result["filter"]) # {"category": {"$eq": "immunology"}, ...}
# Apply the inferred filter, then run temporal_shift directly
shifts = idx.search.temporal_shift(
baseline={"start": "2025-01-01T00:00:00Z", "end": "2025-03-31T23:59:59Z"},
current={"start": "2025-04-01T00:00:00Z", "end": "2025-06-30T23:59:59Z"},
timestamp_field="month",
filter=agent_result["filter"],
limit_per_window=5_000,
min_relative_shift=0.20,
summary_mode="llm",
)
Agent mode infers the category and date filters from natural language. The temporal shift workload is then run directly on the filtered subset.
What is happening, line by line
Independent topic discovery per window. The API runs spherical k-means separately on the baseline snapshot and the current snapshot. This ensures that topics are defined by the data in each window, not constrained by an external taxonomy.
Cross-window topic alignment. After discovery, the API aligns similar topics across the two windows using cosine similarity between topic centroids. An aligned pair is the same theme at two points in time. An unaligned baseline topic is a disappearing theme. An unaligned current topic is an emerging theme.
Shift classification. The API classifies each aligned or unaligned topic pair using relative prevalence, absolute count change, and centroid drift. The result is a structured signal list, not a single number.
Variations
Weekly scheduling. Wrap the temporal shift call in a cron job that runs every Monday morning:
# Pseudocode for scheduled drift detection
shifts = idx.search.temporal_shift(
baseline={"start": last_week_start, "end": last_week_end},
current={"start": this_week_start, "end": this_week_end},
...
)
for shift in shifts["shifts"]:
if shift["kind"] == "emerging" and shift["score"] > 0.8:
alert_team(shift["label"], shift["explanation"])
Multi-granularity comparison. Run daily shifts for operational alerts, weekly shifts for tactical review, and monthly shifts for strategic planning. The same API call with different window sizes gives you three different temporal lenses.
Correlation with external metrics. After detecting a growing topic, correlate its growth with an external metric — citation counts, stock prices, or support-ticket volume. The shift tells you what changed; the external metric helps you understand why.
Per-tenant drift. If your index holds data for multiple customers, use group_by=["tenant_id"] to run independent drift detection per tenant. One tenant's emerging fraud pattern is another tenant's stable baseline.
Alert pipeline. Write shift signals to a downstream index:
for shift in shifts["shifts"]:
if shift["kind"] in ("emerging", "growing"):
# Write to alert index
pass
What to read next
- Cookbook 6: From Search to Insight — chain temporal shift with clustering, anomaly detection, and topic modeling into a single investigative pipeline.
- EigenRun: How EigenLake Lets Agents Compute, Not Just Retrieve — the research-backed argument for why temporal shift is a perceptual modality for agents.