How do I detect temporal shifts in EigenLake?

Define a baseline window and a current window, run topic discovery independently in each, align similar topics across the two windows, and rank the shifts by kind and magnitude.

Cookbook 5: Temporal Shift in Depth

Temporal shift is the workload that answers "what changed?" This recipe uses scientific abstracts from the PubMed corpus because the evolution of research themes is concrete — CRISPR emerges, RNA interference shrinks — but the same primitives apply to support tickets, fraud patterns, or any time-stamped embedded corpus.

Problem

You have 10,000 scientific abstracts published over six months and you want to know which research themes emerged, which grew, which shrank, and which restructured between Q1 and Q2. You want this comparison per research category (immunology, oncology, neuroscience) so that shifts in one field do not drown out signals in another.

Prerequisites

Python 3.10 or newer.
pip install eigenlake numpy datasets.
An API key from https://api.eigenlake.dev.

The recipe

Step 1 — Load the dataset

from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s

DIM = 384

# PubMed scientific abstracts: https://huggingface.co/datasets/scientific_papers
ds = load_dataset("scientific_papers", "pubmed", split="train")
ds = ds.shuffle(seed=42).select(range(10_000))

Step 2 — Define the schema

schema, index_options = (
    s.SchemaBuilder(additional_properties=False)
    .add("paper_id", s.string(required=True, filterable=True))
    .add("title", s.string(filterable=True))
    .add("abstract", s.string(filterable=False))
    .add("category", s.string(filterable=True))      # immunology, oncology, neuroscience
    .add("year", s.integer(filterable=True))
    .add("month", s.integer(filterable=True))
    .build()
)

year and month are filterable so you can define arbitrary time windows. category is the group_by field that keeps comparisons independent across research areas.

Step 3 — Ingest records with synthetic dates

The PubMed dataset does not include granular dates, so this recipe maps each record to a synthetic month for demonstration. In production, use the actual submitted_date or published_date from your data.

# Real embedding (recommended for production)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("all-MiniLM-L6-v2")
# def embed(text: str) -> list[float]:
#     return model.encode(text, normalize_embeddings=True).tolist()

def fake_embed(text: str) -> list[float]:
    rng = np.random.default_rng(abs(hash(text)) % (2**32))
    v = rng.standard_normal(DIM).astype("float32")
    v /= np.linalg.norm(v) + 1e-12
    return v.tolist()

embed = fake_embed  # swap to real embedder in production

with eigenlake.connect(
    url="https://api.eigenlake.dev",
    api_key="<sk_sbx_your_api_key_here>",
) as client:
    idx = client.indexes.create_or_get(
        namespace="cookbook-05",
        index="scientific-papers",
        dimensions=DIM,
        schema=schema,
        index_options=index_options,
    )

    payload = []
    for i, r in enumerate(ds):
        # Synthetic dates: first 5,000 = Q1 (months 1-3), last 5,000 = Q2 (months 4-6)
        month = 1 + (i % 3) if i < 5_000 else 4 + (i % 3)
        year = 2025
        category = np.random.choice(["immunology", "oncology", "neuroscience"])

        text = r["abstract"][:2000] if r["abstract"] else r["article"][:2000]
        payload.append({
            "properties": {
                "paper_id": str(i),
                "title": r["title"][:200],
                "abstract": text,
                "category": category,
                "year": year,
                "month": month,
            },
            "vector": embed(text),
        })

    result = idx.records.add_many(payload, on_error="continue")
    print(f"inserted {len(result)} of {len(payload)} papers")

Step 4 — Run temporal shift

    shifts = idx.search.temporal_shift(
        baseline={
            "start": "2025-01-01T00:00:00Z",
            "end": "2025-03-31T23:59:59Z",
        },
        current={
            "start": "2025-04-01T00:00:00Z",
            "end": "2025-06-30T23:59:59Z",
        },
        timestamp_field="month",  # In production, use a datetime field like "submitted_date"
        filter={"year": {"$eq": 2025}},
        group_by=["category"],
        limit_per_window=5_000,
        min_clusters=3,
        max_clusters=12,
        min_relative_shift=0.20,
        min_count_shift=5,
        text_fields=["abstract"],
        metadata_fields=["year"],
        summary_mode="deterministic",
    )

    for shift in shifts["shifts"][:10]:
        print(f"\n{shift['kind']:<12} {shift['direction']:<12} score={shift['score']:.4f}")
        print(f"  label: {shift['label']}")
        print(f"  explanation: {shift['explanation']}")

baseline and current define the two windows. The API selects records whose timestamp_field falls within each range.

group_by=["category"] runs independent comparisons per research area. Immunology shifts are ranked against immunology baselines; oncology shifts against oncology baselines. Without group_by, a large shift in one dominant category can drown out smaller but important shifts in others.

min_relative_shift=0.20 requires a 20% change in topic prevalence to qualify as growing or shrinking. min_count_shift=5 requires an absolute change of at least 5 records. Together they suppress noise from tiny topics.

summary_mode="deterministic" returns stable, reproducible labels and explanations. Use "llm" for customer-facing reports where a narrative summary is more useful than a structured signal list.

Step 5 — Read the shift kinds

Each shift has a kind and a direction:

Kind	Direction	Meaning
emerging	new	Topic appeared in current window, absent in baseline
growing	up	Topic grew by more than min_relative_shift
shrinking	down	Topic shrank by more than min_relative_shift
drift	shifted	Topic persisted but its semantic center moved
restructured	restructured	Topic merged, split, or recombined

Read baseline_topics and current_topics for the topic IDs, counts, and similarity scores that support the shift classification.

Agent path

    agent_result = idx.agent.query(
        "What changed in immunology submissions between Q1 and Q2 2025?"
    )
    print(agent_result["action"])   # "filter"
    print(agent_result["filter"])   # {"category": {"$eq": "immunology"}, ...}

    # Apply the inferred filter, then run temporal_shift directly
    shifts = idx.search.temporal_shift(
        baseline={"start": "2025-01-01T00:00:00Z", "end": "2025-03-31T23:59:59Z"},
        current={"start": "2025-04-01T00:00:00Z", "end": "2025-06-30T23:59:59Z"},
        timestamp_field="month",
        filter=agent_result["filter"],
        limit_per_window=5_000,
        min_relative_shift=0.20,
        summary_mode="llm",
    )

Agent mode infers the category and date filters from natural language. The temporal shift workload is then run directly on the filtered subset.

What is happening, line by line

Independent topic discovery per window. The API runs spherical k-means separately on the baseline snapshot and the current snapshot. This ensures that topics are defined by the data in each window, not constrained by an external taxonomy.

Cross-window topic alignment. After discovery, the API aligns similar topics across the two windows using cosine similarity between topic centroids. An aligned pair is the same theme at two points in time. An unaligned baseline topic is a disappearing theme. An unaligned current topic is an emerging theme.

Shift classification. The API classifies each aligned or unaligned topic pair using relative prevalence, absolute count change, and centroid drift. The result is a structured signal list, not a single number.

Variations

Weekly scheduling. Wrap the temporal shift call in a cron job that runs every Monday morning:

# Pseudocode for scheduled drift detection
shifts = idx.search.temporal_shift(
    baseline={"start": last_week_start, "end": last_week_end},
    current={"start": this_week_start, "end": this_week_end},
    ...
)
for shift in shifts["shifts"]:
    if shift["kind"] == "emerging" and shift["score"] > 0.8:
        alert_team(shift["label"], shift["explanation"])

Multi-granularity comparison. Run daily shifts for operational alerts, weekly shifts for tactical review, and monthly shifts for strategic planning. The same API call with different window sizes gives you three different temporal lenses.

Correlation with external metrics. After detecting a growing topic, correlate its growth with an external metric — citation counts, stock prices, or support-ticket volume. The shift tells you what changed; the external metric helps you understand why.

Per-tenant drift. If your index holds data for multiple customers, use group_by=["tenant_id"] to run independent drift detection per tenant. One tenant's emerging fraud pattern is another tenant's stable baseline.

Alert pipeline. Write shift signals to a downstream index:

for shift in shifts["shifts"]:
    if shift["kind"] in ("emerging", "growing"):
        # Write to alert index
        pass

Cookbook 5: Temporal Shift in Depth