EigenLake
Cookbook
Cookbook/

Cookbook 5: Temporal Shift in Depth

Compare two time windows and find semantic shifts — emerging themes, growing clusters, and disappearing patterns — without manual trend analysis or external BI tools.

cookbookpythontemporal-shifttime-seriesdrift-detectionpubmed

Answer

How do I detect temporal shifts in EigenLake?

Define a baseline window and a current window, run topic discovery independently in each, align similar topics across the two windows, and rank the shifts by kind and magnitude.

Workload
Operational Analytics
Difficulty
intermediate
Estimated time
10 minutes

Inputs

  • A time-stamped corpus of scientific abstracts or operational records
  • Embedding vectors that match the index dimension
  • A timestamp_field that partitions records into windows

Outputs

  • Ranked shifts: emerging, growing, shrinking, drifted, and restructured topics
  • Per-shift evidence with baseline and current topic details
  • Optional LLM-generated summaries for human-readable reporting
  • A repeatable drift-detection workflow over one index

Temporal shift is the workload that answers "what changed?" This recipe uses scientific abstracts from the PubMed corpus because the evolution of research themes is concrete — CRISPR emerges, RNA interference shrinks — but the same primitives apply to support tickets, fraud patterns, or any time-stamped embedded corpus.

Problem

You have 10,000 scientific abstracts published over six months and you want to know which research themes emerged, which grew, which shrank, and which restructured between Q1 and Q2. You want this comparison per research category (immunology, oncology, neuroscience) so that shifts in one field do not drown out signals in another.

Prerequisites

  • Python 3.10 or newer.
  • pip install eigenlake numpy datasets.
  • An API key from https://api.eigenlake.dev.

The recipe

Step 1 — Load the dataset

from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s

DIM = 384

# PubMed scientific abstracts: https://huggingface.co/datasets/scientific_papers
ds = load_dataset("scientific_papers", "pubmed", split="train")
ds = ds.shuffle(seed=42).select(range(10_000))

Step 2 — Define the schema

schema, index_options = (
    s.SchemaBuilder(additional_properties=False)
    .add("paper_id", s.string(required=True, filterable=True))
    .add("title", s.string(filterable=True))
    .add("abstract", s.string(filterable=False))
    .add("category", s.string(filterable=True))      # immunology, oncology, neuroscience
    .add("year", s.integer(filterable=True))
    .add("month", s.integer(filterable=True))
    .build()
)

year and month are filterable so you can define arbitrary time windows. category is the group_by field that keeps comparisons independent across research areas.

Step 3 — Ingest records with synthetic dates

The PubMed dataset does not include granular dates, so this recipe maps each record to a synthetic month for demonstration. In production, use the actual submitted_date or published_date from your data.

# Real embedding (recommended for production)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("all-MiniLM-L6-v2")
# def embed(text: str) -> list[float]:
#     return model.encode(text, normalize_embeddings=True).tolist()

def fake_embed(text: str) -> list[float]:
    rng = np.random.default_rng(abs(hash(text)) % (2**32))
    v = rng.standard_normal(DIM).astype("float32")
    v /= np.linalg.norm(v) + 1e-12
    return v.tolist()

embed = fake_embed  # swap to real embedder in production

with eigenlake.connect(
    url="https://api.eigenlake.dev",
    api_key="<sk_sbx_your_api_key_here>",
) as client:
    idx = client.indexes.create_or_get(
        namespace="cookbook-05",
        index="scientific-papers",
        dimensions=DIM,
        schema=schema,
        index_options=index_options,
    )

    payload = []
    for i, r in enumerate(ds):
        # Synthetic dates: first 5,000 = Q1 (months 1-3), last 5,000 = Q2 (months 4-6)
        month = 1 + (i % 3) if i < 5_000 else 4 + (i % 3)
        year = 2025
        category = np.random.choice(["immunology", "oncology", "neuroscience"])

        text = r["abstract"][:2000] if r["abstract"] else r["article"][:2000]
        payload.append({
            "properties": {
                "paper_id": str(i),
                "title": r["title"][:200],
                "abstract": text,
                "category": category,
                "year": year,
                "month": month,
            },
            "vector": embed(text),
        })

    result = idx.records.add_many(payload, on_error="continue")
    print(f"inserted {len(result)} of {len(payload)} papers")

Step 4 — Run temporal shift

    shifts = idx.search.temporal_shift(
        baseline={
            "start": "2025-01-01T00:00:00Z",
            "end": "2025-03-31T23:59:59Z",
        },
        current={
            "start": "2025-04-01T00:00:00Z",
            "end": "2025-06-30T23:59:59Z",
        },
        timestamp_field="month",  # In production, use a datetime field like "submitted_date"
        filter={"year": {"$eq": 2025}},
        group_by=["category"],
        limit_per_window=5_000,
        min_clusters=3,
        max_clusters=12,
        min_relative_shift=0.20,
        min_count_shift=5,
        text_fields=["abstract"],
        metadata_fields=["year"],
        summary_mode="deterministic",
    )

    for shift in shifts["shifts"][:10]:
        print(f"\n{shift['kind']:<12} {shift['direction']:<12} score={shift['score']:.4f}")
        print(f"  label: {shift['label']}")
        print(f"  explanation: {shift['explanation']}")

baseline and current define the two windows. The API selects records whose timestamp_field falls within each range.

group_by=["category"] runs independent comparisons per research area. Immunology shifts are ranked against immunology baselines; oncology shifts against oncology baselines. Without group_by, a large shift in one dominant category can drown out smaller but important shifts in others.

min_relative_shift=0.20 requires a 20% change in topic prevalence to qualify as growing or shrinking. min_count_shift=5 requires an absolute change of at least 5 records. Together they suppress noise from tiny topics.

summary_mode="deterministic" returns stable, reproducible labels and explanations. Use "llm" for customer-facing reports where a narrative summary is more useful than a structured signal list.

Step 5 — Read the shift kinds

Each shift has a kind and a direction:

KindDirectionMeaning
emergingnewTopic appeared in current window, absent in baseline
growingupTopic grew by more than min_relative_shift
shrinkingdownTopic shrank by more than min_relative_shift
driftshiftedTopic persisted but its semantic center moved
restructuredrestructuredTopic merged, split, or recombined

Read baseline_topics and current_topics for the topic IDs, counts, and similarity scores that support the shift classification.

Agent path

    agent_result = idx.agent.query(
        "What changed in immunology submissions between Q1 and Q2 2025?"
    )
    print(agent_result["action"])   # "filter"
    print(agent_result["filter"])   # {"category": {"$eq": "immunology"}, ...}

    # Apply the inferred filter, then run temporal_shift directly
    shifts = idx.search.temporal_shift(
        baseline={"start": "2025-01-01T00:00:00Z", "end": "2025-03-31T23:59:59Z"},
        current={"start": "2025-04-01T00:00:00Z", "end": "2025-06-30T23:59:59Z"},
        timestamp_field="month",
        filter=agent_result["filter"],
        limit_per_window=5_000,
        min_relative_shift=0.20,
        summary_mode="llm",
    )

Agent mode infers the category and date filters from natural language. The temporal shift workload is then run directly on the filtered subset.

What is happening, line by line

Independent topic discovery per window. The API runs spherical k-means separately on the baseline snapshot and the current snapshot. This ensures that topics are defined by the data in each window, not constrained by an external taxonomy.

Cross-window topic alignment. After discovery, the API aligns similar topics across the two windows using cosine similarity between topic centroids. An aligned pair is the same theme at two points in time. An unaligned baseline topic is a disappearing theme. An unaligned current topic is an emerging theme.

Shift classification. The API classifies each aligned or unaligned topic pair using relative prevalence, absolute count change, and centroid drift. The result is a structured signal list, not a single number.

Variations

Weekly scheduling. Wrap the temporal shift call in a cron job that runs every Monday morning:

# Pseudocode for scheduled drift detection
shifts = idx.search.temporal_shift(
    baseline={"start": last_week_start, "end": last_week_end},
    current={"start": this_week_start, "end": this_week_end},
    ...
)
for shift in shifts["shifts"]:
    if shift["kind"] == "emerging" and shift["score"] > 0.8:
        alert_team(shift["label"], shift["explanation"])

Multi-granularity comparison. Run daily shifts for operational alerts, weekly shifts for tactical review, and monthly shifts for strategic planning. The same API call with different window sizes gives you three different temporal lenses.

Correlation with external metrics. After detecting a growing topic, correlate its growth with an external metric — citation counts, stock prices, or support-ticket volume. The shift tells you what changed; the external metric helps you understand why.

Per-tenant drift. If your index holds data for multiple customers, use group_by=["tenant_id"] to run independent drift detection per tenant. One tenant's emerging fraud pattern is another tenant's stable baseline.

Alert pipeline. Write shift signals to a downstream index:

for shift in shifts["shifts"]:
    if shift["kind"] in ("emerging", "growing"):
        # Write to alert index
        pass

What to read next

Related reading

Cookbook/Semantic Search

Cookbook 1: Semantic Search at Scale

Use filtered nearest-neighbour search, cursor pagination, and search-unit economics to retrieve relevant records from a corpus of tens of thousands of consumer complaints.

/8 minutes/intermediate
Read more
Cookbook/Clustering

Cookbook 2: Clustering in Depth

Discover natural groupings in large vector datasets with DBSCAN and k-means, from protein sequences to support tickets, without exporting data to a separate analytics stack.

/10 minutes/intermediate
Read more
Cookbook/Anomaly Detection

Cookbook 3: Anomaly Detection in Depth

Surface unusual sensor events, operational records, and outlier folds with Local Outlier Factor over vector embeddings — without pre-defining what normal looks like.

/10 minutes/intermediate
Read more