EigenLake
Cookbook
Cookbook/

Cookbook 1: Semantic Search at Scale

Use filtered nearest-neighbour search, cursor pagination, and search-unit economics to retrieve relevant records from a corpus of tens of thousands of consumer complaints.

cookbookpythonsemantic-searchfilteringpaginationproduction

Answer

How do I run semantic search at scale in EigenLake?

Define a schema with filterable metadata, ingest embedded records, then run filtered nearest-neighbour search with cursor pagination so you can retrieve thousands of relevant results economically.

Workload
Semantic Search
Difficulty
intermediate
Estimated time
8 minutes

Inputs

  • A corpus of consumer complaints with metadata (product, issue, date, company)
  • Embedding vectors that match the index dimension
  • An EigenLake API key

Outputs

  • A populated EigenLake index with 10,000 complaint records
  • Filtered nearest-neighbour search results ranked by relevance
  • Paginated retrieval beyond the first page of results
  • Search-unit cost accounting for production budgeting

This recipe teaches search as an engineering workload, not a demo. You will learn how to combine vector similarity with structured metadata filters, how to paginate through large result sets, and how to account for search-unit costs when you move from prototype to production.

Problem

You have 10,000 consumer complaints and you need to find credit-card complaints about incorrect information from the last 30 days. A plain vector search would return the most semantically similar complaints regardless of product or date. You need the similarity ranking to apply only within the subset that matches your business criteria.

Prerequisites

  • Python 3.10 or newer.
  • pip install eigenlake numpy datasets.
  • An API key from https://api.eigenlake.dev.
  • A sentence-transformer model. The recipe uses sentence-transformers/all-MiniLM-L6-v2 (384 dimensions). If you do not want to download a model, the fake_embed fallback below is deterministic and self-contained.

The recipe

Step 1 — Load the dataset

from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s

DIM = 384

# Load a subset of the CFPB consumer-complaints dataset
# Full dataset: https://huggingface.co/datasets/cfpb/consumer-complaints
ds = load_dataset("cfpb/consumer-complaints", split="train")
ds = ds.filter(lambda r: r["consumer_complaint_narrative"] is not None)
ds = ds.shuffle(seed=42).select(range(10_000))

Step 2 — Define the schema

schema, index_options = (
    s.SchemaBuilder(additional_properties=False)
    .add("complaint_id", s.string(required=True, filterable=True))
    .add("product", s.string(filterable=True))
    .add("sub_product", s.string(filterable=True))
    .add("issue", s.string(filterable=True))
    .add("company", s.string(filterable=True))
    .add("state", s.string(filterable=True))
    .add("consumer_complaint_narrative", s.string(filterable=False))
    .add("date_received", s.datetime(filterable=True))
    .build()
)

Every field that will appear in a filter= argument must be filterable=True. The complaint narrative is filterable=False because you search it by vector similarity, not by text equality.

Step 3 — Ingest records

# Real embedding (recommended for production)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("all-MiniLM-L6-v2")
# def embed(text: str) -> list[float]:
#     return model.encode(text, normalize_embeddings=True).tolist()

# Deterministic fallback for sandbox testing
def fake_embed(text: str) -> list[float]:
    rng = np.random.default_rng(abs(hash(text)) % (2**32))
    v = rng.standard_normal(DIM).astype("float32")
    v /= np.linalg.norm(v) + 1e-12
    return v.tolist()

embed = fake_embed  # swap to real embedder in production

with eigenlake.connect(
    url="https://api.eigenlake.dev",
    api_key="<sk_sbx_your_api_key_here>",
) as client:
    idx = client.indexes.create_or_get(
        namespace="cookbook-01",
        index="complaints-search",
        dimensions=DIM,
        schema=schema,
        index_options=index_options,
    )

    payload = [
        {
            "properties": {
                "complaint_id": str(r["complaint_id"]),
                "product": r["product"] or "Unknown",
                "sub_product": r["sub_product"] or "Unknown",
                "issue": r["issue"] or "Unknown",
                "company": r["company"] or "Unknown",
                "state": r["state"] or "Unknown",
                "consumer_complaint_narrative": r["consumer_complaint_narrative"],
                "date_received": r["date_received"] + "T00:00:00Z",
            },
            "vector": embed(r["consumer_complaint_narrative"]),
        }
        for r in ds
    ]

    result = idx.records.add_many(payload, on_error="continue")
    print(f"inserted {len(result)} of {len(payload)} records")

Step 4 — Filtered nearest-neighbour search

    query = "credit card charged twice for the same purchase"
    query_vector = embed(query)

    hits = idx.search.nearest(
        vector=query_vector,
        limit=250,
        filter={
            "$and": [
                {"product": {"$eq": "Credit card"}},
                {"date_received": {"$gte": "2024-01-01T00:00:00Z"}},
            ]
        },
    )

    for hit in hits["vectors"][:5]:
        meta = hit["metadata"]
        print(f"{hit['distance']:.4f}  {meta['issue']}  {meta['company']}")

The filter is applied before similarity ranking. Only credit-card complaints received since 2024-01-01 are compared against the query vector. This is the difference between a vector database and a vector workload platform: the filter and the search run in the same call over the same index.

Step 5 — Cursor pagination

    # Retrieve the next pages with search.iterate
    page_count = 0
    for obj in idx.search.iterate(
        filter={"product": {"$eq": "Credit card"}},
        page_size=500,
    ):
        page_count += 1
        if page_count >= 5:
            break
    print(f"retrieved {page_count * 500} records via cursor pagination")

search.iterate returns an iterator that handles the after cursor automatically. Use it when you need to walk a large filtered subset without manually managing offsets.

Step 6 — Search-unit economics

Each search unit returns up to 100 results. The request above with limit=250 consumes 3 search units. A limit=1,000 request consumes 10 units. Budget accordingly:

limitSearch units
1001
2503
5005
1,00010

In production, cap limit to what your downstream system can actually use. If you only show the top 20 results to a user, request limit=100 and re-rank locally.

Agent path

The same search can be expressed in natural language:

    agent_result = idx.agent.query(
        "Find credit card complaints about incorrect information from the last 30 days"
    )
    print(agent_result["action"])   # "filter" or "cluster"
    print(agent_result["filter"])   # inferred MongoDB-style filter

In mode="auto", the agent inspects the query and returns a structured action. For search-like language it currently returns a filter action. Apply that filter to search.nearest to get the ranked results. Future SDK versions may route directly to search.nearest.

What is happening, line by line

Schema-first filtering. The reason product and date_received are filterable=True is so they can appear in the filter= argument to search.nearest. If you forget to mark a field as filterable at index creation time, you cannot filter by it later without rebuilding the index.

$and and $or. The filter language supports MongoDB-style operators. Use $and to require multiple conditions, $or to accept any of several, and $in for membership lists. The full set is $eq, $ne, $in, $nin, $gt, $gte, $lt, $lte, $and, $or, $not, and $exists.

search.iterate vs. search.list. list returns a single page with an after cursor. iterate is a convenience wrapper that calls list repeatedly and yields each record. Both are useful; iterate is simpler for large scans.

Metadata in every hit. Every vector hit includes the full record: UUID, distance, and every property you stored. There is no follow-up get call required. This is why EigenLake does not need a separate metadata store.

Variations

Swap the embedding model. all-MiniLM-L6-v2 is 384-dimensional and fast. For higher quality at the cost of speed, use all-mpnet-base-v2 (768 dimensions) or a domain-specific model. Just change DIM and the model name; the rest of the recipe is identical.

Search across multiple products. Replace the $eq with $in:

filter={"product": {"$in": ["Credit card", "Bank account"]}}

Geographic aggregation. Filter by state and aggregate locally:

hits = idx.search.nearest(vector=query_vector, limit=1_000, filter={"state": {"$eq": "CA"}})

Raise the limit for downstream re-ranking. Request limit=500 (5 search units), then apply a cross-encoder re-ranking model to the top 500 before showing the final top 10 to the user.

Merged structured + semantic retrieval. For agent memory patterns, combine records.list (exact metadata matches) and search.nearest (similarity matches) in one de-duplicated result set. See Cookbook 2: Clustering in Depth for the multi-workload evolution of this pattern.

What to read next

Related reading

Cookbook/Clustering

Cookbook 2: Clustering in Depth

Discover natural groupings in large vector datasets with DBSCAN and k-means, from protein sequences to support tickets, without exporting data to a separate analytics stack.

/10 minutes/intermediate
Read more
Cookbook/Anomaly Detection

Cookbook 3: Anomaly Detection in Depth

Surface unusual sensor events, operational records, and outlier folds with Local Outlier Factor over vector embeddings — without pre-defining what normal looks like.

/10 minutes/intermediate
Read more
Cookbook/Topic Modeling

Cookbook 4: Topic Modeling in Depth

Discover emergent themes in multilingual text, scientific literature, or operational notes without manual labeling — using spherical k-means, c-TF-IDF, and optional LLM-generated labels.

/10 minutes/intermediate
Read more