How do I run semantic search at scale in EigenLake?

Define a schema with filterable metadata, ingest embedded records, then run filtered nearest-neighbour search with cursor pagination so you can retrieve thousands of relevant results economically.

Cookbook 1: Semantic Search at Scale

This recipe teaches search as an engineering workload, not a demo. You will learn how to combine vector similarity with structured metadata filters, how to paginate through large result sets, and how to account for search-unit costs when you move from prototype to production.

Problem

You have 10,000 consumer complaints and you need to find credit-card complaints about incorrect information from the last 30 days. A plain vector search would return the most semantically similar complaints regardless of product or date. You need the similarity ranking to apply only within the subset that matches your business criteria.

Prerequisites

Python 3.10 or newer.
pip install eigenlake numpy datasets.
An API key from https://api.eigenlake.dev.
A sentence-transformer model. The recipe uses sentence-transformers/all-MiniLM-L6-v2 (384 dimensions). If you do not want to download a model, the fake_embed fallback below is deterministic and self-contained.

The recipe

Step 1 — Load the dataset

from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s

DIM = 384

# Load a subset of the CFPB consumer-complaints dataset
# Full dataset: https://huggingface.co/datasets/cfpb/consumer-complaints
ds = load_dataset("cfpb/consumer-complaints", split="train")
ds = ds.filter(lambda r: r["consumer_complaint_narrative"] is not None)
ds = ds.shuffle(seed=42).select(range(10_000))

Step 2 — Define the schema

schema, index_options = (
    s.SchemaBuilder(additional_properties=False)
    .add("complaint_id", s.string(required=True, filterable=True))
    .add("product", s.string(filterable=True))
    .add("sub_product", s.string(filterable=True))
    .add("issue", s.string(filterable=True))
    .add("company", s.string(filterable=True))
    .add("state", s.string(filterable=True))
    .add("consumer_complaint_narrative", s.string(filterable=False))
    .add("date_received", s.datetime(filterable=True))
    .build()
)

Every field that will appear in a filter= argument must be filterable=True. The complaint narrative is filterable=False because you search it by vector similarity, not by text equality.

Step 3 — Ingest records

# Real embedding (recommended for production)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("all-MiniLM-L6-v2")
# def embed(text: str) -> list[float]:
#     return model.encode(text, normalize_embeddings=True).tolist()

# Deterministic fallback for sandbox testing
def fake_embed(text: str) -> list[float]:
    rng = np.random.default_rng(abs(hash(text)) % (2**32))
    v = rng.standard_normal(DIM).astype("float32")
    v /= np.linalg.norm(v) + 1e-12
    return v.tolist()

embed = fake_embed  # swap to real embedder in production

with eigenlake.connect(
    url="https://api.eigenlake.dev",
    api_key="<sk_sbx_your_api_key_here>",
) as client:
    idx = client.indexes.create_or_get(
        namespace="cookbook-01",
        index="complaints-search",
        dimensions=DIM,
        schema=schema,
        index_options=index_options,
    )

    payload = [
        {
            "properties": {
                "complaint_id": str(r["complaint_id"]),
                "product": r["product"] or "Unknown",
                "sub_product": r["sub_product"] or "Unknown",
                "issue": r["issue"] or "Unknown",
                "company": r["company"] or "Unknown",
                "state": r["state"] or "Unknown",
                "consumer_complaint_narrative": r["consumer_complaint_narrative"],
                "date_received": r["date_received"] + "T00:00:00Z",
            },
            "vector": embed(r["consumer_complaint_narrative"]),
        }
        for r in ds
    ]

    result = idx.records.add_many(payload, on_error="continue")
    print(f"inserted {len(result)} of {len(payload)} records")

Step 4 — Filtered nearest-neighbour search

    query = "credit card charged twice for the same purchase"
    query_vector = embed(query)

    hits = idx.search.nearest(
        vector=query_vector,
        limit=250,
        filter={
            "$and": [
                {"product": {"$eq": "Credit card"}},
                {"date_received": {"$gte": "2024-01-01T00:00:00Z"}},
            ]
        },
    )

    for hit in hits["vectors"][:5]:
        meta = hit["metadata"]
        print(f"{hit['distance']:.4f}  {meta['issue']}  {meta['company']}")

The filter is applied before similarity ranking. Only credit-card complaints received since 2024-01-01 are compared against the query vector. This is the difference between a vector database and a vector workload platform: the filter and the search run in the same call over the same index.

Step 5 — Cursor pagination

    # Retrieve the next pages with search.iterate
    page_count = 0
    for obj in idx.search.iterate(
        filter={"product": {"$eq": "Credit card"}},
        page_size=500,
    ):
        page_count += 1
        if page_count >= 5:
            break
    print(f"retrieved {page_count * 500} records via cursor pagination")

search.iterate returns an iterator that handles the after cursor automatically. Use it when you need to walk a large filtered subset without manually managing offsets.

Step 6 — Search-unit economics

Each search unit returns up to 100 results. The request above with limit=250 consumes 3 search units. A limit=1,000 request consumes 10 units. Budget accordingly:

`limit`	Search units
100	1
250	3
500	5
1,000	10

In production, cap limit to what your downstream system can actually use. If you only show the top 20 results to a user, request limit=100 and re-rank locally.

Agent path

The same search can be expressed in natural language:

    agent_result = idx.agent.query(
        "Find credit card complaints about incorrect information from the last 30 days"
    )
    print(agent_result["action"])   # "filter" or "cluster"
    print(agent_result["filter"])   # inferred MongoDB-style filter

In mode="auto", the agent inspects the query and returns a structured action. For search-like language it currently returns a filter action. Apply that filter to search.nearest to get the ranked results. Future SDK versions may route directly to search.nearest.

What is happening, line by line

Schema-first filtering. The reason product and date_received are filterable=True is so they can appear in the filter= argument to search.nearest. If you forget to mark a field as filterable at index creation time, you cannot filter by it later without rebuilding the index.

$and and $or. The filter language supports MongoDB-style operators. Use $and to require multiple conditions, $or to accept any of several, and $in for membership lists. The full set is $eq, $ne, $in, $nin, $gt, $gte, $lt, $lte, $and, $or, $not, and $exists.

search.iterate vs. search.list. list returns a single page with an after cursor. iterate is a convenience wrapper that calls list repeatedly and yields each record. Both are useful; iterate is simpler for large scans.

Metadata in every hit. Every vector hit includes the full record: UUID, distance, and every property you stored. There is no follow-up get call required. This is why EigenLake does not need a separate metadata store.

Variations

Swap the embedding model. all-MiniLM-L6-v2 is 384-dimensional and fast. For higher quality at the cost of speed, use all-mpnet-base-v2 (768 dimensions) or a domain-specific model. Just change DIM and the model name; the rest of the recipe is identical.

Search across multiple products. Replace the $eq with $in:

filter={"product": {"$in": ["Credit card", "Bank account"]}}

Geographic aggregation. Filter by state and aggregate locally:

hits = idx.search.nearest(vector=query_vector, limit=1_000, filter={"state": {"$eq": "CA"}})

Raise the limit for downstream re-ranking. Request limit=500 (5 search units), then apply a cross-encoder re-ranking model to the top 500 before showing the final top 10 to the user.

Merged structured + semantic retrieval. For agent memory patterns, combine records.list (exact metadata matches) and search.nearest (similarity matches) in one de-duplicated result set. See Cookbook 2: Clustering in Depth for the multi-workload evolution of this pattern.

Cookbook 1: Semantic Search at Scale