Cookbook 1: Semantic Search at Scale
Use filtered nearest-neighbour search, cursor pagination, and search-unit economics to retrieve relevant records from a corpus of tens of thousands of consumer complaints.
Answer
How do I run semantic search at scale in EigenLake?
Define a schema with filterable metadata, ingest embedded records, then run filtered nearest-neighbour search with cursor pagination so you can retrieve thousands of relevant results economically.
Inputs
- A corpus of consumer complaints with metadata (product, issue, date, company)
- Embedding vectors that match the index dimension
- An EigenLake API key
Outputs
- A populated EigenLake index with 10,000 complaint records
- Filtered nearest-neighbour search results ranked by relevance
- Paginated retrieval beyond the first page of results
- Search-unit cost accounting for production budgeting
This recipe teaches search as an engineering workload, not a demo. You will learn how to combine vector similarity with structured metadata filters, how to paginate through large result sets, and how to account for search-unit costs when you move from prototype to production.
Problem
You have 10,000 consumer complaints and you need to find credit-card complaints about incorrect information from the last 30 days. A plain vector search would return the most semantically similar complaints regardless of product or date. You need the similarity ranking to apply only within the subset that matches your business criteria.
Prerequisites
- Python 3.10 or newer.
pip install eigenlake numpy datasets.- An API key from
https://api.eigenlake.dev. - A sentence-transformer model. The recipe uses
sentence-transformers/all-MiniLM-L6-v2(384 dimensions). If you do not want to download a model, thefake_embedfallback below is deterministic and self-contained.
The recipe
Step 1 — Load the dataset
from datasets import load_dataset
import numpy as np
import eigenlake
from eigenlake import schema as s
DIM = 384
# Load a subset of the CFPB consumer-complaints dataset
# Full dataset: https://huggingface.co/datasets/cfpb/consumer-complaints
ds = load_dataset("cfpb/consumer-complaints", split="train")
ds = ds.filter(lambda r: r["consumer_complaint_narrative"] is not None)
ds = ds.shuffle(seed=42).select(range(10_000))
Step 2 — Define the schema
schema, index_options = (
s.SchemaBuilder(additional_properties=False)
.add("complaint_id", s.string(required=True, filterable=True))
.add("product", s.string(filterable=True))
.add("sub_product", s.string(filterable=True))
.add("issue", s.string(filterable=True))
.add("company", s.string(filterable=True))
.add("state", s.string(filterable=True))
.add("consumer_complaint_narrative", s.string(filterable=False))
.add("date_received", s.datetime(filterable=True))
.build()
)
Every field that will appear in a filter= argument must be filterable=True. The complaint narrative is filterable=False because you search it by vector similarity, not by text equality.
Step 3 — Ingest records
# Real embedding (recommended for production)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("all-MiniLM-L6-v2")
# def embed(text: str) -> list[float]:
# return model.encode(text, normalize_embeddings=True).tolist()
# Deterministic fallback for sandbox testing
def fake_embed(text: str) -> list[float]:
rng = np.random.default_rng(abs(hash(text)) % (2**32))
v = rng.standard_normal(DIM).astype("float32")
v /= np.linalg.norm(v) + 1e-12
return v.tolist()
embed = fake_embed # swap to real embedder in production
with eigenlake.connect(
url="https://api.eigenlake.dev",
api_key="<sk_sbx_your_api_key_here>",
) as client:
idx = client.indexes.create_or_get(
namespace="cookbook-01",
index="complaints-search",
dimensions=DIM,
schema=schema,
index_options=index_options,
)
payload = [
{
"properties": {
"complaint_id": str(r["complaint_id"]),
"product": r["product"] or "Unknown",
"sub_product": r["sub_product"] or "Unknown",
"issue": r["issue"] or "Unknown",
"company": r["company"] or "Unknown",
"state": r["state"] or "Unknown",
"consumer_complaint_narrative": r["consumer_complaint_narrative"],
"date_received": r["date_received"] + "T00:00:00Z",
},
"vector": embed(r["consumer_complaint_narrative"]),
}
for r in ds
]
result = idx.records.add_many(payload, on_error="continue")
print(f"inserted {len(result)} of {len(payload)} records")
Step 4 — Filtered nearest-neighbour search
query = "credit card charged twice for the same purchase"
query_vector = embed(query)
hits = idx.search.nearest(
vector=query_vector,
limit=250,
filter={
"$and": [
{"product": {"$eq": "Credit card"}},
{"date_received": {"$gte": "2024-01-01T00:00:00Z"}},
]
},
)
for hit in hits["vectors"][:5]:
meta = hit["metadata"]
print(f"{hit['distance']:.4f} {meta['issue']} {meta['company']}")
The filter is applied before similarity ranking. Only credit-card complaints received since 2024-01-01 are compared against the query vector. This is the difference between a vector database and a vector workload platform: the filter and the search run in the same call over the same index.
Step 5 — Cursor pagination
# Retrieve the next pages with search.iterate
page_count = 0
for obj in idx.search.iterate(
filter={"product": {"$eq": "Credit card"}},
page_size=500,
):
page_count += 1
if page_count >= 5:
break
print(f"retrieved {page_count * 500} records via cursor pagination")
search.iterate returns an iterator that handles the after cursor automatically. Use it when you need to walk a large filtered subset without manually managing offsets.
Step 6 — Search-unit economics
Each search unit returns up to 100 results. The request above with limit=250 consumes 3 search units. A limit=1,000 request consumes 10 units. Budget accordingly:
limit | Search units |
|---|---|
| 100 | 1 |
| 250 | 3 |
| 500 | 5 |
| 1,000 | 10 |
In production, cap limit to what your downstream system can actually use. If you only show the top 20 results to a user, request limit=100 and re-rank locally.
Agent path
The same search can be expressed in natural language:
agent_result = idx.agent.query(
"Find credit card complaints about incorrect information from the last 30 days"
)
print(agent_result["action"]) # "filter" or "cluster"
print(agent_result["filter"]) # inferred MongoDB-style filter
In mode="auto", the agent inspects the query and returns a structured action. For search-like language it currently returns a filter action. Apply that filter to search.nearest to get the ranked results. Future SDK versions may route directly to search.nearest.
What is happening, line by line
Schema-first filtering. The reason product and date_received are filterable=True is so they can appear in the filter= argument to search.nearest. If you forget to mark a field as filterable at index creation time, you cannot filter by it later without rebuilding the index.
$and and $or. The filter language supports MongoDB-style operators. Use $and to require multiple conditions, $or to accept any of several, and $in for membership lists. The full set is $eq, $ne, $in, $nin, $gt, $gte, $lt, $lte, $and, $or, $not, and $exists.
search.iterate vs. search.list. list returns a single page with an after cursor. iterate is a convenience wrapper that calls list repeatedly and yields each record. Both are useful; iterate is simpler for large scans.
Metadata in every hit. Every vector hit includes the full record: UUID, distance, and every property you stored. There is no follow-up get call required. This is why EigenLake does not need a separate metadata store.
Variations
Swap the embedding model. all-MiniLM-L6-v2 is 384-dimensional and fast. For higher quality at the cost of speed, use all-mpnet-base-v2 (768 dimensions) or a domain-specific model. Just change DIM and the model name; the rest of the recipe is identical.
Search across multiple products. Replace the $eq with $in:
filter={"product": {"$in": ["Credit card", "Bank account"]}}
Geographic aggregation. Filter by state and aggregate locally:
hits = idx.search.nearest(vector=query_vector, limit=1_000, filter={"state": {"$eq": "CA"}})
Raise the limit for downstream re-ranking. Request limit=500 (5 search units), then apply a cross-encoder re-ranking model to the top 500 before showing the final top 10 to the user.
Merged structured + semantic retrieval. For agent memory patterns, combine records.list (exact metadata matches) and search.nearest (similarity matches) in one de-duplicated result set. See Cookbook 2: Clustering in Depth for the multi-workload evolution of this pattern.
What to read next
- Cookbook 2: Clustering in Depth — discover structural families in the same complaint corpus without pre-defined labels.
- Cookbook 6: From Search to Insight — chain search with clustering, anomaly detection, topic modeling, and temporal shift into a single investigative pipeline.
- Launching EigenLake: Vector Workloads Where Your AI Data Lives — the launch narrative for running workloads where your vector data lives.
- Why We Call It Vector Intelligence, Not Vector Search — the philosophical argument for computation over retrieval.