LLM Chatbot Migration: Betting on Context Windows

The Problem

We had a working DialogFlow ES chatbot serving apartment property clients—handling availability queries, scheduling tours, answering questions about amenities. It worked, but DialogFlow's intent-matching was brittle and required constant tuning per client.

The goal: migrate to an LLM-based system that could handle more natural conversations while still delivering accurate, structured data (apartment availability, pricing, filtering by bedrooms/move-in date/etc).

My Role

Dev lead. I proposed the architecture to the company, got buy-in, and implemented the prototype with a junior colleague. This was my bet to make and my design to execute.

The Architectural Bet

I bet on large context windows (50-60K input tokens at conversation start) rather than step-by-step orchestration with heavy guardrails. The reasoning:

Frontier models were rapidly improving — by the time we shipped, context handling would be stable enough
Loading all relevant training and RAG data upfront would reduce orchestration complexity
Utterance-aligned RAG preselection would keep responses grounded

The counterargument (which I heard but discounted): tokens cost money, and large payloads increase hallucination risk. I believed the ecosystem would catch up.

What Worked Well

Elasticsearch for Training/RAG Data

We were already on Elasticsearch, so storing training docs and RAG content as queryable NoSQL documents was natural. Non-developers could view and edit content. The system's knowledge was maintainable, not buried in code.

Function/Tool Calls for Structured Data

Apartment listings (~1MB JSON per property, refreshed daily) needed exact queries — "3 bedrooms, ground floor, available next week." We adapted existing proven patterns from DialogFlow into LLM tool calls. The LLM collected parameters; tools returned exact data.

Clean Separation of Concerns

LLM handles conversation flow and natural language. Tools handle structured queries and calculations. Neither tries to do the other's job.

What Didn't Work

USEARCH In-Memory Vectors

We used USEARCH for vector storage, loading all embeddings in-memory. This caused painful load times — minutes per deployment, worse when scaled across Kubernetes nodes. Development slowed to a crawl waiting on each other's builds.

This was a design mistake.

Never Validated the Core Thesis

The company was acquired before the system reached production. The acquiring company was Python-only (we were .NET) and had their own LLM systems. Our work was discarded.

Honest Uncertainty

I still don't know if the large-context approach would have worked in production.

The acquiring company used LangChain with step-by-step orchestration and heavy guardrails — the opposite of my bet. Their approach was more defensive against hallucination. Whether it was actually better for tokens and reliability, I genuinely can't say.

The bet was reasonable given where frontier models were heading in 2023-2024. But it was never proven.

What I'd Do Differently

Not use in-memory vector storage during development. The slow load times killed iteration speed. A proper vector database (Pinecone, Qdrant, etc.) would have been worth the operational overhead.
Build smaller proof-of-concept first. We went big before validating the context-window hypothesis at smaller scale.
More explicit fallback paths. If the large-context approach started hallucinating, we didn't have a clean degradation strategy.