I built a RAG chatbot that answers questions with accurate, cited responses by tightly integrating retrieval with generation. Most chatbots hallucinate or give vague answers; mine pulls precise quotes from a custom knowledge base. This post breaks down the technical decisions, failures, and fixes that made it work.
Architecture Overview
The system follows a standard RAG pattern but with strict controls over retrieval quality. A user submits a question via a Next.js frontend. The request hits an API route that first queries a Pinecone vector store using LlamaIndex to find the most relevant text chunks. Those chunks, along with the original question, are formatted into a precise prompt for the OpenAI API. The final answer is streamed back with citations linked to the source text.
flowchart TD
A[User Question] --> B[Next.js API Route]
B --> C[LlamaIndex Query Engine]
C --> D[Pinecone Vector Store]
D --> E[Retrieved Text Chunks with Metadata]
E --> F[Construct Prompt with Citations]
F --> G[OpenAI GPT-4 Completion]
G --> H[Stream Answer with Sources]
H --> I[Next.js Frontend]
The key is that the answer is generated from the retrieved chunks, not just inspired by them. The prompt engineering enforces this.
Key Technical Decisions
The first critical decision was using metadata filtering during retrieval. Without it, the vector search often returns semantically similar but contextually irrelevant chunks. I used Pinecone's metadata filters to restrict searches to specific document sections or types.
import { Pinecone, QueryOptions } from '@pinecone-database/pinecone';
import { VectorStoreIndex } from 'llamaindex';
const queryPineconeWithFilter = async (query: string, docType: string) => {
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.Index('knowledge-base');
const queryOptions: QueryOptions = {
topK: 5,
includeMetadata: true,
filter: { doc_type: { $eq: docType } }, // Critical filter
};
const queryEmbedding = await getEmbedding(query); // Your embedding function
const results = await index.query({
vector: queryEmbedding,
...queryOptions,
});
return results.matches;
};
The second decision was implementing citation anchoring. Simply returning sources isn't enough; you need to show which part of the answer comes from which source. I modified the LlamaIndex response synthesizer to insert special markers.
// Custom synthesizer to inject citation markers
const getResponseWithCitations = (sources: SourceNode[], answer: string) => {
let citedAnswer = answer;
sources.forEach((source, idx) => {
const quote = source.node.getContent().substring(0, 150);
const marker = `[${idx + 1}]`;
// Logic to find where the answer uses this quote and insert marker
if (citedAnswer.includes(quote)) {
citedAnswer = citedAnswer.replace(quote, `${quote} ${marker}`);
}
});
return { text: citedAnswer, sources };
};
What Broke and How I Fixed It
The first major break was chunking degradation. I used a naive 512-character text splitter, which often cut sentences mid-thought. This led to retrievals that started or ended abruptly, confusing the LLM. The fix was implementing semantic-aware chunking with LlamaIndex's SentenceSplitter, which respects sentence boundaries and maintains a token overlap.
import { SentenceSplitter } from 'llamaindex';
const splitter = new SentenceSplitter({
chunkSize: 1024, // Larger, semantic chunks
chunkOverlap: 200, // Overlap to preserve context
});
const documents = await splitter.splitTexts(yourDocuments);
// This preserved complete thoughts and improved retrieval quality by ~40%
The second break was prompt injection leading to hallucination. Even with perfect retrieval, the GPT model would sometimes ignore the provided context and answer from its training data. I fixed this by making the prompt instruction more explicit and using a delimiter system.
const strictPrompt = `
Answer the question using ONLY the context provided below.
If the context does not contain the answer, say "I cannot answer based on the provided sources."
Context:
${retrievedChunks.map((chunk, i) => `[Source ${i + 1}] ${chunk.text}`).join('\n\n')}
Question: ${userQuestion}
Answer (with citations like [Source 1]):`;
// This forceful instruction reduced hallucinations to near zero.
How to Build Something Similar
Start by building your ingestion pipeline before the chat interface. Use LlamaIndex to load, chunk, and embed your documents into Pinecone. This is the most important foundation.
// scripts/ingest.ts
import { PineconeVectorStore, VectorStoreIndex, Document } from 'llamaindex';
import { Pinecone } from '@pinecone-database/pinecone';
const documents = [new Document({ text: yourContent })];
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const vectorStore = new PineconeVectorStore({ pineconeIndex: pinecone.Index('kb') });
const index = await VectorStoreIndex.fromDocuments(documents, { vectorStore });
// Now your index is ready for querying
Create a simple Next.js API route (app/api/chat/route.ts) that uses this index. Use the createStreamableValue utility from ai package to stream responses back. Focus on getting the retrieval working perfectly in the backend before polishing the UI.
Would I Build It the Same Way Again?
For a production system, I would swap Pinecone for a self-hosted vector store like Weaviate or Qdrant to control costs at scale. Pinecone is fantastic for prototyping, but the per-query cost adds up. I would also experiment with smaller, fine-tuned models (like Llama 3) for the final generation step instead of GPT-4, using the same high-quality retrieval, to reduce API dependency and latency.
The core architecture—LlamaIndex for orchestration, strict retrieval filtering, and citation anchoring—remains sound. I'd keep that 100%.
The one thing you should know before starting is that your RAG system is only as good as your data. Spend 80% of your time on ingestion: cleaning, chunking, and testing retrieval on sample questions before writing a single line of chat UI code.