VerbalVista: Talking to Your Own Data with RAG, FAISS, and a Bit of Stubbornness

February 15, 2025

PythonRAGFAISSLLMStreamlitFastAPI

VerbalVista logo

There’s a specific frustration that sets in the first time you try to ask ChatGPT about something it cannot know: a meeting transcript, a proprietary PDF, an internal wiki page, an audio recording you made last Tuesday. The model is fluent and confident and completely useless for your actual question. It knows everything about the world and nothing about your work.

The obvious workaround is to paste the document into the context. That works; until it doesn’t. A few pages is fine. A 200-page technical spec, a 6-hour podcast, or an entire Git repository is not. You hit the context limit, or you hit the cost ceiling, or you discover that models quietly deprioritize content buried deep in a long prompt. The problem isn’t access to a capable LLM. The problem is retrieval: getting the right slice of your data in front of the model at query time.

RAG: Retrieval-Augmented Generation: is the standard answer, and there are plenty of frameworks that package it up for you. But the interesting questions only surface when you build it yourself: what chunk size actually works? When does BM25 beat cosine similarity? How do you keep embedding costs reasonable when you have tens of thousands of chunks? I wanted those answers for myself, so I built VerbalVista: a full-stack RAG platform that accepts eight different input types, chunks and indexes them into FAISS, and answers questions using GPT-4 or Claude.

The Problem with Context Windows

LLMs have no persistent memory of your private data. Every query starts cold. The model knows what you put in the prompt and nothing else. For most tasks that’s fine: but for document intelligence, it’s the entire problem.

The naive fix is to stuff everything into context. Upload the document, prepend it to your question, send it to the API. For a short document this is perfectly reasonable. For anything longer, you run into three compounding issues. First, most models have a hard token limit, so large documents simply don’t fit. Second, cost scales linearly with context length: sending a 100,000-token document with every query gets expensive fast. Third, and subtlest: attention is not uniform. Models tend to recall content near the beginning and end of a long context more reliably than content in the middle. If your answer lives in paragraph 47 of a 200-page spec, the model may not find it even if it’s technically in the prompt.

RAG sidesteps all three problems. Instead of sending everything, you send only the relevant chunks: a few hundred tokens retrieved from an index rather than tens of thousands retrieved from nowhere. The index does the heavy lifting so the model doesn’t have to.

What RAG frameworks hide from you are the choices that actually determine quality: chunk size and overlap, embedding model, distance metric, whether to run lexical retrieval alongside semantic retrieval, how to merge results, how many chunks to include before you exceed the prompt budget. Getting those right is the actual engineering work. Building VerbalVista was mostly an excuse to make those decisions deliberately rather than accepting a framework’s defaults.

What I Built

VerbalVista is a full-stack RAG platform with a Streamlit front-end, a FastAPI + Ray Serve backend, and a FAISS + BM25 retrieval layer. It accepts PDFs, DOCX files, plain text, email (.eml), audio/video files, URLs, YouTube videos, and code repositories. Everything gets transcribed or parsed into text, chunked, embedded, and indexed. At query time, it runs both semantic and lexical retrieval, merges the results, and streams a response from GPT-4 or Claude: including a per-query cost estimate.

The project is at github.com/spate141/VerbalVista. To run it locally:

streamlit run app.py

Under the Hood: The RAG Pipeline

The full pipeline has five stages. Each one is straightforward in isolation; the interesting parts are the seams between them.

Input sources
  │
  ▼
[Ingestion] ──► .data.txt files
  │
  ▼
[Chunking] ──► text chunks + metadata
  │
  ▼
[Embedding & Indexing] ──► FAISS (semantic) + BM25 (lexical)
  │
  ▼
[Retrieval] ──► top-k chunks, merged + deduplicated
  │
  ▼
[Generation] ──► streamed answer + token counts + cost

1. Ingestion: the wide funnel

The most underestimated part of any RAG system is getting content in. VerbalVista handles eight source types:

Audio/video → Whisper transcription → text
PDF/DOCX/TXT/EML → document_parser.py → text
URLs → Selenium-based url_parser.py → text
YouTube → transcript API → text
Reddit / Hacker News / 4chan → specialized scrapers → text
Code repositories → code_parser.py (Python + Markdown files) → text

Every path converges on the same output: a .data.txt file on disk. The FAISS index has no idea whether its chunks came from a podcast or a PDF, and that’s intentional.

2. Chunking

Text files are split with RecursiveCharacterTextSplitter at a configurable chunk size and overlap. Overlap is the key parameter most explanations skip past: without it, a sentence that straddles a chunk boundary gets split in two, and neither half carries enough context to be useful for retrieval. A 10–20% overlap ensures boundary content appears in full in at least one chunk.

Each chunk carries metadata: source filename, chunk index: that surfaces in the response so you know exactly where an answer came from.

3. Embedding and indexing

The EmbedChunks class calls the OpenAI embeddings API in batches, converting each chunk into a float vector. Vectors are L2-normalized to unit norm and stored in a FAISS IndexFlatIP (inner product). Normalized vectors make inner product equivalent to cosine similarity, which is the distance metric that works best for semantic text search.

In parallel, the same chunks go into a rank_bm25 BM25 index for lexical retrieval. Both indices are persisted to disk: FAISS binary format for the vector index, pickle for the BM25 object and chunk metadata: so re-indexing is only needed when the source documents change.

4. Retrieval: dual strategy

At query time, both indices run in parallel:

# Semantic search
def do_semantic_search(query, k=10):
    query_vec = embed(query)
    query_vec = query_vec / np.linalg.norm(query_vec)
    scores, indices = faiss_index.search(query_vec, k)
    return [(chunks[i], scores[0][j]) for j, i in enumerate(indices[0])]

# Lexical search
def do_lexical_search(query, k=10):
    tokens = query.lower().split()
    scores = bm25.get_scores(tokens)
    top_k = np.argsort(scores)[::-1][:k]
    return [(chunks[i], scores[i]) for i in top_k]

Results from both searches are merged, deduplicated by chunk ID, and trimmed to fit within the prompt token budget. The merged set becomes the context for the LLM.

5. Generation

GPTAgent and ClaudeAgent wrap the respective APIs with a consistent interface. Retrieved chunks are formatted into a system context. Responses stream token by token back to the Streamlit UI. Each response includes prompt token count, completion token count, and an estimated USD cost calculated from current API pricing.

Three Design Decisions Worth Calling Out

Hybrid retrieval: semantic + lexical, not either/or

Pure semantic search: embed the query, find the nearest vectors: handles paraphrase and conceptual synonymy well. Ask about “authentication failures” and you’ll surface chunks that talk about “login errors” or “credential issues,” because the embeddings are close in vector space.

But semantic search struggles with specificity. If your document contains version numbers, error codes, product names, or precise technical identifiers, embedding similarity often lets you down. AttributeError: 'NoneType' object has no attribute 'shape' and “null pointer dereference” are conceptually related, but their embeddings are not that close. BM25 finds exact keyword matches that cosine similarity misses.

Running both and merging the top-k from each is not complicated to implement: both indices are built at index time and queried at retrieval time: but the quality difference on technical content is significant. Exact-match queries get answered better. Conceptual queries get answered better. Neither index alone covers the full retrieval space.

Everything is text first

The ingestion layer looks simple from the outside: “just parse the document.” In practice it’s the part that takes the longest to get right and breaks the most often. Audio files need Whisper, which needs GPU time or API calls. PDFs have scanned pages, embedded images, inconsistent encoding. Email threads have quoted replies, HTML, attachments. URLs have JavaScript-rendered content, login walls, navigation noise.

The architectural decision that makes all of this manageable is committing to a single intermediate format: plain text files on disk. Every parser’s job is to produce a .data.txt file and nothing else. The chunker, embedder, and FAISS index never see the original source format. This decouples ingestion from retrieval completely. Adding a new source type means writing one new parser; nothing downstream changes.

Cost tracking as a first-class concern

Every query in VerbalVista returns the prompt token count, the completion token count, and the estimated USD cost. These accumulate in Streamlit session state so you can see the running total for the session.

This sounds like a minor UI feature. It isn’t. When you’re running 50 exploratory queries a day during development: trying different phrasings, different retrieval depths, different models: the costs add up faster than intuition suggests. GPT-4 at scale is not cheap. Building the cost counter in from the start rather than adding it later means the number is always in front of you when you’re deciding whether to run another experiment. It keeps iteration honest.

Closing Thoughts

The most educational part of building VerbalVista wasn’t the LLM integration: that part is relatively mechanical once you have the retrieval working. It was the retrieval layer itself: the realization that chunk size and overlap aren’t hyperparameters to tune once and forget, that BM25 and cosine similarity are complements not substitutes, and that the amount of context you give the model matters as much as the model you choose.

The project grew in ways that reflect how RAG systems evolve in practice. You start with PDFs. Someone asks if you can index an audio recording. Then a YouTube video. Then a GitHub repository. The ingestion layer ends up being the majority of the codebase, and the LLM calls end up being a thin wrapper at the end of a much longer pipeline. VerbalVista went through 42 releases and 262 commits because the interesting problems kept accumulating at the input end, not the output end.

If you’re building something similar or curious about the implementation details, the full source is at github.com/spate141/VerbalVista.