From zero to a RAG system: successes and failures

A detailed account of building an internal RAG system from scratch, handling 1TB of messy technical data using local LLMs, and overcoming hardware and indexing bottlenecks.

A few months ago I was tasked with creating an internal tool for the company's engineers: a Chat that used a local LLM. Nothing extraordinary so far. Then the requirements came in: it had to have a fast response, I insist... fast!, and... it also had to provide answers about every project the company has done throughout its entire history (almost a decade). They didn't want a traditional search engine, but a tool where you could ask questions in natural language and get answers with references to the original documents. With emphasis on providing information from OrcaFlex files (a simulation software for floating body dynamics, cables, etc., widely used in the offshore industry). It already seemed complex, but it was confirmed when I was given access to 1 TB of projects, mixed with technical documentation, reports, analyses, regulations, CSVs, etc. The emotional roller coaster had begun.

I'll tell you upfront that it was neither a quick nor easy process, and that's why I'd like to share it. From the first attempts, mistakes, to the final architecture that ended up in production. I also want to highlight that I had never done anything similar before and didn't know how a RAG worked either.

We'll go problem by problem, and the solution I applied to each one.

Problem 1: selecting the right technology

The first step was to define the stack.

I needed a local language model, without relying on external APIs, for confidentiality reasons. Ollama emerged as the most mature and easy-to-use option for running LLaMA models locally. I tried several embeddings, and nomic-embed-text offered good performance and quality for technical documents.

Next was a RAG engine to orchestrate the document indexing process, embedding generation, vector database storage, and queries. Without it, no matter how fast the language model is, we couldn't retrieve relevant information from the documents. Think of it like a book's index: without it, you'd have to read the entire book to find the information you need. And with a good index, you can go straight to the right page. I'll call this process indexing for simplicity, although it's really a vectorization and indexing process.

After some research, I found a mature open source framework called LlamaIndex.

The language I'd use would be Python, I could list many reasons, but the most important one is that I feel comfortable and productive with it. Additionally, both Ollama and LlamaIndex have excellent Python SDKs.

I was ready to start building the software. I wrote my first scripts to run vector tests on the RAG system and do some query experiments. It worked really well with very little code. I thought it would be a project of a few weeks. I couldn't have been more wrong.

The next step was working with the actual documents. Hold on tight, it's going to be a bumpy ride!

Problem 2: the document chaos

My file source was a folder on Azure with a massive amount of technical documents: hundreds of gigabytes, thousands of files, various formats, with no organization or structure beyond the folder hierarchy. Every data engineer's dream (note the irony).

I cracked my knuckles, set the RAG output to save to disk, and launched my first script. LlamaIndex ended up overflowing my laptop's RAM within minutes, choking my OS until everything froze. I tried many configurations, caching systems, and other strategies, but at some point my machine always died.

After debugging, I discovered it was processing huge files that contributed nothing: videos, simulations, backup files... Documents that added nothing to a RAG system, but that LlamaIndex tried to process as if they were text. If a file weighed several gigabytes, the system tried to load it entirely into memory for processing, which was suicide.

I added a filtering system to the pipeline that excluded files by extension and by name patterns (simulation files, numerical results, etc.).

| Category | Excluded extensions | |---|---| | Video | mp4, avi, mov, mkv, wmv, flv, webm, m4v, mpg, mpeg, 3gp, mts... | | Images | jpg, jpeg, png, gif, bmp, tiff, svg, ico, webp, heic, psd... | | Executables | exe, dll, msi, bat, sh, app, dmg, so, jar... | | Compressed | zip, rar, 7z, tar, gz, bz2, xz | | Simulation | sim, dat | | Temporary | tmp, temp, cache, log, swp, pyc, crdownload, partial... | | Backups | bak, 3dmbak, dwgbak, dxfbak, pdfbak, stlbak, old, bkp, original... | | Email | msg, pst, eml, oft |

I also removed files that were expensive to process and didn't add value either, like CSVs, JSONs, among others. On the other hand, I converted PDF, DOCX, XLSX, PPTX, etc. files to plain text so LlamaIndex could process them without issues.

The result was a 54% reduction in the number of files to index. And of course, my RAM stopped exploding.

I could finally start indexing without fear.

Problem 3: indexing 451GB of documents without dying in the attempt

A RAG involves creating a vector index file containing document embeddings. Vectors are numerical representations of documents that allow measuring their similarity. LlamaIndex has a simple system you can configure with a couple of lines. You just point it to the directory and it takes care of storing all the information inside in JSON format. It's really convenient, works well, unless you're dealing with hundreds of gigabytes of documents. The system became unmanageable: every time the service restarted, it had to reprocess all documents from scratch, which could take days. Also, the default format is not optimal for large searches (JSON).

I added a checkpoint system to save indexing progress. Every time a problem occurred, I wouldn't lose all progress, but could resume from the last processed file. However, data got corrupted, it was error-prone, and very slow. I was facing a bottleneck I couldn't overcome.

After many trials and errors, and reading more about it, I decided to make the leap to a dedicated vector database: ChromaDB. An open-source database (Apache-2.0 license) for storing and querying vectors. Not to be confused with the Chrome/Chromium browser. ChromaDB is an abstraction layer that stores on top of a traditional database, I configured SQLite, and offers specific functionalities like similarity searches, clustering, etc.

The change was radical and instant. Indexing went from being a monolithic process that loaded everything into memory to a batch pipeline that processed 150 files at a time, generated their embeddings, and stored them directly in ChromaDB. This allowed indexing the 451GB of documents across multiple sessions, with checkpoints, without losing progress on interruptions, without corrupted data. Additionally, it was really easy to back up and restore the index in case of failures (just copy the SQLite file).

The system was ready. With a quick benchmark, I discovered I would need several months to index all the content with my laptop. Now the bottleneck was neither the RAM, nor the indexing system, nor the files, but the GPU.

Problem 4: my graphics card is not a rocket

My laptop has an integrated graphics card. Processing 500 MB of documents by CPU takes 4-5 hours, not good numbers. I absolutely needed a powerful GPU. In a follow-up meeting, it was decided to rent me a virtual machine with an NVIDIA RTX 4000 SFF Ada, which has 20GB of VRAM. These kinds of rentals are not exactly cheap. Now I was working under more pressure.

I modified my containers and the system was optimized to take advantage of the GPU. I launched my script. After several weeks, between 2 and 3, the indexing process finished without failures. 738,470 vectors, 54GB of index in ChromaDB, and a RAG system ready to answer questions. I copied the ChromaDB database, a SQLite file, to my local machine and that was it. To the relief of my Sysadmin and Project Manager, we could finally shut down the virtual machine. The cost was 184 euros on Hetzner, not cheap.

It was time to build the backend and frontend.