From Weeks to $0.01: Automating Analysis with RAG

Motivation

When my friend described her team’s annual ritual of analyzing corporate ESG reports, I understood why she was desperate for change. For years, interns manually scanned hundreds of PDF pages to answer precise questions: Does the company track all emission scopes? Have they set quantifiable 2030 decarbonization targets? The process consumed weeks as they cross-checked interpretations while research associates verified results.

As a computer architect, for many years, my main knowledge about AI is that “It’s just matrix multiplication, but it can do anything now.” And this time I saw an opportunity to test its real-world potential.

The core objective is straightforward: Provide an AI system with a PDF report containing a predefined list of questions and corresponding guidance, then collect the AI’s answers to each question. However, several key challenges must be addressed:

  • Accuracy: Responses must be highly accurate to avoid excessive review time and ensure reliability.
  • Explainability: While AI operates like a magical black box to us muggles, it must transparently provide its reasoning process to enable effective review and guidance adjustments.
  • Automation: The end-to-end process requires robust automation to efficiently handle numerous company reports.
  • Affordability: While potentially cheaper than manual analysis, costs remain significant:
    • Cloud-based models (e.g., ChatGPT, DeepSeek) charge per token processed (DeepSeek Pricing)
    • Local hardware requires substantial investment (e.g., ~$2000+ for high-end GPUs like NVIDIA’s RTX 5090)
    • Energy consumption presents ironic implications when analyzing ESG reports

Background: What is RAG?

After preliminary research, I’ve determined this task falls under the category of Retrieval-Augmented Generation (RAG). For reference, here’s an excellent introductory video on RAG.

The core concept is straightforward: An AI model generates answers to queries (the generation component), but faces two inherent limitations:

  1. Temporal constraints: Models lack awareness of recent developments without costly retraining
  2. Context limitations: Models cannot directly access your specific document content

Our specific workflow can be visualized through this pipeline (created at https://excalidraw.com/):

Simple RAG Example
  • Retrieval: The retrieval model identifies the most relevant document sections (context) for each question
  • Augmentation: The original question is enhanced with retrieved context to create an optimized prompt
  • Generation: The reasoning model processes the augmented prompt to produce:
    • Final answers
    • Supporting explanations
    • Source quotations from the report
  • Human Review: Essential for verifying answers and refining guidance to align the AI’s understanding with requirements

A Naive First Attempt: Copilot

Even without RAG expertise, we can leverage existing online chat models (which implement server-side RAG pipelines). I initially tested with Copilot—uploading reports, feeding questions, and evaluating outputs. However, after several trials, this approach proved unfeasible:

  • Human Laboring The process lacks any automation. Completing a single report requires approximately 30 minutes (excluding final review), creating inefficiency and tedium—I’d prefer reading reports manually!
  • Inaccuracy and Instability The model frequently provides incorrect answers. While guidance tuning might improve accuracy, the more critical issue is output inconsistency across attempts. A majority-vote approach could mitigate instability but would necessitate repeating the entire process—an impractical solution.

Okay let’s code

Given these constraints, it’s time to implement an automated solution - after all, we’re programmers! The complete code is available here. The implementation follows this workflow:

  • Retrieval Uses Ollama to run a local retrieval model on the GPU. Model selection was guided by the MTEB leaderboard on Hugging Face.
  • Augment Combines the top 5 relevant pages with each question to create the final prompt. Crucially, page numbers are explicitly preserved for accurate source referencing.
  • Generation Leverages the DeepSeek-R1 API to generate final answers with explanations. Outputs are saved as both JSON and human-readable Markdown.

I won’t detail the code here - it’s under 150 lines including comments. Feel free to explore the implementation directly.

Discussion

Size doesn’t matter… Until it does.

Even as someone whose work does not directly relate to AI, I have heard of the famous scaling law - that intelligence emerges as we keep scaling up models. So far, this seems to hold true. For example, DeepSeek-V3 has 671 billion parameters, and ChatGPT-4 is estimated to have more than 1 trillion parameters (since it’s not open-sourced). And this time, I witnessed the power of scaling law firsthand.

I first used a small version of DeepSeek-V3 running on a desktop GPU - distilled from the full 671-billion parameter model. It worked well when I passed questions one by one. However, since many questions are related, I needed to feed them together so the model could better capture the context. The small model quickly lost itself and started mumbling when it couldn’t handle this more complicated case.

Then I pasted the exact same prompt to the online DeepSeek-R1, and it answered most questions correctly! The larger model is indeed better, and it’s genuinely exciting to imagine what AI could do with even larger models.

Guidance precision determines performance.

Our initial implementation achieved only 5/29 correct answers (17%). After refining questions and guidance to 1. Eliminate ambiguity; 2. Specify constraints; 3. Prevent invalid assumptions. Accuracy jumped to 26/29 (90%). This 73 percentage-point improvement underscores that quality prompt engineering is equally important as model selection. A more systematic experiment on the accuracy can be performed later.

AI affordability is remarkable

Using the DeepSeek API, processing a 200+ page document costs just $0.01. Better yet, during off-peak hours (12:30am-8:30am UTC+8), you get 75% off! With basic coding skills, you can automate batch processing overnight to maximize these savings. Our strategy of using a local retrieval model to identify relevant pages before engaging the large reasoning model significantly reduces costs, since expenses scale directly with token usage.

Challenges persist despite AI’s capabilities

These advancements feel both revolutionary and slightly unnerving - will AI replace us? I remain optimistic because significant challenges endure. Consider this tactic reported by Nikkei Asia: researchers now hide prompts like “focus on this work’s novelty” in papers using white text or microscopic fonts. When AI summarizes such documents, it produces unrealistically positive evaluations, bypassing human reviewers. This demonstrates how humans remain essential gatekeepers for integrity, even as AI’s capabilities grow.

This first RAG implementation revealed AI’s astonishing potential - it’s genuinely exciting how much these tools can accomplish today. Everyone should start learning to harness them!

You can find the code here. And of course, this post is revised by AI :)