My RAG Learning Journey

Written by

in

I’m documenting my RAG learning journey as a personal notebook of concise lessons, aha moments, and practical mental models — not a rewrite of course material. Expect short, actionable takeaways, intuition-first explanations, links to original resources, and checkpoints that save me from re-watching lectures. This log is for beginners following a structured path who want quick reminders and decision points.


Setup dev machine

Setup dev machine

Install Ollama

download from Download Ollama on Windows
Verify: ollama --version

Expected result:

ollama version is 0.24.0

Pull model

ollama pull qwen3:4b
ollama pull qwen3:8b
ollama pull deepseek-coder:6.7b

Big model such as qwen:14b, with 6GB VRAM at RTX A1000, it worsen the result due to some layer sit on VRAM and other layer on RAM, and experience bus bottleneck.

Use AI model that fit 100% on the GPU.

Prompt with specific to coding, AI model for coding superior in code implementation, but generic reasoning AI model superior in design. Use generic reasoning AI model such as qwen3:8b to reason your design – translate high level requirement into design, the use coding AI model such as deepseek-coder:6.7b to implement the design.

As long as your AI model fit to the GPU VRAM, use bigger model, we suggest to use qwen3:8b over qwen3:4b if VRAM size permit.

Use command ollama ps to check how much GPU used.

Verify the model

  • Command: ollama list
  • Prompt: ollama run qwen2.5:7b "Explain what a purchase order is in one sentence"

My machine size

Use my machine size as your reference, extrapolate to your machine. The benchmark and other experiment result based on this hardware size

  • 1x RTX A1000 6GB
  • 1x CPU i7 13850HX gen 13
  • 1x 32 GB ram
  • 1 TB disk
Benchmark

Benchmark

Benchmark result for reasoning prompt

ModelScoreKey StrengthFatal Weakness
Qwen3:8B✅ 9.5Correct + consistentMinor assumption only
Qwen3:4B⚠️ 7.5Clear reasoningContradicts itself
Qwen2.5:7B❌ 4.5Good structureBroken logic & math

The prompt:

Data:
- Product A:
  - Current stock: 120 units
  - Daily sales (last 30 days avg): 10 units/day
  - Lead time: 7 days
  - Safety stock: 50 units

- Product B:
  - Current stock: 40 units
  - Daily sales: 5 units/day
  - Lead time: 14 days
  - Safety stock: 30 units

Task:
1. Calculate reorder point for each product
2. Determine if reorder is needed
3. Suggest reorder quantity (cover 30 days demand)

Return:
- Step-by-step reasoning
- Final recommendation per product

Benchmark for coding prompt

RankModelNotes
🥇✅Qwen3:8BBest (reasoning + correctness)
🥈qwen2.5-coder:7BBest practical coding output
🥈✅DeepSeek-Coder:6.7BVery strong coding (less complete)
🥈LLaMA3:8BClean + teaching-friendly
🥉MagicoderCorrect but overly simplified
🥉Qwen3:4BReasoning but weaker precision
🥉Qwen2.5:7BBasic template
⚠️Qwen2.5-coder:1.5BFlawed abstraction
Phi-4-miniIncorrect pattern

The prompt:

explain strategy pattern in Java

Benchmark for reasoning prompt, with coding subject

overal score:

ModelAverage ScoreSummary
Qwen3.5:4B⭐⭐⭐⭐⭐ 4.67Best balance of depth, clarity, and real-world examples
Qwen3:4B⭐⭐⭐⭐⭐ 4.67Best for clarity, engagement, teaching
Qwen3:8B⭐⭐⭐⭐ 3.83More formal, better precision

LARGER MODEL not always better result, make sure it’s size fit on your GPU.

use LATEST VERSION is possible.

Qwen3:4B has same score with Qwen3:4B, but it has different strength:

ModelGainsLossesNet Effect
Qwen3:4BStructure + EngagementDepth + PrecisionBalanced
Qwen3.5:4BDepth + PrecisionStructure + EngagementBalanced

For Agentic ERP executor, use Qwen3.5:4B:

ModelSuitability for Agentic AssistantReason
Qwen3.5:4B⭐⭐⭐⭐⭐ ✅ Best balanced choiceHigh depth + good clarity + strong practical reasoning
Qwen3:4B⭐⭐⭐⭐Excellent clarity and structure, but slightly less depth and reasoning breadth
Qwen3:8B⭐⭐⭐⭐⭐Strongest precision and consistency in decision-making
AspectQwen3:4BQwen3:8BQwen3.5:4B
Clarity⭐⭐⭐⭐⭐ Very clear, beginner-friendly⭐⭐⭐⭐ Clear but more textbook-like⭐⭐⭐⭐⭐ Very clear, strong explanations
Depth⭐⭐⭐⭐ Good + examples⭐⭐⭐⭐ Slightly more formal, includes variants⭐⭐⭐⭐⭐ Very deep, includes variants & context
Structure⭐⭐⭐⭐⭐ Excellent (tables, visuals, flow)⭐⭐⭐⭐ Standard structure⭐⭐⭐⭐ Good but less polished than 3:4B
Practical Insight⭐⭐⭐⭐⭐ Strong intuition + use cases⭐⭐⭐ Moderate⭐⭐⭐⭐⭐ Strong real-world + applied examples
Technical Precision⭐⭐⭐⭐ Good⭐⭐⭐⭐⭐ Slightly better⭐⭐⭐⭐⭐ Very strong, most complete
Engagement⭐⭐⭐⭐⭐ Highly engaging (icons, metaphor)⭐⭐⭐ More neutral⭐⭐⭐⭐ Engaging but less styled than 3:4B
Extra ValueIncludes example (8-puzzle), summary tablesMentions algorithm variants (stochastic)Includes multiple variants, ML links, rich examples
AudienceBetter Model
Beginner / Student✅ 4B
Exam preparation✅ 4B
Technical interview prep✅ Qwen3.5:4B
Research / deeper study✅ Qwen3.5:4B
Quick refresher✅ 8B

The prompt:

in search algorithm, there is a term called "hill climbing", explain.
Token cost saving

Token cost saving

We can use Ollama as a LLM wrapper. Since Ollama can be used to construct KV Cache, it can be used for token cost saving when you call LLM API.

Core Concepts

Prefill phase

Short desc:
Model reads the entire prompt and builds internal state (KV cache). This is the most expensive part. [learncodecamp.net]

🔗 Read more:

  • https://learncodecamp.net/llm-inference-basics-prefill-decode-ttft-itl/

Decode phase

Short desc:
Model generates output token-by-token, reusing KV cache from prefill. Cheaper per token but sequential. [learncodecamp.net]

🔗 Read more:

  • https://redis.io/blog/prefill-vs-decode/

KV cache

Short desc:
Stores attention states (keys/values) so the model doesn’t recompute previous tokens during generation → speeds up decoding. [huggingface.co]

🔗 Read more:

  • https://huggingface.co/blog/not-lain/kv-caching

Use Case: Save Token Cost with Ollama Wrapper

Key fact

  • Cost ≈ number of tokens sent in prefill
  • Long prompt = expensive (recomputed every request)

What you do with Ollama

1. Compress context before prefill

  • Summarize old chat
  • Keep only last few messages
    👉 fewer tokens → cheaper prefill

2. Don’t resend everything

Instead of:

full chat history → every request

Do:

summary + recent + relevant facts

👉 avoids “context explosion” [medium.com]


3. Use Ollama as local preprocessor

  • summarize
  • filter
  • classify

👉 small model prepares input → big model does reasoning


4. Route simple queries locally

  • FAQ / classification handled by Ollama
    👉 avoid cloud calls entirely

Result

Without wrapperWith Ollama wrapper
2000 tokens per request400–600 tokens
High prefill costLow prefill cost
Full history sentCompressed context
Cloud always usedHybrid local + cloud
AI Agent

AI Agen for RAG

  • Hermes: AI agent specialized for self improving, this is good candidate for knowledge accumulation, and organize the knowledge, to be used for RAG.
  • OpenClaw: AI agent specialized for orchestration, this is good candidate for automation, where customer power user defines inputs, outcome, and boundaries, then OpenClaw implement the workflow.

I’m not done yet with experiment with both AI agent, will post update once I got conclusion.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *