My RAG Learning Journey

<p class="wp-block-paragraph">I’m documenting my RAG learning journey as a personal notebook of concise lessons, aha moments, and practical mental models — not a rewrite of course material. Expect short, actionable takeaways, intuition-first explanations, links to original resources, and checkpoints that save me from re-watching lectures. This log is for beginners following a structured path who want quick reminders and decision points.</p> <hr class="wp-block-separator has-alpha-channel-opacity"/> <details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Setup dev machine</summary> <h2 id="htoc-setup-dev-machine" class="wp-block-heading">Setup dev machine</h2> <h3 id="htoc-install-ollama" class="wp-block-heading">Install Ollama</h3> <p class="wp-block-paragraph" id="htoc-download-from-download-ollama-on-windowsverify-ollama-version">download from Download Ollama on Windows<br>Verify: <code>ollama –version</code></p> <p class="wp-block-paragraph" id="htoc-expected-result-ollama-version-is-0-24-0">Expected result:</p> <pre id="htoc-1" class="wp-block-code"><code><code>ollama version is 0.24.0</code></code></pre> <h3 id="htoc-pull-model" class="wp-block-heading">Pull model</h3> <p class="wp-block-paragraph" id="htoc-ollama-pull-qwen3-4bollama-pull-qwen3-8bollama-pull-deepseek-coder-6-7b"><code>ollama pull qwen3:4b</code><br><code>ollama pull qwen3:8b</code><br><code>ollama pull deepseek-coder:6.7b</code></p> <p class="wp-block-paragraph" id="htoc-big-model-such-as-qwen-14b-with-6gb-vram-at-rtx-a1000-it-worsen-the-result-due-to-some-layer-sit-on-vram-and-other-layer-on-ram-and-experience-bus-bottleneck">Big model such as <code>qwen:14b</code>, with 6GB VRAM at RTX A1000, it worsen the result due to some layer sit on VRAM and other layer on RAM, and experience bus bottleneck.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph" id="htoc-use-ai-model-that-fit-100-on-the-gpu">Use AI model that fit 100% on the GPU.</p> </blockquote> <p class="wp-block-paragraph" id="htoc-prompt-with-specific-to-coding-ai-model-for-coding-superior-in-code-implementation-but-generic-reasoning-ai-model-superior-in-design-use-generic-reasoning-ai-model-such-as-qwen3-8b-to-reason-your-design-translate-high-level-requirement-into-design-the-use-coding-ai-model-such-as-deepseek-coder-6-7b-to-implement-the-design">Prompt with specific to coding, AI model for coding superior in code implementation, but generic reasoning AI model superior in design. Use generic reasoning AI model such as <code>qwen3:8b</code> to reason your design – translate high level requirement into design, the use coding AI model such as <code>deepseek-coder:6.7b</code> to implement the design.</p> <p class="wp-block-paragraph" id="htoc-as-long-as-your-ai-model-fit-to-the-gpu-vram-use-bigger-model-we-suggest-to-use-qwen3-8b-over-qwen3-4b-if-vram-size-permit">As long as your AI model fit to the GPU VRAM, use bigger model, we suggest to use <code>qwen3:8b</code> over <code>qwen3:4b</code> if VRAM size permit.</p> <p class="wp-block-paragraph" id="htoc-use-command-ollama-ps-to-check-how-much-gpu-used">Use command <code>ollama ps</code> to check how much GPU used.</p> <h3 id="htoc-verify-the-model" class="wp-block-heading">Verify the model</h3> <ul class="wp-block-list"> <li>Command: <code>ollama list</code></li> <li>Prompt: <code>ollama run qwen2.5:7b "Explain what a purchase order is in one sentence"</code></li> </ul> <h3 id="htoc-my-machine-size" class="wp-block-heading">My machine size</h3> <p class="wp-block-paragraph" id="htoc-use-my-machine-size-as-your-reference-extrapolate-to-your-machine-the-benchmark-and-other-experiment-result-based-on-this-hardware-size">Use my machine size as your reference, extrapolate to your machine. The benchmark and other experiment result based on this hardware size</p> <ul class="wp-block-list"> <li>1x RTX A1000 6GB</li> <li>1x CPU i7 13850HX gen 13</li> <li>1x 32 GB ram</li> <li>1 TB disk</li> </ul> </details> <details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Benchmark</summary> <h2 id="htoc-benchmark" class="wp-block-heading">Benchmark</h2> <h3 id="htoc-benchmark-result-for-reasoning-prompt" class="wp-block-heading">Benchmark result for reasoning prompt</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Model</th><th>Score</th><th>Key Strength</th><th>Fatal Weakness</th></tr></thead><tbody><tr><td>Qwen3:8B</td><td>✅ 9.5</td><td>Correct + consistent</td><td>Minor assumption only</td></tr><tr><td>Qwen3:4B</td><td>⚠️ 7.5</td><td>Clear reasoning</td><td>Contradicts itself</td></tr><tr><td>Qwen2.5:7B</td><td>❌ 4.5</td><td>Good structure</td><td>Broken logic & math</td></tr></tbody></table></figure> <p class="wp-block-paragraph" id="htoc-the-prompt">The prompt:</p> <pre id="htoc-data-product-a-current-stock-120-units-daily-sales-last-30-days-avg-10-units-day-lead-time-7-days-safety-stock-50-units-product-b-current-stock-40-units-daily-sales-5-units-day-lead-time-14-days-safety-stock-30-units-task-1-calculate-reorder-point-for-each-product-2-determine-if-reorder-is-needed-3-suggest-reorder-quantity-cover-30-days-demand-return-step-by-step-reasoning-final-recommendation-per-product" class="wp-block-code"><code>Data: – Product A: – Current stock: 120 units – Daily sales (last 30 days avg): 10 units/day – Lead time: 7 days – Safety stock: 50 units – Product B: – Current stock: 40 units – Daily sales: 5 units/day – Lead time: 14 days – Safety stock: 30 units Task: 1. Calculate reorder point for each product 2. Determine if reorder is needed 3. Suggest reorder quantity (cover 30 days demand) Return: – Step-by-step reasoning – Final recommendation per product</code></pre> <h3 id="htoc-benchmark-for-coding-prompt" class="wp-block-heading">Benchmark for coding prompt</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Rank</th><th>Model</th><th>Notes</th></tr></thead><tbody><tr><td>🥇✅</td><td>Qwen3:8B</td><td>Best (reasoning + correctness)</td></tr><tr><td>🥈</td><td>qwen2.5-coder:7B</td><td>Best practical coding output</td></tr><tr><td>🥈✅</td><td>DeepSeek-Coder:6.7B</td><td>Very strong coding (less complete)</td></tr><tr><td>🥈</td><td>LLaMA3:8B</td><td>Clean + teaching-friendly</td></tr><tr><td>🥉</td><td>Magicoder</td><td>Correct but overly simplified</td></tr><tr><td>🥉</td><td>Qwen3:4B</td><td>Reasoning but weaker precision</td></tr><tr><td>🥉</td><td>Qwen2.5:7B</td><td>Basic template</td></tr><tr><td>⚠️</td><td>Qwen2.5-coder:1.5B</td><td>Flawed abstraction</td></tr><tr><td>❌</td><td>Phi-4-mini</td><td>Incorrect pattern</td></tr></tbody></table></figure> <p class="wp-block-paragraph" id="htoc-the-prompt1">The prompt:</p> <pre id="htoc-explain-strategy-pattern-in-java" class="wp-block-code"><code>explain strategy pattern in Java</code></pre> <h3 id="htoc-benchmark-for-reasoning-prompt-with-coding-subject" class="wp-block-heading">Benchmark for reasoning prompt, with coding subject</h3> <p class="wp-block-paragraph" id="htoc-overal-score">overal score:</p> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Model</th><th>Average Score</th><th>Summary</th></tr></thead><tbody><tr><td>Qwen3.5:4B</td><td>⭐⭐⭐⭐⭐ 4.67</td><td>Best balance of depth, clarity, and real-world examples</td></tr><tr><td>Qwen3:4B</td><td>⭐⭐⭐⭐⭐ 4.67</td><td>Best for clarity, engagement, teaching</td></tr><tr><td>Qwen3:8B</td><td>⭐⭐⭐⭐ 3.83</td><td>More formal, better precision</td></tr></tbody></table></figure> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph" id="htoc-larger-model-not-always-better-result-make-sure-it-s-size-fit-on-your-gpu">LARGER MODEL not always better result, make sure it’s size fit on your GPU.</p> <p class="wp-block-paragraph">use LATEST VERSION is possible.</p> </blockquote> <p class="wp-block-paragraph" id="htoc-">Qwen3:4B has same score with Qwen3:4B, but it has different strength:</p> <figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><th>Model</th><th>Gains</th><th>Losses</th><th>Net Effect</th></tr><tr><td>Qwen3:4B</td><td>Structure + Engagement</td><td>Depth + Precision</td><td>Balanced</td></tr><tr><td>Qwen3.5:4B</td><td>Depth + Precision</td><td>Structure + Engagement</td><td>Balanced</td></tr></tbody></table></figure> <p class="wp-block-paragraph" id="htoc-for-agentic-erp-executor-use-qwen3-5-4b">For Agentic ERP executor, use Qwen3.5:4B:</p> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Model</th><th>Suitability for Agentic Assistant</th><th>Reason</th></tr></thead><tbody><tr><td>Qwen3.5:4B</td><td>⭐⭐⭐⭐⭐ ✅ Best balanced choice</td><td>High depth + good clarity + strong practical reasoning</td></tr><tr><td>Qwen3:4B</td><td>⭐⭐⭐⭐</td><td>Excellent clarity and structure, but slightly less depth and reasoning breadth</td></tr><tr><td>Qwen3:8B</td><td>⭐⭐⭐⭐⭐</td><td>Strongest precision and consistency in decision-making</td></tr></tbody></table></figure> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Aspect</th><th>Qwen3:4B</th><th>Qwen3:8B</th><th>Qwen3.5:4B</th></tr></thead><tbody><tr><td>Clarity</td><td>⭐⭐⭐⭐⭐ Very clear, beginner-friendly</td><td>⭐⭐⭐⭐ Clear but more textbook-like</td><td>⭐⭐⭐⭐⭐ Very clear, strong explanations</td></tr><tr><td>Depth</td><td>⭐⭐⭐⭐ Good + examples</td><td>⭐⭐⭐⭐ Slightly more formal, includes variants</td><td>⭐⭐⭐⭐⭐ Very deep, includes variants & context</td></tr><tr><td>Structure</td><td>⭐⭐⭐⭐⭐ Excellent (tables, visuals, flow)</td><td>⭐⭐⭐⭐ Standard structure</td><td>⭐⭐⭐⭐ Good but less polished than 3:4B</td></tr><tr><td>Practical Insight</td><td>⭐⭐⭐⭐⭐ Strong intuition + use cases</td><td>⭐⭐⭐ Moderate</td><td>⭐⭐⭐⭐⭐ Strong real-world + applied examples</td></tr><tr><td>Technical Precision</td><td>⭐⭐⭐⭐ Good</td><td>⭐⭐⭐⭐⭐ Slightly better</td><td>⭐⭐⭐⭐⭐ Very strong, most complete</td></tr><tr><td>Engagement</td><td>⭐⭐⭐⭐⭐ Highly engaging (icons, metaphor)</td><td>⭐⭐⭐ More neutral</td><td>⭐⭐⭐⭐ Engaging but less styled than 3:4B</td></tr><tr><td>Extra Value</td><td>Includes example (8-puzzle), summary tables</td><td>Mentions algorithm variants (stochastic)</td><td>Includes multiple variants, ML links, rich examples</td></tr></tbody></table></figure> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Audience</th><th>Better Model</th></tr></thead><tbody><tr><td>Beginner / Student</td><td>✅ 4B</td></tr><tr><td>Exam preparation</td><td>✅ 4B</td></tr><tr><td>Technical interview prep</td><td>✅ Qwen3.5:4B</td></tr><tr><td>Research / deeper study</td><td>✅ Qwen3.5:4B</td></tr><tr><td>Quick refresher</td><td>✅ 8B</td></tr></tbody></table></figure> <p class="wp-block-paragraph" id="htoc-the-prompt-in-search-algorithm-there-is-a-term-called-hill-climbing-explain">The prompt:</p> <pre id="htoc-11" class="wp-block-code"><code><code>in search algorithm, there is a term called "hill climbing", explain.</code></code></pre> </details> <details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Token cost saving</summary> <h2 id="htoc-111" class="wp-block-heading">Token cost saving</h2> <p class="wp-block-paragraph" id="htoc-lorem-ipsum">We can use Ollama as a LLM wrapper. Since Ollama can be used to construct KV Cache, it can be used for token cost saving when you call LLM API.</p> <h3 id="htoc-core-concepts" class="wp-block-heading">Core Concepts</h3> <h4 id="htoc-prefill-phase" class="wp-block-heading">Prefill phase</h4> <p class="wp-block-paragraph" id="htoc-short-desc-model-reads-the-entire-prompt-and-builds-internal-state-kv-cache-this-is-the-most-expensive-part-learncodecamp-net"><strong>Short desc:</strong><br>Model reads the entire prompt and builds internal state (KV cache). This is the <strong>most expensive part</strong>. <a href="https://learncodecamp.net/llm-inference-basics-prefill-decode-ttft-itl/">[learncodecamp.net]</a></p> <p class="wp-block-paragraph" id="htoc-read-more">🔗 Read more:</p> <ul class="wp-block-list"> <li>https://learncodecamp.net/llm-inference-basics-prefill-decode-ttft-itl/</li> </ul> <hr class="wp-block-separator has-alpha-channel-opacity"/> <h4 id="htoc-decode-phase" class="wp-block-heading">Decode phase</h4> <p class="wp-block-paragraph" id="htoc-short-desc-model-generates-output-token-by-token-reusing-kv-cache-from-prefill-cheaper-per-token-but-sequential-learncodecamp-net"><strong>Short desc:</strong><br>Model generates output <strong>token-by-token</strong>, reusing KV cache from prefill. Cheaper per token but sequential. <a href="https://learncodecamp.net/llm-inference-basics-prefill-decode-ttft-itl/">[learncodecamp.net]</a></p> <p class="wp-block-paragraph" id="htoc-read-more1">🔗 Read more:</p> <ul class="wp-block-list"> <li>https://redis.io/blog/prefill-vs-decode/</li> </ul> <hr class="wp-block-separator has-alpha-channel-opacity"/> <h4 id="htoc-kv-cache" class="wp-block-heading">KV cache</h4> <p class="wp-block-paragraph" id="htoc-short-desc-stores-attention-states-keys-values-so-the-model-doesn-t-recompute-previous-tokens-during-generation-speeds-up-decoding-huggingface-co"><strong>Short desc:</strong><br>Stores attention states (keys/values) so the model <strong>doesn’t recompute previous tokens</strong> during generation → speeds up decoding. <a href="https://huggingface.co/blog/not-lain/kv-caching">[huggingface.co]</a></p> <p class="wp-block-paragraph" id="htoc-read-more11">🔗 Read more:</p> <ul class="wp-block-list"> <li>https://huggingface.co/blog/not-lain/kv-caching</li> </ul> <hr class="wp-block-separator has-alpha-channel-opacity"/> <h3 id="htoc-your-use-case-save-token-cost-with-ollama-wrapper" class="wp-block-heading">Use Case: Save Token Cost with Ollama Wrapper</h3> <h4 id="htoc-key-fact" class="wp-block-heading">Key fact</h4> <ul class="wp-block-list"> <li>Cost ≈ number of tokens sent in <strong>prefill</strong></li> <li>Long prompt = expensive (recomputed every request)</li> </ul> <hr class="wp-block-separator has-alpha-channel-opacity"/> <h4 id="htoc-what-you-do-with-ollama" class="wp-block-heading">What you do with Ollama</h4> <p class="wp-block-paragraph" id="htoc-1-compress-context-before-prefill"><strong>1. Compress context before prefill</strong></p> <ul class="wp-block-list"> <li>Summarize old chat</li> <li>Keep only last few messages<br>👉 fewer tokens → cheaper prefill</li> </ul> <hr class="wp-block-separator has-alpha-channel-opacity"/> <p class="wp-block-paragraph" id="htoc-2-don-t-resend-everything"><strong>2. Don’t resend everything</strong></p> <p class="wp-block-paragraph" id="htoc-instead-of">Instead of:</p> <pre id="htoc-full-chat-history-every-request" class="wp-block-code"><code>full chat history → every request </code></pre> <p class="wp-block-paragraph" id="htoc-do">Do:</p> <pre id="htoc-summary-recent-relevant-facts" class="wp-block-code"><code>summary + recent + relevant facts </code></pre> <p class="wp-block-paragraph" id="htoc-avoids-context-explosion-medium-com">👉 avoids “context explosion” <a href="https://medium.com/@ravityuval/how-i-reduced-llm-token-costs-by-90-using-prompt-rag-and-ai-agent-optimization-f64bd1b56d9f">[medium.com]</a></p> <hr class="wp-block-separator has-alpha-channel-opacity"/> <p class="wp-block-paragraph" id="htoc-3-use-ollama-as-local-preprocessor"><strong>3. Use Ollama as local preprocessor</strong></p> <ul class="wp-block-list"> <li>summarize</li> <li>filter</li> <li>classify</li> </ul> <p class="wp-block-paragraph" id="htoc-small-model-prepares-input-big-model-does-reasoning">👉 small model prepares input → big model does reasoning</p> <hr class="wp-block-separator has-alpha-channel-opacity"/> <p class="wp-block-paragraph" id="htoc-4-route-simple-queries-locally"><strong>4. Route simple queries locally</strong></p> <ul class="wp-block-list"> <li>FAQ / classification handled by Ollama<br>👉 avoid cloud calls entirely</li> </ul> <hr class="wp-block-separator has-alpha-channel-opacity"/> <h4 id="htoc-result" class="wp-block-heading">Result</h4> <figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><th>Without wrapper</th><th>With Ollama wrapper</th></tr><tr><td>2000 tokens per request</td><td>400–600 tokens</td></tr><tr><td>High prefill cost</td><td>Low prefill cost</td></tr><tr><td>Full history sent</td><td>Compressed context</td></tr><tr><td>Cloud always used</td><td>Hybrid local + cloud</td></tr></tbody></table></figure> </details> <details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>AI Agent</summary> <h2 id="htoc-ai-agen-for-rag" class="wp-block-heading">AI Agen for RAG</h2> <p class="wp-block-paragraph" id="htoc-i-found-2-popular-agents-with-different-strength">I found 2 popular agents with different strength:</p> <ul class="wp-block-list"> <li id="htoc-h"><a href="https://github.com/nousresearch/hermes-agent">Hermes</a>: AI agent specialized for self improving, this is good candidate for knowledge accumulation, and organize the knowledge, to be used for RAG.</li> <li id="htoc-h1"><a href="https://github.com/openclaw/openclaw">OpenClaw</a>: AI agent specialized for orchestration, this is good candidate for automation, where customer power user defines inputs, outcome, and boundaries, then OpenClaw implement the workflow.</li> </ul> <p class="wp-block-paragraph" id="htoc-i-m-not-done-yet-with-experiment-with-both-ai-agent-will-post-update-once-i-got-conclusion">I’m not done yet with experiment with both AI agent, will post update once I got conclusion.</p> </details> <p class="wp-block-paragraph"></p>

I’m documenting my RAG learning journey as a personal notebook of concise lessons, aha moments, and practical mental models — not a rewrite of course material. Expect short, actionable takeaways, intuition-first explanations, links to original resources, and checkpoints that save me from re-watching lectures. This log is for beginners following a structured path who want quick reminders and decision points.

Setup dev machine

Install Ollama

download from Download Ollama on Windows
Verify: ollama --version

Expected result:

ollama version is 0.24.0

Pull model

ollama pull qwen3:4b
ollama pull qwen3:8b
ollama pull deepseek-coder:6.7b

Big model such as qwen:14b, with 6GB VRAM at RTX A1000, it worsen the result due to some layer sit on VRAM and other layer on RAM, and experience bus bottleneck.

Use AI model that fit 100% on the GPU.

Prompt with specific to coding, AI model for coding superior in code implementation, but generic reasoning AI model superior in design. Use generic reasoning AI model such as qwen3:8b to reason your design – translate high level requirement into design, the use coding AI model such as deepseek-coder:6.7b to implement the design.

As long as your AI model fit to the GPU VRAM, use bigger model, we suggest to use qwen3:8b over qwen3:4b if VRAM size permit.

Use command ollama ps to check how much GPU used.

Verify the model

Command: ollama list
Prompt: ollama run qwen2.5:7b "Explain what a purchase order is in one sentence"

My machine size

Use my machine size as your reference, extrapolate to your machine. The benchmark and other experiment result based on this hardware size

1x RTX A1000 6GB
1x CPU i7 13850HX gen 13
1x 32 GB ram
1 TB disk

Benchmark

Benchmark result for reasoning prompt

Model	Score	Key Strength	Fatal Weakness
Qwen3:8B	✅ 9.5	Correct + consistent	Minor assumption only
Qwen3:4B	⚠️ 7.5	Clear reasoning	Contradicts itself
Qwen2.5:7B	❌ 4.5	Good structure	Broken logic & math

The prompt:

Data:
- Product A:
  - Current stock: 120 units
  - Daily sales (last 30 days avg): 10 units/day
  - Lead time: 7 days
  - Safety stock: 50 units

- Product B:
  - Current stock: 40 units
  - Daily sales: 5 units/day
  - Lead time: 14 days
  - Safety stock: 30 units

Task:
1. Calculate reorder point for each product
2. Determine if reorder is needed
3. Suggest reorder quantity (cover 30 days demand)

Return:
- Step-by-step reasoning
- Final recommendation per product

Benchmark for coding prompt

Rank	Model	Notes
🥇✅	Qwen3:8B	Best (reasoning + correctness)
🥈	qwen2.5-coder:7B	Best practical coding output
🥈✅	DeepSeek-Coder:6.7B	Very strong coding (less complete)
🥈	LLaMA3:8B	Clean + teaching-friendly
🥉	Magicoder	Correct but overly simplified
🥉	Qwen3:4B	Reasoning but weaker precision
🥉	Qwen2.5:7B	Basic template
⚠️	Qwen2.5-coder:1.5B	Flawed abstraction
❌	Phi-4-mini	Incorrect pattern

The prompt:

explain strategy pattern in Java

Benchmark for reasoning prompt, with coding subject

overal score:

Model	Average Score	Summary
Qwen3.5:4B	⭐⭐⭐⭐⭐ 4.67	Best balance of depth, clarity, and real-world examples
Qwen3:4B	⭐⭐⭐⭐⭐ 4.67	Best for clarity, engagement, teaching
Qwen3:8B	⭐⭐⭐⭐ 3.83	More formal, better precision

LARGER MODEL not always better result, make sure it’s size fit on your GPU.

use LATEST VERSION is possible.

Qwen3:4B has same score with Qwen3:4B, but it has different strength:

Model	Gains	Losses	Net Effect
Qwen3:4B	Structure + Engagement	Depth + Precision	Balanced
Qwen3.5:4B	Depth + Precision	Structure + Engagement	Balanced

For Agentic ERP executor, use Qwen3.5:4B:

Model	Suitability for Agentic Assistant	Reason
Qwen3.5:4B	⭐⭐⭐⭐⭐ ✅ Best balanced choice	High depth + good clarity + strong practical reasoning
Qwen3:4B	⭐⭐⭐⭐	Excellent clarity and structure, but slightly less depth and reasoning breadth
Qwen3:8B	⭐⭐⭐⭐⭐	Strongest precision and consistency in decision-making

Aspect	Qwen3:4B	Qwen3:8B	Qwen3.5:4B
Clarity	⭐⭐⭐⭐⭐ Very clear, beginner-friendly	⭐⭐⭐⭐ Clear but more textbook-like	⭐⭐⭐⭐⭐ Very clear, strong explanations
Depth	⭐⭐⭐⭐ Good + examples	⭐⭐⭐⭐ Slightly more formal, includes variants	⭐⭐⭐⭐⭐ Very deep, includes variants & context
Structure	⭐⭐⭐⭐⭐ Excellent (tables, visuals, flow)	⭐⭐⭐⭐ Standard structure	⭐⭐⭐⭐ Good but less polished than 3:4B
Practical Insight	⭐⭐⭐⭐⭐ Strong intuition + use cases	⭐⭐⭐ Moderate	⭐⭐⭐⭐⭐ Strong real-world + applied examples
Technical Precision	⭐⭐⭐⭐ Good	⭐⭐⭐⭐⭐ Slightly better	⭐⭐⭐⭐⭐ Very strong, most complete
Engagement	⭐⭐⭐⭐⭐ Highly engaging (icons, metaphor)	⭐⭐⭐ More neutral	⭐⭐⭐⭐ Engaging but less styled than 3:4B
Extra Value	Includes example (8-puzzle), summary tables	Mentions algorithm variants (stochastic)	Includes multiple variants, ML links, rich examples

Audience	Better Model
Beginner / Student	✅ 4B
Exam preparation	✅ 4B
Technical interview prep	✅ Qwen3.5:4B
Research / deeper study	✅ Qwen3.5:4B
Quick refresher	✅ 8B

The prompt:

in search algorithm, there is a term called "hill climbing", explain.

Token cost saving

We can use Ollama as a LLM wrapper. Since Ollama can be used to construct KV Cache, it can be used for token cost saving when you call LLM API.

Core Concepts

Prefill phase

Short desc:
Model reads the entire prompt and builds internal state (KV cache). This is the most expensive part. [learncodecamp.net]

🔗 Read more:

https://learncodecamp.net/llm-inference-basics-prefill-decode-ttft-itl/

Decode phase

Short desc:
Model generates output token-by-token, reusing KV cache from prefill. Cheaper per token but sequential. [learncodecamp.net]

🔗 Read more:

https://redis.io/blog/prefill-vs-decode/

KV cache

Short desc:
Stores attention states (keys/values) so the model doesn’t recompute previous tokens during generation → speeds up decoding. [huggingface.co]

🔗 Read more:

https://huggingface.co/blog/not-lain/kv-caching

Use Case: Save Token Cost with Ollama Wrapper

Key fact

Cost ≈ number of tokens sent in prefill
Long prompt = expensive (recomputed every request)

What you do with Ollama

1. Compress context before prefill

Summarize old chat
Keep only last few messages
👉 fewer tokens → cheaper prefill

2. Don’t resend everything

Instead of:

full chat history → every request

Do:

summary + recent + relevant facts

👉 avoids “context explosion” [medium.com]

3. Use Ollama as local preprocessor

summarize
filter
classify

👉 small model prepares input → big model does reasoning

4. Route simple queries locally

FAQ / classification handled by Ollama
👉 avoid cloud calls entirely

Result

Without wrapper	With Ollama wrapper
2000 tokens per request	400–600 tokens
High prefill cost	Low prefill cost
Full history sent	Compressed context
Cloud always used	Hybrid local + cloud

AI Agent

AI Agen for RAG

I found 2 popular agents with different strength:

Hermes: AI agent specialized for self improving, this is good candidate for knowledge accumulation, and organize the knowledge, to be used for RAG.
OpenClaw: AI agent specialized for orchestration, this is good candidate for automation, where customer power user defines inputs, outcome, and boundaries, then OpenClaw implement the workflow.

I’m not done yet with experiment with both AI agent, will post update once I got conclusion.

Setup dev machine

Install Ollama

Pull model

Verify the model

My machine size

Benchmark

Benchmark result for reasoning prompt

Benchmark for coding prompt

Benchmark for reasoning prompt, with coding subject

Token cost saving

Core Concepts

Prefill phase

Decode phase

KV cache

Use Case: Save Token Cost with Ollama Wrapper

Key fact

What you do with Ollama

Result

AI Agen for RAG

Comments

Leave a Reply Cancel reply

More posts

Global Brand Power

Customer Centricity

Memory is trainable

Second Order Thinking