<p class="wp-block-paragraph">I’m documenting my RAG learning journey as a personal notebook of concise lessons, aha moments, and practical mental models — not a rewrite of course material. Expect short, actionable takeaways, intuition-first explanations, links to original resources, and checkpoints that save me from re-watching lectures. This log is for beginners following a structured path who want quick reminders and decision points.</p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Setup dev machine</summary>
<h2 id="htoc-setup-dev-machine" class="wp-block-heading">Setup dev machine</h2>
<h3 id="htoc-install-ollama" class="wp-block-heading">Install Ollama</h3>
<p class="wp-block-paragraph" id="htoc-download-from-download-ollama-on-windowsverify-ollama-version">download from Download Ollama on Windows<br>Verify: <code>ollama –version</code></p>
<p class="wp-block-paragraph" id="htoc-expected-result-ollama-version-is-0-24-0">Expected result:</p>
<pre id="htoc-1" class="wp-block-code"><code><code>ollama version is 0.24.0</code></code></pre>
<h3 id="htoc-pull-model" class="wp-block-heading">Pull model</h3>
<p class="wp-block-paragraph" id="htoc-ollama-pull-qwen3-4bollama-pull-qwen3-8bollama-pull-deepseek-coder-6-7b"><code>ollama pull qwen3:4b</code><br><code>ollama pull qwen3:8b</code><br><code>ollama pull deepseek-coder:6.7b</code></p>
<p class="wp-block-paragraph" id="htoc-big-model-such-as-qwen-14b-with-6gb-vram-at-rtx-a1000-it-worsen-the-result-due-to-some-layer-sit-on-vram-and-other-layer-on-ram-and-experience-bus-bottleneck">Big model such as <code>qwen:14b</code>, with 6GB VRAM at RTX A1000, it worsen the result due to some layer sit on VRAM and other layer on RAM, and experience bus bottleneck.</p>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph" id="htoc-use-ai-model-that-fit-100-on-the-gpu">Use AI model that fit 100% on the GPU.</p>
</blockquote>
<p class="wp-block-paragraph" id="htoc-prompt-with-specific-to-coding-ai-model-for-coding-superior-in-code-implementation-but-generic-reasoning-ai-model-superior-in-design-use-generic-reasoning-ai-model-such-as-qwen3-8b-to-reason-your-design-translate-high-level-requirement-into-design-the-use-coding-ai-model-such-as-deepseek-coder-6-7b-to-implement-the-design">Prompt with specific to coding, AI model for coding superior in code implementation, but generic reasoning AI model superior in design. Use generic reasoning AI model such as <code>qwen3:8b</code> to reason your design – translate high level requirement into design, the use coding AI model such as <code>deepseek-coder:6.7b</code> to implement the design.</p>
<p class="wp-block-paragraph" id="htoc-as-long-as-your-ai-model-fit-to-the-gpu-vram-use-bigger-model-we-suggest-to-use-qwen3-8b-over-qwen3-4b-if-vram-size-permit">As long as your AI model fit to the GPU VRAM, use bigger model, we suggest to use <code>qwen3:8b</code> over <code>qwen3:4b</code> if VRAM size permit.</p>
<p class="wp-block-paragraph" id="htoc-use-command-ollama-ps-to-check-how-much-gpu-used">Use command <code>ollama ps</code> to check how much GPU used.</p>
<h3 id="htoc-verify-the-model" class="wp-block-heading">Verify the model</h3>
<ul class="wp-block-list">
<li>Command: <code>ollama list</code></li>
<li>Prompt: <code>ollama run qwen2.5:7b "Explain what a purchase order is in one sentence"</code></li>
</ul>
<h3 id="htoc-my-machine-size" class="wp-block-heading">My machine size</h3>
<p class="wp-block-paragraph" id="htoc-use-my-machine-size-as-your-reference-extrapolate-to-your-machine-the-benchmark-and-other-experiment-result-based-on-this-hardware-size">Use my machine size as your reference, extrapolate to your machine. The benchmark and other experiment result based on this hardware size</p>
<ul class="wp-block-list">
<li>1x RTX A1000 6GB</li>
<li>1x CPU i7 13850HX gen 13</li>
<li>1x 32 GB ram</li>
<li>1 TB disk</li>
</ul>
</details>
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Benchmark</summary>
<h2 id="htoc-benchmark" class="wp-block-heading">Benchmark</h2>
<h3 id="htoc-benchmark-result-for-reasoning-prompt" class="wp-block-heading">Benchmark result for reasoning prompt</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Model</th><th>Score</th><th>Key Strength</th><th>Fatal Weakness</th></tr></thead><tbody><tr><td>Qwen3:8B</td><td>✅ 9.5</td><td>Correct + consistent</td><td>Minor assumption only</td></tr><tr><td>Qwen3:4B</td><td>⚠️ 7.5</td><td>Clear reasoning</td><td>Contradicts itself</td></tr><tr><td>Qwen2.5:7B</td><td>❌ 4.5</td><td>Good structure</td><td>Broken logic & math</td></tr></tbody></table></figure>
<p class="wp-block-paragraph" id="htoc-the-prompt">The prompt:</p>
<pre id="htoc-data-product-a-current-stock-120-units-daily-sales-last-30-days-avg-10-units-day-lead-time-7-days-safety-stock-50-units-product-b-current-stock-40-units-daily-sales-5-units-day-lead-time-14-days-safety-stock-30-units-task-1-calculate-reorder-point-for-each-product-2-determine-if-reorder-is-needed-3-suggest-reorder-quantity-cover-30-days-demand-return-step-by-step-reasoning-final-recommendation-per-product" class="wp-block-code"><code>Data:
– Product A:
– Current stock: 120 units
– Daily sales (last 30 days avg): 10 units/day
– Lead time: 7 days
– Safety stock: 50 units
– Product B:
– Current stock: 40 units
– Daily sales: 5 units/day
– Lead time: 14 days
– Safety stock: 30 units
Task:
1. Calculate reorder point for each product
2. Determine if reorder is needed
3. Suggest reorder quantity (cover 30 days demand)
Return:
– Step-by-step reasoning
– Final recommendation per product</code></pre>
<h3 id="htoc-benchmark-for-coding-prompt" class="wp-block-heading">Benchmark for coding prompt</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Rank</th><th>Model</th><th>Notes</th></tr></thead><tbody><tr><td>🥇✅</td><td>Qwen3:8B</td><td>Best (reasoning + correctness)</td></tr><tr><td>🥈</td><td>qwen2.5-coder:7B</td><td>Best practical coding output</td></tr><tr><td>🥈✅</td><td>DeepSeek-Coder:6.7B</td><td>Very strong coding (less complete)</td></tr><tr><td>🥈</td><td>LLaMA3:8B</td><td>Clean + teaching-friendly</td></tr><tr><td>🥉</td><td>Magicoder</td><td>Correct but overly simplified</td></tr><tr><td>🥉</td><td>Qwen3:4B</td><td>Reasoning but weaker precision</td></tr><tr><td>🥉</td><td>Qwen2.5:7B</td><td>Basic template</td></tr><tr><td>⚠️</td><td>Qwen2.5-coder:1.5B</td><td>Flawed abstraction</td></tr><tr><td>❌</td><td>Phi-4-mini</td><td>Incorrect pattern</td></tr></tbody></table></figure>
<p class="wp-block-paragraph" id="htoc-the-prompt1">The prompt:</p>
<pre id="htoc-explain-strategy-pattern-in-java" class="wp-block-code"><code>explain strategy pattern in Java</code></pre>
<h3 id="htoc-benchmark-for-reasoning-prompt-with-coding-subject" class="wp-block-heading">Benchmark for reasoning prompt, with coding subject</h3>
<p class="wp-block-paragraph" id="htoc-overal-score">overal score:</p>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Model</th><th>Average Score</th><th>Summary</th></tr></thead><tbody><tr><td>Qwen3.5:4B</td><td>⭐⭐⭐⭐⭐ 4.67</td><td>Best balance of depth, clarity, and real-world examples</td></tr><tr><td>Qwen3:4B</td><td>⭐⭐⭐⭐⭐ 4.67</td><td>Best for clarity, engagement, teaching</td></tr><tr><td>Qwen3:8B</td><td>⭐⭐⭐⭐ 3.83</td><td>More formal, better precision</td></tr></tbody></table></figure>
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph" id="htoc-larger-model-not-always-better-result-make-sure-it-s-size-fit-on-your-gpu">LARGER MODEL not always better result, make sure it’s size fit on your GPU.</p>
<p class="wp-block-paragraph">use LATEST VERSION is possible.</p>
</blockquote>
<p class="wp-block-paragraph" id="htoc-">Qwen3:4B has same score with Qwen3:4B, but it has different strength:</p>
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><th>Model</th><th>Gains</th><th>Losses</th><th>Net Effect</th></tr><tr><td>Qwen3:4B</td><td>Structure + Engagement</td><td>Depth + Precision</td><td>Balanced</td></tr><tr><td>Qwen3.5:4B</td><td>Depth + Precision</td><td>Structure + Engagement</td><td>Balanced</td></tr></tbody></table></figure>
<p class="wp-block-paragraph" id="htoc-for-agentic-erp-executor-use-qwen3-5-4b">For Agentic ERP executor, use Qwen3.5:4B:</p>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Model</th><th>Suitability for Agentic Assistant</th><th>Reason</th></tr></thead><tbody><tr><td>Qwen3.5:4B</td><td>⭐⭐⭐⭐⭐ ✅ Best balanced choice</td><td>High depth + good clarity + strong practical reasoning</td></tr><tr><td>Qwen3:4B</td><td>⭐⭐⭐⭐</td><td>Excellent clarity and structure, but slightly less depth and reasoning breadth</td></tr><tr><td>Qwen3:8B</td><td>⭐⭐⭐⭐⭐</td><td>Strongest precision and consistency in decision-making</td></tr></tbody></table></figure>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Aspect</th><th>Qwen3:4B</th><th>Qwen3:8B</th><th>Qwen3.5:4B</th></tr></thead><tbody><tr><td>Clarity</td><td>⭐⭐⭐⭐⭐ Very clear, beginner-friendly</td><td>⭐⭐⭐⭐ Clear but more textbook-like</td><td>⭐⭐⭐⭐⭐ Very clear, strong explanations</td></tr><tr><td>Depth</td><td>⭐⭐⭐⭐ Good + examples</td><td>⭐⭐⭐⭐ Slightly more formal, includes variants</td><td>⭐⭐⭐⭐⭐ Very deep, includes variants & context</td></tr><tr><td>Structure</td><td>⭐⭐⭐⭐⭐ Excellent (tables, visuals, flow)</td><td>⭐⭐⭐⭐ Standard structure</td><td>⭐⭐⭐⭐ Good but less polished than 3:4B</td></tr><tr><td>Practical Insight</td><td>⭐⭐⭐⭐⭐ Strong intuition + use cases</td><td>⭐⭐⭐ Moderate</td><td>⭐⭐⭐⭐⭐ Strong real-world + applied examples</td></tr><tr><td>Technical Precision</td><td>⭐⭐⭐⭐ Good</td><td>⭐⭐⭐⭐⭐ Slightly better</td><td>⭐⭐⭐⭐⭐ Very strong, most complete</td></tr><tr><td>Engagement</td><td>⭐⭐⭐⭐⭐ Highly engaging (icons, metaphor)</td><td>⭐⭐⭐ More neutral</td><td>⭐⭐⭐⭐ Engaging but less styled than 3:4B</td></tr><tr><td>Extra Value</td><td>Includes example (8-puzzle), summary tables</td><td>Mentions algorithm variants (stochastic)</td><td>Includes multiple variants, ML links, rich examples</td></tr></tbody></table></figure>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Audience</th><th>Better Model</th></tr></thead><tbody><tr><td>Beginner / Student</td><td>✅ 4B</td></tr><tr><td>Exam preparation</td><td>✅ 4B</td></tr><tr><td>Technical interview prep</td><td>✅ Qwen3.5:4B</td></tr><tr><td>Research / deeper study</td><td>✅ Qwen3.5:4B</td></tr><tr><td>Quick refresher</td><td>✅ 8B</td></tr></tbody></table></figure>
<p class="wp-block-paragraph" id="htoc-the-prompt-in-search-algorithm-there-is-a-term-called-hill-climbing-explain">The prompt:</p>
<pre id="htoc-11" class="wp-block-code"><code><code>in search algorithm, there is a term called "hill climbing", explain.</code></code></pre>
</details>
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Token cost saving</summary>
<h2 id="htoc-111" class="wp-block-heading">Token cost saving</h2>
<p class="wp-block-paragraph" id="htoc-lorem-ipsum">We can use Ollama as a LLM wrapper. Since Ollama can be used to construct KV Cache, it can be used for token cost saving when you call LLM API.</p>
<h3 id="htoc-core-concepts" class="wp-block-heading">Core Concepts</h3>
<h4 id="htoc-prefill-phase" class="wp-block-heading">Prefill phase</h4>
<p class="wp-block-paragraph" id="htoc-short-desc-model-reads-the-entire-prompt-and-builds-internal-state-kv-cache-this-is-the-most-expensive-part-learncodecamp-net"><strong>Short desc:</strong><br>Model reads the entire prompt and builds internal state (KV cache). This is the <strong>most expensive part</strong>. <a href="https://learncodecamp.net/llm-inference-basics-prefill-decode-ttft-itl/">[learncodecamp.net]</a></p>
<p class="wp-block-paragraph" id="htoc-read-more">🔗 Read more:</p>
<ul class="wp-block-list">
<li>https://learncodecamp.net/llm-inference-basics-prefill-decode-ttft-itl/</li>
</ul>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<h4 id="htoc-decode-phase" class="wp-block-heading">Decode phase</h4>
<p class="wp-block-paragraph" id="htoc-short-desc-model-generates-output-token-by-token-reusing-kv-cache-from-prefill-cheaper-per-token-but-sequential-learncodecamp-net"><strong>Short desc:</strong><br>Model generates output <strong>token-by-token</strong>, reusing KV cache from prefill. Cheaper per token but sequential. <a href="https://learncodecamp.net/llm-inference-basics-prefill-decode-ttft-itl/">[learncodecamp.net]</a></p>
<p class="wp-block-paragraph" id="htoc-read-more1">🔗 Read more:</p>
<ul class="wp-block-list">
<li>https://redis.io/blog/prefill-vs-decode/</li>
</ul>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<h4 id="htoc-kv-cache" class="wp-block-heading">KV cache</h4>
<p class="wp-block-paragraph" id="htoc-short-desc-stores-attention-states-keys-values-so-the-model-doesn-t-recompute-previous-tokens-during-generation-speeds-up-decoding-huggingface-co"><strong>Short desc:</strong><br>Stores attention states (keys/values) so the model <strong>doesn’t recompute previous tokens</strong> during generation → speeds up decoding. <a href="https://huggingface.co/blog/not-lain/kv-caching">[huggingface.co]</a></p>
<p class="wp-block-paragraph" id="htoc-read-more11">🔗 Read more:</p>
<ul class="wp-block-list">
<li>https://huggingface.co/blog/not-lain/kv-caching</li>
</ul>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<h3 id="htoc-your-use-case-save-token-cost-with-ollama-wrapper" class="wp-block-heading">Use Case: Save Token Cost with Ollama Wrapper</h3>
<h4 id="htoc-key-fact" class="wp-block-heading">Key fact</h4>
<ul class="wp-block-list">
<li>Cost ≈ number of tokens sent in <strong>prefill</strong></li>
<li>Long prompt = expensive (recomputed every request)</li>
</ul>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<h4 id="htoc-what-you-do-with-ollama" class="wp-block-heading">What you do with Ollama</h4>
<p class="wp-block-paragraph" id="htoc-1-compress-context-before-prefill"><strong>1. Compress context before prefill</strong></p>
<ul class="wp-block-list">
<li>Summarize old chat</li>
<li>Keep only last few messages<br>👉 fewer tokens → cheaper prefill</li>
</ul>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<p class="wp-block-paragraph" id="htoc-2-don-t-resend-everything"><strong>2. Don’t resend everything</strong></p>
<p class="wp-block-paragraph" id="htoc-instead-of">Instead of:</p>
<pre id="htoc-full-chat-history-every-request" class="wp-block-code"><code>full chat history → every request
</code></pre>
<p class="wp-block-paragraph" id="htoc-do">Do:</p>
<pre id="htoc-summary-recent-relevant-facts" class="wp-block-code"><code>summary + recent + relevant facts
</code></pre>
<p class="wp-block-paragraph" id="htoc-avoids-context-explosion-medium-com">👉 avoids “context explosion” <a href="https://medium.com/@ravityuval/how-i-reduced-llm-token-costs-by-90-using-prompt-rag-and-ai-agent-optimization-f64bd1b56d9f">[medium.com]</a></p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<p class="wp-block-paragraph" id="htoc-3-use-ollama-as-local-preprocessor"><strong>3. Use Ollama as local preprocessor</strong></p>
<ul class="wp-block-list">
<li>summarize</li>
<li>filter</li>
<li>classify</li>
</ul>
<p class="wp-block-paragraph" id="htoc-small-model-prepares-input-big-model-does-reasoning">👉 small model prepares input → big model does reasoning</p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<p class="wp-block-paragraph" id="htoc-4-route-simple-queries-locally"><strong>4. Route simple queries locally</strong></p>
<ul class="wp-block-list">
<li>FAQ / classification handled by Ollama<br>👉 avoid cloud calls entirely</li>
</ul>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<h4 id="htoc-result" class="wp-block-heading">Result</h4>
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><th>Without wrapper</th><th>With Ollama wrapper</th></tr><tr><td>2000 tokens per request</td><td>400–600 tokens</td></tr><tr><td>High prefill cost</td><td>Low prefill cost</td></tr><tr><td>Full history sent</td><td>Compressed context</td></tr><tr><td>Cloud always used</td><td>Hybrid local + cloud</td></tr></tbody></table></figure>
</details>
<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>AI Agent</summary>
<h2 id="htoc-ai-agen-for-rag" class="wp-block-heading">AI Agen for RAG</h2>
<p class="wp-block-paragraph" id="htoc-i-found-2-popular-agents-with-different-strength">I found 2 popular agents with different strength:</p>
<ul class="wp-block-list">
<li id="htoc-h"><a href="https://github.com/nousresearch/hermes-agent">Hermes</a>: AI agent specialized for self improving, this is good candidate for knowledge accumulation, and organize the knowledge, to be used for RAG.</li>
<li id="htoc-h1"><a href="https://github.com/openclaw/openclaw">OpenClaw</a>: AI agent specialized for orchestration, this is good candidate for automation, where customer power user defines inputs, outcome, and boundaries, then OpenClaw implement the workflow.</li>
</ul>
<p class="wp-block-paragraph" id="htoc-i-m-not-done-yet-with-experiment-with-both-ai-agent-will-post-update-once-i-got-conclusion">I’m not done yet with experiment with both AI agent, will post update once I got conclusion.</p>
</details>
<p class="wp-block-paragraph"></p>
I’m documenting my RAG learning journey as a personal notebook of concise lessons, aha moments, and practical mental models — not a rewrite of course material. Expect short, actionable takeaways, intuition-first explanations, links to original resources, and checkpoints that save me from re-watching lectures. This log is for beginners following a structured path who want quick reminders and decision points.
Setup dev machine
Setup dev machine
Install Ollama
download from Download Ollama on Windows Verify: ollama --version
Big model such as qwen:14b, with 6GB VRAM at RTX A1000, it worsen the result due to some layer sit on VRAM and other layer on RAM, and experience bus bottleneck.
Use AI model that fit 100% on the GPU.
Prompt with specific to coding, AI model for coding superior in code implementation, but generic reasoning AI model superior in design. Use generic reasoning AI model such as qwen3:8b to reason your design – translate high level requirement into design, the use coding AI model such as deepseek-coder:6.7b to implement the design.
As long as your AI model fit to the GPU VRAM, use bigger model, we suggest to use qwen3:8b over qwen3:4b if VRAM size permit.
Use command ollama ps to check how much GPU used.
Verify the model
Command: ollama list
Prompt: ollama run qwen2.5:7b "Explain what a purchase order is in one sentence"
My machine size
Use my machine size as your reference, extrapolate to your machine. The benchmark and other experiment result based on this hardware size
1x RTX A1000 6GB
1x CPU i7 13850HX gen 13
1x 32 GB ram
1 TB disk
Benchmark
Benchmark
Benchmark result for reasoning prompt
Model
Score
Key Strength
Fatal Weakness
Qwen3:8B
✅ 9.5
Correct + consistent
Minor assumption only
Qwen3:4B
⚠️ 7.5
Clear reasoning
Contradicts itself
Qwen2.5:7B
❌ 4.5
Good structure
Broken logic & math
The prompt:
Data:
- Product A:
- Current stock: 120 units
- Daily sales (last 30 days avg): 10 units/day
- Lead time: 7 days
- Safety stock: 50 units
- Product B:
- Current stock: 40 units
- Daily sales: 5 units/day
- Lead time: 14 days
- Safety stock: 30 units
Task:
1. Calculate reorder point for each product
2. Determine if reorder is needed
3. Suggest reorder quantity (cover 30 days demand)
Return:
- Step-by-step reasoning
- Final recommendation per product
Benchmark for coding prompt
Rank
Model
Notes
🥇✅
Qwen3:8B
Best (reasoning + correctness)
🥈
qwen2.5-coder:7B
Best practical coding output
🥈✅
DeepSeek-Coder:6.7B
Very strong coding (less complete)
🥈
LLaMA3:8B
Clean + teaching-friendly
🥉
Magicoder
Correct but overly simplified
🥉
Qwen3:4B
Reasoning but weaker precision
🥉
Qwen2.5:7B
Basic template
⚠️
Qwen2.5-coder:1.5B
Flawed abstraction
❌
Phi-4-mini
Incorrect pattern
The prompt:
explain strategy pattern in Java
Benchmark for reasoning prompt, with coding subject
overal score:
Model
Average Score
Summary
Qwen3.5:4B
⭐⭐⭐⭐⭐ 4.67
Best balance of depth, clarity, and real-world examples
Qwen3:4B
⭐⭐⭐⭐⭐ 4.67
Best for clarity, engagement, teaching
Qwen3:8B
⭐⭐⭐⭐ 3.83
More formal, better precision
LARGER MODEL not always better result, make sure it’s size fit on your GPU.
use LATEST VERSION is possible.
Qwen3:4B has same score with Qwen3:4B, but it has different strength:
Model
Gains
Losses
Net Effect
Qwen3:4B
Structure + Engagement
Depth + Precision
Balanced
Qwen3.5:4B
Depth + Precision
Structure + Engagement
Balanced
For Agentic ERP executor, use Qwen3.5:4B:
Model
Suitability for Agentic Assistant
Reason
Qwen3.5:4B
⭐⭐⭐⭐⭐ ✅ Best balanced choice
High depth + good clarity + strong practical reasoning
Qwen3:4B
⭐⭐⭐⭐
Excellent clarity and structure, but slightly less depth and reasoning breadth
Qwen3:8B
⭐⭐⭐⭐⭐
Strongest precision and consistency in decision-making
Aspect
Qwen3:4B
Qwen3:8B
Qwen3.5:4B
Clarity
⭐⭐⭐⭐⭐ Very clear, beginner-friendly
⭐⭐⭐⭐ Clear but more textbook-like
⭐⭐⭐⭐⭐ Very clear, strong explanations
Depth
⭐⭐⭐⭐ Good + examples
⭐⭐⭐⭐ Slightly more formal, includes variants
⭐⭐⭐⭐⭐ Very deep, includes variants & context
Structure
⭐⭐⭐⭐⭐ Excellent (tables, visuals, flow)
⭐⭐⭐⭐ Standard structure
⭐⭐⭐⭐ Good but less polished than 3:4B
Practical Insight
⭐⭐⭐⭐⭐ Strong intuition + use cases
⭐⭐⭐ Moderate
⭐⭐⭐⭐⭐ Strong real-world + applied examples
Technical Precision
⭐⭐⭐⭐ Good
⭐⭐⭐⭐⭐ Slightly better
⭐⭐⭐⭐⭐ Very strong, most complete
Engagement
⭐⭐⭐⭐⭐ Highly engaging (icons, metaphor)
⭐⭐⭐ More neutral
⭐⭐⭐⭐ Engaging but less styled than 3:4B
Extra Value
Includes example (8-puzzle), summary tables
Mentions algorithm variants (stochastic)
Includes multiple variants, ML links, rich examples
Audience
Better Model
Beginner / Student
✅ 4B
Exam preparation
✅ 4B
Technical interview prep
✅ Qwen3.5:4B
Research / deeper study
✅ Qwen3.5:4B
Quick refresher
✅ 8B
The prompt:
in search algorithm, there is a term called "hill climbing", explain.
Token cost saving
Token cost saving
We can use Ollama as a LLM wrapper. Since Ollama can be used to construct KV Cache, it can be used for token cost saving when you call LLM API.
Core Concepts
Prefill phase
Short desc: Model reads the entire prompt and builds internal state (KV cache). This is the most expensive part. [learncodecamp.net]
Short desc: Model generates output token-by-token, reusing KV cache from prefill. Cheaper per token but sequential. [learncodecamp.net]
🔗 Read more:
https://redis.io/blog/prefill-vs-decode/
KV cache
Short desc: Stores attention states (keys/values) so the model doesn’t recompute previous tokens during generation → speeds up decoding. [huggingface.co]
🔗 Read more:
https://huggingface.co/blog/not-lain/kv-caching
Use Case: Save Token Cost with Ollama Wrapper
Key fact
Cost ≈ number of tokens sent in prefill
Long prompt = expensive (recomputed every request)
What you do with Ollama
1. Compress context before prefill
Summarize old chat
Keep only last few messages 👉 fewer tokens → cheaper prefill
Hermes: AI agent specialized for self improving, this is good candidate for knowledge accumulation, and organize the knowledge, to be used for RAG.
OpenClaw: AI agent specialized for orchestration, this is good candidate for automation, where customer power user defines inputs, outcome, and boundaries, then OpenClaw implement the workflow.
I’m not done yet with experiment with both AI agent, will post update once I got conclusion.
Leave a Reply