Why We Turned RAG Into a Tool

June 4, 2026 · 7 min read

Creator of DevoxxGenie

For months, DevoxxGenie's RAG pipeline worked beautifully — as long as you stayed in chat mode. Index your project into ChromaDB, ask a question, and the most relevant code chunks would automatically appear in the prompt as a <SemanticContext> block. It was invisible, automatic, and effective.

Then we shipped Agent Mode, and RAG fell off a cliff.

Users would ask conceptual questions like "which slides discuss MCP?" or "where do we explain the indexing pipeline?" and the agent would ignore the rich semantic context we had just injected. Instead, it reached for search_files — a regex grep — and returned nonsense. The semantic context had become wallpaper: present, but unseen.

The tool bias problem

Agent-mode LLMs are trained to prefer tools over passive context. When both a <SemanticContext> block and a search_files tool are available, the model almost always chooses the tool. It's not being lazy; it's doing exactly what the agent loop incentivises. The problem is that regex grep is terrible for conceptual queries. "Authentication flow" doesn't match auth.*flow in any file — but it embeds close to the actual auth code.

We tried making the tool descriptions more persuasive. We tried bigger models. Smaller models still defaulted to grep. The only reliable fix was to stop fighting the bias and lean into it: if the model wants a tool, give it a tool.

From injection to orchestration

The semantic_search tool (task-221, refined in task-222) does exactly what the passive injection did — embed the query, search ChromaDB, return the top-K chunks — but exposes it through the agent tool loop instead of prepending it to the prompt.

The key design decision was mutual exclusion: when Agent mode is on, passive injection is completely suppressed. The LLM sees the index only through the tool. This avoids three problems at once:

Token waste — no duplicate context in the prompt
Contradictory sources — the LLM can't get confused by injected chunks that differ from tool-retrieved ones
Tool bias — the model has no choice but to use the tool if it wants semantic retrieval

Of course, the agent still has search_files for exact-string lookups, list_files for browsing, and the PSI tools for symbol navigation. The point isn't to replace those. It's to give the agent an orchestration layer: semantic search for meaning, grep for literals, PSI for symbols.

Query expansion and the meta-query trap

Conceptual queries have a second problem: they embed like conversational boilerplate. Ask "where do we discuss authentication?" and the embedding lands near chit-chat, not near AuthenticationService.java.

Our answer was optional query expansion via ExpandingQueryTransformer. The query is paraphrased into multiple variants, each searched independently, and the results fused with Reciprocal Rank Fusion (RRF, k=60). A single meta-query becomes a small retrieval ensemble. It's overkill for "find the User class" and essential for "where do we explain the indexing pipeline?"

The nudge that smaller models need

Even with the tool registered, we noticed smaller models still sometimes defaulted to search_files for conceptual queries. Tool descriptions alone weren't enough. So when Agent mode + RAG are both enabled, DevoxxGenie now injects a dedicated <RAG_INSTRUCTION> system-prompt fragment:

Prefer semantic_search for conceptual queries (e.g. "which slides discuss X", "where do we explain Y").

This lives outside the tool schema, in the system prompt itself, where smaller models are more likely to honour it. It's a small patch, but it meaningfully improved tool selection on models like Qwen 2.5 and Gemma.

Error handling as a first-class concern

One subtle requirement: the tool must fail gracefully. If ChromaDB is unreachable, throwing an exception breaks the agent loop. Instead, SemanticSearchToolExecutor returns a descriptive error string:

Error: ChromaDB is not available. Docker container may not be running.

The agent reads this, understands the index is down, and falls back to search_files or PSI tools. No modal dialogs, no stack traces in the chat window. Because semantic_search is classified as a read-only tool, it's also auto-approved — no approval friction on every call.

The user-control layer

Not everyone wants the agent to have semantic search. We kept the control granular: semantic_search appears in Settings → Agent Mode → Built-in Tools as an individual checkbox, independent of the master RAG switch. You can keep passive injection active for chat mode while excluding the tool from the agent's toolbox, or vice versa.

Architecture in three layers

The implementation is thin by design — it reuses the existing RAG stack rather than building a parallel one:

Layer	Component	Role
Storage	`ProjectIndexerService` + ChromaDB	Language-aware chunking, content-hash manifest, batched embeddings
Retrieval	`SemanticSearchService`	Embedding, optional query expansion, RRF fusion, score filtering
Agent glue	`SemanticSearchToolExecutor`	Formats results for LLM consumption, truncates snippets, returns safe errors

Registration happens in BuiltInToolProvider: the tool is added only when ragEnabled is true. Suppression of passive injection lives in MessageCreationService.shouldInjectPassiveRagContext(). The <RAG_INSTRUCTION> fragment is injected by ChatMemoryManager. Each concern is separated, so the feature can be disabled or extended without touching the core RAG pipeline.

A third retrieval tier: web_search

semantic_search covers your local codebase. search_files covers exact strings. But sometimes the right answer isn't in the project at all — it's in a library changelog, a Stack Overflow thread, or a vendor API reference. That's the gap the new web_search tool fills (task-223).

When web_search is enabled in Settings → Agent Mode → Built-in Tools, the agent gains access to live web search backed by whichever provider you've already configured in Settings → Web Search: Tavily or Google Custom Search. No new API key management — it reuses the keys you've already set up for the /search slash command.

The tool returns raw structured results — title, URL, and snippet — so the agent can reason over the sources directly:

Found 3 results for "langchain4j ChromaDB 0.6 migration":

1. LangChain4j 0.37 release notes
   URL: https://github.com/langchain4j/langchain4j/releases/tag/0.37.0
   ChromaDB store updated to API v2 (0.6.x). Collection creation now uses ...

2. ...

The tool picks the provider automatically: Tavily is tried first if its key is present; Google Custom Search is used as the fallback. If neither is configured, the tool returns a descriptive error string — the agent degrades gracefully, just like semantic_search does when ChromaDB is unreachable.

When a result looks worth reading in full, the agent can follow up with fetch_page, which fetches the complete content of a webpage given a URL — making web_search → fetch_page a natural two-step pattern for retrieving and reading external documentation.

Like its sibling tools, web_search is classified as read-only and is therefore auto-approved — no per-call confirmation dialog when auto-approve read-only is on.

The three-tier retrieval stack

With all three tools enabled, the agent now has a complete retrieval stack:

Tool	Best for
`semantic_search`	Conceptual queries over your indexed codebase
`search_files`	Exact-string or regex lookups in project files
`web_search`	Documentation, release notes, external references

The LLM decides which tier to invoke based on the query. Conceptual questions about your own code go to semantic_search. Grep-style lookups go to search_files. Anything that requires up-to-date external knowledge goes to web_search.

What changed, what didn't

The RAG pipeline itself is unchanged. It still indexes via Ollama's nomic-embed-text, still stores vectors in ChromaDB v0.6.2, still filters low-content chunks at index time, still debounces re-indexing on save. What changed is the interface surface: from prompt injection to tool contract — and now, a third tool tier that reaches beyond the local project entirely.

If you're already using RAG in chat mode, nothing breaks. If you turn on Agent mode, the same index becomes queryable on demand. And if you want the gory setup details — Docker, Ollama, indexing, configuration — the RAG docs have you covered.

Install: JetBrains Marketplace · GitHub

The tool bias problem​

From injection to orchestration​

Query expansion and the meta-query trap​

The nudge that smaller models need​

Error handling as a first-class concern​

The user-control layer​

Architecture in three layers​

A third retrieval tier: web_search​

The three-tier retrieval stack​

What changed, what didn't​