By Zac Siegel in ai — Aug 15, 2025

Building My Own Private AI Search Agent That Actually Works

I built a local AI search system with gpt-oss 20B that handles a surprising amount of my AI search queries.

I've been experimenting with local AI models for a while, but recent industry trends are pushing me experiment a lot more with it. Companies like Cursor are clearly wrestling with the costs of providing AI services. Fixed-cost plans from Anthropic keep getting more restrictive. It's clear where this is heading for heavy users like me - higher costs and less "unlimited usage" and more pay-per-token pricing.

This experiment is a small but important step—the first local setup I've built that I actually use instead ChatGPT for certain tasks. When I looked at my usage patterns, about 25% of my AI interactions were just web searches with synthesis. These queries need current information but don't require deep thinking. It is really a perfect target for a local solution.

Understanding my AI usage

I spend about 50% of my AI time on coding—using Claude Code, Gemini CLI, and testing my own agent backed by Groq. Another 25% goes to deep research with Claude or ChatGPT's best models. The remaining 25% is web exploration and search.

That last category matters more than it seems. These are quick queries throughout the day where I need an explanation of something unfamiliar that I can digest in under 5 minutes. They don't need deep reasoning, just accurate, current information presented clearly.

The components

The system uses a handful of components that work well together:

gpt-oss 20B model - Handles the reasoning and synthesis
LibreChat - Clean frontend that feels familiar
Kagi APIs - Quality private web search without the usual headaches
Ollama - Currently serving the model (migrating to vLLM soon)
Custom Kotlin server - OpenAI-compatible API built with Ktor

I can run this on my laptop for testing, but for daily use it runs on my homelab with an RTX 4090.

The techniques that make it work

There are tons of great resources online about building research agents. I got to explore many of these techniques and implement basic versions that dramatically improved my inference pipeline.

Deduplication and Canonicalization: When a query comes in, I have the LLM generate diverse search queries in a separate context. After searching Kagi, the challenge is deduplicating and caching URLs and content. Different searches return the same URLs with different parameters. The model sometimes requests the same content multiple times. Being efficient with context is critical.

Parallelization: The model processes serially, so I parallelize everything else. Running searches in parallel, fetching multiple URLs simultaneously, chunking and summarizing documents concurrently. Fortunately, these are embarrassingly parallel tasks.

Chunking and Summarization: Web content and local documents are too large to dump into context. I implemented basic chunking and summary of summaries. There are more advanced techniques I want to try, but even simple approaches compress rich context into manageable token counts.

What I learned about "Good Enough"

This local setup produces excellent results for my daily search and exploration needs. The model isn't as capable as hosted models, but it runs fast, runs locally, and has custom tools that give it access to high-quality information. This system can improve over time as I integrate it more into my workflow.

The model searches as much as needed within my limits that I can customize. It uses Kagi for search and can fetch full web pages, PDFs, or documents from my local filesystem or network storage. It remembers what it already searched. And by fetching content locally rather than through AI provider infrastructure, many sites that block known AI systems serve content to my homelab without issue. These advantages add up.

The unexpected benefits

This experiment delivered on several fronts:

Building real AI agents - I now understand what it takes to create a useful agent, valuable knowledge as AI becomes core to development.
Privacy by default - My searches and research stay local. No external logging.
Unlimited usage - No rate limits or usage anxiety. My AI consumption can grow without hitting walls.

The psychological difference matters. The cost hides in my electricity bill rather than accumulating per-token charges. I'm more willing to experiment when I'm not watching a meter run.

Where this breaks down

Local models can't match GPT-4 or Claude for complex reasoning. When I need to analyze intricate codebases or synthesize multiple complex documents, I still use commercial services. But those cases are becoming less common.

There are limitations—context length, reasoning depth, model capabilities. But models keep getting better, smaller, and more efficient while consumer hardware with more VRAM becomes more accessible. What's barely workable today will be easy in a year.

The OpenAI paradox

We should acknowledge something: this only works because companies like OpenAI and Meta have released capable models that run on consumer hardware. The 20B parameter sweet spot balances size and utility perfectly with proper tooling. Its ability to do tool calling and provide JSON output via simple prompting is head and shoulders above other models of a similar size based on my testing and usage.

OpenAI's work in making AI accessible has ironically enabled us to build alternatives to their services. Their success created the foundation for AI independence.

The final word

Building a local AI search replacement is practical today. The GPT-OSS models have solid agentic capabilities, and with the right tools, they handle my search and exploration needs perfectly.

We're entering an era where AI assistance doesn't require subscriptions, usage limits, or privacy compromises. The tools exist. The models work. The only question is whether you're ready to take control of your AI infrastructure.

What's "good enough" today is already pretty good. And it's only getting better.