Practical AI Architecture

TLDR

Start simple and build your way up. Along the way, don't use a large language model for everything. Instead, use a mix of small and large language models for appropriate tasks. If possible, also consider running some tasks locally, Mac devices have a lot of untapped potential.

AI features are rapidly becoming table stakes for applications, but there's a growing challenge with scale. Even as cloud AI costs decrease, the fundamental architecture of today's AI applications creates real constraints that manifest in unexpected ways: "unlimited" plans that aren't truly unlimited, byzantine pricing tiers that confuse users, and applications that struggle to maintain their promise as usage grows.

Like many others building with AI, we initially just used large language models (LLMs) for everything. It was convenient - one solution for all problems. It felt like bringing a gun to a knife fight. Then we started adding agentic behaviours to our app with frameworks like Langchain, llama-index, and crew-ai and that's where cracks started to show for us. As we scaled and added more complexity, we discovered that this approach wasn't sustainable. Frameworks were way too abstract because they're super general and you don't truly learn how it works under the hood until it breaks in production, using the LLM hammer everywhere also became increasingly costly.

Anthropic's recent blog post "Building Effective Agents" resonated strongly with our experience. At our previous startup, we had iterated through most of the frameworks they mentioned, and learned similar lessons through trial and error. Their emphasis on practical implementation over theoretical complexity particularly struck home - it was a principle we had discovered the hard way. While they focused on architectural simplicity, we found these principles extended naturally to the broader challenge of building sustainable AI applications. Their warnings about the hidden costs of complexity aligned perfectly with our own journey.

This realization pushed us to reimagine how AI applications could grow sustainably while maintaining their promise to users. Our journey led us to explore more targeted solutions.

We won't go into detail about what's an agent here but, if you're interested in learning more about agents and their technical implementation, check out my previous posts on "What is an agent?" and "What is an agent, really?". But honestly, the reality is that most are building agents without a clear understanding of what they are. It's an overloaded term at this point so we will focus on the underlying models instead.

Finding the Right Balance

Local First Design vs Cloud Native

I have since left our previous company building generic agentic solutions for enterprises and moved on to a startup more focused on multi-modal usages for individuals. For that we needed a transcription model, and Whisper seemed like the obvious choice. But the more we looked into it, the less it made sense cost-wise. Using a premium service like Deepgram would destroy the margins (~$20 per user per month with normal usage), and even the cheapest Whisper cloud model still ate up close to half of what we wanted to charge per seat. This wouldn't work, transcription is table stakes for us and it was too risky to go down this path.

The breakthrough came when we started experimenting with running Whisper on the device. Initial results were promising, but real-time transcription presented significant challenges. Through careful optimization and tuning, we achieved a Word Error Rate (WER) of approximately 11% for real-time streaming transcription on M-series chips – performance we can deliver consistently even on an M2 MacBook Air. When compared to cloud providers' benchmarks, this demonstrated that local processing could rival cloud solutions with zero marginal cost.

This technical choice had significant implications. We've observed a common pattern in the AI space: platforms advertise "unlimited" features, but often with hidden constraints. Some limit usage duration, while others start with attractive pricing only to raise rates significantly once the unit economics become unsustainable. By processing locally, we could offer genuinely unlimited features without these asterisks or future price hikes.

Small vs Large Language Models

Our approach to Language models evolved similarly. For cloud operations, we started with the usual suspects - OpenAI, Anthropic, and others. Then we paired them with SLMs (Small Language Models) for simpler use cases that don't need complex reasoning. Even then we wanted users to run SLMs directly on device - for privacy, but also means we can provide more value at no cost.

It's exciting is how quickly these models are evolving. The latest Qwen 2.5-3B models are achieving benchmark scores that seemed impossible just a year ago. These compact models are now scoring above 65 on the MMLU benchmark – a feat that even the massive LLaMA-65B model couldn't achieve in early 2023. However from our experience, SLMs locally still struggle with long context sizes and struggle with repetition problems. We typically use local SLMs for features for system-2 processing and features not directly exposed to the user.

There's a lot of untapped potential in users' Mac devices that AI apps could be using. Modern Macs pack significant computing power that often sits idle, and its clear Apple is significantly investing in this area so it will only get better. All in all, we saw our Language models cost cut down by ~50% with the mix.

This kind of cost optimization isn't just relevant for apps using local deployment - it's becoming crucial for any AI application at scale. The key is identifying which tasks truly require the full capabilities of a powerful LLM versus those that can be handled by more efficient models. Exa.ai learned this lesson firsthand when their Twitter wrapped feature went viral. Faced with surging costs, they implemented a hybrid approach: routing simpler requests to smaller, more efficient models while reserving their most powerful LLMs for tasks that genuinely required them. This wasn't just clever engineering – it was essential for maintaining sustainable unit economics at scale. Read more here.

Practical AI Architecture

The future of AI applications lies in smart resource allocation rather than brute force approaches. By combining on-device processing for compute-heavy tasks like transcription with a mix of local and cloud-based language models, we've found a path to truly sustainable AI applications. This isn't just about technical architecture – it's about delivering on the promise of "unlimited" features without hidden constraints or unsustainable economics.

Our experience has shown that this hybrid approach cuts costs significantly: 100% reduction in transcription costs through local processing, and roughly 50% reduction in language model costs through smart mixing of cloud and local models. More importantly, it allows us to maintain consistent performance and user experience as we scale, without the typical trade-offs between cost and capability.

Looking ahead, the possibilities are expanding rapidly. Tools like MLX are making it possible to fine-tune and train models efficiently right on users' devices. We're seeing the emergence of more capable on-device models while maintaining reasonable resource requirements. This shift isn't just about technical capabilities – it's about fundamentally rethinking how we architect AI applications.

The key isn't reaching for the largest available model or defaulting to cloud processing – it's about understanding each component's requirements and choosing the right tool for each task. When we combine local processing power with cloud capabilities thoughtfully, we can build AI applications that are not just powerful, but genuinely sustainable at scale. This approach lets us focus on what really matters: delivering consistent value to users without worrying about usage limits or unsustainable unit economics.

If you have read this far, you might be interested in the full post I wrote for our startup.