·By Saatvik Arya·3 min read

Announcing Avaan: The Real-Time Transcription Software I Needed to Exist

I built Avaan because transcription was killing my workflow. It combines FluidAudio CoreML models for local real-time processing with cloud streaming to deliver results in under a second with live preview.

productavaanengineering

I was deep in a planning session, trying to document a complex feature implementation. The ideas were flowing fast, but the transcription software I was using felt like watching paint dry. Every time I recorded something, I'd wait 5+ seconds to see what I'd actually said. The conversation had moved on by the time I got my transcription back.

That lag kills momentum. When you're in the zone, waiting for batch processing feels archaic. You lose your train of thought. The mental model dissolves. And don't get me started on existing transcription software that makes you choose between privacy (local offline models) and accuracy (cloud processing with inevitable latency).

I hit my breaking point not because I wanted to build another transcription app, but because I needed something that didn't exist.

The Epiphany Moment with Real-Time Audio

Here's what changed everything: while experimenting with different audio processing approaches, I integrated FluidAudio's CoreML models — specifically their StreamingEouAsrManager with Parakeet TDT v3 running on Apple Neural Engine. The results were immediate and compelling. Text started appearing almost instantly as I spoke, with end-of-utterance detection so precise it felt psychic.

Apple's Neural Engine isn't just marketing fluff — when you run audio models locally, you get incredible performance with minimal power consumption. Parakeet TDT v3 processes audio in near real-time (190x speed factor), giving me a glimpse of what transcription could be: immediate, fluid, conversational.

This was the peek into the future I wanted. But local models, while fast, sometimes trade accuracy for speed. I needed cloud-level accuracy with local-level responsiveness. The traditional approach would say "pick one," but I needed both.

The Technical Architecture

To deliver streaming transcription that feels local but provides cloud accuracy, I built a dual-streaming architecture:

Local Streaming with FluidAudio: For offline functionality and maximum privacy, I integrated FluidAudio's CoreML models. The Parakeet TDT v3 model runs natively on Apple Neural Engine, giving offline real-time processing with 190x speed factor. Your audio never leaves your device, and you get instant feedback.

Cloud Streaming with Cloudflare Durable Objects: For when you need state-of-the-art accuracy, I built a streaming proxy using Cloudflare Workers. Each user gets their own Durable Object session with dedicated WebSocket connections. These persistent sessions maintain state across hibernation, ensuring uninterrupted streaming even during network interruptions.

The key insight: Both approaches deliver the same streaming experience. Whether you're using local models or cloud processing, you see text appear as you speak.

What Makes Streaming Different From Real-Time

Here's where most people get confused: "real-time" transcription usually means "faster batch processing." True streaming means you see partial transcription as you speak, not after you're done speaking.

Traditional approach: Record → Upload → Wait → Download → Read

Streaming approach: Speak → See text appear → Continue speaking → See corrections

I implemented a live preview window that updates continuously. You watch your words appear and immediately see patterns. At ~300-500ms latency, you're editing in real-time instead of reviewing later. Combined with end-of-utterance detection, you get conversational feedback loops instead of archival processing.

Built for Getting Work Done

I didn't build Avaan for demos. I built it because I needed transcription that fit my workflow, not interrupted it.

Menu Bar Architecture: Avaan lives in your menu bar (no dock icon), accessible via Cmd+Shift+Space from anywhere. The floating recording window stays out of your way but close enough to monitor.

Context-Aware Transcription: 6 different modes optimized for different use cases — Auto, Notes, Email, Chat, Code, and Off modes. Each one formats output differently based on what you're working on.

Privacy by Design: With local models, your audio never leaves your device. With cloud streaming, audio streams encrypted and no data is stored permanently — it's processed and delivered back to you in real-time, then discarded.

Available Now

I built Avaan because I needed real-time transcription that didn't interrupt my flow. The streaming preview window changes how you think about audio documentation — from "record and review later" to "see and adjust as you speak." Download at avaan.app

I've used Avaan daily for feature planning, meeting notes, daily standups, and technical documentation. The live preview window means I catch transcription errors immediately rather than discovering them days later.

Most importantly — no more waiting for your thoughts to appear on screen.

Originally published on saatvikarya.com

Search

Search products and blog posts