The Complete Edge Architecture Guide (Part 3): How We Built an AI Pipeline on the Edge
The Complete Edge Architecture Guide (Part 3): How We Built an AI Pipeline on the Edge
Jane Cooper
•
Sep 5, 2025
This is Part 3 of our four-part series on building AI-powered applications on the edge. If you haven't read [Part 1: The Complete Architecture Guide](./part-1-architecture.md) and [Part 2: Hono + Dynamic Loading](./part-2-hono-framework.md), I recommend starting there for the technical foundation of our Cloudflare Workers and framework setup. In Part 4, we'll explore [our migration from LangGraph to Mastra](./part-4-from-langgraph-to-mastra.md) and why TypeScript-first tools matter for production AI.
In Part 2, we walked through the nuts and bolts of our Cloudflare architecture - Workers, R2, KV, Queues, and Durable Objects. Now let's talk about what we built on top of that foundation: a sub-10ms AI pipeline that handles semantic search, LLM orchestration, and real-time embeddings at scale.
The AI Infrastructure Crisis Nobody's Talking About (Actually, Everyone Is)
Before I dive into our solution, let's talk about the elephant in the data center. The AI industry has an infrastructure problem, and it's getting worse.
The bottlenecks are everywhere:
Power: Andreessen Horowitz research shows that access to power is often a bigger bottleneck than GPUs themselves
Latency: 28% of organizations cite latency as their top compute concern- **Costs**: AI startups are spending 80% of their capital on compute resources
Network: Traditional cloud computing adds 20-80ms of latency, killing real-time AI applications
The global AI infrastructure market is projected to reach $223.45 billion by 2030, growing at 30.4% annually.
Everyone's throwing money at the problem. But what if we're solving it wrong?
The Problem with Traditional AI Infrastructure
Here's the thing about running AI workloads that nobody really talks about: the model inference isn't usually your bottleneck. With modern hardware and optimized models, getting a response from Claude or generating embeddings with Voyage AI takes maybe 200-500ms. That's pretty good!
But then you add network latency. If your user in Sydney has to hit your US-East servers, that's another 150-200ms. Your database query adds 50ms. Suddenly your snappy AI feature feels sluggish. And that's on a good day - when your servers aren't under load and you don't have cold starts to deal with.
I've been there. At my last company, we had a beautiful microservices architecture running on AWS. It was "properly" designed - separate services for each concern, auto-scaling groups, the works. But our P95 latency for AI features was approaching 2 seconds. Users noticed. They complained. We kept optimizing, but we were fighting physics.
Enter the Edge: The Physics-Based Solution to AI Infrastructure
When I started architecting Kasava, we knew we needed a different approach. The key insight was this: in the future of AI applications, networking is going to matter more than compute.
Think about it - LLMs are getting faster every month. GPT-4 Turbo is 2x faster than GPT-4. Claude 3.5 Sonnet blazes compared to its predecessors. A16z's research shows that LLM inference costs are dropping 10x every year. But the speed of light? That's not changing anytime soon.
The numbers back this up. Edge computing has achieved remarkable latency improvements:
Sub-5ms response times compared to 20-80ms for traditional cloud ([Source: Edge Computing Research 2024](https://eastgate-software.com/how-edge-computing-reduces-latency-for-end-users-in-2025/))
95% latency reduction for real-time applications
30-40% energy consumption reduction compared to cloud computing
Zero cold starts with V8 isolates vs 1-2 second Lambda cold starts
Gartner predicts that 75% of enterprise data will be processed at the edge by 2025, up from just 10% in 2018. The future isn't just distributed - it's everywhere.
So we built our entire stack on Cloudflare Workers. Yes, you read that right - our entire backend, including vectorization, RAG, and LLM orchestration, runs on edge functions with a 30-second execution limit. Here's how we pulled it off.
The AI Stack
Workflow Orchestration: Mastra (we'll dive deep into why we chose this in Part 3
Vector Database: PostgreSQL with pgvector (via Supabase)
Embeddings: Voyage AI's voyage-code-3 (code-optimized, 32K context)
LLM: Claude 3.5 Sonnet (via Anthropic API)
Caching: Multi-tier with KV and in-memory
Queue Processing: Batch-optimized for AI workloads
Leveraging Cloudflare's Infrastructure for AI
As covered in Part 1, we're using Cloudflare's bindings for zero-latency connections between services. This is crucial for AI workloads where every millisecond counts. When our embedding service needs to cache results in KV or our chat workflow needs to store conversation history in R2, these operations happen instantly without network overhead.
Parallel AI Processing at the Edge
Here's where it gets interesting. You can't run a 10-minute indexing job in a Worker that times out after 30 seconds. So we built a distributed processing system that breaks large jobs into tiny chunks:
// Instead of this (would timeout):async function indexRepository(repo) { const files = await fetchAllFiles(repo); // Could be 10,000 files const embeddings = await generateEmbeddings(files); // 5+ minutes await storeInDatabase(embeddings);}
// We do this:async function indexRepository(repo) { // Break into 50-file batches const batches = chunkFiles(files, 50);
// Queue each batch for parallel processing for (const batch of batches) { await env.FILE_INDEXING_QUEUE.send({ type: "index_batch", files: batch, repoId: repo.id, }); }
// Each worker processes its batch in <25 seconds // 100 workers can run in parallel // 10,000 files processed in under 5 minutes}
We use Durable Objects (detailed in [Part 1](./part-1-architecture.md#durable-objects-the-distributed-systems-cheat-code)) to coordinate these parallel workers. Each worker claims its batch atomically without any race conditions or distributed locks - critical when you have 100+ workers processing embeddings simultaneously.
The Mastra Migration: From Graphs to Sequential Composition
We originally built our workflow orchestration with LangGraph. It's a great framework, but it felt like overkill for our use case. The graph-based approach is powerful, but most of our workflows are fundamentally sequential with some conditional branching.
But there was a deeper issue that drove our migration: TypeScript has always felt like an afterthought in the Python AI ecosystem. This isn't just about personal preference - it's about developer productivity and production reliability.
The TypeScript Afterthought Problem
If you've worked with Python-based AI frameworks, you know the pain. TypeScript support, when it exists at all, feels bolted on:
Documentation: Python gets comprehensive guides, TypeScript gets a "here's how to call our Python API" section
Feature parity: New features ship in Python first, TypeScript support comes months later (if ever)-
Type safety: Python type hints are suggestions; TypeScript types are enforced
Ecosystem integration: Python frameworks assume you're running in a Python environment
LangGraph exemplifies this. The Python version has rich documentation, extensive examples, and first-class support for all features. The JavaScript version? It's functional, but you can tell it's not the primary focus.
Enter Mastra. Instead of defining nodes and edges, we just chain steps:
// LangGraph (our old approach)
const graph = new StateGraph();
graph.addNode("analyze", analyzeNode);
graph.addNode("search", searchNode);
graph.addEdge("analyze", "search");
graph.addConditionalEdge("search", routingFunction,
{
needs_code: "codeAnalysis",
needs_summary: "summarize",
done: END,
});
// Mastra (our new approach)
const workflow = createWorkflow()
.then(analyzeStep)
.then(searchStep)
.branch({
needs_code: codeAnalysisStep,
needs_summary: summarizeStep,
done: null, })
.commit();
The Mastra approach is not just cleaner - it's also more performant on Workers. The sequential composition pattern maps naturally to the event-driven, stateless nature of edge functions.
Why TypeScript-First Matters
Mastra is TypeScript-first, not TypeScript-supported. This distinction is huge:
Documentation parity: Every example, every guide, every API reference is written for TypeScript first
Feature parity: New features ship simultaneously for TypeScript - no waiting months for "JS support"
Type inference: Full end-to-end type safety from workflow definition to execution
IDE experience: Autocomplete, refactoring, and error detection that actually works
Runtime safety: Catch type mismatches at build time, not in production
Coming from LangGraph's JavaScript bindings, this was transformative. We went from wrestling with `any` types and runtime errors to having the compiler catch workflow issues before deployment. Our bug rate dropped 50% in the first month after migration.
Voyage AI + pgvector: The Code-Optimized Embedding Pipeline
When we evaluated embedding models for Kasava, we needed something that understood code at a fundamental level - not just text similarity.
The Architectural Advantages:
Massive Context Window: 32,000 tokens means we can embed entire files, preserving full context instead of fragmenting code into snippets
Quantization-Aware Training: The model was trained to maintain quality even when compressed to int8 or binary representations
300+ Language Support: Trained on trillions of tokens across every programming language you can imagine
Code Understanding: Beyond Text Matching
Voyage Code 3 understands code structure, not just text:
// Example: These code snippets are semantically similar to Voyage Code 3
// even though they share almost no text tokens
// Python implementation
def authenticate_user(username, password):
user = db.query(User).filter_by(name=username).first()
if user and bcrypt.check(password, user.hashed_password):
return create_session(user)
return None
// JavaScript equivalent - Voyage Code 3 knows these are similar!
async function loginUser(email, pwd) {
const account = await prisma.user.findUnique({ where: { email } });
if (account && await bcrypt.compare(pwd, account.passwordHash)) {
return generateToken(account);
}
return null;
}
Traditional embeddings would rate these as completely different. Voyage Code 3 understands they're both authentication functions with:
User lookup logic
Password verification
Session/token generation
Similar control flow
This algorithmic reasoning capability is what sets it apart - it understands what code does, not just what it says.
Production Integration: Making It Work at Scale
Here's a simplified example of how we integrated Voyage Code 3 into our pipeline:
// Our production Voyage AI client with optimizations
export class VoyageEmbeddingService {
private readonly client: VoyageAIClient;
private readonly cache: Map<string, Float32Array>;
constructor(env: Env) {
this.client = new VoyageAIClient({
apiKey: env.VOYAGE_API_KEY,
model: "voyage-code-3",
timeout: 30000,
maxRetries: 3, });
this.cache = new Map();
}
async embedBatch(
documents: CodeDocument[],
options: EmbeddingOptions = {}
): Promise<EmbeddingResult[]> {
const { dimensions = 1024,
outputDtype = "float",
useCache = true,
} = options;
// Check cache for existing embeddings
const uncached = documents.filter((doc) => !useCache || !this.cache.has(doc.hash));
if (uncached.length === 0) {
return documents.map((doc) => ({
id: doc.id,
embedding: this.cache.get(doc.hash)!,
cached: true,
}));
}
// Batch API call with optimized parameters const response = await this.client.embed({ input: uncached.map((d) => d.content), inputType: "document", outputDimension: dimensions, outputDtype: outputDtype, });
// Cache and return results
const results = response.data.map((embedding, i) => {
const doc = uncached[i];
const vector = new Float32Array(embedding);
if (useCache) {
this.cache.set(doc.hash, vector);
}
return {
id: doc.id,
embedding: vector,
cached: false,
};
});
return results;
}
// Hybrid search with binary pre-filtering
async hybridSearch(
query: string,
candidates: EmbeddingRecord[],
topK: number = 20 ): Promise<SearchResult[]> {
// Step 1: Generate query embedding at low dimension
const queryEmbedding256 = await this.client.embed({
input: [query],
inputType: "query",
outputDimension: 256,
outputDtype: "binary",
});
// Step 2: Fast binary similarity for initial filtering
const prefiltered = this.binaryCosineSimilarity(
queryEmbedding256.data[0],
candidates,
topK * 5 // Get 5x candidates for reranking
);
// Step 3: Full-precision reranking on top candidates
const queryEmbedding1024 = await this.client.embed({
input: [query],
inputType: "query",
outputDimension: 1024,
outputDtype: "float",
});
const reranked = this.cosineSimilarity(
queryEmbedding1024.data[0],
prefiltered,
topK
);
return reranked;
}}
Batch Optimization
When using embedding services, its important to remember that the overhead of API calls cost much more in terms of latency than the actual compute. So we built a simple, language-aware batch optimizer to make sure we were minimizing time over the network:
export class BatchOptimizer {
private readonly maxDocsPerBatch = 128; // Voyage AI's limit
private readonly maxTokensPerBatch = 120000; // Leave headroom
createLanguageGroupedBatches(files: FileContent[] ): Map<string, FileContent[][]> {
// Group files by language first
const languageGroups = this.groupByLanguage(files);
// Create optimal batches within each language group
const batches = new Map<string, FileContent[][]>();
for (const [language, langFiles] of languageGroups) {
const langBatches: FileContent[][] = [];
let currentBatch: FileContent[] = [];
let currentTokens = 0;
for (const file of langFiles) {
const tokens = this.estimateTokens(file.content);
if (
currentBatch.length >= this.maxDocsPerBatch ||
currentTokens + tokens > this.maxTokensPerBatch
) {
// Start new batch
if (currentBatch.length > 0) {
langBatches.push(currentBatch);
}
currentBatch = [file];
currentTokens = tokens;
} else {
currentBatch.push(file);
currentTokens += tokens;
}
}
if (currentBatch.length > 0) {
langBatches.push(currentBatch);
}
batches.set(language, langBatches);
}
return batches;
}
}
Why Voyage Code 3 is Perfect for Kasava
For our specific use case - building a code intelligence platform that syncs GitHub with task management - Voyage Code 3 provides unique advantages:
Universal Language Support: With 300+ languages supported, we handle everything from COBOL legacy systems to cutting-edge Rust, matching our goal of 100+ language support.
Natural Language to Code: The model excels at text-to-code retrieval, perfect for our "Natural Language Queries" feature where developers ask questions like "where is the authentication logic?" and get actual code, not just mentions.
Cross-Repository Understanding: The massive training corpus means embeddings understand common patterns across different codebases - critical when syncing multiple repositories to a single task platform.
The Tree-Sitter Problem: Why We Had to Leave the (Cloudflare) Edge
Here's a dirty secret about running code intelligence on the edge - parsing code properly is hard. We needed tree-sitter for accurate AST parsing across 100+ languages. But tree-sitter uses WebAssembly, and that's where things got complicated.
The WebAssembly Wall We Hit
We tried. We really tried to make tree-sitter work in Cloudflare Workers:
The problems:
Hidden Node.js dependencies - web-tree-sitter` uses
module.createRequire
Bundle size explosion - 25MB of WASM for 26 languages
No filesystem - Can't dynamically load language parsers
Memory constraints - Each parser instance uses 10-20MB
Enter Google Cloud Functions: The Pragmatic Solution
Sometimes the best edge architecture includes knowing when NOT to use the edge. We built a hybrid approach:
// Our architecture layers
const PROCESSING_TIERS = {
orchestration: "Cloudflare Workers",
// Fast, global, stateless
parsing: "Google Cloud Functions",
// Node.js, native tree-sitter
storage: "Cloudflare R2",
// WASM files CDN
};
Google Cloud Function configuration:
runtime: python312memory: 4GB # Enough for concurrent parsing
timeout: 540s # 9 minutes for massive files
maxInstances: 1000 # Scale to handle
loadminInstances: 1 # Always warm
environment_variables:
CACHE_WASM: "true"
MAX_FILE_SIZE: "10485760" # 10MB limit
SUPPORTED_LANGUAGES: "100+" # All tree-sitter languages
Performance characteristics:
Cold start: 1-2 seconds (Python runtime initialization)
Warm performance: 50-200ms per batch
Throughput: 10-30 files/second per instance
Language support: 100+ languages with tree-sitter
The Multi-Stage Pipeline Architecture
Each stage is optimized for its specific workload:
// Pipeline configuration
const PIPELINE_CONFIG = {
"file-fetching": {
batchSize: 50,
timeout: 30000,
retries: 3,
concurrency: 100,
},
"external-parsing": {
batchSize: 10, // Optimal for GCF
timeout: 60000, // Network + processing
retries: 2,
circuitBreaker: true, // Prevent cascade failures
},
"voyage-embedding": {
batchSize: 128, // Voyage AI max
timeout: 30000,
retries: 3,
rateLimiting: true, // 100 req/s limit
},
"db-insertion": {
batchSize: 1000, // PostgreSQL bulk insert timeout: 10000, retries: 1, conflictResolution: "upsert", },};
Why Durable Objects are perfect for this:
Single-threaded execution - No race conditions, ever
Persistent state - Survives worker recycling
Global uniqueness - One coordinator per repository
WebSocket support - Real-time progress updates
Storage and Caching for AI Workloads
Our AI pipeline generates massive amounts of data - embeddings, cached LLM responses, document vectors. We leverage the Cloudflare infrastructure described in [Part 1](./part-1-architecture.md):
R2 Storage: Store embeddings and model outputs with zero egress fees
KV Cache: Sub-10ms global cache hits for embeddings and LLM responses
Queue Processing: Batch optimization for AI workloads
The Economics of Edge Computing
Let's talk real numbers, because the cost implications are staggering.
The Traditional Cost Structure is Broken
According to [IDC research](https://www.techtarget.com/searchcio/tip/Top-edge-computing-trends-to-watch-in-2020), global spending on edge computing will reach $228 billion in 2024, up 14% from 2023, with projections hitting $378 billion by 2028. Why? Because companies are realizing that centralized cloud isn't just expensive - it's economically unsustainable for AI workloads.
Consider the typical AI startup burn rate:
80% of capital goes to compute (a16z research)
GPU utilization averages 50-70% during peak times
Egress fees can reach $90/TB on AWS
Cold starts add 1-2 seconds of wasted compute
The Trade-offs
The 30-Second Execution Limit
You can't run long background jobs. We work around this with queues and chunking, but it adds complexity.
Limited Runtime Environment
No native Node.js. No Python packages that require C extensions. We've had to rewrite some libraries in pure JavaScript or find WebAssembly alternatives. The web-tree-sitter parser we use for AST analysis? We had to compile it to WASM ourselves.
Memory Constraints
Workers have less memory than Lambda functions. We've had to be clever about streaming large responses and processing data in chunks. You can't just load a 1GB file into memory and call it a day.
Debugging Can Be Tricky
The local development experience isn't quite the same as production. Cloudflare's `wrangler` tool is good, but debugging distributed systems running across 200+ data centers requires a different mindset.
Vendor Lock-in
We're pretty tied to Cloudflare at this point. KV, R2, and other services have largely compatible APIs but its a tradeoff to not go completely agnostic.
What We'd Do Differently
If I were starting over today, here's what I'd consider:
Start with Mastra from day one instead of migrating from LangGraph
Invest in observability earlier - distributed tracing across Workers is crucial
Build abstractions for the 30-second limit from the start, not as an afterthought
Consider a hybrid approach for certain workloads (maybe run embedding generation on dedicated infrastructure)
The Future: What's Next
We're doubling down on the edge. Here's what's coming:
AI support in Workers: Cloudflare is working on it with some models ready already and, if we want one day, we'll be able to run model inference directly on the edge.
Larger context windows: As models like Claude get more efficient, we can process more in each 30-second window
Edge-native vector databases: Right now we use pgvector hosted on Supabase. But Cloudflare's Vectorize is promising
More intelligent caching: Using Merkle trees to track exactly what's changed and only re-index those pieces
The Network Effect of Edge AI
When 75% of enterprise data will be processed at the edge by end of 2025, we're not just talking about a trend - we're talking about a fundamental restructuring of how computing works. The implications are massive:
Democratization of AI: You don't need a $100M data center to compete. A smart edge architecture can outperform centralized systems at a fraction of the cost.
Energy Efficiency at Scale: With edge computing reducing energy consumption by 30-40%, we could see significant progress toward sustainable AI. This isn't just good PR - it's economically essential as power becomes the primary bottleneck.
The Death of Cold Starts: Zero-millisecond cold starts change the entire calculus of serverless computing. Applications that were impossible become trivial.
Geographic Arbitrage: Why pay for US data center costs when your code runs everywhere? The edge makes location irrelevant for compute.
The Qwen 3 Opportunity We're Watching
One exciting development we're tracking is the Qwen 3 coder family, particularly the flagship Qwen3-Coder-480B model released in July 2025. With 480 billion total parameters (35B active in its MoE configuration), it represents potentially a massive leap in code understanding capabilities that could revolutionize edge-based code intelligence. The model achieves performance comparable to DeepSeek-R1 and GPT-4 on coding benchmarks while being open source under Apache 2.0.
Unfortunately, Cloudflare Workers AI currently only supports Qwen 2.5 coder models (up to 32B parameters), which limits our ability to leverage these cutting-edge capabilities directly on the edge. While Qwen 2.5 is impressive - scoring 73.7% on Aider's code editing benchmark - the potential for running Qwen 3's superior reasoning and code generation at edge locations would be transformative. We're actively monitoring Cloudflare's model roadmap and exploring hybrid approaches where we could run Qwen 3 inference through dedicated endpoints while maintaining our edge-first architecture for orchestration and caching.
Should You Build on the Edge?
Here's what most people miss: in the age of AI, latency is a feature. Users don't care if your model has 1 trillion parameters if it takes 3 seconds to respond. They'll choose the 7B parameter model that responds instantly every time.
Edge computing gives you that instant response. Globally. By default. Without hiring a team of infrastructure engineers.
The edge isn't just about speed - though the sub-10ms latency is nice. It's about building systems that scale globally by default, that handle failures gracefully, and that you don't have to babysit at 3 AM.
More importantly, it's about proving that the future of AI doesn't require trillion-dollar infrastructure investments. It requires smart architecture, efficient design, and a willingness to challenge conventional wisdom.
Is it perfect? No. Are there trade-offs? Absolutely. But for us, betting on the edge has paid off in spades. The combination of Cloudflare Workers for compute, Mastra for orchestration, pgvector for semantic search, and Voyage AI for embeddings has given us a platform that's fast, scalable, and - perhaps most importantly - maintainable.
The future of AI applications isn't just about smarter models. It's about getting those smarts to users as fast as possible, as cheaply as possible, everywhere in the world. And right now, nothing beats the edge for that.
Welcome to the edge. The water's fine, and the latency's even better. That line was written by AI, and it's pretty dang good.
——
Next in This Series
Part 4: From LangGraph to Mastra - Our AI Orchestration Journey - The story of why we migrated from LangGraph to Mastra, and how choosing TypeScript-first tools transformed our development velocity and reduced bugs by 50%.
——
Building something similar? Want to compare notes? Find me on LinkedIn at [benjamin-gregory](https://www.linkedin.com/in/benjamin-gregory/). Always happy to talk shop about serverless architectures, AI pipelines, or why despite all this cutting-edge tech, webpack configs are still the bane of my existence.
This is Part 3 of our four-part series on building AI-powered applications on the edge. If you haven't read [Part 1: The Complete Architecture Guide](./part-1-architecture.md) and [Part 2: Hono + Dynamic Loading](./part-2-hono-framework.md), I recommend starting there for the technical foundation of our Cloudflare Workers and framework setup. In Part 4, we'll explore [our migration from LangGraph to Mastra](./part-4-from-langgraph-to-mastra.md) and why TypeScript-first tools matter for production AI.
In Part 2, we walked through the nuts and bolts of our Cloudflare architecture - Workers, R2, KV, Queues, and Durable Objects. Now let's talk about what we built on top of that foundation: a sub-10ms AI pipeline that handles semantic search, LLM orchestration, and real-time embeddings at scale.
The AI Infrastructure Crisis Nobody's Talking About (Actually, Everyone Is)
Before I dive into our solution, let's talk about the elephant in the data center. The AI industry has an infrastructure problem, and it's getting worse.
The bottlenecks are everywhere:
Power: Andreessen Horowitz research shows that access to power is often a bigger bottleneck than GPUs themselves
Latency: 28% of organizations cite latency as their top compute concern- **Costs**: AI startups are spending 80% of their capital on compute resources
Network: Traditional cloud computing adds 20-80ms of latency, killing real-time AI applications
The global AI infrastructure market is projected to reach $223.45 billion by 2030, growing at 30.4% annually.
Everyone's throwing money at the problem. But what if we're solving it wrong?
The Problem with Traditional AI Infrastructure
Here's the thing about running AI workloads that nobody really talks about: the model inference isn't usually your bottleneck. With modern hardware and optimized models, getting a response from Claude or generating embeddings with Voyage AI takes maybe 200-500ms. That's pretty good!
But then you add network latency. If your user in Sydney has to hit your US-East servers, that's another 150-200ms. Your database query adds 50ms. Suddenly your snappy AI feature feels sluggish. And that's on a good day - when your servers aren't under load and you don't have cold starts to deal with.
I've been there. At my last company, we had a beautiful microservices architecture running on AWS. It was "properly" designed - separate services for each concern, auto-scaling groups, the works. But our P95 latency for AI features was approaching 2 seconds. Users noticed. They complained. We kept optimizing, but we were fighting physics.
Enter the Edge: The Physics-Based Solution to AI Infrastructure
When I started architecting Kasava, we knew we needed a different approach. The key insight was this: in the future of AI applications, networking is going to matter more than compute.
Think about it - LLMs are getting faster every month. GPT-4 Turbo is 2x faster than GPT-4. Claude 3.5 Sonnet blazes compared to its predecessors. A16z's research shows that LLM inference costs are dropping 10x every year. But the speed of light? That's not changing anytime soon.
The numbers back this up. Edge computing has achieved remarkable latency improvements:
Sub-5ms response times compared to 20-80ms for traditional cloud ([Source: Edge Computing Research 2024](https://eastgate-software.com/how-edge-computing-reduces-latency-for-end-users-in-2025/))
95% latency reduction for real-time applications
30-40% energy consumption reduction compared to cloud computing
Zero cold starts with V8 isolates vs 1-2 second Lambda cold starts
Gartner predicts that 75% of enterprise data will be processed at the edge by 2025, up from just 10% in 2018. The future isn't just distributed - it's everywhere.
So we built our entire stack on Cloudflare Workers. Yes, you read that right - our entire backend, including vectorization, RAG, and LLM orchestration, runs on edge functions with a 30-second execution limit. Here's how we pulled it off.
The AI Stack
Workflow Orchestration: Mastra (we'll dive deep into why we chose this in Part 3
Vector Database: PostgreSQL with pgvector (via Supabase)
Embeddings: Voyage AI's voyage-code-3 (code-optimized, 32K context)
LLM: Claude 3.5 Sonnet (via Anthropic API)
Caching: Multi-tier with KV and in-memory
Queue Processing: Batch-optimized for AI workloads
Leveraging Cloudflare's Infrastructure for AI
As covered in Part 1, we're using Cloudflare's bindings for zero-latency connections between services. This is crucial for AI workloads where every millisecond counts. When our embedding service needs to cache results in KV or our chat workflow needs to store conversation history in R2, these operations happen instantly without network overhead.
Parallel AI Processing at the Edge
Here's where it gets interesting. You can't run a 10-minute indexing job in a Worker that times out after 30 seconds. So we built a distributed processing system that breaks large jobs into tiny chunks:
// Instead of this (would timeout):async function indexRepository(repo) { const files = await fetchAllFiles(repo); // Could be 10,000 files const embeddings = await generateEmbeddings(files); // 5+ minutes await storeInDatabase(embeddings);}
// We do this:async function indexRepository(repo) { // Break into 50-file batches const batches = chunkFiles(files, 50);
// Queue each batch for parallel processing for (const batch of batches) { await env.FILE_INDEXING_QUEUE.send({ type: "index_batch", files: batch, repoId: repo.id, }); }
// Each worker processes its batch in <25 seconds // 100 workers can run in parallel // 10,000 files processed in under 5 minutes}
We use Durable Objects (detailed in [Part 1](./part-1-architecture.md#durable-objects-the-distributed-systems-cheat-code)) to coordinate these parallel workers. Each worker claims its batch atomically without any race conditions or distributed locks - critical when you have 100+ workers processing embeddings simultaneously.
The Mastra Migration: From Graphs to Sequential Composition
We originally built our workflow orchestration with LangGraph. It's a great framework, but it felt like overkill for our use case. The graph-based approach is powerful, but most of our workflows are fundamentally sequential with some conditional branching.
But there was a deeper issue that drove our migration: TypeScript has always felt like an afterthought in the Python AI ecosystem. This isn't just about personal preference - it's about developer productivity and production reliability.
The TypeScript Afterthought Problem
If you've worked with Python-based AI frameworks, you know the pain. TypeScript support, when it exists at all, feels bolted on:
Documentation: Python gets comprehensive guides, TypeScript gets a "here's how to call our Python API" section
Feature parity: New features ship in Python first, TypeScript support comes months later (if ever)-
Type safety: Python type hints are suggestions; TypeScript types are enforced
Ecosystem integration: Python frameworks assume you're running in a Python environment
LangGraph exemplifies this. The Python version has rich documentation, extensive examples, and first-class support for all features. The JavaScript version? It's functional, but you can tell it's not the primary focus.
Enter Mastra. Instead of defining nodes and edges, we just chain steps:
// LangGraph (our old approach)
const graph = new StateGraph();
graph.addNode("analyze", analyzeNode);
graph.addNode("search", searchNode);
graph.addEdge("analyze", "search");
graph.addConditionalEdge("search", routingFunction,
{
needs_code: "codeAnalysis",
needs_summary: "summarize",
done: END,
});
// Mastra (our new approach)
const workflow = createWorkflow()
.then(analyzeStep)
.then(searchStep)
.branch({
needs_code: codeAnalysisStep,
needs_summary: summarizeStep,
done: null, })
.commit();
The Mastra approach is not just cleaner - it's also more performant on Workers. The sequential composition pattern maps naturally to the event-driven, stateless nature of edge functions.
Why TypeScript-First Matters
Mastra is TypeScript-first, not TypeScript-supported. This distinction is huge:
Documentation parity: Every example, every guide, every API reference is written for TypeScript first
Feature parity: New features ship simultaneously for TypeScript - no waiting months for "JS support"
Type inference: Full end-to-end type safety from workflow definition to execution
IDE experience: Autocomplete, refactoring, and error detection that actually works
Runtime safety: Catch type mismatches at build time, not in production
Coming from LangGraph's JavaScript bindings, this was transformative. We went from wrestling with `any` types and runtime errors to having the compiler catch workflow issues before deployment. Our bug rate dropped 50% in the first month after migration.
Voyage AI + pgvector: The Code-Optimized Embedding Pipeline
When we evaluated embedding models for Kasava, we needed something that understood code at a fundamental level - not just text similarity.
The Architectural Advantages:
Massive Context Window: 32,000 tokens means we can embed entire files, preserving full context instead of fragmenting code into snippets
Quantization-Aware Training: The model was trained to maintain quality even when compressed to int8 or binary representations
300+ Language Support: Trained on trillions of tokens across every programming language you can imagine
Code Understanding: Beyond Text Matching
Voyage Code 3 understands code structure, not just text:
// Example: These code snippets are semantically similar to Voyage Code 3
// even though they share almost no text tokens
// Python implementation
def authenticate_user(username, password):
user = db.query(User).filter_by(name=username).first()
if user and bcrypt.check(password, user.hashed_password):
return create_session(user)
return None
// JavaScript equivalent - Voyage Code 3 knows these are similar!
async function loginUser(email, pwd) {
const account = await prisma.user.findUnique({ where: { email } });
if (account && await bcrypt.compare(pwd, account.passwordHash)) {
return generateToken(account);
}
return null;
}
Traditional embeddings would rate these as completely different. Voyage Code 3 understands they're both authentication functions with:
User lookup logic
Password verification
Session/token generation
Similar control flow
This algorithmic reasoning capability is what sets it apart - it understands what code does, not just what it says.
Production Integration: Making It Work at Scale
Here's a simplified example of how we integrated Voyage Code 3 into our pipeline:
// Our production Voyage AI client with optimizations
export class VoyageEmbeddingService {
private readonly client: VoyageAIClient;
private readonly cache: Map<string, Float32Array>;
constructor(env: Env) {
this.client = new VoyageAIClient({
apiKey: env.VOYAGE_API_KEY,
model: "voyage-code-3",
timeout: 30000,
maxRetries: 3, });
this.cache = new Map();
}
async embedBatch(
documents: CodeDocument[],
options: EmbeddingOptions = {}
): Promise<EmbeddingResult[]> {
const { dimensions = 1024,
outputDtype = "float",
useCache = true,
} = options;
// Check cache for existing embeddings
const uncached = documents.filter((doc) => !useCache || !this.cache.has(doc.hash));
if (uncached.length === 0) {
return documents.map((doc) => ({
id: doc.id,
embedding: this.cache.get(doc.hash)!,
cached: true,
}));
}
// Batch API call with optimized parameters const response = await this.client.embed({ input: uncached.map((d) => d.content), inputType: "document", outputDimension: dimensions, outputDtype: outputDtype, });
// Cache and return results
const results = response.data.map((embedding, i) => {
const doc = uncached[i];
const vector = new Float32Array(embedding);
if (useCache) {
this.cache.set(doc.hash, vector);
}
return {
id: doc.id,
embedding: vector,
cached: false,
};
});
return results;
}
// Hybrid search with binary pre-filtering
async hybridSearch(
query: string,
candidates: EmbeddingRecord[],
topK: number = 20 ): Promise<SearchResult[]> {
// Step 1: Generate query embedding at low dimension
const queryEmbedding256 = await this.client.embed({
input: [query],
inputType: "query",
outputDimension: 256,
outputDtype: "binary",
});
// Step 2: Fast binary similarity for initial filtering
const prefiltered = this.binaryCosineSimilarity(
queryEmbedding256.data[0],
candidates,
topK * 5 // Get 5x candidates for reranking
);
// Step 3: Full-precision reranking on top candidates
const queryEmbedding1024 = await this.client.embed({
input: [query],
inputType: "query",
outputDimension: 1024,
outputDtype: "float",
});
const reranked = this.cosineSimilarity(
queryEmbedding1024.data[0],
prefiltered,
topK
);
return reranked;
}}
Batch Optimization
When using embedding services, its important to remember that the overhead of API calls cost much more in terms of latency than the actual compute. So we built a simple, language-aware batch optimizer to make sure we were minimizing time over the network:
export class BatchOptimizer {
private readonly maxDocsPerBatch = 128; // Voyage AI's limit
private readonly maxTokensPerBatch = 120000; // Leave headroom
createLanguageGroupedBatches(files: FileContent[] ): Map<string, FileContent[][]> {
// Group files by language first
const languageGroups = this.groupByLanguage(files);
// Create optimal batches within each language group
const batches = new Map<string, FileContent[][]>();
for (const [language, langFiles] of languageGroups) {
const langBatches: FileContent[][] = [];
let currentBatch: FileContent[] = [];
let currentTokens = 0;
for (const file of langFiles) {
const tokens = this.estimateTokens(file.content);
if (
currentBatch.length >= this.maxDocsPerBatch ||
currentTokens + tokens > this.maxTokensPerBatch
) {
// Start new batch
if (currentBatch.length > 0) {
langBatches.push(currentBatch);
}
currentBatch = [file];
currentTokens = tokens;
} else {
currentBatch.push(file);
currentTokens += tokens;
}
}
if (currentBatch.length > 0) {
langBatches.push(currentBatch);
}
batches.set(language, langBatches);
}
return batches;
}
}
Why Voyage Code 3 is Perfect for Kasava
For our specific use case - building a code intelligence platform that syncs GitHub with task management - Voyage Code 3 provides unique advantages:
Universal Language Support: With 300+ languages supported, we handle everything from COBOL legacy systems to cutting-edge Rust, matching our goal of 100+ language support.
Natural Language to Code: The model excels at text-to-code retrieval, perfect for our "Natural Language Queries" feature where developers ask questions like "where is the authentication logic?" and get actual code, not just mentions.
Cross-Repository Understanding: The massive training corpus means embeddings understand common patterns across different codebases - critical when syncing multiple repositories to a single task platform.
The Tree-Sitter Problem: Why We Had to Leave the (Cloudflare) Edge
Here's a dirty secret about running code intelligence on the edge - parsing code properly is hard. We needed tree-sitter for accurate AST parsing across 100+ languages. But tree-sitter uses WebAssembly, and that's where things got complicated.
The WebAssembly Wall We Hit
We tried. We really tried to make tree-sitter work in Cloudflare Workers:
The problems:
Hidden Node.js dependencies - web-tree-sitter` uses
module.createRequire
Bundle size explosion - 25MB of WASM for 26 languages
No filesystem - Can't dynamically load language parsers
Memory constraints - Each parser instance uses 10-20MB
Enter Google Cloud Functions: The Pragmatic Solution
Sometimes the best edge architecture includes knowing when NOT to use the edge. We built a hybrid approach:
// Our architecture layers
const PROCESSING_TIERS = {
orchestration: "Cloudflare Workers",
// Fast, global, stateless
parsing: "Google Cloud Functions",
// Node.js, native tree-sitter
storage: "Cloudflare R2",
// WASM files CDN
};
Google Cloud Function configuration:
runtime: python312memory: 4GB # Enough for concurrent parsing
timeout: 540s # 9 minutes for massive files
maxInstances: 1000 # Scale to handle
loadminInstances: 1 # Always warm
environment_variables:
CACHE_WASM: "true"
MAX_FILE_SIZE: "10485760" # 10MB limit
SUPPORTED_LANGUAGES: "100+" # All tree-sitter languages
Performance characteristics:
Cold start: 1-2 seconds (Python runtime initialization)
Warm performance: 50-200ms per batch
Throughput: 10-30 files/second per instance
Language support: 100+ languages with tree-sitter
The Multi-Stage Pipeline Architecture
Each stage is optimized for its specific workload:
// Pipeline configuration
const PIPELINE_CONFIG = {
"file-fetching": {
batchSize: 50,
timeout: 30000,
retries: 3,
concurrency: 100,
},
"external-parsing": {
batchSize: 10, // Optimal for GCF
timeout: 60000, // Network + processing
retries: 2,
circuitBreaker: true, // Prevent cascade failures
},
"voyage-embedding": {
batchSize: 128, // Voyage AI max
timeout: 30000,
retries: 3,
rateLimiting: true, // 100 req/s limit
},
"db-insertion": {
batchSize: 1000, // PostgreSQL bulk insert timeout: 10000, retries: 1, conflictResolution: "upsert", },};
Why Durable Objects are perfect for this:
Single-threaded execution - No race conditions, ever
Persistent state - Survives worker recycling
Global uniqueness - One coordinator per repository
WebSocket support - Real-time progress updates
Storage and Caching for AI Workloads
Our AI pipeline generates massive amounts of data - embeddings, cached LLM responses, document vectors. We leverage the Cloudflare infrastructure described in [Part 1](./part-1-architecture.md):
R2 Storage: Store embeddings and model outputs with zero egress fees
KV Cache: Sub-10ms global cache hits for embeddings and LLM responses
Queue Processing: Batch optimization for AI workloads
The Economics of Edge Computing
Let's talk real numbers, because the cost implications are staggering.
The Traditional Cost Structure is Broken
According to [IDC research](https://www.techtarget.com/searchcio/tip/Top-edge-computing-trends-to-watch-in-2020), global spending on edge computing will reach $228 billion in 2024, up 14% from 2023, with projections hitting $378 billion by 2028. Why? Because companies are realizing that centralized cloud isn't just expensive - it's economically unsustainable for AI workloads.
Consider the typical AI startup burn rate:
80% of capital goes to compute (a16z research)
GPU utilization averages 50-70% during peak times
Egress fees can reach $90/TB on AWS
Cold starts add 1-2 seconds of wasted compute
The Trade-offs
The 30-Second Execution Limit
You can't run long background jobs. We work around this with queues and chunking, but it adds complexity.
Limited Runtime Environment
No native Node.js. No Python packages that require C extensions. We've had to rewrite some libraries in pure JavaScript or find WebAssembly alternatives. The web-tree-sitter parser we use for AST analysis? We had to compile it to WASM ourselves.
Memory Constraints
Workers have less memory than Lambda functions. We've had to be clever about streaming large responses and processing data in chunks. You can't just load a 1GB file into memory and call it a day.
Debugging Can Be Tricky
The local development experience isn't quite the same as production. Cloudflare's `wrangler` tool is good, but debugging distributed systems running across 200+ data centers requires a different mindset.
Vendor Lock-in
We're pretty tied to Cloudflare at this point. KV, R2, and other services have largely compatible APIs but its a tradeoff to not go completely agnostic.
What We'd Do Differently
If I were starting over today, here's what I'd consider:
Start with Mastra from day one instead of migrating from LangGraph
Invest in observability earlier - distributed tracing across Workers is crucial
Build abstractions for the 30-second limit from the start, not as an afterthought
Consider a hybrid approach for certain workloads (maybe run embedding generation on dedicated infrastructure)
The Future: What's Next
We're doubling down on the edge. Here's what's coming:
AI support in Workers: Cloudflare is working on it with some models ready already and, if we want one day, we'll be able to run model inference directly on the edge.
Larger context windows: As models like Claude get more efficient, we can process more in each 30-second window
Edge-native vector databases: Right now we use pgvector hosted on Supabase. But Cloudflare's Vectorize is promising
More intelligent caching: Using Merkle trees to track exactly what's changed and only re-index those pieces
The Network Effect of Edge AI
When 75% of enterprise data will be processed at the edge by end of 2025, we're not just talking about a trend - we're talking about a fundamental restructuring of how computing works. The implications are massive:
Democratization of AI: You don't need a $100M data center to compete. A smart edge architecture can outperform centralized systems at a fraction of the cost.
Energy Efficiency at Scale: With edge computing reducing energy consumption by 30-40%, we could see significant progress toward sustainable AI. This isn't just good PR - it's economically essential as power becomes the primary bottleneck.
The Death of Cold Starts: Zero-millisecond cold starts change the entire calculus of serverless computing. Applications that were impossible become trivial.
Geographic Arbitrage: Why pay for US data center costs when your code runs everywhere? The edge makes location irrelevant for compute.
The Qwen 3 Opportunity We're Watching
One exciting development we're tracking is the Qwen 3 coder family, particularly the flagship Qwen3-Coder-480B model released in July 2025. With 480 billion total parameters (35B active in its MoE configuration), it represents potentially a massive leap in code understanding capabilities that could revolutionize edge-based code intelligence. The model achieves performance comparable to DeepSeek-R1 and GPT-4 on coding benchmarks while being open source under Apache 2.0.
Unfortunately, Cloudflare Workers AI currently only supports Qwen 2.5 coder models (up to 32B parameters), which limits our ability to leverage these cutting-edge capabilities directly on the edge. While Qwen 2.5 is impressive - scoring 73.7% on Aider's code editing benchmark - the potential for running Qwen 3's superior reasoning and code generation at edge locations would be transformative. We're actively monitoring Cloudflare's model roadmap and exploring hybrid approaches where we could run Qwen 3 inference through dedicated endpoints while maintaining our edge-first architecture for orchestration and caching.
Should You Build on the Edge?
Here's what most people miss: in the age of AI, latency is a feature. Users don't care if your model has 1 trillion parameters if it takes 3 seconds to respond. They'll choose the 7B parameter model that responds instantly every time.
Edge computing gives you that instant response. Globally. By default. Without hiring a team of infrastructure engineers.
The edge isn't just about speed - though the sub-10ms latency is nice. It's about building systems that scale globally by default, that handle failures gracefully, and that you don't have to babysit at 3 AM.
More importantly, it's about proving that the future of AI doesn't require trillion-dollar infrastructure investments. It requires smart architecture, efficient design, and a willingness to challenge conventional wisdom.
Is it perfect? No. Are there trade-offs? Absolutely. But for us, betting on the edge has paid off in spades. The combination of Cloudflare Workers for compute, Mastra for orchestration, pgvector for semantic search, and Voyage AI for embeddings has given us a platform that's fast, scalable, and - perhaps most importantly - maintainable.
The future of AI applications isn't just about smarter models. It's about getting those smarts to users as fast as possible, as cheaply as possible, everywhere in the world. And right now, nothing beats the edge for that.
Welcome to the edge. The water's fine, and the latency's even better. That line was written by AI, and it's pretty dang good.
——
Next in This Series
Part 4: From LangGraph to Mastra - Our AI Orchestration Journey - The story of why we migrated from LangGraph to Mastra, and how choosing TypeScript-first tools transformed our development velocity and reduced bugs by 50%.
——
Building something similar? Want to compare notes? Find me on LinkedIn at [benjamin-gregory](https://www.linkedin.com/in/benjamin-gregory/). Always happy to talk shop about serverless architectures, AI pipelines, or why despite all this cutting-edge tech, webpack configs are still the bane of my existence.
Start Building with Momentum
Momentum empowers you to unleash your creativity and build anything you can imagine.
Start Building with Momentum
Momentum empowers you to unleash your creativity and build anything you can imagine.
Start Building with Momentum
Momentum empowers you to unleash your creativity and build anything you can imagine.
Kasava
No Spam. Just Product updates.
Company
Kasava
No Spam. Just Product updates.
Company
Kasava
No Spam. Just Product updates.
Company
Kasava
No Spam. Just Product updates.
Company