The Complete Edge Architecture Guide (Part 3): How We Built an AI Pipeline on the Edge

The Complete Edge Architecture Guide (Part 3): How We Built an AI Pipeline on the Edge

Jane Cooper

Sep 5, 2025

This is Part 3 of our four-part series on building AI-powered applications on the edge. If you haven't read [Part 1: The Complete Architecture Guide](./part-1-architecture.md) and [Part 2: Hono + Dynamic Loading](./part-2-hono-framework.md), I recommend starting there for the technical foundation of our Cloudflare Workers and framework setup. In Part 4, we'll explore [our migration from LangGraph to Mastra](./part-4-from-langgraph-to-mastra.md) and why TypeScript-first tools matter for production AI.

In Part 2, we walked through the nuts and bolts of our Cloudflare architecture - Workers, R2, KV, Queues, and Durable Objects. Now let's talk about what we built on top of that foundation: a sub-10ms AI pipeline that handles semantic search, LLM orchestration, and real-time embeddings at scale.

The AI Infrastructure Crisis Nobody's Talking About (Actually, Everyone Is)

Before I dive into our solution, let's talk about the elephant in the data center. The AI industry has an infrastructure problem, and it's getting worse.

The bottlenecks are everywhere:

  • Power: Andreessen Horowitz research shows that access to power is often a bigger bottleneck than GPUs themselves

  • Latency: 28% of organizations cite latency as their top compute concern- **Costs**: AI startups are spending 80% of their capital on compute resources

  • Network: Traditional cloud computing adds 20-80ms of latency, killing real-time AI applications

  • The global AI infrastructure market is projected to reach $223.45 billion by 2030, growing at 30.4% annually.

Everyone's throwing money at the problem. But what if we're solving it wrong?

The Problem with Traditional AI Infrastructure

Here's the thing about running AI workloads that nobody really talks about: the model inference isn't usually your bottleneck. With modern hardware and optimized models, getting a response from Claude or generating embeddings with Voyage AI takes maybe 200-500ms. That's pretty good!

But then you add network latency. If your user in Sydney has to hit your US-East servers, that's another 150-200ms. Your database query adds 50ms. Suddenly your snappy AI feature feels sluggish. And that's on a good day - when your servers aren't under load and you don't have cold starts to deal with.

I've been there. At my last company, we had a beautiful microservices architecture running on AWS. It was "properly" designed - separate services for each concern, auto-scaling groups, the works. But our P95 latency for AI features was approaching 2 seconds. Users noticed. They complained. We kept optimizing, but we were fighting physics.

Enter the Edge: The Physics-Based Solution to AI Infrastructure

When I started architecting Kasava, we knew we needed a different approach. The key insight was this: in the future of AI applications, networking is going to matter more than compute.

Think about it - LLMs are getting faster every month. GPT-4 Turbo is 2x faster than GPT-4. Claude 3.5 Sonnet blazes compared to its predecessors. A16z's research shows that LLM inference costs are dropping 10x every year. But the speed of light? That's not changing anytime soon.

The numbers back this up. Edge computing has achieved remarkable latency improvements:

  • Sub-5ms response times compared to 20-80ms for traditional cloud ([Source: Edge Computing Research 2024](https://eastgate-software.com/how-edge-computing-reduces-latency-for-end-users-in-2025/))

  • 95% latency reduction for real-time applications

  • 30-40% energy consumption reduction compared to cloud computing

  • Zero cold starts with V8 isolates vs 1-2 second Lambda cold starts

Gartner predicts that 75% of enterprise data will be processed at the edge by 2025, up from just 10% in 2018. The future isn't just distributed - it's everywhere.

So we built our entire stack on Cloudflare Workers. Yes, you read that right - our entire backend, including vectorization, RAG, and LLM orchestration, runs on edge functions with a 30-second execution limit. Here's how we pulled it off.

The AI Stack

  • Workflow Orchestration: Mastra (we'll dive deep into why we chose this in Part 3

  • Vector Database: PostgreSQL with pgvector (via Supabase)

  • Embeddings: Voyage AI's voyage-code-3 (code-optimized, 32K context)

  • LLM: Claude 3.5 Sonnet (via Anthropic API)

  • Caching: Multi-tier with KV and in-memory

  • Queue Processing: Batch-optimized for AI workloads

Leveraging Cloudflare's Infrastructure for AI

As covered in Part 1, we're using Cloudflare's bindings for zero-latency connections between services. This is crucial for AI workloads where every millisecond counts. When our embedding service needs to cache results in KV or our chat workflow needs to store conversation history in R2, these operations happen instantly without network overhead.

Parallel AI Processing at the Edge

Here's where it gets interesting. You can't run a 10-minute indexing job in a Worker that times out after 30 seconds. So we built a distributed processing system that breaks large jobs into tiny chunks:

// Instead of this (would timeout):async function indexRepository(repo) {  const files = await fetchAllFiles(repo); // Could be 10,000 files  const embeddings = await generateEmbeddings(files); // 5+ minutes  await storeInDatabase(embeddings);}
// We do this:async function indexRepository(repo) {  // Break into 50-file batches  const batches = chunkFiles(files, 50);
  // Queue each batch for parallel processing  for (const batch of batches) {    await env.FILE_INDEXING_QUEUE.send({      type: "index_batch",      files: batch,      repoId: repo.id,    });  }
  // Each worker processes its batch in <25 seconds  // 100 workers can run in parallel  // 10,000 files processed in under 5 minutes}

We use Durable Objects (detailed in [Part 1](./part-1-architecture.md#durable-objects-the-distributed-systems-cheat-code)) to coordinate these parallel workers. Each worker claims its batch atomically without any race conditions or distributed locks - critical when you have 100+ workers processing embeddings simultaneously.

The Mastra Migration: From Graphs to Sequential Composition

We originally built our workflow orchestration with LangGraph. It's a great framework, but it felt like overkill for our use case. The graph-based approach is powerful, but most of our workflows are fundamentally sequential with some conditional branching.

But there was a deeper issue that drove our migration: TypeScript has always felt like an afterthought in the Python AI ecosystem. This isn't just about personal preference - it's about developer productivity and production reliability.

The TypeScript Afterthought Problem

If you've worked with Python-based AI frameworks, you know the pain. TypeScript support, when it exists at all, feels bolted on:

  • Documentation: Python gets comprehensive guides, TypeScript gets a "here's how to call our Python API" section

  • Feature parity: New features ship in Python first, TypeScript support comes months later (if ever)-

  • Type safety: Python type hints are suggestions; TypeScript types are enforced

  • Ecosystem integration: Python frameworks assume you're running in a Python environment
    LangGraph exemplifies this. The Python version has rich documentation, extensive examples, and first-class support for all features. The JavaScript version? It's functional, but you can tell it's not the primary focus.

Enter Mastra. Instead of defining nodes and edges, we just chain steps:

// LangGraph (our old approach)
const graph = new StateGraph();
graph.addNode("analyze", analyzeNode);
graph.addNode("search", searchNode);
graph.addEdge("analyze", "search");
graph.addConditionalEdge("search", routingFunction,
                         {  
                           needs_code: "codeAnalysis",  
                           needs_summary: "summarize", 
                           done: END,
                         });

// Mastra (our new approach)
const workflow = createWorkflow() 
                  .then(analyzeStep)  
                  .then(searchStep)  
                  .branch({    
                    needs_code: codeAnalysisStep,   
                    needs_summary: summarizeStep,   
                    done: null,  })
                  .commit();


The Mastra approach is not just cleaner - it's also more performant on Workers. The sequential composition pattern maps naturally to the event-driven, stateless nature of edge functions.

Why TypeScript-First Matters

Mastra is TypeScript-first, not TypeScript-supported. This distinction is huge:

  • Documentation parity: Every example, every guide, every API reference is written for TypeScript first

  • Feature parity: New features ship simultaneously for TypeScript - no waiting months for "JS support"

  • Type inference: Full end-to-end type safety from workflow definition to execution

  • IDE experience: Autocomplete, refactoring, and error detection that actually works

  • Runtime safety: Catch type mismatches at build time, not in production

Coming from LangGraph's JavaScript bindings, this was transformative. We went from wrestling with `any` types and runtime errors to having the compiler catch workflow issues before deployment. Our bug rate dropped 50% in the first month after migration.

Voyage AI + pgvector: The Code-Optimized Embedding Pipeline

When we evaluated embedding models for Kasava, we needed something that understood code at a fundamental level - not just text similarity.

The Architectural Advantages:

  1. Massive Context Window: 32,000 tokens means we can embed entire files, preserving full context instead of fragmenting code into snippets

  2. Quantization-Aware Training: The model was trained to maintain quality even when compressed to int8 or binary representations

  3. 300+ Language Support: Trained on trillions of tokens across every programming language you can imagine

Code Understanding: Beyond Text Matching

Voyage Code 3 understands code structure, not just text:

// Example: These code snippets are semantically similar to Voyage Code 3
// even though they share almost no text tokens

// Python implementation
  def authenticate_user(username, password):   
      user = db.query(User).filter_by(name=username).first()
      if user and bcrypt.check(password, user.hashed_password):
          return create_session(user)    
      return None

// JavaScript equivalent - Voyage Code 3 knows these are similar!
  async function loginUser(email, pwd) { 
      const account = await prisma.user.findUnique({ where: { email } }); 
      if (account && await bcrypt.compare(pwd, account.passwordHash)) { 
          return generateToken(account);  
      }  
      return null;
  }


Traditional embeddings would rate these as completely different. Voyage Code 3 understands they're both authentication functions with:

  • User lookup logic

  • Password verification

  • Session/token generation

  • Similar control flow

This algorithmic reasoning capability is what sets it apart - it understands what code does, not just what it says.

Production Integration: Making It Work at Scale

Here's a simplified example of how we integrated Voyage Code 3 into our pipeline:

// Our production Voyage AI client with optimizations
export class VoyageEmbeddingService {  
  private readonly client: VoyageAIClient; 
  private readonly cache: Map<string, Float32Array>;
  
  constructor(env: Env) {  
    this.client = new VoyageAIClient({ 
    apiKey: env.VOYAGE_API_KEY,   
      model: "voyage-code-3",     
      timeout: 30000,     
      maxRetries: 3,    });
    this.cache = new Map();  
  }
  
  async embedBatch(   
    documents: CodeDocument[],  
    options: EmbeddingOptions = {} 
  ): Promise<EmbeddingResult[]> {   
    const { dimensions = 1024,   
            outputDtype = "float",   
            useCache = true,   
          } = options;
    // Check cache for existing embeddings    
    const uncached = documents.filter((doc) => !useCache || !this.cache.has(doc.hash));
    
    if (uncached.length === 0) {   
      return documents.map((doc) => ({  
        id: doc.id,      
        embedding: this.cache.get(doc.hash)!,    
        cached: true,     
      }));   
    }
    // Batch API call with optimized parameters                                                                                                               const response = await this.client.embed({      input: uncached.map((d) => d.content),      inputType: "document",      outputDimension: dimensions,      outputDtype: outputDtype,    });
    // Cache and return results   
                                                                                                                  
    const results = response.data.map((embedding, i) => {    
      const doc = uncached[i];     
      const vector = new Float32Array(embedding);
         if (useCache) {     
           this.cache.set(doc.hash, vector);     
         }
         return {    
           id: doc.id,   
           embedding: vector,  
           cached: false,    
         };   
    });
    return results;   
}
  // Hybrid search with binary pre-filtering 
async hybridSearch(    
  query: string,  
  candidates: EmbeddingRecord[],   
  topK: number = 20  ): Promise<SearchResult[]> {   
  
  // Step 1: Generate query embedding at low dimension  
 const queryEmbedding256 = await this.client.embed({    
    input: [query],    
    inputType: "query",   
    outputDimension: 256,   
    outputDtype: "binary",  
  });
    // Step 2: Fast binary similarity for initial filtering  
 const prefiltered = this.binaryCosineSimilarity(   
    queryEmbedding256.data[0],   
    candidates,   
    topK * 5 // Get 5x candidates for reranking 
  );
    // Step 3: Full-precision reranking on top candidates  
  const queryEmbedding1024 = await this.client.embed({     
     input: [query],   
     inputType: "query",  
     outputDimension: 1024,  
     outputDtype: "float", 
  });
    
  const reranked = this.cosineSimilarity( 
      queryEmbedding1024.data[0],
      prefiltered,   
      topK   
  );
  
  return reranked;  
}}

Batch Optimization

When using embedding services, its important to remember that the overhead of API calls cost much more in terms of latency than the actual compute. So we built a simple, language-aware batch optimizer to make sure we were minimizing time over the network:

export class BatchOptimizer {  
  private readonly maxDocsPerBatch = 128; // Voyage AI's limit  
  private readonly maxTokensPerBatch = 120000; // Leave headroom
  
  createLanguageGroupedBatches(files: FileContent[]  ): Map<string, FileContent[][]> {   
    
      // Group files by language first  
      const languageGroups = this.groupByLanguage(files);
      
      // Create optimal batches within each language group  
      const batches = new Map<string, FileContent[][]>();
      
      for (const [language, langFiles] of languageGroups) {   
          const langBatches: FileContent[][] = [];    
          let currentBatch: FileContent[] = [];   
          let currentTokens = 0;
          for (const file of langFiles) {    
            const tokens = this.estimateTokens(file.content);
            if (        
              currentBatch.length >= this.maxDocsPerBatch ||      
              currentTokens + tokens > this.maxTokensPerBatch    
            ) {        
              // Start new batch  
              if (currentBatch.length > 0) {   
                langBatches.push(currentBatch);  
              }  
              currentBatch = [file];    
              currentTokens = tokens;  
              } else {     
              currentBatch.push(file);      
              currentTokens += tokens;     
            }   
          }
          if (currentBatch.length > 0) {    
            langBatches.push(currentBatch);   
          }
          batches.set(language, langBatches); 
      }
      
    return batches; 
  }
}

Why Voyage Code 3 is Perfect for Kasava

For our specific use case - building a code intelligence platform that syncs GitHub with task management - Voyage Code 3 provides unique advantages:

  1. Universal Language Support: With 300+ languages supported, we handle everything from COBOL legacy systems to cutting-edge Rust, matching our goal of 100+ language support.

  2. Natural Language to Code: The model excels at text-to-code retrieval, perfect for our "Natural Language Queries" feature where developers ask questions like "where is the authentication logic?" and get actual code, not just mentions.

  3. Cross-Repository Understanding: The massive training corpus means embeddings understand common patterns across different codebases - critical when syncing multiple repositories to a single task platform.


The Tree-Sitter Problem: Why We Had to Leave the (Cloudflare) Edge

Here's a dirty secret about running code intelligence on the edge - parsing code properly is hard. We needed tree-sitter for accurate AST parsing across 100+ languages. But tree-sitter uses WebAssembly, and that's where things got complicated.


The WebAssembly Wall We Hit


We tried. We really tried to make tree-sitter work in Cloudflare Workers:

The problems:
  • Hidden Node.js dependencies - web-tree-sitter` uses module.createRequire

  • Bundle size explosion - 25MB of WASM for 26 languages

  • No filesystem - Can't dynamically load language parsers

  • Memory constraints - Each parser instance uses 10-20MB

Enter Google Cloud Functions: The Pragmatic Solution

Sometimes the best edge architecture includes knowing when NOT to use the edge. We built a hybrid approach:

// Our architecture layers
const PROCESSING_TIERS = {  
  orchestration: "Cloudflare Workers", 
  // Fast, global, stateless  
  parsing: "Google Cloud Functions", 
  // Node.js, native tree-sitter  
  storage: "Cloudflare R2", 
  // WASM files CDN
};


Google Cloud Function configuration:

runtime: python312memory: 4GB # Enough for concurrent parsing
timeout: 540s # 9 minutes for massive files
maxInstances: 1000 # Scale to handle 
loadminInstances: 1 # Always warm
environment_variables:  
  CACHE_WASM: "true"  
  MAX_FILE_SIZE: "10485760" # 10MB limit 
  SUPPORTED_LANGUAGES: "100+" # All tree-sitter languages

Performance characteristics:

  • Cold start: 1-2 seconds (Python runtime initialization)

  • Warm performance: 50-200ms per batch

  • Throughput: 10-30 files/second per instance

  • Language support: 100+ languages with tree-sitter

The Multi-Stage Pipeline Architecture

Each stage is optimized for its specific workload:

// Pipeline configuration
const PIPELINE_CONFIG = {  
  "file-fetching": {   
    batchSize: 50, 
    timeout: 30000,    
    retries: 3,   
    concurrency: 100, 
  }, 
  "external-parsing": {  
    batchSize: 10, // Optimal for GCF   
    timeout: 60000, // Network + processing   
    retries: 2,    
    circuitBreaker: true, // Prevent cascade failures 
    }, 
  "voyage-embedding": {   
    batchSize: 128, // Voyage AI max  
    timeout: 30000,  
    retries: 3, 
    rateLimiting: true, // 100 req/s limit
    },
  "db-insertion": {  
    batchSize: 1000, // PostgreSQL bulk insert    timeout: 10000,    retries: 1,    conflictResolution: "upsert",  },};

Why Durable Objects are perfect for this:

  • Single-threaded execution - No race conditions, ever

  • Persistent state - Survives worker recycling

  • Global uniqueness - One coordinator per repository

  • WebSocket support - Real-time progress updates


Storage and Caching for AI Workloads

Our AI pipeline generates massive amounts of data - embeddings, cached LLM responses, document vectors. We leverage the Cloudflare infrastructure described in [Part 1](./part-1-architecture.md):

  • R2 Storage: Store embeddings and model outputs with zero egress fees

  • KV Cache: Sub-10ms global cache hits for embeddings and LLM responses

  • Queue Processing: Batch optimization for AI workloads

The Economics of Edge Computing

Let's talk real numbers, because the cost implications are staggering.

The Traditional Cost Structure is Broken

According to [IDC research](https://www.techtarget.com/searchcio/tip/Top-edge-computing-trends-to-watch-in-2020), global spending on edge computing will reach $228 billion in 2024, up 14% from 2023, with projections hitting $378 billion by 2028. Why? Because companies are realizing that centralized cloud isn't just expensive - it's economically unsustainable for AI workloads.
Consider the typical AI startup burn rate:

  • 80% of capital goes to compute (a16z research)

  • GPU utilization averages 50-70% during peak times

  • Egress fees can reach $90/TB on AWS

  • Cold starts add 1-2 seconds of wasted compute


The Trade-offs

The 30-Second Execution Limit

You can't run long background jobs. We work around this with queues and chunking, but it adds complexity.

Limited Runtime Environment

No native Node.js. No Python packages that require C extensions. We've had to rewrite some libraries in pure JavaScript or find WebAssembly alternatives. The web-tree-sitter parser we use for AST analysis? We had to compile it to WASM ourselves.

Memory Constraints

Workers have less memory than Lambda functions. We've had to be clever about streaming large responses and processing data in chunks. You can't just load a 1GB file into memory and call it a day.

Debugging Can Be Tricky

The local development experience isn't quite the same as production. Cloudflare's `wrangler` tool is good, but debugging distributed systems running across 200+ data centers requires a different mindset.

Vendor Lock-in

We're pretty tied to Cloudflare at this point. KV, R2, and other services have largely compatible APIs but its a tradeoff to not go completely agnostic.

What We'd Do Differently

If I were starting over today, here's what I'd consider:

  1. Start with Mastra from day one instead of migrating from LangGraph

  2. Invest in observability earlier - distributed tracing across Workers is crucial

  3. Build abstractions for the 30-second limit from the start, not as an afterthought

  4. Consider a hybrid approach for certain workloads (maybe run embedding generation on dedicated infrastructure)

The Future: What's Next

We're doubling down on the edge. Here's what's coming:

  • AI support in Workers: Cloudflare is working on it with some models ready already and, if we want one day, we'll be able to run model inference directly on the edge.

  • Larger context windows: As models like Claude get more efficient, we can process more in each 30-second window

  • Edge-native vector databases: Right now we use pgvector hosted on Supabase. But Cloudflare's Vectorize is promising

  • More intelligent caching: Using Merkle trees to track exactly what's changed and only re-index those pieces

The Network Effect of Edge AI

When 75% of enterprise data will be processed at the edge by end of 2025, we're not just talking about a trend - we're talking about a fundamental restructuring of how computing works. The implications are massive:

  1. Democratization of AI: You don't need a $100M data center to compete. A smart edge architecture can outperform centralized systems at a fraction of the cost.

  2. Energy Efficiency at Scale: With edge computing reducing energy consumption by 30-40%, we could see significant progress toward sustainable AI. This isn't just good PR - it's economically essential as power becomes the primary bottleneck.

  3. The Death of Cold Starts: Zero-millisecond cold starts change the entire calculus of serverless computing. Applications that were impossible become trivial.

  4. Geographic Arbitrage: Why pay for US data center costs when your code runs everywhere? The edge makes location irrelevant for compute.

The Qwen 3 Opportunity We're Watching

One exciting development we're tracking is the Qwen 3 coder family, particularly the flagship Qwen3-Coder-480B model released in July 2025. With 480 billion total parameters (35B active in its MoE configuration), it represents potentially a massive leap in code understanding capabilities that could revolutionize edge-based code intelligence. The model achieves performance comparable to DeepSeek-R1 and GPT-4 on coding benchmarks while being open source under Apache 2.0.

Unfortunately, Cloudflare Workers AI currently only supports Qwen 2.5 coder models (up to 32B parameters), which limits our ability to leverage these cutting-edge capabilities directly on the edge. While Qwen 2.5 is impressive - scoring 73.7% on Aider's code editing benchmark - the potential for running Qwen 3's superior reasoning and code generation at edge locations would be transformative. We're actively monitoring Cloudflare's model roadmap and exploring hybrid approaches where we could run Qwen 3 inference through dedicated endpoints while maintaining our edge-first architecture for orchestration and caching.

Should You Build on the Edge?

Here's what most people miss: in the age of AI, latency is a feature. Users don't care if your model has 1 trillion parameters if it takes 3 seconds to respond. They'll choose the 7B parameter model that responds instantly every time.

Edge computing gives you that instant response. Globally. By default. Without hiring a team of infrastructure engineers.

The edge isn't just about speed - though the sub-10ms latency is nice. It's about building systems that scale globally by default, that handle failures gracefully, and that you don't have to babysit at 3 AM.

More importantly, it's about proving that the future of AI doesn't require trillion-dollar infrastructure investments. It requires smart architecture, efficient design, and a willingness to challenge conventional wisdom.

Is it perfect? No. Are there trade-offs? Absolutely. But for us, betting on the edge has paid off in spades. The combination of Cloudflare Workers for compute, Mastra for orchestration, pgvector for semantic search, and Voyage AI for embeddings has given us a platform that's fast, scalable, and - perhaps most importantly - maintainable.

The future of AI applications isn't just about smarter models. It's about getting those smarts to users as fast as possible, as cheaply as possible, everywhere in the world. And right now, nothing beats the edge for that.

Welcome to the edge. The water's fine, and the latency's even better. That line was written by AI, and it's pretty dang good.

——

Next in This Series

Part 4: From LangGraph to Mastra - Our AI Orchestration Journey - The story of why we migrated from LangGraph to Mastra, and how choosing TypeScript-first tools transformed our development velocity and reduced bugs by 50%.

——
Building something similar? Want to compare notes? Find me on LinkedIn at [benjamin-gregory](https://www.linkedin.com/in/benjamin-gregory/). Always happy to talk shop about serverless architectures, AI pipelines, or why despite all this cutting-edge tech, webpack configs are still the bane of my existence.

This is Part 3 of our four-part series on building AI-powered applications on the edge. If you haven't read [Part 1: The Complete Architecture Guide](./part-1-architecture.md) and [Part 2: Hono + Dynamic Loading](./part-2-hono-framework.md), I recommend starting there for the technical foundation of our Cloudflare Workers and framework setup. In Part 4, we'll explore [our migration from LangGraph to Mastra](./part-4-from-langgraph-to-mastra.md) and why TypeScript-first tools matter for production AI.

In Part 2, we walked through the nuts and bolts of our Cloudflare architecture - Workers, R2, KV, Queues, and Durable Objects. Now let's talk about what we built on top of that foundation: a sub-10ms AI pipeline that handles semantic search, LLM orchestration, and real-time embeddings at scale.

The AI Infrastructure Crisis Nobody's Talking About (Actually, Everyone Is)

Before I dive into our solution, let's talk about the elephant in the data center. The AI industry has an infrastructure problem, and it's getting worse.

The bottlenecks are everywhere:

  • Power: Andreessen Horowitz research shows that access to power is often a bigger bottleneck than GPUs themselves

  • Latency: 28% of organizations cite latency as their top compute concern- **Costs**: AI startups are spending 80% of their capital on compute resources

  • Network: Traditional cloud computing adds 20-80ms of latency, killing real-time AI applications

  • The global AI infrastructure market is projected to reach $223.45 billion by 2030, growing at 30.4% annually.

Everyone's throwing money at the problem. But what if we're solving it wrong?

The Problem with Traditional AI Infrastructure

Here's the thing about running AI workloads that nobody really talks about: the model inference isn't usually your bottleneck. With modern hardware and optimized models, getting a response from Claude or generating embeddings with Voyage AI takes maybe 200-500ms. That's pretty good!

But then you add network latency. If your user in Sydney has to hit your US-East servers, that's another 150-200ms. Your database query adds 50ms. Suddenly your snappy AI feature feels sluggish. And that's on a good day - when your servers aren't under load and you don't have cold starts to deal with.

I've been there. At my last company, we had a beautiful microservices architecture running on AWS. It was "properly" designed - separate services for each concern, auto-scaling groups, the works. But our P95 latency for AI features was approaching 2 seconds. Users noticed. They complained. We kept optimizing, but we were fighting physics.

Enter the Edge: The Physics-Based Solution to AI Infrastructure

When I started architecting Kasava, we knew we needed a different approach. The key insight was this: in the future of AI applications, networking is going to matter more than compute.

Think about it - LLMs are getting faster every month. GPT-4 Turbo is 2x faster than GPT-4. Claude 3.5 Sonnet blazes compared to its predecessors. A16z's research shows that LLM inference costs are dropping 10x every year. But the speed of light? That's not changing anytime soon.

The numbers back this up. Edge computing has achieved remarkable latency improvements:

  • Sub-5ms response times compared to 20-80ms for traditional cloud ([Source: Edge Computing Research 2024](https://eastgate-software.com/how-edge-computing-reduces-latency-for-end-users-in-2025/))

  • 95% latency reduction for real-time applications

  • 30-40% energy consumption reduction compared to cloud computing

  • Zero cold starts with V8 isolates vs 1-2 second Lambda cold starts

Gartner predicts that 75% of enterprise data will be processed at the edge by 2025, up from just 10% in 2018. The future isn't just distributed - it's everywhere.

So we built our entire stack on Cloudflare Workers. Yes, you read that right - our entire backend, including vectorization, RAG, and LLM orchestration, runs on edge functions with a 30-second execution limit. Here's how we pulled it off.

The AI Stack

  • Workflow Orchestration: Mastra (we'll dive deep into why we chose this in Part 3

  • Vector Database: PostgreSQL with pgvector (via Supabase)

  • Embeddings: Voyage AI's voyage-code-3 (code-optimized, 32K context)

  • LLM: Claude 3.5 Sonnet (via Anthropic API)

  • Caching: Multi-tier with KV and in-memory

  • Queue Processing: Batch-optimized for AI workloads

Leveraging Cloudflare's Infrastructure for AI

As covered in Part 1, we're using Cloudflare's bindings for zero-latency connections between services. This is crucial for AI workloads where every millisecond counts. When our embedding service needs to cache results in KV or our chat workflow needs to store conversation history in R2, these operations happen instantly without network overhead.

Parallel AI Processing at the Edge

Here's where it gets interesting. You can't run a 10-minute indexing job in a Worker that times out after 30 seconds. So we built a distributed processing system that breaks large jobs into tiny chunks:

// Instead of this (would timeout):async function indexRepository(repo) {  const files = await fetchAllFiles(repo); // Could be 10,000 files  const embeddings = await generateEmbeddings(files); // 5+ minutes  await storeInDatabase(embeddings);}
// We do this:async function indexRepository(repo) {  // Break into 50-file batches  const batches = chunkFiles(files, 50);
  // Queue each batch for parallel processing  for (const batch of batches) {    await env.FILE_INDEXING_QUEUE.send({      type: "index_batch",      files: batch,      repoId: repo.id,    });  }
  // Each worker processes its batch in <25 seconds  // 100 workers can run in parallel  // 10,000 files processed in under 5 minutes}

We use Durable Objects (detailed in [Part 1](./part-1-architecture.md#durable-objects-the-distributed-systems-cheat-code)) to coordinate these parallel workers. Each worker claims its batch atomically without any race conditions or distributed locks - critical when you have 100+ workers processing embeddings simultaneously.

The Mastra Migration: From Graphs to Sequential Composition

We originally built our workflow orchestration with LangGraph. It's a great framework, but it felt like overkill for our use case. The graph-based approach is powerful, but most of our workflows are fundamentally sequential with some conditional branching.

But there was a deeper issue that drove our migration: TypeScript has always felt like an afterthought in the Python AI ecosystem. This isn't just about personal preference - it's about developer productivity and production reliability.

The TypeScript Afterthought Problem

If you've worked with Python-based AI frameworks, you know the pain. TypeScript support, when it exists at all, feels bolted on:

  • Documentation: Python gets comprehensive guides, TypeScript gets a "here's how to call our Python API" section

  • Feature parity: New features ship in Python first, TypeScript support comes months later (if ever)-

  • Type safety: Python type hints are suggestions; TypeScript types are enforced

  • Ecosystem integration: Python frameworks assume you're running in a Python environment
    LangGraph exemplifies this. The Python version has rich documentation, extensive examples, and first-class support for all features. The JavaScript version? It's functional, but you can tell it's not the primary focus.

Enter Mastra. Instead of defining nodes and edges, we just chain steps:

// LangGraph (our old approach)
const graph = new StateGraph();
graph.addNode("analyze", analyzeNode);
graph.addNode("search", searchNode);
graph.addEdge("analyze", "search");
graph.addConditionalEdge("search", routingFunction,
                         {  
                           needs_code: "codeAnalysis",  
                           needs_summary: "summarize", 
                           done: END,
                         });

// Mastra (our new approach)
const workflow = createWorkflow() 
                  .then(analyzeStep)  
                  .then(searchStep)  
                  .branch({    
                    needs_code: codeAnalysisStep,   
                    needs_summary: summarizeStep,   
                    done: null,  })
                  .commit();


The Mastra approach is not just cleaner - it's also more performant on Workers. The sequential composition pattern maps naturally to the event-driven, stateless nature of edge functions.

Why TypeScript-First Matters

Mastra is TypeScript-first, not TypeScript-supported. This distinction is huge:

  • Documentation parity: Every example, every guide, every API reference is written for TypeScript first

  • Feature parity: New features ship simultaneously for TypeScript - no waiting months for "JS support"

  • Type inference: Full end-to-end type safety from workflow definition to execution

  • IDE experience: Autocomplete, refactoring, and error detection that actually works

  • Runtime safety: Catch type mismatches at build time, not in production

Coming from LangGraph's JavaScript bindings, this was transformative. We went from wrestling with `any` types and runtime errors to having the compiler catch workflow issues before deployment. Our bug rate dropped 50% in the first month after migration.

Voyage AI + pgvector: The Code-Optimized Embedding Pipeline

When we evaluated embedding models for Kasava, we needed something that understood code at a fundamental level - not just text similarity.

The Architectural Advantages:

  1. Massive Context Window: 32,000 tokens means we can embed entire files, preserving full context instead of fragmenting code into snippets

  2. Quantization-Aware Training: The model was trained to maintain quality even when compressed to int8 or binary representations

  3. 300+ Language Support: Trained on trillions of tokens across every programming language you can imagine

Code Understanding: Beyond Text Matching

Voyage Code 3 understands code structure, not just text:

// Example: These code snippets are semantically similar to Voyage Code 3
// even though they share almost no text tokens

// Python implementation
  def authenticate_user(username, password):   
      user = db.query(User).filter_by(name=username).first()
      if user and bcrypt.check(password, user.hashed_password):
          return create_session(user)    
      return None

// JavaScript equivalent - Voyage Code 3 knows these are similar!
  async function loginUser(email, pwd) { 
      const account = await prisma.user.findUnique({ where: { email } }); 
      if (account && await bcrypt.compare(pwd, account.passwordHash)) { 
          return generateToken(account);  
      }  
      return null;
  }


Traditional embeddings would rate these as completely different. Voyage Code 3 understands they're both authentication functions with:

  • User lookup logic

  • Password verification

  • Session/token generation

  • Similar control flow

This algorithmic reasoning capability is what sets it apart - it understands what code does, not just what it says.

Production Integration: Making It Work at Scale

Here's a simplified example of how we integrated Voyage Code 3 into our pipeline:

// Our production Voyage AI client with optimizations
export class VoyageEmbeddingService {  
  private readonly client: VoyageAIClient; 
  private readonly cache: Map<string, Float32Array>;
  
  constructor(env: Env) {  
    this.client = new VoyageAIClient({ 
    apiKey: env.VOYAGE_API_KEY,   
      model: "voyage-code-3",     
      timeout: 30000,     
      maxRetries: 3,    });
    this.cache = new Map();  
  }
  
  async embedBatch(   
    documents: CodeDocument[],  
    options: EmbeddingOptions = {} 
  ): Promise<EmbeddingResult[]> {   
    const { dimensions = 1024,   
            outputDtype = "float",   
            useCache = true,   
          } = options;
    // Check cache for existing embeddings    
    const uncached = documents.filter((doc) => !useCache || !this.cache.has(doc.hash));
    
    if (uncached.length === 0) {   
      return documents.map((doc) => ({  
        id: doc.id,      
        embedding: this.cache.get(doc.hash)!,    
        cached: true,     
      }));   
    }
    // Batch API call with optimized parameters                                                                                                               const response = await this.client.embed({      input: uncached.map((d) => d.content),      inputType: "document",      outputDimension: dimensions,      outputDtype: outputDtype,    });
    // Cache and return results   
                                                                                                                  
    const results = response.data.map((embedding, i) => {    
      const doc = uncached[i];     
      const vector = new Float32Array(embedding);
         if (useCache) {     
           this.cache.set(doc.hash, vector);     
         }
         return {    
           id: doc.id,   
           embedding: vector,  
           cached: false,    
         };   
    });
    return results;   
}
  // Hybrid search with binary pre-filtering 
async hybridSearch(    
  query: string,  
  candidates: EmbeddingRecord[],   
  topK: number = 20  ): Promise<SearchResult[]> {   
  
  // Step 1: Generate query embedding at low dimension  
 const queryEmbedding256 = await this.client.embed({    
    input: [query],    
    inputType: "query",   
    outputDimension: 256,   
    outputDtype: "binary",  
  });
    // Step 2: Fast binary similarity for initial filtering  
 const prefiltered = this.binaryCosineSimilarity(   
    queryEmbedding256.data[0],   
    candidates,   
    topK * 5 // Get 5x candidates for reranking 
  );
    // Step 3: Full-precision reranking on top candidates  
  const queryEmbedding1024 = await this.client.embed({     
     input: [query],   
     inputType: "query",  
     outputDimension: 1024,  
     outputDtype: "float", 
  });
    
  const reranked = this.cosineSimilarity( 
      queryEmbedding1024.data[0],
      prefiltered,   
      topK   
  );
  
  return reranked;  
}}

Batch Optimization

When using embedding services, its important to remember that the overhead of API calls cost much more in terms of latency than the actual compute. So we built a simple, language-aware batch optimizer to make sure we were minimizing time over the network:

export class BatchOptimizer {  
  private readonly maxDocsPerBatch = 128; // Voyage AI's limit  
  private readonly maxTokensPerBatch = 120000; // Leave headroom
  
  createLanguageGroupedBatches(files: FileContent[]  ): Map<string, FileContent[][]> {   
    
      // Group files by language first  
      const languageGroups = this.groupByLanguage(files);
      
      // Create optimal batches within each language group  
      const batches = new Map<string, FileContent[][]>();
      
      for (const [language, langFiles] of languageGroups) {   
          const langBatches: FileContent[][] = [];    
          let currentBatch: FileContent[] = [];   
          let currentTokens = 0;
          for (const file of langFiles) {    
            const tokens = this.estimateTokens(file.content);
            if (        
              currentBatch.length >= this.maxDocsPerBatch ||      
              currentTokens + tokens > this.maxTokensPerBatch    
            ) {        
              // Start new batch  
              if (currentBatch.length > 0) {   
                langBatches.push(currentBatch);  
              }  
              currentBatch = [file];    
              currentTokens = tokens;  
              } else {     
              currentBatch.push(file);      
              currentTokens += tokens;     
            }   
          }
          if (currentBatch.length > 0) {    
            langBatches.push(currentBatch);   
          }
          batches.set(language, langBatches); 
      }
      
    return batches; 
  }
}

Why Voyage Code 3 is Perfect for Kasava

For our specific use case - building a code intelligence platform that syncs GitHub with task management - Voyage Code 3 provides unique advantages:

  1. Universal Language Support: With 300+ languages supported, we handle everything from COBOL legacy systems to cutting-edge Rust, matching our goal of 100+ language support.

  2. Natural Language to Code: The model excels at text-to-code retrieval, perfect for our "Natural Language Queries" feature where developers ask questions like "where is the authentication logic?" and get actual code, not just mentions.

  3. Cross-Repository Understanding: The massive training corpus means embeddings understand common patterns across different codebases - critical when syncing multiple repositories to a single task platform.


The Tree-Sitter Problem: Why We Had to Leave the (Cloudflare) Edge

Here's a dirty secret about running code intelligence on the edge - parsing code properly is hard. We needed tree-sitter for accurate AST parsing across 100+ languages. But tree-sitter uses WebAssembly, and that's where things got complicated.


The WebAssembly Wall We Hit


We tried. We really tried to make tree-sitter work in Cloudflare Workers:

The problems:
  • Hidden Node.js dependencies - web-tree-sitter` uses module.createRequire

  • Bundle size explosion - 25MB of WASM for 26 languages

  • No filesystem - Can't dynamically load language parsers

  • Memory constraints - Each parser instance uses 10-20MB

Enter Google Cloud Functions: The Pragmatic Solution

Sometimes the best edge architecture includes knowing when NOT to use the edge. We built a hybrid approach:

// Our architecture layers
const PROCESSING_TIERS = {  
  orchestration: "Cloudflare Workers", 
  // Fast, global, stateless  
  parsing: "Google Cloud Functions", 
  // Node.js, native tree-sitter  
  storage: "Cloudflare R2", 
  // WASM files CDN
};


Google Cloud Function configuration:

runtime: python312memory: 4GB # Enough for concurrent parsing
timeout: 540s # 9 minutes for massive files
maxInstances: 1000 # Scale to handle 
loadminInstances: 1 # Always warm
environment_variables:  
  CACHE_WASM: "true"  
  MAX_FILE_SIZE: "10485760" # 10MB limit 
  SUPPORTED_LANGUAGES: "100+" # All tree-sitter languages

Performance characteristics:

  • Cold start: 1-2 seconds (Python runtime initialization)

  • Warm performance: 50-200ms per batch

  • Throughput: 10-30 files/second per instance

  • Language support: 100+ languages with tree-sitter

The Multi-Stage Pipeline Architecture

Each stage is optimized for its specific workload:

// Pipeline configuration
const PIPELINE_CONFIG = {  
  "file-fetching": {   
    batchSize: 50, 
    timeout: 30000,    
    retries: 3,   
    concurrency: 100, 
  }, 
  "external-parsing": {  
    batchSize: 10, // Optimal for GCF   
    timeout: 60000, // Network + processing   
    retries: 2,    
    circuitBreaker: true, // Prevent cascade failures 
    }, 
  "voyage-embedding": {   
    batchSize: 128, // Voyage AI max  
    timeout: 30000,  
    retries: 3, 
    rateLimiting: true, // 100 req/s limit
    },
  "db-insertion": {  
    batchSize: 1000, // PostgreSQL bulk insert    timeout: 10000,    retries: 1,    conflictResolution: "upsert",  },};

Why Durable Objects are perfect for this:

  • Single-threaded execution - No race conditions, ever

  • Persistent state - Survives worker recycling

  • Global uniqueness - One coordinator per repository

  • WebSocket support - Real-time progress updates


Storage and Caching for AI Workloads

Our AI pipeline generates massive amounts of data - embeddings, cached LLM responses, document vectors. We leverage the Cloudflare infrastructure described in [Part 1](./part-1-architecture.md):

  • R2 Storage: Store embeddings and model outputs with zero egress fees

  • KV Cache: Sub-10ms global cache hits for embeddings and LLM responses

  • Queue Processing: Batch optimization for AI workloads

The Economics of Edge Computing

Let's talk real numbers, because the cost implications are staggering.

The Traditional Cost Structure is Broken

According to [IDC research](https://www.techtarget.com/searchcio/tip/Top-edge-computing-trends-to-watch-in-2020), global spending on edge computing will reach $228 billion in 2024, up 14% from 2023, with projections hitting $378 billion by 2028. Why? Because companies are realizing that centralized cloud isn't just expensive - it's economically unsustainable for AI workloads.
Consider the typical AI startup burn rate:

  • 80% of capital goes to compute (a16z research)

  • GPU utilization averages 50-70% during peak times

  • Egress fees can reach $90/TB on AWS

  • Cold starts add 1-2 seconds of wasted compute


The Trade-offs

The 30-Second Execution Limit

You can't run long background jobs. We work around this with queues and chunking, but it adds complexity.

Limited Runtime Environment

No native Node.js. No Python packages that require C extensions. We've had to rewrite some libraries in pure JavaScript or find WebAssembly alternatives. The web-tree-sitter parser we use for AST analysis? We had to compile it to WASM ourselves.

Memory Constraints

Workers have less memory than Lambda functions. We've had to be clever about streaming large responses and processing data in chunks. You can't just load a 1GB file into memory and call it a day.

Debugging Can Be Tricky

The local development experience isn't quite the same as production. Cloudflare's `wrangler` tool is good, but debugging distributed systems running across 200+ data centers requires a different mindset.

Vendor Lock-in

We're pretty tied to Cloudflare at this point. KV, R2, and other services have largely compatible APIs but its a tradeoff to not go completely agnostic.

What We'd Do Differently

If I were starting over today, here's what I'd consider:

  1. Start with Mastra from day one instead of migrating from LangGraph

  2. Invest in observability earlier - distributed tracing across Workers is crucial

  3. Build abstractions for the 30-second limit from the start, not as an afterthought

  4. Consider a hybrid approach for certain workloads (maybe run embedding generation on dedicated infrastructure)

The Future: What's Next

We're doubling down on the edge. Here's what's coming:

  • AI support in Workers: Cloudflare is working on it with some models ready already and, if we want one day, we'll be able to run model inference directly on the edge.

  • Larger context windows: As models like Claude get more efficient, we can process more in each 30-second window

  • Edge-native vector databases: Right now we use pgvector hosted on Supabase. But Cloudflare's Vectorize is promising

  • More intelligent caching: Using Merkle trees to track exactly what's changed and only re-index those pieces

The Network Effect of Edge AI

When 75% of enterprise data will be processed at the edge by end of 2025, we're not just talking about a trend - we're talking about a fundamental restructuring of how computing works. The implications are massive:

  1. Democratization of AI: You don't need a $100M data center to compete. A smart edge architecture can outperform centralized systems at a fraction of the cost.

  2. Energy Efficiency at Scale: With edge computing reducing energy consumption by 30-40%, we could see significant progress toward sustainable AI. This isn't just good PR - it's economically essential as power becomes the primary bottleneck.

  3. The Death of Cold Starts: Zero-millisecond cold starts change the entire calculus of serverless computing. Applications that were impossible become trivial.

  4. Geographic Arbitrage: Why pay for US data center costs when your code runs everywhere? The edge makes location irrelevant for compute.

The Qwen 3 Opportunity We're Watching

One exciting development we're tracking is the Qwen 3 coder family, particularly the flagship Qwen3-Coder-480B model released in July 2025. With 480 billion total parameters (35B active in its MoE configuration), it represents potentially a massive leap in code understanding capabilities that could revolutionize edge-based code intelligence. The model achieves performance comparable to DeepSeek-R1 and GPT-4 on coding benchmarks while being open source under Apache 2.0.

Unfortunately, Cloudflare Workers AI currently only supports Qwen 2.5 coder models (up to 32B parameters), which limits our ability to leverage these cutting-edge capabilities directly on the edge. While Qwen 2.5 is impressive - scoring 73.7% on Aider's code editing benchmark - the potential for running Qwen 3's superior reasoning and code generation at edge locations would be transformative. We're actively monitoring Cloudflare's model roadmap and exploring hybrid approaches where we could run Qwen 3 inference through dedicated endpoints while maintaining our edge-first architecture for orchestration and caching.

Should You Build on the Edge?

Here's what most people miss: in the age of AI, latency is a feature. Users don't care if your model has 1 trillion parameters if it takes 3 seconds to respond. They'll choose the 7B parameter model that responds instantly every time.

Edge computing gives you that instant response. Globally. By default. Without hiring a team of infrastructure engineers.

The edge isn't just about speed - though the sub-10ms latency is nice. It's about building systems that scale globally by default, that handle failures gracefully, and that you don't have to babysit at 3 AM.

More importantly, it's about proving that the future of AI doesn't require trillion-dollar infrastructure investments. It requires smart architecture, efficient design, and a willingness to challenge conventional wisdom.

Is it perfect? No. Are there trade-offs? Absolutely. But for us, betting on the edge has paid off in spades. The combination of Cloudflare Workers for compute, Mastra for orchestration, pgvector for semantic search, and Voyage AI for embeddings has given us a platform that's fast, scalable, and - perhaps most importantly - maintainable.

The future of AI applications isn't just about smarter models. It's about getting those smarts to users as fast as possible, as cheaply as possible, everywhere in the world. And right now, nothing beats the edge for that.

Welcome to the edge. The water's fine, and the latency's even better. That line was written by AI, and it's pretty dang good.

——

Next in This Series

Part 4: From LangGraph to Mastra - Our AI Orchestration Journey - The story of why we migrated from LangGraph to Mastra, and how choosing TypeScript-first tools transformed our development velocity and reduced bugs by 50%.

——
Building something similar? Want to compare notes? Find me on LinkedIn at [benjamin-gregory](https://www.linkedin.com/in/benjamin-gregory/). Always happy to talk shop about serverless architectures, AI pipelines, or why despite all this cutting-edge tech, webpack configs are still the bane of my existence.

Start Building with Momentum

Momentum empowers you to unleash your creativity and build anything you can imagine.

Start Building with Momentum

Momentum empowers you to unleash your creativity and build anything you can imagine.

Start Building with Momentum

Momentum empowers you to unleash your creativity and build anything you can imagine.

Kasava

No Spam. Just Product updates.

Kasava. All right reserved. © 2025

Kasava

No Spam. Just Product updates.

Kasava. All right reserved. © 2025

Kasava

No Spam. Just Product updates.

Kasava. All right reserved. © 2025

Kasava

No Spam. Just Product updates.

Kasava. All right reserved. © 2025