Stop building fragile AI wrappers. Start architecting resilient systems.
Most AI Engineering tutorials stop at client.chat.completions.create(). That works for a hackathon, but it breaks in production.
This book is not about prompt engineering. It is an architectural playbook for the non-deterministic, high-latency, and expensive reality of LLMs in production.
I am Sampriti Mitra, a Software Engineering Lead at Sumologic (ex-Razorpay, IIT BHU alumna). I’ve spent the last 6 years building scalable systems. I wrote System Design for the LLM Era to document the exact patterns needed to move from prototype to production-grade by deep diving into whitepapers, blogs and post mortems.
This book is the bridge between a working prototype and a resilient, scalable system.
What’s inside the final book:
We deep dive into patterns, scaling issues, cost optimisations, and case studies of different companies.
Core System Design patterns for using LLMs
We look at the different patterns for system design useful when integrating with LLMs.
- The LLM Gateway Pattern: How to decouple your application from OpenAI/Anthropic using Tiered Fallback strategies.
- Async Event-Driven Architecture: Decouple high-latency agentic workflows using Kafka/SQS message queues and worker patterns for long-running generation tasks
- Cost Optimization: How to combine Semantic Caching (Redis/Vector) and Model Routing (sending simple grammar tasks to cheaper models) to reduce redundant API calls and lower inference costs by ~40%
- Security at the Prompt Layer: Implementing Instruction/Data Separation and using Firewall LLMs to detect and block prompt injection attacks before they reach your expensive models
What Breaks at Scale
This book focuses on failure modes that only show up under load:
- The Cache Stampede: Why standard caching fails during high-traffic events and how to fix it with Coalesce Caching middleware
- The RAG Precision Drop: Why Naive RAG fails on multi-hop reasoning questions and how GraphRAG (Vector + Knowledge Graph) bridges the gap
- Golden Datasets: Why simple assertion fails for non-deterministic LLMs and how to implement LLM-as-a-Judge evaluation pipelines using Golden Datasets
Case Study Deconstructions
We deep dive into the architecture of successful products based on their public engineering blogs and whitepapers:
- AI-Native IDEs (like Cursor/Copilot): Handling the Context Window problem with smart code indexing and low-latency code completion.
- Adaptive Learning Platforms (like Duolingo): Architecting offline content pipelines vs. online serving paths
- AI-powered E-commerce search (for platforms like Amazon) and others!
Table of Contents:
Chapter 1: LLM System Design: Why Integration Requires New Patterns
- Beyond the hype: Understanding Tokens, Embeddings, and the RAG lifecycle.
- Why Naive RAG fails in production and how to fix it with GraphRAG.
- Agentic AI: Understanding the shift from simple prompts to autonomous agents.
- Operationalizing: Performance benchmarking, testing strategies, and handling failures.
Chapter 2: Core Architectural Patterns
- Resilience: Circuit breakers and fallbacks for when OpenAI goes down.
- Latency: Caching strategies to make LLM apps feel instant.
- Cost: Token optimization techniques to slash your API bill by 40%.
- Security: Injection attacks, data privacy, and Grounding strategies.
Chapter 3: Case Study: Designing an AI-Native IDE (like Cursor/Copilot)
- Handling the Context Window problem with smart code indexing.
- Privacy patterns for handling proprietary user code.
- Deep dive: Latency vs. Accuracy trade-offs in code completion.
Chapter 4: Case Study: Adaptive Learning Platform like Duolingo
- Architecting an offline content pipeline vs. an online serving path.
- Asynchronous processing patterns for generating personalized courseware.
- Database selection: When to use Vector DBs vs. Relational vs. Graph.
Chapter 5: Case Study: AI-Powered Search for E-Commerce like Amazon
- Moving beyond keyword search: Hybrid Search architecture.
- The Product Discovery flow: Ranking and re-ranking with LLMs.
- Caching strategies for high-traffic retail events.
Chapter 6: Case Study: AI Customer Support Agent
- The Golden Dataset: How to build an evaluation suite that actually works.
- LLM-as-a-Judge: Automating your quality assurance.
- Ingestion pipelines: Keeping your knowledge base fresh in real-time.