Part 7: The AI Stack and Building Real AI Systems

Part of the AI Fundamentals 101 Series

From Concepts to Systems

Over the last six articles, we've covered the building blocks: what AI is, how ML and deep learning work, NLP, LLMs, RAG and fine-tuning, agents and protocols. Now it's time to put it all together.

This final article covers the practical questions that determine whether an AI project succeeds or ends up in the graveyard of abandoned prototypes. I've built AI systems that shipped and AI systems that didn't, and the difference was never the model — it was everything around it.

The Modern AI Stack

Every AI system, from a simple chatbot to a multi-agent orchestration platform, sits on a stack of layers. Understanding this stack is what separates engineers who use AI from engineers who build AI systems.

┌─────────────────────────────────────────────────────┐
│  Layer 5: Application                                │
│  Your product: chatbot, monitoring tool, code assist  │
├─────────────────────────────────────────────────────┤
│  Layer 4: Orchestration                              │
│  Agent frameworks, RAG pipelines, prompt chains       │
├─────────────────────────────────────────────────────┤
│  Layer 3: Model Layer                                │
│  LLMs, embedding models, classifiers                  │
├─────────────────────────────────────────────────────┤
│  Layer 2: Data & Infrastructure                      │
│  Vector DBs, feature stores, data pipelines           │
├─────────────────────────────────────────────────────┤
│  Layer 1: Compute                                    │
│  GPUs (NVIDIA), TPUs (Google), cloud instances         │
└─────────────────────────────────────────────────────┘

Layer 1: Compute

AI workloads need specialized hardware. Training LLMs requires thousands of GPUs running for months. Inference is cheaper but still GPU-heavy for large models.

compute_landscape = {
    "Training (you probably won't do this)": {
        "hardware": "Clusters of NVIDIA A100/H100 GPUs",
        "cost": "$100M+ for frontier models",
        "who_does_it": "OpenAI, Anthropic, Meta, Google",
        "your_role": "Consumer of the trained model"
    },
    "Inference — Cloud API": {
        "hardware": "Provider manages their GPU fleet",
        "cost": "$0.001-0.10 per 1K tokens",
        "who_does_it": "You, via API calls",
        "your_role": "Pay per use, no infrastructure to manage"
    },
    "Inference — Self-hosted": {
        "hardware": "Your own GPU (RTX 4090, A100) or cloud GPU instances",
        "cost": "$1-15/hour for a GPU instance, or $1,600+ for consumer GPU",
        "who_does_it": "You, running Ollama, vLLM, or TGI",
        "your_role": "Full control, but you manage the infrastructure"
    },
    "ML Training (classical)": {
        "hardware": "CPU is often enough; GPU for larger datasets",
        "cost": "Near zero for scikit-learn workloads on modern hardware",
        "who_does_it": "You",
        "your_role": "Train and deploy your own models"
    }
}

for tier, details in compute_landscape.items():
    print(f"\n{tier}")
    for k, v in details.items():
        print(f"  {k}: {v}")

Layer 2: Data & Infrastructure

data_infrastructure = {
    "Vector Database": {
        "what": "Stores embeddings for semantic search",
        "options": ["pgvector (PostgreSQL)", "Pinecone", "Weaviate", "Qdrant", "Chroma"],
        "when": "Building RAG systems, semantic search, recommendation engines",
        "my_choice": "pgvector — I already run PostgreSQL, no new infrastructure"
    },
    "Feature Store": {
        "what": "Manages and serves ML features for training and inference",
        "options": ["Feast", "Tecton", "Hopsworks"],
        "when": "ML models that need consistent features across training and serving",
        "my_choice": "Not needed yet for personal projects — simple pandas preprocessing works"
    },
    "Data Pipeline": {
        "what": "Moves and transforms data from sources to models",
        "options": ["Apache Airflow", "Prefect", "Dagster", "dbt"],
        "when": "Any system that processes data regularly",
        "my_choice": "Simple Python scripts with cron for personal projects; Airflow for anything complex"
    },
    "Model Registry": {
        "what": "Tracks model versions, metrics, and artifacts",
        "options": ["MLflow", "W&B (Weights & Biases)", "DVC"],
        "when": "Training and deploying multiple model versions",
        "my_choice": "MLflow — open source, self-hostable, integrates with scikit-learn"
    }
}

for component, details in data_infrastructure.items():
    print(f"\n{component}: {details['what']}")
    print(f"  Options: {', '.join(details['options'])}")
    print(f"  My choice: {details['my_choice']}")

Layer 3: Model Layer

model_categories = {
    "Foundation Models (LLMs)": {
        "use_for": "Text generation, reasoning, code, multimodal",
        "options": {
            "API": ["Claude (Anthropic)", "GPT-4 (OpenAI)", "Gemini (Google)"],
            "Self-hosted": ["LLaMA 3 (Meta)", "Mistral", "Qwen"],
        },
        "decision": "API for quality + convenience; self-hosted for privacy + cost at scale"
    },
    "Embedding Models": {
        "use_for": "Converting text/images to vectors for search and RAG",
        "options": {
            "API": ["OpenAI text-embedding-3", "Cohere embed-v3"],
            "Self-hosted": ["sentence-transformers", "nomic-embed", "BGE"],
        },
        "decision": "Self-hosted for most cases — embedding is cheap to run locally"
    },
    "Traditional ML Models": {
        "use_for": "Classification, regression, clustering, anomaly detection",
        "options": {
            "Libraries": ["scikit-learn", "XGBoost", "LightGBM"],
        },
        "decision": "Always start here for structured data problems"
    },
    "Specialized Models": {
        "use_for": "Domain-specific tasks",
        "options": {
            "NER": ["spaCy"],
            "Speech": ["Whisper (OpenAI)"],
            "Vision": ["YOLO", "CLIP"],
        },
        "decision": "Use when the general model isn't good enough for specific tasks"
    }
}

for category, details in model_categories.items():
    print(f"\n{category}")
    print(f"  Use for: {details['use_for']}")
    for option_type, options in details['options'].items():
        print(f"  {option_type}: {', '.join(options)}")

Layer 4: Orchestration

orchestration_tools = {
    "Agent Frameworks": {
        "what": "Build agents with tool use, memory, and reasoning",
        "options": ["LangGraph", "CrewAI", "AutoGen", "Custom (just Python)"],
        "my_take": "I prefer custom Python agents — frameworks add complexity "
                   "before you understand what they abstract away"
    },
    "RAG Frameworks": {
        "what": "Build retrieval-augmented generation pipelines",
        "options": ["LangChain", "LlamaIndex", "Haystack", "Custom"],
        "my_take": "Started with LangChain, moved to custom. The abstraction "
                   "overhead wasn't worth it for my use cases"
    },
    "Prompt Management": {
        "what": "Version, test, and manage prompts",
        "options": ["Promptfoo", "LangSmith", "Custom templates"],
        "my_take": "Promptfoo for evaluation, Jinja2 templates for management"
    },
    "Evaluation": {
        "what": "Measure AI system quality",
        "options": ["Promptfoo", "RAGAS", "DeepEval", "Custom metrics"],
        "my_take": "Custom metrics that match your actual use case are more "
                   "valuable than generic benchmarks"
    }
}

for tool, details in orchestration_tools.items():
    print(f"\n{tool}: {details['what']}")
    print(f"  Options: {', '.join(details['options'])}")
    print(f"  My take: {details['my_take']}")

Layer 5: Application

This is where everything comes together into something users interact with.

# My home lab monitoring system — showing all layers in action

monitoring_system_stack = {
    "Layer 5 — Application": {
        "what": "FastAPI service that receives alerts and provides analysis",
        "tech": "FastAPI + WebSocket for real-time updates"
    },
    "Layer 4 — Orchestration": {
        "what": "ReAct agent that investigates alerts using tools",
        "tech": "Custom Python agent with kubectl and Prometheus tools"
    },
    "Layer 3 — Models": {
        "what": "Random Forest for alert classification + Claude for analysis",
        "tech": "scikit-learn (fast, cheap) + Anthropic API (complex reasoning)"
    },
    "Layer 2 — Data": {
        "what": "PostgreSQL with pgvector for runbook search, Prometheus for metrics",
        "tech": "PostgreSQL 16 + pgvector + Prometheus"
    },
    "Layer 1 — Compute": {
        "what": "Runs on a single home server node",
        "tech": "VM with 16GB RAM, no GPU needed (API model + small ML models)"
    }
}

print("My Monitoring System — Full Stack")
print("=" * 55)
for layer, details in monitoring_system_stack.items():
    print(f"\n{layer}")
    print(f"  What: {details['what']}")
    print(f"  Tech: {details['tech']}")

Adding AI to Existing Applications

Most engineers don't build AI-native applications from scratch. They add AI capabilities to existing systems. Here's a practical framework:

The Embedded AI Pattern

# Before: Traditional application
class PaymentService:
    def process_transaction(self, transaction: dict) -> dict:
        """Traditional: rule-based validation."""
        if transaction["amount"] > 10000:
            return {"status": "flagged", "reason": "High amount"}
        if transaction["country"] in self.blocked_countries:
            return {"status": "rejected", "reason": "Blocked region"}
        return {"status": "approved"}


# After: AI-enhanced application
class SmartPaymentService:
    def __init__(self):
        self.fraud_model = self.load_fraud_model()  # ML classifier
        self.rule_engine = PaymentService()          # Keep existing rules

    def load_fraud_model(self):
        """Load pre-trained fraud detection model."""
        import joblib
        # Model trained on historical transaction data
        # Features: amount, time, merchant_category, user_history, etc.
        return joblib.load("fraud_detector.pkl")

    def process_transaction(self, transaction: dict) -> dict:
        """Hybrid: rules + ML + optional LLM."""
        # Step 1: Apply existing rules (fast, deterministic)
        rule_result = self.rule_engine.process_transaction(transaction)
        if rule_result["status"] == "rejected":
            return rule_result

        # Step 2: ML fraud scoring (fast, probabilistic)
        features = self.extract_features(transaction)
        fraud_score = self.fraud_model.predict_proba([features])[0][1]

        if fraud_score > 0.9:
            return {"status": "rejected", "reason": f"Fraud score: {fraud_score:.2f}"}
        elif fraud_score > 0.6:
            # Step 3: LLM analysis for borderline cases only (expensive but smart)
            analysis = self.llm_analyze(transaction, fraud_score)
            return {"status": "review", "fraud_score": fraud_score, "analysis": analysis}

        return {"status": "approved", "fraud_score": fraud_score}

    def extract_features(self, transaction: dict) -> list:
        """Extract ML features from transaction."""
        return [
            transaction.get("amount", 0),
            transaction.get("hour_of_day", 12),
            transaction.get("merchant_risk_score", 0.5),
            transaction.get("user_avg_transaction", 100),
        ]

    def llm_analyze(self, transaction: dict, fraud_score: float) -> str:
        """LLM analysis for borderline cases."""
        # Only called for ~5% of transactions (the ambiguous ones)
        return f"Transaction analysis for borderline score {fraud_score}..."

The pattern:

Keep existing rules (fast, deterministic, free)
Add ML for pattern-based decisions (fast, cheap)
Use LLM only for complex, ambiguous cases (slow, expensive, powerful)

This layered approach is how I add AI to everything — the LLM is the last resort, not the first.

Why Most AI Projects Fail (The AI Graveyard)

I've seen this pattern repeatedly: a team builds an impressive AI demo in two weeks, then spends six months trying to get it to production and eventually abandons it. Here's why.

The Top Failure Modes

ai_failure_modes = {
    "1. No clear problem definition": {
        "symptom": "We should use AI for something",
        "root_cause": "Solution looking for a problem",
        "fix": "Start with the problem. If AI isn't the simplest solution, don't use it."
    },
    "2. Data problems": {
        "symptom": "Model accuracy tanked after deployment",
        "root_cause": "Training data doesn't match production data, "
                     "or data quality was never verified",
        "fix": "Invest 70% of your time in data quality. "
               "Build data validation into your pipeline."
    },
    "3. Demo-to-production gap": {
        "symptom": "Works great in Jupyter, fails in production",
        "root_cause": "No error handling, no monitoring, no edge cases, "
                     "no latency requirements, no cost analysis",
        "fix": "Treat AI code like any production code: tests, monitoring, "
               "error handling, SLOs."
    },
    "4. Overengineering": {
        "symptom": "Built a RAG system with vector search for a 10-page FAQ",
        "root_cause": "Using complex AI when a simple search or lookup table works",
        "fix": "Start with the simplest approach. Add complexity only when needed."
    },
    "5. Ignoring evaluation": {
        "symptom": "We think the model is good, but we have no metrics",
        "root_cause": "No evaluation framework, no baseline comparison, "
                     "no regression testing",
        "fix": "Define success metrics BEFORE building. Measure continuously."
    },
    "6. Cost blindness": {
        "symptom": "Our AI feature costs $10K/month for 100 users",
        "root_cause": "Using expensive models for everything, no caching, "
                     "no tiered approach",
        "fix": "Build cost monitoring from day one. Use cheaper models for "
               "simpler tasks."
    }
}

for failure, details in ai_failure_modes.items():
    print(f"\n{failure}")
    print(f"  Symptom:    {details['symptom']}")
    print(f"  Root cause: {details['root_cause']}")
    print(f"  Fix:        {details['fix']}")

The "Do I Even Need AI?" Checklist

def should_use_ai(requirements: dict) -> str:
    """Decision framework: does this problem need AI?"""

    checks = []

    # Check 1: Can rules handle it?
    if requirements.get("patterns_are_known", False):
        checks.append("❌ Known patterns → use rules/logic, not AI")
    else:
        checks.append("✅ Unknown/complex patterns → AI can help")

    # Check 2: Is there enough data?
    data_size = requirements.get("data_size", 0)
    if data_size < 100:
        checks.append(f"❌ Only {data_size} samples → not enough for ML")
    else:
        checks.append(f"✅ {data_size} samples → sufficient for ML")

    # Check 3: Does it need language understanding?
    if requirements.get("needs_language", False):
        checks.append("✅ Needs language understanding → LLM is appropriate")
    else:
        checks.append("❌ No language needed → classical approach likely better")

    # Check 4: Cost justified?
    if requirements.get("value_per_prediction", 0) > requirements.get("cost_per_prediction", 0) * 10:
        checks.append("✅ Value >> cost → economically justified")
    else:
        checks.append("⚠️ Marginal ROI → verify cost-benefit carefully")

    for check in checks:
        print(f"  {check}")

    positives = sum(1 for c in checks if c.startswith("✅"))
    if positives >= 3:
        return "→ AI is likely the right approach"
    elif positives >= 2:
        return "→ Consider AI, but start simple"
    else:
        return "→ Probably don't need AI — use traditional engineering"

print("Log anomaly detection:")
result = should_use_ai({
    "patterns_are_known": False,
    "data_size": 50000,
    "needs_language": True,
    "value_per_prediction": 10.0,
    "cost_per_prediction": 0.01
})
print(f"  {result}")

print("\nURL shortener:")
result = should_use_ai({
    "patterns_are_known": True,
    "data_size": 0,
    "needs_language": False,
    "value_per_prediction": 0.001,
    "cost_per_prediction": 0.01
})
print(f"  {result}")

NeuroSymbolic AI: The Best of Both Worlds

A promising direction that combines neural networks (learning from data) with symbolic AI (logical reasoning).

# Traditional Neural AI: Great at pattern recognition, bad at logical reasoning
# Symbolic AI: Great at logical reasoning, bad at learning from data  
# NeuroSymbolic: Combines both

neurosymbolic_concept = {
    "Neural Component": {
        "strength": "Learns patterns from unstructured data",
        "example": "LLM understands natural language log messages"
    },
    "Symbolic Component": {
        "strength": "Applies logical rules deterministically",
        "example": "Rule engine enforces: IF severity=critical AND service=payment THEN page_oncall"
    },
    "Together": {
        "how": "Neural extracts meaning → Symbolic applies rules → Neural generates response",
        "example": "LLM classifies alert → Rules decide action → LLM explains to human"
    }
}

# In practice, this is already what I do (I just didn't call it NeuroSymbolic):

class HybridMonitoring:
    """My monitoring system is NeuroSymbolic without the fancy name."""

    def analyze_alert(self, alert_text: str) -> dict:
        # Neural: LLM extracts structured information from unstructured text
        extracted = self.llm_extract(alert_text)
        # Output: {"service": "payment", "severity": "critical", "type": "OOMKill"}

        # Symbolic: Rule engine applies deterministic policies
        action = self.apply_rules(extracted)
        # Output: {"action": "page_oncall", "auto_restart": True}

        # Neural: LLM generates human-readable incident summary
        summary = self.llm_summarize(alert_text, extracted, action)
        # Output: "Payment service critical OOMKill. Auto-restart triggered.
        #          On-call paged. Likely cause: JVM heap misconfiguration."

        return {"extracted": extracted, "action": action, "summary": summary}

    def llm_extract(self, text: str) -> dict:
        """Neural: extract structured data from text."""
        return {"service": "payment", "severity": "critical", "type": "OOMKill"}

    def apply_rules(self, data: dict) -> dict:
        """Symbolic: deterministic rule application."""
        rules = {
            ("critical", "payment"): {"action": "page_oncall", "auto_restart": True},
            ("critical", "api"): {"action": "page_oncall", "auto_restart": True},
            ("warning", "payment"): {"action": "notify_slack", "auto_restart": False},
        }
        key = (data["severity"], data["service"])
        return rules.get(key, {"action": "log_only", "auto_restart": False})

    def llm_summarize(self, text: str, extracted: dict, action: dict) -> str:
        """Neural: generate human-readable summary."""
        return f"{extracted['service']} {extracted['severity']} {extracted['type']}. " \
               f"Action: {action['action']}."

system = HybridMonitoring()
result = system.analyze_alert("Pod payment-svc-7d8b OOMKilled, memory limit exceeded 512Mi")
print(f"Analysis: {result}")

Why this matters: Pure neural approaches (just call the LLM) lack determinism. Pure symbolic approaches (just write rules) lack flexibility. The combination gives you both.

Responsible AI: Building Systems You Can Trust

This isn't just ethics for ethics' sake — it's engineering. Irresponsible AI creates bugs, liability, and user distrust.

responsible_ai_principles = {
    "Bias Awareness": {
        "risk": "Model trained on biased data produces biased outputs",
        "example": "A hiring classifier trained on historical data that "
                  "reflects past discrimination",
        "mitigation": [
            "Audit training data for demographic representation",
            "Test model outputs across different groups",
            "Use fairness metrics (equalized odds, demographic parity)",
            "Have humans review edge cases"
        ]
    },
    "Transparency": {
        "risk": "Users don't know how decisions are made",
        "example": "Loan application rejected by 'the algorithm' with no explanation",
        "mitigation": [
            "Use interpretable models where possible (decision trees, logistic regression)",
            "Provide confidence scores alongside predictions",
            "Generate explanations for model decisions (SHAP, LIME)",
            "Document model limitations and failure modes"
        ]
    },
    "Privacy": {
        "risk": "Model memorizes or leaks sensitive data",
        "example": "LLM trained on customer data that can be prompted to reveal PII",
        "mitigation": [
            "Anonymize training data",
            "Use local models for sensitive data (no API calls)",
            "Implement output filtering for PII",
            "Follow data minimization — only collect what you need"
        ]
    },
    "Accountability": {
        "risk": "No one is responsible when AI makes a mistake",
        "example": "Automated deployment fails because AI recommended wrong config",
        "mitigation": [
            "Human-in-the-loop for high-impact decisions",
            "Audit trails for all AI-driven actions",
            "Clear ownership of AI systems",
            "Incident response plans for AI failures"
        ]
    }
}

for principle, details in responsible_ai_principles.items():
    print(f"\n{principle}")
    print(f"  Risk: {details['risk']}")
    print(f"  Mitigations:")
    for m in details['mitigation']:
        print(f"    - {m}")

Putting It All Together: Your AI Engineering Roadmap

Based on everything in this series, here's the path I'd recommend:

learning_roadmap = {
    "Phase 1 — Foundations (you are here)": {
        "what": "Understand AI concepts, terminology, and the landscape",
        "series": "AI Fundamentals 101 ✅",
        "outcomes": [
            "Can explain ML vs DL vs Gen AI",
            "Know when to use each approach",
            "Understand the AI stack"
        ]
    },
    "Phase 2 — Hands-On ML": {
        "what": "Build ML models with scikit-learn",
        "series": "Machine Learning 101",
        "outcomes": [
            "Can preprocess data and train models",
            "Can evaluate and compare models",
            "Can build end-to-end ML pipelines"
        ]
    },
    "Phase 3 — Deep Learning": {
        "what": "Build neural networks with PyTorch",
        "series": "PyTorch 101",
        "outcomes": [
            "Understand tensors and autograd",
            "Can build and train neural networks",
            "Can use pre-trained models"
        ]
    },
    "Phase 4 — LLM Engineering": {
        "what": "Build production LLM applications",
        "series": "AI Engineer 101 + LLM API Development 101",
        "outcomes": [
            "Can build FastAPI services with LLM backends",
            "Can implement RAG systems",
            "Can evaluate AI system quality"
        ]
    },
    "Phase 5 — Agents": {
        "what": "Build autonomous AI agents",
        "series": "AI Agent Development 101 + Multi Agent Orchestration 101",
        "outcomes": [
            "Can build single and multi-agent systems",
            "Can implement tool use and human-in-the-loop",
            "Can orchestrate complex AI workflows"
        ]
    },
    "Phase 6 — Production": {
        "what": "Deploy and operate AI systems",
        "series": "MLOps 101",
        "outcomes": [
            "Can deploy models to Kubernetes",
            "Can build CI/CD for ML pipelines",
            "Can monitor model performance in production"
        ]
    }
}

for phase, details in learning_roadmap.items():
    print(f"\n{phase}")
    print(f"  Focus: {details['what']}")
    print(f"  Series: {details['series']}")
    print(f"  You'll be able to:")
    for outcome in details['outcomes']:
        print(f"    ✓ {outcome}")

Series Recap

Let's tie all seven articles together:

Part

Title

Core Takeaway

What is AI?

AI = systems that learn, reason, or adapt. All current AI is narrow (ANI).

ML, DL, & Foundation Models

Three layers of increasing capability. Start simple, upgrade when needed.

NLP, NLU, NLG

Machines processing human language. Classical NLP is still useful alongside LLMs.

LLMs & Generative AI

Transformers predict next tokens. Powerful but hallucinate, expensive, non-deterministic.

RAG, Fine-Tuning, Prompts

Three ways to customize AI. Start with prompts, add RAG for data, fine-tune as last resort.

Agents & Protocols

Agents think + act in loops. MCP for tools, A2A for agent communication.

AI Stack & Building Systems

The complete picture. Layered approach, avoid overengineering, build responsibly.

The meta-lesson across all seven articles: AI is a tool, not magic. Like any tool, it has a cost, limitations, and specific use cases where it excels. The best AI engineers are the ones who know when not to use AI, when to use a $0.001 scikit-learn prediction instead of a $0.10 LLM call, and how to build systems where AI and traditional code work together.

What to Read Next

If You Want To...

Read

Build ML models hands-on

Machine Learning 101

Learn deep learning with PyTorch

PyTorch 101

Build production LLM apps

AI Engineer 101

Build RAG systems

RAG 101

Build agents

AI Agent Development 101

Deploy ML to production

MLOps 101

This is the final article in the AI Fundamentals 101 series. Thanks for reading.

← Part 6: AI Agents and Protocols · Series Overview

PreviousPart 6: AI Agents and Communication Protocols NextHugging Face Transformers 101

Last updated 2 hours ago

hashtagFrom Concepts to Systems

hashtagThe Modern AI Stack

hashtagLayer 1: Compute

hashtagLayer 2: Data & Infrastructure

hashtagLayer 3: Model Layer

hashtagLayer 4: Orchestration

hashtagLayer 5: Application

hashtagAdding AI to Existing Applications

hashtagThe Embedded AI Pattern

hashtagWhy Most AI Projects Fail (The AI Graveyard)

hashtagThe Top Failure Modes

hashtagThe "Do I Even Need AI?" Checklist

hashtagNeuroSymbolic AI: The Best of Both Worlds

hashtagResponsible AI: Building Systems You Can Trust

hashtagPutting It All Together: Your AI Engineering Roadmap

hashtagSeries Recap

hashtagWhat to Read Next