Part 2: Machine Learning, Deep Learning, and Foundation Models

Part of the AI Fundamentals 101 Series

Why These Three Matter

In Part 1, we mapped the AI landscape. Now we're going to zoom into the three layers that power almost every AI system you'll encounter as an engineer:

Machine Learning — algorithms that learn from data
Deep Learning — neural networks with many layers
Foundation Models — massive pre-trained models that changed the industry

Understanding the differences between these isn't academic — it determines which tool you reach for when solving a real problem. I've wasted weeks using an LLM for a task that scikit-learn could handle in 10 lines. And I've wasted days trying to train a classifier when I should have just prompted Claude. The distinction matters.

Machine Learning: Learning Rules from Data

Machine learning is the approach where instead of programming explicit rules, you give an algorithm data and let it learn the patterns.

The Three Types of Machine Learning

1. Supervised Learning — "Learn from Labeled Examples"

You provide input-output pairs. The algorithm learns the mapping.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load a dataset with labels
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train: the model learns the relationship between features and labels
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on data the model has never seen
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2%}")
# Accuracy: 100.00%

Real use cases I've encountered:

Predicting whether a server alert is a real issue or noise (classification)
Estimating build times for CI/CD pipelines based on code change size (regression)
Detecting anomalous network traffic patterns (classification)

2. Unsupervised Learning — "Find Structure in Unlabeled Data"

No labels. The algorithm discovers patterns on its own.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

# Simulated server metrics — no labels, just raw data
np.random.seed(42)
normal_servers = np.random.normal(loc=[30, 40], scale=[10, 15], size=(80, 2))
busy_servers = np.random.normal(loc=[85, 75], scale=[8, 10], size=(20, 2))
server_metrics = np.vstack([normal_servers, busy_servers])

# Scale features to the same range
scaler = StandardScaler()
scaled_data = scaler.fit_transform(server_metrics)

# Let the algorithm find clusters — we don't tell it what the groups are
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = kmeans.fit_predict(scaled_data)

print(f"Cluster 0: {sum(clusters == 0)} servers")
print(f"Cluster 1: {sum(clusters == 1)} servers")
# Cluster 0: 80 servers (normal)
# Cluster 1: 20 servers (busy)

Real use cases:

Grouping similar error log messages together (clustering)
Reducing high-dimensional monitoring data for visualization (dimensionality reduction)
Detecting unusual patterns in server behavior (anomaly detection)

3. Reinforcement Learning — "Learn by Trial and Reward"

The agent takes actions in an environment and receives rewards or penalties. It learns the best strategy over time.

# Simplified reinforcement learning concept
# The agent learns which actions lead to higher rewards

import random

class SimpleAgent:
    def __init__(self, actions: list[str]):
        self.q_table: dict[str, float] = {a: 0.0 for a in actions}
        self.learning_rate = 0.1
        self.epsilon = 0.1  # Exploration rate

    def choose_action(self) -> str:
        """Epsilon-greedy: mostly exploit, sometimes explore."""
        if random.random() < self.epsilon:
            return random.choice(list(self.q_table.keys()))
        return max(self.q_table, key=self.q_table.get)

    def update(self, action: str, reward: float):
        """Update the value of an action based on received reward."""
        self.q_table[action] += self.learning_rate * (
            reward - self.q_table[action]
        )

# Example: agent learns which deployment strategy is best
agent = SimpleAgent(actions=["blue_green", "canary", "rolling"])

# Simulate 1000 deployments
for _ in range(1000):
    action = agent.choose_action()
    # Simulated rewards — canary deployments have lower failure rate
    rewards = {"blue_green": 0.7, "canary": 0.9, "rolling": 0.6}
    noise = random.uniform(-0.2, 0.2)
    agent.update(action, rewards[action] + noise)

print("Learned values:")
for action, value in sorted(agent.q_table.items(), key=lambda x: -x[1]):
    print(f"  {action}: {value:.3f}")
# canary: ~0.90 (highest — the agent learned this is best)

Real use cases:

Game AI (AlphaGo, game bots)
Robotics (learning to walk, grasp objects)
Resource optimization (dynamic scaling, cache policies)

When to Use Which Type

Type

You Need

You Get

Example

Supervised

Labeled data (input → output)

Predictions on new data

"Is this server going to crash?"

Unsupervised

Unlabeled data

Discovered patterns

"What groups do these servers fall into?"

Reinforcement

An environment + reward signal

An optimal strategy

"What's the best scaling policy?"

Deep Learning: Neural Networks with Many Layers

Deep learning is a subset of machine learning that uses neural networks with multiple layers to learn increasingly abstract representations of data.

What Makes It "Deep"?

A shallow model (like logistic regression) learns a single transformation: input → output. A deep model learns a hierarchy of transformations:

Input → Layer 1 (basic patterns) → Layer 2 (combinations) → ... → Output

For example, in image recognition:

Layer 1 detects edges
Layer 2 combines edges into shapes
Layer 3 combines shapes into parts (eyes, wheels)
Layer 4 combines parts into objects (face, car)

Each layer builds on the previous one, learning more complex and abstract features.

A Simple Neural Network in Python

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Create a non-linear dataset (two interleaving half-circles)
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Shallow model — can't capture non-linear boundaries well
shallow = MLPClassifier(
    hidden_layer_sizes=(5,),  # 1 hidden layer with 5 neurons
    max_iter=1000,
    random_state=42
)
shallow.fit(X_train_scaled, y_train)
print(f"Shallow (1 layer):  {accuracy_score(y_test, shallow.predict(X_test_scaled)):.2%}")

# Deep model — multiple layers capture complex patterns
deep = MLPClassifier(
    hidden_layer_sizes=(32, 16, 8),  # 3 hidden layers
    max_iter=1000,
    random_state=42
)
deep.fit(X_train_scaled, y_train)
print(f"Deep (3 layers):    {accuracy_score(y_test, deep.predict(X_test_scaled)):.2%}")

# Output:
# Shallow (1 layer):  87.60%
# Deep (3 layers):    96.00%

The deep model captures the non-linear decision boundary that the shallow model misses.

Key Deep Learning Architectures

Architecture

Best For

How It Works

CNN (Convolutional Neural Network)

Images, spatial data

Slides filters across input to detect patterns

RNN (Recurrent Neural Network)

Sequences, time series

Processes input one step at a time, maintaining hidden state

LSTM (Long Short-Term Memory)

Long sequences

RNN with memory gates that can remember/forget over long distances

Transformer

Language, multimodal

Processes entire sequences in parallel using attention mechanism

GAN (Generative Adversarial Network)

Image generation

Two networks compete: generator creates, discriminator judges

Autoencoder

Compression, anomaly detection

Learns to compress and reconstruct data

When Deep Learning Beats Traditional ML

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Simple problem — traditional ML is fine
X_simple, y_simple = make_classification(
    n_samples=500, n_features=10, n_informative=5,
    random_state=42
)

rf_simple = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_simple, y_simple, cv=5
).mean()

mlp_simple = cross_val_score(
    MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42),
    X_simple, y_simple, cv=5
).mean()

print("Simple problem (10 features, linear-ish):")
print(f"  Random Forest: {rf_simple:.3f}")
print(f"  Neural Network: {mlp_simple:.3f}")

# Complex problem — deep learning shines
X_complex, y_complex = make_classification(
    n_samples=5000, n_features=100, n_informative=30,
    n_redundant=20, n_clusters_per_class=3, random_state=42
)

rf_complex = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_complex, y_complex, cv=5
).mean()

mlp_complex = cross_val_score(
    MLPClassifier(hidden_layer_sizes=(128, 64, 32), max_iter=500, random_state=42),
    X_complex, y_complex, cv=5
).mean()

print("\nComplex problem (100 features, non-linear):")
print(f"  Random Forest: {rf_complex:.3f}")
print(f"  Neural Network: {mlp_complex:.3f}")

The rule of thumb:

< 10,000 samples, < 50 features? Start with traditional ML (Random Forest, XGBoost)
Images, text, audio, or massive tabular data? Deep learning
Need interpretability? Traditional ML (decision trees, logistic regression)
Need maximum accuracy and have compute? Deep learning

Foundation Models: The Paradigm Shift

Foundation models are the biggest shift in AI engineering in the last decade. Understanding them is essential.

What is a Foundation Model?

A foundation model is a large model trained on broad data that can be adapted to many downstream tasks. The "foundation" metaphor is deliberate — it's a base you build on.

Traditional ML:    Collect task-specific data → Train model → Deploy
Foundation Model:  Use pre-trained model → Adapt (prompt/fine-tune) → Deploy

Examples of foundation models:

GPT-4, Claude, LLaMA — language understanding and generation
DALL-E, Stable Diffusion — image generation
Whisper — speech recognition
CLIP — connecting images and text
Codex/Copilot — code generation

What Changed

Before foundation models, every task required its own model trained on task-specific data:

# The old way (pre-2020): one model per task
spam_model = train_from_scratch(spam_dataset)        # Millions of labeled emails
sentiment_model = train_from_scratch(sentiment_data)  # Millions of labeled reviews
summary_model = train_from_scratch(summary_data)      # Millions of document-summary pairs
ner_model = train_from_scratch(ner_data)              # Millions of annotated texts

With foundation models, one model handles many tasks through different prompts:

# The new way: one model, many tasks
from anthropic import Anthropic

client = Anthropic()

def ask_foundation_model(task: str, text: str) -> str:
    """One model, many tasks — just change the prompt."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        messages=[{"role": "user", "content": f"{task}\n\nText: {text}"}]
    )
    return response.content[0].text

# Same model, different tasks
email = "Congratulations! You've won a free iPhone!"

spam_check = ask_foundation_model("Is this spam? Answer yes or no.", email)
sentiment = ask_foundation_model("What is the sentiment? positive/negative/neutral", email)
summary = ask_foundation_model("Summarize in one sentence.", email)
entities = ask_foundation_model("Extract named entities (products, organizations).", email)

Why Foundation Models Work

Three ingredients came together:

Scale — GPT-3 was trained on ~300 billion tokens of internet text. The model saw enough language to learn grammar, facts, reasoning patterns, and coding conventions.
Self-supervised learning — The model learns by predicting the next word in a sequence. No human labels needed for training data.

Training input:  "The capital of France is ___"
Model learns:    "Paris" (from seeing millions of similar patterns)

Transfer learning — Knowledge learned in one context transfers to another. A model that learned to summarize news articles can also summarize error logs — the underlying skill (compression, key point extraction) is the same.

The Foundation Model Stack

┌────────────────────────────────────────────┐
│  Your Application                           │
│  (chatbot, code assistant, RAG system)      │
├────────────────────────────────────────────┤
│  Adaptation Layer                           │
│  (prompts, fine-tuning, RAG, tools)         │
├────────────────────────────────────────────┤
│  Foundation Model                           │
│  (GPT-4, Claude, LLaMA, Gemini)            │
├────────────────────────────────────────────┤
│  Training Infrastructure                    │
│  (GPU clusters, massive datasets)           │
└────────────────────────────────────────────┘

As an engineer, you typically work in the top two layers. You don't train foundation models — you use and adapt them.

Comparing All Three: ML vs DL vs Foundation Models

# Let's solve the same problem three ways: text classification

# --- Approach 1: Traditional ML ---
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

# Small labeled dataset
texts = [
    "server CPU at 95%, memory full, swap thrashing",
    "disk IOPS exceeded threshold, latency spike",
    "deployment succeeded, all health checks passing",
    "pod restarting repeatedly, OOMKilled",
    "new version rolled out, traffic shifted",
    "connection pool exhausted, timeouts increasing",
    "certificate renewed successfully",
    "database replication lag exceeding 30 seconds",
    "scaling event completed, 3 new replicas",
    "unhandled exception in payment service, 500 errors"
]
labels = [1, 1, 0, 1, 0, 1, 0, 1, 0, 1]  # 1 = problem, 0 = normal

# Traditional ML: TF-IDF features + Naive Bayes classifier
ml_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", MultinomialNB())
])

ml_scores = cross_val_score(ml_pipeline, texts, labels, cv=3)
print(f"Traditional ML accuracy: {ml_scores.mean():.2%}")

# Pros: Fast, interpretable, works with small data
# Cons: Needs feature engineering (TF-IDF), limited to patterns in training data

# --- Approach 2: Deep Learning ---
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Same data, but we use a neural network instead
dl_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=100)),
    ("clf", MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42))
])

dl_scores = cross_val_score(dl_pipeline, texts, labels, cv=3)
print(f"Deep Learning accuracy: {dl_scores.mean():.2%}")

# Pros: Can learn complex patterns, better with large datasets
# Cons: Needs more data, slower to train, less interpretable

# --- Approach 3: Foundation Model ---
# No training needed — just a prompt

def classify_with_llm(text: str) -> int:
    """Classify using a foundation model (conceptual — would use API in practice)."""
    prompt = f"""Classify this server log message as either a problem (1) or normal (0).
    
Message: {text}

Return only 0 or 1."""
    # In practice: response = client.messages.create(...)
    # The model already understands server terminology from pre-training
    pass

# Pros: Zero training data needed, understands context, handles new patterns
# Cons: Expensive per-call, slower, requires API access, less deterministic

Decision Framework

Factor

Traditional ML

Deep Learning

Foundation Model

Training data

Hundreds–thousands

Thousands–millions

Zero (few-shot)

Training time

Minutes

Hours–days

None (pre-trained)

Inference cost

Near-zero

Low

$0.001–0.10+ per call

Interpretability

High

Low

Very low

Customization

Full control

Prompt/fine-tune only

Accuracy (small data)

Good

Poor

Good (via in-context learning)

Accuracy (large data)

Good

Excellent

Handles new patterns

No (needs retraining)

Yes (generalization)

My personal rule:

Can I solve this with 10 lines of scikit-learn? → Traditional ML
Do I have images, audio, or need to learn complex patterns from massive data? → Deep Learning
Do I need language understanding, generation, or zero-shot capability? → Foundation Model
Am I unsure? → Start with traditional ML, upgrade if it's not enough

Ten Real-World ML Use Cases You Encounter Daily

Understanding these helps you recognize AI in the wild:

everyday_ml = {
    "1. Email spam filtering": {
        "type": "Supervised classification",
        "how": "Model trained on millions of labeled spam/not-spam emails",
        "you_see_it": "Gmail's spam filter catches 99.9% of spam"
    },
    "2. Netflix/YouTube recommendations": {
        "type": "Collaborative filtering + deep learning",
        "how": "Learns patterns from what similar users watched",
        "you_see_it": "The 'Recommended for you' section"
    },
    "3. Voice assistants (Siri, Alexa)": {
        "type": "Speech recognition (deep learning) + NLU",
        "how": "Converts speech to text, then understands intent",
        "you_see_it": "'Hey Siri, set a timer for 10 minutes'"
    },
    "4. Autocomplete / code suggestions": {
        "type": "Language modeling (foundation model)",
        "how": "Predicts most likely next words/tokens",
        "you_see_it": "GitHub Copilot, Gmail Smart Compose"
    },
    "5. Fraud detection": {
        "type": "Anomaly detection (supervised + unsupervised)",
        "how": "Learns normal transaction patterns, flags deviations",
        "you_see_it": "Credit card fraud alerts"
    },
    "6. Navigation / ETA prediction": {
        "type": "Regression + reinforcement learning",
        "how": "Predicts travel time from historical traffic patterns",
        "you_see_it": "Google Maps arrival time estimates"
    },
    "7. Image search": {
        "type": "Deep learning (CNN + embeddings)",
        "how": "Converts images to vectors, finds similar ones",
        "you_see_it": "Google reverse image search, Pinterest visual search"
    },
    "8. Language translation": {
        "type": "Sequence-to-sequence (transformer)",
        "how": "Encoder processes source language, decoder generates target",
        "you_see_it": "Google Translate, DeepL"
    },
    "9. Content moderation": {
        "type": "Classification (text + image)",
        "how": "Classifies content as safe/unsafe across multiple categories",
        "you_see_it": "Social media content filtering"
    },
    "10. Dynamic pricing": {
        "type": "Regression + reinforcement learning",
        "how": "Adjusts prices based on demand signals and competitor data",
        "you_see_it": "Uber surge pricing, airline ticket prices"
    }
}

for name, details in everyday_ml.items():
    print(f"\n{name}")
    print(f"  Type: {details['type']}")
    print(f"  How:  {details['how']}")

From My Own Experience: Choosing the Right Level

When I started my home server monitoring project, I went through all three levels:

Attempt 1 — Traditional ML: I collected CPU, memory, and disk metrics over two weeks, labeled them (alert/no-alert), and trained a Random Forest. It worked well — 91% accuracy — but it couldn't handle new types of issues it hadn't seen in training.

Attempt 2 — Deep Learning: I tried an LSTM to capture time-series patterns. It improved accuracy to 94% but required significantly more data and compute for marginal gains on my small dataset. Overkill for my use case.

Attempt 3 — Foundation Model (current): I added an LLM layer that takes the prediction from my Random Forest and the raw metrics, then generates a human-readable diagnosis. The ML model handles the fast, cheap detection. The LLM handles the nuanced explanation.

The sweet spot is often a hybrid: traditional ML for the fast, repeatable classification plus a foundation model for the parts that need language understanding.

What's Next

Now that you understand the three layers of modern AI (ML → DL → Foundation Models), we'll explore one of the most practically useful branches: Natural Language Processing — the technology that lets machines understand, process, and generate human language.

Next: Part 3 — Natural Language Processing: NLP, NLU, and NLG

← Part 1: What is AI? · Series Overview · Next →

PreviousPart 1: What is Artificial Intelligence?NextPart 3: Natural Language Processing — NLP, NLU, and NLG

Last updated 2 hours ago

hashtagWhy These Three Matter

hashtagMachine Learning: Learning Rules from Data

hashtagThe Three Types of Machine Learning

hashtag1. Supervised Learning — "Learn from Labeled Examples"

hashtag2. Unsupervised Learning — "Find Structure in Unlabeled Data"

hashtag3. Reinforcement Learning — "Learn by Trial and Reward"

hashtagWhen to Use Which Type

hashtagDeep Learning: Neural Networks with Many Layers

hashtagWhat Makes It "Deep"?

hashtagA Simple Neural Network in Python

hashtagKey Deep Learning Architectures

hashtagWhen Deep Learning Beats Traditional ML

hashtagFoundation Models: The Paradigm Shift

hashtagWhat is a Foundation Model?

hashtagWhat Changed

hashtagWhy Foundation Models Work

hashtagThe Foundation Model Stack

hashtagComparing All Three: ML vs DL vs Foundation Models

hashtagDecision Framework

hashtagTen Real-World ML Use Cases You Encounter Daily

hashtagFrom My Own Experience: Choosing the Right Level

hashtagWhat's Next