Architecture and Components

Introduction

Working with containerized applications across multiple cloud providers taught me an important lesson: understanding Kubernetes architecture isn't just about memorizing component names—it's about grasping how these pieces work together to create a resilient, self-healing system. When I first encountered production issues where pods weren't scheduling correctly, or services weren't routing traffic as expected, I realized that surface-level knowledge wasn't enough. I needed to understand the control plane, the worker nodes, and how the reconciliation loop maintains desired state.

In this article, I'll share what I've learned about Kubernetes architecture from troubleshooting real cluster issues, scaling microservices workloads, and designing reliable container orchestration solutions. We'll explore how the control plane components coordinate to manage your cluster, how worker nodes execute workloads, and how the entire system maintains your application's desired state even when things fail.

High-Level Architecture Overview

Kubernetes follows a master-worker architecture pattern, though the terminology has evolved to use "control plane" and "worker nodes" instead. The architecture is designed around a declarative model where you specify the desired state of your application, and Kubernetes continuously works to maintain that state.

Core Architectural Principles

Declarative Configuration: You describe what you want (desired state), not how to achieve it. Kubernetes controllers handle the implementation details.

Controller Pattern: Independent controllers watch for changes and work to reconcile current state with desired state. This creates a self-healing system.

API-Driven: Everything in Kubernetes is an API object. The API server is the central communication hub for all components.

Distributed System: Components are loosely coupled and communicate through the API server, making the system resilient to individual component failures.

Architecture Diagram

Control Plane Components

The control plane makes global decisions about the cluster (like scheduling) and detects and responds to cluster events. Control plane components can run on any machine in the cluster, but typically run on dedicated master nodes that don't execute user workloads.

API Server (kube-apiserver)

The API server is the front end for the Kubernetes control plane. It's the only component that directly interacts with etcd and serves as the central communication hub for all other components.

Key Responsibilities:

Validates and processes REST operations
Authenticates and authorizes requests
Provides the only interface to etcd
Serves as the communication hub for all components
Implements admission controllers for policy enforcement

How It Works:

When you run kubectl apply -f deployment.yaml, here's what happens:

kubectl sends an HTTP POST request to the API server
API server authenticates the request (using certificates, tokens, etc.)
API server authorizes the request (RBAC checks)
Admission controllers process the request (mutating then validating)
API server validates the object schema
API server writes to etcd
API server returns the response to kubectl

# API Server configuration example (from kubeadm)
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-apiserver
    - --advertise-address=192.168.1.10
    - --allow-privileged=true
    - --authorization-mode=Node,RBAC
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --enable-admission-plugins=NodeRestriction,PodSecurityPolicy
    - --enable-bootstrap-token-auth=true
    - --etcd-servers=https://127.0.0.1:2379
    - --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
    - --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
    - --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
    - --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
    - --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
    - --service-account-key-file=/etc/kubernetes/pki/sa.pub
    - --service-cluster-ip-range=10.96.0.0/12
    - --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
    - --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
    image: k8s.gcr.io/kube-apiserver:v1.28.0
    name: kube-apiserver

API Server Watch Mechanism:

Components don't poll the API server; they establish watch connections. This is efficient because:

// Simplified example of how components watch resources
watch, err := clientset.CoreV1().Pods("default").Watch(context.TODO(), metav1.ListOptions{})
if err != nil {
    log.Fatal(err)
}

for event := range watch.ResultChan() {
    pod, ok := event.Object.(*v1.Pod)
    if !ok {
        continue
    }
    
    switch event.Type {
    case watch.Added:
        fmt.Printf("Pod added: %s\n", pod.Name)
    case watch.Modified:
        fmt.Printf("Pod modified: %s\n", pod.Name)
    case watch.Deleted:
        fmt.Printf("Pod deleted: %s\n", pod.Name)
    }
}

etcd

etcd is a distributed, consistent key-value store that serves as Kubernetes' backing store for all cluster data. It's the single source of truth for your cluster's state.

Key Responsibilities:

Store all cluster state data
Provide consistency guarantees (using Raft consensus)
Support watch operations for event notification
Handle leader election and distributed locking

Data Structure in etcd:

Everything in Kubernetes is stored in etcd under specific key prefixes:

/registry/pods/default/my-app-pod
/registry/deployments/default/my-app
/registry/services/default/my-service
/registry/secrets/default/my-secret
/registry/configmaps/default/my-config

etcd Cluster Configuration:

# etcd static pod manifest
apiVersion: v1
kind: Pod
metadata:
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.1.10:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://192.168.1.10:2380
    - --initial-cluster=master-1=https://192.168.1.10:2380,master-2=https://192.168.1.11:2380,master-3=https://192.168.1.12:2380
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.10:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.1.10:2380
    - --name=master-1
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: k8s.gcr.io/etcd:3.5.9-0
    name: etcd
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  volumes:
  - hostPath:
      path: /var/lib/etcd
    name: etcd-data
  - hostPath:
      path: /etc/kubernetes/pki/etcd
    name: etcd-certs

etcd Best Practices:

Always run etcd in a cluster (3 or 5 nodes for production)
Regular backups are critical
Monitor etcd performance (disk I/O is crucial)
Use dedicated disks (SSDs recommended)
Secure communication with TLS

Backing up etcd:

# Using etcdctl to backup
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250101-120000.db

# Restore from backup
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20250101-120000.db \
  --data-dir=/var/lib/etcd-restored

Scheduler (kube-scheduler)

The scheduler watches for newly created pods with no assigned node and selects a node for them to run on based on resource requirements, constraints, and policies.

Key Responsibilities:

Watch for unscheduled pods
Find feasible nodes (filtering phase)
Score nodes to find the best fit (scoring phase)
Bind pods to nodes

Scheduling Process:

Filtering Predicates:

The scheduler applies predicates to filter nodes:

PodFitsResources: Node has enough CPU/memory
PodFitsHostPorts: Required ports are available
MatchNodeSelector: Node matches pod's nodeSelector
CheckNodeTaints: Pod tolerates node taints
CheckVolumeBinding: Required volumes can be mounted

Scoring Functions:

After filtering, the scheduler scores remaining nodes:

LeastRequestedPriority: Prefers nodes with fewer requested resources
BalancedResourceAllocation: Balances CPU and memory usage
SelectorSpreadPriority: Spreads pods across nodes
NodeAffinityPriority: Prefers nodes matching affinity rules

Advanced Scheduling Example:

apiVersion: v1
kind: Pod
metadata:
  name: advanced-scheduling-example
spec:
  # Node selector - simple node selection
  nodeSelector:
    disk: ssd
    environment: production
  
  # Node affinity - more expressive node selection
  affinity:
    nodeAffinity:
      # Must match these rules
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - us-east-1a
            - us-east-1b
      # Prefer these rules
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: instance-type
            operator: In
            values:
            - c5.2xlarge
    
    # Pod affinity - schedule near other pods
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - cache
        topologyKey: kubernetes.io/hostname
    
    # Pod anti-affinity - avoid scheduling near other pods
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - advanced-scheduling-example
          topologyKey: kubernetes.io/hostname
  
  # Tolerations - allow scheduling on tainted nodes
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu-workload"
    effect: "NoSchedule"
  
  containers:
  - name: app
    image: myapp:1.0
    resources:
      requests:
        memory: "2Gi"
        cpu: "1000m"
      limits:
        memory: "4Gi"
        cpu: "2000m"

Custom Scheduler:

You can write custom schedulers for specific requirements:

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-custom-scheduler
spec:
  schedulerName: my-custom-scheduler
  containers:
  - name: app
    image: nginx

Controller Manager (kube-controller-manager)

The controller manager runs multiple controllers as a single process. Each controller is a control loop that watches the shared state of the cluster through the API server and makes changes to move the current state toward the desired state.

Built-in Controllers:

Node Controller: Monitors node health, marks nodes as NotReady
Replication Controller: Maintains correct number of pods for ReplicaSets
Endpoints Controller: Populates Endpoints objects (joins Services and Pods)
Service Account Controller: Creates default ServiceAccounts for namespaces
Namespace Controller: Deletes all resources when a namespace is deleted
PersistentVolume Controller: Binds PVs to PVCs
Job Controller: Creates pods for Jobs
CronJob Controller: Creates Jobs on a schedule
Deployment Controller: Manages ReplicaSets for Deployments
StatefulSet Controller: Manages StatefulSets

Controller Pattern:

// Simplified controller pattern
func (c *Controller) Run(workers int, stopCh <-chan struct{}) {
    defer runtime.HandleCrash()
    defer c.workqueue.ShutDown()
    
    // Start the informer to watch for changes
    go c.informer.Run(stopCh)
    
    // Wait for cache sync
    if !cache.WaitForCacheSync(stopCh, c.informer.HasSynced) {
        runtime.HandleError(fmt.Errorf("timed out waiting for caches to sync"))
        return
    }
    
    // Start workers
    for i := 0; i < workers; i++ {
        go wait.Until(c.runWorker, time.Second, stopCh)
    }
    
    <-stopCh
}

func (c *Controller) runWorker() {
    for c.processNextItem() {
    }
}

func (c *Controller) processNextItem() bool {
    // Get next item from queue
    key, quit := c.workqueue.Get()
    if quit {
        return false
    }
    defer c.workqueue.Done(key)
    
    // Reconcile the object
    err := c.reconcile(key.(string))
    if err == nil {
        c.workqueue.Forget(key)
        return true
    }
    
    // Requeue on error
    c.workqueue.AddRateLimited(key)
    return true
}

func (c *Controller) reconcile(key string) error {
    // Get current state from cache
    obj, exists, err := c.indexer.GetByKey(key)
    if err != nil {
        return err
    }
    
    if !exists {
        // Object was deleted
        return c.handleDelete(key)
    }
    
    // Compare current state with desired state
    // Make changes to achieve desired state
    return c.syncObject(obj)
}

Controller Manager Configuration:

apiVersion: v1
kind: Pod
metadata:
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --allocate-node-cidrs=true
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=127.0.0.1
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-cidr=10.244.0.0/16
    - --cluster-name=kubernetes
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --node-cidr-mask-size=24
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --service-cluster-ip-range=10.96.0.0/12
    - --use-service-account-credentials=true
    image: k8s.gcr.io/kube-controller-manager:v1.28.0
    name: kube-controller-manager

Cloud Controller Manager

The cloud controller manager runs controllers specific to cloud providers. It allows cloud vendors to integrate with Kubernetes without modifying core Kubernetes code.

Cloud-Specific Controllers:

Node Controller: Checks cloud provider to determine if node has been deleted
Route Controller: Sets up routes in cloud infrastructure
Service Controller: Creates/updates/deletes cloud load balancers
Volume Controller: Creates/attaches/mounts cloud volumes

AWS Cloud Controller Example:

apiVersion: v1
kind: Pod
metadata:
  name: cloud-controller-manager
  namespace: kube-system
spec:
  containers:
  - name: cloud-controller-manager
    image: k8s.gcr.io/cloud-controller-manager:v1.28.0
    command:
    - /usr/local/bin/cloud-controller-manager
    - --cloud-provider=aws
    - --leader-elect=true
    - --use-service-account-credentials=true
    - --allocate-node-cidrs=true
    - --configure-cloud-routes=true
    - --cluster-cidr=10.244.0.0/16

Worker Node Components

Worker nodes run the containerized applications. Each node contains the services necessary to run pods and is managed by the control plane.

kubelet

The kubelet is the primary node agent that runs on each node. It ensures containers are running in a pod as specified.

Key Responsibilities:

Register node with API server
Watch for pod assignments to its node
Pull container images
Start and stop containers
Report pod and node status
Execute liveness and readiness probes
Mount volumes

How kubelet Works:

kubelet Configuration:

# kubelet config file (/var/lib/kubelet/config.yaml)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

Static Pods:

kubelet can manage static pods directly without the API server:

# Place in /etc/kubernetes/manifests/static-nginx.yaml
apiVersion: v1
kind: Pod
metadata:
  name: static-nginx
  labels:
    role: static-pod
spec:
  containers:
  - name: nginx
    image: nginx:1.21
    ports:
    - containerPort: 80

The kubelet automatically creates and manages this pod. It appears in the API server but can only be deleted by removing the file.

kube-proxy

kube-proxy maintains network rules on nodes, implementing part of the Kubernetes Service concept. It enables the Service abstraction by maintaining network rules and performing connection forwarding.

Key Responsibilities:

Maintain network rules for Services
Implement Service load balancing
Handle iptables/ipvs rules
Enable pod-to-service communication

Proxy Modes:

1. iptables Mode (default):

# Example iptables rules created by kube-proxy
# For a Service with ClusterIP 10.96.0.100:80 and backend pods

# KUBE-SERVICES chain - main entry point
-A KUBE-SERVICES -d 10.96.0.100/32 -p tcp -m tcp --dport 80 -j KUBE-SVC-XXXXXXXXXXX

# KUBE-SVC chain - load balancing across endpoints
-A KUBE-SVC-XXXXXXXXXXX -m statistic --mode random --probability 0.33 -j KUBE-SEP-ENDPOINT1
-A KUBE-SVC-XXXXXXXXXXX -m statistic --mode random --probability 0.50 -j KUBE-SEP-ENDPOINT2
-A KUBE-SVC-XXXXXXXXXXX -j KUBE-SEP-ENDPOINT3

# KUBE-SEP chains - DNAT to actual pod IPs
-A KUBE-SEP-ENDPOINT1 -p tcp -m tcp -j DNAT --to-destination 10.244.1.5:8080
-A KUBE-SEP-ENDPOINT2 -p tcp -m tcp -j DNAT --to-destination 10.244.2.6:8080
-A KUBE-SEP-ENDPOINT3 -p tcp -m tcp -j DNAT --to-destination 10.244.3.7:8080

2. IPVS Mode (more scalable):

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |-
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    kind: KubeProxyConfiguration
    mode: "ipvs"
    ipvs:
      scheduler: "rr"  # round-robin
      # Other options: lc (least connection), dh (destination hashing), sh (source hashing)
      strictARP: true
      tcpTimeout: 0s
      tcpFinTimeout: 0s
      udpTimeout: 0s

kube-proxy DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-proxy
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: kube-proxy
  template:
    metadata:
      labels:
        k8s-app: kube-proxy
    spec:
      containers:
      - name: kube-proxy
        image: k8s.gcr.io/kube-proxy:v1.28.0
        command:
        - /usr/local/bin/kube-proxy
        - --config=/var/lib/kube-proxy/config.conf
        - --hostname-override=$(NODE_NAME)
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /var/lib/kube-proxy
          name: kube-proxy
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
      hostNetwork: true
      serviceAccountName: kube-proxy
      volumes:
      - name: kube-proxy
        configMap:
          name: kube-proxy
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
      - name: lib-modules
        hostPath:
          path: /lib/modules

Container Runtime

The container runtime is responsible for running containers. Kubernetes supports several runtimes through the Container Runtime Interface (CRI).

Supported Runtimes:

containerd (most common, CNCF project)
CRI-O (lightweight, OCI-focused)
Docker Engine (via cri-dockerd shim)

Container Runtime Interface:

containerd Configuration:

# /etc/containerd/config.toml
version = 2

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "k8s.gcr.io/pause:3.9"
    
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
    
    [plugins."io.containerd.grpc.v1.cri".containerd]
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

The Kubernetes Control Loop

The control loop is the core of Kubernetes' self-healing nature. Understanding this loop is essential to understanding how Kubernetes works.

Reconciliation Loop

Example: Deployment Controller Flow

Let's trace what happens when you create a Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        ports:
        - containerPort: 80

Step-by-Step Process:

Continuous Reconciliation:

Controllers continuously reconcile:

Deployment Controller ensures correct ReplicaSet exists
ReplicaSet Controller ensures correct number of pods
Node Controller monitors node health
Endpoints Controller updates Service endpoints

If a pod dies:

Communication Patterns

Understanding how components communicate is crucial for troubleshooting.

Communication Flow

Key Communication Rules:

Only the API server talks to etcd - All state changes go through the API
Components use watches, not polling - Efficient event-driven architecture
All communication is authenticated - Mutual TLS between components
API server is the single source of truth - No direct component-to-component communication

Network Policies for Control Plane

# Protect etcd - only API server can access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: etcd-network-policy
  namespace: kube-system
spec:
  podSelector:
    matchLabels:
      component: etcd
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          component: kube-apiserver
    ports:
    - protocol: TCP
      port: 2379
    - protocol: TCP
      port: 2380

High Availability Architecture

Production clusters require high availability for the control plane.

HA Control Plane

HA Considerations:

API Server: All instances are active (load balanced)
etcd: Cluster with quorum (3 or 5 nodes)
Scheduler: One active, others on standby (leader election)
Controller Manager: One active, others on standby (leader election)

Leader Election Configuration:

# Controller Manager with leader election
apiVersion: v1
kind: Pod
metadata:
  name: kube-controller-manager
spec:
  containers:
  - command:
    - kube-controller-manager
    - --leader-elect=true
    - --leader-elect-lease-duration=15s
    - --leader-elect-renew-deadline=10s
    - --leader-elect-retry-period=2s
    - --leader-elect-resource-lock=leases
    - --leader-elect-resource-name=kube-controller-manager
    - --leader-elect-resource-namespace=kube-system
    image: k8s.gcr.io/kube-controller-manager:v1.28.0
    name: kube-controller-manager

Stacked vs External etcd

Stacked etcd topology:

# etcd runs on same nodes as control plane
# Simpler but less resilient
Master Node 1: [API Server, Scheduler, Controller Manager, etcd]
Master Node 2: [API Server, Scheduler, Controller Manager, etcd]
Master Node 3: [API Server, Scheduler, Controller Manager, etcd]

External etcd topology:

# etcd runs on dedicated nodes
# More resilient, recommended for production
Master Nodes: [API Server, Scheduler, Controller Manager]
etcd Nodes: [etcd cluster - dedicated hardware]

Architecture Best Practices

Control Plane

Run at least 3 master nodes for production
Use external etcd for large clusters (>100 nodes)
Monitor etcd performance - it's the most critical component
Regular etcd backups - automated and tested
Separate master and worker nodes - don't schedule workloads on masters
Resource reservations for control plane components

# Reserve resources for control plane
apiVersion: v1
kind: Node
metadata:
  name: master-node-1
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
  # System reserved resources
  kubelet:
    systemReserved:
      cpu: "1000m"
      memory: "2Gi"
      ephemeral-storage: "10Gi"
    kubeReserved:
      cpu: "1000m"
      memory: "2Gi"
      ephemeral-storage: "10Gi"

Worker Nodes

Right-size nodes - balance between too many small nodes and few large nodes
Use node pools for different workload types
Configure resource reservations for system components
Enable swap accounting for better resource management
Monitor node resources and set up autoscaling

# Node resource management
apiVersion: v1
kind: Node
metadata:
  name: worker-node-1
  labels:
    node.kubernetes.io/instance-type: c5.2xlarge
    workload-type: compute-intensive
spec:
  # Allocatable resources
  allocatable:
    cpu: "7000m"  # 8 CPUs - 1 reserved
    memory: "28Gi"  # 32GB - 4GB reserved
  # Kubelet configuration
  kubelet:
    maxPods: 110
    systemReserved:
      cpu: "500m"
      memory: "1Gi"
    kubeReserved:
      cpu: "500m"
      memory: "2Gi"
    evictionHard:
      memory.available: "1Gi"
      nodefs.available: "10%"

Networking

Choose the right CNI for your use case (Calico, Cilium, Flannel)
Plan IP address spaces carefully
Implement Network Policies for security
Use appropriate Service types for different scenarios

Security

Enable RBAC and follow principle of least privilege
Use Pod Security Standards (Baseline, Restricted)
Encrypt etcd data at rest
Rotate certificates regularly
Enable audit logging

# Pod Security Standards
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Common Architecture Issues

Issue 1: etcd Performance Degradation

Symptoms:

Slow API responses
Controller delays
Watch events delayed

Diagnosis:

# Check etcd latency
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  check perf

# Check etcd metrics
curl -k https://127.0.0.1:2379/metrics | grep etcd_disk

# Monitor slow requests
etcdctl watch --prefix "" --write-out=fields

Solutions:

Use SSDs for etcd
Defragment etcd database
Compact etcd history
Consider scaling etcd cluster

# Defragment etcd
ETCDCTL_API=3 etcdctl defrag \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Compact history
ETCDCTL_API=3 etcdctl compact \
  --endpoints=https://127.0.0.1:2379 \
  $(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')

Issue 2: Scheduler Not Scheduling Pods

Symptoms:

Pods stuck in Pending state
No node assignment

Diagnosis:

# Check pod events
kubectl describe pod <pod-name>

# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-<master-node>

# Check node resources
kubectl describe nodes

# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Common Causes:

Insufficient resources
Node taints without tolerations
Node affinity not satisfied
Volume binding issues

Issue 3: Control Plane Communication Issues

Symptoms:

Components can't reach API server
Certificate errors
Authentication failures

Diagnosis:

# Check component health
kubectl get componentstatuses

# Test API server connectivity
curl -k https://kubernetes-api:6443/healthz

# Check certificates
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout

# Check certificate expiration
kubeadm certs check-expiration

Solutions:

# Renew certificates
kubeadm certs renew all

# Restart control plane components
systemctl restart kubelet

Issue 4: Worker Node NotReady

Symptoms:

Nodes show NotReady status
Pods evicted from node

Diagnosis:

# Check node status
kubectl describe node <node-name>

# Check kubelet logs
journalctl -u kubelet -f

# Check node resources
kubectl top node

# Check network plugin
kubectl get pods -n kube-system | grep -E 'calico|flannel|cilium'

Common Causes:

kubelet crashed
Network plugin issues
Resource pressure (disk, memory)
Certificate problems

What I Learned

Understanding Kubernetes architecture transformed how I approach container orchestration challenges:

Start with the API Server: Everything in Kubernetes flows through the API server. When troubleshooting, check API server logs first, then work backward to other components.

etcd is Critical: The health of your cluster depends on etcd performance. I learned to monitor etcd metrics closely and ensure it runs on fast SSDs with dedicated resources.

Controllers are Independent: Each controller works independently, watching for its specific resources. This design makes Kubernetes resilient but also means you need to understand the reconciliation loop to debug issues effectively.

The Declarative Model Works: Specifying desired state and letting controllers reconcile it is more reliable than imperative commands. Trust the control loop—it will eventually converge to desired state.

Component Communication Matters: Understanding that only the API server talks to etcd, and all other components use watches (not polling), helps explain why certain operations are fast and others are slow.

High Availability Requires Planning: Don't wait until production to think about HA. Design for it from the start—etcd quorum, leader election, and load balancing all need careful consideration.

Security is Built-In: Kubernetes' architecture includes security by design—mutual TLS, RBAC, admission controllers. Use these features; don't work around them.

The architecture of Kubernetes reflects years of experience running distributed systems at scale. Understanding these components and how they interact gives you the foundation to build reliable, scalable applications on Kubernetes. In the next articles, we'll build on this architectural knowledge to explore practical implementation patterns and best practices.

PreviousIntroduction to Kubernetes NextLocal Development Setup

Last updated 1 month ago

hashtagIntroduction

hashtagTable of Contents

hashtagHigh-Level Architecture Overview

hashtagCore Architectural Principles

hashtagArchitecture Diagram

hashtagControl Plane Components

hashtagAPI Server (kube-apiserver)

hashtagetcd

hashtagScheduler (kube-scheduler)

hashtagController Manager (kube-controller-manager)

hashtagCloud Controller Manager

hashtagWorker Node Components

hashtagkubelet

hashtagkube-proxy

hashtagContainer Runtime

hashtagThe Kubernetes Control Loop

hashtagReconciliation Loop

hashtagExample: Deployment Controller Flow

hashtagCommunication Patterns

hashtagCommunication Flow

hashtagNetwork Policies for Control Plane

hashtagHigh Availability Architecture

hashtagHA Control Plane

hashtagStacked vs External etcd

hashtagArchitecture Best Practices

hashtagControl Plane

hashtagWorker Nodes

hashtagNetworking

hashtagSecurity

hashtagCommon Architecture Issues

hashtagIssue 1: etcd Performance Degradation

hashtagIssue 2: Scheduler Not Scheduling Pods

hashtagIssue 3: Control Plane Communication Issues

hashtagIssue 4: Worker Node NotReady

hashtagWhat I Learned

Introduction

Table of Contents

High-Level Architecture Overview

Core Architectural Principles

Architecture Diagram

Control Plane Components

API Server (kube-apiserver)

etcd

Scheduler (kube-scheduler)

Controller Manager (kube-controller-manager)

Cloud Controller Manager

Worker Node Components

kubelet

kube-proxy

Container Runtime

The Kubernetes Control Loop

Reconciliation Loop

Example: Deployment Controller Flow

Communication Patterns

Communication Flow

Network Policies for Control Plane

High Availability Architecture

HA Control Plane

Stacked vs External etcd

Architecture Best Practices

Control Plane

Worker Nodes

Networking

Security

Common Architecture Issues

Issue 1: etcd Performance Degradation

Issue 2: Scheduler Not Scheduling Pods

Issue 3: Control Plane Communication Issues

Issue 4: Worker Node NotReady

What I Learned