# Backup and Recovery

*My journey from praying nothing breaks to sleeping soundly with automated backups*

***

## Table of Contents

1. [Introduction: The Backup I Never Made](#introduction-the-backup-i-never-made)
2. [Backup Strategies Overview](#backup-strategies-overview)
3. [pg\_dump: Logical Backups](#pg_dump-logical-backups)
4. [pg\_basebackup: Physical Backups](#pg_basebackup-physical-backups)
5. [Continuous Archiving and PITR](#continuous-archiving-and-pitr)
6. [Restoring from Backups](#restoring-from-backups)
7. [Backup Automation and Scheduling](#backup-automation-and-scheduling)
8. [Verifying Your Backups](#verifying-your-backups)
9. [Replication for High Availability](#replication-for-high-availability)
10. [Cloud Backup Solutions](#cloud-backup-solutions)
11. [Disaster Recovery Planning](#disaster-recovery-planning)
12. [What I Learned About Backups](#what-i-learned-about-backups)
13. [Next Steps](#next-steps)

***

## Introduction: The Backup I Never Made

I had been developing my blog for two years when my VPS provider sent an email: "Your server has experienced a hardware failure and cannot be recovered."

I went cold. I searched my laptop for any local copies. I found SQL export files—but they were eight months old. My staging environment had a copy but it was three weeks behind.

I lost eight months of posts, all the comments and user data from that period, and countless hours of work.

The sickening part? Backups would have taken 30 minutes to set up properly. I had been saying "I'll do it properly later" for two years.

Don't be me. Set up backups today.

***

## Backup Strategies Overview

{% @mermaid/diagram content="graph TD
Backup\["Backup Strategies"] --> Logical
Backup --> Physical
Backup --> Continuous

```
Logical["Logical Backups\n(pg_dump)"] --> L1["SQL dump file\n(portable, human-readable)"]
Physical["Physical Backups\n(pg_basebackup)"] --> P1["Binary copy of data directory\n(faster, larger)"]
Continuous["Continuous Archiving\n(WAL archiving + PITR)"] --> C1["Point-in-time recovery\n(recover to any moment)"]" %}
```

| Strategy                | Recovery Point    | Recovery Time         | Use Case                     |
| ----------------------- | ----------------- | --------------------- | ---------------------------- |
| pg\_dump (daily)        | 24 hours ago      | Minutes-hours         | Small databases, portability |
| pg\_basebackup (weekly) | Last full backup  | Minutes               | Medium databases             |
| WAL archiving + PITR    | Any point in time | Minutes + replay time | Production, compliance       |
| Streaming Replication   | Near real-time    | Seconds               | High availability            |

**The 3-2-1 Rule:** 3 copies, 2 different media, 1 offsite location.

***

## pg\_dump: Logical Backups

`pg_dump` exports the database as SQL statements or binary format. It's consistent—it takes a snapshot without locking tables.

### Basic Usage

```bash
# Dump single database to SQL file
pg_dump -U postgres -d blog_db > blog_db_$(date +%Y%m%d_%H%M%S).sql

# Dump to compressed custom format (recommended)
pg_dump -U postgres -d blog_db -Fc -f blog_db_$(date +%Y%m%d).dump

# With connection details
pg_dump -h localhost -p 5432 -U blog_admin -d blog_db -Fc \
    -f /backups/blog_db_$(date +%Y%m%d).dump
```

### pg\_dump Formats

| Format    | Flag            | Notes                           |
| --------- | --------------- | ------------------------------- |
| Plain SQL | `-Fp` (default) | Human-readable, largest size    |
| Custom    | `-Fc`           | Compressed, selective restore   |
| Directory | `-Fd`           | Parallel dump (large databases) |
| Tar       | `-Ft`           | Archive, no compression         |

### Selective Dumps

```bash
# Dump only specific tables
pg_dump -U postgres -d blog_db -Fc \
    -t posts -t authors -t comments \
    -f posts_backup.dump

# Dump schema only (no data)
pg_dump -U postgres -d blog_db --schema-only -f schema_only.sql

# Dump data only (no schema)
pg_dump -U postgres -d blog_db --data-only -f data_only.sql

# Exclude a table
pg_dump -U postgres -d blog_db -Fc \
    --exclude-table=activity_log \
    -f blog_db_no_logs.dump
```

### Dump All Databases

```bash
# pg_dumpall includes globals (users, roles, tablespaces)
pg_dumpall -U postgres > all_databases_$(date +%Y%m%d).sql

# Only dump globals (roles and permissions)
pg_dumpall -U postgres --globals-only > globals_$(date +%Y%m%d).sql
```

***

## pg\_basebackup: Physical Backups

`pg_basebackup` copies the entire PostgreSQL data directory as a binary backup. It's faster for large databases but less portable (same PostgreSQL major version required for restore).

```bash
# Full backup to directory
pg_basebackup -U replication_user -h localhost \
    -D /backups/base_$(date +%Y%m%d) \
    -P --wal-method=stream

# Compressed tar backup
pg_basebackup -U replication_user -h localhost \
    -D /backups -Ft -z \
    -P --wal-method=stream

# -P = progress indicator
# --wal-method=stream = include WAL files needed for consistent restore
```

### Create a Replication Role for Backups

```sql
-- Dedicated user for backup operations
CREATE ROLE backup_user WITH REPLICATION LOGIN PASSWORD 'strong_backup_password';
```

```bash
# In pg_hba.conf, allow replication connections
# host  replication  backup_user  backup_server_ip/32  scram-sha-256
```

***

## Continuous Archiving and PITR

Write-Ahead Logging (WAL) records every database change. By archiving WAL files, you can restore to any point in time—not just the last backup.

### Configure WAL Archiving

```bash
# In postgresql.conf:
wal_level = replica           # or logical
archive_mode = on
archive_command = 'cp %p /mnt/wal_archive/%f'
# Or to S3:
# archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f'
archive_timeout = 300         # Archive WAL every 5 minutes at most
```

### Full Backup + WAL Archive Workflow

```bash
# 1. Take a base backup nightly
pg_basebackup -U backup_user -D /backups/base_$(date +%Y%m%d) \
    --wal-method=stream -P

# 2. WAL files are archived automatically by archive_command
# 3. For PITR: restore base backup, then replay WAL to target time
```

### Point-in-Time Recovery

```bash
# 1. Stop PostgreSQL
sudo systemctl stop postgresql

# 2. Move current data directory
sudo mv /var/lib/postgresql/14/main /var/lib/postgresql/14/main.old

# 3. Restore the base backup
sudo cp -a /backups/base_20260115/. /var/lib/postgresql/14/main/
sudo chown -R postgres:postgres /var/lib/postgresql/14/main/

# 4. Create recovery configuration
cat > /var/lib/postgresql/14/main/postgresql.auto.conf << EOF
restore_command = 'cp /mnt/wal_archive/%f %p'
recovery_target_time = '2026-02-20 14:30:00'
recovery_target_action = 'promote'
EOF

# Create recovery signal file (PostgreSQL 12+)
touch /var/lib/postgresql/14/main/recovery.signal

# 5. Start PostgreSQL - it will replay WAL to the target time
sudo systemctl start postgresql
```

***

## Restoring from Backups

### Restore from pg\_dump

```bash
# Restore SQL dump
psql -U postgres -d blog_db_restored < backup_20260220.sql

# Restore custom format dump
pg_restore -U postgres -d blog_db_restored \
    --clean --if-exists \
    backup_20260220.dump

# Restore only specific tables
pg_restore -U postgres -d blog_db_restored \
    -t posts -t authors \
    backup_20260220.dump

# Restore to a new database
createdb -U postgres blog_db_restored
pg_restore -U postgres -d blog_db_restored backup_20260220.dump
```

### Restore Specific Tables (Emergency Recovery)

```bash
# Extract just the posts table from a backup
pg_restore -U postgres -d blog_db_restored \
    --table=posts backup_20260220.dump

# Then copy the data back to production using COPY or INSERT SELECT
# This is useful when you need to recover accidentally deleted rows
```

### Verify Restored Data

```sql
-- Quick sanity checks after restore
SELECT COUNT(*) FROM posts;
SELECT COUNT(*) FROM authors;
SELECT MAX(created_at) FROM posts;  -- Check latest data is present
SELECT * FROM posts ORDER BY created_at DESC LIMIT 5;
```

***

## Backup Automation and Scheduling

### Shell Script for Daily Backups

```bash
#!/bin/bash
# /opt/scripts/backup_blog_db.sh

set -e  # Exit on any error

BACKUP_DIR="/backups/postgresql"
DB_NAME="blog_db"
DB_USER="backup_user"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${DATE}.dump"
RETENTION_DAYS=30

# Create backup directory if it doesn't exist
mkdir -p "${BACKUP_DIR}"

# Create backup
echo "Starting backup of ${DB_NAME}..."
pg_dump -U "${DB_USER}" -d "${DB_NAME}" -Fc -f "${BACKUP_FILE}"

# Verify backup was created
if [ ! -f "${BACKUP_FILE}" ]; then
    echo "ERROR: Backup file not created!"
    exit 1
fi

BACKUP_SIZE=$(du -sh "${BACKUP_FILE}" | cut -f1)
echo "Backup created: ${BACKUP_FILE} (${BACKUP_SIZE})"

# Upload to S3 (optional)
# aws s3 cp "${BACKUP_FILE}" "s3://my-blog-backups/postgresql/"

# Remove backups older than retention period
find "${BACKUP_DIR}" -name "*.dump" -mtime +${RETENTION_DAYS} -delete
echo "Removed backups older than ${RETENTION_DAYS} days"

echo "Backup completed successfully"
```

### Scheduling with Cron

```bash
# Edit crontab
crontab -e

# Daily backup at 2 AM
0 2 * * * /opt/scripts/backup_blog_db.sh >> /var/log/pg_backup.log 2>&1

# Weekly full backup at 1 AM on Sundays
0 1 * * 0 /opt/scripts/full_backup.sh >> /var/log/pg_full_backup.log 2>&1
```

### Scheduling with pg\_cron

```sql
-- pg_cron can run backups from within PostgreSQL (using COPY to file)
CREATE EXTENSION IF NOT EXISTS pg_cron;

-- Export to CSV nightly (for smaller, selective backups)
SELECT cron.schedule(
    'nightly-posts-export',
    '0 3 * * *',
    $$COPY posts TO '/backups/posts_export.csv' WITH CSV HEADER$$
);
```

***

## Verifying Your Backups

**An unverified backup is not a backup.** Schedule regular restore tests:

```bash
#!/bin/bash
# /opt/scripts/verify_backup.sh
# Run weekly to verify the latest backup is valid

BACKUP_DIR="/backups/postgresql"
TEST_DB="blog_db_backup_test"
LATEST_BACKUP=$(ls -t "${BACKUP_DIR}"/*.dump | head -1)

echo "Verifying backup: ${LATEST_BACKUP}"

# Create test database
createdb -U postgres "${TEST_DB}"

# Restore backup
pg_restore -U postgres -d "${TEST_DB}" "${LATEST_BACKUP}"
RESTORE_STATUS=$?

if [ $RESTORE_STATUS -ne 0 ]; then
    echo "ERROR: Restore failed!"
    dropdb -U postgres "${TEST_DB}"
    exit 1
fi

# Run verification queries
POST_COUNT=$(psql -U postgres -d "${TEST_DB}" -t -c "SELECT COUNT(*) FROM posts")
AUTHOR_COUNT=$(psql -U postgres -d "${TEST_DB}" -t -c "SELECT COUNT(*) FROM authors")

echo "Restored: ${POST_COUNT} posts, ${AUTHOR_COUNT} authors"

# Clean up
dropdb -U postgres "${TEST_DB}"
echo "Backup verification successful: ${LATEST_BACKUP}"
```

***

## Replication for High Availability

Streaming replication keeps a standby server up to date with the primary in near real-time:

{% @mermaid/diagram content="graph LR
Primary\["Primary Server\n(Read/Write)"] -- "WAL Stream" --> Standby1\["Standby 1\n(Hot Standby - Read Only)"]
Primary -- "WAL Stream" --> Standby2\["Standby 2\n(Warm Standby)"]" %}

### Setting Up a Standby Server

```bash
# On the primary server (postgresql.conf):
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB

# Create replication role
psql -c "CREATE ROLE replica_user WITH REPLICATION LOGIN PASSWORD 'password';"

# In pg_hba.conf:
# host  replication  replica_user  standby_ip/32  scram-sha-256

# On the standby server:
# Take initial backup from primary
pg_basebackup -h primary_ip -U replica_user \
    -D /var/lib/postgresql/14/main \
    -P --wal-method=stream -R
# -R creates postgresql.auto.conf with primary_conninfo automatically

# Start standby
systemctl start postgresql
```

```sql
-- Monitor replication lag on primary
SELECT
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
```

### Failover

```bash
# If primary fails, promote standby to primary
pg_ctl promote -D /var/lib/postgresql/14/main
# Or create trigger file:
touch /var/lib/postgresql/14/main/failover.signal
```

Use **Patroni** or **pg\_auto\_failover** for automated failover in production.

***

## Cloud Backup Solutions

Most cloud-managed PostgreSQL services handle backups automatically:

| Service                       | Backup Type         | PITR       | Retention      |
| ----------------------------- | ------------------- | ---------- | -------------- |
| AWS RDS PostgreSQL            | Automated snapshots | ✅ Yes      | 1–35 days      |
| AWS Aurora PostgreSQL         | Continuous backup   | ✅ Yes      | 1–35 days      |
| GCP Cloud SQL                 | Automated backups   | ✅ Yes      | Up to 365 days |
| Azure Database for PostgreSQL | Full + differential | ✅ Yes      | 7–35 days      |
| Supabase                      | Daily backups       | ✅ Pro plan | 7 days         |

Even with managed backups, test restores regularly and keep your own exports for compliance.

***

## Disaster Recovery Planning

A backup without a recovery plan is just a file. Document your recovery procedure:

### Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

* **RTO**: How long can the service be down? (1 hour? 4 hours? 24 hours?)
* **RPO**: How much data can you lose? (Last hour? Last day? Zero data loss?)

These determine your backup strategy:

| RPO       | RPO        | Strategy                                |
| --------- | ---------- | --------------------------------------- |
| 1 hour    | 4 hours    | Daily pg\_dump + hourly WAL archiving   |
| 5 minutes | 30 minutes | WAL archiving every 5 min + hot standby |
| \~0       | Seconds    | Synchronous replication + hot standby   |

### Runbook Template

```markdown
## Database Recovery Runbook

### Contacts
- DBA on-call: [phone/slack]
- Cloud infrastructure: [phone/slack]

### Scenario 1: Accidental Data Deletion

1. Identify the time of deletion from application logs
2. Create a test database and restore from last backup
3. Extract the deleted rows using pg_restore --table
4. Insert the recovered rows back into production

Estimated time: 30–60 minutes
RPO impact: Data since last backup

### Scenario 2: Full Server Failure

1. Provision a new server (AMI/snapshot or fresh install)
2. Install PostgreSQL same major version
3. Restore latest base backup
4. Apply WAL archives from archive_command target
5. Verify data integrity with verification queries
6. Update DNS/connection strings
7. Run smoke tests

Estimated time: 1–3 hours
RPO impact: Data since last WAL archive
```

***

## What I Learned About Backups

That data loss incident was the defining moment in my relationship with infrastructure. The lesson was brutal but clear: **the data you haven't backed up doesn't really belong to you—you're just borrowing it.**

The mindset shift that made everything click:

**Backups are not for when things go wrong. Backups are for when things go wrong&#x20;*****and*****&#x20;you don't panic.**

My current setup:

* Daily automated pg\_dump to S3 with 30-day retention
* WAL archiving with 7-day PITR window
* Weekly automated restore test to a throw-away database
* Monitoring alert if that test fails

Total setup time: about 2 hours. Total peace of mind: priceless.

**The non-negotiables:**

* ✅ Automate backups—manual backups don't get done
* ✅ Test restores regularly—untested backups are a false sense of security
* ✅ Store backups offsite or in a different availability zone
* ✅ Document your recovery procedure before you need it
* ✅ Monitor that your backup jobs are completing successfully

***

## Next Steps

With backups secured, we complete the series with comprehensive **Database Design Best Practices**—pulling together everything we've learned into production-ready patterns.

* [**→ Next: Database Design Best Practices**](https://blog.htunnthuthu.com/getting-started/programming/database-101/database-101-database-design-best-practices)
* [**← Previous: Database Security**](https://blog.htunnthuthu.com/getting-started/programming/database-101/database-101-database-security)

***

*Part of the* [*Database 101 Series*](https://blog.htunnthuthu.com/getting-started/programming/database-101)
