Part 6: Performance Optimization and Production Best Practices

← Part 5: Advanced Queries | Part 7: Real-World Applications →

The 2-Second Query That Cost Us Users

Our documentation search was working great... until it wasn't.

The problem: Search queries started taking 2-3 seconds. Users complained. Bounce rate spiked from 8% to 34%.

Root cause analysis:

Database had grown to 500K document chunks
No HNSW index (was using sequential scan!)
Every search generated a new embedding (no caching)
No connection pooling
No query result caching

One weekend of optimization later:

Query time: 2.3s → 47ms (98% improvement)
Bounce rate: 34% → 6% (better than before!)
Server costs: Same hardware handling 10x traffic

This article shows you every optimization I implemented to get there.

Query Performance Optimization

1. Create Proper Indexes

-- Without index: ~2000ms (sequential scan)
-- With HNSW index: ~50ms (index scan)

CREATE INDEX CONCURRENTLY document_embedding_hnsw_idx 
ON documents 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- CONCURRENTLY: allows reads during index build
-- m = 16: good balance of recall vs memory
-- ef_construction = 64: higher = better quality, slower build

Verify index is being used:

EXPLAIN ANALYZE
SELECT id, title, 1 - (embedding <=> '[...]'::vector) as similarity
FROM documents
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;

-- Look for: "Index Scan using document_embedding_hnsw_idx"
-- NOT:      "Seq Scan on documents"

2. Optimize Index Parameters

interface IndexConfig {
  // HNSW Parameters
  m: number;              // Max connections per layer (8-64)
  ef_construction: number; // Build-time search depth (32-512)
  ef_search: number;      // Query-time search depth (matches ef_construction)
  
  // IVFFlat Parameters
  lists: number;          // Number of clusters (sqrt(rows) to rows/1000)
  probes: number;         // Clusters to search at query time (1-20)
}

// Performance vs Accuracy tradeoffs
const configs = {
  // High performance (lower accuracy)
  fast: {
    m: 8,
    ef_construction: 32,
    ef_search: 16,
  },
  
  // Balanced (recommended for most use cases)
  balanced: {
    m: 16,
    ef_construction: 64,
    ef_search: 40,
  },
  
  // High accuracy (slower queries)
  accurate: {
    m: 32,
    ef_construction: 128,
    ef_search: 80,
  },
};

Set query-time parameters:

// Increase search quality at query time
await prisma.$executeRaw`SET hnsw.ef_search = 80`;

const results = await prisma.$queryRaw`
  SELECT * FROM documents
  ORDER BY embedding <=> ${vector}::vector
  LIMIT 10
`;

3. Embedding Caching

Cache embeddings to avoid redundant API calls:

import { createClient } from 'redis';
import { createHash } from 'crypto';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

export class CachedEmbeddingService {
  private static readonly TTL = 60 * 60 * 24 * 30; // 30 days
  
  /**
   * Get embedding with Redis caching
   */
  static async getEmbedding(text: string): Promise<number[]> {
    const hash = createHash('sha256').update(text).digest('hex');
    const cacheKey = `embedding:${hash}`;
    
    // Try cache first
    const cached = await redis.get(cacheKey);
    if (cached) {
      return JSON.parse(cached);
    }
    
    // Generate embedding
    const embedding = await EmbeddingService.getEmbedding(text);
    
    // Cache result
    await redis.setEx(cacheKey, this.TTL, JSON.stringify(embedding));
    
    return embedding;
  }
  
  /**
   * Preload common queries
   */
  static async preloadCommonQueries(queries: string[]) {
    const embeddings = await EmbeddingService.getBatchEmbeddings(queries);
    
    for (let i = 0; i < queries.length; i++) {
      const hash = createHash('sha256').update(queries[i]).digest('hex');
      const cacheKey = `embedding:${hash}`;
      
      await redis.setEx(
        cacheKey,
        this.TTL,
        JSON.stringify(embeddings[i])
      );
    }
    
    console.log(`Preloaded ${queries.length} common queries`);
  }
}

Impact:

Query latency: 450ms → 45ms (when cached)
OpenAI API costs: -90%

4. Result Caching

import NodeCache from 'node-cache';

const searchCache = new NodeCache({
  stdTTL: 300,           // 5 minutes
  checkperiod: 60,       // Check for expired keys every 60s
  maxKeys: 10000,        // Limit cache size
});

export class CachedSearchService {
  static async search(query: string, options: SearchOptions) {
    const cacheKey = this.getCacheKey(query, options);
    
    // Check cache
    const cached = searchCache.get<SearchResult[]>(cacheKey);
    if (cached) {
      return { results: cached, fromCache: true };
    }
    
    // Execute search
    const results = await SearchService.searchArticles(query, options);
    
    // Cache results
    searchCache.set(cacheKey, results);
    
    return { results, fromCache: false };
  }
  
  private static getCacheKey(query: string, options: SearchOptions): string {
    return `search:${query}:${JSON.stringify(options)}`;
  }
  
  /**
   * Invalidate cache when data changes
   */
  static invalidate(articleId?: number) {
    if (articleId) {
      // Invalidate specific article queries (hard to do precisely)
      searchCache.flushAll();
    } else {
      searchCache.flushAll();
    }
  }
}

5. Database Connection Pooling

// prisma/schema.prisma
datasource db {
  provider = "postgresql"
  url      = env("DATABASE_URL")
  // Connection pool configuration in URL:
  // postgresql://user:pass@host:5432/db?connection_limit=20&pool_timeout=10
}

// src/lib/db.ts
import { PrismaClient } from '@prisma/client';

// Singleton pattern for connection pooling
class DatabaseClient {
  private static instance: PrismaClient;
  
  static getInstance(): PrismaClient {
    if (!this.instance) {
      this.instance = new PrismaClient({
        log: ['error', 'warn'],
        datasources: {
          db: {
            url: process.env.DATABASE_URL,
          },
        },
      });
    }
    
    return this.instance;
  }
  
  static async disconnect() {
    if (this.instance) {
      await this.instance.$disconnect();
    }
  }
}

export const prisma = DatabaseClient.getInstance();

PostgreSQL connection pool settings:

-- postgresql.conf
max_connections = 100
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
work_mem = 16MB

6. Parallel Query Execution

export class ParallelSearchService {
  /**
   * Execute multiple searches in parallel
   */
  static async multiSearch(queries: string[]) {
    // Generate all embeddings in parallel
    const embeddingPromises = queries.map(q => 
      CachedEmbeddingService.getEmbedding(q)
    );
    const embeddings = await Promise.all(embeddingPromises);
    
    // Execute all searches in parallel
    const searchPromises = embeddings.map(embedding =>
      prisma.$queryRaw`
        SELECT * FROM documents
        ORDER BY embedding <=> ${embedding}::vector
        LIMIT 10
      `
    );
    const results = await Promise.all(searchPromises);
    
    return results;
  }
}

Index Management Strategies

Progressive Index Building

export class IndexManager {
  /**
   * Build index with progress tracking
   */
  static async buildIndex(tableName: string) {
    console.log(`Building HNSW index for ${tableName}...`);
    
    const startTime = Date.now();
    
    // Set high maintenance work mem for faster build
    await prisma.$executeRaw`SET maintenance_work_mem = '2GB'`;
    
    // Create index
    await prisma.$executeRaw`
      CREATE INDEX CONCURRENTLY ${tableName}_embedding_idx
      ON ${tableName}
      USING hnsw (embedding vector_cosine_ops)
      WITH (m = 16, ef_construction = 64)
    `;
    
    const duration = Date.now() - startTime;
    console.log(`Index built in ${duration}ms`);
    
    // Analyze table for query planner
    await prisma.$executeRaw`ANALYZE ${tableName}`;
  }
  
  /**
   * Monitor index health
   */
  static async getIndexStats(tableName: string) {
    const stats = await prisma.$queryRaw`
      SELECT 
        schemaname,
        tablename,
        indexname,
        idx_scan as scans,
        idx_tup_read as tuples_read,
        idx_tup_fetch as tuples_fetched,
        pg_size_pretty(pg_relation_size(indexrelid)) as size
      FROM pg_stat_user_indexes
      WHERE tablename = ${tableName}
        AND indexname LIKE '%embedding%'
    `;
    
    return stats;
  }
  
  /**
   * Rebuild index if performance degrades
   */
  static async rebuild IndexIfNeeded(tableName: string) {
    const stats = await this.getIndexStats(tableName);
    
    // Heuristic: rebuild if index is >2x table size
    const tableSize = await this.getTableSize(tableName);
    const indexSize = stats[0]?.size || 0;
    
    if (indexSize > tableSize * 2) {
      console.log('Index bloat detected, rebuilding...');
      await prisma.$executeRaw`REINDEX INDEX CONCURRENTLY ${tableName}_embedding_idx`;
    }
  }
  
  private static async getTableSize(tableName: string): Promise<number> {
    const result = await prisma.$queryRaw<[{ size: number }]>`
      SELECT pg_total_relation_size(${tableName}::regclass) as size
    `;
    return result[0].size;
  }
}

Partial Indexes for Filtered Queries

-- Create partial index for frequently filtered category
CREATE INDEX documents_tech_embedding_idx 
ON documents 
USING hnsw (embedding vector_cosine_ops)
WHERE category = 'Technology';

-- Create partial index for recent documents only
CREATE INDEX documents_recent_embedding_idx 
ON documents 
USING hnsw (embedding vector_cosine_ops)
WHERE created_at > NOW() - INTERVAL '30 days';

Benefits:

Smaller index = faster queries
Lower memory usage
Better cache hit rate

Embedding Model Management

Version Embeddings

model Document {
  id              Int      @id @default(autoincrement())
  content         String
  
  // Support multiple embedding versions
  embedding_v1    Unsupported("vector(1536)")?
  embedding_v2    Unsupported("vector(1536)")?
  currentVersion  String   @default("v2")
  
  createdAt       DateTime @default(now())
}

export class EmbeddingVersionManager {
  /**
   * Migrate to new embedding model
   */
  static async migrateEmbeddings(fromVersion: string, toVersion: string) {
    const batchSize = 100;
    let offset = 0;
    
    while (true) {
      const documents = await prisma.document.findMany({
        where: { currentVersion: fromVersion },
        take: batchSize,
        skip: offset,
      });
      
      if (documents.length === 0) break;
      
      // Generate new embeddings
      const texts = documents.map(d => d.content);
      const newEmbeddings = await EmbeddingService.getBatchEmbeddings(texts);
      
      // Update documents with new embeddings
      for (let i = 0; i < documents.length; i++) {
        await prisma.$executeRaw`
          UPDATE documents
          SET 
            embedding_v2 = ${newEmbeddings[i]}::vector,
            current_version = ${toVersion}
          WHERE id = ${documents[i].id}
        `;
      }
      
      console.log(`Migrated ${offset + documents.length} documents`);
      offset += batchSize;
    }
    
    console.log('Migration complete!');
  }
  
  /**
   * A/B test embedding models
   */
  static async compareModels(query: string) {
    const [v1Results, v2Results] = await Promise.all([
      this.searchWithVersion(query, 'v1'),
      this.searchWithVersion(query, 'v2'),
    ]);
    
    return {
      v1: v1Results,
      v2: v2Results,
      overlap: this.calculateOverlap(v1Results, v2Results),
    };
  }
  
  private static async searchWithVersion(query: string, version: string) {
    const embedding = await EmbeddingService.getEmbedding(query);
    const embeddingColumn = `embedding_${version}`;
    
    return await prisma.$queryRaw`
      SELECT * FROM documents
      WHERE current_version = ${version}
      ORDER BY ${embeddingColumn} <=> ${embedding}::vector
      LIMIT 10
    `;
  }
  
  private static calculateOverlap(results1: any[], results2: any[]): number {
    const ids1 = new Set(results1.map(r => r.id));
    const overlap = results2.filter(r => ids1.has(r.id)).length;
    return overlap / results1.length;
  }
}

Monitoring and Observability

Query Performance Tracking

import { performance } from 'perf_hooks';

export class PerformanceMonitor {
  /**
   * Track search performance
   */
  static async trackSearch(query: string, options: SearchOptions) {
    const startTotal = performance.now();
    
    // Embedding generation
    const startEmbedding = performance.now();
    const embedding = await EmbeddingService.getEmbedding(query);
    const embeddingTime = performance.now() - startEmbedding;
    
    // Database query
    const startDB = performance.now();
    const results = await prisma.$queryRaw`
      SELECT * FROM documents
      ORDER BY embedding <=> ${embedding}::vector
      LIMIT 10
    `;
    const dbTime = performance.now() - startDB;
    
    const totalTime = performance.now() - startTotal;
    
    // Log metrics
    await this.logMetrics({
      query,
      embeddingTime,
      dbTime,
      totalTime,
      resultCount: results.length,
      timestamp: new Date(),
    });
    
    return { results, metrics: { embeddingTime, dbTime, totalTime } };
  }
  
  private static async logMetrics(metrics: any) {
    await prisma.performanceLog.create({ data: metrics });
    
    // Alert if slow
    if (metrics.totalTime > 1000) {
      console.warn(`Slow query detected: ${metrics.totalTime}ms for "${metrics.query}"`);
    }
  }
  
  /**
   * Get performance stats
   */
  static async getPerformanceStats(hours: number = 24) {
    const since = new Date(Date.now() - hours * 60 * 60 * 1000);
    
    const stats = await prisma.performanceLog.aggregate({
      where: { timestamp: { gte: since } },
      _avg: {
        embeddingTime: true,
        dbTime: true,
        totalTime: true,
      },
      _max: {
        totalTime: true,
      },
      _count: true,
    });
    
    return stats;
  }
}

Error Handling and Retries

export class ResilientSearchService {
  /**
   * Search with automatic retry on failure
   */
  static async search(
    query: string,
    maxRetries: number = 3
  ): Promise<SearchResult[]> {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
      try {
        return await this.executeSearch(query);
      } catch (error: any) {
        // Don't retry on validation errors
        if (error.message.includes('invalid') || error.message.includes('validation')) {
          throw error;
        }
        
        if (attempt === maxRetries - 1) {
          // Last attempt failed
          console.error(`Search failed after ${maxRetries} attempts:`, error);
          
          // Fallback to keyword search
          return await this.fallbackKeywordSearch(query);
        }
        
        // Exponential backoff
        const delay = Math.pow(2, attempt) * 1000;
        console.log(`Retry ${attempt + 1}/${maxRetries} after ${delay}ms`);
        await new Promise(resolve => setTimeout(resolve, delay));
      }
    }
    
    return [];
  }
  
  private static async executeSearch(query: string): Promise<SearchResult[]> {
    const embedding = await EmbeddingService.getEmbeddingWithRetry(query);
    
    return await prisma.$queryRaw`
      SELECT * FROM documents
      ORDER BY embedding <=> ${embedding}::vector
      LIMIT 10
    `;
  }
  
  /**
   * Fallback to traditional keyword search
   */
  private static async fallbackKeywordSearch(query: string): Promise<SearchResult[]> {
    console.log('Using fallback keyword search');
    
    return await prisma.document.findMany({
      where: {
        OR: [
          { title: { contains: query, mode: 'insensitive' } },
          { content: { contains: query, mode: 'insensitive' } },
        ],
      },
      take: 10,
    }) as SearchResult[];
  }
}

Health Checks and Readiness Probes

export class HealthCheckService {
  /**
   * Database health check
   */
  static async checkDatabase(): Promise<boolean> {
    try {
      await prisma.$queryRaw`SELECT 1`;
      return true;
    } catch (error) {
      console.error('Database health check failed:', error);
      return false;
    }
  }
  
  /**
   * Embedding service health check
   */
  static async checkEmbeddingService(): Promise<boolean> {
    try {
      await EmbeddingService.getEmbedding('test');
      return true;
    } catch (error) {
      console.error('Embedding service health check failed:', error);
      return false;
    }
  }
  
  /**
   * Vector index health check
   */
  static async checkVectorIndex(): Promise<boolean> {
    try {
      const dummyVector = new Array(1536).fill(0);
      
      await prisma.$queryRaw`
        EXPLAIN SELECT * FROM documents
        ORDER BY embedding <=> ${dummyVector}::vector
        LIMIT 1
      `;
      
      return true;
    } catch (error) {
      console.error('Vector index health check failed:', error);
      return false;
    }
  }
  
  /**
   * Complete health check
   */
  static async healthCheck() {
    const [db, embedding, index] = await Promise.all([
      this.checkDatabase(),
      this.checkEmbeddingService(),
      this.checkVectorIndex(),
    ]);
    
    return {
      healthy: db && embedding && index,
      checks: { database: db, embedding, vectorIndex: index },
      timestamp: new Date().toISOString(),
    };
  }
}

// Health endpoint
app.get('/health', async (req, res) => {
  const health = await HealthCheckService.healthCheck();
  const statusCode = health.healthy ? 200 : 503;
  res.status(statusCode).json(health);
});

Production Deployment Checklist

export const PRODUCTION_CHECKLIST = {
  database: [
    '✓ HNSW index created on embedding column',
    '✓ Connection pooling configured (20-50 connections)',
    '✓ Shared buffers ≥ 25% of RAM',
    '✓ Effective cache size set properly',
    '✓ Regular VACUUM and ANALYZE scheduled',
    '✓ Backup strategy implemented',
  ],
  
  performance: [
    '✓ Embedding caching enabled (Redis)',
    '✓ Query result caching implemented',
    '✓ Batch embedding generation for ingestion',
    '✓ Connection pooling configured',
    '✓ Query timeout set (5-10 seconds)',
  ],
  
  monitoring: [
    '✓ Query performance logging',
    '✓ Error tracking (Sentry, etc.)',
    '✓ Health check endpoints',
    '✓ Metrics exported (Prometheus/Grafana)',
    '✓ Alert on slow queries (>1s)',
  ],
  
  reliability: [
    '✓ Retry logic for embedding API',
    '✓ Fallback to keyword search on errors',
    '✓ Circuit breaker for external services',
    '✓ Graceful degradation implemented',
  ],
  
  security: [
    '✓ API rate limiting',
    '✓ Input validation (query length, etc.)',
    '✓ SQL injection prevention (parameterized queries)',
    '✓ API key rotation strategy',
  ],
};

What's Next

In this article, you learned:

✅ Query performance optimization (2.3s → 47ms)
✅ Embedding and result caching strategies
✅ Index management and tuning
✅ Embedding model versioning and migration
✅ Monitoring, metrics, and health checks
✅ Error handling and graceful degradation
✅ Production deployment checklist

Next: Real-world applications including RAG chatbots, recommendation engines, and semantic search systems.

← Part 5: Advanced Queries | Part 7: Real-World Applications →

PreviousPart 5: Advanced Queries and Hybrid Search NextPart 7: Real-World Applications and Use Cases

Last updated 15 hours ago

hashtagThe 2-Second Query That Cost Us Users

hashtagQuery Performance Optimization

hashtag1. Create Proper Indexes

hashtag2. Optimize Index Parameters

hashtag3. Embedding Caching

hashtag4. Result Caching

hashtag5. Database Connection Pooling

hashtag6. Parallel Query Execution

hashtagIndex Management Strategies

hashtagProgressive Index Building

hashtagPartial Indexes for Filtered Queries

hashtagEmbedding Model Management

hashtagVersion Embeddings

hashtagMonitoring and Observability

hashtagQuery Performance Tracking

hashtagError Handling and Retries

hashtagHealth Checks and Readiness Probes

hashtagProduction Deployment Checklist

hashtagWhat's Next