Zero Downtime

Solving Kubernetes Nginx Configuration Management Without Downtime

Architecting multi-tenant nginx deployments in Kubernetes to eliminate configuration downtime and enable true rolling updates.

Back to Blog

Ever found yourself in a situation where adding a single client configuration to your nginx reverse proxy requires a significant maintenance window? I encountered this scenario in production, and what I discovered reveals a fundamental flaw in how many teams architect multi-client nginx deployments in Kubernetes.

The Setup

We were running a multi-tenant nginx reverse proxy in Kubernetes, serving dozens of client websites through a single nginx instance:

  • Multiple client domains (client1.example.com, client2.example.com, etc.)
  • Single Kubernetes pod containing nginx + logging pipeline
  • All client configurations managed through ConfigMaps
  • Persistent volume for log storage

The supposed workflow: update the ConfigMap with new client configs, restart the pod, and you're done. Except you're never actually done.

text Original Architecture - Single Pod
┌─────────────────────────────────────────────────────────────────┐
│                        Internet                                 │
│  client1.example.com ──┐                                        │
│  client2.example.com ──┼─→ DNS Points to AWS Elastic IP         │  
│  client3.example.com ──┘     (Static Public IP: 203.0.113.10)   │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                ┌─────────────────▼────────────────┐
                │         AWS EC2 Instance         │
                │    (Elastic IP: 203.0.113.10)    │
                │                                  │
                │  ┌─────────────────────────────┐ │
                │  │     Kubernetes Cluster      │ │
                │  │                             │ │
                │  │  ┌─────────────────────────┐│ │
                │  │  │      Single Pod         ││ │
                │  │  │ ┌─────────┐ ┌────────┐  ││ │
                │  │  │ │  nginx  │ │   MQ   │  ││ │
                │  │  │ │(reverse │ │        │  ││ │
                │  │  │ │ proxy)  │ │        │  ││ │
                │  │  │ └─────────┘ └────────┘  ││ │
                │  │  │      │      ┌────────┐  ││ │
                │  │  │      │      │Filter  │  ││ │
                │  │  │      │      │  API   │  ││ │
                │  │  │      ▼      └────────┘  ││ │
                │  │  │ [PVC Logs]              ││ │
                │  │  └─────────────────────────┘│ │
                │  └─────────────────────────────┘ │
                └──────────────────────────────────┘

The Hidden Problems

Issue 1: Full Service Downtime

Every ConfigMap change requires a complete pod restart:

bash Configuration Update Process
kubectl apply -f updated-nginx-configmap.yaml
kubectl rollout restart deployment/nginx-deployment

This triggers a cascade of problems that turns a 2-minute configuration change into a 30-minute production incident.

Issue 2: All-or-Nothing Availability

With a single nginx pod handling all client traffic, any restart means every client site goes down simultaneously. There's no graceful failover, no partial service - it's complete outage for everyone.

Issue 3: Scaling Creates Log Chaos

The obvious solution seems to be "add more replicas for high availability," but this creates different problems:

  • Multiple nginx pods writing the same logs to different locations
  • Logging pipeline gets overwhelmed with duplicate data
  • Storage costs multiply unnecessarily
  • Debugging becomes impossible with redundant log entries

Issue 4: Resource Multiplication

Each nginx replica requires its own full container set (nginx + MQ + filtering API). Three replicas means nine containers total, tripling your resource usage for what should be a simple availability improvement.

The Root Cause

The fundamental issue isn't with ConfigMaps or pod restarts - it's architectural. We've tightly coupled:

  • Network identity: The static IP clients depend on
  • Compute identity: The specific node where our pod runs
  • Processing identity: The containers handling logs

When any component needs updates, everything must restart together.

The Solution Architecture

The fix requires separating these concerns and implementing true rolling updates. Here's the corrected architecture:

Improved Architecture - Separated Concerns

Multi-pod nginx with Network Load Balancer and centralized logging

text Improved Architecture - Separated Concerns
┌─────────────────────────────────────────────────────────────────┐
│                        Internet                                 │
│  client1.example.com ──┐                                        │
│  client2.example.com ──┼─→ DNS Points to Static Elastic IPs     │  
│  client3.example.com ──┘                                        │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                ┌─────────────────▼────────────────┐
                │      Network Load Balancer       │
                │     (Static Elastic IPs)         │ 
                └─────────────────┬────────────────┘
                                  │
           ┌─────────────────────▼─────────────────────────┐
           │              Kubernetes Cluster               │
           │                                               │
           │  ┌──────────┐  ┌──────────┐  ┌──────────┐     │
           │  │nginx-pod1│  │nginx-pod2│  │nginx-pod3│     │
           │  │(nginx    │  │(nginx    │  │(nginx    │     │
           │  │ only)    │  │ only)    │  │ only)    │     │
           │  └─────┬────┘  └─────┬────┘  └─────┬────┘     │
           │        │              │              │        │
           │        └──── logs ────┼──── logs ────┘        │
           │                       │                       │
           │              ┌────────▼────────┐              │
           │              │   Shared EFS    │              │
           │              │   (Single PVC)  │              │
           │              └────────┬────────┘              │
           │                       │                       │
           │              ┌────────▼────────┐              │
           │              │ Centralized     │              │
           │              │ Logging Pod     │              │
           │              │  ┌──────────┐   │              │
           │              │  │    MQ    │   │              │
           │              │  └──────────┘   │              │
           │              │  ┌──────────┐   │              │
           │              │  │Filter API│   │              │
           │              │  └──────────┘   │              │
           │              └─────────────────┘              │
           └──────────────────────────────────────────────┘

Implementation Details

1. Shared Storage Foundation

yaml EFS Storage Configuration
# EFS Storage Class for shared access
apiVersion: storage.k8s.io/v1
kind: StorageClass
meta
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-xxxxxxxxx
  directoryPerms: "0755"
---
# Single PVC shared by all nginx pods
apiVersion: v1
kind: PersistentVolumeClaim
meta
  name: shared-nginx-logs
spec:
  accessModes:
    - ReadWriteMany  # Critical: supports multiple writers
  storageClassName: efs-sc
  resources:
    requests:
      storage: 100Gi

2. Network Load Balancer for Static IPs

yaml NLB Service Configuration
apiVersion: v1
kind: Service
meta
  name: nginx-nlb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-eip-allocations: "eipalloc-12345678,eipalloc-87654321"
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/health"
spec:
  type: LoadBalancer
  selector:
    app: nginx-reverse-proxy
  ports:
  - name: http
    port: 80
    targetPort: 80

3. Fast Rolling Update Deployment

yaml Rolling Update Nginx Deployment
apiVersion: apps/v1
kind: Deployment
meta
  name: nginx-reverse-proxy
spec:
  replicas: 3  # Multiple replicas for rolling updates
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Always keep 2 pods running
      maxSurge: 1        # Add 1 extra during updates
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: nginx
        image: nginx:1.25-alpine
        volumeMounts:
        - name: nginx-config
          mountPath: /etc/nginx/conf.d
        - name: shared-logs
          mountPath: /var/log/nginx
        
        # Aggressive health checks for fast updates
        readinessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 2
        
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 10"]
      
      volumes:
      - name: nginx-config
        configMap:
          name: nginx-config
      - name: shared-logs
        persistentVolumeClaim:
          claimName: shared-nginx-logs

4. Centralized Logging Service

yaml Centralized Logging Deployment
apiVersion: apps/v1
kind: Deployment
meta
  name: centralized-logging
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: mq
        image: rabbitmq:3.12-alpine
        ports:
        - containerPort: 5672
      
      - name: filter-api
        image: your-registry/log-filter-api:latest
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: shared-logs
          mountPath: /var/log/nginx
          readOnly: true
        
      volumes:
      - name: shared-logs
        persistentVolumeClaim:
          claimName: shared-nginx-logs

Why This Architecture Works

Separation of Concerns

  • Nginx pods handle only reverse proxy duties
  • Logging pod handles only log processing
  • Shared storage layer enables communication between them

Fast Rolling Updates

Configuration updates now take 15-30 seconds instead of 30 minutes:

bash Fast Configuration Updates
kubectl apply -f updated-nginx-config.yaml
kubectl rollout restart deployment/nginx-reverse-proxy
# Total downtime: ~15-30 seconds

No Log Redundancy

  • Single shared PVC eliminates duplicate log entries
  • Centralized logging processes each request exactly once
  • Resource usage scales predictably

Zero Manual Intervention

  • NLB handles all IP routing automatically
  • Kubernetes manages pod lifecycle
  • Health checks prevent traffic to unhealthy pods

The Results

Before this architecture:

  • 30-minute maintenance windows for config changes
  • All client sites down during updates
  • Manual intervention required for every change
  • Unpredictable resource usage with replicas

After implementation:

  • 15-30 second rolling updates
  • Zero simultaneous client downtime
  • Fully automated configuration deployment
  • Predictable resource scaling

The transformation is dramatic. What used to be emergency maintenance windows requiring coordination with clients became routine deployments that happen transparently during business hours.

Limitations and Trade-offs

AWS Constraints:

  • Security group rule limits can be reached with many NLBs
  • Target limits of 500 per load balancer
  • Connection limits of ~55,000 with client IP preservation
  • Cannot share NLB across multiple services

Cost Considerations:

  • NLB costs ~$16/month per load balancer
  • EFS storage costs scale with usage
  • Additional complexity in monitoring

Operational Complexity:

  • More moving parts to monitor
  • Shared storage introduces potential bottlenecks
  • Requires AWS-specific knowledge

Despite these limitations, the operational benefits far outweigh the constraints for most multi-client deployments.

Key Lessons Learned

  1. Separate scaling concerns: Don't force nginx scaling and log processing to scale together
  2. Decouple network identity: Static IPs should live with load balancers, not pods
  3. Design for rolling updates: Health checks and graceful shutdown are critical
  4. Shared storage patterns: EFS with ReadWriteMany enables elegant solutions
  5. Monitor the right metrics: Focus on rolling update success rates and health check latency

The Bottom Line

The key insight isn't about any specific technology - it's about architectural separation. By decoupling nginx scaling from logging scaling, and separating network identity from pod identity, we eliminate the constraints that forced us into all-or-nothing deployments.

This approach proves that sometimes the best solution isn't adding more complexity, but removing the artificial constraints that create complexity in the first place. Your nginx reverse proxy can be both highly available and operationally simple - you just need to architect it correctly from the start.

For teams running multi-client nginx deployments in Kubernetes, this architecture provides a clear path from maintenance-heavy operations to truly cloud-native, automated deployments. The difference between 30-minute outages and 30-second updates isn't just technical - it's the difference between infrastructure that fights your business growth and infrastructure that enables it.

Related Reading

This article builds on concepts from our previous post about The Hidden Manual Work Behind Kubernetes Pod Restarts. If you haven't read that yet, it provides essential background on why manual intervention becomes necessary and how Network Load Balancers solve fundamental architectural problems in Kubernetes deployments.