Skip to main content

Incident Response Runbook

How to respond to CRM platform incidents.

Severity Levels

SeverityDescriptionResponse Time
P1Complete outage, data loss15 minutes
P2Major feature broken1 hour
P3Minor feature issue4 hours
P4Non-urgent bugNext business day

Initial Response

1. Acknowledge

  • Note the time incident was detected
  • Assign incident commander
  • Create incident channel (if needed)

2. Assess

Check overall system health:

# Pod status
kubectl get pods -n crm

# Recent events
kubectl get events -n crm --sort-by='.lastTimestamp'

# Service health
curl https://crm-api.digiwedge.dev/health

3. Communicate

  • Update status page
  • Notify stakeholders
  • Keep incident channel updated

Common Incidents

API Unresponsive

Symptoms: 5xx errors, timeouts

Check:

# Pod logs
kubectl logs -l app=crm-backend -n crm --tail=100

# Resource usage
kubectl top pods -n crm

# Database connections
kubectl exec -it deploy/crm-backend -n crm -- \
psql $CRM_DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

Resolution:

# Restart pods
kubectl rollout restart deployment/crm-backend -n crm

# Scale up if needed
kubectl scale deployment/crm-backend --replicas=4 -n crm

Worker Queue Stuck

Symptoms: Jobs not processing, queue growing

Check:

# Worker logs
kubectl logs -l app=crm-worker -n crm --tail=100

# Redis queue depth
redis-cli -u $REDIS_URL LLEN crm-events

Resolution:

# Restart workers
kubectl rollout restart deployment/crm-worker -n crm

Database Connection Exhausted

Symptoms: Connection timeout errors

Check:

# Active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'crm';

# Idle connections
SELECT count(*) FROM pg_stat_activity
WHERE datname = 'crm' AND state = 'idle';

Resolution:

# Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'crm' AND state = 'idle'
AND query_start < now() - interval '10 minutes';

# Restart pods to reset pool
kubectl rollout restart deployment/crm-backend -n crm

Social Publishing Failed

Symptoms: Posts not appearing on platforms

Check:

# Check failed posts
SELECT id, platform, status, error_message
FROM "SocialPost"
WHERE status = 'FAILED'
ORDER BY created_at DESC LIMIT 10;

# Check token expiry
SELECT platform, token_expires_at
FROM "SocialConnection"
WHERE token_expires_at < now();

Resolution:

  • If token expired: User must re-authenticate
  • If rate limited: Wait and retry
  • If content rejected: Review content and resubmit

Event Processing Lag

Symptoms: Customer data not updating

Check:

# Queue depth
redis-cli LLEN crm-events

# Processing rate
kubectl logs -l app=crm-worker -n crm | grep "processed"

# Check for failed events
SELECT * FROM "SyncLog"
WHERE status = 'FAILED'
ORDER BY created_at DESC LIMIT 10;

Resolution:

# Scale workers
kubectl scale deployment/crm-worker --replicas=4 -n crm

# Retry failed events
UPDATE "SyncLog"
SET status = 'PENDING', retry_count = 0
WHERE status = 'FAILED' AND created_at > now() - interval '1 hour';

Post-Incident

1. Resolve

  • Verify fix is working
  • Monitor for recurrence
  • Update status page

2. Document

  • Timeline of events
  • Root cause
  • Actions taken

3. Review

  • Schedule post-mortem
  • Identify preventive measures
  • Update runbooks

Escalation Path

  1. On-call engineer
  2. Platform lead
  3. DevOps team
  4. Management