Incident Response Runbook
How to respond to CRM platform incidents.
Severity Levels
| Severity | Description | Response Time |
|---|---|---|
| P1 | Complete outage, data loss | 15 minutes |
| P2 | Major feature broken | 1 hour |
| P3 | Minor feature issue | 4 hours |
| P4 | Non-urgent bug | Next business day |
Initial Response
1. Acknowledge
- Note the time incident was detected
- Assign incident commander
- Create incident channel (if needed)
2. Assess
Check overall system health:
# Pod status
kubectl get pods -n crm
# Recent events
kubectl get events -n crm --sort-by='.lastTimestamp'
# Service health
curl https://crm-api.digiwedge.dev/health
3. Communicate
- Update status page
- Notify stakeholders
- Keep incident channel updated
Common Incidents
API Unresponsive
Symptoms: 5xx errors, timeouts
Check:
# Pod logs
kubectl logs -l app=crm-backend -n crm --tail=100
# Resource usage
kubectl top pods -n crm
# Database connections
kubectl exec -it deploy/crm-backend -n crm -- \
psql $CRM_DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"
Resolution:
# Restart pods
kubectl rollout restart deployment/crm-backend -n crm
# Scale up if needed
kubectl scale deployment/crm-backend --replicas=4 -n crm
Worker Queue Stuck
Symptoms: Jobs not processing, queue growing
Check:
# Worker logs
kubectl logs -l app=crm-worker -n crm --tail=100
# Redis queue depth
redis-cli -u $REDIS_URL LLEN crm-events
Resolution:
# Restart workers
kubectl rollout restart deployment/crm-worker -n crm
Database Connection Exhausted
Symptoms: Connection timeout errors
Check:
# Active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'crm';
# Idle connections
SELECT count(*) FROM pg_stat_activity
WHERE datname = 'crm' AND state = 'idle';
Resolution:
# Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'crm' AND state = 'idle'
AND query_start < now() - interval '10 minutes';
# Restart pods to reset pool
kubectl rollout restart deployment/crm-backend -n crm
Social Publishing Failed
Symptoms: Posts not appearing on platforms
Check:
# Check failed posts
SELECT id, platform, status, error_message
FROM "SocialPost"
WHERE status = 'FAILED'
ORDER BY created_at DESC LIMIT 10;
# Check token expiry
SELECT platform, token_expires_at
FROM "SocialConnection"
WHERE token_expires_at < now();
Resolution:
- If token expired: User must re-authenticate
- If rate limited: Wait and retry
- If content rejected: Review content and resubmit
Event Processing Lag
Symptoms: Customer data not updating
Check:
# Queue depth
redis-cli LLEN crm-events
# Processing rate
kubectl logs -l app=crm-worker -n crm | grep "processed"
# Check for failed events
SELECT * FROM "SyncLog"
WHERE status = 'FAILED'
ORDER BY created_at DESC LIMIT 10;
Resolution:
# Scale workers
kubectl scale deployment/crm-worker --replicas=4 -n crm
# Retry failed events
UPDATE "SyncLog"
SET status = 'PENDING', retry_count = 0
WHERE status = 'FAILED' AND created_at > now() - interval '1 hour';
Post-Incident
1. Resolve
- Verify fix is working
- Monitor for recurrence
- Update status page
2. Document
- Timeline of events
- Root cause
- Actions taken
3. Review
- Schedule post-mortem
- Identify preventive measures
- Update runbooks
Escalation Path
- On-call engineer
- Platform lead
- DevOps team
- Management