Incident Response Runbook

How to respond to CRM platform incidents.

Severity Levels

Severity	Description	Response Time
P1	Complete outage, data loss	15 minutes
P2	Major feature broken	1 hour
P3	Minor feature issue	4 hours
P4	Non-urgent bug	Next business day

Initial Response

1. Acknowledge

Note the time incident was detected
Assign incident commander
Create incident channel (if needed)

2. Assess

Check overall system health:

# Pod status
kubectl get pods -n crm

# Recent events
kubectl get events -n crm --sort-by='.lastTimestamp'

# Service health
curl https://crm-api.digiwedge.dev/health

3. Communicate

Update status page
Notify stakeholders
Keep incident channel updated

Common Incidents

API Unresponsive

Symptoms: 5xx errors, timeouts

Check:

# Pod logs
kubectl logs -l app=crm-backend -n crm --tail=100

# Resource usage
kubectl top pods -n crm

# Database connections
kubectl exec -it deploy/crm-backend -n crm -- \
  psql $CRM_DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

Resolution:

# Restart pods
kubectl rollout restart deployment/crm-backend -n crm

# Scale up if needed
kubectl scale deployment/crm-backend --replicas=4 -n crm

Worker Queue Stuck

Symptoms: Jobs not processing, queue growing

Check:

# Worker logs
kubectl logs -l app=crm-worker -n crm --tail=100

# Redis queue depth
redis-cli -u $REDIS_URL LLEN crm-events

Resolution:

# Restart workers
kubectl rollout restart deployment/crm-worker -n crm

Database Connection Exhausted

Symptoms: Connection timeout errors

Check:

# Active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'crm';

# Idle connections
SELECT count(*) FROM pg_stat_activity
WHERE datname = 'crm' AND state = 'idle';

Resolution:

# Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'crm' AND state = 'idle'
AND query_start < now() - interval '10 minutes';

# Restart pods to reset pool
kubectl rollout restart deployment/crm-backend -n crm

Symptoms: Posts not appearing on platforms

Check:

# Check failed posts
SELECT id, platform, status, error_message
FROM "SocialPost"
WHERE status = 'FAILED'
ORDER BY created_at DESC LIMIT 10;

# Check token expiry
SELECT platform, token_expires_at
FROM "SocialConnection"
WHERE token_expires_at < now();

Resolution:

If token expired: User must re-authenticate
If rate limited: Wait and retry
If content rejected: Review content and resubmit

Event Processing Lag

Symptoms: Customer data not updating

Check:

# Queue depth
redis-cli LLEN crm-events

# Processing rate
kubectl logs -l app=crm-worker -n crm | grep "processed"

# Check for failed events
SELECT * FROM "SyncLog"
WHERE status = 'FAILED'
ORDER BY created_at DESC LIMIT 10;

Resolution:

# Scale workers
kubectl scale deployment/crm-worker --replicas=4 -n crm

# Retry failed events
UPDATE "SyncLog"
SET status = 'PENDING', retry_count = 0
WHERE status = 'FAILED' AND created_at > now() - interval '1 hour';

Post-Incident

1. Resolve

Verify fix is working
Monitor for recurrence
Update status page

2. Document

Timeline of events
Root cause
Actions taken

3. Review

Schedule post-mortem
Identify preventive measures
Update runbooks

Escalation Path

On-call engineer
Platform lead
DevOps team
Management

Severity Levels​

Initial Response​

1. Acknowledge​

2. Assess​

3. Communicate​

Common Incidents​

API Unresponsive​

Worker Queue Stuck​

Database Connection Exhausted​

Social Publishing Failed​

Event Processing Lag​

Post-Incident​

1. Resolve​

2. Document​

3. Review​

Escalation Path​

Severity Levels

Initial Response

1. Acknowledge

2. Assess

3. Communicate

Common Incidents

API Unresponsive

Worker Queue Stuck

Database Connection Exhausted

Social Publishing Failed

Event Processing Lag

Post-Incident

1. Resolve

2. Document

3. Review

Escalation Path