Monitoring & Alerts Runbook
How to respond to CRM platform alerts.
Alert Overview
| Alert | Severity | Description |
|---|---|---|
| CRM API Down | P1 | Health endpoint failing |
| High Error Rate | P2 | More than 5% of requests failing |
| Queue Depth High | P2 | More than 1000 pending jobs |
| Database Connections | P2 | More than 80% pool exhausted |
| Social Token Expiring | P3 | Token expires within 7 days |
| Memory High | P3 | More than 80% memory usage |
Prometheus Queries
Request Rate
sum(rate(http_requests_total{app="crm-backend"}[5m]))
Error Rate
sum(rate(http_requests_total{app="crm-backend", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{app="crm-backend"}[5m])) * 100
Response Latency (p99)
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{app="crm-backend"}[5m]))
by (le)
)
Queue Depth
bullmq_queue_size{queue="crm-events"}
Database Connections
pg_stat_activity_count{datname="crm"}
Alert Response
CRM API Down
Alert: Health endpoint returning non-200
Check:
# Test health endpoint
curl -v https://crm-api.digiwedge.dev/health
# Check pod status
kubectl get pods -n crm
# Check logs
kubectl logs -l app=crm-backend -n crm --tail=50
Resolve:
# Restart pods
kubectl rollout restart deployment/crm-backend -n crm
# If persistent, check resources
kubectl describe pod -l app=crm-backend -n crm
High Error Rate
Alert: Error rate >5%
Check:
# Find error patterns
kubectl logs -l app=crm-backend -n crm | grep -i error | tail -50
# Check specific endpoint errors
# In Grafana, filter by endpoint
Resolve:
- If database error: Check DB connectivity
- If timeout: Check downstream services
- If validation: Check for bad client requests
Queue Depth High
Alert: Queue has >1000 pending jobs
Check:
# Queue stats
redis-cli LLEN crm-events
# Worker status
kubectl get pods -l app=crm-worker -n crm
# Worker logs
kubectl logs -l app=crm-worker -n crm --tail=50
Resolve:
# Scale workers
kubectl scale deployment/crm-worker --replicas=4 -n crm
# If workers stuck, restart
kubectl rollout restart deployment/crm-worker -n crm
Database Connections High
Alert: >80% connection pool used
Check:
# Connection count
kubectl exec -it deploy/crm-backend -n crm -- \
psql $CRM_DATABASE_URL -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname = 'crm'"
# Idle connections
kubectl exec -it deploy/crm-backend -n crm -- \
psql $CRM_DATABASE_URL -c \
"SELECT state, count(*) FROM pg_stat_activity WHERE datname = 'crm' GROUP BY state"
Resolve:
# Kill idle connections
kubectl exec -it deploy/crm-backend -n crm -- \
psql $CRM_DATABASE_URL -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE datname = 'crm' AND state = 'idle'
AND query_start < now() - interval '5 minutes'"
# Restart pods to reset pool
kubectl rollout restart deployment/crm-backend -n crm
Social Token Expiring
Alert: Token expires within 7 days
Check:
SELECT id, platform, club_id, token_expires_at
FROM "SocialConnection"
WHERE token_expires_at < now() + interval '7 days';
Resolve:
- Notify club admin to re-authenticate
- Or trigger automatic token refresh if supported
Memory High
Alert: Pod memory exceeds 80%
Check:
# Current usage
kubectl top pods -n crm
# Memory trends in Grafana
Resolve:
# Restart to clear memory
kubectl rollout restart deployment/crm-backend -n crm
# If persistent, increase limits
kubectl set resources deployment/crm-backend \
--limits=memory=1Gi -n crm
Grafana Dashboards
| Dashboard | Purpose |
|---|---|
| CRM Overview | Request rate, latency, errors |
| CRM Workers | Queue depth, processing rate |
| CRM Database | Connections, query performance |
| CRM Social | Publishing rate, failures |
Silencing Alerts
For planned maintenance:
# Create silence in Alertmanager
amtool silence add alertname="CRM*" \
--comment="Planned maintenance" \
--duration=2h
Updating Alert Rules
Alert rules are in kubernetes/crm/alerts.yaml:
groups:
- name: crm
rules:
- alert: CRMAPIDown
expr: up{job="crm-backend"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "CRM API is down"