Skip to main content

Monitoring & Alerts Runbook

How to respond to CRM platform alerts.

Alert Overview

AlertSeverityDescription
CRM API DownP1Health endpoint failing
High Error RateP2More than 5% of requests failing
Queue Depth HighP2More than 1000 pending jobs
Database ConnectionsP2More than 80% pool exhausted
Social Token ExpiringP3Token expires within 7 days
Memory HighP3More than 80% memory usage

Prometheus Queries

Request Rate

sum(rate(http_requests_total{app="crm-backend"}[5m]))

Error Rate

sum(rate(http_requests_total{app="crm-backend", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{app="crm-backend"}[5m])) * 100

Response Latency (p99)

histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{app="crm-backend"}[5m]))
by (le)
)

Queue Depth

bullmq_queue_size{queue="crm-events"}

Database Connections

pg_stat_activity_count{datname="crm"}

Alert Response

CRM API Down

Alert: Health endpoint returning non-200

Check:

# Test health endpoint
curl -v https://crm-api.digiwedge.dev/health

# Check pod status
kubectl get pods -n crm

# Check logs
kubectl logs -l app=crm-backend -n crm --tail=50

Resolve:

# Restart pods
kubectl rollout restart deployment/crm-backend -n crm

# If persistent, check resources
kubectl describe pod -l app=crm-backend -n crm

High Error Rate

Alert: Error rate >5%

Check:

# Find error patterns
kubectl logs -l app=crm-backend -n crm | grep -i error | tail -50

# Check specific endpoint errors
# In Grafana, filter by endpoint

Resolve:

  • If database error: Check DB connectivity
  • If timeout: Check downstream services
  • If validation: Check for bad client requests

Queue Depth High

Alert: Queue has >1000 pending jobs

Check:

# Queue stats
redis-cli LLEN crm-events

# Worker status
kubectl get pods -l app=crm-worker -n crm

# Worker logs
kubectl logs -l app=crm-worker -n crm --tail=50

Resolve:

# Scale workers
kubectl scale deployment/crm-worker --replicas=4 -n crm

# If workers stuck, restart
kubectl rollout restart deployment/crm-worker -n crm

Database Connections High

Alert: >80% connection pool used

Check:

# Connection count
kubectl exec -it deploy/crm-backend -n crm -- \
psql $CRM_DATABASE_URL -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname = 'crm'"

# Idle connections
kubectl exec -it deploy/crm-backend -n crm -- \
psql $CRM_DATABASE_URL -c \
"SELECT state, count(*) FROM pg_stat_activity WHERE datname = 'crm' GROUP BY state"

Resolve:

# Kill idle connections
kubectl exec -it deploy/crm-backend -n crm -- \
psql $CRM_DATABASE_URL -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE datname = 'crm' AND state = 'idle'
AND query_start < now() - interval '5 minutes'"

# Restart pods to reset pool
kubectl rollout restart deployment/crm-backend -n crm

Social Token Expiring

Alert: Token expires within 7 days

Check:

SELECT id, platform, club_id, token_expires_at
FROM "SocialConnection"
WHERE token_expires_at < now() + interval '7 days';

Resolve:

  • Notify club admin to re-authenticate
  • Or trigger automatic token refresh if supported

Memory High

Alert: Pod memory exceeds 80%

Check:

# Current usage
kubectl top pods -n crm

# Memory trends in Grafana

Resolve:

# Restart to clear memory
kubectl rollout restart deployment/crm-backend -n crm

# If persistent, increase limits
kubectl set resources deployment/crm-backend \
--limits=memory=1Gi -n crm

Grafana Dashboards

DashboardPurpose
CRM OverviewRequest rate, latency, errors
CRM WorkersQueue depth, processing rate
CRM DatabaseConnections, query performance
CRM SocialPublishing rate, failures

Silencing Alerts

For planned maintenance:

# Create silence in Alertmanager
amtool silence add alertname="CRM*" \
--comment="Planned maintenance" \
--duration=2h

Updating Alert Rules

Alert rules are in kubernetes/crm/alerts.yaml:

groups:
- name: crm
rules:
- alert: CRMAPIDown
expr: up{job="crm-backend"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "CRM API is down"