0 CloudWatch: a minimal alerting baseline (with 3 starter alarms) - kevwells.com

CloudWatch: a minimal alerting baseline (with 3 starter alarms)

Last updated: 20 Aug 2025

Short version: set log retention explicitly, wire 3 alarms that actually matter, and keep severity mapping simple. Don’t create 40 “informational” alerts and call it monitoring.

1) Set log retention on day one

  • Pick a default (e.g., 30 or 90 days) for all CloudWatch log groups. “Never expire” is not a plan.
  • Apply exceptions only where justified (e.g., audit streams mirrored to S3).

2) Three alarms that earn their keep

  1. EC2 instance health: StatusCheckFailed_System > 0 for 5 minutes → page the on-call.
  2. ALB 5xx rate spike: error rate >= threshold (e.g., 5% over 5 minutes) → notify incident channel.
  3. RDS free storage low: FreeStorageSpace below threshold (e.g., 15% capacity) for 10 minutes → ticket + notify.

Everything else can wait until your runbook exists.

3) Optional: metric filters from CloudTrail

If your CloudTrail is shipped to CloudWatch Logs, add these filters + alarms (they’re security-relevant and low-noise):

# Root account usage
{ ($.userIdentity.type = "Root") && ($.userIdentity.invokedBy NOT EXISTS) && ($.eventType != "AwsServiceEvent") }

# Unauthorized API calls
{ ($.errorCode = "*UnauthorizedOperation") || ($.errorCode = "AccessDenied*") }

# Console logins without MFA
{ ($.eventName = "ConsoleLogin") && ($.additionalEventData.MFAUsed = "No") && ($.responseElements.ConsoleLogin = "Success") }

4) Wire alerts sensibly

  • Alarms → SNS → email/Chat/incident tool. Use one topic per severity.
  • Define OK/ALARM/INSUFFICIENT_DATA handling. Don’t spam on flaps; use a 5-minute period and sensible evaluation.

5) Keep the noise down

  • Every alarm must have a clear owner and a runbook link. If nobody owns it, delete it.
  • Review alarms monthly; remove those that didn’t help.

Want this baseline deployed correctly across accounts/regions? Request a call.

Security gaps in Linux and cloud systems risk downtime, data compromise, lost business — and compliance failures.

With 20+ years’ experience and active UK Security Check (SC) clearance, I harden Linux and cloud platforms for government, corporate, and academic sectors — ensuring secure, compliant, and resilient infrastructure.