AWS CloudWatch Monitoring Overview

You are here:
< All Topics

 

AWS CloudWatch is the basic AWS monitoring service that collects metrics on your resources in AWS, including your applications, in real time.

 

You can also collect and monitor log files with AWS CloudWatch. You can set alarms for metrics in CloudWatch to continuously monitor performance, utilization, health, and other parameters of your AWS resources and take action when metrics cross set thresholds.

 

CloudWatch is a global AWS service, so it can monitor resources and services across all AWS regions via a single dashboard.

 

 

CloudWatch provides basic monitoring free of charge at 5-minute intervals as a serverless AWS service, thus there is no need to install any additional software to use it.

 

 

For an additional charge, you can set detailed monitoring that provides data at 1-minute intervals.

 

 

AWS CloudWatch has a feature that allows you to publish and retain custom metrics for a 1-second or 1-minute duration for your application, services, and resources, known as high-resolution custom metrics.

 

CloudWatch stores metrics data for 15 months, so even after terminatíng an EC2 instance or deleting an ELB, you can still retrieve historical metrics for these resources.

 

 

How CloudWatch Works

 

CW Monitoring Is Event-Driven

 

All monitoring in AWS is event-driven. An event is “something that happens in AWS and is captured.”

 

For example, when a new EBS volume is created, the createVolume event is triggered, with a result of either available or failed. This event and its result are sent to CloudWatch.

 

You can create a maximum of 5000 alarms in every region in your AWS account.

 

You can create alarms for functions such as starting, stopping, terminating, or recovering an EC2 instance, or when an instance is experiencing a service issue.

 

Monitoring Is Customizable

 

You can define custom metrics easily. A custom metric behaves just like a predefined one and can then be analyzed and interpreted in the same way as standard metrics.

One important limitation of CloudWatch – exam question! 

 

CloudWatch functions below the AWS Hypervisor, which means it functions below the virtualization layer of AWS.

 

This means it can report on things like CPU usage and disk I/O…but it cannot see beyond what is happening *above* that layer.

 

This means CloudWatch CANNOT tell you what tasks or application processes are affecting performance. Remember this point!

 

Thus it cannot tell you about disk usage, unless you write code that checks disk usage and send that as a custom metric to CloudWatch.

 

This is an important aspect that can appear in the exam. You might be asked if CloudWatch can report on memory or disk usage by default; it cannot.

 

Monitoring Drives Action

 

The final piece of the AWS monitoring puzzle is alarms – this is what occurs after a metric has reported a value or result outside a set “everything is okay” threshold.

 

When this happens, an alarm is triggered. Note that an alarm is not necessarily the same as “something is wrong”; an alarm is merely a notification that something has happened at a particular point.

 

For example, it could be running some code in Lambda, or sending a message to an Auto Scaling group telling it to scale in, or sending an email via the AWS SNS message service.

 

Think of alarms as saving you from having to sit monitoring the CloudWatch dashboard 24×7.

 

One of your tasks as SysOp is to define these alarms.

 

 

CloudWatch Is Metric- and Event-Based

 

Know the difference between metrics and events.

An event is predefined and is something that happens, such as bytes coming into a network interface.

 

The metric is a measure of that event eg how many bytes are received in a given period of time.

 

Events and metrics are related, but they are not the same thing.

 

CloudWatch Events Are Lower Level

 

An event is something that happens, usually a metric changing or reporting to CloudWatch, but at a system level.

 

An event can then trigger further action, just as an alarm can.

 

Events are typically reported constantly from low-level AWS resources to CloudWatch.

 

CloudWatch Events Have Three Components

 

CloudWatch Events have three key components: events, rules, and targets.

 

An event:

 

the thing being reported. Events describe change in your AWS resources. They can be thought of as event logs for services, applications and resources.

 

A rule:

 

 

an expression that matches incoming events. If the rule matches an event, then the event is forwarded to a target for processing.

 

 

A target:

 

 

is another AWS component, for example, a piece of Lambda code, or an Auto Scaling group, or an email or SNS/SQS message that is sent out.

 

 

Both alarms and events are important and it is essential to monitor both.

 

CloudWatch Namespaces

 

A CloudWatch Namespace is a container for a collection of related CloudWatch metrics. This provides for a way to group metrics together for easier understanding and recognition.

AWS provides a number of predefined namespaces, which all begin with AWS/[service].

 

Eg, AWS/EC2/CPUUtilization is CPU utilization for an EC2 instance,

 

 

AWS/DynamoDB/CPUUtilization is the same metric but for DynamoDB.

 

 

You can add your own custom metrics to existing AWS namespaces, or else create your own custom namespaces in CloudWatch.

 

 

exam question:

CloudWatch can accept metric data from 2 weeks earlier and 2 hours into the future but make sure your EC2 instance time is set accurately for this to work correctly!

 

 

Monitoring EC2 Instances

 

CloudWatch provides some important often-encountered metrics for EC2.

 

Here are some of the most common EC2 metrics which you should be familiar with for the exam:

 

 

CPUUtilization – one of the fundamental EC2 instance metrics. It shows the percentage of allocated compute units currently in use.

 

DiskReadOps – reports a count of completed read operations from all instance store volumes.

 

DiskWriteOps – opposite of DiskReadOps, reports a count of completed read operations from all instance store volumes.

 

DiskReadBytes – reports the bytes read from all available instance store volumes.

 

DiskWriteBytes – reports the total of all bytes written to instance store volumes.

 

NetworkIn – total bytes received by all network interfaces.

 

NetworkOut – total bytes sent out across all network interfaces on the instance.

 

NetworkPacketsIn – total number of packets received by all network interfaces on the instance (available only for basic monitoring).

 

NetworkPacketsOut – number of packets sent out across all network interfaces on the instance. Also available only for basic monitoring.

 

 

 

S3 Metrics

 

There are many S3 metrics, but these are the most common ones you should know:

BucketSizeBytes – shows the daily storage of your buckets as bytes.

NumberOfObjects – the total number of objects stored in a bucket, across all storage classes.

 

AllRequests – the total number of all HTTP requests made to a bucket.

 

GetRequests – total number of GET requests to a bucket. There are also similar metrics for other requests: PutRequests , DeleteRequests , HeadRequests , PostRequests , and SelectRequests.

 

BytesDownloaded – total bytes downloaded for requests to a bucket.

 

BytesUploaded – total bytes uploaded to a bucket. These are the bytes that contain a request body.

 

FirstByteLatency – per-request time for a completed request, by first-byte millisecond.

 

TotalRequestLatency – the elapsed time in milliseconds from the first to the last byte of a request.

 

 

 

CloudWatch Alarms

 

 

Alarms Indicate a Notifiable Change

 

 

A CloudWatch alarm initiates action. You can set an alarm for when a metric is reported with a value outside of a set level.

 

Eg, for when your EC2 instance CPU utilization reaches 85 percent.

 

 

Alarms have three possible states at any given point in time:

 

OK : means the metric lies within the defined threshold.

ALARM : means the metric is below or above the defined threshold.

 

INSUFFICIENT_DATA : can have a number of reasons. The most common reasons are that the alarm has only just started or been created, that the metric it is monitoring is not available for some reason, or there is not enough data at this time to determine whether the alarm is OK or in ALARM state.

 

 

CloudWatch Logs

 

CloudWatch Logs stores logs from AWS systems and resources and can also handle the logs for on-premises systems provided they have the Amazon Unified CloudWatch Agent installed.

 

If you are monitoring AWS CloudTrail activity through CloudWatch, then that activity is sent to CloudWatch Logs.

 

If you need a long retention period for your logs, then CloudWatch Logs can also do this.

 

By default logs are kept forever and never expire. But you can adjust this based on your own retention policies.

You can choose to keep logs for only a single day or go up to 10 years.

 

Log Groups and Log Streams

 

You can group logs together that serve a similar purpose or from a similar resource type. For
example, EC2 instances that handle web traffic.

 

 

Log streams refer to data from instances within applications or log files or containers.4

 

 

CloudWatch Logs can send logs to S3, Kinesis Data Streams and Kinesis Data Firehose, Lambda and ElasticSearch

 

 

CloudWatch Logs – sources can be:

 

SDKs,

CloudWatch Logs Agent,

CloudWatch Unified Agent

Elastic Beanstalk

ECS – Elastic Container Service

Lambda function logs

VPC Flow Logs – these are VPC specific

API Gateway

CloudTrail based on filters

Route53 – logs DNS queries

 

 

Define Metric Filters and Insights for CloudWatch Logs

You can apply a filter expression eg to look for a specific IP in a log or the number of occurrences of “ERROR” in the log

Metric filters can be used to trigger CloudWatch Alarms

 

CloudWatch Logs Insights can be used to query logs and add queries to CloudWatch Dashboards

 

 

CloudWatch Logs – Exporting to S3

 

 

NOTE

this can take 12 hours for the data to become available for export – so it is not real time. For this you should use Log Subscriptions.

 

 

The API call for this is “CreateExportTask”

 

 

 

CloudWatch Log Subscriptions

 

You apply a “subscription filter” to the CloudWatch Log before sending it to eg a Lambda function managed by AWS/or to a custom-designed Lambda function and then from there as real-time data on to eg ElasticSearch. Or, you might send it from Subscription Filter and then to Kinesis.

 

 

You can also send or aggregate logs from different accounts and different regions to a subscription filter in each region and from there to a common single Kinesis Data Stream and Firehose and from there in near-real time on to eg S3.

 

 

 

 

 

 

 

Unified CloudWatch Agent

 

 

The AWS Unified CloudWatch Agent provides more detailed information than the standard free CloudWatch service.

 

You can also use it to gather logs from your on-premises servers in the case of a hybrid environment and then centrally manage and store them from within the CloudWatch console.

 

The agent is available for Windows and Linux operating systems.

 

When installed on a Windows machine, you can forward in-depth information to CloudWatch from the Windows Performance Monitor, which is built into the Windows operating system.

 

When CloudWatch is installed on a Linux system, you can receive more in-depth metrics about CPU, memory, network, processes, and swap memory usage. You can also gather custom logs from applications installed on servers.

 

To install the CloudWatch agent, you need to set up the configuration file.

 

 

Table of Contents