A Task: CloudWatch monitoring and alerting

6 min readAug 19, 2024

CloudWatch Monitoring and Alerting

Agenda

Monitoring and Alerting of AWS resources is performed using AWS’ CloudWatch.

Monitoring involves a related concept called Metrics. Alerting involves a related concept called Alarms.
Thus we would get familiar with:
- Metrics
- Monitoring
- Alarms
- Alerts

Alerts need to be configured using AWS SNS service. Hence we would also touch upon the following SNS concepts:
- Topic
- Subscription

We will monitor an EC2 instance in this post. Monitoring can also be configured on other AWS resources like API Gateway, Lambda, S3 etc.

Prerequisites

We will use three AWS services, namely CloudWatch, EC2 and SNS. Hence having a user authorised for these three services would help.

EC2 setup

We are interested in monitoring the CPU Utilisation of an EC2 instance. Hence, we need an EC2 instance.

Let’s launch an EC2 t2.micro instance, we can use Ubuntu 22.04 LTS AMI or any other AMI.

We have named our instance Tutorial Server.

Metrics

Monitoring is always performed on a Metric. Some examples of metrics are:
- CPU Utilisation
- Memory used
- Memory available
- Bytes read
- Bytes written

A useful metric for compute resources is CPU utilisation. As EC2 is a compute service, so let’s focus on CPU utilisation.

CloudWatch provides ability to view metrics for different resources. It can be used to see CPU utilisation of an EC2 instance.

Navigate to CloudWatch > Metrics > All metrics

Metrics are grouped under namespaces.

You would find EC2 under AWS namespaces. As we are interested in an instance metric, thus need to select Per-Instance Metrics.

If you have multiple EC2 instances, metrics for all the instances would show up. Filter by the instance id to view only an instance’s metrics.

You would find a column Metric name. It lists several metrics like CPUUtilization, DiskReadBytes, DiskReadOps etc.

We can select and graph CPUUtilization.

The graph should start looking like:

We are working on a brand new EC2 instance which isn’t doing any real work. Hence the CPU utilisation is less than 1 percent.

Stress test

Let’s stress the instance with some real load and see how the CPU utilisation spikes.

stress is a useful tool to subject any machine to real CPU load.

ssh to the instance.

scripts to stimulate spike in Ec2 instance

import time

def simulate_cpu_spike(duration=30, cpu_percent=80):
    print(f"Simulating CPU spike at {cpu_percent}%...")
    start_time = time.time()

    # Calculate the number of iterations needed to achieve the desired CPU utilization
    target_percent = cpu_percent / 100
    total_iterations = int(target_percent * 5_000_000)  # Adjust the number as needed

    # Perform simple arithmetic operations to spike CPU utilization
    for _ in range(total_iterations):
        result = 0
        for i in range(1, 1001):
            result += i

    # Wait for the rest of the time interval
    elapsed_time = time.time() - start_time
    remaining_time = max(0, duration - elapsed_time)
    time.sleep(remaining_time)

    print("CPU spike simulation completed.")

if __name__ == '__main__':
    # Simulate a CPU spike for 30 seconds with 80% CPU utilization
    simulate_cpu_spike(duration=30, cpu_percent=80)

This would put load on the CPU for 300 seconds, i.e 5 minutes.

We will check the metric after 5 minutes once stress has completed. Go and grab a coffee in the meanwhile!

Recheck the metric and graph it after 5 minutes. The graph would look similar to the following:

This suggests that during the 5 minute interval that the stress was running, CPU utilisation was 50 percent.

Let’s increase the number of concurrent processes to 4 and run the processes for 10 minutes

Alarm and Alert

We want to be alerted whenever there is a spike in CPU utilisation. This will allow us to take corrective action like scaling the compute capacity before there is an outage.

The foundation for Alerting is Alarms. Hence, Alarms need to be configured.

The alarm would indicate us that current server resources aren’t sufficient to sustain the load.

We can create an Alarm from Graphed metrics tab.

Alarm creation has three major parts:
1. Metric
2. Condition
3. Action

A CloudWatch Alarm has different states. The two states of interest are OK and In Alarm.

We took the following steps:
1. Under Actions, we choose Create alarm.
2. As we have reached the Create alarm page from Metric detail page, hence fields Metric name, InstanceId, Statistic and Period are auto-populated.
3. We want Alarm to transition to In Alarm state when CPU utilisation breaches 60 percent. Hence the condition we have specified is Greater than 60.
4. When Alarm changes state, CloudWatch sends a notification to an SNS Topic.
5. Hence, we have created an SNS Topic. Also, we added an EMAIL Subscription to this topic. You should add your email address.
6. Subscriptions need to be confirmed.
7. We navigated to the Subscription list page, the subscription would say Pending Confirmation.

You should have received an email from AWS Notifications. Open the email and confirm subscription.

Refresh the subscriptions list page, the subscription Status should change to Confirmed.

You have to give confirm subscription

We consider our application stable when cpu utilisation stays below 40%. That’s why our alarm has been configured to trigger when the utilisation goes above 40%.

CloudWatch also provides an Alarm list page. Let’s navigate there and see the created Alarm.

We can see the created Alarm named CPUUtilization. Also, the current state of Alarm is shown here which is OK.

Let’s increase the CPU concurrency with stress. This will put additional load on the server CPU.

After 10 minutes, once the relevant data points are published to CloudWatch, check the Alarm page again.

It’s highly likely that the Alarm has transitioned to In alarm state.

Since our Alarm’s action specified a SNS Topic and an Email subscription, hence you should have received an Email from AWS.

This confirms that our alarm and alerting is working as intended.

Recap

CloudWatch allows users to monitor different resource metrics. Alarms can be configured on the monitored metrics.

When the configured metrics thresholds are breached, the alarm transitions from OK to In Alarm state.

Alerts can be configured to be triggered on alarm transition. A basic alerting mechanism could be sending an email. In AWS, alerts are configured using SNS.

Thank you for reading so far! Before you go:

👏 Clap for the story if it helped :)
📰 View more content from me https://medium.com/@clouddevsecops