AWS Learning
Monitoring & Management

Amazon CloudWatch

Metrics, Alarms, Logs, Dashboards, Insights

Tổng Quan

Amazon CloudWatch là dịch vụ monitoring và observability toàn diện của AWS, cho phép bạn giám sát tài nguyên AWS, ứng dụng và dịch vụ chạy trên cloud hoặc on-premises.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         AMAZON CLOUDWATCH ECOSYSTEM                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│   │   Metrics   │    │    Logs     │    │   Alarms    │    │  Dashboards │  │
│   │   📈        │    │   📋        │    │   🔔        │    │   📊        │  │
│   └──────┬──────┘    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘  │
│          │                  │                  │                   │        │
│          └──────────────────┼──────────────────┼──────────────────┘         │
│                             │                   │                           │
│                    ┌────────▼────────┐   ┌─────▼─────┐                      │
│                    │ CloudWatch Logs │   │    SNS    │                      │
│                    │    Insights     │   │  Lambda   │                      │
│                    └────────┬────────┘   │Auto Scale │                      │
│                             │            └───────────┘                      │
│                    ┌────────▼────────┐                                      │
│                    │  Events/Alarms  │                                      │
│                    │   Automation    │                                      │
│                    └─────────────────┘                                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

CloudWatch Core Components

ComponentMô TảUse Case
MetricsDữ liệu time-series về performanceCPU, Memory, Network I/O
LogsThu thập và lưu trữ log filesApplication logs, System logs
AlarmsCảnh báo dựa trên metricsNotify khi CPU > 80%
DashboardsVisualization tập trungTổng hợp metrics nhiều services
EventsReact to AWS resource changesTrigger Lambda khi EC2 stop
InsightsQuery và analyze logsTroubleshooting, Analytics
SyntheticsCanary scriptsMonitor endpoints, APIs
ServiceLensEnd-to-end observabilityDistributed tracing

CloudWatch Metrics

1. Metrics Là Gì?

Metric là một biến đo lường theo thời gian (time-series data), ví dụ: CPU utilization của EC2 instance.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           CLOUDWATCH METRICS FLOW                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌──────────────┐     ┌──────────────────┐     ┌──────────────────────┐    │
│   │ AWS Services │────▶│ CloudWatch       │────▶│ Dashboards/Alarms    │    │
│   │ (EC2, RDS,   │     │ Metrics Store    │     │ Analysis/Automation  │    │
│   │  Lambda...)  │     │ (15 months)      │     │                      │    │
│   └──────────────┘     └──────────────────┘     └──────────────────────┘    │
│                                                                             │
│   ┌──────────────┐     ┌──────────────────┐                                 │
│   │ Custom Apps  │────▶│ PutMetricData    │────┐                            │
│   │ (Your Code)  │     │ API/SDK          │     │                           │
│   └──────────────┘     └──────────────────┘     │                           │
│                                                 ▼                           │
│                                        CloudWatch Metrics                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

2. Metrics Structure

Namespace     : AWS/EC2
MetricName    : CPUUtilization
Dimensions    : InstanceId=i-1234567890abcdef0
Timestamp     : 2024-01-15T10:00:00Z
Value         : 45.5
Unit          : Percent
ComponentMô TảVí Dụ
NamespaceContainer cho metricsAWS/EC2, AWS/RDS, Custom/MyApp
Metric NameTên của metricCPUUtilization, RequestCount
DimensionsKey-value pairs để filterInstanceId, AutoScalingGroupName
TimestampThời điểm data pointISO 8601 format
ValueGiá trị đo được45.5, 1024
UnitĐơn vị đoPercent, Bytes, Count

3. Default Metrics vs Detailed Monitoring

LoạiResolutionChi PhíAvailability
Basic (Default)5 phútMiễn phíTất cả EC2 instances
Detailed Monitoring1 phútCó phíPhải enable
High-Resolution Custom1 giâyCó phí cao hơnCustom metrics only

4. Important Default Metrics by Service

EC2 Instance Metrics

MetricMô Tả⚠️ Lưu Ý
CPUUtilization% CPU được sử dụngDefault có sẵn
NetworkIn/OutBytes network trafficDefault có sẵn
DiskReadBytes/WriteBytesDisk I/O bytesInstance store only
StatusCheckFailedHealth check statusSystem & Instance check
MemoryUtilization% RAM sử dụngKHÔNG CÓ mặc định - Cần CloudWatch Agent
DiskSpaceUtilization% Disk sử dụngKHÔNG CÓ mặc định - Cần CloudWatch Agent

[!IMPORTANT] Memory và Disk Space KHÔNG được thu thập mặc định bởi CloudWatch. Bạn cần cài đặt CloudWatch Agent để có các metrics này.

RDS Metrics

MetricMô Tả
CPUUtilizationCPU %
DatabaseConnectionsSố connections đang mở
FreeableMemoryRAM khả dụng
ReadIOPS/WriteIOPSI/O operations per second
FreeStorageSpaceDisk space còn lại

Lambda Metrics

MetricMô Tả
InvocationsSố lần function được gọi
DurationThời gian thực thi (ms)
ErrorsSố lần lỗi
ThrottlesSố lần bị throttle
ConcurrentExecutionsSố executions đồng thời

5. Custom Metrics

Bạn có thể push custom metrics từ ứng dụng của mình:

import boto3
from datetime import datetime
 
cloudwatch = boto3.client('cloudwatch')
 
# Push custom metric
cloudwatch.put_metric_data(
    Namespace='CustomApp/OrderService',
    MetricData=[
        {
            'MetricName': 'OrdersProcessed',
            'Dimensions': [
                {
                    'Name': 'Environment',
                    'Value': 'Production'
                },
                {
                    'Name': 'Region',
                    'Value': 'us-east-1'
                }
            ],
            'Timestamp': datetime.utcnow(),
            'Value': 150,
            'Unit': 'Count'
        },
        {
            'MetricName': 'ProcessingTime',
            'Value': 234.5,
            'Unit': 'Milliseconds',
            'StorageResolution': 1  # High resolution (1 second)
        }
    ]
)

6. Namespace và Dimensions Chi Tiết

Namespace là gì?

Namespacecontainer/category để nhóm các metrics liên quan lại với nhau. Nó giống như một "thư mục" để tổ chức metrics.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         CLOUDWATCH NAMESPACES                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌───────────────────────────────────────────────────────────────────────┐ │
│   │  NAMESPACE: AWS/EC2                                                   │ │
│   │  ├── CPUUtilization                                                   │ │
│   │  ├── NetworkIn / NetworkOut                                           │ │
│   │  └── StatusCheckFailed                                                │ │
│   └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   ┌───────────────────────────────────────────────────────────────────────┐ │
│   │  NAMESPACE: AWS/Lambda                                                │ │
│   │  ├── Invocations                                                      │ │
│   │  ├── Duration                                                         │ │
│   │  └── Errors                                                           │ │
│   └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   ┌───────────────────────────────────────────────────────────────────────┐ │
│   │  NAMESPACE: MyCompany/OrderService  ← Custom namespace                │ │
│   │  ├── OrdersProcessed                                                  │ │
│   │  └── PaymentSuccess                                                   │ │
│   └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
AWS ServiceNamespace
EC2AWS/EC2
RDSAWS/RDS
LambdaAWS/Lambda
ALBAWS/ApplicationELB
DynamoDBAWS/DynamoDB
S3AWS/S3
SQSAWS/SQS
CustomMyCompany/MyApp (tự đặt, KHÔNG dùng AWS/ prefix)

Dimensions là gì?

Dimensionskey-value pairs dùng để xác định và phân loại một metric cụ thể. Nó giống như "filters/tags" để phân biệt các metrics cùng tên.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         METRIC DIMENSIONS                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   MetricName: CPUUtilization (cùng tên)                                     │
│                                                                             │
│   ┌───────────────────────────────────────────────────────────────────────┐ │
│   │  Dimension: InstanceId = i-abc123  → CPU của instance abc123          │ │
│   │  Dimension: InstanceId = i-xyz789  → CPU của instance xyz789          │ │
│   │                                                                       │ │
│   │  Dimension: AutoScalingGroupName = web-asg → Tất cả trong ASG         │ │
│   └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   MULTI-DIMENSION (combine):                                                │
│   ┌───────────────────────────────────────────────────────────────────────┐ │
│   │  Dimensions:                                                          │ │
│   │    - InstanceId = i-abc123                                            │ │
│   │    - AutoScalingGroupName = web-asg                                   │ │
│   │  → CPU của instance abc123 TRONG ASG web-asg                          │ │
│   └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
ServiceCommon Dimensions
EC2InstanceId, AutoScalingGroupName, ImageId
RDSDBInstanceIdentifier, DBClusterIdentifier
LambdaFunctionName, Resource, Version
ALBLoadBalancer, TargetGroup, AvailabilityZone
SQSQueueName
DynamoDBTableName, GlobalSecondaryIndexName

Tổng hợp: Namespace + MetricName + Dimensions

┌─────────────────────────────────────────────────────────────────────────────┐
│                 UNIQUE METRIC IDENTIFICATION                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   CloudWatch Metric = Namespace + MetricName + Dimensions                   │
│                       ─────────   ──────────   ──────────                   │
│                       Thư mục     Tên file     Tags/Filters                 │
│                                                                             │
│   Ví dụ:                                                                    │
│   ┌───────────────────────────────────────────────────────────────────────┐ │
│   │  Namespace:   AWS/EC2                                                 │ │
│   │  MetricName:  CPUUtilization                                          │ │
│   │  Dimensions:  InstanceId = i-abc123                                   │ │
│   │              Environment = Production                                 │ │
│   │  ─────────────────────────────────────────────────────────            │ │
│   │  → 1 UNIQUE time series (CPU của i-abc123 trong Production)           │ │
│   └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   KEY RULES:                                                                │
│   • Max 30 dimensions per metric                                            │
│   • Mỗi unique combination = 1 custom metric (tính phí riêng!)              │
│   • Custom namespace: KHÔNG dùng prefix "AWS/"                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

CloudWatch Logs

1. Logs Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CLOUDWATCH LOGS ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                          LOG GROUP                                  │    │
│  │  (Container cho logs từ cùng một source, e.g., /aws/lambda/myFunc)  │    │
│  │                                                                     │    │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │ LOG STREAM 1                                                    │ │   │
│  │  │ (Sequence of log events từ cùng source instance)               │  │   │
│  │  │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐        │   │   │
│  │  │ │Event 1 │ │Event 2 │ │Event 3 │ │Event 4 │ │Event 5 │ ...    │   │   │
│  │  │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘        │   │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                                                                     │    │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │ LOG STREAM 2                                                    │ │   │
│  │  │ ┌────────┐ ┌────────┐ ┌────────┐                               │  │   │
│  │  │ │Event 1 │ │Event 2 │ │Event 3 │ ...                           │  │   │
│  │  │ └────────┘ └────────┘ └────────┘                               │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                                                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
ConceptMô TảVí Dụ
Log GroupContainer cho related log streams/aws/lambda/my-function
Log StreamSequence of events từ cùng source2024/01/15/[$LATEST]abc123
Log EventSingle log entry với timestamp{"timestamp": ..., "message": "..."}

2. Log Sources

Có 2 nhóm nguồn chính gửi logs đến CloudWatch:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           LOG SOURCES → CLOUDWATCH                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ╔═══════════════════════════════╗    ╔═══════════════════════════════╗    │
│   ║   AWS NATIVE SERVICES         ║    ║   CUSTOM SOURCES              ║    │
│   ║   (Built-in Integration)      ║    ║   (Cần CloudWatch Agent/SDK)  ║    │
│   ╠═══════════════════════════════╣    ╠═══════════════════════════════╣    │
│   ║                               ║    ║                               ║    │
│   ║  • Lambda (Tự động)           ║    ║  • EC2 Instances              ║    │
│   ║  • API Gateway                ║    ║  • On-Premises Servers        ║    │
│   ║  • ECS/EKS (awslogs driver)   ║    ║  • Docker Containers          ║    │
│   ║  • Route 53 (Query logs)      ║    ║  • Custom Applications        ║    │
│   ║  • VPC Flow Logs              ║    ║  • Any server with CW Agent   ║    │
│   ║  • CloudTrail                 ║    ║                               ║    │
│   ║  • RDS (Slow query logs)      ║    ║                               ║    │
│   ║                               ║    ║                               ║    │
│   ╚═══════════════╦═══════════════╝    ╚═══════════════╦═══════════════╝    │
│                   ║                                    ║                    │
│                   ║                                    ║                    │
│                   ▼                                    ▼                    │
│              ┌─────────────────────────────────────────────┐                │
│              │            CloudWatch Logs                  │                │
│              │  (Central log storage & analysis)           │                │
│              └─────────────────────────────────────────────┘                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
NhómCách Gửi LogsVí Dụ
AWS NativeTự động hoặc enable trong consoleLambda logs tự động đến /aws/lambda/<function-name>
Custom SourcesCài CloudWatch Agent hoặc dùng SDKEC2 cần install agent để push /var/log/*

3. Log Retention

┌────────────────────────────────────────────────────────┐
│              LOG RETENTION OPTIONS                     │
├────────────────────────────────────────────────────────┤
│  1 day  │  3 days │  5 days │  1 week │  2 weeks       │
│  1 month │ 2 months │ 3 months │ 6 months              │
│  1 year  │ 13 months │ 18 months │ 2 years             │
│  3 years │ 5 years │ 6 years │ 7 years │ 8 years       │
│  9 years │ 10 years │ Never expire (default)           │
└────────────────────────────────────────────────────────┘

[!WARNING] Mặc định logs KHÔNG bao giờ expire! Điều này có thể gây ra chi phí lưu trữ cao. Luôn set retention policy phù hợp.

4. Log Export & Integration

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CLOUDWATCH LOGS EXPORT OPTIONS                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                         ┌─────────────────┐                                 │
│                         │ CloudWatch Logs │                                 │
│                         └────────┬────────┘                                 │
│                                  │                                          │
│          ┌───────────────────────┼───────────────────────┐                  │
│          │                       │                        │                 │
│          ▼                       ▼                       ▼                  │
│   ┌──────────────┐       ┌──────────────┐       ┌──────────────┐            │
│   │     S3       │       │  Kinesis     │       │  Lambda        │          │
│   │  (Export)    │       │  Firehose    │       │(Subscription)  │          │
│   │              │       │  (Real-time) │       │                │          │
│   └──────────────┘       └──────────────┘       └──────────────┘            │
│          │                       │                        │                 │
│          ▼                       ▼                        │                 │
│   ┌──────────────┐       ┌──────────────┐                 │                 │
│   │ Athena       │       │ OpenSearch   │                 │                 │
│   │ Glue         │       │ Splunk       │                 │                 │
│   │ QuickSight   │       │ Datadog      │               ▼                   │
│   └──────────────┘       └──────────────┘       ┌──────────────┐            │
│          │ Any Custom                                    │                  │
│          │ Processing                                    │                  │
│          └───────────────────────────────────────────────┘                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Export MethodUse CaseReal-time?
S3 ExportArchival, long-term storage, Athena analysis❌ Batch (up to 12h delay)
Subscription Filter → Kinesis FirehoseReal-time streaming to S3/OpenSearch✅ Near real-time
Subscription Filter → LambdaCustom processing, alerting✅ Near real-time
Subscription Filter → Kinesis Data StreamsComplex event processing✅ Real-time

5. CloudWatch Logs Insights

Query language mạnh mẽ để analyze logs:

-- Tìm tất cả ERROR logs trong 1 giờ qua
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
 
-- Count errors by type
fields @message
| filter @message like /ERROR/
| parse @message "ERROR: *" as errorType
| stats count(*) as count by errorType
| sort count desc
 
-- Calculate average response time
fields @timestamp, @message
| parse @message "ResponseTime: * ms" as responseTime
| stats avg(responseTime) as avgTime, 
        max(responseTime) as maxTime,
        min(responseTime) as minTime
| limit 1
 
-- Top 10 most expensive Lambda invocations
fields @timestamp, @billedDuration, @memorySize
| filter @type = "REPORT"
| sort @billedDuration desc
| limit 10

🔔 CloudWatch Alarms

1. Alarm States

┌─────────────────────────────────────────────────────────────────────────────┐
│                          CLOUDWATCH ALARM STATES                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                                                                     │   │
│   │     ┌────────────┐                           ┌────────────┐          │  │
│   │     │            │   Threshold Breached      │            │          │  │
│   │     │     OK     │ ────────────────────────▶ │   ALARM    │          │  │
│   │     │    ✅      │                           │    🔴      │          │  │
│   │     │            │ ◀──────────────────────── │            │          │  │
│   │     └────────────┘   Threshold Recovered     └────────────┘          │  │
│   │           ▲                                        ▲                │   │
│   │           │                                        │                 │  │
│   │           │         ┌────────────┐                 │                 │  │
│   │           │         │            │                 │                 │  │
│   │           └─────────│INSUFFICIENT│─────────────────┘                 │  │
│   │      Not enough     │   DATA     │    Not enough                     │  │
│   │      data points    │    ⚪      │    data points                    │  │
│   │                     │            │                                   │  │
│   │                     └────────────┘                                   │  │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
StateÝ Nghĩa
OKMetric trong ngưỡng bình thường
ALARMMetric đã vượt ngưỡng
INSUFFICIENT_DATAKhông đủ data để đánh giá (mới tạo hoặc metric không có data)

2. Alarm Configuration

AlarmName: HighCPUAlarm
MetricName: CPUUtilization
Namespace: AWS/EC2
Dimensions:
  - Name: InstanceId
    Value: i-1234567890abcdef0
 
# Threshold Configuration
Statistic: Average          # Sum, SampleCount, Minimum, Maximum
Period: 300                  # 5 minutes (in seconds)
EvaluationPeriods: 3         # Check 3 consecutive periods
DatapointsToAlarm: 2         # 2 out of 3 periods must breach
Threshold: 80                # 80%
ComparisonOperator: GreaterThanThreshold
 
# Actions
ActionsEnabled: true
AlarmActions:
  - arn:aws:sns:us-east-1:123456789012:notify-ops
  - arn:aws:automate:us-east-1:ec2:recover
OKActions:
  - arn:aws:sns:us-east-1:123456789012:notify-ops
InsufficientDataActions:
  - arn:aws:sns:us-east-1:123456789012:notify-ops

3. Alarm Actions

┌─────────────────────────────────────────────────────────────────────────────┐
│                         CLOUDWATCH ALARM ACTIONS                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                        ┌──────────────────┐                                 │
│                        │ CloudWatch Alarm │                                 │
│                        │     TRIGGERS     │                                 │
│                        └────────┬─────────┘                                 │
│                                  │                                          │
│     ┌───────────────────────────┼───────────────────────────┐               │
│     │               │           │           │                │              │
│     ▼               ▼           ▼           ▼               ▼               │
│ ┌───────┐     ┌──────────┐ ┌─────────┐ ┌─────────┐    ┌─────────┐           │
│ │  SNS  │     │ Auto     │ │   EC2   │ │   EC2   │    │ Systems │           │
│ │       │     │ Scaling  │ │  Stop   │ │ Recover │    │ Manager │           │
│ │ Email │     │          │ │         │ │         │    │         │           │
│ │ SMS   │     │ Scale    │ │ Reduce  │ │ Auto    │    │Run      │           │
│ │Lambda │     │ In/Out   │ │ Costs   │ │ Healing │    │Command  │           │
│ └───────┘     └──────────┘ └─────────┘ └─────────┘    └─────────┘           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Action TypeUse CaseExample
SNSNotificationsEmail, SMS, Lambda trigger
Auto ScalingScale resourcesAdd EC2 when CPU > 80%
EC2 StopCost optimizationStop dev instance after hours
EC2 TerminateCleanupTerminate unhealthy instance
EC2 RecoverSelf-healingRecover failed instance
Systems ManagerAutomationRun remediation runbook

4. Composite Alarms

Combine multiple alarms với AND/OR logic:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           COMPOSITE ALARM EXAMPLE                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Individual Alarms:                                                        │
│   ┌────────────────────┐  ┌────────────────────┐  ┌────────────────────┐    │
│   │ HighCPU Alarm      │  │ HighMemory Alarm   │  │ HighDisk Alarm     │    │
│   │ CPU > 80%          │  │ Memory > 85%       │  │ Disk > 90%         │    │
│   └─────────┬──────────┘  └─────────┬──────────┘  └─────────┬──────────┘    │
│             │                       │                        │              │
│             └───────────────────────┼───────────────────────┘               │
│                                      │                                      │
│                                     ▼                                       │
│                    ┌────────────────────────────────────┐                   │
│                    │        COMPOSITE ALARM             │                   │
│                    │                                    │                   │
│                    │  Rule: (HighCPU AND HighMemory)    │                   │
│                    │        OR HighDisk                 │                   │
│                    │                                    │                   │
│                    │  → Only alert when TRULY critical  │                   │
│                    │  → Reduce alert fatigue            │                   │
│                    └────────────────────────────────────┘                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

🖥️ CloudWatch Agent

1. Tại Sao Cần CloudWatch Agent?

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DEFAULT METRICS vs AGENT METRICS                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Without Agent (Default)           │  With CloudWatch Agent                 │
│  ─────────────────────────────────│──────────────────────────────────────   │
│                                     │                                       │
│  ✅ CPU Utilization                │  ✅ All Default Metrics                │
│  ✅ Network In/Out                 │  ➕ Memory Utilization                 │
│  ✅ Disk Read/Write (Instance      │  ➕ Disk Space Utilization             │
│     Store only)                    │  ➕ Swap Usage                         │
│  ✅ Status Check                   │  ➕ Netstat Metrics                    │
│  ❌ Memory - NOT AVAILABLE         │  ➕ Process-level Metrics              │
│  ❌ Disk Space - NOT AVAILABLE     │  ➕ Custom Application Logs            │
│  ❌ Application Logs               │  ➕ StatsD/collectd Metrics            │
│                                     │                                       │
└────────────────────────────────────┴────────────────────────────────────────┘

2. Agent Installation & Configuration

# 1. Download & Install (Amazon Linux 2)
sudo yum install amazon-cloudwatch-agent -y
 
# 2. Create configuration using wizard
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
 
# 3. Start agent with config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a fetch-config \
    -m ec2 \
    -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json \
    -s

3. Agent Configuration File

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "CustomEC2Metrics",
    "metrics_collected": {
      "cpu": {
        "measurement": ["cpu_usage_idle", "cpu_usage_user", "cpu_usage_system"],
        "totalcpu": true
      },
      "mem": {
        "measurement": ["mem_used_percent", "mem_available_percent"]
      },
      "disk": {
        "measurement": ["disk_used_percent", "disk_free"],
        "resources": ["/", "/data"]
      },
      "swap": {
        "measurement": ["swap_used_percent"]
      }
    },
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}",
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}"
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/messages",
            "log_group_name": "/ec2/system/messages",
            "log_stream_name": "{instance_id}",
            "timestamp_format": "%b %d %H:%M:%S"
          },
          {
            "file_path": "/var/log/myapp/*.log",
            "log_group_name": "/ec2/myapp",
            "log_stream_name": "{instance_id}/{file_name}",
            "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}"
          }
        ]
      }
    }
  }
}

CloudWatch Dashboards

1. Dashboard Features

┌──────────────────────────────────────────────────────────────────────────────┐
│                      CLOUDWATCH DASHBOARD EXAMPLE                            │
├──────────────────────────────────────────────────────────────────────────────┤
│  Production Overview                                      [Time: Last 3h ▼]  │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────┐  ┌─────────────────────────────┐            │
│  │  📈 EC2 CPU Utilization     │  │  📈 RDS Connections         │            │
│  │  ┌─────────────────────┐    │  │  ┌─────────────────────┐      │          │
│  │  │     ___/\___        │    │  │  │   ___    ___        │      │          │
│  │  │    /       \        │    │  │  │  /   \__/   \___    │      │          │
│  │  │___/         \___    │    │  │  │_/               \_  │      │          │
│  │  └─────────────────────┘    │  │  └─────────────────────┘      │          │
│  │  Avg: 45%  Max: 78%         │  │  Current: 127  Max: 200     │            │
│  └─────────────────────────────┘  └─────────────────────────────┘            │
│                                                                              │
│  ┌─────────────────────────────┐  ┌─────────────────────────────┐            │
│  │  📊 Lambda Errors (Table)   │  │  🔢 Active Alarms           │            │
│  │  ┌─────────────────────┐    │  │                               │          │
│  │  │ Function   | Errors │    │  │  ⚠️  HighCPU-Web-Server       │          │
│  │  │ OrderProc  |   3    │    │  │  ⚠️  LowDiskSpace-DB          │          │
│  │  │ PaymentSvc |   0    │    │  │  ✅  All other alarms OK      │          │
│  │  │ UserAuth   |   1    │    │  │                               │          │
│  │  └─────────────────────┘    │  │                               │          │
│  └─────────────────────────────┘  └─────────────────────────────┘            │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

2. Widget Types

Widget TypeUse CaseExample
LineTime series trendsCPU over time
Stacked AreaShow compositionMemory breakdown
NumberSingle current valueError count
GaugeShow vs thresholdCPU vs 80% limit
BarCompare valuesRequests by endpoint
PieShow distributionTraffic by region
TextMarkdown contentInstructions, links
Alarm StatusShow alarm statesCritical alarms
Logs TableRecent log entriesError logs
ExplorerDynamic resource viewAll EC2 instances

3. Cross-Account & Cross-Region Dashboards

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CROSS-ACCOUNT CLOUDWATCH SETUP                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────────┐                    ┌─────────────────────┐        │
│   │  Monitoring Account │                    │   Source Account A   │       │
│   │  (Central View)     │◀───────────────────│   (Production)       │       │
│   │                     │  CloudWatch        │                      │       │
│   │  ┌───────────────┐  │  Cross-Account     └─────────────────────┘        │
│   │  │  Unified      │  │  Sharing                                          │
│   │  │  Dashboard    │  │                    ┌─────────────────────┐        │
│   │  │               │  │◀───────────────────│   Source Account B  │        │
│   │  │  All Accounts │  │                    │   (Development)     │        │
│   │  │  All Regions  │  │                    │                     │        │
│   │  └───────────────┘  │                    └─────────────────────┘        │
│   │                     │                                                   │
│   └─────────────────────┘                                                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

CloudWatch Synthetics (Canaries)

1. Canary Overview

Canaries là configurable scripts chạy theo schedule để monitor endpoints và APIs.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CLOUDWATCH SYNTHETICS FLOW                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────┐         ┌─────────────────┐         ┌─────────────┐   │
│   │   Canary        │         │   Your Website  │         │ CloudWatch  │   │
│   │   Script        │────────▶│   or API        │────────▶│ Metrics/    │   │
│   │   (Scheduled)   │ Request │                 │ Response│ Alarms      │   │
│   └─────────────────┘         └─────────────────┘         └─────────────┘   │
│          │                                                        │         │
│          │ Run every                                              │         │
│          │ X minutes                                            ▼           │
│          │                                              ┌─────────────┐     │
│          │                                              │ SNS Alert     │   │
│          │                                              │ if Failed     │   │
│          ▼                                              └─────────────┘     │
│   ┌─────────────────┐                                                       │
│   │ S3 Bucket       │                                                       │
│   │ - Screenshots   │                                                       │
│   │ - HAR files     │                                                       │
│   │ - Logs          │                                                       │
│   └─────────────────┘                                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

2. Canary Use Cases

Use CaseDescription
Heartbeat MonitoringSimple availability check
API MonitoringValidate API responses
UI WorkflowTest login flows, checkout process
Visual MonitoringScreenshot comparison
Broken Link CheckerFind 404 errors

3. Sample Canary Script (Node.js)

const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');
 
const flowBuilderBlueprint = async function () {
    // Configure the browser
    let page = await synthetics.getPage();
    
    // Step 1: Navigate to homepage
    await synthetics.executeStep('navigateToHomepage', async function () {
        await page.goto('https://www.example.com', {
            waitUntil: 'networkidle0',
            timeout: 30000
        });
    });
    
    // Step 2: Verify page title
    await synthetics.executeStep('verifyTitle', async function () {
        const title = await page.title();
        if (!title.includes('Example')) {
            throw new Error('Title does not contain expected text');
        }
        log.info('Page title verified: ' + title);
    });
    
    // Step 3: Check API endpoint
    await synthetics.executeStep('checkAPIEndpoint', async function () {
        const response = await page.goto('https://api.example.com/health');
        if (response.status() !== 200) {
            throw new Error(`API returned status ${response.status()}`);
        }
    });
};
 
exports.handler = async () => {
    return await flowBuilderBlueprint();
};

CloudWatch ServiceLens & X-Ray Integration

1. End-to-End Observability

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CLOUDWATCH SERVICELENS ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   User Request                                                              │
│       │                                                                     │
│       ▼                                                                     │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐                  │
│   │   ALB   │───▶│   API   │───▶│ Lambda  │───▶│ DynamoDB│                  │
│   │         │    │ Gateway │    │Function │    │         │                  │
│   └────┬────┘    └────┬────┘    └────┬────┘    └────┬────┘                  │
│        │              │              │               │                      │
│        │              │              │               │                      │
│        └──────────────┴──────────────┴──────────────┘                       │
│                              │                                              │
│                              ▼                                              │
│                    ┌──────────────────┐                                     │
│                    │   AWS X-Ray      │                                     │
│                    │   (Traces)       │                                     │
│                    └────────┬─────────┘                                     │
│                              │                                              │
│                             ▼                                               │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                    CloudWatch ServiceLens                           │   │
│   │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │   │
│   │  │   Service    │ │  Resource    │ │   Trace      │                 │   │
│   │  │   Map        │ │  Health      │ │   Analysis   │                 │   │
│   │  └──────────────┘ └──────────────┘ └──────────────┘                 │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

2. ServiceLens Features

FeatureDescription
Service MapVisual map of application dependencies
Trace AnalysisFollow requests across services
Correlated MetricsLink traces with CloudWatch metrics
Latency AnalysisIdentify slow components
Error TrackingTrace error paths

CloudWatch Container Insights

1. Container Monitoring

┌─────────────────────────────────────────────────────────────────────────────┐
│                     CONTAINER INSIGHTS ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                          EKS / ECS Cluster                          │   │
│   │                                                                     │   │
│   │   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐              │  │
│   │   │ Node 1  │   │ Node 2  │   │ Node 3  │   │ Node N  │              │  │
│   │   │┌───────┐│   │┌───────┐│   │┌───────┐│   │┌───────┐│              │  │
│   │   ││Pod A  ││   ││Pod D  ││   ││Pod G  ││   ││Pod J  ││              │  │
│   │   │├───────┤│   │├───────┤│   │├───────┤│   │├───────┤│              │  │
│   │   ││Pod B  ││   ││Pod E  ││   ││Pod H  ││   ││Pod K  ││              │  │
│   │   │├───────┤│   │├───────┤│   │├───────┤│   │├───────┤│              │  │
│   │   ││Pod C  ││   ││Pod F  ││   ││Pod I  ││   ││Pod L  ││              │  │
│   │   │└───────┘│   │└───────┘│   │└───────┘│   │└───────┘│              │  │
│   │   └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘              │  │
│   │        │             │             │             │                   │  │
│   └────────┼─────────────┼─────────────┼─────────────┼──────────────────┘   │
│            │             │             │               │                    │
│            └─────────────┴─────────────┴─────────────┘                      │
│                                     │                                       │
│                                    ▼                                        │
│                    ┌───────────────────────────────┐                        │
│                    │   CloudWatch Container        │                        │
│                    │   Insights Agent              │                        │
│                    │   (DaemonSet / Sidecar)       │                        │
│                    └───────────────┬───────────────┘                        │
│                                     │                                       │
│                                    ▼                                        │
│                    ┌───────────────────────────────┐                        │
│                    │   CloudWatch Logs & Metrics   │                        │
│                    │   - Cluster metrics           │                        │
│                    │   - Node metrics              │                        │
│                    │   - Pod metrics               │                        │
│                    │   - Container metrics         │                        │
│                    └───────────────────────────────┘                        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

2. Container Insights Metrics

LevelMetrics Collected
ClusterNode count, Pod count, CPU/Memory reservation
NodeCPU, Memory, Network, Filesystem, Pod count
PodCPU, Memory, Network, Container restarts
ContainerCPU, Memory limits/requests

EventBridge Integration (formerly CloudWatch Events)

┌─────────────────────────────────────────────────────────────────────────────┐
│                 CLOUDWATCH EVENTS → EVENTBRIDGE EVOLUTION                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CloudWatch Events (Legacy)              EventBridge (Current)              │
│  ┌─────────────────────────┐             ┌─────────────────────────┐        │
│  │ • Basic event routing   │   ────▶     │ • Advanced event bus    │        │
│  │ • AWS events only       │             │ • SaaS integrations     │        │
│  │ • Simple rules          │             │ • Schema registry       │        │
│  │                         │             │ • Event archive/replay  │        │
│  │                         │             │ • Cross-account events  │        │
│  └─────────────────────────┘             └─────────────────────────┘        │
│                                                                             │
│  Note: CloudWatch Events API still works but routes to EventBridge          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

[!NOTE] CloudWatch Events đã được rebrand thành Amazon EventBridge. Xem file eventbridge.md để biết chi tiết đầy đủ.


CloudWatch Pricing

1. Pricing Components

ComponentFree TierPaid Pricing
Metrics10 custom metrics$0.30/metric/month (first 10K)
Dashboards3 dashboards$3/dashboard/month
Alarms10 alarms$0.10/alarm/month (standard)
Logs Ingestion5GB$0.50/GB
Logs Storage5GB$0.03/GB/month
Logs InsightsNone$0.005/GB scanned
Contributor Insights1 rule$0.02/matching log event
CanariesNone$0.0012/canary run

2. Cost Optimization Tips

┌─────────────────────────────────────────────────────────────────────────────┐
│                     CLOUDWATCH COST OPTIMIZATION                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  1. SET LOG RETENTION                                                       │
│     ─────────────────                                                       │
│     Change from "Never Expire" to appropriate retention period              │
│     Most logs don't need > 30 days retention                                │
│                                                                             │
│  2. USE LOG FILTERS WISELY                                                  │
│     ───────────────────────                                                 │
│     Create metric filters instead of querying all logs                      │
│     Push aggregated metrics, not every data point                           │
│                                                                             │
│  3. EXPORT TO S3                                                            │
│     ──────────────────                                                      │
│     For long-term storage, export to S3 (cheaper than CW Logs storage)      │
│     Use Athena for querying archived logs                                   │
│                                                                             │
│  4. OPTIMIZE METRIC RESOLUTION                                              │
│     ────────────────────────                                                │
│     Use standard resolution (1 min) unless you truly need high-res          │
│     High-resolution metrics cost significantly more                         │
│                                                                             │
│  5. CONSOLIDATE DASHBOARDS                                                  │
│     ────────────────────                                                    │
│     Each dashboard costs $3/month                                           │
│     Combine related metrics into fewer dashboards                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Common Use Cases & Best Practices

1. Basic EC2 Monitoring Setup

import boto3
 
cloudwatch = boto3.client('cloudwatch')
 
# Create alarm for high CPU
cloudwatch.put_metric_alarm(
    AlarmName='HighCPU-WebServer',
    ComparisonOperator='GreaterThanThreshold',
    EvaluationPeriods=2,
    MetricName='CPUUtilization',
    Namespace='AWS/EC2',
    Period=300,
    Statistic='Average',
    Threshold=80.0,
    ActionsEnabled=True,
    AlarmActions=[
        'arn:aws:sns:us-east-1:123456789012:AlertTopic'
    ],
    AlarmDescription='Alert when CPU exceeds 80%',
    Dimensions=[
        {
            'Name': 'InstanceId',
            'Value': 'i-1234567890abcdef0'
        },
    ]
)

2. Application Logging Pattern

import logging
import watchtower  # pip install watchtower
 
# Setup CloudWatch Logs handler
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
# Add CloudWatch handler
logger.addHandler(watchtower.CloudWatchLogHandler(
    log_group='/myapp/production',
    stream_name='web-server-{date}',
    create_log_group=True
))
 
# Use structured logging
logger.info('Order processed', extra={
    'order_id': '12345',
    'customer_id': 'C-999',
    'amount': 150.00,
    'status': 'completed'
})

3. Metric Filter for Error Counting

# CloudFormation snippet
MetricFilter:
  Type: AWS::Logs::MetricFilter
  Properties:
    LogGroupName: /aws/lambda/my-function
    FilterPattern: "ERROR"
    MetricTransformations:
      - MetricName: ErrorCount
        MetricNamespace: CustomApp/Lambda
        MetricValue: "1"
        DefaultValue: 0

CloudWatch FAQ

Q: CloudWatch Agent vs Built-in Metrics - Khi nào cần Agent?

ScenarioCần Agent?
Monitor CPU/Network của EC2❌ Không
Monitor Memory của EC2✅ Có
Monitor Disk Space✅ Có
Collect application logs từ EC2✅ Có
Monitor Lambda metrics❌ Không (tự động)
Monitor on-premises servers✅ Có

Q: Log Group vs Log Stream?

Log Group: /aws/lambda/order-service     ← Container (billing, retention)
├── Log Stream: 2024/01/15/[$LATEST]abc  ← Single Lambda instance log
├── Log Stream: 2024/01/15/[$LATEST]def  ← Another instance
└── Log Stream: 2024/01/16/[$LATEST]ghi  ← Next day instance

Q: Standard vs High-Resolution Metrics?

AspectStandardHigh-Resolution
Resolution1 minute1 second
Retention15 months3 hours (then aggregated)
CostLower~10x higher
Use CaseMost workloadsReal-time trading, gaming

Q: CloudWatch Alarms vs EventBridge?

FeatureCloudWatch AlarmsEventBridge
Trigger Based OnMetric thresholdsEvents/State changes
ExampleCPU > 80% for 5 minEC2 instance stopped
ActionsSNS, EC2, Auto ScalingLambda, Step Functions, SQS, etc.
Pattern MatchingSimple thresholdComplex event patterns

ServiceRelationship với CloudWatch
SNSNhận alarm notifications
EventBridgeEvent-driven automation (successor of CW Events)
X-RayDistributed tracing, ServiceLens integration
Auto ScalingScale based on CW metrics/alarms
Systems ManagerRun automation based on alarms
LambdaLog destination, alarm target
Kinesis FirehoseReal-time log streaming

Tổng Kết

┌──────────────────────────────────────────────────────────────────────────────┐
│                      CLOUDWATCH KEY TAKEAWAYS                                │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ✅ Metrics: Time-series data, default + custom, 1-second to 5-min res       │
│                                                                              │
│  ✅ Logs: Centralized logging, retention policies, Insights for queries      │
│                                                                              │
│  ✅ Alarms: Threshold-based alerts, 3 states (OK/ALARM/INSUFFICIENT)         │
│                                                                              │
│  ✅ Agent: Required for Memory/Disk metrics and custom logs                  │
│                                                                              │
│  ✅ Dashboards: Unified visualization, cross-account/region support          │
│                                                                              │
│  ✅ Canaries: Synthetic monitoring for endpoints and workflows               │
│                                                                              │
│  ✅ Container Insights: EKS/ECS monitoring with cluster/pod/container        │
│                         level metrics                                        │
│                                                                              │
│  ✅ ServiceLens: End-to-end observability with X-Ray integration             │
│                                                                              │
│  ⚠️  Memory & Disk Space: NOT default metrics - need CloudWatch Agent        │
│                                                                              │
│  ⚠️  Log retention: Default is "Never Expire" - SET RETENTION POLICY!        │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

On this page

Tổng QuanCloudWatch Core ComponentsCloudWatch Metrics1. Metrics Là Gì?2. Metrics Structure3. Default Metrics vs Detailed Monitoring4. Important Default Metrics by ServiceEC2 Instance MetricsRDS MetricsLambda Metrics5. Custom Metrics6. Namespace và Dimensions Chi TiếtNamespace là gì?Dimensions là gì?Tổng hợp: Namespace + MetricName + DimensionsCloudWatch Logs1. Logs Architecture2. Log Sources3. Log Retention4. Log Export & Integration5. CloudWatch Logs Insights🔔 CloudWatch Alarms1. Alarm States2. Alarm Configuration3. Alarm Actions4. Composite Alarms🖥️ CloudWatch Agent1. Tại Sao Cần CloudWatch Agent?2. Agent Installation & Configuration3. Agent Configuration FileCloudWatch Dashboards1. Dashboard Features2. Widget Types3. Cross-Account & Cross-Region DashboardsCloudWatch Synthetics (Canaries)1. Canary Overview2. Canary Use Cases3. Sample Canary Script (Node.js)CloudWatch ServiceLens & X-Ray Integration1. End-to-End Observability2. ServiceLens FeaturesCloudWatch Container Insights1. Container Monitoring2. Container Insights MetricsEventBridge Integration (formerly CloudWatch Events)CloudWatch Pricing1. Pricing Components2. Cost Optimization TipsCommon Use Cases & Best Practices1. Basic EC2 Monitoring Setup2. Application Logging Pattern3. Metric Filter for Error CountingCloudWatch FAQQ: CloudWatch Agent vs Built-in Metrics - Khi nào cần Agent?Q: Log Group vs Log Stream?Q: Standard vs High-Resolution Metrics?Q: CloudWatch Alarms vs EventBridge?Related ServicesTổng Kết