Post

Setting up Amazon CloudWatch in EKS

Setting up Amazon CloudWatch in EKS

Introduction

Recently I was provided a ticket to troubleshoot why our CloudWatch pods were crash-looping. After investigating the issue, I noticed that the deployment was outdated and had not been updated in years. The deployment was created manually using the DaemonSet and an old version of the CloudWatch agent which is no longer supported or compatible with the current EKS version and node version.

The first thing I did was navigate to the cluster and namespace using the command below to get the current context:

There are two tools I used to do this which make life easier when navigating through clusters and namespaces. These tools are called kubectx and kubens.

1
2
kubectx <context>
kubens <namespace>

Once I navigated to the namespace, I used another tool called k9s to get a better visual representation of the resources in the namespace.

1
k9s -n cloudwatch

This is where I found that the CloudWatch pods that are on the old Amazon-Linux nodes are failing to start (CrashLoopBackOff). This then led me to investigate the deployment and configmaps to ascertain the root cause of the issue. This is what really stopped me in my tracks because I had difficulty figuring out how this was deployed and what the intended configuration was. However, once I checked the official AWS documentation, I noticed there is a deployment for DaemonSet which seems to be deprecated and now AWS Observability is the recommended way of deploying the CloudWatch agent and Fluent Bit.

I decided to take the Helm chart approach as it would probably make the deployment and management easier and more configurable for our use case.

What I implemented

My manager mentioned it would have been better if I had created a more detailed design document. While I did create one initially, I realized I didn’t have much context for the CloudWatch agent and Fluent Bit. So after a bit of research, I decided to do a test deployment in our dev environment.

So this was my thought process:

  • GitOps deployment
  • Kustomize and since I can use patching which will be needed due to the AWS Helm chart being strict.
  • Helm chart for easier deployment and management
  • Testing in a development environment

I created a new directory in our git repo called test, then I created a base directory and two files: one called kustomization.yaml and a base values.yaml file. Since I was testing in dev, my understanding was that by adding a base configuration, I could deploy it to different environments using an ApplicationSet. The `` template variable in the values file would be replaced with the correct cluster name for each environment.

The base directory will handle the Helm chart deployment and configuration. This will span all environments we need to deploy to.

base/kustomization.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources: []

helmCharts:
  - name: amazon-cloudwatch-observability
    repo: https://aws-observability.github.io/helm-charts
    version: 4.3.1
    releaseName: amazon-cloudwatch-observability
    namespace: amazon-cloudwatch
    valuesInline:
      clusterName: ""
      region: <AWS_REGION>
      containerLogs:
        enabled: true
        fluentBit:
          resources:
            requests:
              memory: 128Mi
              cpu: 250m
            limits:
              memory: 512Mi
              cpu: 500m
      agent:
        name: cloudwatch-agent
        mode: daemonset
        replicas: 1
        resources:
          requests:
            memory: 128Mi
            cpu: 250m
          limits:
            memory: 512Mi
            cpu: 500m
        serviceAccount:
          name: cloudwatch-agent
          annotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::<AWS_ACCOUNT_ID>:role/<EKS_ROLE_NAME>
        defaultConfig:
          agent:
            region: <AWS_REGION>
          logs:
            metrics_collected:
              kubernetes:
                cluster_name: ""
                enhanced_container_insights: true
                force_flush_interval: 30
                metrics_collection_interval: 300
              application_signals:
                hosted_in: ""
          metrics:
            namespace: CWAgent
            metrics_collected:
              cpu:
                measurement:
                  - name: cpu_usage_idle
                    rename: CPU_USAGE_IDLE
                    unit: Percent
                  - name: cpu_usage_iowait
                    rename: CPU_USAGE_IOWAIT
                    unit: Percent
                  - name: cpu_usage_user
                    rename: CPU_USAGE_USER
                    unit: Percent
                  - name: cpu_usage_system
                    rename: CPU_USAGE_SYSTEM
                    unit: Percent
                metrics_collection_interval: 60
          traces:
            traces_collected:
              application_signals: {}
      dcgmExporter:
        enabled: false
      neuronMonitor:
        enabled: false
      tolerations:
        - operator: Exists
      updateStrategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 1
          maxSurge: 1
    includeCRDs: true

patches:
  # Patch FluentBit DaemonSet with correct environment variables
  - target:
      kind: DaemonSet
      name: fluent-bit
      namespace: amazon-cloudwatch
    patch: |-
      - op: replace
        path: /spec/template/spec/containers/0/env/0/value
        value: <AWS_REGION>
      - op: replace
        path: /spec/template/spec/containers/0/env/1/value
        value: ""
  # Patch ServiceAccounts with IAM role
  - target:
      kind: ServiceAccount
      name: amazon-cloudwatch-observability-controller-manager
      namespace: amazon-cloudwatch
    patch: |-
      - op: add
        path: /metadata/annotations
        value:
          eks.amazonaws.com/role-arn: arn:aws:iam::<AWS_ACCOUNT_ID>:role/<EKS_ROLE_NAME>

base/values.yaml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
# Values for Amazon CloudWatch Observability Helm Chart
# This file structure must match the chart's values.yaml format

# Cluster configuration
clusterName: "" 
region: <AWS_REGION>

# Container logs configuration
containerLogs:
  enabled: true
  fluentBit:
    resources:
      requests:
        memory: 128Mi
        cpu: 250m
      limits:
        memory: 512Mi
        cpu: 500m

# CloudWatch agent configuration
agent:
  name: cloudwatch-agent
  mode: daemonset
  replicas: 1
  resources:
    requests:
      memory: 128Mi
      cpu: 250m
    limits:
      memory: 512Mi
      cpu: 500m
  serviceAccount:
    name: cloudwatch-agent

  defaultConfig:
    agent:
      region: <AWS_REGION>
    logs:
      metrics_collected:
        kubernetes:
          enhanced_container_insights: true
          force_flush_interval: 30
          metrics_collection_interval: 300
        application_signals: {}
    traces:
      traces_collected:
        application_signals: {}

# Disable GPU monitoring (only needed for NVIDIA GPU nodes)
dcgmExporter:
  enabled: false

# Disable Neuron monitoring (only needed for Inferentia/Trainium nodes)
neuronMonitor:
  enabled: false

# Tolerations
tolerations:
  - operator: Exists

# Update strategy
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1

The above code is generic and can be customized for specific environments by overriding the values in the respective environment’s values.yaml file. Just a note I found out the hard way is that the AWS observability Helm chart is strict - something like clusterName will need to be patched to the correct name in the kustomization.yaml file for each environment.

Below is an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ../../base

patches:
  - target:
      kind: AmazonCloudWatchAgent
      name: cloudwatch-agent
      namespace: cloudwatch
    patch: |-
      - op: replace
        path: /spec/config
        value: '{"agent":{"region":"<AWS_REGION>"},"logs":{"metrics_collected":{"application_signals":{"hosted_in":"<ENVIRONMENT>"},"kubernetes":{"cluster_name":"<ENVIRONMENT>","enhanced_container_insights":true,"force_flush_interval":30,"metrics_collection_interval":300}}},"metrics":{"namespace":"CWAgent","metrics_collected":{"cpu":{"measurement":[{"name":"cpu_usage_idle","rename":"CPU_USAGE_IDLE","unit":"Percent"},{"name":"cpu_usage_iowait","rename":"CPU_USAGE_IOWAIT","unit":"Percent"},{"name":"cpu_usage_user","rename":"CPU_USAGE_USER","unit":"Percent"},{"name":"cpu_usage_system","rename":"CPU_USAGE_SYSTEM","unit":"Percent"}],"metrics_collection_interval":60}}},"traces":{"traces_collected":{"application_signals":{}}}}'
---

This just works for me and my work, yours could be different.

Once I had these configured and ready, well to be frank, I was not sure if what I had configured was correct, so I guess it had to be trial by fire and just deploy and see what happens.

So I went ahead and created an ApplicationSet and AppProject within the same file. These are what we use to deploy and create the app in our deployment interface.

Deployment

One thing most developers or even tech people underestimate is the fear and caution one has when deploying an app or change to any environment. This is where I would check, check, and check again. I asked for help; however, I would still be hesitant to apply this to the cluster.

In the end, I did press enter and the app was deployed. I saw the notification on our communication platform stating it was created. I navigated to our deployment interface and checked—it was there in an unknown state. This is where I noticed the error .Values.clusterName is required which is due to the Helm chart being so strict. I fixed this with the patching solution and did a sync, which created everything successfully.

My next issue was a permission problem with the EKS role I specified. I had made an error in the IAM role configuration—once that was corrected, everything was fixed and sending to CloudWatch properly. Since I’m paranoid, I double-checked in AWS CloudWatch and confirmed metrics and logs were being sent. Simple mistake but something that’s easily overlooked.

Boom! Everything was working as intended. The joy and relief of not breaking anything is a great feeling.

Lessons Learned

In tech there’s not much help, and even if you do ask, it’s often met with disdain or seen negatively, especially at work where most people want to do their job and go home, which you can’t fault them for. However, one thing I want to do is not just follow the status quo. For this task, I created a guide and put it on our internal documentation so it can help other junior workers. I do think that if we had more interactions and less judgment as tech folk, we would increase our skills and level even more.

Testing in dev is a must. I did break the deployment a few times, but I learned from it and fixed it. I think I learned more from my testing than from anyone telling me what to do.

Even though our work is hard and stressful, it’s just work at the end of the day and shouldn’t consume your life. It took me a while to get this task completed—maybe someone could have done it quicker and better.

If you work in DevOps or any tech role, the best thing you can do is try, try, and try again because at the end of the day you will learn more from doing it yourself than from anyone telling you what to do.

If you got this far, thanks for reading this blog post. I hope you found it useful or at least entertaining.

This post is licensed under CC BY 4.0 by the author.