Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow

As organizations scale their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, platform directors face growing challenges in effectively managing multi-tenant clusters. Duties reminiscent of investigating pod failures, addressing useful resource constraints, and resolving misconfiguration can eat vital effort and time. As a substitute of spending precious engineering hours manually parsing logs, monitoring metrics, and implementing fixes, groups ought to concentrate on driving innovation. Now, with the facility of generative AI, you’ll be able to remodel your Kubernetes operations. By implementing clever cluster monitoring, sample evaluation, and automatic remediation, you’ll be able to dramatically cut back each imply time to determine (MTTI) and imply time to resolve (MTTR) for widespread cluster points.

At AWS re:Invent 2024, we introduced the multi-agent collaboration functionality for Amazon Bedrock (preview). With multi-agent collaboration, you’ll be able to construct, deploy, and handle a number of AI brokers working collectively on complicated multistep duties that require specialised abilities. As a result of troubleshooting an EKS cluster entails deriving insights from a number of observability alerts and making use of fixes utilizing a steady integration and deployment (CI/CD) pipeline, a multi-agent workflow will help an operations group streamline the administration of EKS clusters. The workflow supervisor agent can combine with particular person brokers that interface with particular person observability alerts and a CI/CD workflow to orchestrate and carry out duties based mostly on consumer immediate.

On this put up, we display easy methods to orchestrate a number of Amazon Bedrock brokers to create a complicated Amazon EKS troubleshooting system. By enabling collaboration between specialised brokers—deriving insights from K8sGPT and performing actions by the ArgoCD framework—you’ll be able to construct a complete automation that identifies, analyzes, and resolves cluster points with minimal human intervention.

Resolution overview

The structure consists of the next core elements:

Amazon Bedrock collaborator agent – Orchestrates the workflow and maintains context whereas routing consumer prompts to specialised brokers, managing multistep operations and agent interactions
Amazon Bedrock agent for K8sGPT – Evaluates cluster and pod occasions by K8sGPT’s Analyze API for safety points, misconfigurations, and efficiency issues, offering remediation solutions in pure language
Amazon Bedrock agent for ArgoCD – Manages GitOps-based remediation by ArgoCD, dealing with rollbacks, useful resource optimization, and configuration updates

The next diagram illustrates the answer structure.

Stipulations

It is advisable have the next conditions in place:

Arrange the Amazon EKS cluster with K8sGPT and ArgoCD

We begin with putting in and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.

The K8sGPT operator will assist with enabling AI-powered evaluation and troubleshooting of cluster points. For instance, it may robotically detect and recommend fixes for misconfigured deployments, reminiscent of figuring out and resolving useful resource constraint issues in pods.

ArgoCD is a declarative GitOps steady supply instrument for Kubernetes that automates the deployment of purposes by preserving the specified utility state in sync with what’s outlined in a Git repository.

The Amazon Bedrock agent serves because the clever decision-maker in our structure, analyzing cluster points detected by K8sGPT. After the foundation trigger is recognized, the agent orchestrates corrective actions by ArgoCD’s GitOps engine. This highly effective integration signifies that when issues are detected (whether or not it’s a misconfigured deployment, useful resource constraints, or scaling subject), the agent can robotically combine with ArgoCD to supply the required fixes. ArgoCD then picks up these adjustments and synchronizes them together with your EKS cluster, creating a really self-healing infrastructure.

Create the required namespaces in Amazon EKS:

kubectl create ns helm-guestbook
kubectl create ns k8sgpt-operator-system

Add the k8sgpt Helm repository and set up the operator:

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo replace
helm set up k8sgpt-operator k8sgpt/k8sgpt-operator 
  --namespace k8sgpt-operator-system

You possibly can confirm the set up by getting into the next command:

kubectl get pods -n k8sgpt-operator-system

NAME                                                          READY   STATUS    RESTARTS  AGE
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Operating   0         1d

After the operator is deployed, you’ll be able to configure a K8sGPT useful resource. This Customized Useful resource Definition(CRD) may have the massive language mannequin (LLM) configuration that may help in AI-powered evaluation and troubleshooting of cluster points. K8sGPT helps numerous backends to assist in AI-powered evaluation. For this put up, we use Amazon Bedrock because the backend and Anthropic’s Claude V3 because the LLM.

It is advisable create the pod identification for offering the EKS cluster entry to different AWS companies with Amazon Bedrock:

eksctl create podidentityassociation  --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt  --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:coverage/AmazonBedrockFullAccess  --region $AWS_REGION

Configure the K8sGPT CRD:

cat << EOF > k8sgpt.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
sort: K8sGPT
metadata:
  title: k8sgpt-bedrock
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    mannequin: anthropic.claude-v3
    backend: amazonbedrock
    area: us-east-1
    credentials:
      secretRef:
        title: k8sgpt-secret
        namespace: k8sgpt-operator-system
  noCache: false
  repository: ghcr.io/k8sgpt-ai/k8sgpt
  model: v0.3.48
EOF

kubectl apply -f k8sgpt.yaml

Validate the settings to substantiate the k8sgpt-bedrock pod is operating efficiently:

kubectl get pods -n k8sgpt-operator-system
NAME                                                          READY   STATUS    RESTARTS      AGE
k8sgpt-bedrock-5b655cbb9b-sn897                               1/1     Operating   9 (22d in the past)   22d
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Operating   3 (10h in the past)   22d

Now you’ll be able to configure the ArgoCD controller:

helm repo add argo https://argoproj.github.io/argo-helm
helm repo replace
kubectl create namespace argocd
helm set up argocd argo/argo-cd 
  --namespace argocd 
  --create-namespace

Confirm the ArgoCD set up:

kubectl get pods -n argocd
NAME                                                READY   STATUS    RESTARTS   AGE
argocd-application-controller-0                     1/1     Operating   0          43d
argocd-applicationset-controller-5c787df94f-7jpvp   1/1     Operating   0          43d
argocd-dex-server-55d5769f46-58dwx                  1/1     Operating   0          43d
argocd-notifications-controller-7ccbd7fb6-9pptz     1/1     Operating   0          43d
argocd-redis-587d59bbc-rndkp                        1/1     Operating   0          43d
argocd-repo-server-76f6c7686b-rhjkg                 1/1     Operating   0          43d
argocd-server-64fcc786c-bd2t8                       1/1     Operating   0          43d

Patch the argocd service to have an exterior load balancer:

kubectl patch svc argocd-server -n argocd -p '{"spec": {"kind": "LoadBalancer"}}'

Now you can entry the ArgoCD UI with the next load balancer endpoint and the credentials for the admin consumer:

kubectl get svc argocd-server -n argocd
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
argocd-server   LoadBalancer   10.100.168.229   a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com   80:32334/TCP,443:32261/TCP   43d

Retrieve the credentials for the ArgoCD UI:

export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret 
-o jsonpath="{.knowledge.password}" | base64 -d`

echo ArgoCD admin password - $argocdpassword

Push the credentials to AWS Secrets and techniques Supervisor:

aws secretsmanager create-secret 
--name argocdcreds 
--description "Credentials for argocd" 
--secret-string "{"USERNAME":"admin","PASSWORD":"$argocdpassword"}"

Configure a pattern utility in ArgoCD:

cat << EOF > argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
sort: Utility
metadata:
title: helm-guestbook
namespace: argocd
spec:
venture: default
supply:
repoURL: https://github.com/awsvikram/argocd-example-apps
targetRevision: HEAD
path: helm-guestbook
vacation spot:
server: https://kubernetes.default.svc
namespace: helm-guestbook
syncPolicy:
automated:
prune: true
selfHeal: true
EOF

Apply the configuration and confirm it from the ArgoCD UI by logging in because the admin consumer:
```
kubectl apply -f argocd-application.yaml
```
It takes a while for K8sGPT to investigate the newly created pods. To make that rapid, restart the pods created within the k8sgpt-operator-system namespace. The pods will be restarted by getting into the next command:
```
kubectl -n k8sgpt-operator-system rollout restart deploy

deployment.apps/k8sgpt-bedrock restarted
deployment.apps/k8sgpt-operator-controller-manager restarted
```

Arrange the Amazon Bedrock brokers for K8sGPT and ArgoCD

We use a CloudFormation stack to deploy the person brokers into the US East (N. Virginia) Area. If you deploy the CloudFormation template, you deploy a number of assets (prices can be incurred for the AWS assets used).

Use the next parameters for the CloudFormation template:

The stack creates the next AWS Lambda features:

-LambdaK8sGPTAgent-
-RestartRollBackApplicationArgoCD-
-ArgocdIncreaseMemory-

The stack creates the next Amazon Bedrock brokers:

ArgoCDAgent, with the next motion teams:
1. argocd-rollback
2. argocd-restart
3. argocd-memory-management

K8sGPTAgent, with the next motion group:
1. k8s-cluster-operations

The stack outputs the next, with the next brokers related to it:

ArgoCDAgent
K8sGPTAgent

LambdaK8sGPTAgentRole, AWS Identification and Entry Administration (IAM) position Amazon Useful resource Title (ARN) related to the Lambda perform handing interactions with the K8sGPT agent on the EKS cluster. This position ARN can be wanted at a later stage of the configuration course of.
K8sGPTAgentAliasId, ID of the K8sGPT Amazon Bedrock agent alias
ArgoCDAgentAliasId, ID of the ArgoCD Amazon Bedrock Agent alias
CollaboratorAgentAliasId, ID of the collaborator Amazon Bedrock agent alias

Assign acceptable permissions to allow K8sGPT Amazon Bedrock agent to entry the EKS cluster

To allow the K8sGPT Amazon Bedrock agent to entry the EKS cluster, it’s worthwhile to configure the suitable IAM permissions utilizing Amazon EKS entry administration APIs. It is a two-step course of: first, you create an entry entry for the Lambda perform’s execution position (which yow will discover within the CloudFormation template output part), and then you definitely affiliate the AmazonEKSViewPolicy to grant read-only entry to the cluster. This configuration makes certain that the K8sGPT agent has the required permissions to watch and analyze the EKS cluster assets whereas sustaining the precept of least privilege.

Create an entry entry for the Lambda perform’s execution position

export CFN_STACK_NAME=EKS-Troubleshooter
	   export EKS_CLUSTER=PetSite

export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output textual content`

aws eks create-access-entry 
    --cluster-name $EKS_CLUSTER 
    --principal-arn $K8SGPT_LAMBDA_ROLE

Affiliate the EKS view coverage with the entry entry

aws eks associate-access-policy 
    --cluster-name $EKS_CLUSTER 
    --principal-arn  $K8SGPT_LAMBDA_ROLE
    --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy 
    --access-scope kind=cluster

Confirm the Amazon Bedrock brokers. The CloudFormation template provides all three required brokers. To view the brokers, on the Amazon Bedrock console, underneath Builder instruments within the navigation pane, choose Brokers, as proven within the following screenshot.

Carry out Amazon EKS troubleshooting utilizing the Amazon Bedrock agentic workflow

Now, take a look at the answer. We discover the next two situations:

The agent coordinates with the K8sGPT agent to supply insights into the foundation reason for a pod failure
The collaborator agent coordinates with the ArgoCD agent to supply a response

Agent coordinates with K8sGPT agent to supply insights into the foundation reason for a pod failure

On this part, we look at a down alert for a pattern utility referred to as memory-demo. We’re within the root reason for the difficulty. We use the next immediate: “We acquired a down alert for the memory-demo app. Assist us with the foundation reason for the difficulty.”

The agent not solely said the foundation trigger, however went one step additional to doubtlessly repair the error, which on this case is growing reminiscence assets to the applying.

Collaborator agent coordinates with ArgoCD agent to supply a response

For this state of affairs, we proceed from the earlier immediate. We really feel the applying wasn’t offered sufficient reminiscence, and it ought to be elevated to completely repair the difficulty. We are able to additionally inform the applying is in an unhealthy state within the ArgoCD UI, as proven within the following screenshot.

Let’s now proceed to extend the reminiscence, as proven within the following screenshot.

The agent interacted with the argocd_operations Amazon Bedrock agent and was capable of efficiently enhance the reminiscence. The identical will be inferred within the ArgoCD UI.

Cleanup

For those who determine to cease utilizing the answer, full the next steps:

To delete the related assets deployed utilizing AWS CloudFormation:
1. On the AWS CloudFormation console, select Stacks within the navigation pane.
2. Find the stack you created in the course of the deployment course of (you assigned a reputation to it).
3. Choose the stack and select Delete.
Delete the EKS cluster if you happen to created one particularly for this implementation.

Conclusion

By orchestrating a number of Amazon Bedrock brokers, we’ve demonstrated easy methods to construct an AI-powered Amazon EKS troubleshooting system that simplifies Kubernetes operations. This integration of K8sGPT evaluation and ArgoCD deployment automation showcases the highly effective prospects when combining specialised AI brokers with present DevOps instruments. Though this answer represents development in automated Kubernetes operations, it’s essential to keep in mind that human oversight stays precious, significantly for complicated situations and strategic choices.

As Amazon Bedrock and its agent capabilities proceed to evolve, we are able to anticipate much more refined orchestration prospects. You possibly can prolong this answer to include extra instruments, metrics, and automation workflows to satisfy your group’s particular wants.

To be taught extra about Amazon Bedrock, consult with the next assets:

In regards to the authors

Vikram Venkataraman is a Principal Specialist Options Architect at Amazon Internet Companies (AWS). He helps prospects modernize, scale, and undertake greatest practices for his or her containerized workloads. With the emergence of Generative AI, Vikram has been actively working with prospects to leverage AWS’s AI/ML companies to resolve complicated operational challenges, streamline monitoring workflows, and improve incident response by clever automation.

Puneeth Ranjan Komaragiri is a Principal Technical Account Supervisor at Amazon Internet Companies (AWS). He’s significantly captivated with monitoring and observability, cloud monetary administration, and generative AI domains. In his present position, Puneeth enjoys collaborating intently with prospects, leveraging his experience to assist them design and architect their cloud workloads for optimum scale and resilience.

Sudheer Sangunni is a Senior Technical Account Supervisor at AWS Enterprise Assist. Along with his intensive experience within the AWS Cloud and large knowledge, Sudheer performs a pivotal position in aiding prospects with enhancing their monitoring and observability capabilities inside AWS choices.

Vikrant Choudhary is a Senior Technical Account Supervisor at Amazon Internet Companies (AWS), specializing in healthcare and life sciences. With over 15 years of expertise in cloud options and enterprise structure, he helps companies speed up their digital transformation initiatives. In his present position, Vikrant companions with prospects to architect and implement revolutionary options, from cloud migrations and utility modernization to rising applied sciences reminiscent of generative AI, driving profitable enterprise outcomes by cloud adoption.

Main Menu

What's Hot

Researchers Expose On-line Pretend Foreign money Operation in India

The very best gaming audio system of 2025: Skilled examined from SteelSeries and extra

Can Exterior Validation Instruments Enhance Annotation High quality for LLM-as-a-Decide?

Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow

Can Exterior Validation Instruments Enhance Annotation High quality for LLM-as-a-Decide?

How PerformLine makes use of immediate engineering on Amazon Bedrock to detect compliance violations

10 Free On-line Programs to Grasp Python in 2025

Researchers Expose On-line Pretend Foreign money Operation in India

How AI is Redrawing the World’s Electrical energy Maps: Insights from the IEA Report

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Researchers Expose On-line Pretend Foreign money Operation in India

The very best gaming audio system of 2025: Skilled examined from SteelSeries and extra

Can Exterior Validation Instruments Enhance Annotation High quality for LLM-as-a-Decide?

Robotic house rovers preserve getting caught. Engineers have found out why

Main Menu

Subscribe to Updates

What's Hot

Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow

Resolution overview

Stipulations

Arrange the Amazon EKS cluster with K8sGPT and ArgoCD

Arrange the Amazon Bedrock brokers for K8sGPT and ArgoCD

Assign acceptable permissions to allow K8sGPT Amazon Bedrock agent to entry the EKS cluster

Carry out Amazon EKS troubleshooting utilizing the Amazon Bedrock agentic workflow

Agent coordinates with K8sGPT agent to supply insights into the foundation reason for a pod failure

Collaborator agent coordinates with ArgoCD agent to supply a response

Cleanup

Conclusion

In regards to the authors

Related Posts