Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    OpenAI Bans ChatGPT Accounts Utilized by Russian, Iranian and Chinese language Hacker Teams

    June 9, 2025

    At the moment’s NYT Connections: Sports activities Version Hints, Solutions for June 9 #259

    June 9, 2025

    Malicious npm Utility Packages Allow Attackers to Wipe Manufacturing Techniques

    June 9, 2025
    Facebook X (Twitter) Instagram
    UK Tech Insider
    Facebook X (Twitter) Instagram Pinterest Vimeo
    UK Tech Insider
    Home»Machine Learning & Research»Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow
    Machine Learning & Research

    Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow

    Oliver ChambersBy Oliver ChambersApril 20, 2025No Comments12 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Automate Amazon EKS troubleshooting utilizing an Amazon Bedrock agentic workflow
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    As organizations scale their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, platform directors face growing challenges in effectively managing multi-tenant clusters. Duties reminiscent of investigating pod failures, addressing useful resource constraints, and resolving misconfiguration can eat vital effort and time. As a substitute of spending precious engineering hours manually parsing logs, monitoring metrics, and implementing fixes, groups ought to concentrate on driving innovation. Now, with the facility of generative AI, you’ll be able to remodel your Kubernetes operations. By implementing clever cluster monitoring, sample evaluation, and automatic remediation, you’ll be able to dramatically cut back each imply time to determine (MTTI) and imply time to resolve (MTTR) for widespread cluster points.

    At AWS re:Invent 2024, we introduced the multi-agent collaboration functionality for Amazon Bedrock (preview). With multi-agent collaboration, you’ll be able to construct, deploy, and handle a number of AI brokers working collectively on complicated multistep duties that require specialised abilities. As a result of troubleshooting an EKS cluster entails deriving insights from a number of observability alerts and making use of fixes utilizing a steady integration and deployment (CI/CD) pipeline, a multi-agent workflow will help an operations group streamline the administration of EKS clusters. The workflow supervisor agent can combine with particular person brokers that interface with particular person observability alerts and a CI/CD workflow to orchestrate and carry out duties based mostly on consumer immediate.

    On this put up, we display easy methods to orchestrate a number of Amazon Bedrock brokers to create a complicated Amazon EKS troubleshooting system. By enabling collaboration between specialised brokers—deriving insights from K8sGPT and performing actions by the ArgoCD framework—you’ll be able to construct a complete automation that identifies, analyzes, and resolves cluster points with minimal human intervention.

    Resolution overview

    The structure consists of the next core elements:

    • Amazon Bedrock collaborator agent – Orchestrates the workflow and maintains context whereas routing consumer prompts to specialised brokers, managing multistep operations and agent interactions
    • Amazon Bedrock agent for K8sGPT – Evaluates cluster and pod occasions by K8sGPT’s Analyze API for safety points, misconfigurations, and efficiency issues, offering remediation solutions in pure language
    • Amazon Bedrock agent for ArgoCD – Manages GitOps-based remediation by ArgoCD, dealing with rollbacks, useful resource optimization, and configuration updates

    The next diagram illustrates the answer structure.

    Stipulations

    It is advisable have the next conditions in place:

    Arrange the Amazon EKS cluster with K8sGPT and ArgoCD

    We begin with putting in and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.

    The K8sGPT operator will assist with enabling AI-powered evaluation and troubleshooting of cluster points. For instance, it may robotically detect and recommend fixes for misconfigured deployments, reminiscent of figuring out and resolving useful resource constraint issues in pods.

    ArgoCD is a declarative GitOps steady supply instrument for Kubernetes that automates the deployment of purposes by preserving the specified utility state in sync with what’s outlined in a Git repository.

    The Amazon Bedrock agent serves because the clever decision-maker in our structure, analyzing cluster points detected by K8sGPT. After the foundation trigger is recognized, the agent orchestrates corrective actions by ArgoCD’s GitOps engine. This highly effective integration signifies that when issues are detected (whether or not it’s a misconfigured deployment, useful resource constraints, or scaling subject), the agent can robotically combine with ArgoCD to supply the required fixes. ArgoCD then picks up these adjustments and synchronizes them together with your EKS cluster, creating a really self-healing infrastructure.

    1. Create the required namespaces in Amazon EKS:
      kubectl create ns helm-guestbook
      kubectl create ns k8sgpt-operator-system
    2. Add the k8sgpt Helm repository and set up the operator:
      helm repo add k8sgpt https://charts.k8sgpt.ai/
      helm repo replace
      helm set up k8sgpt-operator k8sgpt/k8sgpt-operator 
        --namespace k8sgpt-operator-system
    3. You possibly can confirm the set up by getting into the next command:
      kubectl get pods -n k8sgpt-operator-system
      
      NAME                                                          READY   STATUS    RESTARTS  AGE
      release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Operating   0         1d
      

    After the operator is deployed, you’ll be able to configure a K8sGPT useful resource. This Customized Useful resource Definition(CRD) may have the massive language mannequin (LLM) configuration that may help in AI-powered evaluation and troubleshooting of cluster points. K8sGPT helps numerous backends to assist in AI-powered evaluation. For this put up, we use Amazon Bedrock because the backend and Anthropic’s Claude V3 because the LLM.

    1. It is advisable create the pod identification for offering the EKS cluster entry to different AWS companies with Amazon Bedrock:
      eksctl create podidentityassociation  --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt  --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:coverage/AmazonBedrockFullAccess  --region $AWS_REGION
    2. Configure the K8sGPT CRD:
      cat << EOF > k8sgpt.yaml
      apiVersion: core.k8sgpt.ai/v1alpha1
      sort: K8sGPT
      metadata:
        title: k8sgpt-bedrock
        namespace: k8sgpt-operator-system
      spec:
        ai:
          enabled: true
          mannequin: anthropic.claude-v3
          backend: amazonbedrock
          area: us-east-1
          credentials:
            secretRef:
              title: k8sgpt-secret
              namespace: k8sgpt-operator-system
        noCache: false
        repository: ghcr.io/k8sgpt-ai/k8sgpt
        model: v0.3.48
      EOF
      
      kubectl apply -f k8sgpt.yaml
      
    3. Validate the settings to substantiate the k8sgpt-bedrock pod is operating efficiently:
      kubectl get pods -n k8sgpt-operator-system
      NAME                                                          READY   STATUS    RESTARTS      AGE
      k8sgpt-bedrock-5b655cbb9b-sn897                               1/1     Operating   9 (22d in the past)   22d
      release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Operating   3 (10h in the past)   22d
      
    4. Now you’ll be able to configure the ArgoCD controller:
      helm repo add argo https://argoproj.github.io/argo-helm
      helm repo replace
      kubectl create namespace argocd
      helm set up argocd argo/argo-cd 
        --namespace argocd 
        --create-namespace
    5. Confirm the ArgoCD set up:
      kubectl get pods -n argocd
      NAME                                                READY   STATUS    RESTARTS   AGE
      argocd-application-controller-0                     1/1     Operating   0          43d
      argocd-applicationset-controller-5c787df94f-7jpvp   1/1     Operating   0          43d
      argocd-dex-server-55d5769f46-58dwx                  1/1     Operating   0          43d
      argocd-notifications-controller-7ccbd7fb6-9pptz     1/1     Operating   0          43d
      argocd-redis-587d59bbc-rndkp                        1/1     Operating   0          43d
      argocd-repo-server-76f6c7686b-rhjkg                 1/1     Operating   0          43d
      argocd-server-64fcc786c-bd2t8                       1/1     Operating   0          43d
    6. Patch the argocd service to have an exterior load balancer:
      kubectl patch svc argocd-server -n argocd -p '{"spec": {"kind": "LoadBalancer"}}'
    7. Now you can entry the ArgoCD UI with the next load balancer endpoint and the credentials for the admin consumer:
      kubectl get svc argocd-server -n argocd
      NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
      argocd-server   LoadBalancer   10.100.168.229   a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com   80:32334/TCP,443:32261/TCP   43d
    8. Retrieve the credentials for the ArgoCD UI:
      export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret 
      -o jsonpath="{.knowledge.password}" | base64 -d`
      
      echo ArgoCD admin password - $argocdpassword
    9. Push the credentials to AWS Secrets and techniques Supervisor:
      aws secretsmanager create-secret 
      --name argocdcreds 
      --description "Credentials for argocd" 
      --secret-string "{"USERNAME":"admin","PASSWORD":"$argocdpassword"}"
    10. Configure a pattern utility in ArgoCD:
      cat << EOF > argocd-application.yaml
      apiVersion: argoproj.io/v1alpha1
      sort: Utility
      metadata:
      title: helm-guestbook
      namespace: argocd
      spec:
      venture: default
      supply:
      repoURL: https://github.com/awsvikram/argocd-example-apps
      targetRevision: HEAD
      path: helm-guestbook
      vacation spot:
      server: https://kubernetes.default.svc
      namespace: helm-guestbook
      syncPolicy:
      automated:
      prune: true
      selfHeal: true
      EOF
    11. Apply the configuration and confirm it from the ArgoCD UI by logging in because the admin consumer:
      kubectl apply -f argocd-application.yaml

      ArgoCD Application

    12. It takes a while for K8sGPT to investigate the newly created pods. To make that rapid, restart the pods created within the k8sgpt-operator-system namespace. The pods will be restarted by getting into the next command:
      kubectl -n k8sgpt-operator-system rollout restart deploy
      
      deployment.apps/k8sgpt-bedrock restarted
      deployment.apps/k8sgpt-operator-controller-manager restarted

    Arrange the Amazon Bedrock brokers for K8sGPT and ArgoCD

    We use a CloudFormation stack to deploy the person brokers into the US East (N. Virginia) Area. If you deploy the CloudFormation template, you deploy a number of assets (prices can be incurred for the AWS assets used).

    Use the next parameters for the CloudFormation template:

    The stack creates the next AWS Lambda features:

    • -LambdaK8sGPTAgent-
    • -RestartRollBackApplicationArgoCD-
    • -ArgocdIncreaseMemory-

    The stack creates the next Amazon Bedrock brokers:

    • ArgoCDAgent, with the next motion teams:
      1. argocd-rollback
      2. argocd-restart
      3. argocd-memory-management
    • K8sGPTAgent, with the next motion group:
      1. k8s-cluster-operations

    The stack outputs the next, with the next brokers related to it:

    1. ArgoCDAgent
    2. K8sGPTAgent
    • LambdaK8sGPTAgentRole, AWS Identification and Entry Administration (IAM) position Amazon Useful resource Title (ARN) related to the Lambda perform handing interactions with the K8sGPT agent on the EKS cluster. This position ARN can be wanted at a later stage of the configuration course of.
    • K8sGPTAgentAliasId, ID of the K8sGPT Amazon Bedrock agent alias
    • ArgoCDAgentAliasId, ID of the ArgoCD Amazon Bedrock Agent alias
    • CollaboratorAgentAliasId, ID of the collaborator Amazon Bedrock agent alias

    Assign acceptable permissions to allow K8sGPT Amazon Bedrock agent to entry the EKS cluster

    To allow the K8sGPT Amazon Bedrock agent to entry the EKS cluster, it’s worthwhile to configure the suitable IAM permissions utilizing Amazon EKS entry administration APIs. It is a two-step course of: first, you create an entry entry for the Lambda perform’s execution position (which yow will discover within the CloudFormation template output part), and then you definitely affiliate the AmazonEKSViewPolicy to grant read-only entry to the cluster. This configuration makes certain that the K8sGPT agent has the required permissions to watch and analyze the EKS cluster assets whereas sustaining the precept of least privilege.

    1. Create an entry entry for the Lambda perform’s execution position
      export CFN_STACK_NAME=EKS-Troubleshooter
      	   export EKS_CLUSTER=PetSite
      
      export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output textual content`
      
      aws eks create-access-entry 
          --cluster-name $EKS_CLUSTER 
          --principal-arn $K8SGPT_LAMBDA_ROLE
    2. Affiliate the EKS view coverage with the entry entry
      aws eks associate-access-policy 
          --cluster-name $EKS_CLUSTER 
          --principal-arn  $K8SGPT_LAMBDA_ROLE
          --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy 
          --access-scope kind=cluster
    3. Confirm the Amazon Bedrock brokers. The CloudFormation template provides all three required brokers. To view the brokers, on the Amazon Bedrock console, underneath Builder instruments within the navigation pane, choose Brokers, as proven within the following screenshot.

    Bedrock agents

    Carry out Amazon EKS troubleshooting utilizing the Amazon Bedrock agentic workflow

    Now, take a look at the answer. We discover the next two situations:

    1. The agent coordinates with the K8sGPT agent to supply insights into the foundation reason for a pod failure
    2. The collaborator agent coordinates with the ArgoCD agent to supply a response

    Agent coordinates with K8sGPT agent to supply insights into the foundation reason for a pod failure

    On this part, we look at a down alert for a pattern utility referred to as memory-demo. We’re within the root reason for the difficulty. We use the next immediate: “We acquired a down alert for the memory-demo app. Assist us with the foundation reason for the difficulty.”

    The agent not solely said the foundation trigger, however went one step additional to doubtlessly repair the error, which on this case is growing reminiscence assets to the applying.

    K8sgpt agent finding

    Collaborator agent coordinates with ArgoCD agent to supply a response

    For this state of affairs, we proceed from the earlier immediate. We really feel the applying wasn’t offered sufficient reminiscence, and it ought to be elevated to completely repair the difficulty. We are able to additionally inform the applying is in an unhealthy state within the ArgoCD UI, as proven within the following screenshot.

    ArgoUI

    Let’s now proceed to extend the reminiscence, as proven within the following screenshot.

    Interacting with agent to increase memory

    The agent interacted with the argocd_operations Amazon Bedrock agent and was capable of efficiently enhance the reminiscence. The identical will be inferred within the ArgoCD UI.

    ArgoUI showing memory increase

    Cleanup

    For those who determine to cease utilizing the answer, full the next steps:

    1. To delete the related assets deployed utilizing AWS CloudFormation:
      1. On the AWS CloudFormation console, select Stacks within the navigation pane.
      2. Find the stack you created in the course of the deployment course of (you assigned a reputation to it).
      3. Choose the stack and select Delete.
    2. Delete the EKS cluster if you happen to created one particularly for this implementation.

    Conclusion

    By orchestrating a number of Amazon Bedrock brokers, we’ve demonstrated easy methods to construct an AI-powered Amazon EKS troubleshooting system that simplifies Kubernetes operations. This integration of K8sGPT evaluation and ArgoCD deployment automation showcases the highly effective prospects when combining specialised AI brokers with present DevOps instruments. Though this answer represents development in automated Kubernetes operations, it’s essential to keep in mind that human oversight stays precious, significantly for complicated situations and strategic choices.

    As Amazon Bedrock and its agent capabilities proceed to evolve, we are able to anticipate much more refined orchestration prospects. You possibly can prolong this answer to include extra instruments, metrics, and automation workflows to satisfy your group’s particular wants.

    To be taught extra about Amazon Bedrock, consult with the next assets:


    In regards to the authors

    Vikram Venkataraman is a Principal Specialist Options Architect at Amazon Internet Companies (AWS). He helps prospects modernize, scale, and undertake greatest practices for his or her containerized workloads. With the emergence of Generative AI, Vikram has been actively working with prospects to leverage AWS’s AI/ML companies to resolve complicated operational challenges, streamline monitoring workflows, and improve incident response by clever automation.

    Puneeth Ranjan Komaragiri is a Principal Technical Account Supervisor at Amazon Internet Companies (AWS). He’s significantly captivated with monitoring and observability, cloud monetary administration, and generative AI domains. In his present position, Puneeth enjoys collaborating intently with prospects, leveraging his experience to assist them design and architect their cloud workloads for optimum scale and resilience.

    Sudheer Sangunni is a Senior Technical Account Supervisor at AWS Enterprise Assist. Along with his intensive experience within the AWS Cloud and large knowledge, Sudheer performs a pivotal position in aiding prospects with enhancing their monitoring and observability capabilities inside AWS choices.

    Vikrant Choudhary is a Senior Technical Account Supervisor at Amazon Internet Companies (AWS), specializing in healthcare and life sciences. With over 15 years of expertise in cloud options and enterprise structure, he helps companies speed up their digital transformation initiatives. In his present position, Vikrant companions with prospects to architect and implement revolutionary options, from cloud migrations and utility modernization to rising applied sciences reminiscent of generative AI, driving profitable enterprise outcomes by cloud adoption.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    Construct a Textual content-to-SQL resolution for information consistency in generative AI utilizing Amazon Nova

    June 7, 2025

    Multi-account assist for Amazon SageMaker HyperPod activity governance

    June 7, 2025

    Implement semantic video search utilizing open supply giant imaginative and prescient fashions on Amazon SageMaker and Amazon OpenSearch Serverless

    June 6, 2025
    Leave A Reply Cancel Reply

    Top Posts

    OpenAI Bans ChatGPT Accounts Utilized by Russian, Iranian and Chinese language Hacker Teams

    June 9, 2025

    How AI is Redrawing the World’s Electrical energy Maps: Insights from the IEA Report

    April 18, 2025

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025
    Don't Miss

    OpenAI Bans ChatGPT Accounts Utilized by Russian, Iranian and Chinese language Hacker Teams

    By Declan MurphyJune 9, 2025

    OpenAI has revealed that it banned a set of ChatGPT accounts that had been doubtless…

    At the moment’s NYT Connections: Sports activities Version Hints, Solutions for June 9 #259

    June 9, 2025

    Malicious npm Utility Packages Allow Attackers to Wipe Manufacturing Techniques

    June 9, 2025

    Slack is being bizarre for lots of people immediately

    June 9, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2025 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.