GPUs are a valuable useful resource; they’re each quick in provide and rather more pricey than conventional CPUs. They’re additionally extremely adaptable to many alternative use circumstances. Organizations constructing or adopting generative AI use GPUs to run simulations, run inference (each for inside or exterior utilization), construct agentic workloads, and run knowledge scientists’ experiments. The workloads vary from ephemeral single-GPU experiments run by scientists to lengthy multi-node steady pre-training runs. Many organizations must share a centralized, high-performance GPU computing infrastructure throughout totally different groups, enterprise items, or accounts inside their group. With this infrastructure, they will maximize the utilization of pricey accelerated computing sources like GPUs, relatively than having siloed infrastructure that could be underutilized. Organizations additionally use a number of AWS accounts for his or her customers. Bigger enterprises would possibly need to separate totally different enterprise items, groups, or environments (manufacturing, staging, improvement) into totally different AWS accounts. This supplies extra granular management and isolation between these totally different elements of the group. It additionally makes it easy to trace and allocate cloud prices to the suitable groups or enterprise items for higher monetary oversight.
The precise causes and setup can fluctuate relying on the dimensions, construction, and necessities of the enterprise. However usually, a multi-account technique supplies larger flexibility, safety, and manageability for large-scale cloud deployments. On this put up, we focus on how an enterprise with a number of accounts can entry a shared Amazon SageMaker HyperPod cluster for operating their heterogenous workloads. We use SageMaker HyperPod activity governance to allow this characteristic.
Answer overview
SageMaker HyperPod activity governance streamlines useful resource allocation and supplies cluster directors the potential to arrange insurance policies to maximise compute utilization in a cluster. Process governance can be utilized to create distinct groups with their very own distinctive namespace, compute quotas, and borrowing limits. In a multi-account setting, you may limit which accounts have entry to which group’s compute quota utilizing role-based entry management.
On this put up, we describe the settings required to arrange multi-account entry for SageMaker HyperPod clusters orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and the right way to use SageMaker HyperPod activity governance to allocate accelerated compute to a number of groups in numerous accounts.
The next diagram illustrates the answer structure.
On this structure, one group is splitting sources throughout a couple of accounts. Account A hosts the SageMaker HyperPod cluster. Account B is the place the info scientists reside. Account C is the place the info is ready and saved for coaching utilization. Within the following sections, we display the right way to arrange multi-account entry in order that knowledge scientists in Account B can prepare a mannequin on Account A’s SageMaker HyperPod and EKS cluster, utilizing the preprocessed knowledge saved in Account C. We break down this setup in two sections: cross-account entry for knowledge scientists and cross-account entry for ready knowledge.
Cross-account entry for knowledge scientists
If you create a compute allocation with SageMaker HyperPod activity governance, your EKS cluster creates a novel Kubernetes namespace per group. For this walkthrough, we create an AWS Id and Entry Administration (IAM) position per group, known as cluster entry roles, which might be then scoped entry solely to the group’s activity governance-generated namespace within the shared EKS cluster. Function-based entry management is how we make sure that the info science members of Staff A will be unable to submit duties on behalf of Staff B.
To entry Account A’s EKS cluster as a person in Account B, you have to to imagine a cluster entry position in Account A. The cluster entry position could have solely the wanted permissions for knowledge scientists to entry the EKS cluster. For an instance of IAM roles for knowledge scientists utilizing SageMaker HyperPod, see IAM customers for scientists.
Subsequent, you have to to imagine the cluster entry position from a job in Account B. The cluster entry position in Account A will then must have a belief coverage for the info scientist position in Account B. The info scientist position is the position in account B that shall be used to imagine the cluster entry position in Account A. The next code is an instance of the coverage assertion for the info scientist position in order that it may well assume the cluster entry position in Account A:
The next code is an instance of the belief coverage for the cluster entry position in order that it permits the info scientist position to imagine it:
The ultimate step is to create an entry entry for the group’s cluster entry position within the EKS cluster. This entry entry also needs to have an entry coverage, reminiscent of EKSEditPolicy, that’s scoped to the namespace of the group. This makes positive that Staff A customers in Account B can’t launch duties exterior of their assigned namespace. You can even optionally arrange customized role-based entry management; see Organising Kubernetes role-based entry management for extra info.
For customers in Account B, you may repeat the identical setup for every group. You should create a novel cluster entry position for every group to align the entry position for the group with their related namespace. To summarize, we use two totally different IAM roles:
- Information scientist position – The position in Account B used to imagine the cluster entry position in Account A. This position simply wants to have the ability to assume the cluster entry position.
- Cluster entry position – The position in Account A used to offer entry to the EKS cluster. For an instance, see IAM position for SageMaker HyperPod.
Cross-account entry to ready knowledge
On this part, we display the right way to arrange EKS Pod Id and S3 Entry Factors in order that pods operating coaching duties in Account A’s EKS cluster have entry to knowledge saved in Account C. EKS Pod Id mean you can map an IAM position to a service account in a namespace. If a pod makes use of the service account that has this affiliation, then Amazon EKS will set the setting variables within the containers of the pod.
S3 Entry Factors are named community endpoints that simplify knowledge entry for shared datasets in S3 buckets. They act as a option to grant fine-grained entry management to particular customers or purposes accessing a shared dataset inside an S3 bucket, with out requiring these customers or purposes to have full entry to your entire bucket. Permissions to the entry level is granted by S3 entry level insurance policies. Every S3 Entry Level is configured with an entry coverage particular to a use case or software. Because the HyperPod cluster on this weblog put up can be utilized by a number of groups, every group might have its personal S3 entry level and entry level coverage.
Earlier than following these steps, guarantee you might have the EKS Pod Id Add-on put in in your EKS cluster.
- In Account A, create an IAM Function that accommodates S3 permissions (reminiscent of
s3:ListBucket
ands3:GetObject
to the entry level useful resource) and has a belief relationship with Pod Id; this shall be your Information Entry Function. Beneath is an instance of a belief coverage.
{
"Model": "2012-10-17",
"Assertion": [
{
"Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
"Effect": "Allow",
"Principal": {
"Service": "pods.eks.amazonaws.com"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
]
}
]
}
- In Account C, create an S3 entry level by following the steps right here.
- Subsequent, configure your S3 entry level to permit entry to the position created in step 1. That is an instance entry level coverage that offers Account A permission to entry factors in account C.
{
"Model": "2012-10-17",
"Assertion": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam:::role/"
},
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Useful resource": [
"arn:aws:s3:::accesspoint/",
"arn:aws:s3:::accesspoint//object/*"
]
}
]
}
- Guarantee your S3 bucket coverage is up to date to permit Account A entry. That is an instance S3 bucket coverage:
{
"Model": "2012-10-17",
"Assertion": [
{
"Effect": "Allow",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Useful resource": [
"arn:aws:s3:::",
"arn:aws:s3:::/*"
],
"Situation": {
"StringEquals": {
"s3:DataAccessPointAccount": ""
}
}
}
]
}
- In Account A, create a pod identification affiliation to your EKS cluster utilizing the AWS CLI.
- Pods accessing cross-account S3 buckets will want the service account identify referenced of their pod specification.
You’ll be able to take a look at cross-account knowledge entry by spinning up a take a look at pod and the executing into the pod to run Amazon S3 instructions:
This instance exhibits making a single knowledge entry position for a single group. For a number of groups, use a namespace-specific ServiceAccount with its personal knowledge entry position to assist forestall overlapping useful resource entry throughout groups. You can even configure cross-account Amazon S3 entry for an Amazon FSx for Lustre file system in Account A, as described in Use Amazon FSx for Lustre to share Amazon S3 knowledge throughout accounts. FSx for Lustre and Amazon S3 will must be in the identical AWS Area, and the FSx for Lustre file system will must be in the identical Availability Zone as your SageMaker HyperPod cluster.
Conclusion
On this put up, we offered steerage on the right way to arrange cross-account entry to knowledge scientists accessing a centralized SageMaker HyperPod cluster orchestrated by Amazon EKS. As well as, we lined the right way to present Amazon S3 knowledge entry from one account to an EKS cluster in one other account. With SageMaker HyperPod activity governance, you may limit entry and compute allocation to particular groups. This structure can be utilized at scale by organizations eager to share a big compute cluster throughout accounts inside their group. To get began with SageMaker HyperPod activity governance, seek advice from the Amazon EKS Help in Amazon SageMaker HyperPod workshop and SageMaker HyperPod activity governance documentation.
In regards to the Authors
Nisha Nadkarni is a Senior GenAI Specialist Options Architect at AWS, the place she guides corporations by greatest practices when deploying massive scale distributed coaching and inference on AWS. Previous to her present position, she spent a number of years at AWS targeted on serving to rising GenAI startups develop fashions from ideation to manufacturing.
Anoop Saha is a Sr GTM Specialist at Amazon Internet Companies (AWS) specializing in generative AI mannequin coaching and inference. He companions with prime frontier mannequin builders, strategic prospects, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and enormous firms, primarily specializing in silicon and system structure of AI infrastructure.
Kareem Syed-Mohammed is a Product Supervisor at AWS. He’s targeted on compute optimization and price governance. Previous to this, at Amazon QuickSight, he led embedded analytics, and developer expertise. Along with QuickSight, he has been with AWS Market and Amazon retail as a Product Supervisor. Kareem began his profession as a developer for name heart applied sciences, Native Professional and Advertisements for Expedia, and administration marketing consultant at McKinsey.
Rajesh Ramchander is a Principal ML Engineer in Skilled Companies at AWS. He helps prospects at numerous levels of their AI/ML and GenAI journey, from these which might be simply getting began all the way in which to those who are main their enterprise with an AI-first technique.