GPU 서버 통합

지원 버전: EKS 1.31+, nodeadm 0.1+ 마지막 업데이트: 2026년 2월 21일

이 문서에서는 온프레미스 GPU 서버를 EKS Hybrid Nodes에 통합하는 방법을 다룹니다.

NVIDIA GPU Operator 배포

GPU Operator는 Kubernetes 클러스터에서 NVIDIA GPU를 관리하기 위한 모든 구성 요소를 자동으로 배포합니다.

# NVIDIA GPU Operator Helm 저장소 추가
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# GPU Operator 설치
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set migManager.enabled=false \
  --set dcgmExporter.enabled=true

참고: 온프레미스 노드에 이미 NVIDIA 드라이버가 설치되어 있으므로 driver.enabled=false로 설정합니다.

H100/H200 서버 통합

Device Plugin 구성 확인

# GPU 노드에서 Device Plugin 상태 확인
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

# GPU 리소스 확인
kubectl describe node hybrid-gpu-node-001 | grep -A 10 "Allocatable:"
# 예상 출력:
# Allocatable:
#   cpu:                128
#   memory:             1024Gi
#   nvidia.com/gpu:     8

GPU 리소스 검증

# 테스트 Pod으로 GPU 접근 확인
kubectl run gpu-test --rm -it \
  --image=nvidia/cuda:12.3.1-base-ubuntu22.04 \
  --restart=Never \
  --overrides='
{
  "spec": {
    "nodeSelector": {"topology.kubernetes.io/zone": "on-premises"},
    "tolerations": [{"key": "location", "operator": "Equal", "value": "on-premises", "effect": "NoSchedule"}],
    "containers": [{
      "name": "gpu-test",
      "image": "nvidia/cuda:12.3.1-base-ubuntu22.04",
      "command": ["nvidia-smi"],
      "resources": {"limits": {"nvidia.com/gpu": "1"}}
    }]
  }
}' \
  -- nvidia-smi

Dynamic Resource Allocation (DRA)

Kubernetes 1.31+에서는 DRA를 통해 더 유연한 GPU 리소스 관리가 가능합니다.

DeviceClass 정의

참고: Kubernetes 1.31의 resource.k8s.io/v1alpha3에서 ResourceClass는 DeviceClass로 대체되었습니다.

# gpu-device-class.yaml
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
  name: nvidia-gpu
spec:
  selectors:
  - cel:
      expression: "device.driver == 'gpu.nvidia.com'"
  suitableNodes:
    nodeSelectorTerms:
    - matchExpressions:
      - key: nvidia.com/gpu.present
        operator: In
        values: ["true"]
---
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
  name: high-memory-gpu
spec:
  selectors:
  - cel:
      expression: "device.driver == 'gpu.nvidia.com' && device.attributes['gpu.nvidia.com'].productName in ['NVIDIA-H100-80GB-HBM3', 'NVIDIA-H200']"
  suitableNodes:
    nodeSelectorTerms:
    - matchExpressions:
      - key: nvidia.com/gpu.product
        operator: In
        values: ["NVIDIA-H100-80GB-HBM3", "NVIDIA-H200"]

ResourceClaimTemplate 정의

# gpu-resource-claim-template.yaml
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
  name: gpu-claim-template
  namespace: ai-workloads
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: nvidia-gpu
        count: 1

DRA를 사용하는 Pod 정의

# pod-with-dra.yaml
apiVersion: v1
kind: Pod
metadata:
  name: llm-inference-pod
  namespace: ai-workloads
spec:
  nodeSelector:
    topology.kubernetes.io/zone: on-premises
  tolerations:
  - key: location
    operator: Equal
    value: on-premises
    effect: NoSchedule
  containers:
  - name: llm-server
    image: <REGISTRY>/ai/vllm-server:v0.4.0  # ECR 또는 프라이빗 레지스트리
    resources:
      claims:
      - name: gpu-resource
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: "0,1,2,3"
  resourceClaims:
  - name: gpu-resource
    source:
      resourceClaimTemplateName: gpu-claim-template

DRA 모니터링 메트릭

# ResourceClaim 상태 확인
kubectl get resourceclaims -n ai-workloads

# ResourceClaim 상세 정보
kubectl describe resourceclaim gpu-claim-template-xxxxx -n ai-workloads

# DRA 컨트롤러 로그 확인
kubectl logs -n gpu-operator -l app=nvidia-dra-driver -f

< 이전: 노드 부트스트랩 | 목차 | 다음: 워크로드 배치 전략 >

이전노드 부트스트랩 다음워크로드 배치 전략

마지막 업데이트 1일 전

hashtagNVIDIA GPU Operator 배포

hashtagH100/H200 서버 통합

hashtagDevice Plugin 구성 확인

hashtagGPU 리소스 검증

hashtagDynamic Resource Allocation (DRA)

hashtagDeviceClass 정의

hashtagResourceClaimTemplate 정의

hashtagDRA를 사용하는 Pod 정의

hashtagDRA 모니터링 메트릭

NVIDIA GPU Operator 배포

H100/H200 서버 통합

Device Plugin 구성 확인

GPU 리소스 검증

Dynamic Resource Allocation (DRA)

DeviceClass 정의

ResourceClaimTemplate 정의

DRA를 사용하는 Pod 정의

DRA 모니터링 메트릭