CUDA 기초와 GPU 프로그래밍

GPU 프로그래밍 기초

CPU는 복잡한 연산을 순차적으로 처리하는 소수의 강력한 코어를 가집니다. GPU는 단순한 연산을 수천 개의 코어가 동시에 병렬 처리합니다. 텐서 연산(행렬 곱, 합성곱 등)은 독립적인 계산이 많아 GPU에 이상적입니다.

CPU:  ● → ● → ● → ● → ●   (순차, 강력한 코어 수개)
GPU:  ●●●●●●●●●●●●●●●●     (병렬, 수천 개 약한 코어)
      ●●●●●●●●●●●●●●●●
      ●●●●●●●●●●●●●●●●

PyTorch에서는 텐서를 GPU로 이동하는 것만으로 GPU 가속을 활용할 수 있습니다.

GPU 사용 가능 여부 확인

import torch

# CUDA(NVIDIA GPU) 사용 가능 여부
print(torch.cuda.is_available())       # True / False

# 사용 가능한 GPU 수
print(torch.cuda.device_count())       # 예: 1

# 현재 GPU 이름
if torch.cuda.is_available():
    print(torch.cuda.get_device_name(0))  # 예: NVIDIA GeForce RTX 4090

# MPS (Apple Silicon M1/M2) 확인
print(torch.backends.mps.is_available())  # True (Mac M시리즈)

디바이스 자동 선택 패턴 — GPU가 없어도 CPU로 동작하는 안전한 코드:

# 우선순위: CUDA → MPS → CPU
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

print(f"사용 디바이스: {device}")

텐서를 GPU로 이동

.to(device) — 권장 방법

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

x = torch.randn(3, 4)
print(x.device)  # cpu

# GPU로 이동
x_gpu = x.to(device)
print(x_gpu.device)  # cuda:0 (GPU가 있는 경우)

# 새 텐서 직접 GPU에 생성
y = torch.zeros(3, 4, device=device)
print(y.device)  # cuda:0

.cuda() / .cpu() — 명시적 이동

x = torch.randn(3, 4)

x_gpu = x.cuda()   # GPU로
x_cpu = x_gpu.cpu()  # 다시 CPU로

# 특정 GPU 지정 (멀티 GPU 환경)
x_gpu1 = x.cuda(1)   # 두 번째 GPU

CPU 텐서와 GPU 텐서를 직접 연산하면 RuntimeError가 발생합니다. 연산하기 전에 두 텐서가 같은 디바이스에 있는지 확인하세요.

a = torch.randn(3)          # CPU
b = torch.randn(3).cuda()   # GPU

try:
    c = a + b  # RuntimeError: Expected all tensors to be on the same device
except RuntimeError as e:
    print(e)

# 올바른 방법
c = a.to(b.device) + b  # a를 GPU로 먼저 이동

GPU 메모리 관리

현재 메모리 사용량 확인

if torch.cuda.is_available():
    # 현재 할당된 메모리 (bytes)
    allocated = torch.cuda.memory_allocated(0)
    print(f"할당됨: {allocated / 1024**2:.1f} MB")

    # 캐시된 메모리 (할당됐다가 반환된 것)
    reserved = torch.cuda.memory_reserved(0)
    print(f"예약됨: {reserved / 1024**2:.1f} MB")

    # 전체 GPU 메모리
    total = torch.cuda.get_device_properties(0).total_memory
    print(f"총 메모리: {total / 1024**3:.1f} GB")

메모리 해제

import gc

# 텐서 삭제 후 캐시 비우기
large_tensor = torch.randn(10000, 10000, device='cuda')
del large_tensor

# Python 가비지 컬렉터 실행
gc.collect()

# GPU 캐시 비우기
torch.cuda.empty_cache()

모델을 GPU로 이동

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 모델의 모든 파라미터를 GPU로 이동
model = model.to(device)

# 입력 데이터도 같은 디바이스로 이동
x = torch.randn(32, 784).to(device)
output = model(x)
print(output.shape)   # torch.Size([32, 10])
print(output.device)  # cuda:0

멀티 GPU 소개 — DataParallel

여러 GPU가 있을 때 nn.DataParallel 로 데이터를 나눠 병렬 처리합니다.

# GPU가 여러 개인 환경
if torch.cuda.device_count() > 1:
    print(f"GPU {torch.cuda.device_count()}개 사용")
    model = nn.DataParallel(model)   # 배치를 GPU 수로 나눠 처리

model = model.to(device)

# 사용법은 단일 GPU와 동일
x = torch.randn(64, 784).to(device)
output = model(x)
print(output.shape)  # torch.Size([64, 10])

DataParallel 동작 방식:

배치 (64개 샘플)
       │
  ┌────┴────┐
GPU 0       GPU 1
(32개)     (32개)
  │           │
  └─── 합산 ──┘
       │
   결과 반환

디바이스 정보 요약 표

메서드	설명
`torch.cuda.is_available()`	CUDA GPU 사용 가능 여부
`torch.cuda.device_count()`	사용 가능한 GPU 수
`torch.cuda.get_device_name(i)`	i번 GPU 이름
`tensor.to(device)`	텐서를 지정 디바이스로 이동
`tensor.cuda()`	GPU로 이동
`tensor.cpu()`	CPU로 이동
`tensor.device`	현재 텐서의 디바이스
`torch.cuda.memory_allocated()`	할당된 GPU 메모리
`torch.cuda.empty_cache()`	GPU 캐시 해제

핵심 요약

GPU는 수천 개 코어로 텐서 연산을 병렬 처리 — CPU보다 훨씬 빠름
torch.cuda.is_available() 로 GPU 존재 여부 확인 후 device 변수로 통일 관리
.to(device) 로 텐서와 모델을 같은 디바이스에 배치
CPU ↔ GPU 디바이스 불일치는 RuntimeError 발생 — 항상 같은 디바이스 확인
torch.cuda.empty_cache() 로 GPU 캐시 정리, OOM 시 del + gc.collect() 병행

다음 장에서는 CPU와 GPU 성능을 비교하고, 혼합 정밀도 학습 등 실전 성능 최적화 기법을 다룹니다.