이미지 전처리 과정 개선 - 100-hours-a-week/5-yeosa-wiki GitHub Wiki

1. 현재 코드

def clip_preprocess_np(img: np.ndarray) -> torch.Tensor:
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    tensor = torch.from_numpy(img).permute(2, 0, 1).float() / 255.0
    mean = torch.tensor([0.48145466, 0.4578275, 0.40821073]).view(3, 1, 1)
    std = torch.tensor([0.26862954, 0.26130258, 0.27577711]).view(3, 1, 1)
    return (tensor - mean) / std

with ThreadPoolExecutor() as executor:
        # 전처리 함수를 lambda로 정의
    preprocess_func = lambda img: preprocess(img)
        
    for i in range(0, len(images), batch_size):
	        batch_images = images[i:i + batch_size]
	        # 병렬로 전처리 수행
	        preprocessed_batch = list(executor.map(preprocess_func, batch_images))
	        preprocessed_batch = torch.stack(preprocessed_batch)
	        preprocessed_batches.append(preprocessed_batch)

2. 예상 문제점과 개선 방안

a. Preprocess가 CPU에서 이뤄진다

가. 문제점

clip_preprocess_np 코드를 보면, resize 후에 tensor로 올라가긴 하지만 device가 정해지지 않았다.
executor.map(preprocess_func, batch_images) 코드도 CPU를 활용하고 있는 것 같지는 않다.

나. 개선 방안

기존의 전처리 방식을 GPU에서 수행하도록 개선한다.

MEAN = torch.tensor([0.48145466, 0.4578275, 0.40821073]).view(3, 1, 1).to('cuda')
STD = torch.tensor([0.26862954, 0.26130258, 0.27577711]).view(3, 1, 1).to('cuda')

def clip_preprocess_gpu(img: np.ndarray, device="cuda") -> torch.Tensor:
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    tensor = torch.from_numpy(img).permute(2, 0, 1).float() / 255.0
    tensor = tensor.to(device)
    return (tensor - MEAN.to(device)) / STD.to(device)

원래의 clip_preprocess 코드를 가져와서, GPU에서 전처리 과정이 이루어지도록 해본다. → 비효율적
```
def _transform(n_px):
    return Compose([
        Resize(n_px, interpolation=BICUBIC),
        CenterCrop(n_px),
        _convert_image_to_rgb,
        ToTensor(),
        Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
    ])
```
- 이 방식은 PIL Image 객체를 인자로 받는다. 따라서 np.ndarray를 PIL Image 객체로 변환하는 과정을 거쳐야 한다.
- 정규화를 CPU에서 해준다. → 비효율적

b. 실질적으로 배치 처리가 되지 않고 있는 것 같다

가. 문제점

코드를 보면 배치 단위로 묶어서 처리하려고 의도한 것 같으나, 실제로는 다음과 같이 동작
```
# 병렬로 각 이미지에 대해 전처리 함수 실행
preprocessed_batch = list(executor.map(preprocess_func, batch_images))

# 리스트를 하나의 텐서로 합침
preprocessed_batch = torch.stack(preprocessed_batch)
preprocessed_batches.append(preprocessed_batch)
```
- N개의 이미지를 개별 스레드에서 독립적으로 전처리한 후, 단순히 torch.stack()으로 묶고 있다.
  - 전처리 연산 자체가 배치단위로 한꺼번에 진행되고 있지 않다.
- GIL 환경에서는 각각의 스레드에서 독립적으로 수행되는 작업들 조차 실질적인 병렬성이 보장되지 않는다.
결과적으로, 전처리 작업은 거의 순차적으로 실행되는 것과 동일한 수준의 성능을 보일 것으로 예상된다.

나. 개선 방안

전처리 작업 자체를 배치로 처리하는 preprocess 함수 정의

MEAN = torch.tensor([0.48145466, 0.4578275, 0.40821073]).view(1, 3, 1, 1)
STD = torch.tensor([0.26862954, 0.26130258, 0.27577711]).view(1, 3, 1, 1)

def batch_clip_preprocess_np(images: list[np.ndarray]) -> torch.Tensor:
	resized = [cv2.resize(img, (224, 224), interpolation=cv2.Inter_CUBIC) for img in images]
	np_batch = np.stack(resized)
	tensor_batch = torch.from_numpy(np_batch).permute(0, 3, 1, 2).float() / 255.0
	return (tensor_batch - MEAN) / STD

3. 테스트

a. 전처리 작업을 GPU에서 수행

MEAN = torch.tensor([0.48145466, 0.4578275, 0.40821073]).view(3, 1, 1).to('cuda')
STD = torch.tensor([0.26862954, 0.26130258, 0.27577711]).view(3, 1, 1).to('cuda')

def clip_preprocess_gpu(img: np.ndarray, device="cuda") -> torch.Tensor:
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    tensor = torch.from_numpy(img).permute(2, 0, 1).to(device, dtype=torch.float32) / 255.0
    return (tensor - MEAN) / STD

가. 개선 사항 분석

[ 1. cv2.resize(...) → CPU ]

OpenCV는 C++ 기반의 고성능 라이브러리로, CPU에서 매우 빠르게 동작함.
GPU에서 처리하려면 이미지 자체를 먼저 GPU로 옮겨야 하므로 불필요한 전송 비용이 생김.

[ 2. torch.from_numpy(img) → CPU]

NumPy 배열은 CPU 메모리에 있으므로, 텐서를 만들 때도 CPU에서 생성

[ 3. .permute(2, 0, 1) → CPU]

단순 shape 재배열 연산 (view 변경)으로, 데이터 복사 없이 CPU에서 매우 가볍게 수행됨.

[ 4. .to(device, dtype=torch.float32) → CPU + GPU]

to(device, dtype=...)는 PyTorch에서 가장 효율적으로 dtype + device 전환을 처리하는 방식
- uint8 데이터를 GPU로 전송하면서, 동시에 float32로 변환 (PyTorch 내부에서 최적화된 커널 사용)
형변환 + 메모리 전송을 하나의 커널에서 처리하여 속도 이점 있음

[ 5. / 255.0 → GPU]

GPU로 옮기기 전에 / 255.0을 통해 정규화하면 float으로 캐스팅된 후 GPU RAM으로 이동 → 더 무거워지므로 GPU 전환 후 연산

[ 6. (tensor - MEAN) / STD → GPU]

정규화 연산은 전체 텐서에 대해 채널별 연산을 수행하므로, GPU의 병렬 연산 능력을 활용할 수 있음.

나. 테스트 결과

[ 이미지 50장 * 30명 요청 기준 ]

기존 방식

min max avg

로딩 + 디코딩 258.89ms 1340.00ms 513.93ms

전처리 85.64ms 332.42ms 181.99ms

임베딩 143.88ms 645.61ms 434.57ms

총 소요 시간 622.76ms 2020.00ms 1287.32ms
개선 방식

min max avg

로딩 + 디코딩 310.92ms 1390.00ms 570.49ms

전처리 60.47ms 436.93ms 127.24ms

임베딩 153.62ms 800.51ms 419.82ms

총 소요 시간 569.23ms 2010.00ms 1294.56ms

	min	max	avg
로딩 + 디코딩	258.89ms	1340.00ms	513.93ms
전처리	85.64ms	332.42ms	181.99ms
임베딩	143.88ms	645.61ms	434.57ms
총 소요 시간	622.76ms	2020.00ms	1287.32ms

	min	max	avg
로딩 + 디코딩	310.92ms	1390.00ms	570.49ms
전처리	60.47ms	436.93ms	127.24ms
임베딩	153.62ms	800.51ms	419.82ms
총 소요 시간	569.23ms	2010.00ms	1294.56ms

다. 결과 분석

[ 성능 변화 요약 ]

	min	max	avg
로딩 + 디코딩	+20.08%	+3.73%	+10.99%
전처리	-29.43%	+31.43%	-30.07%
임베딩	+6.77%	+23.98%	-3.38%
총 소요 시간	-8.60%	-0.50%	+0.56%

[ 핵심 변화점 ]

전처리 속도 평균 기준 약 30% 향상
임베딩 평균 시간 약 3% 소폭 개선
총 소요 시간은 평균적으로 거의 동일하지만, 최소 시간은 약 9% 개선

[ 성능 향상의 원인 분석 ]

항목	분석
전처리 속도 향상	CPU 전처리 시 `.to('cuda')` 전송까지 포함되어 있었으며, 이 비용이 컸음. 개선 후에는 `.from_numpy → to('cuda') → 정규화`까지 GPU에서 일괄 처리되어 전체 전처리 흐름이 짧아짐
임베딩 속도 차이 없음	두 버전 모두 `.to('cuda')` 이후의 GPU 텐서를 입력으로 넘기므로, 임베딩 단계의 입력 조건은 동일. 속도 차이는 미세한 분산 또는 시스템 부하 요인으로 추정
총 소요 시간 안정화	GPU에서 연속적으로 전처리-임베딩이 이어지면서 단계 간 병목이 완화되어 최소 시간 기준으로는 유의미한 개선이 나타남

b. 실질적인 배치 처리 도입

MEAN = torch.tensor([0.48145466, 0.4578275, 0.40821073]).view(1, 3, 1, 1)
STD = torch.tensor([0.26862954, 0.26130258, 0.27577711]).view(1, 3, 1, 1)

def batch_clip_preprocess_np(images: list[np.ndarray]) -> torch.Tensor:
		resized = [cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC) for img in images]
		np_batch = np.stack(resized)
		tensor_batch = torch.from_numpy(np_batch).permute(0, 3, 1, 2).float() / 255.0
		return (tensor_batch - MEAN) / STD

# 기존 방식은 이미지 한장당 별도 스레드에서 처리했으므로 executor로 처리했지만, 배치 처리로 자연스럽게 병렬성을 확보했으므로 삭제
# with ThreadPoolExecutor() as executor:
		# preprocess_func = lambda img: preprocess(img)
		
		for i in range(0, len(images), batch_size):
		    batch_images = images[i:i + batch_size]
		    # 이미지 한장당 스레드에서 실행
		    # preprocessed_batch = list(executor.map(preprocess_func, batch_images))
		    # 결과물이 이미 스택되어 있는 상태이므로 스택할 필요 없음
		    # preprocessed_batch = torch.stack(preprocessed_batch)
		    preprocessed_batch = preprocess(batch_images)
		    preprocessed_batches.append(preprocessed_batch)

가. 개선 사항 분석

[ 요약 ]

항목	기존 코드 (단일 처리)	개선 코드 (배치 처리)	설명
입력 데이터	`img: np.ndarray`	`images: list[np.ndarray]`	여러 이미지를 한 번에 처리할 수 있도록 리스트로 입력
리사이즈	단일 `cv2.resize(...)`	리스트 컴프리헨션 `[cv2.resize(...)]`	배치 전처리를 위해 각 이미지를 반복적으로 리사이즈
텐서 변환	`(3, 224, 224)`	`(N, 3, 224, 224)`	`np.stack`으로 배치 차원 추가 후 텐서로 변환
`permute`	`.permute(2, 0, 1)`	`.permute(0, 3, 1, 2)`	채널 축을 앞으로, 그리고 배치 차원을 유지하도록 설정
정규화 shape	`(3, 1, 1)`	`(1, 3, 1, 1)`	브로드캐스트를 위해 배치 차원 포함. `(N, 3, 224, 224)`에 대응

[ MEAN = torch.tensor([...]).view(1, 3, 1, 1) ]

tensor_batch의 shape은 (N, 3, 224, 224)
MEAN, STD가 (1, 3, 1, 1)이면 PyTorch의 broadcasting 규칙에 따라 N개의 이미지 모두에 채널별로 정규화 적용 가능

[ np.stack(resized) ]

배치 단위 처리를 위해 이미지 배열들을 하나의 4차원 배열로 집합
- (N, 224, 224, 3) 형태의 배열로 변환
torch.from_numpy()로 한 번에 Tensor로 변환

[ .permute(0, 3, 1, 2) ]

.permute(2, 0, 1)과 같은 역할

나. 테스트 결과

[ 이미지 50장 * 30명 요청 기준 ]

기존 방식

min max avg

로딩 + 디코딩 258.89ms 1340.00ms 513.93ms

전처리 85.64ms 332.42ms 181.99ms

임베딩 143.88ms 645.61ms 434.57ms

총 소요 시간 622.76ms 2020.00ms 1287.32ms
개선 방식

min max avg

로딩 + 디코딩 229.27ms 1310.00ms 425.60ms

전처리 133.73ms 473.17ms 251.23ms

임베딩 157.16ms 601.26ms 358.43ms

총 소요 시간 640.44ms 1960.00ms 1167.04ms

	min	max	avg
로딩 + 디코딩	258.89ms	1340.00ms	513.93ms
전처리	85.64ms	332.42ms	181.99ms
임베딩	143.88ms	645.61ms	434.57ms
총 소요 시간	622.76ms	2020.00ms	1287.32ms

	min	max	avg
로딩 + 디코딩	229.27ms	1310.00ms	425.60ms
전처리	133.73ms	473.17ms	251.23ms
임베딩	157.16ms	601.26ms	358.43ms
총 소요 시간	640.44ms	1960.00ms	1167.04ms

다. 결과 분석

[ 성능 변화 요약 ]

	min	max	avg
로딩 + 디코딩	-11.44%	-2.24%	-17.18%
전처리	+56.19%	+42.31%	+38.06%
임베딩	+9.22%	-6.87%	-17.52%
총 소요 시간	+2.83%	-2.97%	-9.35%

[ 핵심 변화점 ]

전처리 평균 시간 약 38% 증가
임베딩 평균 시간 약 18% 감소
총 소요 시간 평균 기준 약 9% 단축
로딩 + 디코딩 평균 시간 약 17% 개선

[ 성능 저하의 원인 분석 ]

항목	분석
전처리 시간 증가	배치 단위로 `cv2.resize`, `np.stack`, `from_numpy` 등을 한 번에 수행하면서 CPU 연산 집중, 그리고 `.to('cuda')`가 전체 배치에 대해 일괄 적용되며 단건 처리보다 메모리 집중도가 높아짐
임베딩 속도 개선	임베딩 평균 시간은 소폭 개선되었으나, 이는 구조 차이보다는 배치 정렬도 및 GPU 흐름 안정화에 따른 간접 효과로 해석된다. 두 구조 모두 `.to('cuda')` 이후 임베딩을 시작하기 때문에 직접적인 연산 경로 상의 개선은 없다.
총 소요 시간 단축	전처리 시간이 늘었음에도, 임베딩 + 디코딩 구간의 비용이 줄면서 전체 흐름이 더 빨라짐

[ 종합 판단 ]

전처리 개선을 위한 구조 변경이 오히려 역효과를 일으킨 사례
배치 전처리는 GPU 중심 환경이나 병렬화 가능한 연산 환경에서 효과적

[ 추가로 확인해볼 사항 ]

이미지 로딩 + 디코딩 과정의 시간이 줄어듦
기존 CPU에서 하던 전처리 작업으로 인해 CPU 태스크에 병목이 생겨 디코딩 연산 또한 지연되었다고 추측 → 정말 디코딩 연산 대기 시간이 줄었는지 추가 테스트 필요

c. GPU 배치 전처리 도입

MEAN = torch.tensor([0.48145466, 0.4578275, 0.40821073]).view(1, 3, 1, 1).to('cuda')
STD = torch.tensor([0.26862954, 0.26130258, 0.27577711]).view(1, 3, 1, 1).to('cuda')

def batch_clip_preprocess_np(images: list[np.ndarray], device="cuda") -> torch.Tensor:
		resized = [cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC) for img in images]
		np_batch = np.stack(resized)
		tensor_batch = torch.from_numpy(np_batch).permute(0, 3, 1, 2).to(device, dtype=torch.float32) / 255.0
		return (tensor_batch - MEAN) / STD

# 기존 방식은 이미지 한장당 별도 스레드에서 처리했으므로 executor로 처리했지만, 배치 처리로 자연스럽게 병렬성을 확보했으므로 삭제
# with ThreadPoolExecutor() as executor:
		# preprocess_func = lambda img: preprocess(img)
		
		for i in range(0, len(images), batch_size):
		    batch_images = images[i:i + batch_size]
		    # 이미지 한장당 스레드에서 실행
		    # preprocessed_batch = list(executor.map(preprocess_func, batch_images))
		    # 결과물이 이미 스택되어 있는 상태이므로 스택할 필요 없음
		    # preprocessed_batch = torch.stack(preprocessed_batch)
		    preprocessed_batch = preprocess(batch_images)
		    preprocessed_batches.append(preprocessed_batch)

가. 개선 사항 분석

a, b의 개선 사항을 조합
CPU에서는 배치 처리 시 오히려 시간이 증가했는데, GPU에서 한다면 실질적인 배치 처리가 될 것으로 기대

나. 테스트 결과

[ 이미지 50장 * 30명 요청 기준 ]

CPU 배치 전처리

min max avg

로딩 + 디코딩 229.27ms 1310.00ms 425.60ms

전처리 133.73ms 473.17ms 251.23ms

임베딩 157.16ms 601.26ms 358.43ms

총 소요 시간 640.44ms 1960.00ms 1167.04ms
GPU 배치 처리

min max avg

로딩 + 디코딩 220.41ms 1310.00ms 428.02ms

전처리 107.05ms 464.60ms 190.38ms

임베딩 133.27ms 634.93ms 379.62ms

총 소요 시간 498.42ms 2160.00ms 1129.13ms

	min	max	avg
로딩 + 디코딩	229.27ms	1310.00ms	425.60ms
전처리	133.73ms	473.17ms	251.23ms
임베딩	157.16ms	601.26ms	358.43ms
총 소요 시간	640.44ms	1960.00ms	1167.04ms

	min	max	avg
로딩 + 디코딩	220.41ms	1310.00ms	428.02ms
전처리	107.05ms	464.60ms	190.38ms
임베딩	133.27ms	634.93ms	379.62ms
총 소요 시간	498.42ms	2160.00ms	1129.13ms

다. 결과 분석

[ 성능 변화 요약 ]

	min	max	avg
로딩 + 디코딩	-3.86%	0.00%	+0.57%
전처리	-19.99%	-1.81%	-24.24%
임베딩	-15.22%	+5.61%	+5.91%
총 소요 시간	-22.16%	+10.20%	-3.25%

[ 핵심 변화점 ]

전처리 평균 시간 약 24% 감소
임베딩 평균 시간 약 6% 증가
총 소요 시간 평균 기준 약 3% 감소, 최소 시간 기준으로는 22% 개선

[ 성능 저하의 원인 분석 ]

항목	분석
전처리 속도 향상	`.to('cuda')`를 각 배치 내에서 개별 호출하던 구조(결과 1)에서 → `.from_numpy → .to('cuda') → 정규화`를 한 번에 GPU에서 처리하는 구조로 바뀌며 전처리 흐름이 단축됨
임베딩 속도 소폭 증가	전처리 결과가 GPU에 이미 존재함에도, 메모리 연속성/캐시 상태 등에서 결과 1보다 다소 손해를 봤을 가능성. 구조상 개선되지 않은 영역
디코딩 시간 거의 동일	디코딩 자체에는 구조 차이가 없으나, `.to('cuda')` 타이밍이 달라지면서 소폭 변동이 발생한 것으로 보임

[ 종합 판단 ]

전처리 개선 목적에 부합하는 구조 변화였으며, .to('cuda')를 GPU 내부로 통합한 것이 주효
임베딩 속도는 구조상 동일하지만 메모리 정렬/버퍼 타이밍의 간섭으로 미세한 편차가 발생했을 뿐, 큰 의미는 없다.

4. 각 방식 별 병목 지점 비교

a. 개요

가. 문제점

전처리 과정을 배치 단위로 진행할 때에 비해 이미지 한 장씩 진행할 때 이미지 로딩 + 전처리 과정이 약 120% 정도 소요되는 것으로 파악

나. 원인 추정

이미지 개별 처리에서는 한 장당 하나의 스레드를 생성하기 때문에 CPU 컨텍스트 스위칭과 리소스 경합이 발생 → 로딩 + 전처리 병목
배치 처리에서는 배치 단위로 직렬 처리되므로 스레드 수가 적고 연산 흐름이 더 단순 → 빠른 결과

b. 로깅

async def _download(self, file_name: str) -> bytes:
        """
        GCS에서 단일 이미지를 다운로드하고 RGB로 변환합니다.

        Args:
            file_name (str): GCS 내 파일 이름

        Returns:
            bytes: 로드된 이미지 바이트

        """
        # print(f"[INFO] 이미지 로딩 시작")
        start = time.time()
        image_bytes = await self.client.download(
            bucket=self.bucket_name, object_name=file_name
        )
        # image_bytes = await blob.download_as_bytes(client=self.client)
        end = time.time()
        print(f"[INFO] 이미지 로딩 완료 : {format_elapsed(end - start)}")
        # 
        return image_bytes

def decode_image_cv2(image_bytes: bytes, label: str) -> np.ndarray:
    start = time.time()
    nparr = np.frombuffer(image_bytes, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    end = time.time()
    print(f"[INFO] 디코딩 완료 : {format_elapsed(end - start)}")
    return img, end - start

async def _process_single_file(
        self, filename: str, executor=None
    ) -> np.ndarray:
        loop = asyncio.get_running_loop()
        image_bytes = await self._download(filename)
        
        start = time.time()
        decoded_img, decoding_time = await loop.run_in_executor(
            executor, decode_image_cv2, image_bytes, "gcs"
        )
        end = time.time()
        print(f"[INFO] 디코딩 대기 : {format_elapsed(end - start - decoding_time)}")
        return decoded_img

c. 결과

가. 방식 설명

방식 1 : CPU 전처리 + 개별 전처리
방식 2 : CPU 전처리 + 배치 전처리
방식 3 : GPU 전처리 + 개별 전처리
방식 4 : GPU 전처리 + 배치 전처리

나. 결과

항목	방식 1	방식 2	방식 3	방식 4
로딩	193.42	166.47	179.09	160.44
디코딩 완료	24.15	24.70	24.51	23.84
디코딩 대기	157.83	126.49	148.06	130.87
로딩 + 디코딩	578.19	495.21	563.29	472.69

다. 분석

디코딩 시간 자체는 짧은데, 대기 시간이 상당히 긴 것을 확인
- 이는 과도한 스레드 생성 → CPU 컨텍스트 스위칭 증가 → I/O 및 디코딩 병목을 유발한 결과

→ CPU 성능을 업그레이드 한다면?