로깅을 통해 태스크별 소요 시간 파악 - 100-hours-a-week/5-yeosa-wiki GitHub Wiki

1. 로깅 방식

공통 형식 작성

def format_elapsed(t: float) -> str:
    return f"{t * 1000:.2f} ms" if t < 1 else f"{t:.2f} s"

이미지 로딩, 디코딩 : 각각 한 장 단위로 로깅한 후, 최소 / 최대 / 평균 시간 분석

async def _download(self, file_name: str) -> bytes:
    """
    GCS에서 단일 이미지를 다운로드하고 RGB로 변환합니다.
    Args:
        file_name (str): GCS 내 파일 이름

    Returns:
        bytes: 로드된 이미지 바이트
    """
    print(f"[INFO] 이미지 로딩 시작")
    start = time.time()
    image_bytes = await self.client.download(
        bucket=self.bucket_name, object_name=file_name
    )
    end = time.time()
    print(f"[INFO] 이미지 로딩 완료 : {format_elapsed(end - start)}")
    
    return image_bytes

def decode_image_cv2(image_bytes: bytes, label: str) -> np.ndarray:
    print(f"[INFO] 디코딩 시작")
    start = time.time()
    
    nparr = np.frombuffer(image_bytes, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    end = time.time()
    print(f"[INFO] 디코딩 완료 : {format_elapsed(end - start)}")
    return img

이미지 로딩 + 디코딩, 전처리, 임베딩, 총 처리 시간: 요청 당 전체 사진 기준으로 로깅

요청이 여러개면 최소 / 최대 / 평균 시간 분석

async def embed_controller(request: Request):
    t0 = time.time()
    
		...

    print(f"[INFO] 이미지 로딩 및 디코딩 시작")
    start = time.time()
    
    image_loader = request.app.state.image_loader
    images = await image_loader.load_images(image_refs)
    
    end = time.time()
    print(f"[INFO] 이미지 로딩 및 디코딩 완료: {format_elapsed(start - end)}")
    
    ...
    
    t6 = time.time()
    print(f"[INFO] 총 처리 시간: {format_elapsed(t6 - t0)}")
    
    return Response(
        content=serialized,
        media_type="application/octet-stream"
    )

def embed_images(
    model, preprocess, images, filenames, batch_size=32, device="cuda"
):
    # 이미지 전처리를 배치 단위로 수행
    preprocessed_batches = []
    print(f"[INFO] 전처리 시작")
    t1 = time.time()
    # ThreadPoolExecutor 생성
    with ThreadPoolExecutor() as executor:
        # 전처리 함수를 lambda로 정의
        preprocess_func = lambda img: preprocess(img)
        
        for i in range(0, len(images), batch_size):
            batch_images = images[i:i + batch_size]
            # 병렬로 전처리 수행
            preprocessed_batch = list(executor.map(preprocess_func, batch_images))
            preprocessed_batch = torch.stack(preprocessed_batch)
            preprocessed_batches.append(preprocessed_batch)
    t2 = time.time()
    print(f"[INFO] 전처리 완료: {format_elapsed(t2 - t1)}")
    
    # 결과를 저장할 딕셔너리
    results = {}
    
    print(f"[INFO] 임베딩 시작")
    t1 = time.time()
    # 배치 단위로 임베딩 수행
    for i, batch in enumerate(preprocessed_batches):
        batch_filenames = filenames[i * batch_size:(i + 1) * batch_size]
        
        # GPU로 데이터 이동
        image_input = batch.to(device)
        
        # 임베딩 생성
        with torch.no_grad():
            batch_features = model.encode_image(image_input)
        
        # CPU로 결과 이동 및 저장
        batch_features = batch_features.cpu()
        for filename, feature in zip(batch_filenames, batch_features):
            results[filename] = feature.cpu()

    t2 = time.time()
    print(f"[INFO] 임베딩 완료: {format_elapsed(t2 - t1)}")

    return results

2. 태스크별 소요 시간

a. 이미지 1장 기준

이미지 로딩 : 76.68ms
디코딩 : 12.79ms
이미지 로딩 + 디코딩 : 91.61ms
전처리 : 7.02ms
임베딩 : 19.40ms

b. 이미지 50장 기준(최소 / 최대 / 평균)

이미지 로딩(장당) : 333.88ms / 671.60ms / 546.46ms
디코딩(장당) : 8.94ms / 227.93ms / 40.05ms
이미지 로딩 + 디코딩 : 727.79 ms
전처리 : 60.89 ms
임베딩 : 223.67 ms

c. 이미지 100장 기준(최소 / 최대 / 평균)

이미지 로딩(장당) : 597.63ms / 982.61ms / 856.58ms
디코딩(장당) : 12.44ms / 70.84ms / 34.87ms
이미지 로딩 + 디코딩(100장) : 1.28s
전처리 : 224.65ms
임베딩 : 429.77ms

d. 이미지 50장 기준, 30명이 동시 요청

이미지 로딩(장당) : 561.07ms / 8870.00ms / 4652.34ms
디코딩(장당) : 8.05ms / 47.84ms / 20.19ms
이미지 로딩 + 디코딩(100장) : 1.37s / 915ms / 5.71s
전처리 : 127.80ms / 398.46ms / 247.88ms
임베딩 : 246.57ms / 1090.00ms / 557.87ms

2. 분석 및 개선 방안

a. 분석

임베딩 시간 증가율(약 1.43배) 보다 전처리 시간 증가율(약 40.5배)이 훨씬 두드러진다.
- 임베딩 작업은 GPU에서 수행되다보니 비교적 병렬화가 잘 되는 것으로 보임
- 전처리 작업은 CPU에서 진행되는 작업 → 스레드가 과도하게 생성됐을 가능성

b. 개선 방안

전처리 작업도 GPU에서 수행
전처리 작업을 CPU에서 진행하되, max_workers를 두어 최대 생성 스레드 수 제한