https: github.com dhs‐shine litellm tree rate_limit_model_per_key_fallback_1.72.0‐stable 기준 분석 - dhs-shine/litellm GitHub Wiki

기준 브랜치

https://github.com/dhs-shine/litellm/tree/rate_limit_model_per_key_fallback_1.72.0-stable

동작상 문제점

model_list의 litellm_prams에 등록된 rpm 설정에 따른 global rate limit 제한 시 fallback은 정상 동작함
litellm_settings의 cache를 true로 변경해서 유저별 rate limit 제약을 걸면 429 에러만 발생시키고 fallback은 동작하지 않음

model rpm을 3으로, user api key rpm도 3으로 걸어두었을 때

fastapi.exceptions.HTTPException 429 에러를 발생시키면서 Client쪽에 60초 이후 Request하라고 정보를 보냄. rate per minute이므로 1분 후에는 rate limit이 풀리고 재시도하면서 동작함

22:48:24 - LiteLLM Proxy:ERROR: common_request_processing.py:485 - litellm.proxy.proxy_server._handle_llm_api_exception(): Exception occured - 429: Max parallel request limit reached Crossed TPM / RPM / Max Parallel Request Limit. Hit limit for model_per_key. Current usage: max_parallel_requests: 2.0, current_rpm: 4.0, current_tpm: 30195.0. Current limits: max_parallel_requests: None, rpm_limit: 3, tpm_limit: None.
Traceback (most recent call last):
  File "/home/dhsshin/Workspace/AAP/litellm/litellm/proxy/proxy_server.py", line 3566, in chat_completion
    return await base_llm_response_processor.base_process_llm_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dhsshin/Workspace/AAP/litellm/litellm/proxy/common_request_processing.py", line 355, in base_process_llm_request
    self.data, logging_obj = await self.common_processing_pre_call_logic(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dhsshin/Workspace/AAP/litellm/litellm/proxy/common_request_processing.py", line 302, in common_processing_pre_call_logic
    self.data = await proxy_logging_obj.pre_call_hook(  # type: ignore
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dhsshin/Workspace/AAP/litellm/litellm/proxy/utils.py", line 582, in pre_call_hook
    raise e
  File "/home/dhsshin/Workspace/AAP/litellm/litellm/proxy/utils.py", line 569, in pre_call_hook
    response = await _callback.async_pre_call_hook(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dhsshin/Workspace/AAP/litellm/litellm/proxy/hooks/parallel_request_limiter_v2.py", line 420, in async_pre_call_hook
    await asyncio.gather(*tasks)
  File "/home/dhsshin/Workspace/AAP/litellm/litellm/proxy/hooks/parallel_request_limiter_v2.py", line 250, in check_key_in_limits_v2
    raise self.raise_rate_limit_error(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dhsshin/Workspace/AAP/litellm/litellm/proxy/hooks/parallel_request_limiter_v2.py", line 275, in raise_rate_limit_error
    raise HTTPException(
fastapi.exceptions.HTTPException: 429: Max parallel request limit reached Crossed TPM / RPM / Max Parallel Request Limit. Hit limit for model_per_key. Current usage: max_parallel_requests: 2.0, current_rpm: 4.0, current_tpm: 30195.0. Current limits: max_parallel_requests: None, rpm_limit: 3, tpm_limit: None.

route_request까지 넘어가야 폴백 핸들링이 될건데, 이전에 에러가 발생해서 중지됨