Skip to content

[Bug]: Rate limiter decrements by incorrect token count for /v1/responses endpoint #18671

@AmethystLiang

Description

@AmethystLiang

What happened?

Bug: Rate limiter decrements by incorrect token count for /v1/responses endpoint

Description

When using the /v1/responses endpoint with team-per-model TPM rate limiting enabled, the x-ratelimit-model_per_team-remaining-tokens header decreases by only ~2 tokens per request, regardless of actual token consumption reported in total_tokens.

Environment

  • Affected: Original LiteLLM
  • Not affected: Stably fork

Steps to Reproduce

Run the following curl command multiple times:

  curl -sD - -o - <LITELLM_PROXY_URL>/v1/responses \
    -H "Authorization: Bearer <API_KEY>" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "gemini-3-flash-preview",
      "input": "hello"
    }' \
  | grep -oE '"total_tokens"[[:space:]]*:[[:space:]]*[0-9]+|x-ratelimit-model_per_team-remaining-tokens:[[:space:]]*[0-9]+'

Expected Behavior

x-ratelimit-model_per_team-remaining-tokens should decrease by the value reported in total_tokens (~35 tokens per request).

Actual Behavior

Request remaining-tokens total_tokens Actual Decrease
1 1,999,998 35 -
2 1,999,996 35 2
3 1,999,994 35 2

The rate limiter only decrements by 2 tokens instead of the actual 35 tokens consumed.

Impact

  • Rate limiting is ineffective for /v1/responses endpoint
  • Users can consume significantly more tokens than their rate limit should allow
  • TPM quotas are not being enforced correctly

Additional Context

  • This issue is specific to the v1/responses endpoint with model_per_team TPM rate limiting
  • Requires team-per-model TPM rate limiting to be configured

Relevant log output

=== Request 1 ===                                                                                                                
  x-ratelimit-model_per_team-remaining-tokens: 1999998
  "total_tokens":104

  === Request 2 ===
  x-ratelimit-model_per_team-remaining-tokens: 1999996
  "total_tokens":104

  === Request 3 ===
  x-ratelimit-model_per_team-remaining-tokens: 1999994
  "total_tokens":104

What part of LiteLLM is this about?

SDK (litellm Python package)

What LiteLLM version are you on ?

v1.80.11

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions