Rate-Limiting

Effective December 19, 2025, Architect will enforce rate limits by user ID. The rate limits will impact any application, script, or manual usage that communicates excessively with the Architect API.

It is essential to review your code and the Architect documentation to ensure best practices in optimizing requests.

Architect limits the rate of incoming gRPC requests per user (including usage of the SDK) to ensure services are reliable and responsive. Rate limiting is enforced on a per-user basis, meaning each authenticated user has their own independent rate limit quota.

Overview

Rate limiting applies to all gRPC requests made through the Architect API, including:

Direct gRPC calls
SDK usage (Python, TypeScript, Rust)
Any application or script that communicates with Architect services

How Rate Limiting Works

Per-User Enforcement

Rate limits are enforced based on the authenticated user ID. When you make a request:

Architect extracts your user ID
The rate limiter checks your personal quota
If you have available tokens, the request proceeds
If your quota is exhausted, the request is rejected with a rate limit error

Note that multiple accounts under a single user share a single token bucket.

Token Bucket Algorithm

The rate limiter uses a token bucket algorithm with the following characteristics:

Burst Capacity: You can make a burst of requests up to your configured limit
Refill Rate: Tokens are replenished over time according to your quota

For example, if your rate limit is 10 requests per second with a 100 burst capacity:

You can make up to 100 requests immediately (burst)
After consuming tokens, they refill at a rate of 10 per second
If you exceed the burst capacity, you must wait for tokens to refill

Default Rate Limits

By default the server will enforce rate limits with a burst capacity of 100 requests and a refill rate of 10 requests per second. This applies per user (not per account). However, Architect reserves the right to change these limits at any time depending on server capacity and use cases.

Rate Limit Responses

When you exceed your rate limit, Architect returns a gRPC error with the following characteristics:

Error Details

gRPC Status Code: RESOURCE_EXHAUSTED (code 8)
Error Message: "rate limit exceeded"
Retry-After Header: Provides the recommended wait time in milliseconds before retrying

Handling the Error in Python Example

When using the Python SDK, you'll receive an AioRpcError:

Example Error Output:

AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.RESOURCE_EXHAUSTED
	details = "rate limit exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"rate limit exceeded", grpc_status:8}"

from grpc import StatusCode
from grpc.aio import AioRpcError

try:
    # Some client request here
    for i in range(1000):
        _ = await client.list_accounts()
except AioRpcError as e:
    if e.code() == StatusCode.RESOURCE_EXHAUSTED:
        # Rate limit exceeded
        retry_after = e.trailing_metadata().get("retry-after")
        print(f"Rate limit exceeded: {e.details()}. Retry after {retry_after}")
    else:
        # Handle other errors
        raise

Rust Example

When using the Rust SDK:

use tonic::{Code, Status};
use tokio::time::{sleep, Duration};
use humantime::parse_duration;

match client.list_accounts().await {
    Ok(response) => { /* Handle success */ }

    Err(status) if status.code() == Code::ResourceExhausted => {

        let duration = status.metadata()
            .get("retry-after")
            .and_then(|h| h.to_str().ok())
            .and_then(|s| humantime::parse_duration(s).ok())
            .unwrap_or(Duration::from_secs(1));

        eprintln!("Rate limit hit. Waiting {:?}...", duration);
        sleep(duration).await;
        // Place your retry logic here
    }
    Err(e) => {
        // Handle all other errors
        return Err(e);
    }
}

Best Practices

Use Streaming Channels Instead of Polling

The most effective way to stay within rate limits is to use Architect's streaming channels instead of repeatedly calling unary endpoints.

Rate limit tokens are consumed per gRPC call, not per message on a stream. Establishing a streaming connection like orderflow consumes only one token—all subsequent messages sent over that stream (PlaceOrder, CancelOrder, GetOrder, etc.) do not consume additional tokens.

Approach

Rate Limit Impact

Polling get_order in a loop

1 token per call ❌

Using orderflow channel

1 token to connect, unlimited messages ✅

Subscribing to subscribe_orderflow

1 token to connect, unlimited updates ✅

Example: Prefer orderflow over polling


# Bad: Polling consumes rate limit tokens on every call
while True:
    order = await client.get_order('some_order')
    print(f" --> {order}")

# Good: Streaming uses only 1 token for the connection
async for event in client.stream_orderflow():
    print(f" --> {event}")

Batch and Cache Where Possible

Batch requests: Use batch endpoints (e.g., place_batch_order) instead of making multiple individual calls
Cache static data: Cache infrequently changing data like account lists or product definitions to reduce redundant API calls

Request Backoff

Clients should treat RESOURCE_EXHAUSTED as a signal to alleviate pressure. Retrying after a delay is recommended, and doubling the delay upon each consecutive RESOURCE_EXHAUSTED message is often best practice. You can apply some jitter to avoid the thundering herd problem.

FAQs

What happens when I exceed the rate limit?

When you exceed your rate limit, your request is immediately rejected with a RESOURCE_EXHAUSTED error. Note that this differs from throttling (where requests are slowed down). Crucially, this means that rejected calls need to be resent (if still needed).

You must wait for tokens to refill before making additional requests. The error response includes a Retry-After header indicating how long you should wait.

How do I know what my rate limit is?

Rate limit quotas are configured per deployment. Default rates will be posted, but contact the Architect team to learn the specific rate limits for your environment.

Can I request a higher rate limit?

Rate limit quotas are configured based on system capacity and fair usage policies. Contact the Architect team to discuss your specific needs.

Do rate limits apply to all gRPC methods?

Yes, rate limiting applies to all gRPC requests made to Architect services, regardless of the specific method or service being called.

How are rate limits enforced across multiple connections?

Rate limits are enforced per user ID, not per connection. If you have multiple connections or clients using the same user credentials, they all share the same rate limit quota.

PreviousPagination NextSymbology and instrument info

Last updated 2 months ago

hashtagOverview

hashtagHow Rate Limiting Works

hashtagPer-User Enforcement

hashtagToken Bucket Algorithm

hashtagDefault Rate Limits

hashtagRate Limit Responses

hashtagError Details

hashtagHandling the Error in Python Example

hashtagRust Example

hashtagBest Practices

hashtagUse Streaming Channels Instead of Polling

hashtagBatch and Cache Where Possible

hashtagRequest Backoff

hashtag

hashtagFAQs

hashtagWhat happens when I exceed the rate limit?

hashtagHow do I know what my rate limit is?

hashtagCan I request a higher rate limit?

hashtagDo rate limits apply to all gRPC methods?

hashtagHow are rate limits enforced across multiple connections?