Skip to content

Latest commit

 

History

History
193 lines (139 loc) · 14.5 KB

SdkDesign.md

File metadata and controls

193 lines (139 loc) · 14.5 KB

.NET SDK component graph

General component schema

Whenever an operation is executed through the .NET SQL SDK, a public API is invoked. The API will leverage the ClientContext to create the Diagnostics scope for the operation (and any retries involved) and create the RequestMessage.

The handler pipeline is used to process and handle the RequestMessage and perform actions like handling retries including any custom user handler added through CosmosClientOptions.CustomHandlers. See the pipeline section for more details.

At the end of the pipeline, the request is sent to the transport layer, which will process the request depending on the CosmosClientOptions.ConnectionMode and use gateway or direct connectivity mode to reach to the Azure Cosmos DB service.

flowchart LR
    PublicApi[Public API]
    PublicApi <--> ClientContext[Client Context]
    ClientContext <--> Pipeline[Handler Pipeline]
    Pipeline <--> TransportClient[Configured Transport]
    TransportClient <--> Service[(Cosmos DB Service)]
Loading

Handler pipeline

The handler pipeline processes the RequestMessage and each handler can choose to augment it in different ways, as shown in our handler samples and also handle certain error conditions and retry, like our own RetryHandler. The RetryHandler will handle any failures from the Transport layer that should be handled as regional failovers.

The default pipeline structure is:

flowchart
    RequestInvokerHandler <--> UserHandlers([Any User defined handlers])
    UserHandlers <--> DiagnosticHandler
    DiagnosticHandler <--> TelemetryHandler
    TelemetryHandler <--> RetryHandler[Cross region retries]
    RetryHandler <--> ThrottlingRetries[Throttling retries]
    ThrottlingRetries <--> RouteHandler
    RouteHandler <--> IsPartitionedFeedOperation{{Partitioned Feed operation?}}
    IsPartitionedFeedOperation <-- No --> TransportHandler
    IsPartitionedFeedOperation <-- Yes --> InvalidPartitionExceptionRetryHandler
    InvalidPartitionExceptionRetryHandler <--> PartitionKeyRangeHandler
    PartitionKeyRangeHandler <--> TransportHandler
    TransportHandler <--> TransportClient[[Selected Transport]]
Loading

Throttling retries

Any HTTP response, with a status code 429 from the service means the current operation is being rate limited and it's handled by the RetryHandler through the ResourceThrottleRetryPolicy.

The policy will retry the operation using the delay indicated in the x-ms-retryafter response header up to the maximum configured in CosmosClientOptions.MaxRetryAttemptsOnRateLimitedRequests with a default value of 9.

Cross region retries

Any failure response from the Transport that matches the conditions for cross-regional communication are handled by the RetryHandler through the ClientRetryPolicy.

CancellationToken passed down as input to the Public API can stop these retries.

  • HTTP connection failures or DNS resolution problems (HttpRequestException) - The account information is refreshed, the current region is marked unavailable to be used, and the request is retried on the next available region after account refresh if the account has multiple regions. In case no other regions are available, the SDK keeps retrying by refreshing the account information up to a maximum of times.
  • HTTP 403 with Substatus 3 - The current region is no longer a Write region (write region failover), the account information is refreshed, and the request is retried on the new Write region.
  • HTTP 403 with Substatus 1008 - The current region is not available (adding or removing a region). The region is marked as unavailable for 5 minutes and the request retried on the next available region.
  • HTTP 404 with Substatus 1002 - Session consistency request where the region did not yet receive the requested Session Token, the request is retried on the primary (write) region for accounts with a single write region or on the next preferred region for accounts with multiple write regions.
  • HTTP 503 - The request could not succeed on the region due to repeated TCP connectivity issues, the request is retried on the next preferred region.

Transport

Once a RequestMessage reaches the TransportHandler it will be sent through either the GatewayStoreModel for HTTP requests and Gateway mode clients or through the ServerStoreModel for clients configured with Direct mode.

Even on clients configured on Direct mode, there can be HTTP requests that get routed to the Gateway. The ConnectionMode defined in the CosmosClientOptions affects data-plane operations (operations related to Items, like CRUD or query over existing Items in a Container) but metadata/control-plane operations (that appear as MetadataRequests on Azure Monitor) are sent through HTTP to the Gateway.

The ServerStoreModel contains the Direct connectivity stack, which takes care of discovering, for each operation, which is the physical partition to route to and which replica/s should be contacted. The Direct connectivity stack includes a retry layer, a consistency component and the TCP protocol implementation.

The GatewayStoreModel connects to the Cosmos DB Gateway and sends HTTP requests through our CosmosHttpClient, which just wraps the HttpClient through a retry layer to handle transient timeouts.

flowchart
    Start{{HTTP or Direct operation?}}
        Start <-- HTTP --> GatewayStoreModel
        Start <-- Direct --> ServerStoreModel
        
        TCPClient <-- TCP --> R1
        TCPClient <-- TCP --> R17
        TCPClient <-- TCP --> R20

        GatewayService <-- TCP --> R6
        GatewayService <-- TCP --> R3
        GatewayService <-- TCP --> R2

        ServerStoreModel <--> StoreClient
        StoreClient <--> ReplicatedResourceClient
        ReplicatedResourceClient <-- Retries using --> DirectRetry[[Direct mode retry layer]]
        DirectRetry <--> Consistency[[Consistency stack]]
        Consistency <--> TCPClient[TCP implementation]

        GatewayStoreModel <--> GatewayStoreClient
        GatewayStoreClient <--> CosmosHttpClient
        CosmosHttpClient <-- Retries using --> HttpTimeoutPolicy[[HTTP retry layer]]
        HttpTimeoutPolicy <-- HTTPS --> GatewayService
    subgraph Service
        subgraph Partition1
            R1[(Replica 1)]
            R2[(Replica 2)]
            R6[(Replica 6)]
            R8[(Replica 8)]
        end
        subgraph Partition2
            R3[(Replica 3)]
            R20[(Replica 20)]
            R10[(Replica 10)]
            R17[(Replica 17)]
            
        end

        GatewayService[Gateway Service]
    end

Loading

HTTP retry layer

The HttpClient is wrapped around a CosmosHttpClient which employs an HttpTimeoutPolicy to retry if the request has a transient failure (timeout) or if it takes longer than expected. Requests are canceled if latency is higher than expected to account for transient network delays (retrying would be faster than waiting for the request to fail) and for scenarios where the Cosmos DB Gateway is performing rollout upgrades on their endpoints.

The different HTTP retry policies are:

flowchart
    GatewayStoreModel --> GatewayStoreClient
    GatewayStoreClient -- selects --> HttpTimeoutPolicy
    HttpTimeoutPolicy -- sends request through --> CosmosHttpClient
    CosmosHttpClient <-- HTTPS --> GatewayService[(Gateway Service)]
    GatewayService --> IsSuccess{{Request succeeded?}}
    IsSuccess -- No --> IsRetriable{{Transient retryable error / reached max latency?}}
    IsRetriable -- Yes --> HttpTimeoutPolicy

Loading

Direct mode retry layer

Direct connectivity is obtained from the Microsoft.Azure.Cosmos.Direct package reference.

The below code links are to an example branch in this repository that contains the Microsoft.Azure.Cosmos.Direct source code, but that branch might not be updated with the latest source code.

The ServerStoreModel uses the StoreClient to execute Direct operations. The StoreClient is used to capture session token updates (in case of Session consistency) and calls the ReplicatedResourceClient which wraps the request into the GoneAndRetryWithRequestRetryPolicy which retries for up to 30 seconds (or up to the user CancellationToken).

CancellationToken passed down as input to the Public API can stop these retries.

If the retry period (30 seconds) is exhausted, it returns an HTTP 503 ServiceUnavailable error to the caller. Takes care of handling:

  • HTTP 410 Substatus 0 (replica moved) responses from the service or TCP timeouts (connect or send timeouts) -> Refreshes partition addresses and retries.
  • HTTP 410 Substatus 1008 (partition is migrating) responses from the service -> Refreshes partition map (rediscovers partitions) and retries.
  • HTTP 410 Substatus 1007 (partition is splitting) responses from the service -> Refreshes partition map (rediscovers partitions) and retries.
  • HTTP 410 Substatus 1000 responses from the service, up to 3 times -> Refreshes container map (for cases when the container was recreated with the same name) and retries.

ℹ️ There is no delay for the first retry. Further retries start with 1 second and using a backoff multiplier of 2 can go up to 15 seconds.

  • HTTP 449 responses from the service -> Retries using a random salted period.

ℹ️ There is no delay for the first retry. Further retries start with 10 milliseconds and using a backoff multiplier of 2 with a 5 millisecond salt can go up to 1 second.

flowchart
    ServerStoreModel --> StoreClient
    StoreClient --> ReplicatedResourceClient
    ReplicatedResourceClient -- uses --> GoneAndRetryWithRequestRetryPolicy
    GoneAndRetryWithRequestRetryPolicy --> Consistency[[Consistency stack]]
    Consistency --> TCPClient[TCP implementation]
    TCPClient --> Replica[(Replica X)]
    Replica --> IsSuccess{{Request succeeded?}}
    IsSuccess -- No --> IsRetryable{{Is retryable condition?}}
    IsRetryable -- Yes --> GoneAndRetryWithRequestRetryPolicy

Loading

Direct mode caches

Per our connectivity documentation, the SDK will store, in internal caches, critical information to allow for request routing.

For details on the caches, please see the cache design documentation.

Consistency (direct mode)

When performing operations through Direct mode, the SDK is involved in checking consistency for Bounded Staleness and Strong accounts. Read requests are handled by the ConsistencyReader and write requests are handled by the ConsistencyWriter. The ConsistencyReader uses the QuorumReader when the consistency is Bounded Staleness or Strong to verify quorum after performing two requests and comparing the LSNs. If quorum cannot be achieved, the SDK starts what is defined as "barrier requests" to the container and waits for it to achieve quorum. The ConsistencyWriter also performs a similar LSN check after receiving the response from the write, the GlobalCommittedLSN and the item LSN. If they don't match, barrier requests are also performed.

flowchart LR
    ReplicatedResourceClient --> RequestType{{Is it a read request?}}
    RequestType -- Yes --> ConsistencyReader
    RequestType -- No --> ConsistencyWriter
    ConsistencyReader --> QuorumReader
    QuorumReader --> TCPClient[TCP implementation]
    ConsistencyWriter --> TCPClient

Loading