[exporter/loadbalancing] Support consistency between scale-out events #33959

jamesmoessis · 2024-07-09T01:14:15Z

Component(s)

exporter/loadbalancing

Is your feature request related to a problem? Please describe.

When a scale-out event occurs, the loadbalancing exporter goes from having n endpoints to n+1 endpoints. So now, the exporting is divided differently after the scaling event is complete.

Consider the case where a trace has 2 spans, and the load balancing exporter is configured to route by trace ID. Span (a) arrives and is routed a given host. The scaling event occurs and now span (b) arrives, being routed to a different host.

We need a way to effectively scale-out our backend without these inconsistencies, while maintaining performance.

This is a separate problem to a scale-in event which in my opinion presents a different set of problems, and requires the terminating node to flush any data it's statefully holding onto. It may be worth discussing here but I want to focus on the simpler case of a scale-out event.

Describe the solution you'd like

I don't know exactly what the solution should be yet, and I'm hoping that this thread will provide discussion so we can reach a solution that works.

Essentially any solution that I've seen discussed for this problem includes some kind of cache which holds trace ID as the key and the backend as the value. My question to community is: how should this cache be implemented?

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-09T01:14:32Z

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2024-07-09T16:08:28Z

This is a problem inherent to distributed systems: there's only so much we can do before documenting the use-cases we won't support. When designing the load-balancing exporter, the trade-off was: either scaling events aren't frequent and we need fewer sync stages (periodic intervals for resolvers can be longer), or scaling events are frequent and need more/shorter refresh intervals.

This still requires no coordination between the nodes, acknowledging that they might end up making a different decision for the same trace ID during the moments where one load balancer is out of sync with the others. Hopefully, this is a short period of time, but this will happen. To alleviate some of the pain, I chose an algorithm that is a bit more expensive than the alternatives, but brought some stability in it: my recollection from the paper was that changes to the circle would affect only about a third of the hashes. I think it was (and still is, for most cases) a good compromise.

If we want to make it even better in terms of consistency, I considered a few alternatives in the past, which should be doable as different implementations of the resolver interface. My favorite is to use a distributed key/value store, like etcd or zookeeper, that would allow all nodes to get the same data at the same time, including their updates. Consensus would be handled there, but we'd still need to handle split brain scenarios (which is what I was trying to avoid in the first place). Another thing I considered in the past was to implement a gossip extension and allow load-balancer instances to communicate with each other, and we'd implement the consensus algorithm ourselves (likely raft, using an external library?). Again, split-brain would probably be something we'll have to handle ourselves.

If we want to have consistency even on split-brain scenarios... well, then I don't know :-) I'm not ready to think about a solution for that if we don't have that problem yet.

Anyway: thank you for opening this issue! I finally got those things out of my head :-)

github-actions · 2024-09-09T03:33:50Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jamesmoessis added enhancement New feature or request needs triage New item requiring triage labels Jul 9, 2024

github-actions bot added the exporter/loadbalancing label Jul 9, 2024

github-actions bot mentioned this issue Jul 9, 2024

Weekly Report: 2024-07-02 - 2024-07-09 #33962

Open

This was referenced Jul 16, 2024

Weekly Report: 2024-07-09 - 2024-07-16 #34087

Closed

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Closed

This was referenced Jul 30, 2024

Weekly Report: 2024-07-23 - 2024-07-30 #34301

Closed

Weekly Report: 2024-07-30 - 2024-08-06 #34410

Closed

This was referenced Aug 13, 2024

Weekly Report: 2024-08-06 - 2024-08-13 #34626

Closed

Weekly Report: 2024-08-13 - 2024-08-20 #34743

Closed

This was referenced Aug 27, 2024

Weekly Report: 2024-08-20 - 2024-08-27 #34856

Closed

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

github-actions bot added the Stale label Sep 9, 2024

This was referenced Sep 10, 2024

Weekly Report: 2024-09-03 - 2024-09-10 #35086

Open

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Open

This was referenced Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Open

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/loadbalancing] Support consistency between scale-out events #33959

[exporter/loadbalancing] Support consistency between scale-out events #33959

jamesmoessis commented Jul 9, 2024

github-actions bot commented Jul 9, 2024

jpkrohling commented Jul 9, 2024 •

edited

Loading

github-actions bot commented Sep 9, 2024

[exporter/loadbalancing] Support consistency between scale-out events #33959

[exporter/loadbalancing] Support consistency between scale-out events #33959

Comments

jamesmoessis commented Jul 9, 2024

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Jul 9, 2024

jpkrohling commented Jul 9, 2024 • edited Loading

github-actions bot commented Sep 9, 2024

jpkrohling commented Jul 9, 2024 •

edited

Loading