Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain a stable order of children context, resolves a non-determinism around cancels #1183

Merged
merged 2 commits into from
Nov 1, 2022

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented Aug 6, 2022

After some tough-to-identify determinism issues in what appeared to be correct user workflows, and some investigations by both us and them:
This PR resolves a non-deterministic behavior involving child context cancellation propagation, in particular when unblocking selects based on those contexts (possibly transitively, e.g. via activity futures).

As this was previously non-deterministic behavior, both the previous and new code could cause determinism failures after upgrading... but the random execution order previously stood a good chance of failing a few times and then automatically resolving itself. Unfortunately that is not maintained here - failures are likely to be permanent.

Resolving this is... probably not feasible currently. We do not record client-library versions in workflow history, so we cannot maintain backwards compatibility accurately in scenarios like this. We almost certainly should record this on decisions, at least when it changes - we could randomly cancel entries in the list when replaying old decisions, and allow the random behavior to eventually choose a stable execution on a host somewhere.

In any case, for all future workflows this makes behavior deterministic, and should resolve the issue for good.


A full repro can be seen with:

  1. Create multiple cancellable child contexts off a single cancellable parent context, populating its child-context map.
  2. Base some behavior off each child context. Any one-shot logic works, but activities are pretty easy and occur a lot in practice (i.e. waiting on N activities, and being able to cancel many at once).
  3. Block on the selector.
  4. Cancel the parent context. This will:
    1. Cancel the parent context
    2. Propagate that to a random child context
    3. Which will synchronously resolve the future(s) attached to the child context
    4. Which will synchronously trigger any pending callbacks
    5. One of which is a "first call wins" closure which the selector uses to choose which branch to execute

Maintaining the children contexts in an order resolves this, as it ensures the same child is canceled first (then second, etc) each time. Any order should work.
For clearer semantics, I chose to implement it as a compacting FIFO list (as children can remove themselves if they are cancelled independently). This is not noticeably costly (maintenance in a large list will be dwarfed by any side effects of canceling) and it makes it very easy to define and hopefully maintain, as it must not be changed.


This order decision will not be a defined semantic of workflows, however. Cancellation of multiple futures / selector branches should be treated as unordered, and implementing exactly the same behavior in other languages may not be efficient.
In a future implementation it may be worth making selectors choose from any available branch pseudo-randomly, e.g. by run-ID, for the same reason Go explicitly randomizes these behaviors: it prevents accidentally depending on implementation details, by exposing logical flaws sooner.

@coveralls
Copy link

coveralls commented Aug 6, 2022

Pull Request Test Coverage Report for Build 01843045-5520-4306-94ac-81baa73093bf

  • 19 of 19 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 64.171%

Totals Coverage Status
Change from base Build 01838582-1814-4eb0-a654-48c8844fff22: 0.03%
Covered Lines: 12618
Relevant Lines: 19663

💛 - Coveralls

@Groxx Groxx changed the title Maintain a stable order of children context, hopefully resolves some non-determinism around cancels Maintain a stable order of children context, resolves some non-determinism around cancels Oct 31, 2022
…non-determinism around cancels

We have some non-deterministic failures in a user's workflows, which (as far as either of us can tell)
are not due to flawed workflow code.  We have been unable to reliably reproduce it to investigate, but
there is this map of contexts which are iterated over to propagate cancellation information, and that
seems at least slightly suspect.

I'm not confident that this map can or cannot cause non-determinism elsewhere, but making it clearly
deterministic seems both safe[1] and reduces questionable code.

---

I've opted to keep the list in order while removing entries because it's likely to be a small list (and
even if it is not, it'll be dwarfed by other costs associated with code that caused it to be large),
and because it seems probably easier to reason about by hand if needed: the list of cancellation signals
will be first-in, first-out, rather than being based on when things are removed from the list.

Either way should be perfectly repeatable though, so I'm happy to be convinced to do the more efficient
thing instead.  I'm not totally sold on its utility anyway.

---

[1]: If the map's random order does not cause non-determinism failures, changing the execution order
     like this PR does will also not cause non-determinism failures.
     If it *does*, then this change is no worse (it will not introduce *more* non-determinism due to
     the change) and it may resolve the problem for future workflows.
     So this is at worst identical, and at best a fix.
@Groxx Groxx changed the title Maintain a stable order of children context, resolves some non-determinism around cancels Maintain a stable order of children context, resolves a non-determinism around cancels Oct 31, 2022
@Groxx Groxx merged commit dcaec77 into uber-go:master Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants