Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waitblockheight MPP bugs #3914

Merged

Conversation

ZmnSCPxj
Copy link
Collaborator

@ZmnSCPxj ZmnSCPxj commented Aug 9, 2020

So I stumbled on a bug where the payee has a longer chain than the payer while developing #3913.

I wrote this little test plus fix, but it looks like we still have issues with being too aggressive in returning a transient error permanently!

Here are logs on the payer, after the (partial fix) is done on the payer side:

b'2020-08-09T07:12:07.387Z DEBUG plugin-pay: Added a channel hint for 103x1x1/1: enabled true, estimated capacity 4294967295msat'
b"2020-08-09T07:12:07.388Z DEBUG plugin-pay: After filtering routehints we're left with 1 usable hints"
b'2020-08-09T07:12:07.388Z DEBUG plugin-pay: Asking gossipd whether 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d is reachable without routehints.'
b'2020-08-09T07:12:07.391Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T07:12:07.391Z DEBUG plugin-pay: The destination is not directly reachable excluding attempts without routehints'
b'2020-08-09T07:12:07.391Z INFO plugin-pay: Split into 3 sub-payments due to initial size (20000000msat > 10000000msat)'
b'2020-08-09T07:12:07.395Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T07:12:07.396Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T07:12:07.396Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T07:12:07.699Z INFO plugin-pay: failed: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS (reply from remote)'
b'2020-08-09T07:12:07.700Z INFO plugin-pay: Remote node appears to be on a longer chain, which causes CLTV timeouts to be incorrect. Waiting up to 59 seconds to catch up to block 113 before retrying.'
b'2020-08-09T07:12:07.700Z INFO plugin-pay: failed: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS (reply from remote)'
b'2020-08-09T07:12:07.700Z INFO plugin-pay: Remote node appears to be on a longer chain, which causes CLTV timeouts to be incorrect. Waiting up to 59 seconds to catch up to block 113 before retrying.'
b'2020-08-09T07:12:07.700Z INFO plugin-pay: failed: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS (reply from remote)'
b'2020-08-09T07:12:07.700Z INFO plugin-pay: Remote node appears to be on a longer chain, which causes CLTV timeouts to be incorrect. Waiting up to 59 seconds to catch up to block 113 before retrying.'
b'2020-08-09T07:12:08.226Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T07:12:08.227Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T07:12:08.227Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T07:12:08.558Z INFO plugin-pay: failed: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS (reply from remote)'
b'2020-08-09T07:12:08.559Z INFO plugin-pay: failed: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS (reply from remote)'
b'2020-08-09T07:12:08.559Z INFO plugin-pay: failed: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS (reply from remote)'
b'2020-08-09T07:12:08.678Z INFO plugin-pay: Killing plugin: Plugin exited before completing handshake.'

And here are the relevant logs on the payee:

b'2020-08-09T07:12:07.558Z DEBUG lightningd: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS: lightningd/peer_htlcs.c:503'
b'2020-08-09T07:12:07.562Z DEBUG lightningd: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS: lightningd/peer_htlcs.c:503'
b'2020-08-09T07:12:07.566Z DEBUG lightningd: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS: lightningd/peer_htlcs.c:503'
b'2020-08-09T07:12:08.421Z DEBUG lightningd: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS: lightningd/htlc_set.c:113'
b'2020-08-09T07:12:08.425Z DEBUG lightningd: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS: lightningd/htlc_set.c:113'
b'2020-08-09T07:12:08.429Z DEBUG lightningd: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS: lightningd/htlc_set.c:113'

I find it very strange that payee still thinks there is some problem with the incoming payment. So it looks like we have yet another bug.... LOL

@cdecker for help at the payer side (I think it is a genuine bug as I find the logic of the d->offset fairly weird) and @rustyrussell for help at the payee side (I have no idea about how the htlc_set works, and why the source of the error is different in the initial case (where the payer genuinely has the wrong blockheight) then in the second case (where the payer should have advanced its view of the blockheight already)).

@cdecker
Copy link
Member

cdecker commented Aug 9, 2020

Quite a scare you gave me there: it seems like after the waitblockheight modifier we add the current payment's amount to the onion payload, instead of the total payment amount. This is easily fixed by changing the following msat to payment_root(p)->amount.millisatoshis here:

if (payment_secret != NULL) {
assert(final);
tlvstream_set_tlv_payload_data(fields, payment_secret,
msat);
}

What bothers me more is that the existing code works for most MPP payments, just not in combination with waitblockheight which is strange. Still debugging where the difference comes from. The offset stuff is unlikely to be involved, and the receiver side is correct.

@cdecker
Copy link
Member

cdecker commented Aug 9, 2020

Ok, found the issue: there are three places we add the TLV payload, which may include the total_msat value. All use the the root->amount, except this one:

if (payment_secret != NULL) {
assert(final);
tlvstream_set_tlv_payload_data(fields, payment_secret,
msat);
}

Which is the only invocation that isn't mediated through the MPP modifiers. Using the root amount is correct always, however I still don't quite get why the MPP modifiers doesn't just overwrite the TLV field here:

if (hop->style == ROUTE_HOP_TLV) {
struct tlv_field **fields = &hop->tlv_payload->fields;
tlvstream_set_tlv_payload_data(
fields, root->payment_secret,
root->amount.millisatoshis); /* Raw: onion payload */

and here

if (hop->style == ROUTE_HOP_TLV) {
struct tlv_field **fields = &hop->tlv_payload->fields;
tlvstream_set_tlv_payload_data(
fields, root->payment_secret,
root->amount.millisatoshis); /* Raw: onion payload */

Still looking into that, but the fix should indeed just be using the root->amount.

@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Aug 9, 2020

@cdecker the reason why I think the d->offset computations for routehint_data_init are wrong is because of these logs which appear if I do not change the d->offset compputations:

b'2020-08-09T13:20:26.702Z DEBUG plugin-pay: Added a channel hint for 103x1x1/1: enabled true, estimated capacity 4294967295msat'
b"2020-08-09T13:20:26.703Z DEBUG plugin-pay: After filtering routehints we're left with 1 usable hints"
b'2020-08-09T13:20:26.703Z DEBUG plugin-pay: Asking gossipd whether 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d is reachable without routehints.'
b'2020-08-09T13:20:26.705Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T13:20:26.706Z DEBUG plugin-pay: The destination is not directly reachable excluding attempts without routehints'
b'2020-08-09T13:20:26.706Z INFO plugin-pay: Split into 2 sub-payments due to initial size (20000000msat > 10000000msat)'
b'2020-08-09T13:20:26.708Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T13:20:26.708Z DEBUG plugin-pay: Using routehint 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59 (105x1x1) cltv_delta=6'
b'2020-08-09T13:20:27.048Z INFO plugin-pay: failed: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS (reply from remote)'
b'2020-08-09T13:20:27.049Z INFO plugin-pay: Remote node appears to be on a longer chain, which causes CLTV timeouts to be incorrect. Waiting up to 59 seconds to catch up to block 113 before retrying.'
b'2020-08-09T13:20:27.049Z INFO plugin-pay: failed: WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS (reply from remote)'
b'2020-08-09T13:20:27.049Z INFO plugin-pay: Remote node appears to be on a longer chain, which causes CLTV timeouts to be incorrect. Waiting up to 59 seconds to catch up to block 113 before retrying.'
b'2020-08-09T13:20:27.629Z DEBUG plugin-pay: Not using a routehint'
b'2020-08-09T13:20:27.629Z DEBUG plugin-pay: Not using a routehint'
b'2020-08-09T13:20:27.633Z INFO plugin-pay: Error computing a route to 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d: \\"Could not find a route\\" (205)'
b'2020-08-09T13:20:27.633Z INFO plugin-pay: Error computing a route to 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d: \\"Could not find a route\\" (205)'
b'2020-08-09T13:20:27.807Z INFO plugin-pay: Killing plugin: Plugin exited before completing handshake.'

As you can see above, the sub-payments used to have a routehint, but after the WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS the routehint is forgotten. And the d->offset initialization is wonky. First, it checks if the parent is in RETRY state, but gets the offset from the root payment offset. And it adds +1 instead of retrying the same offset as the retried parent. Which is why I am asking to check it, since it looks wrong to me, and as seen above it leads to forgetting the routehint "early".

This is particularly important since completely unpublished payees can only be accessed via the routehints they put in the invoice, so forgetting routehints should be strongly discouraged.

With both my changes to the routehint_data_init function and the change to use root amount, the test passes. Will push it now.

@ZmnSCPxj ZmnSCPxj force-pushed the waitblockheight-aggressive-routehint branch from 7152d76 to 1b86601 Compare August 9, 2020 13:34
@ZmnSCPxj ZmnSCPxj marked this pull request as ready for review August 9, 2020 13:36
@ZmnSCPxj ZmnSCPxj requested a review from cdecker as a code owner August 9, 2020 13:36
@cdecker
Copy link
Member

cdecker commented Aug 9, 2020

Hm, that is kind of bad tbh: without the d->offset = pd->offset + 1 we will always only consider the first routehint, we will not even try the other ones, and worse we'll not try without routehints despite that being possible sometimes (if the initial getroute check tells us the destination is reachable directly).

We could probably add a signal to the struct payment to allow reusing a routehint in a child that came from the waitblockheight modifier, but that is the kind of cross-talk that may end up undoing some of the nice compartmentalization.

To be honest I don't think the waitblockheight case is going to hit us too often, and I'd hate to be adding cross-modifier signaling just for this rare case. Let's see if we can somehow work around it.

@cdecker cdecker force-pushed the waitblockheight-aggressive-routehint branch from 1b86601 to b1f6976 Compare August 9, 2020 13:43
@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Aug 9, 2020

we will not even try the other ones

But we do! See:

for (; d->offset <numhints; d->offset++) {
curr = d->routehints[d->offset];
if (curr == NULL || !routehint_excluded(p, curr))
return curr;
}

We only try the other ones if the current one is now banned by routehint_excluded, but IMO that is correct: if the problem does not lie in the current routehint, why change the current routehint?

What we should do instead is what I suggest in #3894, i.e. give each sub-payment its own permutation of the order of routehints, so that each sub-payment tries a different routehint if possible.

@cdecker
Copy link
Member

cdecker commented Aug 9, 2020

Hm, you're right, we can rotate the routehints without assuming the last retry was due to the routehint being bad.

I like your idea of giving each split its own iteration order, but think that could also be achieved simply by offsetting the iteration randomly into the routehints list? Basically use d->offset to iterate through the array, and have a d->base = RANDOM be the starting point for that iteration.

Our current routehint then just becomes d->routehints[(d->offset + d->base) % num_routehints]. If instead of assigning a random value to d->base we can make just increment them on each split, spreading partial payments evenly across all routehints.

@cdecker
Copy link
Member

cdecker commented Aug 9, 2020

Nice find that we don't need to increment routehint-offset and retry counter in lock-step btw :-)

@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Aug 9, 2020

hmm so something like d->offset = 0 always, and d->base = pseudorand(tal_count(d->routehints)), in routehint_data_init?

@cdecker
Copy link
Member

cdecker commented Aug 9, 2020

That sounds reasonable to me, it'd mean that we always use the first non-excluded routehint, starting from a random position in the array. d->offset is then just used to determine if we have more routehints we can try before we need to split.

@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Aug 9, 2020

It is less optimal than making a strong effort to ensure that different sub-payments have as diverse ordering as possible, but is very simple to implement.

@ZmnSCPxj ZmnSCPxj force-pushed the waitblockheight-aggressive-routehint branch from b1f6976 to fb24e00 Compare August 9, 2020 14:23
@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Aug 9, 2020

Rebased to fix merge conflicts, also add changelogs. Will go put the randomization we suggested, please wait.

@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Aug 9, 2020

@cdecker there is a good reason for us to advance the d->update, and it is: https://travis-ci.org/github/ElementsProject/lightning/jobs/716299819#L1943

Basically, if my understanding of the code is right:

  • Exceeding the budget needs to be signalled by setting p->route = NULL: the error message is "Fee exceeds our budget ... discarding route" even, but we do not, in fact, discard the route.
  • routehint_step_cb needs to check for failure, and if it is because of an inability to route, should check routehints. If there are more routehints beyond the current one, it should advance the routehint pointer d->offset and retry.
    • We should note as well that the retry paymod does not retry if p->route is NULL. routehints should instead advance this. Maybe.
    • Alternately we could check this in routehint_data_init, if the parent has no route, we should advance. Note we have a failroute_retry that routehint writes to, which we can say amounts to a promise by the routehint paymod to the retry paymod that if failed to route, it will advance the pointer (because there are more routehints to try).

What do you think?

@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Aug 9, 2020

The issue with the above is that when we exceed the constraints, we also exclude the most expensive channel. And the reason for that is that we used to have an actual route randomization, and sometimes it would allow really expensive channels due to randomization. So we need to factor in route randomization (once implemented) in our decision to advance or not advance the routehints pointer. If the reason for route rejection is due to route randomization going overboard, we should consider whether to advance the routehints pointer, or reduce the randomization, but not both (we should advance the routehints pointer first, and if that runs out, reduce randomization and reset the routehints pointer).

@rustyrussell notice interaction with route randomization!

@ZmnSCPxj ZmnSCPxj force-pushed the waitblockheight-aggressive-routehint branch from fb24e00 to 2d43c79 Compare August 10, 2020 02:03
@ZmnSCPxj
Copy link
Collaborator Author

Okay, went with "advance the routehint only if routing failed", and promote constraint violation to a "routing failed". @cdecker thoughts? I think constraint violation should be consider a routing failure.

@ZmnSCPxj
Copy link
Collaborator Author

Yet another reason to advance the routehint: https://travis-ci.org/github/ElementsProject/lightning/jobs/716418486#L2348

In this scenario there exists two routehints, in this order: l3->l4->l5, and l3->l5, l5 is the payee. l4 happens to be down.

This is what happens in the payer:

b'2020-08-10T03:04:25.353Z DEBUG plugin-pay: Added a channel hint for 103x1x1/1: enabled true, estimated capacity 957757992msat'
b'2020-08-10T03:04:25.353Z INFO plugin-pay: Payment fee constraint 500msat is below exemption threshold, allowing a maximum fee of 5000msat'
b"2020-08-10T03:04:25.355Z DEBUG plugin-pay: After filtering routehints we're left with 2 usable hints"
b'2020-08-10T03:04:25.355Z DEBUG plugin-pay: Asking gossipd whether 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e is reachable without routehints.'
b'2020-08-10T03:04:25.357Z DEBUG plugin-pay: Using routehint 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d (110x1x1) cltv_delta=6'
b'2020-08-10T03:04:25.357Z DEBUG plugin-pay: The destination is not directly reachable excluding attempts without routehints'
b'2020-08-10T03:04:25.671Z DEBUG plugin-pay: Added a channel hint for 110x1x1/0: enabled true, estimated capacity 101001msat'
b'2020-08-10T03:04:25.671Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)'
b'2020-08-10T03:04:25.671Z DEBUG plugin-pay: Retrying d7b7f51de4b0c08aa1276b174f6b154612605d0f3337bfa489c2f7e2b3c47f8c/1 (100000msat), new partid 2. 9 attempts left\\n'
b'2020-08-10T03:04:25.672Z DEBUG plugin-pay: Using routehint 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d (110x1x1) cltv_delta=6'
b'2020-08-10T03:04:25.997Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)'
b'2020-08-10T03:04:25.997Z DEBUG plugin-pay: Retrying d7b7f51de4b0c08aa1276b174f6b154612605d0f3337bfa489c2f7e2b3c47f8c/2 (100000msat), new partid 3. 8 attempts left\\n'
b'2020-08-10T03:04:25.998Z DEBUG plugin-pay: Using routehint 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d (119x1x0) cltv_delta=6'
b'2020-08-10T03:04:26.660Z INFO plugin-pay: Killing plugin: Plugin exited before completing handshake.'

So on the first WIRE_TEMPORARY_CHANNEL_FAILURE we added a channel hint for 110x1x1/0. However, the routehint paymod does not consider the channel hints in next_routehint. This is because routehint_excluded does not consider channel hints. The correct solution here is to have routehint_excluded check channel hints.

@ZmnSCPxj ZmnSCPxj force-pushed the waitblockheight-aggressive-routehint branch from 2d43c79 to d701179 Compare August 10, 2020 04:11
@ZmnSCPxj
Copy link
Collaborator Author

Fixed the above. Hopefully this is the final added fix.

@ZmnSCPxj
Copy link
Collaborator Author

YAY IT PASSES

@cdecker re: the routehint randomization, I will put in #3913 instead, as that adds multi-routehint invoices. I will rebase 3913 on top of this one, I think the flaky in 3913 would be resolved by the changes here.

@ZmnSCPxj
Copy link
Collaborator Author

I now realized that the waitblockheight condition would often cause the paymod system to give up too early due to setting the abort flag in handle_final_failure.

@ZmnSCPxj ZmnSCPxj force-pushed the waitblockheight-aggressive-routehint branch from df42e6a to 586ff15 Compare August 11, 2020 03:21
@ZmnSCPxj ZmnSCPxj added this to the v0.9.1 milestone Aug 11, 2020
@ZmnSCPxj ZmnSCPxj force-pushed the waitblockheight-aggressive-routehint branch from 586ff15 to 6c63829 Compare August 11, 2020 07:43
@ZmnSCPxj
Copy link
Collaborator Author

Rebased. Also, bump

@ZmnSCPxj ZmnSCPxj mentioned this pull request Aug 11, 2020
ZmnSCPxj and others added 4 commits August 12, 2020 19:13
… causes us to advance routehints too aggressively.

The worst effect is that unpublished nodes are harder to pay, but
even published ones make us do unnecessary work, since we are
losing routehints from the published ones that could help us
actually route better to them.
…utehints.

Only advance through routehints if no route was found at all, or if the
estimated capacity at the routehint is lower than the amount that we
have to send through the routehint.

Changelog-Fixed: pay: Be less aggressive with forgetting routehints.
Reported-by: ZmnSCPxj
Signed-off-by: Christian Decker <@cdecker>

Changelog-Fixed: pay: Correct a case where we put the sub-payment value instead of the *total* value in the `total_msat` field of a multi-part payment.
…prevent all future progress.

Blockheight disagreement is signalled with a permanent failure at the
end node, but is actually a transient failure.
@cdecker cdecker force-pushed the waitblockheight-aggressive-routehint branch from 6c63829 to ce05d3f Compare August 12, 2020 17:14
@cdecker
Copy link
Member

cdecker commented Aug 12, 2020

Rebased on top of master

@cdecker
Copy link
Member

cdecker commented Aug 13, 2020

ACK ce05d3f

@cdecker cdecker merged commit f81611e into ElementsProject:master Aug 13, 2020
@ZmnSCPxj ZmnSCPxj deleted the waitblockheight-aggressive-routehint branch August 13, 2020 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants