Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing: fix stuck inflight payments #8174

Merged

Conversation

yyforyongyu
Copy link
Collaborator

@yyforyongyu yyforyongyu commented Nov 14, 2023

Replaces #8150, this PR attempts to fix #8146.

Timed out HTLCs are properly handle

The original assumption was, when the HTLC timed out, the attempt result wasn't sent back the payment lifecycle, thus causing the payment to be in stuck. However, this assumption is not true, as shown in the newly added itest - when an outgoing HTLC has timed out, the payment will be marked as failed. Moreover, if the outgoing HTLC is swept by the remote via the preimage path, the payment will be marked as succeeded. All these are expected behavior.

HTLC Attempt result is not saved

If there's already a result in db, GetAttemptResult will return it, so the result must not be found in db. Also notice that, since there's no result in db, the only line we can hit is:

nChan, err = s.networkResults.subscribeResult(attemptID)

as otherwise hitting this line will return the error ErrPaymentIDNotFound:
res, err := s.networkResults.getResult(attemptID)

And the method GetAttemptResult is blocking at line:
res, err := s.networkResults.getResult(attemptID)

On the other hand, these network results will be cleaned when ChannelRouter starts here:

if err := r.cfg.Payer.CleanStore(toKeep); err != nil {

But this cleanup only applied to terminated payments. For inflight payments, we always keep the network results. This rules out the possibility that the HTLC attempt once had a result but was later removed.

This seems to leave us only one possibility - that the network result was never written to disk, causing an HTLC attempt waiting for its result forever, hence the payment stayed inflight, which leads to suspicion that this line is hit, thus the line go s.handleLocalResponse(packet) is skipped so the result is not saved:

lnd/htlcswitch/switch.go

Lines 1289 to 1293 in e02fd39

// closeCircuit returns a nil circuit when a settle packet returns an
// ErrUnknownCircuit error upon the inner call to CloseCircuit.
if circuit == nil {
return nil
}

This PR refactors handlePacketForward and makes sure the network result is always saved for locally-initiated packet.

Fixes,

@yyforyongyu yyforyongyu added payments Related to invoices/payments bug fix labels Nov 14, 2023
@yyforyongyu yyforyongyu force-pushed the fix-inflight-payments branch 4 times, most recently from 1435e5b to 2793c53 Compare November 15, 2023 17:16
@yyforyongyu yyforyongyu self-assigned this Nov 16, 2023
@yyforyongyu yyforyongyu added this to the v0.18.0 milestone Nov 16, 2023
@yyforyongyu yyforyongyu force-pushed the fix-inflight-payments branch 6 times, most recently from d044321 to a80524d Compare November 19, 2023 14:42
@saubyk
Copy link
Collaborator

saubyk commented Nov 28, 2023

cc: @Roasbeef for approach ack

@saubyk saubyk modified the milestones: v0.18.0, v0.18.1 Mar 3, 2024
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass done, looks really good already 🔥. I'll let this run on a node that has stuck inflight payments.

@@ -562,4 +562,8 @@ var allTestCases = []*lntest.TestCase{
Name: "payment succeeded htlc remote swept",
TestFunc: testPaymentSucceededHTLCRemoteSwept,
},
{
Name: "send to route failed htlc timeout",
TestFunc: testSendToRouteFailHTLCTimeout,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the repeated itest needed? What's the main difference for SendPaymentV2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, but it's testing a different RPC? Tho SendToRouteV2 looks similar to SendPaymentV2, it's nonetheless an independent RPC that can diverge someday.

@@ -978,7 +978,7 @@ func (s *Switch) handleLocalResponse(pkt *htlcPacket) {
s.cfg.HtlcNotifier.NotifySettleEvent(key, htlc.PaymentPreimage,
eventType)

case *lnwire.UpdateFailHTLC:
case *lnwire.UpdateFailHTLC, *lnwire.UpdateFailMalformedHTLC:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that change needed? I think for the same reason mentioned below, we don't expect UpdateFailMalformedHTLC here because of the conversion, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Yeah based on the callsites I don't think this is needed, now removed.

htlcswitch/switch.go Show resolved Hide resolved
htlcs[a.AttemptID] = hash
}

payHash := payment.Info.PaymentIdentifier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: payHash can be used more to compress some lines below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

routing/router.go Show resolved Hide resolved
@yyforyongyu
Copy link
Collaborator Author

lightninglabs-deploy mute 336h

@saubyk saubyk added the P2 should be fixed if one has time label Jun 25, 2024
@yyforyongyu
Copy link
Collaborator Author

!lightninglabs-deploy mute 336h30m

1 similar comment
@yyforyongyu
Copy link
Collaborator Author

!lightninglabs-deploy mute 336h30m

@kilrau
Copy link
Contributor

kilrau commented Jul 20, 2024

Why are you guys muting this? This happens in prod and imo serious

@saubyk
Copy link
Collaborator

saubyk commented Jul 20, 2024

Why are you guys muting this? This happens in prod and imo serious

Hi @kilrau the notifications for reviews were muted as YY focused on other prs, but as you can see the pr is still tagged for 18.3 and will be picked up again soon

@kilrau
Copy link
Contributor

kilrau commented Jul 20, 2024

Why are you guys muting this? This happens in prod and imo serious

Hi @kilrau the notifications for reviews were muted as YY focused on other prs, but as you can see the pr is still tagged for 18.3 and will be picked up again soon

Getting this into 0.18.3 sounds good 🙏

Copy link
Contributor

coderabbitai bot commented Jul 22, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ziggie1984 ziggie1984 removed the request for review from bhandras July 31, 2024 12:08
@ziggie1984
Copy link
Collaborator

Exchanged with the guys from the BOLTZ team, and it seems like they are suffering from another issue in regards of Inflight Payments which I am not sure this PR will solve.

There seems to be an edge case where we fail a payment ( exitresumePayment) although we have still Inflight HTLCs which need to be resolved. This has some side effects:

When a service tries to use trackpayment => this call remains pending although we already canceled all resultCollector goroutines, so it will pending until the next restart.

Moreover this payment will be inflight until the next restart in general, where we try to recollect all the Inflight HTLCs of a payment.

I think I might have found the culprit based on a logs of a payment attempt which got stuck:

image_2024-08-01_19-09-16

So basically we are in the process of adding a new HTLC Attempt to the payment here:

Code parts taken from this PR.

attempt, err := p.registerAttempt(rt, ps.RemainingAmt)
		if err != nil {
			return exitWithErr(err)
		}

This will however return an error if the payment is failed already in
func (p *controlTower) RegisterAttempt(....

// Check if registering a new attempt is allowed.
		if err := payment.Registrable(); err != nil {
			return err
		}

hence the Payment Lifecycle will exit with (as seen in the picture):


	// If the payment is already failed, we won't allow adding more HTLCs.
	if m.State.PaymentFailed {
		return ErrPaymentPendingFailed
	}

I think the problem is For MPP when we hit the MPP Timeout error we would mark a payment prematurely as failed although HTLCs are still pending in handleSwitchErr

Because we would fail a payment as soon as the final hop reported the failure which is the case for the MPP Timeout for example.

I think my analysis describes whats happening in the above log picture, although it's not a full log but just a grep of the payment hash. However I think we really need to be carefull to not prematurely kill the payment lifecycle until all HTLCs reported back, or when killing the payment flow, we should at least also kill the trackpayment stream so that the subscriber is not waiting forever.

@ziggie1984
Copy link
Collaborator

here is another log which shows the MPP-Timeout of one of the HTLCAttempts:

image_2024-08-01_19-24-22

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass done, mostly nits. Except from my other comment related to the other problem with Inflight HTLCs.

itest/lnd_payment_test.go Outdated Show resolved Hide resolved
@@ -3041,3 +2949,162 @@ func (s *Switch) handlePacketAdd(packet *htlcPacket,

return destination.handleSwitchPacket(packet)
}

// handlePacketAdd handles forwarding a settle packet.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: handlePacketAdd => handlePacketSettle

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// ErrUnknownCircuit error upon the inner call to CloseCircuit.
//
// NOTE: We can only get a nil circuit when it has already been deleted
// when `UpdateFulfillHTLC` is received. After which `RevokeAndAck` is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ... and when UpdateFulfillHTLC is received.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

return s.mailOrchestrator.Deliver(packet.incomingChanID, packet)
}

// handlePacketAdd handles forwarding a fail packet.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: godocs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

payStream := alice.RPC.TrackPaymentV2(payHash[:])
dustPayStream := alice.RPC.TrackPaymentV2(dustPayHash[:])

// Check that the dust payment is failed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe add a comment why both checks, the stream and the normal findpayment command should be done here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

htlcswitch/switch.go Show resolved Hide resolved
// handlePacketAdd handles forwarding a settle packet.
func (s *Switch) handlePacketSettle(packet *htlcPacket) error {
// If the source of this packet has not been set, use the circuit map
// to lookup the origin.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: when can this happen, would be good to describe it in the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool let me dig a bit...

@@ -33,6 +33,9 @@
created](https://github.com/lightningnetwork/lnd/pull/8921) when multiple RPC
calls are made concurrently.

* [Fixed](https://github.com/lightningnetwork/lnd/pull/8174) old payments that
are stuck inflight. Check more info [here](https://github.com/lightningnetwork/lnd/issues/8146).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering whether the three commits on top of this PR are related ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the last three? yeah they are related as they are required to get the itest pass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe switch the order so we have the release at the end ?

@yyforyongyu
Copy link
Collaborator Author

Interesting case @ziggie1984! Cool could you open a new issue to track this instead - think I know what happened, but don't have a good fix, plus I prefer not growing the scope of this PR to stay focused. In short it's due to async fetching of the same payment, and it happens when an invoice times out exactly during the period while the node is still restarting. Would also be great to have raw logs instead of screenshots, as the timestamps were messed up and difficult to analyze.

Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks very close 🚀, just a few nits and questions.

htlcswitch/switch.go Outdated Show resolved Hide resolved
htlcswitch/switch.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
routing/router.go Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐝, still had some general questions about this code change but all non-blocking

}

log.Debugf("Cleaning network result store.")
if err := r.cfg.Payer.CleanStore(toKeep); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we clean the store only once on startup ? Would it make sense to run this periodically:

from the comment of the clean function:

This should be called
// preiodically to let the switch clean up payment results that we have
// handled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good q - yeah I think it should be cleaned more frequently. I also remember we can fetch whatever is stored here from our open channel, which makes this store redundant. Will make an issue to track it - either clean it more often or just don't store it.

a := a

// We check whether the individual attempts have their
// HTLC hash set, if not we'll fall back to the overall
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is the hash set equal of AMP being used ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think so

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe add it to the comment then, I would have found it useful ?

// We pass in a non-timeout context, to indicate we don't need
// it to timeout. It will stop immediately after the existing
// attempt has finished anyway. We also set a zero fee limit,
// as no more routes should be tried.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be => can we be sure that INFLIGHT HTLCs will never have payment status (e.g. StatusInFlight) which could lead us to retry the payment and send more HTCL attempts ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once there are inflight HTLCs the payment should always be StatusInFlight tho,

// Based on the above variables, we derive the status using the following
// table,
// | inflight | settled | htlc failed | payment failed | status |
// |:--------:|:-------:|:-----------:|:--------------:|:--------------------:|
// | true | true | true | true | StatusInFlight |
// | true | true | true | false | StatusInFlight |
// | true | true | false | true | StatusInFlight |
// | true | true | false | false | StatusInFlight |
// | true | false | true | true | StatusInFlight |
// | true | false | true | false | StatusInFlight |
// | true | false | false | true | StatusInFlight |
// | true | false | false | false | StatusInFlight |

Since the fee limit is zero, it means there are no routes to be tried, unless we can find a zero-fee route.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the zero-fee route cause a problem in terms of not setting a timeout here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there exists a zero-fee route (which seems impossible?), then whether to send another attempt or not is determined by the remaining payment amt, if it's not zero, then another attempt will be tried.

Copy link
Collaborator

@bitromortac bitromortac Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question is whether we really would like to set a zero fee limit here (and starve the payment), as in principle we may want to have payments to continue after a restart, right? I think that's why resumePayment was added. Do we have an itest for a payment that continues its pathfinding after a restart?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I think we need to revisit the restart process a bit - atm, there's the existing issue re how we should go and handle the payment timeout from the sender side. The second issue now is the one you mentioned above. Say we have this case,

  1. invoice has a timeout of 60s.
  2. the sender tries to fulfill the invoice via MPP, say htlc1 and htlc2.
  3. htlc1 is sent, but if the sender restarts before sending htlc2(or even before the creation of it, as long as there's a remaining amount), and,
  4. the sender's restart finishes before the invoice times out.
  5. the payment is now resumed, with only htlc1 sent, and some remaining amount to try.

In this case, this payment is def gonna fail since it won't try another route. I also don't think this is an urgent issue, also guess that's why we log errors during the shutdown process if we still have inflight payments. There are multiple ways to fix it and think it's worthy having a tracking issue.

@@ -33,6 +33,9 @@
created](https://github.com/lightningnetwork/lnd/pull/8921) when multiple RPC
calls are made concurrently.

* [Fixed](https://github.com/lightningnetwork/lnd/pull/8174) old payments that
are stuck inflight. Check more info [here](https://github.com/lightningnetwork/lnd/issues/8146).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe switch the order so we have the release at the end ?

@@ -48,6 +48,9 @@
bumping an anchor channel closing was not possible when no HTLCs were on the
commitment when the channel was force closed.

* [Fixed](https://github.com/lightningnetwork/lnd/pull/8174) old payments that
are stuck inflight. Check more info [here](https://github.com/lightningnetwork/lnd/issues/8146).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add in the description that the culprit of this problem is still not detected ? (Meaning why those payments are still inflight and did not resolve their corresponding payment attempt)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool updated.

// Double check whether the reported closed channel has indeed finished
// closing.
//
// NOTE: There are misalignments regarding when a channel's FC is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you find out which misalignments in particular ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the TODO is for...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure, was just asking what misalignment means in this context, afaict it means somehow we mark channels prematurely as closed although they are still pending ?

Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ⚡

yyforyongyu and others added 15 commits August 7, 2024 22:17
This commit adds a new test case to validate that when an HTLC has timed
out, the corresponding payment is marked as failed.
Simply moves the code into a new method so it's easier to follow the
method `handlePacketForward`.
This commit adds two methods, `handlePacketFail` and
`handlePacketSettle` to handle the settle and fail packets differently.
This commit adds a check during router's startup and fails the inflight
HTLCs if they are routing using channels unknown to us. The channels are
unknown because they are already closed, usually long time ago.
The method `FetchClosedChannels` sometimes prematurely mark a pending
force closing channel as finalized, therefore we need to furthur check
`FetchPendingChannels` to make sure the channel is indeed finalized.
To reflect the new sweeping behavior, also makes it easier to be used as
we need to method to quickly cleanup force closes without concerning the
details when we are not testing the force close behavior.
@yyforyongyu yyforyongyu merged commit a449a5d into lightningnetwork:master Aug 7, 2024
24 of 27 checks passed
@yyforyongyu yyforyongyu deleted the fix-inflight-payments branch August 7, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment