Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Payment status: IN_FLIGHT (stuck) - (updated) -> v0.13.1-beta.rc2 #5396

Closed
miketwenty1 opened this issue Jun 17, 2021 · 20 comments · Fixed by #8174
Closed

Payment status: IN_FLIGHT (stuck) - (updated) -> v0.13.1-beta.rc2 #5396

miketwenty1 opened this issue Jun 17, 2021 · 20 comments · Fixed by #8174
Assignees
Labels
needs investigation P2 should be fixed if one has time payments Related to invoices/payments

Comments

@miketwenty1
Copy link

Background

We have payments that are stuck in "IN_FLIGHT" even after upgrading from v0.13.0rc3-beta to v0.13.0-beta.
Here we will be analyzing just 1 of these payments with payment hash 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30

Your environment

  • version of lnd v0.13.0-beta
  • which operating system (uname -a on *Nix)
  • Ubuntu 20.04
  • version of btcd, bitcoind, or other backend
  • bitcoind zmq
  • any other relevant environment details
    docker

Steps to reproduce

Unsure, supposedly this was supposed to auto resolve. The expiration is 5000+ blocks after blockheight.

Expected behaviour

Auto resolve to Failed payment I would assume?

Actual behaviour

Still shows IN_FLIGHT using lncli trackpayment

Please see pastebin for details: https://pastebin.com/Ep8Sr2Sn
^ Above we have output of lncli trackpayment and after doing some pagination showing the output of the lncli listpayments

@carlaKC
Copy link
Collaborator

carlaKC commented Jun 18, 2021

Thanks for the details @miketwenty1!

The specific issue we thought this could be and was patched in 0.13 is when a single htlc shard fails, which is not the case here, so this looks like something else.

Could you lookup the following for me:

  • Is there a htlc with this hash on the outgoing channel 742948803069739015?
  • Do you have any logs with this payment hash?

I'm carla on the lnd slack if you'd rather dm me this information than post it here, and feel free to xxx out any identifiable info, we just need to see the payment logs.

@miketwenty1
Copy link
Author

I messaged you information on slack, we can discuss there and post our conclusion/ findings here afterwards.

@carlaKC
Copy link
Collaborator

carlaKC commented Jun 23, 2021

Some updates from discussion on slack:

  • Stuck payment has a single htlc, with expired timelock at 682566 which was mined on 08 May
  • The htlc is no longer present on the outgoing channel of its route
  • Logs had been rotated, so we restarted lnd to get fresh payment resume logs:
2021-06-22 15:22:31.266 [INF] CRTR: Resuming payment shard 389058 for payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.327 [DBG] HSWC: Subscribing to result for paymentID=389058

This doesn't look like the stuck in flight issue we've previously seen, because the payment has a single shard and the htlc does not have a failures (both characteristics of the previous bug). Since the payment gets stuck in this state, it seems like we're never getting a result from the switch.

@Roasbeef Roasbeef added needs investigation payments Related to invoices/payments labels Jun 23, 2021
@yyforyongyu yyforyongyu self-assigned this Jun 29, 2021
@guggero
Copy link
Collaborator

guggero commented Jun 29, 2021

Update: Talked to @halseth who was involved in the original fix of the stuck HTLC PR (#5214). Though we agree that the payments don't look exactly like those caused by the fixed bug, there are still some similarities (HTLC not around anymore, not getting an update from the switch). That's why we think this might have been caused by the same bug (the payments were made before the first RC of v0.13.0 was released).

Therefore we probably need to look into cleaning up the payment state manually and then observe if this indeed does not happen anymore with v0.13.0.

@yyforyongyu is going to take a look as well, thanks a lot!

@Roasbeef
Copy link
Member

Yeh agreed @guggero, unless we have a clean slate, then it's hard to determine if this is new, or was fixed and created using an older version that allows this scenario to arise.

@yyforyongyu
Copy link
Collaborator

Read the logs and found the following line, which shows the payment is still stuck in flight under the new version of lnd.

2021-06-22 15:21:36.886 [INF] LTND: Version: 0.13.0-beta commit=v0.13.0-beta, build=production, logging=default, debuglevel=CNCT=debug,HSWC=debug
2021-06-22 15:22:31.176 [INF] CRTR: Resuming payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.266 [INF] CRTR: Resuming payment shard 389058 for payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30
2021-06-22 15:22:31.327 [DBG] HSWC: Subscribing to result for paymentID=389058
2021-06-22 15:20:55.252 [ERR] CRTR: Resuming payment 41abd340f698c0db087b0d9f15cc0a514affd142108eed71d16a12c65d223b30 failed: router shutting down.

As @carlaKC mentioned above, from the logs, the collectResult goroutine seems to be stuck at the line, case result, ok = <-resultChan:, which is created here,

// Now ask the switch to return the result of the payment when
// available.
resultChan, err := p.router.cfg.Payer.GetPaymentResult(
    attempt.AttemptID, p.identifier, errorDecryptor,
)
switch {

  // If this attempt ID is unknown to the Switch, it means it was never
  // checkpointed and forwarded by the switch before a restart. In this
  // case we can safely send a new payment attempt, and wait for its
  // result to be available.
  case err == htlcswitch.ErrPaymentIDNotFound:
    log.Debugf("Attempt ID %v for payment %v not found in "+
      "the Switch, retrying.", attempt.AttemptID,
      p.identifier)

    attempt, cErr := p.failAttempt(attempt, err)
    if cErr != nil {
      return nil, cErr
    }

    return &shardResult{
      attempt: attempt,
      err:     err,
    }, nil

The resultChan blocks until a *PaymentResult is received.

Meanwhile inside GetPaymentResult, we have two paths to get the result,

  • if the payment attempt is not found in the circuit map, we getResult, and will return an error ErrPaymentIDNotFound if that attempt is not found.
  • if the attempt is found in the circuit map, we subscribeResult, even the attempt is not found, we won't return ErrPaymentIDNotFound, and the returned channel resultChan will wait forever.

This could be a problem if the attempt cannot be found (maybe cleaned using htlcswitch.CleanStore), yet we use the second path inside GetPaymentResult. The error ErrPaymentIDNotFound won't be returned, and we could never reach the above line case err == htlcswitch.ErrPaymentIDNotFound:, which is likely to generate similar logs mentioned above.

Remaining questions (notes for myself),

  • How could we land on the first path in GetPaymentResult?
  • How/when to clean the circuit map?
  • How/when is the status of HTLC attempt modified?

@yyforyongyu
Copy link
Collaborator

A possible mitigation is to put a timeout while the router is waiting on resultChan. Meanwhile, yeah, we could just mark the old in-flight payments as failed, since their HTLC is far behind its deadline.

@yyforyongyu
Copy link
Collaborator

What remains unclear here is, we should have seen the log printed from this line since the call of CollectResultAsync is unblocking. Unless payment, currentState, err := p.fetchPaymentState() or shardHandler.checkShards() blocks. I've messaged @miketwenty1 on slack hoping to get more recent logs, thanks!

@miketwenty1
Copy link
Author

miketwenty1 commented Jun 30, 2021

I did respond to @yyforyongyu, I look forward to working with you. I provided the information you asked for on slack.

@miketwenty1
Copy link
Author

above was a fat finger.

@miketwenty1
Copy link
Author

Yeh agreed @guggero, unless we have a clean slate, then it's hard to determine if this is new, or was fixed and created using an older version that allows this scenario to arise.

I know there was some concern about whether new ones were coming in or if this is an artifact of previous version..
It seems for every IN_FLIGHT stuck payment, the "creation_date" unixtimestamp is May 7th 2021 or earlier.
https://pastebin.com/Y92t3Cc0
We currently have 676 IN_FLIGHT payments for this node.

@yyforyongyu
Copy link
Collaborator

yyforyongyu commented Jun 30, 2021

Chatted with @michael1011 , all the stuck payments happened before the fix, in particular, the stats are,

2020-10-07 00:00:00 :  1
2020-12-20 00:00:00 :  1
2020-12-26 00:00:00 :  1
2021-02-17 00:00:00 :  1
2021-02-20 00:00:00 :  1
2021-03-03 00:00:00 :  1
2021-05-06 00:00:00 :  1
2020-09-13 00:00:00 :  2
2021-01-25 00:00:00 :  2
2021-02-07 00:00:00 :  2
2021-02-14 00:00:00 :  2
2021-03-17 00:00:00 :  2
2021-04-14 00:00:00 :  2
2020-12-01 00:00:00 :  3
2021-02-16 00:00:00 :  3
2021-04-15 00:00:00 :  3
2021-04-16 00:00:00 :  3
2020-11-15 00:00:00 :  4
2021-05-05 00:00:00 :  4
2021-02-09 00:00:00 :  6
2021-04-20 00:00:00 :  8
2021-04-27 00:00:00 :  8
2021-04-17 00:00:00 :  16
2021-03-07 00:00:00 :  33
2021-02-28 00:00:00 :  67
2021-04-28 00:00:00 :  142
2021-05-07 00:00:00 :  144
2021-04-21 00:00:00 :  213

I think it's been fixed by #5214 , however, it seems that we didn't clean the already stuck payments that are dangling there before the fix. I'll draft a PR to address this issue.

@yyforyongyu
Copy link
Collaborator

yyforyongyu commented Jun 30, 2021

We currently have 676 IN_FLIGHT payments for this node.

@miketwenty1 Just to be sure, all these payments should have been marked as failed right?

@miketwenty1
Copy link
Author

@yyforyongyu, just fyi.. im @miketwenty1 You tagged the wrong Michael there. 😂

We currently have 676 IN_FLIGHT payments for this node.

@michael1011 Just to be sure, all these payments should have been marked as failed right?

Every "failure_reason" is "FAILURE_REASON_NONE"

@miketwenty1
Copy link
Author

miketwenty1 commented Jul 14, 2021

running v0.13.1-beta.rc2 did not resolve this issue @yyforyongyu (updating title of issue).

@miketwenty1 miketwenty1 changed the title Payment status: IN_FLIGHT (stuck) - v0.13.0-beta release Payment status: IN_FLIGHT (stuck) - (updated) -> v0.13.1-beta.rc2 Jul 14, 2021
@yyforyongyu
Copy link
Collaborator

yyforyongyu commented Jul 19, 2021

running v0.13.1-beta.rc2 did not resolve this issue @yyforyongyu (updating title of issue).

Thanks for the update! This will be fixed in two steps,

@carlaKC carlaKC removed their assignment Jul 20, 2021
@Roasbeef Roasbeef added the P2 should be fixed if one has time label Aug 31, 2021
@jackwhite420
Copy link

Just adding a quick note that this is affecting one of my withdrawal attempts from a LN exchange (southxchange).

@dolcalmi
Copy link

quick note: 2 withdrawal attempts with this issue in bitcoin beach wallet #4788 (comment)

Can someone confirm if it is safe to return the funds? (one of the txs is for more than 1 million sats so we need to be extra sure)

@yyforyongyu
Copy link
Collaborator

What do you mean by "return the funds"?

@dolcalmi
Copy link

dolcalmi commented Dec 27, 2021

What do you mean by "return the funds"?

The issue is with a payment done by one of the users of Bitcoin Beach Wallet... so we have the tx as pending, i.e, the amount was discounted from user balance. We need to know if it is safe to void the transaction (user tx)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation P2 should be fixed if one has time payments Related to invoices/payments
Projects
None yet
8 participants