Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discovery: implement banning for invalid channel anns #9009

Merged
merged 7 commits into from
Aug 27, 2024

Conversation

Crypt-iQ
Copy link
Collaborator

@Crypt-iQ Crypt-iQ commented Aug 14, 2024

Partially addresses #8889, specifically parts of #4 & #5 here: #8889 (comment)

This PR implements banning for invalid channel announcements:

  • we will ignore all channel announcements from a peer until it becomes un-banned (after 48 hours).
  • non-channel peers will be disconnected when their ban score reaches the ban threshold.
  • channel peers won't be disconnected when their ban score reaches the threshold, but we will ignore their announcements. Note that this still allows us to create channels with them since the announcement isn't gossiped between channel peers.

This PR also keeps track of closed channels such that we won't attempt to validate channel announcements for closed channels.

Future improvements:

  • the banning is currently in-memory only, meaning a restart of lnd will wipe all ban data. Ideally we would persist a limited set of ban info to disk and not use as much memory.
  • a banned peer that has reconnected is only disconnected again if they send another invalid announcement. Ideally they would be disconnected immediately in the peer/brontide.go code, but I decided to keep things contained in the gossiper.
  • instead of ignoring channel peers' announcements if they are banned, we should instead rate limit them.
  • if we receive a channel announcement from a non-syncing peer that isn't banned, we can potentially ignore it.
  • generalize banning to other gossip messages.

This change is Reviewable

Copy link
Contributor

coderabbitai bot commented Aug 14, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@morehouse
Copy link
Collaborator

Concept ACK

@Crypt-iQ
Copy link
Collaborator Author

Crypt-iQ commented Aug 14, 2024

It just occurred to me that we can get rid of the new banning code and just use slices of rate limiters in the gossiper instead. Tradeoff would be losing customizable banning code for less code in the discovery package.

@saubyk saubyk added this to the v0.18.3 milestone Aug 15, 2024
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👌, had some questions but it looks very close.

Missing Release-Notes for 18.3

discovery/ban.go Show resolved Hide resolved
discovery/ban.go Outdated Show resolved Hide resolved
discovery/ban.go Show resolved Hide resolved
discovery/ban.go Show resolved Hide resolved
discovery/ban_test.go Show resolved Hide resolved
discovery/gossiper.go Outdated Show resolved Hide resolved
discovery/gossiper.go Show resolved Hide resolved
discovery/gossiper_test.go Show resolved Hide resolved
@Crypt-iQ Crypt-iQ force-pushed the gossip_ban_8132024 branch 3 times, most recently from 0af94be to 0d4e31e Compare August 19, 2024 13:50
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🔥! I think it's important that we keep the flow of channel announcements going if it's our channel, which I think is ensured, but also have a question in a comment. Perhaps we could also make the closed channel index more exhaustive by filling it with detected closes (in the future we may be able to separate zombies from closed channels as well).

channeldb/graph.go Outdated Show resolved Hide resolved
channeldb/graph.go Show resolved Hide resolved
discovery/ban.go Outdated Show resolved Hide resolved
discovery/ban.go Show resolved Hide resolved
discovery/gossiper.go Outdated Show resolved Hide resolved
discovery/gossiper_test.go Show resolved Hide resolved
discovery/gossiper_test.go Outdated Show resolved Hide resolved
discovery/gossiper_test.go Outdated Show resolved Hide resolved
discovery/gossiper.go Show resolved Hide resolved
discovery/gossiper_test.go Outdated Show resolved Hide resolved
@Crypt-iQ Crypt-iQ force-pushed the gossip_ban_8132024 branch 2 times, most recently from 900d2c7 to ee220b6 Compare August 20, 2024 20:35
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks almost good to go 🙏, only a few nits.


// Ban a peer by repeatedly incrementing its ban score.
peer1 := [33]byte{0x00}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps test that the peer is not banned beforehand, which also tests the cache.ErrElementNotFound case

// Assert that purgeBanEntries does nothing.
b.purgeBanEntries()
banInfo, err = b.peerBanIndex.Get(peer1)
require.Nil(t, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semanitcally, a require.NoError may be better

discovery/gossiper.go Outdated Show resolved Hide resolved
discovery/gossiper.go Outdated Show resolved Hide resolved
discovery/gossiper.go Show resolved Hide resolved
discovery/gossiper.go Outdated Show resolved Hide resolved
discovery/gossiper_test.go Show resolved Hide resolved

select {
case err = <-ctx.gossiper.ProcessRemoteAnnouncement(ca, nodePeer1):
require.NotNil(t, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add require.ErrorContains(t, err, "peer is banned"), to be a bit more explicit


select {
case err = <-ctx.gossiper.ProcessRemoteAnnouncement(ca, nodePeer2):
require.NotNil(t, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then here maybe add require.ErrorContains(t, err, "ignoring closed channel")

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work 👏

Maybe add in the release notes a statement that this new ban protection excludes Neutrino nodes, tho they are not doing any expensive checks in the first place. Tho the bandwidth requirements would still be high for them.

discovery/gossiper.go Outdated Show resolved Hide resolved
discovery/gossiper.go Outdated Show resolved Hide resolved
@Crypt-iQ Crypt-iQ force-pushed the gossip_ban_8132024 branch 2 times, most recently from 20f70f3 to a97bc4f Compare August 22, 2024 00:04
channeldb/graph.go Show resolved Hide resolved
channeldb/graph.go Show resolved Hide resolved
discovery/ban.go Outdated Show resolved Hide resolved
discovery/ban.go Show resolved Hide resolved
discovery/mock_test.go Outdated Show resolved Hide resolved
}

if !chanPeer {
nMsg.peer.Disconnect(ErrPeerBanned)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we prevent incoming connections from being fully accepted (either at the brontide connection handshake layer, or in the server before we finalize the peer)? Otherwise, we could have a situation where they: connect, send something bad, and we disconnect (in a loop).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change in the server, but still need to properly test in a few different scenarios

discovery/gossiper.go Outdated Show resolved Hide resolved
@@ -4037,3 +4037,26 @@ func TestGraphLoading(t *testing.T) {
graphReloaded.graphCache.nodeFeatures,
)
}

func TestClosedScid(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing doc string and it would be better to use require.NoError instead of require.Nil.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@bitromortac bitromortac Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it tests the same. No strong opinion here, but NoError has an error type as a parameter, will display a better debug message and it may be a bit more readable and nice for code uniformity reasons. This linter for example would complain https://github.com/Antonboom/testifylint?tab=readme-ov-file#error-nil. (non-blocking)

server.go Outdated Show resolved Hide resolved
@Roasbeef
Copy link
Member

Some lint failures in the latest run:

Error: server.go:3660:3: return with no blank line before (nlreturn)
level=info msg="[runner] Processors filtering stat (in/out): invalid_issue: 31043/31043, skip_files: 31043/31043, skip_dirs: 31043/31043, autogenerated_exclude: 31043/9943, nolint: 8068/6975, max_from_linter: 3/3, filename_unadjuster: 31043/31043, exclude-rules: 9943/8068, max_same_issues: 3/3, source_code: 3/3, severity-rules: 3/3, sort_results: 3/3, cgo: 31043/31043, path_prettifier: 31043/31043, identifier_marker: 9943/9943, uniq_by_line: 6975/6810, diff: 6810/3, path_prefixer: 3/3, exclude: 9943/9943, max_per_file_from_linter: 3/3, path_shortener: 3/3, fixer: 3/3"
level=info msg="[runner] processing took 898.870583ms with stages: nolint: 418.469319ms, exclude-rules: 219.895506ms, identifier_marker: 112.844453ms, diff: 86.477579ms, autogenerated_exclude: 22.357815ms, path_prettifier: 20.053506ms, invalid_issue: 5.778688ms, skip_dirs: 5.245824ms, cgo: 3.104551ms, filename_unadjuster: 2.756653ms, uniq_by_line: 1.607849ms, source_code: 269.231µs, max_same_issues: 3.046µs, max_per_file_from_linter: 2.054µs, path_shortener: 1.473µs, max_from_linter: 972ns, skip_files: 481ns, sort_results: 401ns, exclude: 381ns, fixer: 371ns, path_prefixer: 320ns, severity-rules: 110ns"
level=info msg="[runner] linters took 1m6.671230379s with stages: goanalysis_metalinter: 1m5.77224[42](https://github.com/lightningnetwork/lnd/actions/runs/10530827085/job/29181564250?pr=9009#step:7:43)28s"
level=info msg="File cache stats: 1289 entries of total size 17.9MiB"
level=info msg="Memory: 852 samples, avg is 1498.0MB, max is 3378.3MB"
level=info msg="Execution took 1m43.938872034s"
		return
		^
Error: server.go:3763:3: return with no blank line before (nlreturn)
		return
		^
Error: discovery/gossiper.go:2674:4: return with no blank line before (nlreturn)
			return nil, false
			^
make: *** [Makefile:322: lint-source] Error 1
Error: Process completed with exit code 2.

@Crypt-iQ Crypt-iQ force-pushed the gossip_ban_8132024 branch 2 times, most recently from 4774dae to 39d7deb Compare August 26, 2024 18:00
@Crypt-iQ
Copy link
Collaborator Author

test errors look to be unrelated

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 19 of 19 files at r2, 4 of 4 files at r4, all commit messages.
Reviewable status: all files reviewed, 37 unresolved discussions (waiting on @bitromortac, @Crypt-iQ, and @ziggie1984)

@Roasbeef
Copy link
Member

test errors look to be unrelated

Yeah the native sql failures will be fixed with: #9022

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cool idea rejecting the peer at the peer connection level ⚡️


// Reset the AddEdge error and pass the same announcement again. An
// error should be returned even though AddEdge won't fail.
ctx.router.resetAddEdgeErrCode()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-Blocking but I think we could improve the graph.Error print method, which your make the debug log easier to read:

// Error satisfies the error interface and prints human-readable errors.
func (e *Error) Error() string {
	if e.err != nil {
		return fmt.Sprintf("ErrCode: %v, error: %s", e.code, e.err)
	}
	return fmt.Sprintf("ErrCode: %v", e.code)
}

// String returns the string representation of the error code.
func (e ErrorCode) String() string {
	switch e {
	case ErrOutdated:
		return "ErrOutdated"

	case ErrIgnored:
		return "ErrIgnored"

	case ErrChannelSpent:
		return "ErrChannelSpent"

	case ErrNoFundingTransaction:
		return "ErrNoFundingTransaction"

	case ErrInvalidFundingOutput:
		return "ErrInvalidFundingOutput"

	case ErrVBarrierShuttingDown:
		return "ErrInvalidFundingOutput"

	case ErrParentValidationFailed:
		return "ErrInvalidFundingOutput"

	default:
		return "<unknown>"
	}
}

which makes the output of the test way clearer:

2024-08-27 09:48:05.971 [DBG] DISC: Adding edge for short_chan_id: 111050674405376
2024-08-27 09:48:05.971 [DBG] DISC: Graph rejected edge for short_chan_id(111050674405376): ErrCode: ErrChannelSpent, error: received error

instead of:

2024-08-27 09:11:08.573 [DBG] DISC: Adding edge for short_chan_id: 108851651149824
2024-08-27 09:11:08.573 [DBG] DISC: Graph rejected edge for short_chan_id(108851651149824): received error

server.go Outdated Show resolved Hide resolved
@@ -4037,3 +4037,26 @@ func TestGraphLoading(t *testing.T) {
graphReloaded.graphCache.nodeFeatures,
)
}

func TestClosedScid(t *testing.T) {
Copy link
Collaborator

@bitromortac bitromortac Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it tests the same. No strong opinion here, but NoError has an error type as a parameter, will display a better debug message and it may be a bit more readable and nice for code uniformity reasons. This linter for example would complain https://github.com/Antonboom/testifylint?tab=readme-ov-file#error-nil. (non-blocking)

discovery/gossiper.go Show resolved Hide resolved
discovery/gossiper.go Outdated Show resolved Hide resolved
discovery/gossiper.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
This commit adds the ability to store closed channels by scid in
the database. This will allow the gossiper to ignore channel
announcements for closed channels without having to do any
expensive validation.
This commit introduces a ban manager that marks peers as banned if
they send too many invalid channel announcements to us. Expired
entries are purged after a certain period of time (currently 48 hours).
This will be used in the gossiper to disconnect from peers if their
ban score passes the ban threshold.
This commit hooks up the banman to the gossiper:
- peers that are banned and don't have a channel with us will get
  disconnected until they are unbanned.
- peers that are banned and have a channel with us won't get
  disconnected, but we will ignore their channel announcements until
  they are no longer banned. Note that this only disables gossip of
  announcements to us and still allows us to open channels to them.
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one thing we probably need to add in the follow-up PR is a way for nodes being infected with the old channel data set, to get cured. Is there currently a way to wipe the whole graph history other than chantools? Because for nodes infected, they might after this PR get banned by a lot of peers. Pretending they are not doing it deliberately we should probably also fix the problem for them (in case they are running LND) ?
:lgtm:

Reviewable status: 20 of 23 files reviewed, 43 unresolved discussions (waiting on @bitromortac, @Crypt-iQ, and @Roasbeef)

@Roasbeef
Copy link
Member

Is there currently a way to wipe the whole graph history other than chantools? Because for nodes infected, they might after this PR get banned by a lot of peer

So we know that one vector of these old channels was actually a version of CLN that had a bug causing it not to detect channels as actually being closed. We also know that some lnd nodes that are running neutrino nodes (assumechanvalid) may have stored those channels on disk momentarily. Once the zombie tick interval passes, neutrino nodes will prune these from disk, and now be able to use the spend index to avoid re-downloading all the channels.

@Roasbeef Roasbeef merged commit 1bf7ad9 into lightningnetwork:master Aug 27, 2024
23 of 31 checks passed
@Crypt-iQ Crypt-iQ deleted the gossip_ban_8132024 branch August 28, 2024 04:40
@ziggie1984
Copy link
Collaborator

So I analysed it for neutrino nodes, and the problem is that we are not deleting those channels with only a announcement:
#8889 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants