graph/db: add zombie channels cleanup routine #10015

GustavoStingelin · 2025-07-01T03:30:01Z

This PR addresses issue #9524, which caused zombie channels to remain in the in-memory graph cache. This led to incorrect pathfinding behavior and unnecessary memory consumption.

Benchmark

To evaluate the performance impact of the cleanup logic, I added a benchmark simulating a node graph with 50,000 nodes and 500,000 channels. On my machine, the cleanup took approximately 120 ms, which I think is acceptable for a daily cleanup routine. Additionally, we could potentially improve this by using the channelCache struct, but it appears underutilized.

$ go test -bench=. ./graph/db
goos: linux
goarch: amd64
pkg: github.com/lightningnetwork/lnd/graph/db
cpu: AMD Ryzen 7 5700X 8-Core Processor             
BenchmarkGraphCacheCleanupZombies-16    	       9	 120417156 ns/op	       120.3 ms/op	16957585 B/op	  149297 allocs/op

gemini-code-assist

Summary of Changes

Hello @GustavoStingelin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential memory leak in the GraphCache by implementing a "zombie channel" cleanup mechanism. It introduces a dedicated index to track channels that cannot be immediately removed due to incomplete node information and a background process to periodically prune these channels, ensuring the cache remains consistent and efficient.

Highlights

Zombie Channel Tracking: Introduced a zombieIndex within the GraphCache to temporarily store channel IDs that cannot be fully removed immediately due to missing node information (e.g., a zeroVertex).
Background Cleanup Process: Added a new background goroutine (zombieCleaner) that periodically (every 24 hours) attempts to clean up channels listed in the zombieIndex, ensuring they are eventually removed from the cache.
Robust Channel Removal: Modified the RemoveChannel method to identify channels where one or both associated nodes are unknown (represented by a zeroVertex) and adds them to the zombieIndex for deferred cleanup, preventing potential memory leaks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a background process to clean up zombie channels from the graph cache, addressing a potential memory leak. The implementation includes a dedicated goroutine and a zombie index to track channels awaiting removal. The review suggests improvements for maintainability and performance, such as making the cleanup interval configurable and optimizing the cleanup logic.

graph/db/graph_cache.go

ellemouton · 2025-07-01T04:53:23Z

thanks for the PR @GustavoStingelin!
Feel free to ping me once this is ready for review. Also remember to remove the [skip ci] from the commit message at that point so that the CI can run

GustavoStingelin · 2025-07-02T02:13:16Z

@ellemouton ready!

MPins

Well done! 👏

I ran the tests and everything LGTM ✅

Here are the benchmark results on my machine:

goos: linux
goarch: amd64
pkg: github.com/lightningnetwork/lnd/graph/db
cpu: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
=== RUN BenchmarkGraphCacheCleanupZombies
BenchmarkGraphCacheCleanupZombies
BenchmarkGraphCacheCleanupZombies-8 5 207848522 ns/op 207.8 ms/op 31292878 B/op 245935 allocs/op
PASS
ok github.com/lightningnetwork/lnd/graph/db 9.526s

ellemouton

Looking great so far! Thanks for this 🙏

graph/db/graph_cache.go

graph/db/graph_cache_test.go

graph/db/graph_cache.go

yyforyongyu

Thanks for the PR! My main question is - does it cost more if we just remove it directly inside RemoveChannel? And if the zombies are cleaned per X hours, does it mean the pathfinding may fail due to the zombies?

graph/db/graph_cache.go

graph/db/graph_cache_test.go

GustavoStingelin · 2025-07-29T02:33:07Z

Just rebased. Let me know for additional comments!

GustavoStingelin · 2025-08-05T22:50:05Z

@saubyk, could you assign me to this?

lightninglabs-deploy · 2025-09-03T01:39:51Z

@yyforyongyu: review reminder
@ellemouton: review reminder
@GustavoStingelin, remember to re-request review from reviewers when ready

ellemouton

lgtm! thanks for this 🎉

graph/db/graph_cache.go

yyforyongyu · 2025-09-10T05:12:04Z

graph/db/graph_cache.go

@@ -305,6 +410,9 @@ func (c *GraphCache) getChannels(node route.Vertex) []*DirectedChannel {
 		i++
 	}

+	// Copy the slice to clean up the unused pre allocated tail entries.
+	copy(channelsCopy, channelsCopy[:i])


I think we can just return channelsCopy[:i] - in Go the re-slicing [:i] already creates a new slice header that points to the same underlying array as channelsCopy, but its length is i. And the copy will do nothing here.

And the copy will do nothing here

It's a slice trick, 😃.

My original intent here was to free the underlying array, since it could still hold unused tail entries that the GC wouldn’t reclaim as long as the slice referenced them. By using copy, we force a new backing array to be created, which allows those tail elements to be freed earlier and potentially saves some bytes in the current cycle.

After rethinking this, I realized it might not be worth the extra cost. The zombie cleaner will eventually release memory anyway, and the “tail” entries are likely irrelevant compared to the overhead of copying and allocating a new array.

So I switched to simply using [:i]. This means the unused entries remain in the underlying array, but they’re not visible through the slice header, and the tradeoff avoids the extra copy and allocation cost.

reference

"Because the “deleted” value is referenced in the underlying array, the deleted value is still “reachable” during GC, even though the value cannot be referenced by your code. If the underlying array is long-lived, this represents a leak"

yyforyongyu · 2025-09-10T05:16:29Z

graph/db/graph_cache_test.go

+		for j := range numChannels / 10 {
+			cache.RemoveChannel(zeroVertex, zeroVertex,
+				uint64(j*1000*10))
+			cache.RemoveChannel(zeroVertex, zeroVertex,


so we are marking 20% channels as zombies? what does the +5 mean here?

I did a small refactor in this bench to make it more readable. The setup is:

10% of existing channels are marked as zombies.

Another 10% worth of entries are marked as zombies using IDs that do not exist in the map.

So the run includes 10% real zombies and 10% of "ghost zombies".
The +5 is just to generate IDs outside the existing range.

yyforyongyu · 2025-09-10T05:18:14Z

graph/db/graph_cache.go

@@ -83,6 +95,9 @@ func NewGraphCache(preAllocNumNodes int) *GraphCache {
 			map[route.Vertex]*lnwire.FeatureVector,
 			preAllocNumNodes,
 		),
+		zombieIndex:           make(map[uint64]struct{}),
+		zombieCleanerInterval: time.Hour,


we should make this time.Hour a var above for easy reference.

bitromortac

I would like to be sure to understand the solution space a bit better.

Builder.MarkZombieEdge is only called in validateFundingTransaction with leads to the only call of ChannelGraph.MarkEdgeZombie with zero pubkeys. Are edges even added to the graph (and cache) if they fail validateFundingTransaction and is there thus a problem at all?

Additionally, there seems to only be a single chain of call sites Builder.MarkZombieEdge -> ChannelGraph.MarkEdgeZombie -> GraphCache.RemoveChannel, so would it work if we'd pass in the pubkeys to Builder.MarkZombieEdge and change ChannelGraph.MarkEdgeZombie to call V1Store.MarkEdgeZombie(chanID, zero, zero) instead, but have the real pubkeys available in GraphCache.RemoveChannel?

yyforyongyu · 2025-09-12T10:00:25Z

so would it work if we'd pass in the pubkeys to Builder.MarkZombieEdge and change ChannelGraph.MarkEdgeZombie to call V1Store.MarkEdgeZombie(chanID, zero, zero) instead, but have the real pubkeys available in GraphCache.RemoveChannel?

Yeah this will fix the leaky cache completely, but it will also open an attack surface such that a malicious node can remove other nodes from our graph cache, the attack scenario,

a new node starts, with no knowledge about past closed channels
an attacker sends an old channel that is valid but is now closed, which is a replay attack
if we also use node public keys here instead of zero keys, it will remove the victim nodes from our graph cache
we check for is closed or not before the validation, but that won't catch the node because we don't know this channel is closed yet

yyforyongyu

Pending @bitromortac 's confirm of the analysis, otherwise LGTM👏

Fix a bug that leaks zombie channels in the memory graph, resulting in incorrect path finding and memory usage.

ellemouton · 2025-09-15T05:27:33Z

Builder.MarkZombieEdge is only called in validateFundingTransaction with leads to the only call of ChannelGraph.MarkEdgeZombie with zero pubkeys. Are edges even added to the graph (and cache) if they fail validateFundingTransaction and is there thus a problem at all?

Hmmm great point!!
I think that may indeed mean that this may not be an issue 🤔 and if so, we should just leave it so as not to take up more in-memory space

bitromortac

I'm fairly sure my suggested workaround isn't relevant, because if vialidateFundingTransaction (only call site for RemoveChannel with empty pubkeys) errors, we won't add the edge to the graph and therefore don't have issues with those channels in the cache. So unless an issue with this can be demonstrated, I think we should not pursue this PR to keep the complexity out (although the code itself looks good and is well tested).

saubyk · 2025-09-15T14:12:14Z

@GustavoStingelin based on the last comment, I am pulling this out of release 0.20's scope. If you agree with @bitromortac 's assessment we can close this pr and revisit in the future if an issue arises. Thanks.

GustavoStingelin · 2025-09-15T14:48:31Z

@GustavoStingelin based on the last comment, I am pulling this out of release 0.20's scope. If you agree with @bitromortac 's assessment we can close this pr and revisit in the future if an issue arises. Thanks.

agreed.

saubyk · 2025-09-15T23:48:29Z

Closing the pr based on the above comment.

gemini-code-assist bot reviewed Jul 1, 2025

View reviewed changes

graph/db/graph_cache.go Outdated Show resolved Hide resolved

graph/db/graph_cache.go Show resolved Hide resolved

GustavoStingelin force-pushed the graph-cache/zombie-channels branch 2 times, most recently from 8acd2d2 to 0142868 Compare July 1, 2025 22:00

GustavoStingelin changed the title ~~DRAFT: graph/db: add zombie channel process - WIP [skip ci]~~ graph/db: add zombie channels cleanup routine Jul 1, 2025

GustavoStingelin marked this pull request as ready for review July 2, 2025 02:08

ellemouton self-requested a review July 2, 2025 07:08

MPins approved these changes Jul 3, 2025

View reviewed changes

ellemouton reviewed Jul 8, 2025

View reviewed changes

GustavoStingelin force-pushed the graph-cache/zombie-channels branch 3 times, most recently from 6a1bd16 to 46d2623 Compare July 10, 2025 16:39

GustavoStingelin requested a review from ellemouton July 10, 2025 16:54

GustavoStingelin commented Jul 10, 2025

View reviewed changes

graph/db/graph_cache.go Show resolved Hide resolved

yyforyongyu requested changes Jul 14, 2025

View reviewed changes

GustavoStingelin force-pushed the graph-cache/zombie-channels branch from 46d2623 to 4f05f0d Compare July 17, 2025 19:18

GustavoStingelin requested a review from yyforyongyu July 17, 2025 20:02

GustavoStingelin force-pushed the graph-cache/zombie-channels branch from 4f05f0d to 918e871 Compare July 29, 2025 02:30

saubyk assigned GustavoStingelin Aug 6, 2025

ellemouton approved these changes Sep 4, 2025

View reviewed changes

graph/db/graph_cache.go Show resolved Hide resolved

graph/db/graph_cache.go Show resolved Hide resolved

graph/db/graph_cache.go Show resolved Hide resolved

graph/db/graph_cache.go Show resolved Hide resolved

graph/db/graph_cache.go Outdated Show resolved Hide resolved

saubyk added this to the v0.20.0 milestone Sep 4, 2025

saubyk added this to lnd v0.20 Sep 4, 2025

saubyk moved this to In review in lnd v0.20 Sep 4, 2025

saubyk requested review from bitromortac and removed request for yyforyongyu September 9, 2025 16:34

yyforyongyu reviewed Sep 10, 2025

View reviewed changes

bitromortac reviewed Sep 10, 2025

View reviewed changes

yyforyongyu reviewed Sep 12, 2025

View reviewed changes

GustavoStingelin force-pushed the graph-cache/zombie-channels branch from 918e871 to d5785c4 Compare September 12, 2025 20:41

GustavoStingelin added 3 commits September 12, 2025 17:44

graph/db: add zombie channels cache cleanup routine

5ec2ea7

Fix a bug that leaks zombie channels in the memory graph, resulting in incorrect path finding and memory usage.

graph/db: test zombie channel cleaning

675c283

docs: update release-notes-0.20.0

0f62b18

GustavoStingelin force-pushed the graph-cache/zombie-channels branch from d5785c4 to 0f62b18 Compare September 12, 2025 20:45

GustavoStingelin requested a review from yyforyongyu September 12, 2025 21:35

bitromortac reviewed Sep 15, 2025

View reviewed changes

saubyk removed this from the v0.20.0 milestone Sep 15, 2025

saubyk removed this from lnd v0.20 Sep 15, 2025

saubyk closed this Sep 15, 2025

bitromortac mentioned this pull request Sep 17, 2025

graph cache: zombie channels are not properly removed from cache #9524

Closed

graph/db: add zombie channels cleanup routine #10015

graph/db: add zombie channels cleanup routine #10015

Uh oh!

Conversation

GustavoStingelin commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ellemouton commented Jul 1, 2025

Uh oh!

GustavoStingelin commented Jul 2, 2025

Uh oh!

MPins left a comment

Choose a reason for hiding this comment

Uh oh!

ellemouton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yyforyongyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GustavoStingelin commented Jul 29, 2025

Uh oh!

GustavoStingelin commented Aug 5, 2025

Uh oh!

lightninglabs-deploy commented Sep 3, 2025

Uh oh!

ellemouton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yyforyongyu Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

GustavoStingelin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

yyforyongyu Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

GustavoStingelin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

yyforyongyu Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

GustavoStingelin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

bitromortac left a comment

GustavoStingelin commented Jul 1, 2025 •

edited

Loading

saubyk commented Sep 15, 2025 •

edited

Loading