Skip to content

Conversation

sanducb
Copy link
Contributor

@sanducb sanducb commented Jul 18, 2025

Changes proposed in this pull request

  • Adds multihop payment support through static routing -> longest prefix match
  • Telemetry support for ILP packets processing time per operation i.e. outgoing-payment, incoming-payment, routing or unknown (rate probes fall into this category)
  • Telemetry support for ILP payment round trip time
  • Simplifies determining which tenant is the destination of a payment introduced in this PR by making it part of the routing logic.

Context

Closes #3444

Overall setup and routing logic

The setup creates 3 instances where instance A is peered with B, B is peered with C (please check the setup for exact instance names).
Payments should be successful from A -> B -> C by using the existing Bruno collection. Instances A and C were kept as cloud-nine and happy-life-bank in order to have minimal changes of the existing setup.

At startup of a Rafiki instance, routes are loaded from the database and stored in the in memory routing table. All subsequent peer updates will also refresh the routing table. For backwards compatibility, if no routes exist then direct peers' address and asset id will be used to populate the routing table.

A routing table entry has the following structure:

| tenantId:destination | next hop | asset id |

where:

  • tenantId is the tenant id of the caller
  • destination is the static ILP address of the payment receiver.
  • next hop is the peer id of the direct peer that will either route or be the destination of the packet
  • asset id is the asset id of the next hop peer -> this field is mandatory when adding/removing a route but not when querying for the next hop, as one could or could not be interested in what asset the peering relationship has when forwarding the packet.

tenantId:destination is called prefix in the implementation and is the key of the table. Longest prefix matching is done against this key.

The routing logic is now also responsible for resolving the peering asymmetry issue described here in a multi-tenanted environment.

Telemetry

There are 2 key metrics added in this PR:

  • ilp_prepare_packet_processing_ms: Measures the time it takes to process individual ILP prepare packets through the connector middleware and is a histogram with a label that denotes the operation of the packet (outgoing_payment, incoming_payment, routing, unknown -> which includes rate probes). In the ILP metrics Grafana dashboard you can see P50 and P95 percentiles panels for tracking latency.
  • ilp_payment_round_trip_ms: Measures the round-trip time for completing ILP payment (on the sender side). This one is also a histogram and the average round-trip time can be seen in the dashboard.

Local testing

Only non-tenanted setup will now have the multi-hop feature with 3 instances.
Use any of the non-tenanted Open Payments Bruno collections as-is to test this flow.
Run pnpm localenv:compose:multihop up to spin it locally.

Notes

  • The reason why maxPacketAmount was added to the seed is that if it is not set, then rate probes will lock a big part of a peer's balance during quoting that will only be moved/released it on receiving fulfill/reject. The total amount locked is usually 10^13 + 10^12 + ... + 10^3 (adding each probe packet's decreasing value). This causes issues with payments throughput, as even a few payments could lock the whole balance of a peer. We want to mitigate that by setting maxPacketAmount to a reasonable value, such that rate probe packets will not lock any value until they match the expected maxPacketAmount. Therefore, expect to see error logs (caused by reject packets) for AmountTooLargeError when quoting until rate probes will match the maxPacketAmount set. Only global-bank needs maxPacketAmount set since this is relevant for "receiving" instances when quoting.

Copy link

netlify bot commented Jul 18, 2025

Deploy Preview for brilliant-pasca-3e80ec canceled.

Name Link
🔨 Latest commit 60c60dc
🔍 Latest deploy log https://app.netlify.com/projects/brilliant-pasca-3e80ec/deploys/68d2bf51f57a2a00082499cb

@github-actions github-actions bot added type: tests Testing related pkg: backend Changes in the backend package. pkg: frontend Changes in the frontend package. type: source Changes business logic pkg: mock-ase pkg: mock-account-service-lib labels Jul 18, 2025
@sanducb sanducb changed the title feat: implement multihop static routing feat(backend): multi-hop payments with static routing Jul 18, 2025
Copy link

github-actions bot commented Jul 18, 2025

🚀 Performance Test Results

Test Configuration:

  • VUs: 4
  • Duration: 1m0s

Test Metrics:

  • Requests/s: 44.59
  • Iterations/s: 14.88
  • Failed Requests: 0.00% (0 of 2682)
📜 Logs

> [email protected] run-tests:testenv /home/runner/work/rafiki/rafiki/test/performance
> ./scripts/run-tests.sh -e test "-k" "-q" "--vus" "4" "--duration" "1m"

Cloud Nine GraphQL API is up: http://localhost:3101/graphql
Cloud Nine Wallet Address is up: http://localhost:3100/
Happy Life Bank Address is up: http://localhost:4100/
cloud-nine-wallet-test-backend already set
cloud-nine-wallet-test-auth already set
happy-life-bank-test-backend already set
happy-life-bank-test-auth already set
     data_received..................: 968 kB 16 kB/s
     data_sent......................: 2.1 MB 34 kB/s
     http_req_blocked...............: avg=7.16µs   min=1.76µs   med=4.78µs   max=954.97µs p(90)=5.97µs   p(95)=6.39µs  
     http_req_connecting............: avg=908ns    min=0s       med=0s       max=921.79µs p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=89.13ms  min=8.07ms   med=72.78ms  max=646.27ms p(90)=151.09ms p(95)=172.1ms 
       { expected_response:true }...: avg=89.13ms  min=8.07ms   med=72.78ms  max=646.27ms p(90)=151.09ms p(95)=172.1ms 
     http_req_failed................: 0.00%  ✓ 0         ✗ 2682
     http_req_receiving.............: avg=79.87µs  min=26.37µs  med=70.53µs  max=1.19ms   p(90)=106.1µs  p(95)=136.44µs
     http_req_sending...............: avg=38.08µs  min=10.18µs  med=25.52µs  max=3.69ms   p(90)=36.73µs  p(95)=52.4µs  
     http_req_tls_handshaking.......: avg=0s       min=0s       med=0s       max=0s       p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=89.01ms  min=7.95ms   med=72.69ms  max=646.16ms p(90)=151.01ms p(95)=172.01ms
     http_reqs......................: 2682   44.59473/s
     iteration_duration.............: avg=268.62ms min=170.34ms med=254.17ms max=1.2s     p(90)=323.64ms p(95)=370.89ms
     iterations.....................: 895    14.881537/s
     vus............................: 4      min=4       max=4 
     vus_max........................: 4      min=4       max=4 

@BlairCurrey
Copy link
Contributor

Just checking in with some observations to follow up on the issues @sanducb mentioned in the call last week.

These were:

  • some random socket hangups by some backends requiring a container restart
  • open payments flow in bruno example intermittently failing

I ran the tenanted open payments flow, which always completed but sometimes showed these error logs:

global-bank-backend-1          | {"level":20,"time":1753127207814,"pid":30,"hostname":"global-bank-backend","service":"RouterService","destination":"test.happy-life-bank.MJHzYR7_ogW2GlMTr5teAEO8KgPMf4I6cFW1oUJkmNdrteSvdOc75l6o_rmTa_A4Z8hl1-9z5j9OcCOz8DzEdQ2eGKZxOeA","prefix":"test","tenantId":"53f2d913-e98a-40b9-b270-372d0547f23e","selectedPeer":"8e8aaed3-761f-4050-8de2-7b094df64b4b","msg":"found next hop"}
global-bank-backend-1          | {"level":50,"time":1753127207816,"pid":30,"hostname":"global-bank-backend","service":"ConnectorService","module":"balance-middleware","transferOptions":{"sourceAccount":{"id":"8e8aaed3-761f-4050-8de2-7b094df64b4b","assetId":"d5002e16-bc22-46f1-bc3f-a0b9d3c60e96","maxPacketAmount":null,"staticIlpAddress":"test.intergalactic-bank","name":null,"createdAt":"2025-07-18T19:06:32.900Z","updatedAt":"2025-07-18T19:06:32.900Z","liquidityThreshold":"1000000","tenantId":"53f2d913-e98a-40b9-b270-372d0547f23e","routes":["test.intergalactic-bank","test.intergalactic-bank","test"],"http":{"outgoing":{"authToken":"global-to-intergalactic","endpoint":"http://intergalactic-bank-backend:3002"}},"asset":{"id":"d5002e16-bc22-46f1-bc3f-a0b9d3c60e96","ledger":1,"code":"USD","scale":2,"withdrawalThreshold":null,"createdAt":"2025-07-18T19:06:32.780Z","updatedAt":"2025-07-18T19:06:32.780Z","liquidityThreshold":"10000000","deletedAt":null,"tenantId":"53f2d913-e98a-40b9-b270-372d0547f23e"}},"destinationAccount":{"id":"8e8aaed3-761f-4050-8de2-7b094df64b4b","assetId":"d5002e16-bc22-46f1-bc3f-a0b9d3c60e96","maxPacketAmount":null,"staticIlpAddress":"test.intergalactic-bank","name":null,"createdAt":"2025-07-18T19:06:32.900Z","updatedAt":"2025-07-18T19:06:32.900Z","liquidityThreshold":"1000000","tenantId":"53f2d913-e98a-40b9-b270-372d0547f23e","routes":["test.intergalactic-bank","test.intergalactic-bank","test"],"http":{"outgoing":{"authToken":"[Redacted]","endpoint":"http://intergalactic-bank-backend:3002"}},"asset":{"id":"d5002e16-bc22-46f1-bc3f-a0b9d3c60e96","ledger":1,"code":"USD","scale":2,"withdrawalThreshold":null,"createdAt":"2025-07-18T19:06:32.780Z","updatedAt":"2025-07-18T19:06:32.780Z","liquidityThreshold":"10000000","deletedAt":null,"tenantId":"53f2d913-e98a-40b9-b270-372d0547f23e"}},"sourceAmount":"100","destinationAmount":"100","transferType":"TRANSFER","timeout":5},"transferError":"SameAccounts","msg":"Could not create transfer"}
global-bank-backend-1          | {"level":30,"time":1753127207816,"pid":30,"hostname":"global-bank-backend","service":"ConnectorService","err":{"type":"InternalServerError","message":"[object Object]","stack":"InternalServerError: [object Object]\n    at ctxThrow (/home/rafiki/node_modules/.pnpm/[email protected]/node_modules/koa/lib/context.js:97:11)\n    at createPendingTransfer (/home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/balance.ts:100:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/balance.ts:115:19\n    at ildcp (/home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/ildcp.ts:19:7)\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/throughput.ts:90:5\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/rate-limit.ts:54:5\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/max-packet-amount.ts:30:5\n    at account (/home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/account.ts:137:5)\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/stream-address.ts:38:5","status":500,"statusCode":500,"expose":false},"msg":"Error thrown in incoming pipeline"}
global-bank-backend-1          | {"level":50,"time":1753127207817,"pid":30,"hostname":"global-bank-backend","service":"ConnectorService","err":{"type":"InternalServerError","message":"[object Object]","stack":"InternalServerError: [object Object]\n    at ctxThrow (/home/rafiki/node_modules/.pnpm/[email protected]/node_modules/koa/lib/context.js:97:11)\n    at createPendingTransfer (/home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/balance.ts:100:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/balance.ts:115:19\n    at ildcp (/home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/ildcp.ts:19:7)\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/throughput.ts:90:5\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/rate-limit.ts:54:5\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/max-packet-amount.ts:30:5\n    at account (/home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/account.ts:137:5)\n    at /home/rafiki/packages/backend/src/payment-method/ilp/connector/core/middleware/stream-address.ts:38:5","status":500,"statusCode":500,"expose":false},"msg":"unexpected internal error"}

I rang the non-tenanted open payments flow and the first time saw the create quote command take ~10 seconds then return an Internal Server Error with these logs:

cloud-nine-backend-1           | {"level":50,"time":1753127354732,"pid":30,"hostname":"cloud-nine-wallet-backend","service":"QuoteService","err":{"type":"PaymentMethodHandlerError","message":"Received error during ILP quoting","stack":"PaymentMethodHandlerError: Received error during ILP quoting\n    at getQuote (/home/rafiki/packages/backend/src/payment-method/ilp/service.ts:155:13)\n    at runNextTicks (node:internal/process/task_queues:60:5)\n    at processTimers (node:internal/timers:516:9)\n    at createQuote (/home/rafiki/packages/backend/src/open_payments/quote/service.ts:208:15)\n    at createQuote (/home/rafiki/packages/backend/src/open_payments/quote/routes.ts:98:22)\n    at getWalletAddressForSubresource (/home/rafiki/packages/backend/src/open_payments/wallet_address/middleware.ts:121:3)\n    at httpsigMiddleware (/home/rafiki/packages/backend/src/open_payments/auth/middleware.ts:264:3)\n    at /home/rafiki/packages/backend/src/open_payments/auth/middleware.ts:165:5\n    at getWalletAddressUrlFromRequestBody (/home/rafiki/packages/backend/src/open_payments/wallet_address/middleware.ts:19:3)\n    at /home/rafiki/node_modules/.pnpm/@[email protected]/node_modules/@interledger/openapi/dist/middleware.js:27:9","name":"PaymentMethodHandlerError","description":"RateProbeFailed","retryable":true},"msg":"error creating a quote"}
cloud-nine-backend-1           | 
cloud-nine-backend-1           |   InternalServerError: Internal Server Error
cloud-nine-backend-1           |       at Object.throw (/home/rafiki/node_modules/.pnpm/[email protected]/node_modules/koa/lib/context.js:97:11)
cloud-nine-backend-1           |       at openPaymentsServerErrorMiddleware (/home/rafiki/packages/backend/src/open_payments/route-errors.ts:105:14)
cloud-nine-backend-1           |       at processTicksAndRejections (node:internal/process/task_queues:95:5)
cloud-nine-backend-1           |       at bodyParser (/home/rafiki/node_modules/.pnpm/[email protected]/node_modules/koa-bodyparser/index.js:78:5)
cloud-nine-backend-1           |       at cors (/home/rafiki/node_modules/.pnpm/@[email protected]/node_modules/@koa/cors/index.js:109:16)
cloud-nine-backend-1           | 
cloud-nine-backend-1           | {"level":50,"time":1753127354735,"pid":30,"hostname":"cloud-nine-wallet-backend","method":"POST","path":"/438fa74a-fa7d-4317-9ced-dde32ece1787/quotes","err":{"type":"PaymentMethodHandlerError","message":"Received error during ILP quoting","stack":"PaymentMethodHandlerError: Received error during ILP quoting\n    at getQuote (/home/rafiki/packages/backend/src/payment-method/ilp/service.ts:155:13)\n    at runNextTicks (node:internal/process/task_queues:60:5)\n    at processTimers (node:internal/timers:516:9)\n    at createQuote (/home/rafiki/packages/backend/src/open_payments/quote/service.ts:208:15)\n    at createQuote (/home/rafiki/packages/backend/src/open_payments/quote/routes.ts:98:22)\n    at getWalletAddressForSubresource (/home/rafiki/packages/backend/src/open_payments/wallet_address/middleware.ts:121:3)\n    at httpsigMiddleware (/home/rafiki/packages/backend/src/open_payments/auth/middleware.ts:264:3)\n    at /home/rafiki/packages/backend/src/open_payments/auth/middleware.ts:165:5\n    at getWalletAddressUrlFromRequestBody (/home/rafiki/packages/backend/src/open_payments/wallet_address/middleware.ts:19:3)\n    at /home/rafiki/node_modules/.pnpm/@[email protected]/node_modules/@interledger/openapi/dist/middleware.js:27:9","name":"PaymentMethodHandlerError","description":"RateProbeFailed","retryable":true},"msg":"Received unhandled error in Open Payments request"}

Then a successive try to create quote worked. The rest of the flow worked as well. Then I tried again, and the entire flow worked. I tore everything down including the volumes and retried and saw an error in bruno on the grant request for incoming payment (Error invoking remote method 'send-http-request': Error: socket hang up) although I dont see any stopped containers. Then I tore down (removing volumes) and rebuilt several times now see the socket hang up errors for the egt wallet address requests too. With one or many mock ase's down.

@njlie
Copy link
Contributor

njlie commented Jul 31, 2025

I am not too comfortable with the current localenv test setup even though it works, because I think it should be separated completely from the multitenancy-only setup. I leave it up for discussion to find the most ergonomic way we can do this.

I was thinking about how cumbersome the localenv options were getting when adding the localenv script for creating the multitenancy environment. Perhaps something we could do in the future is have a bash script that takes in options modularly like pnpm localenv compose --telemetry --multitenancy --multihop up so it can handle the different environments more neatly.

Copy link
Contributor

@njlie njlie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished reviewing everything, but wanted to share the comments I have already

peerUrl: http://happy-life-bank-backend:3002
peerIlpAddress: test.happy-life-bank
- initialLiquidity: '1000000000000'
peerUrl: http://intergalactic-bank-backend:3002
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this breaks the most basic local environment setup. There might need to be different docker-compose files for the multihop environment that point to different seed files, so that cloud-nine-wallet can route to happy-life-bank with or without any hops in the middle.

Copy link
Contributor Author

@sanducb sanducb Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Current localenv setup only makes sense for the purpose of testing the multi-hop logic. I will think about a way of making multi-hop coexist with all the other setups and bring it up here for discussion. Your suggestion here sounds like a good avenue to explore.

@sanducb
Copy link
Contributor Author

sanducb commented Aug 22, 2025

I addressed some comments in the latest commits and made some adjustments that I detailed in the Notes section of the PR description. Please check the updated description before re-reviewing. @njlie @BlairCurrey

@sanducb sanducb requested a review from njlie August 22, 2025 16:52
Copy link
Contributor

@njlie njlie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tested the Bruno collection and it seems to be working. I just had a small note about the tests

)
})

test('Updates peer routes in router service', async (): Promise<void> => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good for the update tests to also ensure that the prior routes were cleared before the new ones are synced with a .toHaveBeenCalled() check or something.

@sanducb sanducb requested a review from njlie August 28, 2025 13:12
BlairCurrey
BlairCurrey previously approved these changes Sep 2, 2025
Copy link
Contributor

@BlairCurrey BlairCurrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and appears to be working as expected locally

Copy link
Contributor

@mkurapov mkurapov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All working for me!

I'm thinking, though, to think to keep our local playground docs accurate (and keep the local playground simple), we should have the pnpm localenv:compose be the same, such that it only starts up two Rafiki nodes.

What we can do is have a pnpm localenv:compose:multihop, which starts up the three nodes as you have them now, and because you updated the UpdatePeer Mutation, we can just have the seed script update the Peering routes on start, such that we "enable" the multi hop functionality without having to do much else :)

Copy link
Contributor

@mkurapov mkurapov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think deleting a peer will be difficult given the FK constraints + the accounting data. (and also maybe we shouldn't for audit reasons). Instead, we could create an additional multihop docker compose (kind of how the merged docker compose file gets added to the pnpm command) which configures the containers to use a separate seed files. In the seed files, the peer which will be skipped for routing will just have "unreachable" ILP address & routes, kind of how you did it for global bank for the cloud nine wallet seed file. This makes it possible to not have to have any custom code in the seed script for global bank

Comment on lines +290 to +294
if (options.routes !== undefined) {
const staticIlpAddress =
options.staticIlpAddress ?? existingPeer.staticIlpAddress
updateData.routes = [staticIlpAddress, ...options.routes]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the staticIlpAddress already added in syncPeerRoutes?

Copy link
Contributor Author

@sanducb sanducb Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in syncPeerRoutes we already add the staticIlpAddress to the routing table. Here we are making sure that we also update the routes db entry with the peer's static address since that is used here when loading the routes from the db. Redundancy in syncPeerRoutes exists to make sure that there is no way to not have the static address in the routing table (even if the peer might be modified to have invalid routes) by taking the address from the peer entity.

@sanducb sanducb requested a review from mkurapov September 22, 2025 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: backend Changes in the backend package. pkg: frontend Changes in the frontend package. pkg: mock-account-service-lib pkg: mock-ase type: source Changes business logic type: tests Testing related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Static routing implementation in connector
4 participants