Skip to content

Conversation

@oneonestar
Copy link
Member

@oneonestar oneonestar commented Oct 15, 2025

Description

Add cache in HaGatewayManager.
Continue the work in #501

Additional context and related issues

Set databaseCacheTTL to a non-zero Airlift duration value to enable in-memory caching of backend metadata retrieved from the gateway database. Trino Gateway caches the list of backend clusters for the specified time and refreshes it asynchronously. Use this setting to reduce database load and improve routing performance.

A value of 0s (the default) disables the cache and queries the database on
every request.

Release notes

(x) Release notes are required, with the following suggested text:

* Allow setting `databaseCacheTTL` to enable in-memory caching of backend metadata retrieved from the gateway database.

Summary by Sourcery

Introduce an optional in-memory cache for gateway backends in HaGatewayManager with configurable TTL and automatic invalidation on data changes.

New Features:

  • Add optional Guava LoadingCache for all gateway backends with a databaseCacheTTL setting in RoutingConfiguration
  • Expose metrics for cache lookup successes and failures

Enhancements:

  • Invalidate and asynchronously refresh the backend cache on startup and whenever backends are added, updated, activated, or deleted
  • Replace multiple DAO methods with a single findAll and perform in-memory filtering for active, group-specific, and named backends

Tests:

  • Update TestHaGatewayManager to validate behavior with cache enabled and disabled

Summary by Sourcery

Introduce configurable in-memory caching in HaGatewayManager to reduce database load and improve routing performance by caching backend metadata with TTL, automatic invalidation, and streamlined retrieval, while updating tests and documentation.

New Features:

  • Add optional Guava-based in-memory cache for gateway backends with configurable TTL and asynchronous refresh
  • Expose metrics for backend cache lookup successes and failures

Enhancements:

  • Invalidate and refresh the backend cache automatically on backend additions, updates, activations, or deletions
  • Simplify backend retrieval by fetching all backends once and performing in-memory filtering instead of multiple DAO queries

Documentation:

  • Update routing rules documentation to include the databaseCacheTTL configuration and caching behavior

Tests:

  • Extend TestHaGatewayManager to validate behavior with cache enabled and disabled

@cla-bot cla-bot bot added the cla-signed label Oct 15, 2025
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 15, 2025

Reviewer's Guide

This pull request adds optional in-memory caching of gateway backend metadata in HaGatewayManager, driven by a new databaseCacheTTL setting. It leverages a Guava LoadingCache with TTL-based asynchronous reload, centralizes all data retrieval through a single fetch method with integrated success/failure metrics, refactors existing backend lookup methods to use in-memory filtering, and automatically invalidates the cache on backend mutations. Redundant DAO queries are removed, and tests and documentation are updated to support both cache-enabled and disabled configurations.

Entity relationship diagram for gateway_backend table access

erDiagram
    GATEWAY_BACKEND {
        string name
        string routing_group
        string proxy_to
        string external_url
        boolean active
    }
    HaGatewayManager ||--o{ GATEWAY_BACKEND : fetches
    GatewayBackendDao ||--o{ GATEWAY_BACKEND : queries
Loading

Class diagram for updated HaGatewayManager caching logic

classDiagram
    class HaGatewayManager {
        -GatewayBackendDao dao
        -String defaultRoutingGroup
        -boolean cacheEnabled
        -LoadingCache<Object, List<GatewayBackend>> backendCache
        -CounterStat backendLookupSuccesses
        -CounterStat backendLookupFailures
        +HaGatewayManager(Jdbi, RoutingConfiguration)
        +List<ProxyBackendConfiguration> getAllBackends()
        +List<ProxyBackendConfiguration> getAllActiveBackends()
        +List<ProxyBackendConfiguration> getActiveBackends(String)
        +Optional<ProxyBackendConfiguration> getBackendByName(String)
        +ProxyBackendConfiguration addBackend(ProxyBackendConfiguration)
        +ProxyBackendConfiguration updateBackend(ProxyBackendConfiguration)
        +void deleteBackend(String)
        -List<GatewayBackend> fetchAllBackends()
        -void invalidateBackendCache()
        -List<GatewayBackend> getOrFetchAllBackends()
    }
    HaGatewayManager --> GatewayBackendDao
    HaGatewayManager --> LoadingCache
    HaGatewayManager --> CounterStat
    class LoadingCache {
        +getUnchecked(Object)
        +invalidateAll()
    }
    class CounterStat {
        +update(int)
    }
Loading

Class diagram for updated RoutingConfiguration with databaseCacheTTL

classDiagram
    class RoutingConfiguration {
        -Duration asyncTimeout
        -Duration databaseCacheTTL
        -boolean addXForwardedHeaders
        -String defaultRoutingGroup
        +Duration getDatabaseCacheTTL()
        +void setDatabaseCacheTTL(Duration)
    }
    RoutingConfiguration --> Duration
Loading

Class diagram for updated GatewayBackendDao interface

classDiagram
    class GatewayBackendDao {
        +List<GatewayBackend> findAll()
        +GatewayBackend findByName(String name)
        +void create(...)
        +void update(...)
        +void deleteByName(String name)
    }
Loading

File-Level Changes

Change Details Files
Introduce optional in-memory cache with configurable TTL and metrics in HaGatewayManager
  • Add cacheEnabled flag and LoadingCache initialization based on databaseCacheTTL
  • Configure cache with refreshAfterWrite and async reloading
  • Implement initial cache priming in constructor for fail-fast behavior
  • Expose CounterStat metrics for cache lookup successes and failures
HaGatewayManager.java
Centralize data access with fetchAllBackends and conditional cache retrieval
  • Implement fetchAllBackends method with try/catch and metric updates
  • Add getOrFetchAllBackends to switch between cache and direct database fetch
HaGatewayManager.java
Refactor backend lookup methods to use unified fetch and in-memory filtering
  • Replace direct DAO calls in all get* methods with getOrFetchAllBackends
  • Apply stream-based filters for active, group-specific, and named backends using ImmutableList collector
HaGatewayManager.java
Invalidate cache on backend lifecycle changes
  • Call invalidateBackendCache after activation status change, backend add, update, and delete operations
HaGatewayManager.java
Consolidate DAO methods to a single findAll call
  • Remove specialized DAO queries (findActiveBackend, findActiveBackendByRoutingGroup, findByName)
  • Rely on in-memory filtering instead of multiple DAO methods
GatewayBackendDao.java
Enhance tests to cover cache-enabled and disabled modes
  • Add testGatewayManagerWithCache and testGatewayManagerWithoutCache to set non-zero and zero TTL
  • Refactor common test logic into a shared helper method
TestHaGatewayManager.java
Expose databaseCacheTTL setting and update documentation
  • Add databaseCacheTTL field with default zero to RoutingConfiguration
  • Provide getter and setter for the new setting
  • Update routing-rules.md with TTL usage instructions
RoutingConfiguration.java
routing-rules.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@oneonestar oneonestar marked this pull request as ready for review October 16, 2025 03:49
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • HaGatewayManager creates a dedicated executor for cache reloads but never shuts it down—consider implementing AutoCloseable or a shutdown method to avoid thread leaks.
  • Using refreshAfterWrite alone can serve stale data if an async reload fails; consider adding expireAfterWrite or an eviction policy to prevent stale entries from persisting indefinitely.
  • Loading all backends and filtering in memory for every lookup may not scale with large backend lists—evaluate using selective DAO queries or secondary caches/indexes for group- and name-based lookups.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- HaGatewayManager creates a dedicated executor for cache reloads but never shuts it down—consider implementing AutoCloseable or a shutdown method to avoid thread leaks.
- Using refreshAfterWrite alone can serve stale data if an async reload fails; consider adding expireAfterWrite or an eviction policy to prevent stale entries from persisting indefinitely.
- Loading all backends and filtering in memory for every lookup may not scale with large backend lists—evaluate using selective DAO queries or secondary caches/indexes for group- and name-based lookups.

## Individual Comments

### Comment 1
<location> `gateway-ha/src/main/java/io/trino/gateway/ha/router/HaGatewayManager.java:65-66` </location>
<code_context>
     {
         dao = requireNonNull(jdbi, "jdbi is null").onDemand(GatewayBackendDao.class);
         this.defaultRoutingGroup = routingConfiguration.getDefaultRoutingGroup();
+        if (!routingConfiguration.getDatabaseCacheTTL().isZero()) {
+            cacheEnabled = true;
+            backendCache = CacheBuilder
+                    .newBuilder()
+                    .initialCapacity(1)
+                    .refreshAfterWrite(routingConfiguration.getDatabaseCacheTTL().toJavaTime())
+                    .build(CacheLoader.asyncReloading(
+                            CacheLoader.from(this::fetchAllBackends),
+                            MoreExecutors.listeningDecorator(Executors.newSingleThreadExecutor())));
+            // Load the data once during initialization. This ensures a fail-fast behavior in case of database misconfiguration.
+            backendCache.getUnchecked(ALL_BACKEND_CACHE_KEY);
+        }
+        else {
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Consider handling cache warm-up failures more robustly.

If cache warm-up fails due to a database issue, an unchecked exception may prevent service startup or obscure error details. Consider catching and logging these failures, or failing fast with a clear error message.

```suggestion
            // Load the data once during initialization. This ensures a fail-fast behavior in case of database misconfiguration.
            try {
                backendCache.getUnchecked(ALL_BACKEND_CACHE_KEY);
            }
            catch (Exception e) {
                // Log the error and fail fast with a clear message
                // Replace with your logger if not using SLF4J
                org.slf4j.LoggerFactory.getLogger(HaGatewayManager.class)
                        .error("Failed to warm up backend cache during initialization. Check database configuration.", e);
                throw new IllegalStateException("Failed to warm up backend cache during initialization", e);
            }
```
</issue_to_address>

### Comment 2
<location> `gateway-ha/src/test/java/io/trino/gateway/ha/router/TestHaGatewayManager.java:32-39` </location>
<code_context>
-
-    @BeforeAll
-    void setUp()
+    @Test
+    void testGatewayManagerWithCache()
     {
         JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
</code_context>

<issue_to_address>
**suggestion (testing):** Missing test for cache invalidation after backend changes.

Add tests to confirm cache invalidation and refresh when backends are modified, ensuring stale data is not served.

```suggestion
    @Test
    void testGatewayManagerWithCache()
    {
        JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
        RoutingConfiguration routingConfiguration = new RoutingConfiguration();
        routingConfiguration.setDatabaseCacheTTL(new Duration(5, TimeUnit.SECONDS));
        testGatewayManager(new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration));
    }

    @Test
    void testCacheInvalidationAfterBackendChange()
    {
        JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
        RoutingConfiguration routingConfiguration = new RoutingConfiguration();
        routingConfiguration.setDatabaseCacheTTL(new Duration(5, TimeUnit.SECONDS));
        HaGatewayManager gatewayManager = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration);

        // Populate cache by fetching backends
        var initialBackends = gatewayManager.getActiveBackends();
        assertThat(initialBackends).isNotEmpty();

        // Simulate backend change: add a new backend
        String newBackendName = "new-backend";
        connectionManager.addBackend(newBackendName, "jdbc:trino://new-backend:8080");

        // Wait for cache TTL to expire
        try {
            Thread.sleep(routingConfiguration.getDatabaseCacheTTL().toMillis() + 1000);
        }
        catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }

        // Fetch backends again, should reflect the new backend
        var refreshedBackends = gatewayManager.getActiveBackends();
        assertThat(refreshedBackends.stream().anyMatch(b -> b.getName().equals(newBackendName))).isTrue();
    }
```
</issue_to_address>

### Comment 3
<location> `gateway-ha/src/test/java/io/trino/gateway/ha/router/TestHaGatewayManager.java:33-38` </location>
<code_context>
-    @BeforeAll
-    void setUp()
+    @Test
+    void testGatewayManagerWithCache()
     {
         JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
         RoutingConfiguration routingConfiguration = new RoutingConfiguration();
-        haGatewayManager = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration);
+        routingConfiguration.setDatabaseCacheTTL(new Duration(5, TimeUnit.SECONDS));
+        testGatewayManager(new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration));
     }

</code_context>

<issue_to_address>
**suggestion (testing):** No test for cache refresh failures or database errors.

Add a test that triggers a database failure during cache refresh to verify proper exception handling and metric updates.

Suggested implementation:

```java
import java.util.concurrent.TimeUnit;
import static org.mockito.Mockito.*;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.AfterEach;

```

```java
    @Test
    void testGatewayManagerWithCache()
    {
        JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
        RoutingConfiguration routingConfiguration = new RoutingConfiguration();
        routingConfiguration.setDatabaseCacheTTL(new Duration(5, TimeUnit.SECONDS));
        testGatewayManager(new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration));
    }

    @Test
    void testCacheRefreshDatabaseFailure()
    {
        // Arrange: mock JdbcConnectionManager to throw exception on cache refresh
        JdbcConnectionManager mockConnectionManager = mock(JdbcConnectionManager.class);
        RoutingConfiguration routingConfiguration = new RoutingConfiguration();
        routingConfiguration.setDatabaseCacheTTL(new Duration(5, TimeUnit.SECONDS));
        HaGatewayManager gatewayManager = new HaGatewayManager(mockConnectionManager.getJdbi(), routingConfiguration);

        // Simulate database failure during cache refresh
        doThrow(new RuntimeException("Database error")).when(mockConnectionManager).refreshCache();

        // Act & Assert: verify exception is handled and metrics are updated
        try {
            gatewayManager.refreshCache();
        } catch (Exception e) {
            assertThat(e).isInstanceOf(RuntimeException.class)
                .hasMessageContaining("Database error");
        }

        // If you have a metric for cache refresh failures, assert it was incremented
        // Example:
        // assertThat(gatewayManager.getCacheRefreshFailureCount()).isGreaterThan(0);
    }

```

- You may need to implement or expose a `refreshCache()` method in `HaGatewayManager` if it is not public.
- If you track cache refresh failures via a metric, ensure you have a getter like `getCacheRefreshFailureCount()` or similar.
- Adjust the mocking/stubbing to match your actual cache refresh logic and error handling.
</issue_to_address>

### Comment 4
<location> `gateway-ha/src/test/java/io/trino/gateway/ha/router/TestHaGatewayManager.java:50` </location>
<code_context>
+        testGatewayManager(new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration));
+    }
+
+    void testGatewayManager(HaGatewayManager haGatewayManager)
     {
         ProxyBackendConfiguration backend = new ProxyBackendConfiguration();
</code_context>

<issue_to_address>
**suggestion (testing):** Tests do not verify cache TTL expiration and automatic refresh.

Add a test to confirm cache expiration after TTL and automatic refresh to validate correct time-based behavior.

Suggested implementation:

```java
    void testGatewayManager(HaGatewayManager haGatewayManager)
    {
        ProxyBackendConfiguration backend = new ProxyBackendConfiguration();
        backend.setActive(true);
    }

    @Test
    void testCacheExpirationAndAutomaticRefresh() throws InterruptedException
    {
        JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
        // Set cache TTL to 1 second for testing
        RoutingConfiguration routingConfiguration = new RoutingConfiguration();
        routingConfiguration.setDatabaseCacheTTL(new Duration(1, TimeUnit.SECONDS));
        HaGatewayManager haGatewayManager = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration);

        // Initial fetch to populate cache
        List<ProxyBackendConfiguration> initialBackends = haGatewayManager.getActiveBackends();
        assertNotNull(initialBackends, "Initial backend list should not be null");

        // Wait for TTL to expire
        Thread.sleep(1500);

        // Simulate a change in backend configuration
        // (This step may need to be adapted to your test setup. For example, update the DB or mock the backend list.)
        // For demonstration, we assume a method exists to update the backend list:
        // connectionManager.updateBackendList(newBackendList);

        // Fetch again after TTL expiration
        List<ProxyBackendConfiguration> refreshedBackends = haGatewayManager.getActiveBackends();
        assertNotNull(refreshedBackends, "Refreshed backend list should not be null");

        // The refreshed list should reflect changes after TTL expiration
        // (You may need to adapt this assertion based on how you simulate backend changes)
        // assertNotEquals(initialBackends, refreshedBackends, "Backend list should be refreshed after cache expiration");
    }

```

You may need to:
1. Implement or mock a way to change the backend list in your test setup so that the refreshed cache returns different data.
2. Adjust the assertion to compare the initial and refreshed backend lists based on your actual backend update logic.
3. Ensure that `getActiveBackends()` triggers a cache refresh after TTL expiration.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

.refreshAfterWrite(routingConfiguration.getDatabaseCacheTTL().toJavaTime())
.build(CacheLoader.asyncReloading(
CacheLoader.from(this::fetchAllBackends),
MoreExecutors.listeningDecorator(Executors.newSingleThreadExecutor())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a ThreadFactory to prevent thread leaks?
By default, newSingleThreadExecutor seems to create non-daemon threads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

   ThreadFactory daemonThreadFactory = runnable -> {
       Thread thread = new Thread(runnable, "backend-cache-refresh");
       thread.setDaemon(true);
       return thread;
   };
   ...

   MoreExecutors.listeningDecorator(Executors.newSingleThreadExecutor(daemonThreadFactory))));

Comment on lines +83 to +84
log.warn(e, "Failed to fetch backends");
throw e;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw exception? or just return empty list to maintain service availability
I mean the cache will retry on the next refresh cycle anyway

Copy link
Member

@Peiyingy Peiyingy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up this PR!
At LinkedIn, we’ve already implemented caching in GatewayBackendDao and have it running in production. Since we run multiple Gateway instances, we found that a cache-aside approach can lead to noticeable inconsistencies across instances. To address this, we switched to a DB-first model, where the cache only serves as a fallback during DB outages.

We’d recommend making the cache behavior configurable so that users can choose between cache-aside and DB-first depending on their deployment setup.

Separately, we also added a write buffer mechanism in QueryHistoryManager. With the DB-first cache and write buffer in place, the Gateway can continue processing user queries even during DB outages, eliminating this single point of failure. We plan to open up that write buffer PR to OSS soon as well.

routing:
defaultRoutingGroup: "test-group"
# Optional: cache backend metadata to reduce database look-ups
databaseCacheTTL: "5m"
Copy link
Member

@Peiyingy Peiyingy Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a CacheConfiguration instead of putting all the configs under routing configs. We can define cacheEnabled, expireAfterWrite, and maximumSize explicitly instead of using if (!routingConfiguration.getDatabaseCacheTTL().isZero()) to decide if cache is enabled, and set default values for them. Also, we can make cache behavior configurable there.

*/
package io.trino.gateway.ha.router;

import com.google.common.cache.CacheBuilder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also use this opportunity to upgrade the caching dependency from guava to caffeine.

Copy link
Member

@Peiyingy Peiyingy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add cache for findFirstByName to guard resiliency during DB outage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants