-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Iceberg Plugin: Add BigQuery Metastore to Trino #26892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Iceberg Plugin: Add BigQuery Metastore to Trino #26892
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Salih Göktuğ Köse.
|
Reviewer's GuideThis PR integrates BigQuery Metastore into the Trino Iceberg plugin by adding Google BigQuery client dependencies, defining configuration and Guice modules, registering a new catalog type, and implementing both the TrinoCatalog and IcebergTableOperations to map catalog operations to BigQuery Metastore API calls, alongside utility classes and tests. Class diagram for new BigQuery Metastore integration classesclassDiagram
class IcebergBigQueryMetastoreCatalogConfig {
+String projectID
+String location
+String listAllTables
+String warehouse
+String jsonKeyFilePath
+setProjectID(String)
+setLocation(String)
+setListAllTables(String)
+setWarehouse(String)
+setJsonKeyFilePath(String)
+getProjectID()
+getLocation()
+getListAllTables()
+getWarehouse()
+getJsonKeyFilePath()
}
class TrinoBigQueryMetastoreCatalogFactory {
+create(ConnectorIdentity): TrinoCatalog
-IcebergTableOperationsProvider tableOperationsProvider
-BigQueryMetastoreClientImpl bigQueryMetastoreClient
-CatalogName catalogName
-TypeManager typeManager
-TrinoFileSystemFactory fileSystemFactory
-ForwardingFileIoFactory fileIoFactory
-String projectID
-String gcpLocation
-String listAllTables
-String warehouse
-boolean isUniqueTableLocation
}
class TrinoBigQueryMetastoreCatalog {
+namespaceExists(ConnectorSession, String): boolean
+listNamespaces(ConnectorSession): List<String>
+dropNamespace(ConnectorSession, String)
+createNamespace(ConnectorSession, String, Map<String, Object>, TrinoPrincipal)
+listTables(ConnectorSession, Optional<String>): List<TableInfo>
+listIcebergTables(ConnectorSession, Optional<String>): List<SchemaTableName>
+dropTable(ConnectorSession, SchemaTableName)
+loadTable(ConnectorSession, SchemaTableName): BaseTable
+defaultTableLocation(ConnectorSession, SchemaTableName): String
-BigQueryMetastoreClientImpl client
-String projectID
-String gcpLocation
-String warehouse
-boolean listAllTables
}
class BigQueryMetastoreIcebergTableOperations {
+commitNewTable(TableMetadata)
+commitToExistingTable(TableMetadata, TableMetadata)
+getRefreshedLocation(boolean): String
-BigQueryMetastoreClientImpl client
-String projectId
-String datasetId
-String tableId
-TableReference tableReference
-String tableName
}
class BigQueryMetastoreIcebergTableOperationsProvider {
+createTableOperations(TrinoCatalog, ConnectorSession, String, String, Optional<String>, Optional<String>): IcebergTableOperations
-TrinoFileSystemFactory fileSystemFactory
-ForwardingFileIoFactory fileIoFactory
-BigQueryMetastoreClientImpl bqmsClient
-String projectId
}
class BigQueryMetastoreIcebergUtil {
+createExternalCatalogTableOptions(String, Map<String, String>): ExternalCatalogTableOptions
+createExternalCatalogDatasetOptions(String, Map<String, Object>): ExternalCatalogDatasetOptions
}
class IcebergBigQueryMetastoreModule {
+setup(Binder)
+createBigQueryMetastoreClient(IcebergBigQueryMetastoreCatalogConfig): BigQueryMetastoreClientImpl
}
IcebergBigQueryMetastoreCatalogConfig <.. IcebergBigQueryMetastoreModule
IcebergBigQueryMetastoreCatalogConfig <.. TrinoBigQueryMetastoreCatalogFactory
TrinoBigQueryMetastoreCatalogFactory <|.. TrinoBigQueryMetastoreCatalog
BigQueryMetastoreIcebergTableOperationsProvider <|.. BigQueryMetastoreIcebergTableOperations
BigQueryMetastoreIcebergTableOperationsProvider <.. TrinoBigQueryMetastoreCatalog
BigQueryMetastoreIcebergUtil <.. BigQueryMetastoreIcebergTableOperations
BigQueryMetastoreIcebergUtil <.. TrinoBigQueryMetastoreCatalog
Class diagram for updated CatalogType enumclassDiagram
class CatalogType {
<<enum>>
JDBC
NESSIE
SNOWFLAKE
BIGQUERY_METASTORE
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In TrinoBigQueryMetastoreCatalog.toDatasetReference you call
Arrays.stream(namespace.split("\\.")).toList().getFirst()
which won’t compile—usenamespace.split("\\.")[0]
orstream.findFirst()
instead and simplify your namespace splitting logic. - The config keys in IcebergBigQueryMetastoreCatalogConfig (
iceberg.bqms-catalog.*
) don’t match the constants in TrinoBigQueryMetastoreCatalog (gcp.bigquery.*
), please unify property names or remove unused constants to avoid confusion. - Add precondition checks in TrinoBigQueryMetastoreCatalogFactory to fail fast when mandatory settings (project‐id, location, warehouse) are missing so users get a clear error at startup rather than NPEs later.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In TrinoBigQueryMetastoreCatalog.toDatasetReference you call `Arrays.stream(namespace.split("\\.")).toList().getFirst()` which won’t compile—use `namespace.split("\\.")[0]` or `stream.findFirst()` instead and simplify your namespace splitting logic.
- The config keys in IcebergBigQueryMetastoreCatalogConfig (`iceberg.bqms-catalog.*`) don’t match the constants in TrinoBigQueryMetastoreCatalog (`gcp.bigquery.*`), please unify property names or remove unused constants to avoid confusion.
- Add precondition checks in TrinoBigQueryMetastoreCatalogFactory to fail fast when mandatory settings (project‐id, location, warehouse) are missing so users get a clear error at startup rather than NPEs later.
## Individual Comments
### Comment 1
<location> `plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/bigquery/BigQueryMetastoreIcebergTableOperations.java:170-174` </location>
<code_context>
+ }
+
+ /** Update table properties with concurrent update detection using etag. */
+ private void updateTable(
+ String oldMetadataLocation, String newMetadataLocation, TableMetadata metadata)
+ {
+ Table table = client.load(tableReference);
+ if (table.getEtag().isEmpty()) {
+ throw new ValidationException(
+ "Etag of legacy table %s is empty, manually update the table via the BigQuery API or"
</code_context>
<issue_to_address>
**issue (bug_risk):** Etag check may not be robust for null values.
Using `isEmpty()` directly on `getEtag()` can cause a NullPointerException if the value is null. Please use a null-safe check such as `Strings.isNullOrEmpty(table.getEtag())`.
</issue_to_address>
### Comment 2
<location> `plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/bigquery/IcebergBigQueryMetastoreModule.java:63-68` </location>
<code_context>
+ if (config.getLocation() != null) {
+ optionsBuilder.setLocation(config.getLocation());
+ }
+ if (config.getJsonKeyFilePath() != null) {
+ try (FileInputStream fs = new FileInputStream(config.getJsonKeyFilePath())) {
+ optionsBuilder.setCredentials(ServiceAccountCredentials.fromStream(fs));
+ }
+ catch (Exception e) {
+ throw new CredentialException("Unable to locate GCP Service Account JSON file");
+ }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Credential exception handling may mask underlying IO errors.
Consider logging or including the original exception details in the CredentialException to aid debugging and avoid losing information about the underlying error.
```suggestion
try (FileInputStream fs = new FileInputStream(config.getJsonKeyFilePath())) {
optionsBuilder.setCredentials(ServiceAccountCredentials.fromStream(fs));
}
catch (Exception e) {
throw new CredentialException("Unable to locate GCP Service Account JSON file: " + e.getMessage(), e);
}
```
</issue_to_address>
### Comment 3
<location> `plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/catalog/bigquery/TestIcebergBigQueryMetastoreCatalogConnectorSmokeTest.java:128-83` </location>
<code_context>
+
+public class TestIcebergBigQueryMetastoreCatalogConfig
+{
+ @Test
+ public void testDefaults()
+ {
</code_context>
<issue_to_address>
**suggestion (testing):** Missing tests for error scenarios with invalid configuration/environment variables.
Please add tests to cover cases where required environment variables are missing or invalid, to confirm the connector handles configuration errors appropriately.
Suggested implementation:
```java
@AfterAll
public void teardown()
throws IOException
@Nested
class ConfigurationErrorTests
{
@Test
public void testMissingRequiredEnvironmentVariable()
{
// Simulate missing environment variable
String originalValue = System.getenv("BIGQUERY_PROJECT_ID");
try {
// Unset the environment variable (using reflection hack for test only)
TestUtils.unsetEnv("BIGQUERY_PROJECT_ID");
IcebergConfig config = new IcebergConfig();
config.setBigQueryProjectId(null); // explicitly set to null
Exception exception = assertThrows(IllegalArgumentException.class, () -> {
new IcebergBigQueryMetastoreCatalog(config);
});
assertTrue(exception.getMessage().contains("BIGQUERY_PROJECT_ID"));
}
finally {
// Restore environment variable
if (originalValue != null) {
TestUtils.setEnv("BIGQUERY_PROJECT_ID", originalValue);
}
}
}
@Test
public void testInvalidEnvironmentVariableValue()
{
String originalValue = System.getenv("BIGQUERY_PROJECT_ID");
try {
TestUtils.setEnv("BIGQUERY_PROJECT_ID", "!!!invalid_project_id!!!");
IcebergConfig config = new IcebergConfig();
config.setBigQueryProjectId("!!!invalid_project_id!!!");
Exception exception = assertThrows(RuntimeException.class, () -> {
new IcebergBigQueryMetastoreCatalog(config);
});
assertTrue(exception.getMessage().contains("invalid project id"));
}
finally {
if (originalValue != null) {
TestUtils.setEnv("BIGQUERY_PROJECT_ID", originalValue);
}
}
}
}
```
- You will need to implement `TestUtils.setEnv` and `TestUtils.unsetEnv` methods to manipulate environment variables for testing purposes. This can be done using reflection (see various Java testing guides for details).
- Ensure that `IcebergBigQueryMetastoreCatalog` and `IcebergConfig` throw appropriate exceptions when required environment variables are missing or invalid.
- Adjust exception types and messages in the assertions to match your actual implementation.
- If your connector uses a different configuration mechanism, adapt the test setup accordingly.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
...n/java/io/trino/plugin/iceberg/catalog/bigquery/BigQueryMetastoreIcebergTableOperations.java
Outdated
Show resolved
Hide resolved
...g/src/main/java/io/trino/plugin/iceberg/catalog/bigquery/IcebergBigQueryMetastoreModule.java
Show resolved
Hide resolved
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Salih Göktuğ Köse.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to add this catalog. See #26219
Hi, I saw #26219 that integrates BigLake Metastore via REST API. However, that feature is not generally available at the moment. Also, organizations that heavily depend on Google Cloud services may benefit from better performance compared to REST solution. |
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Salih Göktuğ Köse.
|
This commit introduces BigQuery Metastore to Trino.
813e578
to
7e852f3
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
I know it’s currently in the preview phase, but I expect it will eventually become generally available.
Could you share the benchmark result that shows this catalog is faster than the REST catalog approach? |
Sorry for the late response 🙏 Some other notes:
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
@talatuyarer @rambleraptor Do you have any idea why Iceberg REST catalog endpoint (#26219) is slower than this PR? |
This commit introduces BigQuery Metastore to Trino.
Description
BigQuery Metastore can be reachable from different external engines such as Spark and Flink but integration with Trino not provided yet. This PR introduces BigQuery Metastore integration to Trino.
Relates to: Google Official Documentation
Inspired by: apache/iceberg#12808
The following configuration options can be used to interact with BigQuery Metastore.
Additional context and related issues
TODO:
iceberg.bqms-catalog.bq_connection
option will be added.Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:
Summary by Sourcery
Add BigQuery Metastore catalog support to the Trino Iceberg plugin
New Features:
Build:
Tests: