-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat(native): Support custom schemas in native sidecar function registry #26236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewer's GuideImplements catalog-scoped function metadata retrieval and registration by extending the native sidecar API, HTTP endpoint, client provider, connector integration for Hive functions, OpenAPI spec, and corresponding tests to support custom schemas. Class diagram for NativeFunctionDefinitionProvider and catalog-scoped retrievalclassDiagram
class NativeFunctionDefinitionProvider {
- JsonCodec nativeFunctionSignatureMapJsonCodec
- HttpClient httpClient
- NativeFunctionNamespaceManagerConfig config
- String catalogName
+ NativeFunctionDefinitionProvider(..., String catalogName)
+ UdfFunctionSignatureMap getUdfDefinition(NodeManager)
}
class UdfFunctionSignatureMap {
+ UdfFunctionSignatureMap(Map<String, List<JsonBasedUdfFunctionMetadata>>)
}
NativeFunctionDefinitionProvider --> UdfFunctionSignatureMap
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
1c9bf83
to
8ace7d7
Compare
8ace7d7
to
3767a71
Compare
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The new GET /v1/functions/{catalog} handler always returns 200 with null or empty JSON even for non‐existent catalogs—modify it to return a 404 when no functions are found to match the OpenAPI spec.
- The
/v1/functions/{catalog}
path conflicts with the existing/v1/functions/{schema}
route—consider renaming or ordering the handlers to avoid ambiguous routing. - The hard-coded blocklist in getFunctionsMetadata is scattered in the implementation—extract it into a shared constant or make it configurable to simplify future updates.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new GET /v1/functions/{catalog} handler always returns 200 with null or empty JSON even for non‐existent catalogs—modify it to return a 404 when no functions are found to match the OpenAPI spec.
- The `/v1/functions/{catalog}` path conflicts with the existing `/v1/functions/{schema}` route—consider renaming or ordering the handlers to avoid ambiguous routing.
- The hard-coded blocklist in getFunctionsMetadata is scattered in the implementation—extract it into a shared constant or make it configurable to simplify future updates.
## Individual Comments
### Comment 1
<location> `presto-native-execution/presto_cpp/main/functions/FunctionMetadata.cpp:341` </location>
<code_context>
+ continue;
+ }
+
+ const auto parts = getFunctionNameParts(name);
+ if (parts[0] != catalog) {
+ continue;
</code_context>
<issue_to_address>
**issue:** Potential out-of-bounds access in 'parts' array.
Validate that 'parts' contains at least three elements before accessing indices 1 and 2 to prevent undefined behavior.
</issue_to_address>
### Comment 2
<location> `presto-native-execution/presto_cpp/main/connectors/PrestoToVeloxConnector.cpp:63-65` </location>
<code_context>
+
+ // Register hive-specific functions when hive catalog is detected.
+ // Delegate to generic Hive native function registrar which is idempotent.
+ if (connectorName ==
+ velox::connector::hive::HiveConnectorFactory::kHiveConnectorName ||
+ connectorName == std::string("hive-hadoop2")) {
+ hive::functions::registerHiveNativeFunctions();
+ }
</code_context>
<issue_to_address>
**suggestion:** Connector name comparison may be brittle.
Hardcoding connector names increases maintenance risk if new variants are added. Centralize connector name definitions or use a more robust matching approach.
Suggested implementation:
```cpp
// Register hive-specific functions when hive catalog is detected.
// Delegate to generic Hive native function registrar which is idempotent.
if (isHiveConnector(connectorName)) {
hive::functions::registerHiveNativeFunctions();
}
```
```cpp
#include "velox/functions/FunctionRegistry.h"
#include <unordered_set>
```
```cpp
connectorName);
protocol::registerConnectorProtocol(
connectorName, std::move(connectorProtocol));
// Centralized Hive connector name definitions.
namespace {
const std::unordered_set<std::string> kHiveConnectorNames = {
velox::connector::hive::HiveConnectorFactory::kHiveConnectorName,
"hive-hadoop2"
};
bool isHiveConnector(const std::string& connectorName) {
return kHiveConnectorNames.count(connectorName) > 0;
}
} // namespace
```
</issue_to_address>
### Comment 3
<location> `presto-native-execution/presto_cpp/main/connectors/hive/functions/InitcapFunction.h:30` </location>
<code_context>
+struct InitCapFunction {
+ VELOX_DEFINE_FUNCTION_TYPES(T);
+
+ static constexpr bool is_default_ascii_behavior = true;
+
+ FOLLY_ALWAYS_INLINE void call(
</code_context>
<issue_to_address>
**nitpick:** Unused 'is_default_ascii_behavior' constant.
If this constant is not needed, please remove it. If it is reserved for future use or external access, add a comment explaining its purpose.
</issue_to_address>
### Comment 4
<location> `presto-native-execution/presto_cpp/main/functions/tests/FunctionMetadataTest.cpp:117` </location>
<code_context>
testFunction("variance", "Variance.json", 5);
}
+
+TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithCatalog) {
+ // Test with the "presto" catalog that is registered in SetUpTestSuite
+ std::string catalog = "presto";
</code_context>
<issue_to_address>
**suggestion (testing):** Missing test coverage for custom schemas and non-default schemas.
Please add tests for functions registered under non-default schemas to ensure getFunctionsMetadata returns correct metadata for those cases.
Suggested implementation:
```cpp
TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithCatalog) {
// Test with the "presto" catalog that is registered in SetUpTestSuite
std::string catalog = "presto";
auto metadata = getFunctionsMetadata(catalog);
// The result should be a JSON object with function names as keys
ASSERT_TRUE(metadata.is_object());
ASSERT_FALSE(metadata.empty());
// Verify that common functions are present
ASSERT_TRUE(metadata.contains("abs"));
ASSERT_TRUE(metadata.contains("mod"));
}
// Register a function under a custom schema for testing purposes.
TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithCustomSchema) {
// Assume registerFunction is available for registering test functions.
// Register a function "custom_func" under schema "custom_schema".
std::string customCatalog = "presto";
std::string customSchema = "custom_schema";
std::string functionName = "custom_func";
// The registration API may differ; adjust as needed for your codebase.
registerFunction(customCatalog, customSchema, functionName, /*function implementation*/ nullptr);
auto metadata = getFunctionsMetadata(customCatalog);
// The result should include the custom function under the custom schema.
ASSERT_TRUE(metadata.is_object());
ASSERT_TRUE(metadata.contains(functionName));
// Optionally, check that the schema is correct in the metadata.
ASSERT_EQ(metadata[functionName]["schema"], customSchema);
}
// Register a function under another non-default schema for additional coverage.
TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithNonDefaultSchema) {
std::string catalog = "presto";
std::string nonDefaultSchema = "analytics";
std::string functionName = "analytics_func";
registerFunction(catalog, nonDefaultSchema, functionName, /*function implementation*/ nullptr);
auto metadata = getFunctionsMetadata(catalog);
ASSERT_TRUE(metadata.is_object());
ASSERT_TRUE(metadata.contains(functionName));
ASSERT_EQ(metadata[functionName]["schema"], nonDefaultSchema);
}
```
- Ensure that the `registerFunction` API exists and is accessible in your test environment. If not, you may need to use the actual function registration mechanism used in your codebase.
- If the metadata structure differs (e.g., schema is not a direct property), adjust the assertions accordingly.
- If setup/teardown is required for custom functions, add appropriate cleanup code.
</issue_to_address>
### Comment 5
<location> `presto-native-execution/presto_cpp/main/functions/tests/FunctionMetadataTest.cpp:150` </location>
<code_context>
+ }
+}
+
+TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithNonExistentCatalog) {
+ // Test with a catalog that doesn't exist
+ std::string catalog = "nonexistent";
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a test for error conditions or malformed catalog names.
Please include test cases for malformed catalog names, such as empty strings or special characters, to verify robust error handling.
</issue_to_address>
### Comment 6
<location> `presto-native-execution/presto_cpp/main/functions/tests/FunctionMetadataTest.cpp:143` </location>
<code_context>
+ ASSERT_TRUE(signature.contains("functionKind"))
+ << "Function: " << it.key();
+
+ // Schema should be "default" since we registered with "presto.default."
+ // prefix
+ EXPECT_EQ(signature["schema"], "default") << "Function: " << it.key();
</code_context>
<issue_to_address>
**nitpick (testing):** Test assertions could be more robust for schema values.
Hardcoding 'default' for the schema may cause the test to fail if registration logic changes or custom schemas are used. Parameterize the expected schema or derive it from the registration logic to improve test resilience.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
continue; | ||
} | ||
|
||
const auto parts = getFunctionNameParts(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Potential out-of-bounds access in 'parts' array.
Validate that 'parts' contains at least three elements before accessing indices 1 and 2 to prevent undefined behavior.
struct InitCapFunction { | ||
VELOX_DEFINE_FUNCTION_TYPES(T); | ||
|
||
static constexpr bool is_default_ascii_behavior = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Unused 'is_default_ascii_behavior' constant.
If this constant is not needed, please remove it. If it is reserved for future use or external access, add a comment explaining its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Joe-Abraham : Please fix.
ASSERT_TRUE(signature.contains("functionKind")) | ||
<< "Function: " << it.key(); | ||
|
||
// Schema should be "default" since we registered with "presto.default." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (testing): Test assertions could be more robust for schema values.
Hardcoding 'default' for the schema may cause the test to fail if registration logic changes or custom schemas are used. Parameterize the expected schema or derive it from the registration logic to improve test resilience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Joe-Abraham ,
From a high-level, the changes look good to me. Will do a more thorough analysis later.
Can you write a small RFC so we can get more feedback on the architecture?
I have created the RFC - prestodb/rfcs#50 |
3767a71
to
ed9051e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Joe-Abraham
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
add_library(presto_hive_functions HiveFunctionRegistration.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : Add a black line before add_library
#include "velox/functions/Macros.h" | ||
#include "velox/functions/lib/string/StringImpl.h" | ||
|
||
using namespace facebook::velox; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not use "using" in header files. Please use the fully resolved names in the rest of the code.
struct InitCapFunction { | ||
VELOX_DEFINE_FUNCTION_TYPES(T); | ||
|
||
static constexpr bool is_default_ascii_behavior = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Joe-Abraham : Please fix.
connectorName); | ||
protocol::registerConnectorProtocol( | ||
connectorName, std::move(connectorProtocol)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code shouldn't be part of the PrestoToVeloxConnector protocol conversion registration. We should add them in registerFunctions if the hive connector is present https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoServer.cpp#L1353
|
||
#include "presto_cpp/main/connectors/hive/functions/InitcapFunction.h" | ||
#include "presto_cpp/main/functions/dynamic_registry/DynamicFunctionRegistrar.h" | ||
#include "velox/functions/FunctionRegistry.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave a blank line between the presto_cpp and velox includes.
return j; | ||
} | ||
|
||
json getFunctionsMetadata(const std::string& catalog) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is largely duplicated from the getFunctionsMetadata above. We should change the above function to take an optional catalog parameter... or use a default of the default namespace and reduce the duplication.
// Register hive catalog for hive-specific functions | ||
queryRunner.loadFunctionNamespaceManager( | ||
NativeFunctionNamespaceManagerFactory.NAME, | ||
"hive", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes the PrestoServer has hive connector. Thats a reasonable assumption as we always setup hive connector. But then we should make the C++ PrestoServer code consistent by always registering the hive function as well (and not check if the hive connector is present)
|
||
TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithNonExistentCatalog) { | ||
// Test with a catalog that doesn't exist | ||
std::string catalog = "nonexistent"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this temporary variable.
} | ||
} | ||
|
||
TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithNonExistentCatalog) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test names are camelCase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Joe-Abraham : Please also add e2e tests for initcap (with and without side-car) in presto-native-tests module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add initcap tests as well.
ImmutableMap.of()); | ||
|
||
// Register native catalog for built-in functions | ||
queryRunner.loadFunctionNamespaceManager( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdabre12 : If we want to make this code truly dynamic then we will have to get the list of the connectors that the worker supports and create namespace catalogs for all of them. At present the worker reports the connectors to the DiscoveryServer. We could try to reuse that response or add a new endpoint for the worker sidecar to report the supported catalogs.
We can do that as a follow up.
Description
Fixes
hive.default.
namespaceMotivation and Context
Add support for custom schemas in native sidecar function registry
Impact
Enables functions to be registered in connector specific or custom namespaces.
Test Plan
Included with this PR.
Contributor checklist