Skip to content

Conversation

@fgerlits
Copy link
Contributor

@fgerlits fgerlits commented Dec 5, 2024

Summary

NIFI-13997

Load the requested python processor only

Before this change: when multiple python processors were present in a package,
we only installed the dependencies from requirements.txt plus those from
ProcessorDetails.dependencies of the requested processor. Then we tried to
load all the python files in the package, including other processors which may
have other dependencies. The result was that the requested processor could not
be loaded.

After the change: we only load the requested processor and any non-processor
helper files which are present in the package.

Enable py4j integration tests in system-tests

There are 26 tests (+1 added now) in PythonControllerInteractionIT.java, which
is in nifi-py4j-integration-tests, but these were not run in any of the CI
jobs. I have added them to system-tests, because integration-tests explicitly
skips nifi-py4j-integration-tests with a comment that it is run as part of
system-tests.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this improvement @fgerlits.

On initial review, I'm concerned about building a test case based on the Milvus and Unstructured libraries. These libraries have large transitive dependency trees, which can change often. This also takes more time to download and run. I recommend building the tests using much more narrowly-scoped dependencies, preferably avoiding transitive dependencies as much as possible.

Before this change: when multiple python processors were present in a package,
we only installed the dependencies from requirements.txt plus those from
ProcessorDetails.dependencies of the requested processor. Then we tried to
load all the python files in the package, including other processors which may
have other dependencies. The result was that the requested processor could not
be loaded.

After the change: we only load the requested processor and any non-processor
helper files which are present in the package.
There are 26 tests (+1 added now) in PythonControllerInteractionIT.java, which
is in nifi-py4j-integration-tests, but these were not run in any of the CI
jobs. I have added them to system-tests, because integration-tests explicitly
skips nifi-py4j-integration-tests with a comment that it is run as part of
system-tests.
@fgerlits fgerlits force-pushed the NIFI-13997_Load-requested-processor-only branch from c20e05b to 19f2f34 Compare December 10, 2024 14:13
@fgerlits
Copy link
Contributor Author

I recommend building the tests using much more narrowly-scoped dependencies

fixed in 19f2f34

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @fgerlits. I noted one additional test-related concern regarding external network requests.

One general question, however, is that I'm not sure about the general strategy of supporting different sets of dependencies in this way. The general purpose of a shared module is to have shared dependencies. That is the case of Java modules, and that should generally be the expectation for Python modules. The fact that there are ways to support both inline and external dependencies with Python Processors is a special feature, but I'm not sure whether this is something we should promote in general. That being said, the runtime code changes that simply filter out loading other Processors in the same module same module seem reasonable, so I'm willing to go forward with these changes.

@fgerlits
Copy link
Contributor Author

[...] The fact that there are ways to support both inline and external dependencies with Python Processors is a special feature, but I'm not sure whether this is something we should promote in general. [...]

Yes, shared dependencies per package is the way to go in most cases. We did, however, run into a case of a package of "parser" processors, and while most of their dependencies were common, some of the format-specific libs were quite large. We would like to avoid loading all format-specific libs if only one format is used.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for working through the feedback @fgerlits! +1 merging

@exceptionfactory exceptionfactory merged commit 116f2f7 into apache:main Dec 19, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants