NUTCH-3175 Implement integration testing framework for Nutch Protocol plugins using Testcontainers#913
Open
lewismc wants to merge 2 commits intoapache:masterfrom
Open
NUTCH-3175 Implement integration testing framework for Nutch Protocol plugins using Testcontainers#913lewismc wants to merge 2 commits intoapache:masterfrom
lewismc wants to merge 2 commits intoapache:masterfrom
Conversation
… plugins using Testcontainers
… plugins using Testcontainers
Member
Author
|
Pushed changes to fix the issues flagged by Yetus. Great to see these Yetus checks upholding stricter consistency. |
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




PR for NUTCH-3175.
My goal was to run each protocol plugin against a real server rather than mocks. This PR adds a new Ant target
test-protocol-integrationwired into both the top-level build and GitHub Actions CI (Ubuntu only, triggered when protocol plugin files change). This mimics what we did previously with index plugins.Integration test framework (src/test/)
Per-plugin integration tests
Testcontainers-based tests are annotated
@Testcontainers(disabledWithoutDocker = true)and skip cleanly when Docker is unavailable.Build / CI changes
build.xml— new top-level test-protocol-integration target delegates to src/plugin/build.xml.src/plugin/build.xml— runs each protocol plugin'stest-protocol-integrationtarget sequentially to avoid container resource contention.src/plugin/build-plugin.xml— addstest-protocol-integrationtarget; adds testcontainers*.jar to the global plugin test classpath so plugins can compile against Testcontainers without declaring it individually..github/workflows/master-build.yml— addsprotocol_pluginspath filter and test protocol integration step, gated toubuntu-latestonly.Bug fixes in protocol-ftp (found while writing tests)
This part surprised me as admittedly I hadn't ever used
protocol-ftpbefore. These are production fixes, not test scaffolding:FtpResponse: ignored URL port —ftp.client.connect(addr)always connected to port 21, ignoring the port in the URL. Fixed to useurl.getPort()with fallback toFTP.DEFAULT_PORT.FtpResponse: quotedSYSTreply — RFC 959 allows servers to quote the system type (215 "UNIX"). After .substring(4) the client received "UNIX" (with literal quotes), causing parser initialization to fail silently withftp.parserisnull. Fixed with explicit quote stripping.FtpResponse: empty directory listing treated as exception — when a server returns a 150+226 response with an empty listing for a non-existent file,list.get(0)threwIndexOutOfBoundsException. Fixed by checkinglist.isEmpty()and returning 404 instead.Ftp: status code never set on exception — ifFtpResponseconstructor threw beforegetProtocolOutputreached thedatum.getMetaData().put(...)call,PROTOCOL_STATUS_CODE_KEYwas never set, causing aNullPointerExceptionin callers. Fixed by setting code 500 in the outer catch block.protocol-ftp/ivy.xml: commons-net upgraded 1.2.2 → 3.9.0 — commons-net 1.2.2'sUnixFTPEntryParserdepended on Apache ORO (org.apache.oro.text.regex), which is not on the Nutch classpath. At runtime this caused aNoClassDefFoundErrorthat was silently swallowed by a finally/return block, leavingftp.parser = nulland every fetch returning HTTP 500. Upgrading to 3.9.0 eliminates the ORO dependency.Other fixes
conf/log4j2.xml— renamed internal elements fromhadoop.log.dir/hadoop.log.filetonutch.log.dir/nutch.log.file. Hadoop's test harness sets system propertieshadoop.log.dirandhadoop.log.fileto self-referential values; when log4j2 resolved${sys:hadoop.log.dir}inside a property of the same name, it detected an infinite interpolation loop and emitted repeatedWARNInfinite loop in property interpolation messages. Renaming the log4j2 properties breaks the cycle while preserving the same runtime behaviour.protocol-httpclient/ivy.xml— adds WireMock 3.0.1 as a test-scoped dependency to supportHttpClientProtocolIT.