Skip to content

NUTCH-3175 Implement integration testing framework for Nutch Protocol plugins using Testcontainers#913

Open
lewismc wants to merge 2 commits intoapache:masterfrom
lewismc:NUTCH-3175
Open

NUTCH-3175 Implement integration testing framework for Nutch Protocol plugins using Testcontainers#913
lewismc wants to merge 2 commits intoapache:masterfrom
lewismc:NUTCH-3175

Conversation

@lewismc
Copy link
Copy Markdown
Member

@lewismc lewismc commented May 8, 2026

PR for NUTCH-3175.

My goal was to run each protocol plugin against a real server rather than mocks. This PR adds a new Ant target test-protocol-integration wired into both the top-level build and GitHub Actions CI (Ubuntu only, triggered when protocol plugin files change). This mimics what we did previously with index plugins.

Integration test framework (src/test/)

  • AbstractProtocolPluginIT — base class providing getHttpStatusCode(CrawlDatum), assertFetchSuccess(), and assertFetchNotFound() helpers shared across all protocol ITs.
  • ProtocolPluginIntegrationTest — JUnit 5 interface declaring the setUpProtocol / tearDownProtocol / getProtocol / getTestUrl contract; each plugin IT implements it.

Per-plugin integration tests

  • protocol-ftp — FtpProtocolIT — in-process MockFtpServer 3.1.0, no Docker required
  • protocol-http — HttpProtocolIT — nginx:alpine via Testcontainers
  • protocol-httpclient — HttpClientProtocolIT — in-process WireMock 3.0.1, no Docker required
  • protocol-htmlunit — HtmlUnitProtocolIT — nginx:alpine via Testcontainers
  • protocol-okhttp — OkHttpProtocolIT — nginx:alpine via Testcontainers
  • protocol-selenium — SeleniumProtocolIT — nginx:alpine via Testcontainers

Testcontainers-based tests are annotated @Testcontainers(disabledWithoutDocker = true) and skip cleanly when Docker is unavailable.

Build / CI changes

  • build.xml — new top-level test-protocol-integration target delegates to src/plugin/build.xml.
  • src/plugin/build.xml — runs each protocol plugin's test-protocol-integration target sequentially to avoid container resource contention.
  • src/plugin/build-plugin.xml — adds test-protocol-integration target; adds testcontainers*.jar to the global plugin test classpath so plugins can compile against Testcontainers without declaring it individually.
  • .github/workflows/master-build.yml — adds protocol_plugins path filter and test protocol integration step, gated to ubuntu-latest only.

Bug fixes in protocol-ftp (found while writing tests)

This part surprised me as admittedly I hadn't ever used protocol-ftp before. These are production fixes, not test scaffolding:

  1. FtpResponse: ignored URL port — ftp.client.connect(addr) always connected to port 21, ignoring the port in the URL. Fixed to use url.getPort() with fallback to FTP.DEFAULT_PORT.
  2. FtpResponse: quoted SYST reply — RFC 959 allows servers to quote the system type (215 "UNIX"). After .substring(4) the client received "UNIX" (with literal quotes), causing parser initialization to fail silently with ftp.parser is null. Fixed with explicit quote stripping.
  3. FtpResponse: empty directory listing treated as exception — when a server returns a 150+226 response with an empty listing for a non-existent file, list.get(0) threw IndexOutOfBoundsException. Fixed by checking list.isEmpty() and returning 404 instead.
  4. Ftp: status code never set on exception — if FtpResponse constructor threw before getProtocolOutput reached the datum.getMetaData().put(...) call, PROTOCOL_STATUS_CODE_KEY was never set, causing a NullPointerException in callers. Fixed by setting code 500 in the outer catch block.
  5. protocol-ftp/ivy.xml: commons-net upgraded 1.2.2 → 3.9.0 — commons-net 1.2.2's UnixFTPEntryParser depended on Apache ORO (org.apache.oro.text.regex), which is not on the Nutch classpath. At runtime this caused a NoClassDefFoundError that was silently swallowed by a finally/return block, leaving ftp.parser = null and every fetch returning HTTP 500. Upgrading to 3.9.0 eliminates the ORO dependency.

Other fixes

  • conf/log4j2.xml — renamed internal elements from hadoop.log.dir/hadoop.log.file to nutch.log.dir/nutch.log.file. Hadoop's test harness sets system properties hadoop.log.dir and hadoop.log.file to self-referential values; when log4j2 resolved ${sys:hadoop.log.dir} inside a property of the same name, it detected an infinite interpolation loop and emitted repeated WARN Infinite loop in property interpolation messages. Renaming the log4j2 properties breaks the cycle while preserving the same runtime behaviour.
  • protocol-httpclient/ivy.xml — adds WireMock 3.0.1 as a test-scoped dependency to support HttpClientProtocolIT.

@lewismc lewismc self-assigned this May 8, 2026
@lewismc
Copy link
Copy Markdown
Member Author

lewismc commented May 8, 2026

Pushed changes to fix the issues flagged by Yetus. Great to see these Yetus checks upholding stricter consistency.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 8, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant