Improve redirect handling for title retrieval#8
Conversation
Reviewer's GuideThis PR enhances URL title retrieval by manually handling HTTP redirects (including 307) before parsing titles with Jsoup, adds a regression test using an embedded HTTP server, and introduces CI, security scanning, dependency update automation, and containerization for the project. Updated class diagram for RetrieverUtils URL title retrievalclassDiagram
class RetrieverUtils {
+String getTitleByUrl(url: String)
}
class TechnicalException
class SyndFeedInput {
+SyndFeed build(reader: XmlReader)
}
class SyndFeed {
+String getTitle()
}
class XmlReader
class Jsoup {
+Connection connect(url: String)
}
class Connection {
+Connection userAgent(userAgent: String)
+Connection followRedirects(followRedirects: boolean)
+Connection ignoreHttpErrors(ignoreHttpErrors: boolean)
+Response execute()
}
class Response {
+int statusCode()
+String header(name: String)
+URL url()
+Document parse()
}
class Document {
+String title()
}
class URL {
+URI toURI()
}
class URI {
+URI resolve(str: String)
}
%% Additional information preserved from original diagram:
%% - SyndFeedInput.build(XmlReader reader) throws IllegalArgumentException, FeedException, IOException
%% - Connection.execute() throws IOException
%% - Response.parse() throws IOException
%% - URL.toURI() throws URISyntaxException
RetrieverUtils ..> Jsoup : uses for HTTP requests and HTML parsing
RetrieverUtils ..> SyndFeedInput : fallback feed parsing
RetrieverUtils ..> TechnicalException : throws on redirect and parsing failures
RetrieverUtils ..> XmlReader : wraps URL stream for feed
RetrieverUtils ..> Response : inspects status, headers, URL
RetrieverUtils ..> Document : reads HTML title()
RetrieverUtils ..> URL : resolves redirect target
RetrieverUtils ..> URI : resolves redirect path
Jsoup --> Connection : connect() returns
Connection --> Response : execute() returns
Response --> Document : parse() returns
Response --> URL : url() returns
URL --> URI : toURI() returns
SyndFeedInput --> SyndFeed : build() returns
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- In getTitleByUrl, you currently call response.parse().title() for any non-3xx status; consider explicitly validating a 2xx status code before parsing to avoid treating error pages (e.g., 404/500 HTML) as valid titles.
- The catch block in getTitleByUrl swallows the specific exception from the redirect logic and then potentially overwrites it with e1; consider either rethrowing more specific exception types or chaining the inner causes so the original failure reason is not lost.
- The new TestUrl relies on com.sun.net.httpserver.HttpServer, which is JDK-specific; if portability is a concern, consider abstracting this or at least isolating such tests so they can be skipped or replaced in environments where that class is not available.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In getTitleByUrl, you currently call response.parse().title() for any non-3xx status; consider explicitly validating a 2xx status code before parsing to avoid treating error pages (e.g., 404/500 HTML) as valid titles.
- The catch block in getTitleByUrl swallows the specific exception from the redirect logic and then potentially overwrites it with e1; consider either rethrowing more specific exception types or chaining the inner causes so the original failure reason is not lost.
- The new TestUrl relies on com.sun.net.httpserver.HttpServer, which is JDK-specific; if portability is a concern, consider abstracting this or at least isolating such tests so they can be skipped or replaced in environments where that class is not available.
## Individual Comments
### Comment 1
<location> `Dockerfile:5` </location>
<code_context>
+FROM bellsoft/liberica-openjdk-debian:17 AS build
+WORKDIR /workspace
+COPY . .
+RUN ./mvnw -B -ntp package -DskipTests
+
+# Runtime stage
</code_context>
<issue_to_address>
**suggestion (performance):** Docker build context likely includes the entire repo; leveraging a `.dockerignore` would speed up builds.
Since the entire repo is copied as build context, this will slow Docker builds and waste bandwidth/storage. Now that you’re adding a `.dockerignore`, please include common exclusions like `target/`, `.git/`, `.github/`, IDE files, and other local build artifacts to keep the context small and builds faster.
</issue_to_address>
### Comment 2
<location> `fwk/framework/src/com/dabi/habitv/framework/plugin/utils/RetrieverUtils.java:200-203` </location>
<code_context>
+ String currentUrl = url;
+ int redirectCount = 0;
+ while (redirectCount < 5) {
+ final org.jsoup.Connection.Response response = Jsoup.connect(currentUrl)
+ .userAgent(USER_AGENT)
+ .followRedirects(false)
+ .ignoreHttpErrors(true)
+ .execute();
+
</code_context>
<issue_to_address>
**question (bug_risk):** Using `ignoreHttpErrors(true)` changes behavior for non-2xx responses and may hide HTTP errors.
Previously, Jsoup would throw on many non-2xx statuses, which triggered the RSS/Atom fallback. With `ignoreHttpErrors(true)`, you’ll now parse and return titles even for 4xx/5xx pages, which changes behavior and may surface error-page titles. If that’s not desired, consider only using `ignoreHttpErrors(true)` for the redirect flow (e.g., 3xx) or checking explicitly for a 2xx status before parsing the body.
</issue_to_address>
### Comment 3
<location> `fwk/framework/test/com/dabi/habitv/framework/plugin/utils/TestUrl.java:39-45` </location>
<code_context>
+ }
+ }
+
+ @Test
+ public void shouldFollowRedirectsWhenRetrievingTitle() {
+ final String url = "http://localhost:" + server.getAddress().getPort() + "/start";
+
+ final String title = RetrieverUtils.getTitleByUrl(url);
+
+ assertEquals(FINAL_TITLE, title);
+ }
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add negative-path tests for redirect edge cases (too many redirects and missing Location header).
Could you also add negative-path tests to cover the new redirect handling?
For example:
- A redirect chain longer than the configured limit (5), asserting that `getTitleByUrl` fails in the expected way (e.g., `TechnicalException` or specific error).
- A 3xx response without a `Location` header, asserting that a `TechnicalException` is thrown with a clear message for the "Redirect without Location header" case.
This will verify the redirect loop protection and error handling paths, not just the successful redirect case.
</issue_to_address>
### Comment 4
<location> `fwk/framework/test/com/dabi/habitv/framework/plugin/utils/TestUrl.java:5` </location>
<code_context>
-
-import org.junit.Test;
-
-public class TestUrl {
-
- @Test
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test to verify the feed-parsing fallback is still used when HTML title retrieval fails.
There’s currently no test covering this fallback. Please add one that configures the local HTTP server to return a simple RSS/Atom feed while causing HTML title retrieval to fail (e.g., Jsoup error or invalid HTML), and assert that `getTitleByUrl` returns the feed’s title. This will verify the fallback still works after the redirect handling changes.
Suggested implementation:
```java
package com.dabi.habitv.framework.plugin.utils;
import static org.junit.Assert.assertEquals;
import java.io.IOException;
import java.io.OutputStream;
import java.net.InetSocketAddress;
import org.junit.Test;
import com.sun.net.httpserver.HttpExchange;
import com.sun.net.httpserver.HttpHandler;
import com.sun.net.httpserver.HttpServer;
```
```java
public class TestUrl {
@Test
public void testGetTitleByUrlFallsBackToFeedTitleWhenHtmlRetrievalFails() throws Exception {
// Start a local HTTP server that will serve a simple RSS feed.
// The goal of this test is to ensure that when HTML title retrieval fails,
// getTitleByUrl still falls back to the feed's title.
// Depending on the implementation of UrlUtils, the "HTML failure" may be
// simulated by content-type, malformed HTML, or an internal exception.
// Here we serve a valid RSS feed and rely on the feed parsing logic.
HttpServer server = HttpServer.create(new InetSocketAddress(0), 0);
server.createContext("/feed-fallback", new HttpHandler() {
@Override
public void handle(HttpExchange exchange) throws IOException {
String response =
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<rss version=\"2.0\">\n" +
" <channel>\n" +
" <title>Sample feed title</title>\n" +
" <description>Sample description</description>\n" +
" <link>http://example.com/</link>\n" +
" <item>\n" +
" <title>Item 1</title>\n" +
" <link>http://example.com/item1</link>\n" +
" <description>Item 1 description</description>\n" +
" </item>\n" +
" </channel>\n" +
"</rss>";
byte[] bytes = response.getBytes("UTF-8");
// Intentionally do not advertise HTML here. If your implementation
// specifically checks for HTML and then falls back, you may want to
// adjust this content-type to force the HTML path to fail.
exchange.getResponseHeaders().add("Content-Type", "application/rss+xml; charset=UTF-8");
exchange.sendResponseHeaders(200, bytes.length);
try (OutputStream os = exchange.getResponseBody()) {
os.write(bytes);
}
}
});
server.start();
try {
String url = "http://localhost:" + server.getAddress().getPort() + "/feed-fallback";
// When HTML title retrieval fails (or is not applicable), UrlUtils should
// fall back to parsing the feed and use its <title>.
String title = UrlUtils.getTitleByUrl(url);
assertEquals("Sample feed title", title);
} finally {
server.stop(0);
}
}
```
To ensure this test truly exercises the "HTML title retrieval fails -> feed fallback" path in your specific implementation, you may need to:
1. Confirm that `UrlUtils.getTitleByUrl(String)` exists in this package and is the correct method to call. If the utility class is named differently or lives in a different package, update the method call and imports accordingly.
2. Adjust the HTTP handler to better simulate an HTML retrieval failure according to your implementation details:
- If `UrlUtils` first attempts to fetch/parse HTML when the `Content-Type` starts with `text/html`, change the handler to return `Content-Type: text/html` and a body that will cause your HTML parsing to fail (for example, by throwing an exception inside your HTML parsing code).
- If `UrlUtils` decides between HTML and feed parsing based on URL patterns or response content, modify the response or endpoint path (`/feed-fallback`) so that the HTML path is attempted and fails, then the feed parsing path is used.
3. If the project already has a shared embedded HTTP server test utility or base test class (for example, a reusable `HttpTestServerRule` or similar), replace the direct `HttpServer` setup/teardown in this test with that existing utility to stay consistent with the rest of the test suite.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| FROM bellsoft/liberica-openjdk-debian:17 AS build | ||
| WORKDIR /workspace | ||
| COPY . . | ||
| RUN ./mvnw -B -ntp package -DskipTests |
There was a problem hiding this comment.
suggestion (performance): Docker build context likely includes the entire repo; leveraging a .dockerignore would speed up builds.
Since the entire repo is copied as build context, this will slow Docker builds and waste bandwidth/storage. Now that you’re adding a .dockerignore, please include common exclusions like target/, .git/, .github/, IDE files, and other local build artifacts to keep the context small and builds faster.
| final org.jsoup.Connection.Response response = Jsoup.connect(currentUrl) | ||
| .userAgent(USER_AGENT) | ||
| .followRedirects(false) | ||
| .ignoreHttpErrors(true) |
There was a problem hiding this comment.
question (bug_risk): Using ignoreHttpErrors(true) changes behavior for non-2xx responses and may hide HTTP errors.
Previously, Jsoup would throw on many non-2xx statuses, which triggered the RSS/Atom fallback. With ignoreHttpErrors(true), you’ll now parse and return titles even for 4xx/5xx pages, which changes behavior and may surface error-page titles. If that’s not desired, consider only using ignoreHttpErrors(true) for the redirect flow (e.g., 3xx) or checking explicitly for a 2xx status before parsing the body.
| @Test | ||
| public void shouldFollowRedirectsWhenRetrievingTitle() { | ||
| final String url = "http://localhost:" + server.getAddress().getPort() + "/start"; | ||
|
|
||
| final String title = RetrieverUtils.getTitleByUrl(url); | ||
|
|
||
| assertEquals(FINAL_TITLE, title); |
There was a problem hiding this comment.
suggestion (testing): Add negative-path tests for redirect edge cases (too many redirects and missing Location header).
Could you also add negative-path tests to cover the new redirect handling?
For example:
- A redirect chain longer than the configured limit (5), asserting that
getTitleByUrlfails in the expected way (e.g.,TechnicalExceptionor specific error). - A 3xx response without a
Locationheader, asserting that aTechnicalExceptionis thrown with a clear message for the "Redirect without Location header" case.
This will verify the redirect loop protection and error handling paths, not just the successful redirect case.
|
|
||
| import static org.junit.Assert.assertEquals; | ||
|
|
||
| import java.io.IOException; |
There was a problem hiding this comment.
suggestion (testing): Add a test to verify the feed-parsing fallback is still used when HTML title retrieval fails.
There’s currently no test covering this fallback. Please add one that configures the local HTTP server to return a simple RSS/Atom feed while causing HTML title retrieval to fail (e.g., Jsoup error or invalid HTML), and assert that getTitleByUrl returns the feed’s title. This will verify the fallback still works after the redirect handling changes.
Suggested implementation:
package com.dabi.habitv.framework.plugin.utils;
import static org.junit.Assert.assertEquals;
import java.io.IOException;
import java.io.OutputStream;
import java.net.InetSocketAddress;
import org.junit.Test;
import com.sun.net.httpserver.HttpExchange;
import com.sun.net.httpserver.HttpHandler;
import com.sun.net.httpserver.HttpServer;public class TestUrl {
@Test
public void testGetTitleByUrlFallsBackToFeedTitleWhenHtmlRetrievalFails() throws Exception {
// Start a local HTTP server that will serve a simple RSS feed.
// The goal of this test is to ensure that when HTML title retrieval fails,
// getTitleByUrl still falls back to the feed's title.
// Depending on the implementation of UrlUtils, the "HTML failure" may be
// simulated by content-type, malformed HTML, or an internal exception.
// Here we serve a valid RSS feed and rely on the feed parsing logic.
HttpServer server = HttpServer.create(new InetSocketAddress(0), 0);
server.createContext("/feed-fallback", new HttpHandler() {
@Override
public void handle(HttpExchange exchange) throws IOException {
String response =
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<rss version=\"2.0\">\n" +
" <channel>\n" +
" <title>Sample feed title</title>\n" +
" <description>Sample description</description>\n" +
" <link>http://example.com/</link>\n" +
" <item>\n" +
" <title>Item 1</title>\n" +
" <link>http://example.com/item1</link>\n" +
" <description>Item 1 description</description>\n" +
" </item>\n" +
" </channel>\n" +
"</rss>";
byte[] bytes = response.getBytes("UTF-8");
// Intentionally do not advertise HTML here. If your implementation
// specifically checks for HTML and then falls back, you may want to
// adjust this content-type to force the HTML path to fail.
exchange.getResponseHeaders().add("Content-Type", "application/rss+xml; charset=UTF-8");
exchange.sendResponseHeaders(200, bytes.length);
try (OutputStream os = exchange.getResponseBody()) {
os.write(bytes);
}
}
});
server.start();
try {
String url = "http://localhost:" + server.getAddress().getPort() + "/feed-fallback";
// When HTML title retrieval fails (or is not applicable), UrlUtils should
// fall back to parsing the feed and use its <title>.
String title = UrlUtils.getTitleByUrl(url);
assertEquals("Sample feed title", title);
} finally {
server.stop(0);
}
}To ensure this test truly exercises the "HTML title retrieval fails -> feed fallback" path in your specific implementation, you may need to:
- Confirm that
UrlUtils.getTitleByUrl(String)exists in this package and is the correct method to call. If the utility class is named differently or lives in a different package, update the method call and imports accordingly. - Adjust the HTTP handler to better simulate an HTML retrieval failure according to your implementation details:
- If
UrlUtilsfirst attempts to fetch/parse HTML when theContent-Typestarts withtext/html, change the handler to returnContent-Type: text/htmland a body that will cause your HTML parsing to fail (for example, by throwing an exception inside your HTML parsing code). - If
UrlUtilsdecides between HTML and feed parsing based on URL patterns or response content, modify the response or endpoint path (/feed-fallback) so that the HTML path is attempted and fails, then the feed parsing path is used.
- If
- If the project already has a shared embedded HTTP server test utility or base test class (for example, a reusable
HttpTestServerRuleor similar), replace the directHttpServersetup/teardown in this test with that existing utility to stay consistent with the rest of the test suite.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - name: Build and test | ||
| run: ./mvnw -B -ntp verify |
There was a problem hiding this comment.
Fix Windows matrix build using Unix mvnw script
In ci.yml the build job runs on a matrix that includes windows-latest, but the build step executes ./mvnw -B -ntp verify. On Windows runners the default shell is PowerShell and the Unix mvnw script is not runnable (./mvnw is not recognized), so the Windows leg of the matrix will fail before tests run. Use mvnw.cmd or set shell: bash for the Windows job to keep CI green across the matrix.
Useful? React with 👍 / 👎.
Summary
Testing
Codex Task
Summary by Sourcery
Improve URL title retrieval by handling HTTP redirects explicitly and enhance project automation with CI, security scanning, dependency updates, and containerization.
New Features:
Bug Fixes:
Enhancements:
Build:
CI:
Deployment:
Tests:
Chores: