-
Notifications
You must be signed in to change notification settings - Fork 304
Fix issue #804: Support multiple template namespace prefixes for Macedonian #812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix issue #804: Support multiple template namespace prefixes for Macedonian #812
Conversation
Removing broken information in brackets
* implement selection of construct validation tests * rename test report file of construct validation tests * fix path to test reports directory * add removing brackets that start with "(," * make removing broken brackets in WikipediaNifExtractor configurable * rename abstract extractors * fix getting short abstracts in WikipediaNifExtractor * generate two dumps for plain and html abstract extractors * fix minidump abstract extraction
* add Construct model * add construct validation test for (; in abstracts * implement handling of right and left validators
* Fix merging of multiple infoboxes under one resource, see issue dbpedia#709 * Add new page to the minidump for testing * Create SHACL test, add it to the shacl-test-groups.csv and disabled isbn test
closes dbpedia#771. --------- Co-authored-by: Ted Thibodeau Jr <[email protected]>
📝 WalkthroughWalkthroughThis PR introduces a broken bracket removal feature for abstracts, refactors extractor names (AbstractExtractor → PlainAbstractExtractor, AbstractExtractorWikipedia → HtmlAbstractExtractor), redesigns the validation framework to use a Construct data model instead of plain Strings, fixes a Macedonian template namespace handling bug in MappingStatsHolder, and updates test infrastructure with new configurations and utilities. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…edonian using wsl
a5630e7 to
df0b449
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
dump/src/test/resources/shaclTestsCoverageTable.md (1)
29-172: Normalize SHACL coverage table rows to 4 columns (no trailing pipes).markdownlint reports that many added rows (e.g., Line 29 onward, plus Lines 216–229) have too few cells and end with a trailing pipe. Since the header defines 4 columns, each row should include 4 cells and end with the last cell value (not a trailing pipe). Please add placeholders for the “issue” and “comment” columns (or fill them) and remove trailing pipes to keep the table valid.
🛠️ Example fix (apply consistently to the other new rows)
-[http://de.dbpedia.org/resource/Arthur_Schopenhauer](http://dief.tools.dbpedia.org/server/extraction/de/extract?title=Arthur_Schopenhauer&revid=&format=trix&extractors=custom) | [http://dbpedia.org/ontology/abstract](http://dbpedia.org/ontology/abstract) `#en_abstract_validation` | +[http://de.dbpedia.org/resource/Arthur_Schopenhauer](http://dief.tools.dbpedia.org/server/extraction/de/extract?title=Arthur_Schopenhauer&revid=&format=trix&extractors=custom) | [http://dbpedia.org/ontology/abstract](http://dbpedia.org/ontology/abstract) `#en_abstract_validation` | - | -Also applies to: 216-229
dump/src/main/scala/org/dbpedia/validation/construct/tests/suites/NTripleTestSuite.scala (1)
73-101: Prevent Construct(null) from reaching validators (NPE regression).If parsing fails,
s/p/oremain null but you still buildConstruct(null). These survive.filter(_ != null)and laternTriplePart.self.startsWith(...)throws. Previously, null Strings were filtered out.🔧 Proposed fix (return empty on parse failure / nulls)
try { s = spo(0).drop(1) p = spo(1).trim.drop(1) o = { @@ if (aa.startsWith("<")) aa.drop(1).dropRight(1) else aa } } catch { - case ae: ArrayIndexOutOfBoundsException => println(line); ae.printStackTrace() + case ae: ArrayIndexOutOfBoundsException => + println(line); ae.printStackTrace() + return Array.empty } - Array(Construct(s), Construct(p, Some(s), Some(o)), Construct(o)) + if (s == null || p == null || o == null) return Array.empty + Array(Construct(s), Construct(p, Some(s), Some(o)), Construct(o))Also applies to: 119-128
🤖 Fix all issues with AI agents
In @.github/ISSUE_TEMPLATE/data.md:
- Around line 10-38: Fix the markdownlint issues in the template sections
(headers "Issue validity", "Error Description", "Pinpointing the source of the
error", "Details") by replacing bare URLs with proper link syntax (e.g.,
[text](http://...)) for the links currently on the "Issue validity" and
"Pinpointing the source of the error" lines, remove the stray blank line inside
the blockquote in the "Details" section, and add language identifiers to the
fenced code blocks under "Wrong triples RDF snippet" and "Expected / corrected
RDF outcome snippet" (use ttl) and to the "Example DBpedia resource URL(s)"
block (use text) so the fences become ```ttl and ```text respectively. Ensure
the blockquote prefixes (>) remain consistent for each paragraph line.
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala`:
- Around line 251-255: The regex in patternsToRemove inside
PlainAbstractExtractor.scala uses a greedy and non-DOTALL pattern for
"<normalized>.*</normalized>", which can span too much or miss multiline
content; change that entry to a non‑greedy DOTALL pattern (e.g. use a pattern
equivalent to (?s)<normalized>.*?</normalized>) so it matches the smallest block
including newlines between each pair of tags rather than greedily across the
line.
In `@dump/src/test/bash/minidump-overview.md`:
- Line 48: The bare URL on line 48 causes a markdownlint MD034 warning. Fix this
by wrapping the URL in angle brackets or converting it into a Markdown link
format in the minidump-overview.md file where the Marian Breland Bailey
Wikipedia link appears.
In `@dump/src/test/resources/ci-tests/dbpedia-specific-ci-tests.ttl`:
- Around line 301-307: The v:validator block for <#iri_slash_n> uses
v:doesNotContain "\\n", which matches a backslash + 'n' rather than an actual
newline; update the v:doesNotContain value to use the newline escape "\n" (i.e.,
replace "\\n" with "\n") in the v:IRI_Validator entry so the test asserts IRIs
do not contain actual U+000A newlines.
In `@dump/src/test/resources/shacl-tests/properties/dbp_abstract.ttl`:
- Around line 15-19: Fix the malformed language-tag and replace the unsupported
negative lookahead: remove the space before the language tag in the sh:message
so the "@en" immediately follows the closing quote, and replace the sh:pattern
with a sh:not constraint on the dbpedia abstract path
(<http://dbpedia.org/ontology/abstract>) using a simple pattern that matches the
disallowed sequence instead of a negative lookahead (i.e., use sh:not with a
nested sh:pattern that targets "\(;" rather than the "(?!...)" construct).
In
`@dump/src/test/scala/org/dbpedia/extraction/dump/ConstructValidationTest.scala`:
- Around line 37-51: The current logic builds selectValues as a newline-joined
string and then removes statements whose subjects match those keys (inverting
intended filtering); instead build a Set of URIs and remove statements that are
NOT in that set so you keep only intended tests: change the creation to val
selectUris = groupKeys.map(x =>
s"https://github.com/dbpedia/extraction-framework$x").toSet, then in the
iterator loop use if (rdfSubject.isURIResource && rdfObject.toString ==
testGeneratorURI && !selectUris.contains(rdfSubject.getURI)) { iterator.remove()
} so membership checks use the Set and the filter keeps the selected tests
(referencing Utils.loadTestGroupsKeys, selectUris/selectValues,
iterator/listStatements, testGeneratorURI).
In `@dump/src/test/scala/org/dbpedia/extraction/dump/Utils.scala`:
- Around line 9-45: In loadTestGroupsKeys and getGroup ensure opened streams are
closed: replace scala.io.Source.fromFile(filePath) with a resource-safe pattern
(e.g. scala.util.Using) to close the Source after reading and avoid leaking file
handles, and wrap getClass.getClassLoader.getResourceAsStream(...) usage in a
safe/try-with-resources style (or scala.util.Using) so the InputStream is
closed; also handle the missing properties resource by returning
TestConfig.defaultTestGroup or throwing a clear exception instead of letting a
NullPointerException propagate (update loadTestGroupsKeys, getGroup, and any use
of classLoader/getResource/getResourceAsStream accordingly).
- Around line 47-58: The method renameAbstractsDatasetFiles currently assumes
minidumpDir.listFiles() is non-null and ignores failures from File.renameTo;
update renameAbstractsDatasetFiles to first check that minidumpDir.exists() and
that listFiles() != null (return or throw a descriptive exception if missing),
then iterate over Option(listFiles()) or a safe array, and after each call to
renameTo(...) check the boolean result and handle failures (log an error via
your test logger or throw an AssertionError) for both longAbstractsFile and
shortAbstractsFile so missing directories and failed renames fail fast and are
visible in tests; reference renameAbstractsDatasetFiles, minidumpDir, listFiles,
and renameTo in your changes.
In `@README.md`:
- Line 7: Fix the Slack line in the README: remove the duplicate "the", correct
"developement" to "development", fix casing/punctuation for clarity, and remove
the stray space inside the Slack invite URL so the link works; keep the channel
mention "#dev-team" and the DBpedia Slack workspace link but rewrite the
sentence to read smoothly (e.g., "Slack: join the `#dev-team` channel in the
DBpedia Slack workspace — the main point for development updates and
discussions").
- Around line 107-125: Fix markdownlint warnings in the guidelines section by
correcting nested list indentation (indent the sub-bullets under "If you already
pushed a branch..." and the quoted links block to align as nested list items),
normalize emphasis style (replace mixed *italic* with either underscores or
consistent asterisks throughout occurrences like "Try *not* to modify the
indentation." and the _italic_ examples), and replace bare URLs with proper
Markdown links (turn the bare https://forum.dbpedia.org and any other bare URLs
into [forum.dbpedia.org](https://forum.dbpedia.org) or descriptive link text).
Locate and update the paragraph and bulleted area containing the lines starting
"If you already pushed a branch..." and the "Please keep in mind:" block to
apply these changes.
In
`@server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStatsHolder.scala`:
- Around line 82-113: Add a small safe-ratio helper (e.g., def safeRatio(num:
Double, den: Double): Double = if (den == 0) 0.0 else num / den) and use it in
MappingStatsHolder to compute mappedTemplateRatio, mappedPropertyRatio,
mappedTemplateUseRatio and mappedPropertyUseRatio instead of directly dividing
(replace mappedTemplateCount.toDouble / templateCount.toDouble, etc.). Apply the
same helper or equivalent zero-check in MappingStats for mappedPropertyRatio and
mappedPropertyUseRatio to guard against zero denominators and avoid NaN/Infinity
in UI output. Ensure you reference the existing identifiers: MappingStatsHolder,
countTemplates, mappedTemplateRatio, mappedPropertyRatio,
mappedTemplateUseRatio, mappedPropertyUseRatio, and in MappingStats update
mappedPropertyRatio and mappedPropertyUseRatio.
🧹 Nitpick comments (4)
core/src/main/scala/org/dbpedia/extraction/mappings/MissingAbstractsExtractor.scala (1)
55-58: Confirm logger category change is intentional.Line 57 now logs under
PlainAbstractExtractor. If you want logs scoped toMissingAbstractsExtractor, consider switching the logger category back to this class.💡 Optional adjustment
- private val logger = Logger.getLogger(classOf[PlainAbstractExtractor].getName) + private val logger = Logger.getLogger(classOf[MissingAbstractsExtractor].getName)dump/src/main/scala/org/dbpedia/validation/construct/model/Construct.scala (1)
1-3: Consider adding ScalaDoc for field semantics.The
Constructcase class is well-designed with appropriate use ofOptiontypes. However, the purpose ofleftandrightfields isn't immediately clear from the code alone. A brief ScalaDoc comment would help maintainers understand when these fields are populated and how validators should use them (e.g., for triple subject/object validation).📝 Suggested documentation
package org.dbpedia.validation.construct.model +/** + * Represents a construct to be validated. + * `@param` self The primary value to validate + * `@param` left Optional left context (e.g., subject in a triple) + * `@param` right Optional right context (e.g., object in a triple) + */ case class Construct(self: String, left: Option[String] = None, right: Option[String] = None)core/src/test/scala/org/dbpedia/extraction/util/AbstractUtilsTest.scala (1)
5-30: Good test coverage for core scenarios.The tests cover the main use cases effectively. Consider adding edge case tests for robustness:
- Nested brackets:
"Text ((; nested)) more"- Multiple patterns:
"A (; x) B (; y) C"- No brackets (identity):
"Plain text without brackets"- Brackets at boundaries:
"(; start) middle (end; )"These would help ensure the utility handles less common inputs gracefully.
dump/src/test/scala/org/dbpedia/extraction/dump/ConstructValidationTest.scala (1)
3-9: Remove redundant imports.Several imports are duplicated:
StmtIteratoris imported on both lines 6 and 9RDFNodeis imported on both lines 9 and 19model(line 4) andgraph(line 3) appear unused♻️ Suggested cleanup
package org.dbpedia.extraction.dump -import org.apache.jena.graph -import org.apache.jena.rdf.model -import org.apache.jena.rdf.model.impl.StatementImpl -import org.apache.jena.rdf.model.StmtIterator - -import java.io.{File, FileInputStream, FileOutputStream} -import org.apache.jena.rdf.model.{Alt, Bag, Literal, Model, ModelFactory, Property, RDFNode, RSIterator, ReifiedStatement, Resource, ResourceF, Statement, StmtIterator} +import java.io.{File, FileInputStream, FileOutputStream} +import java.util.function.Consumer + +import org.apache.jena.rdf.model.{Model, ModelFactory, RDFNode} import org.apache.jena.riot.{RDFDataMgr, RDFLanguages}Also applies to: 19-19
| # Issue validity | ||
| > Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: http://dief.tools.dbpedia.org/server/extraction/en/ | ||
| > If the issue persists, please post the link from your browser here: | ||
|
|
||
| # Source | ||
| > Where did you find the data issue? Pick one, remove the others. | ||
|
|
||
| ### Web / SPARQL | ||
| > State the service (e.g. http://dbpedia.org/sparql) and the SPARQL query | ||
| > give a link to the web / linked data pages (e.g. http://dbpedia.org/resource/Berlin) | ||
|
|
||
| ### Release Dumps | ||
| > DBpedia provides monthly release dumps, cf. release-dashboard.dbpedia.org | ||
| > provide artifact & version or download link | ||
|
|
||
| ### Running the DBpedia Extraction (DIEF) software | ||
| > Please include all necessary information. | ||
|
|
||
|
|
||
| # Classification | ||
| > If you have some familiarity with DBpedia, please use the classification tags at (link) to correctly file this issue. Otherwise skip this step. | ||
|
|
||
|
|
||
|
|
||
| ### Error Description | ||
| # Error Description | ||
| > Please state the nature of your technical emergency: | ||
|
|
||
| # Pinpointing the source of the error | ||
| > Where did you find the data issue? Non-exhaustive options are: | ||
| * Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please **provide query or link** | ||
| * Dumps: dumps are managed by the Databus. Please **provide artifact & version or download link** | ||
| * DIEF: you ran the software and the error occured then, please **include all necessary information such as the extractor or log**. If you had problems running the software use [another issue template](https://github.com/dbpedia/extraction-framework/issues/new/choose) | ||
|
|
||
| ### Error specification | ||
| > Pick the appropriate: | ||
| # Details | ||
| > please post the details | ||
|
|
||
| - Affected extraction artifacts (Databus artifact version or file identifiers): | ||
| - https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/mappingbased-objects_lang=en_disjointDomain.ttl.bz2 | ||
| - | ||
| - Example DBpedia resource URL(s) having the error (one full IRI per line): | ||
| - http://dbpedia.org/resource/Leipzig | ||
| - | ||
| - Erroneous triples RDF snippet (NTRIPLES): | ||
| > Wrong triples RDF snippet | ||
| ``` | ||
|
|
||
| ``` | ||
| - Expected / corrected RDF outcome snippet (NTRIPLES): | ||
| > Expected / corrected RDF outcome snippet | ||
| ``` | ||
|
|
||
| ``` | ||
| >Example DBpedia resource URL(s) | ||
| ``` | ||
|
|
||
| ### Additional context | ||
| > Add any other context about the problem here. | ||
| ``` | ||
| > Other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve markdownlint issues: bare URLs + missing code‑fence languages.
Line 11 and Line 19 include bare URLs, and the fenced blocks at Lines 27/31/35 are missing language identifiers. There’s also a blank line inside a blockquote (Line 25). These are easy markdownlint failures to fix.
✅ Suggested cleanup
-# Issue validity
-> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: http://dief.tools.dbpedia.org/server/extraction/en/
+# Issue validity
+> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into <http://dbpedia.org/sparql>. During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At <http://dief.tools.dbpedia.org/server/extraction/en/> we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: <http://dief.tools.dbpedia.org/server/extraction/en/>
@@
-> Please state the nature of your technical emergency:
+> Please state the nature of your technical emergency:
@@
-# Details
-> please post the details
+# Details
+> please post the details
+>
@@
-> Wrong triples RDF snippet
- ```
+> Wrong triples RDF snippet
+ ```ttl
@@
-> Expected / corrected RDF outcome snippet
- ```
+> Expected / corrected RDF outcome snippet
+ ```ttl
@@
->Example DBpedia resource URL(s)
-```
+>Example DBpedia resource URL(s)
+```text🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
11-11: Bare URL used
(MD034, no-bare-urls)
11-11: Bare URL used
(MD034, no-bare-urls)
11-11: Bare URL used
(MD034, no-bare-urls)
19-19: Bare URL used
(MD034, no-bare-urls)
19-19: Bare URL used
(MD034, no-bare-urls)
25-25: Blank line inside blockquote
(MD028, no-blanks-blockquote)
27-27: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
31-31: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
35-35: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In @.github/ISSUE_TEMPLATE/data.md around lines 10 - 38, Fix the markdownlint
issues in the template sections (headers "Issue validity", "Error Description",
"Pinpointing the source of the error", "Details") by replacing bare URLs with
proper link syntax (e.g., [text](http://...)) for the links currently on the
"Issue validity" and "Pinpointing the source of the error" lines, remove the
stray blank line inside the blockquote in the "Details" section, and add
language identifiers to the fenced code blocks under "Wrong triples RDF snippet"
and "Expected / corrected RDF outcome snippet" (use ttl) and to the "Example
DBpedia resource URL(s)" block (use text) so the fences become ```ttl and
```text respectively. Ensure the blockquote prefixes (>) remain consistent for
each paragraph line.
| val patternsToRemove = List( | ||
| """<div style=[^/]*/>""".r -> " ", | ||
| """</div>""".r -> " " | ||
| """</div>""".r -> " ", | ||
| """<normalized>.*<\/normalized>""".r -> "" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make <normalized> removal non‑greedy to avoid stripping unrelated text.
.* is greedy and can remove content between the first and last <normalized> tag on the same line. It also won’t match across newlines. Prefer a non‑greedy DOTALL pattern.
🔧 Proposed fix (non-greedy DOTALL)
val patternsToRemove = List(
"""<div style=[^/]*/>""".r -> " ",
"""</div>""".r -> " ",
- """<normalized>.*<\/normalized>""".r -> ""
+ """(?s)<normalized>.*?</normalized>""".r -> ""
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| val patternsToRemove = List( | |
| """<div style=[^/]*/>""".r -> " ", | |
| """</div>""".r -> " " | |
| """</div>""".r -> " ", | |
| """<normalized>.*<\/normalized>""".r -> "" | |
| ) | |
| val patternsToRemove = List( | |
| """<div style=[^/]*/>""".r -> " ", | |
| """</div>""".r -> " ", | |
| """(?s)<normalized>.*?</normalized>""".r -> "" | |
| ) |
🤖 Prompt for AI Agents
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala`
around lines 251 - 255, The regex in patternsToRemove inside
PlainAbstractExtractor.scala uses a greedy and non-DOTALL pattern for
"<normalized>.*</normalized>", which can span too much or miss multiline
content; change that entry to a non‑greedy DOTALL pattern (e.g. use a pattern
equivalent to (?s)<normalized>.*?</normalized>) so it matches the smallest block
including newlines between each pair of tags rather than greedily across the
line.
| * https://en.wikipedia.org/wiki/IKEA | ||
| * https://en.wikipedia.org/wiki/Jim_Pewter | ||
| * https://en.wikipedia.org/wiki/Kerala_Agricultural_University | ||
| * https://en.wikipedia.org/wiki/Marian_Breland_Bailey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix MD034 by wrapping the new bare URL (Line 48).
markdownlint flags the newly added bare URL. Wrap it in angle brackets or use a Markdown link to satisfy MD034.
✅ Suggested fix
-* https://en.wikipedia.org/wiki/Marian_Breland_Bailey
+* <https://en.wikipedia.org/wiki/Marian_Breland_Bailey>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * https://en.wikipedia.org/wiki/Marian_Breland_Bailey | |
| * <https://en.wikipedia.org/wiki/Marian_Breland_Bailey> |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
48-48: Bare URL used
(MD034, no-bare-urls)
🤖 Prompt for AI Agents
In `@dump/src/test/bash/minidump-overview.md` at line 48, The bare URL on line 48
causes a markdownlint MD034 warning. Fix this by wrapping the URL in angle
brackets or converting it into a Markdown link format in the
minidump-overview.md file where the Marian Breland Bailey Wikipedia link
appears.
| <#iri_slash_n> | ||
| a v:TestGenerator ; | ||
| v:trigger trigger:generic_iri ; | ||
| v:validator [ | ||
| a v:IRI_Validator ; | ||
| v:doesNotContain "\\n" | ||
| ] . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
Turtle RDF string escape sequences \\n newline
💡 Result:
Yes — in Turtle string literals the escape sequence \n represents a newline (U+000A). Example: "line1\nline2" yields a string with a line break. In short (quoted) string literals actual newline characters must be escaped; in long ("""...""") literals you may include real newlines directly. [1] (w3.org)
Citations:
🏁 Script executed:
# Check the file context to understand the validator better
fd "dbpedia-specific-ci-tests.ttl" --type f -x wc -l {}Repository: dbpedia/extraction-framework
Length of output: 140
🏁 Script executed:
# Read the relevant section of the file
fd "dbpedia-specific-ci-tests.ttl" --type f -x head -320 {} | tail -50Repository: dbpedia/extraction-framework
Length of output: 2167
\\n matches a literal backslash‑n, not a newline character.
In Turtle, \n represents a newline (U+000A), so \\n represents a backslash followed by 'n'. To check that IRIs don't contain actual newline characters, use "\n" instead.
✅ Suggested fix
- v:doesNotContain "\\n"
+ v:doesNotContain "\n"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <#iri_slash_n> | |
| a v:TestGenerator ; | |
| v:trigger trigger:generic_iri ; | |
| v:validator [ | |
| a v:IRI_Validator ; | |
| v:doesNotContain "\\n" | |
| ] . | |
| <#iri_slash_n> | |
| a v:TestGenerator ; | |
| v:trigger trigger:generic_iri ; | |
| v:validator [ | |
| a v:IRI_Validator ; | |
| v:doesNotContain "\n" | |
| ] . |
🤖 Prompt for AI Agents
In `@dump/src/test/resources/ci-tests/dbpedia-specific-ci-tests.ttl` around lines
301 - 307, The v:validator block for <#iri_slash_n> uses v:doesNotContain "\\n",
which matches a backslash + 'n' rather than an actual newline; update the
v:doesNotContain value to use the newline escape "\n" (i.e., replace "\\n" with
"\n") in the v:IRI_Validator entry so the test asserts IRIs do not contain
actual U+000A newlines.
| sh:message "Error, found (; in dbo:abstract. "@en ; | ||
| sh:property [ | ||
| sh:path <http://dbpedia.org/ontology/abstract> ; | ||
| sh:pattern "^((?!\\(\\;).)*$" ; | ||
| ] . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the file
find . -name "dbp_abstract.ttl" -type fRepository: dbpedia/extraction-framework
Length of output: 137
🏁 Script executed:
# Also check the file size to understand how to read it
wc -l dump/src/test/resources/shacl-tests/properties/dbp_abstract.ttl 2>/dev/null || echo "File not found in expected location"Repository: dbpedia/extraction-framework
Length of output: 138
🌐 Web query:
Turtle RDF language-tagged literal syntax @en space rules
💡 Result:
Short answer
- The language tag must immediately follow the closing quote with no intervening space: e.g. "hello"@en (valid), "hello"
@en(invalid). [1][2] - The @ is followed by a BCP‑47 language tag (well‑formed per BCP47). In RDF 1.2 an optional initial text‑direction can follow the tag separated by “--” (values ltr or rtl). [1][2]
Examples
- Valid: "That Seventies Show"@en. [2]
- Invalid (space): "That Seventies Show"
@en. [1]
References
- W3C RDF 1.2 Turtle specification (language/tag and LANG_DIR). [1]
- W3C RDF 1.1 Turtle specification (language‑tag syntax and examples). [2]
🌐 Web query:
SPARQL XPath regex negative lookahead support
💡 Result:
Web search failed: Server error: no LLM provider could handle the message
🏁 Script executed:
cat -n dump/src/test/resources/shacl-tests/properties/dbp_abstract.ttlRepository: dbpedia/extraction-framework
Length of output: 992
🌐 Web query:
SPARQL REGEX function negative lookahead support
💡 Result:
Short answer: No—SPARQL's REGEX uses the XPath/XQuery / XML Schema regular‑expression flavor, which does not include Perl‑style lookaround (negative/positive lookahead or lookbehind). (w3.org)
What to do instead (options):
- Use SPARQL logic instead of lookarounds (e.g. FILTER + NOT EXISTS or an explicit FILTER with two regex calls). Example: FILTER(regex(...)) and FILTER NOT EXISTS { ... FILTER(regex(...)) } . (stackoverflow.com)
- Preprocess your strings or add computed properties so matching becomes simpler.
- If you control the engine and it offers an extension (some XQuery engines allow native Java/PCRE regex via a flag), use that engine‑specific extension—but this is nonstandard. (stackoverflow.com)
If you want, tell me which SPARQL engine you use and the exact pattern you need; I can propose a concrete non‑lookaround rewrite.
Citations:
- 1: https://www.w3.org/2009/sparql/docs/query-1.1/rq25-pub-2011-05.html?utm_source=openai
- 2: https://stackoverflow.com/questions/20777406/sparql-combine-and-exclude-regex-filters?utm_source=openai
- 3: https://stackoverflow.com/questions/52558237/positive-lookbehind-in-xquery?utm_source=openai
🌐 Web query:
XSD XML Schema regular expressions negative lookahead
💡 Result:
Short answer: XML Schema (XSD) regular expressions do not support lookaround assertions (positive/negative lookahead or lookbehind). [1][2]
Options/alternatives:
- If you can use XSD 1.1, add an xs:assert that uses XPath to reject matches, e.g.:
<xs:assert test="not(matches(., 'FILENAME'))"/> — this enforces “no FILENAME anywhere” without lookahead. [1] - Build the allowed pattern(s) explicitly (unions/alternation, enumerations) so the forbidden case is never matched (can be cumbersome). [2][8]
- Use a processor-specific extension to switch to a different regex engine (e.g. Saxon’s ';j' flag to use Java regex with lookarounds) — vendor-specific only. [3][7]
Sources:
- W3C XML Schema (regular-expression syntax and limitations). [1]
- Community discussion / examples showing lack of lookahead and workarounds (StackOverflow). [2][5]
- Saxon documentation and examples showing the Java-regex extension flag. [3]
References:
[1] W3C XML Schema Part 2: Datatypes — Regular expressions.
[2] StackOverflow — “XML schema restriction pattern for not allowing specific string” / “Negative lookahead in XML schema”.
[3] Saxonica documentation (XSD 1.1 / Java-regex 'j' flag).
Fix invalid language-tagged literal and unsupported negative lookahead regex.
Line 15 has a space before @en which violates Turtle syntax for language-tagged literals (the tag must immediately follow the closing quote). Line 18's negative lookahead (?!...) is not supported by SPARQL/XPath regex and will be rejected by SHACL engines. Use sh:not with a simple pattern instead.
Proposed fix
- sh:message "Error, found (; in dbo:abstract. "@en ;
+ sh:message "Error, found (; in dbo:abstract."@en ;
sh:property [
sh:path <http://dbpedia.org/ontology/abstract> ;
- sh:pattern "^((?!\\(\\;).)*$" ;
+ sh:not [ sh:pattern "\\(;" ] ;
] .📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sh:message "Error, found (; in dbo:abstract. "@en ; | |
| sh:property [ | |
| sh:path <http://dbpedia.org/ontology/abstract> ; | |
| sh:pattern "^((?!\\(\\;).)*$" ; | |
| ] . | |
| sh:message "Error, found (; in dbo:abstract."@en ; | |
| sh:property [ | |
| sh:path <http://dbpedia.org/ontology/abstract> ; | |
| sh:not [ sh:pattern "\\(;" ] ; | |
| ] . |
🤖 Prompt for AI Agents
In `@dump/src/test/resources/shacl-tests/properties/dbp_abstract.ttl` around lines
15 - 19, Fix the malformed language-tag and replace the unsupported negative
lookahead: remove the space before the language tag in the sh:message so the
"@en" immediately follows the closing quote, and replace the sh:pattern with a
sh:not constraint on the dbpedia abstract path
(<http://dbpedia.org/ontology/abstract>) using a simple pattern that matches the
disallowed sequence instead of a negative lookahead (i.e., use sh:not with a
nested sh:pattern that targets "\(;" rather than the "(?!...)" construct).
| def loadTestGroupsKeys(group: String, path: String, option: String = "yes"): Array[String] = { | ||
| println( | ||
| s"""############## | ||
| | GROUP $group | ||
| |##############""".stripMargin) | ||
|
|
||
| val filePath = classLoader.getResource(path).getFile | ||
| val file = scala.io.Source.fromFile(filePath) | ||
|
|
||
| val table: Array[Array[String]] = file.getLines().map(_.split(",")).toArray | ||
| val columnsNames: Array[String] = table.head | ||
|
|
||
| if (!columnsNames.contains(group)) { | ||
| Array[String]() | ||
| } | ||
| else { | ||
| val indexOfGroup = columnsNames.indexOf(group) | ||
| val groupsKeys: Array[String] = table.tail.flatMap(row => | ||
| if (row(indexOfGroup) == option) Array[String](row(0)) | ||
| else Array[String]()) | ||
| groupsKeys | ||
| } | ||
| } | ||
|
|
||
| def getGroup(testName: String): String = { | ||
| val resourceInputStream = Option(getClass.getClassLoader.getResourceAsStream("properties-from-pom.properties")) | ||
| val properties = new Properties() | ||
| resourceInputStream match { | ||
| case Some(inputStream) => properties.load(inputStream) | ||
| case None => return TestConfig.defaultTestGroup | ||
| } | ||
| val groupOption = Option(properties.getProperty(testName)) | ||
| groupOption match { | ||
| case Some(group) => group | ||
| case None => TestConfig.defaultTestGroup | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close opened resources to avoid file-handle leaks.
Source.fromFile and the properties input stream aren’t closed, which can leak file descriptors in long test runs. Also, a missing resource currently NPEs without a clear error.
🔧 Proposed fix (close resources + clearer error)
- val filePath = classLoader.getResource(path).getFile
- val file = scala.io.Source.fromFile(filePath)
-
- val table: Array[Array[String]] = file.getLines().map(_.split(",")).toArray
+ val fileUrl = Option(classLoader.getResource(path))
+ .getOrElse(throw new IllegalArgumentException(s"Missing resource: $path"))
+ val file = scala.io.Source.fromFile(fileUrl.getFile)
+ val table: Array[Array[String]] =
+ try file.getLines().map(_.split(",")).toArray
+ finally file.close()
@@
- resourceInputStream match {
- case Some(inputStream) => properties.load(inputStream)
+ resourceInputStream match {
+ case Some(inputStream) =>
+ try properties.load(inputStream)
+ finally inputStream.close()
case None => return TestConfig.defaultTestGroup
}🤖 Prompt for AI Agents
In `@dump/src/test/scala/org/dbpedia/extraction/dump/Utils.scala` around lines 9 -
45, In loadTestGroupsKeys and getGroup ensure opened streams are closed: replace
scala.io.Source.fromFile(filePath) with a resource-safe pattern (e.g.
scala.util.Using) to close the Source after reading and avoid leaking file
handles, and wrap getClass.getClassLoader.getResourceAsStream(...) usage in a
safe/try-with-resources style (or scala.util.Using) so the InputStream is
closed; also handle the missing properties resource by returning
TestConfig.defaultTestGroup or throwing a clear exception instead of letting a
NullPointerException propagate (update loadTestGroupsKeys, getGroup, and any use
of classLoader/getResource/getResourceAsStream accordingly).
| def renameAbstractsDatasetFiles(datasetName: String): Unit = { | ||
| val minidumpDir = new File("./target/minidumptest/base") | ||
| minidumpDir.listFiles().foreach(f => { | ||
| val longAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts.ttl.bz2") | ||
| if (longAbstractsFile.exists()) { | ||
| longAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts-$datasetName.ttl.bz2")) | ||
| } | ||
| val shortAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts.ttl.bz2") | ||
| if (shortAbstractsFile.exists()) { | ||
| shortAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts-$datasetName.ttl.bz2")) | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing base dir and failed renames.
listFiles() can return null if the directory doesn’t exist, and renameTo failures are currently silent, which can break downstream tests unpredictably.
🔧 Proposed fix (null-safe listing + rename check)
- val minidumpDir = new File("./target/minidumptest/base")
- minidumpDir.listFiles().foreach(f => {
+ val minidumpDir = new File("./target/minidumptest/base")
+ Option(minidumpDir.listFiles()).getOrElse(Array.empty).foreach(f => {
val longAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts.ttl.bz2")
if (longAbstractsFile.exists()) {
- longAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts-$datasetName.ttl.bz2"))
+ val target = new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts-$datasetName.ttl.bz2")
+ if (!longAbstractsFile.renameTo(target)) {
+ throw new IllegalStateException(s"Failed to rename ${longAbstractsFile.getPath} -> ${target.getPath}")
+ }
}
val shortAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts.ttl.bz2")
if (shortAbstractsFile.exists()) {
- shortAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts-$datasetName.ttl.bz2"))
+ val target = new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts-$datasetName.ttl.bz2")
+ if (!shortAbstractsFile.renameTo(target)) {
+ throw new IllegalStateException(s"Failed to rename ${shortAbstractsFile.getPath} -> ${target.getPath}")
+ }
}
})🤖 Prompt for AI Agents
In `@dump/src/test/scala/org/dbpedia/extraction/dump/Utils.scala` around lines 47
- 58, The method renameAbstractsDatasetFiles currently assumes
minidumpDir.listFiles() is non-null and ignores failures from File.renameTo;
update renameAbstractsDatasetFiles to first check that minidumpDir.exists() and
that listFiles() != null (return or throw a descriptive exception if missing),
then iterate over Option(listFiles()) or a safe array, and after each call to
renameTo(...) check the boolean result and handle failures (log an error via
your test logger or throw an AssertionError) for both longAbstractsFile and
shortAbstractsFile so missing directories and failed renames fail fast and are
visible in tests; reference renameAbstractsDatasetFiles, minidumpDir, listFiles,
and renameTo in your changes.
| **Documentation**: http://dev.dbpedia.org/Extraction <br/> | ||
| **Get in touch with DBpedia**: https://wiki.dbpedia.org/join/get-in-touch <br/> | ||
| **Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace](https://dbpedia-slack.herokuapp.com/) - the main point for [developement updates](https://github.com/dbpedia/extraction-framework/blob/master/.github/workflows/maven.yml) and discussions <br/> | ||
| **Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typos and casing in the Slack line.
Small wording issues in user-facing docs.
✏️ Proposed fix
-**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
+**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for development updates and discussions <br/>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/> | |
| **Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for development updates and discussions <br/> |
🧰 Tools
🪛 LanguageTool
[uncategorized] ~7-~7: Did you mean the communication tool “Slack” (= proper noun, capitalized)?
Context: ...//dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack w...
(ON_SKYPE)
[grammar] ~7-~7: Ensure spelling is correct
Context: ...oeWBlolJb_lKifEqA) - the main point for developement updates and discussions
## Conte...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
In `@README.md` at line 7, Fix the Slack line in the README: remove the duplicate
"the", correct "developement" to "development", fix casing/punctuation for
clarity, and remove the stray space inside the Slack invite URL so the link
works; keep the channel mention "#dev-team" and the DBpedia Slack workspace link
but rewrite the sentence to read smoothly (e.g., "Slack: join the `#dev-team`
channel in the DBpedia Slack workspace — the main point for development updates
and discussions").
| * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build. | ||
| * If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request. | ||
| * When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch. | ||
|
|
||
| Please keep in mind: | ||
| - Try *not* to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made. | ||
| - **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch! | ||
| - If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying _"To prevent you from losing history, non-fast-forward updates were rejected"_. If _(and only if)_ you are sure that nobody already pulled from this branch, add `--force` to the push command. | ||
| [_"Don’t rebase branches you have shared with another developer."_](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/) | ||
| [_"Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed."_](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247) | ||
| [_"Never ever rebase a branch that you pushed, or that you pulled from another person_"](http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/) | ||
| - _"[Don’t rebase branches you have shared with another developer.](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)"_ | ||
| - _"[Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed.](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)"_ | ||
| - _"[Never ever rebase a branch that you pushed, or that you pulled from another person](https://web.archive.org/web/20150622064245/http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)"_ | ||
| - In general, we prefer Scala over Java. | ||
|
|
||
| More tips: | ||
| - Guides to setup your development environment for [Intellij](Setting up IntelliJ IDEA) or [Eclipse](Setting up eclipse). | ||
| - Get help with the [Maven build](Build-from-Source-with-Maven) or another form of [installation](Installation). | ||
| - [Download](Downloads) some data to work with. | ||
| - How to run [from Scala/Java](Run-from-Java-or-Scala) or [from a JAR](Run-from-a-JAR). | ||
| - Having different troubles? Check the [troubleshooting page](Troubleshooting) or post on https://forum.dbpedia.org. | ||
| - Get help with the [Maven build](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html) or another form of [installation](https://maven.apache.org/install.html). | ||
| - [Download](https://dumps.wikimedia.org/) some data to work with. | ||
| - How to run [from Scala/Java](https://docs.scala-lang.org/tutorials/scala-with-maven.html) or [from a JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/run.html). | ||
| - Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve markdownlint warnings in the guidelines section.
Nested list indentation, emphasis style, and bare URL can be made lint-friendly.
✏️ Proposed fix
- * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
- * If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
- * When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.
+ * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
+ * If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
+ * When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.
@@
-- Try *not* to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
-- **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
+- Try _not_ to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
+- __Never__ rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, _if and only if_ nobody already pulled from the development branch!
@@
-- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
+- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on [https://forum.dbpedia.org](https://forum.dbpedia.org).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build. | |
| * If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request. | |
| * When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch. | |
| Please keep in mind: | |
| - Try *not* to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made. | |
| - **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch! | |
| - If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying _"To prevent you from losing history, non-fast-forward updates were rejected"_. If _(and only if)_ you are sure that nobody already pulled from this branch, add `--force` to the push command. | |
| [_"Don’t rebase branches you have shared with another developer."_](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/) | |
| [_"Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed."_](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247) | |
| [_"Never ever rebase a branch that you pushed, or that you pulled from another person_"](http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/) | |
| - _"[Don’t rebase branches you have shared with another developer.](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)"_ | |
| - _"[Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed.](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)"_ | |
| - _"[Never ever rebase a branch that you pushed, or that you pulled from another person](https://web.archive.org/web/20150622064245/http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)"_ | |
| - In general, we prefer Scala over Java. | |
| More tips: | |
| - Guides to setup your development environment for [Intellij](Setting up IntelliJ IDEA) or [Eclipse](Setting up eclipse). | |
| - Get help with the [Maven build](Build-from-Source-with-Maven) or another form of [installation](Installation). | |
| - [Download](Downloads) some data to work with. | |
| - How to run [from Scala/Java](Run-from-Java-or-Scala) or [from a JAR](Run-from-a-JAR). | |
| - Having different troubles? Check the [troubleshooting page](Troubleshooting) or post on https://forum.dbpedia.org. | |
| - Get help with the [Maven build](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html) or another form of [installation](https://maven.apache.org/install.html). | |
| - [Download](https://dumps.wikimedia.org/) some data to work with. | |
| - How to run [from Scala/Java](https://docs.scala-lang.org/tutorials/scala-with-maven.html) or [from a JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/run.html). | |
| - Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org. | |
| * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build. | |
| * If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request. | |
| * When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch. | |
| Please keep in mind: | |
| - Try _not_ to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made. | |
| - __Never__ rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, _if and only if_ nobody already pulled from the development branch! | |
| - If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying _"To prevent you from losing history, non-fast-forward updates were rejected"_. If _(and only if)_ you are sure that nobody already pulled from this branch, add `--force` to the push command. | |
| - _"[Don't rebase branches you have shared with another developer.](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)"_ | |
| - _"[Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed.](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)"_ | |
| - _"[Never ever rebase a branch that you pushed, or that you pulled from another person](https://web.archive.org/web/20150622064245/http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)"_ | |
| - In general, we prefer Scala over Java. | |
| More tips: | |
| - Guides to setup your development environment for [Intellij](Setting up IntelliJ IDEA) or [Eclipse](Setting up eclipse). | |
| - Get help with the [Maven build](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html) or another form of [installation](https://maven.apache.org/install.html). | |
| - [Download](https://dumps.wikimedia.org/) some data to work with. | |
| - How to run [from Scala/Java](https://docs.scala-lang.org/tutorials/scala-with-maven.html) or [from a JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/run.html). | |
| - Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on [https://forum.dbpedia.org](https://forum.dbpedia.org). |
🧰 Tools
🪛 LanguageTool
[style] ~108-~108: Consider an alternative to strengthen your wording.
Context: ...sion passes the build. * If there are further changes you need to make, because Travis said t...
(CHANGES_ADJUSTMENTS)
[style] ~116-~116: Consider using a more formal and expressive alternative to ‘awesome’.
Context: ...oiding-rebase-hell/)"_ - _"[Rebase is awesome, I use rebase exclusively for everythin...
(AWESOME)
🪛 markdownlint-cli2 (0.18.1)
107-107: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
108-108: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
109-109: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
112-112: Emphasis style
Expected: underscore; Actual: asterisk
(MD049, emphasis-style)
112-112: Emphasis style
Expected: underscore; Actual: asterisk
(MD049, emphasis-style)
113-113: Emphasis style
Expected: underscore; Actual: asterisk
(MD049, emphasis-style)
113-113: Emphasis style
Expected: underscore; Actual: asterisk
(MD049, emphasis-style)
125-125: Bare URL used
(MD034, no-bare-urls)
🤖 Prompt for AI Agents
In `@README.md` around lines 107 - 125, Fix markdownlint warnings in the
guidelines section by correcting nested list indentation (indent the sub-bullets
under "If you already pushed a branch..." and the quoted links block to align as
nested list items), normalize emphasis style (replace mixed *italic* with either
underscores or consistent asterisks throughout occurrences like "Try *not* to
modify the indentation." and the _italic_ examples), and replace bare URLs with
proper Markdown links (turn the bare https://forum.dbpedia.org and any other
bare URLs into [forum.dbpedia.org](https://forum.dbpedia.org) or descriptive
link text). Locate and update the paragraph and bulleted area containing the
lines starting "If you already pushed a branch..." and the "Please keep in
mind:" block to apply these changes.
| private def countTemplates(all: Boolean, count: MappingStats => Int): Int = { | ||
| var sum = 0 | ||
| for (ms <- mappedStatistics) { | ||
| if (all || ms.isMapped) { | ||
| if (!ignoreList.isTemplateIgnored(ms.templateName)) { | ||
| sum += count(ms) | ||
| } | ||
| } | ||
| sum | ||
| } | ||
| sum | ||
| } | ||
|
|
||
| private def countAllTemplates(count: MappingStats => Int): Int = countTemplates(true, count) | ||
| private def countMappedTemplates(count: MappingStats => Int): Int = countTemplates(false, count) | ||
|
|
||
| val templateCount = countAllTemplates(_ => 1) | ||
| val mappedTemplateCount = countMappedTemplates(_ => 1) | ||
|
|
||
| val templateUseCount = countAllTemplates(_.templateCount) | ||
| val mappedTemplateUseCount = countMappedTemplates(_.templateCount) | ||
|
|
||
| val propertyCount = countAllTemplates(_.propertyCount) | ||
| val mappedPropertyCount = countMappedTemplates(_.mappedPropertyCount) | ||
|
|
||
| private def countAllTemplates(count: MappingStats => Int): Int = countTemplates(true, count) | ||
| private def countMappedTemplates(count: MappingStats => Int): Int = countTemplates(false, count) | ||
|
|
||
| val templateCount = countAllTemplates(_ => 1) | ||
| val mappedTemplateCount = countMappedTemplates(_ => 1) | ||
|
|
||
| val templateUseCount = countAllTemplates(_.templateCount) | ||
| val mappedTemplateUseCount = countMappedTemplates(_.templateCount) | ||
|
|
||
| val propertyCount = countAllTemplates(_.propertyCount) | ||
| val mappedPropertyCount = countMappedTemplates(_.mappedPropertyCount) | ||
|
|
||
| val propertyUseCount = countAllTemplates(_.propertyUseCount) | ||
| val mappedPropertyUseCount = countMappedTemplates(_.mappedPropertyUseCount) | ||
|
|
||
| val mappedTemplateRatio = mappedTemplateCount.toDouble / templateCount.toDouble | ||
| val mappedPropertyRatio = mappedPropertyCount.toDouble / propertyCount.toDouble | ||
|
|
||
| val mappedTemplateUseRatio = mappedTemplateUseCount.toDouble / templateUseCount.toDouble | ||
| val mappedPropertyUseRatio = mappedPropertyUseCount.toDouble / propertyUseCount.toDouble | ||
| val propertyUseCount = countAllTemplates(_.propertyUseCount) | ||
| val mappedPropertyUseCount = countMappedTemplates(_.mappedPropertyUseCount) | ||
|
|
||
| val mappedTemplateRatio = mappedTemplateCount.toDouble / templateCount.toDouble | ||
| val mappedPropertyRatio = mappedPropertyCount.toDouble / propertyCount.toDouble | ||
|
|
||
| val mappedTemplateUseRatio = mappedTemplateUseCount.toDouble / templateUseCount.toDouble | ||
| val mappedPropertyUseRatio = mappedPropertyUseCount.toDouble / propertyUseCount.toDouble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and examine the code at the specified lines
cat -n server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStatsHolder.scala | sed -n '70,125p'Repository: dbpedia/extraction-framework
Length of output: 2766
🏁 Script executed:
# Search for usages of the ratio vals in the codebase
rg "mappedTemplateRatio|mappedPropertyRatio|mappedTemplateUseRatio|mappedPropertyUseRatio" --type scalaRepository: dbpedia/extraction-framework
Length of output: 4317
🏁 Script executed:
# Check MappingStats.scala for similar patterns
cat -n server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStats.scala | head -100Repository: dbpedia/extraction-framework
Length of output: 2221
🏁 Script executed:
# Look for test files related to MappingStats or MappingStatsHolder
find . -type f -name "*test*" -o -name "*Test*" -o -name "*Spec*" | grep -i "mapping" | head -20Repository: dbpedia/extraction-framework
Length of output: 1792
🏁 Script executed:
# Search for tests in the server/stats directory
find server -type f \( -name "*Test*.scala" -o -name "*Spec*.scala" \) | head -20Repository: dbpedia/extraction-framework
Length of output: 252
🏁 Script executed:
# Check the IgnoreListTest to understand testing patterns for stats
cat server/src/test/scala/org/dbpedia/extraction/server/stats/IgnoreListTest.scalaRepository: dbpedia/extraction-framework
Length of output: 902
🏁 Script executed:
# Search for NaN, Infinity, or zero-denominator handling patterns in stats code
rg "NaN|Infinity|/ 0|denominator|division" server/src/main/scala/org/dbpedia/extraction/server/stats/ --type scalaRepository: dbpedia/extraction-framework
Length of output: 54
🏁 Script executed:
# Check the MappingStatsManager or MappingStatsHolder to see how mappedStatistics is populated
rg "new MappingStatsHolder|MappingStatsHolder(" server/src/main/scala/ --type scala -B 3 -A 3Repository: dbpedia/extraction-framework
Length of output: 206
🏁 Script executed:
# Check the MappingStatsManager or MappingStatsHolder to see how mappedStatistics is populated
rg "MappingStatsHolder" server/src/main/scala/ --type scala -B 2 -A 2Repository: dbpedia/extraction-framework
Length of output: 4731
🏁 Script executed:
# Look at the MappingStatsHolder.apply method to see how statistics are created
sed -n '1,75p' server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStatsHolder.scalaRepository: dbpedia/extraction-framework
Length of output: 2957
🏁 Script executed:
# Check if there are any comments or handling around empty statistics
rg "empty|null|zero" server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStatsHolder.scala --type scala -B 2 -A 2Repository: dbpedia/extraction-framework
Length of output: 402
🏁 Script executed:
# Verify the exact lines in MappingStats.scala with the ratio calculations
sed -n '44,53p' server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStats.scalaRepository: dbpedia/extraction-framework
Length of output: 502
Guard ratio calculations in both MappingStatsHolder and MappingStats against zero denominators.
Division by zero in ratio calculations will produce NaN/Infinity when denominators are zero (e.g., empty mappedStatistics, all templates filtered by ignoreList, or no mapped items). These values are then formatted as strings in UI output ("NaN"/"Infinity" instead of percentages). Implement a safe ratio helper across both files.
Note: The same issue exists in MappingStats.scala lines 51–52 (mappedPropertyRatio and mappedPropertyUseRatio), which also requires the same fix.
🔧 Suggested fix
MappingStatsHolder.scala:
+ private def safeRatio(n: Int, d: Int): Double = if (d == 0) 0.0 else n.toDouble / d.toDouble
@@
- val mappedTemplateRatio = mappedTemplateCount.toDouble / templateCount.toDouble
- val mappedPropertyRatio = mappedPropertyCount.toDouble / propertyCount.toDouble
+ val mappedTemplateRatio = safeRatio(mappedTemplateCount, templateCount)
+ val mappedPropertyRatio = safeRatio(mappedPropertyCount, propertyCount)
@@
- val mappedTemplateUseRatio = mappedTemplateUseCount.toDouble / templateUseCount.toDouble
- val mappedPropertyUseRatio = mappedPropertyUseCount.toDouble / propertyUseCount.toDouble
+ val mappedTemplateUseRatio = safeRatio(mappedTemplateUseCount, templateUseCount)
+ val mappedPropertyUseRatio = safeRatio(mappedPropertyUseCount, propertyUseCount)MappingStats.scala:
+ private def safeRatio(n: Int, d: Int): Double = if (d == 0) 0.0 else n.toDouble / d.toDouble
@@
- val mappedPropertyRatio = mappedPropertyCount.toDouble / propertyCount.toDouble
- val mappedPropertyUseRatio = mappedPropertyUseCount.toDouble / propertyUseCount.toDouble
+ val mappedPropertyRatio = safeRatio(mappedPropertyCount, propertyCount)
+ val mappedPropertyUseRatio = safeRatio(mappedPropertyUseCount, propertyUseCount)🤖 Prompt for AI Agents
In
`@server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStatsHolder.scala`
around lines 82 - 113, Add a small safe-ratio helper (e.g., def safeRatio(num:
Double, den: Double): Double = if (den == 0) 0.0 else num / den) and use it in
MappingStatsHolder to compute mappedTemplateRatio, mappedPropertyRatio,
mappedTemplateUseRatio and mappedPropertyUseRatio instead of directly dividing
(replace mappedTemplateCount.toDouble / templateCount.toDouble, etc.). Apply the
same helper or equivalent zero-check in MappingStats for mappedPropertyRatio and
mappedPropertyUseRatio to guard against zero denominators and avoid NaN/Infinity
in UI output. Ensure you reference the existing identifiers: MappingStatsHolder,
countTemplates, mappedTemplateRatio, mappedPropertyRatio,
mappedTemplateUseRatio, mappedPropertyUseRatio, and in MappingStats update
mappedPropertyRatio and mappedPropertyUseRatio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
dump/src/main/scala/org/dbpedia/validation/construct/model/validators/generic/GenericLiteralLangTagValidator.scala (1)
14-14: Typo: "LTIERAL" should be "LITERAL".The validator IRI contains a typo:
#GENERIC_LTIERAL_LANG_TAG_VALIDATORshould be#GENERIC_LITERAL_LANG_TAG_VALIDATOR.🐛 Proposed fix
- override val iri: ValidatorIRI = "#GENERIC_LTIERAL_LANG_TAG_VALIDATOR" + override val iri: ValidatorIRI = "#GENERIC_LITERAL_LANG_TAG_VALIDATOR"dump/src/main/scala/org/dbpedia/validation/construct/tests/suites/NTripleTestSuite.scala (1)
104-109: UnusedconstructLabelparameter.The
constructLabelparameter is added to the method signature but is never used within the method body. This appears to be dead code.#!/bin/bash # Verify if constructLabel is used elsewhere in the codebase rg -n "constructLabel" --type scaladump/src/main/scala/org/dbpedia/validation/construct/tests/generators/NTripleTestGenerator.scala (1)
282-290: InconsistentvalidatorGrouphandling forNotContainsValidator.The
NotContainsValidatorinstantiation here does not passvalidatorGroup, unlike lines 337-341 where the same validator type receives it. This inconsistency means IRI-basedNotContainsValidatorinstances will always useValidatorGroup.DEFAULT, potentially ignoring the configured group.🐛 Proposed fix
if (validatorQuerySolution.contains("doesNotContains")) { validatorQuerySolution.getLiteral("doesNotContains").getLexicalForm.split(delim).foreach(charSeq => { - - validatorCollection.append(NotContainsValidator(currentValidatorID, validatorIRI, charSeq)) + val validatorGroup = getValidatorGroup(validatorQuerySolution) + validatorCollection.append(NotContainsValidator(currentValidatorID, validatorIRI, charSeq, validatorGroup)) groupedValidators.append(currentValidatorID) currentValidatorID += 1 }) }
🤖 Fix all issues with AI agents
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala`:
- Around line 251-255: In PlainAbstractExtractor.scala update the
patternsToRemove entry that removes <normalized> blocks to use non-greedy
matching so it doesn't span multiple tags; specifically change the regex in the
patternsToRemove list (the entry currently """<normalized>.*<\/normalized>""".r
-> "") to a non-greedy form (e.g. use .*? inside the same triple-quoted regex)
so each <normalized>...</normalized> is removed independently.
In
`@dump/src/main/scala/org/dbpedia/validation/construct/model/validators/TypedLiteralValidator.scala`:
- Around line 13-27: The current lexical extraction in run (using
VALIDATOR_GROUP and nTriplePart.left/right/self) is fragile because
split("\"").dropRight(1).drop(1).mkString("") fails on escaped quotes; replace
that inline splitting with a robust extractor: implement a function
extractLexicalForm(value: String) that correctly parses N-Triples quoted
literals (handling escaped quotes, language tags and datatypes) using a regex or
an N-Triples literal parser, then call extractLexicalForm(value) in each branch
(the branches of ValidatorGroup.RIGHT, ValidatorGroup.LEFT and the default)
instead of the split-based logic and return false when Option is None as before.
In `@dump/src/test/bash/minidump-overview.md`:
- Line 48: Replace the bare URL
"https://en.wikipedia.org/wiki/Marian_Breland_Bailey" with a proper Markdown
link to satisfy MD034: either wrap it as an autolink (<https://...>) or use link
text such as the page title (e.g., "Marian Breland Bailey") pointing to that URL
so the line no longer contains a bare URL.
In `@dump/src/test/resources/ci-tests/dbpedia-specific-ci-tests.ttl`:
- Around line 59-63: The trigger pattern for trigger:abstract_property is
missing regex anchors and should be changed to an exact-match pattern by adding
^ at the start and $ at the end of the trigger:pattern value; locate the
trigger:abstract_property block and update its trigger:pattern string to include
the ^ and $ anchors so it matches only the full IRI
"http://dbpedia.org/ontology/abstract".
In `@dump/src/test/resources/shaclTestsCoverageTable.md`:
- Around line 29-33: The new table rows in shaclTestsCoverageTable.md (e.g.,
entries beginning with [http://el.dbpedia.org/resource/Βερολίνο] and the various
[http://.../resource/%3F_(film)] lines) only have two columns and end with a
trailing pipe, breaking the 4-column table format; update each such row to
include two additional empty cells for the “issue” and “comment” columns (add "
| | " after the existing cells) and remove any trailing pipe at the end of the
line so every row has exactly four pipe-separated columns.
In
`@dump/src/test/scala/org/dbpedia/extraction/dump/ConstructValidationTest.scala`:
- Around line 37-51: The current selectValues is built as a single String
(mkString) so selectValues.contains(...) does substring checks; change
selectValues to a Set[String] of exact URIs and compare membership against
rdfSubject.getURI. Specifically, in the block using Utils.loadTestGroupsKeys and
selectValues, build selectValues as groupKeys.map(x =>
s"https://github.com/dbpedia/extraction-framework$x").toSet (no angle brackets
or trailing space) and then use if (rdfSubject.isURIResource &&
selectValues.contains(rdfSubject.getURI) && rdfObject.toString ==
testGeneratorURI) to perform exact set membership; keep the rest of the
iterator/remove logic the same.
In `@dump/src/test/scala/org/dbpedia/extraction/dump/Utils.scala`:
- Around line 47-59: The method renameAbstractsDatasetFiles can NPE because
minidumpDir.listFiles() may return null; update renameAbstractsDatasetFiles to
guard against that by first checking that minidumpDir.exists() &&
minidumpDir.isDirectory() or by wrapping listFiles() in Option(...) and using
getOrElse(Array.empty) before iterating; keep the rest of the logic unchanged
and operate only when files are present so longAbstractsFile/shortAbstractsFile
checks remain valid.
- Around line 33-45: In getGroup, the InputStream returned by
getResourceAsStream is not closed causing a resource leak; update getGroup to
load the Properties inside a safe-closure (e.g., scala.util.Using or a
try/finally) that ensures inputStream.close() is always called after
properties.load(inputStream), and keep the existing logic returning
TestConfig.defaultTestGroup on missing resource or missing property; reference
the getGroup method and the local variable resourceInputStream/inputStream when
implementing the safe-close.
- Around line 15-18: The code opens a scala.io.Source (variable file) and never
closes it, risking a resource leak; update the Utils code that computes table
(using variables filePath and file) to ensure the Source is closed after reading
— either wrap the file usage in a try-finally or use a managed resource
construct (e.g., Using.from or scala.util.Using) to read
file.getLines().map(_.split(",")).toArray and close the Source in the
finally/Using block so the file handle is always released.
In `@README.md`:
- Line 7: Fix the typos in the Slack sentence that starts "**Slack**: join the
[**#dev-team**]...": remove the duplicated "the", correct "developement" to
"development", ensure "Slack workspace" capitalization is consistent, and remove
the extra space before the URL so the sentence reads cleanly and without double
words or misspellings.
In
`@server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStatsHolder.scala`:
- Around line 78-113: The ratio fields in MappingStatsHolder
(mappedTemplateRatio, mappedPropertyRatio, mappedTemplateUseRatio,
mappedPropertyUseRatio) perform raw division and can produce NaN/Infinity when
denominators (templateCount, propertyCount, templateUseCount, propertyUseCount)
are zero; add a small helper method (e.g., safeRatio(numerator: Double,
denominator: Double): Double) that returns 0.0 when denominator == 0.0 and
otherwise returns numerator/denominator, then replace the four direct divisions
with safeRatio calls to ensure safe results.
🧹 Nitpick comments (15)
.github/ISSUE_TEMPLATE/data.md (3)
10-12: Major inconsistency: This file doesn't relate to the PR objectives.The PR objectives state this PR fixes issue
#804for Macedonian template namespace handling inMappingStatsHolder.scala, and the AI summary describes changes to AbstractExtractor, validation framework, and broken bracket removal. However, this file is just a GitHub issue template with documentation updates—it contains no Scala code, template namespace handling, or extractor changes.This suggests either the wrong file was included in the review, or this PR contains many unrelated changes beyond the stated objectives.
Optional: Wrap bare URLs in angle brackets.
The three URLs in this section are flagged by markdownlint (MD034). Consider wrapping them in angle brackets for better markdown compliance:
<https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1><http://dbpedia.org/sparql><http://dief.tools.dbpedia.org/server/extraction/en/>
17-21: Optional: Wrap bare URLs in angle brackets.Line 19 contains two bare URLs flagged by markdownlint (MD034). Consider wrapping them:
<http://dbpedia.org/sparql><http://dbpedia.org/resource/Berlin>
25-35: Optional: Address markdown formatting issues.Static analysis flags a few minor markdown formatting issues:
- Line 25: Blank line inside blockquote (MD028) - consider removing the blank line or closing the blockquote before it
- Lines 27, 31, 35: Fenced code blocks should specify a language (MD040) - consider adding language identifiers like
turtle,rdf, ortextto the code fence openersThese are pedantic formatting improvements and don't affect functionality.
core/src/main/scala/org/dbpedia/extraction/mappings/TemplateMapping.scala (5)
9-9: Unused import.The
Loggerimport is added but never used in this file.🧹 Proposed fix
-import java.util.logging.Logger
86-90: Simplify boolean assignment and consider internationalization.The if-else block can be simplified to a direct boolean expression. Additionally, hardcoding "Infobox" may not work correctly for non-English Wikipedia editions where infobox templates have different naming conventions.
♻️ Proposed simplification
- val isInfobox = if (node.title.decoded.contains("Infobox")) { - true - } else { - false - } + val isInfobox = node.title.decoded.contains("Infobox")Please verify whether non-English Wikipedia editions use different infobox naming conventions that this check might miss.
92-109: Usevalinstead ofvarand consider more robust name comparison.Several variables are declared as
varbut never reassigned. Additionally, the name comparison at line 106 doesn't account for case sensitivity or URL encoding differences.♻️ Proposed refactor
var condition4_same_entity_infoBox = true if(isInfobox) { //getting name property from infobox val allNames = node.children.filter(p => p.key == "name") - var name = subjectUri; - if(allNames.size > 0) - name = allNames(0).propertyNodeValueToPlainText + val name = allNames.headOption + .map(_.propertyNodeValueToPlainText) + .getOrElse(subjectUri) //getting subject of wikipedia page - var splittedURI = subjectUri.split("/") - var pageTitle = splittedURI(splittedURI.size - 1) + val splittedURI = subjectUri.split("/") + val pageTitle = splittedURI.last - if(!name.contains(pageTitle) && !pageTitle.contains(name)) + if(!name.toLowerCase.contains(pageTitle.toLowerCase) && !pageTitle.toLowerCase.contains(name.toLowerCase)) condition4_same_entity_infoBox = false }
199-213: Use idiomaticfindinstead of loop with mutable variable.The loop pattern with a mutable
namePropertyvariable can be replaced with Scala'sfindmethod for cleaner, more idiomatic code.♻️ Proposed refactor
- //Try to find a property which contains 'name' - var nameProperty : PropertyNode = null - for(property <- properties if nameProperty == null) - { - if(property.key.toLowerCase.contains("name")) - { - nameProperty = property - } - } - - //If no name property has been found -> Use the first property of the template - if(nameProperty == null) - { - nameProperty = properties.head - } + //Try to find a property which contains 'name', or use the first property + val nameProperty = properties + .find(_.key.toLowerCase.contains("name")) + .getOrElse(properties.head)
31-34: Incomplete comment or missing implementation.The comment mentions "when extractor has a pre-phase" but there's no corresponding pre-phase implementation visible. Consider either completing the documentation or removing this placeholder comment.
dump/src/test/resources/ci-tests/dbpedia-specific-ci-tests.ttl (1)
293-299: Consider extracting the duplicated language list pattern.The language code pattern is duplicated from
trigger:generic_wikipedia_dbpedia_extraction(line 37). While functional, this creates a maintenance burden if languages are added/removed. Consider whether this could reference the trigger pattern or be extracted to a shared constant.core/src/main/scala/org/dbpedia/extraction/config/Config.scala (1)
280-281: Consider documenting the new config keys in sample configs/docs.
remove-broken-brackets-plain-abstractsandremove-broken-brackets-html-abstractsare new knobs; adding them to config templates (e.g., universal/sample properties) will make them discoverable.Also applies to: 297-299
core/src/main/scala/org/dbpedia/extraction/mappings/MissingAbstractsExtractor.scala (1)
57-57: Use the extractor’s own class for the logger category.Logging under
PlainAbstractExtractormakes MissingAbstractsExtractor messages harder to trace.🔧 Suggested tweak
- private val logger = Logger.getLogger(classOf[PlainAbstractExtractor].getName) + private val logger = Logger.getLogger(classOf[MissingAbstractsExtractor].getName)dump/src/test/scala/org/dbpedia/extraction/dump/ConstructValidationTest.scala (1)
3-21: Remove duplicate imports.Several imports are duplicated:
StmtIteratoris imported at lines 6 and 9RDFNodeis imported at lines 9 and 19- Line 3-5 imports (
graph,model,StatementImpl) appear unused♻️ Suggested cleanup
package org.dbpedia.extraction.dump -import org.apache.jena.graph -import org.apache.jena.rdf.model -import org.apache.jena.rdf.model.impl.StatementImpl -import org.apache.jena.rdf.model.StmtIterator - -import java.io.{File, FileInputStream, FileOutputStream} -import org.apache.jena.rdf.model.{Alt, Bag, Literal, Model, ModelFactory, Property, RDFNode, RSIterator, ReifiedStatement, Resource, ResourceF, Statement, StmtIterator} +import java.io.{File, FileInputStream, FileOutputStream} +import org.apache.jena.rdf.model.{Model, ModelFactory} import org.apache.jena.riot.{RDFDataMgr, RDFLanguages} import org.apache.spark.sql.SQLContext import org.dbpedia.extraction.dump.TestConfig.{XSDCITestFile, ciTestFile, ciTestModel, classLoader, date, mappingsConfig, sparkSession} import org.dbpedia.extraction.dump.tags.ConstructValidationTestTag import org.dbpedia.validation.construct.report.ReportWriter import org.dbpedia.validation.construct.report.formats.ReportFormat import org.dbpedia.validation.construct.tests.TestSuiteFactory import org.dbpedia.validation.construct.tests.suites.NTripleTestSuite import org.scalatest.{BeforeAndAfterAll, DoNotDiscover, FunSuite} -import org.apache.jena.rdf.model.RDFNode - -import java.util.function.Consumerdump/src/main/scala/org/dbpedia/validation/construct/tests/suites/NTripleTestSuite.scala (1)
50-51: Null check may need adjustment for Construct type.The filter
(_ != null)checks if the Construct object is null, but based onprepareFlatTerseLine, the Construct is always created (never null). However, theselffield inside could be null if parsing fails (lines 77-98 shows,p,ocan remain null on exception). Consider filtering on_.self != nullinstead.Suggested fix
- .filter(_ != null) + .filter(c => c != null && c.self != null)dump/src/main/scala/org/dbpedia/validation/construct/tests/generators/NTripleTestGenerator.scala (1)
350-363: Consider using pattern matching for cleaner code.The logic is correct, but this could be more idiomatic using Scala's pattern matching.
♻️ Optional refactor
def getValidatorGroup(validatorQuerySolution: QuerySolution): ValidatorGroup.Value = { - if (validatorQuerySolution.contains("validatorGroup")) { - val validatorGroup = validatorQuerySolution.getResource("validatorGroup").getLocalName - if (validatorGroup == rightValidator) { - ValidatorGroup.RIGHT - } else if (validatorGroup == leftValidator){ - ValidatorGroup.LEFT - } else { - ValidatorGroup.DEFAULT - } - } else { - ValidatorGroup.DEFAULT + Option(validatorQuerySolution) + .filter(_.contains("validatorGroup")) + .map(_.getResource("validatorGroup").getLocalName) match { + case Some(`rightValidator`) => ValidatorGroup.RIGHT + case Some(`leftValidator`) => ValidatorGroup.LEFT + case _ => ValidatorGroup.DEFAULT } }core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala (1)
223-249: Code duplication withNifExtractor.postProcessExtractedHtml.The
postProcessExtractedHtmlmethod appears to be duplicated fromNifExtractor.scala(lines 72-95 in the relevant snippets). Consider extracting this to a shared utility (e.g.,AbstractUtils) to avoid maintenance burden.
| val patternsToRemove = List( | ||
| """<div style=[^/]*/>""".r -> " ", | ||
| """</div>""".r -> " " | ||
| """</div>""".r -> " ", | ||
| """<normalized>.*<\/normalized>""".r -> "" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use non-greedy matching for <normalized> tag removal.
The regex <normalized>.*<\/normalized> uses greedy matching, which could inadvertently match across multiple <normalized> blocks on the same line, removing more content than intended.
🐛 Proposed fix
val patternsToRemove = List(
"""<div style=[^/]*/>""".r -> " ",
"""</div>""".r -> " ",
- """<normalized>.*<\/normalized>""".r -> ""
+ """<normalized>.*?<\/normalized>""".r -> ""
)🤖 Prompt for AI Agents
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala`
around lines 251 - 255, In PlainAbstractExtractor.scala update the
patternsToRemove entry that removes <normalized> blocks to use non-greedy
matching so it doesn't span multiple tags; specifically change the regex in the
patternsToRemove list (the entry currently """<normalized>.*<\/normalized>""".r
-> "") to a non-greedy form (e.g. use .*? inside the same triple-quoted regex)
so each <normalized>...</normalized> is removed independently.
| override def run(nTriplePart: Construct): Boolean = { | ||
| val lexicalForm = VALIDATOR_GROUP match { | ||
| case ValidatorGroup.RIGHT => nTriplePart.right match { | ||
| // TODO: 1) maybe we need to rename "value" | ||
| // 2) discuss what to do if we want to check the value that doesn't exist on | ||
| // the left or right side, at the moment we only return false in these cases | ||
| case Some(value) => value.trim.split("\"").dropRight(1).drop(1).mkString("") | ||
| case None => return false | ||
| } | ||
| case ValidatorGroup.LEFT => nTriplePart.left match { | ||
| case Some(value) => value.trim.split("\"").dropRight(1).drop(1).mkString("") | ||
| case None => return false | ||
| } | ||
| case _ => nTriplePart.self.trim.split("\"").dropRight(1).drop(1).mkString("") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fragile lexical form extraction for literals with escaped quotes.
The split("\"").dropRight(1).drop(1).mkString("") approach works for simple quoted literals but will produce incorrect results for literals containing escaped quotes (e.g., "hello \"world\"").
Consider using a regex or proper N-Triples literal parser for more robust extraction:
Suggested improvement
- case ValidatorGroup.RIGHT => nTriplePart.right match {
- // TODO: 1) maybe we need to rename "value"
- // 2) discuss what to do if we want to check the value that doesn't exist on
- // the left or right side, at the moment we only return false in these cases
- case Some(value) => value.trim.split("\"").dropRight(1).drop(1).mkString("")
- case None => return false
- }
- case ValidatorGroup.LEFT => nTriplePart.left match {
- case Some(value) => value.trim.split("\"").dropRight(1).drop(1).mkString("")
- case None => return false
- }
- case _ => nTriplePart.self.trim.split("\"").dropRight(1).drop(1).mkString("")
+ private def extractLexicalForm(value: String): String = {
+ val trimmed = value.trim
+ if (trimmed.startsWith("\"")) {
+ // Find the closing quote (handling escaped quotes)
+ val content = trimmed.drop(1)
+ val endIdx = content.lastIndexOf("\"")
+ if (endIdx >= 0) content.substring(0, endIdx) else content
+ } else trimmed
+ }Then use extractLexicalForm(value) in each branch.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| override def run(nTriplePart: Construct): Boolean = { | |
| val lexicalForm = VALIDATOR_GROUP match { | |
| case ValidatorGroup.RIGHT => nTriplePart.right match { | |
| // TODO: 1) maybe we need to rename "value" | |
| // 2) discuss what to do if we want to check the value that doesn't exist on | |
| // the left or right side, at the moment we only return false in these cases | |
| case Some(value) => value.trim.split("\"").dropRight(1).drop(1).mkString("") | |
| case None => return false | |
| } | |
| case ValidatorGroup.LEFT => nTriplePart.left match { | |
| case Some(value) => value.trim.split("\"").dropRight(1).drop(1).mkString("") | |
| case None => return false | |
| } | |
| case _ => nTriplePart.self.trim.split("\"").dropRight(1).drop(1).mkString("") | |
| } | |
| override def run(nTriplePart: Construct): Boolean = { | |
| val lexicalForm = VALIDATOR_GROUP match { | |
| case ValidatorGroup.RIGHT => nTriplePart.right match { | |
| case Some(value) => extractLexicalForm(value) | |
| case None => return false | |
| } | |
| case ValidatorGroup.LEFT => nTriplePart.left match { | |
| case Some(value) => extractLexicalForm(value) | |
| case None => return false | |
| } | |
| case _ => extractLexicalForm(nTriplePart.self) | |
| } | |
| // ... rest of the function | |
| } | |
| private def extractLexicalForm(value: String): String = { | |
| val trimmed = value.trim | |
| if (trimmed.startsWith("\"")) { | |
| // Find the closing quote (handling escaped quotes) | |
| val content = trimmed.drop(1) | |
| val endIdx = content.lastIndexOf("\"") | |
| if (endIdx >= 0) content.substring(0, endIdx) else content | |
| } else trimmed | |
| } |
🤖 Prompt for AI Agents
In
`@dump/src/main/scala/org/dbpedia/validation/construct/model/validators/TypedLiteralValidator.scala`
around lines 13 - 27, The current lexical extraction in run (using
VALIDATOR_GROUP and nTriplePart.left/right/self) is fragile because
split("\"").dropRight(1).drop(1).mkString("") fails on escaped quotes; replace
that inline splitting with a robust extractor: implement a function
extractLexicalForm(value: String) that correctly parses N-Triples quoted
literals (handling escaped quotes, language tags and datatypes) using a regex or
an N-Triples literal parser, then call extractLexicalForm(value) in each branch
(the branches of ValidatorGroup.RIGHT, ValidatorGroup.LEFT and the default)
instead of the split-based logic and return false when Option is None as before.
| * https://en.wikipedia.org/wiki/IKEA | ||
| * https://en.wikipedia.org/wiki/Jim_Pewter | ||
| * https://en.wikipedia.org/wiki/Kerala_Agricultural_University | ||
| * https://en.wikipedia.org/wiki/Marian_Breland_Bailey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap the new URL to satisfy MD034 (no-bare-urls).
markdownlint flags the added bare URL. Use an autolink or link text.
🛠️ Proposed fix
-* https://en.wikipedia.org/wiki/Marian_Breland_Bailey
+* <https://en.wikipedia.org/wiki/Marian_Breland_Bailey>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * https://en.wikipedia.org/wiki/Marian_Breland_Bailey | |
| * <https://en.wikipedia.org/wiki/Marian_Breland_Bailey> |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
48-48: Bare URL used
(MD034, no-bare-urls)
🤖 Prompt for AI Agents
In `@dump/src/test/bash/minidump-overview.md` at line 48, Replace the bare URL
"https://en.wikipedia.org/wiki/Marian_Breland_Bailey" with a proper Markdown
link to satisfy MD034: either wrap it as an autolink (<https://...>) or use link
text such as the page title (e.g., "Marian Breland Bailey") pointing to that URL
so the line no longer contains a bare URL.
| trigger:abstract_property | ||
| a v:RDF_IRI_Trigger ; | ||
| trigger:pattern "http://dbpedia.org/ontology/abstract" ; | ||
| rdfs:label "abstract IRIs" ; | ||
| rdfs:comment "Match abstracts" . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing regex anchor for exact match.
The trigger pattern "http://dbpedia.org/ontology/abstract" lacks the ^ start anchor and $ end anchor, which all other triggers in this file use consistently. Without anchors, this could match IRIs that merely contain this substring.
Proposed fix
trigger:abstract_property
a v:RDF_IRI_Trigger ;
- trigger:pattern "http://dbpedia.org/ontology/abstract" ;
+ trigger:pattern "^http://dbpedia.org/ontology/abstract$" ;
rdfs:label "abstract IRIs" ;
rdfs:comment "Match abstracts" .📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| trigger:abstract_property | |
| a v:RDF_IRI_Trigger ; | |
| trigger:pattern "http://dbpedia.org/ontology/abstract" ; | |
| rdfs:label "abstract IRIs" ; | |
| rdfs:comment "Match abstracts" . | |
| trigger:abstract_property | |
| a v:RDF_IRI_Trigger ; | |
| trigger:pattern "^http://dbpedia.org/ontology/abstract$" ; | |
| rdfs:label "abstract IRIs" ; | |
| rdfs:comment "Match abstracts" . |
🤖 Prompt for AI Agents
In `@dump/src/test/resources/ci-tests/dbpedia-specific-ci-tests.ttl` around lines
59 - 63, The trigger pattern for trigger:abstract_property is missing regex
anchors and should be changed to an exact-match pattern by adding ^ at the start
and $ at the end of the trigger:pattern value; locate the
trigger:abstract_property block and update its trigger:pattern string to include
the ^ and $ anchors so it matches only the full IRI
"http://dbpedia.org/ontology/abstract".
| [http://de.dbpedia.org/resource/Arthur_Schopenhauer](http://dief.tools.dbpedia.org/server/extraction/de/extract?title=Arthur_Schopenhauer&revid=&format=trix&extractors=custom) | [http://dbpedia.org/ontology/abstract](http://dbpedia.org/ontology/abstract) #en_abstract_validation | | ||
| [http://de.dbpedia.org/resource/Berlin](http://dief.tools.dbpedia.org/server/extraction/de/extract?title=Berlin&revid=&format=trix&extractors=custom) | [http://dbpedia.org/ontology/abstract](http://dbpedia.org/ontology/abstract) #en_abstract_validation | | ||
| [http://el.dbpedia.org/resource/Βερολίνο](http://dief.tools.dbpedia.org/server/extraction/el/extract?title=Βερολίνο&revid=&format=trix&extractors=custom) | | ||
| [http://en.dbpedia.org/resource/%3F_(film)](http://dief.tools.dbpedia.org/server/extraction/en/extract?title=%3F_(film)&revid=&format=trix&extractors=custom) | [http://dbpedia.org/ontology/abstract](http://dbpedia.org/ontology/abstract) #en_abstract_validation | | ||
| [http://en.dbpedia.org/resource/%3F_(film)](http://dief.tools.dbpedia.org/server/extraction/en/extract?title=%3F_(film)&revid=&format=trix&extractors=custom) | [http://dbpedia.org/property/accessDate](http://dbpedia.org/property/accessDate) #Citation_english_languagа_accessDate_datatype_validation | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix table rows to keep 4-column shape (and remove trailing pipes).
Several new rows only have two columns and end with a trailing pipe, which breaks the 4‑column table contract and triggers markdownlint. Add empty cells for “issue” and “comment” (and avoid trailing pipes) for all new rows in this pattern.
🔧 Example fix (apply to similar rows)
-[http://de.dbpedia.org/resource/Arthur_Schopenhauer](http://dief.tools.dbpedia.org/server/extraction/de/extract?title=Arthur_Schopenhauer&revid=&format=trix&extractors=custom) | [http://dbpedia.org/ontology/abstract](http://dbpedia.org/ontology/abstract) `#en_abstract_validation` |
+[http://de.dbpedia.org/resource/Arthur_Schopenhauer](http://dief.tools.dbpedia.org/server/extraction/de/extract?title=Arthur_Schopenhauer&revid=&format=trix&extractors=custom) | [http://dbpedia.org/ontology/abstract](http://dbpedia.org/ontology/abstract) `#en_abstract_validation` | | 🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
29-29: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
29-29: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
30-30: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
30-30: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
31-31: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
31-31: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
32-32: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
32-32: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
33-33: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
33-33: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
🤖 Prompt for AI Agents
In `@dump/src/test/resources/shaclTestsCoverageTable.md` around lines 29 - 33, The
new table rows in shaclTestsCoverageTable.md (e.g., entries beginning with
[http://el.dbpedia.org/resource/Βερολίνο] and the various
[http://.../resource/%3F_(film)] lines) only have two columns and end with a
trailing pipe, breaking the 4-column table format; update each such row to
include two additional empty cells for the “issue” and “comment” columns (add "
| | " after the existing cells) and remove any trailing pipe at the end of the
line so every row has exactly four pipe-separated columns.
| val filePath = classLoader.getResource(path).getFile | ||
| val file = scala.io.Source.fromFile(filePath) | ||
|
|
||
| val table: Array[Array[String]] = file.getLines().map(_.split(",")).toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resource leak: file handle not closed.
scala.io.Source.fromFile opens a file handle that should be closed after use to prevent resource leaks.
Suggested fix using try-finally
- val file = scala.io.Source.fromFile(filePath)
-
- val table: Array[Array[String]] = file.getLines().map(_.split(",")).toArray
+ val file = scala.io.Source.fromFile(filePath)
+ val table: Array[Array[String]] = try {
+ file.getLines().map(_.split(",")).toArray
+ } finally {
+ file.close()
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| val filePath = classLoader.getResource(path).getFile | |
| val file = scala.io.Source.fromFile(filePath) | |
| val table: Array[Array[String]] = file.getLines().map(_.split(",")).toArray | |
| val filePath = classLoader.getResource(path).getFile | |
| val file = scala.io.Source.fromFile(filePath) | |
| val table: Array[Array[String]] = try { | |
| file.getLines().map(_.split(",")).toArray | |
| } finally { | |
| file.close() | |
| } |
🤖 Prompt for AI Agents
In `@dump/src/test/scala/org/dbpedia/extraction/dump/Utils.scala` around lines 15
- 18, The code opens a scala.io.Source (variable file) and never closes it,
risking a resource leak; update the Utils code that computes table (using
variables filePath and file) to ensure the Source is closed after reading —
either wrap the file usage in a try-finally or use a managed resource construct
(e.g., Using.from or scala.util.Using) to read
file.getLines().map(_.split(",")).toArray and close the Source in the
finally/Using block so the file handle is always released.
| def getGroup(testName: String): String = { | ||
| val resourceInputStream = Option(getClass.getClassLoader.getResourceAsStream("properties-from-pom.properties")) | ||
| val properties = new Properties() | ||
| resourceInputStream match { | ||
| case Some(inputStream) => properties.load(inputStream) | ||
| case None => return TestConfig.defaultTestGroup | ||
| } | ||
| val groupOption = Option(properties.getProperty(testName)) | ||
| groupOption match { | ||
| case Some(group) => group | ||
| case None => TestConfig.defaultTestGroup | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resource leak: InputStream not closed after loading properties.
The InputStream from getResourceAsStream should be closed after use.
Suggested fix
def getGroup(testName: String): String = {
val resourceInputStream = Option(getClass.getClassLoader.getResourceAsStream("properties-from-pom.properties"))
val properties = new Properties()
resourceInputStream match {
- case Some(inputStream) => properties.load(inputStream)
+ case Some(inputStream) =>
+ try {
+ properties.load(inputStream)
+ } finally {
+ inputStream.close()
+ }
case None => return TestConfig.defaultTestGroup
}🤖 Prompt for AI Agents
In `@dump/src/test/scala/org/dbpedia/extraction/dump/Utils.scala` around lines 33
- 45, In getGroup, the InputStream returned by getResourceAsStream is not closed
causing a resource leak; update getGroup to load the Properties inside a
safe-closure (e.g., scala.util.Using or a try/finally) that ensures
inputStream.close() is always called after properties.load(inputStream), and
keep the existing logic returning TestConfig.defaultTestGroup on missing
resource or missing property; reference the getGroup method and the local
variable resourceInputStream/inputStream when implementing the safe-close.
| def renameAbstractsDatasetFiles(datasetName: String): Unit = { | ||
| val minidumpDir = new File("./target/minidumptest/base") | ||
| minidumpDir.listFiles().foreach(f => { | ||
| val longAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts.ttl.bz2") | ||
| if (longAbstractsFile.exists()) { | ||
| longAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts-$datasetName.ttl.bz2")) | ||
| } | ||
| val shortAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts.ttl.bz2") | ||
| if (shortAbstractsFile.exists()) { | ||
| shortAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts-$datasetName.ttl.bz2")) | ||
| } | ||
| }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential NullPointerException if directory doesn't exist.
minidumpDir.listFiles() returns null if the path is not a directory or an I/O error occurs, which would cause a NullPointerException when calling foreach.
Suggested fix
def renameAbstractsDatasetFiles(datasetName: String): Unit = {
val minidumpDir = new File("./target/minidumptest/base")
- minidumpDir.listFiles().foreach(f => {
+ Option(minidumpDir.listFiles()).getOrElse(Array.empty).foreach(f => {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def renameAbstractsDatasetFiles(datasetName: String): Unit = { | |
| val minidumpDir = new File("./target/minidumptest/base") | |
| minidumpDir.listFiles().foreach(f => { | |
| val longAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts.ttl.bz2") | |
| if (longAbstractsFile.exists()) { | |
| longAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts-$datasetName.ttl.bz2")) | |
| } | |
| val shortAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts.ttl.bz2") | |
| if (shortAbstractsFile.exists()) { | |
| shortAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts-$datasetName.ttl.bz2")) | |
| } | |
| }) | |
| } | |
| def renameAbstractsDatasetFiles(datasetName: String): Unit = { | |
| val minidumpDir = new File("./target/minidumptest/base") | |
| Option(minidumpDir.listFiles()).getOrElse(Array.empty).foreach(f => { | |
| val longAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts.ttl.bz2") | |
| if (longAbstractsFile.exists()) { | |
| longAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-long-abstracts-$datasetName.ttl.bz2")) | |
| } | |
| val shortAbstractsFile = new File( s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts.ttl.bz2") | |
| if (shortAbstractsFile.exists()) { | |
| shortAbstractsFile.renameTo(new File(s"./target/minidumptest/base/${f.getName}/$date/${f.getName}-$date-short-abstracts-$datasetName.ttl.bz2")) | |
| } | |
| }) | |
| } |
🤖 Prompt for AI Agents
In `@dump/src/test/scala/org/dbpedia/extraction/dump/Utils.scala` around lines 47
- 59, The method renameAbstractsDatasetFiles can NPE because
minidumpDir.listFiles() may return null; update renameAbstractsDatasetFiles to
guard against that by first checking that minidumpDir.exists() &&
minidumpDir.isDirectory() or by wrapping listFiles() in Option(...) and using
getOrElse(Array.empty) before iterating; keep the rest of the logic unchanged
and operate only when files are present so longAbstractsFile/shortAbstractsFile
checks remain valid.
| **Documentation**: http://dev.dbpedia.org/Extraction <br/> | ||
| **Get in touch with DBpedia**: https://wiki.dbpedia.org/join/get-in-touch <br/> | ||
| **Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace](https://dbpedia-slack.herokuapp.com/) - the main point for [developement updates](https://github.com/dbpedia/extraction-framework/blob/master/.github/workflows/maven.yml) and discussions <br/> | ||
| **Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typos and capitalization in the Slack sentence.
Line 7 has a duplicated “the” and “developement” typo.
✅ Suggested edit
-**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
+**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for development updates and discussions <br/>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/> | |
| **Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for development updates and discussions <br/> |
🧰 Tools
🪛 LanguageTool
[uncategorized] ~7-~7: Did you mean the communication tool “Slack” (= proper noun, capitalized)?
Context: ...//dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack w...
(ON_SKYPE)
[grammar] ~7-~7: Ensure spelling is correct
Context: ...oeWBlolJb_lKifEqA) - the main point for developement updates and discussions
## Conte...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
In `@README.md` at line 7, Fix the typos in the Slack sentence that starts
"**Slack**: join the [**#dev-team**]...": remove the duplicated "the", correct
"developement" to "development", ensure "Slack workspace" capitalization is
consistent, and remove the extra space before the URL so the sentence reads
cleanly and without double words or misspellings.
| * Also holds on to the mappings to make synchronization in MappingStatsManager easier. | ||
| * TODO: better solution for mappings? | ||
| */ | ||
| class MappingStatsHolder(val mappings: Mappings, val mappedStatistics: List[MappingStats], val reversedRedirects: Map[String, String], ignoreList: IgnoreList) { | ||
|
|
||
| private def countTemplates(all: Boolean, count: MappingStats => Int): Int = { | ||
| var sum = 0 | ||
| for (ms <- mappedStatistics) { | ||
| if (all || ms.isMapped) { | ||
| if (! ignoreList.isTemplateIgnored(ms.templateName)) { | ||
| sum += count(ms) | ||
| } | ||
| private def countTemplates(all: Boolean, count: MappingStats => Int): Int = { | ||
| var sum = 0 | ||
| for (ms <- mappedStatistics) { | ||
| if (all || ms.isMapped) { | ||
| if (!ignoreList.isTemplateIgnored(ms.templateName)) { | ||
| sum += count(ms) | ||
| } | ||
| } | ||
| sum | ||
| } | ||
| sum | ||
| } | ||
|
|
||
| private def countAllTemplates(count: MappingStats => Int): Int = countTemplates(true, count) | ||
| private def countMappedTemplates(count: MappingStats => Int): Int = countTemplates(false, count) | ||
|
|
||
| val templateCount = countAllTemplates(_ => 1) | ||
| val mappedTemplateCount = countMappedTemplates(_ => 1) | ||
|
|
||
| val templateUseCount = countAllTemplates(_.templateCount) | ||
| val mappedTemplateUseCount = countMappedTemplates(_.templateCount) | ||
|
|
||
| val propertyCount = countAllTemplates(_.propertyCount) | ||
| val mappedPropertyCount = countMappedTemplates(_.mappedPropertyCount) | ||
|
|
||
| private def countAllTemplates(count: MappingStats => Int): Int = countTemplates(true, count) | ||
| private def countMappedTemplates(count: MappingStats => Int): Int = countTemplates(false, count) | ||
|
|
||
| val templateCount = countAllTemplates(_ => 1) | ||
| val mappedTemplateCount = countMappedTemplates(_ => 1) | ||
|
|
||
| val templateUseCount = countAllTemplates(_.templateCount) | ||
| val mappedTemplateUseCount = countMappedTemplates(_.templateCount) | ||
|
|
||
| val propertyCount = countAllTemplates(_.propertyCount) | ||
| val mappedPropertyCount = countMappedTemplates(_.mappedPropertyCount) | ||
|
|
||
| val propertyUseCount = countAllTemplates(_.propertyUseCount) | ||
| val mappedPropertyUseCount = countMappedTemplates(_.mappedPropertyUseCount) | ||
|
|
||
| val mappedTemplateRatio = mappedTemplateCount.toDouble / templateCount.toDouble | ||
| val mappedPropertyRatio = mappedPropertyCount.toDouble / propertyCount.toDouble | ||
|
|
||
| val mappedTemplateUseRatio = mappedTemplateUseCount.toDouble / templateUseCount.toDouble | ||
| val mappedPropertyUseRatio = mappedPropertyUseCount.toDouble / propertyUseCount.toDouble | ||
| val propertyUseCount = countAllTemplates(_.propertyUseCount) | ||
| val mappedPropertyUseCount = countMappedTemplates(_.mappedPropertyUseCount) | ||
|
|
||
| val mappedTemplateRatio = mappedTemplateCount.toDouble / templateCount.toDouble | ||
| val mappedPropertyRatio = mappedPropertyCount.toDouble / propertyCount.toDouble | ||
|
|
||
| val mappedTemplateUseRatio = mappedTemplateUseCount.toDouble / templateUseCount.toDouble | ||
| val mappedPropertyUseRatio = mappedPropertyUseCount.toDouble / propertyUseCount.toDouble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard ratio calculations against zero counts.
If any denominator is 0 (e.g., empty stats or all templates ignored), ratios become NaN/Infinity. Consider a safe helper that returns 0.0 in that case.
✅ Suggested fix (safe ratios)
private def countAllTemplates(count: MappingStats => Int): Int = countTemplates(true, count)
private def countMappedTemplates(count: MappingStats => Int): Int = countTemplates(false, count)
+
+ private def ratio(num: Int, den: Int): Double =
+ if (den == 0) 0.0 else num.toDouble / den.toDouble
@@
- val mappedTemplateRatio = mappedTemplateCount.toDouble / templateCount.toDouble
- val mappedPropertyRatio = mappedPropertyCount.toDouble / propertyCount.toDouble
+ val mappedTemplateRatio = ratio(mappedTemplateCount, templateCount)
+ val mappedPropertyRatio = ratio(mappedPropertyCount, propertyCount)
@@
- val mappedTemplateUseRatio = mappedTemplateUseCount.toDouble / templateUseCount.toDouble
- val mappedPropertyUseRatio = mappedPropertyUseCount.toDouble / propertyUseCount.toDouble
+ val mappedTemplateUseRatio = ratio(mappedTemplateUseCount, templateUseCount)
+ val mappedPropertyUseRatio = ratio(mappedPropertyUseCount, propertyUseCount)🤖 Prompt for AI Agents
In
`@server/src/main/scala/org/dbpedia/extraction/server/stats/MappingStatsHolder.scala`
around lines 78 - 113, The ratio fields in MappingStatsHolder
(mappedTemplateRatio, mappedPropertyRatio, mappedTemplateUseRatio,
mappedPropertyUseRatio) perform raw division and can produce NaN/Infinity when
denominators (templateCount, propertyCount, templateUseCount, propertyUseCount)
are zero; add a small helper method (e.g., safeRatio(numerator: Double,
denominator: Double): Double) that returns 0.0 when denominator == 0.0 and
otherwise returns numerator/denominator, then replace the four direct divisions
with safeRatio calls to ensure safe results.


Description
Modified
MappingStatsHolder.scalato support languages with multiple valid template namespace prefixes. Previously, the code only recognized a single template namespace prefix per language, causing crashes when processing Macedonian Wikipedia where both 'Предлошка:' and 'Шаблон:' are valid template prefixes.Changes Made:
org.dbpedia.extraction.wikiparser.impl.wikipedia.NamespacesNamespacesconfiguration instead of hardcoding a single prefixvalidTemplatePrefixes.find()to accept any valid prefixmatchedPrefix.isDefinedbefore calling substring operationsMotivation and Context
Issue #804: Macedonian Wikipedia extraction crashes with
StringIndexOutOfBoundsExceptionwhen processing templates with the 'Шаблон:' prefix.Root Cause: Macedonian Wikipedia uses two valid template namespace prefixes:
The original code only checked for a hardcoded single prefix, causing the parser to fail when encountering the alternative prefix.
Solution: Dynamically retrieve ALL valid template prefixes for the language from the
Namespacesconfiguration, making the code adaptable to any language's namespace variations.Fixes #804
How Has This Been Tested?
Namespaces.names(language)Types of changes
Checklist:
Summary by CodeRabbit
Documentation
Bug Fixes
Tests
Configuration
✏️ Tip: You can customize this high-level summary in your review settings.