Conversation
📝 WalkthroughWalkthroughConverted multiple importer classes from dataclasses to plain classes, moved post-init logic into constructors, switched author preprocessing to emit Wikibase author claims (P50/P2093) instead of QID lists, and normalized string quoting and metadata access across ArXiv, Crossref, Zenodo, and CRAN modules. Changes
Sequence Diagram(s)sequenceDiagram
participant Importer as Importer
participant SourceAPI as "Source API\n(arXiv / Crossref / Zenodo / CRAN)"
participant Preprocessor as Preprocessor
participant Wikibase as Wikibase
Importer->>SourceAPI: fetch metadata (authors, ids, dates)
SourceAPI-->>Importer: return metadata
Importer->>Preprocessor: __preprocess_authors(metadata.authors)
Preprocessor-->>Importer: list of author claims (P50 / P2093)
Importer->>Wikibase: create/update item with claims (P50/P2093, P356, P577, ...)
Wikibase-->>Importer: success / failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Failure to add the new IP will result in interrupted reviews. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@mardi_importer/mardi_importer/cran/RPackage.py`:
- Around line 262-269: The author handling currently only uses P50 when
author.orcid exists, which downgrades authors who already have author.QID;
change the flow in the RPackage.py author processing to always call
author.pull_QID(self.author_pool) (as is done) and then prefer adding a P50
claim when author.QID is present, regardless of author.orcid; if no QID but
author.orcid is present, call author.create() or otherwise resolve the ORCID to
a QID and add P50, and only when no QID or ORCID exists add the P2093 name
string via self.item.add_claim("wdt:P2093", author.name). Apply the same change
to the other occurrence around lines 386-393.
In `@mardi_importer/mardi_importer/crossref/CrossrefPublication.py`:
- Around line 374-381: The P50 claim is being created with the Author object
instead of its QID, causing inconsistent API usage; update the call to
self.api.get_claim("wdt:P50", author) to pass the author's QID (author.QID)
after ensuring author.create() sets author.QID when missing—i.e., in the loop
over self.authors, call author.create() if needed and then call
self.api.get_claim("wdt:P50", author.QID) so get_claim receives the entity ID
string.
🧹 Nitpick comments (2)
mardi_importer/mardi_importer/arxiv/ArxivPublication.py (1)
357-379: Align return type hint with actual claim objects.
__preprocess_authors()now returns claim objects rather than strings; updating the annotation improves clarity for type-checking.♻️ Suggested change
-from typing import List, Optional +from typing import List, Optional, Any @@ -def __preprocess_authors(self) -> List[str]: +def __preprocess_authors(self) -> List[Any]:mardi_importer/mardi_importer/zenodo/ZenodoResource.py (1)
293-315: Update the return type hint to match claim objects.The method now returns claim objects rather than QIDs.
♻️ Suggested change
-from typing import Dict, List, Optional +from typing import Dict, List, Optional, Any @@ -def __preprocess_authors(self) -> List[str]: +def __preprocess_authors(self) -> List[Any]:
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
mardi_importer/mardi_importer/cran/RPackage.py (2)
185-209:⚠️ Potential issue | 🟡 MinorRemove unnecessary
fprefixes from strings with no interpolation.Ruff flags these as F541 (f-string-missing-placeholders). Three instances found:
- Line 185-187:
f"Metadata table not found in CRAN. Package has probably been archived."- Line 206:
f"Package could not be created."- Line 451:
f"Package could not be updated."✂️ Suggested fix
- log.warning( - f"Metadata table not found in CRAN. Package has probably been archived." - ) + log.warning( + "Metadata table not found in CRAN. Package has probably been archived." + ) @@ - log.info(f"Package could not be created.") + log.info("Package could not be created.") @@ - print(f"Package could not be updated.") + print("Package could not be updated.")
152-159:⚠️ Potential issue | 🟠 MajorAdd request timeouts and avoid bare
exceptclauses.Network calls at lines 155 and 949 can hang indefinitely without timeouts, and bare
except:blocks at lines 157 and 951 mask parsing errors from BeautifulSoup. Userequests.RequestException, add timeouts, and log exception details.🛠️ Suggested fix
- try: - page = requests.get(self.url) - soup = BeautifulSoup(page.content, "lxml") - except: - log.warning(f"Package {self.label} package not found in CRAN.") + try: + page = requests.get(self.url, timeout=10) + page.raise_for_status() + soup = BeautifulSoup(page.content, "lxml") + except requests.RequestException as exc: + log.warning(f"Package {self.label} package not found in CRAN.", exc_info=exc)- try: - page = requests.get(url) - soup = BeautifulSoup(page.content, "lxml") - except: - log.warning(f"Version page for package {self.label} not found.") + try: + page = requests.get(url, timeout=10) + page.raise_for_status() + soup = BeautifulSoup(page.content, "lxml") + except requests.RequestException as exc: + log.warning(f"Version page for package {self.label} not found.", exc_info=exc)mardi_importer/mardi_importer/zenodo/ZenodoResource.py (1)
8-45:⚠️ Potential issue | 🟠 MajorAdd timeout and error handling to Zenodo API fetch.
The
urlopencall at lines 40-42 lacks a timeout, which can cause indefinite hangs during import. This is inconsistent with similar patterns inZenodoSource.py, which usestimeout=10for Zenodo API requests. Add both a timeout and exception handling.Suggested fix
-import urllib.request, json, re +import urllib.request, urllib.error, json, re @@ - with urllib.request.urlopen( - f"https://zenodo.org/api/records/{self.zenodo_id}" - ) as url: - json_data = json.load(url) + try: + with urllib.request.urlopen( + f"https://zenodo.org/api/records/{self.zenodo_id}", timeout=10 + ) as url: + json_data = json.load(url) + except urllib.error.URLError as e: + raise RuntimeError( + f"Failed to fetch Zenodo record {self.zenodo_id}" + ) from e
🤖 Fix all issues with AI agents
In `@mardi_importer/mardi_importer/arxiv/ArxivPublication.py`:
- Around line 301-304: The requests.get calls in ArxivPublication (look for
base_url + arxiv_author_id + ".html" and the later requests.get around lines
~353-354) must include a timeout and be wrapped in error handling: add a timeout
parameter (e.g., timeout=10) to both requests.get(...) invocations and surround
them with try/except for requests.exceptions.RequestException to catch
timeouts/connection errors, then log the failure (using the existing logger) and
either continue/return appropriately instead of letting the loop hang; update
references to the response variable (req) handling so you only parse req.content
when the request succeeds.
In `@mardi_importer/mardi_importer/utils/Author.py`:
- Around line 43-71: The comparison logic in Author.py uses indexing like
self_name.first[0] and other_name.first[0] without ensuring both sides are
non-empty; update the four conditional blocks that compare initials (the ones
referencing self_name.first[0], other_name.first[0], self_name.last[0],
other_name.last[0]) to first check that the counterpart string is non-empty
(e.g., len(other_name.first) > 0 or other_name.first) before indexing, and
similarly check the counterpart last name before accessing [0], so both sides
are validated prior to any [0] access.
🧹 Nitpick comments (2)
mardi_importer/mardi_importer/zenodo/ZenodoResource.py (1)
10-12: Align __preprocess_authors return type with claim objects.This now returns claim objects, but the annotation and docstring still imply
List[str]. Update toList[Any]for clarity and downstream tooling.♻️ Suggested fix
-from typing import Dict, List, Optional +from typing import Dict, List, Optional, Any @@ - def __preprocess_authors(self) -> List[str]: + def __preprocess_authors(self) -> List[Any]: @@ - List: - Author claims to be added (P50 entity claims and/or P2093 string claims). + List[Any]: + Author claims to be added (P50 entity claims and/or P2093 string claims).Also applies to: 322-344
mardi_importer/mardi_importer/arxiv/ArxivPublication.py (1)
1-3: Update __preprocess_authors return annotation to claim objects.The method now returns claim objects, not strings. Align the type hint and docstring.
♻️ Suggested fix
-from typing import List, Optional +from typing import List, Optional, Any @@ - def __preprocess_authors(self) -> List[str]: + def __preprocess_authors(self) -> List[Any]: @@ - List: - Author claims to be added (P50 entity claims and/or P2093 string claims). + List[Any]: + Author claims to be added (P50 entity claims and/or P2093 string claims).Also applies to: 502-514
| base_url = "https://arxiv.org/a/" | ||
| req = requests.get(base_url + arxiv_author_id + ".html", headers=headers) | ||
| soup = BeautifulSoup(req.content, 'html.parser') | ||
| soup = BeautifulSoup(req.content, "html.parser") | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file and the specific lines mentioned
cd mardi_importer && head -310 mardi_importer/arxiv/ArxivPublication.py | tail -15Repository: MaRDI4NFDI/docker-importer
Length of output: 729
🏁 Script executed:
# Also check lines around 353-354
cd mardi_importer && sed -n '350,360p' mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 455
🏁 Script executed:
# Get an overview of the file structure and context
cd mardi_importer && wc -l mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 114
🏁 Script executed:
# Search for all requests.get calls in the file to understand full scope
cd mardi_importer && rg -n "requests\.get" mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 205
🏁 Script executed:
# Get context around line 302 - see the function definition and error handling
cd mardi_importer && sed -n '290,320p' mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 1445
🏁 Script executed:
# Get context around line 353 - see the function definition and error handling
cd mardi_importer && sed -n '340,365p' mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 877
🏁 Script executed:
# Check if there are any try-except blocks or timeout handling in the file
cd mardi_importer && rg -n "except|timeout|RequestException" mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 108
🏁 Script executed:
# Check imports to see what's being used
cd mardi_importer && head -30 mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 636
Add timeout parameters to arXiv HTTP requests to prevent indefinite blocking.
Both requests.get() calls lack timeout parameters. Line 302 is particularly problematic as it runs in a loop, and line 353 lacks any error handling. Without timeouts, these requests can hang indefinitely if the arXiv server is unresponsive or network connectivity is poor.
🛠️ Suggested fix
- req = requests.get(base_url + arxiv_author_id + ".html", headers=headers)
+ req = requests.get(
+ base_url + arxiv_author_id + ".html", headers=headers, timeout=10
+ )
@@
- response = requests.get(full_url)
+ response = requests.get(full_url, timeout=10)Also applies to: 353-354
🧰 Tools
🪛 Ruff (0.14.14)
[error] 302-302: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/arxiv/ArxivPublication.py` around lines 301 -
304, The requests.get calls in ArxivPublication (look for base_url +
arxiv_author_id + ".html" and the later requests.get around lines ~353-354) must
include a timeout and be wrapped in error handling: add a timeout parameter
(e.g., timeout=10) to both requests.get(...) invocations and surround them with
try/except for requests.exceptions.RequestException to catch timeouts/connection
errors, then log the failure (using the existing logger) and either
continue/return appropriately instead of letting the loop hang; update
references to the response variable (req) handling so you only parse req.content
when the request succeeds.
| if self_name.first.lower().replace("-", "") == other_name.first.lower().replace( | ||
| "-", "" | ||
| ) and self_name.last.lower().replace( | ||
| "-", "" | ||
| ) == other_name.last.lower().replace("-", ""): | ||
| return True | ||
| if ( len(self_name.first) == 2 and self_name.first[0] == other_name.first[0] | ||
| and self_name.last == other_name.last ): | ||
| if ( | ||
| len(self_name.first) == 2 | ||
| and self_name.first[0] == other_name.first[0] | ||
| and self_name.last == other_name.last | ||
| ): | ||
| return True | ||
| if ( len(other_name.first) == 2 and self_name.first[0] == other_name.first[0] | ||
| and self_name.last == other_name.last ): | ||
| if ( | ||
| len(other_name.first) == 2 | ||
| and self_name.first[0] == other_name.first[0] | ||
| and self_name.last == other_name.last | ||
| ): | ||
| return True | ||
| if ( self_name.first == other_name.first and len(self_name.last) == 2 | ||
| and self_name.last[0] == other_name.last[0] ): | ||
| if ( | ||
| self_name.first == other_name.first | ||
| and len(self_name.last) == 2 | ||
| and self_name.last[0] == other_name.last[0] | ||
| ): | ||
| return True | ||
| if ( self_name.first == other_name.first and len(other_name.last) == 2 | ||
| and self_name.last[0] == other_name.last[0] ): | ||
| if ( | ||
| self_name.first == other_name.first | ||
| and len(other_name.last) == 2 | ||
| and self_name.last[0] == other_name.last[0] | ||
| ): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "Author.py" -type fRepository: MaRDI4NFDI/docker-importer
Length of output: 166
🏁 Script executed:
wc -l ./mardi_importer/mardi_importer/utils/Author.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 121
🏁 Script executed:
cat -n ./mardi_importer/mardi_importer/utils/Author.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 8733
🌐 Web query:
nameparser HumanName empty string first last attribute behavior
💡 Result:
In nameparser’s HumanName, the first and last attributes are properties that return the joined underlying lists (e.g., first_list, last_list), and if those lists are empty they return C.empty_attribute_default. (nameparser.readthedocs.io)
So:
HumanName("")parses to “no pieces”, andname.first/name.lastreturnCONSTANTS.empty_attribute_default, which defaults to""(empty string). (nameparser.readthedocs.io)- If you change the default, empty attributes follow it (e.g.,
CONSTANTS.empty_attribute_default = Nonemakesname.title/name.first/name.lastreturnNonewhen empty). (nameparser.readthedocs.io)
Citations:
- 1: https://nameparser.readthedocs.io/en/latest/_modules/nameparser/parser.html
- 2: https://nameparser.readthedocs.io/en/latest/modules.html?utm_source=openai
- 3: https://nameparser.readthedocs.io/en/latest/modules.html?utm_source=openai
Add length checks on both sides before indexing to prevent IndexError on incomplete names.
When comparing HumanName objects, the code accesses the [0] index on both sides of some comparisons, but only verifies the length of one side. Since nameparser returns empty strings for missing name components, incomplete author metadata can crash with IndexError. Each of the four conditions (lines 49–72) needs to check that the other name part is non-empty before accessing its first character.
🛠️ Suggested fix
- if (
- len(self_name.first) == 2
- and self_name.first[0] == other_name.first[0]
- and self_name.last == other_name.last
- ):
+ if (
+ len(self_name.first) == 2
+ and other_name.first
+ and self_name.first[0] == other_name.first[0]
+ and self_name.last == other_name.last
+ ):
return True
- if (
- len(other_name.first) == 2
- and self_name.first[0] == other_name.first[0]
- and self_name.last == other_name.last
- ):
+ if (
+ len(other_name.first) == 2
+ and self_name.first
+ and self_name.first[0] == other_name.first[0]
+ and self_name.last == other_name.last
+ ):
return True
- if (
- self_name.first == other_name.first
- and len(self_name.last) == 2
- and self_name.last[0] == other_name.last[0]
- ):
+ if (
+ self_name.first == other_name.first
+ and len(self_name.last) == 2
+ and other_name.last
+ and self_name.last[0] == other_name.last[0]
+ ):
return True
- if (
- self_name.first == other_name.first
- and len(other_name.last) == 2
- and self_name.last[0] == other_name.last[0]
- ):
+ if (
+ self_name.first == other_name.first
+ and len(other_name.last) == 2
+ and self_name.last
+ and self_name.last[0] == other_name.last[0]
+ ):
return True📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self_name.first.lower().replace("-", "") == other_name.first.lower().replace( | |
| "-", "" | |
| ) and self_name.last.lower().replace( | |
| "-", "" | |
| ) == other_name.last.lower().replace("-", ""): | |
| return True | |
| if ( len(self_name.first) == 2 and self_name.first[0] == other_name.first[0] | |
| and self_name.last == other_name.last ): | |
| if ( | |
| len(self_name.first) == 2 | |
| and self_name.first[0] == other_name.first[0] | |
| and self_name.last == other_name.last | |
| ): | |
| return True | |
| if ( len(other_name.first) == 2 and self_name.first[0] == other_name.first[0] | |
| and self_name.last == other_name.last ): | |
| if ( | |
| len(other_name.first) == 2 | |
| and self_name.first[0] == other_name.first[0] | |
| and self_name.last == other_name.last | |
| ): | |
| return True | |
| if ( self_name.first == other_name.first and len(self_name.last) == 2 | |
| and self_name.last[0] == other_name.last[0] ): | |
| if ( | |
| self_name.first == other_name.first | |
| and len(self_name.last) == 2 | |
| and self_name.last[0] == other_name.last[0] | |
| ): | |
| return True | |
| if ( self_name.first == other_name.first and len(other_name.last) == 2 | |
| and self_name.last[0] == other_name.last[0] ): | |
| if ( | |
| self_name.first == other_name.first | |
| and len(other_name.last) == 2 | |
| and self_name.last[0] == other_name.last[0] | |
| ): | |
| if self_name.first.lower().replace("-", "") == other_name.first.lower().replace( | |
| "-", "" | |
| ) and self_name.last.lower().replace( | |
| "-", "" | |
| ) == other_name.last.lower().replace("-", ""): | |
| return True | |
| if ( | |
| len(self_name.first) == 2 | |
| and other_name.first | |
| and self_name.first[0] == other_name.first[0] | |
| and self_name.last == other_name.last | |
| ): | |
| return True | |
| if ( | |
| len(other_name.first) == 2 | |
| and self_name.first | |
| and self_name.first[0] == other_name.first[0] | |
| and self_name.last == other_name.last | |
| ): | |
| return True | |
| if ( | |
| self_name.first == other_name.first | |
| and len(self_name.last) == 2 | |
| and other_name.last | |
| and self_name.last[0] == other_name.last[0] | |
| ): | |
| return True | |
| if ( | |
| self_name.first == other_name.first | |
| and len(other_name.last) == 2 | |
| and self_name.last | |
| and self_name.last[0] == other_name.last[0] | |
| ): |
🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/utils/Author.py` around lines 43 - 71, The
comparison logic in Author.py uses indexing like self_name.first[0] and
other_name.first[0] without ensuring both sides are non-empty; update the four
conditional blocks that compare initials (the ones referencing
self_name.first[0], other_name.first[0], self_name.last[0], other_name.last[0])
to first check that the counterpart string is non-empty (e.g.,
len(other_name.first) > 0 or other_name.first) before indexing, and similarly
check the counterpart last name before accessing [0], so both sides are
validated prior to any [0] access.
3d66bd3 to
1624453
Compare
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mardi_importer/mardi_importer/zenodo/ZenodoResource.py (1)
196-215:⚠️ Potential issue | 🟠 MajorPotential TypeError:
self.licensemay be None.The
updatemethod accessesself.license["id"]at line 206 without checking ifself.licenseis not None. Thelicenseproperty can returnNoneif the metadata doesn't contain a license field, which would cause aTypeError: 'NoneType' object is not subscriptable.🐛 Proposed fix
new_item = self.api.item.get(entity_id=zenodo_id) - if self.license["id"] == "cc-by-4.0": - new_item.add_claim("wdt:P275", "wd:Q20007257") - elif self.license["id"] == "cc-by-sa-4.0": - new_item.add_claim("wdt:P275", "wd:Q18199165") - elif self.license["id"] == "cc-by-nc-sa-4.0": - new_item.add_claim("wdt:P275", "wd:Q42553662") - elif self.license["id"] == "mit-license": - new_item.add_claim("wdt:P275", "wd:Q334661") + if self.license: + license_id = self.license.get("id") + if license_id == "cc-by-4.0": + new_item.add_claim("wdt:P275", "wd:Q20007257") + elif license_id == "cc-by-sa-4.0": + new_item.add_claim("wdt:P275", "wd:Q18199165") + elif license_id == "cc-by-nc-sa-4.0": + new_item.add_claim("wdt:P275", "wd:Q42553662") + elif license_id == "mit-license": + new_item.add_claim("wdt:P275", "wd:Q334661") return new_item.write()
🤖 Fix all issues with AI agents
In `@mardi_importer/mardi_importer/arxiv/ArxivPublication.py`:
- Around line 282-285: The TypeError raised when validating author_factory uses
an unnecessary f-string; in the raise statement inside ArxivPublication (the
block that checks "if not callable(author_factory)"), replace the f-string
f"Could not resolve a callable Author class. Check your imports." with a plain
string "Could not resolve a callable Author class. Check your imports." (or a
shorter message if TRY003 is enforced) so Ruff F541 no longer flags the line.
- Around line 397-407: The code assumes item.get_value("wdt:P50") returns an
iterable but it may be None; update the block around author_QID in
ArxivPublication (the self.QID branch where you call item =
self.api.item.get(self.QID) and then author_QID = item.get_value("wdt:P50")) to
guard against falsy values by treating author_QID as an empty list when
None/empty (or explicitly check if author_QID before iterating) so the for QID
in author_QID loop does not raise a TypeError; ensure subsequent uses of
author_QID (and the loop that fetches author_item, name, orcid, arxiv_author_id)
only run when author_QID is present.
In `@mardi_importer/mardi_importer/cran/RPackage.py`:
- Around line 520-523: The DOI normalization inside the Zenodo branch currently
loses the first replacement because the second line reassigns zenodo_id from doi
again; update the logic in the re.search("10.5281/", doi) branch so both
transforms are applied to the same variable (use zenodo_id =
zenodo_id.replace("10.5281/zenodo.", "") after the initial zenodo_id =
doi.replace(":", ".")), and ensure the final value appended to zenodo_references
is zenodo_id.strip(); reference variables: doi, zenodo_id, zenodo_references and
the re.search("10.5281/", doi) branch.
- Around line 352-357: The loop that builds current authors uses Author(...)
directly which bypasses the resolved author_factory and can raise if Author is a
module; replace the direct constructor call with author_factory(self.api,
name=author_label) (or the correct signature of author_factory) when creating
current_author inside the for over current_authors, preserving the subsequent
assignment current_author._QID = author_qid and use the same author_item lookup
(author_item = self.api.item.get(entity_id=author_qid)) and author_label
extraction as before.
- Around line 154-159: In pull() and get_versions(), avoid bare HTTP calls: add
a timeout to requests.get (e.g., timeout=10), call response.raise_for_status()
to surface non-2xx codes, and replace the bare except: with except
requests.RequestException as e to only catch network/HTTP errors; then log using
structured logging (e.g., log.warning("Package %s not found in CRAN: %s",
self.label, e)) so errors aren’t masked and timeouts prevent hangs.
- Around line 185-187: Remove redundant f-string prefixes for constant log
messages in RPackage.py: any logging calls like log.warning(...), log.info(...),
or log.debug(...) that pass a string with no interpolations should drop the
leading f (e.g., change f"Metadata table not found..." to "Metadata table not
found..."). Search for occurrences in the RPackage module (the logging calls
shown in the diff and the other reported spots) and replace only the f-prefix on
literal strings that contain no formatting expressions, leaving interpolated
f-strings unchanged.
- Around line 236-238: Remove the unnecessary f-string prefixes on the TypeError
messages in RPackage.py where no interpolation is used: replace the f-prefixed
messages such as "Could not resolve a callable Author class. Check your
imports." and the other three similar raise TypeError messages (the ones at the
other reported locations) with plain string literals (remove the leading "f") so
F541 is not triggered; locate the raise TypeError occurrences in the RPackage
class/module and update each raise TypeError(...) to use a normal string instead
of an f-string.
In `@mardi_importer/mardi_importer/crossref/CrossrefPublication.py`:
- Around line 239-241: The TypeError message in CrossrefPublication.py is using
an unnecessary f-string; locate the raise TypeError(...) statement that
constructs "Could not resolve a callable Author class. Check your imports." (the
one referencing the Author class resolution) and remove the leading "f" so the
literal is a plain string instead of an f-string.
In `@mardi_importer/mardi_importer/zenodo/ZenodoResource.py`:
- Around line 111-122: The authors property currently accesses
self.metadata["creators"] directly which can raise KeyError; change it to safely
read creators = self.metadata.get("creators") (or check 'creators' in
self.metadata) and treat a missing or None value as an empty list before
iterating, so the loop in authors and the creation of Author(self.api, ...) only
runs when creators is truthy; ensure self._authors is returned unchanged when
there are no creators.
- Around line 91-97: The publication_date property accesses
self.metadata["publication_date"] without verifying the key exists, risking
KeyError; update the property (publication_date) to first check
"publication_date" in self.metadata (or use .get()) and only run re.match and
build the ISO timestamp when the key is present and non-empty, mirroring the
defensive check used in the description property to avoid KeyError and return
None when absent.
- Around line 124-127: The resource_type property accesses nested keys
self.metadata["resource_type"]["title"] without checking for their existence,
which can raise KeyError; update the resource_type property to defensively read
these keys (e.g., check if self.metadata is truthy and if "resource_type" in
self.metadata and "title" in self.metadata["resource_type"], or use nested .get
calls or a try/except KeyError) and set self._resource_type to a safe default
(None or empty string) when missing, then return self._resource_type; refer to
the resource_type property, the _resource_type attribute, and the metadata dict
when making the change.
🧹 Nitpick comments (5)
mardi_importer/mardi_importer/utils/Author.py (2)
8-11: Global nameparser configuration is fragile.These
remove()calls modify global state and will raiseKeyErrorif the titles don't exist in future nameparser versions. Consider wrapping in try/except or checking existence first.-CONSTANTS.titles.remove("Mahdi") -CONSTANTS.titles.remove("Bodhisattva") +for title in ("Mahdi", "Bodhisattva"): + if title in CONSTANTS.titles: + CONSTANTS.titles.remove(title)
207-207: Inconsistent property identifier for ORCID claim.Line 178 uses
"wdt:P496"while this line uses"ORCID iD". Use consistent property identifiers throughout to avoid confusion and potential mapping issues.- self._item.add_claim("ORCID iD", self.orcid) + self._item.add_claim("wdt:P496", self.orcid)mardi_importer/mardi_importer/crossref/CrossrefPublication.py (1)
82-86: Preferlog.exception()when logging errors inside exception handlers.Using
log.error()loses the traceback.log.exception()automatically captures it.Suggested fix
except HTTPStatusError as e: - log.error( - f"Publication with doi: {self.doi} not found in Crossref: {str(e)}" - ) + log.exception( + f"Publication with doi: {self.doi} not found in Crossref" + )mardi_importer/mardi_importer/zenodo/ZenodoResource.py (2)
40-44: Consider adding error handling for the URL request.The
urlopencall can fail due to network errors, 404 responses, or invalid JSON. Currently, any of these would cause an unhandled exception during initialization.🛡️ Proposed fix to add error handling
- with urllib.request.urlopen( - f"https://zenodo.org/api/records/{self.zenodo_id}" - ) as url: - json_data = json.load(url) - self.metadata = json_data["metadata"] + try: + with urllib.request.urlopen( + f"https://zenodo.org/api/records/{self.zenodo_id}" + ) as url: + json_data = json.load(url) + self.metadata = json_data.get("metadata", {}) + except (urllib.error.URLError, json.JSONDecodeError) as e: + print(f"Failed to fetch Zenodo metadata for {self.zenodo_id}: {e}") + self.metadata = {}
337-344: Verifyauthor.QIDis set afterauthor.create()call.The code assumes
author.QIDis set after callingauthor.create()on line 339, but if the creation fails,author.QIDmay still beNone, which would then be passed toget_claimon line 340. Consider adding a check or fallback.🛡️ Proposed fix
for author in self.authors: if author.orcid or author.QID: if not author.QID: author.create() - claims.append(self.api.get_claim("wdt:P50", author.QID)) + if author.QID: + claims.append(self.api.get_claim("wdt:P50", author.QID)) + elif author.name: + # Fallback to author string if creation failed + claims.append(self.api.get_claim("wdt:P2093", author.name)) else: if author.name: claims.append(self.api.get_claim("wdt:P2093", author.name))
| try: | ||
| page = requests.get(self.url) | ||
| soup = BeautifulSoup(page.content, 'lxml') | ||
| soup = BeautifulSoup(page.content, "lxml") | ||
| except: | ||
| log.warning(f"Package {self.label} package not found in CRAN.") | ||
| return None |
There was a problem hiding this comment.
❓ Verification inconclusive
Script executed:
cat -n mardi_importer/mardi_importer/cran/RPackage.py | sed -n '150,165p'Repository: MaRDI4NFDI/docker-importer
Repository: MaRDI4NFDI/docker-importer
Exit code: 0
stdout:
150 attributes.
151 """
152 self.url = f"https://CRAN.R-project.org/package={self.label}"
153
154 try:
155 page = requests.get(self.url)
156 soup = BeautifulSoup(page.content, "lxml")
157 except:
158 log.warning(f"Package {self.label} package not found in CRAN.")
159 return None
160 else:
161 if soup.find_all("table"):
162 self.long_description = soup.find_all("p")[0].get_text() or ""
163 self.parse_publications(self.long_description)
164 self.long_description = re.sub("\n", "", self.long_description).strip()
165 self.long_description = re.sub("\t", "", self.long_description).strip()
Script executed:
cat -n mardi_importer/mardi_importer/cran/RPackage.py | sed -n '985,1000p'Repository: MaRDI4NFDI/docker-importer
Repository: MaRDI4NFDI/docker-importer
Exit code: 0
stdout:
985 return result
986
987 def get_versions(self):
988 url = f"https://cran.r-project.org/src/contrib/Archive/{self.label}"
989
990 try:
991 page = requests.get(url)
992 soup = BeautifulSoup(page.content, "lxml")
993 except:
994 log.warning(f"Version page for package {self.label} not found.")
995 else:
996 if soup.find_all("table"):
997 table = soup.find_all("table")[0]
998 versions_df = pd.read_html(StringIO(str(table)))
999 versions_df = versions_df[0]
1000 versions_df = versions_df.drop(
Script executed:
# Check if requests is imported and what timeout patterns are used elsewhere
rg -n "import requests|from requests" mardi_importer/mardi_importer/cran/RPackage.pyRepository: MaRDI4NFDI/docker-importer
Repository: MaRDI4NFDI/docker-importer
Exit code: 0
stdout:
16:import requests
Add timeouts and narrow exception handling for CRAN HTTP requests.
Both pull() (lines 154–159) and get_versions() (lines 990–994) make unguarded HTTP requests without timeouts, which can cause indefinite hangs. The bare except: clauses also mask real failures. Use a timeout, call raise_for_status(), and catch requests.RequestException specifically. Additionally, replace f-string logging with structured format arguments for consistency.
🛠️ Proposed fix
- page = requests.get(self.url)
- soup = BeautifulSoup(page.content, "lxml")
- except:
- log.warning(f"Package {self.label} package not found in CRAN.")
+ page = requests.get(self.url, timeout=30)
+ page.raise_for_status()
+ soup = BeautifulSoup(page.content, "lxml")
+ except requests.RequestException as exc:
+ log.warning("Package %s package not found in CRAN: %s", self.label, exc)
@@
- page = requests.get(url)
- soup = BeautifulSoup(page.content, "lxml")
- except:
- log.warning(f"Version page for package {self.label} not found.")
+ page = requests.get(url, timeout=30)
+ page.raise_for_status()
+ soup = BeautifulSoup(page.content, "lxml")
+ except requests.RequestException as exc:
+ log.warning("Version page for package %s not found: %s", self.label, exc)🧰 Tools
🪛 Ruff (0.14.14)
[error] 155-155: Probable use of requests call without timeout
(S113)
[error] 157-157: Do not use bare except
(E722)
🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/cran/RPackage.py` around lines 154 - 159, In
pull() and get_versions(), avoid bare HTTP calls: add a timeout to requests.get
(e.g., timeout=10), call response.raise_for_status() to surface non-2xx codes,
and replace the bare except: with except requests.RequestException as e to only
catch network/HTTP errors; then log using structured logging (e.g.,
log.warning("Package %s not found in CRAN: %s", self.label, e)) so errors aren’t
masked and timeouts prevent hangs.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@mardi_importer/mardi_importer/arxiv/ArxivPublication.py`:
- Around line 502-524: The return type of __preprocess_authors is incorrect: it
declares List[str] but returns claim objects from self.api.get_claim; change the
annotation to List[Any] to match CrossrefPublication.__preprocess_authors and
actual return values, and add Any to the module imports so the type is
available; update the signature of def __preprocess_authors(self) -> List[Any]:
and ensure the import line includes Any.
🧹 Nitpick comments (1)
mardi_importer/mardi_importer/crossref/CrossrefPublication.py (1)
83-85: Uselogging.exceptionor%sformatting for exception logging.When logging exceptions, prefer
log.exception()(which automatically includes the traceback) or use%sformatting instead ofstr(e)for lazy evaluation.Suggested fix
- log.error( - f"Publication with doi: {self.doi} not found in Crossref: {str(e)}" - ) + log.exception( + "Publication with doi: %s not found in Crossref", self.doi + )
| def __preprocess_authors(self) -> List[str]: | ||
| """Processes the author information of each publication. | ||
|
|
||
| Create the author if it does not exist already as an | ||
| Create the author if it does not exist already as an | ||
| entity in wikibase. | ||
|
|
||
|
|
||
| If an author has no ORCID and no existing QID, store the author as | ||
| an author name string (P2093) instead of creating an author item. | ||
|
|
||
| Returns: | ||
| List[str]: | ||
| QIDs corresponding to each author. | ||
| List: | ||
| Author claims to be added (P50 entity claims and/or P2093 string claims). | ||
| """ | ||
| log = get_logger_safe(__name__) | ||
| author_QID = [] | ||
| claims = [] | ||
| for author in self.authors: | ||
| if not author.QID: | ||
| log.debug(f"Creating author: {author}") | ||
| author.create() | ||
| author_QID.append(author.QID) | ||
| return author_QID | ||
| if author.orcid or author.arxiv_id or author.QID: | ||
| if not author.QID: | ||
| author.create() | ||
| claims.append(self.api.get_claim("wdt:P50", author.QID)) | ||
| else: | ||
| if author.name: | ||
| claims.append(self.api.get_claim("wdt:P2093", author.name)) | ||
| return claims |
There was a problem hiding this comment.
Return type annotation is inaccurate.
The method signature declares List[str] but actually returns a list of claim objects (from self.api.get_claim()). This should match the CrossrefPublication.__preprocess_authors() signature which uses List[Any].
Fix
+from typing import List, Optional, Any
+
- def __preprocess_authors(self) -> List[str]:
+ def __preprocess_authors(self) -> List[Any]:Note: You'll also need to add Any to the imports at line 3.
🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/arxiv/ArxivPublication.py` around lines 502 -
524, The return type of __preprocess_authors is incorrect: it declares List[str]
but returns claim objects from self.api.get_claim; change the annotation to
List[Any] to match CrossrefPublication.__preprocess_authors and actual return
values, and add Any to the module imports so the type is available; update the
signature of def __preprocess_authors(self) -> List[Any]: and ensure the import
line includes Any.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mardi_importer/mardi_importer/zenodo/ZenodoResource.py (1)
218-236:⚠️ Potential issue | 🟠 MajorGuard
licensebefore indexing inupdate().
If the record has no license, this will crash.Suggested fix
- if self.license["id"] == "cc-by-4.0": + license_id = self.license.get("id") if self.license else None + if license_id == "cc-by-4.0": new_item.add_claim("wdt:P275", "wd:Q20007257") - elif self.license["id"] == "cc-by-sa-4.0": + elif license_id == "cc-by-sa-4.0": new_item.add_claim("wdt:P275", "wd:Q18199165") - elif self.license["id"] == "cc-by-nc-sa-4.0": + elif license_id == "cc-by-nc-sa-4.0": new_item.add_claim("wdt:P275", "wd:Q42553662") - elif self.license["id"] == "mit-license": + elif license_id == "mit-license": new_item.add_claim("wdt:P275", "wd:Q334661")
🤖 Fix all issues with AI agents
In `@mardi_importer/mardi_importer/cran/RPackage.py`:
- Around line 1009-1011: Replace the faulty regex that strips the archive suffix
in RPackage (around the code setting variable `version` where `self.label` is
removed) so it only removes the literal ".tar.gz" at the end; specifically
change the `re.sub(".tar.gz", "", version)` call to use a literal-dot,
end-anchored pattern such as `re.sub(r"\.tar\.gz$", "", version)` (or use a safe
endswith/slicing approach) to avoid corrupting version strings.
🧹 Nitpick comments (1)
mardi_importer/mardi_importer/zenodo/ZenodoResource.py (1)
344-366: Fix return type annotation for claim objects.
The method returns claim objects, not strings.Suggested fix
-def __preprocess_authors(self) -> List[str]: +def __preprocess_authors(self) -> List[object]:
| version = re.sub(f"{self.label}_", "", name) | ||
| version = re.sub(".tar.gz", "", version) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n mardi_importer/mardi_importer/cran/RPackage.py | sed -n '1000,1020p'Repository: MaRDI4NFDI/docker-importer
Length of output: 931
🏁 Script executed:
# Let me verify the regex behavior with a quick test
python3 << 'EOF'
import re
# Test the buggy regex
test_version = "3.4.0.tar.gz"
buggy_result = re.sub(".tar.gz", "", test_version)
print(f"Buggy regex: '{test_version}' -> '{buggy_result}'")
# Test with another example
test_version2 = "1.2.3.tar.gz"
buggy_result2 = re.sub(".tar.gz", "", test_version2)
print(f"Buggy regex: '{test_version2}' -> '{buggy_result2}'")
# Test the fixed regex
fixed_result = re.sub(r"\.tar\.gz$", "", test_version)
print(f"Fixed regex: '{test_version}' -> '{fixed_result}'")
fixed_result2 = re.sub(r"\.tar\.gz$", "", test_version2)
print(f"Fixed regex: '{test_version2}' -> '{fixed_result2}'")
EOFRepository: MaRDI4NFDI/docker-importer
Length of output: 225
🏁 Script executed:
# Test with more edge cases to see if the regex can fail
python3 << 'EOF'
import re
# Test cases that might expose the bug
test_cases = [
"3.4.0.tar.gz", # Normal case
"1.tar.gz", # Minimal case
"1tar.gz", # No dot before tar
"tar.gz.tar.gz", # Double occurrence (shouldn't happen but...)
"0.0.0.tar.gz", # All zeros
"3.4tar.gz", # Missing dot before tar
]
print("Testing buggy regex: \".tar.gz\"")
for test in test_cases:
result = re.sub(".tar.gz", "", test)
print(f" '{test}' -> '{result}'")
print("\nTesting fixed regex: r\"\.tar\.gz$\"")
for test in test_cases:
result = re.sub(r"\.tar\.gz$", "", test)
print(f" '{test}' -> '{result}'")
EOFRepository: MaRDI4NFDI/docker-importer
Length of output: 442
Fix .tar.gz stripping to prevent version corruption.
re.sub(".tar.gz", "", ...) is buggy because . matches any character in regex, not a literal dot. For versions with patterns like 1tar.gz or 3.4tar.gz, the regex will incorrectly remove characters mid-string (e.g., 3.4tar.gz becomes 3. instead of 3.4). Use re.sub(r"\.tar\.gz$", "", version) to match the literal extension only at the end.
✅ Suggested fix
- version = re.sub(".tar.gz", "", version)
+ version = re.sub(r"\.tar\.gz$", "", version)🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/cran/RPackage.py` around lines 1009 - 1011,
Replace the faulty regex that strips the archive suffix in RPackage (around the
code setting variable `version` where `self.label` is removed) so it only
removes the literal ".tar.gz" at the end; specifically change the
`re.sub(".tar.gz", "", version)` call to use a literal-dot, end-anchored pattern
such as `re.sub(r"\.tar\.gz$", "", version)` (or use a safe endswith/slicing
approach) to avoid corrupting version strings.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@mardi_importer/mardi_importer/arxiv/ArxivPublication.py`:
- Around line 517-520: The code assumes author.create() always sets author.QID
before calling self.api.get_claim("wdt:P50", author.QID); add a defensive check
after calling author.create() to verify author.QID is truthy and only append the
P50 claim when it exists, otherwise append the author name claim using
self.api.get_claim("p:P2093", author.name) (matching the existing else-branch
pattern); update the block around author.create(), author.QID, and
self.api.get_claim to handle the creation-failure case gracefully.
In `@mardi_importer/mardi_importer/zenodo/ZenodoResource.py`:
- Around line 68-69: The code iterates "for QID in author_QID" after assigning
author_QID = item.get_value("wdt:P50") without a defensive guard; change the
assignment to use a fallback list (e.g., author_QID = item.get_value("wdt:P50")
or []) so that if get_value returns None the loop won't raise a TypeError—apply
this change in ZenodoResource.py where author_QID and the subsequent for QID in
author_QID are used.
- Around line 363-366: The code can pass None to self.api.get_claim("wdt:P50",
author.QID) if author.create() fails to set author.QID; update the block around
author.create() in ZenodoResource (the author.orcid/orcid check and claims
append) to verify author.QID is truthy after calling author.create() and only
call self.api.get_claim when author.QID is present—otherwise skip adding the
claim (or log/raise as appropriate) to avoid creating a claim with a None value.
- Around line 274-275: The property lookup is using prefixed IDs as labels when
calling get_local_id_by_label for author claims; update the two calls that set
author_item_prop and author_string_prop (in ZenodoResource.py) to pass the
correct human-readable labels instead of "wdt:P50" and "wdt:P2093" — use
"author" for P50 and "stated as" for P2093 — or, if you prefer the ID route
consistent with other usages, change the calls to use the prefixed IDs as the
lookup key in the same style as elsewhere; ensure you modify the calls to
get_local_id_by_label(author/“stated as”) or use "wdt:P50"/"wdt:P2093"
accordingly so the lookups return the correct properties.
🧹 Nitpick comments (2)
mardi_importer/mardi_importer/zenodo/ZenodoResource.py (2)
36-45: Consider extracting repeated author factory resolution.The author factory resolution logic (checking
hasattr(Author, "Author"), validating callable) is duplicated in__post_init__(lines 36-45) and theauthorsproperty (lines 124-133). The same pattern also appears inArxivPublication.py. Extracting this to a shared utility function would reduce duplication.Also applies to: 124-133
348-370: Return type annotation is inaccurate.Same as in
ArxivPublication.py, the method declaresList[str]but returns claim objects. Should beList[Any].♻️ Proposed fix
- def __preprocess_authors(self) -> List[str]: + def __preprocess_authors(self) -> List[Any]:Also add
Anyto the imports:-from typing import Dict, List, Optional +from typing import Any, Dict, List, Optional
| if author.orcid or author.arxiv_id or author.QID: | ||
| if not author.QID: | ||
| author.create() | ||
| claims.append(self.api.get_claim("wdt:P50", author.QID)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the Author class definition
find . -type f -name "*.py" | xargs grep -l "class Author"Repository: MaRDI4NFDI/docker-importer
Length of output: 197
🏁 Script executed:
# Search for the create() method in Author class
rg "def create\(" --type py -A 20 | head -100Repository: MaRDI4NFDI/docker-importer
Length of output: 7141
🏁 Script executed:
# Look at imports and Author class location in ArxivPublication.py
head -50 mardi_importer/mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 896
🏁 Script executed:
sed -n '/def create(self):/,/^ def /p' mardi_importer/mardi_importer/utils/Author.py | head -100Repository: MaRDI4NFDI/docker-importer
Length of output: 1863
🏁 Script executed:
# Get the full Author class to understand the structure
wc -l mardi_importer/mardi_importer/utils/Author.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 119
🏁 Script executed:
# View the complete create() method
awk '/def create\(self\):/,/^ def [a-z_]/' mardi_importer/mardi_importer/utils/Author.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 91
🏁 Script executed:
# View the context around lines 517-520 in ArxivPublication.py
sed -n '500,530p' mardi_importer/mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 992
🏁 Script executed:
# Check the entire method that contains lines 517-520
grep -n "def.*preprocess_authors" mardi_importer/mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 122
🏁 Script executed:
# Find the method containing these lines
awk 'NR>=510 && NR<=530' mardi_importer/mardi_importer/arxiv/ArxivPublication.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 613
🏁 Script executed:
# Check if there's any exception handling around author.create() calls in the codebase
rg "try|except" mardi_importer/mardi_importer/arxiv/ArxivPublication.py -B 5 -A 5Repository: MaRDI4NFDI/docker-importer
Length of output: 3668
🏁 Script executed:
# Look at the complete Author class initialization and property definition
cat -n mardi_importer/mardi_importer/utils/Author.py | head -80Repository: MaRDI4NFDI/docker-importer
Length of output: 3078
🏁 Script executed:
# Find the QID property getter in the Author class
grep -n "def QID\|@property" mardi_importer/mardi_importer/utils/Author.py | head -20Repository: MaRDI4NFDI/docker-importer
Length of output: 135
🏁 Script executed:
# Check for exception handling or error cases in _item.write()
rg "\.write\(\)" mardi_importer/mardi_importer/utils/Author.py -B 2 -A 2Repository: MaRDI4NFDI/docker-importer
Length of output: 578
🏁 Script executed:
# Look at test cases to see if create() can fail or return None
grep -rn "author.create()" tests/ --type py -B 3 -A 3Repository: MaRDI4NFDI/docker-importer
Length of output: 187
🏁 Script executed:
# Get the QID property definition
sed -n '136,145p' mardi_importer/mardi_importer/utils/Author.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 328
🏁 Script executed:
# Check if _item.write() could return something without a valid id
rg "class MardiItem|def write" --type py -A 3 | head -50Repository: MaRDI4NFDI/docker-importer
Length of output: 1018
🏁 Script executed:
# Look at the entire create() method again with focus on error handling
sed -n '150,195p' mardi_importer/mardi_importer/utils/Author.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 1916
🏁 Script executed:
# Look at complete Author.create() implementation to understand all paths
sed -n '147,210p' mardi_importer/mardi_importer/utils/Author.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 2424
🏁 Script executed:
# Get the rest of the create() method to see the complete implementation
sed -n '200,217p' mardi_importer/mardi_importer/utils/Author.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 518
🏁 Script executed:
# Check if there are any test cases or usage examples of author.create()
rg "author\.create\(\)" --type py -B 2 -A 5 | head -80Repository: MaRDI4NFDI/docker-importer
Length of output: 6811
🏁 Script executed:
# Check if MardiClient API might throw exceptions on write()
find . -name "*.py" -path "*mardiclient*" -o -name "*mardiclient*" | head -5Repository: MaRDI4NFDI/docker-importer
Length of output: 52
Add defensive check after author.create() to handle creation failures.
The author.create() method should set author.QID on success, but if the API call fails without exception handling, author.QID could remain None. Add a guard before using the QID, with a fallback to the author name claim (P2093), consistent with the pattern already used in the else branch.
Suggested fix
if author.orcid or author.arxiv_id or author.QID:
if not author.QID:
author.create()
- claims.append(self.api.get_claim("wdt:P50", author.QID))
+ if author.QID:
+ claims.append(self.api.get_claim("wdt:P50", author.QID))
+ elif author.name:
+ claims.append(self.api.get_claim("wdt:P2093", author.name))🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/arxiv/ArxivPublication.py` around lines 517 -
520, The code assumes author.create() always sets author.QID before calling
self.api.get_claim("wdt:P50", author.QID); add a defensive check after calling
author.create() to verify author.QID is truthy and only append the P50 claim
when it exists, otherwise append the author name claim using
self.api.get_claim("p:P2093", author.name) (matching the existing else-branch
pattern); update the block around author.create(), author.QID, and
self.api.get_claim to handle the creation-failure case gracefully.
| author_QID = item.get_value("wdt:P50") | ||
| for QID in author_QID: |
There was a problem hiding this comment.
Missing defensive guard for P50 iteration.
Unlike ArxivPublication.py (line 400) which now uses item.get_value("wdt:P50") or [], this line lacks the defensive guard. If get_value("wdt:P50") returns None, the subsequent for QID in author_QID will raise TypeError.
🛡️ Proposed fix
- author_QID = item.get_value("wdt:P50")
+ author_QID = item.get_value("wdt:P50") or []
for QID in author_QID:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| author_QID = item.get_value("wdt:P50") | |
| for QID in author_QID: | |
| author_QID = item.get_value("wdt:P50") or [] | |
| for QID in author_QID: |
🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/zenodo/ZenodoResource.py` around lines 68 - 69,
The code iterates "for QID in author_QID" after assigning author_QID =
item.get_value("wdt:P50") without a defensive guard; change the assignment to
use a fallback list (e.g., author_QID = item.get_value("wdt:P50") or []) so that
if get_value returns None the loop won't raise a TypeError—apply this change in
ZenodoResource.py where author_QID and the subsequent for QID in author_QID are
used.
| author_item_prop = self.api.get_local_id_by_label("wdt:P50", "property") | ||
| author_string_prop = self.api.get_local_id_by_label("wdt:P2093", "property") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "get_local_id_by_label" --type py -A 2 -B 1Repository: MaRDI4NFDI/docker-importer
Length of output: 16459
🏁 Script executed:
rg -i "author|stated" --type py mardi_importer/mardi_importer/ -B 2 -A 2 | head -100Repository: MaRDI4NFDI/docker-importer
Length of output: 8632
🏁 Script executed:
# Check the API implementation to understand get_local_id_by_label behavior
find . -type f -name "*.py" -exec grep -l "def get_local_id_by_label" {} \;Repository: MaRDI4NFDI/docker-importer
Length of output: 52
🏁 Script executed:
head -30 mardi_importer/mardi_importer/zenodo/ZenodoResource.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 1009
🏁 Script executed:
# Search for any comments or documentation about P50 and P2093 in the codebase
rg -i "P50|P2093" --type py -B 2 -A 2Repository: MaRDI4NFDI/docker-importer
Length of output: 16127
🏁 Script executed:
# Check how other properties are retrieved and used, especially nearby in the same file
sed -n '260,285p' mardi_importer/mardi_importer/zenodo/ZenodoResource.pyRepository: MaRDI4NFDI/docker-importer
Length of output: 1104
Fix incorrect property lookup for author claims.
The calls on lines 274-275 pass prefixed property IDs ("wdt:P50", "wdt:P2093") to get_local_id_by_label, which expects human-readable labels. All other usages in the codebase pass labels like "description", "community", etc. For these author properties, use the correct labels: "author" for P50 and "stated as" for P2093. Alternatively, use the prefixed IDs directly (e.g., "wdt:P50") as done elsewhere in the codebase for this same operation.
🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/zenodo/ZenodoResource.py` around lines 274 -
275, The property lookup is using prefixed IDs as labels when calling
get_local_id_by_label for author claims; update the two calls that set
author_item_prop and author_string_prop (in ZenodoResource.py) to pass the
correct human-readable labels instead of "wdt:P50" and "wdt:P2093" — use
"author" for P50 and "stated as" for P2093 — or, if you prefer the ID route
consistent with other usages, change the calls to use the prefixed IDs as the
lookup key in the same style as elsewhere; ensure you modify the calls to
get_local_id_by_label(author/“stated as”) or use "wdt:P50"/"wdt:P2093"
accordingly so the lookups return the correct properties.
| if author.orcid or author.QID: | ||
| if not author.QID: | ||
| author.create() | ||
| claims.append(self.api.get_claim("wdt:P50", author.QID)) |
There was a problem hiding this comment.
Potential None claim value if author creation fails.
Same issue as in ArxivPublication.py: if author.create() fails to populate author.QID, line 366 will create a claim with None value.
🛡️ Proposed fix
if author.orcid or author.QID:
if not author.QID:
author.create()
- claims.append(self.api.get_claim("wdt:P50", author.QID))
+ if author.QID:
+ claims.append(self.api.get_claim("wdt:P50", author.QID))
+ elif author.name:
+ claims.append(self.api.get_claim("wdt:P2093", author.name))🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/zenodo/ZenodoResource.py` around lines 363 -
366, The code can pass None to self.api.get_claim("wdt:P50", author.QID) if
author.create() fails to set author.QID; update the block around author.create()
in ZenodoResource (the author.orcid/orcid check and claims append) to verify
author.QID is truthy after calling author.create() and only call
self.api.get_claim when author.QID is present—otherwise skip adding the claim
(or log/raise as appropriate) to avoid creating a claim with a None value.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@mardi_importer/mardi_importer/crossref/CrossrefPublication.py`:
- Around line 386-397: The loop in CrossrefPublication that builds author claims
may append a P50 claim with a None QID when author.create() doesn't set
author.QID; update the block handling authors with orcid/QID so that after
calling author.create() you re-check author.QID and only call
self.api.get_claim("wdt:P50", author.QID) if author.QID is truthy, otherwise
fall back to adding self.api.get_claim("wdt:P2093", author.name) (if author.name
exists); ensure you reference author.create(), author.QID, and
self.api.get_claim(...) in the updated logic and optionally add a debug/warning
log when falling back to P2093.
- Around line 149-162: The code in CrossrefPublication.py only sets
self.proceedings_year when metadata["created"]["date-parts"][0] has length > 1,
so publications with only a year skip setting proceedings_year; change the logic
in the block that handles metadata["created"]["date-parts"] (the code that sets
self.proceedings_month and self.proceedings_year) so the month is set only when
an index 1 exists but the year is always set when index 0 exists (i.e., check
len(...) > 0 or >= 1 for setting self.proceedings_year while keeping the
len(...) > 1 check for self.proceedings_month and its zero-pad logic).
- Around line 248-260: The ORCID extraction in CrossrefPublication (inside the
block that checks "ORCID" in author.keys()) uses re.findall(...)[0] which can
raise IndexError for malformed values; modify this to safely check the findall
result (or use re.search) before indexing and only set orcid_id and call
author_factory(self.api, name=author_label, orcid=orcid_id) when a match
exists—otherwise fall back to appending the name-only author via
author_factory(self.api, name=author_label) so self.authors is not populated
with a failing entry.
| if "created" in metadata.keys(): | ||
| if "date-parts" in metadata["created"].keys(): | ||
| if len(metadata["created"]["date-parts"][0]) > 1: | ||
| self.proceedings_month = str( | ||
| metadata["created"]["date-parts"][0][1] | ||
| ) | ||
| if len(self.proceedings_month) == 1: | ||
| self.proceedings_month = "0" + self.proceedings_month | ||
| if len(metadata['created']['date-parts'][0]) > 1: | ||
| self.proceedings_year = str(metadata['created']['date-parts'][0][0]) | ||
| elif metadata['type'] == 'book-chapter': | ||
| self.instance = 'wd:Q1980247' | ||
| self.description = 'book chapter' | ||
| self.proceedings_month = ( | ||
| "0" + self.proceedings_month | ||
| ) | ||
| if len(metadata["created"]["date-parts"][0]) > 1: | ||
| self.proceedings_year = str( | ||
| metadata["created"]["date-parts"][0][0] | ||
| ) |
There was a problem hiding this comment.
Ensure proceedings year is captured when only the year is provided.
Line 159 currently skips setting proceedings_year if the date has only a year (length 1), which later prevents a P577 claim.
🔧 Suggested fix
- if len(metadata["created"]["date-parts"][0]) > 1:
+ if len(metadata["created"]["date-parts"][0]) > 0:
self.proceedings_year = str(
metadata["created"]["date-parts"][0][0]
)🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/crossref/CrossrefPublication.py` around lines
149 - 162, The code in CrossrefPublication.py only sets self.proceedings_year
when metadata["created"]["date-parts"][0] has length > 1, so publications with
only a year skip setting proceedings_year; change the logic in the block that
handles metadata["created"]["date-parts"] (the code that sets
self.proceedings_month and self.proceedings_year) so the month is set only when
an index 1 exists but the year is always set when index 0 exists (i.e., check
len(...) > 0 or >= 1 for setting self.proceedings_year while keeping the
len(...) > 1 check for self.proceedings_month and its zero-pad logic).
| if "ORCID" in author.keys(): | ||
| orcid_id = re.findall( | ||
| "\d{4}-\d{4}-\d{4}-.{4}", author["ORCID"] | ||
| )[0] | ||
| self.authors.append( | ||
| author_factory( | ||
| self.api, name=author_label, orcid=orcid_id | ||
| ) | ||
| ) | ||
| else: | ||
| self.authors.append(author_factory(self.api, name=author_label)) | ||
| self.authors.append( | ||
| author_factory(self.api, name=author_label) | ||
| ) |
There was a problem hiding this comment.
Guard ORCID extraction to avoid IndexError on malformed values.
re.findall(...)[0] will raise if the ORCID string doesn’t match the pattern. Add a safe fallback to treat it as a name-only author.
🛡️ Proposed fix
- orcid_id = re.findall(
- "\d{4}-\d{4}-\d{4}-.{4}", author["ORCID"]
- )[0]
- self.authors.append(
- author_factory(
- self.api, name=author_label, orcid=orcid_id
- )
- )
+ orcid_matches = re.findall(
+ r"\d{4}-\d{4}-\d{4}-.{4}", author["ORCID"]
+ )
+ if orcid_matches:
+ orcid_id = orcid_matches[0]
+ self.authors.append(
+ author_factory(
+ self.api, name=author_label, orcid=orcid_id
+ )
+ )
+ else:
+ log.warning(
+ "Invalid ORCID format for author %s: %s",
+ author_label,
+ author["ORCID"],
+ )
+ self.authors.append(
+ author_factory(self.api, name=author_label)
+ )🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/crossref/CrossrefPublication.py` around lines
248 - 260, The ORCID extraction in CrossrefPublication (inside the block that
checks "ORCID" in author.keys()) uses re.findall(...)[0] which can raise
IndexError for malformed values; modify this to safely check the findall result
(or use re.search) before indexing and only set orcid_id and call
author_factory(self.api, name=author_label, orcid=orcid_id) when a match
exists—otherwise fall back to appending the name-only author via
author_factory(self.api, name=author_label) so self.authors is not populated
with a failing entry.
| for author in self.authors: | ||
| if not author.QID: | ||
| log.debug(f"Creating author item for: {author}") | ||
| author.create() | ||
| log.debug(f"Created author with QID: {author.QID}") | ||
| author_QID.append(author.QID) | ||
| return author_QID | ||
| if author.orcid or author.QID: | ||
| if not author.QID: | ||
| log.debug(f"Creating author item for: {author}") | ||
| author.create() | ||
| log.debug(f"Created author with QID: {author.QID}") | ||
| claims.append(self.api.get_claim("wdt:P50", author.QID)) | ||
| else: | ||
| # if it does not exist yet and there is not orcid, it should be a name string | ||
| if author.name: | ||
| claims.append(self.api.get_claim("wdt:P2093", author.name)) | ||
| return claims |
There was a problem hiding this comment.
Fallback when author creation doesn’t yield a QID.
If author.create() fails to set author.QID, the current flow still adds a P50 claim with None. Instead, fall back to P2093 when QID is missing after creation.
🔧 Suggested fix
for author in self.authors:
if author.orcid or author.QID:
if not author.QID:
log.debug(f"Creating author item for: {author}")
author.create()
log.debug(f"Created author with QID: {author.QID}")
- claims.append(self.api.get_claim("wdt:P50", author.QID))
+ if author.QID:
+ claims.append(self.api.get_claim("wdt:P50", author.QID))
+ elif author.name:
+ log.warning(
+ "Author QID missing after create; using name string for %s",
+ author,
+ )
+ claims.append(self.api.get_claim("wdt:P2093", author.name))
else:
# if it does not exist yet and there is not orcid, it should be a name string
if author.name:
claims.append(self.api.get_claim("wdt:P2093", author.name))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for author in self.authors: | |
| if not author.QID: | |
| log.debug(f"Creating author item for: {author}") | |
| author.create() | |
| log.debug(f"Created author with QID: {author.QID}") | |
| author_QID.append(author.QID) | |
| return author_QID | |
| if author.orcid or author.QID: | |
| if not author.QID: | |
| log.debug(f"Creating author item for: {author}") | |
| author.create() | |
| log.debug(f"Created author with QID: {author.QID}") | |
| claims.append(self.api.get_claim("wdt:P50", author.QID)) | |
| else: | |
| # if it does not exist yet and there is not orcid, it should be a name string | |
| if author.name: | |
| claims.append(self.api.get_claim("wdt:P2093", author.name)) | |
| return claims | |
| for author in self.authors: | |
| if author.orcid or author.QID: | |
| if not author.QID: | |
| log.debug(f"Creating author item for: {author}") | |
| author.create() | |
| log.debug(f"Created author with QID: {author.QID}") | |
| if author.QID: | |
| claims.append(self.api.get_claim("wdt:P50", author.QID)) | |
| elif author.name: | |
| log.warning( | |
| "Author QID missing after create; using name string for %s", | |
| author, | |
| ) | |
| claims.append(self.api.get_claim("wdt:P2093", author.name)) | |
| else: | |
| # if it does not exist yet and there is not orcid, it should be a name string | |
| if author.name: | |
| claims.append(self.api.get_claim("wdt:P2093", author.name)) | |
| return claims |
🤖 Prompt for AI Agents
In `@mardi_importer/mardi_importer/crossref/CrossrefPublication.py` around lines
386 - 397, The loop in CrossrefPublication that builds author claims may append
a P50 claim with a None QID when author.create() doesn't set author.QID; update
the block handling authors with orcid/QID so that after calling author.create()
you re-check author.QID and only call self.api.get_claim("wdt:P50", author.QID)
if author.QID is truthy, otherwise fall back to adding
self.api.get_claim("wdt:P2093", author.name) (if author.name exists); ensure you
reference author.create(), author.QID, and self.api.get_claim(...) in the
updated logic and optionally add a debug/warning log when falling back to P2093.
If no author identifiers are present, use author string
Summary by CodeRabbit
Improvements
Bug Fixes