Skip to content

Releases: opencitations/ref-matcher

Ref-Matcher

17 Nov 13:27

Choose a tag to compare

This release of the Reference Matching Tool introduces a comprehensive, production-ready system for automated bibliographic reference matching against the OpenCitations Meta database.
The tool combines SPARQL-based queries, fuzzy matching algorithms, and optional GROBID integration to achieve high-precision matching of academic citations from both structured (Crossref JSON) and semi-structured (TEI XML) sources.
The system implements a cascading query strategy with multiple fallback mechanisms, achieving robust matching rates across diverse academic disciplines while maintaining strict quality thresholds and detailed provenance tracking.

The matching engine employs a six-tier cascading SPARQL query strategy that progresses from high-precision DOI-based queries through author-title combinations to bibliometric triples (year/author/page/volume). Each query executes sequentially with early stopping when the matching threshold is reached, optimizing both performance and accuracy.
The 48-point scoring algorithm implements weighted field matching based on Visser et al. (2021) methodology: DOI exact matches contribute 15 points, fuzzy title matching 10-14 points, author surnames 7 points, page numbers 8 points, volumes 3 points, and years 1 point. The default threshold of 26 points (~54% of maximum) includes dynamic adjustment at 90% for near-matches.
When structured metadata proves insufficient, GROBID integration activates as a fallback extraction layer, processing unstructured citation strings to extract author surnames, titles, years, volumes, pages, and DOIs. The system employs intelligent field merging that preserves high-quality source data while enriching with GROBID-extracted metadata, tracking fallback statistics including attempts, successes, and contribution to overall match rates.