Description
I found an example case in our code where annotation fails. In this simplified example, we're trying to insert {brackets} around the citation, after removing page numbers in pre-processing:
>>> from eyecite.annotate import annotate_citations
# we start with this html:
>>> html = '145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459. 1 Bla'
# we remove page labels and other tags during cleaning:
>>> text = '145, 11 N. H. 459. 1 Bla'
# and then, having located citations, attempt to annotate them:
>>> annots = [((5, 17), '{', '}')]
# but look where the '{' ends up:
>>> annotate_citations(text, annots, html)
'145, <a id="p4{10" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459}. 1 Bla'
# it does work correctly with the slower python diff algorithm, though that may not be reliable, just how the diff shakes out in this case:
>>> annotate_citations(text, annots, html, use_dmp=False)
'145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>{11 <em>N. H.</em> 459}. 1 Bla'
The solution I found is to protect the strings removed in cleaning, by temporarily encoding them as reserved UTF characters:
import re
def encode_strings_as_unicode(big_string, substrings):
""" Replace substrings in big_string with unique characters in the unicode private use range. """
char_mapping = []
for i, substring in enumerate(substrings):
unicode_char = chr(0xE000 + i) # start of the private use range
char_mapping.append([substring, unicode_char])
big_string = big_string.replace(substring, unicode_char, 1)
return big_string, char_mapping
def decode_unicode_to_strings(big_string, char_mapping):
"""Undo encode_strings_as_unicode by replacing each pair in char_mapping."""
for s, c in char_mapping:
big_string = big_string.replace(c, s)
return big_string
strings_to_protect = re.findall(r'<a[^>]*>.*?</a>|<[^>]+>', html, flags=re.S)
new_html, char_mapping = encode_strings_as_unicode(html, strings_to_protect)
new_html = annotate_citations(text, annots, new_html)
new_html = decode_unicode_to_strings(new_html, char_mapping)
This works and gives correct results, but is a bit messy, especially since it requires some funky coding to use a custom annotation function:
def annotator(char_mapping, before, encoded_text, after):
"""
Attach annotation tags to a stretch of citation text. If text contains a link or an unbalanced tag, wrap
those tags.
"""
text = decode_unicode_to_strings(encoded_text, char_mapping)
if '<a' in text or not is_balanced_html(text):
encoded_text = re.sub(r'[\uE000-\uF8FF]', rf"{after}\g<0>{before}", encoded_text)
return before + encoded_text + after
# used as as:
new_html = annotate_citations(text, annots, new_html, annotator=partial(annotator, char_mapping))
The ideal thing would probably be to have a way to give hints to the diff function -- something like annotate_citations(text, annots, new_html, protected_strings=strings_removed_in_cleaning)
, and then ensure that all protected strings end up as coherent inserts in the diff generated by diff-match-patch.