Skip to content

Need ability to guard strings so annotations won't be inserted inside; example case where annotator failsΒ #180

Open
@jcushman

Description

@jcushman

I found an example case in our code where annotation fails. In this simplified example, we're trying to insert {brackets} around the citation, after removing page numbers in pre-processing:

>>> from eyecite.annotate import annotate_citations
# we start with this html:
>>> html = '145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459. 1 Bla'
# we remove page labels and other tags during cleaning:
>>> text = '145, 11 N. H. 459. 1 Bla'
# and then, having located citations, attempt to annotate them:
>>> annots = [((5, 17), '{', '}')]
# but look where the '{' ends up:
>>> annotate_citations(text, annots, html)
'145, <a id="p4{10" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459}. 1 Bla'
# it does work correctly with the slower python diff algorithm, though that may not be reliable, just how the diff shakes out in this case:
>>> annotate_citations(text, annots, html, use_dmp=False)
'145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>{11 <em>N. H.</em> 459}. 1 Bla'

The solution I found is to protect the strings removed in cleaning, by temporarily encoding them as reserved UTF characters:

import re

def encode_strings_as_unicode(big_string, substrings):
    """ Replace substrings in big_string with unique characters in the unicode private use range. """
    char_mapping = []
    for i, substring in enumerate(substrings):
        unicode_char = chr(0xE000 + i)  # start of the private use range
        char_mapping.append([substring, unicode_char])
        big_string = big_string.replace(substring, unicode_char, 1)
    return big_string, char_mapping

def decode_unicode_to_strings(big_string, char_mapping):
    """Undo encode_strings_as_unicode by replacing each pair in char_mapping."""
    for s, c in char_mapping:
        big_string = big_string.replace(c, s)
    return big_string

strings_to_protect = re.findall(r'<a[^>]*>.*?</a>|<[^>]+>', html, flags=re.S)
new_html, char_mapping = encode_strings_as_unicode(html, strings_to_protect)
new_html = annotate_citations(text, annots, new_html)
new_html = decode_unicode_to_strings(new_html, char_mapping)

This works and gives correct results, but is a bit messy, especially since it requires some funky coding to use a custom annotation function:

def annotator(char_mapping, before, encoded_text, after):
    """
        Attach annotation tags to a stretch of citation text. If text contains a link or an unbalanced tag, wrap
        those tags.
    """
    text = decode_unicode_to_strings(encoded_text, char_mapping)
    if '<a' in text or not is_balanced_html(text):
        encoded_text = re.sub(r'[\uE000-\uF8FF]', rf"{after}\g<0>{before}", encoded_text)
    return before + encoded_text + after

# used as as:
new_html = annotate_citations(text, annots, new_html, annotator=partial(annotator, char_mapping))

The ideal thing would probably be to have a way to give hints to the diff function -- something like annotate_citations(text, annots, new_html, protected_strings=strings_removed_in_cleaning), and then ensure that all protected strings end up as coherent inserts in the diff generated by diff-match-patch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions