Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails

I found an example case in our code where annotation fails. In this simplified example, we're trying to insert {brackets} around the citation, after removing page numbers in pre-processing:

```
>>> from eyecite.annotate import annotate_citations
# we start with this html:
>>> html = '145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 N. H. 459. 1 Bla'
# we remove page labels and other tags during cleaning:
>>> text = '145, 11 N. H. 459. 1 Bla'
# and then, having located citations, attempt to annotate them:
>>> annots = [((5, 17), '{', '}')]
# but look where the '{' ends up:
>>> annotate_citations(text, annots, html)
'145, <a id="p4{10" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 N. H. 459}. 1 Bla'
# it does work correctly with the slower python diff algorithm, though that may not be reliable, just how the diff shakes out in this case:
>>> annotate_citations(text, annots, html, use_dmp=False)
'145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>{11 N. H. 459}. 1 Bla'
```

The solution I found is to protect the strings removed in cleaning, by temporarily encoding them as reserved UTF characters:

```
import re

def encode_strings_as_unicode(big_string, substrings):
 """ Replace substrings in big_string with unique characters in the unicode private use range. """
 char_mapping = []
 for i, substring in enumerate(substrings):
 unicode_char = chr(0xE000 + i) # start of the private use range
 char_mapping.append([substring, unicode_char])
 big_string = big_string.replace(substring, unicode_char, 1)
 return big_string, char_mapping

def decode_unicode_to_strings(big_string, char_mapping):
 """Undo encode_strings_as_unicode by replacing each pair in char_mapping."""
 for s, c in char_mapping:
 big_string = big_string.replace(c, s)
 return big_string

strings_to_protect = re.findall(r'<a[^>]*>.*?</a>|<[^>]+>', html, flags=re.S)
new_html, char_mapping = encode_strings_as_unicode(html, strings_to_protect)
new_html = annotate_citations(text, annots, new_html)
new_html = decode_unicode_to_strings(new_html, char_mapping)
```

This works and gives correct results, but is a bit messy, especially since it requires some funky coding to use a custom annotation function:

```
def annotator(char_mapping, before, encoded_text, after):
 """
 Attach annotation tags to a stretch of citation text. If text contains a link or an unbalanced tag, wrap
 those tags.
 """
 text = decode_unicode_to_strings(encoded_text, char_mapping)
 if '<a' in text or not is_balanced_html(text):
 encoded_text = re.sub(r'[\uE000-\uF8FF]', rf"{after}\g<0>{before}", encoded_text)
 return before + encoded_text + after

# used as as:
new_html = annotate_citations(text, annots, new_html, annotator=partial(annotator, char_mapping))
```

The ideal thing would probably be to have a way to give hints to the diff function -- something like `annotate_citations(text, annots, new_html, protected_strings=strings_removed_in_cleaning)`, and then ensure that all protected strings end up as coherent inserts in the diff generated by diff-match-patch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails #180

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails #180

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions