Skip to content

Ignore uniform dates under redactions #30

Open
@mlissner

Description

@mlissner

It seems to be common to put dates under the redaction boxes, as you can see in the highlighted screenshot below:

Screenshot from 2021-09-11 11-21-13

Note that the date isn't actually relevant semantically to the sentence. Looking throughout the redactions of this document:

{2: [{'bbox': (390.3498229980469,
               536.0278930664062,
               415.180419921875,
               552.8250122070312),
      'text': '03/23/2019'}],
 20: [{'bbox': (434.0060119628906,
                293.506103515625,
                446.1649169921875,
                307.0159912109375),
       'text': '03/23/2'}],
 29: [{'bbox': (197.58200073242188,
                75.3205795288086,
                224.60189819335938,
                89.5059814453125),
       'text': '03/23/2019'},
      {'bbox': (232.70700073242188,
                75.31907653808594,
                269.1838073730469,
                88.8289794921875),
       'text': '03/23/2019'},
      {'bbox': (278.6400146484375,
                75.99359130859375,
                319.1697998046875,
                87.47698974609375),
       'text': '03/23/2019'},
      {'bbox': (348.2170104980469,
                75.3205795288086,
                421.17059326171875,
                89.5059814453125),
       'text': '03/23/2019'},

You see a pattern that the text is always the same date. When this is the case, we should nuke all such redactions from our list as false positives.

gov.uscourts.cacd.45170.569.9_2.pdf

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions