Skip to content

Meeting notes

Martin Holmes edited this page Sep 13, 2023 · 4 revisions

Meeting notes

2023-09-13

We have decided to allow any namespace in the searchable collection as long as it has a specified prefix in the root of your config file, or the xpath-default-namespace attribute is specified on the root of the config file.

The tokenizer span should then be in the staticSearch namespace.

We should consider forking after checking well-formedness to allow for the use of a jar file to make the HTML well-formed. https://htmlcleaner.sourceforge.net/ is a good option, being a single open-source jar we could include in our repo.

We also discussed issues #219 and #246, realizing that rather than multiple passes through the document which have to be in a particular order, and therefore can't easily make good decisions about what to prioritize and what to drop, we should use a single pass with an accumulator building a profile of the element, and then have a decision function at the end which enables clear specification of the algorithm in a single location.

Clone this wiki locally