-
Notifications
You must be signed in to change notification settings - Fork 23
Meeting notes
Meeting notes
2023-09-13
We have decided to allow any namespace in the searchable collection as long as it has a specified prefix in the root of your config file, or the xpath-default-namespace attribute is specified on the root of the config file.
The tokenizer span should then be in the staticSearch namespace.
We should consider forking after checking well-formedness to allow for the use of a jar file to make the HTML well-formed. https://htmlcleaner.sourceforge.net/ is a good option, being a single open-source jar we could include in our repo.
We also discussed issues #219 and #246, realizing that rather than multiple passes through the document which have to be in a particular order, and therefore can't easily make good decisions about what to prioritize and what to drop, we should use a single pass with an accumulator building a profile of the element, and then have a decision function at the end which enables clear specification of the algorithm in a single location.