-
Notifications
You must be signed in to change notification settings - Fork 0
HTML Parser
In many institutions and hospitals, the medical reports are stored in html format, which makes them easy to parse. If the reports are sructured into sections, it becomes easy to select the sections that will be most informative for inference tasks.
text_parser is the core script for parsing the html documents, it relies on the excellent BeautifulSoup package, broadly used for web scraping. It performs the following actions:
- normalize the text in the soup
- removes useless tags
- fetch the text of the different sections
- returns for each document a dictionnary where keys are section names and values are sections content.
To do so, the scripts browse over each text and extracts the text located between the sections. It outputs for each document, a dictionnary which keys are the section names and values are the sections contents.
This is done that way so it is easier to remove the least informative sections (eg sections that are identical across all reports) and then merge all the text using section_manager.reduce_dico.
All this is wrapped up in parser.ReportsParser object for convenience. This transformer is compatible with scikit-learn Pipeline API so it is easy to combine it with other transformers.
Note that you can gridsearch the hyperparameters of ReportsParser, to find for instance, the best sections of the reports, using sklearn's GridSearch
A sample of a typical html report is given here, to help understand the format and also for testing purpose.
However, some reports might not be as structured as this one, and therefore can not be parsed with BeautifulSoup. In that case, set the parameter headers to None in ReportsParser and the parser fetches all the text in the document.
The method ReportsParser.transform performs its action using the built-in python multiprocess library, which can be controlled by the n_jobs attribute. Therefore, it performs best on a machine with several CPUs.