This code (for now) can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.
pip install playwright
playwright install firefox
This is because of the way Firefox renders the PDFs. It does in a way quite different than other browsers. This made it advantageous and easier.
- Conversion from PDF to other formats as well
- More field-specific cleaning (not just CS papers)
"This url calls the api, which returns the results in the Atom 1.0 format." To know more, click here for the official documentation.
Well, I think it is easier to just install Firefox! You can open up a PDF of a research paper in other browsers vs. Firefox and understand the notable difference in the way the information is presented.
Actually at the time of release it worked fine on Windows. I am not sure about other OS.
Absolutely! If you can find a way to extract information from the unhelpful way it is displayed in other Browsers, or if you can extend support for other OS, or if you think there is a way you can further improve the quality of the extracted text, then you are welcome! Just submit a PR.