arXiv Extractor

This code (for now) can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.

Requirements

It requires Playwright.

pip install playwright

It also requires firefox to be installed in Playwright environment.

playwright install firefox

Compatible Softwares

Currently, it works only for Mozilla Firefox Browsers.

This is because of the way Firefox renders the PDFs. It does in a way quite different than other browsers. This made it advantageous and easier.

Features to Add

Conversion from PDF to other formats as well
More field-specific cleaning (not just CS papers)

FAQ

1. Why I did not use the ArXiv API?

"This url calls the api, which returns the results in the Atom 1.0 format." To know more, click here for the official documentation.

2. Will additional browser support be added?

Well, I think it is easier to just install Firefox! You can open up a PDF of a research paper in other browsers vs. Firefox and understand the notable difference in the way the information is presented.

3. Does it work for all OS?

Actually at the time of release it worked fine on Windows. I am not sure about other OS.

4. Is there any scope for contributions?

Absolutely! If you can find a way to extract information from the unhelpful way it is displayed in other Browsers, or if you can extend support for other OS, or if you think there is a way you can further improve the quality of the extracted text, then you are welcome! Just submit a PR.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
arxiv_pdf_to_text_v1.py		arxiv_pdf_to_text_v1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

arXiv Extractor

This code (for now) can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.

Requirements

It requires Playwright.

It also requires firefox to be installed in Playwright environment.

Compatible Softwares

Currently, it works only for Mozilla Firefox Browsers.

Features to Add

FAQ

1. Why I did not use the ArXiv API?

2. Will additional browser support be added?

3. Does it work for all OS?

4. Is there any scope for contributions?

Gratitude

I am grateful for the help I got from Mistral's Le Chat. It helped me overcome significant challenges.

About

Uh oh!

Languages

sushantnair/arxiv_extractor

Folders and files

Latest commit

History

Repository files navigation

arXiv Extractor

This code (for now) can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.

Requirements

It requires Playwright.

It also requires firefox to be installed in Playwright environment.

Compatible Softwares

Currently, it works only for Mozilla Firefox Browsers.

Features to Add

FAQ

1. Why I did not use the ArXiv API?

2. Will additional browser support be added?

3. Does it work for all OS?

4. Is there any scope for contributions?

Gratitude

I am grateful for the help I got from Mistral's Le Chat. It helped me overcome significant challenges.

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages