A Python tool that converts PDF files to Markdown format using various AI models. The tool supports multiple AI providers and models.
- Support for multiple AI providers and OCR solutions:
- Anthropic (Claude)
- Google (Gemini)
- OpenAI
- Mistral
- Ollama
- Unstructured.io
- MarkItDown
- Configurable model selection for each provider
- Structured output with proper markdown formatting
- Environment variable configuration for API keys
- Clone the repository:
git clone https://github.com/cotrane/pdf2md.git
cd pdf2md- Create and activate a virtual environment:
uv venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows- Install dependencies:
uv pip install -e ".[dev]"Create a .env file similar to the .env.tmpl file and add your API keys.
The supported API keys are:
GOOGLE_API_KEYOPENAI_API_KEYANTHROPIC_API_KEYMISTRAL_API_KEYUNSTRUCTURED_API_KEY
In order to test AWS Textract you require an AWS account and credentials which can then be added to the .env file using
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGION
uv run src/run.py --input input.pdf --parser anthropicThe output file will be created in folder output and be called <input_file_name>_<parser>_<model>.md.
anthropic: Uses Claude 3.5 Sonnetgoogleai: Uses Google's Gemini Proopenai: Uses GPT-4 Turbomistral: Uses Mistral Largeollama: Uses Ollama modelstextract: Uses Textract for text extractionunstructuredio: Uses Unstructured.io APImarkitdown: Uses Microsoft's open-source tool MarkItDown
Each parser supports different models. Use the --model option to specify a model:
uv run src/run.py --input input.pdf --parser anthropic --model claude-3-sonnet-20240229Available models per parser:
- Anthropic:
claude-3-7-sonnet-20250219(default)claude-3-5-sonnet-20241022
- Google AI:
gemini-1.5-flashgemini-2.0-flash(default)gemini-2.0-flash-thinking-exp-01-21gemini-2.5-pro-exp-03-25
- OpenAI:
gpt-4o(default)gpt-4o-minigpt-4.5-previewo1
- Mistral:
mistral-ocr-latest - Ollama: Any model available in your Ollama installation
- Textract: No model selection
- Unstructured.io:
gpt-4o(default)claude-3-5-sonnet-20241022gemini-2.0-flash-001hi-resfast
- MarkItDown:
pdfminer(default)gpt-4o
The tool includes a utility to evaluate the similarity between markdown files. This script creates a similarity heatmap between the various output files and serves as a rough measure of how accurate the created markdown file is once one is checked manually.
uv run src/evaluate.py -f <filename>The evaluation provides several metrics:
- Cosine Similarity: TF-IDF based similarity score (0-1)
- Word Overlap Ratio: Ratio of common words to total unique words (0-1)
- Levenshtein Ratio: Normalized similarity score based on Levenshtein distance (0-1)
- Word Statistics: Counts of words, common words, and unique words in each file
In order to evaluate only the OCR part of the output file, we can remove all markdown notation by running the script as follows:
uv run src/evaluate.py -f <filename> -mTo run all unit tests:
uv run pytest tests/ -vTo run tests with coverage:
uv run pytest tests/ -v --cov=src --cov-report=term-missingIntegration tests require external services to be configured. To run them:
uv run pytest tests/ -v -m integrationtest_base.py: Tests for the base parser functionalitytest_run.py: Tests for the main script functionalitytest_integration.py: Integration tests requiring external services
The test suite is configured in pyproject.toml with the following settings:
- Test paths:
tests/ - Test file pattern:
test_*.py - Coverage reporting: Enabled
- Integration test marker:
@pytest.mark.integration
The project uses:
- Pylint for code linting (configured in pyproject.toml)
- Black for code formatting
- MyPy for type checking
Install development dependencies:
uv pip install ".[dev]"Install evaluation dependencies:
uv pip install ".[eval]"This project is licensed under the MIT License - see the LICENSE file for details.