HierarchiSummarizer is a document processing tool that analyzes document hierarchy, extracts structured content, summarizes, and translates paragraphs from Markdown and PDF files. It supports multiple LLM providers and provides customizable output formats.
- Converts PDF and Markdown files into structured summaries.
- Uses Mistral API for PDF to Markdown conversion (Mistral API key required).
- Supports multiple LLM providers for summarization (Mistral, OpenAI, Groq, Ollama).
- Excludes specified sections and heading levels.
- Provides hierarchical or paragraph-style summaries.
- Outputs results in multiple languages.
- Allows customized summary output through prompt tuning (
additional_requirements
inconfig.yaml
)
- Clone the repository:
git clone https://github.com/jeongHwarr/HierarchiSummarizer.git cd HierarchiSummarizer
- Install dependencies:
pip install -r requirements.txt
Since HierarchiSummarizer uses the Mistral API for PDF to Markdown conversion, obtaining a Mistral API key is mandatory.
Example configuration (in config.yaml
):
mistral:
model: open-mistral-nemo # https://docs.mistral.ai/getting-started/models/models_overview/
api_key: your_api_key # Required! This API key is needed for converting PDFs to Markdown.
- Visit the Mistral AI website
- Sign up or log in
- Navigate to the API section to generate your key
- Add your API key to the
config.yaml
file undermistral.api_key
-
Modify the Configuration (
config.yaml
).- Open the
config.yaml
file and ensure you add your mistral api_key. This key is required for the script to function properly.
- Open the
-
Place PDF or Markdown Files in the
workspace/to_process/
Folder.- Place any PDF or Markdown documents you wish to process in the
workspace/to_process/
directory. - If the document is a PDF, it will automatically be converted to Markdown before summarization.
- Place any PDF or Markdown documents you wish to process in the
-
Run the Script.
- To begin processing, execute the following command:
python run.py
- To begin processing, execute the following command:
- Place PDF or Markdown files in
workspace/to_process/
.- If the file is a PDF, it will first be converted to Markdown format before the summarization process begins.
- The conversion result will be saved as
workspace/output/{title}/pdf_to_md.md
.
- Summarized files will be saved in
workspace/output/
.- For each processed document, the final summary will be saved as
title_summary.md
in the corresponding folder. - If the document was originally a PDF, you will find the Markdown conversion in
workspace/output/{title}/pdf_to_md.md
.
- For each processed document, the final summary will be saved as
- Processed files will be moved to
workspace/done/
.
Edit the config.yaml
file to customize the summarization settings.
This section configures the AI model provider for summarization. Choose the model provider you want to use, and adjust the settings accordingly.
summary_provider
: The AI model provider for summarization. Options:mistral
: Use the Mistral AI API.open_ai
: Use the OpenAI API.groq
: Use the Groq AI API.ollama
: Use the Ollama model (for local inference).
temperature
: Controls the creativity of the generated text. Higher values (e.g., 0.7) produce more creative text, while lower values (e.g., 0.3) generate more consistent responses.
- Each AI model provider (such as Mistral, OpenAI, Groq) requires an API key to function properly.
- Ensure that you add the required API keys for the selected provider.
- mistral api_key is REQUIRED!
The Mistral API is used for converting PDF files into Markdown format. You must provide an API key in the mistral
section to enable this conversion process.
Example configuration:
mistral:
model: open-mistral-nemo # https://docs.mistral.ai/getting-started/models/models_overview/
api_key: your_api_key # Required! This API key is needed for converting PDFs to Markdown.
This section defines the directory paths for input, output, and processed files.
input_dir_path
: The directory where documents to be processed are located. Default:workspace/to_process
.output_dir_path
: The directory where summarized files will be saved. Default:workspace/output
.done_dir_path
: The directory where processed files will be moved after summarization. Default:workspace/done
.
This section customizes the output format and language of the summary.
language
: The language of the summary. Default isenglish
.exclude_level
: A list of heading levels to exclude from the summary. For example,- 1
excludes level 1 headings (e.g.,# Title
).exclude_section
: A list of sections to exclude from summarization, such as- reference
,- references
.summary_style
: The style of the summary. Options:hierarchical bullet list
hierarchical numbered list
bullet list
paragraph
- Others depending on your needs.
summary_level
: The detail level of the summary. Options:detailed
medium
concise
- Others depending on your needs.
Customize the prompt that is sent to the AI model for summarization.
additional_requirements
: Add custom instructions for how the AI should generate the summary. For example:additional_requirements: | **Sentence Structure**: Do NOT use periods at the end of sentences.
HierarchiSummarizer supports multiple LLM providers:
- Mistral
- OpenAI
- Groq
- Ollama
This project is licensed under the MIT License.
Pull requests and feature suggestions are welcome.
For issues or inquiries, open an issue on GitHub or contact the maintainer.