README for Extraction Package

This code is a part of the TA2 project for USGS. This is the package that works to extract information from Mining Reports to gather deposit types, mineral inventory, mining report reference, and mining site information to create a larger Database. The most up to date package is stored in ./extraction_package.

Installation (requires python >3.12 and pip)

Create virtual environment (python, anaconda, etc.)
In the project root: pip install -r requirements.txt

How to run Docker Image

Clone the Repository: git clone [email protected]:DARPA-CRITICALMAAS/ta2-extraction.git
Create a .env file with : API_KEY & CDR_BEARER so that the application works. The API_KEY should be an OpenAI API key and the CDR Bearer is a connection to the polymer.
Fill out any necessary variables in the settings.py
Build the Docker Image: docker build -t -my-extraction-app .
Running the Docker Container:

    docker run \
    -v /path/to/stored/reports/ta2-extraction/reports:/app/reports \
    -v /path/to/stored/reports/ta2-extraction/output:/app/output \
    my-extraction-app \
    python -m extraction_package.pipeline \
    --pdf_p "/app/reports/" \
    --pdf_name "FileName.pdf" \
    --output_path "/app/output/"

Note: Make sure the the directories are for saving the output and finding the reports are correctly named. This will only run one file at a time. --pdf_p: the pathway to where reports are stored. --pdf_name: is the name of the file that you want to extract from. --output_path: folder directory where you want to store the output.

Extraction Package Directory

Note all loggers should be name similar to the file. Note that the documents you want to process should also already be downloaded locally.

Updated Jan 2025.

extraction_package/
| |---- __init__
| |---- pipeline.py : the main driver of the code
| |---- genericFunctions: all generalized functions that do not just relate to a single section
| |---- mineralInventoryHelp: code to generate the Mineral Inventory
| |---- documentRefHelp: Code to generate the Mineral Site and document reference information
| |---- LLMFunctions: all code that relates to call the LLM that is used
| |---- LLMModels: all formats for structured outputs
| |---- depositTypesHelp : the code given to derive the deposit type candidates
| |---- extractionPrompts: all prompts used
| |---- schemaFormats: all the formats that match the schema derived by the larger TA2 team

Description of Variables

file_name: the filename of the pdf that you want to extract from. Expectation is that the file_name has the record_id in part of the name.
folderpath: folderpath to the where the report pdf is stored
output_path: output folder path where you want the mineral inventory extraction json to be outputted to.

How to run One File

Make sure all variables in the .env are correctly formatted (ie API key and CDR Bearer)
Create virtual environment
Install required dependencies: pip install -r requirements.txt
Run Pipeline for one File which must already be downloaded locally. At the top of the ta2-extraction directory Run python -m extraction_package.pipeline --pdf_p "/path/to/reports/" --pdf_name "FileName.pdf" --output_path "/path/to/stored/output/"

CDR Polymer process

Log in to the CDR: https://auth.polymer.rocks/
Upload a document by hitting the CDR dropdown in top right corner
Insert document name and information.
Hit upload
Once upload completes, go to the file
Hit the process button to initate the extraction of the document

Version Control

Current Version 3.0

Major Changes

Removal of assistants to use a more generic model for longevity
change approach of how to get page number
utilization of structured outputs and openAI improvements for models 4o
Working on adding a filtered extraction
updates to the schema

Previous Version 2.0: extraction_package

Major Changes

LARGE overhaul to make code more scalable, readable, & all MPG's standards
add logging, more try catch
modularize the code into separate files

Past version explanations

1.2 Changes for 9 month, pushed to main DATE reference : https://platform.openai.com/docs/assistants/tools/file-search

Removed the need to look at total for categories or for zones
Using gpt-4-Turbo
updates to assistants v2 which includes JSON return
add a separate check for just tonnage or units using chat GPT
utilizing tenacity for any run status failures/errors
Changing the author prompts
Updating how we get pages by looking for key words

1.1 Change to the extraction clean-up - adding the instance check & removal of keys - new unit keys were added - this was used for cobalt - errors: not picking up all the rows, schema errors with removal of keys 1.0 Initial Prompts - this was done for copper & nickel & zinc

OLD Extraction Package Directory v1

extraction_package/
| |---- __init__
| |---- extraction_pipeline : the main driver of the code
| |---- extraction_functions : stores all the functions needed to help the pipeline code
| |---- prompts: all prompts used
| |---- schema_formats: all the formats that match the schema derived by the larger TA2 team

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.ipynb_checkpoints		.ipynb_checkpoints
automatic_evaluation		automatic_evaluation
codes		codes
expert_notes		expert_notes
extraction_package		extraction_package
juptyr_notebooks		juptyr_notebooks
metadata		metadata
old_demo		old_demo
old_extraction_package_v1		old_extraction_package_v1
old_extraction_package_v2		old_extraction_package_v2
.dockerignore		.dockerignore
.gitignore		.gitignore
ALL_commodities_extraction.ipynb		ALL_commodities_extraction.ipynb
Deposit classification Scheme.xlsx		Deposit classification Scheme.xlsx
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
download_files.py		download_files.py
parallel_extract_run.py		parallel_extract_run.py
parallel_extraction.ipynb		parallel_extraction.ipynb
requirements.txt		requirements.txt
runPipeline.py		runPipeline.py
settings.py		settings.py
testing.ipynb		testing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README for Extraction Package

Installation (requires python >3.12 and pip)

How to run Docker Image

Extraction Package Directory

Description of Variables

How to run One File

CDR Polymer process

Version Control

Current Version 3.0

Previous Version 2.0: extraction_package

Past version explanations

OLD Extraction Package Directory v1

About

Uh oh!

Releases

Packages

Languages

License

wdwatkins/ta2-extraction

Folders and files

Latest commit

History

Repository files navigation

README for Extraction Package

Installation (requires python >3.12 and pip)

How to run Docker Image

Extraction Package Directory

Description of Variables

How to run One File

CDR Polymer process

Version Control

Current Version 3.0

Previous Version 2.0: extraction_package

Past version explanations

OLD Extraction Package Directory v1

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages