e-Gov Law Data Fetcher and Summarizer

Overview

This script fetches law data from the Japanese e-Gov API, generates summaries using the OpenAI API, and saves the results into JSON files. It is designed to handle large volumes of data efficiently by chunking and hierarchical summarization.

Features

Fetches law lists and full texts for specified categories and date ranges.
Generates hierarchical summaries of laws using the OpenAI API.
Supports output to JSON files, automatically managing file sizes.
Easy configuration of API keys and settings.

Prerequisites

Python 3.8 or higher
Python libraries:
- requests
- tqdm
- openai
- retrying
An OpenAI API key set in the environment variable OPENAI_API_KEY
A valid api_headers.json file containing the e-Gov API headers

Installation

Clone this repository or download LawDataGPT.py.

Install dependencies:

pip install requests tqdm openai retrying

Export your OpenAI API key:

export OPENAI_API_KEY="your_openai_api_key"

Ensure api_headers.json is present in the working directory (or pass a custom path via --headers).

Usage

Run the script using:

python LawDataGPT.py <categories> --start_date YYYY-MM-DD --end_date YYYY-MM-DD \
    [--headers path/to/api_headers.json] \
    [--model MODEL_NAME] \
    [--max_chunk_tokens N]

Arguments

<categories>: Space-separated list of category numbers to process. Available categories are:
- 2: Constitutional laws and acts
- 3: Cabinet orders and Imperial ordinances
- 4: Ministerial ordinances and regulations
--start_date: Start date in YYYY-MM-DD format.
--end_date: End date in YYYY-MM-DD format.

--headers: Path to API headers JSON file (default: api_headers.json)
--model: OpenAI model name for summarization (default: gpt-4o-mini)
--max_chunk_tokens: Override chunk token limit (auto-detected if omitted)

Example

Fetch and summarize laws in categories 2, 3, and 4 between 2004-12-15 and 2024-12-15, using GPT-4.1 as the model:

python LawDataGPT.py 2 3 4 --start_date 2004-12-15 --end_date 2024-12-15 \
    --model gpt-4.1

Output

The script saves summarized data into JSON files named according to the format:

law_data_category_<category>_<file_count>_<start_date>_to_<end_date>.json

Each file contains the following fields:

LawId: Unique identifier for the law.
LawNumber: Number assigned to the law.
LawName: Name of the law.
PromulgationDate: Date the law was promulgated.
Summary: Hierarchical summary of the law's content.

Error Handling

If the OpenAI API key is not set, the script exits with an error message.
If the api_headers.json file is missing or invalid, the script exits with an error message.
If the API response contains invalid data or fails, the script logs an appropriate message and continues processing other laws.

Configuration

API Key

Set your OpenAI API key in the environment:

export OPENAI_API_KEY="your_openai_api_key"

API Headers

Prepare the api_headers.json file with valid e-Gov API headers:

{
  "Authorization": "Bearer your_api_token",
  "Accept": "application/xml"
}

File Size Limit

The default file size limit is 3 MB (DEFAULT_FILE_SIZE_LIMIT). Adjust in the code if needed.

Model & Chunk Size

By default, the script uses gpt-4o-mini with an auto chunk limit. If you specify --model, the chunk size is set to:

gpt-4.1: ~1 047 576 tokens
gpt-4-32k: 32 768 tokens
gpt-4 (8k context): 8 192 tokens
otherwise: 3 500 tokens Use --max_chunk_tokens to override.

Development Notes

The default model is gpt-4o-mini, but you can override it with --model. Chunk sizes and max summary tokens are auto-detected.
API rate limits are managed with time.sleep(); tune delays as needed.
Date inputs must be YYYY-MM-DD.
Import of tqdm, openai, and retrying is guarded so tests/mock environments work without actual packages.

Contributing

Contributions are welcome! Please open an issue or submit a pull request with proposed changes.

License

This script is licensed under the MIT License. See LICENSE for details.

Special Thanks

Toshio Yamada (LinkedIn, Github)

SUN SHIHAO

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
prompt		prompt
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
LawDataGPT.py		LawDataGPT.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

e-Gov Law Data Fetcher and Summarizer

Overview

Features

Prerequisites

Installation

Usage

Arguments

Example

Output

Error Handling

Configuration

API Key

API Headers

File Size Limit

Model & Chunk Size

Development Notes

Contributing

License

Special Thanks

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

takurot/LawDataGPT

Folders and files

Latest commit

History

Repository files navigation

e-Gov Law Data Fetcher and Summarizer

Overview

Features

Prerequisites

Installation

Usage

Arguments

Example

Output

Error Handling

Configuration

API Key

API Headers

File Size Limit

Model & Chunk Size

Development Notes

Contributing

License

Special Thanks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages