This script fetches law data from the Japanese e-Gov API, generates summaries using the OpenAI API, and saves the results into JSON files. It is designed to handle large volumes of data efficiently by chunking and hierarchical summarization.
- Fetches law lists and full texts for specified categories and date ranges.
- Generates hierarchical summaries of laws using the OpenAI API.
- Supports output to JSON files, automatically managing file sizes.
- Easy configuration of API keys and settings.
- Python 3.8 or higher
- Python libraries:
requeststqdmopenairetrying
- An OpenAI API key set in the environment variable
OPENAI_API_KEY - A valid
api_headers.jsonfile containing the e-Gov API headers
- Clone this repository or download
LawDataGPT.py. - Install dependencies:
pip install requests tqdm openai retrying
- Export your OpenAI API key:
export OPENAI_API_KEY="your_openai_api_key"
- Ensure
api_headers.jsonis present in the working directory (or pass a custom path via--headers).
Run the script using:
python LawDataGPT.py <categories> --start_date YYYY-MM-DD --end_date YYYY-MM-DD \
[--headers path/to/api_headers.json] \
[--model MODEL_NAME] \
[--max_chunk_tokens N]<categories>: Space-separated list of category numbers to process. Available categories are:2: Constitutional laws and acts3: Cabinet orders and Imperial ordinances4: Ministerial ordinances and regulations
--start_date: Start date inYYYY-MM-DDformat.--end_date: End date inYYYY-MM-DDformat.
--headers: Path to API headers JSON file (default:api_headers.json)--model: OpenAI model name for summarization (default:gpt-4o-mini)--max_chunk_tokens: Override chunk token limit (auto-detected if omitted)
Fetch and summarize laws in categories 2, 3, and 4 between 2004-12-15 and 2024-12-15, using GPT-4.1 as the model:
python LawDataGPT.py 2 3 4 --start_date 2004-12-15 --end_date 2024-12-15 \
--model gpt-4.1The script saves summarized data into JSON files named according to the format:
law_data_category_<category>_<file_count>_<start_date>_to_<end_date>.json
Each file contains the following fields:
LawId: Unique identifier for the law.LawNumber: Number assigned to the law.LawName: Name of the law.PromulgationDate: Date the law was promulgated.Summary: Hierarchical summary of the law's content.
- If the OpenAI API key is not set, the script exits with an error message.
- If the
api_headers.jsonfile is missing or invalid, the script exits with an error message. - If the API response contains invalid data or fails, the script logs an appropriate message and continues processing other laws.
Set your OpenAI API key in the environment:
export OPENAI_API_KEY="your_openai_api_key"Prepare the api_headers.json file with valid e-Gov API headers:
{
"Authorization": "Bearer your_api_token",
"Accept": "application/xml"
}The default file size limit is 3 MB (DEFAULT_FILE_SIZE_LIMIT). Adjust in the code if needed.
By default, the script uses gpt-4o-mini with an auto chunk limit. If you specify --model, the chunk size is set to:
gpt-4.1: ~1 047 576 tokensgpt-4-32k: 32 768 tokensgpt-4(8k context): 8 192 tokens- otherwise: 3 500 tokens
Use
--max_chunk_tokensto override.
- The default model is
gpt-4o-mini, but you can override it with--model. Chunk sizes and max summary tokens are auto-detected. - API rate limits are managed with
time.sleep(); tune delays as needed. - Date inputs must be
YYYY-MM-DD. - Import of
tqdm,openai, andretryingis guarded so tests/mock environments work without actual packages.
Contributions are welcome! Please open an issue or submit a pull request with proposed changes.
This script is licensed under the MIT License. See LICENSE for details.
Toshio Yamada (LinkedIn, Github)
SUN SHIHAO