A Python application for extracting, preparing, and classifying Knowledge Graphs, leveraging LLMs and traditional machine learning.
Thesis Project, University of Salerno, ISISLab
Explore the docs »
Report Bug
·
Request Feature
Table of Contents
KgSum is a Python application for extracting, preparing, and classifying Knowledge Graphs (KGs). It combines Large Language Models (such as Mistral Instructor 7B with QLoRA) and traditional machine learning for effective graph classification and profiling.
Thesis Project for Bachelor's Degree
University of Salerno
Lab: ISISLab
Author: Mario Cosenza
Supervisor: Maria Angela Pellegrino
Follow these steps to set up KgSum locally.
- Miniconda (required)
- Python 3.12 (suggested)
- CUDA 12.8 (for transformer models like Mistral)
- NVIDIA GPU (recommended: RTX 3070 or higher)
- Node.js
- npm
- Docker
- Docker Compose
-
Clone the repository:
git clone https://github.com/mariocosenza/kgsum.git cd kgsum -
Create and activate conda environment:
conda env create -f environment.yml conda activate kgsum
-
For GPU/Transformer Models (Mistral):
- Comment out CUDA libraries in
environment.yml - Change TensorFlow version to GPU-compatible version as suggested in comments
- Comment out CUDA libraries in
-
Install dependencies:
npm install
-
Run the frontend:
npm run dev
-
For GraphDB embedding visualization:
- Replace GraphDB's
security-config.xmlwith the one in/docker/graphdb
- Replace GraphDB's
Set the following environment variables in your shell:
export GEMINI_API_KEY=your_gemini_api_key_here
export LOCAL_ENDPOINT_LOV=http://your-local-endpoint
export LOCAL_ENDPOINT=http://your-local-endpoint
export SECRET_KEY=your_secret_key_here
export UPLOAD_FOLDER=/path/to/uploads
export UPLOAD=true
export NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=your_clerk_publishable_key
export CLASSIFICATION_API_URL=http://localhost:5000
export GITHUB_TOKEN=your_github_token_hereConfigure the backend by editing config.json:
{
"labeling" : {
"use_gemini": false,
"search_zenodo": true,
"search_github": true,
"search_lod_cloud": true,
"stop_before_merging": false
},
"extraction": {
"start_offset": 0,
"step_numbers": 10,
"step_range": 16,
"extract_sparql": true,
"query_lov": false
},
"processing" : {
"use_ner": false,
"use_filter": true
},
"training" : {
"classifier": "NAIVE_BAYES",
"feature": ["CURI", "PURI", "LAB", "CON", "TLDS", "VOC", "LCN", "LPN", "DSC", "SBJ"],
"oversample": true,
"max_token": 36000,
"use_tfidf_autoencoder": true
},
"profile": {
"store_profile_after_training": false,
"base_domain": "http://www.isislab.it"
},
"general_settings": {
"info": "Possible classifiers: SVM, NAIVE_BAYES, KNN, J48, MISTRAL, MLP, DEEP, BATCHNORM, Phase: LABELING, EXTRACTION, PROCESSING, TRAINING, STORE",
"start_phase": "labeling",
"stop_phase": "training",
"allow_upload": true
}
}Available Classifiers: SVM, NAIVE_BAYES, KNN, J48, MISTRAL, MLP, DEEP, BATCHNORM
Available Features: CURI, PURI
Processing Phases: LABELING, EXTRACTION, PROCESSING, TRAINING, STORE
Run the complete training process from extraction to model training:
python train.pyFor more fine-tuned control, run individual scripts in /src:
# Run scripts in /src directory for specific phasesAfter completing training, start the WSGI Flask server on port 5000:
python app.py- Linked Open Vocabularies (LOV) instance is required for complete profiling and initial data extraction
Send POST requests to:
/api/v1/profile/sparql/api/v1/profile/file
Refer to the Swagger documentation for detailed request and response formats.
For a simpler deployment using the pre-trained Naive Bayes model:
-
Navigate to the docker directory:
cd /docker -
Fill the
.envfile with your configuration -
Run with Docker Compose:
docker-compose up
Three individual Dockerfiles are provided for custom deployments:
- Backend service
- Frontend service
- GraphDB configuration
| Component | Specification |
|---|---|
| CPU | AMD Ryzen 5800x |
| RAM | 32 GB DDR4 3600MHz |
| GPU | NVIDIA RTX 3070 |
| Component | Specification |
|---|---|
| RAM | 64+ GB (larger size suggested) |
| GPU | High-performance GPU for better LLM performance |
- Add Swagger API documentation
- Expand coverage for more LLMs
- Improve Docker deployment documentation
- Add more dataset preparation examples
- Add performance optimization guides
- Enhance frontend visualization features
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt for more information.
Mario Cosenza - @mario_cosenza_ - [email protected]
Maria Angela Pellegrino - [email protected]
Gabriele Tuozzo - [email protected]
Project Link: https://github.com/isislab-unisa/KGSum
- University of Salerno, ISISLab
- Mistral LLM
- LOD Cloud
- Zenodo
- Linked Open Vocabularies
