GitHub - isislab-unisa/KGSum

KgSum

A Python application for extracting, preparing, and classifying Knowledge Graphs, leveraging LLMs and traditional machine learning.
Thesis Project, University of Salerno, ISISLab
Explore the docs »
Report Bug · Request Feature

Table of Contents

About The Project
- Built With
Getting Started
Usage
Docker Deployment
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

KgSum is a Python application for extracting, preparing, and classifying Knowledge Graphs (KGs). It combines Large Language Models (such as Mistral Instructor 7B with QLoRA) and traditional machine learning for effective graph classification and profiling.

Thesis Project for Bachelor's Degree
University of Salerno
Lab: ISISLab
Author: Mario Cosenza
Supervisor: Maria Angela Pellegrino

(back to top)

Built With

(back to top)

Getting Started

Follow these steps to set up KgSum locally.

Prerequisites

For Local Machine Learning Backend:

Miniconda (required)
Python 3.12 (suggested)
CUDA 12.8 (for transformer models like Mistral)
NVIDIA GPU (recommended: RTX 3070 or higher)

For Frontend:

Node.js
npm

For Docker Deployment:

Docker
Docker Compose

Installation

Local Setup (Machine Learning Backend)

Clone the repository:

git clone https://github.com/mariocosenza/kgsum.git
cd kgsum

Create and activate conda environment:

conda env create -f environment.yml
conda activate kgsum

For GPU/Transformer Models (Mistral):
- Comment out CUDA libraries in environment.yml
- Change TensorFlow version to GPU-compatible version as suggested in comments

Frontend Setup

Install dependencies:
```
npm install
```
Run the frontend:
```
npm run dev
```
For GraphDB embedding visualization:
- Replace GraphDB's security-config.xml with the one in /docker/graphdb

Configuration

Environment Variables

Set the following environment variables in your shell:

export GEMINI_API_KEY=your_gemini_api_key_here
export LOCAL_ENDPOINT_LOV=http://your-local-endpoint
export LOCAL_ENDPOINT=http://your-local-endpoint
export SECRET_KEY=your_secret_key_here
export UPLOAD_FOLDER=/path/to/uploads
export UPLOAD=true
export NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=your_clerk_publishable_key
export CLASSIFICATION_API_URL=http://localhost:5000
export GITHUB_TOKEN=your_github_token_here

Backend Configuration

Configure the backend by editing config.json:

{
  "labeling" : {
    "use_gemini": false,
    "search_zenodo": true,
    "search_github": true,
    "search_lod_cloud": true,
    "stop_before_merging": false
  },
  "extraction": {
    "start_offset": 0,
    "step_numbers": 10,
    "step_range": 16,
    "extract_sparql": true,
    "query_lov": false
  },
  "processing" : {
    "use_ner": false,
    "use_filter": true
  },
  "training" : {
     "classifier": "NAIVE_BAYES",
     "feature": ["CURI", "PURI", "LAB", "CON", "TLDS", "VOC", "LCN", "LPN", "DSC", "SBJ"],
     "oversample": true,
     "max_token": 36000,
     "use_tfidf_autoencoder": true
  },
  "profile": {
    "store_profile_after_training": false,
    "base_domain": "http://www.isislab.it"
  },
  "general_settings": {
    "info": "Possible classifiers: SVM, NAIVE_BAYES, KNN, J48, MISTRAL, MLP, DEEP, BATCHNORM, Phase: LABELING, EXTRACTION, PROCESSING, TRAINING, STORE",
    "start_phase": "labeling",
    "stop_phase": "training",
    "allow_upload": true
  }
}

Available Classifiers: SVM, NAIVE_BAYES, KNN, J48, MISTRAL, MLP, DEEP, BATCHNORM
Available Features: CURI, PURI
Processing Phases: LABELING, EXTRACTION, PROCESSING, TRAINING, STORE

(back to top)

Usage

Training Process

Full Training Pipeline

Run the complete training process from extraction to model training:

python train.py

Individual Script Training

For more fine-tuned control, run individual scripts in /src:

# Run scripts in /src directory for specific phases

Running the Application

Local Flask Server

After completing training, start the WSGI Flask server on port 5000:

python app.py

Prerequisites for Complete Profiling

Linked Open Vocabularies (LOV) instance is required for complete profiling and initial data extraction

API Usage

Send POST requests to:

/api/v1/profile/sparql
/api/v1/profile/file

Refer to the Swagger documentation for detailed request and response formats.

(back to top)

Docker Deployment

Quick Setup with Pre-trained Model

For a simpler deployment using the pre-trained Naive Bayes model:

Navigate to the docker directory:
```
cd /docker
```
Fill the .env file with your configuration
Run with Docker Compose:
```
docker-compose up
```

Individual Docker Services

Three individual Dockerfiles are provided for custom deployments:

Backend service
Frontend service
GraphDB configuration

Hardware Requirements

Tested Configuration

Component	Specification
CPU	AMD Ryzen 5800x
RAM	32 GB DDR4 3600MHz
GPU	NVIDIA RTX 3070

Recommended Configuration

Component	Specification
RAM	64+ GB (larger size suggested)
GPU	High-performance GPU for better LLM performance

(back to top)

Roadmap

Add Swagger API documentation
Expand coverage for more LLMs
Improve Docker deployment documentation
Add more dataset preparation examples
Add performance optimization guides
Enhance frontend visualization features

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

Top contributors:

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Mario Cosenza - @mario_cosenza_ - [email protected]
Maria Angela Pellegrino - [email protected] Gabriele Tuozzo - [email protected]

Project Link: https://github.com/isislab-unisa/KGSum

(back to top)

Acknowledgments

University of Salerno, ISISLab
Mistral LLM
LOD Cloud
Zenodo
Linked Open Vocabularies

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 377 Commits
.github		.github
.idea		.idea
docker		docker
frontend/kgsum-frontend		frontend/kgsum-frontend
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config.json		config.json
config.py		config.py
environment.yml		environment.yml

License

isislab-unisa/KGSum

Folders and files

Latest commit

History

Repository files navigation

KgSum

About The Project

Built With

Getting Started

Prerequisites

For Local Machine Learning Backend:

For Frontend:

For Docker Deployment:

Installation

Local Setup (Machine Learning Backend)

Frontend Setup

Configuration

Environment Variables

Backend Configuration

Usage

Training Process

Full Training Pipeline

Individual Script Training

Running the Application

Local Flask Server

Prerequisites for Complete Profiling

API Usage

Docker Deployment

Quick Setup with Pre-trained Model

Individual Docker Services

Hardware Requirements

Tested Configuration

Recommended Configuration

Roadmap

Contributing

Top contributors:

License

Contact

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages