This Python project processes audio transcriptions and classifies them for aggression and hate speech categories using various language models.
- Binary Classification: Determines if text is aggressive (0 = not aggressive, 1 = aggressive).
- Multiclass Classification: Categorizes text into one of five classes:
- 0: Racism (attacks based on race, nationality, or religion)
- 1: Sexism (attacks directed at women, gender roles, objectification)
- 2: Hate Speech (general hate speech not fitting other categories)
- 3: Vulgarism (vulgar language without targeted attacks)
- 4: Neutral (non-aggressive language)
- Supports multiple model providers: Local (Ollama), OpenAI, and Google Gemini.
- Processes input files from the
input/directory and outputs results tooutput/. - Handles errors gracefully (e.g., unreachable models, invalid responses).
- Python 3.7+
- Required packages (install via
pip install -r requirements.txt):litellmrequestsopenai(for OpenAI provider)google-genai(for Gemini provider)
- For local models: Ollama server running with supported models (e.g., llama3.3, mistral-large).
- API keys for OpenAI and Gemini if using those providers.
-
Clone the repository:
git clone <repository-url> cd Classification -
Install dependencies:
pip install -r requirements.txt -
Set up API keys (if using OpenAI or Gemini):
- Edit the script files to set
OPENAI_API_KEYandGEMINI_API_KEY, or set environment variables.
- Edit the script files to set
-
For local models, ensure Ollama is installed and running:
- Install Ollama from ollama.ai.
- Pull required models:
ollama pull llama3.3, etc. - Set
SELF_HOSTED_MODELS_URLto your Ollama API base (e.g.,http://localhost:11434).
-
Place input files in the
input/directory. Each file should contain lines in the format:filename;text(semicolon-separated). -
Run the classification script:
- For one-shot classification:
python classification-oneshot.py - For zero-shot classification:
python classification-zeroshot.py
- For one-shot classification:
-
Results will be saved in the
output/directory as CSV files, one per model and input file.
Output files are CSV with semicolon-separated values:
filename: Original filenamebinary_aggressive: 0 or 1 (or error codes)multiclass_label: 0-4 (or error codes)text: Sanitized input text
- Local: llama3.3, mistral-large, SpeakLeash/bielik-11b-v2.3-instruct:Q8_0
- OpenAI: gpt-5 (adjust as needed)
- Gemini: gemini-2.5-flash