Official dataset registry for the Data Science for Social Impact (DSFSI) Research Group
Last Updated: December 31, 2025
This repository serves as the comprehensive catalog and registry for all publicly available datasets produced by the DSFSI research group at the University of Pretoria. Our datasets span multiple domains including Natural Language Processing, Speech Recognition, Public Health, Legal Documents, Financial Data, and more, with a focus on South African and African languages and contexts.
- Featured Datasets
- Datasets by Category
- Datasets by Language
- Datasets by Platform
- Local Datasets
- How to Use
- Contributing
- Citation
- Contact
Our most popular and impactful datasets:
The largest publicly available multilingual speech dataset for South African languages
- Platform: HuggingFace
- Type: Speech/Audio
- Size: 3,000 hours of audio
- Languages: 7 South African languages (isiZulu, Setswana, Sesotho sa Leboa, Tshivenda, Xitsonga, isiXhosa, Sesotho)
- Models: 8 Whisper-based ASR models available
- License: Custom (see dataset card)
- Use Cases: Automatic Speech Recognition, Speech-to-Text, Low-resource Language Processing
Comprehensive multilingual terminologies and glossaries for South African languages
- Platform: HuggingFace Collection
- Type: Terminology/Lexicon
- Datasets: 9+ specialized glossaries
- Languages: 11 South African languages
- Notable Collections:
- License: Various (see individual datasets)
- Use Cases: Machine Translation, Terminology Management, Multilingual NLP
Comprehensive COVID-19 data repository and dashboard for South Africa
- Platform: GitHub | Zenodo
- Type: Public Health/Time Series
- Coverage: Provincial and district-level data (2020-present)
- Stars: 256 | Forks: 197 | Contributors: 64+
- Dashboard: Live Dashboard
- Data Includes: Cases, deaths, testing, vaccination, mobility, hospital surveillance
- License: Code (MIT), Data (CC BY-SA 4.0)
- Use Cases: Epidemiology, Public Health Research, Data Visualization
- Showcase: 14+ projects using this data
South African government magazine corpus in 11 languages
- Platform: GitHub | Zenodo
- Type: Text/Parallel Corpus
- Languages: 11 South African languages (English, Afrikaans, isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, Siswati, Tshivenda, Xitsonga)
- Size: 2,200+ aligned sentence pairs
- Stars: 7 | Forks: 6
- License: Code (MIT), Data (CC 4.0 BY)
- Use Cases: Machine Translation, Multilingual NLP, Sentence Alignment
- Paper: RAIL 2023
Setswana language model and curated dataset
- Platform: GitHub | HuggingFace Model | HuggingFace Dataset
- Type: Text/Language Model
- Language: Setswana (Tswana)
- GitHub Stars: 5
- Variants: PuoBERTa-News, PuoBERTa-NER, PuoBERTa-POS
- License: CC BY 4.0
- Performance: 83.4 F1 (POS tagging), 78.2 F1 (NER)
- Use Cases: Text Classification, NER, POS Tagging, Embeddings
- Paper: arXiv:2310.09141
| Dataset | Platform | Languages | Size |
|---|---|---|---|
| African Next Voices (ZA-ANV) | HuggingFace | 7 SA languages | 3,000 hours |
| Vuk'uzenzele isiXhosa Speech (ViXSD) | HuggingFace | isiXhosa | - |
| NCHLT Speech Corpus | NWU Repository | 11 SA languages | - |
| Lwazi Speech Corpus | SADiLaR | Multiple | - |
ASR Models Available:
- Multilingual Whisper v3 Turbo (0.8B params)
- Language-specific Whisper models for Tshivenda, Xitsonga, Setswana, isiZulu, Sesotho
- MMS-1B models (NCHLT & Lwazi variants)
- W2V-BERT 2.0 models
| Dataset | Platform | Languages | Description | License |
|---|---|---|---|---|
| Vuk'uzenzele Corpus | GitHub/Zenodo | 11 SA languages | Government magazine corpus with 2,200+ aligned sentences | CC 4.0 BY |
| ZA-Gov Multilingual | GitHub | Multiple | Cabinet statements from SA government | MIT |
| Gov-ZA Cabinet Statements | HuggingFace | Multiple | Sentence-aligned government statements | - |
| Dataset | Platform | Languages | Description | License |
|---|---|---|---|---|
| IsiZulu & Siswati News 2022 | GitHub | isiZulu, Siswati | News categorization (Izindaba-Tindzaba) | - |
| South African News Data (2020) | Zenodo | Multiple | News articles for classification | - |
| AG News Dataset | Local (this repo) | English | Text classification dataset | - |
| Dataset | Platform | Languages | Description | License |
|---|---|---|---|---|
| ZA Fake News 2020 | Zenodo | English | SA disinformation website data | - |
| ZA Fake News Repo | GitHub | English | Fake news detection corpus | - |
| Dataset | Platform | Language | Description |
|---|---|---|---|
| PuoData | HuggingFace | Setswana | Curated Setswana text corpus |
| ANV Paper Sample | HuggingFace | Multiple | Sample from ANV research |
| Dataset | Platform | Languages | Description | License |
|---|---|---|---|---|
| Umsuka EN-IsiZulu Parallel Corpus | Zenodo | English, isiZulu | Parallel corpus for MT | - |
| Kinyarwanda-Kirundi Models | GitHub | Kinyarwanda, Kirundi | Cross-lingual transfer | - |
| Masakhane MT | GitHub | 40+ African languages | Participatory MT resources | - |
| Dataset | Platform | Languages | Description | License |
|---|---|---|---|---|
| African Pre-Trained Embeddings | Zenodo | Multiple African | Word embeddings for African languages | - |
| Dataset | Platform | Languages | Description | License |
|---|---|---|---|---|
| AfroCS-xs | ACL Anthology | Multiple African | Compact code-switched dataset | - |
The za-mafoko collection provides comprehensive multilingual terminologies:
| Dataset | Platform | Source |
|---|---|---|
| DSAC Terminology | HuggingFace | Dept. Sport, Arts & Culture |
| StatsSA Terminology | HuggingFace | Statistics South Africa |
| UNISA Multilingual | HuggingFace | University of South Africa |
| OERTB Terminology | HuggingFace | Open Educational Resources |
| UP Glossary | HuggingFace | University of Pretoria |
| AI Glossary | HuggingFace | AI/ML terminology |
| UNISA Robotics | HuggingFace | Robotics terminology |
| Tshivenda Augmented | HuggingFace | Tshivenda translations |
Web Portal: za-mafoko.dsfsi.co.za
Other Lexicon Resources:
- WordNets for SA Languages (2020) - Zenodo
- Loughran McDonald-SA-2020 Sentiment Word List - UP Repository
| Dataset | Platform | Description | License |
|---|---|---|---|
| ZASCA-Sum | UP Repository | Supreme Court of Appeal judgments and media summaries | - |
| State Capture Commission Transcripts | GitHub | Zondo Commission transcripts (Oct 2022) | - |
| Dataset | Platform | Coverage | Description | Stars/Downloads |
|---|---|---|---|---|
| COVID-19 ZA | GitHub | 2020-present | Comprehensive COVID-19 data for South Africa | 256 ⭐ / 197 forks |
| COVID-19 ZA (Zenodo) | Zenodo | 2020-present | Archived version with DOI | - |
| Dataset | Location | Coverage | Description | Format |
|---|---|---|---|---|
| JSE Top 40 Stock Data | This repo | 2019-2024 | Daily closing prices for JSE Top 40 companies | CSV, Pickle |
Data Structure:
- Ticker lists:
data/stocks/top40_jse_YYYY.csv - Performance data:
data/stocks/top40_jse_YYYY_performance.csv - Years: 2019, 2021, 2022, 2023, 2024
- Time range: First 4 months of each year (Jan-Apr)
Curated datasets for coursework and teaching:
| Dataset | Location | Course | Description |
|---|---|---|---|
| Customer Segmentation | data/cos781/ |
COS781 | Customer demographics and behavior |
| Hypermarket Dataset | data/cos781/ |
COS781 | Retail transaction data |
| Market Basket Data | data/cos781/ |
COS781 | Association rule mining |
| Online Retail II | data/cos781/ |
COS781 | E-commerce transactions (zipped) |
| AG News | data/cos802/ag_news_csv/ |
COS802 | News text classification |
| What's Cooking | data/whats-cooking/ |
- | Kaggle competition data |
- African Next Voices (ZA-ANV)
- Vuk'uzenzele Corpus
- IsiZulu News 2022
- Umsuka EN-IsiZulu Parallel Corpus
- za-mafoko collections
- Whisper ASR models
- African Next Voices (ZA-ANV)
- PuoData & PuoBERTa
- Vuk'uzenzele Corpus
- za-mafoko collections
- Whisper ASR models
- African Next Voices (ZA-ANV)
- Vuk'uzenzele isiXhosa Speech Dataset (ViXSD)
- Vuk'uzenzele Corpus
- za-mafoko collections
- African Next Voices (ZA-ANV)
- Vuk'uzenzele Corpus
- za-mafoko collections
- Whisper ASR models
- African Next Voices (ZA-ANV)
- Vuk'uzenzele Corpus
- za-mafoko collections
- African Next Voices (ZA-ANV)
- Vuk'uzenzele Corpus
- za-mafoko collections
- Mafoko Tshivenda Augmented Translations
- Whisper ASR models
- African Next Voices (ZA-ANV)
- Vuk'uzenzele Corpus
- za-mafoko collections
- Whisper ASR models
- Siswati News 2022
- Vuk'uzenzele Corpus
- za-mafoko collections
- Vuk'uzenzele Corpus
- za-mafoko collections
- COVID-19 ZA
- ZA Fake News
- Most corpora include English
- AG News Dataset
- Vuk'uzenzele Corpus
- za-mafoko collections
- Kinyarwanda & Kirundi (Cross-lingual models)
- 40+ languages in Masakhane MT
- African Pre-Trained Embeddings
Organization: dsfsi
- 22+ datasets, 30+ models
- Collections: za-mafoko, PuoBERTa
Organization: dsfsi-anv
- African Next Voices dataset (3,000 hours multilingual speech)
- 8 Whisper ASR models
Browse All: HuggingFace DSFSI Datasets
Organization: github.com/dsfsi
- 10+ dataset repositories
- Code and preprocessing scripts
- Documentation and papers
Archived datasets with DOIs for academic citation:
- COVID-19 ZA
- Vuk'uzenzele NLP
- Umsuka Parallel Corpus
- SA News Data
- African Embeddings
- WordNets
- ZA Fake News 2020
This repository hosts small-scale datasets locally in the data/ directory:
- JSE Top 40 Stock Performance (2019-2024)
- Location:
data/stocks/ - Format: CSV and pickle files
- Contains daily closing prices for Johannesburg Stock Exchange Top 40 companies
- Location:
-
COS781 Datasets - Customer analytics and retail data
- Location:
data/cos781/
- Location:
-
COS802 Datasets - Text classification datasets
- Location:
data/cos802/
- Location:
- What's Cooking - Recipe classification data
- Location:
data/whats-cooking/
- Location:
Note: These datasets are version-controlled locally but .gitignore rules are in place for the data/ directory.
from datasets import load_dataset
# Load a dataset
dataset = load_dataset("dsfsi/PuoData")
# Load za-mafoko terminology
terminology = load_dataset("dsfsi/za-mafoko-dsac")
# Load African Next Voices
anv = load_dataset("dsfsi-anv/za-african-next-voices")# Clone a repository
git clone https://github.com/dsfsi/covid19za.git
# Clone this registry
git clone https://github.com/dsfsi/dsfsi-datasets.gitfrom transformers import pipeline
# Load PuoBERTa for masked language modeling
model = pipeline("fill-mask", model="dsfsi/PuoBERTa")
# Load Whisper ASR model
asr = pipeline("automatic-speech-recognition",
model="dsfsi-anv/multilingual-whisper-v3-turbo")import pandas as pd
# Load stock ticker list
tickers = pd.read_csv("data/stocks/top40_jse_2024.csv")
# Load performance data
performance = pd.read_csv("data/stocks/top40_jse_2024_performance.csv")
# Or load pickle format
performance_df = pd.read_pickle("data/stocks/top40_jse_2024_performance.df")When using our datasets, please cite appropriately:
- Use DOIs from Zenodo/repository pages when available
- Cite associated papers listed in dataset descriptions
- Reference this registry for general dataset discovery
We welcome contributions to this registry! See CONTRIBUTING.md for guidelines on:
- Submitting new datasets
- Updating dataset information
- Reporting issues with datasets
- Proposing improvements to the registry
- Publications: dsfsi.co.za/publications
- Research Group: dsfsi.co.za
- GitHub: github.com/dsfsi
- HuggingFace: huggingface.co/dsfsi
If you use datasets from the DSFSI collection, please cite both the specific dataset and this registry:
@misc{dsfsi-datasets-2025,
title={DSFSI Public Datasets Registry},
author={{Data Science for Social Impact Research Group}},
year={2025},
publisher={University of Pretoria},
url={https://github.com/dsfsi/dsfsi-datasets}
}For specific datasets, please see individual dataset pages for citation information.
Data Science for Social Impact (DSFSI) Research Group
- Institution: University of Pretoria, South Africa
- Website: www.dsfsi.co.za
- Email: dsfsi.info@up.ac.za
- Lead: Prof Vukosi Marivate (vukosi.marivate@cs.up.ac.za)
- GitHub: github.com/dsfsi
- Twitter/X: @dsfsi_research
- LinkedIn: DSFSI Research Group
- Bluesky: @dsfsi.bsky.social
- This registry: MIT License
- Individual datasets: See individual dataset licenses (varies by dataset)
- Code in this repository: MIT License
- Local data: Various licenses (see dataset-specific documentation)
Last Updated: December 2025
Maintained by: DSFSI Research Group, University of Pretoria