DSFSI Public Datasets Registry

Official dataset registry for the Data Science for Social Impact (DSFSI) Research Group

Last Updated: December 31, 2025

This repository serves as the comprehensive catalog and registry for all publicly available datasets produced by the DSFSI research group at the University of Pretoria. Our datasets span multiple domains including Natural Language Processing, Speech Recognition, Public Health, Legal Documents, Financial Data, and more, with a focus on South African and African languages and contexts.

Featured Datasets

Our most popular and impactful datasets:

🎤 African Next Voices (ZA-ANV)

The largest publicly available multilingual speech dataset for South African languages

Platform: HuggingFace
Type: Speech/Audio
Size: 3,000 hours of audio
Languages: 7 South African languages (isiZulu, Setswana, Sesotho sa Leboa, Tshivenda, Xitsonga, isiXhosa, Sesotho)
Models: 8 Whisper-based ASR models available
License: Custom (see dataset card)
Use Cases: Automatic Speech Recognition, Speech-to-Text, Low-resource Language Processing

📚 za-mafoko: Multilingual Terminology Collection

Comprehensive multilingual terminologies and glossaries for South African languages

Platform: HuggingFace Collection
Type: Terminology/Lexicon
Datasets: 9+ specialized glossaries
Languages: 11 South African languages
Notable Collections:
License: Various (see individual datasets)
Use Cases: Machine Translation, Terminology Management, Multilingual NLP

🦠 COVID-19 South Africa Data

Comprehensive COVID-19 data repository and dashboard for South Africa

Platform: GitHub | Zenodo
Type: Public Health/Time Series
Coverage: Provincial and district-level data (2020-present)
Stars: 256 | Forks: 197 | Contributors: 64+
Dashboard: Live Dashboard
Data Includes: Cases, deaths, testing, vaccination, mobility, hospital surveillance
License: Code (MIT), Data (CC BY-SA 4.0)
Use Cases: Epidemiology, Public Health Research, Data Visualization
Showcase: 14+ projects using this data

📰 Vuk'uzenzele Multilingual Corpus

South African government magazine corpus in 11 languages

Platform: GitHub | Zenodo
Type: Text/Parallel Corpus
Languages: 11 South African languages (English, Afrikaans, isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, Siswati, Tshivenda, Xitsonga)
Size: 2,200+ aligned sentence pairs
Stars: 7 | Forks: 6
License: Code (MIT), Data (CC 4.0 BY)
Use Cases: Machine Translation, Multilingual NLP, Sentence Alignment
Paper: RAIL 2023

🤖 PuoBERTa & PuoData

Setswana language model and curated dataset

Platform: GitHub | HuggingFace Model | HuggingFace Dataset
Type: Text/Language Model
Language: Setswana (Tswana)
GitHub Stars: 5
Variants: PuoBERTa-News, PuoBERTa-NER, PuoBERTa-POS
License: CC BY 4.0
Performance: 83.4 F1 (POS tagging), 78.2 F1 (NER)
Use Cases: Text Classification, NER, POS Tagging, Embeddings
Paper: arXiv:2310.09141

Datasets by Category

Speech & Audio

Dataset	Platform	Languages	Size
African Next Voices (ZA-ANV)	HuggingFace	7 SA languages	3,000 hours
Vuk'uzenzele isiXhosa Speech (ViXSD)	HuggingFace	isiXhosa	-
NCHLT Speech Corpus	NWU Repository	11 SA languages	-
Lwazi Speech Corpus	SADiLaR	Multiple	-

ASR Models Available:

Multilingual Whisper v3 Turbo (0.8B params)
Language-specific Whisper models for Tshivenda, Xitsonga, Setswana, isiZulu, Sesotho
MMS-1B models (NCHLT & Lwazi variants)
W2V-BERT 2.0 models

Text & NLP

Multilingual Corpora

Dataset	Platform	Languages	Description	License
Vuk'uzenzele Corpus	GitHub/Zenodo	11 SA languages	Government magazine corpus with 2,200+ aligned sentences	CC 4.0 BY
ZA-Gov Multilingual	GitHub	Multiple	Cabinet statements from SA government	MIT
Gov-ZA Cabinet Statements	HuggingFace	Multiple	Sentence-aligned government statements	-

News & Media

Dataset	Platform	Languages	Description	License
IsiZulu & Siswati News 2022	GitHub	isiZulu, Siswati	News categorization (Izindaba-Tindzaba)	-
South African News Data (2020)	Zenodo	Multiple	News articles for classification	-
AG News Dataset	Local (this repo)	English	Text classification dataset	-

Disinformation & Fact-Checking

Dataset	Platform	Languages	Description	License
ZA Fake News 2020	Zenodo	English	SA disinformation website data	-
ZA Fake News Repo	GitHub	English	Fake news detection corpus	-

Language-Specific Datasets

Dataset	Platform	Language	Description
PuoData	HuggingFace	Setswana	Curated Setswana text corpus
ANV Paper Sample	HuggingFace	Multiple	Sample from ANV research

Cross-lingual & Translation

Dataset	Platform	Languages	Description	License
Umsuka EN-IsiZulu Parallel Corpus	Zenodo	English, isiZulu	Parallel corpus for MT	-
Kinyarwanda-Kirundi Models	GitHub	Kinyarwanda, Kirundi	Cross-lingual transfer	-
Masakhane MT	GitHub	40+ African languages	Participatory MT resources	-

Embeddings

Dataset	Platform	Languages	Description	License
African Pre-Trained Embeddings	Zenodo	Multiple African	Word embeddings for African languages	-

Code-Switching

Dataset	Platform	Languages	Description	License
AfroCS-xs	ACL Anthology	Multiple African	Compact code-switched dataset	-

Terminology & Lexicons

The za-mafoko collection provides comprehensive multilingual terminologies:

Dataset	Platform	Source
DSAC Terminology	HuggingFace	Dept. Sport, Arts & Culture
StatsSA Terminology	HuggingFace	Statistics South Africa
UNISA Multilingual	HuggingFace	University of South Africa
OERTB Terminology	HuggingFace	Open Educational Resources
UP Glossary	HuggingFace	University of Pretoria
AI Glossary	HuggingFace	AI/ML terminology
UNISA Robotics	HuggingFace	Robotics terminology
Tshivenda Augmented	HuggingFace	Tshivenda translations

Web Portal: za-mafoko.dsfsi.co.za

Other Lexicon Resources:

WordNets for SA Languages (2020) - Zenodo
Loughran McDonald-SA-2020 Sentiment Word List - UP Repository

Legal Documents

Dataset	Platform	Description	License
ZASCA-Sum	UP Repository	Supreme Court of Appeal judgments and media summaries	-
State Capture Commission Transcripts	GitHub	Zondo Commission transcripts (Oct 2022)	-

Public Health

Dataset	Platform	Coverage	Description	Stars/Downloads
COVID-19 ZA	GitHub	2020-present	Comprehensive COVID-19 data for South Africa	256 ⭐ / 197 forks
COVID-19 ZA (Zenodo)	Zenodo	2020-present	Archived version with DOI	-

Financial Data

Dataset	Location	Coverage	Description	Format
JSE Top 40 Stock Data	This repo	2019-2024	Daily closing prices for JSE Top 40 companies	CSV, Pickle

Data Structure:

Ticker lists: data/stocks/top40_jse_YYYY.csv
Performance data: data/stocks/top40_jse_YYYY_performance.csv
Years: 2019, 2021, 2022, 2023, 2024
Time range: First 4 months of each year (Jan-Apr)

Educational Datasets

Curated datasets for coursework and teaching:

Dataset	Location	Course	Description
Customer Segmentation	`data/cos781/`	COS781	Customer demographics and behavior
Hypermarket Dataset	`data/cos781/`	COS781	Retail transaction data
Market Basket Data	`data/cos781/`	COS781	Association rule mining
Online Retail II	`data/cos781/`	COS781	E-commerce transactions (zipped)
AG News	`data/cos802/ag_news_csv/`	COS802	News text classification
What's Cooking	`data/whats-cooking/`	-	Kaggle competition data

Datasets by Language

isiZulu (Zulu)

African Next Voices (ZA-ANV)
Vuk'uzenzele Corpus
IsiZulu News 2022
Umsuka EN-IsiZulu Parallel Corpus
za-mafoko collections
Whisper ASR models

Setswana (Tswana)

African Next Voices (ZA-ANV)
PuoData & PuoBERTa
Vuk'uzenzele Corpus
za-mafoko collections
Whisper ASR models

isiXhosa (Xhosa)

African Next Voices (ZA-ANV)
Vuk'uzenzele isiXhosa Speech Dataset (ViXSD)
Vuk'uzenzele Corpus
za-mafoko collections

Sesotho (Southern Sotho)

African Next Voices (ZA-ANV)
Vuk'uzenzele Corpus
za-mafoko collections
Whisper ASR models

Sepedi (Northern Sotho / Sesotho sa Leboa)

African Next Voices (ZA-ANV)
Vuk'uzenzele Corpus
za-mafoko collections

Tshivenda (Venda)

African Next Voices (ZA-ANV)
Vuk'uzenzele Corpus
za-mafoko collections
Mafoko Tshivenda Augmented Translations
Whisper ASR models

Xitsonga (Tsonga)

African Next Voices (ZA-ANV)
Vuk'uzenzele Corpus
za-mafoko collections
Whisper ASR models

Siswati (Swati)

Siswati News 2022
Vuk'uzenzele Corpus
za-mafoko collections

isiNdebele (Ndebele)

Vuk'uzenzele Corpus
za-mafoko collections

English

COVID-19 ZA
ZA Fake News
Most corpora include English
AG News Dataset

Afrikaans

Vuk'uzenzele Corpus
za-mafoko collections

Other African Languages

Kinyarwanda & Kirundi (Cross-lingual models)
40+ languages in Masakhane MT
African Pre-Trained Embeddings

Datasets by Platform

HuggingFace 🤗

Organization: dsfsi

22+ datasets, 30+ models
Collections: za-mafoko, PuoBERTa

Organization: dsfsi-anv

African Next Voices dataset (3,000 hours multilingual speech)
8 Whisper ASR models

Browse All: HuggingFace DSFSI Datasets

GitHub

Organization: github.com/dsfsi

10+ dataset repositories
Code and preprocessing scripts
Documentation and papers

Zenodo

Archived datasets with DOIs for academic citation:

COVID-19 ZA
Vuk'uzenzele NLP
Umsuka Parallel Corpus
SA News Data
African Embeddings
WordNets
ZA Fake News 2020

University of Pretoria Repository

SADiLaR

Lwazi Speech Corpus

NWU Repository

NCHLT Speech Corpus

Local Datasets

This repository hosts small-scale datasets locally in the data/ directory:

Financial Data

JSE Top 40 Stock Performance (2019-2024)
- Location: data/stocks/
- Format: CSV and pickle files
- Contains daily closing prices for Johannesburg Stock Exchange Top 40 companies

Course Datasets

COS781 Datasets - Customer analytics and retail data
- Location: data/cos781/
COS802 Datasets - Text classification datasets
- Location: data/cos802/

Kaggle Datasets

What's Cooking - Recipe classification data
- Location: data/whats-cooking/

Note: These datasets are version-controlled locally but .gitignore rules are in place for the data/ directory.

How to Use

Accessing HuggingFace Datasets

from datasets import load_dataset

# Load a dataset
dataset = load_dataset("dsfsi/PuoData")

# Load za-mafoko terminology
terminology = load_dataset("dsfsi/za-mafoko-dsac")

# Load African Next Voices
anv = load_dataset("dsfsi-anv/za-african-next-voices")

Accessing GitHub Repositories

# Clone a repository
git clone https://github.com/dsfsi/covid19za.git

# Clone this registry
git clone https://github.com/dsfsi/dsfsi-datasets.git

Using HuggingFace Models

from transformers import pipeline

# Load PuoBERTa for masked language modeling
model = pipeline("fill-mask", model="dsfsi/PuoBERTa")

# Load Whisper ASR model
asr = pipeline("automatic-speech-recognition",
               model="dsfsi-anv/multilingual-whisper-v3-turbo")

Working with Local JSE Stock Data

import pandas as pd

# Load stock ticker list
tickers = pd.read_csv("data/stocks/top40_jse_2024.csv")

# Load performance data
performance = pd.read_csv("data/stocks/top40_jse_2024_performance.csv")

# Or load pickle format
performance_df = pd.read_pickle("data/stocks/top40_jse_2024_performance.df")

Citing Datasets

When using our datasets, please cite appropriately:

Use DOIs from Zenodo/repository pages when available
Cite associated papers listed in dataset descriptions
Reference this registry for general dataset discovery

Contributing

We welcome contributions to this registry! See CONTRIBUTING.md for guidelines on:

Submitting new datasets
Updating dataset information
Reporting issues with datasets
Proposing improvements to the registry

Quick Links

Publications: dsfsi.co.za/publications
Research Group: dsfsi.co.za
GitHub: github.com/dsfsi
HuggingFace: huggingface.co/dsfsi

Citation

If you use datasets from the DSFSI collection, please cite both the specific dataset and this registry:

@misc{dsfsi-datasets-2025,
  title={DSFSI Public Datasets Registry},
  author={{Data Science for Social Impact Research Group}},
  year={2025},
  publisher={University of Pretoria},
  url={https://github.com/dsfsi/dsfsi-datasets}
}

For specific datasets, please see individual dataset pages for citation information.

Contact

Data Science for Social Impact (DSFSI) Research Group

Institution: University of Pretoria, South Africa
Website: www.dsfsi.co.za
Email: dsfsi.info@up.ac.za
Lead: Prof Vukosi Marivate (vukosi.marivate@cs.up.ac.za)
GitHub: github.com/dsfsi
Twitter/X: @dsfsi_research
LinkedIn: DSFSI Research Group
Bluesky: @dsfsi.bsky.social

License

This registry: MIT License
Individual datasets: See individual dataset licenses (varies by dataset)
Code in this repository: MIT License
Local data: Various licenses (see dataset-specific documentation)

Last Updated: December 2025

Maintained by: DSFSI Research Group, University of Pretoria

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DATASETS_GUIDE.md		DATASETS_GUIDE.md
LICENSE		LICENSE
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
datasets_index.json		datasets_index.json
search_datasets.py		search_datasets.py

License

dsfsi/dsfsi-datasets

Folders and files

Latest commit

History

Repository files navigation

DSFSI Public Datasets Registry

Table of Contents

Featured Datasets

🎤 African Next Voices (ZA-ANV)

📚 za-mafoko: Multilingual Terminology Collection

🦠 COVID-19 South Africa Data

📰 Vuk'uzenzele Multilingual Corpus

🤖 PuoBERTa & PuoData

Datasets by Category

Speech & Audio

Text & NLP

Multilingual Corpora

News & Media

Disinformation & Fact-Checking

Language-Specific Datasets

Cross-lingual & Translation

Embeddings

Code-Switching

Terminology & Lexicons

Legal Documents

Public Health

Financial Data

Educational Datasets

Datasets by Language

isiZulu (Zulu)

Setswana (Tswana)

isiXhosa (Xhosa)

Sesotho (Southern Sotho)

Sepedi (Northern Sotho / Sesotho sa Leboa)

Tshivenda (Venda)

Xitsonga (Tsonga)

Siswati (Swati)

isiNdebele (Ndebele)

English

Afrikaans

Other African Languages

Datasets by Platform

HuggingFace 🤗

GitHub

Zenodo

University of Pretoria Repository

SADiLaR

NWU Repository

Local Datasets

Financial Data

Course Datasets

Kaggle Datasets

How to Use

Accessing HuggingFace Datasets

Accessing GitHub Repositories

Using HuggingFace Models

Working with Local JSE Stock Data

Citing Datasets

Contributing

Quick Links

Citation

Contact

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages