Skip to content

Official DSFSI Public Datasets Registry - Comprehensive catalog of 50+ datasets for South African & African languages. Includes speech recognition, NLP, terminology, health, legal & financial data across HuggingFace, GitHub, Zenodo & more.

License

Notifications You must be signed in to change notification settings

dsfsi/dsfsi-datasets

Repository files navigation

DSFSI Public Datasets Registry

Official dataset registry for the Data Science for Social Impact (DSFSI) Research Group

GitHub HuggingFace Website

Last Updated: December 31, 2025

This repository serves as the comprehensive catalog and registry for all publicly available datasets produced by the DSFSI research group at the University of Pretoria. Our datasets span multiple domains including Natural Language Processing, Speech Recognition, Public Health, Legal Documents, Financial Data, and more, with a focus on South African and African languages and contexts.

Table of Contents


Featured Datasets

Our most popular and impactful datasets:

🎤 African Next Voices (ZA-ANV)

The largest publicly available multilingual speech dataset for South African languages

  • Platform: HuggingFace
  • Type: Speech/Audio
  • Size: 3,000 hours of audio
  • Languages: 7 South African languages (isiZulu, Setswana, Sesotho sa Leboa, Tshivenda, Xitsonga, isiXhosa, Sesotho)
  • Models: 8 Whisper-based ASR models available
  • License: Custom (see dataset card)
  • Use Cases: Automatic Speech Recognition, Speech-to-Text, Low-resource Language Processing

📚 za-mafoko: Multilingual Terminology Collection

Comprehensive multilingual terminologies and glossaries for South African languages

🦠 COVID-19 South Africa Data

Comprehensive COVID-19 data repository and dashboard for South Africa

  • Platform: GitHub | Zenodo
  • Type: Public Health/Time Series
  • Coverage: Provincial and district-level data (2020-present)
  • Stars: 256 | Forks: 197 | Contributors: 64+
  • Dashboard: Live Dashboard
  • Data Includes: Cases, deaths, testing, vaccination, mobility, hospital surveillance
  • License: Code (MIT), Data (CC BY-SA 4.0)
  • Use Cases: Epidemiology, Public Health Research, Data Visualization
  • Showcase: 14+ projects using this data

📰 Vuk'uzenzele Multilingual Corpus

South African government magazine corpus in 11 languages

  • Platform: GitHub | Zenodo
  • Type: Text/Parallel Corpus
  • Languages: 11 South African languages (English, Afrikaans, isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, Siswati, Tshivenda, Xitsonga)
  • Size: 2,200+ aligned sentence pairs
  • Stars: 7 | Forks: 6
  • License: Code (MIT), Data (CC 4.0 BY)
  • Use Cases: Machine Translation, Multilingual NLP, Sentence Alignment
  • Paper: RAIL 2023

🤖 PuoBERTa & PuoData

Setswana language model and curated dataset

  • Platform: GitHub | HuggingFace Model | HuggingFace Dataset
  • Type: Text/Language Model
  • Language: Setswana (Tswana)
  • GitHub Stars: 5
  • Variants: PuoBERTa-News, PuoBERTa-NER, PuoBERTa-POS
  • License: CC BY 4.0
  • Performance: 83.4 F1 (POS tagging), 78.2 F1 (NER)
  • Use Cases: Text Classification, NER, POS Tagging, Embeddings
  • Paper: arXiv:2310.09141

Datasets by Category

Speech & Audio

Dataset Platform Languages Size
African Next Voices (ZA-ANV) HuggingFace 7 SA languages 3,000 hours
Vuk'uzenzele isiXhosa Speech (ViXSD) HuggingFace isiXhosa -
NCHLT Speech Corpus NWU Repository 11 SA languages -
Lwazi Speech Corpus SADiLaR Multiple -

ASR Models Available:

Text & NLP

Multilingual Corpora

Dataset Platform Languages Description License
Vuk'uzenzele Corpus GitHub/Zenodo 11 SA languages Government magazine corpus with 2,200+ aligned sentences CC 4.0 BY
ZA-Gov Multilingual GitHub Multiple Cabinet statements from SA government MIT
Gov-ZA Cabinet Statements HuggingFace Multiple Sentence-aligned government statements -

News & Media

Dataset Platform Languages Description License
IsiZulu & Siswati News 2022 GitHub isiZulu, Siswati News categorization (Izindaba-Tindzaba) -
South African News Data (2020) Zenodo Multiple News articles for classification -
AG News Dataset Local (this repo) English Text classification dataset -

Disinformation & Fact-Checking

Dataset Platform Languages Description License
ZA Fake News 2020 Zenodo English SA disinformation website data -
ZA Fake News Repo GitHub English Fake news detection corpus -

Language-Specific Datasets

Dataset Platform Language Description
PuoData HuggingFace Setswana Curated Setswana text corpus
ANV Paper Sample HuggingFace Multiple Sample from ANV research

Cross-lingual & Translation

Dataset Platform Languages Description License
Umsuka EN-IsiZulu Parallel Corpus Zenodo English, isiZulu Parallel corpus for MT -
Kinyarwanda-Kirundi Models GitHub Kinyarwanda, Kirundi Cross-lingual transfer -
Masakhane MT GitHub 40+ African languages Participatory MT resources -

Embeddings

Dataset Platform Languages Description License
African Pre-Trained Embeddings Zenodo Multiple African Word embeddings for African languages -

Code-Switching

Dataset Platform Languages Description License
AfroCS-xs ACL Anthology Multiple African Compact code-switched dataset -

Terminology & Lexicons

The za-mafoko collection provides comprehensive multilingual terminologies:

Dataset Platform Source
DSAC Terminology HuggingFace Dept. Sport, Arts & Culture
StatsSA Terminology HuggingFace Statistics South Africa
UNISA Multilingual HuggingFace University of South Africa
OERTB Terminology HuggingFace Open Educational Resources
UP Glossary HuggingFace University of Pretoria
AI Glossary HuggingFace AI/ML terminology
UNISA Robotics HuggingFace Robotics terminology
Tshivenda Augmented HuggingFace Tshivenda translations

Web Portal: za-mafoko.dsfsi.co.za

Other Lexicon Resources:

Legal Documents

Dataset Platform Description License
ZASCA-Sum UP Repository Supreme Court of Appeal judgments and media summaries -
State Capture Commission Transcripts GitHub Zondo Commission transcripts (Oct 2022) -

Public Health

Dataset Platform Coverage Description Stars/Downloads
COVID-19 ZA GitHub 2020-present Comprehensive COVID-19 data for South Africa 256 ⭐ / 197 forks
COVID-19 ZA (Zenodo) Zenodo 2020-present Archived version with DOI -

Financial Data

Dataset Location Coverage Description Format
JSE Top 40 Stock Data This repo 2019-2024 Daily closing prices for JSE Top 40 companies CSV, Pickle

Data Structure:

  • Ticker lists: data/stocks/top40_jse_YYYY.csv
  • Performance data: data/stocks/top40_jse_YYYY_performance.csv
  • Years: 2019, 2021, 2022, 2023, 2024
  • Time range: First 4 months of each year (Jan-Apr)

Educational Datasets

Curated datasets for coursework and teaching:

Dataset Location Course Description
Customer Segmentation data/cos781/ COS781 Customer demographics and behavior
Hypermarket Dataset data/cos781/ COS781 Retail transaction data
Market Basket Data data/cos781/ COS781 Association rule mining
Online Retail II data/cos781/ COS781 E-commerce transactions (zipped)
AG News data/cos802/ag_news_csv/ COS802 News text classification
What's Cooking data/whats-cooking/ - Kaggle competition data

Datasets by Language

isiZulu (Zulu)

  • African Next Voices (ZA-ANV)
  • Vuk'uzenzele Corpus
  • IsiZulu News 2022
  • Umsuka EN-IsiZulu Parallel Corpus
  • za-mafoko collections
  • Whisper ASR models

Setswana (Tswana)

  • African Next Voices (ZA-ANV)
  • PuoData & PuoBERTa
  • Vuk'uzenzele Corpus
  • za-mafoko collections
  • Whisper ASR models

isiXhosa (Xhosa)

  • African Next Voices (ZA-ANV)
  • Vuk'uzenzele isiXhosa Speech Dataset (ViXSD)
  • Vuk'uzenzele Corpus
  • za-mafoko collections

Sesotho (Southern Sotho)

  • African Next Voices (ZA-ANV)
  • Vuk'uzenzele Corpus
  • za-mafoko collections
  • Whisper ASR models

Sepedi (Northern Sotho / Sesotho sa Leboa)

  • African Next Voices (ZA-ANV)
  • Vuk'uzenzele Corpus
  • za-mafoko collections

Tshivenda (Venda)

  • African Next Voices (ZA-ANV)
  • Vuk'uzenzele Corpus
  • za-mafoko collections
  • Mafoko Tshivenda Augmented Translations
  • Whisper ASR models

Xitsonga (Tsonga)

  • African Next Voices (ZA-ANV)
  • Vuk'uzenzele Corpus
  • za-mafoko collections
  • Whisper ASR models

Siswati (Swati)

  • Siswati News 2022
  • Vuk'uzenzele Corpus
  • za-mafoko collections

isiNdebele (Ndebele)

  • Vuk'uzenzele Corpus
  • za-mafoko collections

English

  • COVID-19 ZA
  • ZA Fake News
  • Most corpora include English
  • AG News Dataset

Afrikaans

  • Vuk'uzenzele Corpus
  • za-mafoko collections

Other African Languages

  • Kinyarwanda & Kirundi (Cross-lingual models)
  • 40+ languages in Masakhane MT
  • African Pre-Trained Embeddings

Datasets by Platform

HuggingFace 🤗

Organization: dsfsi

  • 22+ datasets, 30+ models
  • Collections: za-mafoko, PuoBERTa

Organization: dsfsi-anv

  • African Next Voices dataset (3,000 hours multilingual speech)
  • 8 Whisper ASR models

Browse All: HuggingFace DSFSI Datasets

GitHub

Organization: github.com/dsfsi

  • 10+ dataset repositories
  • Code and preprocessing scripts
  • Documentation and papers

Zenodo

Archived datasets with DOIs for academic citation:

  • COVID-19 ZA
  • Vuk'uzenzele NLP
  • Umsuka Parallel Corpus
  • SA News Data
  • African Embeddings
  • WordNets
  • ZA Fake News 2020

University of Pretoria Repository

SADiLaR

NWU Repository


Local Datasets

This repository hosts small-scale datasets locally in the data/ directory:

Financial Data

  • JSE Top 40 Stock Performance (2019-2024)
    • Location: data/stocks/
    • Format: CSV and pickle files
    • Contains daily closing prices for Johannesburg Stock Exchange Top 40 companies

Course Datasets

  • COS781 Datasets - Customer analytics and retail data

    • Location: data/cos781/
  • COS802 Datasets - Text classification datasets

    • Location: data/cos802/

Kaggle Datasets

  • What's Cooking - Recipe classification data
    • Location: data/whats-cooking/

Note: These datasets are version-controlled locally but .gitignore rules are in place for the data/ directory.


How to Use

Accessing HuggingFace Datasets

from datasets import load_dataset

# Load a dataset
dataset = load_dataset("dsfsi/PuoData")

# Load za-mafoko terminology
terminology = load_dataset("dsfsi/za-mafoko-dsac")

# Load African Next Voices
anv = load_dataset("dsfsi-anv/za-african-next-voices")

Accessing GitHub Repositories

# Clone a repository
git clone https://github.com/dsfsi/covid19za.git

# Clone this registry
git clone https://github.com/dsfsi/dsfsi-datasets.git

Using HuggingFace Models

from transformers import pipeline

# Load PuoBERTa for masked language modeling
model = pipeline("fill-mask", model="dsfsi/PuoBERTa")

# Load Whisper ASR model
asr = pipeline("automatic-speech-recognition",
               model="dsfsi-anv/multilingual-whisper-v3-turbo")

Working with Local JSE Stock Data

import pandas as pd

# Load stock ticker list
tickers = pd.read_csv("data/stocks/top40_jse_2024.csv")

# Load performance data
performance = pd.read_csv("data/stocks/top40_jse_2024_performance.csv")

# Or load pickle format
performance_df = pd.read_pickle("data/stocks/top40_jse_2024_performance.df")

Citing Datasets

When using our datasets, please cite appropriately:

  • Use DOIs from Zenodo/repository pages when available
  • Cite associated papers listed in dataset descriptions
  • Reference this registry for general dataset discovery

Contributing

We welcome contributions to this registry! See CONTRIBUTING.md for guidelines on:

  • Submitting new datasets
  • Updating dataset information
  • Reporting issues with datasets
  • Proposing improvements to the registry

Quick Links


Citation

If you use datasets from the DSFSI collection, please cite both the specific dataset and this registry:

@misc{dsfsi-datasets-2025,
  title={DSFSI Public Datasets Registry},
  author={{Data Science for Social Impact Research Group}},
  year={2025},
  publisher={University of Pretoria},
  url={https://github.com/dsfsi/dsfsi-datasets}
}

For specific datasets, please see individual dataset pages for citation information.


Contact

Data Science for Social Impact (DSFSI) Research Group


License

  • This registry: MIT License
  • Individual datasets: See individual dataset licenses (varies by dataset)
  • Code in this repository: MIT License
  • Local data: Various licenses (see dataset-specific documentation)

Last Updated: December 2025

Maintained by: DSFSI Research Group, University of Pretoria

About

Official DSFSI Public Datasets Registry - Comprehensive catalog of 50+ datasets for South African & African languages. Includes speech recognition, NLP, terminology, health, legal & financial data across HuggingFace, GitHub, Zenodo & more.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •