Sample PySpark ETL pipeline for REST API data source with pagination

Credits

The ingesting data from a REST API is based on the method demostrated in the repo https://github.com/jamesshocking/Spark-REST-API-UDF.git where the API call is integrated into the dataframe via a UDF which has the potiential benefit of making full use of workers on the distributed architecture of Spark as it was inteded to be used rather than running entirely on a driver.

In this sample project I have slightly extended this technique by adding pagination to the calling API. This is a neccesarry addition as most producition API's has a upper limit on the data payload being sent over the internet and provides a facility to itteratively call additional pages until all data is extracted. This sample makes use of the Rick & Morty REST API to demonstrate this feature.

The helper functions for starting the spark session and loggin has been adopted from the repo https://github.com/AlexIoannides/pyspark-example-project.

Project Structure

The project is structured to allow running scripts and notebooks as well as enabling for DevOps CI/CD integration with modularised python folders and files.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample PySpark ETL pipeline for REST API data source with pagination

Credits

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

rarpal/sample-sparkingest-rickmortyapi

Folders and files

Latest commit

History

Repository files navigation

Sample PySpark ETL pipeline for REST API data source with pagination

Credits

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages