ETL

Comparison of techs to perform ETL. The idea is simple, read a dataset from postgres, replicate it on another postres database and evaluate time and memory consupmtion. The detailed statistics can be found in results.ipynb.

Data

OS Open UPRN https://osdatahub.os.uk/downloads/open/OpenUPRN

full count: 41,011,955 test count: 2,000,000

Highligts

pg_dump/pg_restore

For postgresql to postgresql is the most efficient tool.
It is not possible to transform the data.

Sling

Great for replications as it includes many inbuild features (retries, streaming etc)
it has a very low memory impact
it is not as fast as other solutions

DuckDB

is a winner (in terms of execution time) for both small and large datasets
it is not distributed so it might struggle with very large datasets
it is mostly sql based. Familiar for many but might have limitations.

Spark

handles well memory for both small and large datasets
not as fast as duckdb for these tests
it is distributed so it can handle very large datasets (Terabytes and more)
allows SQL, python and scala
It also has machine learning and graph theory capabilities

Polars

Very efficient compared to Pandas and for small datasets competes well against spark.
Very similar to pandas.

Setup

Initial data upload

run upload:

cd data
sh initial_upload.sh

Create databases

origin: postgres target: target

create target:

createdb target

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data		data
dlthub		dlthub
duckdb_copy		duckdb_copy
duckdb_copy_parquet		duckdb_copy_parquet
elusion		elusion
meltano		meltano
pandas_copy		pandas_copy
pandas_to_sql		pandas_to_sql
pg_dump_restore		pg_dump_restore
polars_adbc_copy		polars_adbc_copy
polars_connectorx_copy		polars_connectorx_copy
polars_connectorx_write		polars_connectorx_write
pyspark_copy		pyspark_copy
pyspark_write		pyspark_write
sling		sling
spark		spark
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
TODO.md		TODO.md
comparisons.csv		comparisons.csv
log_memory.sh		log_memory.sh
results.ipynb		results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL

Data

Highligts

pg_dump/pg_restore

Sling

DuckDB

Spark

Polars

Setup

Initial data upload

Create databases

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

carlospadron/etl

Folders and files

Latest commit

History

Repository files navigation

ETL

Data

Highligts

pg_dump/pg_restore

Sling

DuckDB

Spark

Polars

Setup

Initial data upload

Create databases

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages