Exercises in Python/SQL, semester project for Advanced Topics in Database Systems course at ECE⚡, NTUA🎓, academic year 2021-2022
The dataset used for this project is Full MovieLens Dataset .
The project consists of two main parts:
- Implement and test 5 requested queries using RDD API and Spark SQL
- Do performance analysis for Reduce-Side join, Map-Side join implementations
Details:
- We used 3 VMs for our cluster ( 1 NameNode , 2 DataNodes )
- Dataset formats used: csv, dataframe, parquet
- get familiar with Spark API
- evaluate performance for a list of queries
- compare different join algorithms in Spark Map-Reduce
Project's assignment and report are written in greek.
| Name - GitHub | |
|---|---|
| Stylianos Kandylakis | |
| Kitsos Orfanopoulos | |
| Christos Tsoufis |
| OS | CPUs | RAM | Disk space |
|---|---|---|---|
| Ubuntu 16.04 LTS (Xenial) | 2 | 2GB | 30GB |