DataPrep Workflow Engine

Analyze • Transform • Export

Project Overview

Real-world datasets are often messy, inconsistent, and not directly suitable for machine learning or analytical workflows. Data scientists and analysts typically spend significant time performing repetitive preprocessing tasks such as:

Handling missing values
Removing duplicates
Detecting and removing outliers
Scaling numerical features
Encoding categorical variables
Understanding dataset structure and quality

Most beginner EDA tools focus only on visualization and static analysis. However, they lack a proper preprocessing workflow architecture where transformations are applied sequentially and preserved throughout the session.

The DataPrep Workflow Engine solves this problem by providing a stateful preprocessing pipeline system that allows users to:

Analyze datasets interactively
Apply sequential transformations
Preserve transformation state
Track preprocessing history
Export final cleaned datasets

The application behaves more like a lightweight preprocessing workflow engine rather than a simple visualization dashboard.

Problem Statement

Traditional EDA dashboards typically suffer from several limitations:

Transformations are not persistent
Data resets after interactions
No sequential preprocessing pipeline
Lack of workflow transparency
Limited preprocessing capabilities
Poor support for iterative cleaning workflows

This project addresses these limitations through:

Stateful session management
Sequential transformation architecture
Modular preprocessing pipeline design
Transformation history tracking
Interactive preprocessing workflows

Preview

Analysis Page Preview

Transformation Page Preview

Key Features

Stateful Sequential Transformation Pipeline

The application maintains two separate datasets throughout the workflow:

`original_df`

Stores the raw uploaded dataset
Never modified
Serves as the source of truth

`working_df`

Stores transformed data
Updated after every preprocessing step
Used throughout the transformation pipeline
Exported as the final cleaned dataset

This architecture enables true sequential preprocessing workflows.

Multi-Page Application Architecture

The application is divided into separate workflow-focused pages:

Page	Purpose
Home	Dataset upload and session initialization
Data Analysis	Exploratory Data Analysis and visualizations
Data Transformations	Sequential preprocessing and export workflows

Analysis Page Features

The Data Analysis module provides interactive exploratory data analysis capabilities.

Dataset Preview

Displays dataset head
Provides quick overview of uploaded data

Dataset Shape Metrics

Total rows
Total columns

Datatype Analysis

Detects column datatypes
Helps identify categorical and numerical features

Null Value Analysis

Displays missing value counts per column
Helps identify incomplete features

Duplicate Analysis

Detects exact duplicate rows
Helps identify redundant records

Unique Value Analysis

Displays unique feature counts
Shows cardinality levels
Helps guide encoding decisions

Outlier Analysis

Uses IQR-based outlier detection
Detects numerical anomalies

Histogram Visualization

Interactive histogram plotting
Numerical feature distribution analysis

Correlation Heatmap

Multi-column correlation analysis
Dynamic heatmap visualization

Transformation Page Features

The Data Transformations module enables sequential preprocessing workflows.

Missing Value Handling

Supported strategies:

Mean Imputation
Median Imputation
Mode Imputation
Row Deletion

Duplicate Removal

Removes exact duplicate rows
Maintains dataset consistency

Outlier Removal

Uses:

IQR (Interquartile Range) Method

Features:

Column-wise outlier removal
Numerical feature filtering

Feature Scaling

Supported scaling methods:

MinMax Scaling
Standard Scaling

Implemented using:

scikit-learn.preprocessing

Feature Encoding

Supported encoding methods:

Label Encoding
One-Hot Encoding

Features:

Automatic categorical feature handling
Integer-based encoded outputs
Dynamic feature expansion

Transformation History

Tracks all preprocessing steps applied during the session.

Example:

✓ Applied Mean Imputation on ['Age']
✓ Removed Duplicate Rows
✓ Applied Standard Scaling on ['Salary']
✓ Applied One-Hot Encoding on ['City']

This provides workflow transparency and preprocessing traceability.

Reset Pipeline

Allows users to reset all transformations and restore the original uploaded dataset.

Export System

Supported export formats:

CSV
Excel (.xlsx)

Users can download the fully cleaned and transformed dataset after preprocessing.

System Architecture

The project follows a modular architecture for scalability and maintainability.

data-prep-workflow-engine/
│
├── streamlit_app.py
│
├── pages/
│   ├── analysis.py
│   └── transformations.py
│
├── analyzers/
│   ├── dtype_analyzer.py
│   ├── null_analyzer.py
│   ├── duplicate_analyzer.py
│   ├── unique_analyzer.py
│   └── outlier_analyzer.py
│
├── transformers/
│   ├── missing_handler.py
│   ├── duplicate_handler.py
│   ├── outlier_handler.py
│   ├── scaler.py
│   └── encoder.py
│
├── visualizations/
│   ├── histogram.py
│   └── heatmap.py
│
└── requirements.txt

Tech Stack

Technology	Purpose
Python	Core programming language
Streamlit	Interactive web application framework
Pandas	Data manipulation and preprocessing
NumPy	Numerical operations
Matplotlib	Data visualization
Seaborn	Statistical plotting
Scikit-learn	Preprocessing and ML utilities
OpenPyXL	Excel export functionality

Engineering Concepts Implemented

This project demonstrates several important software engineering and machine learning workflow concepts:

Stateful Session Management
Sequential Data Pipelines
Modular Application Design
Interactive Data Processing
Workflow-Oriented Architecture
Reusable Transformer Design
Multi-Page Streamlit Applications
Persistent Transformation Pipelines

Future Improvements

Potential future enhancements include:

Pipeline export as Python code
Undo/Redo preprocessing steps
Smart preprocessing recommendations
Datatype conversion utilities
AutoML integration
Workflow save/load functionality
Pipeline templates
Advanced visualization modules

Conclusion

The DataPrep Workflow Engine is designed as a lightweight interactive preprocessing workflow system that bridges the gap between traditional EDA dashboards and real-world data preparation pipelines.

The project emphasizes:

Sequential preprocessing workflows
Stateful data transformations
Modular engineering practices
Interactive preprocessing pipelines

rather than focusing solely on static visualizations.

This architecture makes the system scalable, reusable, and extensible for future machine learning workflow enhancements.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
analyzers		analyzers
images		images
pages		pages
transformers		transformers
visualizations		visualizations
README.md		README.md
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

DataPrep Workflow Engine

Analyze • Transform • Export

Project Overview

Problem Statement

Preview

Analysis Page Preview

Transformation Page Preview

Key Features

Stateful Sequential Transformation Pipeline

original_df

working_df

Multi-Page Application Architecture

Analysis Page Features

Dataset Preview

Dataset Shape Metrics

Datatype Analysis

Null Value Analysis

Duplicate Analysis

Unique Value Analysis

Outlier Analysis

Histogram Visualization

Correlation Heatmap

Transformation Page Features

Missing Value Handling

Duplicate Removal

Outlier Removal

Feature Scaling

Feature Encoding

Transformation History

Reset Pipeline

Export System

System Architecture

Tech Stack

Engineering Concepts Implemented

Future Improvements

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`original_df`

`working_df`

Packages