Real-world datasets are often messy, inconsistent, and not directly suitable for machine learning or analytical workflows. Data scientists and analysts typically spend significant time performing repetitive preprocessing tasks such as:
- Handling missing values
- Removing duplicates
- Detecting and removing outliers
- Scaling numerical features
- Encoding categorical variables
- Understanding dataset structure and quality
Most beginner EDA tools focus only on visualization and static analysis. However, they lack a proper preprocessing workflow architecture where transformations are applied sequentially and preserved throughout the session.
The DataPrep Workflow Engine solves this problem by providing a stateful preprocessing pipeline system that allows users to:
- Analyze datasets interactively
- Apply sequential transformations
- Preserve transformation state
- Track preprocessing history
- Export final cleaned datasets
The application behaves more like a lightweight preprocessing workflow engine rather than a simple visualization dashboard.
Traditional EDA dashboards typically suffer from several limitations:
- Transformations are not persistent
- Data resets after interactions
- No sequential preprocessing pipeline
- Lack of workflow transparency
- Limited preprocessing capabilities
- Poor support for iterative cleaning workflows
This project addresses these limitations through:
- Stateful session management
- Sequential transformation architecture
- Modular preprocessing pipeline design
- Transformation history tracking
- Interactive preprocessing workflows
The application maintains two separate datasets throughout the workflow:
- Stores the raw uploaded dataset
- Never modified
- Serves as the source of truth
- Stores transformed data
- Updated after every preprocessing step
- Used throughout the transformation pipeline
- Exported as the final cleaned dataset
This architecture enables true sequential preprocessing workflows.
The application is divided into separate workflow-focused pages:
| Page | Purpose |
|---|---|
| Home | Dataset upload and session initialization |
| Data Analysis | Exploratory Data Analysis and visualizations |
| Data Transformations | Sequential preprocessing and export workflows |
The Data Analysis module provides interactive exploratory data analysis capabilities.
- Displays dataset head
- Provides quick overview of uploaded data
- Total rows
- Total columns
- Detects column datatypes
- Helps identify categorical and numerical features
- Displays missing value counts per column
- Helps identify incomplete features
- Detects exact duplicate rows
- Helps identify redundant records
- Displays unique feature counts
- Shows cardinality levels
- Helps guide encoding decisions
- Uses IQR-based outlier detection
- Detects numerical anomalies
- Interactive histogram plotting
- Numerical feature distribution analysis
- Multi-column correlation analysis
- Dynamic heatmap visualization
The Data Transformations module enables sequential preprocessing workflows.
Supported strategies:
- Mean Imputation
- Median Imputation
- Mode Imputation
- Row Deletion
- Removes exact duplicate rows
- Maintains dataset consistency
Uses:
- IQR (Interquartile Range) Method
Features:
- Column-wise outlier removal
- Numerical feature filtering
Supported scaling methods:
- MinMax Scaling
- Standard Scaling
Implemented using:
scikit-learn.preprocessing
Supported encoding methods:
- Label Encoding
- One-Hot Encoding
Features:
- Automatic categorical feature handling
- Integer-based encoded outputs
- Dynamic feature expansion
Tracks all preprocessing steps applied during the session.
Example:
✓ Applied Mean Imputation on ['Age']
✓ Removed Duplicate Rows
✓ Applied Standard Scaling on ['Salary']
✓ Applied One-Hot Encoding on ['City']
This provides workflow transparency and preprocessing traceability.
Allows users to reset all transformations and restore the original uploaded dataset.
Supported export formats:
- CSV
- Excel (.xlsx)
Users can download the fully cleaned and transformed dataset after preprocessing.
The project follows a modular architecture for scalability and maintainability.
data-prep-workflow-engine/
│
├── streamlit_app.py
│
├── pages/
│ ├── analysis.py
│ └── transformations.py
│
├── analyzers/
│ ├── dtype_analyzer.py
│ ├── null_analyzer.py
│ ├── duplicate_analyzer.py
│ ├── unique_analyzer.py
│ └── outlier_analyzer.py
│
├── transformers/
│ ├── missing_handler.py
│ ├── duplicate_handler.py
│ ├── outlier_handler.py
│ ├── scaler.py
│ └── encoder.py
│
├── visualizations/
│ ├── histogram.py
│ └── heatmap.py
│
└── requirements.txt
| Technology | Purpose |
|---|---|
| Python | Core programming language |
| Streamlit | Interactive web application framework |
| Pandas | Data manipulation and preprocessing |
| NumPy | Numerical operations |
| Matplotlib | Data visualization |
| Seaborn | Statistical plotting |
| Scikit-learn | Preprocessing and ML utilities |
| OpenPyXL | Excel export functionality |
This project demonstrates several important software engineering and machine learning workflow concepts:
- Stateful Session Management
- Sequential Data Pipelines
- Modular Application Design
- Interactive Data Processing
- Workflow-Oriented Architecture
- Reusable Transformer Design
- Multi-Page Streamlit Applications
- Persistent Transformation Pipelines
Potential future enhancements include:
- Pipeline export as Python code
- Undo/Redo preprocessing steps
- Smart preprocessing recommendations
- Datatype conversion utilities
- AutoML integration
- Workflow save/load functionality
- Pipeline templates
- Advanced visualization modules
The DataPrep Workflow Engine is designed as a lightweight interactive preprocessing workflow system that bridges the gap between traditional EDA dashboards and real-world data preparation pipelines.
The project emphasizes:
- Sequential preprocessing workflows
- Stateful data transformations
- Modular engineering practices
- Interactive preprocessing pipelines
rather than focusing solely on static visualizations.
This architecture makes the system scalable, reusable, and extensible for future machine learning workflow enhancements.


