This project develops an intelligent Property Recommendation System leveraging machine learning to connect users with their ideal properties. It addresses the challenge of sifting through vast real estate data by providing personalized suggestions based on various property features and user preferences. The system integrates data processing, advanced feature engineering, robust machine learning modeling, and an interactive web application for a seamless user experience.
- Data Cleaning & Preprocessing: Robust handling of missing values, duplicates, and standardization.
- Feature Engineering & Selection: Creation of impactful new features and selection of the most relevant ones.
- Machine Learning Model Training: Development and evaluation of a predictive recommendation model.
- Interactive Web Application: A user-friendly Streamlit interface for real-time recommendations.
- Model Interpretability (SHAP): Insights into model predictions using SHAP values.
- Data Visualization: Comprehensive visualizations for deeper data understanding.
- Frontend: Streamlit (Interactive web application)
- Backend: Python
- Data Processing: Pandas, NumPy
- Machine Learning: XGBoost, Scikit-learn
- Visualization: Plotly
- Model Interpretation: SHAP
Our project is organized into the following logical directories:
Property-Recommendation/
├── data/ # Raw and processed dataset files
├── model/ # Trained machine learning models (e.g., XGBoost.pkl)
├── script/ # Core source code and notebooks
│ ├── app.py # Streamlit web application for recommendations
│ ├── data_cleaning.ipynb # Notebook for initial data cleaning and preprocessing
│ ├── data_processing.ipynb # Notebook for data transformation and validation
│ ├── feature_engineering.ipynb # Notebook for creating and selecting new features
│ ├── model.ipynb # Notebook for model training, tuning, and evaluation
│ └── requirements.txt # Python package dependencies
└── script_info/ # Additional scripts and documentation
Our robust data pipeline ensures high-quality data feeds into the recommendation engine:
-
Data Cleaning (
script/data_cleaning.ipynb)- Purpose: To refine raw data by handling inconsistencies and preparing it for analysis.
- Processes:
- Identification and imputation/removal of missing values.
- Detection and elimination of duplicate property entries.
- Standardization of data formats across various features.
-
Data Processing (
script/data_processing.ipynb)- Purpose: To transform cleaned data into a structured format suitable for feature engineering and modeling.
- Processes:
- Extraction of relevant information from raw text fields (e.g., amenities lists).
- Normalization and scaling of numerical features.
- Validation checks to ensure data integrity and consistency.
-
Feature Engineering (
script/feature_engineering.ipynb)- Purpose: To create new, informative features that enhance the predictive power of the model.
- Processes:
- Generation of interaction terms between existing features.
- Creation of categorical indicators from text descriptions.
- Application of feature scaling techniques (e.g., StandardScaler, MinMaxScaler).
-
Model Training (
script/model.ipynb)- Purpose: To train, tune, and evaluate the machine learning model for property recommendations.
- Processes:
- Selection of the optimal machine learning algorithm (e.g., XGBoost).
- Hyperparameter tuning using techniques like GridSearchCV or RandomizedSearchCV.
- Cross-validation for robust model evaluation.
-
Streamlit Application (
script/app.py)- Purpose: To serve the trained model through an interactive web interface.
- Processes:
- Loads the pre-trained model and preprocessing pipelines.
- Takes user inputs (preferences, property criteria).
- Generates and displays property recommendations dynamically.
- Visualizes feature importance and other insights.
Effective feature engineering was crucial for capturing the most relevant similarities between the subject property and candidate comparables. Here are the engineered features used in the model:
-
gla_diff: Difference in Gross Living Area (GLA) between subject and candidate. Reason: Properties with similar living area are more comparable in valuation. -
lot_size_diff: Difference in lot size between subject and candidate. Reason: Lot size impacts property value and is a key metric for comparison. -
bedroom_diff: Difference in number of bedrooms. Reason: Number of bedrooms is a fundamental factor in property similarity and buyer interest. -
bathroom_diff: Difference in number of bathrooms. Reason: Bathrooms, like bedrooms, are core to property utility and comparability. -
room_count_diff: Difference in total room count. Reason: Overall room count gives an additional measure of property size and use. -
same_property_type: Boolean flag (1/0) if subject and candidate have the same property type. Reason: Comparing the same property type ensures more meaningful valuation (e.g., house to house, condo to condo). -
same_storey_type: Boolean flag if both properties have the same number/type of stories (e.g., both are 2-storey homes). Reason: Storey type affects structure, value, and buyer preference. -
sold_recently_90: Boolean flag if candidate was sold within 90 days of the subject's valuation date. Reason: Recent sales provide more reliable market comparisons due to similar market conditions.
(These features were selected based on appraisal best practices and confirmed by SHAP analysis as the most influential drivers for the model's recommendations.)
We employed XGBoost (Extreme Gradient Boosting) for our recommendation model.
- Why XGBoost?:
- Performance: Known for its high performance and speed, especially on structured data.
- Robustness: Handles various data types and missing values effectively.
- Interpretability: Integrates well with SHAP for understanding feature contributions.
- Scalability: Efficient for large datasets, suitable for real estate data.
The model was trained using an 80/20 train–test split on a curated property dataset, with group-aware splitting to prevent data leakage. Hyperparameter tuning was done via GridSearchCV.
-
Evaluation: Performance was measured by:
- Top-3 Hit Rate: Achieved a 94.4% hit rate (17/18), meaning at least one expert-chosen comp was present in the model's top-3 ranked candidates for nearly all test subjects.
- ROC AUC Score: Scored 0.93, indicating excellent ability to distinguish true comps from non-comps.
-
Interpretability: SHAP (SHapley Additive exPlanations) analysis identified
gla_diff,lot_size_diff,bedroom_diff, andsold_recently_90as the most influential features in model recommendations. SHAP results confirmed the model's reliance on core property characteristics and sale recency, closely matching human expert logic.
Follow these steps to set up and run the Property Recommendation System locally.
- Python 3.8+
- Git
-
Clone the repository:
git clone https://github.com/yourusername/Property-Recommendation.git cd Property-Recommendation -
Create and activate a virtual environment: It's highly recommended to use a virtual environment to manage dependencies.
python -m venv venv # On macOS/Linux: source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install required packages: All project dependencies are listed in
script/requirements.txt.pip install -r script/requirements.txt
Navigate to the script/ directory and open the .ipynb files in sequence using Jupyter Lab or Jupyter Notebook to understand the data pipeline:
# After activating your virtual environment
cd script/
jupyter lab # or jupyter notebookFollow the notebooks in this order:
data_cleaning.ipynbdata_processing.ipynbfeature_engineering.ipynbmodel.ipynb
After training the model (by running model.ipynb or ensuring a trained model is saved in model/), you can launch the interactive web application:
# Ensure you are in the script directory of the project: Property-Recommendation/script
streamlit run app.pyOpen your web browser and navigate to the local URL provided by Streamlit (typically http://localhost:8501).
The model demonstrated strong performance in recommending comparable properties:
-
Key Metrics:
- Top-3 Hit Rate: Achieved a 94.4% hit rate (17/18), meaning at least one expert-chosen comp was present in the model's top-3 ranked candidates for nearly all test subjects.
- ROC AUC Score: Scored 0.93, indicating excellent ability to distinguish true comps from non-comps.
-
SHAP Analysis Insights:
- SHAP (SHapley Additive exPlanations) analysis identified
GLA Difference,Lot Size Difference,Bedroom Difference, andSold Within 90 Daysas the most influential features in model recommendations. - SHAP results confirmed the model's reliance on core property characteristics and sale recency, closely matching human expert logic.
- SHAP (SHapley Additive exPlanations) analysis identified
Developing this system involved several critical decisions and overcoming notable challenges:
-
Challenge 1: Data Sparsity & Missing Values
- Decision: Opted for a hybrid approach involving imputation (e.g., median for numerical, mode for categorical) and strategic removal of rows with excessive missing data, balancing data integrity with dataset size.
- Learning: Understanding the impact of different imputation strategies on model performance.
-
Challenge 2: Feature Engineering Complexity
- Decision: Focused on creating domain-specific features (
gla_diff,property_type_same) that directly relate to real estate valuation, rather than relying solely on raw features. - Learning: The significant uplift in model performance achievable through well-thought-out feature engineering.
- Decision: Focused on creating domain-specific features (
-
Challenge 3: Model Interpretability
- Decision: Integrated SHAP values early in the development cycle to ensure the model's predictions could be explained, which is crucial for trust in recommendation systems.
- Learning: How SHAP helps in debugging model errors and gaining stakeholder confidence.
-
Decision 4: Streamlit for Rapid Prototyping
- Reason: Chosen for its ability to quickly build interactive web applications with pure Python, allowing for rapid iteration and demonstration of the recommendation system.
- Learning: Streamlit's simplicity accelerated the deployment phase significantly.
This project is licensed under the MIT License - see the LICENSE file for details.
Raahim Khan - Initial work & Core Development - https://www.linkedin.com/in/raahimk24/