This respositry consists of the submission for the First Mini Project for the course CS771, Fall 2024, completed under the instruction of Prof. Piyush Rai, Department of CSE, IIT Kanpur
| Name | Roll Number |
|---|---|
| Anushka Singh | 220188 |
| Arush Upadhyaya | 220213 |
| Aujasvit Datta | 220254 |
| Pahal Dhruvin Patel | 220742 |
| Pranav Agrawal | 220791 |
17.py: main file to generate and save predictionsutils.py: utility functions used in17.pypred_emoticon.txt: predictions for the emoticons datasetpred_deepfeat.txt: predictions for the deep features datasetpred_text_seq.txt: predictions for the text sequences datasetemoticons/: jupyter notebooks containing experiments and EDA for emoticons datasetfeatures/: jupyter notebooks containing experiments and EDA for features datasettext_seq/: jupyter notebooks containing experiments and EDA for text sequences datasetcombined/: jupter notebooks containing experiments and EDA for all datasets combinedcommon/: helper functions used in experiments
- Install the dependencies
pip install -r requirements.txt-
Download the dataset, make sure the
datasets/directory is present in the root -
Run
17.pyto generate the prediction files →
python 17.py-
Preprocessing :
- Removed dummy emojis, that are occuring in all the input emoji strings
- Columnarised the emoji strings into one column per character
- One hot encoded the categorical columns
-
Model : Logistic Regression
-
Best Parametres
Parameter Value C 10 penalty L1 Solver Liblinear -
Achieved Accuracy on Validation Set : 97.13%
-
Preprocessing : None
-
Model : Logistic Regression
-
Best Parametres
Parameter Value C 10.0 fit_intercept True penalty l2 solver lbfgs -
Achieved Accuracy on Validation Set : 98.77%
-
Preprocessing
- Removed substrings occuring in all the input strings
- Converted the input strings into n-gram respresentation, with
$n_range = (3, 5)$
-
Model : Logistic Regression
-
Best Parametres
Parameter Value colsample_bytree 1.0 eval_metric logloss gamma 0.2 learning_rate 0.1 max_depth 7 min_child_weight 3 n_estimators 500 subsample 1.0 -
Achieved Accuracy on Validation Set : 93.05%
-
Model : Logistic Regression
-
Best Parametres :
Parameter Value C 10.0 fit_intercept True penalty l2 solver lbfgs -
Achieved Accuracy on Validation Set : 98.77%
We used the seed 42 for all the probabilistic models that we attempted to run.