Phishing links have been around since the inception of the Internet and have only grown and improved throughout the years. Despite multiple researching endeavors to better detect and prevent such malicious activity, the problem continues to reemerge due to the low amount of maintenance needed to create hundreds of new links every day. As of right now, machine learning based approaches have proven to be the best defense against known bad actors but have shown to be less resilient against large variances in generated link features. Thus, without larger and broader datasets for training purposes, some of the top solutions such as Deep Neural Networks (DNNs) are not capable of stronger real-time detection rates. Perhaps a focus on feature selection could give us insight on what factors are more relevant for the overall improvement of current models. Experimentation with an unsupervised AutoEncoder DNN should provide a clear visual of potential patterns found in our created features through clustering.
The dataset we will be using for our proposed project will be from the PhishTank website that is widely used as a source for hundreds of thousands of confirmed phishing links. The data will be extracted into a .csv file where we will be creating features such as: number of dots, total length, and numbers found to help give our unsupervised AutoEncoder DNN a wide range of patterns to discover.
Our project will employ several different types of features that are commonly seen in datasets such as number of dots, total length, number of slashes, etc… all of which will be standardized to be between 0 and 1 with MinMaxScaler from the sklearn python library. Features during preprocessing will be stratified, split, and shuffled into a training, validation, and testing datasets respectively with the StratifiedShuffleSplit and Train_Test_Split methods also from the sklearn python library. We will mainly be using batch normalization to help our unsupervised AutoEncoder DNN with clustering algorithms such as K-means or DBSCAN to help group our data points. A visualization technique such as t-SNE or PCA will be used to project the high-dimensional data into a lower-dimensional space for easy visualization.
Our model will follow a learning algorithm known as representation learning through backpropagation with stochastic gradient descent (SGD), Adam or Adagrad.. We will also be using the RandomizedSearchCV function from the sklearn python library to help optimize our model’s hyper parameters such as: the number of hidden layers, learning rate, and hidden neurons. Early stopping can be set to help improve our overall training times.
We will be evaluating our results based on our training and validation accuracy against our initial DNN containing all created features and another DNN containing only the strongest correlated features found during initial clustering. During data cleaning, we will be employing the StratifiedShuffleSplit method from the sklearn python library to implement a 3-fold cross-validation as we stratify and split our dataset into training and testing.
Phishing Detection Using Machine Learning Techniques by Vahid Shahrivari, Mohammad Mahdi Darabi, and Mohammad Izadi will be our guideline for this project. All current references are listed below at the time of this project proposal.