Skip to content

Commit 25bb904

Browse files
committed
Merge pull request #1 from diru1100/sb-api
Create Initial core PR for SpamBrainz API
2 parents 4bd0fad + 312594a commit 25bb904

21 files changed

+698
-2
lines changed

README.md

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,44 @@
1-
# SB_API
2-
SpamBrainz API using LodBrok model.
1+
# SpamBrainz API
2+
An API to classify editor account using LodBrok(keras model) in the backend. Also has an option to retrain the model for future enhancement by taking incorrect predictions into consideration as per SpamNinja feedback.
3+
4+
### Steps to run the API:
5+
6+
1) Install all the dependencies needed in virtual environment:
7+
8+
```
9+
pip install -r requirements.txt
10+
```
11+
2) In spambrainz folder, set certain environmental variables in the terminal to run the API:
12+
```
13+
$ export FLASK_APP=sb_api.py
14+
$ export FLASK_DEBUG=1 # only for development purpose
15+
$ export FLASK_RUN_PORT=4321 # the api requests are sent to this port number
16+
```
17+
18+
3) Install Redis:
19+
```
20+
$ wget http://download.redis.io/redis-stable.tar.gz
21+
$ tar xvzf redis-stable.tar.gz
22+
$ cd redis-stable
23+
$ make
24+
$ sudo make install
25+
```
26+
4) Run redis separate terminal to store the get the data sent to SB_API
27+
```
28+
$ redis-server
29+
```
30+
31+
5) With this all the dependencies are set to run the server, now simply run the server by
32+
```
33+
$ flask run
34+
```
35+
36+
This should run the API in the specified port in debug mode
37+
38+
39+
40+
![](spambrainz/static/images/sb_api_working.png)
41+
42+
The detailed internal functioning of API is present [here](spambrainz/app/README.md)
43+
44+
The request used, their details and the output are present [here](spambrainz/README.md)

requirements.txt

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
absl-py==0.9.0
2+
aniso8601==8.0.0
3+
appdirs==1.4.4
4+
astor==0.8.1
5+
astunparse==1.6.3
6+
cachetools==4.1.1
7+
certifi==2020.6.20
8+
chardet==3.0.4
9+
click==7.1.2
10+
DateTime==4.3
11+
filelock==3.0.12
12+
Flask==1.1.2
13+
Flask-RESTful==0.3.8
14+
gast==0.3.3
15+
gevent==20.6.2
16+
google-auth==1.20.0
17+
google-auth-oauthlib==0.4.1
18+
google-pasta==0.2.0
19+
greenlet==0.4.16
20+
grpcio==1.30.0
21+
h5py==2.10.0
22+
idna==2.10
23+
importlib-metadata==1.7.0
24+
itsdangerous==1.1.0
25+
Jinja2==2.11.2
26+
Keras==2.3.1
27+
Keras-Applications==1.0.8
28+
Keras-Preprocessing==1.1.2
29+
Markdown==3.2.2
30+
MarkupSafe==1.1.1
31+
mbdata==25.0.4
32+
numpy==1.18.5
33+
oauthlib==3.1.0
34+
opt-einsum==3.3.0
35+
protobuf==3.12.4
36+
pyasn1==0.4.8
37+
pyasn1-modules==0.2.8
38+
pytz==2020.1
39+
PyYAML==5.3.1
40+
redis==3.5.3
41+
requests==2.24.0
42+
requests-oauthlib==1.3.0
43+
rsa==4.6
44+
scipy==1.4.1
45+
six==1.15.0
46+
tensorboard==2.3.0
47+
tensorboard-plugin-wit==1.7.0
48+
tensorflow==2.3.0
49+
tensorflow-estimator==2.3.0
50+
termcolor==1.1.0
51+
uritools==3.0.0
52+
urlextract==1.0.0
53+
urllib3==1.25.10
54+
Werkzeug==1.0.1
55+
wrapt==1.12.1
56+
zipp==3.1.0
57+
zope.event==4.4
58+
zope.interface==5.1.0

spambrainz/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
## Functioning of requests:
2+
3+
1) There are two request codes namely classify_request.py and train_request.py for /predict and /train respectively.
4+
2) The classify_request.py sents a spam editor account to be classified by the lodbrok model running in the backend. The command ```python classify_request.py``` gives the following output:
5+
6+
![](static/images/classify_request.png)
7+
3) The train_request.py sents a spam editor accounts along with **verdict** given by SpamNinja as spam or not. The command ```python train_request.py``` gives the following output:
8+
9+
![](static/images/train_request.png)
10+
11+
More details regarding the API functioning is written [here](app/README.md).

spambrainz/app/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
## Internal Functioning of API:
2+
3+
This is the structure in the API:
4+
5+
![](../static/images/api_structure.png)
6+
7+
- The initialization of the application is done in **__ init __.py** where the flask, redis, and model instance are initialized.
8+
9+
- The **. ./sb_api.py** takes the above instances and runs the application.
10+
- (Note: This is to avoid circular imports in Flask)
11+
12+
- The **classify.py** contains classify_process() which classifies by retrieving the data stored in redis.
13+
- First, it converts the given data JSON data to be ready for the model to predict on, for this:
14+
- It converts the datetime object stored as string in JSON back to a datetime object with the help of string_to_dateitime function.
15+
- Converts the JSON data to an np array with the help of preprocess_editor function.
16+
- After the preprocessed data is obtained, it performs the necessary predictions and stores them back into redis to be retrieved later.
17+
- The editor details stored in redis are removed form the queue
18+
19+
- The **routes.py** contains all the necessary API endpoints /predict and /train which are called when post requests are made to the API.
20+
- The **/predict** endpoint:
21+
- In this endpoint the input JSON data is pushed into the redis queue after converting into compatible form.
22+
- The editor ids are stored before hand to retrieve the results mapped with ids stored in redis by classify_process, later on.
23+
- Unfilled details such as area or bio are set as None to be compatible later on.
24+
- classify_process() with size is called to load the model and classify the editor accounts retrived back from redis.
25+
- The results stored in redis by classify_process are retrieved back and sent back to SpamNinja in a JSON format.
26+
- The **/train** endpoint:
27+
- In this endpoint the input JSON data with additional **verdict** parameter and editor details are directly converted into compatible format(np array) to be used by model to retrain.
28+
- Unfilled details such as area or bio are set as None to be compatible later on and also the datetime objects stored as string in JSON back to a datetime object with the help of string_to_dateitime function.
29+
- Using preprocess_editor converts finally into an np array.
30+
- The preprocessed data is sent to retrain_model function to retrain the model.
31+
- If successfully retrained, a success JSON message is sent.
32+
33+
- The **preprocessing.py** is used to preprocess the JSON/Dict editor data into proper np array for the model to predict/train on, this uses the initial tokenizers created to convert each parameter properly.
34+
35+
- The **train.py** contains the retraining part of the model.
36+
- The data sent to retrain_model is used to retrain the model.
37+
- The learning rate of the optimizer(Adam) is set to 0.001(very low static value) to learn from new patterns while also keeping in my mind not to forget old learnings(drastic forgetting).
38+
- The current model is saved as previous_lodbrok.h5 in ../statis/weights/ for future reference if we wanted to come back.
39+
- Then it calls the train_model function which continues the model learning with new data, The batch is set to only 1 with 2 epochs so as to keep balanced learning.
40+
- After the training is done, the new model is saved with current_lodbrok.h5 to overwrite over the old model and continue classifying new data over time.
41+
- The original lodbrok weights are saved in original_lodbrok.h5 to come back and trace the progress done so far.
42+
- **Benefits of using this method are:**
43+
- No need to maintain any extra database to store new data sent by SpamNinja and maintain the old data on which the model is trained.
44+
- There is no fixed size to the batch of data sent to the model, it can be any number of models.
45+
- The old learnings of the model won't be forgotten soon due to **slow static learning rate** and also the structure of data being the same forever.
46+

spambrainz/app/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import flask
2+
import redis
3+
import keras
4+
from threading import Thread
5+
6+
# initialize our Flask application, Redis server, and Keras model
7+
8+
app = flask.Flask(__name__)
9+
db = redis.StrictRedis(host="localhost", port=6379, db=0)
10+
model = None
11+
12+
from app import routes

spambrainz/app/classify.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
from app import model, db
2+
from tensorflow.keras.models import load_model
3+
from .preprocessing import preprocess_editor
4+
import json
5+
import datetime
6+
import numpy as np
7+
import flask
8+
model = model
9+
10+
11+
# initialize constants used for redis server
12+
EDITOR_QUEUE = "editor_queue"
13+
SERVER_SLEEP = 0.25
14+
CLIENT_SLEEP = 0.25
15+
16+
# function used to convert string to datetime, used before preprocessing
17+
def string_to_datetime(string_dt):
18+
if string_dt is None:
19+
return None
20+
return datetime.datetime(*[int(v) for v in string_dt.replace('T', '-').replace(':', '-').split('-')])
21+
22+
# function used to retrive editor_data from redis and store the results back
23+
def classify_process(size):
24+
25+
print("* Loading model...")
26+
global model
27+
model = load_model('static/models/weights/current_lodbrok.h5')
28+
print("* Model loaded")
29+
30+
BATCH_SIZE = size
31+
32+
# All the editor detials are retrived here from redis
33+
queue = db.lrange(EDITOR_QUEUE, 0, BATCH_SIZE - 1)
34+
35+
36+
for q in queue:
37+
38+
q = json.loads(q)
39+
editor_id = q["id"]
40+
41+
# changing string datetime to datetime objects
42+
q["birth_date"] = string_to_datetime(q["birth_date"])
43+
q["member_since"] = string_to_datetime(q["member_since"])
44+
q["email_confirm_date"] = string_to_datetime(q["email_confirm_date"])
45+
q["last_updated"] = string_to_datetime(q["last_updated"])
46+
q["last_login_date"] = string_to_datetime(q["last_login_date"])
47+
48+
# preprocessing the given input to get prediction
49+
q = preprocess_editor(q)
50+
51+
# defining the structure
52+
q = np.array([q])
53+
54+
# only data from index 1 is considered while predicting, thus
55+
# not taking the spam value into consideration
56+
predict_data = {
57+
"main_input": np.array(q[:,1:10]),
58+
"email_input": np.array(q[:,10]),
59+
"website_input": np.array(q[:,11]),
60+
"bio_input": np.array(q[:,12:]),
61+
}
62+
63+
64+
result = model.predict(x = [
65+
predict_data["main_input"],
66+
predict_data["email_input"],
67+
predict_data["website_input"],
68+
predict_data["bio_input"],
69+
])
70+
71+
# identifying the prediction done by lodbrok
72+
# if closer to 1 it's a spam account else a non spam account
73+
if(result[0][1]>result[0][0]) :
74+
prediction = {
75+
'result' : "Spam Editor Account"
76+
}
77+
78+
else:
79+
prediction = {
80+
'result' : "Non Spam Editor Account"
81+
}
82+
83+
# converting to fit in redis
84+
prediction = json.dumps(prediction)
85+
86+
#storign the result in redis
87+
db.set(str(editor_id), prediction)
88+
89+
# remove the set of editor from our queue
90+
db.ltrim(EDITOR_QUEUE, size, -1)
91+
92+
93+
94+

spambrainz/app/preprocessing.py

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
import os
2+
import pickle
3+
import numpy as np
4+
from keras.preprocessing.text import Tokenizer
5+
from urllib.parse import urlparse
6+
from datetime import timedelta
7+
from urlextract import URLExtract
8+
from keras.preprocessing.text import Tokenizer
9+
10+
extractor = URLExtract()
11+
one_hour = timedelta(hours=1)
12+
13+
bio_tokenizer, website_tokenizer, email_tokenizer = None, None, None
14+
15+
# The tokenizers build by the model are loaded here to convert
16+
# given input to vectors to train on
17+
def load_tokenizers():
18+
global bio_tokenizer, website_tokenizer, email_tokenizer
19+
20+
with open("static/models/others/bio_tokenizer.pickle", "rb") as f:
21+
bio_tokenizer = pickle.load(f)
22+
23+
with open("static/models/others/website_tokenizer.pickle", "rb") as f:
24+
website_tokenizer = pickle.load(f)
25+
26+
with open("static/models/others/email_tokenizer.pickle", "rb") as f:
27+
email_tokenizer = pickle.load(f)
28+
29+
# editor preprocessing
30+
extractor = URLExtract()
31+
one_hour = timedelta(hours=1)
32+
33+
# preprocess_editor is used to convert given data to np array for
34+
# the model to train on
35+
def preprocess_editor(editor, spam=None):
36+
37+
load_tokenizers()
38+
# Apparently there are users with unset member_since
39+
if editor["member_since"] is not None:
40+
# These shouldn't be none but you can't trust the database
41+
if editor["last_updated"] is not None:
42+
update_delta = (editor["last_updated"] - editor["member_since"]) / one_hour
43+
else:
44+
update_delta = -1
45+
46+
if editor["last_login_date"] is not None:
47+
login_delta = (editor["last_login_date"] - editor["member_since"]) / one_hour
48+
else:
49+
login_delta = -1
50+
51+
# Confirm date may be None
52+
if editor["email_confirm_date"] is not None:
53+
conf_delta = (editor["email_confirm_date"] - editor["member_since"]) / one_hour
54+
else:
55+
conf_delta = -1
56+
else:
57+
update_delta, login_delta, conf_delta = -2, -2, -2
58+
59+
# Email domain
60+
email_domain = email_tokenizer.texts_to_sequences([editor["email"].split("@")[1]])[0]
61+
if len(email_domain) == 0:
62+
email_token = 1024
63+
else:
64+
email_token = email_domain[0]
65+
66+
# Website domain
67+
domain = urlparse(editor["website"]).hostname
68+
if domain is not None:
69+
website_domain = website_tokenizer.texts_to_sequences(urlparse(editor["website"]).hostname)[0]
70+
if len(website_domain) == 0:
71+
website_token = 1023
72+
else:
73+
website_token = email_domain[0]
74+
else:
75+
website_token = 1024
76+
77+
# Bio metadata
78+
if editor["bio"] is not None:
79+
bio_len = len(editor["bio"])
80+
bio_urls = extractor.has_urls(editor["bio"])
81+
bio = bio_tokenizer.texts_to_matrix([editor["bio"]], mode="tfidf")[0]
82+
else:
83+
bio_len, bio_urls = 0, 0
84+
bio = np.zeros(512)
85+
86+
data = np.array([
87+
spam, # spam classification (used only during training the model)
88+
editor["area"] is not None, # Area Set
89+
editor["gender"] is not None, # Gender
90+
editor["birth_date"] is not None, # Birth date set
91+
editor["privs"] != 0, # Nonzero privs
92+
bio_len, # Bio length
93+
bio_urls, # URLs in bio
94+
conf_delta, # Confirmation delta
95+
update_delta, # Last updated delta
96+
login_delta, # Last login delta
97+
email_token, # Email domain
98+
website_token, # Website domain
99+
], dtype=np.float32)
100+
101+
data = np.concatenate((data, bio))
102+
# print("Preproccessed data")
103+
return data

0 commit comments

Comments
 (0)