Skip to content

Commit e929bef

Browse files
authored
Add gallery of examples (#40)
1 parent 0e50fa1 commit e929bef

17 files changed

+269
-18
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -191,9 +191,9 @@ predictions = pipeline.predict(X_test)
191191

192192
## Step-by-step walk-through
193193

194-
A step-by-step walk-through is available on our interactive notebook hosted on [Google Colab](https://colab.research.google.com/drive/1Idzht9dNoB85pjc9gOL24t9ksrXZEA-9?usp=sharing).
194+
A step-by-step walk-through is available on our documentation hosted on [Read the Docs](https://hiclass.readthedocs.io/en/latest/index.html).
195195

196-
This will guide you through the process of installing hiclass with conda, training and predicting a small dataset.
196+
This will guide you through the process of installing hiclass within a virtual environment, training, predicting, persisting models and much more.
197197

198198
## API documentation
199199

docs/examples/README.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Gallery of Examples
2+
===================
3+
4+
These examples illustrate the main features of HiClass.

docs/source/algorithms/selecting_training_policy.rst renamed to docs/examples/plot_binary_policies.py

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
Selecting a training policy
1+
# -*- coding: utf-8 -*-
2+
"""
3+
===========================
4+
Binary Training Policies
25
===========================
36
47
The siblings policy is used by default on the local classifier per node, but the remaining ones can be selected with the parameter :literal:`binary_policy`, for example:
@@ -40,3 +43,36 @@
4043
4144
rf = RandomForestClassifier()
4245
classifier = LocalClassifierPerNode(local_classifier=rf, binary_policy="exclusive_siblings")
46+
47+
In the code below, the inclusive policy is selected.
48+
However, the code can be easily updated by replacing lines 20-21 with the examples shown in the tabs above.
49+
50+
.. seealso::
51+
52+
Mathematical definition on the different policies is given at :ref:`Training Policies`.
53+
"""
54+
from sklearn.ensemble import RandomForestClassifier
55+
56+
from hiclass import LocalClassifierPerNode
57+
58+
# Define data
59+
X_train = [[1], [2], [3], [4]]
60+
X_test = [[4], [3], [2], [1]]
61+
Y_train = [
62+
["Animal", "Mammal", "Sheep"],
63+
["Animal", "Mammal", "Cow"],
64+
["Animal", "Reptile", "Snake"],
65+
["Animal", "Reptile", "Lizard"],
66+
]
67+
68+
# Use random forest classifiers for every node
69+
# And exclusive siblings policy to select training examples for binary classifiers.
70+
rf = RandomForestClassifier()
71+
classifier = LocalClassifierPerNode(local_classifier=rf, binary_policy="inclusive")
72+
73+
# Train local classifier per node
74+
classifier.fit(X_train, Y_train)
75+
76+
# Predict
77+
predictions = classifier.predict(X_test)
78+
print(predictions)
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
=====================
4+
Hello HiClass
5+
=====================
6+
7+
A minimalist example showing how to use HiClass to train and predict.
8+
"""
9+
from sklearn.ensemble import RandomForestClassifier
10+
11+
from hiclass import LocalClassifierPerNode
12+
13+
# Define data
14+
X_train = [[1], [2], [3], [4]]
15+
X_test = [[4], [3], [2], [1]]
16+
Y_train = [
17+
["Animal", "Mammal", "Sheep"],
18+
["Animal", "Mammal", "Cow"],
19+
["Animal", "Reptile", "Snake"],
20+
["Animal", "Reptile", "Lizard"],
21+
]
22+
23+
# Use random forest classifiers for every node
24+
rf = RandomForestClassifier()
25+
classifier = LocalClassifierPerNode(local_classifier=rf)
26+
27+
# Train local classifier per node
28+
classifier.fit(X_train, Y_train)
29+
30+
# Predict
31+
predictions = classifier.predict(X_test)
32+
print(predictions)
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
=====================
4+
Model Persistence
5+
=====================
6+
7+
HiClass is fully compatible with Pickle.
8+
Pickle can be used to easily store machine learning models on disk.
9+
In this example, we demonstrate how to use pickle to store and load trained classifiers.
10+
"""
11+
import pickle
12+
13+
from sklearn.linear_model import LogisticRegression
14+
15+
from hiclass import LocalClassifierPerLevel
16+
17+
# Define data
18+
X_train = [[1, 2], [3, 4], [5, 6], [7, 8]]
19+
X_test = [[7, 8], [5, 6], [3, 4], [1, 2]]
20+
Y_train = [
21+
["Animal", "Mammal", "Sheep"],
22+
["Animal", "Mammal", "Cow"],
23+
["Animal", "Reptile", "Snake"],
24+
["Animal", "Reptile", "Lizard"],
25+
]
26+
27+
# Use Logistic Regression classifiers for every level in the hierarchy
28+
lr = LogisticRegression()
29+
classifier = LocalClassifierPerLevel(local_classifier=lr)
30+
31+
# Train local classifier per level
32+
classifier.fit(X_train, Y_train)
33+
34+
# Save the model to disk
35+
filename = "trained_model.sav"
36+
pickle.dump(classifier, open(filename, "wb"))
37+
38+
# Some time in the future...
39+
40+
# Load the model from disk
41+
loaded_model = pickle.load(open(filename, "rb"))
42+
43+
# Predict
44+
predictions = loaded_model.predict(X_test)
45+
print(predictions)
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
=====================
4+
Parallel Training
5+
=====================
6+
7+
Larger datasets require more time for training.
8+
While by default the models in HiClass are trained using a single core,
9+
it is possible to train each local classifier in parallel by leveraging the library Ray [1]_.
10+
In this example, we demonstrate how to train a hierarchical classifier in parallel,
11+
using all the cores available, on a mock dataset from Kaggle [2]_.
12+
13+
.. [1] https://www.ray.io/
14+
.. [2] https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification
15+
"""
16+
import sys
17+
from os import cpu_count
18+
19+
import pandas as pd
20+
import requests
21+
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
22+
from sklearn.linear_model import LogisticRegression
23+
from sklearn.pipeline import Pipeline
24+
25+
from hiclass import LocalClassifierPerParentNode
26+
27+
28+
def download(url: str, path: str) -> None:
29+
"""
30+
Download a file from the internet.
31+
32+
Parameters
33+
----------
34+
url : str
35+
The address of the file to be downloaded.
36+
path : str
37+
The path to store the downloaded file.
38+
"""
39+
response = requests.get(url)
40+
with open(path, "wb") as file:
41+
file.write(response.content)
42+
43+
44+
# Download training data
45+
training_data_url = "https://zenodo.org/record/6657410/files/train_40k.csv?download=1"
46+
training_data_path = "train_40k.csv"
47+
download(training_data_url, training_data_path)
48+
49+
# Load training data into pandas dataframe
50+
training_data = pd.read_csv(training_data_path).fillna(" ")
51+
52+
# We will use logistic regression classifiers for every parent node
53+
lr = LogisticRegression(max_iter=1000)
54+
55+
pipeline = Pipeline(
56+
[
57+
("count", CountVectorizer()),
58+
("tfidf", TfidfTransformer()),
59+
(
60+
"lcppn",
61+
LocalClassifierPerParentNode(local_classifier=lr, n_jobs=cpu_count()),
62+
),
63+
]
64+
)
65+
66+
# Select training data
67+
X_train = training_data["Title"]
68+
Y_train = training_data[["Cat1", "Cat2", "Cat3"]]
69+
70+
# Fixes bug AttributeError: '_LoggingTee' object has no attribute 'fileno'
71+
# This only happens when building the documentation
72+
# Hence, you don't actually need it for your code to work
73+
sys.stdout.fileno = lambda: False
74+
75+
# Now, let's train the local classifier per parent node
76+
pipeline.fit(X_train, Y_train)

docs/examples/plot_pipeline.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
=====================
4+
Building Pipelines
5+
=====================
6+
7+
HiClass can be adopted in scikit-learn pipelines, and fully supports sparse matrices as input.
8+
This example desmonstrates the use of both of these features.
9+
"""
10+
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
11+
from sklearn.linear_model import LogisticRegression
12+
from sklearn.pipeline import Pipeline
13+
14+
from hiclass import LocalClassifierPerParentNode
15+
16+
# Define data
17+
X_train = [
18+
"Struggling to repay loan",
19+
"Unable to get annual report",
20+
]
21+
X_test = [
22+
"Unable to get annual report",
23+
"Struggling to repay loan",
24+
]
25+
Y_train = [["Loan", "Student loan"], ["Credit reporting", "Reports"]]
26+
27+
# We will use logistic regression classifiers for every parent node
28+
lr = LogisticRegression()
29+
30+
# Let's build a pipeline using CountVectorizer and TfidfTransformer
31+
# to extract features as sparse matrices
32+
pipeline = Pipeline(
33+
[
34+
("count", CountVectorizer()),
35+
("tfidf", TfidfTransformer()),
36+
("lcppn", LocalClassifierPerParentNode(local_classifier=lr)),
37+
]
38+
)
39+
40+
# Now, let's train a local classifier per parent node
41+
pipeline.fit(X_train, Y_train)
42+
43+
# Finally, let's predict using the pipeline
44+
predictions = pipeline.predict(X_test)
45+
print(predictions)

docs/requirements.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,7 @@
22
sphinx==5.0.0
33
sphinx_rtd_theme==1.0.0
44
readthedocs-sphinx-search==0.1.2
5-
sphinx_code_tabs==0.5.3
5+
sphinx_code_tabs==0.5.3
6+
sphinx-gallery==0.10.1
7+
matplotlib==3.5.2
8+
pandas==1.4.2

docs/source/algorithms/local_classifier_per_node.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,5 @@ One of the most popular approaches in the literature, the local classifier per n
1414
:hidden:
1515

1616
training_policies
17-
selecting_training_policy
1817

1918
Each binary classifier is trained in parallel using the library `Ray <https://www.ray.io/>`_. In order to avoid inconsistencies, prediction is performed in a top-down manner. For example, given a hypothetical test example, the local classifier per node firstly queries the binary classifiers at nodes "Reptile" and "Mammal". Let's suppose that in this case the probability of the test example belonging to class "Reptile" is 0.8, while the probability of belonging to class "Mammal" is 0.5, then class "Reptile" is picked. At the next level, only the classifiers at nodes "Snake" and "Lizard" are queried, and again the one with the highest probability is selected.

docs/source/algorithms/metrics.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
.. _metrics-overview:
22

3-
Hierarchical Metrics
3+
Metrics
44
====================
55

66
According to [1]_, the use of flat classification metrics might not be adequate to give enough insight of which algorithm is better at classifying hierarchical data. Hence, in HiClass we implemented the metrics of hierarchical precision (hP), hierarchical recall (hR) and hierarchical F-score (hF), which are extensions of the renowned metrics of precision, recall and F-score, but tailored to the hierarchical classification scenario. These hierarchical counterparts were initially proposed by [2]_, and are defined as follows:

0 commit comments

Comments
 (0)