-
Notifications
You must be signed in to change notification settings - Fork 55
Proposal for Groundrules
eli knaap edited this page Mar 1, 2019
·
1 revision
One way that python projects help smooth the user experience is to identify methods/conventions that all classes or estimators must have. For instance, the most well-known version of this is done by scikit-learn, but others exist in other packages as well, such as statsmodels or even spreg.
Usually, this behavior is achieved through class inheritance, but may be achieved much more generally through ensuring semantic consistency across the library. What I hope to do here is to provide a set of conventions, attributes, and methods that I think would ensure our user experience is consistent, simple to understand, and easy to integrate with other software packages. I'd also like to make sklearn a dependency.
- Vocabulary & ground rules
- spatially-determined clusters are
regions, which are composed of collections ofobservations. - a polygon that encloses all observations within a region is the
hullfor that region. - the individual numbers that map observations into regions are
labels.- all
labelsareinttypes and start at zero. - observations that are disconnected or that are not assigned to regions should be labeled
-1
- all
- estimated quantities that are stored on the object should end with a
_character.- for instance, the assignments made by the regionalizer should be called
labels_ - the data used for the regionalization should not be cached on the object, but derived properties (like an affinity matrix) can be.
- for instance, the assignments made by the regionalizer should be called
- return types should be flat numpy arrays.
- spatially-determined clusters are
- Borrowing from
sklearn, the methods for all classes should probably beclass Regionalizer(sklearn.base.BaseEstimator, sklearn.base.ClusterMixin)and follow a similar pattern:-
__init__: the initialization of the estimator- only set attributes or configuration flags.
- the number of regions to find should be called
n_regions- when wanting a max-p-type solution--the largest number of feasible regions--
n_clusters=np.infshould be used. - when wanting an optimal number of clusters given a fit metric--when the number of clusters should be learned from the data--
n_clusters=Noneshould be used.
- when wanting a max-p-type solution--the largest number of feasible regions--
- the connectivity matrix should be given as a
connectivityargument, and should focus on scipy sparse matrices. We can build theWbehind the scenes, but this lowers the barrier to folks outside of PySAL (e.g. networkx/osmnx)
-
fit(X,y=None):- this should ignore
y. This is a convention insklearn, but I'm open to just takingX. - this might clean up data and then pass it along to a function designed to regionalize on clean data. See dbscan for examples.
- this should return
self
- this should ignore
-
fit_transform(X)- this should be implemented as
return self.fit(X).labels_
- this should be implemented as
-
- Possible "new" concepts for regionalization? open for definition (do not implement on classes)
-
not neededupdate(new_X)- modify existing partition to accommodate new data.
- return labels for new data
-
assign(new_X)implement in a utility, because this does not pertain to the clustering algorithm directly.- assign observations to regions using the region hull
-