Add a New K-Means Binning Primitive

In the current implementation, the only pre-processing that takes raw datapoints and converts it to tokens is `Float2Scalar`. The same is true for post-processing, where we perform `Scalar2Float` to take scalar outputs back into float predictions. 

`Float2Scalar` multiplies each float by some power of 10, default 100, and truncates decimals. The problem with this approach is that important information could be encoded in removed decimals. We seek a discretization method that optimizes for preserving information in the data.

Proposing to add new primitives called `Float2Cluster` and `Cluster2Float`. The primitive `Float2Cluster` will run K-Means binning on the first N data points with K clusters. We then assign cluster indices to each cluster mean in an increasing order. Lastly, we map each point in the full data to its nearest cluster mean, with value equal to the corresponding cluster index. For example, if a datapoint with value 0.32 is closest to cluster 40 with mean 0.31, the point is discretized to 40. The primitive `Cluster2Float` takes each output token and maps it to the corresponding cluster mean. 

Note, In the multivariate case, this is done for each dimension independently. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a New K-Means Binning Primitive #56

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a New K-Means Binning Primitive #56

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions