-
Notifications
You must be signed in to change notification settings - Fork 27
Description
In the current implementation, the only pre-processing that takes raw datapoints and converts it to tokens is Float2Scalar. The same is true for post-processing, where we perform Scalar2Float to take scalar outputs back into float predictions.
Float2Scalar multiplies each float by some power of 10, default 100, and truncates decimals. The problem with this approach is that important information could be encoded in removed decimals. We seek a discretization method that optimizes for preserving information in the data.
Proposing to add new primitives called Float2Cluster and Cluster2Float. The primitive Float2Cluster will run K-Means binning on the first N data points with K clusters. We then assign cluster indices to each cluster mean in an increasing order. Lastly, we map each point in the full data to its nearest cluster mean, with value equal to the corresponding cluster index. For example, if a datapoint with value 0.32 is closest to cluster 40 with mean 0.31, the point is discretized to 40. The primitive Cluster2Float takes each output token and maps it to the corresponding cluster mean.
Note, In the multivariate case, this is done for each dimension independently.