-
Notifications
You must be signed in to change notification settings - Fork 725
Description
I have a basic question about the use of lightFM, apologies if this isn't the right forum.
I'm building a recommender system that will recommend documents to users. There are no interactions yet and all we know about the users are the set of keywords they're interested in.
I've built a prototype where I transform each document using TF-IDF. I then transform the user's keywords with the same transformer and use cosine similarity to find the most relevant documents. It works reasonably well.
I'm now porting this to lightFM so that we can include interactions, but first I need the system to perform equally well as the TF-IDF solution, but I struggle to make it work. Here's the current approach:
- build Dataset object on all items in the corpus, using TF-IDF to build item features
When request for recommendations for a new user comes in:
- get that user’s keywords. Form a pseudo-document containing just a string with all the keywords.
- get the TF-IDF features on that pseudo document, using the same vectorizer used to build the corpus features
- retrain the LightFM model, with a single interaction between the user and the pseudo document and item_features formed by concatenating the corpus's item features and the pseudo document's features
- call the predict function to get the recommendations
In my unit tests I have 52 documents, which get transformed to a TF-IDF vector of about 3300 columns. The user's pseudo document is transformed to a vector with a single 1.0 entry corresponding to that keyword.
So I would expect the prediction to score high those documents for which the TF-IDF entry corresponding to the keyword are also high. But instead, the scores are more or less the same, about -0.5.
Am I doing something wrong here?