In one of our research projects, we are trying to compare some alternative algorithms for generating recommendations based on content similarity. As you might expect, we have some data we’re playing with, but the data is noisy and sometimes it’s hard to make sense of the variability: is it due to noise in the data, or is the algorithm trying to tell us something?
So my thought was to break the problem into two parts: first deal with our algorithms on known data, and then apply the results to the new, noisy data to see what’s there. My purpose in writing this post is to solicit suggestions about which publicly-available data we should be using.
Here are the characteristics the data we’d like to use should have:
- Each item should belong to one or more categories.
- Each item should be characterized by at least one attribute.
- There should be a large number of attributes, but each item may have only a few of them assigned.
- If possible, attributes should have numerical scores or ratings.
- There should be lots of items.
Looking forward to hearing about your suggestions. In the end, we’ll probably try to do the analysis on several different datasets to check the robustness of our findings.