Recommendations needed


In one of our research projects, we are trying to compare some alternative algorithms for generating recommendations based on content similarity. As you might expect, we have some data we’re playing with, but the data is noisy and sometimes it’s hard to make sense of the variability: is it due to noise in the data, or is the algorithm trying to tell us something?

So my thought was to break the problem into two parts: first deal with our algorithms on known data, and then apply the results to the new, noisy data to see what’s there. My purpose in writing this post is to solicit suggestions about which publicly-available data we should be using.

Here are the characteristics the data we’d like to use should have:

  • Each item should belong to one or more categories.
  • Each item should be characterized by at least one attribute.
  • There should be a large number of attributes, but each item may have only a few of them assigned.
  • If possible, attributes should have numerical scores or ratings.
  • There should be lots of items.

Looking forward to hearing about your suggestions. In the end, we’ll probably try to do the analysis on several different datasets to check the robustness of our findings.


  1. How about the New York Times Annotated Corpus? Seems to satisfy 4 out of 5 of your criteria–the only thing it’s missing are numerical scores for the attributes.

    I’m not sure what IMDB gives you, but that would seem to satisfy all 5.

  2. You might want to try this brand new dataset:

    The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

  3. Xavier Amatriain says:

    I am not sure I understand your needs. Is all you care about a set of annotated content data? Or do you also need the link from that content to user preferences? In other words, if you take a dataset like, say, IMDB, you will have a great deal of content with many categories, descriptions,… But you are missing the user evaluation for that so I am not sure how you will be able to evaluate your approach. Unless your only success measure is similarity, with no user feedback. Even in that case, what would your ground truth be in terms of movie or song similarity?

    If I had to evaluate a content-driven approach I would take the Netflix dataset and add content info from IMDB. This is pretty easy to do and you then have both sides: user ratings and content annotation.

  4. Thanks everyone for your suggestions!

    @Daniel Is the Times corpus available publicly? Has anyone used it to generate recommendations based on article similarity, and, if so, have these results been published?

    @Oscar I’ll take a look at the million song dataset. It would be nice to have some benchmark recommendations to compare against

    @Xavier Mashing up Netflix and IMDB makes sense. Is that something that others have already done? I would like to compare our algorithms to established results to understand whether we’re doing anything interesting.

  5. Xavier Amatriain says:

    @Gene Yes, using IMDB to extend the Netflix dataset was sort of common practice. I did it myself for some experiments by simply using the titles in Netflix to run a query in imdb-py library. This works pretty well for most of the cases, although you need to deal with near-matches, different spellings, etc… if you want to get to 100%. If you don’t want to use the python library approach there is an offline version of the imdb database. Actually, given the popularity of this approach during the Netflix prize, i wouldn’t be surprised you would find some public code to do this matching if you look into it.

  6. Thanks Xavier! Sounds like this is worth pursuing.

  7. Hi,
    just wanted to mention the Yahoo Ratings Dataset as the counterpart of the Netflix dataset for music:
    (datasets R-1, R-2 and R-3).
    There is a good overlap between them and the Million Song Dataset, so the pair might be a viable option.

Comments are closed.