Is the KDD Cup really music recommendation?

Feb 22, 2011

The KDD Cup is an annual Data Mining and Knowledge Discovery competition organized by the ACM Special Interest Group on Knowledge Discovery and Data Mining. This year, the KDD-Cup is called Learn the rhythm, predict the musical scores. Yahoo! Music has contributed 300 million ratings performed by over 1 million anonymized users. The ratings are given to to songs, albums, artists and genres. The goal for this competition is for submitters to (1) Accurately predict ratings that users gave to various items and (2) Separation of loved songs from other songs.

This is a pretty exciting set of data. It is perhaps the largest set of music rating data ever released. With a data set of this size we should see Netflix Prize -sized advances in the music recommendation field because of it. However, there's one little gotcha. The data is entirely anonymized. Not only have the user data been anonymized, but all of the songs, albums, artists and genres as well. So instead of getting ratings data like 'user 1 rated bon jovi with five stars', you get data like 'user 1 rated artist 10 with five stars' . Here's a sample of data for one user:

3|14  # user ID 3 has 14 ratings
5980    90      3811    13:24:00   # item 5980 got a score of 90/100
11059   90      3811    13:24:00   # 3811 is a day offset from an
21931   90      3811    13:24:00   #     undisclosed date
74262   90      3811    13:24:00   #
146781  90      3811    13:24:00   # 13:24 is the time on day 3811
173094  90      3811    13:24:00
175835  90      3811    13:24:00
180037  90      3811    13:24:00
194044  90      3811    13:24:00
267723  90      3811    13:24:00
290303  90      3811    13:24:00
366723  90      3811    13:24:00
432968  90      3811    13:24:00
451800  90      3811    13:24:00

Without any way to tie the item IDs to actual music items, this competition seems to be less about music recommendation and more about collaborative filtering (CF) algorithms. As Oscar Celma (who literally wrote the book on music recommendation) put it in the KDD Cup competition forum:

Without artist/song name, the dataset has no interest for me (e.g. it doesn't make any sense not being able to understand what are you predicting). As it is now, this is not really a "music dataset" nor a competition about "music recommendation", but simply a way to apply CF to a huge dataset. In a way, this is good for people doing research on CF. But, not being able to add *any* knowledge about the domain... it doesn't make any sense, IMHO.

Researcher Amelie Anglade adds:

There is so much we could do if we had access to the artist and track names, using Music Information Retrieval techniques: we could analyse the audio (tempo, chords, melody, timbre, etc.), the scores, the lyrics,the artists' connections, and much more. There is a growing community working on these topics, and attempting to do music recommendation without any contextual and/or content information other than the genres (which is a limited approach) is simply ignoring this whole branch of research.

The folks at Yahoo! who have generously put together the dataset do understand how the lack of real, non-anonymized music data makes it difficult for a whole branch of researchers from the Music Information Retrieval community to participate in the competition. However Noam Koenigstein, one of the organizers of this years KDD-Cup, says that the aggressive anonymization of the data is required by their legal team due to recent lawsuits around large releases of user rating data (see Netflix lawsuit) and their hands are tied. Noam does go on to say that:

After working with this dataset for 6 months now, I can defiantly say that there are differences between music CF and other types of CF. One example is the popularity temporal trends in music that are different than in movies (Netflix). So a CF system that considers also temporal effects will be different in music. There are other differences as well, but I can not reveal them right now.

I'm sure Noam is right, that there is some interesting differences between the music rating data and other large rating sets and I'm sure that exploring these differences will improve the state-of-the-art in CF systems, but Oscar and Amelie are right too - so much more could be learned if we had the ability to know what items were actually being rated

There have been two very active research communities involved in music recommendation. The RecSys community takes a traditional recommender systems approach and relies mostly on collaborative filtering techniques to make recommendations. To this community, data mining of user behavior is enough to make good recommendations. Whereas the Music Information Retrieval (MIR) community focuses much more on the music itself, relying on content-based (CB) techniques based on the audio (or descriptions of the audio) to find musical connections to base recommendations on. Each approach has its own strengths and weaknesses (CF has the cold start problem, popularity feedback loops, hacking susceptibility etc. while CB tends to be computationally more challenging and has trouble separating good music from bad). The best systems tend to combine aspects of both approaches into hybrid systems.

The KDD-cup data set is a fantastic set of data, and I'm sure this data will help the RecSys community improve the state-of-the-art in CF systems. The MIR community is also creating its own industrial-sized datasets for research such as the recently released Million Song Data Set which will be used to improve CB techniques. It is my hope that someday we'll be able to offer a combined dataset that contains both massive rating data and massive content data. If we put all this data in the hands of researchers, there's no telling what they'll find. And perhaps that's the real problem - as Jeremy Reed tweeted: Biomed researchers can obtain illegal substances for research, but we can't get data because we'll find users with bad taste!

Music Machinery

Discussion about this post