4 How does Netflix recommend
movies?
We have just seen three beautiful equations in the last three chapters, each used
at least a billion times every single day:
pi[t+ 1] =
γi
SIRi[t]
pi[t]
Ui(b) = vi − pi(b)
pi∗T = pi∗TG.
We now continue with our first block of four chapters that present four funda-
mental algorithms: Distributed power control, second price auction, pagerank,
and now, collaborative filtering. These four chapters also introduce the basic
languages of optimization, game, graph, and learning theories.
4.1 A Short Answer
Netflix started its DVD rental business in 1997: instead of you going to rental
stores, DVDs come to you in the mail. Instead of incurring a late fee for each day
you hold the DVD beyond the return date, you can keep holding the DVD, as
long as you continue to pay monthly subscription fee, but you cannot receive a
new one without returning the old one. This is similar in spirit to the sliding win-
dow mechanism of congestion control in Chapter 14, or the tit-for-tat incentive
mechanism in P2P in Chapter 15. Netflix created an attractive web interface and
maintained a remarkable inventory control and mail delivery system. It operated
with great scalability (per customer cost much lower as the number of customers
go up) and stickiness (users are reluctant to change the service). By 2008, there
were about 9 million users in North America.
As Netflix’s business grew, it moved on to the next mode of delivering enter-
tainment. This time it is streaming movies and TV programs from video servers,
through the Internet and wireless networks, to your Internet-connected devices:
game consoles, set-top boxes, smart phones, and tablets. With its branding,
choice of content, and aggressive pricing, Netflix’s subscriber nearly tripled to 23
million by April 2011. Netflix video streaming generated so much Internet traffic
that more than one in every four bits going through the Internet that month is
Netflix traffic.
4.1 A Short Answer 49
In September 2011, Netflix announced that it would separate the DVD rental
and online streaming businesses. We will later look at the cloud-based video
distribution technology, including Netflix, Amazon Prime, Hulu, HBO Go, etc.,
in Chapter 17, and how that is changing the future of both entertainment and
networking.
In this chapter, we instead focus on the dimension of social networking in
Netflix: How does it recommend movies for you to watch (either by mail or by
streaming)? It is like trying to read your mind and predict your movie rating.
An effective recommendation system is important to Netflix: it enhances user
experience, increases royalty and volume, and helps with inventory control.
Recommendation system is a helpful feature for many other applications be-
yond video distribution. Similar to search engines in Chapter 3, recommendation
systems give rise to structures in the sea of raw data and reduce the impact of
information explosion.
You must have also seen how Amazon recommends products to you based
on your purchase and viewing history, adjusting its recommendation each time
you browse a product. This recommendation runs content-based filtering, in
contrast to collaborative filtering used by Netflix. There is a related question:
when can you trust the averaged rating of a product on Amazon? It is a different
variant of the recommendation problem, and will be taken up in Chapter 5.
You must have been swayed by recommendations on YouTube that followed
each of the videos you watched. We will look at YouTube recommendation in
Chapter 7.
You may have used Pandora’s online music selection, where recommendation
is developed by experts. But you get to thumb-up or thumb-down the recom-
mendation: an explicit binary feedback.
Netflix wants to develop a recommendation system that does not depend on
any expert, but instead use the rich history of all the user behaviors to profile
each user’s taste on movies. This system has the following input, output, and
criterion of success:
• Among the inputs to this system is the history of star ratings across all the
users and all the movies. Each data consists of four number: (a) user ID,
indexed by u, (b) movie title, indexed by i, (c) number of stars, 1-5, of
the rating, denoted as rui, (d) date of the rating, denoted as tui. This is
a really large data set: think of millions of users and tens of thousands
movies. But only a fraction of users actually watched any given movie, and
only a fraction of that fraction actually bothered to rate the movie. Still
the size of this input is in the order of billions for Netflix.
• The output is, first of all, a set of predictions rˆui, one for each movie i that
user u has not watched yet. These can be real numbers, not just integers.
And we can interpret a predicted rating of say 4.2 as saying that the user
will rate this movie with 4 stars with 20% probability and 5 stars with
80% probability. The actual output is a short, ranked list of movies recom-
50 How does Netflix recommend movies?
mended to each user u, presumably those movies receiving rˆui ≥ 4 or the
top 5 movies with the highest predicted rˆui.
• The real test of this mind-reading system is whether user u actually likes the
movies recommended. This information, however, is hard to collect. So a
proxy used by Netflix is Root Mean Squared Error (RMSE), measured
for those (u, i) pairs where we have both the prediction and the actual
rating. Let us say there are C such pairs. Each rating prediction’s error
is squared: (rui − rˆui)2, and then averaged over all the predictions. Since
square was taken, to scale the numerical value back down, a square root is
taken over this average:
RMSE =
√√√√∑
(u,i)
(rui − rˆui)2
C
.
Netflix could have used aboslute value of the error instead of squared error,
but for our purpose we will stick to RMSE as the metric that quantifies the
accuracy of a recommendation system. The smaller the RMSE, the better
the recommendation system.
Can recommendation accuracy be improved by 10% over what Netflix was
using? That was the question Netflix throwed at the research community in Oc-
tober 2006, with a prize of $1 million and through an open, online, international
competition called the Netflix Prize.
The competition’s mechanism is interesting in its own right. Netflix made
available a set of over 100 million ratings, as part of its record from 1999 to
2005. That amount of data can just fit in the memory of standard desktops in
2006, making it easy for anyone in the world to participate in the competition.
These rating data come from more than 480,000 users and 17,770 movies. On
average, each movie is rated by more than 5000 users and each user rates more
than 200 movies. But those average numbers disguise the real difficulty here:
many users only rate a few movies, and very few users rate a huge number of
movies (one user rated over 17,000 movies). Whatever recommendation system
we use, it must be able to predict well for all these users.
The exact distribution of the data is shown in Figure 4.1:
• A little less than 100 million ratings are made public as the training set.
• About 1.4 million additional ratings are also made public and they have similar
statistical properties as the test set and the quiz set described next. It is
called the probe set, which competitors in Netflix Prize can use to test their
algorithms.
• About 1.4 million additional ratings are hidden from the competitors, called
the quiz set. Each competing team can submit an algorithm that run on the
quiz test, but not more than once a day. The RMSE scores continuously
updated on the leaderboard on Netflix Prize website are based on this set’s
data.
4.1 A Short Answer 51
_
Training Set
100 Million
Public
Probe
Set
Quiz
Set
Test
Set
Hidden
1.4 M 1.4 M 1.4 M
Figure 4.1 Netflix Prize’s four data sets. The training set and probe set are publicy
released, whereas the quiz set and test set hidden from the public. The probe, quiz,
and test sets have similar statistical properties, but the probe set can be used by each
competiting team as often as they want, the quick set at most once a day, and the
final decision is based on comparison of RMSE on the test set.
• Another 1.4 million ratings, also hidden from the competitors, called the test
set. This is the real test. RMSE scores on this set determine the winner.
Each competing team first come up with a model for its recommendation
system, then decide their model parameters’ values based on minimizing the
RMSE between known ratings in the training set and their model’s predictions,
and finally, use this model with tuned parameters to predict unknown ratings in
the quiz set. Of course Netflix knows the actual ratings in the quiz set, and can
evaluate RMSE between those ratings and the predictions from each team.
This is a smart arrangement. No team can reverse engineer the actual test set,
since only scores on the quiz set are shown. It is also helpful to have a probe
set where the competiting teams can run their own tests as many times as they
want.
Netflix had its own algorithm called Cinematch that gave an RMSE of 0.9514
on the quiz set if its parameters were tuned by the training set. Improving RMSE
by even 0.01 can sometimes make a real difference in the top 10 recommendation
for a user. If the recommendation accuracy is improved by 10% over Cinematch,
it would push RMSE to 0.8563 on the quiz set, and 0.8572 on the test set.
This Netflix Prize ignites the most intense and high-profile surge of activities
in the research communities of machine learning, data mining, and information
retrieval in recent years. To some researchers, the quality and sheer size of the
available data is as attractive as the hype and prize. Over 5000 teams worldwide
entered more than 44,000 submissions. Both Netflix and these research fields
benefit from the three year quest towards the goal of 10%. Turns out setting the
52 How does Netflix recommend movies?
RMSE (On Quiz Set)
0.9514
0.8728
0.8712
0.8616
0.8563
0.8553
Oct. 2
2006
Oct.
2007
Oct.
2008
June
July
2007
Time
(a)
(b) (c)
(d)
(e)
(f)
*
* *
*
*
Figure 4.2 Netflix Prize’s timeline and some of the highlight events. It lasted for
almost three years and the final decision came down to a twenty-minute differential.
target as 10% improvement was a really good decision. For the given training
set and quiz set, getting 8% improvement was reasonably easy, but getting 11%
would have been extremely difficult.
Here are a few highlights in the history of Netflix Prize, also shown in the time-
line in Figure 4.2: (a) Within a week of the start of the competition, Cinematch
was beaten. (b) By early September, 2007, team BellKor made 8.26% improve-
ment over Cinematch, but the first place changed hand a couple of times, until
(c) in the last hour before the first year of competition ended, the same team
got 8.43% improvement and won the $50,000 annual progress prize for leading
the pack during the first year of the competition.
Then teams started merging. BellKor and BigChaos, two of the leading teams
merged, and (d) received the 2008 progress prize for pushing RMSE down to
0.8616. They further merged with Pragmatic Theory, and (e) the new team
BellKor’s Pragmatic Chaos in June 2009 became the first team that achieved
more than 10% improvement, beating Cinematch by 10.06%, on the quiz set.
Then the competition entered the “last call” period: all teams had 30 days to
make their final submissions. (f) At the end of this period, two teams beat
Cinematch by more than 10% on the quiz set: BellKor’s Pragmatic Chaos had
an RMSE of 0.8554, and The Ensemble had an RMSE of 0.8553, slightly better.
The final winner is to be declared by comparing their RMSEs on the test set.
Here is the grand finale: both teams beat Cinematch by more than 10% on
the test set, and actually got the same RMSE on that set: 0.8567. But BellKor’s
Pragmatic Chaos submitted their algorithm 20 minutes earlier, thus became
the winner of the grand prize. A world-class science race lasting almost 3 years
concluded with a 20 minute differential.
You must be wondering what algorithm did BellKor’s Pragmatic Chaos use
4.1 A Short Answer 53
_
? ?
1 2 3 4 5 6 7 8
1
2
3
4
5
6
5 2 4
4 3 1 3
5 4 5 4
1 1 2
3 3
2 4
? ?
Movies
Users
Figure 4.3 Recommendation system’s problem: predicting missing ratings from given
ratings in a large yet sparse table. In this toy example of 6 users and 8 movies, there
are 18 known ratings as a training set, and 4 unknown ratings to be predicted. Real
problems are much larger, with billions of entries in the table, and sparse: only about
1% filled with known ratings.
in the final winning solution? The answer is documented in detail in a set of
three reports, one from each component of this composite team, linked from the
Netflix Prize website. But what you’ll find is that the winning solution is really
a cocktail of many methods combined, with hundreds of ingredient algorithms
blended together, and thousands of model parameters fine-tuned specifically to
the training set provided by Netflix. That was what it took to get that last 1% of
improvement. But if you are only interested in the main approaches, big ideas,
and getting 8-9% improvement over Cinematch, there are actually just a few key
methodologies. And those are what we will focus on in the rest of this chapter.
To start with, take a look at the table in Figure 4.3. We can also think of the
table as a matrix R, or a weighted bipartite graph where the user nodes are on
the left column and the movie nodes on the right, with a link connecting user
node u and movie node i if there was such a rating, and the value of the rating is
the weight of the link. In Chapter 8 we will discuss other matrices that describe
the structure of different graphs.
Each column in this table is a movie (or an item in general), each row is a
user, and each cell’s number is the star rating by that user for that movie. Most
cells are empty since only few users rate few of the movies. You are given a large
yet sparse table like this, and asked to predict some missing entries like those
four indicated by question marks in the last two rows in this table. This is a
visualization of our problem at hand.
There are two main camps of techniques for any recommendation system:
content-based filtering and collaborative filtering.
54 How does Netflix recommend movies?
Content-based filtering only looks at each row in isolation and attach labels to
the columns: if you like a comedy played by Rowan Atkinson, you’ll probably like
another comedy played by Rowan Atkinson again. This straight forward solution
is often inadequate for Netflix.
In contrast, collaborative filtering exploits all the data in the entire table,
across all the rows and all the columns, trying to discover structures in the
pattern and values of the entries in this table.
Drawing an imperfect analogy with search engine in Chapter 3: content-based
filtering is like the relevance score of individual webpages, and collaborative
filtering is like the importance score determined by the connections among the
webpages.
Within the camp of collaborative filtering, there are in turn two main ap-
proaches.
• The intuitively simpler one is called neighborhood model. Here, two users
are “neighbors” if they share similar tastes on movies. If Alice and Bob both
like “The Schindler’s List” and “Life is Beautiful”, but not as much “E.T.”
and “Lion King”, then knowing Alice likes “Dr. Zhivago” a lot would make
us think Bob likes “Dr. Zhivago” a lot, too. In the neighborhood method,
we first compute a similarity score between each pair of users. Larger the
score, closer these two users are. Then for a given user whose opinion of
some unwatched movie we would like to predict, we select some number,
like 50, of the most similar users who have rated that movie. And then take
a weighted sum of these ratings and call that our prediction.
• The second approach is called latent factor model. It assumes that under-
neath the billions of ratings out there, there are only a few hundred key
factors on which users and movies interact. Statistical similarities among
users or among movies are actually due to some hidden, low-dimensional
structure in the data. This is a big assumption: there are much fewer types
of people and movies than there are people and movies, but it sounds about
right. Turns out one way to represent a low-dimensional model is to fac-
torize the table into two sets of short vectors.
Determining baseline predictors for the neighborhood model, and finding just
the right short vectors in the latent factor model, boil down to solving least
squares problems, also called linear regression.
Most of the mileage in the leading solutions to the Netflix Prize is obtained
by combining variants of these two approaches, supplemented by a whole bag of
tricks. Two of these supplementary ideas are particularly interesting.
One is implicity feedback. A user does not have to rate a movie to tell us
something about her mind. Which movies she browsed, which she watched, and
which she bothered to rate at all are all helpful hints. For example, it is useful to
leverage information in a binary table where each entry simply indicates whether
this user rated that movie or not.
Another idea played an important role in pushing the improvement to 9%
4.2 A Longer Answer 55
g
Ground
Truth
Model
Parameter
Observation
Training Prediction Prediction
Observation
Model
Figure 4.4 The main flow of two steps: first training and then prediction. The training
module optimizes over model parameters, and the prediction module uses the
optimized model parameters to make predictions.
in Netflix Prize: incorporate temporal dynamics. The model parameters be-
come time-dependent. This allows the model to capture the change in person’s
taste and in trends of movie market, as well as mood of the day when a user
rates movies, at the expense of dramatically increasing the number of model
parameters to optimize over. One interesting observation is that when a user
rates many movies on the same day, she tends to give similar ratings to all of
these movies. By discovering and discounting these temporal features, the truly
long-term structures in the training set are better quantified.
In the next section, we will present baseline predictor training and the neigh-
borhood method, leaving latent factor models to Advanced Material.
4.2 A Longer Answer
Before diving into any specific predictors, let us take a look at the generic work-
flow consisting of two phases, as shown in Figure 4.4:
• Training : we put in model with its parameters to work on observation data,
and then compare the predictions with ground truth that we known for
training data. Then we tune the model parameters so as to minimize some
error metric, like RMSE, relative to the ground truth that we know.
• Prediction: now we use the optimized model parameter to work on observation
data beyond those in the training set. These predictions are then used in
practice (or, in the case of Netflix Prize, compared against a quiz set of
ground truth that Netflix knows).
56 How does Netflix recommend movies?
4.2.1 Baseline predictor: least squares
Collaborative filtering extracts and leverages structures from the data set like
those in the table in Figure 4.3. But here is an easy method that does not even
require any understanding of the st
本文档为【How does Netflix recommend movies】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。