One of the most interesting part in Machine learning is knowing what you want based on the your logs.
Like netflix
, User rates movies using zero to five stars, Then the system will recommend movies you may like by analyzing your personal taste.
Content-based recommendation system
Let’s see the below table for user’s rating for the movies they watched.
What if $User_0$ watch the movie, notebook
? Can you guess the rating?
Maybe you can guess based on other people’s rating and $User_0$’s taste.
\[\begin{array}{r|rr} \text{Movies} & User_0 & User_1 & User_2 & User_3 & User_4 \\ \hline \text{the greatest show man} & 5 & 4 & 4 & 2 &1 \\ \text{once upon a time} & 5 & 4 & ? & 2 &4 \\ \text{notebook} & ? & 4 & 2 & 2 &2 \\ \text{taken} & 4 & 3 & 2 & 3 &5 \\ \end{array}\]For each user $j$, learn a parameter $\theta^{(j)} \in R^3$. Predict user $j$ as rating move $i$ with $(\theta^{(j)})^T x^{(i)}$ stars
\[\begin{align} & r(i, j) = 1 \ \text{if user $j$ as rated movie $i$($0$ otherwise)}\\ & y^{(i, j)} = \text{rating by user $j$ on movie $i$(if defined)} \\ & \theta^{(j)} = \text{parameter vector for user $j$}\\ & x^{(i)} = \text{feature vector for movie $i$}\\ & (\theta^{(j)})^T(x^{(i)}) =\text{For user $j$, movie $i$, predicted rating}\\ & m^{(j)} = \text{number of movies rated by user $j$} \end{align}\]Optimization for Objective function
\[\begin{align} & \text{To learn $\theta^{(j)}$ parameters for user $j$}: \\ & \min_{\theta(j)} \frac{1}{2} \sum_{i:r(i,j)=1} \left( (\theta^{(j)})^T x^{(i)} - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^n \left( \theta^{(j)}_k \right)^2 \\ \\ & \text{To learn $\theta^{(1)}, \theta^{(2)} \dots, \theta^{(n_u)}$}: \\ & \min_{\theta^{(1)}, \theta^{(2)} \dots \theta^{(n_u)}} \frac{1}{2} \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1} \left( (\theta^{(j)})^T x^{(i)} - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^n \left( \theta^{(j)}_k \right)^2 \\ \end{align}\]Optimization algorithm
\[\begin{align} & \text{Optimization algorithm}: \\ & J({\theta^{(1)}, \theta^{(2)} \dots , \theta^{(n_u)}}) = \frac{1}{2} \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1} \left( (\theta^{(j)})^T x^{(i)} - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^n \left( \theta^{(j)}_k \right)^2 \\ & \text{Gradient descent update:}\\ & \theta_k^{(j)} := \theta_k^{(j)} - \alpha \sum_{i:r(i,j) = 1} \left( (\theta^{(j)})^Tx^{(i)} - y^{(i,j)} \right)x_k^{(i)} (\text{for } k = 0) \\ & \theta_k^{(j)} := \theta_k^{(j)}- \alpha \sum_{i:r(i,j) = 1} \left( ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)} +\lambda x_k^{(j)} \right) (\text{for } k \ne 0) \\ \end{align}\]Collaborative filtering
The collaborative filtering means the weight can be estimated by each other.
We will build a model to categorize content for each content and the weight of categorized contents for each user.
There are two hypothesis.
- Each user like the movies depending on the movie categories.
- Each user have different weights for the movie categories.
The main differences from content based recommendation
are the weight for categories is added and two weights are updated simultaneously.
$x^{(1)}, \dots, x^{(n_m)}$ are movie categories for each movie.
$\theta^{(1)}, \dots, \theta^{(n_u)}$ are the weights of movie categories for each user.
\[\begin{align} &\text{Estimate the weights for the taste of users}\\ &\text{Given } x^{(1)}, \dots, x^{(n_m)} \text{ and movie ratings}\\ &\text{Can estimate } \theta^{(1)}, \dots, \theta^{(n_u)} \\ \\ &\text{Estimate the weights of contents}\\ &\text{Given } \theta^{(1)}, \dots, \theta^{(n_u)}\\ &\text{Can estimate } x^{(1)}, \dots, x^{(n_m)} \\ \end{align}\]Optimization function
\[\begin{align} &\text{for a user:}\\ &\text{Given } \theta^{(1)}, \dots, \theta^{(n_u)} \text{ to learn } x^{(i)} \\ & \min \frac{1}{2} \sum_{j:r(i,j)=1} \left( ( \theta^{(j)} )^T x^{(i)} - y^{(i, j)} \right) + \frac{\lambda}{2} \sum_{k=1}^n \left( x_k^{(i)}\right)^2 \\ \\ &\text{for users:}\\ &\text{Given } \theta^{(1)}, \dots, \theta^{(n_u)} \text{ to learn } x^{(1)}, \dots, x^{(n_m)} \\ & \min_{x^{(i)}, \dots, x^{(n_m)}} \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} \left( ( \theta^{(j)} )^T x^{(i)} - y^{(i, j)} \right) + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^n \left( x_k^{(i)}\right)^2 \\ \end{align}\]Minimize $x^{(1)}, \dots, x^{(n_m)}$ and $\theta^{(1)}, \dots, \theta^{(n_m)}$ simultaneously
\[J(x^{(1)}, \dots, x^{(n_m)}, \theta^{(1)}, \dots, \theta^{(n_u)}) = \\ \frac{1}{2} \sum_{(i,j):r(i,j)=1} \left( (\theta^{(j)})^Tx^{(i)} - y^{(i, j)}\right)^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^n \left( x_k^{(i)}\right)^2 + \frac{\lambda}{2} \sum_{i=1}^{n_u} \sum_{k=1}^n \left( \theta_k^{(j)}\right)^2 \\\]Optimization algorithm
Iterate below process until converge
- Initialize $x^{(1)}, \dots, x^{(n_m)}, \theta^{(1)}, \dots, \theta^{(n_u)}$ to small random values.
- Minimize $J(x^{(1)}, \dots, x^{(n_m)}, \theta^{(1)}, \dots, \theta^{(n_u)}) $ using gradient descent.