Collaborative Filtering: Non-Personalized item-to-item similarity

Let's understand Item-to-Item Collaborative Filtering. suppose we have purchase matrix

        Item1  Item2 ... ItemN
 User1  0        1   ...  0
 User2  1        1   ...  0 
  .
  .
  .
 UserM  1        0   ...  0

Then we can calculate Item similarity using column vector, e.g use cosine. We have a item similarity symmetry matrix as below

        Item1  Item2 ... ItemN
 Item1  1       1/M  ...  0
 Item2  1/M     1    ...  0 
  .
  .
  .
 ItemN  0       0    ...  1

It's can be explained as "Customers who viewed/purchased X have also viewed/purchased Y, Z, ..." (Collaborative Filtering). Because Item's vectorization is based on user's purchased.

Amazon's logic is exactly same with above while it's target is to improve efficient. As they said

We could build a product-to-product matrix by iterating through all item pairs and com- puting a similarity metric for each pair. However, many product pairs have no common customers, and thus the approach is inefficient in terms of processing time and memory usage. The iterative algorithm provides a better approach by calculating the similarity between a single prod-uct and all related products


There's a good O'Reilly book on this topic. While the whitepaper might lay the logic out in pseudo-code like that, I don't think that approach would scale very well. The calculations are all probability calculations, so things like Bayes' Theorem get used to say, "Given Person A purchased X, what's the likelihood they purchased Z?" Straightforward looping over the data is working too hard. You have to go through it all for each person.


@Neil or whoever comes to this question later on:

The choice of similarity metric is up to you and you might want to leave it malleable for the future. Check out the Wikipedia article on Frobenius norm for a start. Or as in the link you submitted, the Jaccard coefficient cos(I1,I2).

User-item –vs– user-user –vs– item-item, or whatever combination, cannot be answered objectively. It depends on what kind of data you can get from your users, how the UI draws information out of them, what parts of your data you consider reliable, and your own time constraints (as far as hybrids go).

Since many people have done masters theses on the questions above, you probably want to start with the easiest implementable solution while leaving room for growth in the complexity of the algorithm.