About how to build a recommendation engine using Spark MLLib, Spark streaming, kafka, mongodb using scala.

This is the first post about how to create a recommendation engine, the preliminar know how to build a good recommendation engine.

What is it a recommendation engine? well, the users from Spotify, Amazon and Netflix know that the recommendations shown to us are really accurate, it looks like there are somebody who knows us very well, but no, there are not any people following our steps, don’t worry about that. Instead of that, there are some machine learning algorithms running every hour in their clusters, normally running spark mllib software.

What makes good a recommendation engine?

  1. The results are relevant but not obvious,
  2. Sense of surprise

The Math Behind Relevance • Finding ‘Similar’ Objects Cosine Similarity 

Similarity(A,B) = cosine ∅ = A * B / ∣A∣ ∣B∣ = (x1*x2 + y1*y2) / (x1 squared + y2 squared) squared 1/2 * (x2 squared + y squared) squared 1/2

Value of cos θ varies between:
• -1 [‘θ’ = 180◦, Absolutely dissimilar – Opposite ended vectors/relationship]
• 0 [‘θ’ = 90◦, Dissimilar, perpendicular vectors/relationship]
• +1 [‘θ’ = 0◦, Absolutely Similar

I know, maybe it is not too clear…

Where do we use recommendations?
• Applications can be found in a wide variety of industries and applications:
• Travel
• Financial Service
• Music/Online radio
• TV and Video
• Online Publications
• Retail ..and countless others

Our case, Movies!!!!

How to do it? Using Hadoop, Spark and Mllib, maybe Mahout…

The best solutions includes at least the results of two algorithms:

Content Based Filtering:
How similar is this particular movie to other movies based on others?.
This algorithm builds a matrix of items to other items and calculates similarities (based on user ratings)
The most similar item are then ouput to a list.
• Item ID, Similar Item ID, Similarity Score
• Items with the highest score are most similar
• In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100)

Item ID, Similar Item ID, Similarity Score
7               100                         0.690951001800917
7               50                           0.653299445638532
7               117                          0.643701303640083

At the moment, content based filtering is not available for Spark in Mllib. On our project, i am going to use Apache Mahout.

Collaborative Filtering:

Predict an individual preferences based on their peer ratings.
Spark MLLib implements a collaborative filtering algorithm named Alternating Least Square (ALS)

Collaborative Filtering “People with similar taste to you liked these movies”

• Collaborative filtering applies weights based on “peer” user preference.
• Essentially it determines the best movie critics for you to follow
• The items with the highest recommendation score are then output as tuples
• User ID [Item ID1:Score,…., Item IDn:Score]
• Items with the highest recommendation score are the most relevant to this user
• For user “Johny Sisklebert” (572), the two most highly recommended movies are “Seven” and “Donnie Brasco”

User ID [Item ID1:Score ,Item ID2:Score,Item ID3:Score,Item ID4:Score,Item ID5:Score,Item ID6:Score,Item ID7:Score,Item ID8:Score,Item ID9:Score,Item ID10:Score]
572 [11:5.0 ,293:4.70718 ,8:4.688335 ,273:4.687676,427:4.685926 ,234:4.683155 ,168:4.669672 ,89:4.66959 ,4:4.65515]
573 [487:4.54397 ,1203:4.5291 ,616:4.51644 ,605:4.49344 ,709:4.3406 ,502:4.33706 ,152:4.32263 ,503:4.20515 ,432:4.26455 ,611:4.22019]
574 [1:5.0 ,902:5.0 ,546:5.0 ,13:5.0 ,534:5.0 ,533:5.0 ,531:5.0 ,1082:5.0 ,1631:5.0 ,515:5.0]

Imagine two guys, one of them likes P,Q,R and S, the other guy likes Q,R,S and T,
so a good recomendation to the second guy should be P and another one to the first one should be T.

Both algorithms only needs 3 fields, in my particular case of recomendation movies, User ID, Item ID, Rating.

Recommendation Store

• Serving recommendations needs to be instantaneous
• The core to this solution is two reference tables:

Rec_Item_Similarity
Item_ID
Similar_Item
Similarity_Score

Rec_User_Item_Base
User_ID Item_ID
Recommendation_Score

• When called to make recommendations we query our store
• Rec_Item_Similarity based on the Item_ID they are viewing
• Rec_User_Item_Base based on their User_ID

Delivering Recommendations

So if Johny is viewing “12 Monkeys” we query our recommendation store and present the results

Item-Based: Peers like these Movies Best Recommendations

Item                           Similarity Raw Score Score
Fargo                                        0.691                 1.000
Star Wars                                0.653                 0.946
Rock, The                                0.644                0.932
Pulp Fiction                            0.628                0.909
Return of the Jedi                 0.627                0.908
Independence Day               0.618                 0.894
Willy Wonka                          0.603                 0.872
Mission: Impossible            0.597                0.864
Silence of the Lambs, The 0.596                0.863
Star Trek: First Contact      0.594                0.859
Raiders of the Lost Ark       0.584                0.845
Terminator, The                    0.574                0.831
Blade Runner                          0.571                0.826
Usual Suspects, The             0.569                0.823
Seven (Se7en)                        0.569                0.823

+

Item-Base (Peer)               Raw Score Score
Seven                                      5.000        1.000
Donnie Brasco                     4.707        0.941
Babe                                        4.688        0.938
Heat                                         4.688       0.938
To Kill a Mockingbird         4.686       0.937
Jaws                                          4.683       0.937
Monty Python, Holy Grail 4.670       0.934
Blade Runner                        4.670       0.934
Get Shorty                              4.655       0.931

=

Top 10 Recommendations
Seven (Se7en) 1.823
Blade Runner 1.760
Fargo 1.000
Star Wars 0.946
Donnie Brasco 0.941
Babe 0.938
Heat 0.938
To Kill a Mockingbird 0.937
Jaws 0.937
Monty Python, Holy Grail 0.934

From Good to Great Recommendations

• Note that the first 5 recommendations look pretty good …but the 6th result would have been “Babe” the children’s movie
• Tuning the algorithms might help: parameter changes, similarity measures.
• How else can we make it better?
1. Delivery filters
2. Introduce additional algorithms such as K-Means

Additional Algorithm – K-Means

“These movies are similar based on their attributes”

• Treats items as coordinates
• Places a number of random “centroids” and assigns the nearest items
• Moves the centroids around based on average location
• Process repeats until the assignments stop changing

We would use the major attributes of the Movie to create coordinate points.
• Categories
• Actors
• Director
• Synopsis Text
Delivery Scoring and Filters

Apply assumptions to control the results of collaborative filtering

• One or more categories must match
• Only children movies will be recommended for children’s movies.

Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller 
 Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0 0
 Babe 0 0 1 1 0 1 0 0 0 0 0 0
 Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1 1
 Star Wars 1 1 0 0 0 0 0 0 1 1 0 0
 Blade Runner 0 0 0 0 0 0 1 0 0 1 0 0 
 Fargo 0 0 0 0 1 1 0 0 0 0 1 1 
 Willy Wonka 0 1 1 1 0 0 0 0 0 0 0 0 
 Monty Python 0 0 0 1 0 0 0 0 0 0 0 0 
 Jaws 1 0 0 0 0 0 0 1 0 0 0 1
 Heat 1 0 0 0 1 0 0 0 0 0 1 1 
 Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0 1 
 To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0 0

This matrix means that Babe is rank as Children´s and Comedy, Jaws is rank as Action and Thriller, so on so forth.
Similarly logic could be applied to promote more favorable options

• New Releases
• Retail Case: Items that are on-sale, overstock

Something recommended by more than 1 algorithm is better suited and probably the most higly rated!

Summary
• Hadoop and Spark can provide a relatively low cost and extremely scalable platform for recommendations
• Spark, with MLlib offers a great library of established Machine Learning algorithms, reducing development efforts
• A good recommendation system combines Collaborative and Content filtering algorithms and custom business rules
• As Spark matures, Mahout or roll-your-own algorithms may be needed.

Interesting links

https://chimpler.wordpress.com/2014/07/22/building-a-food-recommendation-engine-with-spark-mllib-and-play/

http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html

Disclaimer

Thank you Joe Caserta, i am going to code this recommendation engine following your guide.

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s