Spark Recommender System
in Big_data / Big Data on Recommender
The goal of this task is to finalize the implementation of a movie recommendation system. In the process, you will get more experience with programming in Scala and working with RDDs.
Dataset: Modified MovieLens dataset (Mirror)
Project Template: Download the project and try compiling it. Compilation should complete without errors. (Mirror)
Extra files: Download list of movies for collecting user preferences
Inspecting and Running The Template
Inspect the project template content.
spark-recommendation
├── build.sbt
├── for-grading.tsv
└── src
├── main
│ └── scala
│ ├── Grader.scala
│ └── MovieLensALS.scala
└── test
└── scala
build.sbt
contains build configuration and dependencies. Make sure that the correct spark and scala (read the documentation or runspark-submit --version
) versions are specified herefor-grading.tsv
this file is used for collecting user ratingsGrader.scala
for reading user preferencesMovieLensALS.scala
contains main class for training recommendation system
Try to compile this project with sbt or create an artifact with your IDE. When using IDE, make sure all dependencies are in MANIFEST. More information about this in the Appendix.
Try to run application with
spark-submit --master yarn spark-recommendation.jar hdfs:///movielens-mod -user false
Data Description
The dataset includes two files: movies2.csv
and ratings2.csv
. The first contains the list of movie titles in the format
movieId,title,genres
for example
72,Kicking and Screaming (1995),Comedy|Drama
73,"Misérables, Les (1995)",Drama|War
The ratings data are stored in the second file and have the format
userId,movieId,rating,timestamp
for example
1,2253,3.5,1112486122
1,2288,4.0,1094786077
The data is also available in HDFS at /movielens-mod
.
General description
Your goal is to implement a movie recommendation system. The system takes data in the format Rating(userId, movieId, rating)
, and tries to learn a model of a user, based on graded movies. You need to format the data as needed using RDD transformations. For more information about class Rating
read here.
In Spark, recommendation system can be built with the ALS
(Alternating Least Squares) class that has simple train()
and predict()
interfaces.
The predictor takes the data in format (userId, movieId)
and returns Rating(userId, movieId, rating)
. To make the results human-readable, we need an additional structure that stores the mapping of movieId
to movieTitle
To evaluate the benefit of this system, we compare the result with a baseline. The baseline for this task is the average rating for a given movie. We need a data structure that stores the mapping of movieId
to averageRating
.
Read the code. Understand how the program works.
Complete the code
Implement parseTitle
and rmse(test: RDD[Rating], prediction: scala.collection.Map[Int, Double])
spark-submit --master yarn spark-recommendation.jar hdfs:///movielens-mod -user false
Run with Your Movie Preferences
spark/bin/spark-submit --master yarn spark-recommendation.jar hdfs:///movielens-mod -user true
executing this command will allow you to specify your own movie preferences and get recommendations after the model has finished training.
:::info The dataset is not very large. You can run the application locally if you have sufficient resources. :::
Post-processing of Recommendations
Currently, the list of recommendations can include the movies that the user has already graded. Modify the program such that these movies are filtered. You can use either set difference or RDD’s filter methods.
Load Your Movie Preferences
Your movie preferences were saved into the file user_rating.tsv
. Write a method or class that will load this data instead of surveying the user every time.
Change the Rank of the Model
Try different model ranks. A higher rank leads to a more complex model. Compare the difference between baseline and prediction for a model with a higher rank. Evaluate the quality of the proposed movies subjectively.
Extra Filtering
You might have noticed that the recommendations are not very great. The reason is that the model cannot compute confident representations for infrequent movies. Modify the program in such a way that low-frequency items (movies with less than 50-100 ratings) are excluded from the training and recommendation process.
Reference
- Help with Scala
- Scala Book
- RDD partitioning 1
- Stack overflow during execution 1
- History Server 1
Self-Check Questions
- What does
sc.parallelize
do? - What does
collectAsMap
do? - What is the difference between
foreach
andmap
? - What is pattern matching in Scala?
Troubleshooting
Sbt compilation error
There are issues compiling scala libraries with Java 9 and above. If you see an error during compilation, try downgrading to Java 8.
Missing Winutils
There is a known issue with Spark on Windows, that it requires winutils.exe. The solution can be found here.
Memory Issues
If you experience issues with memory, try to partition dataset into larger chunks
val data = data.repartition(number_of_partitions)
Creating more partitions is equivalent to reducing the size of input data per partition, and reducing memory requirements for map tasks.
Appendix
Including Dependencies
When you configure artifact with IDE, you need to specify the dependencies. There is absolutely no need to include all the dependencies in the jar container, and you should merely add dependencies to the manifest.