(This is my capstone project at NYC Data Science Academy)
1. Why we need recommendation?
Recommendation is a big topic. Netflix was willing to spend 1 million dollars to summon the best movie recommendation algorithm for their service. Facebook devised hundreds of variables just to carefully gauge the recommended feeds on the first page when you log in your page. Not to mention myriads of other applications in the market.
“Getting information off the internet is like taking a drink from a fire hydrant” (Mitchell Kapor). Information overload is a real phenomenon preventing us from making good decisions or taking actions. This is why recommendation systems are becoming common and extremely useful in products such as Netflix, Amazon Echo, and Facebook News Feed in recent year.
Besides, in the big data era, real-time recommendations may become a new norm because real-time recommendations:
- Allows both customers and businesses to gain real-time insights
- Enable rapid development
- Perform real-time analytics
2 What is “Yelper”?
Based on the dataset provided by the “Yelp Challenge 2016”, “Yelper” is a system that:
- Performs preprocessing by dividing business data by cities to allow fine tuned and customized recommendations
- Uses collaborative filtering based recommendation using Spark MLlib
- Generates user-business graph visualizations using D3 and graph-tool library
- Performs user-business graph analysis using Spark GraphX in Scala
- Handles real-time user request handling simulation using Spark Streaming and Apache Kafka
- Creates a Google Map view to recommend high rated businesses for users
The rationale for building Yelper is that:
Data scientists should not only know how machine learning works, but they also need to be prepared for the incoming new norm: the large scale streaming data mining and analytics in the big data era.
The GitHub page can be found here: https://github.com/sundeepblue/yelper_recommendation_system
Slides: Presentation slides
3. Components and technical stack to build Yelper
There are 5 major components to build Yelper, as shown below. Here is the overall flowchart of what happens.
3.1 Data preprocessing
Yelper was built based on the Yelp Challenge 2016 Dataset, which includes:
- 2.7M reviews and 649K tips by 687K users for 86K businesses
- 566K business attributes, e.g., hours, parking availability, ambiance
- Social network of 687K users for a total of 4.2M social edges
- Aggregated check-ins over time for each of the 86K businesses
- 200,000 pictures from the included businesses
Our system mainly relies on two data sets: the 687K users and the 86K businesses. We performed two preprocessing steps (Python source code):
- Map the raw IDs in ASCII string format for both user data and business data into integer indices
- Split the business data by nine major cities (see sample structure of splitting by cities): us_charlotte, us_lasvegas, us_madison, us_phoenix, us_pittsburgh, us_urbana_champaign, canada_montreal, germany_karlsruhe, uk_edinburgh
In Yelper, we deliberately created all business IDs as integers in range [0, 1M] (M means million), while all user IDs are in range [10M, +∞], because we have much more user IDs than business IDs. In this way, we can use the cutoff value 10M (=10000000) to check if an ID is associated to a user. To sum up, using integer number as ID (such as 3234) instead of a raw string (eg., “G8qH6TbfEhoYmS9KZM2Hfg”) has several advantages:
- Building the graph using Spark GraphX requires that all nodes and edges have an integer ID
- Exported CSV or JSON files of either user data or business data became much smaller, leading to faster loading and manipulation
- It was easy to differentiate the user id and business id just by a threshold.
Splitting business data by cities has the following advantages:
- Making each city have its own recommendation model allows customized recommendation
- Model training becomes faster since city-wise data is smaller than the totality
- City-wise model tuning becomes manageable
- Scalability is possible. Remember there are thousands of cities all over the world
- It does not make sense to build one single recommendation model to handle all world cities since, like human, each city has its own “gene”
3.2 Low-rank matrix factorization
We want to recommend the most highly-rated businesses to users. There are mainly two types of algorithms used for recommendation systems:
- Collaborative filtering based algorithm, such as low-rank matrix factorization, SVD, etc
- Content-based recommendations
Both algorithms have pros and cons. For simplicity and fast prototype, we explored the first one in Yelper. If we have more time, it is not difficult to integrate the second one into our system, which will be discussed in the final section.
Technically, we use Spark MLlib to train the ALS-based collaborative filtering models in Yelper (Python source code). Below are several brief steps:
- Load the city-wise user-business-star tuple file F by Spark (sample tuple for Las Vegas)
- Divide the F into three parts, 60% for training, 20% for testing, and 20% for validation using Spark MLlib
- Setup several rank values (such as 4, 8, 12) as the tuning parameter, perform cross-validation to find the best rank that leads to minimal Residual Mean Squared Error (RMSE)
- Generate the best model using the best rank
- Export the best model to a directory under the corresponding city name (sample trained model for Las Vegas)
For the last step above, persisting the city-wise models to disk (or cloud) is a good choice, because we can compress, transfer, distribute, or cache those models.
3.3 Dynamic network visualization
We are particularly interested in analyzing the user-business interaction and gaining city-level insights. Why do we want to do this? Below are two assumptions:
- Generally speaking, the more businesses a city has, the more prosperous a city will be
- The more customers (residents, tourists, population, etc) a city has, the more businesses a city will need to have to sustain the needs
We build city-wise user-business network graphs in the following steps:
- Extract connected components using Spark GraphX (Source code in Scala)
As an example, the dynamic user-business network graph of the city Madison (US) is shown below. Each node is either a user (in green color) or a business (in blue color). If a user u rated a business b, then there is an edge from u to b, namely, u –> b. For the city of Madison, there are in total more than 10K edges, and unfortunately, the library D3.js cannot handle and render such a large number of nodes and edges. Thus we randomly selected only a small number of edges to be rendered in Chrome.
What can we gain from this dynamic network? Well, just from a single network, we can gain the topological relationship between users and businesses. The density of edges (in red) reflects, to some extent, how the users rate all the businesses in a city. If we generated those dynamic graphs for all cities, we may find that each city’s network is distinct. We can further distill in-depth information just from the networks themselves, such as in/out degree, clustering, page rank analysis, min cut, community discovery, etc.
3.4 Google Map API based frontend
We implemented a web server to allow the user to interact with Yelper to get recommendations. This web server was built using:
- Python paste
Below are several features of the frontend:
- Users can search any keywords they are interested in (not just restaurants) (Python code)
- User scan change the city and how many recommended businesses to return on the map
- RESTful API is supported. User can change city/topk/keywords from the URL
As an example, we show the user interaction of Yelper through the image below. The user with ID “10081786” requested recommendations for the keywords “restaurants”, “book store”, “library”, and “ice cream” in the city of Charlotte. The Yelper returned the “topK” recommended results and showed the businesses locations via Google Map API.
3.5 Simulation of real-time customer requests handling
Now, imagine that there are one thousand or even one million of customers who need to access Yelper to get recommendations in different cities. Before we show how Yelper handles this challenging scenario, let us first delve a little bit deeper into why this scenario is challenging.
- More user requests means Yelper needs the ability of handle a large volume of requests from various sources (phone, iPad, laptop, etc)
- Users’ recommendation requests need to be handled promptly, and asynchronously
- Yelper cannot lose the request from any users. Otherwise, customer satisfaction and loyalty will be compromised
- An army of computing resources in the cloud need to be deployed to handle requests
- Customers have different tastes and needs
Those are just a few factors out of many that drive the need for a strong recommendation infrastructure. A detailed enumeration is definitely out of the scope of this post.
One neat solution is to view all incoming request from users across many cities as a non-stopping stream, and use Apache Kafka as a message broker to redirect all requests to Kafka, then let the Spark Streaming handle anything that is piped from Kafka into Spark streaming in a fault-tolerant and scalable manner. The advantages are:
- The streaming is fault-tolerant to any sudden uncertainties
- Allows different types of messages being piped to piper
- Easy to handle
We thus simulate the real-time handling of a large amount of users’ recommendation requests locally (Python source code), as shown in the figure below. On the left side of the terminal window, we run the Spark job to handle the requests in Spark Streaming at a frequency of 1.5 seconds. Note the text message showing the time stamp, such as “Time: 2016-09-25 14:37:22.500000“. The Yelper logo is shown before each of the recommended results.
On the right side of the terminal, we run a python script to generate users’ recommendation requests at a random time interval. Those requests were piped into Apache Kafka such that Spark Streaming can consume those requests.
(Click image above to see animated simulation)
4. Insights from user-business networks
We build static user-business network graph using a Python library called graph-tool (Python code) (sample network images). Now we show briefly what it looks like for the city-wise user-business network. Again, if a user rated a business, there will be an edge connecting the user node and the business node in the network.
In terms of visualization those edges, there is a dilemma. The more nodes and edges that are included in the graph, the more precise the graph will be, but the visualization will become a mess, since it is nearly impossible for many graph layout algorithms to elegantly visualize clearly the graph details when the edge number is above, say, 0.1 million. For this reason, we adopted two simple strategies: out of all the edges in a city, we either randomly selected 1% (or 2%) of all the edges, or sequentially selected the first 1% (or 2%) of all edges.
It is interesting to mention that, even if we randomly select only 2% of the entire edges, the generated networks still have the power to reveal many insights about a city if we dig deeper. Let us now briefly go over three cities in the US: Charlotte, Las Vegas, and Pittsburgh.
4.1 Charlotte, US
We randomly selected 3312 edges, which is only 1% of the total edges in this city.
4.2 Las Vegas, US
We randomly selected 11547 edges, which is only 1% of the total edges in this city.
What if we randomly select 23095 edges (2%)?
Instead of randomly, what if we sequentially select23095 edges out of all the edge list?
4.3 Pittsburgh, US
What about Pittsburgh? We randomly choose 2230 edges (2%) out of all edges lists.
4.4 What can we gain from the networks?
Again, similar to the dynamic network mentioned in the previous section, a lot of insights can be revealed from those static graphs if we delve deeper using graph-based algorithms and analytics techniques. For example, as we all already know, the economy of Las Vegas is higher than Pittsburgh. From the network, there are clearly more nodes with a high number of incoming edges in Las Vegas than in Pittsburgh.
If I have more time, below are several potential directions that I can think of to dig deeper into those networks:
- Use the edges with high ratings (i.e., rating star > 4.5) to generate a network of high-quality businesses, comparing the differences amongst cities
- Build the graph with low ratings (i.e., rating star < 2.5) businesses, see their distribution and figure out what led to low-ratings
- Add time stamps into consideration. Build networks that evolve with time such that we know how a city evolves over time
- Find influential businesses in cities via network community discovery algorithm
- Build user-business networks by business category. i.e, networks of restaurants, networks of libraries, etc. This allows gaining insights by sectors
5. Future work
○ More graph analysis
- Graph page rank analysis using Spark GraphX and Python graph-tool library
- Community discovery (similar to Facebook social network)
○ Improve recommendation
- Content-based recommendation
- Clustering all businesses
- Extract objects from business photos using Convolutional Neural Network