Setting Up an Apache Spark Powered Recommendation Engine
Setting Up an Apache Spark Powered Recommendation Engine
One of the easiest ways for retailers to increase their customers’ average shopping cart size is to suggest items to prospective buyers at the point of sale. We see every day in the physical world when we go to the grocery store or hardware store, the packs of gum, beef jerky, batteries, magazines, etc. lining the checkout lanes, tempting us with impulse items or things we may have forgotten we needed. This practice of suggestive selling is nothing new. But now, with online shopping at an all-time high, there is a much more effective way to target specific items to specific customers at this most crucial point in the digital shopping experience.
As we discussed in a previous post, Clickstream data analysis is a very valuable asset in the retail industry. It is a perfect example of the benefits of Big Data Analysis. When Clickstream data is combined with customer demographic data and passed into a Machine Learning algorithm, retailers are now able to make reasonable predictions about which products their customers are most interested. This makes the analysis much more valuable. The problem is, while still very valuable, this analysis could take minutes to calculate with a standard MapReduce configuration. By the time analysis is done, the online shopper is long gone and there is no guarantee he will be coming back. With the introduction of Apache’s Spark Streaming technology, combined with Talend’s native Spark architecture, the same analysis can now be done in seconds to provide real-time, intelligent recommendations at any time throughout the shoppers’ experience.
Let’s take a deeper look into an example of this scenario, which you can experience first-hand with the new Talend Big Data Sandbox. In the sandbox, you have the option to run this scenario right within Talend Studio using a stand-alone Spark Engine that comes embedded within Studio, or you can choose to run it on either of the latest Cloudera or Hortonworks Hadoop Distributions through their respective YARN Clients. You can read more about the cool technology behind the new sandbox that makes all this possible in a separate blog. For now, I am going to focus on creating this Real-time Recommendation Pipeline to show how you can increase your online sales.
First, A Recommendation…
To begin building an intelligent recommendation pipeline, you first need a Recommendation Model. The Model is what drives the recommendations presented to your customers as they browse through your website. The Alternating Least Squares (or ALS) Algorithm is the most widely accepted method of generating such a model and it sounds very complex. That’s because it is. But with Talend Studio, you have a single, simple component that does all that nasty calculation for you. All you have to do is feed the model with data you already have at your fingertips from your Hadoop Cluster, Data warehouse and/or Operational Data Store. Give it some historical Clickstream data (the more the better to produce more accurate recommendations) combined with user demographic data from your registered user community and then link it to your product catalog. Let the Algorithm do the rest. If you have a rocket scientist in your IT department, they may want to tweak the algorithm for even more accurate results. Talend gives you the ability to do that as well with parameters for Training Percentage, Latent Factors, Number of Iterations and Regularization Factor, which can all be factored into your final Recommendation Model. I’m not a rocket scientistthough, so I have no idea how those parameters affect the Model. But that is OK, the Talend Component is configured with default values that provide a good starting point for a first-time Recommendation Model! What is produced is a sort of mini database of information, stored on your Hadoop cluster, that can be referenced at a moment’s notice. Pretty cool, huh? But this is just the beginning. Let’s put this model to work!
Apache Spark Streaming & Real-Time Retail
The way we put this model to work is through a Spark Streaming job that reads real-time web traffic “clicks” through a Kafka Topic, enriches the data with customer demographic information and sends it to the Recommendation Model to generate a list of products to present to the user. Confused? Let me break it down another way. When a user enters your retail website and starts navigating your product pages, each click of the mouse they make is captured in your Clickstream data. By utilizing Spark Streaming and Kafka queuing, each click can be immediately analyzed to identify the customer making the click and the product they are clicking on. That single click now contains enough information to pass to the Recommendation Model where it can be instantly compared against not only that users’ historical click history but other user’s click history that share some of the same demographic characteristics. The generated recommendations can be stored in a NoSQL Database for quick reference when presenting the information to the customer immediately upon their next click or page refresh. On a personal level, maybe a previous shopper viewed a necklace and earrings on your site but didn’t buy the items – when they return days later you can identify that same shopper is about to click “purchase” on a dress and heels. At this critical moment in the shopping experience, they are much more likely to add recommended accessories to their purchase that they have viewed and have an interest, such as the specific necklace and earrings, rather than random accessories they would be almost sure to overlook. Now you have made a personal connection with that shopper through a digital transaction and thereby increased not only the potential revenue of the sale but also the likelihood they will return. From a dollars and cents perspective, instead of offering a shopper a pack of gum or packages of Beef Jerky at checkout in the hopes of cashing in on a $2 impulse buy, now you can offer them the perfect pair of $50 wool socks that similar customers bought when also buying the same pair of $200 hiking boots. Cha-Ching! You just increased the potential of this sale by 25%!! Do I have your attention now?
Understanding Your Customers in an Instant
Let’s bring this full circle. Not only do you have your Clickstream Analysis to understand product visibility and popularity across different regions, you can also combine that information with a specific customer’s demographic information and retrieve their full history of clicks to understand their individual interests and compare across different demographic metrics. Now, with a Real-time Recommendation Pipeline to present them with instant, intelligent and meaningful recommendations, you are creating a personal relationship through a digital transaction and ultimately increasing your potential revenue. Further, you can track the recommendations presented to your customers in Hadoop for later Big Data Analysis. For example, cross-referencing their purchase history with the recommendations they were presented to identify which recommendations are providing the most value. As you continue to collect the clicks and grow your customer base, you can update and improve your Recommendation Model for more accurate recommendations and even predict with a certain level of accuracy what your customers will want and need – depending on Age, Gender, Income and even Region, Climate and Season among others.
Most Downloaded Resources
Browse our most popular resources - You can never just have one.