One of the major differentiators between Apache Spark and the prior generation of Apache Hadoop–based and MapReduce-based technologies is the built-in Spark machine-learning library (MLlib). The motivation behind including these capabilities is to make practical machine learning scalable and understandable for data engineers and data scientists. MLlib also leverages Spark’s distributed, in-memory execution model to yield significant performance benefits over preceding technologies such as R and Apache Mahout.
Machine-learning algorithm applications
Out of the box, data scientists and engineers have access to many forms of statistical analysis, machine learning and some supporting methods:
- Classification: Naive Bayes, decision trees and ensemble methods
- Regression: Logistic regression, linear regressions and support vector machines (SVMs)
- Collaborative filtering: Alternating least squares (ALS)–based recommendations
- Clustering: Gaussian mixture, particle in cell (PIC), linear discriminant analysis (LDA), k-means algorithm—and some improvements
- Dimensionality reduction: Singular value decomposition (SVD) and principal component analysis (PCA)
- Feature extraction: Term frequency-inverse document frequency (tf-idf), word2vec, PCA and more
While the capabilities in MLlib are powerful in the abstract, one still needs to identify a practical application, implement a technical solution and productionalize the analysis for its downstream consumers. As I discussed in the post, Spark: The operating system for big data analytics, Spark makes the implementation and productionalization of advanced data analysis significantly less challenging than the aforementioned technologies. However, identifying practical and appropriate applications for specific machine-learning algorithms remains a key challenge, not only in reaping the benefits of MLlib in Spark, but of machine learning in general.
Take a look at a few specific applications. Each case rationalizes the chosen machine-learning algorithm.
Principal component analysis
One common usage of applied machine learning is to identify the statistically relevant attributes of a high-dimensionality data set. For example, imagine you are charged with analyzing medical profiles to determine whether a particular treatment may have an adverse effect on a patient. Such a data set may contain hundreds, if not thousands, of attributes that describe a single patient’s medical history and current conditions. Wide data sets such as this one can be difficult to work with using traditional data analytics tools, so applying a statistical method is certainly appropriate. Unfortunately, some such algorithms do not perform well when the dimensionality of a data set is this large, and can result in long processing times and high operational costs.
To address this problem, a dimensionality reduction technique can be used to identify the most statistically important attributes. In the best case, this technique can result in a substantial reduction in the width of the data set, without a statistically relevant degradation in the actual results of the prediction.
In version 1.6, Spark includes two such algorithms in MLlib—SVD and PCA. Let’s look specifically at PCA, a technique that identifies the most relevant attributes in a data set based on the variance of those attributes. Speaking specifically, the attributes that have the largest impact on a medical outcome will be ranked the highest. The number of attributes—or principal components—is also specified up front, enabling a data scientist to control the balance between cost and accuracy. Because this example involves medical outcomes, a user of this technique would likely err on the side of accuracy, and thus choose a higher number of principal components to compute. Once this has been done, the relevant attributes can be extracted from the source data set and used in a downstream analysis.
Another common usage of machine learning is to augment a decision-making process with supporting evidence or reliable, validated predictions. For example, imagine you are a data scientist at a large ecommerce company who is trying to forecast average monthly revenue per user by analyzing data sets related to shopping behavior. One way to frame this analysis is to think of the set of implicit decisions that exist in a behavioral data set as indicators of a particular spending pattern. Furthermore, these indicators could point to either a discrete shopping profile—a classification—or a forecasted monthly spend—a regression.
Using Spark, both the classification and regression analyses can be achieved with a decision tree. In the classification case, a summary of a user’s behavior over a period of time can be used as a descriptor of the user’s behavior. When coupled with a classification derived from the user’s actual monthly spend, a model can be trained in Spark that decides which shopping profile a new customer will likely fall into. In the regression case, the discrete labels are replaced with the predicted monthly spend, resulting in a prediction that yields the expected spend for the next month.
In this last example, consider a use case in which we’d like to adaptively classify entities in real time. For example, imagine you are a large shipping company that wants to provide a prediction for customers as to when a package will likely be shipped. This prediction can be achieved by considering a number of attributes, including daily shipment volume, shipping class, destination and so on—and using these attributes to train a classification algorithm.
However, given the real-time nature of the problem, a model that is trained in the morning may not yield accurate results in the afternoon or evening. As such, we need to apply a technique that can adaptively update the model as new data arrives so that the classifications stay up to date.
To implement this system, we can combine two major components from Spark: MLlib’s streaming k-means and Spark Streaming. The first of these components is an implementation of the k-means clustering algorithm that can be trained using data streams constructed with the second component. By choosing this particular implementation of the clustering algorithm, we can meet the requirement of adaptively updating the model as new data comes in over the course of a business day, and still keep the solution confined to Spark’s core components.
Practical application of machine-learning algorithms
Hopefully the three examples described here shed some light on the issue of identifying practical and appropriate applications for specific machine-learning algorithms included in Spark. In addition, these examples show how to deploy Spark in a variety of advanced analytics use cases, including data preparation through dimensionality reduction, revenue prediction through decision trees trained on user behavior and package delay classification based on real-time, adaptive clustering.
Experience the power of Spark and machine learning in an integrated development environment for data science. Also, join the data science experience and explore how you can use Spark and R to build your own data science applications.