We are in an age in which machine learning has increasing importance in our daily lives. Machine learning is put into action whenever your mobile map application automatically reminds you to leave for your next appointment because of unusual traffic situations. Besides personal assistants on your cell phones, wearable sport devices use machine-learning algorithms to propose personal training plans, and banks depend on accurate machine-learning models to detect malicious transactions.
Healthcare, for instance, has also started to find helpful patterns in medical data using machine learning. Modern technologies allow for close monitoring of a patient’s condition through a large volume of data provided by a number of sensors. Machine learning is applied to this data to find patterns and predict how a patient will react to a treatment plan, for example. Accuracy is particularly important in this field because each miss can have significant implications. This case presents several challenges for machine learning technologists:
- Complicated model preparation because of the huge data volume and various forms of data including highly imbalanced data sets
- Constant model retraining and reevaluation because of the ever-changing nature of patient data and structure, as well as the need to improve the accuracy of model prediction
- Fast deployment of newly trained models to monitor a patient’s condition
Machine learning with the power of Spark
Sparkling Water brings the H2O open source, machine-learning platform to Apache Spark environments. H2O runs directly in a Spark Java virtual machine (JVM), which eliminates any data transfer overhead that other solutions typically incur:
H2O allows users to combine the data processing power of Spark with powerful machine-learning algorithms provided by the H2O platform. This combination solves the aforementioned challenges for machine learning technologists in a variety of ways.
Parallelized data processing
H2O is designed to process huge amounts of data in a distributed and fully parallelized fashion. This approach means a hospital can fully leverage all the data available for their analyses, explore and test more models in quick iterations and benefit from the results.
Operationalized model training, evaluation and comparison, and scoring
Finding the optimum model for a given patient condition is a tedious process that has many moving parts. Hospitals need to try out different strategies to explore the space of possible models and various setups and compare the results best suited for their environments. H2O operationalizes this tedious training process in several ways:
- Providing a library of machine-learning algorithms supporting advanced, algorithm-specific features; moreover, H2O allows combining models into ensembles—super learners
- Performing fast exploration of hyperspace of parameters (aka grid search)
- Offering the facility to specify various criteria that identify and select the best model—for example, accuracy, building time, scoring time and so on
- Adding the ability to continue model preparation with modified parameters and additional relevant training data; this specific feature of H2O helps simplify the lives of data scientists and speeds up model preparation turnaround
- Creating visualizations of various model characteristics on the fly and the final model during training; moreover, users can explore the performance of the model on training as well as validation—that is, unseen—data.
H2O also allows users to stop the model training process manually, if the visual feedback reports unexpected results; modify parameters; and continue the training.
Optimized model deployment
Model deployment is one of the most critical elements of the machine-learning process in healthcare—the model, or even multiple models, are instantiated and fed by real-time data from sensors monitoring a patient’s body, and the models need to provide predictions as quickly as possible. To meet these strict requirements, H2O allows for the export of trained models as an optimized code for deployment into target systems—that is, web services, applications and so on. The optimized code delivers the best possible response time, which is crucial for applications that need to react quickly to changing conditions.
Use cases with streamlined implementation
Sparkling Water improves and streamlines the way machine learning is applied to healthcare. Besides healthcare, Sparkling Water can also elevate the use of machine learning in a variety of other use cases:
- Detecting fraud in the finance industry, where high accuracy and speed are key factors
- Proposing interest rates for insurance applications or predicting drivers’ risk factors
- Planning truck maintenance based on tracking trucks’ telemetry.
Next time when you think about improving the quality of your life, remember Sparkling Water. At the Apache Spark Maker Community Event, 6 June 2016, IBM is sharing important announcements for helping customers to use Spark, R and open data science to drive business innovations. Register for this in-person event. If you can’t attend, then register to watch a livestream presentation of the event.