Apache Ignite ML: possible use cases, racing with Spark ML, plans for the future

Alexey Zinoviev
4 min readNov 12, 2020

--

Hi, distributed programmers! This is the second post in a series of posts about Ignite ML library.

Today I would like to talk about possible scenarios for using Ignite ML framework, compare its capabilities with Apache Spark ML, talk about future plans.

This is the second story in a series of stories about the Apache Ignite ML.

  1. Apache Ignite ML: origins and development
  2. Apache Ignite ML: possible use cases, racing with Spark ML, plans for the future
  3. Apache Ignite ML Performance Experiment

Possible usage scenarios in Big Data architecture and use cases in 2020

Okay, ambition is great, you say, but to what scenarios can this Distributed ML framework be applied? When is it time to stop clinging to the Python data-science community and take a decisive step toward Ignite ML?

  1. Batch training on large datasets and batch predictions on data from Ignite caches
  2. Batch training on large datasets and real-time prediction on numerous cluster nodes
  3. Real-time training and real-time prediction
  4. External training on Apache Spark or on an XGBoost model, model loading to Ignite ML, and batch or real-time predictions.
  5. External training on Apache Spark or XGBoost model, model loading and updating that is based on Ignite data, and real-time predictions
  6. Use of preprocessors to calculate basic statistics about data in Ignite tables

If an item from the bulleted list can solve your problem, I’ll be happy.

If we need to go deeper into possible use cases, I can safely say that in the same way for which you could use any other machine learning library.

Classification algorithms can help in a model building in the following areas:

  • transaction analysis
  • spam detection
  • credit scoring
  • disease identification

Regression algorithms can build models that are good in the following areas:

  • drug responses
  • stock prices
  • supermarket revenues

Clustering algorithms typically are good in next competitions:

  • customer segmentation
  • grouping of experiment outcomes
  • grouping of shopping items

Recommendation algorithms are newcomers that can help in building simple recommendation systems.

Maybe the most recent case that I faced will be of interest: The customer kept all the data in an old, on-disk database with an embedded Random Forest algorithm.

The training, which each time was performed on hundreds of thousands of rows, required lots of hours. The customer had moved his model training and its data mart to Ignite and trained with Ignite ML Random Forest.

Now, he could repeat his training on a few million rows every ten minutes (if he allocated enough heap memory for JVM, of course, my friends). Performance speed increased by ~ 50–80x (depending on RF hyperparameters).

Apache Spark ML versus Apache Ignite ML (feature comparison)

The following table identifies and compares the main features of Ignite ML 2.9 and Spark ML 3.0. At the moment, there is a strategic parity of capabilities between the two frameworks.

Spark is stronger in data preprocessing but weaker in regard to integration with other frameworks, support for model ensembles, and advanced methods for hyperparameter tuning. The weaknesses within the Spark framework are strengths within the Ignite ML framework.

The API guarantees and stability

In this section, I discuss how stable the current API is. During the past year, model-training methods and model interfaces and classes, such as decision trees and logistic regression, have changed little. And, I think, they are unlikely to change in the future.

On the other hand, everything related to the model (fine-tuning of the hyperparameters, model evaluation, model export/import, and integration bridges with other ML frameworks is a hot place for change. So, we don’t guarantee that there will be no breaking changes in the current ML API.

Also, you need to keep in mind that the number of users is growing and that users find bugs and submit proposals for development of functionality. These user actions lead to improvements. The field of machine learning is also rapidly changing. What seemed unnecessary yesterday may now be present in all frameworks. Therefore, you have to make changes for the sake of user convenience and early access to new features.

What’s next?

The previous two years have been very turbulent in terms of new and experimental functionality. Some features did not last long; they left under the pressure of the community and of the first users. Some features were added in a hurry and were not backed up by sufficiently deep research (especially with regard to hasty integration with MLeap and TensorFlow 1.14 via the TensorFlow Java API).

Now, the main functionality from the roadmap is implemented, and it is the time to slow the rapid development of the framework and the rapid addition of new features and hear from the first adopters from the community — to understand what things are important and necessary for users.

We have many algorithms, good support for ensembles and hyperparameter tuning, basic elements of AutoML, examples are the best among all Apache projects, but there is a lack of functionality related to the MLOps [1].

I read the user-list and dev-list and hear the voices of people who post issues about model export/import, the addition of new preprocessing algorithms, and various minor improvements. Hope to present some solutions for these problems in the next release.

--

--

Alexey Zinoviev

Apache Ignite Committer/PMC; Machine Learning Engineer in JetBrains