Apache Ignite ML: origins and development
Hi, distributed programmers! Nice to see you here, in my technical blog. Gird your loins and read this long, long article about Apache® Ignite™ Machine Learning framework.
Do you know what Apache Ignite is? It is a horizontally scalable, fault-tolerant, distributed, in-memory computing platform for building real-time applications that can process terabytes of data with in-memory speed [1].
De facto, Ignite is a distributed, in-memory database. It’s not like Apache Spark (a distributed ETL framework but not a database with reading/writing on disk and intermediate results in-memory). It’s not like Cassandra (a distributed NoSQL database, keeps indices in memory and data on disk).
This is the first story in a series of stories about the Apache Ignite ML.
- Apache Ignite ML: origins and development
- Apache Ignite ML: possible use cases, racing with Spark ML, plans for the future
- Apache Ignite ML Performance Experiment
In this article, I’m going to describe in detail all the main points of the machine-learning framework that is built at the top of Apache Ignite.
Let’s go, machine teachers!
What Apache Ignite ML is and is not?
You can think about the machine-learning (ML) framework as a distributed version of scikit-learn, written in Java.
First, it’s a distributed machine learning framework, and, as a result, not all classic ML algorithms can be easily and effectively* ported to the distributed universe.
For example, the SGD algorithm can be easily parallelized or distributed, but DBSCAN, for instance, is not distributed at all.
Second, it’s not a deep-learning framework, and the support of neural networks (NNs) is limited for both training and inference because it requires a lot of NN ops implementations or importation of TensorFlow or MXNet runtimes.
Third, it’s not the right choice for exploratory data analysis or data-science experiments. It’s a tool for training and model inference in situations when you know which model or model ensemble you want in production, and you have a choice — to train the model from zero to hero or load a pre-trained model.
The authors of this ML framework were inspired by ideas that lie in the foundation of scikit-learn, Apache Spark ML, Apache Flink ML, and Apache Mahout. We tried to make the API as simple as possible and easily recognizable for persons who had dealt with the mentioned above frameworks.
* Algorithm could be ported if it has low complexity (lower than n² in Big O notation for CPU and memory ops), and amount of transferred via network data is not dependent on n (amount of data to train on)
Motivation (to do yet another framework)
Why do people create frameworks? What drives them? Ambition, thirst for knowledge, questions from users, challenges of fate?
First motivation: Honestly, I think that, initially, the idea was to create a simple framework at the top of Apache Ignite with models like Linear Regression and Decision Trees to compete with other distributed systems, like Redis or Apache Spark or Oracle, which have their data-science tools. Purely for show.
Second motivation: As I remember, some Ignite users asked about making fast predictions in memory by using pre-trained models. But all of the models were trained in Python, and, as a result, serialized Python models could be easily used for prediction on Ignite caches.
Third motivation: Users much consider the effect that scalability can have on performance. Just like Apache Spark and Apache Mahout, Apache Ignite can be scaled and trained on data partitions.
With Apache Ignite ML, 1000 nodes with 10000 data partitions are required to train a Decision Tree or NaiveBayes classifiers.
Fourth motivation: There are not many useful ML libraries in the JVM world. Unlike with scikit-learn in Python, there is no community-driven gold standard. However, the creation of ML libraries (whether distributed or non distributed) brings new possibilities. Because the Java, Scala, and Kotlin languages use a standard JVM bytecode, libraries that are written in any of the languages can be easily called and reused for data processing.
As a result, we could reduce network traffic, CPU operations, and volumes of allocated memory if all these actions: data source reading, preprocessing, training, model deployment, and prediction could be made on one platform.
If data-source reading, preprocessing, training, model deployment, and prediction can be made on one platform, then network traffic, CPU operations, and size of allocated memory can be reduced.
Fifth motivation: If you are using a distributed system, probably your data is too large to fit on one server or, in the case of Apache Ignite, within the memory of one server. If you have a lot of data, training on the data might require a good deal of time. If you lose a node, you might lose your calculations and have to restart the training. Apache Ignite Machine Learning is tolerant of node failures. Therefore, if a node fails during the learning process, recovery procedures are transparent to you, your training processes aren’t interrupted, and you obtain results relatively quickly, almost the same as if all nodes were working as they should. For detailed information, see Partition Based Dataset [2].
Sixth motivation. Last but not least: Be honest, all main Big Data tools are written in JVM languages like Java or Scala (not to mention Kafka, Spark, Hadoop, Flink, Ignite, and Cassandra), and all of the languages are developed under the Apache community rules. But, the data-analysis pipeline has a missing link. It lacks the ML and data-science tools that integrate well with other JVM Big Data tools. This deficiency caused the Flink and Spark communities to implement its ML frameworks and played a significant role in creating the Apache Ignite ML framework.
What was the motivation for me, personally, to participate in the Ignite ML project during the past three years and continue my job?
The main reason was to create competition for the Apache Spark ML framework, which offers a lot of advantages and is widely known but provides limited support for model ensembles, integration with other ML frameworks, and online-learning [3]. After a few fruitless attempts to change something within the Apache Spark, I joined the Apache Ignite community, which began the development of the Ignite ML framework by trying to utilize the basics of Apache Mahout at the top of Apache Ignite.
What is Apache Ignite in 2020 for me?
I’m in since 2017, and I remember the prototype, which had only Linear Regression and distributed matrices. During the next two years, the very friendly community of ML developers (such as Alexey Platonov, Anton Dmitriev, Yury Babak, Artem Malykh, Ravil Galeev, and your humble servant Alexey Zinoviev) did a lot of work together.
During the past two years, I became first a committer and then a PMC in Apache Ignite (responsible for the ML module and its future).
Personally, I created and added the following algorithms and ML techniques: SVM, KNN, Logistic Regression, one-hot encoding and other preprocessors, evaluation (including cross-validation, hyperparameter tuning based on random search and genetic algorithms, and metrics calculations), and model loading from Spark.
As a comitter and PMC, I completed more than two-hundred reviews and PRs, writing the documentation for the last two releases. I also presented talks about Ignite ML at ten events and conferences.
I hope to continue my work here and hope to involve you, my dear reader, in active use and development of this wonderful ML library.
Educational Resources
The first most popular question on user-lists is “How to start with Ignite ML? Ignite is quite difficult; ML requires mathematical background. Ignite ML is probably doubly difficult. How to start?”
First, dear friends, read the documentation. The official site [4] contains up-to-date documentation on all the main features of the machine learning framework. Everything necessary is described in dry and concise language, and code snippets that you can copy and play with are provided.
Second, make this long, long article your guiding star star for learning about the framework. All insights about and reasons for using different architectural solutions are described in this article. You need no other books or articles. Only here. Here.
Third, watch videos.
On my YouTube channel:
- [En, 2019] Ensembles of ML algorithms & distributed online ML
- [Ru, 2019] Apache Spark and Apache Ignite are building a bright ML future together
- [Ru, 2019] Not all ML algorithms go to distributed heaven
- [En, 2019] Distributed ML/DL with Ignite using Spark as data-source
- [Ru, 2018] Nuances of Machine Learning with Apache Ignite ML
- [En, 2018] Nuances of Machine Learning with Apache Ignite ML
From Ken Cotrell (GridGain), Architect’s Guide for Continuous Machine Learning Platforms.
Fourth, consult the examples folder [5], the main entry point for experiments and learning. Give special attention to the Titantic Tutorial [6], where the ML pipeline is implemented from CSV parsing to hyperparameter tuning.
I hope that these resources will be enough for a start, and I hope that the community creates more educational materials in the near future!
This is the first story in a series of stories about the Apache Ignite ML.
Reference list
[3] Weakness of the Apache Spark ML library
[4] Machine Learning in Apache Ignite
[5] ML examples
[6] Titanic Tutorial