Apache Ignite ML Performance Experiment

Alexey Zinoviev
9 min readNov 27, 2020

Spoiler: The reported data is not an official benchmark with reproducible results and shared code, but the numbers can help with performance estimation.

A few months ago, I did a time-consuming performance experiment with the new version of Apache® Ignite™ (release 2.8).

First, I wrote a script that generates a dataset in Ignite caches with 15 columns (The dataset contained categorical and numerical features in a 40-to-60 proportion) and a variable number of rows as a parameter. Observations were generated in accordance with the predefined function, which is the sum of linear and nonlinear functions of arguments with the added noise.

The primary goal (prior to the 2.8 release) was to test performance in most of the presented algorithms, check the scalability effect, compare them with each other, and identify possible bottlenecks.

I had five machines shipped, each with eight cores and 100GB of RAM onboard. Four machines were used as servers, and one of the machines was used as a client. I deployed Ignite 2.8-SNAPSHOT on all of the machines and ran the ML algorithms with various parameters.

I ran the Ignite nodes with various settings, but the most common was

-Xmx80G and DataRegionConfiguration.maxSize* = 10G (OffHeap, shortly in the text below)

I tested the following next trainers:

  • Random forest
  • Linear regression with LSQR solver
  • Linear regression with SGD solver
  • Decision tree for classification and regression tasks
  • SVM
  • KNN
  • ANN
  • K-means
  • GMM
  • Logistic regression
  • Boosted logistic regression
  • GBT

When possible, each training was repeated multiple times (10 times on for each machine, then 10 times on two machines, and then 10 times on four machines to get average numbers) with 10³, 10⁴, 10⁵, 10⁶, 10⁷, and 10⁸ rows. Also, each pair < number of machines and number of rows> was copied for perf testing. For each test, hyperparameters that were most appropriate for the current algorithm was used. Wow! A lot of runs.

And, don’t forget about an important implementation detail — the number of partitions in the initial cache and, as a result, in the partitioned-based dataset. To obtain maximum performance for most algorithms, there should be ~ 1_000–10_000 rows per partition, but I tested with a fixed number of partitions for 10_000 and 10_000_000 rows.

In the following sections, I share some notes about how to run Ignite ML algorithms to get the best performance.

  • DataRegionConfiguration.maxSize = 10G could be set up via configuring of ignite-server.xml, for example:

You could find more information in the official documentation.

<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration
.DataStorageConfiguration">
<!-- Redefining the default region's settings -->
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration
.DataRegionConfiguration">
<property name="maxSize"
value="#{10L * 1024L * 1024L * 1024L}"/>
</bean>
</property>
</bean>
</property>

Random Forest

In Apache Ignite 2.8, I’ve run Ignite ML with the following hyperparameters: 100 trees in ensemble and default settings of Random forest.

In the following table, I posted the data with initial settings of 1024 partitions on 4 machines with -Xmx80G OffHeap = 10G on the client and server nodes.

Linear Regression with LSQR

Training time on this super-fast solver doesn’t depend on the amount of data. I tested the solver on various datasets, each with a different number of columns. I found that both required RAM and training time are linearly dependent on the number of features (columns in the dataset). To find the coefficients of linear regression, this trainer is the best choice.

Linear Regression with SGD Optimizer

In distributed systems, SGD is a workhorse, but SGD is a relatively slow optimization method.

I ran it with the following hyperparameters: maxIterations = 1000, batchSize = 100, locIterations = 100 on 1, 2, 4 nodes. In each case, the training time was close to linear scalability (n times faster on n nodes)

The following table displays the results of performance testing on 4 nodes with a total of 10 dataset partitions and a variable number of features:

The following table reports results for a run with ten features, rather than with a variable number of features:

LinReg with LSQR or SGD can be run on four nodes (with each node requiring 100Gb of RAM) and be trained on 10⁸ rows in one to two minutes.

Decision Tree (for Classification)

Decision Tree trainer is used to train the tree model in random forest. Unlike with Random Forest, when the decision-tree trainer is used alone, it trains on all the observed data without using sub-sampling techniques.

I tested this algorithm with two hyperparameters: maxDeep=5, minImpurity=0.

In the following table, you find the performance testing results on 4 4 machines, with a total of 10 dataset partitions and a variable number of features: 10, 100, 1000.

I should note that the decision-tree trainer does not scale well and that the best choice is to use random forest, with scaling on the number of trees, instead of on a single decision tree.

Unfortunately, I didn’t take the barrier with 10⁷ rows on four machines, each with 100GB RAM.

SVM

The SVM linear-classification trainer is based on the communication-efficient, distributed, dual coordinate ascent algorithm (CoCoA). In my case, working with 10⁸ rows was easy, but for 10⁹ rows, there was not enough RAM on a few nodes.

I ran the trainer for 10, 100, and 1024 partitions and found that 10 partitions are better for 0–10⁵ rows, 100 partitions are better for 10⁵ — 10⁶ rows, and 1024 is better for more than 10⁶ rows.

The following results were produced on a cluster with 4 nodes, a cache with 1024 partitions, and default SVM hyperparameters.

For example, 10⁸ rows with 10 features were handled within one minute, and the produced model was good enough (acc ~ 90%).

Also, I observed that the duration was not dependent on the number of features. At the very end, I encountered issues with garbage collection and the speed of memory allocation, which caused a significant but not a great increase in training time.

With eight machines with 200GB each, I expect that you will beat 10⁹ rows with ease.

KNN

This algorithm is special because there is no training phase and the time required for batch prediction depends on the batch size (with n² being the worst case). But, we keep this algorithm in our library because the models that it provides are very accurate and very fast on medium-size datasets.

I highly recommend that you run this algorithm on an average number of partitions: 50–200.

Show me the numbers, dude! Catch your numbers, guys!

K-Means

K-means is the most popular clustering algorithm. It is an iterative algorithm, and we could try to find the best clusters (best clusters are clusters that minimizing the pairwise squared deviations of points in the same cluster) until the sun goes supernova.

This is a reason why I share settings to run. If you run with the same settings, you have a chance to finish close to my time numbers on the same iterative algorithm with a finite number of iterations.

This distributed implementation is close to linear scalability, and increasing the number of features by 10 times leads to an increase in the time spent on training by 10.

The following results were produced on an Ignite cluster with 4 nodes, a cache with 1024 partitions. The algorithm was launched with default settings in the task of searching for two clusters in the dataset.

The following table focuses on trainings on a dataset with 10 features:

This algorithm is used internally in the following crazy mix of KNN and clustering ideas, the ANN algorithm.

ANN

I really like this algorithm and am proud to have it in the toolbox of our framework.

During the prediction phase, ANN’s behavior is similar to KNN’s behavior. However, the number of nearest neighbors doesn’t include the whole dataset but is limited with the value of hyperparameter k[doc] (internally, it uses the k-means algorithm during the training phase to find these k neighbors).

During the prediction phase, ANN’s behavior is similar to KNN’s behavior. However, the number of the nearest neighbors is limited by the centers of clusters found by k-means during the training phase.

As a result, training time depends on the number of clusters, and the number of clusters defines how accurate the model will be. Like in strategy games, you need to find the balance between desired accuracy and available resources you can use to train a model.

The following results were produced on Ignite cluster with 4 nodes, a dataset with 10 columns, a cache with 100 partitions. The ANN trainer was run with two different hyperparameters, k — 100 and 1000 nearest neighbors, to compare training times.

The accuracy of ANN is close to the accuracy of KNN, but ANN can be trained on datasets that are up to 100 times larger.

GMM

This algorithm was added at the end of 2019 when it joined the family of clustering algorithms.

The Gaussian mixture model (GMM) is not like k-means. GMM requires more memory and CPU ops, but GMM can be used to build more accurate models and take into account the distribution characteristics of the generated data.

GMM almost reaches linear scalability, and training time depends on how many features are close to linear dependence.

The following results were produced on an Ignite cluster with 4 nodes, a dataset with 10 or 50 or 100 columns, and a cache with 100 partitions. The algorithm was run with default hyperparameter values.

The following table displays the results of training on a dataset with 10 features:

Logistic Regression

The current implementation of LogReg is a one-layered multilayered perceptron with fixed parameters and an SGD solver.

As a result, this algorithm has linear scalability for several nodes, and training time is proportional to the number of features.

Moreover, Logistic Regression is a super-fast, binary classification trainer that can train the model on 100 000 000 rows (with 10 columns per row = 1 000 000 000 doubles) within 1.5 minutes on a cluster with four nodes and a total of 400GB of RAM.

The following results were produced on an Ignite cluster with 4 nodes, a dataset with 10 or 100 or 1000 columns, a cache with 1024 partitions. The algorithm was run with default hyperparameter values.

The following table focuses on trainings on a dataset with 10 features:

Boosted Logistic Regression

Also, I tested one of the most popular ensemble techniques, boosting, with the best trainer — LogReg.

The following results were produced on an Ignite cluster with 4 nodes, a dataset with 10,100, 1000 columns, a cache with 100 partitions. The algorithm was run with default hyperparameter values.

GBT (Gradient Boosted Trees)

Also, we have our own implementation of gradient boosted trees, a widely known ensemble algorithm.

The following results were produced on a cluster with 4 nodes, a dataset with 10 columns, and a cache with 1024 partitions. The default GBT hyperparameters were used.

The following results were produced on a cluster with 4 nodes, a dataset with 10 columns, and a cache with 10 partitions. The default GBT hyperparameters were used.

So, this is the end of our performance journey. I hope that our journey will be helpful to you in your own performance experiments.

P.S. In many cases, algorithms fail (with GC limit exceeded and OutOfMemory errors) because more and more memory is required. Each algorithm has its own memory requirement for intermediate calculations, depending upon how many rows are in the training dataset.

--

--

Alexey Zinoviev

Apache Ignite Committer/PMC; Machine Learning Engineer in JetBrains