Theory and practice of hyperparameter tuning with Apache Ignite ML

Alexey Zinoviev
18 min readDec 9, 2020

How to build the best model

Data scientists train models by running training algorithms. There are many different ways to build a Linear Regression model or a Decision Tree model. And, each training algorithm (trainer) has several tunable settings, the values of which can be changed before a training starts. Every value can affect the final result (that is, the trained model).

What happens if we try to find the tunable settings’ best values and re-use them to train the best model?

Before we start, we need to answer a few questions:

  • How do we determine that a given set of tunable settings is the best?
  • Where are the boundaries of what is a tunable setting, and isn’t a tunable setting?
  • How do we find the best values for the tunable settings?
  • How can we ensure that our search for the best values for tunable settings is compatible with the modern practice of training models?

Let’s keep in mind that such settings are known, in machine learning, as “hyperparameters” and that the task of a value search for optimal hyperparameters is known as “HPO” (Hyperparameter Optimization Problem).

Hyperparameters and model parameters

Let’s start from the basic definition: What is and is not a hyperparameter?

Newbies in ML are often confused about the model parameters and hyperparameters of training algorithms.

Model parameters are learned during training, and hyperparameters, on the contrary, must be set by a data scientist or AutoML before training.

A good example of model parameters is a slope and intercept in linear regression or a weight tensor in the convolutional layer — we are trying to find the best parameters during one training. On the opposite side, the initial learning rate in the SGD optimization algorithm or the number of trees in Random Forest should be defined before training and can be different in different training runs.

The main idea of HPO is to find the hyperparameters of a given ML algorithm or of the pipeline that returns the best metric value, as measured on a validation set.

Yes, we could treat the best hyperparameters as parameters that are learned from the validation dataset rather than directly from within ML algorithms. And, I suppose that, with the AutoML era (in the near future), the difference between model parameters and hyperparameters will be erased.

Math description of HPO (Hyperparameter Optimization Problem)

Hyperparameter optimization can be formulated as a bilevel (two-level) optimization problem, where the optimal parameters on the training set depend on the hyperparameters.

Let w denote parameters (e.g., weights and biases) and λ denote hyperparameters (e.g., dropout probability). Let L_t and L_v be functions mapping parameters and hyperparameters to training and validation losses, respectively. We aim to solve

subjects to:

where:

From [1]

A bilevel optimization problem consists of two sub-problems called the upper-level and lower-level problems, where the upper-level problem must be solved subject to the optimality of the lower-level problem. Minimax problems are an example of bilevel programs where the upper-level objective equals the negative lower-level objective.

Bilevel programs were first studied in economics to model leader/follower firm dynamics (Von Stackelberg, 2010) and have since found uses in various fields (see Colson et al. (2007) for an overview). In machine learning, many problems can be formulated as bilevel programs, including hyperparameter optimization, GAN training (Goodfellow et al., 2014), meta-learning, and neural architecture search (Zoph & Le, 2016)

Even if all objectives and constraints are linear, bilevel problems are strongly NP-hard (Hansen et al., 1992; Vicente et al., 1994). NP-hard complexity of the HPO task is a reason why we should avoid direct methods or brute-force technique and prefer heuristic algorithms with polynomial complexity (even if these heuristics do not have proven upper bounds for polynomial complexity but are only observable as such).

ML algorithms that are supported in the Apache Ignite ML framework

Let’s list all the currently available Ignite ML training algorithms and the algorithms’ tunable hyperparameters.

I will start with algorithms that are based on different versions of SGD (stochastic gradient descent).

LinearRegressionSGDTrainer

  • maxIterations — the maximum number of iterations before the training will be stopped
  • batchSize — the batch size for the SGD-based updating algorithm
  • locIterations — the maximum number of local iterations before synchronization
  • seed — just seed
  • updatesStgy — an analog of the optimizer in DL frameworks

LogisticRegressionSGDTrainer

  • maxIterations — the maximum number of iterations before the training will be stopped
  • batchSize — the batch size for the SGD-based updating algorithm
  • locIterations — the maximum number of local iterations before synchronization
  • seed — just seed
  • updatesStgy — an analog of the optimizer in DL frameworks

SVMLinearClassificationTrainer

  • amountOfIterations — the number of outer SDCA algorithm iterations
  • amountOfLocIterations — the number of local SDCA algorithm iterations
  • lambda — a regularization parameter
  • seed — just seed

MLPTrainer

  • arch — a neural network architecture that defines layers and activation functions
  • loss — a loss function to be minimized during the training
  • maxIterations — the maximum number of iterations before the training will be stopped
  • batchSize — the batch size for the SGD-based updating algorithm
  • locIterations — the maximum number of local iterations before synchronization
  • seed — just seed
  • updatesStgy — an analog of the optimizer in DL frameworks

The sets of hyperparameters for the four algorithms are similar because all of the algorithms use an iterative optimization algorithm internally. The number of iterations and the type of optimizer play an essential role. Even minor changes in the hyperparameters can produce significant differences in the predictive abilities of the resulting models.

Now, let’s now move on to the widely used family of metric clustering and classification algorithms.

KMeansTrainer

  • k — the number of clusters to be discovered
  • maxIterations — the number of iterations
  • epsilon — the numerical value that is used to check the convergence on each iteration
  • distance — an ML framework distance measure, such as Euclidean, Hamming, or Manhattan

GmmTrainer

  • maxCountOfClusters — the number of possible clusters
  • maxCountOfIterations — the number of iterations, a stop criteria (the other stop criteria is epsilon)
  • epsilon — delta between the values of the old and new centroids
  • countOfComponents — the number of components
  • maxLikelihoodDivergence — the maximum divergence between the maximum of the likelihood of each of two vectors in the datasets for anomalies identification;
  • minElementsForNewCluster — the minimum number of required anomalies, in terms of maxLikelihoodDivergence, for creating a cluster
  • minClusterProbability — the minimum cluster probability

KNNClassificationTrainer and KNNRegressionTrainer

  • k — the number of nearest neighbors
  • distanceMeasure — an ML framework’s distance metric, such as Euclidean, Hamming, or Manhattan
  • isWeighted — false by default; if true, this hyperparameter enables a weighted KNN algorithm.)
  • dataCache — a holder for a set of training objects for which the class is known
  • indexType — a distributed spatial index that has three values: ARRAY, KD_TREE, and BALL_TREE

ANNClassificationTrainer

  • k — the number of candidate points for the prediction phase of the search of nearest neighbors
  • maxIterations — the number of iterations
  • epsilon — the numerical value that is used to check the convergence on each iteration
  • distance — an ML framework’s distance measure, such as Euclidean, Hamming, or Manhattan

Let’s move on to growing trees and forest thickets. Gardening has the specificity that is tied to the amount of information and entropy.

DecisionTreeClassificationTrainer and DecisionTreeRegressionTrainer

  • maxDeep — the maximum depth of the tree
  • minImpurityDecrease — the minimum impurity decrease
  • usingIdx — a setting that enables the use of an index structure instead of using sorting while learning.

RandomForestClassifierTrainer

  • featuresCountSelectionStrgy — a strategy that defines the count of random features for one tree learning
  • maxDepth — the maximum depth of the tree
  • minImpurityDelta — a node in a decision tree is split into two nodes if the delta between impurity values on the two nodes is less than the unspilt node’s minImpurityDecrease value.
  • subSampleSize — the value that lies in the [0; MAX_DOUBLE] interval; a parameter that defines the count of sample repetitions during uniform sampling with replacement
  • seed — the seed value that is used in random generators

GDBBinaryClassifierOnTreesTrainer and GDBRegressionOnTreesTrainer

  • maxDeep — the maximum depth of the tree
  • minImpurityDecrease — the minimum impurity decrease
  • usingIdx — a setting that enables the use of index structure instead of using sorting while learning;
  • gradStepSize — the constant weight of each model in composition
  • cntOfIterations — the maximum number of iterations
  • checkConvergenceFactory — the factory for construction of the convergence checker that is used to prevent the overfitting and learning of useless models during training

Consider the latest representative of ensemble algorithms, bagging, which used different trainer algorithms (not only trees).

BaggedTrainer

  • ensembleSize — the number of trained models in the ensemble
  • subsampleRatio — the subsample ratio of the dataset
  • aggregator — the type of ensemble model prediction aggregation
  • featureVectorSize — feature vector dimensionality
  • featuresSubspaceDim — feature subspace dimensionality

The following training algorithms do not have hyperparameters:

No need to remember all the parameters, but I hope that you are impressed by the number of tunable settings in the Apache Ignite ML framework.

Preprocessors

The training process in Ignite ML is not limited by training only. Various preprocessors can help normalize numbers or convert strings to numbers.

Ignite ML now supports the tuning of the hyperparameters of the preprocessor trainers. For example, in NormalizationTrainer, a user can tune the hyperparameter ‘p’ (the parameter that is responsible for normalization in L^p space), and, in ImputerTrainer, the user can input strategy.

NOTE: The right choice of column scaler is a well-known hyperparameter tuning issue. Currently, Ignite ML offers a few scalers: MinMax Scaler, MaxAbsScaler, and StandardScaler. Unfortunately, the ability to tune a column scaler as a hyperparameter is not yet supported.

When the AutoML elements are added to the ML framework, the issue of column-scaler choice will be resolved.

Model evaluation

Apache Ignite ML provides a suite of classification and regression metrics to evaluate the prediction accuracy of ML models.

There are many classification algorithms. However, classification models are evaluated in similar ways. For each data point in a supervised classification problem, there are two outputs, a true output and a model-generated, predicted output. Therefore, the results for each data point can be assigned to one of four categories:

  • True Positive (TP) — The label is positive, and the prediction is also positive.
  • True Negative (TN) — The label is negative, and the prediction is also negative.
  • False Positive (FP) — The label is negative, but the prediction is positive.
  • False Negative (FN) — The label is positive, but the prediction is negative.

The following is a list of the binary classification metrics that are supported in Apache Ignite ML:

  • Accuracy
  • Balanced accuracy
  • F-Measure
  • FallOut
  • FN
  • FP
  • FDR
  • MissRate
  • NPV
  • Precision
  • Recall
  • Specificity
  • TN
  • TP

The explanations and formulas for these metrics can be found here.

NOTE: In Apache Ignite ML, multiclass classification evaluation is not yet supported.

When a continuous output variable is predicted from several independent variables, regression analysis is used.

The following is a list of the regression metrics that are supported in Apache Ignite ML:

  • MAE
  • R2
  • RMSE
  • RSS
  • MSE

NOTE: The metric calculations run distributedly, without collecting all predicted or actual (ground truth) labels on one client node.

Test-train splitting

It is not good practice to learn and validate the model parameters on the same data. This practice leads to overfitting. One of the most efficient solutions is to save and use part of the training data as a test set.

A test-train splitting term in Apache Ignite ML means to filter the data that is stored in a cache into two parts (the training part used to train the model and the test part used to estimate the model quality). The filtering is performed according to particular conditions. In reality, no data transfer or network shuffle.

All fit() methods have a parameter that passes a filter condition to each cache.

NOTE: Due to the distributed and lazy nature of dataset operations, the dataset split is a lazy operation. The operation could be defined as a filter condition that is applied to the initial cache to form both the train and test datasets.

In the following example, the model is trained on only 75% of the initial dataset. The value of the filter parameter results from the split.getTrainFilter(), which can continue with or reject a row from the initial dataset.

Cross-validation types

Okay, test-train splitting helps to avoid overfitting. However, by partitioning the available data and excluding one or more subsets of the data from the training set, we significantly reduce the number of samples that are used to learn the model parameters. The results depend on a particular random choice for the pair of train and validation subsets.

The cross-validation approach solves this problem. Apache Ignite ML implements K-fold cross-validation:

  1. The training set is split into multiple sub-datasets. The k parameter defines the number of sub-datasets.
  2. Ignite ML uses k-1 of the folds (subsets) as training data and trains a model.
  3. The resulting model is validated on the data that was not used for training.

Apache Ignite provides cross-validation functionality that enables it to parameterize the trainer that is to be validated, the metrics that are to be calculated for the model, and the number of folds that the training data should be split into.

You can find more information about cross-validation parameters on documentation.

The user can also pass split.getTrainFilter() to the CrossValidation object to perform cross-validation on only the training data.

Hyperparameter tuning process

In many cases, manual tuning is not an option because we need to tune multiple parameters. Of course, we could write a few loops inside each other to iterate through all possible combinations of parameters, as in the following example:

And, of course, every software engineer’s first idea is to automate the process of parameter-set generation and injection in training algorithms and preprocessors.

But, I am happy to announce that, beginning with the 2.8 release of the Ignite ML framework, the automatic hyperparameter tuning feature is available. Now, Ignite ML includes a built-in library for serial and parallel optimization over awkward discrete search spaces.

Also, on the roadmap, a feature to support search spaces with real-valued, discrete, and conditional dimensions is planned. Highly likely, Apache Ignite 3.0 will ship with this feature onboard.

ParamGrid

The primary object that keeps all possible values of hyperparameters is the ParamGrid object.

There are multiple approaches to finding the optimal set of hyperparameters:

  • Brute Force (Grid Search) — The traditional way of performing hyperparameter optimization is grid search ( parameter sweep). Grid search is an exhaustive search through a manually specified subset of the hyperparameter space of a learning algorithm.
  • Random search — Random search replaces the exhaustive enumeration of all combinations by selecting them randomly.
  • Evolutionary optimization — Evolutionary optimization is a methodology for the global optimization of noisy black-box functions. In hyperparameter optimization, evolutionary optimization uses evolutionary algorithms to search the hyperparameter space for a given algorithm.

The ParamGrid object supports some setter methods that set up the search-strategy parameters.

After the ParamGrid object is created, it can be passed to the CrossValidation object via the withParamGrid() method.

Firstly, we will consider the basic HPO technique, Grid Search (or Brute Force).

This approach’s main idea is to run ML Pipeline with all possible combinations of parameters to get the final metric scores. When we have a two-column fulfilled table with the following columns, “Parameter names and values” and “Metric value,” it’s easy to find the best hyperparameter combination to reuse in training from scratch.

We should keep in mind a combinatorial explosion of the number of parameters’ combinations as the number of hyperparameters and their variants of values grows.

For example, if we have two hyperparameters with ten variants, 10*10 = 100 training runs are completed. If we increase the number of hyperparameters from 2 to 5, keeping ten variants for each, the 10*10*10*10*10 = 100 000 trainings runs are completed. And, I suppose that nobody would wait long enough for 100 000 training runs to finish.

You could find a full example here.

Parallel execution

When I ran hyperparameter tuning for scikit-learn models, for Random Search and Brute Force cases, I often wrote a handmade parallelization to run a portion of training with parallelization.

I was confused that a library such as Hyperopt uses Apache Spark or MongoDB to run parallel training and does not have built-in parallelization. Running external JVM executors to start parallel execution is too much for me.

Of course, when ParamGrid was implemented in Apache Ignite ML, the next station for our developers’ train became a speedup of search by brute force. If you have enough free memory and low CPU utilization, it’s a good idea to start another workout without waiting for the last training run to finish.

A brute-force search of all possible combinations, where each training runs independently, can be easily parallelized without synchronization barriers.

To run the parallel version of Cross-Validation, you need to change ParallelismStrategy from NO_PARALLELISM to ON_DEFAULT_POOL in LearningEnvironmentBuilder (CrossValidation parameter), as in the following example:

Imagine that we have a dataset in Ignite’s cache and that we need to run dozens of workouts simultaneously.

After we call the tuneHyperParameters() method for search by brute force, the following steps are performed:

  1. The CrossValidation object generates all possible combinations of parameters.
  2. The CrossValidation object creates a sequence of tasks (one task per each combination of parameters).
  3. Tasks are submitted to the ThreadPool (with one main thread by default).
  4. Each task is executed independently, and each result is posted to the shared, thread-safe CrossValidationResult object.
  5. Inside each task, all parameters are injected into appropriate places of the ML pipeline.

Each task constructs a new instance of the cache-based dataset, and every partition of the dataset is spread over the Ignite cluster.

You could find a full example here.

How to make fewer training runs on data (alternative approaches)

In [2], Bengio writes that manual search and grid search prevail as state of the art despite decades of research into global optimization and the publishing of several hyperparameter optimization algorithms; Bengio details the reasons for this anomaly:

  • There is no technical overhead or barrier to manual optimization.
  • Grid search is simple to implement, and parallelization is trivial.
  • Typically, Grid Search (with access to a compute cluster) finds a better λ than purely manual sequential optimization finds (in the same amount of time).
  • Grid search is reliable in low dimensional spaces (for example, 1-d, 2-d).

I can add that the performance of Grid Search quickly degrades when it searches the best hyperparameter values in the space of dimension higher than three.

Meta-optimization task

The parameter space often has a high rank, and a searching surface looks like a pile of dirty clothes in a laundry basket. As a result, the standard methods of convex and non-convex optimization do not apply at searching for global or local optima.

Random Search

Grid search and manual search are the most widely used strategies for hyperparameter optimization. Paper [2] shows empirically and theoretically that randomly chosen trials are more efficient for hyperparameter optimization than are trials on a grid. When we compare ML algorithms that are configured by a pure grid search with ML algorithms that are configured by random search, we find that Random Search over the same domain can find models that are as good or better and uses a small fraction of the computation time and cluster resources that a Grid Search requires.

The Apache Ignite ML implementation does not yet support the generation of random parameters from a range and according to a given distribution, as in paper [2], but the implementation makes random jumps and trials on the discrete grid to ensure no more than a set number of attempts. The calculated metric value on one training run does not influence the choice of hyperparameter set.

The following example shows how to make ten training runs to find enough good hyperparameters.

Also, Ignite ML includes the parallel version of Random Search optimization. The Ignite’s user could enable it via changing ParallelismStrategy from NO_PARALLELISM to ON_DEFAULT_POOL in LearningEnvironmentBuilder.

Genetic algorithms (Evolutionary Optimization)

Genetic algorithms, also known as Evolutionary optimization, apply natural selection mechanisms to solve the optimization problem (for example, find the minimum of a function).

I hope you are familiar with the basic principles of Genetic algorithms, such as population, individuals, fitness function, selection, crossover (crossingover), and mutation. If not, I recommend the following article or Wikipedia.

In brief, the adoption of a genetic algorithm to solve the HPO problem includes the following steps:

  1. The Genetic algorithm (GA) generates an initial population (fixed sets of hyperparameters, with each hyperparameter set representing an individual in the population).
  2. After that, the GA evaluates the initial population (runs training runs and gets the value of the fitness function; for example, metric value or scores by folds for each set of hyperparameters).
  3. Via SelectionStrategy (Roulette Wheel Selection, for example), the genetic algorithm selects the best individuals from the initial population to become the parents of a new generation.
  4. The GA creates the next generation population with crossover and mutation operators.
  5. Crossover acts in the following way: To produce a new member of the population, it randomly selects two parents from the previous generation and combines parents’ genes according to the chosen CrossoverStrategy. Ignite ML implements Single-point crossover, Two-point crossover, k-point crossover, and Uniform crossover.
  6. With user-defined mutationProbability, the mutation operator randomly changes a few of the chromosomes of each newly generated child. The mutation operator changes the chromosomes to valid values from the search space.
  7. The genetic algorithm evaluates the new generation that includes the newly generated children.
  8. Steps 2 through 6 are repeated until the maximum number of generations is reached or until some other stopping or convergence condition is reached (for example, not enough improvement in fitness function).

Also, Ignite ML Evolutional strategy supports elitism. Elitism is a practical variant of the general process of constructing a new population. Elitism enables the best organisms from the current generation to carry over to the next generation unaltered.

This strategy is known as “elitist selection.” The strategy guarantees that the quality of the GA solution does not decrease from one generation to the next. You can set up the number of elite chromosomes in ParamGrid.

Also, Ignite ML includes the parallel version of Random Search optimization. When you’re working in Ignite, you can enable the parallelization by changing ParallelismStrategy from NO_PARALLELISM to ON_DEFAULT_POOL in LearningEnvironmentBuilder.

Internally, the parallel version of the GA is implemented as a parallel calculation of fitness functions for all individuals in the population. (one fitness function in one thread). The fitness function calculation is equivalent to the training run with a fixed set of hyperparameters presented as an individual. The end of the evaluation of new individuals is a natural synchronization barrier.

The following is a practical example of ParamGrid object creation:

In this three hyperparameter example, each individual in the population will have three chromosomes (genes): p, maxDeep, and minImpurityDecrease. Values from the arrays will be passed to ParamGrid.

Let’s take a closer look at a couple of individuals in the initial population.

Suppose that the selection operator selects (1.0; 1.0; 0.0) and (5.0; 2.0; 0.8) as parents and that the parents produce two children via one-point crossover: (1.0; 2.0; 0.8) and (5.0; 1.0; 0.0).

After that mutation operator changes a gene in the first individual from 0.8 to 0.9. As a result, two children (1.0; 2.0; 0.9) and (5.0; 1.0; 0.0) are added to the new-generation population.

I hope I have explained some of the tricky things about using genetic algorithms to find the best hyperparameters.

In conclusion

I hope, dear reader, that you now have an idea of how hyperparameter optimization works in Ignite ML.

In this article, we walked step-by-step through hyperparameter optimization in Apache Ignite ML.

Automated hyperparameter tuning of ML models can be accomplished by using an Ignite ML HPO implementation. In contrast to Random Search, Evolutionary optimization chooses hyperparameters by applying natural selection mechanisms and, thus, by spending more time evaluating promising values. The outcome can be fewer evaluations of the fitness function and better generalization performance on the test set, as compared to Random Search and Grid Search.

Of course, there is much work here for Ignite ML contributors: addition of optimization methods such as Bayesian optimization or Gradient-based optimization and improvements such as EarlyStopping logic, which is based on conditions that are implemented via callbacks.

But, existing methods can significantly reduce the number of training runs and the amount of burned CPU time and occupied memory size.

Reference list

[1] Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, Roger Grosse. Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions, arXiv:1903.03088 [cs.LG]

[2] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.

--

--

Alexey Zinoviev

Apache Ignite Committer/PMC; Machine Learning Engineer in JetBrains