How to build the best model

Data scientists train models by running training algorithms. There are many different ways to build a Linear Regression model or a Decision Tree model. And, each training algorithm (trainer) has several tunable settings, the values of which can be changed before a training starts. Every value can affect the final result (that is, the trained model).

What happens if we try to find the tunable settings’ best values and re-use them to train the best model?

Before we start, we need to answer a few questions:

  • How do we determine that a given set of tunable settings is the best?

Spoiler: The reported data is not an official benchmark with reproducible results and shared code, but the numbers can help with performance estimation.

A few months ago, I made a time-consuming performance experiment with the new version of Apache® Ignite™ (release 2.8).

This is the third story in a series of stories about the Apache Ignite ML.

  1. Apache Ignite ML: origins and development
  2. Apache Ignite ML: possible use cases, racing with Spark ML, plans for the future
  3. Apache Ignite ML Performance Experiment

First, I wrote a script that generates a dataset in Ignite caches with 15 columns (The dataset contained…

Hi, distributed programmers! This is the second post in a series of posts about Ignite ML library.

Today I would like to talk about possible scenarios for using Ignite ML framework, compare its capabilities with Apache Spark ML, talk about future plans.

This is the second story in a series of stories about the Apache Ignite ML.

  1. Apache Ignite ML: origins and development
  2. Apache Ignite ML: possible use cases, racing with Spark ML, plans for the future
  3. Apache Ignite ML Performance Experiment

Possible usage scenarios in Big Data architecture and use cases in 2020

Okay, ambition is great, you say, but to what scenarios can this Distributed ML framework be applied? When…

Hi, distributed programmers! Nice to see you here, in my technical blog. Gird your loins and read this long, long article about Apache® Ignite™ Machine Learning framework.

Do you know what Apache Ignite is? It is a horizontally scalable, fault-tolerant, distributed, in-memory computing platform for building real-time applications that can process terabytes of data with in-memory speed [1].

De facto, Ignite is a distributed, in-memory database. It’s not like Apache Spark (a distributed ETL framework but not a database with reading/writing on disk and intermediate results in-memory). …

In my previous article, I showed how you can train a linear regression model in Kotlin using Tensorflow API. This time I decided to tackle something a bit more complex, like convolutional networks. In this article I’ll show you how you can train a LeNet model in Kotlin.

Article Contents:

  1. Introduction
  2. LeNet-5 layers
  3. The updated LeNet-4-zaleslaw layers
  4. First Convolutional layer
  5. First pooling layer
  6. Second Convolutional layer and Pooling Layer
  7. Flatten the 2d input
  8. Dense layers and the output
  9. Training: loss function, gradient descent
  10. Evaluation: meet the Accuracy Queen!
  11. Conclusion
  12. References


The LeNet-5 architecture was published in 1998, more than 20 years ago…

Image for post
Image for post

Добрый день, пока весь мир увлеченно учится чему-то новому на дому, перемежая это с весёлыми стэнд-апами и невероятными фитнес-тренировками по вебке, я решил предложить вашему вниманию подборку курсов для свитчеров с классического ML в универсальный DL.

Disclaimer: следующая подборка откалибрована по мне и моим коллегам, людям с хорошим фундаментальным математическим аппаратом в кармане и навыками использования классического ML в быту и на производстве.

Теоретический минимум матзнаний включает в себя следующий список, необходимый для понимания материала, предложенного далее. Если вы первый раз слышите эти понятия, то вам будет больно и скучно.

  • векторное пространство, операции над векторами
  • обратная и транспонированные матрицы, матричное…

My story began a few days ago, when I realized there were no examples on how to train a Linear Regression model on TensorFlow using Java API.

Image for post
Image for post

Why is it so important to me? Who would use TensorFlow to find the best weight and bias in a Linear Problem? Who would try to do it in Java? This may sound crazy like using a hammer to eat the noodles, isn’t it?

Let me explain: I’m a professional ML/DL framework designer and my main area of expertise is Java and other JVM languages like Kotlin.

I need a good Java API…

A few days ago I’ve bought a notebook Acer Nitro 5 515–54 and as usually decided to install Ubuntu 18.04 LTS as a second OS.

This notebook has only one storage device, SSD disk, and I didn’t think that it could became a part of problem.

As usually I’ve created the installation LiveCD on 4GB flash-card via rufus program using standard options to write the ISO (MBR for UEFI/FAT 32/ISO).

But after typical manipulation in BIOS and during installation process I’ve got a message that it’s not enough memory for Ubuntu installation: my Ubuntu installer didn’t see the SSD driver…

Like everything in the world, the Spark Distributed ML library widely known as MLlib is not perfect and working with it every day you come face to face with certain difficulties.

I will share my thoughts about the shortcomings of Spark ML that could be fixed in the future but exist in the 2.4.4 release.

Statement 1: Spark ML doesn’t support model ensembles as stacking, boosting, bagging

Ok, in reality it has limited support of boosting in Random Forest training or in Gradient Boosted Trees, but you have no common way to build the stacking model or bagging model with…

I have been a Spark user since Spark 0.8 was launched in 2013. It was the first release where Spark introduced its ML library. In 2014 I had experimented with the SVM algorithm and distributed algebra to train the linear model on 100 GB of data in a cluster with 10 nodes. It was better 10 times than using a custom MapReduce algorithm or Mahout.

Currently, I have a lot of experience with different distributed algorithms in Apache Spark. …

Alexey Zinoviev

Apache Ignite Committer/PMC; Machine Learning Engineer in JetBrains

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store