Good parts of the Apache Spark ML library

Alexey Zinoviev
2 min readDec 9, 2019

I have been a Spark user since Spark 0.8 was launched in 2013. It was the first release where Spark introduced its ML library. In 2014 I had experimented with the SVM algorithm and distributed algebra to train the linear model on 100 GB of data in a cluster with 10 nodes. It was better 10 times than using a custom MapReduce algorithm or Mahout.

Currently, I have a lot of experience with different distributed algorithms in Apache Spark. I have implemented a few custom private algorithms for commercial projects, built an alternative distributed ML library over the Apache Ignite, and and to reach my goals I used the Spark ML as a reference framework.

In this paper, I will share my thoughts about the advantages and shortcomings of Apache Spark ML many of which are well known to the users of MLlib, but some may be a revelation even for the experienced users.

Let’s start with the advantages that make Spark ML a popular and successful distributed ML library in the Big Data and Middle Data world.

If you have petabytes of data, then the python libraries like scikit-learn will not be able to process even a small fraction of the data for building a machine learning model.

Spark incorporates a truly remarkable set of classical machine learning algorithms

that allow you to perform iterative calculations on data stored in memory and on disk via operations on Spark dataframes.

The main advantages of Spark ML are next:

  • Spark ML is a DISTRIBUTED ML library. You could scale your training with a lot of algorithms close to linear scalability.
  • It works in production as a part of Java or Scala application. For example, you could use Spring Boot and Spark ML in one monolith application (sic!)
  • It has Python and R connectors, and its API is comparable with the Scala API of Spark.
  • It works with Hadoop data stored in HDFS, and it’s not the strange ugly console tool like Mahout.
  • It has a good integration with other parts of the Spark API to write code in the same style and reuse main Spark Core/SQL concepts (like persistence or SQL operations on dataframes).
  • A lot of trainers and methods have the same naming like in popular Python library scikit-learn (but both libraries are developed in parallel by different teams).
  • It supports classic ML algorithms well known for wide audience.
  • Wide support of different data sources and sinks (this a Spark feature mostly)
  • Easy building of ML Pipelines.
  • Model evaluation and hyper-parameter tuning support (for example Flink ML or Mahout has no this feature for many years).
  • MLlib supports a few really different approaches for model export and model import.

All these parts really make Spark ML the most popular ML library in production for Big Data projects. Of course, the initial pipeline prototype could be built in Jupyter Notebook, but the real pipeline will be developed with Spark ML and each Data Engineer will meet with the limitations of this library which will be mentioned in the next article.

--

--

Alexey Zinoviev

Apache Ignite Committer/PMC; Machine Learning Engineer in JetBrains