Weakness of the Apache Spark ML library
Like everything in the world, the Spark Distributed ML library widely known as MLlib is not perfect and working with it every day you come face to face with certain difficulties.
I will share my thoughts about the shortcomings of Spark ML that could be fixed in the future but exist in the 2.4.4 release.
Statement 1: Spark ML doesn’t support model ensembles as stacking, boosting, bagging
Ok, in reality it has limited support of boosting in Random Forest training or in Gradient Boosted Trees, but you have no common way to build the stacking model or bagging model with an arbitrary trainer.
I couldn’t find the plan in Spark JIRA or on dev-list to implement common stacking or boosting.
In the following thread, this problem is discussed and a few solutions are proposed: https://stackoverflow.com/questions/54050547/stacking-ml-algorithms-in-spark
But I found only one solution on Github with 5 stars (later, I will post about the advantages of this private solution).
Statement 2: Spark ML has a very limited support of Pandas DataFrame functions
When a typical Data Scientist leaves his cozy little Pandas world, he is faced with the lack of a set of basic daily things in the Spark ML like fetching data in pandas style .cols[‘Name’][:5] , plotting, stratifying, handling missing data, building histograms and advanced statistic functions.
I found an interesting project with 112 stars on GitHub, which emulates Pandas dataframes on Scala and Spark. You can take a look at the project here: https://github.com/dvgodoy/handyspark.
Statement 3: Spark ML has limited support of statistic functions
Yes, you could calculate the correlation between two columns, test hypothesis, but Spark ML doesn’t support advanced stat methods like ANOVA, different descriptive statistics and so on.
Also, the RDD-based ML library supports more stat methods than DataFrame-based ML library. Compare the RDD-based supported functionality and DataFrame-based.
In many cases, you have to implement advanced methods manually.
I have found one RDD-based advanced stat library on Spark Packages with 27 stars here with ANOVA, Kolmogorov-Smirnov tests, and so on.
Statement 4: Spark ML doesn’t support online-learning for all algorithms
As mentioned in the docs, Spark ML supports only a few models like Linear Regression and KMeans in old Streaming mode [DStreams]; in other cases you should implement online-learning manually.
For example, Streaming KMeans has a hardcoded law for merging of two models based on previous data and new portion of data. This is the best implementation for clusters’ updation but this law cannot be reused for another algorithms like Decision Trees or SVM.
In common case, the Spark user should search the archive.org to find the mathematical methods of online learning to implement them near his Spark code.
Statement 5: A lot of data transformation/overhead from data source to ML types
Also you need to make a lot of noisy transformation from raw data in parquet to build Labeled Vectors.
In reality, in a typical ML Spark pipeline you will recreate your dataframe many times due to the immutable nature of Scala and Spark data structures.
For example, you have data in parquet and need to do 11 intermediate preprocessing transformations before training: Spark creates first dataframe from parquet data, and after that Spark creates new dataframe then you are using VectorAssembler, also it creates new dataframes for each preprocessing step and so on.
NOTE: Never use caching for all intermediate preprocessing steps, the storage memory in Spark cluster will finish soon.
Statement 6: The hard integration with TensorFlow/Caffee/PyTorch
The integration with TF and Caffee is hard and we see a lot of different approaches to run TF with Spark.
Historically, the YahooTensorFlowOnSpark solution was the first and has 3300 stars on Github. Also, Yahoo released the same integration for Caffee and Spark with the same API based on the same pricnciples (currently it’s an archived project with no support from Yahoo I think it’s related to the low level popularity of Caffee framework now).
Another approach with more fluent API was implemented in SparkFlow. A comparatively less popular library (have a look on CNN example here).
After TensorFlowOnSpark success, Databricks started to develop alternative TensorFrames, but now it’s deprecated and frozen in experimental status.
Also we should not forget the ineffective and slow DL4j — Spark integration, which was historically my first experience with Neural Networks on Spark.
Statement 7: Databricks goes wrong with TensorFlow-Horovod-Spark integration
Databricks totally concentrated on TensorFlow — Horovod — Spark integration where Spark plays a very modest role and does not support alternative approaches.
Insight 1: I asked about this situation to the Databricks ML developers in the Spark Summit this year and they answer me that Databricks spent a lot of resources for partnership with Uber and Google to build these bridges.
Insight 2: I am sceptical about the future of the integration because the maintainer of Horovod, Alex Sergeev left Uber a few months ago and we can see only a few commits from him in master.
Insight 3: It could never became a part of Apache Spark and be a part of Databricks Runtime for years. Look here and here to be sure about perspectives.
Statement 8: Spark ML doesn’t support building of modern Neural Network architectures
I suppose, that Spark could have his own Keras-like API to build traditional NN architectures like CNN, RNN and so on. It’s not difficult to implement. Multilayer Perceptron is not 10 times easier to implement, but now it’s part of Spark ML.
Statement 9: A part of algorithms are using sparse matrix
Also, Matrix computation like multiplication can be easily parallelized due to all data located in the shared memory, but many matrix operations are not easy to implement in a distribution version due to the distributed data representation.
Historically, a lot of ML libraries start from linear algebra implementation, but in reality many ML algorithms could avoid matrix computations and use only labeled or unlabeled vectors.
Now Spark ML has two versions of vectors, outdated matrix algebra, and very vague development prospects.
Statement 10: Several unfinished approaches of model inference/model serving
I had a lot of trouble trying to choose the right way to get trained model from Spark and put it to the internal system…
What should I use? Old approach via PMML? save method to serialize the model in the parquet file? Maybe Mleap? or JPMML library?
RDD-based ML has an ability to save a few trained models to PMML format. Special xml format for model exchange. Unfortunately, the new models from ml package couldn’t be imported via PMML.
Solution: I found and used one popular library, jpmml-sparkml, (200 stars) for converting Spark ML pipelines to PMML format. Yes, it works.
Also, Spark supports model serialization to parquet file or directory via model.write() method. Currently, it’s the most popular approach to send models via network.
Statement 11: Spark ML does not have full support for model inference
As I mentioned earlier, Spark has limited support of loading models trained in another systems for prediction purposes over the Spark cluster data. For example, Imagine that the model trained by data scientists in Jupyter Notebook could be applied to the data stored in HDFS or S3 many times to produce the new column..
In most cases, you should write model parser and initializer of Spark model by hand, it’s not hard math, but it wastes your time.
Statement 12: Some distributed implementations are inefficient, and you don’t know which ones
I will not describe from the zero the widely known story with the LinearSVM implementation which internally uses distributed SGD version with its own LossFunction.
For example, I know a more efficient method to solve the same optimization task like the widely known CoCoA Framework for effective building of SVM model on distributed data. Now, Flink ML and Ignite ML has Linear SVM implementations based on this approach.
Statement 13: Spark doesn’t support ML operators in Spark SQL
I hope that Spark will get it in the future, but currently, this important feature is not supported.
Solution: Have a look on HiveMall project — it gives an ability to run ML algorithms as a part of HiveSQL (and a part of Spark SQL too as a result).
Statement 14: ML algorithms internally uses Mllib on RDD
Currently we have two large packages in Spark ML: RDD-based and DataFrame-based. But in many cases, the DataFrame-based algorithm is only a wrapper on the RDD-based algorithm, including the inefficient DataFrame to RDD conversion, a mllib.vector to ml.vector casting, and so on.
If you don’t believe me, check the KMeans algorithm, which internally uses mllib.KMeans here or usage of CholeskyDecomposition from mllib in ALS recommendation algorithm.
But I am happy to see that from release to release more and more algorithms destroy their ties with old RDD-based implementation.
Hope that in Spark 3.0 we will have no such dependencies in all algorithms.
Statement 15: The main problem with Spark ML
You will grow old before your PR will be merged.
Summary
Unfortunately…yes, that’s the huge problem
Sometimes it seems that there is no alternative.
But in the Apache family there is another glorious son who can work in close conjunction with Spark, I happy to present Apache Ignite ML which close these gaps mentioned above.