Recently, I have finished a project working on testing Machine Learning algorithm performances in different data science platforms with an explicit focus on explainability. In this post, I shall describe some of the criteria and the platforms that were used in this project.
In the era of the internet where vast amounts of data are being generated by different sectors ranging from pharmaceuticals to molecular biology, the need for automated tools that allow for analysing of these data as well as insights that are critical for business is higher today than it has ever been.
Data science platforms as a software hub that allows for integrating data, building and deploying models have seen an exponential rise both in supply and demand, but which of these platforms is 'best'?
Without getting into a philosophical debate, the term best would probably have a very different meaning to different individuals and organisations. However, platforms that offer interpretable Machine Learning (ML) and Artifical intelligence (AI) outputs are prefered than those of 'black box'.
Platforms that met our criteria
The number of available tools is ever increasing as the friction in the market for the data science services disappears. How would one choose the right platform for data science tasks?
In the development of this project, we first set out the foundations for what would make a data science platform. For a platform to have made it to our list, it should have allowed for data pre-processing, feature selection, classifier choice, parameter tuning and support for open source use. Of the 41 platforms Identified R and Python were set as the benchmark and the other five that satisfied the selection criterion were chosen to be studied further and to perform supervised and unsupervised ML models.
R is an open source statistical platform that allows users to build very advanced ML models. The functionality of R is very widely known by the industry users as well as academics. R features different libraries which are a fully open source and each function within those libraries are fully transparent and explained mathematically. R is arguably the best visualisation platform available. R runs on Windows, Mac OS X and Linux and it is compatible with different data frames such as Microsoft Excel, Microsoft Access, MySQL, SQLite, Oracle. One of the remarkable features of the R language is its adaptability. Due to R's popularity and its expressive power and transparency, R developers keep on building creative and inexhaustible interfaces to software that complements Rs strengths.R’s memory management has been a drawback. However, recently there has been the advancement in techniques which allows developers to understand R’s memory management and ultimately make functions and loops run faster.
Python is a widely used DS platform and programming language. Python is also widely used for web and game developing. It is an object-oriented language. The Python programming language is used in many different software packages and sectors ranging from academia to pharmaceutical. Python is capable of powering the Googles search engine, YouTube, DropBox, Reddit, Quora, Disqus and FriendFeed. NASA, IBM and search browsers such as Mozilla rely mostly on Python as a programming language. Due to its ability to allow for the integration of systems quickly and effectively and being open source is very attractive, python is exceptionally appealing to startups and smaller companies.
H2O is an open source, in-memory, distributed ML platform. H20 runs on Java such that inside H2O a key-values distributed storage is used that enables the data, models and other objects to be used across different machines. H2O uses map reduce distributed framework and allows for the java join framework. The data is transformed in an h2O data frame which is distributed across all clusters and stored in memory. H2O’s intelligent data parser can guess the schema of the incoming datasets and supports data ingest from multiple sources in various formats. H2O's API enables access script via JSON over HTTP. The API is used by H2Os web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python).
BigML allows developers and enterprises to create ML algorithms. BigML offers an abstract, simple interface to a wide range of ML algorithms that can be used in isolation at a very high level and also combined, by means of DSLs, into new, more complex, algorithmic workflows; so one can cover the gamut from users that barely know the particulars of an algorithm they are invoking to savvy data scientists that can combine many of them in complex ways. BigML enables the users to perform their task more effectively by tapping the functionality of the platform without having to use proprietary API’s (an API whose methods and outcomes are public and usable by anyone without any kind of reverse engineering). BigML connects to R via the ’Bigml’ package which contains the Bigml API. However, this package is old and has not been updated. The package includes methods that provide straightforward access to basic API functionality, as well as methods that accommodate local R data types and concepts. BigML also offers many other BigML language bindings that are all open source such as python, java, ruby and clojure.
RapidMiner offers data mining and ML procedures including data loading and transformation, data preprocessing and visualisation, modelling, evaluation, and deployment. RapidMiner is written in the Java programming language. It also integrates learning schemes and attributes evaluators of the Weka machine learning environment and statistical modelling schemes of the R-Project. This platform benefits from an extensive built-in library which also integrates with existing databases and most common open source DS programming languages such as R and Python. The Auto ML function of this platform is an automated lifecycle to build ML algorithms.
Dataikus Data Science Studio (DSS) platform allows connection to any data store, eliminating integration stages. DSS detects wrong entries while automatically cleansing, transforming, and enriching data. Visualisation features make it easy to find correlations, variables, and patterns to predict future outcomes and trends with certainty. DSS also has features that support collaborative data science which makes the job of different teams such as data engineers, business analysts, business stakeholders, hardcore coders, R users and Python users more collaborative. This, in turn, provides an efficient way of making the needs of these different roles to work together on DS projects. This platform runs python in memory.
Azure ML studio allows users to develop models in the cloud. Azure is also integrated with R and Python environments. This feature makes it possible for data scientists to write and run R and Python programs on the cloud as well.
So, which platform is best?
One of the remarkable features of the chosen platforms is that they have massive support for collaborative data science at scale as well as allowing for integration with the benchmark platforms.
This project lays out the path for an era to attract more work towards platform comparison via algorithm performance as compared to just algorithm testing. The need for the automated
ML and AI will see an even more increasing rise and having research in this area will enable industry and academia to use the trade-offs between different platform to decide what might be most suited for their purpose. All in all, there is no one fit all platform that solves every problem. The measures in this project show that the chosen platforms just as the benchmark platforms provide similar functionality and results.
For more information on the criteria, platforms not included in here, ML algorithms and the results of this project please visit my thesis and reference appropriately if used.
If the works of this project are of interest to you please get in touch.