Sharing data and control - Christopher Prohm

The more time I spent building products that use machine learning at their core the more it becomes clear to me that the ideal of a single data science language and framework will never be fully realized. There will always be legacy code that was written before the framework du jour arrived. Even without legacy code, the reality looks already rather polyglot. Consider Spark. When using their Python bindings one will almost surely end up with a mix of Scala and Python.

To support these polyglot architectures more and more effort is being spent to ensure that the underlying algorithms are the same regardless of the programming language or programming framework chosen. A good example of such an algorithm is xgboost. The core algorithm is implemented in C++, but it also supports current data processing frameworks, like Spark or Flink, and data science languages, such as Python or R. As one central feature of its design, xgboost uses its own framework for parallelization: even if you use Spark, the parallelization will be fully handled by xgboost once the training starts. Another example of this design is CaffeOnSpark.

However, I am wondering if this really the last step in the design of these libraries. At its heart there is no fundamental reason why only data and not control is flowing between parallelization framework and machine learning library. For me, flow of control in this discussion means that both sides are taking turns with executing code and notify each other when they are finished. Think of JavaScript's asynchronous HTTP API. Whenever the user code requests a HTTP page, the browser takes over and performs the request. Once the request is complete the browser hands both the result and the control back to the JavaScript side for further actions.

If similarly control flow would be possible between machine learning library and parallelization framework, the algorithm could use the native parallelization mechanisms supplied by the framework. Think of an algorithm that is implemented in terms of primitives such as allreduce and broadcast - as is xgboost. Whenever the algorithm requests an allreduce, the parallelization framework takes over, but hands back control once the reduction has been performed. This way each part of the system could handle what it does best: the algorithm library the actual algorithm, the parallelization framework the parallelization.