In a nutshell, KAML-D (Kubernetes Advanced Machine Learning & Data Engineering Platform) addresses the double divide, often found in teams employing Machine Learning (ML) for their apps:
There are two orthogonal issues to overcome:
KAML-Dfocuses on this one.
Imagine a team that works on an app that has a machine learning feature, for example a face recognition or voice recognition task. Now, the awesome data scientists develop their model, using for example R or in Python. How do they go about it? Glad you asked. They get some training data (let’s say some CSV dump or maybe a ZIP with a gazillion
.png images) and start choosing a ‘good’ machine learning approach, like an unsupervised or reinforcement learning approach. Every time the data scientists iterate, they adjust the training data maybe cleaning up, adding more data or whatever. Then they of course need to split out a part, maybe 30% into some test data. They’re literally copying the original dataset (say
myawesomedata/), remove/add stuff as they need and store it under a new name, maybe
myawesomedata1/ or even fancy stuff like
This is just the learning/training phase. Once they have the model, they need to serialize it to make it available for the data engineers/developers to actual (re-implement) it in another language and/or environment such as Apache Spark or Apache Flink (in Scala) to make it production ready. No matter if they’re using a proprietary format such as Tensorflow’s checkpoint files or interchange formats such as ONNX or CoreML, the model can and will change (drivers may include: new data, a better model or algorithm, etc.) and that updated model needs versioning, again, same as the dataset above.
Now, one could argue that they could use GitHub to capture the respective dataset at any given point in time but did you know GitHub has a limit of 100MB per file? Ah, OK, so I just put it on S3 and enable versioning of the bucket! Sure you can do that, if you’d like to be locked into S3 ;)
What KAML-D brings to the table to remedy the situation is: