Why we need to talk about Machine Learning Operations and how it is different from DevOps.
When we talk about Machine Learning, we always have data and the model in our mind. All the challenges correlated to the Exploratory Data Analysis workflow. However, there is a big missing puzzle in the whole game – Deployment. To extract real business value from an ML model, you need to put it into production somehow. And this ‘somehow’ is a methodology called Machine Learning Operations.
We can hear much great news about the new Machine Learning models, techniques, and tools. Everybody can agree that the industry in this field is moving forward extremely quickly. However, there is a big bottleneck in the AI ecosystem. Taking models and moving them into production where they actually start to solve real problems is a big challenge. There are a lot of problems with AI model deployment. Just like DevOps did 15 years ago, taking off the ground a piece of code with just a single code repository, the whole AI ecosystem fights to achieve a similar outcome with ML models. Execute a build pipeline that takes a model to production along the way, testing and validating every single step. Sounds easy but it’s not.
MLOps itself is not something that is already defined. Both in terms of the definition (companies mean slightly different things) and the people who should deal with it (ML Team or Backend Team). The whole topic is still fresh, and the terminology has not yet been defined and established.
In syndicai, when we talk about MLOps, we talk about the ability to move AI models from data scientist’s machines or laptops to remote machine clusters at scale. The main goal of that process is to automate the whole workflow taking care of resource management, orchestration, data/model versioning, and monitoring.
The main idea behind DevOps is that a piece of software is tested, QA-ed by the development team when stable goes into production.
In the field of AI, everything around is actually very different.
Unlike DevOps, the MLOps is much more experimental at its core. During the whole workflow, Machine Learning engineers try different features, parameters, model architectures. Therefore, it is important not only to version the whole code but also to reproduce previously obtained results.
When we deploy our models, the constant flow of new data causes a decrease in accuracy over time, which is not the case with regular software. That is why we need to retrain and deploy again models that theoretically work properly (with no bugs) but have a significant drop in performance.
When it comes to computing resources, we need to be aware that Machine learning experiments can be hefty workloads. So unlike regular software, there is a need to run stuff on large machines from day one. So very often, there is a need to change workstations depending on the traffic.
In the DevOps environment, infrastructure is mainly under constant supervision. On the other hand, in MLOps, there is a need to take care of the model itself, constantly checking what happens with parameters and metrics to retrain when needed.
In addition to the software tests such as unit testing and integration testing, there is a need for model validation, model training tests.
The awareness of Machine Learning operations is constantly growing. More and more companies are providing their services on top of AI, which requires them to automate processes. We can also feel that the whole movement empowers the industry, bringing to life new tools—everything to become much more clever and efficient as a Data Scientist.
Unlike ' normal software,' we need to be aware that, unlike ‘normal software,’ Machine Learning models need to be treated with a different set of practices and tools. It is all about implementing Continuous integration flow into the Data Science workflow.