Radha Basu is the Founder and CEO of iMerit, a global AI data solutions company delivering high-quality data that powers machine learning and artificial intelligence applications for Fortune 500 companies. She is a leading tech entrepreneur, and a pioneer in the Indian software business. Under her leadership, iMerit has employed hundreds of skilled and marginalized women and youth in digital and data services worldwide. Additionally, iMerit employees contribute to growing industries ranging from virtual/augmented reality to the sharing economy, to e-commerce and financial services.
What is ML DataOps?
In the last decade, the commercial applications of Machine Learning (ML) have moved from conceptualization to testing to deployment. As the industry has moved through this cycle, the need for efficient and scalable processes resulted in the emergence of MLOps as a critical function within organizations developing AI.
The development of ML applications hinges on the collection and analysis of data and the creation of a model using the data. In short, machine learning is, at its core, a combination of data and the model. As ML has scaled, there has been a great deal of focus on tuning and optimizing the model to produce the best possible performance. Over the past year, the AI ecosystem has seen a push to move from a model-centric approach to a data-centric approach. This shift, championed by AI pioneer and technology entrepreneur Andrew Ng, is thought to have a dramatic impact on the quality of Machine Learning models deployed. Ng points out that sourcing and preparation of high-quality data account for 80% of the Machine Learning process.
In line with this development, a greater focus is seen today on ML DataOps – the subset of MLOps that focuses entirely on the collection, preparation, and use of data complete with a feedback cycle. A structured ML DataOps pipeline creates the ability to handle data at scale, as it goes through the cyclical journey of AI training and deployment. The all-critical transition from testing to production must be tackled through repeatable and scalable processes, to ensure the sustainability of the resulting AI solutions.
A robust ML DataOps ecosystem is needed to keep up with the dynamic nature of AI development and the specific data problems faced within different sectors employing AI. An open and interconnected ecosystem helps the industry as a whole be time-efficient and move closer to deployment at scale. There is no company that can do it all, and no one model or workflow can solve all the problems.
The Three Pillars of ML DataOps
Within the ML DataOps ecosystem, companies focus on different aspects of the data pipeline and bring their specialized solutions to the market. The solutions offered by companies within the ecosystem broadly fall under three categories: people, technology and tooling, and processes or end-to-end solutions. The boundaries are fluid, and there is a constant evolution in the space to incorporate new ML DataOps capabilities to best serve the AI market.
- People: The crucial role of skilled humans in the process of ML development has become increasingly important. Humans-in-the-loop bring judgement and nuance to the processes in the pipeline and enable complex solutions which cannot be tackled through technology alone.
- Technology and Tooling: Tools are being developed to streamline different functions within the ML DataOps pipeline, including labeling, curation, and visualization. Technology features are developed to improve the efficiency of humans-in-the-loop and complement human knowledge with technical support and automation.
- End-to-end processes: Some companies are focused on developing an end-to-end solution with streamlined processes built-in. Process efficiencies, which can be overlooked in a small project, can be huge time and cost savers when dealing with enterprise-grade data pipelines.
What an ideal DataOps ecosystem player looks like
Most companies are hyper-focused on providing solutions for a specific area within the DataOps pipeline. Companies like Lightly help AI teams use the right data for their project and offer data curation expertise. Within the labeling and annotation ecosystem, Datasaur offers a leading tool for NLP and text data, while Dataloop, Superannotate, and V7 are go-to partners for computer vision.
Agile and responsive
Given the rapid pace of growth and dynamic nature of requirements within AI data, teams are constantly building workflows and features to solve new problems. Customers are looking for ML DataOps partners who can handle varied requirements and provide a single view into the data pipeline, with platform and expertise.
Building in automation
Platforms like AWS Sagemaker and Snorkel AI offer automation and programmatic approaches for processes in the pipeline, powering faster time to market and reducing costs. Google Document AI uses automation to enable better decision-making and applications on specific data formats using document understanding techniques at scale.
Edge case resolution
The last mile of AI development lies in the resolution of edge cases. The ability to handle edge cases can make or break the production readiness of a trained ML system. Companies in the ecosystem are constantly finding ways to seamlessly integrate human expertise with tooling capabilities for auditing, monitoring, and handling edge cases. This is a critical step in getting ML past the final hurdles to production.
An open architecture combined with points of aggregated control of the whole process will be the future of DataOps pipelines. Tooling teams need to be thinking of their position in the entire lifecycle and provide open interfaces to plug into customer data pipelines and workflows.
Trackable and engineered data flows
While handling large volumes of data, the key to success lies in the data engineering of open and flexible pipelines, optimization of relevant data streams, rapid feedback, large-scale data capacity, and focused human expertise.
A solutions-first approach is grounded in the ability to quickly navigate a vast ecosystem for specialized capabilities as required. This will be mission-critical for the deployment of AI. With each provider within the ecosystem bringing specialized expertise in the setup of flexible and productive workflows, less time is spent in solving problems that have already been solved. When a customer works with a company within the DataOps ecosystem, recommendations can be made for the best combinations of specialized tools and customized capabilities for that specific project and workflow.
Monitoring and insights
The ML DataOps pipeline consists of several workflows and processes. Detailed reporting, is crucial to analyze the workflow’s productivity, and make informed decisions about project progress. The use of APIs and open-source tools will enable complete report customization, with the use of drag and drop filters. Reports with details down to the granular level will become part of the reporting suite. Real-time reporting enables dynamic troubleshooting and resource allocation. Root cause analysis and trend spotting over time also help in cost-saving over the long term.