Building a Data Pipeline For Scalable AI

iMerit’s VP of Engineering Sudeep George has an overall experience of more than 18 years, specializing in imaging sensors & computer vision. Sudeep has co-founded Tonbo Imaging, a company that focuses on advanced imaging and sensor systems to sense, understand and control complex environments. The company develops and manufactures state-of-the-art computational imaging systems and products for various business verticals. Before Tonbo Imaging, Sudeep was the Director of Engineering at Serial Innovations where he led the complete product and engineering practice; managed an entire spectrum of high-tech imaging systems designed for the military, security and commercial markets. Sudeep is an engineer from Kuvempu University and has also worked for companies like Infosys and Samsung during his career.

Machine learning and AI have made their way into daily life in the form of virtual assistants, robots to do our bidding, and increasing automation in almost every sphere. The road to this large-scale deployment has been and continues to be a long and complex one. While the excitement around AI has been seen in every industry and application, the ability to scale to enterprise-grade deployment has been the make or break for most projects.

As with any enterprise, it all begins with the pilot. Pilots are a good indicator of whether AI is the right solution for the problem you are looking for and is a good litmus test before investing significant resources. Pilots, however, have their limitations and many a time even projects that have been successful in this phase stumble when they enter production i.e., when they scale up.

Fundamentally, an AI application is built when data helps train a model on what to do in a specific scenario – be it responding to a user query or successfully navigating through the streets. So, in order for the AI to be at its most effective, both the model and data have to be of their highest quality.

For a long time, a majority of the effort in optimizing AI was dedicated to improving the model. But more recently, the industry has seen a shift from this model-centric approach to a more data-centric approach. The significance of data in the quality of AI output has become the focus of industry conversations. Experiments conducted by leading practitioners like Andrew Ng tell us that improving the quality and consistency of data being used can have an exponential impact, rather than relying on the model alone. We at iMerit have been able to have a ringside view of this ongoing shift from a model-centric approach to a data-centric approach.

One change has been in the growth of ML DataOps. The earlier growth of MLOps – or the process of taking ML to production as well as managing and maintaining the models – helped streamline the complexities of the ML lifecycle from ingestion all the way to production and validation. ML DataOps is a branch of MLOps that focuses entirely on the collection, preparation, and use of data complete with a feedback cycle, and this is gathering momentum in the industry. Both remain important in creating scalable AI applications, in different industries.

Key considerations while scaling AI

Data quality

The quality of the data required for an application directly impacts performance, which is amplified while considering AI at scale. Enterprise-grade AI projects often work with millions of data points, potentially of different formats and from different sources. Each and every data point has to be annotated with specific instructions and to serve different functions. For example, data annotation for autonomous mobility projects can be in the form of semantic segmentation or polygon annotation. When working at scale, the data preparation effort for AI can run across months and years.

Data quality has to be maintained across these different parameters, and quality requirements also grow and evolve along the stages of production. When the POC and pilot stage are crossed and the number of workflows and experts working on the data increases, consistency becomes mission critical.

Establishing stable processes for QC and super QC across the stages of the project can also help ensure high-quality data is moved through the pipeline and delivered to clients. Leading companies are able to ensure accuracy levels of around 98%.

Expertise and skill levels

An AI project requires experts from different disciplines and the composition and caliber of the team are significant factors when it comes time to scale. But this remains one of the most common problems for companies exploring AI today.

The types of expertise that help augment an AI process can be broadly divided into technical expertise, domain expertise, and process expertise. The challenges of finding technical talent – data engineers, data scientists, developers, and others – are well-documented.

Domain experts are those who understand the problem statement and the data required to address the problem through an AI model. In the case of medical applications, trained doctors and healthcare professionals are well-suited to apply their experience from the field to help build impactful products that can improve patient experience and other outcomes. Needless to say, this expertise is very different from what is required for a retail or e-commerce application. Incorporating this type of expertise and consultation in not only developing solutions but also skilling other team members is valuable in building industry-specific products for a particular market.

When the data pipeline, which includes many different steps and processes, is at its most complex during production, operations and process experts manage workflows in real-time and ensure course corrections can be implemented.

Companies can choose to build internal teams that incorporate these types of experts or use a combination of internal resources and external strategic partners who can help manage the data pipeline from end to end.

Edge case management

As the number of data points in an AI data pipeline continues to grow, the occurrence of edge cases also comes more firmly into focus. Edge cases or rare occurrences in data are typically seen within the last mile of the AI development lifecycle. These are caused by the complexity and sheer variations that exist in the real world and need to be represented in the data, to train the AI model to recognize and react appropriately.

Edge cases are valuable in that these outliers are where the AI model needs more context and nuanced handling. Expert humans-in-the-loop are able to provide judgment and insights into these unique situations. An autonomous mobility application workflow, for instance, can show up over 100 unique edge cases in a month, each of which has to be analyzed and a standard resolution arrived at. At an overall workflow level, edge case management must be tackled through repeatable scalable processes to ensure consistency.

Data security and governance

AI at scale brings together vast amounts of data which has to be maintained and managed securely. An investment in AI fundamentally involves an investment in robust information security infrastructure. The requirements for AI can be more complex than in traditional IT and cybersecurity, to address areas like cloud storage. Compliance and control under industry standards certifications like SOC 2 and ISO provide a framework to be adhered to. In a world where remote work is increasingly prevalent, there is also greater scrutiny to ensure that the control of on-premises security is also applied when an employee is working elsewhere. While working with data from around the world, geography-specific data privacy protocols like Europe’s GDPR and California’s CCPA also come into play.

Continuous training

The process of AI development is a continuous one – as the world around us evolves, AI products must adapt accordingly. One recent example of this was the need for facial recognition algorithms to learn to recognize people wearing masks, during the COVID-19 pandemic. The other is where products like self-driving cars are tested on the roads and bring back valuable information that needs to be fed back into the training. The data that is being collected and used in AI is continually changing and data pipelines must be built with this in mind.

As companies seek to deploy AI at scale, a combination of technology, talent, and techniques is required to build an end-to-end data pipeline that can produce impactful and accurate ML applications.

Content Disclaimer

EXPERTS' OPINION