In a GenAI World, Curating Unstructured Data Gives Your Company a Competitive Edge

Carolyn Parent is the CEO of Conveyer, an AI platform that helps businesses transform and generate proprietary data. Recently, she served as Entrepreneur in Residence at HearstLab. A recipient of the Ernst & Young Entrepreneur of the Year Award and tech veteran with 25+ years of CEO experience at growth-oriented software companies, Parent has ushered in six profitable exits. Her previous roles include: CEO of LiveSafe, a risk intelligence company acquired by Vector Solutions; Cofounder of Gravy Analytics, a location-based analytics company recognized as a leader in mobility data; and GM of Deltek, a project management software company that successfully IPO’d.

Building stand-out AI solutions starts with unlocking the power of your unstructured proprietary data

Unless you’ve been hiding under a rock, you know that generative AI (GenAI) is changing the world. The launch of ChatGPT triggered a tsunami of GenAI innovation as tech giants such as Google and Apple rushed to develop their own large language models (LLMs). With GenAI expected to drive $7.9 trillion per year in economic activity, businesses of all kinds are now racing to figure out what GenAI means for them, and how they can use this incredible technology to boost productivity, elevate their products, and dazzle their customers.

But there’s a catch: you can’t simply deploy an off-the-shelf LLM and expect to get results. By definition, public LLMs are available to everyone, so it’s hard to use them to generate significant competitive advantage; anything your business does with ChatGPT, mine can quickly emulate. To stand out from the crowd and unlock value that others can’t quickly match, businesses need to turn to what makes them unique — their proprietary data — to build a moat around their business.

As the Harvard Business Review recently reported, companies are already starting to build bridges between emerging GenAI technologies and proprietary data. The challenge, as they do so, is learning to transform and curate messy unstructured data into actionable assets. This, in fact, is the key frontier in the emerging GenAI arms race: the companies that most effectively operationalize their unstructured data will be best positioned to create bespoke LLMs, win market share, and stay a step ahead of their rivals.

Find your Data Moat

Why is unstructured proprietary data the key to a successful GenAI strategy? It boils down to the fact that to get a competitive edge, you need to use GenAI to do things your rivals can’t replicate. Anyone can spin up a GenAI chatbot that provides plausible-sounding tips on how to clean your dishwasher, for instance — but create a tool that gives accurate, high-value results tailored to your brand and your customers’ specific appliance model, and you’ll rightly stand out from the pack.

To achieve those kinds of results, organizations need to use fine-tuning and prompt-tuning methodologies to infuse unstructured proprietary data into existing LLMs. Essentially, this means using valuable data to supplement the public data on which LLMs were originally trained, giving them domain-specific “expertise” and enabling them to generate more valuable outputs in ways that rivals, lacking access to the underlying proprietary data, can’t easily copy.

Importantly, this approach also helps mitigate business risks. While generic LLMs, built using unvetted public data, are prone to inaccuracies, “hallucinations,” and even copyright violations, adding proprietary data into the mix gives businesses more control and customers more confidence.

To unlock these benefits, though, organizations first need to marshal their proprietary data. The good news is that while training an LLM requires billions of data points, fine-tuning and prompt-tuning can be achieved with smaller datasets. Morningstar’s Mo tool, for instance, was prompt-tuned using about 10,000 pieces of investment research, while Morgan Stanley’s internal GenAI system was crafted with around 100,000 documents.

Creating the datasets needed for LLM customization isn’t easy, though. We’re talking about the data your company generates and uses every day — not carefully processed data sitting in neatly structured rows and columns, but messy, unstructured, organic data that lives in working documents, product manuals, customer service transcripts, and beyond. This data is often undervalued, but in the GenAI era it’s critically important, serving as a goldmine of expert insights and institutional knowledge that can be leveraged to customize and elevate LLMs.

To unlock that nascent value, organizations urgently need to figure out how to surface the salient data, process it to protect customer privacy and business IP, and add tags and other structural metadata to enable efficient GenAI customization. Above all, they need to find a way to perform this data curation quickly, accurately, and affordably. That’s a challenging task — and one that, more than anything else, will determine the winners and losers in the GenAI revolution.

Curation is Key

As things stand, few companies have effective data curation processes in place: up to 90% of the world’s data is unstructured, but only 18% of companies say they’re currently able to leverage that data. Bridging that gap, and surfacing and organizing the value submerged in unstructured documents, records, and datasets, is tough — but it’s absolutely essential if we’re to sculpt dependable, high-value GenAI models.

One option is to curate unstructured proprietary data manually. That’s the approach Morgan Stanley took, committing a team of 20 analysts to validate and organize the dataset used to fine-tune their GenAI model. But manual curation — which includes trawling through datasets by hand to identify useful documents or records, screen them for sensitive data, and structure them in ways that AI tools can act on — isn’t easy.

Dedicate your best people to this task, and you’ll incur an opportunity cost: what else should these strategically and technologically astute leaders have been doing with their time? Give more junior employees the responsibility, though, and you risk misjudgments or errors that lead to inefficiency, low-quality results, or even the expense of having to recalibrate your LLM with higher-quality data at a later date. That’s why many organizations are now using automated technologies to identify, catalog, sort, and structure the data they need.

Automated and AI-powered curation is, by its nature, more affordable and sustainable than manual approaches. It enables large volumes of data to be processed rapidly, empowering companies to more quickly build the GenAI tools they need. Crucially, AI-based curation can also run more-or-less continually, enabling LLMs to be updated with the latest and most salient data in near-real time — either via fine-tuning processes, or by calling up properly tagged and formatted data via APIs — to ensure your data moat gets deeper and more defensible over time.

Don’t do it Alone

To be clear, curating unstructured proprietary data is difficult — in some ways more difficult than simply training an LLM on vast volumes of undifferentiated data scraped from the public web. Trying to manage that task manually is a fool’s errand. Instead, organizations need to use AI to enable AI and find sophisticated tech partners that can help them identify and extract actionable data from their existing unstructured data assets.

This explains why some of the biggest acquisitions in the GenAI space involve firms whose technologies focus not on building new LLMs, but rather on leveraging proprietary data to customize existing models. Organizations generally lack the in-house expertise and resources to manage data curation — but by finding the right outside partners, they’re gaining the support they need to sort and structure their unstructured data and activate it for the use-cases where it can do the most good.

The bottom line: whether you’re looking to build customer support chatbots, empower employees with auto-generated analytics, or streamline digital marketing campaigns, every successful GenAI strategy starts with unstructured data curation. This is the new GenAI arms race, and the starting pistol has already fired — so it’s up to you to move quickly to find the tools and resources you need to unlock the power of your data. For businesses of all kinds, effectively curating unstructured proprietary data is the key to gaining and maintaining a competitive edge in the GenAI era.

Content Disclaimer

LEADERS' INSIGHTS