The Source. Thoughts on AI Data.

There has been a great deal of discussion about AI language models and potential use cases. The transformer approach detailed in Attention is All You Need revolutionized the way we think about organizing, analyzing and synthesizing data to unlock utility, productivity and creativity. However, there has been far less discussion about the data behind these models and the way these recent advances have changed our relationship to that data, even as they have amplified the impact of AI-driven content on our world.

Despite a series of mind-bending applications of conversational problem solving and surreal visual effects, artificial intelligence is fundamentally just another way to index data and there’s a long tradition of pattern recognition for us to consider. Organizing data is really the foundation of language and society and blending data into ideas is arguably what makes us human. So traditionally, we have taken great care of data. From the Library of Alexandria to the Dewey Decimal System, we have gone to great lengths to safeguard the origins and connect the dots between sources. 

However, the internet has radically changed that tradition. While democratizing knowledge, we have also depreciated it. We built the world’s greatest library without a librarian. According to a study by the Pew Research Center, up to 15% of social media accounts are bots and in 2023, bots accounted for over 50% of all internet traffic. Let that sink in. Half of the traffic on the web is a bot that can be used to spread misinformation, influence opinions and manipulate perceptions. What The Source, the first online service started almost 50 years ago to transfer “voluminous amounts of information” has become saturated with ulterior motives and falsified content.

As artificial intelligence encounters these challenges, it’s important that we take a thoughtful, regulated approach to the data fueling these models. So, let’s consider three factors that define the trajectory of any LLM: Sourcing, Processing, and Presenting.

When it comes to sourcing, the data for foundation models like ChatGPT and IBM’s Granite decoder-only model fall into two general categories: Objective and Subjective. The first includes academic papers from places like arXiv, mathematical pairs from DeepMind, legal opinions from Free Law, code from GitHub, SEC filings, US patents, and other factual documents. The second category of more subjective data includes fictional books from places like Project Gutenberg and crowdsourced information from Wikipedia but also, far less curated content from places like Stack Exchange, social media sites and the ubiquitous Common Crawl. That’s a lot of raw data to process and it’s critical to get it right because as the saying goes, Garbage In; Garbage Out.

Perhaps more importantly, as vast as the internet may seem, it represents only a sliver of the human experience. LLMs are wholly dependent on the knowledge we codify and capture within the digital domain. AI cannot watch a sunrise or give birth to a child or even, access the deep web which is over 90% of the data on the internet. So, there are enormous gaps in knowledge that AI cannot fully understand and as mentioned, great quality challenges in what information it can.

When it comes to processing, that’s probably why 80% of time is spent “cleaning data.” For example, IBM has invested heavily in a data governance process that evaluates datasets for risk, compliance and quality checks. After a clearance process, they have a pre-processing pipeline that includes formatting, deduplication, language identification, profanity annotation and hate filtering. As they write, “data sources drawing from the open Internet, such as Common Crawl, inevitably contain abusive language.” AI companies go to great lengths to apply what is mostly procedural filtering but it’s not entirely clear how more conceptual forms of misinformation and bias are managed, when these scaled solutions require more nuance.

Across the industry, there are no standards or regulatory committees to safeguard the quality and integrity of data sources. It seems like each private enterprise from Meta to Open AI has been entrusted to define their own best practices and processes. There is no oversight and limited accountability. Perhaps data processing is an area worth further discussion and consideration.

Finally, the presentation of content has generated a lot of discussion with the Writer’s Guild of America going on strike and Forbes threatening Perplexity with legal action over copyright violations. But beyond concerns about ownership, there are deeper considerations about the sort of traditions safeguarded since the Library of Alexandria. As data is assimilated into AI language models, the lack of attribution breaks a sort of cultural continuity we have historically taken for granted. The new works generated by AI lack a creative connection to their influence. There’s something inherently tragic and dehumanizing about that. When we lose these nods to heritage, we break bonds of community and creativity that fuel our imagination and growth.

Furthermore, a disconnected remix of human data represents AI with a false hubris of originality, as if AI is autonomous and independent from the corpus of human experience it appropriated to build whatever content it has rerendered. It presumes that videos from Sora or songs from Suno are wholly original expressions rather than an inspiring yet predictable blending of learned patterns.

This lack of attribution is particularly insidious and has fueled the idea that artificial intelligence is somehow a new species, beyond human. It’s not. It’s an evolution of language and the printing press and the internet, patterns of data, models of human experience. The results are truly astonishing and yes, AI is revolutionizing creativity and society. But it’s not beyond human. It is humanity, itself.

I hope these thoughts inspire discussion about the limitations of digital data sets, the need for data processing oversight and the importance of creative continuity as our relationship with data evolves. Thanks for reading.

Previous
Previous

The Consumer vs. the Beautiful Randomness of the Beatles

Next
Next

Pereira O'Dell Launches AI Innovation Lab