Table of Contents
Alongside misguided “threats” of AI, many online, including influencers and creators, have justified fears about new technologies and companies. Many creators are speaking up against the growing AI industry, defending their content from plagiarism and shady AI training practices.
A recent Proof News investigation into this AI Industry – specifically AI training data and its utilization by major wealthy AI companies – has revealed it’s not just publically accessible and “ethically founded” content being used to train AI technology and data sets. The report reveals that Apple, Nvidia, and Anthropic use AI training sets crafted and trained by creators’ YouTube video subtitles.
The dataset (“YouTube Subtitles”) captured transcripts from creators like MrBeast and PewDiePie, and educational content from Khan Academy and MIT. The investigation found that media channels like BBC, The Wall Street Journal, and NPR’s transcripts also trained the AI dataset.
Tech Giants Used YouTube Content for AI Training
While EleutherAI, the dataset’s creators, have not responded to comment on the investigations, a research paper they published explains that this specific dataset – trained by YouTube subtitles – is part of a compilation called “The Pile.” Proof News reports that the compilation used more than YouTube subtitles, including content from English Wikipedia and the European Parliament.
“The Pile’s datasets” are public, so tech companies like Apple, Nvidia, and Salesforce use them to train AI, including OpenELM. Despite clear usage captured in various reports, many companies argue that “The Pile authors” should be accountable for “potential violations.”
“The Pile includes a very small subset of YouTube subtitles,” Anthropic spokesperson Jennifer Martinez argues. “YouTube’s terms cover direct use of the platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors.”
Though technically public, using datasets like “The Pile” and “YouTube Subtitles” raises ethical issues in the creator community. “It’s theft,” CEO of Nebula, Dave Wiskus, told Proof News. “Will this be used to exploit and harm artists? Yes, absolutely.”
It’s not just “disrespectful” to creators’ work, according to Wiskus, it’s also largely consequential for crafting the expectations and norms of the industry – where many artists face the looming threat of “being replaced by generative AI” technologies by profit-driven companies.
AI Training Strategy & Compensation
While training AI with publicly posted content might seem ethical, deeper implications for creators’ livelihoods arise when discussing AI training. “If you’re profiting off of work that I’ve done…that will put me out of work or people like me out of work,” YouTuber Dave Farina, who hosts a science-focused channel called “Professor Dave Explains,” adds, “then there needs to be a conversation on the table about compensation or some kind of regulation.”
These billion-dollar companies can afford to compensate creators who craft the subtitles that influence their training models and AI technology. However, they choose to cut corners and establish toxic industry standards to save costs. Most creators remain unaware that their content helps train large, profitable AI models used by these companies.
“We are frustrated to learn that our thoughtfully produced educational content has been used in this way without our consent,” Crash Course’s production CEO, Julie Walsh Smith, admits.
Artists and creators deserve compensation and celebration for their humanity and artistry, not just being used to train AI. AI cannot recreate art, connection, and humanity by training on content from people who don’t participate or get compensated.
Considering the growth of artist-founded and focused platforms like Cara, creators are growing more educated on AI training initiatives – growing bolder in advocating for their own individuality and claims to their art. From Instagram’s trail introductions of AI influencers, to misguided “Made by AI” labels – it’s no surprise they’re yearning to break away from traditional social media apps that struggle to protect their authenticity and rights to their content in the face of huge tech companies and the AI industry at large.
Artistic Authenticity & Creativity from Creatives Online
AI companies and the tech industry often cut corners in developing technology, sacrificing creators’ content, creativity, and behind-the-scenes work. They know the value of content like YouTube subtitles, which capture creators’ humanity and train their often “robotic” AI technologies and data.
It’s a “gold mine,” according to OpenAI’s CTO Mira Murati – these YouTube subtitles and other “speech to text data” sets can help to influence AI to replicate how people speak. Despite admitting to using these datasets to train “Sora,” they recognize that many creators’ unique content holds incredible power.
Public Availability of the ‘Pile’ for Large-Scale Companies
Some companies admit using “The Pile” for AI training but avoid validating, compensating, or acknowledging the data’s origins. Others avoid commenting on their usage. However, despite their willingness to comment, Proof News’ report makes assumptions about the validity and health of the data they’re using – especially after Salesforce revealed their “flags” for the content within the sets.
They flagged the datasets for profanity, noted biases against gender and religious groups, and warned of potential safety concerns. For companies like Apple, founded on inclusivity and data privacy, biases and vulnerabilities in AI can severely harm users.
These datasets profit off creators’ hard work, removing their content from channels and platforms to build potentially harmful AI technologies.
Closing Thoughts
Stealing content, misusing it without context, and failing to compensate creators is unethical and affects their livelihood. Large companies and tech giants should embrace transparency, especially regarding AI technology, and transform their ethos. Not only will it help to bolster trust with users, but it has the power to transform expectations and regulations in a space that’s largely uncharted territory.