Hello, Common Crawl is very likely the most influential nonprofit you have never heard of. For many years, the organization has been flying under the radar – crawling and archiving the internet, openly sharing that data with everyone. But everything changed when OpenAI revealed Common Crawl as the primary source of training data for GPT-3, the large language model (LLM) that still powers the free version of ChatGPT. Now Common Crawl's more than 9.5 petabytes of data is the go-to source for LLM builders – reason enough for us to investigate this influential dataset. In our brand new report, Mozilla finds that Common Crawl's outsized role in the generative AI boom has improved transparency and competition, but is also contributing to biased and opaque generative AI models. But better is possible. Read our report on Common Crawl to learn how one small nonprofit shapes generative AI as we know it. Key findings - Common Crawl's data is only a fraction of the internet: it primarily captures English language content, and websites from digitally marginalized communities are less likely to be included.
- Automated filtering isn't cutting it, human curation is a must: Common Crawl's data contains hate speech and explicit content that is useful for many researchers, but harmful when used to train consumer products without care.
- Common Crawl and LLM builders have a shared responsibility: Common Crawl should highlight its limitations and biases, and push for builder transparency. And builders need to share how Common Crawl data was filtered, and what measures they take to address harms from biased and explicit datasets.
Read the report to learn about what can be done to build generative AI responsibly. Thanks for all you do for the internet. Stefan Baack Research and Data Analyst Mozilla |