Data scraped from public websites and compiled in large data sets is what allows generative A.I. tools like OpenAI’s ChatGPT to write, code and generate images and videos. The more high-quality data is fed into these models, the better their outputs generally are. But over the past year, many of the most important Web sources used for training AI models have restricted the use of their data, according to a study published in July by the Data Provenance Initiative, an Massachusetts Institute of Technology.(MIT)-led research group. The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, identified what it says is an “emerging crisis in consent,” as publishers and online platforms take steps to prevent their data from being harvested. The trend raises questions of what will happen once available sources are exhausted. “If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems,” say the study’s authors. They warn that the rising number of restrictions “is foreclosing much of the open Web, not only for commercial AI, but non-commercial AI and academic purposes.” Plan B appears to have problems of its own. As they reach the limits of human-made material that can improve the cutting-edge technology, AI companies, including OpenAI and Microsoft, are testing the use of so-called “synthetic” data — information created by AI systems to train large language models (LLMs). Research published in Nature magazine on July 24 suggests the use of such data could lead to the rapid degradation of AI models. The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training. The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used. Read on to learn more about this story and other important technology news impacting business. |