UK flag, digital
Image: Bruno Martins via Unsplash

UK privacy watchdog to examine practice of web scraping to get training data for AI

Britain’s data protection regulator, the Information Commissioner’s Office (ICO), is scrutinizing the legality of web scraping to collect data to train generative AI models.

It announced on Monday the first consultation in a series focusing on generative AI models — the tools that create text or images based on a prompt after being trained on enormous datasets of similar media.

The collection of this training data can pose challenges under privacy laws due to the risk of collecting personal data, particularly because such collection is almost always automated due to scale.

Research papers have uncovered ways to extract training data from large language models (LLMs), potentially exposing personal information. The National Cyber Security Centre also has warned prompt injection attacks could potentially be a fundamental flaw for all such AI tools, by allowing attackers to access otherwise protected LLM data.

While there are concerns about web scraping infringing on intellectual property or contract law, the ICO’s consultations will be focusing on data protection standards.

“Based on current practices, five of the six lawful bases [for processing data under British laws] are unlikely to be available for training generative AI on web-scraped data,” wrote the ICO.

The only remaining lawful basis under the U.K. GDPR — legitimate interests — requires the entity doing the training to undertake a variety of actions, including assessing the balance between individuals’ rights to have their data handled safely and the necessity for web scraping for most generative AI training.

“We invite all stakeholders with an interest in generative AI to respond and help inform our positions. This includes developers and users of generative AI, legal advisors and consultants working in this area, civil society groups and other public bodies with an interest in generative AI,” the ICO stated.

Get more insights with the
Recorded Future
Intelligence Cloud.
Learn more.
No previous article
No new articles
Alexander Martin

Alexander Martin

is the UK Editor for Recorded Future News. He was previously a technology reporter for Sky News and is also a fellow at the European Cyber Conflict Research Initiative.