Seize the data! : Legal and regulatory issues for Artificial Intelligence (AI) training data

Data serves a vital role in training artificial intelligence (AI) models. Data is scraped from various sources including websites. The term data mining and data scraping are similar but not quite identical in nature. It is similar in the sense that both involved bulk copying of data. Data scraping only involves the process of obtaining data and not all scraped data will be used for analysis. Data mining involves analysis of the data scraped. Both data mining and data scraping may raise issues around intellectual property rights, data protection and breach of contract which we will explore in this article.

Intellectual Property Rights (IPR)

Scraped data may involve copyrighted work which can lead to IPR infringement if data is scraped without permission of the owner as seen in the Getty case (read our article on this here) where Getty Images sued Stability AI for data scraping without its permission. 

 

IPR in database 

Database can attract copyright protection if “the selection or arrangement of the contents of the database the database constitutes the author’s own intellectual creation”.

The sui generis database right protects the content stored in a database. The right automatically subsists if it meets the requirements, and no registration is required for such protection. Such rights are infringed where a substantial part or all the contents in the protected database are extracted or reutilised.

Misappropriation of scraped data which may be protected under copyright or sui generis database right may lead to IPR infringement.

 

UK: Text and data mining exception 

The United Kingdom Intellectual Property Office (UKIPO)’s proposed extension to the copyright law exception in respect of text and data mining (TDM) has been shelved. The Government’s initial reasoning for extending the exception to allow text and data mining for commercial purposes, was that it will encourage AI innovation in the UK. However, the exception is now withdrawn, and the current position remains whereby the exception only applies to non-commercial research purposes or otherwise with the permission of the rights holder.  

In the government’s Budget 2023 speech, it announced that UKIPO will produce a code of practice “by the summer” which will support AI firms in accessing copyrighted work as an input to their models. The government stated that “An AI firm which commits to the code of practice can expect to be able to have a reasonable licence offered by a rights holder in return”. If the code is not adopted or an agreement cannot be reached, the government may follow this up with legislation. In addition to that, the government will also support rights holders by ensuring there are “protections (e.g., labelling) on generated output”.

 

EU: Disclosure requirement under proposed draft AI Act 

The draft EU AI Act sets out an obligation for companies deploying generative AI tools, like ChatGPT, to disclose any copyrighted material used. According to reports from Euractiv, companies may have to “make publicly available a summary disclose the use of training data protected under copyright law”. This can open floodgates to copyrights claims as we have seen in the Getty case mentioned above. It is important to note the rules around the disclosure requirement are still not known and may well change before the finalisation of the Act.

 

Contracts and confidentiality

Data scraping could attract liability under breach of contract if website content is protected by terms of use or any other similar website usage agreements. 

Confidential information within the data may be subjected to restrictions under a confidentiality agreements or non-disclosure agreements. Separately, the equitable doctrine prohibits making use of confidential information without authorisation. Therefore, it is important to consider if such data or database is confidential and whether such data are subject to any confidentiality obligations or restrictions.  

 

Computer Misuse Act 1990 (CMA)

Under the CMA, it is a criminal offence to access computer program or data without authorisation. The government are reviewing the consultation on the Computer Misuse Act 1990. The consultation stated that the current penalty under CMA is “insufficient penalty to deal with the seriousness of the criminality” and the proposal put forward was to make it a general offence for possessing or using data obtained illegally. 

 

Data protection

If personal data is involved in data scraping and data mining, consideration needs to be given to data protection legislation. The UK General Data Protection Regulation (GDPR) sets out principles and legal basis that need to be satisfied where personal data is involved. For example, under the GDPR accuracy it is important to ensure that personal data is not “incorrect or misleading as to any matter of fact” and where necessary, is corrected or deleted without undue delay. Therefore, it is important to check content for any inaccuracies within the personal data beforehand.

Due to nature of the large scale processing of personal data within data scraping and data mining, it is important to consider whether it is appropriate to carry out a Data Protection Impact Assessment (“DPIA”) should be carried out. Article 35 GDPR states that “where a type of processing in particular using new technologies, and taking into account the nature, scope, context and purposes of the processing, is likely to result in a high risk to the rights and freedoms of natural persons, the controller shall, prior to the processing, carry out an assessment of the impact of the envisaged processing operations on the protection of personal data”.  

Read our article here to find out more about data protection issues to consider within AI systems.

 

Key takeaways

  • Always check the terms of use of a website to see if there are any terms which prohibit data scraping.

  • Consider whether data would be confidential information and be mindful of any confidentiality restrictions which may exist.

  • Where possible, obtain permission from rights holder and/or licences to avoid any IPR infringement.

  • Check if TDM exception applies.

  • Ensure compliance with data protection legislation where personal data is involved.

  • Check content for any inaccuracies before using data.

  • If you are outsourcing data scraping/data mining to a service provider, ensure that they comply with applicable laws and provide you an indemnity against any third parties’ claims.

 

Related Expertise