It is well known that the Internet is the information source par excellence, where large amounts of information on many different topics can be found. This is especially relevant in the field of generative artificial intelligence (AI), which has been booming in recent times, as it requires large datasets for training and continuous improvement. In this context , web scraping or data scraping, as well as other similar techniques, have become essential tools to achieve the collection of information from the Internet more efficiently, quickly and without involving too many personal or economic resources.
What is web scraping or data scraping?
Web scraping, or data scraping, is the technique by which the automated extraction of information, content (including images or videos) and data from websites or other online spaces is carried out . Generally, this extraction or collection is obtained through the use of software called ‘scrapers “ or ”bots’. The content extracted through this technique can then be stored, analysed and used for various purposes, such as training algorithms.
How does this relate to the protection of personal data?
It is important to note thatweb scraping involves the processing of personal data when the websites or online spaces on which the technique is applied contain personal information. Even if the tool is prepared to collect the information and anonymise it, the anonymisation process is itself a processing of personal data subject to compliance with data protection regulations.
Therefore,web scraping raises important questions as to its compatibility with data protection law. In this article, we will analyse the legality of this technique in relation to current data protection regulations, i.e. the General Data Protection Regulation (GDPR) and the pronouncements of the Spanish Data Protection Agency (AEPD).
Is it legal to capture information from the Internet, as a publicly accessible source?
Firstly, we must assess whether the Internet is a public source that allows us to collect and use the information available on websites, social networks, etc., and use it for our purposes without limitation.
In this regard, the previous legislative framework in Spain, regulated by Organic Law 15/1999, repealed by the GDPR, established the exception to consent in the processing of personal data when the data came from ‘publicly accessible sources’, which consisted of a closed list formed by the promotional census, telephone directories, lists of persons belonging to professional groups, newspapers and official bulletins and the media.
In the current regulation, although the GDPR refers to public sources in some provisions, it does not provide a specific definition, nor does it provide a list of sources that can be considered public. If we look at the availability of the source, the Internet could be a public source, but the AEPD has ruled (for example, in report 0089/2020, on the ASEDIE Code of Conduct) that, although the information published on a website (also applicable to social networks, blogs, forums or other online spaces) is available for consultation by any user, the personal data contained therein do not constitute freely usable data.
What should we take into account, then, if we want to collect information from the Internet using web scraping or other techniques?
In order to be able to use these techniques and collect information from the internet to, for example, train our algorithms, it is necessary to ensure that the collection of information and content: (a) does not infringe any law or right, including intellectual property or contract law; and (b) has a valid legal basis under the GDPR.
When the source is a public administration official and the publication of personal data is in compliance with legal obligations, the publication of the data has a specific purpose established by law (transparency, protection of rights, etc.). Therefore, the collection and further use of personal data by a third party using web scraping would no longer be covered by the legal basis of the legal obligation. In this regard, there is a regime in place for the re-use of data from public administrations, or ‘open data’, which allows this re-use under licences for use and redistribution, provided that certain conditions are respected.
Consequently, we should check these licences and analyse whether they allow us to use the information to fulfil the purposes pursued. In the case of private sources, i.e. websites, social networks or other online spaces owned by private companies, it will be necessary to take into account the limitations imposed by applicable laws, depending on the type of information we want to collect and use, as well as the terms and conditions established by these entities in relation to the collection, processing and disclosure of the contents published on their spaces.
As regards the basis for legitimisation, the AEPD understands that the collection and use of information for purposes that are not related to the original purpose must be consented to by the data subject, be compatible with the original purpose or be covered by legitimate interest. However, depending on the amount of information ‘scraped’ and necessary for the purposes of processing, especially in the case of training of generative AI models, it will be impracticable to seek consent from all data subjects whose data are intended to be collected and processed. Therefore, potentially only legitimate interest could apply.
In this regard, the fact that the data are contained in public registers or bulletins may be considered as a relevant factor in the weighting of legitimate interest in the processing of personal data. However, it is not a determining factor on its own. It is therefore necessary to carry out a thorough case-by-case weighing of the interests of the company, and to ensure that the intended use passes the necessity and proportionality tests.
If, in addition, the information collected falls within the special categories of data under Art. 9.1. of the GDPR, it will be necessary not only to ensure the lawful basis for the processing, but also to analyse whether the prohibition on processing this data can be lifted, which may require obtaining the explicit consent of the data subject or the application of another exception provided for in the article.
What other obligations do we have to comply with?
The web scraping process, when it involves personal data, must also comply with the other obligations of the GDPR. For example, assess the need to submit the activity to an impact assessment, comply with the minimisation principle, guarantee the accuracy of the data, implement adequate security measures to protect the information, provide the necessary information to users about the collection and processing of their data, and respect the rights of data subjects, including providing sufficient mechanisms for users to exercise their rights of access, rectification, erasure, etc.
When the tool anonymises the information extracted, we must also ensure that the anonymisation process is secure and irreversible, periodically evaluating the effectiveness of the technique used.
Finally, it is important to bear in mind that all the observations included in this article are applicable, whether it is the company itself that collects the information using these techniques, or whether it acquires data scanned by third parties, or whether it carries out both actions simultaneously. In any case, the responsibility for ensuring compliance with data protection regulations lies with the data controller, who will be the one using the information to fulfil its purposes.
Article written by
Privacy, intellectual property and technology procurement lawyer
About Metricson
With offices in Barcelona, Madrid, Valencia and Seville and a significant international presence, Metricson is a pioneering firm in legal services for innovative and technology companies. Since its inception in 2009, it has advised more than 1,400 clients from 15 different countries, including startups, investors, large corporations, universities, institutions and governments.
If you would like to contact us, please do not hesitate to write to us at contacto@metricson.com. We look forward to talking to you!