This post is also available in: עברית (Hebrew)
Many artificial intelligence tools use public data to train their large language models, but now large social media sites are looking for ways to defend against data scraping. The problem is that scraping isn’t currently illegal.
According to Cybernews, data scraping refers to a computer program extracting data from the output generated from another program, and it is becoming a big problem for large social media sites like Twitter or Reddit.
An example is Steve Huffman, Reddit’s co-founder who told The New York Times that it is unacceptable that AI companies like OpenAI have been scraping huge amounts of Reddit data to train their systems for free. He then monetized access to the site’s data, angering thousands of users.
According to Dan Pinto, co-founder and chief executive of Fingerprint, generative AI models are not the only ones that can scrape companies’ data for training. Malicious actors or even competitors could steal data for nefarious purposes.
Pinto explains in an interview with Cybernews that to deal with this issue, companies can implement web application firewalls and block IP ranges, countries, and data centers that are known to host scrapers or add a CAPTCHA system. Nevertheless, he reiterates saying: “With data scraping, you can never prevent 100% of the attempts. Your goal is to increase the difficulty level for scrapers to the correct level for your business.”
When asked about cases like Twitter and Reddit, he claims that like many third-party app developers, companies need to maintain open APIs (application programming interfaces) and charge appropriate prices while making data scraping very challenging at the same time.
But there’s a certain hitch since web or data scraping isn’t illegal.
On the one hand, ordinary data scraping and crawling can actually help businesses grow much faster. Pinto provides an example in which he himself worked on a search engine for used machinery and used crawling to collect information on the machinery available for sale online. He claims he views this as ethical “because it helped both equipment buyers and sellers to complete many more transactions than before.”
On the other hand, it becomes problematic in cases when non-publicly available data gets extracted regardless of intent- then it becomes theft.
“Regulations, policies, and even best practices are still being figured out, but recent rulings have pointed towards if information is available in the open it should be accessible to bots,” Pinto concluded. “This points again to focusing on making scraping difficult instead of depending on lawsuits.”