AI And The Issues with Data Scraping

Aug 17, 2023

This post is also available in: עברית (Hebrew)

Many artificial intelligence tools use public data to train their large language models, but now large social media sites are looking for ways to defend against data scraping. The problem is that scraping isn’t currently illegal.

According to Cybernews, data scraping refers to a computer program extracting data from the output generated from another program, and it is becoming a big problem for large social media sites like Twitter or Reddit.

An example is Steve Huffman, Reddit’s co-founder who told The New York Times that it is unacceptable that AI companies like OpenAI have been scraping huge amounts of Reddit data to train their systems for free. He then monetized access to the site’s data, angering thousands of users.

Google is also facing a lawsuit that was filed after it updated its privacy policy to allow data scraping for AI training purposes, and OpenAI is facing a lawsuit for allegedly using copyrighted books without permission to train its AI systems.

According to Dan Pinto, co-founder and chief executive of Fingerprint, generative AI models are not the only ones that can scrape companies’ data for training. Malicious actors or even competitors could steal data for nefarious purposes.

Pinto explains in an interview with Cybernews that to deal with this issue, companies can implement web application firewalls and block IP ranges, countries, and data centers that are known to host scrapers or add a CAPTCHA system. Nevertheless, he reiterates saying: “With data scraping, you can never prevent 100% of the attempts. Your goal is to increase the difficulty level for scrapers to the correct level for your business.”

When asked about cases like Twitter and Reddit, he claims that like many third-party app developers, companies need to maintain open APIs (application programming interfaces) and charge appropriate prices while making data scraping very challenging at the same time.

But there’s a certain hitch since web or data scraping isn’t illegal.

On the one hand, ordinary data scraping and crawling can actually help businesses grow much faster. Pinto provides an example in which he himself worked on a search engine for used machinery and used crawling to collect information on the machinery available for sale online. He claims he views this as ethical “because it helped both equipment buyers and sellers to complete many more transactions than before.”

On the other hand, it becomes problematic in cases when non-publicly available data gets extracted regardless of intent- then it becomes theft.

“Regulations, policies, and even best practices are still being figured out, but recent rulings have pointed towards if information is available in the open it should be accessible to bots,” Pinto concluded. “This points again to focusing on making scraping difficult instead of depending on lawsuits.”

AI And The Issues with Data Scraping

Latest

$20 Million in Sight: New Partnership Targets the Smart Sensing Market

Guess Which $61 Billion Defense-Tech Giant Wants to Set Up Shop...

This Space Radar Can See Through Clouds, Darkness, and Bad Weather

One of the World’s Most Widely Used Machine Guns Is Evolving...

This AI-Powered Shield Is Built for the Age of FPV Drones...

A New Robot Submarine Built for Stealth Strike Missions

Drone Operators Can Now Carry Their Entire Control Station on Their...

A New Military Intercom Keeps Crews Connected Under Pressure

This SSD Could Save Your Files Even After a Ransomware Attack

This Rifle Can Switch Calibers in a Minute to Match the...

Drone vs. Drone: A New Layer of Battlefield Protection

Pentagon Confirms AI Role in Iran Operations

This AI Security System Uses Chaotic Lasers Instead of Passwords

This Chip Reveals Hidden Information When You Breathe on It

These Cyborg Cockroaches Are Controlled by AI — But Not in...

This Pea-Sized Pump Could Be the Missing Piece for Soft Robots

Google’s New CAPTCHA Wants You to Wave at Your Camera

Scientists May Have Found the Secret to Stable Flapping Drones

This AI-Powered Sensor Hunts Radar Signals Designed to Stay Hidden

A Smart Bandage Can Use Engineered Cells to Speed Up Healing