The Web Scraping Dilemma: How to Block Scraping Bots

Delve into the challenges of web scraping, its ethical implications, and proactive measures to shield data from scraping bots.

October 16, 2023

The Web Scraping Dilemma: How to Block Scraping Bots

Fingerprint Co-founder and CEO Dan Pinto dives into the buzz surrounding web scraping, its legal and ethical implications, and strategies for businesses to safeguard their data from scraping bots.

Data scraping, specifically web scraping, is on the minds of tech leaders, regulators, and consumer advocates. Leaders from a dozen international privacy watchdog groups sent social media networks a statementOpens a new window urging them to protect user information from scraping bots. Meanwhile, X Corp (formerly known as Twitter) sued four unnamed individuals for scraping its site. Google and OpenAI also face lawsuits for privacy and copyright violations related to web scraping.

Data scraping is not illegal. It’s big business. Experts expect the web scraping software market value Opens a new window to reach nearly $1.7 billion by 2030, up from $695 million in 2022. Scraping can be useful, allowing us to track flight prices or compare products across sites. Companies use it to gather market research or aggregate information. Popular large language models (LLMs) like Bard and ChatGPT are trained on scraped data.

Web scraping has been around for many years. So why has it become a buzzword generating so much concern? And what can businesses do to prevent it?

What Is Web Scraping?

Let’s start with the basics. Web scraping typically uses bots to extract information from websites. The practice has many applications, from the helpful to the infamous. 

Web scraping is different from web crawling. Search engines use web crawlers to index web pages and provide search results to users who follow a link to the source. Data scraping involves extracting the data from the page and using it elsewhere. To use an analogy: Crawling makes a list of library books to check out. Scraping copies of the books for you to take home.

AI scraping, on the other hand, enters a gray area because it does not return value to the original content creator. The more disconnected the flow of value from the original author, the more unethical the data scraping.

See More: Battling Phishing and Business Email Compromise Attacks

Why Is Web Scraping a Hot Topic?

We’ve all likely seen web scraping on travel search sites, real estate listings, and news aggregators, among many others. However, generative AI’s popularity is bringing concerns to the forefront. Engineers train these models on data, including personal information and intellectual property scraped from the web. The LLM could replicate the proprietary information without properly attributing the creator. Experts believe these copyright issuesOpens a new window will head to the U.S. Supreme Court. 

Additionally, scapers are becoming more advanced. While scraping does not technically count as a data breach, many bad actors use the information for evil, including:

  • Identity theft.
  • Spam.
  • Phishing.
  • Cyberattacks.
  • Counterfeit websites.
  • Plagiarism.
  • Price manipulation.
  • Ad fraud.
  • Credential or coupon stuffing. 
  • Account takeovers.
  • Fake content generation.

Even scrapers with good intentions create ripple effects. Bots consume bandwidth during each website visit, causing longer loading times, higher hosting costs, or disrupted service. And any resulting duplicate content may harm search engine optimization. 

Policymakers and government agencies are currently considering how to put guardrails on scraping bots. However, recent rulings suggest regulations may grant bots access to openly available information. 

Regardless of the ethical questions, businesses can decide what data to make available. 

How Can Companies Prevent Scraping?

Blocking 100% of scraping attempts is impossible. Instead, your goal should be making it more difficult for scrapers to access your protected data. Here’s how. 

  • Robots.txt: This file tells web robots which pages on your site are off-limits for crawling. Robots.txt prevents access to sensitive information, such as login or checkout pages. The file also limits the number of bot visits, like letting search engines only crawl once a day to prevent performance problems. However, some bots ignore the commands, and others circumvent them. Businesses must add additional safeguards.  
  • Web application firewall (WAF): WAFs offer a first line of defense from malicious bots by providing an essential layer of protection by filtering and blocking problematic traffic before it reaches the site. The firewall inspects HTTP(Hypertext transfer protocol) traffic to identify patterns associated with attacks and can be configured to block traffic from IP ranges, countries, and data centers known to host bots. The WAF will block many bots, but newer, sophisticated bots may still sneak through. 
  • CAPTCHA: We are all familiar with the Completely Automated Public Turing test to tell computers and humans Apart, better known as CAPTCHA. This challenge-response authentication presents a test that only humans could solve, for example, selecting all the squares with a motorcycle. CAPTCHA prevents automated bots from entering the site. However, this gatekeeper needs to improve the user experience. A new study revealed that bots can solve these tests more quickly than humansOpens a new window . Businesses can’t rely on CAPTCHA alone. 
  • Device intelligence: Leveraging device intelligence helps businesses distinguish between bots and legitimate website users. Device intelligence includes browser and device fingerprinting, which uses signals and device characteristics such as IP address, location, VPN(Virtual private network), and operating system to identify unique devices. This information can identify traffic from countries that often host bots, visitors with a history of bot-like behavior, and other dubious devices. 

Bots send many signals that human users do not, including errors, network overrides, and browser attribute inconsistencies. Device intelligence detects these signals to distinguish potential scrapers. Bots also act differently than humans. Device intelligence helps monitor visitor behavior to flag suspicious actions, like many login attempts or repeated requests for the same information.

Realistically, businesses must combine several safety features to create sufficient hurdles for bots. With scrapers’ growing sophistication, protections require frequent updates to maintain effectiveness. 

Will we ever resolve the web scraping debate? Perhaps not. While the practice is neither inherently good nor bad, companies must decide their comfort level with the extent of data openness and act accordingly to protect their assets.

Why do ethical concerns matter, and how can businesses safeguard data from scraping bots? Let us know on FacebookOpens a new window , XOpens a new window , and LinkedInOpens a new window . We’d love to hear from you!

Image Source: Shutterstock

MORE ON WEB SCRAPING

Dan Pinto
Dan Pinto

CEO and co-founder , Fingerprint

Dan Pinto is CEO and co-founder of Fingerprint and brings over a decade of experience in tech. He began his career in software engineering where he developed an interest in creating bots, but quickly shifted his focus to entrepreneurship. Dan has founded many small startups, including eBay stores, a tech blog, and even a forum for TV shows. In 2014, Dan co-founded Machinio, a search engine for used machinery, which was later acquired by NASDAQ:LQDT in 2018. After this success, he co-founded Fingerprint, the world’s most accurate device identifier, that has raised over $44 million since its first funding round in 2020. Fingerprint currently employs over 100 people and is dedicated to solving the complex issue of online fraud. When he's not busy building companies, Dan enjoys spending time with his family - he lives in Chicago with his wife and their nearly two-year-old son.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.