How to Build a Web Scraper That Doesn't Get Blocked

I built a Python web scraper that collects real estate listings without triggering bot detection, using Selenium and deliberate human-like behavior patterns. This project was essential for gathering market data where APIs were unavailable or prohibitively expensive. The core challenge wasn't extracting data, but doing so reliably over thousands of pages without getting IP-banned.

Architecture Overview

The system is built around a single scraping service that orchestrates browser behavior, request management, and data persistence. It prioritizes resilience over raw speed. A controller manages the workflow, while a dedicated browser driver handles all page interactions, applying anti-detection techniques at every step. Cleaned data is then structured and saved to a database for analysis.

flowchart TD
    A[Controller] --> B[Browser Driver]
    B --> C{Apply Anti-Detection}
    C --> D[Execute Page Actions]
    D --> E[Extract & Clean Data]
    E --> F[(Database)]
    C --> G[Failure: Log & Rotate]

Key Technical Decisions

The first major decision was using Selenium over simpler HTTP libraries like requests. For modern, JavaScript-heavy real estate sites, you need a real browser to execute code and render dynamic content. A headless browser is detectable, so I used a headed version but minimized its window to reduce resource load.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--window-size=1024,768")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(options=options)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

The second decision was implementing a stochastic delay system. Fixed pauses between actions are a clear bot signature. My solution was to generate random, human-like intervals based on a normal distribution, simulating reading and hesitation times.

import random
import time
import numpy as np

def human_delay(base=2.0, variability=1.5):
    """Generate a random delay that mimics human reading/hesitation."""
    # Use a normal distribution centered around the base time
    delay = np.random.normal(base, variability)
    # Ensure a minimum wait and cap extreme outliers
    delay = max(0.5, min(delay, base * 3))
    time.sleep(delay)

# Usage before a click or data extraction
human_delay(base=1.7, variability=1.2)

What Broke and How I Fixed It

The first major breakage occurred when the site started serving a CAPTCHA after about 50 page requests. The scraper would halt entirely. My fix was multi-pronged: I implemented automatic IP rotation using a pool of residential proxies and added a detection routine to pause and alert if a CAPTCHA page's HTML structure was found. This turned a complete stop into a manageable, monitored event.

The second issue was subtle: mouse movement detection. Even with delays, the cursor moving in perfectly straight lines from link to link was flagged. I integrated pyautogui to generate slightly erratic, curved mouse paths between target elements before a click, which drastically reduced detection rates on more sophisticated platforms.

How to Build Something Similar

Start by manually browsing your target site and noting the sequence of actions a human takes. Open developer tools and monitor the network tab to see the timing of XHR requests. Your code should mirror this rhythm.

Begin with a basic Selenium script that navigates to one page and extracts your target data. Before scaling, immediately integrate your delay function and the critical browser option to set navigator.webdriver to undefined. Use explicit WebDriverWaits for elements, not static sleeps. Only after this works reliably should you add complexity like proxy rotation. You can find the core patterns I used documented on my portfolio, suhailroushan.com.

Would I Build It the Same Way Again?

For a targeted, mid-volume scraper against complex sites, yes—the Selenium-based approach remains effective. However, for scraping thousands of listings daily, I would now consider a hybrid approach. I'd use a tool like Playwright, which offers better built-in stealth properties, and combine it with a lightweight HTTP library for repetitive API calls discovered during the browser session. This splits the workload: the browser handles the initial page load and JavaScript execution, while faster, direct HTTP requests handle pagination, reducing overall runtime and resource use.

The non-negotiable first step is always to respect robots.txt and check a site's terms of service—ethical scraping is sustainable scraping.

Architecture Overview

Key Technical Decisions

What Broke and How I Fixed It

How to Build Something Similar

Would I Build It the Same Way Again?

Related posts