Pithom Labs Scraper

A local, high-performance Go-based web scraping platform that uses an interactive, browser-based intent discovery system to configure and run automated data extraction.

Source-available early-access local scraper engine for authorized browser workflows. Works across many modern site patterns, but site-specific failures may occur. Paid setup and diagnosis are available for business-critical workflows. Click here to get priority support.

High-Level Overview

Pithom Labs Scraper splits the web scraping workflow into two stages: an interactive setup stage (headed) and an automated extraction stage (headless).

┌─────────────────────────────────────────────────────────────────────────────┐
│  STAGE 1: Intent Discovery (Interactive Dashboard)                          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                      │
│  │   Chrome    │───▶│   Overlay   │───▶│   Intent    │                      │
│  │  (Headed)   │    │   (GUI)     │    │  Discoverer │                      │
│  └─────────────┘    └─────────────┘    └─────────────┘                      │
│         │                                       │                           │
│         │                                       ▼                           │
│         │                              ┌─────────────┐                      │
│         └─────────────────────────────▶│  intent.json + session.json        │
└─────────────────────────────────────────────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STAGE 2: Automated Scraping (CLI Engine)                                    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                      │
│  │   Chrome    │◀───│  Render Pool│◀───│   Scraper   │                      │
│  │  (Headless) │    │  (go-rod)   │    │   (CLI)     │                      │
│  └─────────────┘    └─────────────┘    └─────────────┘                      │
│         │                   │                    │                          │
│         ▼                   ▼                    ▼                          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                      │
│  │   Session   │    │   Extractor │    │   Output    │                      │
│  │  (Cookies)  │    │   (Schema)  │    │  (CSV/JSON) │                      │
│  └─────────────┘    └─────────────┘    └─────────────┘                      │
└─────────────────────────────────────────────────────────────────────────────┘

Stage 1 (Intent Discovery): Navigate to target websites in a headed Chrome browser. Using an injected visual overlay, click elements on the page to define what data to extract, configure pagination, and save cookies.
Stage 2 (Automated Scraping): Run automated extraction using the generated "intent" configuration. The engine handles rendering pages concurrently, navigating pages, extracting target data schemas, and outputting to CSV/JSON.

Architecture

Technology Stack

Layer	Technology	Purpose
Browser Automation	`chromedp` + `go-rod`	Headless/Headed Chrome control with anti-detection capabilities
HTML Parsing	`goquery`	DOM traversal and selector matching fallback
UI Dashboard & Overlay	Vite + React + TypeScript / JS	Interactive dashboard and selector selection overlay
Data Extraction	Go custom schema generator	Applies structural CSS selectors and text transformation functions
Build & Run Tasks	`Taskfile`	Cross-platform build script automation

Component Packages

Package	Purpose
`internal/browser`	Launches Chrome, injects cookie policies, and configures proxy connections
`internal/overlay`	Injects the selection/labeling React app into the target page
`internal/intent`	Manages intent definition discovery state
`internal/selector`	Implements the selector generalization algorithms (LCS)
`internal/extract`	Extracts structure from raw HTML using selectors and text sanitization pipelines
`internal/render`	Maintains a browser tab rendering pool to support concurrent page processing
`internal/paginate`	Executes multi-page extraction (click-next, scroll-load, URL templates)
`internal/output`	Writes tabular data to CSV (with BOM) and structured JSON

Features

Visual Selector Mapping: Generate extraction logic simply by clicking elements on the page.
LCS-Based Selector Generalization: Uses a Longest Common Subsequence (LCS) algorithm to automatically generalize specific CSS selectors into robust patterns that capture all sibling list elements.
Stealth and Anti-Detection: Configured out-of-the-box to bypass bot detection using undetected browser footprints and cookie preservation.
Concurrent Rendering Pool: Parallelizes detail page loads using a thread-safe render pool to speed up extraction on large lists.
Robust Text Transforms: Clean raw text inside the engine using built-in pipelines (trim, extract_number, strip_html, abs_url, decode_html, and more).
Multi-Page Patterns Support: Works with single lists, paginated lists, lists that open detail pages, and paginated lists with detail pages.

Project Structure

scraper-public/
+-- docs/                    # Structured documentation guides
|   +-- install.md           # Getting started installation guide
|   +-- tutorial.md          # Step-by-step visual tutorial
|   +-- taskfile.md          # Automation Taskfile reference
|   +-- testing.md           # Testing instructions and guides
+-- scraper/                 # Scraper engine source directory
|   +-- assets/              # Injected visual overlay resources
|   +-- cmd/
|   |   +-- scraper/         # Command Line Interface entrypoint
|   |   +-- orch/            # Orchestrator & Mission Control server
|   +-- internal/            # Backend business logic
|   +-- json/                # Intent samples and test configurations
|   +-- manual/              # Screenshots for the tutorial guide
|   +-- web/                 # React Dashboard frontend code
|   +-- Taskfile.yml         # Project orchestration file
|   +-- go.mod
|   +-- go.sum
+-- LICENSE.md               # Business Source License 1.1
+-- README.md                # Root-level general reference

Prerequisites

To run or build the scraper, you will need:

Google Chrome (standard browser, not Chromium).
Go 1.22+ (if compiling from source code).
Node.js & npm (if building or running the React frontend in development mode).
Task (optional task runner, install via go install github.com/go-task/task/v3/cmd/task@latest).

Getting Started

Refer to docs/install.md for detailed platform-specific installation steps.

1. Launch Mission Control Dashboard

Start the local dashboard server by running:

cd scraper
./scraper ui
# Or using task
task go-dev

Navigate to the dashboard on http://localhost:8080 (or http://localhost:5173 in web development mode).

2. Configure Intent (Stage 1)

Enter your target URL in the dashboard search bar. Pithom Labs Scraper will open a headed Chrome window with the visual selection overlay.

Click elements to select list items.
Select target fields, name them, and choose text transforms.
Choose your pagination type (e.g. click-next or page parameters).
Save the configuration. An intent.json and session.json (cookies) will be written to your working folder.

3. Run Extraction (Stage 2)

Run automated extraction using the CLI with your saved config:

./scraper scrape --intent-file intent.json --session-file session.json --output results.csv

Your results will be saved as results.csv.

Commands Reference

discover: Launch headed Chrome to visually map selectors and build intent configurations.
scrape: Execute automated extraction headlessly using existing intent and session files.
replay: Replay data extraction on saved HTML snapshots without connecting to Chrome.
diagnose: Run pre-flight checks and diagnose extraction issues automatically.
ui: Launch the local orchestrator dashboard (Mission Control).

For details on command flags, run:

./scraper [command] --help

Configuration

The intent.json schema captures the structural properties of target websites:

{
  "start_url": "https://books.toscrape.com/",
  "pattern": "list_paginated_detail",
  "container_selector": "ol.row li",
  "item_selector": "ol.row li a",
  "list_fields": [
    {
      "name": "title",
      "rel_selector": "img",
      "attribute": "alt"
    },
    {
      "name": "price",
      "rel_selector": ".price_color",
      "attribute": "innerText"
    },
    {
      "name": "product_url",
      "rel_selector": "a",
      "attribute": "href",
      "is_detail_link": true
    }
  ],
  "detail_fields": [
    {
      "name": "description",
      "rel_selector": "#product_description + p",
      "attribute": "innerText"
    }
  ],
  "pagination_type": "click_next",
  "next_button_selector": "ul.pager li.next a",
  "needs_js_render": false
}

How It Works

CSS Selector Generalization

When you click multiple list elements in Stage 1, the engine receives their raw CSS paths (e.g. ul > li:nth-child(1) > a and ul > li:nth-child(2) > a). Pithom Labs Scraper processes these paths through a Longest Common Subsequence (LCS) algorithm to find structural invariants, stripping out unique child indexes to produce ul > li > a, which matches all items in the list.

Text Transformation Pipeline

Raw DOM text is often messy. The scraper applies transforms sequentially before exporting data:

extract_number: Extracts integers/floats (e.g., "$19.99" -> "19.99").
trim: Strips outer whitespace.
strip_html: Removes nested markup tags.
abs_url: Resolves relative hrefs to absolute links.

Testing

For testing pipelines, benchmarks, and integration configurations, please check the docs/testing.md guide.

Run unit tests directly:

cd scraper
go test -v ./...

License

Pithom Labs Scraper is source-available under the Business Source License 1.1 (converting to Apache 2.0 on June 1, 2030).

Additional Use Grant: You are permitted to use this software for personal, research, evaluation, and internal business workloads.
Prohibited Use: You may not offer this software as a hosted scraping SaaS, managed extraction service, API, or competing commercial service.

Please refer to the full license parameters in LICENSE.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pithom Labs Scraper

Table of Contents

High-Level Overview

Architecture

Technology Stack

Component Packages

Features

Project Structure

Prerequisites

Getting Started

1. Launch Mission Control Dashboard

2. Configure Intent (Stage 1)

3. Run Extraction (Stage 2)

Commands Reference

Configuration

How It Works

CSS Selector Generalization

Text Transformation Pipeline

Testing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
scraper		scraper
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
go.mod		go.mod

Folders and files

Latest commit

History

Repository files navigation

Pithom Labs Scraper

Table of Contents

High-Level Overview

Architecture

Technology Stack

Component Packages

Features

Project Structure

Prerequisites

Getting Started

1. Launch Mission Control Dashboard

2. Configure Intent (Stage 1)

3. Run Extraction (Stage 2)

Commands Reference

Configuration

How It Works

CSS Selector Generalization

Text Transformation Pipeline

Testing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages