A local, high-performance Go-based web scraping platform that uses an interactive, browser-based intent discovery system to configure and run automated data extraction.
Source-available early-access local scraper engine for authorized browser workflows. Works across many modern site patterns, but site-specific failures may occur. Paid setup and diagnosis are available for business-critical workflows. Click here to get priority support.
- High-Level Overview
- Architecture
- Features
- Project Structure
- Prerequisites
- Getting Started
- Commands Reference
- Configuration
- How It Works
- Testing
- License
Pithom Labs Scraper splits the web scraping workflow into two stages: an interactive setup stage (headed) and an automated extraction stage (headless).
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: Intent Discovery (Interactive Dashboard) β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Chrome βββββΆβ Overlay βββββΆβ Intent β β
β β (Headed) β β (GUI) β β Discoverer β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β
β β βΌ β
β β βββββββββββββββ β
β βββββββββββββββββββββββββββββββΆβ intent.json + session.json β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: Automated Scraping (CLI Engine) β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Chrome ββββββ Render Poolββββββ Scraper β β
β β (Headless) β β (go-rod) β β (CLI) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Session β β Extractor β β Output β β
β β (Cookies) β β (Schema) β β (CSV/JSON) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Stage 1 (Intent Discovery): Navigate to target websites in a headed Chrome browser. Using an injected visual overlay, click elements on the page to define what data to extract, configure pagination, and save cookies.
- Stage 2 (Automated Scraping): Run automated extraction using the generated "intent" configuration. The engine handles rendering pages concurrently, navigating pages, extracting target data schemas, and outputting to CSV/JSON.
| Layer | Technology | Purpose |
|---|---|---|
| Browser Automation | chromedp + go-rod |
Headless/Headed Chrome control with anti-detection capabilities |
| HTML Parsing | goquery |
DOM traversal and selector matching fallback |
| UI Dashboard & Overlay | Vite + React + TypeScript / JS | Interactive dashboard and selector selection overlay |
| Data Extraction | Go custom schema generator | Applies structural CSS selectors and text transformation functions |
| Build & Run Tasks | Taskfile |
Cross-platform build script automation |
| Package | Purpose |
|---|---|
internal/browser |
Launches Chrome, injects cookie policies, and configures proxy connections |
internal/overlay |
Injects the selection/labeling React app into the target page |
internal/intent |
Manages intent definition discovery state |
internal/selector |
Implements the selector generalization algorithms (LCS) |
internal/extract |
Extracts structure from raw HTML using selectors and text sanitization pipelines |
internal/render |
Maintains a browser tab rendering pool to support concurrent page processing |
internal/paginate |
Executes multi-page extraction (click-next, scroll-load, URL templates) |
internal/output |
Writes tabular data to CSV (with BOM) and structured JSON |
- Visual Selector Mapping: Generate extraction logic simply by clicking elements on the page.
- LCS-Based Selector Generalization: Uses a Longest Common Subsequence (LCS) algorithm to automatically generalize specific CSS selectors into robust patterns that capture all sibling list elements.
- Stealth and Anti-Detection: Configured out-of-the-box to bypass bot detection using undetected browser footprints and cookie preservation.
- Concurrent Rendering Pool: Parallelizes detail page loads using a thread-safe render pool to speed up extraction on large lists.
- Robust Text Transforms: Clean raw text inside the engine using built-in pipelines (
trim,extract_number,strip_html,abs_url,decode_html, and more). - Multi-Page Patterns Support: Works with single lists, paginated lists, lists that open detail pages, and paginated lists with detail pages.
scraper-public/
+-- docs/ # Structured documentation guides
| +-- install.md # Getting started installation guide
| +-- tutorial.md # Step-by-step visual tutorial
| +-- taskfile.md # Automation Taskfile reference
| +-- testing.md # Testing instructions and guides
+-- scraper/ # Scraper engine source directory
| +-- assets/ # Injected visual overlay resources
| +-- cmd/
| | +-- scraper/ # Command Line Interface entrypoint
| | +-- orch/ # Orchestrator & Mission Control server
| +-- internal/ # Backend business logic
| +-- json/ # Intent samples and test configurations
| +-- manual/ # Screenshots for the tutorial guide
| +-- web/ # React Dashboard frontend code
| +-- Taskfile.yml # Project orchestration file
| +-- go.mod
| +-- go.sum
+-- LICENSE.md # Business Source License 1.1
+-- README.md # Root-level general reference
To run or build the scraper, you will need:
- Google Chrome (standard browser, not Chromium).
- Go 1.22+ (if compiling from source code).
- Node.js & npm (if building or running the React frontend in development mode).
- Task (optional task runner, install via
go install github.com/go-task/task/v3/cmd/task@latest).
Refer to docs/install.md for detailed platform-specific installation steps.
Start the local dashboard server by running:
cd scraper
./scraper ui
# Or using task
task go-devNavigate to the dashboard on http://localhost:8080 (or http://localhost:5173 in web development mode).
Enter your target URL in the dashboard search bar. Pithom Labs Scraper will open a headed Chrome window with the visual selection overlay.
- Click elements to select list items.
- Select target fields, name them, and choose text transforms.
- Choose your pagination type (e.g. click-next or page parameters).
- Save the configuration. An
intent.jsonandsession.json(cookies) will be written to your working folder.
Run automated extraction using the CLI with your saved config:
./scraper scrape --intent-file intent.json --session-file session.json --output results.csvYour results will be saved as results.csv.
discover: Launch headed Chrome to visually map selectors and build intent configurations.scrape: Execute automated extraction headlessly using existing intent and session files.replay: Replay data extraction on saved HTML snapshots without connecting to Chrome.diagnose: Run pre-flight checks and diagnose extraction issues automatically.ui: Launch the local orchestrator dashboard (Mission Control).
For details on command flags, run:
./scraper [command] --helpThe intent.json schema captures the structural properties of target websites:
{
"start_url": "https://books.toscrape.com/",
"pattern": "list_paginated_detail",
"container_selector": "ol.row li",
"item_selector": "ol.row li a",
"list_fields": [
{
"name": "title",
"rel_selector": "img",
"attribute": "alt"
},
{
"name": "price",
"rel_selector": ".price_color",
"attribute": "innerText"
},
{
"name": "product_url",
"rel_selector": "a",
"attribute": "href",
"is_detail_link": true
}
],
"detail_fields": [
{
"name": "description",
"rel_selector": "#product_description + p",
"attribute": "innerText"
}
],
"pagination_type": "click_next",
"next_button_selector": "ul.pager li.next a",
"needs_js_render": false
}When you click multiple list elements in Stage 1, the engine receives their raw CSS paths (e.g. ul > li:nth-child(1) > a and ul > li:nth-child(2) > a).
Pithom Labs Scraper processes these paths through a Longest Common Subsequence (LCS) algorithm to find structural invariants, stripping out unique child indexes to produce ul > li > a, which matches all items in the list.
Raw DOM text is often messy. The scraper applies transforms sequentially before exporting data:
extract_number: Extracts integers/floats (e.g., "$19.99" -> "19.99").trim: Strips outer whitespace.strip_html: Removes nested markup tags.abs_url: Resolves relative hrefs to absolute links.
For testing pipelines, benchmarks, and integration configurations, please check the docs/testing.md guide.
Run unit tests directly:
cd scraper
go test -v ./...Pithom Labs Scraper is source-available under the Business Source License 1.1 (converting to Apache 2.0 on June 1, 2030).
- Additional Use Grant: You are permitted to use this software for personal, research, evaluation, and internal business workloads.
- Prohibited Use: You may not offer this software as a hosted scraping SaaS, managed extraction service, API, or competing commercial service.
Please refer to the full license parameters in LICENSE.md.