Skip to content

PithomLabs/scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Pithom Labs Scraper

A local, high-performance Go-based web scraping platform that uses an interactive, browser-based intent discovery system to configure and run automated data extraction.

Source-available early-access local scraper engine for authorized browser workflows. Works across many modern site patterns, but site-specific failures may occur. Paid setup and diagnosis are available for business-critical workflows. Click here to get priority support.


Table of Contents

  1. High-Level Overview
  2. Architecture
  3. Features
  4. Project Structure
  5. Prerequisites
  6. Getting Started
  7. Commands Reference
  8. Configuration
  9. How It Works
  10. Testing
  11. License

High-Level Overview

Pithom Labs Scraper splits the web scraping workflow into two stages: an interactive setup stage (headed) and an automated extraction stage (headless).

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE 1: Intent Discovery (Interactive Dashboard)                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚  β”‚   Chrome    │───▢│   Overlay   │───▢│   Intent    β”‚                      β”‚
β”‚  β”‚  (Headed)   β”‚    β”‚   (GUI)     β”‚    β”‚  Discoverer β”‚                      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚         β”‚                                       β”‚                           β”‚
β”‚         β”‚                                       β–Ό                           β”‚
β”‚         β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚         └─────────────────────────────▢│  intent.json + session.json        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚
                                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE 2: Automated Scraping (CLI Engine)                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚  β”‚   Chrome    │◀───│  Render Pool│◀───│   Scraper   β”‚                      β”‚
β”‚  β”‚  (Headless) β”‚    β”‚  (go-rod)   β”‚    β”‚   (CLI)     β”‚                      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚         β”‚                   β”‚                    β”‚                          β”‚
β”‚         β–Ό                   β–Ό                    β–Ό                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚  β”‚   Session   β”‚    β”‚   Extractor β”‚    β”‚   Output    β”‚                      β”‚
β”‚  β”‚  (Cookies)  β”‚    β”‚   (Schema)  β”‚    β”‚  (CSV/JSON) β”‚                      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Stage 1 (Intent Discovery): Navigate to target websites in a headed Chrome browser. Using an injected visual overlay, click elements on the page to define what data to extract, configure pagination, and save cookies.
  • Stage 2 (Automated Scraping): Run automated extraction using the generated "intent" configuration. The engine handles rendering pages concurrently, navigating pages, extracting target data schemas, and outputting to CSV/JSON.

Architecture

Technology Stack

Layer Technology Purpose
Browser Automation chromedp + go-rod Headless/Headed Chrome control with anti-detection capabilities
HTML Parsing goquery DOM traversal and selector matching fallback
UI Dashboard & Overlay Vite + React + TypeScript / JS Interactive dashboard and selector selection overlay
Data Extraction Go custom schema generator Applies structural CSS selectors and text transformation functions
Build & Run Tasks Taskfile Cross-platform build script automation

Component Packages

Package Purpose
internal/browser Launches Chrome, injects cookie policies, and configures proxy connections
internal/overlay Injects the selection/labeling React app into the target page
internal/intent Manages intent definition discovery state
internal/selector Implements the selector generalization algorithms (LCS)
internal/extract Extracts structure from raw HTML using selectors and text sanitization pipelines
internal/render Maintains a browser tab rendering pool to support concurrent page processing
internal/paginate Executes multi-page extraction (click-next, scroll-load, URL templates)
internal/output Writes tabular data to CSV (with BOM) and structured JSON

Features

  • Visual Selector Mapping: Generate extraction logic simply by clicking elements on the page.
  • LCS-Based Selector Generalization: Uses a Longest Common Subsequence (LCS) algorithm to automatically generalize specific CSS selectors into robust patterns that capture all sibling list elements.
  • Stealth and Anti-Detection: Configured out-of-the-box to bypass bot detection using undetected browser footprints and cookie preservation.
  • Concurrent Rendering Pool: Parallelizes detail page loads using a thread-safe render pool to speed up extraction on large lists.
  • Robust Text Transforms: Clean raw text inside the engine using built-in pipelines (trim, extract_number, strip_html, abs_url, decode_html, and more).
  • Multi-Page Patterns Support: Works with single lists, paginated lists, lists that open detail pages, and paginated lists with detail pages.

Project Structure

scraper-public/
+-- docs/                    # Structured documentation guides
|   +-- install.md           # Getting started installation guide
|   +-- tutorial.md          # Step-by-step visual tutorial
|   +-- taskfile.md          # Automation Taskfile reference
|   +-- testing.md           # Testing instructions and guides
+-- scraper/                 # Scraper engine source directory
|   +-- assets/              # Injected visual overlay resources
|   +-- cmd/
|   |   +-- scraper/         # Command Line Interface entrypoint
|   |   +-- orch/            # Orchestrator & Mission Control server
|   +-- internal/            # Backend business logic
|   +-- json/                # Intent samples and test configurations
|   +-- manual/              # Screenshots for the tutorial guide
|   +-- web/                 # React Dashboard frontend code
|   +-- Taskfile.yml         # Project orchestration file
|   +-- go.mod
|   +-- go.sum
+-- LICENSE.md               # Business Source License 1.1
+-- README.md                # Root-level general reference

Prerequisites

To run or build the scraper, you will need:

  1. Google Chrome (standard browser, not Chromium).
  2. Go 1.22+ (if compiling from source code).
  3. Node.js & npm (if building or running the React frontend in development mode).
  4. Task (optional task runner, install via go install github.com/go-task/task/v3/cmd/task@latest).

Getting Started

Refer to docs/install.md for detailed platform-specific installation steps.

1. Launch Mission Control Dashboard

Start the local dashboard server by running:

cd scraper
./scraper ui
# Or using task
task go-dev

Navigate to the dashboard on http://localhost:8080 (or http://localhost:5173 in web development mode).

2. Configure Intent (Stage 1)

Enter your target URL in the dashboard search bar. Pithom Labs Scraper will open a headed Chrome window with the visual selection overlay.

  1. Click elements to select list items.
  2. Select target fields, name them, and choose text transforms.
  3. Choose your pagination type (e.g. click-next or page parameters).
  4. Save the configuration. An intent.json and session.json (cookies) will be written to your working folder.

3. Run Extraction (Stage 2)

Run automated extraction using the CLI with your saved config:

./scraper scrape --intent-file intent.json --session-file session.json --output results.csv

Your results will be saved as results.csv.


Commands Reference

  • discover: Launch headed Chrome to visually map selectors and build intent configurations.
  • scrape: Execute automated extraction headlessly using existing intent and session files.
  • replay: Replay data extraction on saved HTML snapshots without connecting to Chrome.
  • diagnose: Run pre-flight checks and diagnose extraction issues automatically.
  • ui: Launch the local orchestrator dashboard (Mission Control).

For details on command flags, run:

./scraper [command] --help

Configuration

The intent.json schema captures the structural properties of target websites:

{
  "start_url": "https://books.toscrape.com/",
  "pattern": "list_paginated_detail",
  "container_selector": "ol.row li",
  "item_selector": "ol.row li a",
  "list_fields": [
    {
      "name": "title",
      "rel_selector": "img",
      "attribute": "alt"
    },
    {
      "name": "price",
      "rel_selector": ".price_color",
      "attribute": "innerText"
    },
    {
      "name": "product_url",
      "rel_selector": "a",
      "attribute": "href",
      "is_detail_link": true
    }
  ],
  "detail_fields": [
    {
      "name": "description",
      "rel_selector": "#product_description + p",
      "attribute": "innerText"
    }
  ],
  "pagination_type": "click_next",
  "next_button_selector": "ul.pager li.next a",
  "needs_js_render": false
}

How It Works

CSS Selector Generalization

When you click multiple list elements in Stage 1, the engine receives their raw CSS paths (e.g. ul > li:nth-child(1) > a and ul > li:nth-child(2) > a). Pithom Labs Scraper processes these paths through a Longest Common Subsequence (LCS) algorithm to find structural invariants, stripping out unique child indexes to produce ul > li > a, which matches all items in the list.

Text Transformation Pipeline

Raw DOM text is often messy. The scraper applies transforms sequentially before exporting data:

  • extract_number: Extracts integers/floats (e.g., "$19.99" -> "19.99").
  • trim: Strips outer whitespace.
  • strip_html: Removes nested markup tags.
  • abs_url: Resolves relative hrefs to absolute links.

Testing

For testing pipelines, benchmarks, and integration configurations, please check the docs/testing.md guide.

Run unit tests directly:

cd scraper
go test -v ./...

License

Pithom Labs Scraper is source-available under the Business Source License 1.1 (converting to Apache 2.0 on June 1, 2030).

  • Additional Use Grant: You are permitted to use this software for personal, research, evaluation, and internal business workloads.
  • Prohibited Use: You may not offer this software as a hosted scraping SaaS, managed extraction service, API, or competing commercial service.

Please refer to the full license parameters in LICENSE.md.

About

Source-available early-access local scraper engine for authorized browser workflows. Works across many modern site patterns, but site-specific failures may occur. Paid setup and diagnosis are available for business-critical workflows.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors