This project is responsible for coordinating dataset crawling. The coordinator, crawler, and CLI modules work together to do the actual crawling.
The webservice and webservice client present crawling status as recorded in Zookeeper.
The Crawler project includes:
- crawler: Contains the actual crawlers that speak the various XML and DWC-A/ABCD-A/CamtrapDP dialects needed for crawling the GBIF network
- crawler-cleanup: Used to delete crawl jobs in Zookeeper (see sub-module README for details how to use)
- crawler-cli: Provides the services that listen to RabbitMQ for instructions to crawl resources
- crawler-coordinator: Coordinates crawling jobs via Zookeeper (Curator)
- crawler-ws: Exposes read only crawl status and access to logs.
- crawler-ws-client: Java client to the WS
See the individual sub-module READMEs for specific details, but in general it is enough to build all components with:
mvn clean packageThe crawler CLI ships as a single container image built from the root Dockerfile. The image contains the
crawler-cli fat jar and runs any CLI command by setting CRAWLER_COMMAND at runtime. The container entrypoint
(scripts/entrypoint-crawler-cli.sh) materialises a YAML config from environment
variables and Helm-injected values before invoking java -jar /app/crawler-cli.jar "$CRAWLER_COMMAND" --conf ....
Kubernetes deployment is managed through the Helm chart in helm/. The chart is intentionally generic: each
CLI command is configured under commands in helm/values.yaml. Long-running listeners such as
downloaders and metasync services are deployed as Deployment resources, while one-off commands such as startcrawl,
crawleverything, metasynceverything, and pusheml can be deployed as Job resources when enabled.
RabbitMQ and Registry credentials should normally be provided as existing Kubernetes Secrets via
rabbit.credentialsSecret and registry.credentialsSecret. The chart can create development secrets when
createSecret is enabled, but production deployments should avoid storing real credentials in values files.
Archive storage must be shared by services that hand archives off to one another. In normal environments this should be
an NFS or other shared volume configured in helm/values.yaml under volumes. The default emptyDir values are only
suitable for local rendering or isolated development because each pod receives its own temporary filesystem.
For running the CLI locally against a real RabbitMQ and ZooKeeper without Kubernetes, bring up the lightweight stack in
docker-compose.local.yml:
task local # docker compose -f docker-compose.local.yml up -dBoth services include healthchecks and listen on their default ports (5672 / 15672 / 2181). The CLI itself is intended to run from your IDE/terminal in this mode.
To run the packaged image instead of running the CLI from your IDE, Taskfile.dist.yml provides a generic
docker:run:local task that auto-creates the archive directory under ~/data and wires the container against
host.docker.internal (works on Docker Desktop and on Linux thanks to --add-host=host.docker.internal:host-gateway):
task docker:run:local # dwcdp-metasync, ~/data/dwcdp
DATA_DIR=coldp CRAWLER_COMMAND=coldp-metasync task docker:run:local # coldp-metasync, ~/data/coldp
DATA_DIR=dwca CRAWLER_COMMAND=downloader task docker:run:local
DATA_DIR=abcda CRAWLER_COMMAND=abcdadownloader task docker:run:localDATA_DIR should match the archiveRepository name in helm/values.yaml. Commands that require
multiple shared volumes or extra repository paths (dwcdpdownloader, validator, dwca-metasync, camtrapdp*) are
not covered by docker:run:local; for those, call docker:run directly with the appropriate mounts/config, or run the
CLI from your IDE against the local stack.
GBIF cluster nodes run on linux/amd64, but Apple Silicon developer machines build linux/arm64 by default. A
single-architecture push will fail at runtime with exec format error on the cluster. Use the multi-arch task helper
in Taskfile.dist.yml:
REGISTRY=docker.gbif.org task docker:push:distThis builds linux/amd64 and linux/arm64 images, pushes them, and creates a Docker manifest tagged with the
crawler-cli Maven project.version. The Helm chart's image.tag should match.
Each key under commands in helm/values.yaml becomes one Deployment or Job named after that
key. The nested command value must match the Java CLI name (see super("…") in each *Command.java under
crawler-cli); names are not all formatted the same (dwcdpdownloader vs dwcdp-metasync).
Typical commands include crawl coordination (scheduler, coordinator, crawlserver), DwC-A (downloader,
validator, dwca-metasync), DwC-DP / CoL-DP (dwcdpdownloader, dwcdp-metasync, coldpdownloader,
coldp-metasync), ABCD-A (abcdadownloader), CamtrapDP (camtrapdpdownloader, camtrapdptodwca), legacy metasync,
coordinatorcleanup, and Jobs such as startcrawl / crawleverything / metasynceverything / pusheml. Use
helm/values-dev.yaml.example as a second values file with placeholder hosts and NFS
paths; copy it to helm/values-dev.yaml, replace placeholders, and pass it with -f on helm template / helm upgrade
(Helm merges it over the chart defaults).
The steps below use namespace dev and release name crawler-cli; adjust to your conventions.
-
Confirm the kubectl context points at the right cluster:
kubectl config current-context kubectl get nodes
-
Create the namespace if it does not already exist:
kubectl create namespace dev
-
Create RabbitMQ and Registry credentials as Secrets:
kubectl -n dev create secret generic crawler-rabbitmq-credentials \ --from-literal=username=<user> --from-literal=password=<pass> kubectl -n dev create secret generic crawler-registry-credentials \ --from-literal=username=<user> --from-literal=password=<pass>
-
Build and push a multi-arch image (or use CI), then set
image.repository,image.name, andimage.tagin your override to match:REGISTRY=docker.gbif.org task docker:push:dist
-
Copy the committed template and edit hosts, ZooKeeper, registry URL, volumes, and toggles:
cp helm/values-dev.yaml.example helm/values-dev.yaml $EDITOR helm/values-dev.yaml -
Render and install:
helm template crawler-cli ./helm -f helm/values-dev.yaml -n dev | less helm upgrade --install crawler-cli ./helm -f helm/values-dev.yaml -n dev -
Check workloads (use your enabled command keys as deployment names):
kubectl -n dev get deploy,job kubectl -n dev get pods kubectl -n dev logs deploy/<command-key> -f kubectl -n dev exec deploy/<command-key> -- cat /app/.tmp/crawler.yaml
-
Roll back or uninstall:
helm rollback crawler-cli -n dev helm uninstall crawler-cli -n dev
- Downloader
- Validator
- Metasync
- Pipelines (all archives)
- Normalizer (Checklist)
- Metasync
- Validator
More information in crawler-cli README.
Darwin Core Data Package crawling is split across the crawler CLI and the separate dwc-dp-analyser-service:
dwcdpdownloaderdownloads the archive to shared storage and publishes aDwcDpDownloadFinishedMessage.dwc-dp-analyser-serviceis deployed from its own repository/chart, validates the archive, and publishes aDwcDpValidationFinishedMessage.dwcdp-metasynclistens for validation-finished messages, forwards metadata to the Registry, and publishes aDwcDpMetadataSyncFinishedMessageafter metadata sync succeeds.
The downloader, validator, and metasync services must all agree on the shared archive storage backing the DwC-DP archive paths.
coldpdownloaderdownloads the archive to shared storage and publishes aColDpDownloadFinishedMessage.coldp-metasynclistens for the download-finished message and forwards CoL-DP metadata to the Registry.