GBIF Crawler

This project is responsible for coordinating dataset crawling. The coordinator, crawler, and CLI modules work together to do the actual crawling.

The webservice and webservice client present crawling status as recorded in Zookeeper.

The Crawler project includes:

crawler: Contains the actual crawlers that speak the various XML and DWC-A/ABCD-A/CamtrapDP dialects needed for crawling the GBIF network
crawler-cleanup: Used to delete crawl jobs in Zookeeper (see sub-module README for details how to use)
crawler-cli: Provides the services that listen to RabbitMQ for instructions to crawl resources
crawler-coordinator: Coordinates crawling jobs via Zookeeper (Curator)
crawler-ws: Exposes read only crawl status and access to logs.
crawler-ws-client: Java client to the WS

Building

See the individual sub-module READMEs for specific details, but in general it is enough to build all components with:

mvn clean package

Docker and Kubernetes deployment

The crawler CLI ships as a single container image built from the root Dockerfile. The image contains the crawler-cli fat jar and runs any CLI command by setting CRAWLER_COMMAND at runtime. The container entrypoint (scripts/entrypoint-crawler-cli.sh) materialises a YAML config from environment variables and Helm-injected values before invoking java -jar /app/crawler-cli.jar "$CRAWLER_COMMAND" --conf ....

Kubernetes deployment is managed through the Helm chart in helm/. The chart is intentionally generic: each CLI command is configured under commands in helm/values.yaml. Long-running listeners such as downloaders and metasync services are deployed as Deployment resources, while one-off commands such as startcrawl, crawleverything, metasynceverything, and pusheml can be deployed as Job resources when enabled.

RabbitMQ and Registry credentials should normally be provided as existing Kubernetes Secrets via rabbit.credentialsSecret and registry.credentialsSecret. The chart can create development secrets when createSecret is enabled, but production deployments should avoid storing real credentials in values files.

Archive storage must be shared by services that hand archives off to one another. In normal environments this should be an NFS or other shared volume configured in helm/values.yaml under volumes. The default emptyDir values are only suitable for local rendering or isolated development because each pod receives its own temporary filesystem.

Local dependencies for development

For running the CLI locally against a real RabbitMQ and ZooKeeper without Kubernetes, bring up the lightweight stack in docker-compose.local.yml:

task local           # docker compose -f docker-compose.local.yml up -d

Both services include healthchecks and listen on their default ports (5672 / 15672 / 2181). The CLI itself is intended to run from your IDE/terminal in this mode.

To run the packaged image instead of running the CLI from your IDE, Taskfile.dist.yml provides a generic docker:run:local task that auto-creates the archive directory under ~/data and wires the container against host.docker.internal (works on Docker Desktop and on Linux thanks to --add-host=host.docker.internal:host-gateway):

task docker:run:local                                                         # dwcdp-metasync, ~/data/dwcdp
DATA_DIR=coldp CRAWLER_COMMAND=coldp-metasync   task docker:run:local         # coldp-metasync, ~/data/coldp
DATA_DIR=dwca  CRAWLER_COMMAND=downloader       task docker:run:local
DATA_DIR=abcda CRAWLER_COMMAND=abcdadownloader  task docker:run:local

DATA_DIR should match the archiveRepository name in helm/values.yaml. Commands that require multiple shared volumes or extra repository paths (dwcdpdownloader, validator, dwca-metasync, camtrapdp*) are not covered by docker:run:local; for those, call docker:run directly with the appropriate mounts/config, or run the CLI from your IDE against the local stack.

Building and pushing the image (multi-arch)

GBIF cluster nodes run on linux/amd64, but Apple Silicon developer machines build linux/arm64 by default. A single-architecture push will fail at runtime with exec format error on the cluster. Use the multi-arch task helper in Taskfile.dist.yml:

REGISTRY=docker.gbif.org task docker:push:dist

This builds linux/amd64 and linux/arm64 images, pushes them, and creates a Docker manifest tagged with the crawler-cli Maven project.version. The Helm chart's image.tag should match.

Deploying to a Kubernetes cluster

Each key under commands in helm/values.yaml becomes one Deployment or Job named after that key. The nested command value must match the Java CLI name (see super("…") in each *Command.java under crawler-cli); names are not all formatted the same (dwcdpdownloader vs dwcdp-metasync).

Typical commands include crawl coordination (scheduler, coordinator, crawlserver), DwC-A (downloader, validator, dwca-metasync), DwC-DP / CoL-DP (dwcdpdownloader, dwcdp-metasync, coldpdownloader, coldp-metasync), ABCD-A (abcdadownloader), CamtrapDP (camtrapdpdownloader, camtrapdptodwca), legacy metasync, coordinatorcleanup, and Jobs such as startcrawl / crawleverything / metasynceverything / pusheml. Use helm/values-dev.yaml.example as a second values file with placeholder hosts and NFS paths; copy it to helm/values-dev.yaml, replace placeholders, and pass it with -f on helm template / helm upgrade (Helm merges it over the chart defaults).

The steps below use namespace dev and release name crawler-cli; adjust to your conventions.

Confirm the kubectl context points at the right cluster:
```
kubectl config current-context
kubectl get nodes
```
Create the namespace if it does not already exist:
```
kubectl create namespace dev
```

Create RabbitMQ and Registry credentials as Secrets:

kubectl -n dev create secret generic crawler-rabbitmq-credentials \
  --from-literal=username=<user> --from-literal=password=<pass>

kubectl -n dev create secret generic crawler-registry-credentials \
  --from-literal=username=<user> --from-literal=password=<pass>

Build and push a multi-arch image (or use CI), then set image.repository, image.name, and image.tag in your override to match:
```
REGISTRY=docker.gbif.org task docker:push:dist
```
Copy the committed template and edit hosts, ZooKeeper, registry URL, volumes, and toggles:
```
cp helm/values-dev.yaml.example helm/values-dev.yaml
$EDITOR helm/values-dev.yaml
```

Render and install:

helm template crawler-cli ./helm -f helm/values-dev.yaml -n dev | less
helm upgrade --install crawler-cli ./helm -f helm/values-dev.yaml -n dev

Check workloads (use your enabled command keys as deployment names):

kubectl -n dev get deploy,job
kubectl -n dev get pods
kubectl -n dev logs deploy/<command-key> -f
kubectl -n dev exec deploy/<command-key> -- cat /app/.tmp/crawler.yaml

Roll back or uninstall:

helm rollback crawler-cli -n dev
helm uninstall crawler-cli -n dev

Sequence

Darwin Core Archive

Downloader
- Validator
  - Metasync
    - Pipelines (all archives)
    - Normalizer (Checklist)

Darwin Core Data Package

Darwin Core Data Package crawling is split across the crawler CLI and the separate dwc-dp-analyser-service:

dwcdpdownloader downloads the archive to shared storage and publishes a DwcDpDownloadFinishedMessage.
dwc-dp-analyser-service is deployed from its own repository/chart, validates the archive, and publishes a DwcDpValidationFinishedMessage.
dwcdp-metasync listens for validation-finished messages, forwards metadata to the Registry, and publishes a DwcDpMetadataSyncFinishedMessage after metadata sync succeeds.

The downloader, validator, and metasync services must all agree on the shared archive storage backing the DwC-DP archive paths.

Catalogue of Life Data Package

coldpdownloader downloads the archive to shared storage and publishes a ColDpDownloadFinishedMessage.
coldp-metasync listens for the download-finished message and forwards CoL-DP metadata to the Registry.

Name		Name	Last commit message	Last commit date
Latest commit History 1,522 Commits
crawler-cleanup		crawler-cleanup
crawler-cli		crawler-cli
crawler-coordinator		crawler-coordinator
crawler-metasync		crawler-metasync
crawler-ws-client		crawler-ws-client
crawler-ws		crawler-ws
crawler		crawler
helm		helm
scripts		scripts
.editorconfig		.editorconfig
.gitignore		.gitignore
Dockerfile		Dockerfile
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
README.md		README.md
Taskfile.dist.yml		Taskfile.dist.yml
docker-compose.local.yml		docker-compose.local.yml
gbif-license-header		gbif-license-header
gbif.importorder		gbif.importorder
how-to-debug-crawling.md		how-to-debug-crawling.md
overview.graffle		overview.graffle
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GBIF Crawler

Building

Docker and Kubernetes deployment

Local dependencies for development

Building and pushing the image (multi-arch)

Deploying to a Kubernetes cluster

Sequence

Darwin Core Archive

Darwin Core Data Package

Catalogue of Life Data Package

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GBIF Crawler

Building

Docker and Kubernetes deployment

Local dependencies for development

Building and pushing the image (multi-arch)

Deploying to a Kubernetes cluster

Sequence

Darwin Core Archive

Darwin Core Data Package

Catalogue of Life Data Package

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages