Boost Data Collector is a Django project that collects and manages data from various Boost-related sources. The project has multiple Django apps in one repository. All apps share one virtual environment, one database (PostgreSQL), and the same Django settings. Each app exposes one or more management commands (e.g. run_boost_library_tracker). The main workflow runs these commands in a fixed order (e.g. via python manage.py run_all_collectors or a Celery task). See docs/Workflow.md for workflow details.
- Python 3.11+
- Django (version in
requirements.txt) - PostgreSQL database access
- pandoc — required by
boost_library_docs_trackerfor HTML→Markdown conversion (pypandoccalls thepandocbinary at runtime):- macOS:
brew install pandoc - Debian/Ubuntu:
sudo apt-get install pandoc - Windows:
winget install JohnMacFarlane.Pandocor download from pandoc.org
- macOS:
- Environment variables for database URL and API keys (e.g. via
.env)
- Clone the repository:
git clone <boost-data-collector-repo-url>
cd boost-data-collector- Create and activate a virtual environment:
python -m venv venv
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt-
Configure environment variables (e.g. copy
.env.exampleto.envand set database URL and API credentials). -
Create and run migrations (required before any command that uses the database):
python manage.py makemigrations
python manage.py migrateEach project app has a migrations/ package; if you previously saw "No changes detected" but migrate only listed admin, auth, contenttypes, sessions, ensure those packages exist and run the commands again. After a successful migrate you should see migrations for cppa_user_tracker, github_activity_tracker, boost_library_tracker, workflow (and optionally github_ops).
If you see relation "cppa_user_tracker_githubaccount" does not exist (or similar), the database tables are missing — run the two commands above.
- Run a single app command or the full workflow to confirm the project works:
python manage.py run_all_collectorsFor local development you can start the dev server: python manage.py runserver.
You can run the whole stack (Django, PostgreSQL, Redis, Celery worker and beat) in Docker. See docs/Docker.md for step-by-step instructions, including first-time setup and useful commands.
The daily workflow runs as a Celery task (see docs/Celery_test.md). You need Redis running (default: localhost:6379). Start the worker and (optionally) Beat in separate terminals:
# Worker (executes tasks)
celery -A config worker -l info
# Beat (schedules the daily task at 1:00 AM Pacific)
celery -A config beat -l infoOn Windows, the project configures the worker to use the solo pool automatically; if you see PermissionError [WinError 5], run: celery -A config worker -l info --pool=solo.
The project uses pytest with pytest-django. Tests run against config.test_settings (SQLite in-memory by default; set DATABASE_URL to use PostgreSQL).
- Install test dependencies (once):
pip install -r requirements-dev.txt- Run the full test suite:
python -m pytest- Optional: run with coverage and a short traceback:
python -m pytest --tb=short --cov=. --cov-report=term-missing- Run a subset of tests (e.g. one app or one file):
python -m pytest cppa_user_tracker/tests/ -v
python -m pytest github_activity_tracker/tests/test_sync_utils.py -vSee docs/Development_guideline.md for when to run tests during development.
boost-data-collector/
├── manage.py
├── requirements.txt
├── .env.example
├── README.md
├── config/ or <project_name>/ # Django project settings (settings.py)
├── docs/ # Documentation (per-topic)
│ ├── README.md # Topic index
│ ├── operations/ # Shared I/O (GitHub, Discord, etc.)
│ │ ├── README.md
│ │ └── github.md
│ ├── service_api/ # Per-app service API
│ ├── Workflow.md
│ ├── Schema.md
│ └── ...
├── workspace/ # Raw/processed files (see docs/Workspace.md)
│ ├── github_activity_tracker/
│ ├── boost_library_tracker/
│ ├── ...
│ └── shared/
| (Django Apps)
├── cppa_user_tracker/
├── github_activity_tracker/
├── workflow/
└── ...
Each Django app can expose management commands in management/commands/. All apps are in INSTALLED_APPS and use the shared database.
- Django project: One Django project with multiple Django apps; all apps share the same settings and database.
- Workflow: The main task runs app commands in a fixed order (e.g.
run_all_collectorsor a Celery task). Scheduling is done with Celery Beat or by running commands by hand. - Database: One PostgreSQL database (e.g.
boost_dashboard); Django ORM and migrations for all apps. - Configuration: Django settings (
settings.py) and environment variables (e.g. viadjango-environorpython-decouple).
The project supports multiple GitHub tokens for different operations (see .env.example):
- GITHUB_TOKEN – Fallback when a specific token is not set.
- GITHUB_TOKENS_SCRAPING – Comma-separated list for API read/scraping; tokens are used in round-robin to spread rate limits.
- GITHUB_TOKEN_WRITE – Used for create PR, create issue, comment on issue, and git push (falls back to GITHUB_TOKEN).
Operations (shared I/O): External integrations (GitHub, Discord, etc.) live in dedicated apps (e.g. github_ops) and are used by other apps. See docs/operations/ for the group and docs/operations/github.md for GitHub usage and token mapping.
One folder, subfolders per app. For github_activity_tracker, sync uses workspace/github_activity_tracker/<owner>/<repo>/commits|issues|prs/*.json; files are processed into the DB then removed. Default root: workspace/ (configurable via WORKSPACE_DIR). See docs/Workspace.md.
Docs are organized by topic (one doc per concern: workflow, workspace, service API, etc.). See docs/README.md for the full index.
- docs/README.md – Per-topic index and how to find app-specific info.
- Running tests – How to run the test suite (pytest, coverage).
- Celery – How to start the Celery worker and Beat.
- Celery_test.md – Testing the Celery task (run once, Beat, Redis).
- operations/ – Operations group: shared I/O (GitHub, Discord, etc.); index and per-operation docs.
- Workflow.md – Main application workflow, execution order, and project details.
- operations/github.md – GitHub layer (clone, push, fetch file, create PR/issue/comment) and token use.
- Deployment.md – CI/CD pipeline, GitHub secrets, server setup, and deploy script behavior.
- Workspace.md – Workspace layout and usage for file processing.
- Schema.md – Database schema and table relationships.
- Development_guideline.md – Development setup, app requirements, and step-by-step workflow.
- Contributing.md – Service layer (single place for writes) and contributor guidelines.
- Service_API.md – API reference and index for all service layer functions.
- service_api/ – Per-app service API docs (name, description, parameters, return types, validation).
The project deploys automatically over SSH after CI passes. Pushes to develop deploy to staging; pushes to main deploy to production.
See docs/Deployment.md for:
- Required environment secrets (
SSH_HOST,SSH_USER,SSH_PRIVATE_KEY) and optionalSSH_PORT(defaults to22) — set per environment (production / staging) - GitHub Environments setup (approval gates for production)
- One-time server setup (prerequisites,
.env, SSH key) - Deploy script behavior and override options
- main – Default/production branch (stable, release-ready code).
- develop – Development branch (active integration and feature work).
- Feature branches: Create from
develop. Do not branch frommainfor day-to-day work. - Pull requests: Open PRs against
develop; merge tomainfor releases.