Metadata-first pipeline for ACM Digital Library PDFs. The project extracts spreadsheet-ready metadata (Title, Venue, Year, Authors, Abstract, DOI) so the teacher’s sheet can be filled with minimal manual work.
- Refactored the pipeline into reusable core modules (
acm_meta/) so PDF parsing, Crossref calls, normalization, and persistence can be reused by both CLI and API layers. - Introduced typed models plus structured upload responses. Every API now returns consistent
status/error_code/messagepayloads, and the frontend consumes a singlerecordshape. - Hardened persistence with atomic JSON/CSV/XLSX writes and process-level file locks. DELETE/reorder now operate purely on stable record IDs.
- Added inline editing + validation on the Metadata Table (double-click cells to edit, changes persist through the new
PATCH /api/records/{id}endpoint). - Built-in DOI dedupe: uploading the same paper again replaces the existing row instead of creating duplicates, even if the PDF name differs.
- Simplified the schema to the six core spreadsheet columns (Title, Venue, Year, Authors, Abstract, DOI) so the table matches the teacher’s requirements exactly.
- Known issue: the Atlas browser sometimes swallows the confirm dialog so Delete appears unresponsive—Chrome behaves correctly, so use Chrome if you need to delete rows until Atlas addresses the dialog bug.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp config.example.env .env # fill in CROSSREF_MAILTO
mkdir -p pdfs output && cp /path/to/acm.pdf pdfs/
python main.py serve # open http://127.0.0.1:8000Batch mode still works: drop multiple PDFs into pdfs/ and run python main.py batch to regenerate output/metadata.json + CSV.
Uploading XR-Objects.pdf produces the following table row (also persisted to data/records.json / .csv):
{
"Title": "Augmented Object Intelligence with XR-Objects",
"Venue": "Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology",
"Publication year": 2024,
"Author list": "Mustafa Doga Dogan, Eric J Gonzalez, Karan Ahuja, Ruofei Du, Andrea Colaço, Johnny Lee, Mar Gonzalez-Franco, David Kim",
"Abstract": "Seamless integration of physical objects as interactive digital entities remains a challenge for spatial computing...",
"DOI": "10.1145/3654777.3676379",
}- Added mouse-driven resizing for metadata-table columns and rows so reviewers can adjust column widths or row heights directly in the UI.
- Converted the entire documentation to English and removed Chinese text from the repository.
- Continued polishing the persistent metadata table (drag-to-reorder, inline delete, CSV/JSON/Excel export, adjustable font size).
Each paper occupies one spreadsheet row. Column definitions stay fixed:
| Column | Description | Constraint |
|---|---|---|
| Title | Paper title | Use the original English title; avoid trailing punctuation |
| Venue | Conference or journal name | Prefer the full official name (e.g., CHI Conference on Human Factors in Computing Systems) |
| Publication year | Year of publication | Four-digit integer (e.g., 2022) |
| Author list | Authors in Given Family format |
Join with , ; do not include and or periods |
| Abstract | Paper abstract | Plain English paragraph with the ABSTRACT label removed |
| DOI | Digital Object Identifier | e.g., 10.1145/3491102.3502071 |
The MVP automates Title, Venue, Publication year, Author list, Abstract, and DOI.
- Batch or web uploads for ACM DL PDFs (assuming each PDF contains a DOI).
- Extract DOI from the first pages, query Crossref
/works/{doi}, and normalize fields. - Produce two outputs:
output/metadata.json: verbose record for downstream automation.output/metadata_for_spreadsheet.csv: column order matches the teacher’s sheet exactly.
- CLI or UI helpers for attaching optional hero images / demo links.
- Scripts that export useful figures from PDFs.
- Automatically discover high-quality teaser images / videos and recommend candidates.
- Rank media to help select a teaser.
- Generate “making of” prompts, XR browsing experiences, and other exploratory features.
- Project name:
acm-meta-mvp - Goal: Given ACM DL PDFs, automatically output Title / Venue / Year / Authors / Abstract / DOI that can be pasted directly into the teacher’s spreadsheet.
- Pipeline: PDF → DOI extraction → Crossref metadata → normalization → JSON & CSV.
- Python 3.10+
- FastAPI + Uvicorn: Web API (
/api/upload) - PyPDF/PyMuPDF: extract DOI/abstract snippets from PDFs
- Requests: call Crossref
- Pandas: build CSV/Excel exports
- python-dotenv: read Crossref email for the polite User-Agent
MetaData/
├─ README.md
├─ requirements.txt
├─ main.py
├─ config.example.env
├─ data/
│ └─ .gitkeep # becomes records.json / records.csv after runs
├─ frontend/
│ └─ index.html
├─ static/
│ ├─ app.js
│ └─ styles.css
├─ pdfs/
└─ output/
frontend/index.html plus everything in static/ power the web UI. Run python main.py serve and open http://127.0.0.1:8000/ to batch-upload PDFs (max 20) and manage the persistent metadata table.
requirements.txt
fastapi
uvicorn[standard]
pypdf
requests
pandas
python-dotenv
python-multipart
pymupdf
openpyxlconfig.example.env
CROSSREF_MAILTO=your_email@example.comCopy to .env and replace the email with a real inbox for Crossref’s User-Agent guidelines.
- DOI extraction: Use
pypdfto read the first two pages and applyr"10\.\d{4,9}/[^\s\"<>]+". - Crossref lookup:
GET https://api.crossref.org/works/{doi}with the configured email in the User-Agent. - Field normalization:
- Title:
message.title[0]. - Venue:
container-title[0]. - Publication year:
issued.date-parts[0][0]. - Author list: join
given familynames with,. - Abstract: prefer Crossref, otherwise extract from the PDF’s “Abstract” paragraph via PyMuPDF.
- DOI: returned DOI or the PDF fallback.
- Title:
- Outputs:
metadata_for_spreadsheet.csv: six columns (Title, Venue, Year, Authors, Abstract, DOI).
- API modes:
/api/uploadhandles a single file,POST /api/upload/batchprocesses multiple PDFs, andpython main.py batchruns over everything insidepdfs/.
POST /api/upload/batch: accepts up to 20files, returns status per file, and saves successes todata/records.jsonanddata/records.csv. Missing abstracts are auto extracted from the PDF.GET /api/records: returns all stored records (most recent first); the frontend uses this for the metadata table.DELETE /api/records/{id}: deletes a record (triggered by the table’s Delete button).PATCH /api/records/{id}: updates editable columns (Title/Venue/Year/Authors/Abstract/DOI/etc.) from the inline editor.POST /api/records/reorder: persists drag-and-drop ordering from the UI.GET /api/export: downloadsdata/records.csv.GET /api/export/json: downloads the JSON dataset.GET /api/export/xlsx: downloads an Excel workbook built withopenpyxl.
-
Environment setup
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt cp config.example.env .env # then edit CROSSREF_MAILTO mkdir -p pdfs output
-
Batch extraction
python main.py batch
Outputs land in
output/metadata.jsonandoutput/metadata_for_spreadsheet.csvwith the teacher’s column order. -
Web UI
python main.py serve
Browse to
http://127.0.0.1:8000/to access:- Upload PDFs: drag/drop ≤20 files and watch metadata cards appear instantly; the status list shows success/failure for each file.
- Metadata Table: browse persistent records, delete rows, drag to reorder, resize columns/rows, tweak font size, and export CSV/JSON/Excel. All data persists under
data/. - CLI users can also call the API directly (sample
curlcommands live in the code comments).
-
All-in-one helper
./run.sh
Ensures virtualenv/requirements/env vars exist, then launches
python main.py serve.
- ✅ Automatic metadata: DOI detection + Crossref + CSV/JSON/Excel export.
- ✅ Persistent metadata table with drag/drop, delete, column/row resize, font slider, and multi-format export.
- ✅ Documentation and UI strings fully in English.
- ⏳ Upcoming (V1): asset picker (hero image, demo link) and media export helpers.
- 🔭 Future (V2+): automated teaser/media discovery plus AI-powered tooling.
Happy metadata harvesting! Contributions, bug reports, and feature ideas are welcome—file an issue or open a pull request.