acm-meta-mvp (v0.2.2)

Metadata-first pipeline for ACM Digital Library PDFs. The project extracts spreadsheet-ready metadata (Title, Venue, Year, Authors, Abstract, DOI) so the teacher’s sheet can be filled with minimal manual work.

Version 0.2.2 – Release Notes

Refactored the pipeline into reusable core modules (acm_meta/) so PDF parsing, Crossref calls, normalization, and persistence can be reused by both CLI and API layers.
Introduced typed models plus structured upload responses. Every API now returns consistent status/error_code/message payloads, and the frontend consumes a single record shape.
Hardened persistence with atomic JSON/CSV/XLSX writes and process-level file locks. DELETE/reorder now operate purely on stable record IDs.
Added inline editing + validation on the Metadata Table (double-click cells to edit, changes persist through the new PATCH /api/records/{id} endpoint).
Built-in DOI dedupe: uploading the same paper again replaces the existing row instead of creating duplicates, even if the PDF name differs.
Simplified the schema to the six core spreadsheet columns (Title, Venue, Year, Authors, Abstract, DOI) so the table matches the teacher’s requirements exactly.
Known issue: the Atlas browser sometimes swallows the confirm dialog so Delete appears unresponsive—Chrome behaves correctly, so use Chrome if you need to delete rows until Atlas addresses the dialog bug.

TL;DR (Quick Start)

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp config.example.env .env  # fill in CROSSREF_MAILTO
mkdir -p pdfs output && cp /path/to/acm.pdf pdfs/
python main.py serve  # open http://127.0.0.1:8000

Batch mode still works: drop multiple PDFs into pdfs/ and run python main.py batch to regenerate output/metadata.json + CSV.

Sample I/O Snapshot

Uploading XR-Objects.pdf produces the following table row (also persisted to data/records.json / .csv):

{
  "Title": "Augmented Object Intelligence with XR-Objects",
  "Venue": "Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology",
  "Publication year": 2024,
  "Author list": "Mustafa Doga Dogan, Eric J Gonzalez, Karan Ahuja, Ruofei Du, Andrea Colaço, Johnny Lee, Mar Gonzalez-Franco, David Kim",
  "Abstract": "Seamless integration of physical objects as interactive digital entities remains a challenge for spatial computing...",
  "DOI": "10.1145/3654777.3676379",
}

Version 0.2 – Release Notes

Added mouse-driven resizing for metadata-table columns and rows so reviewers can adjust column widths or row heights directly in the UI.
Converted the entire documentation to English and removed Chinese text from the repository.
Continued polishing the persistent metadata table (drag-to-reorder, inline delete, CSV/JSON/Excel export, adjustable font size).

1. Requirements & Field Breakdown

1.1 Target Data Schema

Each paper occupies one spreadsheet row. Column definitions stay fixed:

Column	Description	Constraint
Title	Paper title	Use the original English title; avoid trailing punctuation
Venue	Conference or journal name	Prefer the full official name (e.g., `CHI Conference on Human Factors in Computing Systems`)
Publication year	Year of publication	Four-digit integer (e.g., `2022`)
Author list	Authors in `Given Family` format	Join with `,` ; do not include `and` or periods
Abstract	Paper abstract	Plain English paragraph with the `ABSTRACT` label removed
DOI	Digital Object Identifier	e.g., `10.1145/3491102.3502071`

1.2 Core Metadata Fields

The MVP automates Title, Venue, Publication year, Author list, Abstract, and DOI.

2. Roadmap & Priority

2.1 V0 / MVP (current scope)

Batch or web uploads for ACM DL PDFs (assuming each PDF contains a DOI).
Extract DOI from the first pages, query Crossref /works/{doi}, and normalize fields.
Produce two outputs:
- output/metadata.json: verbose record for downstream automation.
- output/metadata_for_spreadsheet.csv: column order matches the teacher’s sheet exactly.

2.2 V1 (next)

CLI or UI helpers for attaching optional hero images / demo links.
Scripts that export useful figures from PDFs.

2.3 V2+ (long-term)

Automatically discover high-quality teaser images / videos and recommend candidates.
Rank media to help select a teaser.
Generate “making of” prompts, XR browsing experiences, and other exploratory features.

3. MVP Documentation

3.1 Overview

Project name: acm-meta-mvp
Goal: Given ACM DL PDFs, automatically output Title / Venue / Year / Authors / Abstract / DOI that can be pasted directly into the teacher’s spreadsheet.
Pipeline: PDF → DOI extraction → Crossref metadata → normalization → JSON & CSV.

3.2 Tech Stack

Python 3.10+
FastAPI + Uvicorn: Web API (/api/upload)
PyPDF/PyMuPDF: extract DOI/abstract snippets from PDFs
Requests: call Crossref
Pandas: build CSV/Excel exports
python-dotenv: read Crossref email for the polite User-Agent

3.3 Directory Layout

MetaData/
├─ README.md
├─ requirements.txt
├─ main.py
├─ config.example.env
├─ data/
│  └─ .gitkeep              # becomes records.json / records.csv after runs
├─ frontend/
│  └─ index.html
├─ static/
│  ├─ app.js
│  └─ styles.css
├─ pdfs/
└─ output/

frontend/index.html plus everything in static/ power the web UI. Run python main.py serve and open http://127.0.0.1:8000/ to batch-upload PDFs (max 20) and manage the persistent metadata table.

3.4 Dependencies & Configuration

requirements.txt

fastapi
uvicorn[standard]
pypdf
requests
pandas
python-dotenv
python-multipart
pymupdf
openpyxl

config.example.env

CROSSREF_MAILTO=your_email@example.com

Copy to .env and replace the email with a real inbox for Crossref’s User-Agent guidelines.

3.5 Core Pipeline (`main.py`)

DOI extraction: Use pypdf to read the first two pages and apply r"10\.\d{4,9}/[^\s\"<>]+".
Crossref lookup: GET https://api.crossref.org/works/{doi} with the configured email in the User-Agent.
Field normalization:
- Title: message.title[0].
- Venue: container-title[0].
- Publication year: issued.date-parts[0][0].
- Author list: join given family names with , .
- Abstract: prefer Crossref, otherwise extract from the PDF’s “Abstract” paragraph via PyMuPDF.
- DOI: returned DOI or the PDF fallback.
Outputs:
- metadata_for_spreadsheet.csv: six columns (Title, Venue, Year, Authors, Abstract, DOI).
API modes: /api/upload handles a single file, POST /api/upload/batch processes multiple PDFs, and python main.py batch runs over everything inside pdfs/.

3.6 Persistence & Batch API

POST /api/upload/batch: accepts up to 20 files, returns status per file, and saves successes to data/records.json and data/records.csv. Missing abstracts are auto extracted from the PDF.
GET /api/records: returns all stored records (most recent first); the frontend uses this for the metadata table.
DELETE /api/records/{id}: deletes a record (triggered by the table’s Delete button).
PATCH /api/records/{id}: updates editable columns (Title/Venue/Year/Authors/Abstract/DOI/etc.) from the inline editor.
POST /api/records/reorder: persists drag-and-drop ordering from the UI.
GET /api/export: downloads data/records.csv.
GET /api/export/json: downloads the JSON dataset.
GET /api/export/xlsx: downloads an Excel workbook built with openpyxl.

3.7 Running the Project

Environment setup

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp config.example.env .env  # then edit CROSSREF_MAILTO
mkdir -p pdfs output

Batch extraction
```
python main.py batch
```
Outputs land in output/metadata.json and output/metadata_for_spreadsheet.csv with the teacher’s column order.
Web UI
```
python main.py serve
```
Browse to http://127.0.0.1:8000/ to access:
- Upload PDFs: drag/drop ≤20 files and watch metadata cards appear instantly; the status list shows success/failure for each file.
- Metadata Table: browse persistent records, delete rows, drag to reorder, resize columns/rows, tweak font size, and export CSV/JSON/Excel. All data persists under data/.
- CLI users can also call the API directly (sample curl commands live in the code comments).
All-in-one helper
```
./run.sh
```
Ensures virtualenv/requirements/env vars exist, then launches python main.py serve.

3.8 Status Snapshot

✅ Automatic metadata: DOI detection + Crossref + CSV/JSON/Excel export.
✅ Persistent metadata table with drag/drop, delete, column/row resize, font slider, and multi-format export.
✅ Documentation and UI strings fully in English.
⏳ Upcoming (V1): asset picker (hero image, demo link) and media export helpers.
🔭 Future (V2+): automated teaser/media discovery plus AI-powered tooling.

Happy metadata harvesting! Contributions, bug reports, and feature ideas are welcome—file an issue or open a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

acm-meta-mvp (v0.2.2)

Version 0.2.2 – Release Notes

TL;DR (Quick Start)

Sample I/O Snapshot

Version 0.2 – Release Notes

1. Requirements & Field Breakdown

1.1 Target Data Schema

1.2 Core Metadata Fields

2. Roadmap & Priority

2.1 V0 / MVP (current scope)

2.2 V1 (next)

2.3 V2+ (long-term)

3. MVP Documentation

3.1 Overview

3.2 Tech Stack

3.3 Directory Layout

3.4 Dependencies & Configuration

3.5 Core Pipeline (`main.py`)

3.6 Persistence & Batch API

3.7 Running the Project

3.8 Status Snapshot

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
acm_meta		acm_meta
data		data
frontend		frontend
static		static
.gitignore		.gitignore
README.md		README.md
config.example.env		config.example.env
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

acm-meta-mvp (v0.2.2)

Version 0.2.2 – Release Notes

TL;DR (Quick Start)

Sample I/O Snapshot

Version 0.2 – Release Notes

1. Requirements & Field Breakdown

1.1 Target Data Schema

1.2 Core Metadata Fields

2. Roadmap & Priority

2.1 V0 / MVP (current scope)

2.2 V1 (next)

2.3 V2+ (long-term)

3. MVP Documentation

3.1 Overview

3.2 Tech Stack

3.3 Directory Layout

3.4 Dependencies & Configuration

3.5 Core Pipeline (main.py)

3.6 Persistence & Batch API

3.7 Running the Project

3.8 Status Snapshot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3.5 Core Pipeline (`main.py`)

Packages