Skip to content

DearBobby9/MetaData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

acm-meta-mvp (v0.2.2)

Metadata-first pipeline for ACM Digital Library PDFs. The project extracts spreadsheet-ready metadata (Title, Venue, Year, Authors, Abstract, DOI) so the teacher’s sheet can be filled with minimal manual work.

Version 0.2.2 – Release Notes

  • Refactored the pipeline into reusable core modules (acm_meta/) so PDF parsing, Crossref calls, normalization, and persistence can be reused by both CLI and API layers.
  • Introduced typed models plus structured upload responses. Every API now returns consistent status/error_code/message payloads, and the frontend consumes a single record shape.
  • Hardened persistence with atomic JSON/CSV/XLSX writes and process-level file locks. DELETE/reorder now operate purely on stable record IDs.
  • Added inline editing + validation on the Metadata Table (double-click cells to edit, changes persist through the new PATCH /api/records/{id} endpoint).
  • Built-in DOI dedupe: uploading the same paper again replaces the existing row instead of creating duplicates, even if the PDF name differs.
  • Simplified the schema to the six core spreadsheet columns (Title, Venue, Year, Authors, Abstract, DOI) so the table matches the teacher’s requirements exactly.
  • Known issue: the Atlas browser sometimes swallows the confirm dialog so Delete appears unresponsive—Chrome behaves correctly, so use Chrome if you need to delete rows until Atlas addresses the dialog bug.

TL;DR (Quick Start)

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp config.example.env .env  # fill in CROSSREF_MAILTO
mkdir -p pdfs output && cp /path/to/acm.pdf pdfs/
python main.py serve  # open http://127.0.0.1:8000

Batch mode still works: drop multiple PDFs into pdfs/ and run python main.py batch to regenerate output/metadata.json + CSV.

Sample I/O Snapshot

Uploading XR-Objects.pdf produces the following table row (also persisted to data/records.json / .csv):

{
  "Title": "Augmented Object Intelligence with XR-Objects",
  "Venue": "Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology",
  "Publication year": 2024,
  "Author list": "Mustafa Doga Dogan, Eric J Gonzalez, Karan Ahuja, Ruofei Du, Andrea Colaço, Johnny Lee, Mar Gonzalez-Franco, David Kim",
  "Abstract": "Seamless integration of physical objects as interactive digital entities remains a challenge for spatial computing...",
  "DOI": "10.1145/3654777.3676379",
}

Version 0.2 – Release Notes

  • Added mouse-driven resizing for metadata-table columns and rows so reviewers can adjust column widths or row heights directly in the UI.
  • Converted the entire documentation to English and removed Chinese text from the repository.
  • Continued polishing the persistent metadata table (drag-to-reorder, inline delete, CSV/JSON/Excel export, adjustable font size).

1. Requirements & Field Breakdown

1.1 Target Data Schema

Each paper occupies one spreadsheet row. Column definitions stay fixed:

Column Description Constraint
Title Paper title Use the original English title; avoid trailing punctuation
Venue Conference or journal name Prefer the full official name (e.g., CHI Conference on Human Factors in Computing Systems)
Publication year Year of publication Four-digit integer (e.g., 2022)
Author list Authors in Given Family format Join with , ; do not include and or periods
Abstract Paper abstract Plain English paragraph with the ABSTRACT label removed
DOI Digital Object Identifier e.g., 10.1145/3491102.3502071

1.2 Core Metadata Fields

The MVP automates Title, Venue, Publication year, Author list, Abstract, and DOI.


2. Roadmap & Priority

2.1 V0 / MVP (current scope)

  • Batch or web uploads for ACM DL PDFs (assuming each PDF contains a DOI).
  • Extract DOI from the first pages, query Crossref /works/{doi}, and normalize fields.
  • Produce two outputs:
    • output/metadata.json: verbose record for downstream automation.
    • output/metadata_for_spreadsheet.csv: column order matches the teacher’s sheet exactly.

2.2 V1 (next)

  • CLI or UI helpers for attaching optional hero images / demo links.
  • Scripts that export useful figures from PDFs.

2.3 V2+ (long-term)

  • Automatically discover high-quality teaser images / videos and recommend candidates.
  • Rank media to help select a teaser.
  • Generate “making of” prompts, XR browsing experiences, and other exploratory features.

3. MVP Documentation

3.1 Overview

  • Project name: acm-meta-mvp
  • Goal: Given ACM DL PDFs, automatically output Title / Venue / Year / Authors / Abstract / DOI that can be pasted directly into the teacher’s spreadsheet.
  • Pipeline: PDF → DOI extraction → Crossref metadata → normalization → JSON & CSV.

3.2 Tech Stack

  • Python 3.10+
  • FastAPI + Uvicorn: Web API (/api/upload)
  • PyPDF/PyMuPDF: extract DOI/abstract snippets from PDFs
  • Requests: call Crossref
  • Pandas: build CSV/Excel exports
  • python-dotenv: read Crossref email for the polite User-Agent

3.3 Directory Layout

MetaData/
├─ README.md
├─ requirements.txt
├─ main.py
├─ config.example.env
├─ data/
│  └─ .gitkeep              # becomes records.json / records.csv after runs
├─ frontend/
│  └─ index.html
├─ static/
│  ├─ app.js
│  └─ styles.css
├─ pdfs/
└─ output/

frontend/index.html plus everything in static/ power the web UI. Run python main.py serve and open http://127.0.0.1:8000/ to batch-upload PDFs (max 20) and manage the persistent metadata table.

3.4 Dependencies & Configuration

requirements.txt

fastapi
uvicorn[standard]
pypdf
requests
pandas
python-dotenv
python-multipart
pymupdf
openpyxl

config.example.env

CROSSREF_MAILTO=your_email@example.com

Copy to .env and replace the email with a real inbox for Crossref’s User-Agent guidelines.

3.5 Core Pipeline (main.py)

  1. DOI extraction: Use pypdf to read the first two pages and apply r"10\.\d{4,9}/[^\s\"<>]+".
  2. Crossref lookup: GET https://api.crossref.org/works/{doi} with the configured email in the User-Agent.
  3. Field normalization:
    • Title: message.title[0].
    • Venue: container-title[0].
    • Publication year: issued.date-parts[0][0].
    • Author list: join given family names with , .
    • Abstract: prefer Crossref, otherwise extract from the PDF’s “Abstract” paragraph via PyMuPDF.
    • DOI: returned DOI or the PDF fallback.
  4. Outputs:
    • metadata_for_spreadsheet.csv: six columns (Title, Venue, Year, Authors, Abstract, DOI).
  5. API modes: /api/upload handles a single file, POST /api/upload/batch processes multiple PDFs, and python main.py batch runs over everything inside pdfs/.

3.6 Persistence & Batch API

  • POST /api/upload/batch: accepts up to 20 files, returns status per file, and saves successes to data/records.json and data/records.csv. Missing abstracts are auto extracted from the PDF.
  • GET /api/records: returns all stored records (most recent first); the frontend uses this for the metadata table.
  • DELETE /api/records/{id}: deletes a record (triggered by the table’s Delete button).
  • PATCH /api/records/{id}: updates editable columns (Title/Venue/Year/Authors/Abstract/DOI/etc.) from the inline editor.
  • POST /api/records/reorder: persists drag-and-drop ordering from the UI.
  • GET /api/export: downloads data/records.csv.
  • GET /api/export/json: downloads the JSON dataset.
  • GET /api/export/xlsx: downloads an Excel workbook built with openpyxl.

3.7 Running the Project

  1. Environment setup

    python -m venv .venv
    source .venv/bin/activate  # Windows: .venv\Scripts\activate
    pip install -r requirements.txt
    cp config.example.env .env  # then edit CROSSREF_MAILTO
    mkdir -p pdfs output
  2. Batch extraction

    python main.py batch

    Outputs land in output/metadata.json and output/metadata_for_spreadsheet.csv with the teacher’s column order.

  3. Web UI

    python main.py serve

    Browse to http://127.0.0.1:8000/ to access:

    • Upload PDFs: drag/drop ≤20 files and watch metadata cards appear instantly; the status list shows success/failure for each file.
    • Metadata Table: browse persistent records, delete rows, drag to reorder, resize columns/rows, tweak font size, and export CSV/JSON/Excel. All data persists under data/.
    • CLI users can also call the API directly (sample curl commands live in the code comments).
  4. All-in-one helper

    ./run.sh

    Ensures virtualenv/requirements/env vars exist, then launches python main.py serve.

3.8 Status Snapshot

  • ✅ Automatic metadata: DOI detection + Crossref + CSV/JSON/Excel export.
  • ✅ Persistent metadata table with drag/drop, delete, column/row resize, font slider, and multi-format export.
  • ✅ Documentation and UI strings fully in English.
  • ⏳ Upcoming (V1): asset picker (hero image, demo link) and media export helpers.
  • 🔭 Future (V2+): automated teaser/media discovery plus AI-powered tooling.

Happy metadata harvesting! Contributions, bug reports, and feature ideas are welcome—file an issue or open a pull request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors