🐛 Bug fix for exact entities.#80
Conversation
…the fuzzy matching algorithm.
Reviewer's GuideIntroduces exact-match handling for specific entity labels in the entity disambiguation pipeline by propagating a normalized subclass key from anonymization postprocessing into canonical entity building, so that those labels cluster strictly by exact value instead of fuzzy similarity; also adjusts a notebook to point to a different example document. Sequence diagram for exact-match handling in entity disambiguationsequenceDiagram
participant AnonymizationPostprocess
participant FuzzyDisambiguation
participant CanonicalEntities
AnonymizationPostprocess->>AnonymizationPostprocess: process(ent)
AnonymizationPostprocess->>AnonymizationPostprocess: cleaned_text = pattern.sub("", ent.text)
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass = []
alt label in exact_labels
AnonymizationPostprocess->>AnonymizationPostprocess: flattened_text = re.sub("[^a-zA-Z0-9]", "", cleaned_text)
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass.append(flattened_text)
end
AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_alt_text = cleaned_text
FuzzyDisambiguation->>FuzzyDisambiguation: build_canonical_entities(labels, target_labels, threshold)
FuzzyDisambiguation->>FuzzyDisambiguation: grouped.setdefault(aymurai_label, []).append({text, aymurai_label, exact_alias})
loop for each label_type, items in grouped.items()
alt label_type in EXACT_LABELS
FuzzyDisambiguation->>FuzzyDisambiguation: exact_groups.setdefault(exact_alias, []).append(item)
FuzzyDisambiguation->>FuzzyDisambiguation: clusters = list(exact_groups.values())
else
FuzzyDisambiguation->>FuzzyDisambiguation: clusters = _cluster_aliases_with_cdist(items, threshold)
end
FuzzyDisambiguation->>CanonicalEntities: _clusters_to_canonical_entities(clusters)
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
exact_labelsset is duplicated in bothfuzzy.pyandcore.py; consider centralizing this constant in a shared module to avoid divergence and make future updates easier. - In
anonymization_postprocess/core.py,aymurai_label_subclassis always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values. - The notebook change from
documents[14]todocuments[5]looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `exact_labels` set is duplicated in both `fuzzy.py` and `core.py`; consider centralizing this constant in a shared module to avoid divergence and make future updates easier.
- In `anonymization_postprocess/core.py`, `aymurai_label_subclass` is always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values.
- The notebook change from `documents[14]` to `documents[5]` looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.
## Individual Comments
### Comment 1
<location path="aymurai/utils/entity_disambiguation/fuzzy.py" line_range="10-18" />
<code_context>
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
+EXACT_LABELS = {
+ "DNI",
+ "CUIT_CUIL",
+ "TELEFONO",
+ "PATENTE_DOMINIO",
+ "IP",
+ "NUM_CAJA_AHORRO",
+ "CBU",
+ "NUM_MATRICULA",
+}
+
</code_context>
<issue_to_address>
**suggestion:** Avoid duplicating the exact-label set in multiple modules by centralizing it
This set also exists here as `EXACT_LABELS` and in `anonymization_postprocess/core.py` as `exact_labels`. Please move it to a shared constants module and import it in both places so there’s a single source of truth and no risk of the two lists drifting out of sync.
Suggested implementation:
```python
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
from aymurai.meta.constants import EXACT_LABELS
```
1. Create (or extend) a shared constants module, for example `aymurai/meta/constants.py`, and move the set definition there:
```python
EXACT_LABELS = {
"DNI",
"CUIT_CUIL",
"TELEFONO",
"PATENTE_DOMINIO",
"IP",
"NUM_CAJA_AHORRO",
"CBU",
"NUM_MATRICULA",
}
```
2. In `anonymization_postprocess/core.py`, replace the local `exact_labels` definition with an import from the same constants module, e.g.:
```python
from aymurai.meta.constants import EXACT_LABELS as exact_labels
```
(or adjust naming/import style to match existing conventions in that file).
3. Ensure `aymurai/meta/constants.py` is part of the package (has `__init__.py` as needed) and update any relevant `__all__` if your project uses it.
</issue_to_address>
### Comment 2
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="60-64" />
<code_context>
+ "NUM_MATRICULA",
+ }
+
+ ent["attrs"]["aymurai_label_subclass"] = []
+
+ if label in exact_labels:
+ flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text)
+ ent["attrs"]["aymurai_label_subclass"].append(flattened_text)
+
# Update the entity's alt text and indices
</code_context>
<issue_to_address>
**issue (bug_risk):** Re-initializing `aymurai_label_subclass` may unintentionally discard previous subclass information
Unconditionally assigning `ent["attrs"]["aymurai_label_subclass"] = []` clears any existing data in this field before you append the new value. If earlier steps in the pipeline set this attribute (now or in the future), this could cause data loss. Consider only initializing when absent (e.g., via `setdefault`/`get`) or otherwise making this logic additive rather than destructive.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
This PR adjusts entity disambiguation/anonymization so certain identifier-like labels (e.g., DNI/CBU/IP) are treated as exact identifiers (no fuzzy clustering), using a normalized “exact alias” derived from label subclass metadata.
Changes:
- Add an
EXACT_LABELSpath in canonical-entity building to group exact-identifier labels by a normalized alias instead of fuzzy clustering. - Update anonymization postprocessing to store a normalized subclass value for exact-identifier labels to support exact grouping.
- Update an experimental notebook to process a different sample document.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| notebooks/experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb | Changes which document sample index is processed in the experiment. |
| aymurai/utils/entity_disambiguation/fuzzy.py | Introduces exact-identifier grouping logic during canonical entity construction. |
| aymurai/transforms/anonymization_postprocess/core.py | Records a normalized subclass value for exact-identifier labels during entity cleaning. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
879309c8 Feat/entity manager mention feedback (#81) 4d2de106 Fix/responsive home layout (#80) 986e68d2 Fix/homogenize file check ui (#77) 046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76) 2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality git-subtree-dir: frontend git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f
* Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Squashed 'frontend/' content from commit 9123e6f git-subtree-dir: frontend git-subtree-split: 9123e6ff047ddc6da0528d1de827a4af68752d0f * Squashed 'frontend/' changes from 9123e6ff..8add5c45 8add5c45 1.25.0 d7424d94 use base_url when dealing with public assets ae1a47b8 Merge pull request #44 from AymurAI/feat/redesign 9cfc610c fix issue regarding annotation keyboard navigation a723be01 restore useNotify feature in process page b888214a make how it works modal bigger 49c3bc22 fix electron ts issues eb791d07 add missing anonymize value on label attrs 151e2710 make OS taskbar API safe to call in web 8dc38fbb add knip a0f6c687 fix ts clone element issue d975e3fd include label policies and annotations when anonymizing document 70c44a04 more fixes in height to validate dataset a36a7401 make suggestion clickable on uncontrolled text input ea61e4e2 add fixed height to validation dataset page 072d751e remove axios' default api url 6f8a478c adjust typings 528c646f add more canonical id operations on reducers ae09277c add update by cannonical id function de2b641c add random cannonical id on search add event 74d4dac2 add value as undefined on ui/input component 6aa895f4 simplify and remove unused code on fille annotator component d10fe8c7 add useShallow to local storage default getters 9f62ac69 adjust predicting and file parser flows 01aa8a79 fix TS issues 5655eac5 smaller fixes on file annotation components a7e91cec add remove dialog to label manager's entity tab cad8faaa simplify annotation components 52230ee4 add metadata to search and tag annotations 0f08b9c8 add more anonymizer copy locale texts 6c3112ac store label manager config in local storage 63c4d731 move suggestion label and add mark props 77017543 adjust spacing on dataset validation cae3b907 disambiguate hook 6b21903e move createAnnotationData, tag and suffix to context d1649e5d add max width to toast and remove instad of dismiss it ca56bb63 add size variant to suggestion and adjust display 55a67bf8 fix select viewport and add scroll aa2ce6a1 adjust styles on callout 5493cdee finish mark and tag annotation feature 4fb28120 finish major pages and add file protection feature af457c85 improve accessibility on host page c2c21c21 add className to SectionTitle 6edd7101 update tanstack react query version df491c17 add odt to pdf conversion efb04bc5 deprecate usage of useSchemedQuery and useSchemedQueries 046a73e2 deprecate usage of useSchemedMutation 65ceb195 swap replace button icons 12348edd update button icons 2719755f extract tagger popover logic 28ce40ab create search tagger e54adfbe add small version of select component d6c3bd77 improve ui dialog component 63c6bf09 initial mark component update cd0f7d30 add radix's select as dependency 92ceaace remove old tooltip implementation 2499e4c8 add input size variants be48dac9 add custom icons to callout, toast and showToast 32d3441b add retro compatibility features to select 657e2a5a minor styling fixes + new select imports 7a323cd2 kill more unused components c5d7b4ba create better select component d855a93b adjust styling and positioning on preview page 1cfba9c4 adjust callout styling c28fdcd0 fix title in process copy etxt 3a9a5cc6 remove old component implementations 03954d74 create callout from toast, and then apply a11y to toast 93c9457d create toast component 642e760d improve suggestion component 707e844b update suggestion mark component's styles 7c1db956 add checked variant to button 204246c7 fix gaps on finish page 9e81ee12 hide file stepper selector arrows depending on cursor da1377ea update file processing component 12aa1c6b update decision tabs to panda 490a1adb simplify finish dataset and anonymizer ba12dad2 add className prop to footer 4ed68454 drop file queries when resetting the progress 762cf823 add missing built by in features page 2c9d6cf8 make dataset validation file annotator not annotable 59416b8f clear files on features menu 9585726b use translation on droparea 091b2c58 make search bar static and follow scroll 0978aff6 add label manager to file annotator 5b069d68 connect rest of label manager 23fa8407 improve layout header component 3c88d2f3 finish preview page d31ec034 finish onboarding page 7d1a71f3 add disambiguate and predict react query options 4e91d285 feat: add feature icon record bfc9baf7 feat: create initial label manager 0bc48bc9 chore: refactor Searchbar 962961e4 chore: remove stepper component 95e17235 fix: title in anonymizer locale df53488d chore: minor changes 4a9a2557 feat: create switch component d4830e48 feat: create label manager component baa9835c feat: add more copies for the finish page 9d915cab chore: kill unused old hidden input component 2fd61a7f fix: add missing feature param call on route.tsx 617ea974 feat: use a11y on finish page 172b731a Merge branch 'feat/a11y-dataset-header' into feat/redesign 394121d9 feat: add more copies to locale file c34951e8 fix: redo home layout 31078ff5 chore: cleanup cf9def05 chore: restructure HOC to be a regular component 088bed82 feat: add api base url protection and apply it cd15a776 fix(layout): address PR review on header and icon changes 59dc125e feat: add i18n support for the whole app 36d70bbd build: add i18n 244027be refactor(layout): hoist Topbar and Stepper to global wizard route 146905fb feat: improve topbar accessibility with semantic icons and aria labels 5f910470 chore: ignore personal analysis folder b6569f77 chore: rework layout components 331ff8dc fix: update enum import ee17865e feat(ui): create and/or adjust components 6c9daf38 feat: rework onboarding page 347054d5 chore: simplify main app layout cef7f882 chore: adjust button sizes and enum import dc59ad2a feat: make card clickable 051b6bd4 chore: add tutorial seen to local storage store 1e15bb74 fix: typo on anonymizer label 24f6de07 build: add web or electron run modes beb63bc3 chore: migrate hidden input 23442095 chore: use constants and base card element on feature selector 2eba84b3 chore: refactor header so we can correctly position all elements 0a50bb14 fix: adjust stepper styles (sizing and colors) 504840d9 chore: export constants 3b79f391 build: update react and add radix dependencies f6f7fac8 fix: remove fadeIn scaling animation 41a158c3 chore: create modern ui components 5a02a5df chore: flag card as deprecated edb85211 feat: create modern tooltip component a2c2040a fix: replace brand images with correct ones and set proper heights edce7295 feat(components): create brand, layout and ui components 5475a617 feat: add more brand images 9cd982f8 chore(styles): add animation semantic tokens ce991130 fix: extra character in home layout and rename the component 5e41df1a feat: redesign home 0947627c feat: create link card tool for features 0833a987 feat: create components to render in home screen a6198d99 fix: add fixed height to button and auto adjust icon size ab0a4e20 fix: add lineheight to text styles and adjust font weight 047b115a chore: replace custom use mutation hook with base on connect to host hook 59363cd7 chore: change to named export on local store 3a289e1e chore: add changes to router file 2393d18e chore: fix some tokens in panda and move stitches global styles 64048ac7 chore: re-implement button and partially input 51fe0990 feat: add loading screen on boot, timer of 1.5s 60cf7fde chore: configure view transition for all pages ebcf3a66 feat: add loading page and updated branding images 67603f66 chore: flag stitches as deprecated e416ace8 build: install and configure pandacss c2e5eac1 build: add support for environmental variables for both web and electron apps git-subtree-dir: frontend git-subtree-split: 8add5c452478cdbe6a99ad1b05183cd264183c72 * ✨ Add frontend routing and settings for frontend distribution directory * Squashed 'frontend/' changes from 8add5c45..ff882164 ff882164 chore: add .npmrc to configure public hoist pattern for @types 32bfab0a Merge pull request #59 from AymurAI/fix/53-restore-home-button 94e3816d Merge pull request #65 from AymurAI/fix/add-placeholder-to-select-entities 3545775f Merge pull request #66 from AymurAI/fix/remove-doc-extension d24c458b add a "config" button in features menu f427da73 make hover effect in button work for anchor tanstack link wrapper 77077905 remove slot checks on header 82d2c65a add home button to header on all flow's pages 5663da42 make aymurai's logo a link in the header c3c0e8a7 create home button component 51cb537f fix: prevent select caret rotation from leaking ancestor data-state 5d85973f feat: add tooltips to tagger label and suffix inputs 7e0b67b8 Fix text overflow in HowItWorksModal (#58) 3d516636 Restore delete-one and delete-all hover actions on annotations (#57) fa7e0967 create link component c8112e20 add "Entidad" placeholder to tagger select 70710e11 change NINO to NIÑO 6664e631 remove copies and functions referencing .doc files 4a00039b copy change 422affaf Merge pull request #64 from AymurAI/fix/browser-resources-exhaustion ce8a474a prevent semaphores underflow dcf3b6b7 Merge pull request #62 from AymurAI/fix/conversion-endpoint-usage 7729801f add error handling to finish file conversion 2afe00ee Merge pull request #63 from AymurAI/fix/copy-changes 387d6724 Merge pull request #61 from AymurAI/fix/responsiveness 84843877 limit concurrent predict requests to avoid connection exhaustion 8d3e2693 fix: increase spacing between home and features menu buttons 3025e341 fix: use House icon in header instead of BackButton arrow c9bba1a9 feat: add back-to-home button on features page a0b8f55f use extension to check if file conversion is needed 0b83d0b0 create pdf to odt service 78b1a9c9 responsiveness for screens less than 1280px in width fef9a94e copy change on label manager tab b9c6db53 copy changes on label manager config tab git-subtree-dir: frontend git-subtree-split: ff882164be8077dee58b6748886b0d7d3acbe376 * 🔧 Remove commented-out router for anonymizer database * ✨ Add Node.js and npm installation for frontend build in Dockerfile * 📝 Update API documentation URLs to include '/api' prefix * ✨ Add frontend build commands to Makefile * 🙈 Update .dockerignore and .gitignore to include frontend build output directories * ✅ Update API routes to include '/api' prefix in tests and add frontend integration tests * ♻️ Refactor routing and API integration to remove '/app' prefix and streamline feature routes * Squashed 'frontend/' changes from ff882164..d3e14b5e d3e14b5e feat(validation): persist and restore predictions via backend validation endpoint (#68) 6b8a23ba Add drag-and-drop reordering and inline rename to label manager (#60) git-subtree-dir: frontend git-subtree-split: d3e14b5e00af41fded1c113e51e2e8b73bbf1b22 * refactor: update feature routing, migrate to pnpm, and refine dev environment configuration * Squashed 'frontend/' changes from d3e14b5e..879309c8 879309c8 Feat/entity manager mention feedback (#81) 4d2de106 Fix/responsive home layout (#80) 986e68d2 Fix/homogenize file check ui (#77) 046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76) 2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality git-subtree-dir: frontend git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f * Squashed 'frontend/' changes from 879309c8..a37adc20 a37adc20 fix(useLocal): stop persisting groupOrder and remove dead categoryAssignments (#78) (#87) 47fb1fb3 fix(disambiguate): match response items by text instead of array index (#78) (#86) c873c8e0 Mover configuración de pnpm de `package.json` a `pnpm-workspace.yml` (#83) f4ce881b fix(useFileParse): use position-based paragraph ID to avoid key collisions (#85) 181e0356 Fix/invalid entity offsets (#82) git-subtree-dir: frontend git-subtree-split: a37adc20f579276b3a0e5979424ba7809fb7e2ff * chore: migrate frontend build process from npm to pnpm in API Dockerfile * 🐛 fix: add support for numpy integer and floating types in EnhancedJSONEncoder * fix: update Stack component to use height instead of minHeight for consistent layout * fix: update imports for Label and Text components in UncontrolledInput to avoid circular dependency * chore: regenerate routeTree.gen.ts after removing $feature parent layout route * feat: add default anonymization policies to settings * chore: bump frontend version to 1.5.0 * fix(api): preserve pipeline cache for configured ttl * refactor: remove torch dependency and configure threads via settings * fix(frontend): replace previous anonymizer file on load * fix(frontend): support dataset export in web mode * fix(tests): add SQLALCHEMY_DATABASE_URI environment variable for api tests * fix(api): improve error logging during startup --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: dmazzini <dmazzini@gmail.com>
* ➕ build(deps): Add langextract for text entity extraction
* 🚧 wip: Add langextract entity extraction experiment notebook
* ✨ feat: Enhance entity models with relation handling and canonical representation
* ✨ feat: Add JSON serialization support and enhance utility functions
* ⬆️ Upgrade ML dependencies and refresh uv.lock
* 🚧 wip: Update extraction examples in langextract notebook
* 📝 Add entity disambiguation notebook for canonical entity extraction
* ⬆️ Update dependencies: langextract to 1.1.0 and ollama to 0.6.1; add openai extra for langextract
* 📝 Integrate custom OpenAI model for extraction and remove failing empty example
* 📝 Update error message format in json_serial function for better readability
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
* ♻️ Inline immediate return in get_pretty
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
* 🐛 Fix: Use json_serial in save_json
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* 🎨 Format json.dumps call in save_json for improved readability
* Feature/ollama service (#59)
* ✨ Add GPU-enabled Ollama service to compose stack
* 🔧 Add Make targets for managing Ollama service and models
* 🔧 Add launch configuration and task for starting Ollama service
* Feature/llm providers (#60)
* ✨ Add GPU-enabled Ollama service to compose stack
* 🔧 Add Make targets for managing Ollama service and models
* 🔧 Add launch configuration and task for starting Ollama service
* ✨ Implement LLM providers module with Ollama adapter and shared abstractions
* ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider
* 📝 Document Ollama provider usage via notebook demo
* 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag
* ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability
* ✨ Enhance Ollama provider docs and DRY response building for sync/async calls
* ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency
* 📝 Add async examples to OllamaLLMProvider notebook
* ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests
* ♻️ Refactor OllamaLLMProvider to remove async client caching and streamline client instantiation
* Feature/disambiguation metric v2 (#62)
* Update .gitignore to exclude entity disambiguation experiment directories and modify Jupyter notebook execution counts and output handling
* Refactor Makefile for improved service management and update .gitignore to exclude specific experiment directories. Add new Jupyter notebooks for entity disambiguation metrics and documentation.
* Adjust example data for consistency in entity representation.
* Refactor entity disambiguation notebooks to standardize attribute naming and improve metric evaluation. Update role attribute from 'rol' to 'role' for consistency across examples and documentation. Adjust evaluation function to return both score and metrics.
* Add evaluation metrics for entity disambiguation
- Introduced new metrics module for evaluating entity disambiguation performance, including functions for alias normalization, Jaccard similarity, and greedy matching.
- Implemented main evaluation function to compute scores and metrics from gold and predicted entities.
- Added Jupyter notebooks for practical examples and evaluation results, including normalized and non-normalized text evaluations.
- Updated documentation to reflect changes in function signatures and outputs.
* 🔧 Expand Makefile: add API management targets (api-run, api-stop, api-logs, api-full-run) for smoother service control
* ♻️ Refactor metrics.py: clarify docstrings, align type hints, and polish logging
* ✏️ Fix role attribute reference in evaluation metric documentation for consistency
* 🔧 Add CanonicalEntities class to represent a collection of canonical entities
* 📝 Update entity disambiguation notebooks: clean up imports, adjust paths, and streamline API calls for improved clarity and functionality
---------
Co-authored-by: padonizetti
Co-authored-by: jansaldo
* Feature/summarization (#61)
* ✨ feat: Add Streamlit app for document summarization experiments
* Add statistical analysis notebook for summarization performance evaluation( Visualized gaps in performance between CPU and CUDA models, llm alucinations)
* 🎨 Quantitative and qualitative analysis of summaries: descriptive analysis by features, model comparison, gap analusis (CPU-CUDA), Garbage detection/outliers, analysis by document, visuailzations.
* 🔒️ clear all outputs
* 🎨 Improve Summary Analysis per document: cuda vs llama (same model), gemma vs llama (cuda), same document phi3 vs. phi4. Token per second gap.
* ✨ Add YAML utility functions for loading and saving data
* Merge dev into main for v1.1.12 (#57)
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✏️ Remove incomplete comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✨ Add GPU-enabled Ollama service to compose stack
* 🔧 Add Make targets for managing Ollama service and models
* 🔧 Add launch configuration and task for starting Ollama service
* 🔧 Add system prompts for document summarization
* 📝 Add summarization benchmark notebook
* 🚚 Move statistical analysis notebook to summarization folder
* ✨ Implement LLM providers module with Ollama adapter and shared abstractions
* ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider
* 📝 Document Ollama provider usage via notebook demo
* 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag
* ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability
* ✨ Enhance Ollama provider docs and DRY response building for sync/async calls
* ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency
* 📝 Add async examples to OllamaLLMProvider notebook
* ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests
* ➕ Add tiktoken dependency to pyproject.toml and update version in uv.lock
* 🔧 Enhance summarization prompts with additional information extraction and entity identification details
* ✨ Add LLM summarization router
* 📝 Add notebook for the summarization endpoint
* ✏️ Fix formatting of keys in summarization defaults for consistency
* ➕ Add dspy dependency and update related packages in project configuration
* 🚧 WIP: Add prompt optimization notebook for summarization experiments
---------
Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com>
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* 🩹 Fix YAML key names in prompt defaults for summarization
* ♻️ refactor: Restructure USEM module with factory pattern and multipl… (#64)
* Merge dev into main for v1.1.12 (#57)
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
* ✏️ Remove incomplete comment
---------
* ♻️ refactor: Restructure USEM module with factory pattern and multiple encoder backends
- Add BaseSentenceEncoder abstract base class for encoder interface
- Implement factory pattern with EncoderType enum and create_encoder function
- Add sentence-transformers encoder implementations (DistilUSE, MultilingualMiniLM)
- Move TensorFlow implementation to tensorflow_encoder.py
- Add lazy loading for encoder implementations via __getattr__
- Add auto-detection for Apple Silicon compatibility (defaults
* 🚚 Rename test sentence encoders mac notebook
* 📌 Sync dependencies
---------
* ⏪ Rollback to previous torch and torchtext versions to avoid conflicts
* 🩹 Fix: Add missing environment variable for OLLAMA_HOST in docker-compose
* 📝 Add anonymization pipeline docs
* 🚧 WIP: Add Playwright PJN scraper
* 📝 Add Jupyter notebook for entity disambiguation from pre-clustered validations
* Feature/pdf extraction upgrade (#65)
* 🔧 Configure VSCode Python env and Copilot scopes
* 🔧 Include resources/llm in .dockerignore
* 📌 Update dependencies in pyproject.toml and uv.lock
* 🔧 Update Dockerfile and devcontainer.json to install additional PDF tooling
* ♻️ Refactor Makefile and docker-compose.yml for improved service configuration and flexibility
* 🚧 FIXME: Remove DecisionConv1dBinRegex model from pipeline configuration for dependencies update compatibility
* 🔧 Set weights_only=False for torch.load compatibility
* ✨ Enhance PDF extraction with marker integration and improved text processing
* 🔧 Update run_safe_text_extraction to allow indefinite timeout by default
* ✨ Add warm_marker_models function to initialize marker-pdf artifacts at startup
* 🔥 Remove unused environment variables and rename TRANSFORMERS_CACHE to HF_HOME
* 🔧 Improve service stopping logic for Ollama and API services in Makefile
* 🔖 Bump aymurai package version to 2.0.0-alpha.1
* 🔧 Update HF_HOME path and remove HF_DATASETS_CACHE variable in .env.common
* 🔧 Update OLLAMA_HOST for GPU-enabled services to point to ollama-gpu
* 🔧 Simplify marker model warming logic by removing error handling
* ♻️ Refactor text extraction into modular format-specific extractors
* ✅ Add unit tests for document extraction and error handling
* ➕ Add marker-pdf stack and drop textract
* 🔧 Enhance PDF extraction with caching mechanism
* 📝 Improve cache utility functions with enhanced docstrings and type hints
* 🔧 Enhance cache key generation in PdfExtractor for improved stability and performance
* 🔖 Update aymurai package version to 2.0.0a2.dev9
* Feature/remove usem tensorflow deps (#68)
* 🩹 Ensure consistent entity attributes in reformat_entity function and reorder imports
* 📝 Update subcategories exploration notebook
* ⚗️ Add TensorFlow deprecation experiment notebook
* ♻️ Refactor entity subcategorization: Remove USEMSubcategorizer, add SentenceTransformerSubcategorizer
- Removed the USEMSubcategorizer implementation from `usem.py`.
- Introduced new Jupyter notebooks for testing and evaluating the SentenceTransformerSubcategorizer.
- Updated the pipeline configuration to utilize SentenceTransformerSubcategorizer with local embeddings instead of remote URLs.
* ♻️ Refactor download function: Replace gdown with requests for improved file downloading
* 🔥 Remove empty peft model module
* ➖ Remove TensorFlow and gdown dependencies from pyproject.toml
* 📌 Update uv.lock
* ♻️ Refactor sentence encoder module: Remove unused dependencies and streamline factory functions
* 🔖 Update aymurai package version to 2.0.0a3.dev9
* WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67)
* feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection
- Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`.
- Updated model loading and saving mechanisms to support safetensors format.
- Added a new training notebook for the embedding bag classifier.
- Modified the pipeline configuration to include the new model.
* ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text
* 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier
* 🔧 Refactor import statements for safetensors to remove try-except block
* 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations
* 🐛 Fix gen_aymurai_entity call by removing unused category parameter
* 🔖 Update aymurai package version to 2.0.0a4.dev1
* 🔥 Remove TensorFlow environment variables
* Feature/mlfow integration (#66)
* feat: add mlflow-based experiments and services (wip)
* feat: finalize mlflow experiment runner and artifact logging
* feat: add OpenAI ChatGPT extension and update postStartCommand in devcontainer
* 📝 Unify disambiguation evaluation notebooks
* 📝 Enhance documentation and add type hints across multiple modules
* 📌 Update uv.lock
* 🔧 Update devcontainer GPU device configuration
* 🔧 Change default Python environment manager to venv
* 🔧 Add container names for all services in docker-compose.yml
* ➖ Remove commented optional dependencies for GPU support in pyproject.toml
* 🔧 Increase document request timeout from 30 to 300 seconds in .env.common
* 🚚 Changed environment variable names from DOCUMENT_API_BASE_URL and DOCUMENT_REQUEST_TIMEOUT to API_BASE_URL and REQUEST_TIMEOUT
* 🔧 Update dependency installation to include 'mlops' group in entrypoint.sh
* 🔖 Update aymurai package version to 2.0.0a5.dev8
* Feature/document extract config (#69)
* ✨ Enhance document extraction with caching and configuration options
* ✅ Update extractor tests to handle additional configuration parameters and improve error handling
* 🔧 Update marker model warmup to include configuration setup for improved initialization
* 🔖 Update aymurai package version to 2.0.0a6.dev3
* ⏪ Revert multiprocessing context change in run_safe_text_extraction
* 🔖 Update aymurai package version to 2.0.0a6.dev5
* 🔥 Remove unused multiprocessing import from document_extract.py
* 🔥 Remove unused logging import from extraction.py
* 🔧 Change default value of force_ocr to False in pdf_to_text function
* 📝 Update argument descriptions in pdf_to_text and plain_text_extractor functions to include default values
* 📝 Remove duplicate argument description for path in BaseExtractor.extract method
* Feature/pre disambiguation optimization (#70)
* New pre-disambigutation feature notebooks
* New pre-disambigutation feature notebooks and metrics.py per label feature added
* Conclusion added to pre-cluster investigation
* utils.py ocr variable True
* Changes in grid search function to store the best pre-clusterizated entities in a particular directory
* New llm inference function in notebook 07
* New llm grid search inference function
* Add disambiguation endpoint and utility functions for entity grouping
* Remove unused models and tokenizers to streamline the codebase
* Fix type hints for processor functions to avoid runtime errors
* Endpoint /disambiguate with LLM Inference (#72)
* Changes in old 07 notebook adding the usage of the disambiguate endpoint and its own name
* New token counter to check if the LLM inference won't allucinate
* New tokenizer function for token counting and proessing specifics documents
* Batch optimization feature in llm-inference function
* Mapping feature added to llm-inference function
* Updated the /disambiguate endpoint to return DocumentAnnotations similar to the NER predictions, now enriched with role and entity_id fields where applicable.
* New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
* New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
* New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
* New updates on endpoint /disambiguatev2 and notebook 07
* Cleaned code in anonymizer.py and utils.py following Raúl comments
* New classes defined for LLM prompts to validate each set of prompts per label before the LLM inference
* Sorted canonical entities before LLM inference to avoid (or trying to) processing two or more canonical entities that are only one in separate batches
* Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint.
* Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint.
* Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. New disambiguation.py script to store functions to pre-clusterize the canonical entities.
* Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. New disambiguation.py script to store functions to pre-clusterize the canonical entities.
* Code cleaned following Juli's comments regarding the new /disambiguate endpoint
* Remove unused relations field from CanonicalEntity class for LLM inference phase
* Final changes to the code adding the entity_disambiguation.yaml to handle the prompts
* Add entity disambiguation utilities and enhance canonical entity processing
- Introduced new utility functions for entity disambiguation in `fuzzy.py`.
- Implemented `assign_label_instances` and `map_canonical_entities_ner_preds` in `core.py`.
- Added LLM inference capabilities in `llm.py` for refining canonical entities.
- Updated `entities.py` to include `aymurai_label_instance` for ordered label indexing.
* Refactor anonymizer and paragraph modules for improved entity disambiguation and serialization
* Remove unused logger import from paragraph module
* Reviewed code and added some features to 07 experiment notebook
* Implement label policies for disambiguation and anonymization; enhance entity processing and prediction mapping
* New datetime formatter function and changes in old code, there is a bug with my OS that unsupports the setlocale
* New functioanlity added to get_canonical_dates for dates with the same day and month
* New functioanlity added to get_canonical_dates for dates with the same day and month
* 🐛 Fix entity handling in anonymizer and datapublic routers when use_cache is disabled to improve label processing
* Remove commented-out code
* DatetimeFormatter used after NER predictions in postprocess so we only have to take the datetime from aymurai_label_subclass to build the canonical entities from dates
* Fix locale setting for date formatting to ensure correct month name handling
* Add docstring for get_canonical_dates function to clarify input and output
* Remove DIRECCION prompt templates
* Update notebook formatting, remove unused MODE param and improve code readability
* Update uv.lock
* Hotfix: resolve file pathing, logic indentation, and date disambiguation
- Update configuration path in llm.py from .yaml to .yml.
- Fix indentation in core.py for canonical_entity_id assignment. This ensures
all predictions receive an ID even if they lack a canonical match, bypassing
the 'aymurai_label_subclass == 0' filter which caused issues with date
formatting in NER post-processing.
- Add condition in anonymizer.py to trigger 'get_canonical_dates' only when
FECHA is present in 'fuzzy_labels'. This prevents unintended date
disambiguation when the policy is set to None.
* Feature/anonymize document refactor (#73)
* Add render policy support and refactor anonymization logic for improved token rendering
* 📝Update anonymization docs
* ♻️ Refactor: modularize document anonymization
* 📝 Rename notebook for document anonymization with render policy
* FECHA disambiguation bug fixed, label and render policies changed and whole code reviewed for PR
* ⏪ Revert entrypoint.sh to 1ac2776
* ⏪ Revert .dockerignore to 5af5814
* ⏪ Revert .env.common to 90f7369
* ⏪ Revert .vscode/launch.json to f366690
* ⏪ Revert Makefile to cb3df05
* ⏪ Revert aymurai/api.core.py to 19a9ca8
* 🦖 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility
* 🔥 Removed aymurai/api/endpoints/routers/llm for release/v1.5.0 compatibility
* 🦖 Changed aymurai/api/endpoints/routers/misc/document_extract.py for release/v1.5.0 compatibility
🦖 Changed aymurai/text/extractors/pdf.py for release/v1.5.0 compatibility
🦖 Changed aymurai/text/extractors/utils.py for release/v1.5.0 compatibility
* ⏪ Revert aymurai/api/main.py to a801bf4
* 🔥 Removed aymurai/api/startup/marker.py for release/v1.5.0 compatibility
* 🔥 aymurai/experiments/entity_disambiguation folder for release/v1.5.0 compatibility
* 🔥 Removed aymurai/llm_providers for release/v1.5.0 compatibility
* 🦖 Changed aymurai/settings.py for release/v1.5.0 compatibility
* 🦖 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility
🦖 Changed aymurai/utils/entity_disambiguation/__init__.py for release/v1.5.0 compatibility
🔥 Removed aymurai/utils/entity_disambiguation/llm.py for release/v1.5.0 compatibility
* ⏪ Reverted docker-compose.yml to 5b9c220
* ⏪ Revert docker/api/Dockerfile to 4196117
* 🦖 Changed docs/anonymization/README.md for release/v1.5.0 compatibility
* 🔥 Removed docs/experiments/README.md for realease/v1.5.0 compatibility
🔥 Removed docs/experiments/base.yaml for realease/v1.5.0 compatibility
* 🔥 Removed notebooks/experiments/anonymization/05-langextract.ipynb for release/v1.5.0 compatibility
* 🔥 Removed all the notebooks from folder: notebooks/experiments/entity-disambiguation that had something related to LLM disambiguation for release/v1.5.0 compatibility
* 🔥 Removed notebooks/experiments/llm-providers for release/v1.5.0 compatibility
* 🔥 Removed notebooks/experiments/summarization for release/v1.5.0 compatibility
* 🦖 Changed pyproject.toml for release/v1.5.0 compatibility
* 🔥 Removed resources/llm for release/v1.5.0 compatibility
* 🔥 Removed summarization_app for release/v1.5.0 compatibility
* 🔥 Removed test/llm_providers for release/v1.5.0 compatibility
* 🐛 Bug fixed in pyproject.toml line 106 for .venv build up
* 🐛 Bug fixed in function '_normalize_text' from 'aymurai.text.extractors.utils' that was changed to 'normalize_text' because it's used in aymurai/text/extractors/docx.py
* ⏪ Revert elimination of folder aymurai/experiments/entity_disambiguation for experimental purposes. There was an error in deleting everything, files will be changed in next commit.
* 🔥 Removed aymurai/experiments/entity_disambiguation for release/v1.5.0 compatibility
* 🐛 Bug fixed in experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb for release/v1.5.0 compatibility
* 🔥 Removed TESSDATA_PREFIX from .env.common
* 🙈 Update .gitignore to include notebooks directory while excluding subdirectories and non-IPYNB files
* 🔀 Synthesize docker-compose from 26033a8f/00709164 after b05b768 rollback
* 🔀 Synthesize Makefile from afbfda9/d80f74b/26033a8f after f645881 rollback
* 🔧 Fix repository URL case sensitivity in pyproject.toml and remove unused dependencies
* 🔥 Remove tasks.json configuration for Ollama service
* 🔥 Remove scraper and documentation
* 🔥 Remove experiment module
* 🔥 Remove path utility functions from paths.py
* 🔥 Remove unused PromptSet and PromptLibrary classes, and simplify disambiguation options in LabelPolicy
* 🔥 Remove EntityRelation class and its associated methods from entities.py
* 📝 Enhance documentation with detailed docstrings for various functions across multiple modules
* 🔥 Removed PromptLibrary class from aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility
🔥 Removed `llm` disambiguation label policy for release/v1.5.0 compatibility
* 🎨 Changed map_canonical_entities_ner_preds function in aymurai/utils/entity_disambiguation/core.py discarding the role assignment for release/v1.5.0 compatibility
🎨 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py discarding all the validations that had to do with LLM disambiguation for release/v1.5.0 compatibility
🎨 Minor changes in the rest of documents regarding to experimentation with the release/v1.5.0 API
* 🔀 Synthesize document_extract from d349c69 after 3c55d8e: remove extractor config passthrough and restore fixed timeout
* 🔀 Synthesize PDF extraction flow from d349c69/26033a8: remove cache/debug path
* 🔥 Remove text extraction tests
* 📝 Update description formatting for aymurai_disambiguation field in EntityAttributes
* 🦖 Update PdfExtractor.extract method to include ignored keyword arguments for backward compatibility
* 🔥 Remove unused static logo file from API resources
* 🔧 Add version_scheme configuration to setuptools_scm in pyproject.toml
* 📌 Update uv.lock
* 📝 Reorganize and update v1.5.0 documentation (EN/ES)
* 🚚 Rename full-paragraph pipeline to datapublic across code and docs
* ci(tests): add API + pipeline integration tests on linux and windows (#74)
* Merge dev into main for v1.1.12 (#57)
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✏️ Remove incomplete comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67)
* feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection
- Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`.
- Updated model loading and saving mechanisms to support safetensors format.
- Added a new training notebook for the embedding bag classifier.
- Modified the pipeline configuration to include the new model.
* ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text
* 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier
* 🔧 Refactor import statements for safetensors to remove try-except block
* 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations
* 🐛 Fix gen_aymurai_entity call by removing unused category parameter
* 🔖 Update aymurai package version to 2.0.0a4.dev1
* 🔀 cherry-pick(decision): modernize decision model and upgrade ML dependencies
Cherry-pick TinyEmbeddingBagClassifier (safetensors) replacing Conv1d model.
Remove dead deps (torchtext, pytorch-lightning), upgrade torch to 2.x and flair to 0.15.1.
* 🐛 cherry-pick(fix): datapublic and anonymizer crash when use_cache is disabled
* test(infra): rewrite test infrastructure with architecture guide standards
- Delete old test files (test_document_extract.py, test_anonymizer_predict.py, test_datapublic_predict.py)
- Create new directory structure: tests/integration/pipelines/, tests/api/routers/{anonymizer,datapublic,misc}/
- Rewrite tests/conftest.py:
- Set env vars at module level (RESOURCES_BASEPATH=resources, SQLALCHEMY_DATABASE_URI=sqlite:///:memory:)
- Remove torch mock and lazy loader
- Direct imports from production code
- Clean fixtures: db_engine (session-scoped), db_session (function-scoped), client (with dependency override)
- Test data builders: build_data_item(), build_label(), build_anonymization_paragraph(), build_datapublic_paragraph()
- Update pyproject.toml with [tool.pytest.ini_options]: strict-markers, integration/slow markers
Verification: uv run python -c 'import tests.conftest' succeeds, pytest collection clean
* test(conftest): add pipeline loading helpers and mock factories for API tests
Wave 2 complete: integration pipeline conftest + API router conftest
Integration pipeline conftest:
- PIPELINE_CONFIGS dict for flair-anonymizer and full-paragraph
- load_test_pipeline() helper with print_config=False
- Session-scoped fixtures for both pipelines (expensive model loading)
- build_pipeline_input() test data builder
- sample_text fixture with Spanish legal text
API router conftest:
- build_mock_pipeline() factory with MagicMock
- Mock preprocess/predict_single/postprocess methods
- build_processed_data_item() test data builder
- Re-exports builders from root conftest
* test(api): add document extract endpoint tests with mocked extraction
* test(api): add anonymizer and datapublic endpoint tests with mocked pipelines
* test(integration): add pipeline integration tests for flair-anonymizer and full-paragraph
* ✅ test: refactor test infrastructure and add integration tests
- Reorganize test conftest files to proper hierarchy (tests/api/conftest.py)
- Add pytest to dependency groups in pyproject.toml
- Refactor API router tests to use centralized fixtures and builders
- Add real document extraction tests with DOCX/PDF generators
- Improve pipeline integration tests with fixture-based stages
- Fix label serialization to use model_dump(mode="json")
- Update UUID generation for datapublic tests to use uuid.uuid5
- Add cache path environment setup for integration tests
- Clean up imports and remove unused dependencies
- Remove empty test file (document_extract.py)
This refactoring improves test maintainability, adds proper integration
testing without excessive mocking, and establishes consistent test utilities
across the codebase.
* 👷 ci(github): add pytest workflow for CI integration
- Introduced a new GitHub Actions workflow for running pytest.
- Configured to trigger on pull requests and manual dispatch.
- Supports multiple OS and Python versions for comprehensive testing.
* 👷fix(tests): fix env variable DISKCACHE_ROOT
* 👷 ci(github): remove deprecated PR tests workflow & fix env variable
- Deleted the old PR tests workflow file.
- This cleanup helps streamline CI processes and reduces redundancy.
* ci(github): 👷 add pipeline download and integration tests to CI workflow
- Introduced a new script for downloading pipelines.
- Updated the pytest workflow to include running API and pipeline tests.
- Enhanced test execution with improved output formatting and failure limits.
* fix(tests): 🐛 avoid context manager in TestClient to skip app startup
- Changed TestClient usage to prevent app lifespan startup during tests.
- Ensured proper cleanup by closing the client after use.
- This improves test performance and reliability.
* 👷 ci(github): add RESOURCES_BASEPATH environment variable for pipeline tests
- Added RESOURCES_BASEPATH to the environment variables for both downloading pipelines data and running pipeline tests.
- This change ensures that the necessary resource paths are correctly set during the CI workflow execution.
* 👷 ci(github): update RESOURCES_BASEPATH for pipeline data download
- Changed RESOURCES_BASEPATH from /tmp to resources in the pipeline download step.
- Ensures the correct path is used for resource access during tests.
* chore(pyproject): 🔧 add environment markers for platform compatibility
- Introduced required-environments for tool.uv to specify platform requirements.
- Updated resolution-markers and required-markers in uv.lock for better dependency management.
- Added tensorflow-io-gcs-filesystem with specific markers for Windows and Linux.
* ci(github): 👷 configure es_AR locale for Ubuntu runners
- Added steps to configure the es_AR locale on Ubuntu.
- Ensures proper locale settings for tests running in the CI environment.
* 👷 ci(github): add AYMURAI_CACHE_BASEPATH environment variable for pipeline tests
- Introduced AYMURAI_CACHE_BASEPATH to the environment variables for both pipeline download and pipeline tests.
- This change ensures that the correct cache path is utilized during the execution of the tests.
* 🐛 fix(dependencies): adjust textract dependency for platform compatibility
- Added conditional dependency for textract based on the operating system.
- Specified different sources for textract depending on whether the platform is Windows or not.
* 🔥 chore(opencode): remove opencode.json configuration file
- Deleted the opencode.json file as it is no longer needed.
- This change helps to clean up the repository and remove obsolete configurations.
* 🚚 Update pipeline path for datapublic in scripts, notebooks and tests
* 📝 docs: replace Black references with Ruff in CONTRIBUTING and Alembic hook examples
* 🔧 Add backslash to default CACHE_BASEPATH value
* 🔧 Update cache path retrieval to use settings for consistency
* ➖ Remove textract dependencies and update documentation for extract_document function
* ✅ Update integration tests and add new test cases for anonymizer and datapublic flows
* 🔥 chore(test): remove legacy /test dir and standardize sample doc path to /resources/data/sample/document-01.docx
* 🔧 Update UV_VERSION to latest in devcontainer Dockerfile
* 🔧 Update dependency installation command to include all groups
* 📌 Update uv.lock
* 🐛 Fix CACHE_BASEPATH env alias resolution for CI pipeline downloads
* Feature/pdf layout anonymization (#76)
* ✨ feat(extractors): use pymupdf layout for pdf text extraction
* ✨ feat(normalization): enhance document normalization to preserve paragraph structure
* 📝 docs: document default values for extractor and normalization helpers
* 🩹 fix(extractors): use pymupdf4llm.to_text with page_chunks for pdf paragraphs
* ♻️ Add DOCX and PDF anonymizer modules
- Implemented DocxAnonymizer class to handle anonymization of DOCX documents by replacing sensitive data with label tokens. This includes functionality for unzipping documents, parsing XML, editing content, and adding watermarks.
- Developed PdfAnonymizer class for anonymizing PDF documents, utilizing pymupdf for document manipulation. This includes layout parsing, font caching, redaction operations, and watermarking.
* 🔧 Enhance PDF and DOCX handling in anonymization process
* 📝 Update backend module references for document rendering in README
* ✅ Update tests to use DOCX format for document anonymization and enhance mock behavior
* ✨ Add end-to-end PDF anonymization notebook with PyMuPDF and AymurAI API
* ♻️ Rework PDF anonymization for precise spans and widget handling
* 🔧 Update model_dump calls to exclude None values for improved data handling
* 📝 Add docstrings to label replacement functions
* ♻️ Refactor watermark handling and optimize PDF token aliasing
* ✅ Add integration tests for merging fragmented numeric labels and excluding null alt attributes in PDF anonymization
* ➖ Remove opencv-python-headless dependency from project requirements
* ♻️ Implement paragraph splitting function to enhance document text extraction
* 🔧 Update dependency installation command to prevent Python downloads
* 🔥 Remove redundant tests for merging fragmented numeric labels and PDF anonymization
* ♻️ Refactor anonymizer tests to use DOCX format and enhance mock functionality
* 🔧 Add xfail marker for PDF extraction test on Windows due to tensor type issue
* ✨ Enhance PDF anonymization by adding cleanup rects, removing overlapping links, and scrubbing metadata
* 🔧 Remove redundant return statement in _label_replacement_text function
* ♻️ Refactor anonymization module: split pdf and docx internals by format
* ✅ Add integration tests for PDF and DOCX anonymizers, including metadata scrubbing and link preservation
* ✨ Add watermark layout adjustments to avoid footer content overlap in PDF anonymization
* ✅ Add integration test to ensure watermark is positioned away from footer content in PDF anonymization
* 🩹 Fix: read docx xml as utf-8 across platforms
* ✅ Add Windows-specific xfail marker for PDF tests and implement UTF-8 XML reading test
* 🐛 Remove unnecessary --extra runtime flag from uv sync command
* 🐛 Date formatter bug fixed for canonical entities generation.
* 🐛 Fix duplicate DocLabel handling in anonymization and serialization processes
* ✅ Add tests to deduplicate duplicate labels in cached predictions and disambiguation processes
* 🐛 Fix handling of non-alphanumeric entities by returning None for empty cleaned text (#81)
* 🩹 Fix default timeout value in run_safe_text_extraction function from 30 to 300 seconds
* 🚸 Update PDF_TOKEN_ALIAS_MAP with clearer aliases
* Fix/pdf signature anonymization (#82)
* 🧪 test(pdf): cover signature anonymization regressions
* 🐛 fix(pdf): preserve signature appearance when redacting signer names
* ✅ test(pdf): add focused signature geometry tests
* ♻️ refactor(pdf): rename distance function for clarity and update references
* 📝 docs(pdf): clarify signature widget flattening process in preparation function
* ✅ test(pdf): cover signature review edge cases
* 🐛 Bug fix for exact entities. (#80)
* 🐛 Bug fixed for entities who are always the same that have to bypass the fuzzy matching algorithm.
* ⚡️ Improved structure following copilot comments.
* ⚗️ Experimentation.
* 🐛 Merge duplicate labels for the same span and AymurAI label in _dedupe_doclabels function
* ✅ Add integration test for merging cached duplicate labels for the same span
---------
Co-authored-by: jansaldo <julianansaldo@gmail.com>
* Feature/frontend integration (#83)
* Merge dev into main for v1.1.12 (#57)
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✏️ Remove incomplete comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Squashed 'frontend/' content from commit 9123e6f
git-subtree-dir: frontend
git-subtree-split: 9123e6ff047ddc6da0528d1de827a4af68752d0f
* Squashed 'frontend/' changes from 9123e6ff..8add5c45
8add5c45 1.25.0
d7424d94 use base_url when dealing with public assets
ae1a47b8 Merge pull request #44 from AymurAI/feat/redesign
9cfc610c fix issue regarding annotation keyboard navigation
a723be01 restore useNotify feature in process page
b888214a make how it works modal bigger
49c3bc22 fix electron ts issues
eb791d07 add missing anonymize value on label attrs
151e2710 make OS taskbar API safe to call in web
8dc38fbb add knip
a0f6c687 fix ts clone element issue
d975e3fd include label policies and annotations when anonymizing document
70c44a04 more fixes in height to validate dataset
a36a7401 make suggestion clickable on uncontrolled text input
ea61e4e2 add fixed height to validation dataset page
072d751e remove axios' default api url
6f8a478c adjust typings
528c646f add more canonical id operations on reducers
ae09277c add update by cannonical id function
de2b641c add random cannonical id on search add event
74d4dac2 add value as undefined on ui/input component
6aa895f4 simplify and remove unused code on fille annotator component
d10fe8c7 add useShallow to local storage default getters
9f62ac69 adjust predicting and file parser flows
01aa8a79 fix TS issues
5655eac5 smaller fixes on file annotation components
a7e91cec add remove dialog to label manager's entity tab
cad8faaa simplify annotation components
52230ee4 add metadata to search and tag annotations
0f08b9c8 add more anonymizer copy locale texts
6c3112ac store label manager config in local storage
63c4d731 move suggestion label and add mark props
77017543 adjust spacing on dataset validation
cae3b907 disambiguate hook
6b21903e move createAnnotationData, tag and suffix to context
d1649e5d add max width to toast and remove instad of dismiss it
ca56bb63 add size variant to suggestion and adjust display
55a67bf8 fix select viewport and add scroll
aa2ce6a1 adjust styles on callout
5493cdee finish mark and tag annotation feature
4fb28120 finish major pages and add file protection feature
af457c85 improve accessibility on host page
c2c21c21 add className to SectionTitle
6edd7101 update tanstack react query version
df491c17 add odt to pdf conversion
efb04bc5 deprecate usage of useSchemedQuery and useSchemedQueries
046a73e2 deprecate usage of useSchemedMutation
65ceb195 swap replace button icons
12348edd update button icons
2719755f extract tagger popover logic
28ce40ab create search tagger
e54adfbe add small version of select component
d6c3bd77 improve ui dialog component
63c6bf09 initial mark component update
cd0f7d30 add radix's select as dependency
92ceaace remove old tooltip implementation
2499e4c8 add input size variants
be48dac9 add custom icons to callout, toast and showToast
32d3441b add retro compatibility features to select
657e2a5a minor styling fixes + new select imports
7a323cd2 kill more unused components
c5d7b4ba create better select component
d855a93b adjust styling and positioning on preview page
1cfba9c4 adjust callout styling
c28fdcd0 fix title in process copy etxt
3a9a5cc6 remove old component implementations
03954d74 create callout from toast, and then apply a11y to toast
93c9457d create toast component
642e760d improve suggestion component
707e844b update suggestion mark component's styles
7c1db956 add checked variant to button
204246c7 fix gaps on finish page
9e81ee12 hide file stepper selector arrows depending on cursor
da1377ea update file processing component
12aa1c6b update decision tabs to panda
490a1adb simplify finish dataset and anonymizer
ba12dad2 add className prop to footer
4ed68454 drop file queries when resetting the progress
762cf823 add missing built by in features page
2c9d6cf8 make dataset validation file annotator not annotable
59416b8f clear files on features menu
9585726b use translation on droparea
091b2c58 make search bar static and follow scroll
0978aff6 add label manager to file annotator
5b069d68 connect rest of label manager
23fa8407 improve layout header component
3c88d2f3 finish preview page
d31ec034 finish onboarding page
7d1a71f3 add disambiguate and predict react query options
4e91d285 feat: add feature icon record
bfc9baf7 feat: create initial label manager
0bc48bc9 chore: refactor Searchbar
962961e4 chore: remove stepper component
95e17235 fix: title in anonymizer locale
df53488d chore: minor changes
4a9a2557 feat: create switch component
d4830e48 feat: create label manager component
baa9835c feat: add more copies for the finish page
9d915cab chore: kill unused old hidden input component
2fd61a7f fix: add missing feature param call on route.tsx
617ea974 feat: use a11y on finish page
172b731a Merge branch 'feat/a11y-dataset-header' into feat/redesign
394121d9 feat: add more copies to locale file
c34951e8 fix: redo home layout
31078ff5 chore: cleanup
cf9def05 chore: restructure HOC to be a regular component
088bed82 feat: add api base url protection and apply it
cd15a776 fix(layout): address PR review on header and icon changes
59dc125e feat: add i18n support for the whole app
36d70bbd build: add i18n
244027be refactor(layout): hoist Topbar and Stepper to global wizard route
146905fb feat: improve topbar accessibility with semantic icons and aria labels
5f910470 chore: ignore personal analysis folder
b6569f77 chore: rework layout components
331ff8dc fix: update enum import
ee17865e feat(ui): create and/or adjust components
6c9daf38 feat: rework onboarding page
347054d5 chore: simplify main app layout
cef7f882 chore: adjust button sizes and enum import
dc59ad2a feat: make card clickable
051b6bd4 chore: add tutorial seen to local storage store
1e15bb74 fix: typo on anonymizer label
24f6de07 build: add web or electron run modes
beb63bc3 chore: migrate hidden input
23442095 chore: use constants and base card element on feature selector
2eba84b3 chore: refactor header so we can correctly position all elements
0a50bb14 fix: adjust stepper styles (sizing and colors)
504840d9 chore: export constants
3b79f391 build: update react and add radix dependencies
f6f7fac8 fix: remove fadeIn scaling animation
41a158c3 chore: create modern ui components
5a02a5df chore: flag card as deprecated
edb85211 feat: create modern tooltip component
a2c2040a fix: replace brand images with correct ones and set proper heights
edce7295 feat(components): create brand, layout and ui components
5475a617 feat: add more brand images
9cd982f8 chore(styles): add animation semantic tokens
ce991130 fix: extra character in home layout and rename the component
5e41df1a feat: redesign home
0947627c feat: create link card tool for features
0833a987 feat: create components to render in home screen
a6198d99 fix: add fixed height to button and auto adjust icon size
ab0a4e20 fix: add lineheight to text styles and adjust font weight
047b115a chore: replace custom use mutation hook with base on connect to host hook
59363cd7 chore: change to named export on local store
3a289e1e chore: add changes to router file
2393d18e chore: fix some tokens in panda and move stitches global styles
64048ac7 chore: re-implement button and partially input
51fe0990 feat: add loading screen on boot, timer of 1.5s
60cf7fde chore: configure view transition for all pages
ebcf3a66 feat: add loading page and updated branding images
67603f66 chore: flag stitches as deprecated
e416ace8 build: install and configure pandacss
c2e5eac1 build: add support for environmental variables for both web and electron apps
git-subtree-dir: frontend
git-subtree-split: 8add5c452478cdbe6a99ad1b05183cd264183c72
* ✨ Add frontend routing and settings for frontend distribution directory
* Squashed 'frontend/' changes from 8add5c45..ff882164
ff882164 chore: add .npmrc to configure public hoist pattern for @types
32bfab0a Merge pull request #59 from AymurAI/fix/53-restore-home-button
94e3816d Merge pull request #65 from AymurAI/fix/add-placeholder-to-select-entities
3545775f Merge pull request #66 from AymurAI/fix/remove-doc-extension
d24c458b add a "config" button in features menu
f427da73 make hover effect in button work for anchor tanstack link wrapper
77077905 remove slot checks on header
82d2c65a add home button to header on all flow's pages
5663da42 make aymurai's logo a link in the header
c3c0e8a7 create home button component
51cb537f fix: prevent select caret rotation from leaking ancestor data-state
5d85973f feat: add tooltips to tagger label and suffix inputs
7e0b67b8 Fix text overflow in HowItWorksModal (#58)
3d516636 Restore delete-one and delete-all hover actions on annotations (#57)
fa7e0967 create link component
c8112e20 add "Entidad" placeholder to tagger select
70710e11 change NINO to NIÑO
6664e631 remove copies and functions referencing .doc files
4a00039b copy change
422affaf Merge pull request #64 from AymurAI/fix/browser-resources-exhaustion
ce8a474a prevent semaphores underflow
dcf3b6b7 Merge pull request #62 from AymurAI/fix/conversion-endpoint-usage
7729801f add error handling to finish file conversion
2afe00ee Merge pull request #63 from AymurAI/fix/copy-changes
387d6724 Merge pull request #61 from AymurAI/fix/responsiveness
84843877 limit concurrent predict requests to avoid connection exhaustion
8d3e2693 fix: increase spacing between home and features menu buttons
3025e341 fix: use House icon in header instead of BackButton arrow
c9bba1a9 feat: add back-to-home button on features page
a0b8f55f use extension to check if file conversion is needed
0b83d0b0 create pdf to odt service
78b1a9c9 responsiveness for screens less than 1280px in width
fef9a94e copy change on label manager tab
b9c6db53 copy changes on label manager config tab
git-subtree-dir: frontend
git-subtree-split: ff882164be8077dee58b6748886b0d7d3acbe376
* 🔧 Remove commented-out router for anonymizer database
* ✨ Add Node.js and npm installation for frontend build in Dockerfile
* 📝 Update API documentation URLs to include '/api' prefix
* ✨ Add frontend build commands to Makefile
* 🙈 Update .dockerignore and .gitignore to include frontend build output directories
* ✅ Update API routes to include '/api' prefix in tests and add frontend integration tests
* ♻️ Refactor routing and API integration to remove '/app' prefix and streamline feature routes
* Squashed 'frontend/' changes from ff882164..d3e14b5e
d3e14b5e feat(validation): persist and restore predictions via backend validation endpoint (#68)
6b8a23ba Add drag-and-drop reordering and inline rename to label manager (#60)
git-subtree-dir: frontend
git-subtree-split: d3e14b5e00af41fded1c113e51e2e8b73bbf1b22
* refactor: update feature routing, migrate to pnpm, and refine dev environment configuration
* Squashed 'frontend/' changes from d3e14b5e..879309c8
879309c8 Feat/entity manager mention feedback (#81)
4d2de106 Fix/responsive home layout (#80)
986e68d2 Fix/homogenize file check ui (#77)
046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76)
2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality
git-subtree-dir: frontend
git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f
* Squashed 'frontend/' changes from 879309c8..a37adc20
a37adc20 fix(useLocal): stop persisting groupOrder and remove dead categoryAssignments (#78) (#87)
47fb1fb3 fix(disambiguate): match response items by text instead of array index (#78) (#86)
c873c8e0 Mover configuración de pnpm de `package.json` a `pnpm-workspace.yml` (#83)
f4ce881b fix(useFileParse): use position-based paragraph ID to avoid key collisions (#85)
181e0356 Fix/invalid entity offsets (#82)
git-subtree-dir: frontend
git-subtree-split: a37adc20f579276b3a0e5979424ba7809fb7e2ff
* chore: migrate frontend build process from npm to pnpm in API Dockerfile
* 🐛 fix: add support for numpy integer and floating types in EnhancedJSONEncoder
* fix: update Stack component to use height instead of minHeight for consistent layout
* fix: update imports for Label and Text components in UncontrolledInput to avoid circular dependency
* chore: regenerate routeTree.gen.ts after removing $feature parent layout route
* feat: add default anonymization policies to settings
* chore: bump frontend version to 1.5.0
* fix(api): preserve pipeline cache for configured ttl
* refactor: remove torch dependency and configure threads via settings
* fix(frontend): replace previous anonymizer file on load
* fix(frontend): support dataset export in web mode
* fix(tests): add SQLALCHEMY_DATABASE_URI environment variable for api tests
* fix(api): improve error logging during startup
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: dmazzini <dmazzini@gmail.com>
* 🔥 Remove TensorFlow related environment variables in Dockerfile
* 📝 Update documentation for AymurAI v1.5.0
---------
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Paolo Donizetti <padonizetti@gmail.com>
Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com>
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Lio <lionel.chamorro85@gmail.com>
Co-authored-by: conrabeatriz <conrabeatriz@gmail.com>
Co-authored-by: dmazzini <dmazzini@gmail.com>
* Update README.md
* Update README.md
* Update README.md
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✏️ Remove incomplete comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Release/v1.5.0 (#75)
* ➕ build(deps): Add langextract for text entity extraction
* 🚧 wip: Add langextract entity extraction experiment notebook
* ✨ feat: Enhance entity models with relation handling and canonical representation
* ✨ feat: Add JSON serialization support and enhance utility functions
* ⬆️ Upgrade ML dependencies and refresh uv.lock
* 🚧 wip: Update extraction examples in langextract notebook
* 📝 Add entity disambiguation notebook for canonical entity extraction
* ⬆️ Update dependencies: langextract to 1.1.0 and ollama to 0.6.1; add openai extra for langextract
* 📝 Integrate custom OpenAI model for extraction and remove failing empty example
* 📝 Update error message format in json_serial function for better readability
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
* ♻️ Inline immediate return in get_pretty
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
* 🐛 Fix: Use json_serial in save_json
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* 🎨 Format json.dumps call in save_json for improved readability
* Feature/ollama service (#59)
* ✨ Add GPU-enabled Ollama service to compose stack
* 🔧 Add Make targets for managing Ollama service and models
* 🔧 Add launch configuration and task for starting Ollama service
* Feature/llm providers (#60)
* ✨ Add GPU-enabled Ollama service to compose stack
* 🔧 Add Make targets for managing Ollama service and models
* 🔧 Add launch configuration and task for starting Ollama service
* ✨ Implement LLM providers module with Ollama adapter and shared abstractions
* ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider
* 📝 Document Ollama provider usage via notebook demo
* 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag
* ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability
* ✨ Enhance Ollama provider docs and DRY response building for sync/async calls
* ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency
* 📝 Add async examples to OllamaLLMProvider notebook
* ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests
* ♻️ Refactor OllamaLLMProvider to remove async client caching and streamline client instantiation
* Feature/disambiguation metric v2 (#62)
* Update .gitignore to exclude entity disambiguation experiment directories and modify Jupyter notebook execution counts and output handling
* Refactor Makefile for improved service management and update .gitignore to exclude specific experiment directories. Add new Jupyter notebooks for entity disambiguation metrics and documentation.
* Adjust example data for consistency in entity representation.
* Refactor entity disambiguation notebooks to standardize attribute naming and improve metric evaluation. Update role attribute from 'rol' to 'role' for consistency across examples and documentation. Adjust evaluation function to return both score and metrics.
* Add evaluation metrics for entity disambiguation
- Introduced new metrics module for evaluating entity disambiguation performance, including functions for alias normalization, Jaccard similarity, and greedy matching.
- Implemented main evaluation function to compute scores and metrics from gold and predicted entities.
- Added Jupyter notebooks for practical examples and evaluation results, including normalized and non-normalized text evaluations.
- Updated documentation to reflect changes in function signatures and outputs.
* 🔧 Expand Makefile: add API management targets (api-run, api-stop, api-logs, api-full-run) for smoother service control
* ♻️ Refactor metrics.py: clarify docstrings, align type hints, and polish logging
* ✏️ Fix role attribute reference in evaluation metric documentation for consistency
* 🔧 Add CanonicalEntities class to represent a collection of canonical entities
* 📝 Update entity disambiguation notebooks: clean up imports, adjust paths, and streamline API calls for improved clarity and functionality
---------
Co-authored-by: padonizetti
Co-authored-by: jansaldo
* Feature/summarization (#61)
* ✨ feat: Add Streamlit app for document summarization experiments
* Add statistical analysis notebook for summarization performance evaluation( Visualized gaps in performance between CPU and CUDA models, llm alucinations)
* 🎨 Quantitative and qualitative analysis of summaries: descriptive analysis by features, model comparison, gap analusis (CPU-CUDA), Garbage detection/outliers, analysis by document, visuailzations.
* 🔒️ clear all outputs
* 🎨 Improve Summary Analysis per document: cuda vs llama (same model), gemma vs llama (cuda), same document phi3 vs. phi4. Token per second gap.
* ✨ Add YAML utility functions for loading and saving data
* Merge dev into main for v1.1.12 (#57)
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✏️ Remove incomplete comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✨ Add GPU-enabled Ollama service to compose stack
* 🔧 Add Make targets for managing Ollama service and models
* 🔧 Add launch configuration and task for starting Ollama service
* 🔧 Add system prompts for document summarization
* 📝 Add summarization benchmark notebook
* 🚚 Move statistical analysis notebook to summarization folder
* ✨ Implement LLM providers module with Ollama adapter and shared abstractions
* ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider
* 📝 Document Ollama provider usage via notebook demo
* 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag
* ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability
* ✨ Enhance Ollama provider docs and DRY response building for sync/async calls
* ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency
* 📝 Add async examples to OllamaLLMProvider notebook
* ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests
* ➕ Add tiktoken dependency to pyproject.toml and update version in uv.lock
* 🔧 Enhance summarization prompts with additional information extraction and entity identification details
* ✨ Add LLM summarization router
* 📝 Add notebook for the summarization endpoint
* ✏️ Fix formatting of keys in summarization defaults for consistency
* ➕ Add dspy dependency and update related packages in project configuration
* 🚧 WIP: Add prompt optimization notebook for summarization experiments
---------
Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com>
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* 🩹 Fix YAML key names in prompt defaults for summarization
* ♻️ refactor: Restructure USEM module with factory pattern and multipl… (#64)
* Merge dev into main for v1.1.12 (#57)
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
* ✏️ Remove incomplete comment
---------
* ♻️ refactor: Restructure USEM module with factory pattern and multiple encoder backends
- Add BaseSentenceEncoder abstract base class for encoder interface
- Implement factory pattern with EncoderType enum and create_encoder function
- Add sentence-transformers encoder implementations (DistilUSE, MultilingualMiniLM)
- Move TensorFlow implementation to tensorflow_encoder.py
- Add lazy loading for encoder implementations via __getattr__
- Add auto-detection for Apple Silicon compatibility (defaults
* 🚚 Rename test sentence encoders mac notebook
* 📌 Sync dependencies
---------
* ⏪ Rollback to previous torch and torchtext versions to avoid conflicts
* 🩹 Fix: Add missing environment variable for OLLAMA_HOST in docker-compose
* 📝 Add anonymization pipeline docs
* 🚧 WIP: Add Playwright PJN scraper
* 📝 Add Jupyter notebook for entity disambiguation from pre-clustered validations
* Feature/pdf extraction upgrade (#65)
* 🔧 Configure VSCode Python env and Copilot scopes
* 🔧 Include resources/llm in .dockerignore
* 📌 Update dependencies in pyproject.toml and uv.lock
* 🔧 Update Dockerfile and devcontainer.json to install additional PDF tooling
* ♻️ Refactor Makefile and docker-compose.yml for improved service configuration and flexibility
* 🚧 FIXME: Remove DecisionConv1dBinRegex model from pipeline configuration for dependencies update compatibility
* 🔧 Set weights_only=False for torch.load compatibility
* ✨ Enhance PDF extraction with marker integration and improved text processing
* 🔧 Update run_safe_text_extraction to allow indefinite timeout by default
* ✨ Add warm_marker_models function to initialize marker-pdf artifacts at startup
* 🔥 Remove unused environment variables and rename TRANSFORMERS_CACHE to HF_HOME
* 🔧 Improve service stopping logic for Ollama and API services in Makefile
* 🔖 Bump aymurai package version to 2.0.0-alpha.1
* 🔧 Update HF_HOME path and remove HF_DATASETS_CACHE variable in .env.common
* 🔧 Update OLLAMA_HOST for GPU-enabled services to point to ollama-gpu
* 🔧 Simplify marker model warming logic by removing error handling
* ♻️ Refactor text extraction into modular format-specific extractors
* ✅ Add unit tests for document extraction and error handling
* ➕ Add marker-pdf stack and drop textract
* 🔧 Enhance PDF extraction with caching mechanism
* 📝 Improve cache utility functions with enhanced docstrings and type hints
* 🔧 Enhance cache key generation in PdfExtractor for improved stability and performance
* 🔖 Update aymurai package version to 2.0.0a2.dev9
* Feature/remove usem tensorflow deps (#68)
* 🩹 Ensure consistent entity attributes in reformat_entity function and reorder imports
* 📝 Update subcategories exploration notebook
* ⚗️ Add TensorFlow deprecation experiment notebook
* ♻️ Refactor entity subcategorization: Remove USEMSubcategorizer, add SentenceTransformerSubcategorizer
- Removed the USEMSubcategorizer implementation from `usem.py`.
- Introduced new Jupyter notebooks for testing and evaluating the SentenceTransformerSubcategorizer.
- Updated the pipeline configuration to utilize SentenceTransformerSubcategorizer with local embeddings instead of remote URLs.
* ♻️ Refactor download function: Replace gdown with requests for improved file downloading
* 🔥 Remove empty peft model module
* ➖ Remove TensorFlow and gdown dependencies from pyproject.toml
* 📌 Update uv.lock
* ♻️ Refactor sentence encoder module: Remove unused dependencies and streamline factory functions
* 🔖 Update aymurai package version to 2.0.0a3.dev9
* WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67)
* feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection
- Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`.
- Updated model loading and saving mechanisms to support safetensors format.
- Added a new training notebook for the embedding bag classifier.
- Modified the pipeline configuration to include the new model.
* ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text
* 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier
* 🔧 Refactor import statements for safetensors to remove try-except block
* 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations
* 🐛 Fix gen_aymurai_entity call by removing unused category parameter
* 🔖 Update aymurai package version to 2.0.0a4.dev1
* 🔥 Remove TensorFlow environment variables
* Feature/mlfow integration (#66)
* feat: add mlflow-based experiments and services (wip)
* feat: finalize mlflow experiment runner and artifact logging
* feat: add OpenAI ChatGPT extension and update postStartCommand in devcontainer
* 📝 Unify disambiguation evaluation notebooks
* 📝 Enhance documentation and add type hints across multiple modules
* 📌 Update uv.lock
* 🔧 Update devcontainer GPU device configuration
* 🔧 Change default Python environment manager to venv
* 🔧 Add container names for all services in docker-compose.yml
* ➖ Remove commented optional dependencies for GPU support in pyproject.toml
* 🔧 Increase document request timeout from 30 to 300 seconds in .env.common
* 🚚 Changed environment variable names from DOCUMENT_API_BASE_URL and DOCUMENT_REQUEST_TIMEOUT to API_BASE_URL and REQUEST_TIMEOUT
* 🔧 Update dependency installation to include 'mlops' group in entrypoint.sh
* 🔖 Update aymurai package version to 2.0.0a5.dev8
* Feature/document extract config (#69)
* ✨ Enhance document extraction with caching and configuration options
* ✅ Update extractor tests to handle additional configuration parameters and improve error handling
* 🔧 Update marker model warmup to include configuration setup for improved initialization
* 🔖 Update aymurai package version to 2.0.0a6.dev3
* ⏪ Revert multiprocessing context change in run_safe_text_extraction
* 🔖 Update aymurai package version to 2.0.0a6.dev5
* 🔥 Remove unused multiprocessing import from document_extract.py
* 🔥 Remove unused logging import from extraction.py
* 🔧 Change default value of force_ocr to False in pdf_to_text function
* 📝 Update argument descriptions in pdf_to_text and plain_text_extractor functions to include default values
* 📝 Remove duplicate argument description for path in BaseExtractor.extract method
* Feature/pre disambiguation optimization (#70)
* New pre-disambigutation feature notebooks
* New pre-disambigutation feature notebooks and metrics.py per label feature added
* Conclusion added to pre-cluster investigation
* utils.py ocr variable True
* Changes in grid search function to store the best pre-clusterizated entities in a particular directory
* New llm inference function in notebook 07
* New llm grid search inference function
* Add disambiguation endpoint and utility functions for entity grouping
* Remove unused models and tokenizers to streamline the codebase
* Fix type hints for processor functions to avoid runtime errors
* Endpoint /disambiguate with LLM Inference (#72)
* Changes in old 07 notebook adding the usage of the disambiguate endpoint and its own name
* New token counter to check if the LLM inference won't allucinate
* New tokenizer function for token counting and proessing specifics documents
* Batch optimization feature in llm-inference function
* Mapping feature added to llm-inference function
* Updated the /disambiguate endpoint to return DocumentAnnotations similar to the NER predictions, now enriched with role and entity_id fields where applicable.
* New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
* New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
* New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
* New updates on endpoint /disambiguatev2 and notebook 07
* Cleaned code in anonymizer.py and utils.py following Raúl comments
* New classes defined for LLM prompts to validate each set of prompts per label before the LLM inference
* Sorted canonical entities before LLM inference to avoid (or trying to) processing two or more canonical entities that are only one in separate batches
* Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint.
* Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint.
* Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. New disambiguation.py script to store functions to pre-clusterize the canonical entities.
* Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. New disambiguation.py script to store functions to pre-clusterize the canonical entities.
* Code cleaned following Juli's comments regarding the new /disambiguate endpoint
* Remove unused relations field from CanonicalEntity class for LLM inference phase
* Final changes to the code adding the entity_disambiguation.yaml to handle the prompts
* Add entity disambiguation utilities and enhance canonical entity processing
- Introduced new utility functions for entity disambiguation in `fuzzy.py`.
- Implemented `assign_label_instances` and `map_canonical_entities_ner_preds` in `core.py`.
- Added LLM inference capabilities in `llm.py` for refining canonical entities.
- Updated `entities.py` to include `aymurai_label_instance` for ordered label indexing.
* Refactor anonymizer and paragraph modules for improved entity disambiguation and serialization
* Remove unused logger import from paragraph module
* Reviewed code and added some features to 07 experiment notebook
* Implement label policies for disambiguation and anonymization; enhance entity processing and prediction mapping
* New datetime formatter function and changes in old code, there is a bug with my OS that unsupports the setlocale
* New functioanlity added to get_canonical_dates for dates with the same day and month
* New functioanlity added to get_canonical_dates for dates with the same day and month
* 🐛 Fix entity handling in anonymizer and datapublic routers when use_cache is disabled to improve label processing
* Remove commented-out code
* DatetimeFormatter used after NER predictions in postprocess so we only have to take the datetime from aymurai_label_subclass to build the canonical entities from dates
* Fix locale setting for date formatting to ensure correct month name handling
* Add docstring for get_canonical_dates function to clarify input and output
* Remove DIRECCION prompt templates
* Update notebook formatting, remove unused MODE param and improve code readability
* Update uv.lock
* Hotfix: resolve file pathing, logic indentation, and date disambiguation
- Update configuration path in llm.py from .yaml to .yml.
- Fix indentation in core.py for canonical_entity_id assignment. This ensures
all predictions receive an ID even if they lack a canonical match, bypassing
the 'aymurai_label_subclass == 0' filter which caused issues with date
formatting in NER post-processing.
- Add condition in anonymizer.py to trigger 'get_canonical_dates' only when
FECHA is present in 'fuzzy_labels'. This prevents unintended date
disambiguation when the policy is set to None.
* Feature/anonymize document refactor (#73)
* Add render policy support and refactor anonymization logic for improved token rendering
* 📝Update anonymization docs
* ♻️ Refactor: modularize document anonymization
* 📝 Rename notebook for document anonymization with render policy
* FECHA disambiguation bug fixed, label and render policies changed and whole code reviewed for PR
* ⏪ Revert entrypoint.sh to 1ac2776
* ⏪ Revert .dockerignore to 5af5814
* ⏪ Revert .env.common to 90f7369
* ⏪ Revert .vscode/launch.json to f366690
* ⏪ Revert Makefile to cb3df05
* ⏪ Revert aymurai/api.core.py to 19a9ca8
* 🦖 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility
* 🔥 Removed aymurai/api/endpoints/routers/llm for release/v1.5.0 compatibility
* 🦖 Changed aymurai/api/endpoints/routers/misc/document_extract.py for release/v1.5.0 compatibility
🦖 Changed aymurai/text/extractors/pdf.py for release/v1.5.0 compatibility
🦖 Changed aymurai/text/extractors/utils.py for release/v1.5.0 compatibility
* ⏪ Revert aymurai/api/main.py to a801bf4
* 🔥 Removed aymurai/api/startup/marker.py for release/v1.5.0 compatibility
* 🔥 aymurai/experiments/entity_disambiguation folder for release/v1.5.0 compatibility
* 🔥 Removed aymurai/llm_providers for release/v1.5.0 compatibility
* 🦖 Changed aymurai/settings.py for release/v1.5.0 compatibility
* 🦖 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility
🦖 Changed aymurai/utils/entity_disambiguation/__init__.py for release/v1.5.0 compatibility
🔥 Removed aymurai/utils/entity_disambiguation/llm.py for release/v1.5.0 compatibility
* ⏪ Reverted docker-compose.yml to 5b9c220
* ⏪ Revert docker/api/Dockerfile to 4196117
* 🦖 Changed docs/anonymization/README.md for release/v1.5.0 compatibility
* 🔥 Removed docs/experiments/README.md for realease/v1.5.0 compatibility
🔥 Removed docs/experiments/base.yaml for realease/v1.5.0 compatibility
* 🔥 Removed notebooks/experiments/anonymization/05-langextract.ipynb for release/v1.5.0 compatibility
* 🔥 Removed all the notebooks from folder: notebooks/experiments/entity-disambiguation that had something related to LLM disambiguation for release/v1.5.0 compatibility
* 🔥 Removed notebooks/experiments/llm-providers for release/v1.5.0 compatibility
* 🔥 Removed notebooks/experiments/summarization for release/v1.5.0 compatibility
* 🦖 Changed pyproject.toml for release/v1.5.0 compatibility
* 🔥 Removed resources/llm for release/v1.5.0 compatibility
* 🔥 Removed summarization_app for release/v1.5.0 compatibility
* 🔥 Removed test/llm_providers for release/v1.5.0 compatibility
* 🐛 Bug fixed in pyproject.toml line 106 for .venv build up
* 🐛 Bug fixed in function '_normalize_text' from 'aymurai.text.extractors.utils' that was changed to 'normalize_text' because it's used in aymurai/text/extractors/docx.py
* ⏪ Revert elimination of folder aymurai/experiments/entity_disambiguation for experimental purposes. There was an error in deleting everything, files will be changed in next commit.
* 🔥 Removed aymurai/experiments/entity_disambiguation for release/v1.5.0 compatibility
* 🐛 Bug fixed in experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb for release/v1.5.0 compatibility
* 🔥 Removed TESSDATA_PREFIX from .env.common
* 🙈 Update .gitignore to include notebooks directory while excluding subdirectories and non-IPYNB files
* 🔀 Synthesize docker-compose from 26033a8f/00709164 after b05b768 rollback
* 🔀 Synthesize Makefile from afbfda9/d80f74b/26033a8f after f645881 rollback
* 🔧 Fix repository URL case sensitivity in pyproject.toml and remove unused dependencies
* 🔥 Remove tasks.json configuration for Ollama service
* 🔥 Remove scraper and documentation
* 🔥 Remove experiment module
* 🔥 Remove path utility functions from paths.py
* 🔥 Remove unused PromptSet and PromptLibrary classes, and simplify disambiguation options in LabelPolicy
* 🔥 Remove EntityRelation class and its associated methods from entities.py
* 📝 Enhance documentation with detailed docstrings for various functions across multiple modules
* 🔥 Removed PromptLibrary class from aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility
🔥 Removed `llm` disambiguation label policy for release/v1.5.0 compatibility
* 🎨 Changed map_canonical_entities_ner_preds function in aymurai/utils/entity_disambiguation/core.py discarding the role assignment for release/v1.5.0 compatibility
🎨 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py discarding all the validations that had to do with LLM disambiguation for release/v1.5.0 compatibility
🎨 Minor changes in the rest of documents regarding to experimentation with the release/v1.5.0 API
* 🔀 Synthesize document_extract from d349c69 after 3c55d8e: remove extractor config passthrough and restore fixed timeout
* 🔀 Synthesize PDF extraction flow from d349c69/26033a8: remove cache/debug path
* 🔥 Remove text extraction tests
* 📝 Update description formatting for aymurai_disambiguation field in EntityAttributes
* 🦖 Update PdfExtractor.extract method to include ignored keyword arguments for backward compatibility
* 🔥 Remove unused static logo file from API resources
* 🔧 Add version_scheme configuration to setuptools_scm in pyproject.toml
* 📌 Update uv.lock
* 📝 Reorganize and update v1.5.0 documentation (EN/ES)
* 🚚 Rename full-paragraph pipeline to datapublic across code and docs
* ci(tests): add API + pipeline integration tests on linux and windows (#74)
* Merge dev into main for v1.1.12 (#57)
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✏️ Remove incomplete comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67)
* feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection
- Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`.
- Updated model loading and saving mechanisms to support safetensors format.
- Added a new training notebook for the embedding bag classifier.
- Modified the pipeline configuration to include the new model.
* ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text
* 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier
* 🔧 Refactor import statements for safetensors to remove try-except block
* 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations
* 🐛 Fix gen_aymurai_entity call by removing unused category parameter
* 🔖 Update aymurai package version to 2.0.0a4.dev1
* 🔀 cherry-pick(decision): modernize decision model and upgrade ML dependencies
Cherry-pick TinyEmbeddingBagClassifier (safetensors) replacing Conv1d model.
Remove dead deps (torchtext, pytorch-lightning), upgrade torch to 2.x and flair to 0.15.1.
* 🐛 cherry-pick(fix): datapublic and anonymizer crash when use_cache is disabled
* test(infra): rewrite test infrastructure with architecture guide standards
- Delete old test files (test_document_extract.py, test_anonymizer_predict.py, test_datapublic_predict.py)
- Create new directory structure: tests/integration/pipelines/, tests/api/routers/{anonymizer,datapublic,misc}/
- Rewrite tests/conftest.py:
- Set env vars at module level (RESOURCES_BASEPATH=resources, SQLALCHEMY_DATABASE_URI=sqlite:///:memory:)
- Remove torch mock and lazy loader
- Direct imports from production code
- Clean fixtures: db_engine (session-scoped), db_session (function-scoped), client (with dependency override)
- Test data builders: build_data_item(), build_label(), build_anonymization_paragraph(), build_datapublic_paragraph()
- Update pyproject.toml with [tool.pytest.ini_options]: strict-markers, integration/slow markers
Verification: uv run python -c 'import tests.conftest' succeeds, pytest collection clean
* test(conftest): add pipeline loading helpers and mock factories for API tests
Wave 2 complete: integration pipeline conftest + API router conftest
Integration pipeline conftest:
- PIPELINE_CONFIGS dict for flair-anonymizer and full-paragraph
- load_test_pipeline() helper with print_config=False
- Session-scoped fixtures for both pipelines (expensive model loading)
- build_pipeline_input() test data builder
- sample_text fixture with Spanish legal text
API router conftest:
- build_mock_pipeline() factory with MagicMock
- Mock preprocess/predict_single/postprocess methods
- build_processed_data_item() test data builder
- Re-exports builders from root conftest
* test(api): add document extract endpoint tests with mocked extraction
* test(api): add anonymizer and datapublic endpoint tests with mocked pipelines
* test(integration): add pipeline integration tests for flair-anonymizer and full-paragraph
* ✅ test: refactor test infrastructure and add integration tests
- Reorganize test conftest files to proper hierarchy (tests/api/conftest.py)
- Add pytest to dependency groups in pyproject.toml
- Refactor API router tests to use centralized fixtures and builders
- Add real document extraction tests with DOCX/PDF generators
- Improve pipeline integration tests with fixture-based stages
- Fix label serialization to use model_dump(mode="json")
- Update UUID generation for datapublic tests to use uuid.uuid5
- Add cache path environment setup for integration tests
- Clean up imports and remove unused dependencies
- Remove empty test file (document_extract.py)
This refactoring improves test maintainability, adds proper integration
testing without excessive mocking, and establishes consistent test utilities
across the codebase.
* 👷 ci(github): add pytest workflow for CI integration
- Introduced a new GitHub Actions workflow for running pytest.
- Configured to trigger on pull requests and manual dispatch.
- Supports multiple OS and Python versions for comprehensive testing.
* 👷fix(tests): fix env variable DISKCACHE_ROOT
* 👷 ci(github): remove deprecated PR tests workflow & fix env variable
- Deleted the old PR tests workflow file.
- This cleanup helps streamline CI processes and reduces redundancy.
* ci(github): 👷 add pipeline download and integration tests to CI workflow
- Introduced a new script for downloading pipelines.
- Updated the pytest workflow to include running API and pipeline tests.
- Enhanced test execution with improved output formatting and failure limits.
* fix(tests): 🐛 avoid context manager in TestClient to skip app startup
- Changed TestClient usage to prevent app lifespan startup during tests.
- Ensured proper cleanup by closing the client after use.
- This improves test performance and reliability.
* 👷 ci(github): add RESOURCES_BASEPATH environment variable for pipeline tests
- Added RESOURCES_BASEPATH to the environment variables for both downloading pipelines data and running pipeline tests.
- This change ensures that the necessary resource paths are correctly set during the CI workflow execution.
* 👷 ci(github): update RESOURCES_BASEPATH for pipeline data download
- Changed RESOURCES_BASEPATH from /tmp to resources in the pipeline download step.
- Ensures the correct path is used for resource access during tests.
* chore(pyproject): 🔧 add environment markers for platform compatibility
- Introduced required-environments for tool.uv to specify platform requirements.
- Updated resolution-markers and required-markers in uv.lock for better dependency management.
- Added tensorflow-io-gcs-filesystem with specific markers for Windows and Linux.
* ci(github): 👷 configure es_AR locale for Ubuntu runners
- Added steps to configure the es_AR locale on Ubuntu.
- Ensures proper locale settings for tests running in the CI environment.
* 👷 ci(github): add AYMURAI_CACHE_BASEPATH environment variable for pipeline tests
- Introduced AYMURAI_CACHE_BASEPATH to the environment variables for both pipeline download and pipeline tests.
- This change ensures that the correct cache path is utilized during the execution of the tests.
* 🐛 fix(dependencies): adjust textract dependency for platform compatibility
- Added conditional dependency for textract based on the operating system.
- Specified different sources for textract depending on whether the platform is Windows or not.
* 🔥 chore(opencode): remove opencode.json configuration file
- Deleted the opencode.json file as it is no longer needed.
- This change helps to clean up the repository and remove obsolete configurations.
* 🚚 Update pipeline path for datapublic in scripts, notebooks and tests
* 📝 docs: replace Black references with Ruff in CONTRIBUTING and Alembic hook examples
* 🔧 Add backslash to default CACHE_BASEPATH value
* 🔧 Update cache path retrieval to use settings for consistency
* ➖ Remove textract dependencies and update documentation for extract_document function
* ✅ Update integration tests and add new test cases for anonymizer and datapublic flows
* 🔥 chore(test): remove legacy /test dir and standardize sample doc path to /resources/data/sample/document-01.docx
* 🔧 Update UV_VERSION to latest in devcontainer Dockerfile
* 🔧 Update dependency installation command to include all groups
* 📌 Update uv.lock
* 🐛 Fix CACHE_BASEPATH env alias resolution for CI pipeline downloads
* Feature/pdf layout anonymization (#76)
* ✨ feat(extractors): use pymupdf layout for pdf text extraction
* ✨ feat(normalization): enhance document normalization to preserve paragraph structure
* 📝 docs: document default values for extractor and normalization helpers
* 🩹 fix(extractors): use pymupdf4llm.to_text with page_chunks for pdf paragraphs
* ♻️ Add DOCX and PDF anonymizer modules
- Implemented DocxAnonymizer class to handle anonymization of DOCX documents by replacing sensitive data with label tokens. This includes functionality for unzipping documents, parsing XML, editing content, and adding watermarks.
- Developed PdfAnonymizer class for anonymizing PDF documents, utilizing pymupdf for document manipulation. This includes layout parsing, font caching, redaction operations, and watermarking.
* 🔧 Enhance PDF and DOCX handling in anonymization process
* 📝 Update backend module references for document rendering in README
* ✅ Update tests to use DOCX format for document anonymization and enhance mock behavior
* ✨ Add end-to-end PDF anonymization notebook with PyMuPDF and AymurAI API
* ♻️ Rework PDF anonymization for precise spans and widget handling
* 🔧 Update model_dump calls to exclude None values for improved data handling
* 📝 Add docstrings to label replacement functions
* ♻️ Refactor watermark handling and optimize PDF token aliasing
* ✅ Add integration tests for merging fragmented numeric labels and excluding null alt attributes in PDF anonymization
* ➖ Remove opencv-python-headless dependency from project requirements
* ♻️ Implement paragraph splitting function to enhance document text extraction
* 🔧 Update dependency installation command to prevent Python downloads
* 🔥 Remove redundant tests for merging fragmented numeric labels and PDF anonymization
* ♻️ Refactor anonymizer tests to use DOCX format and enhance mock functionality
* 🔧 Add xfail marker for PDF extraction test on Windows due to tensor type issue
* ✨ Enhance PDF anonymization by adding cleanup rects, removing overlapping links, and scrubbing metadata
* 🔧 Remove redundant return statement in _label_replacement_text function
* ♻️ Refactor anonymization module: split pdf and docx internals by format
* ✅ Add integration tests for PDF and DOCX anonymizers, including metadata scrubbing and link preservation
* ✨ Add watermark layout adjustments to avoid footer content overlap in PDF anonymization
* ✅ Add integration test to ensure watermark is positioned away from footer content in PDF anonymization
* 🩹 Fix: read docx xml as utf-8 across platforms
* ✅ Add Windows-specific xfail marker for PDF tests and implement UTF-8 XML reading test
* 🐛 Remove unnecessary --extra runtime flag from uv sync command
* 🐛 Date formatter bug fixed for canonical entities generation.
* 🐛 Fix duplicate DocLabel handling in anonymization and serialization processes
* ✅ Add tests to deduplicate duplicate labels in cached predictions and disambiguation processes
* 🐛 Fix handling of non-alphanumeric entities by returning None for empty cleaned text (#81)
* 🩹 Fix default timeout value in run_safe_text_extraction function from 30 to 300 seconds
* 🚸 Update PDF_TOKEN_ALIAS_MAP with clearer aliases
* Fix/pdf signature anonymization (#82)
* 🧪 test(pdf): cover signature anonymization regressions
* 🐛 fix(pdf): preserve signature appearance when redacting signer names
* ✅ test(pdf): add focused signature geometry tests
* ♻️ refactor(pdf): rename distance function for clarity and update references
* 📝 docs(pdf): clarify signature widget flattening process in preparation function
* ✅ test(pdf): cover signature review edge cases
* 🐛 Bug fix for exact entities. (#80)
* 🐛 Bug fixed for entities who are always the same that have to bypass the fuzzy matching algorithm.
* ⚡️ Improved structure following copilot comments.
* ⚗️ Experimentation.
* 🐛 Merge duplicate labels for the same span and AymurAI label in _dedupe_doclabels function
* ✅ Add integration test for merging cached duplicate labels for the same span
---------
Co-authored-by: jansaldo <julianansaldo@gmail.com>
* Feature/frontend integration (#83)
* Merge dev into main for v1.1.12 (#57)
* Update README.md
* 🐛 bugfix: Fix XML special character escaping in DocAnonymizer
* ➕ build(deps): Add python-docx package
* ✨ feat: Add watermark and hyperlink functionality to document anonymization
* ✨ feat: Install Archivo font in Dockerfile
* 🎨 refactor: Improve Dockerfile structure and comments for clarity
* ⏪ revert: Remove Archivo font installation from Dockerfile
* 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock
* 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency
* 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility
* 🔧 Update Makefile targets for improved Docker workflow
* 🔖 feat: Bump aymurai package version to 1.1.12
* ♻️ Harden get_extension with header scans and zip safeguards
* 🔧 Extend document extraction timeout to 30s
* 🔧 Refactor Docker workflow to build and push images using docker/build-push-action
* 🔧 Fix workflow step order to correctly extract tag name before building Docker images
* 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds
* ⏪ Revert Docker workflow to extract tag name and use it for image versioning
* Update .github/workflows/build-docker-image.yml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ✏️ Remove incomplete comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Squashed 'frontend/' content from commit 9123e6f
git-subtree-dir: frontend
git-subtree-split: 9123e6ff047ddc6da0528d1de827a4af68752d0f
* Squashed 'frontend/' changes from 9123e6ff..8add5c45
8add5c45 1.25.0
d7424d94 use base_url when dealing with public assets
ae1a47b8 Merge pull request #44 from AymurAI/feat/redesign
9cfc610c fix issue regarding annotation keyboard navigation
a723be01 restore useNotify feature in process page
b888214a make how it works modal bigger
49c3bc22 fix electron ts issues
eb791d07 add missing anonymize value on label attrs
151e2710 make OS taskbar API safe to call in web
8dc38fbb add knip
a0f6c687 fix ts clone element issue
d975e3fd include label policies and annotations when anonymizing document
70c44a04 more fixes in height to validate dataset
a36a7401 make suggestion clickable on uncontrolled text input
ea61e4e2 add fixed height to validation dataset page
072d751e remove axios' default api url
6f8a478c adjust typings
528c646f add more canonical id operations on reducers
ae09277c add update by cannonical id function
de2b641c add random cannonical id on search add event
74d4dac2 add value as undefined on ui/input component
6aa895f4 simplify and remove unused code on fille annotator component
d10fe8c7 add useShallow to local storage default getters
9f62ac69 adjust predicting and file parser flows
01aa8a79 fix TS issues
5655eac5 smaller fixes on file annotation components
a7e91cec add remove dialog to label manager's entity tab
cad8faaa simplify annotation components
52230ee4 add metadata to search and tag annotations
0f08b9c8 add more anonymizer copy locale texts
6c3112ac store label manager config in local storage
63c4d731 move suggestion label and add mark props
77017543 adjust spacing on dataset validation
cae3b907 disambiguate hook
6b21903e move createAnnotationData, tag and suffix to context
d1649e5d add max width to toast and remove instad of dismiss it
ca56bb63 add size variant to suggestion and adjust display
55a67bf8 fix select viewport and add scroll
aa2ce6a1 adjust styles on callout
5493cdee finish mark and tag annotation feature
4fb28120 finish major pages and add file protection feature
af457c85 improve accessibility on host page
c2c21c21 add className to SectionTitle
6edd7101 update tanstack react query version
df491c17 add odt to pdf conversion
efb04bc5 deprecate usage of useSchemedQuery and useSchemedQueries
046a73e2 deprecate usage of useSchemedMutation
65ceb195 swap replace button icons
12348edd update button icons
2719755f extract tagger popover logic
28ce40ab create search tagger
e54adfbe add small version of select component
d6c3bd77 improve ui dialog component
63c6bf09 initial mark component update
cd0f7d30 add radix's select as dependency
92ceaace remove old tooltip implementation
2499e4c8 add input size variants
be48dac9 add custom icons to callout, toast and showToast
32d3441b add retro compatibility features to select
657e2a5a minor styling fixes + new select imports
7a323cd2 kill more unused components
c5d7b4ba create better select component
d855a93b adjust styling and positioning on preview page
1cfba9c4 adjust callout styling
c28fdcd0 fix title in process copy etxt
3a9a5cc6 remove old component implementations
03954d74 create callout from toast, and then apply a11y to toast
93c9457d create toast component
642e760d improve suggestion component
707e844b update suggestion mark component's styles
7c1db956 add checked variant to button
204246c7 fix gaps on finish page
9e81ee12 hide file stepper selector arrows depending on cursor
da1377ea update file processing component
12aa1c6b update decision tabs to panda
490a1adb simplify finish dataset and anonymizer
ba12dad2 add className prop to footer
4ed68454 drop file queries when resetting the progress
762cf823 add missing built by in features page
2c9d6cf8 make dataset validation file annotator not annotable
59416b8f clear files on features menu
9585726b use translation on droparea
091b2c58 make search bar static and follow scroll
0978aff6 add label manager to file annotator
5b069d68 connect rest of label manager
23fa8407 improve layout header component
3c88d2f3 finish preview page
d31ec034 finish onboarding page
7d1a71f3 add disambiguate and predict react query options
4e91d285 feat: add feature icon record
bfc9baf7 feat: create initial label manager
0bc48bc9 chore: refactor Searchbar
962961e4 chore: remove stepper component
95e17235 fix: title in anonymizer locale
df53488d chore: minor changes
4a9a2557 feat: create switch component
d4830e48 feat: create label manager component
baa9835c feat: add more copies for the finish page
9d915cab chore: kill unused old hidden input component
2fd61a7f fix: add missing feature param call on route.tsx
617ea974 feat: use a11y on finish page
172b731a Merge branch 'feat/a11y-dataset-header' into feat/redesign
394121d9 feat: add more copies to locale file
c34951e8 fix: redo home layout
31078ff5 chore: cleanup
cf9def05 chore: restructure HOC to be a regular component
088bed82 feat: add api base url protection and apply it
cd15a776 fix(layout): address PR review on header and icon changes
59dc125e feat: add i18n support for the whole app
36d70bbd build: add i18n
244027be refactor(layout): hoist Topbar and Stepper to global wizard route
146905fb feat: improve topbar accessibility with semantic icons and aria labels
5f910470 chore: ignore personal analysis folder
b6569f77 chore: rework layout components
331ff8dc fix: update enum import
ee17865e feat(ui): create and/or adjust components
6c9daf38 feat: rework onboarding page
347054d5 chore: simplify main app layout
cef7f882 chore: adjust button sizes and enum import
dc59ad2a feat: make card clickable
051b6bd4 chore: add tutorial seen to local storage store
1e15bb74 fix: typo on anonymizer label
24f6de07 build: add web or electron run modes
beb63bc3 chore: migrate hidden input
23442095 chore: use constants and base card element on feature selector
2eba84b3 chore: refactor header so we can correctly position all elements
0a50bb14 fix: adjust stepper styles (sizing and colors)
504840d9 chore: export constants
3b79f391 build: update react and add radix dependencies
f6f7fac8 fix: remove fadeIn scaling animation
41a158c3 chore: create modern ui components
5a02a5df chore: flag card as deprecated
edb85211 feat: create modern tooltip component
a2c2040a fix: replace brand images with correct ones and set proper heights
edce7295 feat(components): create brand, layout and ui components
5475a617 feat: add more brand images
9cd982f8 chore(styles): add animation semantic tokens
ce991130 fix: extra character in home layout and rename the component
5e41df1a feat: redesign home
0947627c feat: create link card tool for features
0833a987 feat: create components to render in home screen
a6198d99 fix: add fixed height to button and auto adjust icon size
ab0a4e20 fix: add lineheight to text styles and adjust font weight
047b115a chore: replace custom use mutation hook with base on connect to host hook
59363cd7 chore: change to named export on local store
3a289e1e chore: add changes to router file
2393d18e chore: fix some tokens in panda and move stitches global styles
64048ac7 chore: re-implement button and partially input
51fe0990 feat: add loading screen on boot, timer of 1.5s
60cf7fde chore: configure view transition for all pages
ebcf3a66 feat: add loading page and updated branding images
67603f66 chore: flag stitches as deprecated
e416ace8 build: install and configure pandacss
c2e5eac1 build: add support for environmental variables for both web and electron apps
git-subtree-dir: frontend
git-subtree-split: 8add5c452478cdbe6a99ad1b05183cd264183c72
* ✨ Add frontend routing and settings for frontend distribution directory
* Squashed 'frontend/' changes from 8add5c45..ff882164
ff882164 chore: add .npmrc to configure public hoist pattern for @types
32bfab0a Merge pull request #59 from AymurAI/fix/53-restore-home-button
94e3816d Merge pull request #65 from AymurAI/fix/add-placeholder-to-select-entities
3545775f Merge pull request #66 from AymurAI/fix/remove-doc-extension
d24c458b add a "config" button in features menu
f427da73 make hover effect in button work for anchor tanstack link wrapper
77077905 remove slot checks on header
82d2c65a add home button to header on all flow's pages
5663da42 make aymurai's logo a link in the header
c3c0e8a7 create home button component
51cb537f fix: prevent select caret rotation from leaking ancestor data-state
5d85973f feat: add tooltips to tagger label and suffix inputs
7e0b67b8 Fix text overflow in HowItWorksModal (#58)
3d516636 Restore delete-one and delete-all hover actions on annotations (#57)
fa7e0967 create link component
c8112e20 add "Entidad" placeholder to tagger select
70710e11 change NINO to NIÑO
6664e631 remove copies and functions referencing .doc files
4a00039b copy change
422affaf Merge pull request #64 from AymurAI/fix/browser-resources-exhaustion
ce8a474a prevent semaphores underflow
dcf3b6b7 Merge pull request #62 from AymurAI/fix/conversion-endpoint-usage
7729801f add error handling to finish file conversion
2afe00ee Merge pull request #63 from AymurAI/fix/copy-changes
387d6724 Merge pull request #61 from AymurAI/fix/responsiveness
84843877 limit concurrent predict requests to avoid connection exhaustion
8d3e2693 fix: increase spacing between home and features menu buttons
3025e341 fix: use House icon in header instead of BackButton arrow
c9bba1a9 feat: add back-to-home button on features page
a0b8f55f use extension to check if file conversion is needed
0b83d0b0 create pdf to odt service
78b1a9c9 responsiveness for screens less than 1280px in width
fef9a94e copy change on label manager tab
b9c6db53 copy changes on label manager config tab
git-subtree-dir: frontend
git-subtree-split: ff882164be8077dee58b6748886b0d7d3acbe376
* 🔧 Remove commented-out router for anonymizer database
* ✨ Add Node.js and npm installation for frontend build in Dockerfile
* 📝 Update API documentation URLs to include '/api' prefix
* ✨ Add frontend build commands to Makefile
* 🙈 Update .dockerignore and .gitignore to include frontend build output directories
* ✅ Update API routes to include '/api' prefix in tests and add frontend integration tests
* ♻️ Refactor routing and API integration to remove '/app' prefix and streamline feature routes
* Squashed 'frontend/' changes from ff882164..d3e14b5e
d3e14b5e feat(validation): persist and restore predictions via backend validation endpoint (#68)
6b8a23ba Add drag-and-drop reordering and inline rename to label manager (#60)
git-subtree-dir: frontend
git-subtree-split: d3e14b5e00af41fded1c113e51e2e8b73bbf1b22
* refactor: update feature routing, migrate to pnpm, and refine dev environment configuration
* Squashed 'frontend/' changes from d3e14b5e..879309c8
879309c8 Feat/entity manager mention feedback (#81)
4d2de106 Fix/responsive home layout (#80)
986e68d2 Fix/homogenize file check ui (#77)
046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76)
2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality
git-subtree-dir: frontend
git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f
* Squashed 'frontend/' changes from 879309c8..a37adc20
a37adc20 fix(useLocal): stop persisting groupOrder and remove dead categoryAssignments (#78) (#87)
47fb1fb3 fix(disambiguate): match response items by text instead of array index (#78) (#86)
c873c8e0 Mover configuración de pnpm de `package.json` a `pnpm-workspace.yml` (#83)
f4ce881b fix(useFileParse): use position-based paragraph ID to avoid key collisions (#85)
181e0356 Fix/invalid entity offsets (#82)
git-subtree-dir: frontend
git-subtree-split: a37adc20f579276b3a0e5979424ba7809fb7e2ff
* chore: migrate frontend build process from npm to pnpm in API Dockerfile
* 🐛 fix: add support for numpy integer and floating types in EnhancedJSONEncoder
* fix: update Stack component to use height instead of minHeight for consistent layout
* fix: update imports for Label and Text components in UncontrolledInput to avoid circular dependency
* chore: regenerate routeTree.gen.ts after removing $feature parent layout route
* feat: add default anonymization policies to settings
* chore: bump frontend version to 1.5.0
* fix(api): preserve pipeline cache for configured ttl
* refactor: remove torch dependency and configure threads via settings
* fix(frontend): replace previous anonymizer file on load
* fix(frontend): support dataset export in web mode
* fix(tests): add SQLALCHEMY_DATABASE_URI environment variable for api tests
* fix(api): improve error logging during startup
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: dmazzini <dmazzini@gmail.com>
* 🔥 Remove TensorFlow related environment variables in Dockerfile
* 📝 Update documentation for AymurAI v1.5.0
---------
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Paolo Donizetti <padonizetti@gmail.com>
Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com>
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Lio <lionel.chamorro85@gmail.com>
Co-authored-by: conrabeatriz <conrabeatriz@gmail.com>
Co-authored-by: dmazzini <dmazzini@gmail.com>
---------
Co-authored-by: jed <jedzill4@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: Paolo Donizetti <padonizetti@gmail.com>
Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com>
Co-authored-by: Lio <lionel.chamorro85@gmail.com>
Co-authored-by: conrabeatriz <conrabeatriz@gmail.com>
Co-authored-by: dmazzini <dmazzini@gmail.com>
Summary by Sourcery
Handle certain entity labels as exact identifiers during disambiguation and anonymization, and adjust the experimental notebook document selection accordingly.
Bug Fixes:
Enhancements: