🐛 Bug fix for exact entities. by conrabeatriz · Pull Request #80 · AymurAI/backend

conrabeatriz · 2026-05-20T22:04:13Z

Summary by Sourcery

Handle certain entity labels as exact identifiers during disambiguation and anonymization, and adjust the experimental notebook document selection accordingly.

Bug Fixes:

Prevent fuzzy clustering for entities that should be treated as exact identifiers by grouping them using normalized exact aliases.
Ensure anonymization postprocessing records a normalized subclass value for exact-identifier labels to support precise disambiguation.

Enhancements:

Introduce shared handling of exact-identifier labels by deriving and storing an exact alias from the entity label subclass metadata.

…the fuzzy matching algorithm.

sourcery-ai · 2026-05-20T22:04:19Z

Reviewer's Guide

Introduces exact-match handling for specific entity labels in the entity disambiguation pipeline by propagating a normalized subclass key from anonymization postprocessing into canonical entity building, so that those labels cluster strictly by exact value instead of fuzzy similarity; also adjusts a notebook to point to a different example document.

Sequence diagram for exact-match handling in entity disambiguation

sequenceDiagram
    participant AnonymizationPostprocess
    participant FuzzyDisambiguation
    participant CanonicalEntities

    AnonymizationPostprocess->>AnonymizationPostprocess: process(ent)
    AnonymizationPostprocess->>AnonymizationPostprocess: cleaned_text = pattern.sub("", ent.text)
    AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass = []
    alt label in exact_labels
        AnonymizationPostprocess->>AnonymizationPostprocess: flattened_text = re.sub("[^a-zA-Z0-9]", "", cleaned_text)
        AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_label_subclass.append(flattened_text)
    end
    AnonymizationPostprocess->>AnonymizationPostprocess: ent.attrs.aymurai_alt_text = cleaned_text

    FuzzyDisambiguation->>FuzzyDisambiguation: build_canonical_entities(labels, target_labels, threshold)
    FuzzyDisambiguation->>FuzzyDisambiguation: grouped.setdefault(aymurai_label, []).append({text, aymurai_label, exact_alias})
    loop for each label_type, items in grouped.items()
        alt label_type in EXACT_LABELS
            FuzzyDisambiguation->>FuzzyDisambiguation: exact_groups.setdefault(exact_alias, []).append(item)
            FuzzyDisambiguation->>FuzzyDisambiguation: clusters = list(exact_groups.values())
        else
            FuzzyDisambiguation->>FuzzyDisambiguation: clusters = _cluster_aliases_with_cdist(items, threshold)
        end
        FuzzyDisambiguation->>CanonicalEntities: _clusters_to_canonical_entities(clusters)
    end

File-Level Changes

Change	Details	Files
Propagate an exact, normalized subclass alias per entity (for certain labels) from anonymization postprocessing into the entity disambiguation pipeline so those entities are clustered by exact value rather than fuzzy distance.	Define a shared set of labels that must be treated with exact matching semantics (e.g., DNI, CUIT_CUIL, TELEFONO, etc.) In anonymization postprocess, derive a cleaned, alphanumeric-only value for exact-match labels and store it in the entity attrs under aymurai_label_subclass, initializing that field as a list and appending the flattened value In canonical entity building, compute an exact_alias from aymurai_label_subclass (handling both list and scalar cases) alongside the existing alias text, and include it in the grouped item structure For labels in the exact-match set, form clusters by grouping items with identical exact_alias instead of using the distance-based clustering; for other labels, keep the existing fuzzy clustering behavior	`aymurai/utils/entity_disambiguation/fuzzy.py` `aymurai/transforms/anonymization_postprocess/core.py`
Adjust the experiment notebook to run on a different sample document index.	Change the selected document index from 14 to 5 when choosing doc_path for processing in the entity disambiguation anonymization experiment notebook	`notebooks/experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The exact_labels set is duplicated in both fuzzy.py and core.py; consider centralizing this constant in a shared module to avoid divergence and make future updates easier.
In anonymization_postprocess/core.py, aymurai_label_subclass is always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values.
The notebook change from documents[14] to documents[5] looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `exact_labels` set is duplicated in both `fuzzy.py` and `core.py`; consider centralizing this constant in a shared module to avoid divergence and make future updates easier.
- In `anonymization_postprocess/core.py`, `aymurai_label_subclass` is always reset to an empty list; verify whether you should preserve any existing subclasses or guard against overwriting previously set values.
- The notebook change from `documents[14]` to `documents[5]` looks like a local experiment tweak; confirm this is the intended default behavior and not a temporary debugging choice.

## Individual Comments

### Comment 1
<location path="aymurai/utils/entity_disambiguation/fuzzy.py" line_range="10-18" />
<code_context>
 from aymurai.meta.api_interfaces import DocLabel
 from aymurai.meta.entities import CanonicalEntity

+EXACT_LABELS = {
+    "DNI",
+    "CUIT_CUIL",
+    "TELEFONO",
+    "PATENTE_DOMINIO",
+    "IP",
+    "NUM_CAJA_AHORRO",
+    "CBU",
+    "NUM_MATRICULA",
+}
+
</code_context>
<issue_to_address>
**suggestion:** Avoid duplicating the exact-label set in multiple modules by centralizing it

This set also exists here as `EXACT_LABELS` and in `anonymization_postprocess/core.py` as `exact_labels`. Please move it to a shared constants module and import it in both places so there’s a single source of truth and no risk of the two lists drifting out of sync.

Suggested implementation:

```python
from aymurai.meta.api_interfaces import DocLabel
from aymurai.meta.entities import CanonicalEntity
from aymurai.meta.constants import EXACT_LABELS

```

1. Create (or extend) a shared constants module, for example `aymurai/meta/constants.py`, and move the set definition there:

```python
EXACT_LABELS = {
    "DNI",
    "CUIT_CUIL",
    "TELEFONO",
    "PATENTE_DOMINIO",
    "IP",
    "NUM_CAJA_AHORRO",
    "CBU",
    "NUM_MATRICULA",
}
```

2. In `anonymization_postprocess/core.py`, replace the local `exact_labels` definition with an import from the same constants module, e.g.:

```python
from aymurai.meta.constants import EXACT_LABELS as exact_labels
```

(or adjust naming/import style to match existing conventions in that file).

3. Ensure `aymurai/meta/constants.py` is part of the package (has `__init__.py` as needed) and update any relevant `__all__` if your project uses it.
</issue_to_address>

### Comment 2
<location path="aymurai/transforms/anonymization_postprocess/core.py" line_range="60-64" />
<code_context>
+            "NUM_MATRICULA",
+        }
+
+        ent["attrs"]["aymurai_label_subclass"] = []
+
+        if label in exact_labels:
+            flattened_text = re.sub(r"[^a-zA-Z0-9]", "", cleaned_text)
+            ent["attrs"]["aymurai_label_subclass"].append(flattened_text)
+
         # Update the entity's alt text and indices
</code_context>
<issue_to_address>
**issue (bug_risk):** Re-initializing `aymurai_label_subclass` may unintentionally discard previous subclass information

Unconditionally assigning `ent["attrs"]["aymurai_label_subclass"] = []` clears any existing data in this field before you append the new value. If earlier steps in the pipeline set this attribute (now or in the future), this could cause data loss. Consider only initializing when absent (e.g., via `setdefault`/`get`) or otherwise making this logic additive rather than destructive.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Copilot

Pull request overview

This PR adjusts entity disambiguation/anonymization so certain identifier-like labels (e.g., DNI/CBU/IP) are treated as exact identifiers (no fuzzy clustering), using a normalized “exact alias” derived from label subclass metadata.

Changes:

Add an EXACT_LABELS path in canonical-entity building to group exact-identifier labels by a normalized alias instead of fuzzy clustering.
Update anonymization postprocessing to store a normalized subclass value for exact-identifier labels to support exact grouping.
Update an experimental notebook to process a different sample document.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
notebooks/experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb	Changes which document sample index is processed in the experiment.
aymurai/utils/entity_disambiguation/fuzzy.py	Introduces exact-identifier grouping logic during canonical entity construction.
aymurai/transforms/anonymization_postprocess/core.py	Records a normalized subclass value for exact-identifier labels during entity cleaning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

879309c8 Feat/entity manager mention feedback (#81) 4d2de106 Fix/responsive home layout (#80) 986e68d2 Fix/homogenize file check ui (#77) 046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76) 2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality git-subtree-dir: frontend git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f

…pe_doclabels function

…me span

* Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Squashed 'frontend/' content from commit 9123e6f git-subtree-dir: frontend git-subtree-split: 9123e6ff047ddc6da0528d1de827a4af68752d0f * Squashed 'frontend/' changes from 9123e6ff..8add5c45 8add5c45 1.25.0 d7424d94 use base_url when dealing with public assets ae1a47b8 Merge pull request #44 from AymurAI/feat/redesign 9cfc610c fix issue regarding annotation keyboard navigation a723be01 restore useNotify feature in process page b888214a make how it works modal bigger 49c3bc22 fix electron ts issues eb791d07 add missing anonymize value on label attrs 151e2710 make OS taskbar API safe to call in web 8dc38fbb add knip a0f6c687 fix ts clone element issue d975e3fd include label policies and annotations when anonymizing document 70c44a04 more fixes in height to validate dataset a36a7401 make suggestion clickable on uncontrolled text input ea61e4e2 add fixed height to validation dataset page 072d751e remove axios' default api url 6f8a478c adjust typings 528c646f add more canonical id operations on reducers ae09277c add update by cannonical id function de2b641c add random cannonical id on search add event 74d4dac2 add value as undefined on ui/input component 6aa895f4 simplify and remove unused code on fille annotator component d10fe8c7 add useShallow to local storage default getters 9f62ac69 adjust predicting and file parser flows 01aa8a79 fix TS issues 5655eac5 smaller fixes on file annotation components a7e91cec add remove dialog to label manager's entity tab cad8faaa simplify annotation components 52230ee4 add metadata to search and tag annotations 0f08b9c8 add more anonymizer copy locale texts 6c3112ac store label manager config in local storage 63c4d731 move suggestion label and add mark props 77017543 adjust spacing on dataset validation cae3b907 disambiguate hook 6b21903e move createAnnotationData, tag and suffix to context d1649e5d add max width to toast and remove instad of dismiss it ca56bb63 add size variant to suggestion and adjust display 55a67bf8 fix select viewport and add scroll aa2ce6a1 adjust styles on callout 5493cdee finish mark and tag annotation feature 4fb28120 finish major pages and add file protection feature af457c85 improve accessibility on host page c2c21c21 add className to SectionTitle 6edd7101 update tanstack react query version df491c17 add odt to pdf conversion efb04bc5 deprecate usage of useSchemedQuery and useSchemedQueries 046a73e2 deprecate usage of useSchemedMutation 65ceb195 swap replace button icons 12348edd update button icons 2719755f extract tagger popover logic 28ce40ab create search tagger e54adfbe add small version of select component d6c3bd77 improve ui dialog component 63c6bf09 initial mark component update cd0f7d30 add radix's select as dependency 92ceaace remove old tooltip implementation 2499e4c8 add input size variants be48dac9 add custom icons to callout, toast and showToast 32d3441b add retro compatibility features to select 657e2a5a minor styling fixes + new select imports 7a323cd2 kill more unused components c5d7b4ba create better select component d855a93b adjust styling and positioning on preview page 1cfba9c4 adjust callout styling c28fdcd0 fix title in process copy etxt 3a9a5cc6 remove old component implementations 03954d74 create callout from toast, and then apply a11y to toast 93c9457d create toast component 642e760d improve suggestion component 707e844b update suggestion mark component's styles 7c1db956 add checked variant to button 204246c7 fix gaps on finish page 9e81ee12 hide file stepper selector arrows depending on cursor da1377ea update file processing component 12aa1c6b update decision tabs to panda 490a1adb simplify finish dataset and anonymizer ba12dad2 add className prop to footer 4ed68454 drop file queries when resetting the progress 762cf823 add missing built by in features page 2c9d6cf8 make dataset validation file annotator not annotable 59416b8f clear files on features menu 9585726b use translation on droparea 091b2c58 make search bar static and follow scroll 0978aff6 add label manager to file annotator 5b069d68 connect rest of label manager 23fa8407 improve layout header component 3c88d2f3 finish preview page d31ec034 finish onboarding page 7d1a71f3 add disambiguate and predict react query options 4e91d285 feat: add feature icon record bfc9baf7 feat: create initial label manager 0bc48bc9 chore: refactor Searchbar 962961e4 chore: remove stepper component 95e17235 fix: title in anonymizer locale df53488d chore: minor changes 4a9a2557 feat: create switch component d4830e48 feat: create label manager component baa9835c feat: add more copies for the finish page 9d915cab chore: kill unused old hidden input component 2fd61a7f fix: add missing feature param call on route.tsx 617ea974 feat: use a11y on finish page 172b731a Merge branch 'feat/a11y-dataset-header' into feat/redesign 394121d9 feat: add more copies to locale file c34951e8 fix: redo home layout 31078ff5 chore: cleanup cf9def05 chore: restructure HOC to be a regular component 088bed82 feat: add api base url protection and apply it cd15a776 fix(layout): address PR review on header and icon changes 59dc125e feat: add i18n support for the whole app 36d70bbd build: add i18n 244027be refactor(layout): hoist Topbar and Stepper to global wizard route 146905fb feat: improve topbar accessibility with semantic icons and aria labels 5f910470 chore: ignore personal analysis folder b6569f77 chore: rework layout components 331ff8dc fix: update enum import ee17865e feat(ui): create and/or adjust components 6c9daf38 feat: rework onboarding page 347054d5 chore: simplify main app layout cef7f882 chore: adjust button sizes and enum import dc59ad2a feat: make card clickable 051b6bd4 chore: add tutorial seen to local storage store 1e15bb74 fix: typo on anonymizer label 24f6de07 build: add web or electron run modes beb63bc3 chore: migrate hidden input 23442095 chore: use constants and base card element on feature selector 2eba84b3 chore: refactor header so we can correctly position all elements 0a50bb14 fix: adjust stepper styles (sizing and colors) 504840d9 chore: export constants 3b79f391 build: update react and add radix dependencies f6f7fac8 fix: remove fadeIn scaling animation 41a158c3 chore: create modern ui components 5a02a5df chore: flag card as deprecated edb85211 feat: create modern tooltip component a2c2040a fix: replace brand images with correct ones and set proper heights edce7295 feat(components): create brand, layout and ui components 5475a617 feat: add more brand images 9cd982f8 chore(styles): add animation semantic tokens ce991130 fix: extra character in home layout and rename the component 5e41df1a feat: redesign home 0947627c feat: create link card tool for features 0833a987 feat: create components to render in home screen a6198d99 fix: add fixed height to button and auto adjust icon size ab0a4e20 fix: add lineheight to text styles and adjust font weight 047b115a chore: replace custom use mutation hook with base on connect to host hook 59363cd7 chore: change to named export on local store 3a289e1e chore: add changes to router file 2393d18e chore: fix some tokens in panda and move stitches global styles 64048ac7 chore: re-implement button and partially input 51fe0990 feat: add loading screen on boot, timer of 1.5s 60cf7fde chore: configure view transition for all pages ebcf3a66 feat: add loading page and updated branding images 67603f66 chore: flag stitches as deprecated e416ace8 build: install and configure pandacss c2e5eac1 build: add support for environmental variables for both web and electron apps git-subtree-dir: frontend git-subtree-split: 8add5c452478cdbe6a99ad1b05183cd264183c72 * ✨ Add frontend routing and settings for frontend distribution directory * Squashed 'frontend/' changes from 8add5c45..ff882164 ff882164 chore: add .npmrc to configure public hoist pattern for @types 32bfab0a Merge pull request #59 from AymurAI/fix/53-restore-home-button 94e3816d Merge pull request #65 from AymurAI/fix/add-placeholder-to-select-entities 3545775f Merge pull request #66 from AymurAI/fix/remove-doc-extension d24c458b add a "config" button in features menu f427da73 make hover effect in button work for anchor tanstack link wrapper 77077905 remove slot checks on header 82d2c65a add home button to header on all flow's pages 5663da42 make aymurai's logo a link in the header c3c0e8a7 create home button component 51cb537f fix: prevent select caret rotation from leaking ancestor data-state 5d85973f feat: add tooltips to tagger label and suffix inputs 7e0b67b8 Fix text overflow in HowItWorksModal (#58) 3d516636 Restore delete-one and delete-all hover actions on annotations (#57) fa7e0967 create link component c8112e20 add "Entidad" placeholder to tagger select 70710e11 change NINO to NIÑO 6664e631 remove copies and functions referencing .doc files 4a00039b copy change 422affaf Merge pull request #64 from AymurAI/fix/browser-resources-exhaustion ce8a474a prevent semaphores underflow dcf3b6b7 Merge pull request #62 from AymurAI/fix/conversion-endpoint-usage 7729801f add error handling to finish file conversion 2afe00ee Merge pull request #63 from AymurAI/fix/copy-changes 387d6724 Merge pull request #61 from AymurAI/fix/responsiveness 84843877 limit concurrent predict requests to avoid connection exhaustion 8d3e2693 fix: increase spacing between home and features menu buttons 3025e341 fix: use House icon in header instead of BackButton arrow c9bba1a9 feat: add back-to-home button on features page a0b8f55f use extension to check if file conversion is needed 0b83d0b0 create pdf to odt service 78b1a9c9 responsiveness for screens less than 1280px in width fef9a94e copy change on label manager tab b9c6db53 copy changes on label manager config tab git-subtree-dir: frontend git-subtree-split: ff882164be8077dee58b6748886b0d7d3acbe376 * 🔧 Remove commented-out router for anonymizer database * ✨ Add Node.js and npm installation for frontend build in Dockerfile * 📝 Update API documentation URLs to include '/api' prefix * ✨ Add frontend build commands to Makefile * 🙈 Update .dockerignore and .gitignore to include frontend build output directories * ✅ Update API routes to include '/api' prefix in tests and add frontend integration tests * ♻️ Refactor routing and API integration to remove '/app' prefix and streamline feature routes * Squashed 'frontend/' changes from ff882164..d3e14b5e d3e14b5e feat(validation): persist and restore predictions via backend validation endpoint (#68) 6b8a23ba Add drag-and-drop reordering and inline rename to label manager (#60) git-subtree-dir: frontend git-subtree-split: d3e14b5e00af41fded1c113e51e2e8b73bbf1b22 * refactor: update feature routing, migrate to pnpm, and refine dev environment configuration * Squashed 'frontend/' changes from d3e14b5e..879309c8 879309c8 Feat/entity manager mention feedback (#81) 4d2de106 Fix/responsive home layout (#80) 986e68d2 Fix/homogenize file check ui (#77) 046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76) 2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality git-subtree-dir: frontend git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f * Squashed 'frontend/' changes from 879309c8..a37adc20 a37adc20 fix(useLocal): stop persisting groupOrder and remove dead categoryAssignments (#78) (#87) 47fb1fb3 fix(disambiguate): match response items by text instead of array index (#78) (#86) c873c8e0 Mover configuración de pnpm de `package.json` a `pnpm-workspace.yml` (#83) f4ce881b fix(useFileParse): use position-based paragraph ID to avoid key collisions (#85) 181e0356 Fix/invalid entity offsets (#82) git-subtree-dir: frontend git-subtree-split: a37adc20f579276b3a0e5979424ba7809fb7e2ff * chore: migrate frontend build process from npm to pnpm in API Dockerfile * 🐛 fix: add support for numpy integer and floating types in EnhancedJSONEncoder * fix: update Stack component to use height instead of minHeight for consistent layout * fix: update imports for Label and Text components in UncontrolledInput to avoid circular dependency * chore: regenerate routeTree.gen.ts after removing $feature parent layout route * feat: add default anonymization policies to settings * chore: bump frontend version to 1.5.0 * fix(api): preserve pipeline cache for configured ttl * refactor: remove torch dependency and configure threads via settings * fix(frontend): replace previous anonymizer file on load * fix(frontend): support dataset export in web mode * fix(tests): add SQLALCHEMY_DATABASE_URI environment variable for api tests * fix(api): improve error logging during startup --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: dmazzini <dmazzini@gmail.com>

* ➕ build(deps): Add langextract for text entity extraction * 🚧 wip: Add langextract entity extraction experiment notebook * ✨ feat: Enhance entity models with relation handling and canonical representation * ✨ feat: Add JSON serialization support and enhance utility functions * ⬆️ Upgrade ML dependencies and refresh uv.lock * 🚧 wip: Update extraction examples in langextract notebook * 📝 Add entity disambiguation notebook for canonical entity extraction * ⬆️ Update dependencies: langextract to 1.1.0 and ollama to 0.6.1; add openai extra for langextract * 📝 Integrate custom OpenAI model for extraction and remove failing empty example * 📝 Update error message format in json_serial function for better readability Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> * ♻️ Inline immediate return in get_pretty Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> * 🐛 Fix: Use json_serial in save_json Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * 🎨 Format json.dumps call in save_json for improved readability * Feature/ollama service (#59) * ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service * Feature/llm providers (#60) * ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service * ✨ Implement LLM providers module with Ollama adapter and shared abstractions * ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider * 📝 Document Ollama provider usage via notebook demo * 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag * ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability * ✨ Enhance Ollama provider docs and DRY response building for sync/async calls * ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency * 📝 Add async examples to OllamaLLMProvider notebook * ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests * ♻️ Refactor OllamaLLMProvider to remove async client caching and streamline client instantiation * Feature/disambiguation metric v2 (#62) * Update .gitignore to exclude entity disambiguation experiment directories and modify Jupyter notebook execution counts and output handling * Refactor Makefile for improved service management and update .gitignore to exclude specific experiment directories. Add new Jupyter notebooks for entity disambiguation metrics and documentation. * Adjust example data for consistency in entity representation. * Refactor entity disambiguation notebooks to standardize attribute naming and improve metric evaluation. Update role attribute from 'rol' to 'role' for consistency across examples and documentation. Adjust evaluation function to return both score and metrics. * Add evaluation metrics for entity disambiguation - Introduced new metrics module for evaluating entity disambiguation performance, including functions for alias normalization, Jaccard similarity, and greedy matching. - Implemented main evaluation function to compute scores and metrics from gold and predicted entities. - Added Jupyter notebooks for practical examples and evaluation results, including normalized and non-normalized text evaluations. - Updated documentation to reflect changes in function signatures and outputs. * 🔧 Expand Makefile: add API management targets (api-run, api-stop, api-logs, api-full-run) for smoother service control * ♻️ Refactor metrics.py: clarify docstrings, align type hints, and polish logging * ✏️ Fix role attribute reference in evaluation metric documentation for consistency * 🔧 Add CanonicalEntities class to represent a collection of canonical entities * 📝 Update entity disambiguation notebooks: clean up imports, adjust paths, and streamline API calls for improved clarity and functionality --------- Co-authored-by: padonizetti Co-authored-by: jansaldo * Feature/summarization (#61) * ✨ feat: Add Streamlit app for document summarization experiments * Add statistical analysis notebook for summarization performance evaluation( Visualized gaps in performance between CPU and CUDA models, llm alucinations) * 🎨 Quantitative and qualitative analysis of summaries: descriptive analysis by features, model comparison, gap analusis (CPU-CUDA), Garbage detection/outliers, analysis by document, visuailzations. * 🔒️ clear all outputs * 🎨 Improve Summary Analysis per document: cuda vs llama (same model), gemma vs llama (cuda), same document phi3 vs. phi4. Token per second gap. * ✨ Add YAML utility functions for loading and saving data * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service * 🔧 Add system prompts for document summarization * 📝 Add summarization benchmark notebook * 🚚 Move statistical analysis notebook to summarization folder * ✨ Implement LLM providers module with Ollama adapter and shared abstractions * ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider * 📝 Document Ollama provider usage via notebook demo * 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag * ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability * ✨ Enhance Ollama provider docs and DRY response building for sync/async calls * ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency * 📝 Add async examples to OllamaLLMProvider notebook * ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests * ➕ Add tiktoken dependency to pyproject.toml and update version in uv.lock * 🔧 Enhance summarization prompts with additional information extraction and entity identification details * ✨ Add LLM summarization router * 📝 Add notebook for the summarization endpoint * ✏️ Fix formatting of keys in summarization defaults for consistency * ➕ Add dspy dependency and update related packages in project configuration * 🚧 WIP: Add prompt optimization notebook for summarization experiments --------- Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com> Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * 🩹 Fix YAML key names in prompt defaults for summarization * ♻️ refactor: Restructure USEM module with factory pattern and multipl… (#64) * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml * ✏️ Remove incomplete comment --------- * ♻️ refactor: Restructure USEM module with factory pattern and multiple encoder backends - Add BaseSentenceEncoder abstract base class for encoder interface - Implement factory pattern with EncoderType enum and create_encoder function - Add sentence-transformers encoder implementations (DistilUSE, MultilingualMiniLM) - Move TensorFlow implementation to tensorflow_encoder.py - Add lazy loading for encoder implementations via __getattr__ - Add auto-detection for Apple Silicon compatibility (defaults * 🚚 Rename test sentence encoders mac notebook * 📌 Sync dependencies --------- * ⏪ Rollback to previous torch and torchtext versions to avoid conflicts * 🩹 Fix: Add missing environment variable for OLLAMA_HOST in docker-compose * 📝 Add anonymization pipeline docs * 🚧 WIP: Add Playwright PJN scraper * 📝 Add Jupyter notebook for entity disambiguation from pre-clustered validations * Feature/pdf extraction upgrade (#65) * 🔧 Configure VSCode Python env and Copilot scopes * 🔧 Include resources/llm in .dockerignore * 📌 Update dependencies in pyproject.toml and uv.lock * 🔧 Update Dockerfile and devcontainer.json to install additional PDF tooling * ♻️ Refactor Makefile and docker-compose.yml for improved service configuration and flexibility * 🚧 FIXME: Remove DecisionConv1dBinRegex model from pipeline configuration for dependencies update compatibility * 🔧 Set weights_only=False for torch.load compatibility * ✨ Enhance PDF extraction with marker integration and improved text processing * 🔧 Update run_safe_text_extraction to allow indefinite timeout by default * ✨ Add warm_marker_models function to initialize marker-pdf artifacts at startup * 🔥 Remove unused environment variables and rename TRANSFORMERS_CACHE to HF_HOME * 🔧 Improve service stopping logic for Ollama and API services in Makefile * 🔖 Bump aymurai package version to 2.0.0-alpha.1 * 🔧 Update HF_HOME path and remove HF_DATASETS_CACHE variable in .env.common * 🔧 Update OLLAMA_HOST for GPU-enabled services to point to ollama-gpu * 🔧 Simplify marker model warming logic by removing error handling * ♻️ Refactor text extraction into modular format-specific extractors * ✅ Add unit tests for document extraction and error handling * ➕ Add marker-pdf stack and drop textract * 🔧 Enhance PDF extraction with caching mechanism * 📝 Improve cache utility functions with enhanced docstrings and type hints * 🔧 Enhance cache key generation in PdfExtractor for improved stability and performance * 🔖 Update aymurai package version to 2.0.0a2.dev9 * Feature/remove usem tensorflow deps (#68) * 🩹 Ensure consistent entity attributes in reformat_entity function and reorder imports * 📝 Update subcategories exploration notebook * ⚗️ Add TensorFlow deprecation experiment notebook * ♻️ Refactor entity subcategorization: Remove USEMSubcategorizer, add SentenceTransformerSubcategorizer - Removed the USEMSubcategorizer implementation from `usem.py`. - Introduced new Jupyter notebooks for testing and evaluating the SentenceTransformerSubcategorizer. - Updated the pipeline configuration to utilize SentenceTransformerSubcategorizer with local embeddings instead of remote URLs. * ♻️ Refactor download function: Replace gdown with requests for improved file downloading * 🔥 Remove empty peft model module * ➖ Remove TensorFlow and gdown dependencies from pyproject.toml * 📌 Update uv.lock * ♻️ Refactor sentence encoder module: Remove unused dependencies and streamline factory functions * 🔖 Update aymurai package version to 2.0.0a3.dev9 * WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67) * feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection - Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`. - Updated model loading and saving mechanisms to support safetensors format. - Added a new training notebook for the embedding bag classifier. - Modified the pipeline configuration to include the new model. * ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text * 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier * 🔧 Refactor import statements for safetensors to remove try-except block * 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations * 🐛 Fix gen_aymurai_entity call by removing unused category parameter * 🔖 Update aymurai package version to 2.0.0a4.dev1 * 🔥 Remove TensorFlow environment variables * Feature/mlfow integration (#66) * feat: add mlflow-based experiments and services (wip) * feat: finalize mlflow experiment runner and artifact logging * feat: add OpenAI ChatGPT extension and update postStartCommand in devcontainer * 📝 Unify disambiguation evaluation notebooks * 📝 Enhance documentation and add type hints across multiple modules * 📌 Update uv.lock * 🔧 Update devcontainer GPU device configuration * 🔧 Change default Python environment manager to venv * 🔧 Add container names for all services in docker-compose.yml * ➖ Remove commented optional dependencies for GPU support in pyproject.toml * 🔧 Increase document request timeout from 30 to 300 seconds in .env.common * 🚚 Changed environment variable names from DOCUMENT_API_BASE_URL and DOCUMENT_REQUEST_TIMEOUT to API_BASE_URL and REQUEST_TIMEOUT * 🔧 Update dependency installation to include 'mlops' group in entrypoint.sh * 🔖 Update aymurai package version to 2.0.0a5.dev8 * Feature/document extract config (#69) * ✨ Enhance document extraction with caching and configuration options * ✅ Update extractor tests to handle additional configuration parameters and improve error handling * 🔧 Update marker model warmup to include configuration setup for improved initialization * 🔖 Update aymurai package version to 2.0.0a6.dev3 * ⏪ Revert multiprocessing context change in run_safe_text_extraction * 🔖 Update aymurai package version to 2.0.0a6.dev5 * 🔥 Remove unused multiprocessing import from document_extract.py * 🔥 Remove unused logging import from extraction.py * 🔧 Change default value of force_ocr to False in pdf_to_text function * 📝 Update argument descriptions in pdf_to_text and plain_text_extractor functions to include default values * 📝 Remove duplicate argument description for path in BaseExtractor.extract method * Feature/pre disambiguation optimization (#70) * New pre-disambigutation feature notebooks * New pre-disambigutation feature notebooks and metrics.py per label feature added * Conclusion added to pre-cluster investigation * utils.py ocr variable True * Changes in grid search function to store the best pre-clusterizated entities in a particular directory * New llm inference function in notebook 07 * New llm grid search inference function * Add disambiguation endpoint and utility functions for entity grouping * Remove unused models and tokenizers to streamline the codebase * Fix type hints for processor functions to avoid runtime errors * Endpoint /disambiguate with LLM Inference (#72) * Changes in old 07 notebook adding the usage of the disambiguate endpoint and its own name * New token counter to check if the LLM inference won't allucinate * New tokenizer function for token counting and proessing specifics documents * Batch optimization feature in llm-inference function * Mapping feature added to llm-inference function * Updated the /disambiguate endpoint to return DocumentAnnotations similar to the NER predictions, now enriched with role and entity_id fields where applicable. * New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id * New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id * New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id * New updates on endpoint /disambiguatev2 and notebook 07 * Cleaned code in anonymizer.py and utils.py following Raúl comments * New classes defined for LLM prompts to validate each set of prompts per label before the LLM inference * Sorted canonical entities before LLM inference to avoid (or trying to) processing two or more canonical entities that are only one in separate batches * Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. * Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. * Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. New disambiguation.py script to store functions to pre-clusterize the canonical entities. * Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. New disambiguation.py script to store functions to pre-clusterize the canonical entities. * Code cleaned following Juli's comments regarding the new /disambiguate endpoint * Remove unused relations field from CanonicalEntity class for LLM inference phase * Final changes to the code adding the entity_disambiguation.yaml to handle the prompts * Add entity disambiguation utilities and enhance canonical entity processing - Introduced new utility functions for entity disambiguation in `fuzzy.py`. - Implemented `assign_label_instances` and `map_canonical_entities_ner_preds` in `core.py`. - Added LLM inference capabilities in `llm.py` for refining canonical entities. - Updated `entities.py` to include `aymurai_label_instance` for ordered label indexing. * Refactor anonymizer and paragraph modules for improved entity disambiguation and serialization * Remove unused logger import from paragraph module * Reviewed code and added some features to 07 experiment notebook * Implement label policies for disambiguation and anonymization; enhance entity processing and prediction mapping * New datetime formatter function and changes in old code, there is a bug with my OS that unsupports the setlocale * New functioanlity added to get_canonical_dates for dates with the same day and month * New functioanlity added to get_canonical_dates for dates with the same day and month * 🐛 Fix entity handling in anonymizer and datapublic routers when use_cache is disabled to improve label processing * Remove commented-out code * DatetimeFormatter used after NER predictions in postprocess so we only have to take the datetime from aymurai_label_subclass to build the canonical entities from dates * Fix locale setting for date formatting to ensure correct month name handling * Add docstring for get_canonical_dates function to clarify input and output * Remove DIRECCION prompt templates * Update notebook formatting, remove unused MODE param and improve code readability * Update uv.lock * Hotfix: resolve file pathing, logic indentation, and date disambiguation - Update configuration path in llm.py from .yaml to .yml. - Fix indentation in core.py for canonical_entity_id assignment. This ensures all predictions receive an ID even if they lack a canonical match, bypassing the 'aymurai_label_subclass == 0' filter which caused issues with date formatting in NER post-processing. - Add condition in anonymizer.py to trigger 'get_canonical_dates' only when FECHA is present in 'fuzzy_labels'. This prevents unintended date disambiguation when the policy is set to None. * Feature/anonymize document refactor (#73) * Add render policy support and refactor anonymization logic for improved token rendering * 📝Update anonymization docs * ♻️ Refactor: modularize document anonymization * 📝 Rename notebook for document anonymization with render policy * FECHA disambiguation bug fixed, label and render policies changed and whole code reviewed for PR * ⏪ Revert entrypoint.sh to 1ac2776 * ⏪ Revert .dockerignore to 5af5814 * ⏪ Revert .env.common to 90f7369 * ⏪ Revert .vscode/launch.json to f366690 * ⏪ Revert Makefile to cb3df05 * ⏪ Revert aymurai/api.core.py to 19a9ca8 * 🦖 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility * 🔥 Removed aymurai/api/endpoints/routers/llm for release/v1.5.0 compatibility * 🦖 Changed aymurai/api/endpoints/routers/misc/document_extract.py for release/v1.5.0 compatibility 🦖 Changed aymurai/text/extractors/pdf.py for release/v1.5.0 compatibility 🦖 Changed aymurai/text/extractors/utils.py for release/v1.5.0 compatibility * ⏪ Revert aymurai/api/main.py to a801bf4 * 🔥 Removed aymurai/api/startup/marker.py for release/v1.5.0 compatibility * 🔥 aymurai/experiments/entity_disambiguation folder for release/v1.5.0 compatibility * 🔥 Removed aymurai/llm_providers for release/v1.5.0 compatibility * 🦖 Changed aymurai/settings.py for release/v1.5.0 compatibility * 🦖 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility 🦖 Changed aymurai/utils/entity_disambiguation/__init__.py for release/v1.5.0 compatibility 🔥 Removed aymurai/utils/entity_disambiguation/llm.py for release/v1.5.0 compatibility * ⏪ Reverted docker-compose.yml to 5b9c220 * ⏪ Revert docker/api/Dockerfile to 4196117 * 🦖 Changed docs/anonymization/README.md for release/v1.5.0 compatibility * 🔥 Removed docs/experiments/README.md for realease/v1.5.0 compatibility 🔥 Removed docs/experiments/base.yaml for realease/v1.5.0 compatibility * 🔥 Removed notebooks/experiments/anonymization/05-langextract.ipynb for release/v1.5.0 compatibility * 🔥 Removed all the notebooks from folder: notebooks/experiments/entity-disambiguation that had something related to LLM disambiguation for release/v1.5.0 compatibility * 🔥 Removed notebooks/experiments/llm-providers for release/v1.5.0 compatibility * 🔥 Removed notebooks/experiments/summarization for release/v1.5.0 compatibility * 🦖 Changed pyproject.toml for release/v1.5.0 compatibility * 🔥 Removed resources/llm for release/v1.5.0 compatibility * 🔥 Removed summarization_app for release/v1.5.0 compatibility * 🔥 Removed test/llm_providers for release/v1.5.0 compatibility * 🐛 Bug fixed in pyproject.toml line 106 for .venv build up * 🐛 Bug fixed in function '_normalize_text' from 'aymurai.text.extractors.utils' that was changed to 'normalize_text' because it's used in aymurai/text/extractors/docx.py * ⏪ Revert elimination of folder aymurai/experiments/entity_disambiguation for experimental purposes. There was an error in deleting everything, files will be changed in next commit. * 🔥 Removed aymurai/experiments/entity_disambiguation for release/v1.5.0 compatibility * 🐛 Bug fixed in experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb for release/v1.5.0 compatibility * 🔥 Removed TESSDATA_PREFIX from .env.common * 🙈 Update .gitignore to include notebooks directory while excluding subdirectories and non-IPYNB files * 🔀 Synthesize docker-compose from 26033a8f/00709164 after b05b768 rollback * 🔀 Synthesize Makefile from afbfda9/d80f74b/26033a8f after f645881 rollback * 🔧 Fix repository URL case sensitivity in pyproject.toml and remove unused dependencies * 🔥 Remove tasks.json configuration for Ollama service * 🔥 Remove scraper and documentation * 🔥 Remove experiment module * 🔥 Remove path utility functions from paths.py * 🔥 Remove unused PromptSet and PromptLibrary classes, and simplify disambiguation options in LabelPolicy * 🔥 Remove EntityRelation class and its associated methods from entities.py * 📝 Enhance documentation with detailed docstrings for various functions across multiple modules * 🔥 Removed PromptLibrary class from aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility 🔥 Removed `llm` disambiguation label policy for release/v1.5.0 compatibility * 🎨 Changed map_canonical_entities_ner_preds function in aymurai/utils/entity_disambiguation/core.py discarding the role assignment for release/v1.5.0 compatibility 🎨 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py discarding all the validations that had to do with LLM disambiguation for release/v1.5.0 compatibility 🎨 Minor changes in the rest of documents regarding to experimentation with the release/v1.5.0 API * 🔀 Synthesize document_extract from d349c69 after 3c55d8e: remove extractor config passthrough and restore fixed timeout * 🔀 Synthesize PDF extraction flow from d349c69/26033a8: remove cache/debug path * 🔥 Remove text extraction tests * 📝 Update description formatting for aymurai_disambiguation field in EntityAttributes * 🦖 Update PdfExtractor.extract method to include ignored keyword arguments for backward compatibility * 🔥 Remove unused static logo file from API resources * 🔧 Add version_scheme configuration to setuptools_scm in pyproject.toml * 📌 Update uv.lock * 📝 Reorganize and update v1.5.0 documentation (EN/ES) * 🚚 Rename full-paragraph pipeline to datapublic across code and docs * ci(tests): add API + pipeline integration tests on linux and windows (#74) * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67) * feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection - Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`. - Updated model loading and saving mechanisms to support safetensors format. - Added a new training notebook for the embedding bag classifier. - Modified the pipeline configuration to include the new model. * ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text * 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier * 🔧 Refactor import statements for safetensors to remove try-except block * 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations * 🐛 Fix gen_aymurai_entity call by removing unused category parameter * 🔖 Update aymurai package version to 2.0.0a4.dev1 * 🔀 cherry-pick(decision): modernize decision model and upgrade ML dependencies Cherry-pick TinyEmbeddingBagClassifier (safetensors) replacing Conv1d model. Remove dead deps (torchtext, pytorch-lightning), upgrade torch to 2.x and flair to 0.15.1. * 🐛 cherry-pick(fix): datapublic and anonymizer crash when use_cache is disabled * test(infra): rewrite test infrastructure with architecture guide standards - Delete old test files (test_document_extract.py, test_anonymizer_predict.py, test_datapublic_predict.py) - Create new directory structure: tests/integration/pipelines/, tests/api/routers/{anonymizer,datapublic,misc}/ - Rewrite tests/conftest.py: - Set env vars at module level (RESOURCES_BASEPATH=resources, SQLALCHEMY_DATABASE_URI=sqlite:///:memory:) - Remove torch mock and lazy loader - Direct imports from production code - Clean fixtures: db_engine (session-scoped), db_session (function-scoped), client (with dependency override) - Test data builders: build_data_item(), build_label(), build_anonymization_paragraph(), build_datapublic_paragraph() - Update pyproject.toml with [tool.pytest.ini_options]: strict-markers, integration/slow markers Verification: uv run python -c 'import tests.conftest' succeeds, pytest collection clean * test(conftest): add pipeline loading helpers and mock factories for API tests Wave 2 complete: integration pipeline conftest + API router conftest Integration pipeline conftest: - PIPELINE_CONFIGS dict for flair-anonymizer and full-paragraph - load_test_pipeline() helper with print_config=False - Session-scoped fixtures for both pipelines (expensive model loading) - build_pipeline_input() test data builder - sample_text fixture with Spanish legal text API router conftest: - build_mock_pipeline() factory with MagicMock - Mock preprocess/predict_single/postprocess methods - build_processed_data_item() test data builder - Re-exports builders from root conftest * test(api): add document extract endpoint tests with mocked extraction * test(api): add anonymizer and datapublic endpoint tests with mocked pipelines * test(integration): add pipeline integration tests for flair-anonymizer and full-paragraph * ✅ test: refactor test infrastructure and add integration tests - Reorganize test conftest files to proper hierarchy (tests/api/conftest.py) - Add pytest to dependency groups in pyproject.toml - Refactor API router tests to use centralized fixtures and builders - Add real document extraction tests with DOCX/PDF generators - Improve pipeline integration tests with fixture-based stages - Fix label serialization to use model_dump(mode="json") - Update UUID generation for datapublic tests to use uuid.uuid5 - Add cache path environment setup for integration tests - Clean up imports and remove unused dependencies - Remove empty test file (document_extract.py) This refactoring improves test maintainability, adds proper integration testing without excessive mocking, and establishes consistent test utilities across the codebase. * 👷 ci(github): add pytest workflow for CI integration - Introduced a new GitHub Actions workflow for running pytest. - Configured to trigger on pull requests and manual dispatch. - Supports multiple OS and Python versions for comprehensive testing. * 👷fix(tests): fix env variable DISKCACHE_ROOT * 👷 ci(github): remove deprecated PR tests workflow & fix env variable - Deleted the old PR tests workflow file. - This cleanup helps streamline CI processes and reduces redundancy. * ci(github): 👷 add pipeline download and integration tests to CI workflow - Introduced a new script for downloading pipelines. - Updated the pytest workflow to include running API and pipeline tests. - Enhanced test execution with improved output formatting and failure limits. * fix(tests): 🐛 avoid context manager in TestClient to skip app startup - Changed TestClient usage to prevent app lifespan startup during tests. - Ensured proper cleanup by closing the client after use. - This improves test performance and reliability. * 👷 ci(github): add RESOURCES_BASEPATH environment variable for pipeline tests - Added RESOURCES_BASEPATH to the environment variables for both downloading pipelines data and running pipeline tests. - This change ensures that the necessary resource paths are correctly set during the CI workflow execution. * 👷 ci(github): update RESOURCES_BASEPATH for pipeline data download - Changed RESOURCES_BASEPATH from /tmp to resources in the pipeline download step. - Ensures the correct path is used for resource access during tests. * chore(pyproject): 🔧 add environment markers for platform compatibility - Introduced required-environments for tool.uv to specify platform requirements. - Updated resolution-markers and required-markers in uv.lock for better dependency management. - Added tensorflow-io-gcs-filesystem with specific markers for Windows and Linux. * ci(github): 👷 configure es_AR locale for Ubuntu runners - Added steps to configure the es_AR locale on Ubuntu. - Ensures proper locale settings for tests running in the CI environment. * 👷 ci(github): add AYMURAI_CACHE_BASEPATH environment variable for pipeline tests - Introduced AYMURAI_CACHE_BASEPATH to the environment variables for both pipeline download and pipeline tests. - This change ensures that the correct cache path is utilized during the execution of the tests. * 🐛 fix(dependencies): adjust textract dependency for platform compatibility - Added conditional dependency for textract based on the operating system. - Specified different sources for textract depending on whether the platform is Windows or not. * 🔥 chore(opencode): remove opencode.json configuration file - Deleted the opencode.json file as it is no longer needed. - This change helps to clean up the repository and remove obsolete configurations. * 🚚 Update pipeline path for datapublic in scripts, notebooks and tests * 📝 docs: replace Black references with Ruff in CONTRIBUTING and Alembic hook examples * 🔧 Add backslash to default CACHE_BASEPATH value * 🔧 Update cache path retrieval to use settings for consistency * ➖ Remove textract dependencies and update documentation for extract_document function * ✅ Update integration tests and add new test cases for anonymizer and datapublic flows * 🔥 chore(test): remove legacy /test dir and standardize sample doc path to /resources/data/sample/document-01.docx * 🔧 Update UV_VERSION to latest in devcontainer Dockerfile * 🔧 Update dependency installation command to include all groups * 📌 Update uv.lock * 🐛 Fix CACHE_BASEPATH env alias resolution for CI pipeline downloads * Feature/pdf layout anonymization (#76) * ✨ feat(extractors): use pymupdf layout for pdf text extraction * ✨ feat(normalization): enhance document normalization to preserve paragraph structure * 📝 docs: document default values for extractor and normalization helpers * 🩹 fix(extractors): use pymupdf4llm.to_text with page_chunks for pdf paragraphs * ♻️ Add DOCX and PDF anonymizer modules - Implemented DocxAnonymizer class to handle anonymization of DOCX documents by replacing sensitive data with label tokens. This includes functionality for unzipping documents, parsing XML, editing content, and adding watermarks. - Developed PdfAnonymizer class for anonymizing PDF documents, utilizing pymupdf for document manipulation. This includes layout parsing, font caching, redaction operations, and watermarking. * 🔧 Enhance PDF and DOCX handling in anonymization process * 📝 Update backend module references for document rendering in README * ✅ Update tests to use DOCX format for document anonymization and enhance mock behavior * ✨ Add end-to-end PDF anonymization notebook with PyMuPDF and AymurAI API * ♻️ Rework PDF anonymization for precise spans and widget handling * 🔧 Update model_dump calls to exclude None values for improved data handling * 📝 Add docstrings to label replacement functions * ♻️ Refactor watermark handling and optimize PDF token aliasing * ✅ Add integration tests for merging fragmented numeric labels and excluding null alt attributes in PDF anonymization * ➖ Remove opencv-python-headless dependency from project requirements * ♻️ Implement paragraph splitting function to enhance document text extraction * 🔧 Update dependency installation command to prevent Python downloads * 🔥 Remove redundant tests for merging fragmented numeric labels and PDF anonymization * ♻️ Refactor anonymizer tests to use DOCX format and enhance mock functionality * 🔧 Add xfail marker for PDF extraction test on Windows due to tensor type issue * ✨ Enhance PDF anonymization by adding cleanup rects, removing overlapping links, and scrubbing metadata * 🔧 Remove redundant return statement in _label_replacement_text function * ♻️ Refactor anonymization module: split pdf and docx internals by format * ✅ Add integration tests for PDF and DOCX anonymizers, including metadata scrubbing and link preservation * ✨ Add watermark layout adjustments to avoid footer content overlap in PDF anonymization * ✅ Add integration test to ensure watermark is positioned away from footer content in PDF anonymization * 🩹 Fix: read docx xml as utf-8 across platforms * ✅ Add Windows-specific xfail marker for PDF tests and implement UTF-8 XML reading test * 🐛 Remove unnecessary --extra runtime flag from uv sync command * 🐛 Date formatter bug fixed for canonical entities generation. * 🐛 Fix duplicate DocLabel handling in anonymization and serialization processes * ✅ Add tests to deduplicate duplicate labels in cached predictions and disambiguation processes * 🐛 Fix handling of non-alphanumeric entities by returning None for empty cleaned text (#81) * 🩹 Fix default timeout value in run_safe_text_extraction function from 30 to 300 seconds * 🚸 Update PDF_TOKEN_ALIAS_MAP with clearer aliases * Fix/pdf signature anonymization (#82) * 🧪 test(pdf): cover signature anonymization regressions * 🐛 fix(pdf): preserve signature appearance when redacting signer names * ✅ test(pdf): add focused signature geometry tests * ♻️ refactor(pdf): rename distance function for clarity and update references * 📝 docs(pdf): clarify signature widget flattening process in preparation function * ✅ test(pdf): cover signature review edge cases * 🐛 Bug fix for exact entities. (#80) * 🐛 Bug fixed for entities who are always the same that have to bypass the fuzzy matching algorithm. * ⚡️ Improved structure following copilot comments. * ⚗️ Experimentation. * 🐛 Merge duplicate labels for the same span and AymurAI label in _dedupe_doclabels function * ✅ Add integration test for merging cached duplicate labels for the same span --------- Co-authored-by: jansaldo <julianansaldo@gmail.com> * Feature/frontend integration (#83) * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Squashed 'frontend/' content from commit 9123e6f git-subtree-dir: frontend git-subtree-split: 9123e6ff047ddc6da0528d1de827a4af68752d0f * Squashed 'frontend/' changes from 9123e6ff..8add5c45 8add5c45 1.25.0 d7424d94 use base_url when dealing with public assets ae1a47b8 Merge pull request #44 from AymurAI/feat/redesign 9cfc610c fix issue regarding annotation keyboard navigation a723be01 restore useNotify feature in process page b888214a make how it works modal bigger 49c3bc22 fix electron ts issues eb791d07 add missing anonymize value on label attrs 151e2710 make OS taskbar API safe to call in web 8dc38fbb add knip a0f6c687 fix ts clone element issue d975e3fd include label policies and annotations when anonymizing document 70c44a04 more fixes in height to validate dataset a36a7401 make suggestion clickable on uncontrolled text input ea61e4e2 add fixed height to validation dataset page 072d751e remove axios' default api url 6f8a478c adjust typings 528c646f add more canonical id operations on reducers ae09277c add update by cannonical id function de2b641c add random cannonical id on search add event 74d4dac2 add value as undefined on ui/input component 6aa895f4 simplify and remove unused code on fille annotator component d10fe8c7 add useShallow to local storage default getters 9f62ac69 adjust predicting and file parser flows 01aa8a79 fix TS issues 5655eac5 smaller fixes on file annotation components a7e91cec add remove dialog to label manager's entity tab cad8faaa simplify annotation components 52230ee4 add metadata to search and tag annotations 0f08b9c8 add more anonymizer copy locale texts 6c3112ac store label manager config in local storage 63c4d731 move suggestion label and add mark props 77017543 adjust spacing on dataset validation cae3b907 disambiguate hook 6b21903e move createAnnotationData, tag and suffix to context d1649e5d add max width to toast and remove instad of dismiss it ca56bb63 add size variant to suggestion and adjust display 55a67bf8 fix select viewport and add scroll aa2ce6a1 adjust styles on callout 5493cdee finish mark and tag annotation feature 4fb28120 finish major pages and add file protection feature af457c85 improve accessibility on host page c2c21c21 add className to SectionTitle 6edd7101 update tanstack react query version df491c17 add odt to pdf conversion efb04bc5 deprecate usage of useSchemedQuery and useSchemedQueries 046a73e2 deprecate usage of useSchemedMutation 65ceb195 swap replace button icons 12348edd update button icons 2719755f extract tagger popover logic 28ce40ab create search tagger e54adfbe add small version of select component d6c3bd77 improve ui dialog component 63c6bf09 initial mark component update cd0f7d30 add radix's select as dependency 92ceaace remove old tooltip implementation 2499e4c8 add input size variants be48dac9 add custom icons to callout, toast and showToast 32d3441b add retro compatibility features to select 657e2a5a minor styling fixes + new select imports 7a323cd2 kill more unused components c5d7b4ba create better select component d855a93b adjust styling and positioning on preview page 1cfba9c4 adjust callout styling c28fdcd0 fix title in process copy etxt 3a9a5cc6 remove old component implementations 03954d74 create callout from toast, and then apply a11y to toast 93c9457d create toast component 642e760d improve suggestion component 707e844b update suggestion mark component's styles 7c1db956 add checked variant to button 204246c7 fix gaps on finish page 9e81ee12 hide file stepper selector arrows depending on cursor da1377ea update file processing component 12aa1c6b update decision tabs to panda 490a1adb simplify finish dataset and anonymizer ba12dad2 add className prop to footer 4ed68454 drop file queries when resetting the progress 762cf823 add missing built by in features page 2c9d6cf8 make dataset validation file annotator not annotable 59416b8f clear files on features menu 9585726b use translation on droparea 091b2c58 make search bar static and follow scroll 0978aff6 add label manager to file annotator 5b069d68 connect rest of label manager 23fa8407 improve layout header component 3c88d2f3 finish preview page d31ec034 finish onboarding page 7d1a71f3 add disambiguate and predict react query options 4e91d285 feat: add feature icon record bfc9baf7 feat: create initial label manager 0bc48bc9 chore: refactor Searchbar 962961e4 chore: remove stepper component 95e17235 fix: title in anonymizer locale df53488d chore: minor changes 4a9a2557 feat: create switch component d4830e48 feat: create label manager component baa9835c feat: add more copies for the finish page 9d915cab chore: kill unused old hidden input component 2fd61a7f fix: add missing feature param call on route.tsx 617ea974 feat: use a11y on finish page 172b731a Merge branch 'feat/a11y-dataset-header' into feat/redesign 394121d9 feat: add more copies to locale file c34951e8 fix: redo home layout 31078ff5 chore: cleanup cf9def05 chore: restructure HOC to be a regular component 088bed82 feat: add api base url protection and apply it cd15a776 fix(layout): address PR review on header and icon changes 59dc125e feat: add i18n support for the whole app 36d70bbd build: add i18n 244027be refactor(layout): hoist Topbar and Stepper to global wizard route 146905fb feat: improve topbar accessibility with semantic icons and aria labels 5f910470 chore: ignore personal analysis folder b6569f77 chore: rework layout components 331ff8dc fix: update enum import ee17865e feat(ui): create and/or adjust components 6c9daf38 feat: rework onboarding page 347054d5 chore: simplify main app layout cef7f882 chore: adjust button sizes and enum import dc59ad2a feat: make card clickable 051b6bd4 chore: add tutorial seen to local storage store 1e15bb74 fix: typo on anonymizer label 24f6de07 build: add web or electron run modes beb63bc3 chore: migrate hidden input 23442095 chore: use constants and base card element on feature selector 2eba84b3 chore: refactor header so we can correctly position all elements 0a50bb14 fix: adjust stepper styles (sizing and colors) 504840d9 chore: export constants 3b79f391 build: update react and add radix dependencies f6f7fac8 fix: remove fadeIn scaling animation 41a158c3 chore: create modern ui components 5a02a5df chore: flag card as deprecated edb85211 feat: create modern tooltip component a2c2040a fix: replace brand images with correct ones and set proper heights edce7295 feat(components): create brand, layout and ui components 5475a617 feat: add more brand images 9cd982f8 chore(styles): add animation semantic tokens ce991130 fix: extra character in home layout and rename the component 5e41df1a feat: redesign home 0947627c feat: create link card tool for features 0833a987 feat: create components to render in home screen a6198d99 fix: add fixed height to button and auto adjust icon size ab0a4e20 fix: add lineheight to text styles and adjust font weight 047b115a chore: replace custom use mutation hook with base on connect to host hook 59363cd7 chore: change to named export on local store 3a289e1e chore: add changes to router file 2393d18e chore: fix some tokens in panda and move stitches global styles 64048ac7 chore: re-implement button and partially input 51fe0990 feat: add loading screen on boot, timer of 1.5s 60cf7fde chore: configure view transition for all pages ebcf3a66 feat: add loading page and updated branding images 67603f66 chore: flag stitches as deprecated e416ace8 build: install and configure pandacss c2e5eac1 build: add support for environmental variables for both web and electron apps git-subtree-dir: frontend git-subtree-split: 8add5c452478cdbe6a99ad1b05183cd264183c72 * ✨ Add frontend routing and settings for frontend distribution directory * Squashed 'frontend/' changes from 8add5c45..ff882164 ff882164 chore: add .npmrc to configure public hoist pattern for @types 32bfab0a Merge pull request #59 from AymurAI/fix/53-restore-home-button 94e3816d Merge pull request #65 from AymurAI/fix/add-placeholder-to-select-entities 3545775f Merge pull request #66 from AymurAI/fix/remove-doc-extension d24c458b add a "config" button in features menu f427da73 make hover effect in button work for anchor tanstack link wrapper 77077905 remove slot checks on header 82d2c65a add home button to header on all flow's pages 5663da42 make aymurai's logo a link in the header c3c0e8a7 create home button component 51cb537f fix: prevent select caret rotation from leaking ancestor data-state 5d85973f feat: add tooltips to tagger label and suffix inputs 7e0b67b8 Fix text overflow in HowItWorksModal (#58) 3d516636 Restore delete-one and delete-all hover actions on annotations (#57) fa7e0967 create link component c8112e20 add "Entidad" placeholder to tagger select 70710e11 change NINO to NIÑO 6664e631 remove copies and functions referencing .doc files 4a00039b copy change 422affaf Merge pull request #64 from AymurAI/fix/browser-resources-exhaustion ce8a474a prevent semaphores underflow dcf3b6b7 Merge pull request #62 from AymurAI/fix/conversion-endpoint-usage 7729801f add error handling to finish file conversion 2afe00ee Merge pull request #63 from AymurAI/fix/copy-changes 387d6724 Merge pull request #61 from AymurAI/fix/responsiveness 84843877 limit concurrent predict requests to avoid connection exhaustion 8d3e2693 fix: increase spacing between home and features menu buttons 3025e341 fix: use House icon in header instead of BackButton arrow c9bba1a9 feat: add back-to-home button on features page a0b8f55f use extension to check if file conversion is needed 0b83d0b0 create pdf to odt service 78b1a9c9 responsiveness for screens less than 1280px in width fef9a94e copy change on label manager tab b9c6db53 copy changes on label manager config tab git-subtree-dir: frontend git-subtree-split: ff882164be8077dee58b6748886b0d7d3acbe376 * 🔧 Remove commented-out router for anonymizer database * ✨ Add Node.js and npm installation for frontend build in Dockerfile * 📝 Update API documentation URLs to include '/api' prefix * ✨ Add frontend build commands to Makefile * 🙈 Update .dockerignore and .gitignore to include frontend build output directories * ✅ Update API routes to include '/api' prefix in tests and add frontend integration tests * ♻️ Refactor routing and API integration to remove '/app' prefix and streamline feature routes * Squashed 'frontend/' changes from ff882164..d3e14b5e d3e14b5e feat(validation): persist and restore predictions via backend validation endpoint (#68) 6b8a23ba Add drag-and-drop reordering and inline rename to label manager (#60) git-subtree-dir: frontend git-subtree-split: d3e14b5e00af41fded1c113e51e2e8b73bbf1b22 * refactor: update feature routing, migrate to pnpm, and refine dev environment configuration * Squashed 'frontend/' changes from d3e14b5e..879309c8 879309c8 Feat/entity manager mention feedback (#81) 4d2de106 Fix/responsive home layout (#80) 986e68d2 Fix/homogenize file check ui (#77) 046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76) 2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality git-subtree-dir: frontend git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f * Squashed 'frontend/' changes from 879309c8..a37adc20 a37adc20 fix(useLocal): stop persisting groupOrder and remove dead categoryAssignments (#78) (#87) 47fb1fb3 fix(disambiguate): match response items by text instead of array index (#78) (#86) c873c8e0 Mover configuración de pnpm de `package.json` a `pnpm-workspace.yml` (#83) f4ce881b fix(useFileParse): use position-based paragraph ID to avoid key collisions (#85) 181e0356 Fix/invalid entity offsets (#82) git-subtree-dir: frontend git-subtree-split: a37adc20f579276b3a0e5979424ba7809fb7e2ff * chore: migrate frontend build process from npm to pnpm in API Dockerfile * 🐛 fix: add support for numpy integer and floating types in EnhancedJSONEncoder * fix: update Stack component to use height instead of minHeight for consistent layout * fix: update imports for Label and Text components in UncontrolledInput to avoid circular dependency * chore: regenerate routeTree.gen.ts after removing $feature parent layout route * feat: add default anonymization policies to settings * chore: bump frontend version to 1.5.0 * fix(api): preserve pipeline cache for configured ttl * refactor: remove torch dependency and configure threads via settings * fix(frontend): replace previous anonymizer file on load * fix(frontend): support dataset export in web mode * fix(tests): add SQLALCHEMY_DATABASE_URI environment variable for api tests * fix(api): improve error logging during startup --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: dmazzini <dmazzini@gmail.com> * 🔥 Remove TensorFlow related environment variables in Dockerfile * 📝 Update documentation for AymurAI v1.5.0 --------- Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Paolo Donizetti <padonizetti@gmail.com> Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com> Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Lio <lionel.chamorro85@gmail.com> Co-authored-by: conrabeatriz <conrabeatriz@gmail.com> Co-authored-by: dmazzini <dmazzini@gmail.com>

* Update README.md * Update README.md * Update README.md * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Release/v1.5.0 (#75) * ➕ build(deps): Add langextract for text entity extraction * 🚧 wip: Add langextract entity extraction experiment notebook * ✨ feat: Enhance entity models with relation handling and canonical representation * ✨ feat: Add JSON serialization support and enhance utility functions * ⬆️ Upgrade ML dependencies and refresh uv.lock * 🚧 wip: Update extraction examples in langextract notebook * 📝 Add entity disambiguation notebook for canonical entity extraction * ⬆️ Update dependencies: langextract to 1.1.0 and ollama to 0.6.1; add openai extra for langextract * 📝 Integrate custom OpenAI model for extraction and remove failing empty example * 📝 Update error message format in json_serial function for better readability Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> * ♻️ Inline immediate return in get_pretty Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> * 🐛 Fix: Use json_serial in save_json Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * 🎨 Format json.dumps call in save_json for improved readability * Feature/ollama service (#59) * ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service * Feature/llm providers (#60) * ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service * ✨ Implement LLM providers module with Ollama adapter and shared abstractions * ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider * 📝 Document Ollama provider usage via notebook demo * 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag * ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability * ✨ Enhance Ollama provider docs and DRY response building for sync/async calls * ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency * 📝 Add async examples to OllamaLLMProvider notebook * ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests * ♻️ Refactor OllamaLLMProvider to remove async client caching and streamline client instantiation * Feature/disambiguation metric v2 (#62) * Update .gitignore to exclude entity disambiguation experiment directories and modify Jupyter notebook execution counts and output handling * Refactor Makefile for improved service management and update .gitignore to exclude specific experiment directories. Add new Jupyter notebooks for entity disambiguation metrics and documentation. * Adjust example data for consistency in entity representation. * Refactor entity disambiguation notebooks to standardize attribute naming and improve metric evaluation. Update role attribute from 'rol' to 'role' for consistency across examples and documentation. Adjust evaluation function to return both score and metrics. * Add evaluation metrics for entity disambiguation - Introduced new metrics module for evaluating entity disambiguation performance, including functions for alias normalization, Jaccard similarity, and greedy matching. - Implemented main evaluation function to compute scores and metrics from gold and predicted entities. - Added Jupyter notebooks for practical examples and evaluation results, including normalized and non-normalized text evaluations. - Updated documentation to reflect changes in function signatures and outputs. * 🔧 Expand Makefile: add API management targets (api-run, api-stop, api-logs, api-full-run) for smoother service control * ♻️ Refactor metrics.py: clarify docstrings, align type hints, and polish logging * ✏️ Fix role attribute reference in evaluation metric documentation for consistency * 🔧 Add CanonicalEntities class to represent a collection of canonical entities * 📝 Update entity disambiguation notebooks: clean up imports, adjust paths, and streamline API calls for improved clarity and functionality --------- Co-authored-by: padonizetti Co-authored-by: jansaldo * Feature/summarization (#61) * ✨ feat: Add Streamlit app for document summarization experiments * Add statistical analysis notebook for summarization performance evaluation( Visualized gaps in performance between CPU and CUDA models, llm alucinations) * 🎨 Quantitative and qualitative analysis of summaries: descriptive analysis by features, model comparison, gap analusis (CPU-CUDA), Garbage detection/outliers, analysis by document, visuailzations. * 🔒️ clear all outputs * 🎨 Improve Summary Analysis per document: cuda vs llama (same model), gemma vs llama (cuda), same document phi3 vs. phi4. Token per second gap. * ✨ Add YAML utility functions for loading and saving data * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service * 🔧 Add system prompts for document summarization * 📝 Add summarization benchmark notebook * 🚚 Move statistical analysis notebook to summarization folder * ✨ Implement LLM providers module with Ollama adapter and shared abstractions * ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider * 📝 Document Ollama provider usage via notebook demo * 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag * ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability * ✨ Enhance Ollama provider docs and DRY response building for sync/async calls * ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency * 📝 Add async examples to OllamaLLMProvider notebook * ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests * ➕ Add tiktoken dependency to pyproject.toml and update version in uv.lock * 🔧 Enhance summarization prompts with additional information extraction and entity identification details * ✨ Add LLM summarization router * 📝 Add notebook for the summarization endpoint * ✏️ Fix formatting of keys in summarization defaults for consistency * ➕ Add dspy dependency and update related packages in project configuration * 🚧 WIP: Add prompt optimization notebook for summarization experiments --------- Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com> Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * 🩹 Fix YAML key names in prompt defaults for summarization * ♻️ refactor: Restructure USEM module with factory pattern and multipl… (#64) * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml * ✏️ Remove incomplete comment --------- * ♻️ refactor: Restructure USEM module with factory pattern and multiple encoder backends - Add BaseSentenceEncoder abstract base class for encoder interface - Implement factory pattern with EncoderType enum and create_encoder function - Add sentence-transformers encoder implementations (DistilUSE, MultilingualMiniLM) - Move TensorFlow implementation to tensorflow_encoder.py - Add lazy loading for encoder implementations via __getattr__ - Add auto-detection for Apple Silicon compatibility (defaults * 🚚 Rename test sentence encoders mac notebook * 📌 Sync dependencies --------- * ⏪ Rollback to previous torch and torchtext versions to avoid conflicts * 🩹 Fix: Add missing environment variable for OLLAMA_HOST in docker-compose * 📝 Add anonymization pipeline docs * 🚧 WIP: Add Playwright PJN scraper * 📝 Add Jupyter notebook for entity disambiguation from pre-clustered validations * Feature/pdf extraction upgrade (#65) * 🔧 Configure VSCode Python env and Copilot scopes * 🔧 Include resources/llm in .dockerignore * 📌 Update dependencies in pyproject.toml and uv.lock * 🔧 Update Dockerfile and devcontainer.json to install additional PDF tooling * ♻️ Refactor Makefile and docker-compose.yml for improved service configuration and flexibility * 🚧 FIXME: Remove DecisionConv1dBinRegex model from pipeline configuration for dependencies update compatibility * 🔧 Set weights_only=False for torch.load compatibility * ✨ Enhance PDF extraction with marker integration and improved text processing * 🔧 Update run_safe_text_extraction to allow indefinite timeout by default * ✨ Add warm_marker_models function to initialize marker-pdf artifacts at startup * 🔥 Remove unused environment variables and rename TRANSFORMERS_CACHE to HF_HOME * 🔧 Improve service stopping logic for Ollama and API services in Makefile * 🔖 Bump aymurai package version to 2.0.0-alpha.1 * 🔧 Update HF_HOME path and remove HF_DATASETS_CACHE variable in .env.common * 🔧 Update OLLAMA_HOST for GPU-enabled services to point to ollama-gpu * 🔧 Simplify marker model warming logic by removing error handling * ♻️ Refactor text extraction into modular format-specific extractors * ✅ Add unit tests for document extraction and error handling * ➕ Add marker-pdf stack and drop textract * 🔧 Enhance PDF extraction with caching mechanism * 📝 Improve cache utility functions with enhanced docstrings and type hints * 🔧 Enhance cache key generation in PdfExtractor for improved stability and performance * 🔖 Update aymurai package version to 2.0.0a2.dev9 * Feature/remove usem tensorflow deps (#68) * 🩹 Ensure consistent entity attributes in reformat_entity function and reorder imports * 📝 Update subcategories exploration notebook * ⚗️ Add TensorFlow deprecation experiment notebook * ♻️ Refactor entity subcategorization: Remove USEMSubcategorizer, add SentenceTransformerSubcategorizer - Removed the USEMSubcategorizer implementation from `usem.py`. - Introduced new Jupyter notebooks for testing and evaluating the SentenceTransformerSubcategorizer. - Updated the pipeline configuration to utilize SentenceTransformerSubcategorizer with local embeddings instead of remote URLs. * ♻️ Refactor download function: Replace gdown with requests for improved file downloading * 🔥 Remove empty peft model module * ➖ Remove TensorFlow and gdown dependencies from pyproject.toml * 📌 Update uv.lock * ♻️ Refactor sentence encoder module: Remove unused dependencies and streamline factory functions * 🔖 Update aymurai package version to 2.0.0a3.dev9 * WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67) * feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection - Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`. - Updated model loading and saving mechanisms to support safetensors format. - Added a new training notebook for the embedding bag classifier. - Modified the pipeline configuration to include the new model. * ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text * 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier * 🔧 Refactor import statements for safetensors to remove try-except block * 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations * 🐛 Fix gen_aymurai_entity call by removing unused category parameter * 🔖 Update aymurai package version to 2.0.0a4.dev1 * 🔥 Remove TensorFlow environment variables * Feature/mlfow integration (#66) * feat: add mlflow-based experiments and services (wip) * feat: finalize mlflow experiment runner and artifact logging * feat: add OpenAI ChatGPT extension and update postStartCommand in devcontainer * 📝 Unify disambiguation evaluation notebooks * 📝 Enhance documentation and add type hints across multiple modules * 📌 Update uv.lock * 🔧 Update devcontainer GPU device configuration * 🔧 Change default Python environment manager to venv * 🔧 Add container names for all services in docker-compose.yml * ➖ Remove commented optional dependencies for GPU support in pyproject.toml * 🔧 Increase document request timeout from 30 to 300 seconds in .env.common * 🚚 Changed environment variable names from DOCUMENT_API_BASE_URL and DOCUMENT_REQUEST_TIMEOUT to API_BASE_URL and REQUEST_TIMEOUT * 🔧 Update dependency installation to include 'mlops' group in entrypoint.sh * 🔖 Update aymurai package version to 2.0.0a5.dev8 * Feature/document extract config (#69) * ✨ Enhance document extraction with caching and configuration options * ✅ Update extractor tests to handle additional configuration parameters and improve error handling * 🔧 Update marker model warmup to include configuration setup for improved initialization * 🔖 Update aymurai package version to 2.0.0a6.dev3 * ⏪ Revert multiprocessing context change in run_safe_text_extraction * 🔖 Update aymurai package version to 2.0.0a6.dev5 * 🔥 Remove unused multiprocessing import from document_extract.py * 🔥 Remove unused logging import from extraction.py * 🔧 Change default value of force_ocr to False in pdf_to_text function * 📝 Update argument descriptions in pdf_to_text and plain_text_extractor functions to include default values * 📝 Remove duplicate argument description for path in BaseExtractor.extract method * Feature/pre disambiguation optimization (#70) * New pre-disambigutation feature notebooks * New pre-disambigutation feature notebooks and metrics.py per label feature added * Conclusion added to pre-cluster investigation * utils.py ocr variable True * Changes in grid search function to store the best pre-clusterizated entities in a particular directory * New llm inference function in notebook 07 * New llm grid search inference function * Add disambiguation endpoint and utility functions for entity grouping * Remove unused models and tokenizers to streamline the codebase * Fix type hints for processor functions to avoid runtime errors * Endpoint /disambiguate with LLM Inference (#72) * Changes in old 07 notebook adding the usage of the disambiguate endpoint and its own name * New token counter to check if the LLM inference won't allucinate * New tokenizer function for token counting and proessing specifics documents * Batch optimization feature in llm-inference function * Mapping feature added to llm-inference function * Updated the /disambiguate endpoint to return DocumentAnnotations similar to the NER predictions, now enriched with role and entity_id fields where applicable. * New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id * New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id * New /disambiguatev2 endpoint which makes the LLM inference and return the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id * New updates on endpoint /disambiguatev2 and notebook 07 * Cleaned code in anonymizer.py and utils.py following Raúl comments * New classes defined for LLM prompts to validate each set of prompts per label before the LLM inference * Sorted canonical entities before LLM inference to avoid (or trying to) processing two or more canonical entities that are only one in separate batches * Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. * Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. * Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. New disambiguation.py script to store functions to pre-clusterize the canonical entities. * Cleaned anonymizer.py script and experimental notebook 07 discarding the old pre-cluster endpoint. New disambiguation.py script to store functions to pre-clusterize the canonical entities. * Code cleaned following Juli's comments regarding the new /disambiguate endpoint * Remove unused relations field from CanonicalEntity class for LLM inference phase * Final changes to the code adding the entity_disambiguation.yaml to handle the prompts * Add entity disambiguation utilities and enhance canonical entity processing - Introduced new utility functions for entity disambiguation in `fuzzy.py`. - Implemented `assign_label_instances` and `map_canonical_entities_ner_preds` in `core.py`. - Added LLM inference capabilities in `llm.py` for refining canonical entities. - Updated `entities.py` to include `aymurai_label_instance` for ordered label indexing. * Refactor anonymizer and paragraph modules for improved entity disambiguation and serialization * Remove unused logger import from paragraph module * Reviewed code and added some features to 07 experiment notebook * Implement label policies for disambiguation and anonymization; enhance entity processing and prediction mapping * New datetime formatter function and changes in old code, there is a bug with my OS that unsupports the setlocale * New functioanlity added to get_canonical_dates for dates with the same day and month * New functioanlity added to get_canonical_dates for dates with the same day and month * 🐛 Fix entity handling in anonymizer and datapublic routers when use_cache is disabled to improve label processing * Remove commented-out code * DatetimeFormatter used after NER predictions in postprocess so we only have to take the datetime from aymurai_label_subclass to build the canonical entities from dates * Fix locale setting for date formatting to ensure correct month name handling * Add docstring for get_canonical_dates function to clarify input and output * Remove DIRECCION prompt templates * Update notebook formatting, remove unused MODE param and improve code readability * Update uv.lock * Hotfix: resolve file pathing, logic indentation, and date disambiguation - Update configuration path in llm.py from .yaml to .yml. - Fix indentation in core.py for canonical_entity_id assignment. This ensures all predictions receive an ID even if they lack a canonical match, bypassing the 'aymurai_label_subclass == 0' filter which caused issues with date formatting in NER post-processing. - Add condition in anonymizer.py to trigger 'get_canonical_dates' only when FECHA is present in 'fuzzy_labels'. This prevents unintended date disambiguation when the policy is set to None. * Feature/anonymize document refactor (#73) * Add render policy support and refactor anonymization logic for improved token rendering * 📝Update anonymization docs * ♻️ Refactor: modularize document anonymization * 📝 Rename notebook for document anonymization with render policy * FECHA disambiguation bug fixed, label and render policies changed and whole code reviewed for PR * ⏪ Revert entrypoint.sh to 1ac2776 * ⏪ Revert .dockerignore to 5af5814 * ⏪ Revert .env.common to 90f7369 * ⏪ Revert .vscode/launch.json to f366690 * ⏪ Revert Makefile to cb3df05 * ⏪ Revert aymurai/api.core.py to 19a9ca8 * 🦖 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility * 🔥 Removed aymurai/api/endpoints/routers/llm for release/v1.5.0 compatibility * 🦖 Changed aymurai/api/endpoints/routers/misc/document_extract.py for release/v1.5.0 compatibility 🦖 Changed aymurai/text/extractors/pdf.py for release/v1.5.0 compatibility 🦖 Changed aymurai/text/extractors/utils.py for release/v1.5.0 compatibility * ⏪ Revert aymurai/api/main.py to a801bf4 * 🔥 Removed aymurai/api/startup/marker.py for release/v1.5.0 compatibility * 🔥 aymurai/experiments/entity_disambiguation folder for release/v1.5.0 compatibility * 🔥 Removed aymurai/llm_providers for release/v1.5.0 compatibility * 🦖 Changed aymurai/settings.py for release/v1.5.0 compatibility * 🦖 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility 🦖 Changed aymurai/utils/entity_disambiguation/__init__.py for release/v1.5.0 compatibility 🔥 Removed aymurai/utils/entity_disambiguation/llm.py for release/v1.5.0 compatibility * ⏪ Reverted docker-compose.yml to 5b9c220 * ⏪ Revert docker/api/Dockerfile to 4196117 * 🦖 Changed docs/anonymization/README.md for release/v1.5.0 compatibility * 🔥 Removed docs/experiments/README.md for realease/v1.5.0 compatibility 🔥 Removed docs/experiments/base.yaml for realease/v1.5.0 compatibility * 🔥 Removed notebooks/experiments/anonymization/05-langextract.ipynb for release/v1.5.0 compatibility * 🔥 Removed all the notebooks from folder: notebooks/experiments/entity-disambiguation that had something related to LLM disambiguation for release/v1.5.0 compatibility * 🔥 Removed notebooks/experiments/llm-providers for release/v1.5.0 compatibility * 🔥 Removed notebooks/experiments/summarization for release/v1.5.0 compatibility * 🦖 Changed pyproject.toml for release/v1.5.0 compatibility * 🔥 Removed resources/llm for release/v1.5.0 compatibility * 🔥 Removed summarization_app for release/v1.5.0 compatibility * 🔥 Removed test/llm_providers for release/v1.5.0 compatibility * 🐛 Bug fixed in pyproject.toml line 106 for .venv build up * 🐛 Bug fixed in function '_normalize_text' from 'aymurai.text.extractors.utils' that was changed to 'normalize_text' because it's used in aymurai/text/extractors/docx.py * ⏪ Revert elimination of folder aymurai/experiments/entity_disambiguation for experimental purposes. There was an error in deleting everything, files will be changed in next commit. * 🔥 Removed aymurai/experiments/entity_disambiguation for release/v1.5.0 compatibility * 🐛 Bug fixed in experiments/entity-disambiguation/10-anonymize-document-render-policy.ipynb for release/v1.5.0 compatibility * 🔥 Removed TESSDATA_PREFIX from .env.common * 🙈 Update .gitignore to include notebooks directory while excluding subdirectories and non-IPYNB files * 🔀 Synthesize docker-compose from 26033a8f/00709164 after b05b768 rollback * 🔀 Synthesize Makefile from afbfda9/d80f74b/26033a8f after f645881 rollback * 🔧 Fix repository URL case sensitivity in pyproject.toml and remove unused dependencies * 🔥 Remove tasks.json configuration for Ollama service * 🔥 Remove scraper and documentation * 🔥 Remove experiment module * 🔥 Remove path utility functions from paths.py * 🔥 Remove unused PromptSet and PromptLibrary classes, and simplify disambiguation options in LabelPolicy * 🔥 Remove EntityRelation class and its associated methods from entities.py * 📝 Enhance documentation with detailed docstrings for various functions across multiple modules * 🔥 Removed PromptLibrary class from aymurai/api/endpoints/routers/anonymizer/anonymizer.py for release/v1.5.0 compatibility 🔥 Removed `llm` disambiguation label policy for release/v1.5.0 compatibility * 🎨 Changed map_canonical_entities_ner_preds function in aymurai/utils/entity_disambiguation/core.py discarding the role assignment for release/v1.5.0 compatibility 🎨 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py discarding all the validations that had to do with LLM disambiguation for release/v1.5.0 compatibility 🎨 Minor changes in the rest of documents regarding to experimentation with the release/v1.5.0 API * 🔀 Synthesize document_extract from d349c69 after 3c55d8e: remove extractor config passthrough and restore fixed timeout * 🔀 Synthesize PDF extraction flow from d349c69/26033a8: remove cache/debug path * 🔥 Remove text extraction tests * 📝 Update description formatting for aymurai_disambiguation field in EntityAttributes * 🦖 Update PdfExtractor.extract method to include ignored keyword arguments for backward compatibility * 🔥 Remove unused static logo file from API resources * 🔧 Add version_scheme configuration to setuptools_scm in pyproject.toml * 📌 Update uv.lock * 📝 Reorganize and update v1.5.0 documentation (EN/ES) * 🚚 Rename full-paragraph pipeline to datapublic across code and docs * ci(tests): add API + pipeline integration tests on linux and windows (#74) * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67) * feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection - Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`. - Updated model loading and saving mechanisms to support safetensors format. - Added a new training notebook for the embedding bag classifier. - Modified the pipeline configuration to include the new model. * ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text * 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier * 🔧 Refactor import statements for safetensors to remove try-except block * 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations * 🐛 Fix gen_aymurai_entity call by removing unused category parameter * 🔖 Update aymurai package version to 2.0.0a4.dev1 * 🔀 cherry-pick(decision): modernize decision model and upgrade ML dependencies Cherry-pick TinyEmbeddingBagClassifier (safetensors) replacing Conv1d model. Remove dead deps (torchtext, pytorch-lightning), upgrade torch to 2.x and flair to 0.15.1. * 🐛 cherry-pick(fix): datapublic and anonymizer crash when use_cache is disabled * test(infra): rewrite test infrastructure with architecture guide standards - Delete old test files (test_document_extract.py, test_anonymizer_predict.py, test_datapublic_predict.py) - Create new directory structure: tests/integration/pipelines/, tests/api/routers/{anonymizer,datapublic,misc}/ - Rewrite tests/conftest.py: - Set env vars at module level (RESOURCES_BASEPATH=resources, SQLALCHEMY_DATABASE_URI=sqlite:///:memory:) - Remove torch mock and lazy loader - Direct imports from production code - Clean fixtures: db_engine (session-scoped), db_session (function-scoped), client (with dependency override) - Test data builders: build_data_item(), build_label(), build_anonymization_paragraph(), build_datapublic_paragraph() - Update pyproject.toml with [tool.pytest.ini_options]: strict-markers, integration/slow markers Verification: uv run python -c 'import tests.conftest' succeeds, pytest collection clean * test(conftest): add pipeline loading helpers and mock factories for API tests Wave 2 complete: integration pipeline conftest + API router conftest Integration pipeline conftest: - PIPELINE_CONFIGS dict for flair-anonymizer and full-paragraph - load_test_pipeline() helper with print_config=False - Session-scoped fixtures for both pipelines (expensive model loading) - build_pipeline_input() test data builder - sample_text fixture with Spanish legal text API router conftest: - build_mock_pipeline() factory with MagicMock - Mock preprocess/predict_single/postprocess methods - build_processed_data_item() test data builder - Re-exports builders from root conftest * test(api): add document extract endpoint tests with mocked extraction * test(api): add anonymizer and datapublic endpoint tests with mocked pipelines * test(integration): add pipeline integration tests for flair-anonymizer and full-paragraph * ✅ test: refactor test infrastructure and add integration tests - Reorganize test conftest files to proper hierarchy (tests/api/conftest.py) - Add pytest to dependency groups in pyproject.toml - Refactor API router tests to use centralized fixtures and builders - Add real document extraction tests with DOCX/PDF generators - Improve pipeline integration tests with fixture-based stages - Fix label serialization to use model_dump(mode="json") - Update UUID generation for datapublic tests to use uuid.uuid5 - Add cache path environment setup for integration tests - Clean up imports and remove unused dependencies - Remove empty test file (document_extract.py) This refactoring improves test maintainability, adds proper integration testing without excessive mocking, and establishes consistent test utilities across the codebase. * 👷 ci(github): add pytest workflow for CI integration - Introduced a new GitHub Actions workflow for running pytest. - Configured to trigger on pull requests and manual dispatch. - Supports multiple OS and Python versions for comprehensive testing. * 👷fix(tests): fix env variable DISKCACHE_ROOT * 👷 ci(github): remove deprecated PR tests workflow & fix env variable - Deleted the old PR tests workflow file. - This cleanup helps streamline CI processes and reduces redundancy. * ci(github): 👷 add pipeline download and integration tests to CI workflow - Introduced a new script for downloading pipelines. - Updated the pytest workflow to include running API and pipeline tests. - Enhanced test execution with improved output formatting and failure limits. * fix(tests): 🐛 avoid context manager in TestClient to skip app startup - Changed TestClient usage to prevent app lifespan startup during tests. - Ensured proper cleanup by closing the client after use. - This improves test performance and reliability. * 👷 ci(github): add RESOURCES_BASEPATH environment variable for pipeline tests - Added RESOURCES_BASEPATH to the environment variables for both downloading pipelines data and running pipeline tests. - This change ensures that the necessary resource paths are correctly set during the CI workflow execution. * 👷 ci(github): update RESOURCES_BASEPATH for pipeline data download - Changed RESOURCES_BASEPATH from /tmp to resources in the pipeline download step. - Ensures the correct path is used for resource access during tests. * chore(pyproject): 🔧 add environment markers for platform compatibility - Introduced required-environments for tool.uv to specify platform requirements. - Updated resolution-markers and required-markers in uv.lock for better dependency management. - Added tensorflow-io-gcs-filesystem with specific markers for Windows and Linux. * ci(github): 👷 configure es_AR locale for Ubuntu runners - Added steps to configure the es_AR locale on Ubuntu. - Ensures proper locale settings for tests running in the CI environment. * 👷 ci(github): add AYMURAI_CACHE_BASEPATH environment variable for pipeline tests - Introduced AYMURAI_CACHE_BASEPATH to the environment variables for both pipeline download and pipeline tests. - This change ensures that the correct cache path is utilized during the execution of the tests. * 🐛 fix(dependencies): adjust textract dependency for platform compatibility - Added conditional dependency for textract based on the operating system. - Specified different sources for textract depending on whether the platform is Windows or not. * 🔥 chore(opencode): remove opencode.json configuration file - Deleted the opencode.json file as it is no longer needed. - This change helps to clean up the repository and remove obsolete configurations. * 🚚 Update pipeline path for datapublic in scripts, notebooks and tests * 📝 docs: replace Black references with Ruff in CONTRIBUTING and Alembic hook examples * 🔧 Add backslash to default CACHE_BASEPATH value * 🔧 Update cache path retrieval to use settings for consistency * ➖ Remove textract dependencies and update documentation for extract_document function * ✅ Update integration tests and add new test cases for anonymizer and datapublic flows * 🔥 chore(test): remove legacy /test dir and standardize sample doc path to /resources/data/sample/document-01.docx * 🔧 Update UV_VERSION to latest in devcontainer Dockerfile * 🔧 Update dependency installation command to include all groups * 📌 Update uv.lock * 🐛 Fix CACHE_BASEPATH env alias resolution for CI pipeline downloads * Feature/pdf layout anonymization (#76) * ✨ feat(extractors): use pymupdf layout for pdf text extraction * ✨ feat(normalization): enhance document normalization to preserve paragraph structure * 📝 docs: document default values for extractor and normalization helpers * 🩹 fix(extractors): use pymupdf4llm.to_text with page_chunks for pdf paragraphs * ♻️ Add DOCX and PDF anonymizer modules - Implemented DocxAnonymizer class to handle anonymization of DOCX documents by replacing sensitive data with label tokens. This includes functionality for unzipping documents, parsing XML, editing content, and adding watermarks. - Developed PdfAnonymizer class for anonymizing PDF documents, utilizing pymupdf for document manipulation. This includes layout parsing, font caching, redaction operations, and watermarking. * 🔧 Enhance PDF and DOCX handling in anonymization process * 📝 Update backend module references for document rendering in README * ✅ Update tests to use DOCX format for document anonymization and enhance mock behavior * ✨ Add end-to-end PDF anonymization notebook with PyMuPDF and AymurAI API * ♻️ Rework PDF anonymization for precise spans and widget handling * 🔧 Update model_dump calls to exclude None values for improved data handling * 📝 Add docstrings to label replacement functions * ♻️ Refactor watermark handling and optimize PDF token aliasing * ✅ Add integration tests for merging fragmented numeric labels and excluding null alt attributes in PDF anonymization * ➖ Remove opencv-python-headless dependency from project requirements * ♻️ Implement paragraph splitting function to enhance document text extraction * 🔧 Update dependency installation command to prevent Python downloads * 🔥 Remove redundant tests for merging fragmented numeric labels and PDF anonymization * ♻️ Refactor anonymizer tests to use DOCX format and enhance mock functionality * 🔧 Add xfail marker for PDF extraction test on Windows due to tensor type issue * ✨ Enhance PDF anonymization by adding cleanup rects, removing overlapping links, and scrubbing metadata * 🔧 Remove redundant return statement in _label_replacement_text function * ♻️ Refactor anonymization module: split pdf and docx internals by format * ✅ Add integration tests for PDF and DOCX anonymizers, including metadata scrubbing and link preservation * ✨ Add watermark layout adjustments to avoid footer content overlap in PDF anonymization * ✅ Add integration test to ensure watermark is positioned away from footer content in PDF anonymization * 🩹 Fix: read docx xml as utf-8 across platforms * ✅ Add Windows-specific xfail marker for PDF tests and implement UTF-8 XML reading test * 🐛 Remove unnecessary --extra runtime flag from uv sync command * 🐛 Date formatter bug fixed for canonical entities generation. * 🐛 Fix duplicate DocLabel handling in anonymization and serialization processes * ✅ Add tests to deduplicate duplicate labels in cached predictions and disambiguation processes * 🐛 Fix handling of non-alphanumeric entities by returning None for empty cleaned text (#81) * 🩹 Fix default timeout value in run_safe_text_extraction function from 30 to 300 seconds * 🚸 Update PDF_TOKEN_ALIAS_MAP with clearer aliases * Fix/pdf signature anonymization (#82) * 🧪 test(pdf): cover signature anonymization regressions * 🐛 fix(pdf): preserve signature appearance when redacting signer names * ✅ test(pdf): add focused signature geometry tests * ♻️ refactor(pdf): rename distance function for clarity and update references * 📝 docs(pdf): clarify signature widget flattening process in preparation function * ✅ test(pdf): cover signature review edge cases * 🐛 Bug fix for exact entities. (#80) * 🐛 Bug fixed for entities who are always the same that have to bypass the fuzzy matching algorithm. * ⚡️ Improved structure following copilot comments. * ⚗️ Experimentation. * 🐛 Merge duplicate labels for the same span and AymurAI label in _dedupe_doclabels function * ✅ Add integration test for merging cached duplicate labels for the same span --------- Co-authored-by: jansaldo <julianansaldo@gmail.com> * Feature/frontend integration (#83) * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * ✏️ Remove incomplete comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Squashed 'frontend/' content from commit 9123e6f git-subtree-dir: frontend git-subtree-split: 9123e6ff047ddc6da0528d1de827a4af68752d0f * Squashed 'frontend/' changes from 9123e6ff..8add5c45 8add5c45 1.25.0 d7424d94 use base_url when dealing with public assets ae1a47b8 Merge pull request #44 from AymurAI/feat/redesign 9cfc610c fix issue regarding annotation keyboard navigation a723be01 restore useNotify feature in process page b888214a make how it works modal bigger 49c3bc22 fix electron ts issues eb791d07 add missing anonymize value on label attrs 151e2710 make OS taskbar API safe to call in web 8dc38fbb add knip a0f6c687 fix ts clone element issue d975e3fd include label policies and annotations when anonymizing document 70c44a04 more fixes in height to validate dataset a36a7401 make suggestion clickable on uncontrolled text input ea61e4e2 add fixed height to validation dataset page 072d751e remove axios' default api url 6f8a478c adjust typings 528c646f add more canonical id operations on reducers ae09277c add update by cannonical id function de2b641c add random cannonical id on search add event 74d4dac2 add value as undefined on ui/input component 6aa895f4 simplify and remove unused code on fille annotator component d10fe8c7 add useShallow to local storage default getters 9f62ac69 adjust predicting and file parser flows 01aa8a79 fix TS issues 5655eac5 smaller fixes on file annotation components a7e91cec add remove dialog to label manager's entity tab cad8faaa simplify annotation components 52230ee4 add metadata to search and tag annotations 0f08b9c8 add more anonymizer copy locale texts 6c3112ac store label manager config in local storage 63c4d731 move suggestion label and add mark props 77017543 adjust spacing on dataset validation cae3b907 disambiguate hook 6b21903e move createAnnotationData, tag and suffix to context d1649e5d add max width to toast and remove instad of dismiss it ca56bb63 add size variant to suggestion and adjust display 55a67bf8 fix select viewport and add scroll aa2ce6a1 adjust styles on callout 5493cdee finish mark and tag annotation feature 4fb28120 finish major pages and add file protection feature af457c85 improve accessibility on host page c2c21c21 add className to SectionTitle 6edd7101 update tanstack react query version df491c17 add odt to pdf conversion efb04bc5 deprecate usage of useSchemedQuery and useSchemedQueries 046a73e2 deprecate usage of useSchemedMutation 65ceb195 swap replace button icons 12348edd update button icons 2719755f extract tagger popover logic 28ce40ab create search tagger e54adfbe add small version of select component d6c3bd77 improve ui dialog component 63c6bf09 initial mark component update cd0f7d30 add radix's select as dependency 92ceaace remove old tooltip implementation 2499e4c8 add input size variants be48dac9 add custom icons to callout, toast and showToast 32d3441b add retro compatibility features to select 657e2a5a minor styling fixes + new select imports 7a323cd2 kill more unused components c5d7b4ba create better select component d855a93b adjust styling and positioning on preview page 1cfba9c4 adjust callout styling c28fdcd0 fix title in process copy etxt 3a9a5cc6 remove old component implementations 03954d74 create callout from toast, and then apply a11y to toast 93c9457d create toast component 642e760d improve suggestion component 707e844b update suggestion mark component's styles 7c1db956 add checked variant to button 204246c7 fix gaps on finish page 9e81ee12 hide file stepper selector arrows depending on cursor da1377ea update file processing component 12aa1c6b update decision tabs to panda 490a1adb simplify finish dataset and anonymizer ba12dad2 add className prop to footer 4ed68454 drop file queries when resetting the progress 762cf823 add missing built by in features page 2c9d6cf8 make dataset validation file annotator not annotable 59416b8f clear files on features menu 9585726b use translation on droparea 091b2c58 make search bar static and follow scroll 0978aff6 add label manager to file annotator 5b069d68 connect rest of label manager 23fa8407 improve layout header component 3c88d2f3 finish preview page d31ec034 finish onboarding page 7d1a71f3 add disambiguate and predict react query options 4e91d285 feat: add feature icon record bfc9baf7 feat: create initial label manager 0bc48bc9 chore: refactor Searchbar 962961e4 chore: remove stepper component 95e17235 fix: title in anonymizer locale df53488d chore: minor changes 4a9a2557 feat: create switch component d4830e48 feat: create label manager component baa9835c feat: add more copies for the finish page 9d915cab chore: kill unused old hidden input component 2fd61a7f fix: add missing feature param call on route.tsx 617ea974 feat: use a11y on finish page 172b731a Merge branch 'feat/a11y-dataset-header' into feat/redesign 394121d9 feat: add more copies to locale file c34951e8 fix: redo home layout 31078ff5 chore: cleanup cf9def05 chore: restructure HOC to be a regular component 088bed82 feat: add api base url protection and apply it cd15a776 fix(layout): address PR review on header and icon changes 59dc125e feat: add i18n support for the whole app 36d70bbd build: add i18n 244027be refactor(layout): hoist Topbar and Stepper to global wizard route 146905fb feat: improve topbar accessibility with semantic icons and aria labels 5f910470 chore: ignore personal analysis folder b6569f77 chore: rework layout components 331ff8dc fix: update enum import ee17865e feat(ui): create and/or adjust components 6c9daf38 feat: rework onboarding page 347054d5 chore: simplify main app layout cef7f882 chore: adjust button sizes and enum import dc59ad2a feat: make card clickable 051b6bd4 chore: add tutorial seen to local storage store 1e15bb74 fix: typo on anonymizer label 24f6de07 build: add web or electron run modes beb63bc3 chore: migrate hidden input 23442095 chore: use constants and base card element on feature selector 2eba84b3 chore: refactor header so we can correctly position all elements 0a50bb14 fix: adjust stepper styles (sizing and colors) 504840d9 chore: export constants 3b79f391 build: update react and add radix dependencies f6f7fac8 fix: remove fadeIn scaling animation 41a158c3 chore: create modern ui components 5a02a5df chore: flag card as deprecated edb85211 feat: create modern tooltip component a2c2040a fix: replace brand images with correct ones and set proper heights edce7295 feat(components): create brand, layout and ui components 5475a617 feat: add more brand images 9cd982f8 chore(styles): add animation semantic tokens ce991130 fix: extra character in home layout and rename the component 5e41df1a feat: redesign home 0947627c feat: create link card tool for features 0833a987 feat: create components to render in home screen a6198d99 fix: add fixed height to button and auto adjust icon size ab0a4e20 fix: add lineheight to text styles and adjust font weight 047b115a chore: replace custom use mutation hook with base on connect to host hook 59363cd7 chore: change to named export on local store 3a289e1e chore: add changes to router file 2393d18e chore: fix some tokens in panda and move stitches global styles 64048ac7 chore: re-implement button and partially input 51fe0990 feat: add loading screen on boot, timer of 1.5s 60cf7fde chore: configure view transition for all pages ebcf3a66 feat: add loading page and updated branding images 67603f66 chore: flag stitches as deprecated e416ace8 build: install and configure pandacss c2e5eac1 build: add support for environmental variables for both web and electron apps git-subtree-dir: frontend git-subtree-split: 8add5c452478cdbe6a99ad1b05183cd264183c72 * ✨ Add frontend routing and settings for frontend distribution directory * Squashed 'frontend/' changes from 8add5c45..ff882164 ff882164 chore: add .npmrc to configure public hoist pattern for @types 32bfab0a Merge pull request #59 from AymurAI/fix/53-restore-home-button 94e3816d Merge pull request #65 from AymurAI/fix/add-placeholder-to-select-entities 3545775f Merge pull request #66 from AymurAI/fix/remove-doc-extension d24c458b add a "config" button in features menu f427da73 make hover effect in button work for anchor tanstack link wrapper 77077905 remove slot checks on header 82d2c65a add home button to header on all flow's pages 5663da42 make aymurai's logo a link in the header c3c0e8a7 create home button component 51cb537f fix: prevent select caret rotation from leaking ancestor data-state 5d85973f feat: add tooltips to tagger label and suffix inputs 7e0b67b8 Fix text overflow in HowItWorksModal (#58) 3d516636 Restore delete-one and delete-all hover actions on annotations (#57) fa7e0967 create link component c8112e20 add "Entidad" placeholder to tagger select 70710e11 change NINO to NIÑO 6664e631 remove copies and functions referencing .doc files 4a00039b copy change 422affaf Merge pull request #64 from AymurAI/fix/browser-resources-exhaustion ce8a474a prevent semaphores underflow dcf3b6b7 Merge pull request #62 from AymurAI/fix/conversion-endpoint-usage 7729801f add error handling to finish file conversion 2afe00ee Merge pull request #63 from AymurAI/fix/copy-changes 387d6724 Merge pull request #61 from AymurAI/fix/responsiveness 84843877 limit concurrent predict requests to avoid connection exhaustion 8d3e2693 fix: increase spacing between home and features menu buttons 3025e341 fix: use House icon in header instead of BackButton arrow c9bba1a9 feat: add back-to-home button on features page a0b8f55f use extension to check if file conversion is needed 0b83d0b0 create pdf to odt service 78b1a9c9 responsiveness for screens less than 1280px in width fef9a94e copy change on label manager tab b9c6db53 copy changes on label manager config tab git-subtree-dir: frontend git-subtree-split: ff882164be8077dee58b6748886b0d7d3acbe376 * 🔧 Remove commented-out router for anonymizer database * ✨ Add Node.js and npm installation for frontend build in Dockerfile * 📝 Update API documentation URLs to include '/api' prefix * ✨ Add frontend build commands to Makefile * 🙈 Update .dockerignore and .gitignore to include frontend build output directories * ✅ Update API routes to include '/api' prefix in tests and add frontend integration tests * ♻️ Refactor routing and API integration to remove '/app' prefix and streamline feature routes * Squashed 'frontend/' changes from ff882164..d3e14b5e d3e14b5e feat(validation): persist and restore predictions via backend validation endpoint (#68) 6b8a23ba Add drag-and-drop reordering and inline rename to label manager (#60) git-subtree-dir: frontend git-subtree-split: d3e14b5e00af41fded1c113e51e2e8b73bbf1b22 * refactor: update feature routing, migrate to pnpm, and refine dev environment configuration * Squashed 'frontend/' changes from d3e14b5e..879309c8 879309c8 Feat/entity manager mention feedback (#81) 4d2de106 Fix/responsive home layout (#80) 986e68d2 Fix/homogenize file check ui (#77) 046f8ab9 fix(file-annotator): fix upward autoscroll on search previous navigation (#76) 2ecf75dc feat(dependencies): add dnd-kit packages for drag-and-drop functionality git-subtree-dir: frontend git-subtree-split: 879309c841d8072babc4d06f1686d11cf8cbd03f * Squashed 'frontend/' changes from 879309c8..a37adc20 a37adc20 fix(useLocal): stop persisting groupOrder and remove dead categoryAssignments (#78) (#87) 47fb1fb3 fix(disambiguate): match response items by text instead of array index (#78) (#86) c873c8e0 Mover configuración de pnpm de `package.json` a `pnpm-workspace.yml` (#83) f4ce881b fix(useFileParse): use position-based paragraph ID to avoid key collisions (#85) 181e0356 Fix/invalid entity offsets (#82) git-subtree-dir: frontend git-subtree-split: a37adc20f579276b3a0e5979424ba7809fb7e2ff * chore: migrate frontend build process from npm to pnpm in API Dockerfile * 🐛 fix: add support for numpy integer and floating types in EnhancedJSONEncoder * fix: update Stack component to use height instead of minHeight for consistent layout * fix: update imports for Label and Text components in UncontrolledInput to avoid circular dependency * chore: regenerate routeTree.gen.ts after removing $feature parent layout route * feat: add default anonymization policies to settings * chore: bump frontend version to 1.5.0 * fix(api): preserve pipeline cache for configured ttl * refactor: remove torch dependency and configure threads via settings * fix(frontend): replace previous anonymizer file on load * fix(frontend): support dataset export in web mode * fix(tests): add SQLALCHEMY_DATABASE_URI environment variable for api tests * fix(api): improve error logging during startup --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: dmazzini <dmazzini@gmail.com> * 🔥 Remove TensorFlow related environment variables in Dockerfile * 📝 Update documentation for AymurAI v1.5.0 --------- Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Paolo Donizetti <padonizetti@gmail.com> Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com> Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Lio <lionel.chamorro85@gmail.com> Co-authored-by: conrabeatriz <conrabeatriz@gmail.com> Co-authored-by: dmazzini <dmazzini@gmail.com> --------- Co-authored-by: jed <jedzill4@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Co-authored-by: Paolo Donizetti <padonizetti@gmail.com> Co-authored-by: Sofi <sofiamorenadelpozo@gmail.com> Co-authored-by: Lio <lionel.chamorro85@gmail.com> Co-authored-by: conrabeatriz <conrabeatriz@gmail.com> Co-authored-by: dmazzini <dmazzini@gmail.com>

🐛 Bug fixed for entities who are always the same that have to bypass …

49d5c5f

…the fuzzy matching algorithm.

conrabeatriz changed the title ~~🐛 Bug fix for exact entities.~~ 🐛 Bug fix for exact entities. PR May 20, 2026

sourcery-ai Bot reviewed May 20, 2026

View reviewed changes

Comment thread aymurai/utils/entity_disambiguation/fuzzy.py Outdated

Comment thread aymurai/transforms/anonymization_postprocess/core.py Outdated

conrabeatriz changed the title ~~🐛 Bug fix for exact entities. PR~~ 🐛 Bug fix for exact entities. May 20, 2026

jansaldo requested a review from Copilot May 21, 2026 18:03

Copilot started reviewing on behalf of jansaldo May 21, 2026 18:03 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread aymurai/utils/entity_disambiguation/fuzzy.py Outdated

Comment thread aymurai/transforms/anonymization_postprocess/core.py Outdated

Comment thread aymurai/transforms/anonymization_postprocess/core.py Outdated

conrabeatriz added 3 commits May 25, 2026 11:46

✨ Introduced commit from PR#81.

6095fd2

⚡️ Improved structure following copilot comments.

6822b32

⚗️ Experimentation.

c963e3d

jansaldo added 2 commits May 27, 2026 18:51

🐛 Merge duplicate labels for the same span and AymurAI label in _dedu…

19603dc

…pe_doclabels function

✅ Add integration test for merging cached duplicate labels for the sa…

221ab6d

…me span

jansaldo merged commit f2310e2 into release/v1.5.0 May 27, 2026
3 checks passed

jansaldo deleted the bug-fix/issue#75 branch May 27, 2026 19:03

jansaldo mentioned this pull request May 27, 2026

DNIs agrupados mal AymurAI/desktop-app#75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bug fix for exact entities.#80

🐛 Bug fix for exact entities.#80
jansaldo merged 6 commits into
release/v1.5.0from
bug-fix/issue#75

conrabeatriz commented May 20, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 20, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

conrabeatriz commented May 20, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for exact-match handling in entity disambiguation

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

conrabeatriz commented May 20, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented May 20, 2026 •

edited

Loading