Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions docs/user_guides/fs/feature_group/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,59 @@ MaxDirectoryItemsExceededException - The directory item limit is exceeded: limit

By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition.

##### Time-grain partitioning with `partitioned_by` (Delta only)

When the partition columns are derived from the feature group's `event_time`, hand the backend the desired time grains with `partitioned_by=[...]` and the Python client derives the partition columns for you.
Pass one or more grains drawn from `hour`, `day`, `week`, `month`, and `year`.

```python
fg = fs.get_or_create_feature_group(
name="transactions",
version=1,
primary_key=["tx_id"],
event_time="tx_ts",
partitioned_by=["year", "month", "day"],
time_travel_format="DELTA",
)
fg.insert(df) # df does not need year/month/day — the client derives them
```

The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`.
The grain columns are ordinary materialized partition columns: the client computes them from `event_time` on each write and the backend registers them as partition columns through the normal table-creation path.
The source dataframe does not need to carry them.

`partitioned_by` and `partition_key` are mutually exclusive.
`partitioned_by` requires `event_time` to be set.

###### Partition pruning

The grain columns are real partition columns, so a filter on a grain column (for example `year == 2026`) prunes partitions natively.
A filter on an `event_time` range is rewritten into equivalent grain-column predicates by the query layer, so it prunes too on hierarchical specs:

| `partitioned_by` | Prunes on `event_time` range? | Prunes on `year` / `month` / `day` filter? |
| --- | --- | --- |
| `["year"]` | ✅ | ✅ |
| `["year", "month"]` | ✅ | ✅ |
| `["year", "month", "day"]` | ✅ | ✅ |
| `["year", "month", "day", "hour"]` | ✅ | ✅ |
| `["month"]` (no year) | ⚠️ no — month alone is ambiguous across years | ✅ filter on month works |
| `["year", "week"]` | ⚠️ year only — week isn't directly derivable from a date range | ✅ both columns prune |
| `["day"]` (no year/month) | ⚠️ no — day-of-month is ambiguous | ✅ filter on day works |

Prefer hierarchical specs (`["year"]`, `["year", "month"]`, `["year", "month", "day"]`) — they line up with the typical batch-pipeline access pattern and prune naturally.

###### Online feature store

Online-enabled feature groups do not yet support `partitioned_by`.
The online ingestion path does not exclude the offline-only grain columns from the Kafka/Avro schema, nor materialize them for the online write, so the backend rejects `partitioned_by` together with `online_enabled=true` until that work lands (tracked under a separate follow-up ticket).
Keep the feature group offline-only to use `partitioned_by`.

###### Hudi

`partitioned_by` on `time_travel_format="HUDI"` feature groups is not yet supported and the backend rejects it at creation.
Hudi needs a different mechanism (a `CustomKeyGenerator` + server-side `Transformer`) and is tracked under a separate follow-up ticket.
Until that lands, use `time_travel_format="DELTA"` to get time-grain partitioning, or partition Hudi groups explicitly via `partition_key=["year"]` with a `year` column the upstream pipeline computes.

##### Table format

When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter.
Expand Down
Loading