Expiring rows (EXPIRE ROWS)

EXPIRE ROWS is a row-level retention policy for materialized views. Where TTL drops whole partitions once they age out, EXPIRE ROWS decides retention row by row — keep the latest row per key, the top-N per group, rows matching a predicate, and so on — and recomputes the result continuously as the view refreshes.

Expired rows disappear from query results immediately; their on-disk storage is reclaimed afterwards by a background job.

note

EXPIRE ROWS is materialized-view-only, and the view must be a passthrough (non-aggregating) view: SELECT * FROM base with no SAMPLE BY / GROUP BY. CREATE TABLE ... EXPIRE ROWS and aggregating views are rejected. For base-table retention use TTL or, on Enterprise, storage policies.

When to use EXPIRE ROWS

Reach for EXPIRE ROWS on a passthrough materialized view when you want a continuously-maintained, pruned copy of a base table:

  • Current-state-per-key tables — keep only the latest row per device, symbol, or session (KEEP LATEST).
  • Per-group extremes or leaderboards — keep the highest/lowest value per group, or the top-N (KEEP HIGHEST/LOWEST, KEEP N).
  • Rolling row-level windows — keep rows newer than a cutoff, or matching any predicate, at finer granularity than TTL's whole-partition drops (WHEN predicate).

Use TTL instead when partition-granularity, age-based retention on a base table is enough — it is cheaper and has no passthrough-view requirement.

Requirements

EXPIRE ROWS requires a passthrough materialized view:

  • The view query is SELECT * FROM base. A column subset and a WHERE filter are allowed; aggregation, SAMPLE BY, GROUP BY, LATEST ON, DISTINCT, UNION, and JOIN are not — they make the view non-passthrough and are rejected with EXPIRE ROWS is only supported on passthrough (non-aggregating) materialized views.
  • The view inherits the base table's designated timestamp and partitioning.

A passthrough view mirrors its base table 1:1 and refreshes incrementally, so it is effectively a continuously-maintained replica. EXPIRE ROWS prunes that replica down to the rows you want to keep — without touching the base table.

The modes

Every mode keeps a defined set of rows and expires the rest. A row is expired only when the rule selects it for removal.

ModeWhat it keepsSyntax
Per-row predicateRows for which the predicate is not TRUEEXPIRE ROWS WHEN predicate
Keep latestThe latest row per key (current state per key)EXPIRE ROWS KEEP LATEST [ON ts] PARTITION BY cols
Keep highest / lowestRows tied at the group max / min of a columnEXPIRE ROWS KEEP HIGHEST|LOWEST col [PARTITION BY cols]
Keep top-NThe N highest / lowest rows per groupEXPIRE ROWS KEEP N HIGHEST|LOWEST col [PARTITION BY cols]
Window predicateRows for which a window predicate is not TRUEEXPIRE ROWS WHEN windowPredicate

KEEP HIGHEST/LOWEST and KEEP N are convenience forms that desugar to a window predicate, so the window WHEN is the general escape hatch.

The clause is attached to a passthrough CREATE MATERIALIZED VIEW (after the query, and after PARTITION BY if present), or set later with ALTER MATERIALIZED VIEW ... SET EXPIRE ROWS:

EXPIRE ROWS
{ WHEN predicate
| KEEP LATEST [ ON timestampColumn ] PARTITION BY col [, col ...]
| KEEP [ N ] ( HIGHEST | LOWEST ) col [ PARTITION BY col [, col ...] ] }
[ CLEANUP EVERY duration ]
ElementMeaning
predicateAny boolean expression over the view's columns. A row expires when it evaluates TRUE.
KEEP LATESTKeep the latest row per PARTITION BY key, by the designated timestamp.
ON timestampColOptional; if given it must name the view's designated timestamp.
HIGHEST|LOWESTKeep rows at the max / min of col per group (N omitted), or the top N per group.
CLEANUP EVERYHow often the background reclamation job runs for this view. Defaults to 1h if omitted.

Worked examples

The following walks through every mode on a small fixed dataset so you can see exactly which rows each policy keeps.

Setup

Base table and sample data
CREATE TABLE trades (
symbol SYMBOL,
price DOUBLE,
amount DOUBLE,
ts TIMESTAMP
) TIMESTAMP(ts) PARTITION BY DAY WAL;

INSERT INTO trades VALUES
('BTC', 100.0, 1.0, '2024-01-01T10:00:00.000000Z'),
('BTC', 105.0, 2.0, '2024-01-01T11:00:00.000000Z'),
('BTC', 102.0, 1.5, '2024-01-02T09:00:00.000000Z'),
('ETH', 50.0, 3.0, '2024-01-01T10:30:00.000000Z'),
('ETH', 55.0, 1.0, '2024-01-02T08:00:00.000000Z');
note

A materialized view starts an asynchronous initial refresh on creation, so it may briefly return no rows. Check progress with SELECT view_name, view_status, base_table_txn, refresh_base_table_txn FROM materialized_views(); — the view is up to date when refresh_base_table_txn = base_table_txn. With this small dataset that is effectively instant.

Per-row predicate: WHEN

A per-row predicate expires a row when it evaluates TRUE. Here, expire small trades (amount < 1.5):

Expire rows where amount < 1.5
CREATE MATERIALIZED VIEW trades_sized AS (
SELECT * FROM trades
) EXPIRE ROWS WHEN amount < 1.5;

SELECT * FROM trades_sized;
symbolpriceamountts
BTC105.02.02024-01-01T11:00:00.000000Z
BTC102.01.52024-01-02T09:00:00.000000Z
ETH50.03.02024-01-01T10:30:00.000000Z

The two amount = 1.0 rows are expired. amount = 1.5 is kept (1.5 < 1.5 is FALSE), and any NULL amount would be kept too (the comparison is UNKNOWN, not TRUE — see NULLs).

A predicate on the designated timestamp gives a rolling retention window, re-evaluated on every read so the visible set rolls forward with the clock:

Keep the last 1 day
CREATE MATERIALIZED VIEW trades_recent AS (
SELECT * FROM trades
) EXPIRE ROWS WHEN ts < dateadd('d', -1, now());

(With the 2024 sample timestamps above, every row is already older than a day and would be hidden; use recent data to see rows retained.)

Keep latest per key: KEEP LATEST

Keep only the most recent row per key — turning the passthrough view into a live, current-state-per-symbol table:

Keep the latest row per symbol
CREATE MATERIALIZED VIEW trades_latest AS (
SELECT * FROM trades
) EXPIRE ROWS KEEP LATEST PARTITION BY symbol;

SELECT * FROM trades_latest;
symbolpriceamountts
BTC102.01.52024-01-02T09:00:00.000000Z
ETH55.01.02024-01-02T08:00:00.000000Z

One row per symbol — the latest by the designated timestamp ts. As new trades arrive, the kept row advances automatically. PARTITION BY may list multiple key columns. You may write KEEP LATEST ON ts PARTITION BY symbol, but the ON column must be the view's designated timestamp.

Keep extremes per group: KEEP HIGHEST / KEEP LOWEST

Keep the rows tied at the group maximum (or minimum) of a column:

Keep the highest-priced trade per symbol
CREATE MATERIALIZED VIEW trades_peak AS (
SELECT * FROM trades
) EXPIRE ROWS KEEP HIGHEST price PARTITION BY symbol;

SELECT * FROM trades_peak;
symbolpriceamountts
BTC105.02.02024-01-01T11:00:00.000000Z
ETH55.01.02024-01-02T08:00:00.000000Z

KEEP LOWEST price PARTITION BY symbol keeps the cheapest instead (BTC 100.0, ETH 50.0). All rows tied at the extreme are kept, and NULL-valued rows are kept (a NULL is never less than the max).

Keep top-N per group: KEEP N HIGHEST / KEEP N LOWEST

Keep a per-group leaderboard — the N highest (or lowest) rows:

Keep the 2 highest-priced trades per symbol
CREATE MATERIALIZED VIEW trades_top2 AS (
SELECT * FROM trades
) EXPIRE ROWS KEEP 2 HIGHEST price PARTITION BY symbol;

SELECT * FROM trades_top2 ORDER BY symbol, price DESC;
symbolpriceamountts
BTC105.02.02024-01-01T11:00:00.000000Z
BTC102.01.52024-01-02T09:00:00.000000Z
ETH55.01.02024-01-02T08:00:00.000000Z
ETH50.03.02024-01-01T10:30:00.000000Z

BTC keeps its two highest (105, 102) and drops 100; ETH has only two rows, so both survive. Ties are broken by the designated timestamp, so the N-th boundary is deterministic.

Window predicate: the escape hatch

KEEP HIGHEST/LOWEST and KEEP N are shorthand for window predicates. When you need a rule they do not cover, write the window predicate directly in WHEN. For example, this is exactly what KEEP HIGHEST price PARTITION BY symbol expands to:

Equivalent to KEEP HIGHEST, written as a window predicate
CREATE MATERIALIZED VIEW trades_peak_win AS (
SELECT * FROM trades
) EXPIRE ROWS WHEN price < max(price) OVER (PARTITION BY symbol);

A row expires when its price is below its symbol's maximum, so only the peak per symbol survives — the same result as trades_peak above. From here you can express richer rules, for example keeping rows within 5% of the peak (WHEN price < 0.95 * max(price) OVER (PARTITION BY symbol)) or a ranked window (WHEN row_number() OVER (PARTITION BY symbol ORDER BY ts DESC) > 100).

How it works

EXPIRE ROWS has two cooperating parts: an authoritative read-time filter and a best-effort background cleanup.

Read-time filter (authoritative)

Every query against a policied view is transparently rewritten so that only the kept rows are visible — immediately, and regardless of whether cleanup has run. This is what makes results correct at all times:

  • Per-row WHEN keeps rows where the predicate is not TRUE. QuestDB filtering is three-valued, so FALSE and NULL are kept (see NULLs).
  • KEEP LATEST returns the latest row per key using the designated timestamp.
  • KEEP HIGHEST/LOWEST/N and window WHEN compute the keep-set with a window function over the whole view.

Because the filter is applied at query time, a freshly-refreshed row that should be expired is hidden the moment it lands, and a row that should reappear (under a time-based predicate) reappears on the next read.

Physical cleanup (best-effort)

A background job reclaims disk for non-active partitions: a fully-expired partition is removed, and a partially-expired one is compacted down to its survivors. It runs at the CLEANUP EVERY cadence (default 1h) and is best-effort — the read filter is authoritative, so deferred or skipped reclamation only affects disk usage, never query results.

On QuestDB Enterprise, cleanup runs on the primary only, but the reclamation still replicates: the compaction commits are ordinary WAL transactions, so replicas reclaim the identical rows by applying them. A read-only replica neither runs the job nor needs to. Disable the job with cairo.row.expiry.enabled=false in server.conf (reads stay filtered; only reclamation stops).

To observe reclamation, compare the physical row count per partition before and after a sweep:

Physical rows still on disk per partition
SELECT name, numRows FROM table_partitions('trades_latest');

Reclamation defers while a view is being refreshed continuously and resumes on a quiet sweep.

Semantics

NULLs

The keep-set is computed with three-valued logic, so a NULL value is never less than a group maximum (the comparison is UNKNOWN, not TRUE). Therefore:

  • KEEP HIGHEST/LOWEST and value-based WHEN predicates keep rows whose value is NULL.
  • KEEP LATEST uses the designated timestamp, which is never NULL.
  • KEEP N is the exception. It ranks rows with row_number(), and QuestDB has no NULLS LAST, so where a NULL lands is type-dependent: under DESC a floating-point NULL (NaN) sorts first (kept while there is room in N), while an integer/timestamp NULL sorts last (expired first). Use KEEP HIGHEST/LOWEST (no N) when every NULL must be kept regardless of type.

Ties and determinism

KEEP HIGHEST/LOWEST keeps all rows tied at the max/min — deterministic by construction. KEEP N makes the order total by appending the designated timestamp as a tiebreak, so the N-th boundary is deterministic (pair the base table with DEDUP UPSERT KEYS if (col, ts) is not already unique).

Monotonicity and cleanup safety

Physical deletion is only safe when expiry is monotonic: a row that is expired now must stay expired forever. All the relative modes (KEEP LATEST, KEEP HIGHEST/LOWEST, KEEP N) are monotonic by construction, as is a designated-timestamp predicate such as WHEN ts < now() (a row only gets older). A scalar WHEN predicate is arbitrary SQL, so monotonicity is the author's responsibility.

warning

A non-monotonic predicate such as WHEN ts > now() expires future rows that un-expire as now() advances. The read filter recomputes now() on every read and stays correct, but the cleanup job assumes monotonicity and may physically delete a row that a later read would otherwise show (recoverable only by a full refresh). Write WHEN predicates that expire things in the past (ts < now()) or against fixed thresholds — never rows that the passage of time will later keep.

Inspecting a policy

SHOW CREATE MATERIALIZED VIEW renders the clause as written:

SHOW CREATE MATERIALIZED VIEW trades_latest;
-- ... EXPIRE ROWS KEEP LATEST PARTITION BY symbol

The materialized_views() function exposes the policy in the expire_predicate and expire_cleanup_every columns (both NULL when no policy is set):

List EXPIRE ROWS policies
SELECT view_name, expire_predicate, expire_cleanup_every
FROM materialized_views();
view_nameexpire_predicateexpire_cleanup_every
trades_sizedamount < 1.51h
trades_latestKEEP LATEST PARTITION BY symbol1h
trades_top2KEEP 2 HIGHEST price ...1h

Changing or removing a policy

Set, change, or drop a policy on an existing passthrough view — see ALTER MATERIALIZED VIEW SET EXPIRE:

-- set or replace the policy
ALTER MATERIALIZED VIEW trades_latest SET EXPIRE ROWS KEEP LATEST PARTITION BY symbol;

-- remove it (keeps all rows again)
ALTER MATERIALIZED VIEW trades_latest DROP EXPIRE;

SET EXPIRE ROWS validates the new policy against the view's columns before applying it, so an invalid predicate or an unknown column is rejected up front rather than breaking subsequent reads.

Limitations and operational notes

  • Reads recompute the keep-set. A relative/window policy computes its keep-set over the whole physical view on every read. KEEP LATEST on an indexed symbol key is cheap; the window modes (and non-indexed keep-latest) scan the view. A tighter CLEANUP EVERY keeps the physical residue — and therefore the read cost — small.
  • Cleanup defers under continuous refresh. Reclamation only proceeds when the view is quiescent and fully applied, so a view being refreshed continuously defers reclamation to a quiet sweep. The read filter stays authoritative meanwhile.
  • KEEP LATEST [ON ts]. The optional ON ts is accepted for familiarity but the view's designated timestamp is always used; naming a different column is rejected.
  • Non-monotonic WHEN predicates are unsupported for cleanup — see monotonicity above.