
Uncovering Event Patterns in SQL with Match_Recognize
by Paul SymonsContents
Summary
MATCH_RECOGNIZE
is a SQL clause that helps you identify sequential patterns in your datasets,
using concise and readable syntax. It has innumerable use cases, but popular ones include:
- Detecting patterns of fraud in transactional or game play data
- Triggering alarm states in response to sequences of IOT device events
- Identifying the duration of repeating event sequences (e.g. WA Fuelwatch cycles)
Though MATCH_RECOGNIZE
was added to the SQL:2016 standard some years ago,
only a handful of systems support it, notably:
- Oracle
- Trino (and so also Amazon Athena)
- Snowflake
- Flink
- Azure Stream Analytics
In this blog, we will demonstrate its usefulness by
- Illustrating a simplified dataset where patterns can be identified
- Exploring the syntax of the
MATCH_RECOGNIZE
clause - Identify further learning opportunities for complex use cases
I have used the open-source Trino engine with DuckDB for the examples here, but the results should be similar elsewhere.
Our Dataset - WA Fuelwatch
In Western Australia, there is a legislated and government-run service called FuelWatch. Alongside the legislation - which requires service stations to maintain consistent pricing for 24 hour windows - the website service allows consumers to find the current price of a service station’s fuel, and (after 2pm) also tomorrow’s price of fuel.
I’ve previously experimented with this data, for a presentation at a local data meetup group in Perth, to demonstrate how useful in combination DuckDB and Evidence.dev are for analytics processing and Business Intelligence.
A curious phenomenon observed here with Unleaded Petrol (ULP) is that many retailers (especially in the Perth metropolitan area) engage in price cycling; that is, over a period of days, they raise and lower the prices gradually.
Desired Analysis
We want to understand how the duration of the price cycle varies over time, for each service station.
This is easy to understand visually, when looking at a single station on a graph:
However, with over 800 service stations in the dataset, graphing every service station is not a sustainable solution, and also doesn’t help us to compare cycle durations across comparable dimension keys, such as postcode, or petrol brand.
We will show how MATCH_RECOGNIZE
helps us quickly identify the dates and price range for each successive cycle.
Data Structure
We have a single table called fuelwatch
which is comprised of historical CSV files from the Fuelwatch website.
Here is a random sample of the data:
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
publish_date | 2025-01-01 | 2025-01-01 | 2025-01-01 | 2025-01-01 | 2025-01-01 |
trading_name | 7-Eleven Dunsborough | Ampol Foodary Butler | Ampol Foodary Carnarvon | Ampol Foodary Doubleview | Ampol Foodary High Wycombe |
brand_description | 7-Eleven | Ampol | Ampol | Ampol | Ampol |
production_description | ULP | PULP | 98 RON | 98 RON | ULP |
product_price | 179.9 | 208.9 | 223.9 | 198.9 | 164.9 |
address | 100 Commonage Rd | 11 Headingly Cr | 595 Robinson St | 365 Scarborough Beach Rd | 967 Abernethy Rd |
location | DUNSBOROUGH | BUTLER | KINGSFORD | DOUBLEVIEW | HIGH WYCOMBE |
postcode | 6281 | 6036 | 6701 | 6018 | 6057 |
area_description | Busselton-Shire | North of River | Carnarvon | North of River | East/Hills |
region_description | South-West | Metro | Gascoyne | Metro | Metro |
From now on, we will focus on the following key elements:
- publish_date - the date when a price is applicable
- trading_name - the name of the service station
- product_description - we will only focus on the
ULP
value (unleaded petrol) - product_price - the price on the given day
Data Sample
We are going to focus on the Ampol Foodary Thornlie Square Service Station for the month-to-date of March 2025. It has the following values for ULP:
The image below shows this data with some cycle start and end points marked, based on the above data. Green circles represent each cycle start, with the red circles representing the cycle end:
Using MATCH_RECOGNIZE
There’s no getting around the fact that MATCH_RECOGNIZE
is a very powerful, but also complex SQL clause.
Let’s build it up in stages.
Clause
It starts as follows:
SELECT * FROM fuelwatch
MATCH_RECOGNIZE(
-- options
)
Note that the order in which options appear in the
MATCH_RECOGNIZE
clause are important; the statement may not compile if they are out of order.
All options are specified within the clause itself.
The output of your query with MATCH_RECOGNIZE
will differ depending on the options you specify. More on that later.
Partitions and Ordering
Like window functions, MATCH_RECOGNIZE
can be applied partition-wise;
this means patterns will be matched for rows that are grouped by distinct values of the partition key.
Ordering is also important, as patterns are expected to be matched sequentially. Omitting the order will often produce non-deterministic results.
Below we add the partition key trading_name (the name of the service station), and order by the publish_date of the price:
SELECT * FROM fuelwatch
MATCH_RECOGNIZE(
PARTITION BY trading_name
ORDER BY publish_date
-- more options
)
Neither PARTITION BY
nor ORDER BY
are required, however it is difficult to imagine a use case that warrants omission of ORDER BY
.
Declaring a Pattern
To specify how we will match sequences of rows, we introduce a pattern. This comprises a sequence of custom identifiers that may optionally include regex semantics.
For example, if I specify a pattern identifier LOWPRICE
, that alone is valid. But I can additionally
specify:
LOWPRICE{3,}
to indicate 3 or more LOWPRICE pattern matchesLOWPRICE+
to indicate at least one LOWPRICE pattern matchLOWPRICE*
for 0 or more LOWPRICE pattern matches
… and so on.
In our fuelwatch example, we want to match when the price goes down, optionally match when it plateaus, and finally, match when it goes up:
SELECT * FROM fuelwatch
MATCH_RECOGNIZE(
PARTITION BY trading_name
ORDER BY publish_date
PATTERN ( DOWN_OR_PLATEAU* BOTTOM)
-- more options
)
Defining Pattern Identifiers
Once a pattern has been specified, identifiers in the pattern can be defined.
To define identifiers, we use the DEFINE
keyword, following by a comma separated list of identifiers and their definitions:
SELECT * FROM fuelwatch
MATCH_RECOGNIZE(
PARTITION BY trading_name
ORDER BY publish_date
PATTERN ( DOWN_OR_PLATEAU* BOTTOM)
DEFINE
DOWN_OR_PLATEAU AS NEXT(product_price) <= product_price,
BOTTOM AS NEXT(product_price) > product_price
-- more options)
By now you are probably wondering, what is NEXT()
for? The answer, it is one of many possible
navigation functions
that allow you to compare the current row to another row(s). In our case, we are
comparing to a value in the row immediately following. These navigation patterns are incredible
powerful and worth studying in detail to meet your use case.
It is not actually necessary to define every identifier, but if left undefined, an identifier will match any row relative to where it is referenced in the
PATTERN()
.For example, if we modified our pattern to be
PATTERN ( START DOWN_OR_PLATEAU* BOTTOM)
the
START
identifier would always match the row before aDOWN_OR_PLATEAU
sequence was encountered.This may seem useful, and it often is; but be careful, as the row matched by
START
will be included in match result. If yourMATCH_RECOGNIZE
clause is configured to skip past the last matching row - which is the default - yourSTART
row may steal the place of what should be your firstDOWN_OR_PLATEAU
row.
Adding Measures
Finally, we add some measures that will constitute our match results.
You can see more navigation functions here - e.g. FIRST()
and LAST()
SELECT * FROM fuelwatch
MATCH_RECOGNIZE(
PARTITION BY trading_name
ORDER BY publish_date
MEASURES
FIRST(DOWN_OR_PLATEAU.publish_date) AS start_date
, LAST(BOTTOM.publish_date) AS end_date
, FIRST(DOWN_OR_PLATEAU.product_price) AS start_price
, LAST(BOTTOM.product_price) AS end_price
, DATE_DIFF('day', FIRST(DOWN_OR_PLATEAU.publish_date), LAST(BOTTOM.publish_date)) AS days
PATTERN ( DOWN_OR_PLATEAU* BOTTOM)
DEFINE
DOWN_OR_PLATEAU AS NEXT(product_price) <= product_price,
BOTTOM AS NEXT(product_price) > product_price
)
This creates a result set that simplifies the desired facets of the input table shown previously:
trading_name | start_date | end_date | start_price | end_price | days |
---|---|---|---|---|---|
Ampol Foodary Thornlie Square | 2025-03-01 | 2025-03-02 | 171.9 | 170.9 | 1 |
Ampol Foodary Thornlie Square | 2025-03-03 | 2025-03-04 | 171.9 | 163.9 | 1 |
Ampol Foodary Thornlie Square | 2025-03-05 | 2025-03-09 | 188.9 | 170.9 | 4 |
Ampol Foodary Thornlie Square | 2025-03-10 | 2025-03-11 | 172.9 | 163.9 | 1 |
Ampol Foodary Thornlie Square | 2025-03-12 | 2025-03-18 | 187.9 | 160.9 | 6 |
Ampol Foodary Thornlie Square | 2025-03-19 | 2025-03-25 | 199.9 | 159.9 | 6 |
Further Learning
We’ve barely scratched the surface of what MATCH_RECOGNIZE
is capable of, and we’ve also relied on several defaults.
I’ll explain these briefly below.
Rows per match
There are two options for how to generate match results from your query:
ONE ROW PER MATCH
ALL ROWS PER MATCH
(and options)
The first option is the default, and creates a summary record for each complete PATTERN
match, based on the your defined MEASURES
.
The second option will return all rows included in each match, and optionally, other rows that matched nothing. In this mode, additional functions
like CLASSIFIER()
become more useful in MEASURES
, notating the PATTERN
identifier that this row matches, e.g. DOWN_OR_PLATEAU
, BOTTOM
, etc.
After Match Skip
When a full pattern has been matched, you can control where the “cursor” is placed next. This is specified with phrases beginning AFTER MATCH SKIP...
The default is AFTER MATCH SKIP PAST LAST ROW
- this will look for new matches from the row immediately following the last row from your previous match.
One of a few alternatives include AFTER MATCH SKIP TO NEXT ROW
, which has more of a sliding window effect.
Conclusion
MATCH_RECOGNIZE
takes some commitment to learn, but it pays back in dividends.
I’ve found one of the most useful ways to make progress is to start with ALL ROWS PER MATCH WITH UNMATCHED ROWS
,
and specify a single MEASURE
that shows the output of the CLASSIFIER()
function. This allows
you to focus on labelling rows correctly. Once rows are labelled correctly, adding further measures
becomes much more straight forward.
I hope you’ve enjoyed this brief introduction to the MATCH_RECOGNIZE
clause - it
is a very powerful way to analyse sequences of events in your data!
References
- Fuelwatch MATCH_RECOGNIZE example with Trino and DuckDB
- Row pattern recognition with MATCH_RECOGNIZE
- DuckDB Roadmap -
MATCH_RECOGNIZE
support is planned