
Big JSON documents with DuckDB
by Paul SymonsContents
- Summary
- Preface: Why people like DuckDB
- The Uncomfortable Truth
- How we typically work with big data
- The AEMO (WEM) example
- Processing Large JSON documents with DuckDB
- Unnesting the SCADA data
- Wrapping Up
Summary
In this blog we are going to discuss how DuckDB can help you make sense of very large JSON documents using only its CLI.
Skip the feathers if you want to get straight to the eggs
Preface: Why people like DuckDB
The short version of why people like DuckDB is that it lets them focus on the problem to be solved, in contrast to overly opinionated solutions where mastery requires an abundance of precious learning hours.
The Why DuckDB page explains this in greater detail, but the short answer is that DuckDB is:
- Simple - Download the binary and run; no dependencies, no complex install
- Portable - runs on multiple platforms and architectures(e.g. Windows / OSX / Linux, X86 / ARM)
- Feature-Rich - supports larget subset of commercial and other open source analytical databases
- Fast - both performance of analytical operations, and more broadly - in time to value
- Extensible - support community extensions whilst keeping the core lean
- Free - from both licensing and beer perspectives
The Uncomfortable Truth
Sometimes we have to work with very large JSON documents, and the size can introduce challenges. Most commonly, the size of documents can make them slow to work with, difficult to format, and as a result the schema can be obscured from us, as we struggle to navigate the data. And if your document doesn’t have line endings, then your day probably just got even worse.
How we typically work with big data
The term Big Data - mostly out of vogue these days - had its moment in the sun in the 2010s. It was a term that was popularised by the growth in parallel technologies - both software and software defined infrastructure - to handle large data sets that couldn’t be processed in a timely manner with conventional hardware and software, e.g. single node databases.
These days, we are accustomed to working with proprietary and open source OLAP systems that appear to scale without limits: technologies like Spark have provided a neat interface and programming model, distributing the work of transforming large datasets across multiple units of compute.
Yet many of these technologies can still be stymied by single large JSON document in the 100s of megabytes or even gigabytes.
Note the difference between a file containing a single large JSON document, as opposed to a file containing millions or billions of smaller JSON documents, separated by line endings (JSONL); the latter can be processed with ease, while the former is typically more challenging.
This is documented in services like Amazon Athena which is subject to both row and column size limitations (32Mb) and input text file line length limitations (200Mb); Snowflake also has column size limitations of around 16Mb. Whilst I don’t believe Spark has a particular limitation, because very large document objects will be processed on a single node, you would need to ensure your cluster configuration is appropriate to handle the size of your documents to avoid out of memory errors.
Does it have to be so large?
Wouldn’t it be easier to de-normalize the documents into smaller, constituent parts?
Sure it would! However, that is not always within your control, especially when the data is authored by a third party.
The AEMO (WEM) example
The Australian Energy Market Operator is a co-ordinating agency that is responsible for many of the day to day operational functions in the regulated electricity markets it participates:
- National Electric Market (NEM - East Coast)
- Wholesale Electricity Market (WEM - WA SWIS)
These regulated markets, governed by rules, require some of the operational data be made available to market participants.
In the case of the WEM, this data is published in the public domain; one particular data source, the case input data for the dispatch engine solver, produces a daily data dump in a JSON file:
- 400Mb in size approximately
- no line endings
- single JSON document
We can’t control the size, schema or contents of this document, and no schema is published - so we need to find a way to learn more about it. Let’s get started!
Processing Large JSON documents with DuckDB
Spoiler alert - here’s what we’ll focus on to help you make peace with JSON Bigs:
- DESCRIBE a query and create a table from its result
- Using different Output Modes to improve schema visibility
- Use a SQL formatter to understand your captured schema
- Unnesting document fragments to sample the schema
I assume you have duckdb installed and an obtusely large JSON file.
Start duckdb
Jump into the duckdb console - if you don’t specify a database file, an in-memory database will be used. Using a database file can help you save your work for another time.
$ duckdb
v1.2.0 5f5512b827
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D
Try selecting from your obtusely large file
D SELECT * FROM
read_json('/home/paul/Downloads/ReferenceDispatchCase.json');
Invalid Input Error:
"maximum_object_size" of 16777216 bytes exceeded while reading
file "ReferenceDispatchCase_202502050800.json" (>33554428 bytes).
Try increasing "maximum_object_size".
By default, duckdb
won’t load json documents larger than 16mb.
Try again, correcting for error. The number you provide should be bigger than the size of a single JSON document, but it may be simpler to just pick something greater than the size of the file.
D select * from
read_json('ReferenceDispatchCase.json',
maximum_object_size = 405000000);
┌──────────────────────────────────────────────────┬────────┬──────────┬────────┬──────────────────────────────────────┐
│ data │ errors │ warnings │ infos │ transactionId │
│ struct(dispatchdataissueid varchar, casedata s… │ json[] │ json[] │ json[] │ uuid │
├──────────────────────────────────────────────────┼────────┼──────────┼────────┼──────────────────────────────────────┤
│ {'dispatchDataIssueID': 20250207113008192_4668… │ [] │ [] │ [] │ 96dd2a59-1465-448f-a415-7f577a836045 │
└──────────────────────────────────────────────────┴────────┴──────────┴────────┴──────────────────────────────────────┘
Phew! That statement above took around 26 seconds to complete on my laptop - reading a 375Mb file.
We have 5 top level properties: data
, errors
, warnings
, infos
and transactionId
.
The first column, data
, appears to be the substantial payload; it’s truncated STRUCT type
needs some more exploration.
Let’s speed up our feedback loop by capturing the schema of our query to its own table.
Capture the converted JSON schema
We can view the schema of the JSON document (albeit converted to SQL types) by simply using the DESCRIBE
keyword ahead of our query:
D DESCRIBE select * from
read_json('ReferenceDispatchCase.json',
maximum_object_size = 405000000);
┌─────────────┬────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ data │ STRUCT(dispatchDataIssueID VARCHAR, caseData STRUCT(status V… │ YES │ NULL │ NULL │ NULL │
└─────────────┴────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
OS Caching does improve the query time somewhat, if you have enough memory - however it is a bit slow to re-read large JSON files for every query.
duckdb
allows us to store the schema to a table of its own, if we want to.
D CREATE OR REPLACE TABLE rdcschema AS SELECT * FROM (
DESCRIBE select * from
read_json('ReferenceDispatchCase.json',
maximum_object_size = 405000000)
);
D SELECT * FROM rdcschema;
┌───────────────┬──────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────────────┼──────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ data │ STRUCT(dispatchDataIssueID VARCHAR, caseData STRUCT(status… │ YES │ NULL │ NULL │ NULL │
│ errors │ JSON[] │ YES │ NULL │ NULL │ NULL │
│ warnings │ JSON[] │ YES │ NULL │ NULL │ NULL │
│ infos │ JSON[] │ YES │ NULL │ NULL │ NULL │
│ transactionId │ UUID │ YES │ NULL │ NULL │ NULL │
└───────────────┴──────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
Querying the rdcschema
table is instantaneous now; but we still can’t see the full schema of the data
column!
Change the duckdb
output mode
duckdb
has a number of different output modes - these determine how the results
from your queries are formatted. This is a very useful feature, and means you can do
things like issue a query and have the result set returned as a parseable JSON document.
It also offers different modes for presenting in the CLI: the default mode is duckbox
which does terminal friendly things like truncating both the width of columns, and also
the number of columns included in the output, if there are many. This helps duckdb
present a consistent output,
but sometimes you need the full output. Let’s explore some of the different modes.
There are over 15 different output modes,
but for now we are going to switch to the line mode.
To switch modes, use the .mode
command followed by the output you want.
You can then run your query to see how its presentation has changed:
D .mode line
D select * from rdcschema;
column_name = data
column_type = STRUCT(dispatchDataIssueID VARCHAR, caseData STRUCT(status VARCHAR, scenario VARCHAR, dispatchInterval VARCHAR, primaryDispatchInterval VARCHAR, parameters STRUCT("name" VARCHAR, "value" DOUBLE)[], constraintViolationParameters STRUCT("variable" VARCHAR, price DOUBLE)[], scada STRUCT(tag VARCHAR, "value" VARCHAR, valueDataType VARCHAR, qualityFlag VARCHAR, dataSource VARCHAR, asAtTimeStamp VARCHAR)[], constraintLookUp STRUCT("forecast.ess-regulationRaise-expected" DOUBLE, "forecast.ess-regulationLower-expected" DOUBLE, "forecast.ess-contingencyLower-expected" DOUBLE, "forecast.energy-forecastOperationalDemand-expected" DOUBLE, "forecast.energy-forecastOperationalDemand-high" DOUBLE, "forecast.energy-forecastOperationalDemand-low" DOUBLE, "forecast.regionalForecast-easternGoldfields-expected" DOUBLE, "forecast.regionalForecast-southWestGreatSouthernRegionBaseLoad-expected" DOUBLE, "forecast.regionalForecast-northCountryRegionLoad-expected" DOUBLE, "forecast.regionalForecast-northOfThreeSprings-expected" DOUBLE, "forecast.regionalForecast-southWestGreatSouthernRegionLoad-expected" DOUBLE, "forecast.regionalForecast-northCountryRegionBaseLoad-expected" DOUBLE, "forecast.ess-contingencyLowerOffset-expected" DOUBLE, "forecast.energy-forecastDPV-expected" DOUBLE, "forecast.nsgForecast-albanyWindFarm-bomAccessA-expected" DOUBLE, "forecast.nsgForecast-albanyWindFarm-bomAccessA-low" DOUBLE, "forecast.nsgForecast-albanyWindFarm-bomAccessA-high" DOUBLE, "solver.minutesSincePrimary" DOUBLE), "constraints" STRUCT(constraintSet STRUCT(id VARCHAR, constraintEquations VARCHAR[], "version" BIGINT, description VARCHAR, "comments" VARCHAR)[], constraintEquations STRUCT(id VARCHAR, description VARCHAR, "comments" VARCHAR, leftHandSide STRUCT(term VARCHAR, coefficient DOUBLE, termType VARCHAR, "index" BIGINT)[], "operator" VARCHAR, rightHandSide STRUCT(term VARCHAR, coefficient DOUBLE, termType VARCHAR, "index" BIGINT)[], rightHandSideScript VARCHAR, required BOOLEAN, limitType VARCHAR, limitAdviceId VARCHAR, isInterventionEvent BOOLEAN, constraintType VARCHAR, defaultRHS DOUBLE, violationPenalty DOUBLE, contingency VARCHAR, monitoredElement VARCHAR, "version" BIGINT, systemConfiguration VARCHAR)[]), dfcmData STRUCT(dfcmId VARCHAR, dfcmSchemaVersion VARCHAR, frequencyLimits STRUCT(rocofMax DOUBLE, rocofMin DOUBLE, fmin DOUBLE, fmax DOUBLE, fss DOUBLE, marginForOperationCRR DOUBLE), tauValuesforCRPerformanceFactors DOUBLE[], dfcmValidationDetails STRUCT(inertiaRangeMinimum DOUBLE, inertiaRangeMaximum DOUBLE, inertiaRangeStep DOUBLE, largestContingencySizeMinimum DOUBLE, largestContingencySizeMaximum DOUBLE, largestContingencySizeStep DOUBLE, underlyingDemandMinimum DOUBLE, underlyingDemandMaximum DOUBLE, underlyingDemandStep DOUBLE, loadReliefValues DOUBLE[], dpvMinimum DOUBLE, dpvMaximum DOUBLE, dpvStep DOUBLE)), unconstrainedForecast STRUCT(facilityCode VARCHAR, unconstrainedInjectionForecastMw DOUBLE, unconstrainedWithdrawalForecastMw DOUBLE)[], registrationData STRUCT(facilityCode VARCHAR, nol BOOLEAN, facilityStandingData STRUCT(facilityRegistrationStatus VARCHAR, facilityClass VARCHAR, storageConstraints BOOLEAN, inertia DOUBLE, tauFactor DOUBLE, effectiveDispatchIntervalFrom VARCHAR, unconstrainedForecastSource VARCHAR))[], facilityLossFactors STRUCT(facilityCode VARCHAR, distributionLossFactor DOUBLE, transmissionLossFactor DOUBLE)[], markets STRUCT(energy STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(maxInjectionCapacity DOUBLE, maxWithdrawalCapacity DOUBLE, unconstrainedInjectionForecast DOUBLE, unconstrainedWithdrawalForecast DOUBLE, inflexibleFlag BOOLEAN, maxUpwardRampRate DOUBLE, downwardsRampRateValueUsed DOUBLE, maxDownwardRampRate DOUBLE, upwardsRampRateValueUsed DOUBLE, fsip STRUCT(t1 DOUBLE, t2 DOUBLE, t3 DOUBLE, t4 DOUBLE, minimumLoad DOUBLE), facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[], dspUnconstrainedWithdrawalQuantity DOUBLE, dspConstrainedWithdrawalQuantity DOUBLE)[]), regulationRaise STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[]), regulationLower STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[]), contingencyRaise STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[]), contingencyLower STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[]), rocof STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[])), rcmData STRUCT(facilityCode VARCHAR, rcoq DOUBLE)[], marketServicesPriceFlag STRUCT(energyPriceAtCap BOOLEAN, regulationRaisePriceAtCap BOOLEAN, regulationLowerPriceAtCap BOOLEAN, contingencyLowerPriceAtCap BOOLEAN, contingencyRaisePriceAtCap BOOLEAN, rocofPriceAtCap BOOLEAN))[])
null = YES
key = NULL
default = NULL
extra = NULL
column_name = errors
column_type = JSON[]
null = YES
key = NULL
default = NULL
extra = NULL
column_name = warnings
column_type = JSON[]
null = YES
key = NULL
default = NULL
extra = NULL
column_name = infos
column_type = JSON[]
null = YES
key = NULL
default = NULL
extra = NULL
column_name = transactionId
column_type = UUID
null = YES
key = NULL
default = NULL
extra = NULL
D
This is essentially the same schema information as before, but now unabridged. It’s still a bit hard to read though.
Seeing the schema for the first time
To make the schema more readable, I’m going to use VSCode and an extension called SQLTools to format the document.
I save the STRUCT
into a file called schema.sql,
and use the SQLTools extension to format it. The end result looks like this:
STRUCT(
dispatchDataIssueID VARCHAR,
caseData STRUCT(
status VARCHAR,
scenario VARCHAR,
dispatchInterval VARCHAR,
primaryDispatchInterval VARCHAR,
parameters STRUCT("name" VARCHAR, "value" DOUBLE) [],
constraintViolationParameters STRUCT("variable" VARCHAR, price DOUBLE) [],
scada STRUCT(
tag VARCHAR,
"value" VARCHAR,
valueDataType VARCHAR,
qualityFlag VARCHAR,
dataSource VARCHAR,
asAtTimeStamp VARCHAR
) [],
constraintLookUp STRUCT(
"forecast.ess-regulationRaise-expected" DOUBLE,
"forecast.ess-regulationLower-expected" DOUBLE,
"forecast.ess-contingencyLower-expected" DOUBLE,
"forecast.energy-forecastOperationalDemand-expected" DOUBLE,
"forecast.energy-forecastOperationalDemand-high" DOUBLE,
"forecast.energy-forecastOperationalDemand-low" DOUBLE,
"forecast.regionalForecast-easternGoldfields-expected" DOUBLE,
"forecast.regionalForecast-southWestGreatSouthernRegionBaseLoad-expected" DOUBLE,
"forecast.regionalForecast-northCountryRegionLoad-expected" DOUBLE,
"forecast.regionalForecast-northOfThreeSprings-expected" DOUBLE,
"forecast.regionalForecast-southWestGreatSouthernRegionLoad-expected" DOUBLE,
"forecast.regionalForecast-northCountryRegionBaseLoad-expected" DOUBLE,
"forecast.ess-contingencyLowerOffset-expected" DOUBLE,
"forecast.energy-forecastDPV-expected" DOUBLE,
"forecast.nsgForecast-albanyWindFarm-bomAccessA-expected" DOUBLE,
"forecast.nsgForecast-albanyWindFarm-bomAccessA-low" DOUBLE,
"forecast.nsgForecast-albanyWindFarm-bomAccessA-high" DOUBLE,
"solver.minutesSincePrimary" DOUBLE
),
"constraints" STRUCT(
constraintSet STRUCT(
id VARCHAR,
constraintEquations VARCHAR [],
"version" BIGINT,
description VARCHAR,
"comments" VARCHAR
) [],
constraintEquations STRUCT(
id VARCHAR,
description VARCHAR,
"comments" VARCHAR,
leftHandSide STRUCT(
term VARCHAR,
coefficient DOUBLE,
termType VARCHAR,
"index" BIGINT
) [],
"operator" VARCHAR,
rightHandSide STRUCT(
term VARCHAR,
coefficient DOUBLE,
termType VARCHAR,
"index" BIGINT
) [],
rightHandSideScript VARCHAR,
required BOOLEAN,
limitType VARCHAR,
limitAdviceId VARCHAR,
isInterventionEvent BOOLEAN,
constraintType VARCHAR,
defaultRHS DOUBLE,
violationPenalty DOUBLE,
contingency VARCHAR,
monitoredElement VARCHAR,
"version" BIGINT,
systemConfiguration VARCHAR
) []
),
dfcmData STRUCT(
dfcmId VARCHAR,
dfcmSchemaVersion VARCHAR,
frequencyLimits STRUCT(
rocofMax DOUBLE,
rocofMin DOUBLE,
fmin DOUBLE,
fmax DOUBLE,
fss DOUBLE,
marginForOperationCRR DOUBLE
),
tauValuesforCRPerformanceFactors DOUBLE [],
dfcmValidationDetails STRUCT(
inertiaRangeMinimum DOUBLE,
inertiaRangeMaximum DOUBLE,
inertiaRangeStep DOUBLE,
largestContingencySizeMinimum DOUBLE,
largestContingencySizeMaximum DOUBLE,
largestContingencySizeStep DOUBLE,
underlyingDemandMinimum DOUBLE,
underlyingDemandMaximum DOUBLE,
underlyingDemandStep DOUBLE,
loadReliefValues DOUBLE [],
dpvMinimum DOUBLE,
dpvMaximum DOUBLE,
dpvStep DOUBLE
)
),
unconstrainedForecast STRUCT(
facilityCode VARCHAR,
unconstrainedInjectionForecastMw DOUBLE,
unconstrainedWithdrawalForecastMw DOUBLE
) [],
registrationData STRUCT(
facilityCode VARCHAR,
nol BOOLEAN,
facilityStandingData STRUCT(
facilityRegistrationStatus VARCHAR,
facilityClass VARCHAR,
storageConstraints BOOLEAN,
inertia DOUBLE,
tauFactor DOUBLE,
effectiveDispatchIntervalFrom VARCHAR,
unconstrainedForecastSource VARCHAR
)
) [],
facilityLossFactors STRUCT(
facilityCode VARCHAR,
distributionLossFactor DOUBLE,
transmissionLossFactor DOUBLE
) [],
markets STRUCT(
energy STRUCT(
marketService VARCHAR,
baseForecastRequirement DOUBLE,
overrideForecastRequirement JSON,
facilities STRUCT(
maxInjectionCapacity DOUBLE,
maxWithdrawalCapacity DOUBLE,
unconstrainedInjectionForecast DOUBLE,
unconstrainedWithdrawalForecast DOUBLE,
inflexibleFlag BOOLEAN,
maxUpwardRampRate DOUBLE,
downwardsRampRateValueUsed DOUBLE,
maxDownwardRampRate DOUBLE,
upwardsRampRateValueUsed DOUBLE,
fsip STRUCT(
t1 DOUBLE,
t2 DOUBLE,
t3 DOUBLE,
t4 DOUBLE,
minimumLoad DOUBLE
),
facilityCode VARCHAR,
submissionId UUID,
submissionCode VARCHAR,
tranches STRUCT(
tranche BIGINT,
fuelType VARCHAR,
quantity DOUBLE,
submittedPrice DOUBLE,
lfaPrice DOUBLE,
capacityType VARCHAR,
noticeTime BIGINT
) [],
dspUnconstrainedWithdrawalQuantity DOUBLE,
dspConstrainedWithdrawalQuantity DOUBLE
) []
),
regulationRaise STRUCT(
marketService VARCHAR,
baseForecastRequirement DOUBLE,
overrideForecastRequirement JSON,
facilities STRUCT(
enablementMinimum DOUBLE,
enablementMinimumValueUsed DOUBLE,
lowBreakpoint DOUBLE,
lowBreakpointValueUsed DOUBLE,
highBreakpoint DOUBLE,
highBreakpointValueUsed DOUBLE,
enablementMaximum DOUBLE,
enablementMaximumValueUsed DOUBLE,
maximumCapacity DOUBLE,
facilityCode VARCHAR,
submissionId UUID,
submissionCode VARCHAR,
tranches STRUCT(
tranche BIGINT,
fuelType VARCHAR,
quantity DOUBLE,
submittedPrice DOUBLE,
lfaPrice DOUBLE,
capacityType VARCHAR,
noticeTime BIGINT
) []
) []
),
regulationLower STRUCT(
marketService VARCHAR,
baseForecastRequirement DOUBLE,
overrideForecastRequirement JSON,
facilities STRUCT(
enablementMinimum DOUBLE,
enablementMinimumValueUsed DOUBLE,
lowBreakpoint DOUBLE,
lowBreakpointValueUsed DOUBLE,
highBreakpoint DOUBLE,
highBreakpointValueUsed DOUBLE,
enablementMaximum DOUBLE,
enablementMaximumValueUsed DOUBLE,
maximumCapacity DOUBLE,
facilityCode VARCHAR,
submissionId UUID,
submissionCode VARCHAR,
tranches STRUCT(
tranche BIGINT,
fuelType VARCHAR,
quantity DOUBLE,
submittedPrice DOUBLE,
lfaPrice DOUBLE,
capacityType VARCHAR,
noticeTime BIGINT
) []
) []
),
contingencyRaise STRUCT(
marketService VARCHAR,
baseForecastRequirement DOUBLE,
overrideForecastRequirement JSON,
facilities STRUCT(
enablementMinimum DOUBLE,
enablementMinimumValueUsed DOUBLE,
lowBreakpoint DOUBLE,
lowBreakpointValueUsed DOUBLE,
highBreakpoint DOUBLE,
highBreakpointValueUsed DOUBLE,
enablementMaximum DOUBLE,
enablementMaximumValueUsed DOUBLE,
maximumCapacity DOUBLE,
facilityCode VARCHAR,
submissionId UUID,
submissionCode VARCHAR,
tranches STRUCT(
tranche BIGINT,
fuelType VARCHAR,
quantity DOUBLE,
submittedPrice DOUBLE,
lfaPrice DOUBLE,
capacityType VARCHAR,
noticeTime BIGINT
) []
) []
),
contingencyLower STRUCT(
marketService VARCHAR,
baseForecastRequirement DOUBLE,
overrideForecastRequirement JSON,
facilities STRUCT(
enablementMinimum DOUBLE,
enablementMinimumValueUsed DOUBLE,
lowBreakpoint DOUBLE,
lowBreakpointValueUsed DOUBLE,
highBreakpoint DOUBLE,
highBreakpointValueUsed DOUBLE,
enablementMaximum DOUBLE,
enablementMaximumValueUsed DOUBLE,
maximumCapacity DOUBLE,
facilityCode VARCHAR,
submissionId UUID,
submissionCode VARCHAR,
tranches STRUCT(
tranche BIGINT,
fuelType VARCHAR,
quantity DOUBLE,
submittedPrice DOUBLE,
lfaPrice DOUBLE,
capacityType VARCHAR,
noticeTime BIGINT
) []
) []
),
rocof STRUCT(
marketService VARCHAR,
baseForecastRequirement DOUBLE,
overrideForecastRequirement JSON,
facilities STRUCT(
enablementMinimum DOUBLE,
enablementMinimumValueUsed DOUBLE,
lowBreakpoint DOUBLE,
lowBreakpointValueUsed DOUBLE,
highBreakpoint DOUBLE,
highBreakpointValueUsed DOUBLE,
enablementMaximum DOUBLE,
enablementMaximumValueUsed DOUBLE,
maximumCapacity DOUBLE,
facilityCode VARCHAR,
submissionId UUID,
submissionCode VARCHAR,
tranches STRUCT(
tranche BIGINT,
fuelType VARCHAR,
quantity DOUBLE,
submittedPrice DOUBLE,
lfaPrice DOUBLE,
capacityType VARCHAR,
noticeTime BIGINT
) []
) []
)
),
rcmData STRUCT(facilityCode VARCHAR, rcoq DOUBLE) [],
marketServicesPriceFlag STRUCT(
energyPriceAtCap BOOLEAN,
regulationRaisePriceAtCap BOOLEAN,
regulationLowerPriceAtCap BOOLEAN,
contingencyLowerPriceAtCap BOOLEAN,
contingencyRaisePriceAtCap BOOLEAN,
rocofPriceAtCap BOOLEAN
)
) []
)
I can now see that most of the interesting stuff is in the array-type element called caseData
!
The scada data looks interesting - let’s investigate that further.
Unnesting the SCADA data
Before we get to unnesting, I’m going to do some tidy up - reverting the mode to duckbox
,
and saving a copy of the JSON in duckdb
to simplify the queries.
D .mode duckbox
D CREATE OR REPLACE TABLE rdcdata AS
select * from
read_json('ReferenceDispatchCase.json',
maximum_object_size = 405000000);
D
Our first unnest
Now, let’s run a query to unpack the caseData
array:
D WITH cd as (
SELECT UNNEST(data.caseData) "caseData" FROM rdcdata
) SELECT caseData from cd limit 3;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ caseData │
│ struct(status varchar, scenario varchar, dispatchinterval varchar, primarydispatchinterval varchar, parameters str… │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'status': Normal, 'scenario': Reference, 'dispatchInterval': 2025-02-05T08:00:00+08:00, 'primaryDispatchInterval'… │
│ {'status': Normal, 'scenario': Reference, 'dispatchInterval': 2025-02-05T08:05:00+08:00, 'primaryDispatchInterval'… │
│ {'status': Normal, 'scenario': Reference, 'dispatchInterval': 2025-02-05T08:10:00+08:00, 'primaryDispatchInterval'… │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘D
This is useful but I only got a single column with a STRUCT
object in it. We can unnest
further by modifying the original unnest to either specify the max_depth
, or simply to unnest recursively.
While we’re at it, let’s create a view to represent this query:
A recursive unnest
D CREATE OR REPLACE VIEW casedata AS WITH cd as (
SELECT UNNEST(data.caseData, recursive := true)
FROM rdcdata
) SELECT * from cd;
D select * from casedata limit 3;
┌─────────┬───────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬─────────────────┐
│ status │ scenario │ dispatchInterval │ … │ contingencyLowerPr… │ contingencyRaisePr… │ rocofPriceAtCap │
│ varchar │ varchar │ varchar │ │ boolean │ boolean │ boolean │
├─────────┼───────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼─────────────────┤
│ Normal │ Reference │ 2025-02-05T08:00:0… │ … │ false │ false │ false │
│ Normal │ Reference │ 2025-02-05T08:05:0… │ … │ false │ false │ false │
│ Normal │ Reference │ 2025-02-05T08:10:0… │ … │ false │ false │ false │
├─────────┴───────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴─────────────────┤
│ 3 rows 83 columns (6 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
This is an improvement: the unnest has exploded out the STRUCT fields to their own columns; 83 of them, in fact!
Unpacking SCADA data
We now simply repeat the unnest process to access the SCADA data.
I’ll set the output mode to table
for some ascii-art goodness too.
D .mode table
D WITH vw_scada AS (Select unnest(scada, max_depth:=2) FROM casedata) SELECT * from vw_scada limit 5;
+------------------------------------+-----------+---------------+-------------+------------+-----------------------------------+
| tag | value | valueDataType | qualityFlag | dataSource | asAtTimeStamp |
+------------------------------------+-----------+---------------+-------------+------------+-----------------------------------+
| dispatchCondition.loadRelief | 2 | number | good | SCADA | 2025-02-05T07:59:05.02744+08:00 |
| dispatchCondition.systemInertia | 21370.863 | number | good | SCADA | 2025-02-05T07:59:05.0274425+08:00 |
| dispatchCondition.demand | 2364.1733 | number | good | SCADA | 2025-02-05T07:59:05.027435+08:00 |
| dispatchCondition.underlyingDemand | 3053.1206 | number | good | SCADA | 2025-02-05T07:59:05.0274437+08:00 |
| dispatchCondition.loadInertia | 5146.7617 | number | good | SCADA | 2025-02-05T07:59:05.0274388+08:00 |
+------------------------------------+-----------+---------------+-------------+------------+-----------------------------------+
D
Finally, some real data! Want to write it to a CSV?
D COPY (
WITH vw_scada AS (
Select unnest(scada, max_depth:=2) FROM casedata
) SELECT * from vw_scada
) TO 'scada.csv';
D
Wrapping Up
A quack re-cap of what we’ve learned:
- How to load a chonkin great JSON file
- Determining the schema of a large JSON file
- Unnesting arrays and objects in to table rows
- Writing results out to a CSV
If you haven’t tried DuckDB for data mangling, give it a try!