12 Feb 2025

Big JSON documents with DuckDB

by Paul Symons

Summary
Preface: Why people like DuckDB
The Uncomfortable Truth
How we typically work with big data
- Does it have to be so large?
The AEMO (WEM) example
Processing Large JSON documents with DuckDB
Unnesting the SCADA data
Wrapping Up

Summary

In this blog we are going to discuss how DuckDB can help you make sense of very large JSON documents using only its CLI.

Skip the feathers if you want to get straight to the eggs

Preface: Why people like DuckDB

The short version of why people like DuckDB is that it lets them focus on the problem to be solved, in contrast to overly opinionated solutions where mastery requires an abundance of precious learning hours.

The Why DuckDB page explains this in greater detail, but the short answer is that DuckDB is:

Simple - Download the binary and run; no dependencies, no complex install
Portable - runs on multiple platforms and architectures(e.g. Windows / OSX / Linux, X86 / ARM)
Feature-Rich - supports larget subset of commercial and other open source analytical databases
Fast - both performance of analytical operations, and more broadly - in time to value
Extensible - support community extensions whilst keeping the core lean
Free - from both licensing and beer perspectives

The Uncomfortable Truth

Sometimes we have to work with very large JSON documents, and the size can introduce challenges. Most commonly, the size of documents can make them slow to work with, difficult to format, and as a result the schema can be obscured from us, as we struggle to navigate the data. And if your document doesn’t have line endings, then your day probably just got even worse.

How we typically work with big data

The term Big Data - mostly out of vogue these days - had its moment in the sun in the 2010s. It was a term that was popularised by the growth in parallel technologies - both software and software defined infrastructure - to handle large data sets that couldn’t be processed in a timely manner with conventional hardware and software, e.g. single node databases.

These days, we are accustomed to working with proprietary and open source OLAP systems that appear to scale without limits: technologies like Spark have provided a neat interface and programming model, distributing the work of transforming large datasets across multiple units of compute.

Yet many of these technologies can still be stymied by single large JSON document in the 100s of megabytes or even gigabytes.

Note the difference between a file containing a single large JSON document, as opposed to a file containing millions or billions of smaller JSON documents, separated by line endings (JSONL); the latter can be processed with ease, while the former is typically more challenging.

This is documented in services like Amazon Athena which is subject to both row and column size limitations (32Mb) and input text file line length limitations (200Mb); Snowflake also has column size limitations of around 16Mb. Whilst I don’t believe Spark has a particular limitation, because very large document objects will be processed on a single node, you would need to ensure your cluster configuration is appropriate to handle the size of your documents to avoid out of memory errors.

Does it have to be so large?

Wouldn’t it be easier to de-normalize the documents into smaller, constituent parts?

Sure it would! However, that is not always within your control, especially when the data is authored by a third party.

The AEMO (WEM) example

The Australian Energy Market Operator is a co-ordinating agency that is responsible for many of the day to day operational functions in the regulated electricity markets it participates:

National Electric Market (NEM - East Coast)
Wholesale Electricity Market (WEM - WA SWIS)

These regulated markets, governed by rules, require some of the operational data be made available to market participants.

In the case of the WEM, this data is published in the public domain; one particular data source, the case input data for the dispatch engine solver, produces a daily data dump in a JSON file:

400Mb in size approximately
no line endings
single JSON document

We can’t control the size, schema or contents of this document, and no schema is published - so we need to find a way to learn more about it. Let’s get started!

Processing Large JSON documents with DuckDB

Spoiler alert - here’s what we’ll focus on to help you make peace with JSON Bigs:

DESCRIBE a query and create a table from its result
Using different Output Modes to improve schema visibility
Use a SQL formatter to understand your captured schema
Unnesting document fragments to sample the schema

I assume you have duckdb installed and an obtusely large JSON file.

Start `duckdb`

Jump into the duckdb console - if you don’t specify a database file, an in-memory database will be used. Using a database file can help you save your work for another time.

$ duckdb
v1.2.0 5f5512b827
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D

Try selecting from your obtusely large file

D SELECT * FROM 
  read_json('/home/paul/Downloads/ReferenceDispatchCase.json');
Invalid Input Error:
"maximum_object_size" of 16777216 bytes exceeded while reading
 file "ReferenceDispatchCase_202502050800.json" (>33554428 bytes).
 Try increasing "maximum_object_size".

By default, duckdb won’t load json documents larger than 16mb.

Try again, correcting for error. The number you provide should be bigger than the size of a single JSON document, but it may be simpler to just pick something greater than the size of the file.

D select * from 
    read_json('ReferenceDispatchCase.json',
    maximum_object_size = 405000000);
┌──────────────────────────────────────────────────┬────────┬──────────┬────────┬──────────────────────────────────────┐
│                       data                       │ errors │ warnings │ infos  │            transactionId             │
│ struct(dispatchdataissueid varchar, casedata s…  │ json[] │  json[]  │ json[] │                 uuid                 │
├──────────────────────────────────────────────────┼────────┼──────────┼────────┼──────────────────────────────────────┤
│ {'dispatchDataIssueID': 20250207113008192_4668…  │ []     │ []       │ []     │ 96dd2a59-1465-448f-a415-7f577a836045 │
└──────────────────────────────────────────────────┴────────┴──────────┴────────┴──────────────────────────────────────┘

Phew! That statement above took around 26 seconds to complete on my laptop - reading a 375Mb file.

We have 5 top level properties: data, errors, warnings, infos and transactionId. The first column, data, appears to be the substantial payload; it’s truncated STRUCT type needs some more exploration.

Let’s speed up our feedback loop by capturing the schema of our query to its own table.

Capture the converted JSON schema

We can view the schema of the JSON document (albeit converted to SQL types) by simply using the DESCRIBE keyword ahead of our query:

D DESCRIBE select * from 
    read_json('ReferenceDispatchCase.json',
    maximum_object_size = 405000000);
┌─────────────┬────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │                          column_type                           │  null   │   key   │ default │  extra  │
│   varchar   │                            varchar                             │ varchar │ varchar │ varchar │ varchar │
├─────────────┼────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ data        │ STRUCT(dispatchDataIssueID VARCHAR, caseData STRUCT(status V…  │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘

OS Caching does improve the query time somewhat, if you have enough memory - however it is a bit slow to re-read large JSON files for every query.

duckdb allows us to store the schema to a table of its own, if we want to.

D CREATE OR REPLACE TABLE rdcschema AS SELECT * FROM (
    DESCRIBE select * from 
      read_json('ReferenceDispatchCase.json',
                maximum_object_size = 405000000)
);
D SELECT * FROM rdcschema;
┌───────────────┬──────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name  │                         column_type                          │  null   │   key   │ default │  extra  │
│    varchar    │                           varchar                            │ varchar │ varchar │ varchar │ varchar │
├───────────────┼──────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ data          │ STRUCT(dispatchDataIssueID VARCHAR, caseData STRUCT(status…  │ YES     │ NULL    │ NULL    │ NULL    │
│ errors        │ JSON[]                                                       │ YES     │ NULL    │ NULL    │ NULL    │
│ warnings      │ JSON[]                                                       │ YES     │ NULL    │ NULL    │ NULL    │
│ infos         │ JSON[]                                                       │ YES     │ NULL    │ NULL    │ NULL    │
│ transactionId │ UUID                                                         │ YES     │ NULL    │ NULL    │ NULL    │
└───────────────┴──────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘

Querying the rdcschema table is instantaneous now; but we still can’t see the full schema of the data column!

Change the `duckdb` output mode

duckdb has a number of different output modes - these determine how the results from your queries are formatted. This is a very useful feature, and means you can do things like issue a query and have the result set returned as a parseable JSON document.

It also offers different modes for presenting in the CLI: the default mode is duckbox which does terminal friendly things like truncating both the width of columns, and also the number of columns included in the output, if there are many. This helps duckdb present a consistent output, but sometimes you need the full output. Let’s explore some of the different modes.

There are over 15 different output modes, but for now we are going to switch to the line mode. To switch modes, use the .mode command followed by the output you want. You can then run your query to see how its presentation has changed:

D .mode line
D select * from rdcschema;
column_name = data
column_type = STRUCT(dispatchDataIssueID VARCHAR, caseData STRUCT(status VARCHAR, scenario VARCHAR, dispatchInterval VARCHAR, primaryDispatchInterval VARCHAR, parameters STRUCT("name" VARCHAR, "value" DOUBLE)[], constraintViolationParameters STRUCT("variable" VARCHAR, price DOUBLE)[], scada STRUCT(tag VARCHAR, "value" VARCHAR, valueDataType VARCHAR, qualityFlag VARCHAR, dataSource VARCHAR, asAtTimeStamp VARCHAR)[], constraintLookUp STRUCT("forecast.ess-regulationRaise-expected" DOUBLE, "forecast.ess-regulationLower-expected" DOUBLE, "forecast.ess-contingencyLower-expected" DOUBLE, "forecast.energy-forecastOperationalDemand-expected" DOUBLE, "forecast.energy-forecastOperationalDemand-high" DOUBLE, "forecast.energy-forecastOperationalDemand-low" DOUBLE, "forecast.regionalForecast-easternGoldfields-expected" DOUBLE, "forecast.regionalForecast-southWestGreatSouthernRegionBaseLoad-expected" DOUBLE, "forecast.regionalForecast-northCountryRegionLoad-expected" DOUBLE, "forecast.regionalForecast-northOfThreeSprings-expected" DOUBLE, "forecast.regionalForecast-southWestGreatSouthernRegionLoad-expected" DOUBLE, "forecast.regionalForecast-northCountryRegionBaseLoad-expected" DOUBLE, "forecast.ess-contingencyLowerOffset-expected" DOUBLE, "forecast.energy-forecastDPV-expected" DOUBLE, "forecast.nsgForecast-albanyWindFarm-bomAccessA-expected" DOUBLE, "forecast.nsgForecast-albanyWindFarm-bomAccessA-low" DOUBLE, "forecast.nsgForecast-albanyWindFarm-bomAccessA-high" DOUBLE, "solver.minutesSincePrimary" DOUBLE), "constraints" STRUCT(constraintSet STRUCT(id VARCHAR, constraintEquations VARCHAR[], "version" BIGINT, description VARCHAR, "comments" VARCHAR)[], constraintEquations STRUCT(id VARCHAR, description VARCHAR, "comments" VARCHAR, leftHandSide STRUCT(term VARCHAR, coefficient DOUBLE, termType VARCHAR, "index" BIGINT)[], "operator" VARCHAR, rightHandSide STRUCT(term VARCHAR, coefficient DOUBLE, termType VARCHAR, "index" BIGINT)[], rightHandSideScript VARCHAR, required BOOLEAN, limitType VARCHAR, limitAdviceId VARCHAR, isInterventionEvent BOOLEAN, constraintType VARCHAR, defaultRHS DOUBLE, violationPenalty DOUBLE, contingency VARCHAR, monitoredElement VARCHAR, "version" BIGINT, systemConfiguration VARCHAR)[]), dfcmData STRUCT(dfcmId VARCHAR, dfcmSchemaVersion VARCHAR, frequencyLimits STRUCT(rocofMax DOUBLE, rocofMin DOUBLE, fmin DOUBLE, fmax DOUBLE, fss DOUBLE, marginForOperationCRR DOUBLE), tauValuesforCRPerformanceFactors DOUBLE[], dfcmValidationDetails STRUCT(inertiaRangeMinimum DOUBLE, inertiaRangeMaximum DOUBLE, inertiaRangeStep DOUBLE, largestContingencySizeMinimum DOUBLE, largestContingencySizeMaximum DOUBLE, largestContingencySizeStep DOUBLE, underlyingDemandMinimum DOUBLE, underlyingDemandMaximum DOUBLE, underlyingDemandStep DOUBLE, loadReliefValues DOUBLE[], dpvMinimum DOUBLE, dpvMaximum DOUBLE, dpvStep DOUBLE)), unconstrainedForecast STRUCT(facilityCode VARCHAR, unconstrainedInjectionForecastMw DOUBLE, unconstrainedWithdrawalForecastMw DOUBLE)[], registrationData STRUCT(facilityCode VARCHAR, nol BOOLEAN, facilityStandingData STRUCT(facilityRegistrationStatus VARCHAR, facilityClass VARCHAR, storageConstraints BOOLEAN, inertia DOUBLE, tauFactor DOUBLE, effectiveDispatchIntervalFrom VARCHAR, unconstrainedForecastSource VARCHAR))[], facilityLossFactors STRUCT(facilityCode VARCHAR, distributionLossFactor DOUBLE, transmissionLossFactor DOUBLE)[], markets STRUCT(energy STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(maxInjectionCapacity DOUBLE, maxWithdrawalCapacity DOUBLE, unconstrainedInjectionForecast DOUBLE, unconstrainedWithdrawalForecast DOUBLE, inflexibleFlag BOOLEAN, maxUpwardRampRate DOUBLE, downwardsRampRateValueUsed DOUBLE, maxDownwardRampRate DOUBLE, upwardsRampRateValueUsed DOUBLE, fsip STRUCT(t1 DOUBLE, t2 DOUBLE, t3 DOUBLE, t4 DOUBLE, minimumLoad DOUBLE), facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[], dspUnconstrainedWithdrawalQuantity DOUBLE, dspConstrainedWithdrawalQuantity DOUBLE)[]), regulationRaise STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[]), regulationLower STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[]), contingencyRaise STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[]), contingencyLower STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[]), rocof STRUCT(marketService VARCHAR, baseForecastRequirement DOUBLE, overrideForecastRequirement JSON, facilities STRUCT(enablementMinimum DOUBLE, enablementMinimumValueUsed DOUBLE, lowBreakpoint DOUBLE, lowBreakpointValueUsed DOUBLE, highBreakpoint DOUBLE, highBreakpointValueUsed DOUBLE, enablementMaximum DOUBLE, enablementMaximumValueUsed DOUBLE, maximumCapacity DOUBLE, facilityCode VARCHAR, submissionId UUID, submissionCode VARCHAR, tranches STRUCT(tranche BIGINT, fuelType VARCHAR, quantity DOUBLE, submittedPrice DOUBLE, lfaPrice DOUBLE, capacityType VARCHAR, noticeTime BIGINT)[])[])), rcmData STRUCT(facilityCode VARCHAR, rcoq DOUBLE)[], marketServicesPriceFlag STRUCT(energyPriceAtCap BOOLEAN, regulationRaisePriceAtCap BOOLEAN, regulationLowerPriceAtCap BOOLEAN, contingencyLowerPriceAtCap BOOLEAN, contingencyRaisePriceAtCap BOOLEAN, rocofPriceAtCap BOOLEAN))[])
       null = YES
        key = NULL
    default = NULL
      extra = NULL

column_name = errors
column_type = JSON[]
       null = YES
        key = NULL
    default = NULL
      extra = NULL

column_name = warnings
column_type = JSON[]
       null = YES
        key = NULL
    default = NULL
      extra = NULL

column_name = infos
column_type = JSON[]
       null = YES
        key = NULL
    default = NULL
      extra = NULL

column_name = transactionId
column_type = UUID
       null = YES
        key = NULL
    default = NULL
      extra = NULL
D

This is essentially the same schema information as before, but now unabridged. It’s still a bit hard to read though.

Seeing the schema for the first time

To make the schema more readable, I’m going to use VSCode and an extension called SQLTools to format the document.

I save the STRUCT into a file called schema.sql, and use the SQLTools extension to format it. The end result looks like this:

STRUCT(
    dispatchDataIssueID VARCHAR,
    caseData STRUCT(
        status VARCHAR,
        scenario VARCHAR,
        dispatchInterval VARCHAR,
        primaryDispatchInterval VARCHAR,
        parameters STRUCT("name" VARCHAR, "value" DOUBLE) [],
        constraintViolationParameters STRUCT("variable" VARCHAR, price DOUBLE) [],
        scada STRUCT(
            tag VARCHAR,
            "value" VARCHAR,
            valueDataType VARCHAR,
            qualityFlag VARCHAR,
            dataSource VARCHAR,
            asAtTimeStamp VARCHAR
        ) [],
        constraintLookUp STRUCT(
            "forecast.ess-regulationRaise-expected" DOUBLE,
            "forecast.ess-regulationLower-expected" DOUBLE,
            "forecast.ess-contingencyLower-expected" DOUBLE,
            "forecast.energy-forecastOperationalDemand-expected" DOUBLE,
            "forecast.energy-forecastOperationalDemand-high" DOUBLE,
            "forecast.energy-forecastOperationalDemand-low" DOUBLE,
            "forecast.regionalForecast-easternGoldfields-expected" DOUBLE,
            "forecast.regionalForecast-southWestGreatSouthernRegionBaseLoad-expected" DOUBLE,
            "forecast.regionalForecast-northCountryRegionLoad-expected" DOUBLE,
            "forecast.regionalForecast-northOfThreeSprings-expected" DOUBLE,
            "forecast.regionalForecast-southWestGreatSouthernRegionLoad-expected" DOUBLE,
            "forecast.regionalForecast-northCountryRegionBaseLoad-expected" DOUBLE,
            "forecast.ess-contingencyLowerOffset-expected" DOUBLE,
            "forecast.energy-forecastDPV-expected" DOUBLE,
            "forecast.nsgForecast-albanyWindFarm-bomAccessA-expected" DOUBLE,
            "forecast.nsgForecast-albanyWindFarm-bomAccessA-low" DOUBLE,
            "forecast.nsgForecast-albanyWindFarm-bomAccessA-high" DOUBLE,
            "solver.minutesSincePrimary" DOUBLE
        ),
        "constraints" STRUCT(
            constraintSet STRUCT(
                id VARCHAR,
                constraintEquations VARCHAR [],
                "version" BIGINT,
                description VARCHAR,
                "comments" VARCHAR
            ) [],
            constraintEquations STRUCT(
                id VARCHAR,
                description VARCHAR,
                "comments" VARCHAR,
                leftHandSide STRUCT(
                    term VARCHAR,
                    coefficient DOUBLE,
                    termType VARCHAR,
                    "index" BIGINT
                ) [],
                "operator" VARCHAR,
                rightHandSide STRUCT(
                    term VARCHAR,
                    coefficient DOUBLE,
                    termType VARCHAR,
                    "index" BIGINT
                ) [],
                rightHandSideScript VARCHAR,
                required BOOLEAN,
                limitType VARCHAR,
                limitAdviceId VARCHAR,
                isInterventionEvent BOOLEAN,
                constraintType VARCHAR,
                defaultRHS DOUBLE,
                violationPenalty DOUBLE,
                contingency VARCHAR,
                monitoredElement VARCHAR,
                "version" BIGINT,
                systemConfiguration VARCHAR
            ) []
        ),
        dfcmData STRUCT(
            dfcmId VARCHAR,
            dfcmSchemaVersion VARCHAR,
            frequencyLimits STRUCT(
                rocofMax DOUBLE,
                rocofMin DOUBLE,
                fmin DOUBLE,
                fmax DOUBLE,
                fss DOUBLE,
                marginForOperationCRR DOUBLE
            ),
            tauValuesforCRPerformanceFactors DOUBLE [],
            dfcmValidationDetails STRUCT(
                inertiaRangeMinimum DOUBLE,
                inertiaRangeMaximum DOUBLE,
                inertiaRangeStep DOUBLE,
                largestContingencySizeMinimum DOUBLE,
                largestContingencySizeMaximum DOUBLE,
                largestContingencySizeStep DOUBLE,
                underlyingDemandMinimum DOUBLE,
                underlyingDemandMaximum DOUBLE,
                underlyingDemandStep DOUBLE,
                loadReliefValues DOUBLE [],
                dpvMinimum DOUBLE,
                dpvMaximum DOUBLE,
                dpvStep DOUBLE
            )
        ),
        unconstrainedForecast STRUCT(
            facilityCode VARCHAR,
            unconstrainedInjectionForecastMw DOUBLE,
            unconstrainedWithdrawalForecastMw DOUBLE
        ) [],
        registrationData STRUCT(
            facilityCode VARCHAR,
            nol BOOLEAN,
            facilityStandingData STRUCT(
                facilityRegistrationStatus VARCHAR,
                facilityClass VARCHAR,
                storageConstraints BOOLEAN,
                inertia DOUBLE,
                tauFactor DOUBLE,
                effectiveDispatchIntervalFrom VARCHAR,
                unconstrainedForecastSource VARCHAR
            )
        ) [],
        facilityLossFactors STRUCT(
            facilityCode VARCHAR,
            distributionLossFactor DOUBLE,
            transmissionLossFactor DOUBLE
        ) [],
        markets STRUCT(
            energy STRUCT(
                marketService VARCHAR,
                baseForecastRequirement DOUBLE,
                overrideForecastRequirement JSON,
                facilities STRUCT(
                    maxInjectionCapacity DOUBLE,
                    maxWithdrawalCapacity DOUBLE,
                    unconstrainedInjectionForecast DOUBLE,
                    unconstrainedWithdrawalForecast DOUBLE,
                    inflexibleFlag BOOLEAN,
                    maxUpwardRampRate DOUBLE,
                    downwardsRampRateValueUsed DOUBLE,
                    maxDownwardRampRate DOUBLE,
                    upwardsRampRateValueUsed DOUBLE,
                    fsip STRUCT(
                        t1 DOUBLE,
                        t2 DOUBLE,
                        t3 DOUBLE,
                        t4 DOUBLE,
                        minimumLoad DOUBLE
                    ),
                    facilityCode VARCHAR,
                    submissionId UUID,
                    submissionCode VARCHAR,
                    tranches STRUCT(
                        tranche BIGINT,
                        fuelType VARCHAR,
                        quantity DOUBLE,
                        submittedPrice DOUBLE,
                        lfaPrice DOUBLE,
                        capacityType VARCHAR,
                        noticeTime BIGINT
                    ) [],
                    dspUnconstrainedWithdrawalQuantity DOUBLE,
                    dspConstrainedWithdrawalQuantity DOUBLE
                ) []
            ),
            regulationRaise STRUCT(
                marketService VARCHAR,
                baseForecastRequirement DOUBLE,
                overrideForecastRequirement JSON,
                facilities STRUCT(
                    enablementMinimum DOUBLE,
                    enablementMinimumValueUsed DOUBLE,
                    lowBreakpoint DOUBLE,
                    lowBreakpointValueUsed DOUBLE,
                    highBreakpoint DOUBLE,
                    highBreakpointValueUsed DOUBLE,
                    enablementMaximum DOUBLE,
                    enablementMaximumValueUsed DOUBLE,
                    maximumCapacity DOUBLE,
                    facilityCode VARCHAR,
                    submissionId UUID,
                    submissionCode VARCHAR,
                    tranches STRUCT(
                        tranche BIGINT,
                        fuelType VARCHAR,
                        quantity DOUBLE,
                        submittedPrice DOUBLE,
                        lfaPrice DOUBLE,
                        capacityType VARCHAR,
                        noticeTime BIGINT
                    ) []
                ) []
            ),
            regulationLower STRUCT(
                marketService VARCHAR,
                baseForecastRequirement DOUBLE,
                overrideForecastRequirement JSON,
                facilities STRUCT(
                    enablementMinimum DOUBLE,
                    enablementMinimumValueUsed DOUBLE,
                    lowBreakpoint DOUBLE,
                    lowBreakpointValueUsed DOUBLE,
                    highBreakpoint DOUBLE,
                    highBreakpointValueUsed DOUBLE,
                    enablementMaximum DOUBLE,
                    enablementMaximumValueUsed DOUBLE,
                    maximumCapacity DOUBLE,
                    facilityCode VARCHAR,
                    submissionId UUID,
                    submissionCode VARCHAR,
                    tranches STRUCT(
                        tranche BIGINT,
                        fuelType VARCHAR,
                        quantity DOUBLE,
                        submittedPrice DOUBLE,
                        lfaPrice DOUBLE,
                        capacityType VARCHAR,
                        noticeTime BIGINT
                    ) []
                ) []
            ),
            contingencyRaise STRUCT(
                marketService VARCHAR,
                baseForecastRequirement DOUBLE,
                overrideForecastRequirement JSON,
                facilities STRUCT(
                    enablementMinimum DOUBLE,
                    enablementMinimumValueUsed DOUBLE,
                    lowBreakpoint DOUBLE,
                    lowBreakpointValueUsed DOUBLE,
                    highBreakpoint DOUBLE,
                    highBreakpointValueUsed DOUBLE,
                    enablementMaximum DOUBLE,
                    enablementMaximumValueUsed DOUBLE,
                    maximumCapacity DOUBLE,
                    facilityCode VARCHAR,
                    submissionId UUID,
                    submissionCode VARCHAR,
                    tranches STRUCT(
                        tranche BIGINT,
                        fuelType VARCHAR,
                        quantity DOUBLE,
                        submittedPrice DOUBLE,
                        lfaPrice DOUBLE,
                        capacityType VARCHAR,
                        noticeTime BIGINT
                    ) []
                ) []
            ),
            contingencyLower STRUCT(
                marketService VARCHAR,
                baseForecastRequirement DOUBLE,
                overrideForecastRequirement JSON,
                facilities STRUCT(
                    enablementMinimum DOUBLE,
                    enablementMinimumValueUsed DOUBLE,
                    lowBreakpoint DOUBLE,
                    lowBreakpointValueUsed DOUBLE,
                    highBreakpoint DOUBLE,
                    highBreakpointValueUsed DOUBLE,
                    enablementMaximum DOUBLE,
                    enablementMaximumValueUsed DOUBLE,
                    maximumCapacity DOUBLE,
                    facilityCode VARCHAR,
                    submissionId UUID,
                    submissionCode VARCHAR,
                    tranches STRUCT(
                        tranche BIGINT,
                        fuelType VARCHAR,
                        quantity DOUBLE,
                        submittedPrice DOUBLE,
                        lfaPrice DOUBLE,
                        capacityType VARCHAR,
                        noticeTime BIGINT
                    ) []
                ) []
            ),
            rocof STRUCT(
                marketService VARCHAR,
                baseForecastRequirement DOUBLE,
                overrideForecastRequirement JSON,
                facilities STRUCT(
                    enablementMinimum DOUBLE,
                    enablementMinimumValueUsed DOUBLE,
                    lowBreakpoint DOUBLE,
                    lowBreakpointValueUsed DOUBLE,
                    highBreakpoint DOUBLE,
                    highBreakpointValueUsed DOUBLE,
                    enablementMaximum DOUBLE,
                    enablementMaximumValueUsed DOUBLE,
                    maximumCapacity DOUBLE,
                    facilityCode VARCHAR,
                    submissionId UUID,
                    submissionCode VARCHAR,
                    tranches STRUCT(
                        tranche BIGINT,
                        fuelType VARCHAR,
                        quantity DOUBLE,
                        submittedPrice DOUBLE,
                        lfaPrice DOUBLE,
                        capacityType VARCHAR,
                        noticeTime BIGINT
                    ) []
                ) []
            )
        ),
        rcmData STRUCT(facilityCode VARCHAR, rcoq DOUBLE) [],
        marketServicesPriceFlag STRUCT(
            energyPriceAtCap BOOLEAN,
            regulationRaisePriceAtCap BOOLEAN,
            regulationLowerPriceAtCap BOOLEAN,
            contingencyLowerPriceAtCap BOOLEAN,
            contingencyRaisePriceAtCap BOOLEAN,
            rocofPriceAtCap BOOLEAN
        )
    ) []
)

I can now see that most of the interesting stuff is in the array-type element called caseData!

The scada data looks interesting - let’s investigate that further.

Unnesting the SCADA data

Before we get to unnesting, I’m going to do some tidy up - reverting the mode to duckbox, and saving a copy of the JSON in duckdb to simplify the queries.

D .mode duckbox
D CREATE OR REPLACE TABLE rdcdata AS 
    select * from 
      read_json('ReferenceDispatchCase.json',
                maximum_object_size = 405000000);
D

Our first `unnest`

Now, let’s run a query to unpack the caseData array:

D WITH cd as (
    SELECT UNNEST(data.caseData) "caseData" FROM rdcdata
  ) SELECT caseData from cd limit 3;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                       caseData                                                       │
│ struct(status varchar, scenario varchar, dispatchinterval varchar, primarydispatchinterval varchar, parameters str…  │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'status': Normal, 'scenario': Reference, 'dispatchInterval': 2025-02-05T08:00:00+08:00, 'primaryDispatchInterval'…  │
│ {'status': Normal, 'scenario': Reference, 'dispatchInterval': 2025-02-05T08:05:00+08:00, 'primaryDispatchInterval'…  │
│ {'status': Normal, 'scenario': Reference, 'dispatchInterval': 2025-02-05T08:10:00+08:00, 'primaryDispatchInterval'…  │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘D

This is useful but I only got a single column with a STRUCT object in it. We can unnest further by modifying the original unnest to either specify the max_depth, or simply to unnest recursively. While we’re at it, let’s create a view to represent this query:

A recursive `unnest`

D CREATE OR REPLACE VIEW casedata AS WITH cd as (
    SELECT UNNEST(data.caseData, recursive := true) 
    FROM rdcdata
  ) SELECT * from cd;
D select * from casedata limit 3;
┌─────────┬───────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬─────────────────┐
│ status  │ scenario  │   dispatchInterval   │ … │ contingencyLowerPr…  │ contingencyRaisePr…  │ rocofPriceAtCap │
│ varchar │  varchar  │       varchar        │   │       boolean        │       boolean        │     boolean     │
├─────────┼───────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼─────────────────┤
│ Normal  │ Reference │ 2025-02-05T08:00:0…  │ … │ false                │ false                │ false           │
│ Normal  │ Reference │ 2025-02-05T08:05:0…  │ … │ false                │ false                │ false           │
│ Normal  │ Reference │ 2025-02-05T08:10:0…  │ … │ false                │ false                │ false           │
├─────────┴───────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴─────────────────┤
│ 3 rows                                                                                    83 columns (6 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This is an improvement: the unnest has exploded out the STRUCT fields to their own columns; 83 of them, in fact!

Unpacking SCADA data

We now simply repeat the unnest process to access the SCADA data. I’ll set the output mode to table for some ascii-art goodness too.

D .mode table
D WITH vw_scada AS (Select unnest(scada, max_depth:=2) FROM casedata) SELECT * from vw_scada limit 5;
+------------------------------------+-----------+---------------+-------------+------------+-----------------------------------+
|                tag                 |   value   | valueDataType | qualityFlag | dataSource |           asAtTimeStamp           |
+------------------------------------+-----------+---------------+-------------+------------+-----------------------------------+
| dispatchCondition.loadRelief       | 2         | number        | good        | SCADA      | 2025-02-05T07:59:05.02744+08:00   |
| dispatchCondition.systemInertia    | 21370.863 | number        | good        | SCADA      | 2025-02-05T07:59:05.0274425+08:00 |
| dispatchCondition.demand           | 2364.1733 | number        | good        | SCADA      | 2025-02-05T07:59:05.027435+08:00  |
| dispatchCondition.underlyingDemand | 3053.1206 | number        | good        | SCADA      | 2025-02-05T07:59:05.0274437+08:00 |
| dispatchCondition.loadInertia      | 5146.7617 | number        | good        | SCADA      | 2025-02-05T07:59:05.0274388+08:00 |
+------------------------------------+-----------+---------------+-------------+------------+-----------------------------------+
D

Finally, some real data! Want to write it to a CSV?

D COPY (
    WITH vw_scada AS (
      Select unnest(scada, max_depth:=2) FROM casedata
    ) SELECT * from vw_scada
  ) TO 'scada.csv';
D

Wrapping Up

A quack re-cap of what we’ve learned:

How to load a chonkin great JSON file
Determining the schema of a large JSON file
Unnesting arrays and objects in to table rows
Writing results out to a CSV

If you haven’t tried DuckDB for data mangling, give it a try!

Big JSON documents with DuckDB

Contents

Summary

Preface: Why people like DuckDB

The Uncomfortable Truth

How we typically work with big data

Does it have to be so large?

The AEMO (WEM) example

Processing Large JSON documents with DuckDB

Start `duckdb`

Try selecting from your obtusely large file

Capture the converted JSON schema

Change the `duckdb` output mode

Seeing the schema for the first time

Unnesting the SCADA data

Our first `unnest`

A recursive `unnest`

Unpacking SCADA data

Wrapping Up

Post Analytics

Big JSON documents with DuckDB

Contents

Summary

Preface: Why people like DuckDB

The Uncomfortable Truth

How we typically work with big data

Does it have to be so large?

The AEMO (WEM) example

Processing Large JSON documents with DuckDB

Start duckdb

Try selecting from your obtusely large file

Capture the converted JSON schema

Change the duckdb output mode

Seeing the schema for the first time

Unnesting the SCADA data

Our first unnest

A recursive unnest

Unpacking SCADA data

Wrapping Up

Post Analytics

Start `duckdb`

Change the `duckdb` output mode

Our first `unnest`

A recursive `unnest`