17 June 2024

Bringing the Duck(DB) to the Data Lake

by Paul Symons

Summary
Data Producer
Storing Data in S3
My S3 layout
- Processing Time vs Event Time
- Immutability
Creating a DuckDB database
- My migration script
Overall Architecture

Summary

This post explains a process of transforming emitted JSON data into a DuckDB database to use in the AEMO WEM Data Inventory project I’ve created for visualising some public energy market data.

Generative AI has been dominating the tech press for some time now, but before it did, DuckDB made a big splash as it captured the imagination of a Data Engineering community well worn by the complexity and cost of modern OLAP data systems.

DuckDB offers a lot in a small package:

Small, portable and fast with no dependencies!
Integrates well with existing tooling
Excellent and powerful SQL language support

I am going to walkthrough my experience of generating a simple DuckDB database from a JSON producing AWS Lambda Function.

Data Producer

In my last blog post, I discussed a simple Python application that I run in Lambda (source code here).

The Lambda function crawls some webpages and produces around 40,000 JSON Line (JSONL) payloads similar to the sample below:

{"dt": "2024-06-09T08:03:00", "url": "https://data.wa.aemo.com.au/public/infographic/", "depth": 1, "fileName": "facility-weeks-last52.csv", "fileUrl": "https://data.wa.aemo.com.au/public/infographic/facility-weeks-last52.csv", "fileSize": 98768, "is_directory": false, "_created_at": "2024-06-13T10:53:18"}
{"dt": "2024-06-13T10:53:00", "url": "https://data.wa.aemo.com.au/public/infographic/", "depth": 1, "fileName": "generation.csv", "fileUrl": "https://data.wa.aemo.com.au/public/infographic/generation.csv", "fileSize": 21161, "is_directory": false, "_created_at": "2024-06-13T10:53:18"}
{"dt": "2024-05-22T16:00:00", "url": "https://data.wa.aemo.com.au/public/infographic/neartime/", "fileCount": 0, "is_directory": true, "_created_at": "2024-06-13T10:53:18"}
{"dt": "2024-06-13T10:45:00", "url": "https://data.wa.aemo.com.au/public/infographic/", "depth": 1, "fileName": "outage-summary.csv", "fileUrl": "https://data.wa.aemo.com.au/public/infographic/outage-summary.csv", "fileSize": 42409, "is_directory": false, "_created_at": "2024-06-13T10:53:18"}

Now, you could use Lambda to run DuckDB itself and insert data directly into it, either from temporary local files, or direct inserts, but I prefer not to couple so tightly, for a few reasons:

Running DuckDB complicates Lambda set up - perhaps requiring a layer, or Docker image configuration
I may want to use the data for purposes beyond DuckDB
Changing the JSON schema may also force an update of the DuckDB insert code

So instead, I’m going to write the data to S3 object storage.

Storing Data in S3

I have a few principles I stick to when writing semi-structured data to S3:

Never write objects directly to the root of the bucket
Don’t overwrite
Always partition, preferably with a relatively low cardinality value
Always compress
Set up lifecycling from the start

1. Don’t write the root

If you write at the root of the bucket, it is difficult to apply lifecycle rules safely. By always storing in discrete prefixes, you can more confidently apply prefix filters.

There are other filter options, such as minimum and maximum file size, and more usefully - object tags. But I think most people favour prefix filters.

2. Don’t overwrite

Introducing mutability by overwriting objects re-introduces ACID concerns that I’d rather leave behind. It’s often better (and definitely simpler) to be write-only, and filter logically on read.

Thankfully, running out of space is not something to concern ourselves with these days, at least not with S3 - so I prefer to make all writes immutable, and manage retention with lifecycle policies.

3. Always Partition

Partioning assists both writing and reading.

In our adorable use case, we will never exhaust S3 put limits, but it’s worth remembering that S3 quotas limit write requests to 3,500 per second, and read requests to 5,500 per seconds, per prefix.

Therefore, partitioning data can help avoid rate limiting, as long as you distribute reads and writes across multiple prefixes.

4. Always Compress

JSON data (especially JSONL where object keys are continuously repeated) compress really well, often between 10-15 times reduction in size from plain text sizes.

Let’s frame this another way. How do you feel about paying 10-15 times more per Gigabyte of storage on S3?

Reading and writing compressed data from S3 only requires a couple of extra lines of code and will save you lots in long run. There’s few reasons not to do it, so just do it.

5. Lifecycle from the start

Lifecycle Policies require you to have a conversation - even if only with yourself. It’s basic Data Lifecycle Management - what is your plan for this data?

How long do I need to keep it?
Is it sensitive?
Will the structure of the data change?
Should I archive it or just delete it?

Your use case and system design will help you determine what lifecycle policies are appropriate. But make sure you consider it ahead of time, before you have 15 million small files in a single folder burning a hole in your pocket.

My S3 layout

In my application, I’ve decided to write my lambda output in a bucket, with a prefix starting website_inventory/jsonl/

From there, my application will write the following two additional prefixes:

the YYYY-MM-DD date of processing
the aws_request_id parameter from the lambda, which is searchable in CloudWatch logs

Listing my bucket demonstrates this more clearly:

$ aws s3 ls --recursive s3://my-bucketname/website_inventory/jsonl
website_inventory/jsonl/2024-06-11/48073d6d-e937-48a6-b9be-729b0c49cc68/2024-06-11.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-11/b48965c7-de67-4465-8ea1-917d5b986094/2024-06-11.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-11/b5c3804c-f636-4d6f-94ee-633b0bfd37c3/2024-06-11.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-11/f7592e5b-9dec-4fa1-aee2-65c94e4d8512/2024-06-11.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-12/03d7335e-09fb-4309-9d12-8c7eb6c8aee2/2024-06-12.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-12/6086b428-bbc2-48f4-9c39-e493f3587d7f/2024-06-12.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-12/8586e608-8b34-445c-854c-8b34fdf58ac3/2024-06-12.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-12/ed50ed7e-b028-41e1-a839-9742bc654d22/2024-06-12.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-13/41392392-2c12-4e42-8f13-ffb603c0446d/2024-06-13.aemo-inventory.jsonl.gz

Processing Time vs Event Time

If you are familiar with stream processing, you’ll be aware of the difference between Processing Time and Event Time

Event Time is a timestamp as experienced by the event itself, typically at origination
Processing Time is a timestamp associated with when an event data is being processed

It’s an important distinction to make, especially considering typical scenarios such as clock drift, network congestion (late arriving data) or processing delays. Critically, use of Event Time typically requires introspection of the data (e.g. read and parse), where as Processing Time does not.

In our example above, the date component of Lambda execution Processing Time is used. Theoretically, some late arriving data from 11th June could be written to prefix 2024-06-12.

This is a trade off I am happy to wear:

it keeps code simple
it ensures data from a single run will always be in the same prefix
data filtering downstream can filter appropriately to collect the appropriate subset of data

The aim of partitioning is not really to neatly organise things into compartments. It’s to reduce the amount of interrogation and read activity required on blob storage systems.

Immutability

I have partitioned the data by Processing Time date, and by the lambda invocation’s aws_request_id. This guarantees that if my Lambda fires twice instead of once, or I trigger it manually for some reason, my data won’t be overwriting itself.

Creating a DuckDB database

Working with DuckDB is very straightforward; it has lots of install channels to suit your platform or integration method, such as CLI or embedded library.

It also supports a number of extensions:

some are pre-installed, others are installed on demand
some will be autoloaded when required, others must be explicitly loaded by calling load <extension-name>

The CLI app ships as a single pre-compiled binary, and running it drops you into a REPL much like psql or mysql client. But of course, you can also redirect input to the binary, to execute a SQL file against a database:

$ duckdb aemo.duckdb < migrations.sql

The above command uses duckdb to open a database stored in a file in the current directory called aemo.duckdb, and sends it the commands from the file also in the current directory, called migrations.sql

My migration script

The script I use looks like this:

install httpfs;
install aws;
load aws;
call load_aws_credentials();

create table inventories (
    dt timestamp, url varchar, depth int, filecount int,
    is_directory boolean, _created_at timestamp,
    filename varchar, fileurl varchar, filesize bigint);

insert into inventories select
    dt, url, depth, filecount,
    is_directory, _created_at,
    filename, fileurl, filesize
 from "${s3_location}/**/*.jsonl.gz";

A couple of things to note:

Technically, you don’t have to explicitly install and load the httpfs extensions, DuckDB will do it automatically; but as a migration script (and in case DuckDB changes the defaults), I think it’s good practice to include it
You need to call the load_aws_credentials() function to pick up your environmental credentials, otherwise it will attempt anonymous requests
Explicitly reference SELECT columns instead of using * when doing INSERT INTO SELECT... so that if new fields appear in the JSONL data, your INSERT will not break
You could also create a view instead of table to dynamically read from S3 when ever a query references the view. However, for our use case it is preferable to ingest during migration, rather than at query time.

Overall Architecture

The following diagram demonstrates how the database is created on a schedule

Stay tuned to find out what the DuckDB database is used for!