Bringing the Duck(DB) to the Data Lake
by Paul SymonsContents
- Summary
- Data Producer
- Storing Data in S3
- My S3 layout
- Creating a DuckDB database
- Overall Architecture
Summary
This post explains a process of transforming emitted JSON data into a DuckDB database to use in the AEMO WEM Data Inventory project I’ve created for visualising some public energy market data.
Generative AI has been dominating the tech press for some time now, but before it did, DuckDB made a big splash as it captured the imagination of a Data Engineering community well worn by the complexity and cost of modern OLAP data systems.
DuckDB offers a lot in a small package:
- Small, portable and fast with no dependencies!
- Integrates well with existing tooling
- Excellent and powerful SQL language support
I am going to walkthrough my experience of generating a simple DuckDB database from a JSON producing AWS Lambda Function.
Data Producer
In my last blog post, I discussed a simple Python application that I run in Lambda (source code here).
The Lambda function crawls some webpages and produces around 40,000 JSON Line (JSONL) payloads similar to the sample below:
{"dt": "2024-06-09T08:03:00", "url": "https://data.wa.aemo.com.au/public/infographic/", "depth": 1, "fileName": "facility-weeks-last52.csv", "fileUrl": "https://data.wa.aemo.com.au/public/infographic/facility-weeks-last52.csv", "fileSize": 98768, "is_directory": false, "_created_at": "2024-06-13T10:53:18"}
{"dt": "2024-06-13T10:53:00", "url": "https://data.wa.aemo.com.au/public/infographic/", "depth": 1, "fileName": "generation.csv", "fileUrl": "https://data.wa.aemo.com.au/public/infographic/generation.csv", "fileSize": 21161, "is_directory": false, "_created_at": "2024-06-13T10:53:18"}
{"dt": "2024-05-22T16:00:00", "url": "https://data.wa.aemo.com.au/public/infographic/neartime/", "fileCount": 0, "is_directory": true, "_created_at": "2024-06-13T10:53:18"}
{"dt": "2024-06-13T10:45:00", "url": "https://data.wa.aemo.com.au/public/infographic/", "depth": 1, "fileName": "outage-summary.csv", "fileUrl": "https://data.wa.aemo.com.au/public/infographic/outage-summary.csv", "fileSize": 42409, "is_directory": false, "_created_at": "2024-06-13T10:53:18"}
Now, you could use Lambda to run DuckDB itself and insert data directly into it, either from temporary local files, or direct inserts, but I prefer not to couple so tightly, for a few reasons:
- Running DuckDB complicates Lambda set up - perhaps requiring a layer, or Docker image configuration
- I may want to use the data for purposes beyond DuckDB
- Changing the JSON schema may also force an update of the DuckDB insert code
So instead, I’m going to write the data to S3 object storage.
Storing Data in S3
I have a few principles I stick to when writing data semi-structured data to S3:
- Never write objects directly to the root of the bucket
- Don’t overwrite
- Always partition, preferably with a relatively low cardinality value
- Always compress
- Set up lifecycling from the start
1. Don’t write the root
If you write at the root of the bucket, it is difficult to apply lifecycle rules safely. By always storing in discrete prefixes, you can more confidently apply prefix filters.
There are other filter options, such as minimum and maximum file size, and more usefully - object tags. But I think most people favour prefix filters.
2. Don’t overwrite
Introducing mutability by overwriting objects re-introduces ACID concerns that I’d rather leave behind. It’s often better (and definitely simpler) to be write-only, and filter logically on read.
Thankfully, running out of space is not something to concern ourselves with these days, at least not with S3 - so I prefer to make all writes immutable, and manage retention with lifecycle policies.
3. Always Partition
Partioning assists both writing and reading.
In our adorable use case, we will never exhaust S3 put limits, but it’s worth remembering that S3 quotas limit write requests to 3,500 per second, and read requests to 5,500 per seconds, per prefix.
Therefore, partitioning data can help avoid rate limiting, as long as you distribute reads and writes across multiple prefixes.
4. Always Compress
JSON data (especially JSONL where object keys are continuously repeated) compress really well, often between 10-15 times reduction in size from plain text sizes.
Let’s frame this another way. How do you feel about paying 10-15 times more per Gigabyte of storage on S3?
Reading and writing compressed data from S3 only requires a couple of extra lines of code and will save you lots in long run. There’s few reasons not to do it, so just do it.
5. Lifecycle from the start
Lifecycle Policies require you to have a conversation - even if only with yourself. It’s basic Data Lifecycle Management - what is your plan for this data?
- How long do I need to keep it?
- Is it sensitive?
- Will the structure of the data change?
- Should I archive it or just delete it?
Your use case and system design will help you determine what lifecycle policies are appropriate. But make sure you consider it ahead of time, before you have 15 million small files in a single folder burning a hole in your pocket.
My S3 layout
In my application, I’ve decided to write my lambda output in a bucket, with a prefix starting website_inventory/jsonl/
From there, my application will write the following two additional prefixes:
- the
YYYY-MM-DD
date of processing - the
aws_request_id
parameter from the lambda, which is searchable in CloudWatch logs
Listing my bucket demonstrates this more clearly:
$ aws s3 ls --recursive s3://my-bucketname/website_inventory/jsonl
website_inventory/jsonl/2024-06-11/48073d6d-e937-48a6-b9be-729b0c49cc68/2024-06-11.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-11/b48965c7-de67-4465-8ea1-917d5b986094/2024-06-11.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-11/b5c3804c-f636-4d6f-94ee-633b0bfd37c3/2024-06-11.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-11/f7592e5b-9dec-4fa1-aee2-65c94e4d8512/2024-06-11.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-12/03d7335e-09fb-4309-9d12-8c7eb6c8aee2/2024-06-12.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-12/6086b428-bbc2-48f4-9c39-e493f3587d7f/2024-06-12.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-12/8586e608-8b34-445c-854c-8b34fdf58ac3/2024-06-12.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-12/ed50ed7e-b028-41e1-a839-9742bc654d22/2024-06-12.aemo-inventory.jsonl.gz
website_inventory/jsonl/2024-06-13/41392392-2c12-4e42-8f13-ffb603c0446d/2024-06-13.aemo-inventory.jsonl.gz
Processing Time vs Event Time
If you are familiar with stream processing, you’ll be aware of the difference between Processing Time and Event Time
- Event Time is a timestamp as experienced by the event itself, typically at origination
- Processing Time is a timestamp associated with when an event data is being processed
It’s an important distinction to make, especially considering typical scenarios such as clock drift, network congestion (late arriving data) or processing delays. Critically, use of Event Time typically requires introspection of the data (e.g. read and parse), where as Processing Time does not.
In our example above, the date component of Lambda execution Processing Time is used.
Theoretically, some late arriving data from 11th June could be written to prefix 2024-06-12
.
This is a trade off I am happy to wear:
- it keeps code simple
- it ensures data from a single run will always be in the same prefix
- data filtering downstream can filter appropriately to collect the appropriate subset of data
The aim of partitioning is not really to neatly organise things into compartments. It’s to reduce the amount of interrogation and read activity required on blob storage systems.
Immutability
I have partitioned the data by Processing Time date, and by the lambda invocation’s aws_request_id
.
This guarantees that if my Lambda fires twice instead of once, or I trigger it manually for some reason,
my data won’t be overwriting itself.
Creating a DuckDB database
Working with DuckDB is very straightforward; it has lots of install channels to suit your platform or integration method, such as CLI or embedded library.
It also supports a number of extensions:
- some are pre-installed, others are installed on demand
- some will be autoloaded when required, others must be explicitly loaded by calling
load <extension-name>
The CLI app ships as a single pre-compiled binary, and running it drops you
into a REPL much like psql
or mysql
client. But of course, you can also
redirect input to the binary, to execute a SQL file against a database:
$ duckdb aemo.duckdb < migrations.sql
The above command uses duckdb
to open a database stored in a file in the
current directory called aemo.duckdb
, and sends it the commands from the file
also in the current directory, called migrations.sql
My migration script
The script I use looks like this:
install httpfs;
install aws;
load aws;
call load_aws_credentials();
create table inventories (
dt timestamp, url varchar, depth int, filecount int,
is_directory boolean, _created_at timestamp,
filename varchar, fileurl varchar, filesize bigint);
insert into inventories select
dt, url, depth, filecount,
is_directory, _created_at,
filename, fileurl, filesize
from "${s3_location}/**/*.jsonl.gz";
A couple of things to note:
- Technically, you don’t have to explicitly install and load the httpfs extensions, DuckDB will do it automatically; but as a migration script (and in case DuckDB changes the defaults), I think it’s good practice to include it
- You need to call the
load_aws_credentials()
function to pick up your environmental credentials, otherwise it will attempt anonymous requests - Explicitly reference SELECT columns instead of using
*
when doingINSERT INTO SELECT...
so that if new fields appear in the JSONL data, your INSERT will not break - You could also create a
view
instead of table to dynamically read from S3 when ever a query references the view. However, for our use case it is preferable to ingest during migration, rather than at query time.
Overall Architecture
The following diagram demonstrates how the database is created on a schedule
Stay tuned to find out what the DuckDB database is used for!