AI generated image (StableDiffusionWeb) of a cat holding a hose, spraying what could almost be construed as "Icebergs"

Airwaves to Data Lake in 60 Seconds!

by Paul Symons

Contents

Summary

In my last blog, Capture And Upload Data With Vector, we covered some examples of using Vector to read, transform and send data to AWS S3.

This blog will take that idea further, using Vector, Amazon Data Firehose and Amazon S3 Tables to write it direct to iceberg format tables, ready for querying in your data wakehouse.

All code created for this blog is available in the flight-tracking-readsb repository.

Architecture

From top to bottom, this architecture describes:

  1. A USB software defined radio being used with readsb console application to capture ADS-B aircraft transmissions as JSON documents
  2. Vector used as a data pipeline, to transform and ship data from the laptop to AWS Data Firehose
  3. If records sent cannot be delivered, they will be delivered by the Firehose to a bucket used specifically for failed records
  4. Successful records will be inserted into an Iceberg table hosted on AWS S3Tables
  5. The S3Tables iceberg table is federated and accessible to AWS Analytics services such as Athena via LakeFormation.

If you prefer not to use AWS Analytics services to query your data, you can instead use the S3Tables Iceberg Rest Catalog.

Prerequisites

Unfortunately the S3Tables Destination for Amazon Data Firehose currently has a hard dependency on LakeFormation, even though your delivery stream is configured with a service role that could use IAM permissions to permit fine grained table operations. If you don’t want to use LakeFormation, then unfortunately this is not the solution for you.

To make LakeFormation aware of S3Tables as a catalog, you must tickops the Enable Integration setting. You can’t miss this prompt; it nags you all the way through the S3 Table Buckets console until you acquiesce to it.

Once you have done this, each Table Bucket you create will be available as alternative catalogs in services such as LakeFormation, Athena, etc. The default catalog remains the the original Glue Catalog, which is named after your AWS Account Number.

AWS Infrastructure

We will stand up the AWS Infrastructure using Terraform.

S3Tables: Bucket, Namespace and Table

Creating these resources is straight forward:

resource "aws_s3tables_table_bucket" "flight_tracking" {
  name = var.flight_tracking_bucket_name
  encryption_configuration = {
    sse_algorithm = "AES256"
    # shouldn't need to specify the following as null, 
    # but terraform complained otherwise
    kms_key_arn   = null
  }
}

resource "aws_s3tables_namespace" "adsb" {
  table_bucket_arn = aws_s3tables_table_bucket.flight_tracking.arn
  namespace        = var.flight_tracking_namespace
}

resource "aws_s3tables_table" "aircraft" {
  table_bucket_arn = aws_s3tables_table_bucket.flight_tracking.arn
  namespace        = aws_s3tables_namespace.adsb.namespace
  name             = var.flight_tracking_aircraft_table
  format           = "ICEBERG"

  metadata {
    iceberg {
      schema {
        field {
          name     = "_adsb_message_count"
          type     = "int"
          required = false
        }
    ...
  }
} 

I like that you can configure the schema for your table in Terraform, however, unfortunately it does not support defining partitions at this time.

Previously, the glue_catalog_table Terraform Resource, and the AWS Glue create_table API also suffered from this shortcoming, but both now appear to support partition definition. Hopefully the S3Tables Create Table API will catch up soon.

Delivery Errors Bucket

The Terraform code includes the definition of a regular S3 bucket to receive delivery errors. When configuring the Data Firehose, you can choose whether to send only failed records, or simply all records to the delivery errors bucket - you might want to do this to satisfy audit requirements, for example.

Configuration of the delivery errors bucket is mandatory - without it you cannot create the Firehose delivery stream. It turns out to be pretty useful for catching 2 distinct kinds of problems

  • Your configuration is incorrect (e.g. Lakeformation permissions, Firehose Role inadequacies)
  • Your data differs from the Iceberg table definition (e.g. additional data, or conflicting or unsupported types)

Improving the accessibility of delivery errors will make your life much easier, so in the repository we define an external table (not iceberg) that points to where the delivery errors are stored; this allows you to simply query the errors with Athena to debug the cause of your misfortune.

Note that the actual payload is included - it is base64 encoded, so if you want to read the payload you’ll need a query like

SELECT 
    FROM_UTF8( FROM_BASE64(rawdata) ) AS payload
    FROM iceberg_stream_errors

Amazon Data Firehose

Once you have all the supporting measures in place - roles, buckets, log groups etc - defining the delivery stream itself is rather straight forward.

resource "aws_kinesis_firehose_delivery_stream" "iceberg_stream" {
  name        = local.firehose_name
  destination = "iceberg"

  iceberg_configuration {
    cloudwatch_logging_options {
      enabled         = true
      log_stream_name = local.log_stream_name
      log_group_name  = aws_cloudwatch_log_group.firehose_logs.name
    }

    role_arn           = aws_iam_role.firehose_role.arn
    # this is the fully qualified catalog arn for your s3tables
    # table bucket,  which you have to construct yourself, e.g.
    # "arn:aws:glue:${region}:${account}:catalog/s3tablescatalog/${var.flight_tracking_bucket_name}"
    catalog_arn        = local.catalog_arn
    buffering_interval = 60

    destination_table_configuration {
      database_name = aws_s3tables_namespace.adsb.namespace
      table_name    = aws_s3tables_table.aircraft.name
    }

    s3_backup_mode = "FailedDataOnly"
    s3_configuration {
      bucket_arn          = aws_s3_bucket.bad_records_bucket.arn
      role_arn            = aws_iam_role.firehose_role.arn
      error_output_prefix = "${local.firehose_name}/"
    }
  }
}

If you decide you do want to capture all of your data events to the regular S3 bucket, modify the s3_backup_mode parameter to the value AllData instead of FailedDataOnly.

The Amazon Data Firehose service does a lot of pre-validation prior to letting you create a delivery stream. So whilst it’s straight forward to configure a delivery stream, ensuring the delivery stream’s role policy is correct is critical. Consider the role definition in the attached repository a minimum viable policy.

Configuring Vector

Once your Firehose is setup and running, configuring Vector is very easy. Simply take the name of your Firehose stream and add it as a new sink - Vector will do the rest:

sinks:
  my-firehose:
    type: aws_kinesis_firehose
    inputs:
      - unnested
    stream_name: iceberg-stream
    encoding:
      codec: json
    batch:
      # send buffered data after one minute of not sending anything
      timeout_secs: 60
    region: ap-southeast-2

Considerations

Lakeformation: Terraform Support for S3TablesCatalog in Permissions

No shade on Terraform here - they had support for S3Tables at launch and my anecdotal experience is that the project does a great job of keeping up with changes to the hundreds of AWS APIs.

The “AWS Analytics Services Integration” mentioned previously makes S3Tables Table Buckets available as a new form of Glue Catalog. This is an important change, because for the last 5 years or so, the only valid CatalogId for LakeFormation permissions would have been a 12 digit AWS Account ID.

The new integration changes that by allowing CatalogId in the the following form:

123456789012:s3tablescatalog/<table-bucket-name>

An issue exists, and a pull request has already been prepared; this issue will likely disappear very soon.

LakeFormation is mandatory and requires special permissions

When granting permissions to the Firehose role to update your S3Tables table, you have no choice but to grant it Super (or ALL) permissions.

Ideally, I’d prefer to grant it SELECT, INSERT, and DESCRIBE. But even if you tick every single permission, Firehose will raise an error, saying that it does not have access to update your table. Your only remedy is to grant it the Super permission.

I personally don’t like the idea that the Firehose role would have the DROP permission on my table, and for this reason I would not use this feature in a production workload. Unlike Hive tables in the Glue Catalog, which are external tables, Iceberg tables are managed tables, meaning that if your table is deleted, the data is also deleted.

Compression

As far as I can see, it is not possible to send (e.g. gzip) compressed data to a Data Firehose delivery stream that has an S3Tables Iceberg destination.

When data is priced on volume of bytes ingested, this can make a considerable difference, given it is not uncommon to achieve 5x reduction in size using even gzip compression.

Cost

I was surprised to learn that Amazon Data Firehose pricing for Iceberg destinations is a little different, and simplified. Instead of worrying about 5kb increments of data which is common in other Firehose destinations, a flat fee per GB ingested is charged. This makes it easier to reason about the cost for ingestion, and I appreciate this.

Something to keep front of mind is that Firehose is just one part of your cost. You also have S3Tables costs for storage, table maintenance, etc. - OneTable’s article on S3Tables pricing did a good job of calling this out, and not long after it was published, AWS actually revised their pricing downwards.

The irony is that Iceberg was designed for petabyte scale volumes of data. If you take the pricing example from AWS as an example:

It would take around 23 years to store 1 petabyte.

A more realistic example that would reach a petabyte in four months would aim to ingest at around 100MiB per second. To supply the Firehose with this volume of data would require many shards in a Kinesis Data Stream which would itself cost around $1,500 per month.

The corresponding monthly Firehose cost for data ingest - around 250GiB - would be over $23,000.

When working at this scale, you want to consider all of your costs, not just the service costs. And if you actually plan to send 100MiB per second to AWS, you might want to phone ahead first.

Conclusion

Thanks for reading about my favourite AWS Service that I rarely use in production!

The following references were helpful to set up the Delivery Stream to Iceberg:

Post Analytics