AI generated image (Gemini) simulating the packaging and despatching of JSON documents up to clouds in the sky

Capture and Upload Data with Vector

by Paul Symons

Contents

Summary

Vector is a open-source log shipping / data pipeline command line application used to read data from a source and write it to a sink, optionally transforming it along the way. Or, in their own words:

A lightweight, ultra-fast tool for building observability pipelines

It is similar in functionality to other open source solutions such as FluentBit and Benthos (RIP, now Red Panda Connect).

Vector supports a multitude of sources and sinks such as Kafka, AWS S3 and compatibles, InfluxDB, Splunk, Datadog etc. and is actively maintained, making it a reliable choice for first mile data integration.

Despite its description as a logging or observability shipper, it can source and ship a variety of data in many formats and codecs.

First Mile Data Collection

When capturing data from databases and especially internet-hosted SAAS applications, it is common to reach first for familiar Extract and Load (EL) tools, e.g. Fivetran, Estuary, Azure Data Factory, or even a self-hosted Airbyte if you are feeling spicy.

But what if you can’t access or run your normal extract and load tool where your data is coming from? Or what if your data source is hardware-based in origin, and runs on physically adjacent compute?

This is a common scenario with IOT, and there are many vendor specific solutions out there: AWS IOT Greengrass will fulfil this need and much more - but if you are not targetting AWS, or are contained within an Operational Technology network, you need something else. This is where Vector becomes very useful.

My Scenarios

Most of my uses of Vector have been focused on efficiently moving data from isolated or hardware coupled locations to a cloud object storage, where it can be further processed.

Working with de-coupled hardware (Beckhoff Twincat)


In a previous job, we collaborated with a team that designed a process control system that operated in a hazardous environment, managing high voltage electricity as well as pressurised, combustible gas.

Often these systems are not data loggers, though they can be augmented with modules that will support IOT integrations such as MQTT and AMQP, though early testing showed MQTT modules struggled to keep up with the frequency of data emitted from the numerous channels. Whilst logging was desirable for lookback on testing runs, the primary responsiblity was process control.

Using Vector allowed us to de-couple the responsibility of cloud shipping logs from the Twincat programming environment, by instead having it append its log data to files on a NAS. The simplified architecture looked as follows:

Working with attached hardware (RTL2832U Software Radio)


Have you ever used FlightRadar24 or airplanes.live? They track active aircraft all around the globe - updating in near realtime - showing information such as position, heading, air speed, altitude, etc.

Each aircraft (and even some ground vehicles) has what’s known as an ADS-B transponder, that regularly broadcasts its critical data on regulated frequencies, as well as a receiver that alerts them to other broadcasting aircraft in the area, an integral part of modern collision avoidance systems. There are also many ground stations receiving these signals, and that is typically where API data is from.

Hobbyists can also use a USB-dongle based software defined radio to receive ADS-B signals from passing aircraft, and that is what the rest of this blog will focus on.

Collecting ADS-B data

To collect the data, we use a popular command line tool called readsb; it uses the USB software radio to tune in to the ADS-B frequencies, capture the messages, and decode them into a JSON document. The setup looks something like this:

You can download and compile it manually, but I have preferred to run it (and Vector) in docker instead.

The Dockerfile I made for readsb is shown below:

# Stage 1: Build
FROM debian:bookworm AS builder

RUN apt-get update && apt-get install -y \
    git \
    wget \
    build-essential \
    libusb-1.0-0-dev \
    librtlsdr-dev \
    pkg-config \
    curl \
    ca-certificates \
    libncurses-dev zlib1g-dev libzstd-dev debhelper

# Clone readsb source
WORKDIR /src
RUN git clone --depth 20 https://github.com/wiedehopf/readsb.git
RUN wget -O /src/aircraft.csv.gz https://github.com/wiedehopf/tar1090-db/raw/csv/aircraft.csv.gz
WORKDIR /src/readsb

# Build readsb
ENV DEB_BUILD_OPTIONS=noddebs
RUN dpkg-buildpackage -b -Prtlsdr -ui -uc -us

# Stage 2: Runtime
FROM debian:bookworm-slim

# Install runtime dependencies (minimal)
RUN apt-get update && apt-get install -y \
    libusb-1.0-0 \
    librtlsdr0 \
 && apt-get clean && rm -rf /var/lib/apt/lists/*

# Copy binary from builder
COPY --from=builder /src/readsb*.deb /tmp
COPY --from=builder /src/aircraft.csv.gz /tmp
RUN apt-get update -y && apt-get install -y --no-install-recommends libncurses6 && dpkg -i /tmp/readsb*.deb

# Define default command (can be overridden with `docker run`)
ENTRYPOINT ["/usr/bin/readsb"]
CMD ["--help"]

and to run it, I use docker compose with a services file as follows:

services:
  readsb:
    build: readsb/
    devices:
      - /dev/bus/usb
    volumes:
      - tmpfs:/data
    entrypoint: ["readsb","--device-type","rtlsdr","--write-json","/data","--db-file=/tmp/aircraft.csv.gz","--write-json-every=0.5","--metric","--quiet"]
    restart: unless-stopped
    
volumes:
  tmpfs:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs
      o: size=512m,uid=0,gid=0

There’s a few points worth making about how we are running readsb:

  • we are giving docker access to the /dev/usb folder so that readsb can access the USB dongle
  • we are creating a tmpfs volume for writing to. Later, Vector will read from the same volume
  • we’ve put the Dockerfile shown above in the readsb/ folder so the image can be initially built

We use a tmpfs volume because readsb will continually write to disk, and it will only serve as a staging ground before Vector uploads it somewhere else; there’s no point adding unnecessary wear to a good SSD for the sake of an experiment.

If you want to more easily inspect the output from readsb you could use a bind mount instead of the tmpfs mount; additionally you can run it interactively with the --interactive flag which will produce a top-like curses interface.

Getting Started with Vector

To start working with Vector, we will add another service to our docker compose services file, using a public image for Vector:

services:
    ...
    vector:
        image: timberio/vector:0.48.0-debian
        volumes:
        - tmpfs:/data
        - ./vector-config/vector.yaml:/etc/vector/vector.yaml:ro
        environment:
        - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
        - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
        - AWS_SESSION_TOKEN=${AWS_SESSION_TOKEN}
        - AWS_DEFAULT_REGION=ap-southeast-2

Some things to note:

  • tmpfs volume is mounted so Vector can read the output files from readsb
  • We use a bind mount to add our Vector configuration file (see below) into the default location
  • We pass through some standard AWS environment variables used for testing AWS S3 uploads

Vector configuration can be either toml or yaml format - the latter is becoming the default so you should probably stick with that.

Sources

To start our configuration, we edit the file vector-config/vector.yaml and add

sources:
  readsb:
    type: file
    include:
      - /data/aircraft.json
    fingerprint:
      strategy: "device_and_inode"
    multiline:
      start_pattern: "^\\{"
      mode: halt_with
      condition_pattern: "^\\}"
      timeout_ms: 500

So there’s a lot going on here - let’s unpack it step by step.

  1. sources - a top level keyword to declare input sources. Other top level keywords include transforms and sinks. I can add multiple named sources in a single configuration, if I want to.
  2. readsb - this is a named source. You can call it whatever you like - I am naming it after the program that generates our data
  3. type: file - this specifies the component type of the source; in our case, we are using the file source. Once the type of a component has been declared, most properties that follow will then be specific to that component

An interesting behaviour of readsb is that it simply overwrites the output aircraft.json file - it does not append the file continuously; nor does it write the JSON document in a single line, it (somewhat) pretty prints it instead:

{ "now" : 1755134708.501,
 "messages" : 2238,
 "aircraft" : [
{"hex":"7c7d19","type":"adsb_icao","flight":"YZV     ","r":"VH-YZV","t":"PA44","alt_baro":400,"alt_geom":600,"gs":89.9,"track":295.71,"geom_rate":-64,"squawk":"3000","emergency":"none","category":"A1","lat":-32.078123,"lon":115.921681,"nic":8,"rc":186,"seen_pos":41.346,"version":2,"nac_v":2,"sil_type":"perhour","alert":0,"spi":0,"mlat":[],"tisb":[],"messages":366,"seen":35.0,"rssi":-33.7}
 ]
}

By default, the file component is expecting each line to correspond to an event. When reading JSON, that means we actually expect json lines / ndjson. So, to fix that up, we use the multiline syntax above to help Vector identify what counts as a single event. This can take a bit of trial and error to resolve.

The fingerprint.strategy setting allows us to deal with the issue of the file being continuously overwritten in place by readsb. The two options here are checksum and device_and_inodechecksum will track the file contents, whilst device_and_inode will track the replacing of files, typically via rename operations.

The mdash (—) in the previous paragraph was legitimately my own work

Finally, we can advise the file component to monitor a single file, a list of files, or a glob of files by using the include keyword. Globs can be very useful when another process is continuously writing new files according to a pattern, e.g. a datetime based partitioning scheme.

Shipping with Vector

So, now we have set Vector up to read our input files and generate events from them - but it is not actually doing anything with that data. Ultimately, we want it to upload that data to AWS S3 or something similar, but before doing that, we want to transform the data a little bit.

Namely I want to:

  • filter out events where there is no flight data
  • add and remove some metadata
  • unnest and flatten the nested aircraft array element to re-focus the event data on actual aircraft updates

Transforms

As mentioned above, we can use the top level keyword transforms to add any number of transformation steps that will help us refine our events.

Filtering

The first one I am going to add - filtered - will filter out events for which there is no flight data:

transforms:
  filtered:
    type: filter
    inputs:
      - readsb
    condition:
      type: vrl
      source: "length!(parse_json!(.message).aircraft) > 0"

From the excerpt above, we can see:

  • the type of transform is a filter which aims to conditionally allow only events that match a test that you provide
  • we pass the input from our previously defined source, readsb
  • the condition will use vector remap language (VRL)
  • the source describes the code in VRL that the filter will evaluate with
  • the condition source (code) contains the VRL script that will evaluate our condition

Vector Remap Language (VRL) is an expression-oriented language designed for transforming observability data (logs and metrics) in a safe and performant manner. It features a simple syntax and a rich set of built-in functions tailored specifically to observability use cases.

You don’t have to use VRL to get value from Vector, but you may get a lot more out of it if you do. It is easy to learn and once you get the basics, you’ll probably get by with the function reference alone.

Translating our VRL source above

length!(parse_json!(.message).aircraft) > 0

  • the event read from file is stringified inside a top level key called message
  • we parse the stringified event to recreate it as JSON
  • we test the length of the aircraft key - should be an array of size greater than zero

You may be wondering what the exclamation marks are for. VRL is a fail safe language , which means it will fail to compile when code contains unhandled errors.

Using the exclamation mark acknowledges the error can occur and will cause events to be dropped when raised; this is similar to the YOLO non-null assertion operator (also an exclamation mark) in Typescript. In a non-trivial example, you might want to handle the error and set a default or null value, instead of throwing away the event.

Flattening

The second transform has the aim of turning each element of the aircraft array into its own event.

  flattened:
    type: remap
    inputs: 
      - filtered
    source: |-
      json = parse_json!(.message)
      . = flatten(json.aircraft)

This time, the transform is of type remap - instead of being a conditional, this transform will explictly modify the event according to the source code you provide.

In the example above, we set the temporary local json variable to the parsed JSON value of the stringified json message. Then, we use the flatten function to unpack the aircraft array into new Vector events. In essence, it will take a single event such as:

{ 
 "now" : 1755134708.501,
 "messages" : 2238,
 "aircraft" : [
   {"hex":"7c7d19","type":"adsb_icao","flight":"YZV","r":"VH-YZV"},
   {"hex":"7c6d76", "type": "adsb_icao", "flight":"FD624J", "r":"VH-VWO"},
   {"hex":"7c806f", "type": "adsb_icao", "flight":"QFA33", "r":"VH-ZNL"}    
 ]
}

and break it into 3 discrete events as follows:

{"hex":"7c7d19","type":"adsb_icao","flight":"YZV","r":"VH-YZV"}
{"hex":"7c6d76", "type": "adsb_icao", "flight":"FD624J", "r":"VH-VWO"}
{"hex":"7c806f", "type": "adsb_icao", "flight":"QFA33", "r":"VH-ZNL"} 

This is a really powerful feature that allows you to take complex messages and refine them into separate event streams that you can do with as you wish.

Sinks

At this point, we have sourced our data, we have filtered and transformed it - now we should do something useful with it.

When I start working on Vector pipelines, I typically use a file sink as the output, as it is easy to inspect and debug, and doesn’t cost you anything.

More commonly though, you want to ship this data somewhere other than the location where Vector is running. In this example, I am going to demonstrate using the AWS S3 Sink.

S3 Sink

Adding a sink is very simple - like sources and transforms, sinks is a top level keyword where you can gather your named sinks:

sinks:
   my_s3:
     type: aws_s3
     compression: gzip
     encoding: 
       codec: json
     inputs:
       - flattened
     bucket: my-demo-bucket
     batch:
       timeout_secs: 300
     key_prefix: "flights/date=%F/"

In this example, I am configuring

  • the type of the sink as aws_s3
  • file objects uploaded to AWS S3 should be gzip compressed
  • Vector events should be written as json (could be avro, csv, or something else)
  • input to this sink is the flattened transform we described earlier
  • destination bucket is called my-demo-bucket
  • batch settings - if we haven’t sent a file to S3 for over 300 seconds, send any buffered events in a new file
  • put the uploaded files in a S3 prefix called e.g. flights/date=2025-08-15/ - using strftime specifiers

I don’t put AWS credentials in the configuration, because the sink will use the default AWS credential provider, allowing you to resort to your normal credential supply mechanisms.

Conclusion

Hopefully you will find Vector very useful in your own scenarios, though this blog really only skims the surface of what it is capable of. Check out the documentation to find out more!

Working with Vector produces few code artifacts, whilst still remaining configurable, reliable and generally self documenting.

Why not try it out?

Post Analytics