
Capture and Upload Data with Vector
by Paul SymonsContents
- Summary
- First Mile Data Collection
- My Scenarios
- Collecting ADS-B data
- Getting Started with Vector
- Shipping with Vector
- Conclusion
Summary
Vector is a open-source log shipping / data pipeline command line application used to read data from a source and write it to a sink, optionally transforming it along the way. Or, in their own words:
A lightweight, ultra-fast tool for building observability pipelines
It is similar in functionality to other open source solutions such as FluentBit and Benthos (RIP, now Red Panda Connect).
Vector supports a multitude of sources and sinks such as Kafka, AWS S3 and compatibles, InfluxDB, Splunk, Datadog etc. and is actively maintained, making it a reliable choice for first mile data integration.
Despite its description as a logging or observability shipper, it can source and ship a variety of data in many formats and codecs.
First Mile Data Collection
When capturing data from databases and especially internet-hosted SAAS applications, it is common to reach first for familiar Extract and Load (EL) tools, e.g. Fivetran, Estuary, Azure Data Factory, or even a self-hosted Airbyte if you are feeling spicy.
But what if you can’t access or run your normal extract and load tool where your data is coming from? Or what if your data source is hardware-based in origin, and runs on physically adjacent compute?
This is a common scenario with IOT, and there are many vendor specific solutions out there: AWS IOT Greengrass will fulfil this need and much more - but if you are not targetting AWS, or are contained within an Operational Technology network, you need something else. This is where Vector becomes very useful.
My Scenarios
Most of my uses of Vector have been focused on efficiently moving data from isolated or hardware coupled locations to a cloud object storage, where it can be further processed.
Working with de-coupled hardware (Beckhoff Twincat)
In a previous job, we collaborated with a team that designed a process control system that operated in a hazardous environment, managing high voltage electricity as well as pressurised, combustible gas.
Often these systems are not data loggers, though they can be augmented with modules that will support IOT integrations such as MQTT and AMQP, though early testing showed MQTT modules struggled to keep up with the frequency of data emitted from the numerous channels. Whilst logging was desirable for lookback on testing runs, the primary responsiblity was process control.
Using Vector allowed us to de-couple the responsibility of cloud shipping logs from the Twincat programming environment, by instead having it append its log data to files on a NAS. The simplified architecture looked as follows:
Working with attached hardware (RTL2832U Software Radio)
Have you ever used FlightRadar24 or airplanes.live? They track active aircraft all around the globe - updating in near realtime - showing information such as position, heading, air speed, altitude, etc.
Each aircraft (and even some ground vehicles) has what’s known as an ADS-B transponder, that regularly broadcasts its critical data on regulated frequencies, as well as a receiver that alerts them to other broadcasting aircraft in the area, an integral part of modern collision avoidance systems. There are also many ground stations receiving these signals, and that is typically where API data is from.
Hobbyists can also use a USB-dongle based software defined radio to receive ADS-B signals from passing aircraft, and that is what the rest of this blog will focus on.
Collecting ADS-B data
To collect the data, we use a popular command line tool called readsb; it uses the USB software radio to tune in to the ADS-B frequencies, capture the messages, and decode them into a JSON document. The setup looks something like this:
You can download and compile it manually, but I have preferred to run it (and Vector) in docker instead.
The Dockerfile
I made for readsb
is shown below:
# Stage 1: Build
FROM debian:bookworm AS builder
RUN apt-get update && apt-get install -y \
git \
wget \
build-essential \
libusb-1.0-0-dev \
librtlsdr-dev \
pkg-config \
curl \
ca-certificates \
libncurses-dev zlib1g-dev libzstd-dev debhelper
# Clone readsb source
WORKDIR /src
RUN git clone --depth 20 https://github.com/wiedehopf/readsb.git
RUN wget -O /src/aircraft.csv.gz https://github.com/wiedehopf/tar1090-db/raw/csv/aircraft.csv.gz
WORKDIR /src/readsb
# Build readsb
ENV DEB_BUILD_OPTIONS=noddebs
RUN dpkg-buildpackage -b -Prtlsdr -ui -uc -us
# Stage 2: Runtime
FROM debian:bookworm-slim
# Install runtime dependencies (minimal)
RUN apt-get update && apt-get install -y \
libusb-1.0-0 \
librtlsdr0 \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Copy binary from builder
COPY --from=builder /src/readsb*.deb /tmp
COPY --from=builder /src/aircraft.csv.gz /tmp
RUN apt-get update -y && apt-get install -y --no-install-recommends libncurses6 && dpkg -i /tmp/readsb*.deb
# Define default command (can be overridden with `docker run`)
ENTRYPOINT ["/usr/bin/readsb"]
CMD ["--help"]
and to run it, I use docker compose
with a services file as follows:
services:
readsb:
build: readsb/
devices:
- /dev/bus/usb
volumes:
- tmpfs:/data
entrypoint: ["readsb","--device-type","rtlsdr","--write-json","/data","--db-file=/tmp/aircraft.csv.gz","--write-json-every=0.5","--metric","--quiet"]
restart: unless-stopped
volumes:
tmpfs:
driver: local
driver_opts:
type: tmpfs
device: tmpfs
o: size=512m,uid=0,gid=0
There’s a few points worth making about how we are running readsb
:
- we are giving docker access to the /dev/usb folder so that
readsb
can access the USB dongle - we are creating a tmpfs volume for writing to. Later, Vector will read from the same volume
- we’ve put the
Dockerfile
shown above in thereadsb/
folder so the image can be initially built
We use a tmpfs volume because readsb
will continually write to disk, and it will only serve as a
staging ground before Vector uploads it somewhere else; there’s no point adding unnecessary wear to
a good SSD for the sake of an experiment.
If you want to more easily inspect the output from readsb
you could use a bind mount instead of the tmpfs mount;
additionally you can run it interactively with the --interactive
flag which will produce a top-like
curses interface.
Getting Started with Vector
To start working with Vector, we will add another service to our docker compose
services file,
using a public image for Vector:
services:
...
vector:
image: timberio/vector:0.48.0-debian
volumes:
- tmpfs:/data
- ./vector-config/vector.yaml:/etc/vector/vector.yaml:ro
environment:
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_SESSION_TOKEN=${AWS_SESSION_TOKEN}
- AWS_DEFAULT_REGION=ap-southeast-2
Some things to note:
- tmpfs volume is mounted so Vector can read the output files from
readsb
- We use a bind mount to add our Vector configuration file (see below) into the default location
- We pass through some standard AWS environment variables used for testing AWS S3 uploads
Vector configuration can be either toml
or yaml
format - the latter is becoming the default so you should probably stick with that.
Sources
To start our configuration, we edit the file vector-config/vector.yaml
and add
sources:
readsb:
type: file
include:
- /data/aircraft.json
fingerprint:
strategy: "device_and_inode"
multiline:
start_pattern: "^\\{"
mode: halt_with
condition_pattern: "^\\}"
timeout_ms: 500
So there’s a lot going on here - let’s unpack it step by step.
sources
- a top level keyword to declare input sources. Other top level keywords includetransforms
andsinks
. I can add multiple named sources in a single configuration, if I want to.readsb
- this is a named source. You can call it whatever you like - I am naming it after the program that generates our datatype: file
- this specifies the component type of the source; in our case, we are using the file source. Once the type of a component has been declared, most properties that follow will then be specific to that component
An interesting behaviour of readsb
is that it simply overwrites the output aircraft.json
file -
it does not append the file continuously; nor does it write the JSON document in a single line, it (somewhat) pretty prints it instead:
{ "now" : 1755134708.501,
"messages" : 2238,
"aircraft" : [
{"hex":"7c7d19","type":"adsb_icao","flight":"YZV ","r":"VH-YZV","t":"PA44","alt_baro":400,"alt_geom":600,"gs":89.9,"track":295.71,"geom_rate":-64,"squawk":"3000","emergency":"none","category":"A1","lat":-32.078123,"lon":115.921681,"nic":8,"rc":186,"seen_pos":41.346,"version":2,"nac_v":2,"sil_type":"perhour","alert":0,"spi":0,"mlat":[],"tisb":[],"messages":366,"seen":35.0,"rssi":-33.7}
]
}
By default, the file component is expecting each line to correspond to an event.
When reading JSON, that means we actually expect json lines / ndjson. So,
to fix that up, we use the multiline
syntax above to help Vector identify what counts as a single event. This can take a bit of trial and error to resolve.
The fingerprint.strategy
setting allows us to deal with the issue of the file being continuously overwritten in place by readsb
.
The two options here are checksum and device_and_inode — checksum will track the file contents, whilst device_and_inode
will track the replacing of files, typically via rename operations.
The mdash (—) in the previous paragraph was legitimately my own work
Finally, we can advise the file component to monitor a single file, a list of files, or a glob of files by using the include
keyword.
Globs can be very useful when another process is continuously writing new files according to a pattern,
e.g. a datetime based partitioning scheme.
Shipping with Vector
So, now we have set Vector up to read our input files and generate events from them - but it is not actually doing anything with that data. Ultimately, we want it to upload that data to AWS S3 or something similar, but before doing that, we want to transform the data a little bit.
Namely I want to:
- filter out events where there is no flight data
- add and remove some metadata
- unnest and flatten the nested
aircraft
array element to re-focus the event data on actual aircraft updates
Transforms
As mentioned above, we can use the top level keyword transforms
to add any number of transformation
steps that will help us refine our events.
Filtering
The first one I am going to add - filtered - will filter out events for which there is no flight data:
transforms:
filtered:
type: filter
inputs:
- readsb
condition:
type: vrl
source: "length!(parse_json!(.message).aircraft) > 0"
From the excerpt above, we can see:
- the type of transform
is a
filter
which aims to conditionally allow only events that match a test that you provide - we pass the input from our previously defined source,
readsb
- the condition will use vector remap language (VRL)
- the source describes the code in VRL that the filter will evaluate with
- the condition source (code) contains the VRL script that will evaluate our condition
Vector Remap Language (VRL) is an expression-oriented language designed for transforming observability data (logs and metrics) in a safe and performant manner. It features a simple syntax and a rich set of built-in functions tailored specifically to observability use cases.
You don’t have to use VRL to get value from Vector, but you may get a lot more out of it if you do. It is easy to learn and once you get the basics, you’ll probably get by with the function reference alone.
Translating our VRL source above
length!(parse_json!(.message).aircraft) > 0
- the event read from file is stringified inside a top level key called
message
- we parse the stringified event to recreate it as JSON
- we test the length of the
aircraft
key - should be an array of size greater than zero
You may be wondering what the exclamation marks are for. VRL is a fail safe language , which means it will fail to compile when code contains unhandled errors.
Using the exclamation mark acknowledges the error can occur and will cause
events to be dropped when raised; this is similar to the YOLO non-null
assertion operator
(also an exclamation mark) in Typescript. In a non-trivial example, you might want
to handle the error and set a default or null value, instead of throwing away the event.
Flattening
The second transform has the aim of turning each element of the aircraft
array into its own event.
flattened:
type: remap
inputs:
- filtered
source: |-
json = parse_json!(.message)
. = flatten(json.aircraft)
This time, the transform is of type remap
- instead of being a conditional,
this transform will explictly modify the event according to the source code you provide.
In the example above, we set the temporary local json
variable to the parsed JSON value of the
stringified json message. Then, we use the flatten function to unpack the aircraft
array into new Vector events. In essence, it will take a single event such as:
{
"now" : 1755134708.501,
"messages" : 2238,
"aircraft" : [
{"hex":"7c7d19","type":"adsb_icao","flight":"YZV","r":"VH-YZV"},
{"hex":"7c6d76", "type": "adsb_icao", "flight":"FD624J", "r":"VH-VWO"},
{"hex":"7c806f", "type": "adsb_icao", "flight":"QFA33", "r":"VH-ZNL"}
]
}
and break it into 3 discrete events as follows:
{"hex":"7c7d19","type":"adsb_icao","flight":"YZV","r":"VH-YZV"}
{"hex":"7c6d76", "type": "adsb_icao", "flight":"FD624J", "r":"VH-VWO"}
{"hex":"7c806f", "type": "adsb_icao", "flight":"QFA33", "r":"VH-ZNL"}
This is a really powerful feature that allows you to take complex messages and refine them into separate event streams that you can do with as you wish.
Sinks
At this point, we have sourced our data, we have filtered and transformed it - now we should do something useful with it.
When I start working on Vector pipelines, I typically use a file sink as the output, as it is easy to inspect and debug, and doesn’t cost you anything.
More commonly though, you want to ship this data somewhere other than the location where Vector is running. In this example, I am going to demonstrate using the AWS S3 Sink.
S3 Sink
Adding a sink is very simple - like sources and transforms, sinks
is a top level keyword
where you can gather your named sinks:
sinks:
my_s3:
type: aws_s3
compression: gzip
encoding:
codec: json
inputs:
- flattened
bucket: my-demo-bucket
batch:
timeout_secs: 300
key_prefix: "flights/date=%F/"
In this example, I am configuring
- the type of the sink as
aws_s3
- file objects uploaded to AWS S3 should be
gzip
compressed - Vector events should be written as
json
(could be avro, csv, or something else) - input to this sink is the
flattened
transform we described earlier - destination bucket is called
my-demo-bucket
- batch settings - if we haven’t sent a file to S3 for over 300 seconds, send any buffered events in a new file
- put the uploaded files in a S3 prefix called e.g.
flights/date=2025-08-15/
- using strftime specifiers
I don’t put AWS credentials in the configuration, because the sink will use the default AWS credential provider, allowing you to resort to your normal credential supply mechanisms.
Conclusion
Hopefully you will find Vector very useful in your own scenarios, though this blog really only skims the surface of what it is capable of. Check out the documentation to find out more!
Working with Vector produces few code artifacts, whilst still remaining configurable, reliable and generally self documenting.
Why not try it out?