Creating a time series sample CSV file with a bash script

Demonstrating a pragmatic way for generating time series sample data in CSV files with a bash script. Useful for Proof-of-Concepts, unit and performance testing and many other scenarios.

Time series are a widespread data structure for many use cases in different domains, e.g. stock or commodity prices, sensor data from IoT devices, measured values and so on. Getting required sample data for those values might be tricky in some situations. So let’s explore an option on how to solve that.

Basic time series structure

In general, a time series is simply a set of chronological values for a specified key. The key identifies the entity or process the given data is related to, e.g. the ID of a sensor, a share or a process step. Mostly, every time series has a fixed time granularity, meaning values are present for every minute, hour, week etc.

So a minimal time series data set usually consists of three fields: a key, a timestamp and a value. E.g. for a hourly granularity:

keytimestampvalue
sensor-12021-10-07 08:00:00110.07
sensor-12021-10-07 09:00:00112.10
sensor-12021-10-07 10:00:00111.37
sensor-12021-10-07 11:00:00109.98

The challenge of providing time series data

Providing time series data as a needed input will be a challenge in most non-production environments and situations where you don’t have any real world data available – for example in isolated environments, during a Proof-of-Concept or others.

Typically, when dealing with time series data we are talking about mass data where hundreds of thousands or millions of data sets are needed to create nearly realistic situations.

Luckily, in most scenarios it is absolutely sufficient if you would have synthetic sample data in a CSV file at your hand. To generate such files, a simple bash script can be used. So let’s get to it.

Bash-script for creating time-series data

To generate the needed time series sample data you could perfectly use any spreadsheet office app and export to CSV. But there are some things to keep in mind:

  • Typically, no office app is directly available on the target machine/server, so extra effort for connecting, copying files etc. will kick in.
  • Since we are talking about mass data, most spreadsheet office apps will come to their limits quite fast (max. number of lines…).

Taking all this into account, a more efficient way would be to create the sample data for your time series directly on the target machine. Before getting into the details, here’s the script that can serve you as a good basis to achieve this.

#!/bin/bash

declare -a timeseries=( "TS1|50|100" "TS2|100|120" "TS3|5|15" "TS4|10000|13000" )
amount=$((10*365*24))
start="06:00:00 2017-01-01"

create_ts_values () {
  i=0
  while [[ $i -lt $amount ]]
  do
    echo -ne "$1... ${i}"\\r
    echo "$1;$(date -d"$start +${i} hours" +"%Y-%m-%d %H:%M:%S");$(($2 + $RANDOM % ($3-$2)))" >> ts.csv
    i=$((i+1))
  done
}

rm -f ts.csv

echo "Generating timeseries values..."
for val in ${timeseries[@]}; do
  ts=$(echo $val | cut -d '|' -f 1) 
  from=$(echo $val | cut -d '|' -f 2)
  to=$(echo $val | cut -d '|' -f 3)
  create_ts_values $ts $from $to
  echo "$ts... $amount"
done
echo "Done."

Feel free to copy the script code or download it from here.

Now, what does this script do? It will create a file ts.csv containing random hourly time series values for ten years (= 10 * 365 * 24 values) starting from 01-01-2017 06:00:00 for four series with the following ranges:

  • Key: TS1, value range: 50..100
  • Key: TS2, value range: 100..120
  • Key: TS3, value range: 5..15
  • Key: TS4, value range: 10.000..13.000

In total, the resulting CSV file contains 350.400 entries.

$ head ts.csv 
TS1;2017-01-01 06:00:00;65
TS1;2017-01-01 07:00:00;62
TS1;2017-01-01 08:00:00;88
TS1;2017-01-01 09:00:00;83
TS1;2017-01-01 10:00:00;69
TS1;2017-01-01 11:00:00;50
TS1;2017-01-01 12:00:00;69
TS1;2017-01-01 13:00:00;79
TS1;2017-01-01 14:00:00;99
TS1;2017-01-01 15:00:00;85

The resulting file is a perfect starting point for importing into popular databases, e.g.:

Understanding and customizing the script

Time series keys and ranges

The time series to generate the sample values for are specified in the array timeseries.

declare -a timeseries=( "TS1|50|100" "TS2|100|120" "TS3|5|15" "TS4|10000|13000" )

For each series a tuple of key, range-from and range-to delimited by pipes (‘|’) is present. So "TS1|50|100" means “create a series with key “TS1″ and random values between 50 and 100”.

To change/add/remove time series, simply edit the array definition.

Changing time granularity and/or generated fields

The given script generates sample data on an hourly basis. To change this or if you have to add/change fields of the resulting CSV, edit the following line that produces the output.

echo "\"$1\";$(date -d"$start +${i} hours" +"%Y-%m-%d %H:%M:%S");$(($2 + $RANDOM % ($3-$2)))" >> ts.csv

For example, to switch to daily values this line would have to be changed to this.

echo "\"$1\";$(date -d"$start +${i} days" +"%Y-%m-%d %H:%M:%S");$(($2 + $RANDOM % ($3-$2)))" >> ts.csv

For manipulating the date and time, the linux command date is used. To learn about all the features please reach out to man date and also info date for all possible string formats.

Happy coding 🙂

Useful links