Flexible backend system for processing sensor data

matthijskooijman · May 25, 2018, 6:46pm

In the Meetjestad project, we have a basic environmental measurement platform going, with just over a hundred nodes deployed in the field, using TTN for our datatransfer. We have a basic data storage in place now, but we’re running into problems in terms of scalability an flexibility. We have some ideas on how to fix this (details below), but before starting to reinventing our own set of wheels, we’re wondering: Is anyone else also running into these same problems? How are others solving their data processing and storage needs? Are there existing platforms that would handle (part of) our needs? Would a solution like we describe be interesting to others too?

Current situation

We have a number of nodes running in the field, measurement temperature and humidity. These send their data to TTN, which is read by a custom python script, which decodes the packets and writes it to our database. From there, we have a number of PHP scripts that make the data available online.

Problems

When we experiment with different boards, sensors, etc. we need to change the packet format. Currently, this means also changing the python forwarder script, create a database table, and data export scripts. This is problematic, because:
This requires administrator access to the server, which means involving one of a few people within our organisation that have access.
This requires a fairly advanced and broad skillset.
Other communities are interested in reusing our designs and systems. The simplest version is to just copy our designs, hardware and server backend, but that requires quite some investment on their part. Alternatively, they can use our systems as-is, but then they have not flexibility in customizing the data collected.
Ideally, they could get started with our hardware and systems right away, doing all the needed customization through well-defined interfaces (either in the firmware on the node itself, or through some web interface).

Intended solution

Build a generic data collection system, where (in short) a node sends a metadata packet on startup that describes the data that will be sent in subsequent packets.
This metadata packet defines a numer of channels/variables that will be sent, along with metadata about these variables (measured quantity, sensor type, etc.).
On the server side, all data is stored an can be flexibly grouped, downloaded, processed, etc.
Additionally, the owner / maintainer of a node should be able to adapt metadata about the node (e.g. what enclosure is it in? Is it in a shady or sunny spot? If it is on a wall, what direction does it face?
Additionally, data should be stored in a flexible way, that allows processing of data, and reprocessing older data again (e.g. with updated ideas of what constitutes invalid data)

glbaum · May 25, 2018, 7:48pm

From my experience with different meteorological stations when we transferred data or saw the packets bring transferred in other instances, there was a standard length regardless of what was attached measuring different variables. This at least allowed us to check the length to determine the station type, but obviously assumes that you aren’t measuring more than a certain amount of things.

In terms of more homemade sensors we basically had to have on the station/node in the code a way to compartmentalize the data into a format we were already using, or we would have to add to the server side a way to understand the new station which was less than ideal but in some cases necessary when we introduced a new node type.

As for the metadata idea with the location data and type of data sent was almost always handled on our end because we could get the GPS location in our stations data but any other details about siting could change from one growing season to another.

So after all of that I’m not sure anything I said is of any help but maybe something will have been beneficial.

kingigi · May 25, 2018, 9:09pm

If I understand your case correctly, I think you are searching for an open source solution to convert, calibrate and validate your data?
Maybe ConCaVa fits your needs?

kersing · May 26, 2018, 9:04am

I would look at using something along the lines of Cayenne Low Power Payload. The format allows sending some data with different intervals so packets do not have to be the same size on every transmission. It does not depend on sending a meta data packet that might be lost.
I would not copy the format verbatim as there are some data formats ‘missing’. (Particle counts and a generic 16 bit integer value for instance)

For the server side, consider storing values along the same lines. So for each (extracted) measurement insert a record with time stamp, ‘channel’, data type and value (blob) or use different tables for the different formats to allow defining the value type to be set to the correct type.

DannyE · May 27, 2018, 11:13am

Am I correct in understanding your database schema is tailored to your sensordata?

On your site I see:

So you have one table that stores everything?

If so, you have to normalize your database, so it can accept any “ping”
and you do not have to change anything in your backend when you add new sensortypes

something like:

I have PHP/MySQL code for this available (MIT license) which even creates the whole database schema on the first call with an HTTP integration… if you want to take a look contact me with a personal message.

matthijskooijman · May 27, 2018, 12:01pm

That looks interesting, thanks! At first glance, it still looks like that needs server-side configuration for different packet formats, but if this is easy enough, that might be ok. I’m a bit worried that this project hasn’t seen any activity for the last 1,5 year, though.

We’ve been discussing this, and while it looks promising, the format does not seem verbose enough in terms of metadata for what we need. At the same time, a separate metadata packet before the datapackets, like we envision, allows even more compact data packets than LPP has.

Yeah, that’s also what we had in mind.

Currently yes, but that was just a minimal “let’s get going” approach.

That also looks interesting. AFAICS this still needs configuring sensors inside the database, but at least that doesn’t require modifying the processing code.

What encoding are you using for your datapackets? Do these explicitly refer to the _sensorid or something like that?

cultsdotelecomatgmai · May 27, 2018, 12:03pm

Hi @matthijskooijman, I have worked in “Digital Oilfield” for more than 30 years and have seen it passing through all the pain points.

A typical offshore oil and gas facility in the North Sea (there are 400+ offshore facilities) collects about 10,000 data points with timing ranging from 1 minute to 15 minutes. Many are simple, such as temperature, pressure, flow, etc. but many are complex, such as Fast Fourier Transforms from vibration-sensing equipment.

The data is kept very long term, many petabytes. The most common platform for storing the data is the OSISoft PI (Process Information) system. All data points belong to a “tag” which is a scaleable naming/numbering scheme. OSISoft PI typically runs on a Oracle cluster.

Each PI system is typically in 3 parts; “raw tags”, “cleansed tags” and “computed tags”. The raw tag data is exactly that, raw data as it arrived. Cleansed tag data has been processed to remove anything marked as invalid. Computed tag data is secondary calculations based on the cleansed tags. The raw data is typically collected into offshore databases and then replicated to the big onshore PI systems.

The big advances that have been made have come from standardisation where the standards came from external bodies such as Lloyds Register, DNV, SFI, POSC/Energistics, etc. and not from commercial companies.

The big advances have been:
The SFI standards for all aspects of ship/rig information. See https://en.wikipedia.org/wiki/SFI_Coding_and_Classification_System
The POSC/Energistics standards for data formats, etc. See https://en.wikipedia.org/wiki/Energistics
Large-scale industrial buyers of services mandating in procurement documents that the standards are used. See https://www.logic-oil.com/standard-contracts
Investment into applications and marketplaces based on the availability of standardised data. See http://www.intelligentplant.com/ as just one example.

The real power lies with the organisations that classify offshore facilities for insurance purposes. They mandate use of the standards and follow up with an inspection and certification regime. The automation and data systems are just a small part of this. No certificate = no insurance… which is a bit of a problem if you’ve invested a couple of $bn building an offshore oilfield.

This has all taken >25 years in offshore Oil & Gas and is moving much more quickly in offshore wind farms. I predict that this will all happen in IoT - at least industrial use of IoT - but not yet.

DannyE · May 27, 2018, 2:02pm

No, it is not MQTT which requires authentication per device.

All you need is one HTTP Integration WebHook (and one TTN Console configuration per application)
The backend doesn’t care whatmany devices there are, it just registers every “Ping” (anything that pings, it is just a WebHook)

So it stores any sensors/sensorvalues. It does not do do Users/Applications
(but come to think of it; a simple addition would be to add an email address as QueryString variable in the HTTP Integration URI)

I built it to be a personal Dashboard for any of my applications/devices

Working on V2 now to be published on GitHub (and then V3 to be in line with TTNs V3)
Never intended it to be become a consumer dashboard like Cayenne.
That requires some adaptation to optimize the back-end

It is a simple dashboard for an Application Developer;
and thus only runs in Chrome for all CSS4 and ES6 Custom Elements goodies,
so the front-end HTML for (a similar dashboard like the screenshot above) is:

    <itpings-chart class="top_left" sensor="snr" title="SNR" interval="6H"></itpings-chart>
    <itpings-table class="top_center" query="PingedDevices" title="Pinged Devices"></itpings-table>
    <itpings-chart class="top_right" sensor="Temperature" title="Temperatures" interval="6H"></itpings-chart>

    <!--row 2-->
    <itpings-table class="middle_center" query="PingedGateways" title="Pinged Gateways"></itpings-table>
    <itpings-chart class="middle_right" sensor="Luminosity" title="Light" interval="6H"></itpings-chart>

    <!--row 3-->
    <itpings-table class="bottom_left" query="Events" title="ITpings Events" nocache limit="200"></itpings-table>
    <itpings-chart class="bottom_center_right" sensor="Battery" title="Battery"
                   interval="6H"></itpings-chart>
    <itpings-chart class="bottom_right" sensor="rssi" title="RSSI" interval="6H"></itpings-chart>

krital · May 28, 2018, 6:03pm

Could you not use something like protobuf if you need full control of your payload format?

matthijskooijman · June 26, 2018, 8:17am

Protobuf is certainly something we’re considering, but that doesn’t really solve the problem, as you’ll still need to have (out-of-band) agreement on both sides about what protobuf format you will be using (AFAIK you can encode a protobuf descriptor using protobuf and then send that in advance, but then you’re essentially doing what I propose).