Data backup on Gateway in case of network outage?

descartes · July 1, 2020, 1:20pm

The very loverly @cslorabox isn’t a fan of that, not a bit and I fully understand why.

But the higher end RAK gateways do this once setup and tested. They do it by holding the unsubmitted messages in order and passing them on once they have connectivity restored. So very useful for GSM/LTE connections. However I believe there are issues with timestamps that I haven’t got a handle on, see: https://forum.rakwireless.com/t/rak7258-lte-automatic-data-recovery/2193/10

However, due to the security design, particularly the frame counters, this can go wrong so should be a nice to have in your design and not a definitely must must work.

cslorabox · July 1, 2020, 2:41pm

RAK’s challenge isn’t so much an implementation one as the classic one of diving into writing code without there being a strategy which solves the actual problem. And then when users unsurprisingly have issues, trying to fix a systemic problem in the code of one component.

Upon reconnection, if they start being a gateway immediately, then the archive gets submitted out of order and a LoRaWAN server enforcing frame counts rejects it

Conversely if they drain the archive in order before they start being a gateway again, now the duration of the outage of real time data (and during which ADR keeps falling back due to loss) is magnified. And should any other gateway get a packet in, it’s all for naught.

I think if you really want to do this, you’re better off taking any gateway where you control the code and writing your own backup scheme which uses the storage media in a way that you feel is safe. Then get your node keys and do your own offline decoding of historical data outside of TTN, and write those packets to your historic log of application data with some “recovered” flag on them.

kersing · July 1, 2020, 4:11pm

That still doesn’t mean it is a good solution. It will work for gateways out in the boonies with infrequently transmitting nodes and shorter connection interruptions.
In my opinion this tries to solve a fundamental issue of LoRaWAN, the assumption the gateway is always connected to the backend. Solving this requires a rewrite of (parts) of the specification, not hacking the packet forwarder and hoping to get away with it. (Btw it seems kerlinks cpf buffers packets as well so RAK is in good company)

descartes · July 1, 2020, 5:03pm

Users at this level should know better.

Ditto - hence the setup and tested bit.

I fully appreciate that a metropolitan based gateway is going to be hammered if it tries to store & forward - but if it’s in that sort of area, there should be overlapping coverage of gateways.

As you say Jac, it’s for the boonies when the mobile network goes off line because it’s raining hard or it’s foggy (something I’ve actually lived with).

cslorabox · July 1, 2020, 5:28pm

Meh… the real market for RAK (and Dragino) boxes running stock firmware is people who want to just plug something in, click through some gui menus, and have it work.

For a user with reliability concerns, they’re more a demonstration of the fact that an MT76x8 chip (or the competing AR9331 that Dragino uses) can be a decent inexpensive gateway platform potentially robustly booting from NOR flash.

You then have the choice of either putting custom software on the offered boxes (via changes captured in the overlay filesystem, or by rebuilding OpenWrt from source) and/or making a custom hardware platform adding in key things that were left out, such as a USB hub to allow using the sole host port for more than just the LTE modem.

descartes · July 1, 2020, 5:43pm

NAND to Tetris has nothing on you! When you need a chip, do you go to the beach for the raw materials?

I suspect the real problem here is that the memo about the fail-fast product to market philosophy that companies are now using hasn’t reached the users who aren’t aware that features are included that may not work until they (we) have tested it and move it out of beta (if we are lucky, sometimes we move it out of alpha). It’s not unusual for me to include a button I think should be on a device to see if anyone needs it and then wait for them to press it and tell me that it didn’t work. Most of the buttons don’t get pressed.

cslorabox · July 1, 2020, 5:55pm

No, I buy the same parts but shuffle them around until they are connected in the right order… for example putting in the missing USB hub. I didn’t even bother routing the SoC to the DDR, that’s a submodule, as of course are the concentrator and the LTE.

It’s not about doing everything yourself, it’s about re-doing the things that need fixing.

That I can agree with, but iterating by buying successive generations of boxes can be costly. And frequent change in offerings is bad for fleet deployment, too.

Or won’t work at all… often the big problem is software written by people who didn’t start from a clear vision of what it needed to do in the overall system context. That and features invented by a marketing department similarly lacking contextual awareness.

There’s a big difference between room for future expansion where the user/customer/client/licensee has the engineering materials to run with these ideas and turn them into something workable, vs. where they’d have to tear substantial parts up and rebuild them from scratch to move forward.

I think if someone wants packet backup on a gateway, they need to validate how it’s going to fit in architecturally and then write their own software to do it. I personally prefer to have that as an entirely separate subsystem outside the packet forwarder and live backhaul and decoding path.

descartes · July 1, 2020, 6:02pm

Given the price of memory / FRAM / Flash / grains of sand, I’d cache data on the device with my own over-arching mechanism to tell it that it’s OK to purge - or more likely as I’d anticipate having weeks worth of data stored, just FIFO it and have a re-send range mechanism so if there is a gap, send a message with the range of data to resend.

iiLaw · July 7, 2020, 1:49pm

@ descartes Hi the time stamp in the RAK store and forward backup, is a time delta in micro secs between last packet sent and current. So you have to do a bit work on on your backend to re-hydrate correct time stamp after an backhall outage.

The issue for many of us who use cellular backhall as default, is to get over the flowing. But brings it’s own issues.

// Corp
IT: Gateway can’t be on corp lan/wan must be hidden (often in a ceiling void)
Cyber security: Gateway can Never be on lan/wan or only after 3 months of sec evaluation

// Medium sized enterprise:
You will have to talk to Dave, he’s real busy always got a lot on…

// Small biz
Ow my niece does the IT after school on Thursdays

However, even why you have a wired backhall I’ve encountered the following issues…

Gateway stolen
Power turned off at weekends
Hairy arsed electricians pulling out power cables to gateways as it was needed for something else

descartes · July 7, 2020, 2:16pm

Had a client that would call to say the server was down around 5pm each day - turned out the cleaner was removing the note about not unplugging it …

So for all those situations where the gateway goes awol for what ever reason, my scheme of keeping data points on the device (subject to power & cost issues), seems like a plan.

arjanvanb · July 7, 2020, 2:32pm

I’d say it’s not worth the trouble of duplicated administration, required for decryption. And above all: LoRaWAN being radio in a license-free spectrum, maybe one should not rely on the data to start with?

cslorabox · July 7, 2020, 2:56pm

Probably not. I suspect this is a confusion based on the free-running microsecond counter of the gateway concentrator chip, which is what the server uses to time downlink replies. It does not measure the time “since” a previous packet as that would break in the case of backhaul packet loss, it is merely a local counter stamp. And it’s the same thing a gateway normally sends when backhaul is live.

It does indeed take some doing to convert it into an actual time - you have to have a sample of both at at point in the same run of the same packet forwarder program, and enough sense of the time to know how many times it has rolled over. And if the concentrator / packet forwarder have been restarted, that breaks the meaning of the counter unless you have a record of the value right before the restart. Uplink frame counts may make as much sense.

But this is also why someone who really wants a packet backup on the gateway should probably write their own. Including an RTC time is quite simple (and yes, if you’re going to this trouble you want a battery backed RTC and not a situation where you only have time after you can connect to an NTP server).

If you’re really stuck, put an RTC in a node and have it announce the time (doesn’t have to be right, just and monotonic) and use that as a time standard to measure other nodes’ packets by.

iiLaw · July 7, 2020, 3:09pm

Hi yes been using frame counts and appreciate the counter role overs but the free-running counter has been consistent enough in terms of delta T to give approximates which is all I’m after. In my case with the RAK it was part on evaluation they make the claim but don’t provide any details on how you would use the feature in a practical way. So one might say it’s a rather nebulous feature but it does point to GW manfs. considering the issue.

I’ve used Multiech AEP GWs with onboard LNS & App server where we can Store & Forward before posting to say AWS-IoT.

descartes · July 7, 2020, 9:40pm

What does this refer to? As in, which post above?

arjanvanb · July 8, 2020, 6:56am

Which part of my post?

@cslorabox already explained why queueing is going to be a nightmare, and won’t work with the idea that the TTN public network can have multiple gateways that receive the uplinks, today or in the future. I just wanted to add that anyone trying to backup/decrypt on the gateway itself, rather than queueing/forwarding, will find that such is a nightmare too. (And, above all: one should expect missing packets anyway.)

cultsdotelecomatgmai · July 8, 2020, 7:45am

Hi everyone, I’m completely with @arjanvanb on this one. If higher resilience to outages is required then LoRaWAN needs to deploy using the standard distributed control systems architecture of the last 25 years:

Full system stack at the edge; for LoRaWAN this would be a small cluster of gateways at the edge combined with dual micro LNS with applications and failover.
Continuous asynchronous replication of data from the edge to the centre.
Intelligence in the centre to handle fresh data differently from stale data.
Intelligence in the centre to replicate identity data out to the edge.

descartes · July 8, 2020, 10:19am

A few lost packets because it’s raining hard is one thing, but having a framework that can store on node and re-send after someone has chopped through a comms cable and taken out a network for 36+ hours isn’t such a bad thing if it doesn’t cost much and is simple to manage.

If you are collecting data for a study, larger gaps in data can be irritating at best and ruin that data set at worst. But it’s not time critical, so this would be a useful example - although in some situations you could just go and get an SD card from the node.

If you are running something system critical, then that’s a totally different set of circumstances where it would be prudent to implement one or more additional channels to compliment LoRaWAN, transports such as GSM/LTE/4G, SigFox, NB-IOT or even Iridium. In that respect I’d have LoRaWAN for the monitoring data as notionally cost free transport but keep command & control to other channels, but still leave room for a downlink if the other channel(s) manage to go off line.

cslorabox · July 8, 2020, 3:40pm

Backing up and decrypting are two completely different things.

It makes no sense to decrypt on the gateway, it’s both useless and severely compromises key management. (And even if the gateway had only the network session key, it still couldn’t autonomously reply to a node to maintain and ADR, path, since it has no way of knowing if another gateway still in contact with the network server has been asked to.)

However, if one wants to do a backup on the gateway, then it’s probably necessary to feed the packets to a centralized decryptor which is parallel to that of TTN’s server, because such stale packets are not LoRaWAN-compliant.

Building such is not all that complicated. However the debug tool that got built eventually became it’s own network server

arjanvanb · July 8, 2020, 4:21pm

Very true. But assuming we’re still talking about doing this on the gateway: backing up and forwarding are two completely different things too.

I very much agree.

That’s the same waste of time, I feel.

Only if one has the (OTAA) session keys that were applicable at the time the packets were received, and if one either (intelligently) brute forces the MSB 16 bits of the frame counters or keeps track of the frame counters too. Like I wrote that’s not as easy as one may think, I feel.

cslorabox · July 8, 2020, 8:56pm

My impression is that you can the current ones via an API from TTN, I could be wrong, when I did it while trying to figure out why lorasever was losing session keys I grabbed them from the join accepts since I was able to intercept the backhaul traffic to all gateways, something obviously not the case in TTN.

And I think(?) you get an application-level message for a join accept when the keys change, so it’s not like you have to randomly poll for changes.

In a well-functioning network nodes should almost never join anew.

It’s about what a network server normally does - there aren’t that many realistic possibilities {MSB=MSB, MSB=MSB-1, MSB=MSB+1, MSB=0} Like I said, my debug monitor eventually became its own network server since in the end that was the shortest path to the functionality we needed that let us stop fighting the “features” that only caused endless trouble.