Discussion: Recover data after network outage

AndyG · July 8, 2021, 9:26am

Could this be resolved between the node and the application without changing the gateway?

Perhaps have a 256 message buffer on the node and transmit an unsigned byte sequence number in each message.

The application can detect any gap and send a downlink to ask for the missing uplinks.

Please send uplinks 253 to 005. This will trigger the node to add a second transmission in each cycle. So it sends 007 & 253, 008 & 254, … 013 & 004, 014 & 005.

The application is then responsible for re-assembling the missing data in the correct order.

AndyG · July 8, 2021, 9:39am

Doh! We already have a frame counter! Just use the 8 lsb.

kersing · July 8, 2021, 9:42am

It would be fairly simple, however I refuse to do this as it will break the current lorawan model as explained by Johan. Only when those concerns are addressed I’ll take requests regarding this into consideration.

As an exercise in C programming for yourself? Fine. Please do not distribute it as it will break the TTN back-end. (As in you will trigger rate limiters effectively making sure the gateway will be off-line even longer from TTN point of view)

If someone comes up with a decent proposal that can be implemented in a way where TTN back-end can handle it (and know these are ‘old’ messages including forwarding that status to the application level so it know not to respond to them either) and can convince TTN to implement the back-end part, I’ll gladly enhance MP forwarder.
I would suggest a proposal to be based on ttn-gateway-connector protocol and not on UDP. (MP doesn’t do BasicStation and I have no intention to implement it as that would involve a major rewrite)

descartes · July 8, 2021, 9:54am

I’d never consider trying to replay the uplinks as I’ve read all the articles that explain why this isn’t a good idea so I’m not thinking of impacting any LoRaWAN infrastructure.

What I have in mind is storing the uplinks received by the gateway locally and if there is an outage, when it ends, the end-users application server post a request that can be picked up by the gateway periodically by some other process.

At which point this other process extracts the uplinks as received from the data store and sends them directly to the end-users infrastructure, which would then have to decrypt but would have access to the various session keys via the TTS API to do that.

The only wrinkle in this is it requires gateway that can have the altered gateway software & this other process running on it.

I think this solution would be best for many devices in range of a gateway. If there are only a handful, then @AndyG’s outline is pretty much what I have going on but as I’ve found, I can run out of memory or have too many devices plus other wrinkles yet to be debugged.

Jeff-UK · July 8, 2021, 10:23am

If there is an outage…you wont just have the one node to worry about recapturing data for…there could be thousands! A none starter IMHO and likely to well break e.g. FUP wrt rate of dumping catch up messages in quick succession unless done over very long time period…in which case might next outage or load shedding event have happened…cascade issues. Also how to prioritise then live messages coming in vs what is queued up?.. I see big nightmares ahead… this is more in line with what I proposed earlier as likely only practical suggestion even if it does create ‘islands’… extreme solutions for extreme cases?

descartes · July 8, 2021, 10:33am

Possibly, but equally there could be a use case where there are 50 sensors and one gateway in some far off field. It may be preferable to have the whole lot, but the firmware could send averages for as required and if there are any interesting data points, be asked to send particular actuals.

So for SA where water is metered, you can live with the average soil moisture for the last 24 hours in 3 hour averages. And some of the data can be collapsed further if there is a mechanism for indicating repeat or sending deltas from a base line.

And if you have multiple sensors in an area, you may need just one sensors data to get a rough idea.

So it may be that there are only one or two extra uplinks required, albeit requiring one downlink per device, which is about cherry picking the first few sensors you want to backfill on along with how much data.

But in this example, it’s a bit irrelevant - the current value is the one of interest, so you can crack on with that. So the use case may not need an un-interrupted data set.

Perhaps @jpmeijers could chime in with a few example use cases for us to explore.

cslorabox · July 9, 2021, 6:11pm

If you want a solution that can allow the gateway owner to get the raw traffic of their own nodes for backup decoding outside of TTN, there are a lot of ways of doing that.

On the crude end, you can add/reactivate debug logging and have something scrape it out of syslog.

After some early experiments with what became chirpstack, I decided the part Iiked was compartmentalization of task, so I similarly use UDP forwarders which pass the data to a same or adjacent machine task that encapsulates it in something more appropriate to transit the Internet to the server. That custom bridge component is a decent place to “tee” a feed.

One of the nice things about UDP vs a lot of other IPC is that it can’t back up and harm the source. Even if the collector task locks up for some reason (say storage media failure) the source task is just throwing things at a target painted on the wall, with no care or even way of knowing what happens next. The official UDP protocol has acks, but it’s easy to send a second copy to another port blindly, at either packet forwrder level or the subsequent UDP-to-Internet task.

If the chosen gateway can’t support custom components then it can be paired with and configured to talk via a local embedded systems which can do so, and probably give a pipe into the gateways serial or local net management console, too, as well as monitoring the batteries and whatever other fun stuff makes the site reliable.

As I’m sure you know but some readers may be less aware, the problem with feeding even two true LoRaWan network servers comes about if they can both talk. If one can’t (either by design or because there’s literally no way for a downlink command to reach the gateway) then there’s no real issue beyond needing to get the current device addresses and keys in an OTAA situation where they can change.

jpmeijers · July 14, 2021, 2:43pm

A proper official way to do this is the best approach in my opinion. Any of the other suggestions I read here is basically a re-implementation of LoRaWAN, making the use of LoRaWAN redundant.

As a first step to get official support, I have filed this feature request on the Basic Station repo:

github.com/lorabasics/basicstation

Data buffering and catchup

opened 08:01AM - 08 Jul 21 UTC

jpmeijers

There are many situations in which a gateway can lose connectivity from the netw…ork server. During these times all packets that are received by the gateway is lost. I have a clear description of such a scenario in this forum post: https://www.thethingsnetwork.org/forum/t/discussion-recover-data-after-network-outage/49559/3 As stated there, buffering packets and sending them to the network server at a later stage can cause mac state corruption, or even be dropped because of packets arriving out of sequence. Looking at all the suggestions and workarounds, my suggestion would be to have two channels of data that flows between Basic Station and the Network Server: 1. Channel for live packets. As they are received they are sent to the network server. These packets contribute metadata to keep the mac state updated. These packets are checked for being in sequence. 2. Channel for buffered data. This data only contributes payload data. It does not affect the mac state and also does not get checked for being in sequence. Because the connection between the gateway and the network server is secure, all data from it can be trusted. That also includes packets that are out of sequence or which are delayed. We need to think about the possible attack vectors for replay attacks that are introduced with such a second channel. A second channel for buffered/outdated packets will however solve the scenarios described in my post above, as well as the issues faced by satellite networks: https://github.com/TheThingsNetwork/lorawan-stack/issues/2708

descartes · July 14, 2021, 4:53pm

Erm, how so re-implementation? In the event of loss of connectivity the devices are still transmitting using LoRa to the gateway. The only wrinkle with fishing out the un-relayed uplinks is decryption but that is a solved solution. If there was a proposal to make LoRaWAN redundant, we’d have to find another Low power / long range radio system for the devices.

@johan can officially comment on what constitutes official when related to LoRaWAN as he sits on / contributes to the technical committees, but I don’t think data caching falls under Semtech’s domain - more the LoRaWAN Alliance and I doubt you’ll see a RFC on a draft proposal in the near future.

And there is still the unholy nightmare of weaving the uplinks in to the network server state unless the gateway continues to cache & replay the current uplinks until the older uplinks have been relayed on, so when the gateway gets connectivity, you are still running blind until it’s all caught up. Or you allow live uplinks through and signal the NS that there are out-of-sequence uplinks about to arrive. And any & all MAC states & downlinks need to be handled. So as well as changing BasicStation, TTI, Chirpstack etc etc have to implement this in to the heart of the mission critical network server code.

If you do change your mind about the route to go or want to hedge your bets by having an alternative solution, please share your most typical scenarios - duration of downtime, number of devices, number of uplinks lost, typical size of payload etc - perhaps we can come up with some solution in the interim.

As an aside, I’m playing with a Pi CM4 on a daughter board with a concentrator on top, so having a highly accessible platform for development that is also robust is solved and if the code is kept generic enough, folding it in to OpenWRT & other gateway base OS’s shouldn’t be too problematic.

johan · July 14, 2021, 5:41pm

Exactly this

The out-of-order arrival of LoRaWAN frames is basically what makes this the most challenging. Matching devices is already one of the trickiest things to do fast and efficient. First, sessions are filtered by DevAddr and filtered and sorted by FCntUp (accounting for rollover of the 16-bit value in the frame). Then, NS tries to identify the device by checking MIC with the NwkSKey.

If you were to allow LoRaWAN frames arriving out-of-order at NS, you can’t filter and sort by FCntUp anymore. This makes everything extremely inefficient, as you’d have to MIC check all sessions that have a matching DevAddr and where a frame with the concerning FCntUp hasn’t been seen yet. Whilst accounting for rollover. So this means that you basically need a massive boolean array to record for each session for each 32-bit FCntUp whether the frame has been received in the session. Otherwise you’re open for replay attacks. Now, this is 4 gigabyte of state per session. And you probably need to keep previous sessions of the end device as well, as the device may have rejoined in the meantime.

So, yes, in-order late delivery of frames, where the forwarder simply annotates the frames that it is buffered, that is all fine. When gateway A that hears the device at SF7 goes offline, and the device keeps sending priceless uplink messages while using a correct and recommended ADR algorithm to reduce its data rate to SF10 after 100 frames, and gateway B picks up the frame and sends it in real time to the NS, the first 100 frames are lost forever. Because the frames from gateway A, whenever it comes online, will be out-of-order. So it’s not just per gateway, it is really from the perspective of the Network Server.

descartes · July 14, 2021, 6:43pm

And this is why you have the biggest brain and the best T shirts - hadn’t thought of this, had a picture in my head of a lone gateway out in the dust plains with nothing for miles & miles.

So as soon as the link check / ADR recommendations cut in, it’s sort of game over if there are any other gateways in range. Probably easier to handle this on a back channel, given that it won’t require anything like the resources that would need to be applied to even vaguely start doing the job.

Perhaps this could be streamlined in LoRaWAN 2.0 with TTS v5?

andrew58210 · August 31, 2021, 5:56pm

Wouldn’t the out of order issue just be solved by buffering on the node and only transmitting from the buffer? A confirmed downlink message (not efficient) could be used to tell the node that it was a successful transmit. If the buffer has more than 1 message it could schedule a catchup transmits over a 24hr period to space things out.

cslorabox · August 31, 2021, 6:51pm

That’s not ordinarily viable because it costs the network too much to send downlinks to ack transmissions. Downlinks are much more expensive than uplinks; you should not used “confirmed uplink” with TTN.

What you can sometimes do with cleverness is keep some recent history on the node and downlink a request for a sparse summary of it when you start receiving from the node again through the infrastructure after the end of some kind of gap.

Every time a node transmiys there’s an opportunity to send a reply, but network design and capacity means you need to only try to take a small fraction of those opportunities.

andrew58210 · September 1, 2021, 12:07pm

Makes sense, but where do you place this code? I don’t see that level of customization on the TTN.

descartes · September 1, 2021, 12:09pm

In the device firmware

Johan_Scheepers · September 1, 2021, 12:15pm

From my experience and the work I do I can only give the following comments (not naming providers or can I?)

There are the 3 GSM/LTE - Mr V, Mr M, Mr C and MR T (in SA)

Mr C are piggy backing on Mr. V and M

3 of them have essentially their own infrastructure and sharing some recourses.

Very often Mr V, M and T will site share (Mast and fenced land, own container and power).

Mr V have all their own back haul and nearly at all sites have batteries (in KZN).

MR M have mostly their own back haul and try to have batteries at all sites. They use at certain points MR T to supply additional back haul backup.

Mr T have all their own back haul and batteries disappear before the site is commissioned.

Conclusion try and use min 2 gateways and try to see if they fall in different load shedding areas. (_sePush app).

For the backhaul I will also try and get the gateways to connect to GSM sites that fall in different load shedding areas although MR V most probably are you best bet with battery backup. ( Not saying MR M or T does not have)… Also a duel SIM card router might be a option.

andrew58210 · September 1, 2021, 2:04pm

I meant the code that sends the downlink. it seems like a custom downlink is required to make this work.

descartes · September 1, 2021, 2:20pm

The stack / servers have no knowledge of what a downlink does - it takes the byte array and organises its transmission to the device. The message formatting is up to you.

If by custom downlink you mean the firmware has to respond to a particular downlink, then yes, it is a function of the firmware, just in the same way that many many devices have different firmware from each other for uplinks.

The main point is that it’s nothing to do with TTS or any other LoRaWAN server code, its all about the firmware. No gateways need be involved either.

andrew58210 · September 1, 2021, 2:58pm

I guess I don’t see how what he proposed will work without some sort of downlink message being generated to tell the node what something about the last “n” number of transmissions received. The node somehow has to know what the server had last received since the outage if you aren’t going to do confirmed uplinks. Excuse my ignorance on this, but I guess I’m trying to figure out how does one automate this downlink generation from the server end so an appropriate downlink message could be sent automatically to the node periodically as he described? The node firmware is trivial, but something must compliment it on the other end to tell the node something in a automated fashion.

descartes · September 1, 2021, 3:23pm

A device will never know there has been an outage - it should be fat, dumb & happy sending out std uplinks in to the void and every so often, depending on the regional settings, the MAC will send the uplink as confirmed to check connectivity.

All the action is on your server (note, not the TTS server, YOUR server). Your data store will (or should) have the the uplink count (f_cnt), seeing which uplinks are missing should be simple enough.

Then there is a world of possibilities for sending a request to the device - either to get some decimated (summarised) copies or interleaved copies (like an old fashioned JPEG - you get the rough detail and then it comes in to focus) or cherry pick certain ones or just replay them - these have to be packaged so they aren’t misinterpreted as live uplinks and need some sort of sequence counter if the f_cnt isn’t directly available.

To aid this there are compression libraries that allow you to stuff payloads in and get compressed data out without worrying about sleeping. I’ve been getting around half size so with a bit of RAM or even using some flash or external flash/fram you can have quite a history stored in a circular buffer.