Discussion: Recover data after network outage

jpmeijers · July 7, 2021, 4:43pm

Down here in South Africa where my colleagues and me deploy LoRaWAN sensors, we have periodic electricity outages of 2h30m at a time.

Problem 1: gateway loses power

During these times it is very likely that the LoRa gateway goes offline.

Solution

Therefore we normally add a DC UPS to our gateways (Takealot).

Problem 2: internet

In the most location we either depend on GSM/LTE or on long range WiFi networks. These internet services don’t always have backup power, and if they do these don’t always last through the whole power outage.

Solution

We started using RAK Wireless gateways because they have an “automatic data recovery” option. This feature allows the gateway to buffer all received packets and send them to the network server when the internet connection is restored.

Buffering packets on the gateway was possible on TTN V2 and while using the Semtech UDP packet forwarder. If one switch to Basic Station this feature does not exist.

On TTN V3 there is a long discussion about why buffering data on the gateway is bad. It can corrupt the MAC state. See this issue:

github.com/TheThingsNetwork/lorawan-stack

Support downstream resending traffic out-of-order

opened 08:15PM - 10 Jun 20 UTC

johanstokking

c/application server c/network server compat/api

#### Summary Support downstream resending traffic out-of-order Replaces https://github.com/TheThingsIndustries/lorawan-stack/issues/2114 #### Why do we need this? For processing (application) data while new data may already have been received. This may happen already without the Network Server being aware of it. Currently, this triggers downlink tasks and even downlink transmission while this shouldn't happen. Typical scenarios where this happens is gateways that are buffering traffic while the upstream is unavailable. This can be terrestrial gateways and satellites that send buffered data to a ground station. #### What is already there? What do you see now?  Upstream handling that require FCnt increases in the session. #### What is missing? What do you want to see? - Flagging uplink message that have been buffered - Network Server processing traffic but not triggering downlink tasks - Application Server processing traffic - Whether buffered traffic may be processed would be a per-device setting owned by Network Server - Traffic may arrive out-of-order #### How do you propose to implement this? - Add `process_buffered_messages` (`BoolValue`) field to `EndDevice` - NS can only do a minimal check whether the message has already been received in the session; only recent uplinks are kept. This means that buffered traffic that has been seen by other gateways may result in duplicates; so it becomes at least once delivery towards applications. That side effect is acceptable - The upstream handling shouldn't trigger downlink tasks - The field needs to be added included in the uplink message to applications #### How do you propose to test this?  Unit testing handling an uplink message with a lower FCnt in the session #### Can you do this yourself and submit a Pull Request? Can review cc @tftelkamp

Normally there are two suggestions that are made to handle the scenario where network connectivity is lost:

Buffer the data on the device, and send it to the network when a working connection is available.
Add redundant coverage that uses different backhaul internet connectivity.

Both of these have their own problems.

I as a user do not have access to add data buffering to the firmware of all the LoRaWAN devices I am using. Off the shelf there are very few devices that actually offer this feature. But even with this feature when buffered data is sent, one will break the network’s fair use policy.
The areas we are working in barely have a single internet connection. There is no option to install a gateway on a redundant internet connection.

Looking at my dilemma - which does not seem to be far fetched - it looks like LoRaWAN is moving in a direction that only supports urban areas - places that already has other options for connectivity (Sigfox, NB-IoT). LoRa (long range) is ideal for connectivity in rural areas. But the limitations that are being introduced in LoRaWAN makes it very difficult to use there.

As an enduser I do not have contact with the LoRa Alliance. It would however be great to hear their recommendation on how to use LoRaWAN in the scenario with power outages as described above.

Does anyone have a reasonable alternative to buffering data on the gateway that can solve intermittent network outages?

Jeff-UK · July 7, 2021, 5:09pm

Hi JP, have seen many situations with similar issues over the years (usually flakey Backhaul at remote sites vs S.A rolling blackouts) and frankly whilst buffering messages to send on later is technically possible though not always practical for the reasons listed not least out of order delivery, MAC state confusion, missed joins, missed conf etc. The only viable solution where this is a regular problem isnt to walk away from LoRaWAN but rather to embrace it harder… look at option to put LNS at the GW or part of a local cluster, then when internet comes backup for backhaul you pipe any saved decoded messages (not the original LoRaWAN messages) or payloads (or leave encrypted for later decode at target App server) direct to target end collection/display/analysis point/database using normal internet protocols and message passing techniques… but that does move away from TTN use and goes out of scope here. (the old Semtec IOT SX1301 Starter Kits - obsololete for ~ 3years now! - and early (Node Red enabled) Multitech GW’s were good for this. Think may be an option still with some of the RAK outdoor GW’s and others?

It may be that there is a way to go ‘hybrid’ with combination of local LNS and then swap messages with TTN or other TTI instances (Or Chirpstack given your recent direction?! ) via PacketBroker - but that is beyond my immediate knowledge and skill set - just thinking aloud!

jpmeijers · July 8, 2021, 7:42am

“Island of communication” is what I call a standalone LoRaWAN network which does not necessarily have internet connectivity. It is indeed a possible workaround, but still not an elegant solution. The first problem with this is that I’ll need to write custom code to do the buffering and sending. This is something that is firstly a lot of work, deviates from the standard solutions, and not always possible as gateway manufacturers do not want you to run custom firmware on their devices.

Another issue I have with this is the case where I have a farm covered intermittently by a high site hosting a TTN gateway, and then as backup I have an indoor LTE gateway - which is also intermittent. If I move the indoor gateway over to a local network server, I need to register my devices on this server, losing access to the high site. This also hints at another administrative problem: devices are not registered in a central location anymore.

On TTN V2 I had a workaround for this: register the devices as ABP on TTN, and then also register them using the dame DevAddr, AppSKey and NtwkSKey on the local ChirpStack instance running on the gateway. With this approach both networks received the message and forwarded it to my backend. This indeed brings in the problem of corrupted mac state - something we were blissfully unaware of on TTN V2.

@kersing at one of the previous online conferences there was a presentation by RAK Wireless, mentioning the “Automatic data recovery” option they have. If I remember correctly you made a comment on this feature in the chat. Do you maybe have some insights on how to solve the scenario I sketched in my original post?

@johan and @htdvisser as the goto authority on LoRaWAN, it would be great to hear your opinion on this matter too.

johan · July 8, 2021, 8:54am

I understand the use case. Gateways can be offline and they can buffer frames. This buffer can be flushed later. This is relevant for gateway downtime but also for satellites buffering frames until they reach a ground station. This obviously needs to be supported by the packet forwarder, and open source forwarders like the infamous UDP packet forwarder and Basic Station don’t support this (yet).

Currently, NS doesn’t support out-of-order arrival of messages, it always tries class A downlink, and different levels of rate limiters in our infrastructure prevent gateways to flush high amounts of messages in a short time.

Messages may have been received by other gateways, or while the gateway is flushing its buffer slowly, new messages may be forwarded immediately, so that messages arrive out-of-order on the NS. This could mean that valuable information in buffered messages, like telemetry, is discarded because NS already has a higher FCnt.

Finally, when an uplink message arrives late to NS, NS needs to be aware of the fact that there’s no need to attempt class A downlink. This is a waste of resources, especially when you flush thousands of messages, for which NS tries to schedule class A downlink over and over.

We’re tracking Support downstream resending traffic out-of-order · Issue #2708 · TheThingsNetwork/lorawan-stack · GitHub for this.

If you need high available use cases on the edge (including the application), please have a look at Hylke’s TTC talk on installing The Things Stack on a Raspberry Pi connected to Packet Broker. This gives you the best of both worlds: a local deployment with bidirectional TTN coverage when internet is available. See https://www.thethingsnetwork.org/article/deploy-the-things-stack-in-your-local-network

descartes · July 8, 2021, 9:00am

If you need a quick win, then from seeing discussions on here, fiddling with my RAK Edge gateways and seeing the various discussions on the RAK forum about implementing it, I think you will be disappointed but I think you know that.

However the Other Big Boss (Wienke) would like buffering in the GSNE to be implemented and this is a problem that I have some very rough solutions that sort of work but they are all in firmware.

However I know @Jac would be delighted for us all to have a proper discussion about how to design & implement some components that can be used to build a solution - either by migrating to devices that can buffer or to gateways that record uplinks with some system for moving those uplinks up the data store (ie, not via the NS/AS, but to a side-server - we’d have the keys so decryption can be done there).

I know you are busy with many projects, but perhaps you (@jpmeijers) could come up with a spec?

descartes · July 8, 2021, 9:05am

@Jac, my perception for your gateway software is that adding code to output the uplinks to a file should be fairly straight forward (not asking you to do it).

If that’s the case, then it falls under the banner of one of my T Shirts (hacker that can’t remember what programming language he’s using today) and I’ll happy have a stab at it.

I already have “clean up the code” for uplink buffering on my to do list - it ran in to a mess due to trying to make it too generic for too many MCU’s - reworking as an API with a storage driver seems to be the way but expands the work needed to finish it by a fair chunk.

AndyG · July 8, 2021, 9:26am

Could this be resolved between the node and the application without changing the gateway?

Perhaps have a 256 message buffer on the node and transmit an unsigned byte sequence number in each message.

The application can detect any gap and send a downlink to ask for the missing uplinks.

Please send uplinks 253 to 005. This will trigger the node to add a second transmission in each cycle. So it sends 007 & 253, 008 & 254, … 013 & 004, 014 & 005.

The application is then responsible for re-assembling the missing data in the correct order.

AndyG · July 8, 2021, 9:39am

Doh! We already have a frame counter! Just use the 8 lsb.

kersing · July 8, 2021, 9:42am

It would be fairly simple, however I refuse to do this as it will break the current lorawan model as explained by Johan. Only when those concerns are addressed I’ll take requests regarding this into consideration.

As an exercise in C programming for yourself? Fine. Please do not distribute it as it will break the TTN back-end. (As in you will trigger rate limiters effectively making sure the gateway will be off-line even longer from TTN point of view)

If someone comes up with a decent proposal that can be implemented in a way where TTN back-end can handle it (and know these are ‘old’ messages including forwarding that status to the application level so it know not to respond to them either) and can convince TTN to implement the back-end part, I’ll gladly enhance MP forwarder.
I would suggest a proposal to be based on ttn-gateway-connector protocol and not on UDP. (MP doesn’t do BasicStation and I have no intention to implement it as that would involve a major rewrite)

descartes · July 8, 2021, 9:54am

I’d never consider trying to replay the uplinks as I’ve read all the articles that explain why this isn’t a good idea so I’m not thinking of impacting any LoRaWAN infrastructure.

What I have in mind is storing the uplinks received by the gateway locally and if there is an outage, when it ends, the end-users application server post a request that can be picked up by the gateway periodically by some other process.

At which point this other process extracts the uplinks as received from the data store and sends them directly to the end-users infrastructure, which would then have to decrypt but would have access to the various session keys via the TTS API to do that.

The only wrinkle in this is it requires gateway that can have the altered gateway software & this other process running on it.

I think this solution would be best for many devices in range of a gateway. If there are only a handful, then @AndyG’s outline is pretty much what I have going on but as I’ve found, I can run out of memory or have too many devices plus other wrinkles yet to be debugged.

Jeff-UK · July 8, 2021, 10:23am

If there is an outage…you wont just have the one node to worry about recapturing data for…there could be thousands! A none starter IMHO and likely to well break e.g. FUP wrt rate of dumping catch up messages in quick succession unless done over very long time period…in which case might next outage or load shedding event have happened…cascade issues. Also how to prioritise then live messages coming in vs what is queued up?.. I see big nightmares ahead… this is more in line with what I proposed earlier as likely only practical suggestion even if it does create ‘islands’… extreme solutions for extreme cases?

descartes · July 8, 2021, 10:33am

Possibly, but equally there could be a use case where there are 50 sensors and one gateway in some far off field. It may be preferable to have the whole lot, but the firmware could send averages for as required and if there are any interesting data points, be asked to send particular actuals.

So for SA where water is metered, you can live with the average soil moisture for the last 24 hours in 3 hour averages. And some of the data can be collapsed further if there is a mechanism for indicating repeat or sending deltas from a base line.

And if you have multiple sensors in an area, you may need just one sensors data to get a rough idea.

So it may be that there are only one or two extra uplinks required, albeit requiring one downlink per device, which is about cherry picking the first few sensors you want to backfill on along with how much data.

But in this example, it’s a bit irrelevant - the current value is the one of interest, so you can crack on with that. So the use case may not need an un-interrupted data set.

Perhaps @jpmeijers could chime in with a few example use cases for us to explore.

cslorabox · July 9, 2021, 6:11pm

If you want a solution that can allow the gateway owner to get the raw traffic of their own nodes for backup decoding outside of TTN, there are a lot of ways of doing that.

On the crude end, you can add/reactivate debug logging and have something scrape it out of syslog.

After some early experiments with what became chirpstack, I decided the part Iiked was compartmentalization of task, so I similarly use UDP forwarders which pass the data to a same or adjacent machine task that encapsulates it in something more appropriate to transit the Internet to the server. That custom bridge component is a decent place to “tee” a feed.

One of the nice things about UDP vs a lot of other IPC is that it can’t back up and harm the source. Even if the collector task locks up for some reason (say storage media failure) the source task is just throwing things at a target painted on the wall, with no care or even way of knowing what happens next. The official UDP protocol has acks, but it’s easy to send a second copy to another port blindly, at either packet forwrder level or the subsequent UDP-to-Internet task.

If the chosen gateway can’t support custom components then it can be paired with and configured to talk via a local embedded systems which can do so, and probably give a pipe into the gateways serial or local net management console, too, as well as monitoring the batteries and whatever other fun stuff makes the site reliable.

As I’m sure you know but some readers may be less aware, the problem with feeding even two true LoRaWan network servers comes about if they can both talk. If one can’t (either by design or because there’s literally no way for a downlink command to reach the gateway) then there’s no real issue beyond needing to get the current device addresses and keys in an OTAA situation where they can change.

jpmeijers · July 14, 2021, 2:43pm

A proper official way to do this is the best approach in my opinion. Any of the other suggestions I read here is basically a re-implementation of LoRaWAN, making the use of LoRaWAN redundant.

As a first step to get official support, I have filed this feature request on the Basic Station repo:

github.com/lorabasics/basicstation

Data buffering and catchup

opened 08:01AM - 08 Jul 21 UTC

jpmeijers

There are many situations in which a gateway can lose connectivity from the netw…ork server. During these times all packets that are received by the gateway is lost. I have a clear description of such a scenario in this forum post: https://www.thethingsnetwork.org/forum/t/discussion-recover-data-after-network-outage/49559/3 As stated there, buffering packets and sending them to the network server at a later stage can cause mac state corruption, or even be dropped because of packets arriving out of sequence. Looking at all the suggestions and workarounds, my suggestion would be to have two channels of data that flows between Basic Station and the Network Server: 1. Channel for live packets. As they are received they are sent to the network server. These packets contribute metadata to keep the mac state updated. These packets are checked for being in sequence. 2. Channel for buffered data. This data only contributes payload data. It does not affect the mac state and also does not get checked for being in sequence. Because the connection between the gateway and the network server is secure, all data from it can be trusted. That also includes packets that are out of sequence or which are delayed. We need to think about the possible attack vectors for replay attacks that are introduced with such a second channel. A second channel for buffered/outdated packets will however solve the scenarios described in my post above, as well as the issues faced by satellite networks: https://github.com/TheThingsNetwork/lorawan-stack/issues/2708

descartes · July 14, 2021, 4:53pm

Erm, how so re-implementation? In the event of loss of connectivity the devices are still transmitting using LoRa to the gateway. The only wrinkle with fishing out the un-relayed uplinks is decryption but that is a solved solution. If there was a proposal to make LoRaWAN redundant, we’d have to find another Low power / long range radio system for the devices.

@johan can officially comment on what constitutes official when related to LoRaWAN as he sits on / contributes to the technical committees, but I don’t think data caching falls under Semtech’s domain - more the LoRaWAN Alliance and I doubt you’ll see a RFC on a draft proposal in the near future.

And there is still the unholy nightmare of weaving the uplinks in to the network server state unless the gateway continues to cache & replay the current uplinks until the older uplinks have been relayed on, so when the gateway gets connectivity, you are still running blind until it’s all caught up. Or you allow live uplinks through and signal the NS that there are out-of-sequence uplinks about to arrive. And any & all MAC states & downlinks need to be handled. So as well as changing BasicStation, TTI, Chirpstack etc etc have to implement this in to the heart of the mission critical network server code.

If you do change your mind about the route to go or want to hedge your bets by having an alternative solution, please share your most typical scenarios - duration of downtime, number of devices, number of uplinks lost, typical size of payload etc - perhaps we can come up with some solution in the interim.

As an aside, I’m playing with a Pi CM4 on a daughter board with a concentrator on top, so having a highly accessible platform for development that is also robust is solved and if the code is kept generic enough, folding it in to OpenWRT & other gateway base OS’s shouldn’t be too problematic.

johan · July 14, 2021, 5:41pm

Exactly this

The out-of-order arrival of LoRaWAN frames is basically what makes this the most challenging. Matching devices is already one of the trickiest things to do fast and efficient. First, sessions are filtered by DevAddr and filtered and sorted by FCntUp (accounting for rollover of the 16-bit value in the frame). Then, NS tries to identify the device by checking MIC with the NwkSKey.

If you were to allow LoRaWAN frames arriving out-of-order at NS, you can’t filter and sort by FCntUp anymore. This makes everything extremely inefficient, as you’d have to MIC check all sessions that have a matching DevAddr and where a frame with the concerning FCntUp hasn’t been seen yet. Whilst accounting for rollover. So this means that you basically need a massive boolean array to record for each session for each 32-bit FCntUp whether the frame has been received in the session. Otherwise you’re open for replay attacks. Now, this is 4 gigabyte of state per session. And you probably need to keep previous sessions of the end device as well, as the device may have rejoined in the meantime.

So, yes, in-order late delivery of frames, where the forwarder simply annotates the frames that it is buffered, that is all fine. When gateway A that hears the device at SF7 goes offline, and the device keeps sending priceless uplink messages while using a correct and recommended ADR algorithm to reduce its data rate to SF10 after 100 frames, and gateway B picks up the frame and sends it in real time to the NS, the first 100 frames are lost forever. Because the frames from gateway A, whenever it comes online, will be out-of-order. So it’s not just per gateway, it is really from the perspective of the Network Server.

descartes · July 14, 2021, 6:43pm

And this is why you have the biggest brain and the best T shirts - hadn’t thought of this, had a picture in my head of a lone gateway out in the dust plains with nothing for miles & miles.

So as soon as the link check / ADR recommendations cut in, it’s sort of game over if there are any other gateways in range. Probably easier to handle this on a back channel, given that it won’t require anything like the resources that would need to be applied to even vaguely start doing the job.

Perhaps this could be streamlined in LoRaWAN 2.0 with TTS v5?

andrew58210 · August 31, 2021, 5:56pm

Wouldn’t the out of order issue just be solved by buffering on the node and only transmitting from the buffer? A confirmed downlink message (not efficient) could be used to tell the node that it was a successful transmit. If the buffer has more than 1 message it could schedule a catchup transmits over a 24hr period to space things out.

cslorabox · August 31, 2021, 6:51pm

That’s not ordinarily viable because it costs the network too much to send downlinks to ack transmissions. Downlinks are much more expensive than uplinks; you should not used “confirmed uplink” with TTN.

What you can sometimes do with cleverness is keep some recent history on the node and downlink a request for a sparse summary of it when you start receiving from the node again through the infrastructure after the end of some kind of gap.

Every time a node transmiys there’s an opportunity to send a reply, but network design and capacity means you need to only try to take a small fraction of those opportunities.

andrew58210 · September 1, 2021, 12:07pm

Makes sense, but where do you place this code? I don’t see that level of customization on the TTN.