OTAA LMIC node no longer showing in TTN console after 65535 frames up

I have fielded a handful of air quality sensors around the local city starting a couple years ago. The project utilizes an ESP32 + LMIC for communication w/ TTN. The software collects readings from attached sensors and sends the telemetry to TTN every minute. If any problems are encountered, the device will reboot itself. This masked an issue that Iā€™m just now getting my hands around. Once the local network issues had been sorted out by the group managing the gateways, my devices were able to stay online for months on end. Then I eventually noticed that my devices would stop sending data after reaching 65535 uplinks (~45 days). The number is clearly the result of a 16bit overflow conditionā€¦ somewhere.

These are outdoor devices mounted on signal poles which makes pulling debug data difficult. Iā€™ve been working on a new version of the software to add some features, and to switch to the MCCI LMIC port with hopes that Iā€™m probably just seeing some 16 byte issue that may have been solved (despite finding no GitHub issues about this in the original IBM LMIC port). I setup a test device to send data every ~3 sec, which has allowed me now to observe this error after a few days.

What I was expecting is some sort of error shown on the device, a hang, maybe a WDT, something like that. What I woke up to today is TTN showing my device not responding with exactly 65535 frames sent:
image

However - the device itself is online and believes itself to be sending data. I have it spraying debug logs via serial, and itā€™s happily cranking away, currently on uplink sequence number 72260. My local gateway devices both see the device, but no data is reported to TTN and the device is showing dark for the past 7 hours (per the image above)

Hereā€™s one of the gateway logs:
image

Note the frame counter at 6700-something and counting. Add 65535 to that and you get the frame counter my device thinks it has, which is reported via LMIC.seqnoUp. Another classic sign of 32 bits of data in a 16bit bucket.

While Iā€™m not much of an embedded programmer, I donā€™t see how sloppy code on my side is going to impact some internal counter within the LMIC library. Outside of the debug code printing the result of LMIC.seqnoUp, thereā€™s nothing on my side that should be interacting w/ the sequence numbers directly. As these are full-time powered devices Iā€™m not persisting frame counters to local storage (as one might do on an OTAA device that is entering deep sleep modes).

Has anyone seen anything like this?

Check the device settings page and switch ā€œFrame counter widthā€ to the other setting.

One thing to note is that you are violating the TTN fair access policy with each of these devices. You are allowed just 30 seconds of airtime for each device. Sending too often means you are using an inappropriate amount of a shared medium (the few frequencies LoRaWAN uses)

2 Likes

That was exactly the problem! Changing my benchtop device in the TTN settings page immediately restored communication with the device. Thanks man!

Also, totally understood on the airtime thing, the ā€œreport every 3 secondsā€ device is in my basement on my own gateway which should hopefully limit the scope of the dumb crap Iā€™m doing here, but now that this is sorted I can go back to 5min reports which, given my packet size and spreading factor, should fit within the 1% airtime fairness rules.

Thanks @kersing!

If there are 65535 packets in 45 days, then that suggests you have the nodes set to send a packet once every 60 seconds.

If SF9 is typical than thats 388 seconds air time per day, fair access limit is 30 seconds per day.

2 Likes

For testing, one can also use the ttnctl command line tool to set the frame counter to some specific value, and then make the test node start with that value too, rather than starting at zero. This might also be needed when the difference between device and TTN exceeds MAX_FCNT_GAP, being 16,384. (But when adhering to the Fair Access Policy that should only happen when erroneously resetting the frame counters in TTN Console.)

Just for future reference: what is the correct setting (for your LMIC device)? (Iā€™d assume 32 bits, which is also the default in TTN Console.)

The correct setting was 32 bit, and every single device I had created (which happened a couple years ago) were set to 16 bit. I donā€™t recall having any reason to have ever changed that, was the default different some time ago? IIRC (this is going back a bit), I might have used ttnctl to create the device definitions. Maybe something I messed up w/ the command there?

You are mixing two things. First there is the legal 1% limit applicable in Europe for the frequencies used for EU868. Second there is the fair access policy of TTN which states every node is allowed an average airtime of 30 seconds and 10 downlinks each day.

To stay within the 30 seconds a node can send about 10 bytes every 3 minutes at SF7. Larger packets, more frequent transmissions of other SF8 and over will result in a node exceeding the allowed airtime.

3 Likes

Hi @Allen as @Kersing states you are potentially conflating the two issues and I would add that whilst the regs allow higher duty cycles, there have been many posts on the Forum and elsewhere as to why you should not take that as permissive wrt social responsibility let alone wrt TTN FUP, e.g. this recent post from yours truely :wink: if you wouldnt mind taking a moment to read and note the can vs should element :slight_smile:

Understood and acknowledged. We can live with 5min intervals and that was the plan from the start. Thanks for the help guys, this sort of guidance is really useful!

1 Like

If your receiving at SF9, you need to be using a 12.5 minute interval.

Hi,
I have a node that Iā€™m using for a battery / low power test. It has a 16 bit frame counter which reached 65535 a few days ago after about 8 months. Itā€™s ABP on TTN v2.

Traffic is not showing in the Application Data in the V2 console any more. I can see that the counter has rolled over by looking at the gateway traffic. Frame counter checks were and are disabled.

Iā€™ve tried changing the device settings to 32 bit frame counter, leaving it overnight and changing them back.

Not too worried, the test wasnā€™t far from ending. The last voltage I saw was 3.33V, and the node should shut down at 3.2V. With 20/20 hindsight it would probably have been a good idea t set up an integration and log the data properly, but it seemed an easier option just to check on progress from time to time on the console.

Any suggestions?

I trust you verified the code to come to that conclusion?

Keep in mind that just means the 16 bit in the LoRaWAN packet rolled over. Code may be using 32 bits.

A packet is valid if the received count exceeds the last count by max 16384. The first check is skipped if the frame counter check is disabled, I donā€™t expect the second will be skipped as well.

Iā€™ve got a very long running node that passed the 1 million packets some time ago so TTN V2 is certainly able to handle wrapping counters if the settings 16/32 match.

If you really want to know what is happening, you can get the network session key from TTN, and take the raw packet from the gateway, and by doing the calculations outside of TTN see for which (if any) value of the untransmitted upper 16 bits the cryptographic checksum validates.

The node not handling rollover of the lower 16 bits in the way it was believed it would is definitely a strong possibility.

Another question in my mind at least worth asking is if TTNā€™s ability to handle this may be altered by the frame counter disable checkbox - thatā€™s not meant to be a routine mode of operation so it seems theoretically possible it may have unintended interactions with other things. In particular, part of the reason for the 16384 limit is to aid in distinguishing a forward skip across the rollover from something else. If you allow illegal-under-protocol frame counter resets, that gets harder still (though in theory checking both options would be possible, at the cost of complexity, computation, and maybe being a path that doesnā€™t see much testing).

1 Like

Thanks, good advice from both of you.

I must admit I never checked the width of the frame counter used by the Heltec code. I never got to 65535 before. Iā€™ll change to 32bits in the V2 console and enable check frame counter just to see what happens.

There have been a series of tests starting with an Adafruit 32u4 Feather and TinyLora, then an M0 Feather, and Iā€™ve since moved on to the Heltec dev boards. Iā€™ve reused the same devAddr for each test, and never changed the settings in the V2 console. Most of the tests were pretty short, 10 to 50 days. The Feathers are good general purpose hobby boards, but it is difficult to get low power out of them without getting the soldering iron out to remove bits. I still use an M0 Feather in my GPS tracker.

My tests are probably over. I think the Heltec dev boards should do fine for my intended applications. The frame rate will be significantly lower than my test devices so I should see some further improvement in battery life. All will be less than 3km from the gateways. Most will be easily accessible. So not a hassle to swap them out.

It is probably time to set up a V3 Application and do things properly with OTAA devices, and have a go with MQTT, InfluxDB, and Grafana.