I have fielded a handful of air quality sensors around the local city starting a couple years ago. The project utilizes an ESP32 + LMIC for communication w/ TTN. The software collects readings from attached sensors and sends the telemetry to TTN every minute. If any problems are encountered, the device will reboot itself. This masked an issue that I’m just now getting my hands around. Once the local network issues had been sorted out by the group managing the gateways, my devices were able to stay online for months on end. Then I eventually noticed that my devices would stop sending data after reaching 65535 uplinks (~45 days). The number is clearly the result of a 16bit overflow condition… somewhere.
These are outdoor devices mounted on signal poles which makes pulling debug data difficult. I’ve been working on a new version of the software to add some features, and to switch to the MCCI LMIC port with hopes that I’m probably just seeing some 16 byte issue that may have been solved (despite finding no GitHub issues about this in the original IBM LMIC port). I setup a test device to send data every ~3 sec, which has allowed me now to observe this error after a few days.
What I was expecting is some sort of error shown on the device, a hang, maybe a WDT, something like that. What I woke up to today is TTN showing my device not responding with exactly 65535 frames sent:
However - the device itself is online and believes itself to be sending data. I have it spraying debug logs via serial, and it’s happily cranking away, currently on uplink sequence number 72260. My local gateway devices both see the device, but no data is reported to TTN and the device is showing dark for the past 7 hours (per the image above)
Here’s one of the gateway logs:
Note the frame counter at 6700-something and counting. Add 65535 to that and you get the frame counter my device thinks it has, which is reported via
LMIC.seqnoUp. Another classic sign of 32 bits of data in a 16bit bucket.
While I’m not much of an embedded programmer, I don’t see how sloppy code on my side is going to impact some internal counter within the LMIC library. Outside of the debug code printing the result of
LMIC.seqnoUp, there’s nothing on my side that should be interacting w/ the sequence numbers directly. As these are full-time powered devices I’m not persisting frame counters to local storage (as one might do on an OTAA device that is entering deep sleep modes).
Has anyone seen anything like this?