Network and application session keys the same among all my ABP devices

arjanvanb · May 26, 2020, 5:46pm

As you can read in the explanation about the DevAddr I linked-to above, TTN will use the combination of DevAddr and NwkSKey. So I don’t think the non-unique NwkSKey is the cause of your problem.

However: any chance the current values for the uplink counter (for each device) are lower than the last known values, caused by devices resetting somehow? (If they use solar power then maybe their batteries might go too low at night?) If that’s the cause, then your gateway traffic should still see the uplinks just fine (and also show the current counter), but they would not be routed to your application. (Unless you disable the frame counter security.)

Beware that changing any setting for an ABP device will reset their counters. (When using TTN Console.)

verdiagriculture · May 26, 2020, 5:46pm

Since I am using ABP, and I power cycle the endnode. Does this means that it is a new session, and that I should have programmed new keys?

verdiagriculture · May 26, 2020, 5:51pm

Okay, thanks.
It had hoped that this was the cause of my issues, but I didn’t really believe that it was either.

I’ll keep combing through my code.

arjanvanb · May 26, 2020, 5:52pm

For ABP you don’t need to program new keys every time you reset them. But you should reset the counters in TTN Console (or using the command line ttnctl) at the same time you reset the ABP device. (Or disable the security, but like linked-to in an earlier answer: that might need handling in the device as well, for the downlink counters.)

I think key is: do you see the uplinks in the gateway traffic?

TD-er · May 27, 2020, 7:06am

As far as I know, the DevAddr is only used with OTAA, to generate a session key.
When using ABP, you essentially replace that step by generating that session key for that single node.
N.B. you can even do it “manually” by first performing an OTAA and then somehow copy those session keys into an ABP-configuration. But that is even more cumbersome than generating an ABP session key in the first place. (and it will probably be void if the original node does a new OTAA with the same DevAddr)

If all nodes use the same session key, but not at the same time, then you would still run into the problem where the frame counters will be already used.

So if you disable the frame counter check, in the console, you will be able to see those “duplicate” messages again.
I don’t know what the TTN network (or maybe a gateway) does when it detects a session key is used on several nodes. Nor if it can be detected anyway. I guess it can be detected when two nodes send using the same session key at the same time, received by different gateways. (or on different channels at the same time)

Anyway, I would suggest to generate unique ABP keys per node.

arjanvanb · May 27, 2020, 7:35am

Nope. It’s also present in every uplink and downlink message, for both ABP and OTAA devices. And it’s used when calculating the MIC, along with other details such as the frame counter.

The session keys are not part of the message, but the MIC is. However, validation of the MIC starts with a list of all devices of a given DevAddr, and that’s not the same for all devices in this case. See also How does a network know a received packet is for them? - #2 by htdvisser

For security: yes, absolutely. But I doubt this problem is caused by using the same secrets. And if it is: the messages should be visible in the gateway’s log then, if one has access to that. (For many gateways the Traffic page in TTN Console is not working right now.)

Aside: in an earlier, meanwhile deleted, topic @verdiagriculture wrote that the frame counter checks were disabled. Also the number of messages per day is low, so no problems to be expected with, e.g., 16 bits counters.

arjanvanb · May 27, 2020, 8:09am

Ah, maybe you meant the DevEUI, not the DevAddr? Indeed, an ABP device does not know that, and an OTAA device only uses that during its join (along with the AppEUI and the secret AppKey). For uplinks, the DevEUI as known in TTN Console will still be added by the network and passed along in every MQTT API message and integrations (labeled hardware_serial), but I don’t know if they’re used otherwise.

TD-er · May 27, 2020, 9:10am

Yep, the HW address of the node. That’s the one I meant.

verdiagriculture · May 28, 2020, 9:34pm

Thanks for all the replies.
I don’t have immediate access to gateway traffic. But I am getting a new gateway soon, and with 8 or so nodes that I have, hope to recreate the bug and solve it.

verdiagriculture · May 29, 2020, 2:03am

I finally caught a gateway traffic on the console. It looks horrendously long. I see deduplicate commands, and it got 3 devices from network server and performs 2 MIC checks. Maybe this is the issue that is halting my devices from uplinking properly. A log like this isn’t normal is it?

arjanvanb · May 29, 2020, 8:13am

Wow, that’s a messy post. I see duplicate details in the screenshots, and what are we even looking at? Most looks like a downlink trace?

Also, is this about traffic that, though shown in the gateway Traffic, did not arrive in your actual application? If so: how is that application getting its data from TTN? If that is not using MQTT: what does a command line MQTT client receive? How does a trace of dropped uplinks compare to uplinks that are handled correctly?

(Just to be sure: if you’re only looking at the Data page in TTN Console, then that often stops showing data without any clear reason. So: look at what your actual application is receiving. Or enable the Data Storage integration for debugging.)

What are the other details of the uplink that triggered that downlink? The one shown in the screenshot shows FCntUp 6. But the downlink says reason: initial, so seems to be a US915 initial ADR command, which should only be sent once in the lifetime of an ABP device, emphasis mine:

There are a several moments when an ADR request is scheduled or sent:

The initial ADR Request (for US915 and AU915). This is sent immediately after join and is mainly used to set the channel mask of the device. This one is a bit tricky, because we don’t have enough measurements for setting an accurate data rate. To avoid silencing the device, we use an extra “buffer” of a few dB here. This request is only needed with pre-LoRaWAN 1.1 on our v2 stack. With LoRaWAN 1.1 devices on our v3 stack, we can set the channel mask in the JoinAccept message. ABP devices pre-LoRaWAN 1.1 will only get this message once, if they reset after that, they won’t get the message again; this issue is also solved by LoRaWAN 1.1.

However, now that gateway traffic is (partly) routed through V3 components, maybe that’s no longer only sent once.

Do you see downlinks for every uplink?

Yes, all as expected. (The log entry duplicates: 1 actually indicates there is only a single occurrence, not two.)

So, the same DevAddr is used by 3 ABP devices in your region. After 2 MIC checks it found your device, hence does not need to check the 3rd device as well. All just as expected, as explained in one of the links in my earlier answers. (This is not a problem for you, as you wrote that all your devices have a different DevAddr along with your single NwkSKey. So, the other 2 devices are someone else’s ABP devices, using a different NwkSKey.)

I still think that the key question is: do you see the missing uplinks in the gateway’s Traffic? So, do you still see a device’s traffic in the gateway after a device stopped working after 10 days?

descartes · May 29, 2020, 8:19am

Whilst figuring out what goes on with duplicated keys may be interesting, is this not like figuring out the exact dynamics of dropping lighted matches in to a petrol filler on a car. If you don’t try that and drive normal (aka configure the nodes correctly), does it work?

verdiagriculture · May 29, 2020, 5:18pm

Yes, I do apologize for the messy post , I will put more work in making my posts. I just got really excited cause I haven’t seen a downlink (yes it’s a downlink trace) in a long time. The uplink associated with this downlink was received by the TTN, so one device broke through a 5 day period of silence.

It is a lot of speculation as to what the issue is right now, and I don’t like making these kinds of wishy-washy posts. In one to two weeks time, I will have more concrete data. I will cease to post until I have that data to solve the problem with.

Your comments about what the trace log meant was very helpful! I’ll keep the ADR initial problem in mind as I debug.

verdiagriculture · May 29, 2020, 5:26pm

Hey Descartes,
First, I love your username.
I meant recreate the uplink bug. My main goal is to make my devices uplink consistently, and since I don’t understand things very well, the only way I can make sure the problem is fixed is if I:

Recreate uplink bug -> implement fix -> uplink bug gone -> remove fix -> uplink bug reappears

Only then can I be confident that the fix addresses the core issue.

descartes · May 29, 2020, 5:59pm

As per other thread, I should be able to solve this for you with my tested Arduino + RFM95W code - I’ve four nodes based on this, office, at home, one outside with four DS18B20’s and the original low power test - apart from one cat related incident, they just run, three are on 50,000+ uplinks, the original is on 122,341.

verdiagriculture · May 30, 2020, 5:54am

That’s pretty impressive. After scourging the internet, I found a new champion which may be the cause of my issues: LMIC Timing. Seems like the Arduino’s millis counter is causing timing issues as previously mentioned by CongducPham and matthijskooijman.
Does your Arduino implementation use a similar syncing scheme?

descartes · May 30, 2020, 9:05am

I spent some time reading the source but overall I’ve not “implemented” it, more just “included” it.

Changing the timing windows are part of the settings - my take on it is that it allows you to tweak the load your application places on the stack. I bypass this by using flags/semaphores to turn off my code, trigger an uplink and only turn my code back on when it’s all done. This isn’t as energy efficient as it could be as we could sleep whilst waiting for the downlink windows but as the LMIC code is heavily reliant on a mini-RTOS, that really messes with it.

I’ll get the code in to a repository shortly.

descartes · May 30, 2020, 2:39pm

Be nice:

verdiagriculture · May 31, 2020, 5:02am

I took a look at TinyThing.ino. The code looks pretty clean and it has a good debug serial print structure =) .

I do not see how you bypassed the LMIC millis timer issue though. The core difference I see between your code and mine is that you have a call to os_runloop_once() after every 8 seconds of sleep while I call os_runloop_once() only after I finish sleeping for the desired time (which can be hours).

Why did you decide to do it this way (calling os_runloop_once after every 8 sec of sleep). The obvious (and probably noobish/incorrect) way is to do what I did. Is there something that necessitates frequent calls to os_runloop_once() ?

I appreciate your help btw =) thanks for being patient w me.

descartes · May 31, 2020, 11:50am

There’s a lot of science and a lot of “Woo Hoo, it works” and a heap of pragmatism in how I arrived at this point. Scroll to the bottom for the TL;DR

After lots of wasted time tweaking the % values for the mills issue, once I stopped doing things that could take processing time away from LMIC during send, it wasn’t an issue.

This is coupled with os_runloop_once(). It’s not totally clear in my code, but once I want to send, the sleepOK=false effectively prevents any of my code running so LMIC can do its work. Unless we reach the next send interval when it all gets overrun by a new cycle (but copes, I tested).

I found a whole pile of commentary about the LMIC needing its internal timer to be updated after wake from sleep and various suggestions on how to do so, most of which do not appear to have worked for people as they were trying to patch the OS modules timer values with complex calculations which seemed to mostly upset it.

Rather than using a library for sleep, I followed Nick Gammon’s detailed info on setting the sleep mode exactly as I want along with his very useful tips on turning off IO and the ADC.

Having printed out and read the source code for MCCI LMIC, my personal take is that the use of the OS module to provide timing for the receive windows ended up with it spreading across the whole system and from code review, it’s very much embedded in the design. So much so that I’ve seen ESP libraries that incorporate a RTOS for the whole system whilst the LMIC has it’s own mini-scheduler inside!

Part of this development journey saw the very first hand-soldered TinyThing last May using the osjob_t sendjob with its call to do_send() and whilst I tried to use a similar mechanism for sensor checks and a feedback LED, it was getting messy. I don’t mind callbacks, I don’t mind a task scheduler (having written one in assembler for PIC), but I do like my code to follow a journey and as I wanted something that was going to be quick & simple to implement a new sensor, considering some need a warm up time (like the DHT11 & BME680), scheduling jobs for sensor turn on, warm up, read, turn off and so forth was filling memory & code spaghetti & wasn’t inspiring confidence.

Whilst I was doing the above, I was also trying other versions of MCCI, forks & implementations and when I looked in to the feasibility of tracing the LMIC internal code to check on memory & what it was actually up to, I lost DAYS of time and my sense of humour.

So, by a process of simplification by elimination over last summer, I entirely removed the user facing jobs so I wasn’t relying on the mini-OS to run anything I wanted to control but decided that giving the LMIC a chance to do anything I’d not spotted between sleeps was probably for the best. I appreciate this isn’t totally scientific, but as I’ve been running a software & electronics business for 26+ years and cutting code for over 40 years now, I know the power of pragmatism.

So having hacked on this May - Sept I needed to ship something, I had a test Thing running for some time by now on my desk so I could see I’d reached some level of stability. I moved on to low power mods and ended up with a test Thing running on solar blocking the view out my office window. I created the PCB, made the first desk Thing and then built what we call the AA Power Test and put it in a plant pot on the decking outside.

From my time spent watching the Arduino Serial Monitor whilst spamming both my gateway & TTN with repeated small uplinks, I suspect memory management can’t keep up in the face of using various libraries - many of which upon investigation suffer badly from bloat. I think the MCCI LMIC library has suffered too but until I’ve written my own LoRaWAN stack, I’ll keep my detailed opinions to myself. All I’d say is, no jobs, use a timer interrupt!

There have been some cleanups & tweaks and a fair number of deployments but as soon as I added the BME680 library for a project, I knew I wasn’t heading in the right direction so started thinking about where to go. I do use an Arduino Nano Every for projects that need to be Arduino + LoRa chip. But now I have a TinyThing Mk2 with a RAK4200 in place of the RFM95W, I get most of my flash & RAM back. It’s a toss up between using a RAK4260 or going ATmega4809 + LoRa chip, I’ll probably do both.

Having woken up this part of my coding brain I will look again at some of the details but I’m not inclined to burn up time like I did previously on this. It may be that we can eliminate the os_runloop_once() during sleep, that’s simple enough to try and I can see what impact this has on the battery. I have more hardware now that I can use to measure & log current consumption far more accurately so I may take some readings to bring some science to the power consumption for this combination. From my experience using batteries on balloons, I sort of just know that the AA Power Test is doing very well considering the LED and serial port.

TL;DR: No milli tweaking appears to be needed if you aren’t running anything else during a send and research & experimentation found os_runloop_once() every 8 seconds worked and I wasted too much time to investigate further.