MAC commands being sent each time after Join Process

Hi all,

Recently, all my devices have been receiving a downlink of 0 bytes which I am assuming is a MAC command as the LMIC library is sending a message back to TTN as per the verbose messages here:

I’m made the assumption that it’s to do with ADR, but I’ve now turned off ADR in the general settings for my device and I still get this additional message. However this verbose mode output is talking about ADR :confused:
image
image

Has something changed recently?

My firmware has the command:
ttn_set_adr_enabled(false);
and has used that command for well over a year now and never experienced these ADR MAC commands before.

I am starting to see more devices perform this extra step, which adds somewhere around 10 seconds of waiting for my device at a time when our users are likely to be impatient, 10 extra seconds is a long time when your in a rush.

Main question: How do I stop this addition ADR/MAC command from coming through?

Cheers,
Dylan

What is you RSSI values?

Lab testing it’s coming in at -11 when I’m under 1m from the gateway, field deployments we sit between -90 - -120. Happens in all situations, consistently after first join.

How does the downlink add an extra 10 seconds time and given that the expected metric for overall packet loss is 10%, what on earth is this device doing that people are in a rush?

There is another thread discussing this and I am performing tests with additional logging to investigate.

Eshi, how long do you think your gateway or node is going to last? Devices are build for +/- the -80 to -135 range or their about.

The users will check the devices every 1-6 months depending on conditions, when the device is checked they press a button which resets the device and initiates the join process (it’s common for the device to be moved upon inspection, potentially even to an entirely different location). Once the device has joined, it sends a data packet with generic information about the sensors with no load added to them. The device is then loaded and the button is pressed sending the sensor data for the loaded device.

Theres many be dozens of devices per site, for usability we are trying to reduce this start-up sequence as much as possible, we could potentially not use a join sequence when it’s reset, but it seems the best time to do so (it could have been offline for weeks, first provisioned etc) as this is being designed to be as simple as possible to use, single button and as few presses as possible.

Just timed it, instead of my device finishing it’s transmission sequence and entering it’s next provisioning stage (where the user loads the device) it has about 10 seconds of additional time before it’s finished transmitting.

It takes about 11 seconds to go through the transmit/wait for downlink window process without anything else happening. It takes around 18 seconds when the additional command is sent.

in response to user feedback of our device, we cut one of the button presses in our provisioning process - saving 10 seconds (woohooo…) and now that 10 seconds has returned in a way that I can’t seem to remove.

I understand that 10 seconds may not seem like a lot, but we have tight requirements to meet and technically we should be able to meet them.

I hope this explains what I am trying to achieve and justifies the steps I’m taking.

Cheers,
Dylan

Is the 10sec in the field or under these 'lets shout in each others ears and leak channels and distort signal’ conditions :wink: You should be bench testing with <<-40 pref <-50 signal levels, once you get above -35-40 you are in problem territory and once above -20-25 depending on device/design you are even at risk of damaging input stages…

Search forum, recommendation is >3m range, better yet >5m and if poss go >10m with an absorber such as a wall or decent window in between to limit signal magnituge and risks above. Also so close in you are almost operating in the near field rather than far field from an RF perspective. Run several tests where you are at around -70 to -85 RSSI value as seen by GW Console and let us know then if still a repeatable issue…

You’ve not come up with any good reason for a MAC downlink to add 10 seconds - it sends a Join (not good, but we’ll run with it for now), it has to wait 5 seconds for the response & another 1 second if nothing comes in the first Rx window. It takes a milli-second to process the MAC command and then in theory it is available to do an uplink, depending on regional settings as dwell time may preclude an uplink immediately after a join. This is a legal thing.

Whereas @Jeff-UK’s suggestion is far more likely to be the cause of delays. There’s no point testing in such an unrealistic way.

As for the Join, you could join once per site, ie at the start of the day, or just use ABP.

It’s a bit hard to read your opening paragraph as we use the word device & node interchangeably but it looks like you are using device & device to refer to two different items. I think you are saying that the node is plugged in to a box, the button is pressed, it sends a no load uplink, you add sensors to the box and then do another uplink? Doing a Join should not require extra buttons.

Like many queries here, if you give us just a little more detail we may be able to make far better suggestions - like for instance a status LED that gives the user some feedback via blinking or colour change. And the considerations of using ABP.

I’m not sure how I haven’t got a good reason for a MAC downlink to be adding 10 seconds? That’s what it’s doing?

You can see the timestamps in the images I’ve uploaded that the MAC payload is scheduled at 21:12:55 and the last receive is at 21:13:08. That’s 13 seconds later when the node otherwise would have moved into it’s next state.

As for @Jeff-UK’s comments, this is now happening on our devices regardless of the distance, our field deployed devices are 100-1000m away from the gateway. We haven’t experienced this before.

One of my questions was: Has something recently changed (with the context of the TTN)?
It’s sounding like it hasn’t?

We have LED’s to indicate state changes and they work well. This is how they know they have to wait…

I understand how ADR works, which is why I’m here asking why I’m getting MAC/ADR commands sent as a downlink when I’ve disabled ADR from the firmware and the TTN. I shouldn’t be getting any commands being sent regardless of the RSSI.

There are reasonably regular updates to TTS(CE) - the most recent was week last Monday morning when it updated to V3.18.0. When did your devices behaviour change? Which specific hardware (Radio/MCU) and software/firmware/LoRaWAN stack are you using? You mentioned LMIC earlier - can you link to source/library used and confirm which revision you are running, please. Have you made any changes to the stack? (e.g. deleted unneeded sections to firm in memory constraints etc.)

I’m using this library:

I haven’t made any changes to the stack (there are some git issues I submitted with suggested changes, however I have no current issues with the stack that required any modifications)
I am using an ESP32-S2 MCU and the SX1276 LoRa chip.
Using esp-idf V4.4.
I don’t recall when I first discovered the behaviour I’m seeing now, but within the last two weeks.
I do firmware updates regularly, so I thought it was something on the firmware end, however I have returned to much older firmware on a few devices and the behaviour remains.

I’m still not understanding how the Network Server scheduling a downlink will cause the device to encounter some sort of delay.

There is an uplink at 21:12:55, some processing which the device will be oblivious to as it’s happening on servers far far away but occurs in time for the Rx window, so when the device uplinks at 21:13:01 it includes an acknowledgement of the request along with the payload, the NS processes & queues another MAC command that gets to the device in time, which when it uplinks again at 21:13:08 is acknowledged along with the third payload.

The MAC processing is independent of the payload and has no impact on your application, they are solely for getting the best parameters for the network setup on the device. The device has to hold the Rx windows open, we don’t know if they arrived in Rx1 or Rx2, but if we assume Rx2, then that would tally with the 13 second timescale from first to third uplink in the logs.

There is no computational load on a device for the MAC commands, they are literally just copied over in to the internal data structures, so would not add a multi-second delay to the next uplink.

Most significantly, you are expecting just two uplinks whereas the device ends up queuing three. So you need to figure out what each of the three payloads contain and why it may have taken an additional uplink for the device to reach the state you are expecting.

The backend can’t magically make a node transmit, there has to be a downlink which contains something that makes the node decide it needs to transmit. The only thing I can thing of would be a confirmed downlink and it would very much surprise me if the backend created those.

The best way to analyse this is to capture the entire set of join packets uplinks/downlinks in the console and export that and then look at the contents of all packets to see what is actually being exchanged using a LoRaWAN packet decoder, decode the MAC command contents as well. The console doesn’t show the juicy details required to tackle this issue properly.

1 Like

I am seeing something similar. I have disabled ADR in LMIC and it doesn’t look like I’m seeing that. So what’s left seems to be some other link setup. I am sure I was getting 2 uplinks I didn’t initiate before I disabled ADR. It’s too late for me to check on that again today.

I’ve downloaded the messages from the TTN console and am using the packet decoder linked to above to examine them. I am doing my best to decode the FOpts, please forgive me if I’m wrong.

I’m wondering what happens if I don’t let LMIC free-run and send these uplinks. Our old LMIC code doesn’t use the LMIC task scheduling and only calls os_runloop_once in a tight loop between us initiating an uplink and getting a TX_COMPLETE event so LMIC never gets a chance to think about it’s internal state, aside from joining.

Anyway, here are the messages.

My first uplink, I think the 0D in FOpts is my network time request.

TS: 2022-04-11T04:50:26.477855356Z

Assuming base64-encoded packet
QL6EDSYBAAANCs6EZFZUZQ==

Message Type = Data
  PHYPayload = 40BE840D260100000D0ACE8464565465

( PHYPayload = MHDR[1] | MACPayload[..] | MIC[4] )
        MHDR = 40
  MACPayload = BE840D260100000D0ACE84
         MIC = 64565465

( MACPayload = FHDR | FPort | FRMPayload )
        FHDR = BE840D260100000D
       FPort = 0A
  FRMPayload = CE84

      ( FHDR = DevAddr[4] | FCtrl[1] | FCnt[2] | FOpts[0..15] )
     DevAddr = 260D84BE (Big Endian)
       FCtrl = 01
        FCnt = 0000 (Big Endian)
       FOpts = 0D

Message Type = Unconfirmed Data Up
   Direction = up
        FCnt = 0
   FCtrl.ACK = false
   FCtrl.ADR = false

The response from the server. FOpts has 0D again, with 5 bytes 24767E4F46 which I guess is the time. Then 07, new channel req? If so, that takes another 5 bytes 07A0AF8C50 which leaves 0905 TxParamSetupReq with its single byte of data.

TS: 2022-04-11T04:50:26.699742103Z

Assuming base64-encoded packet
YL6EDSaOAAANJHZ+T0YHB6CvjFAJBXKZBn8=

Message Type = Data
  PHYPayload = 60BE840D268E00000D24767E4F460707A0AF8C5009057299067F

( PHYPayload = MHDR[1] | MACPayload[..] | MIC[4] )
        MHDR = 60
  MACPayload = BE840D268E00000D24767E4F460707A0AF8C500905
         MIC = 7299067F

( MACPayload = FHDR | FPort | FRMPayload )
        FHDR = BE840D268E00000D24767E4F460707A0AF8C500905
       FPort = 
  FRMPayload = 

      ( FHDR = DevAddr[4] | FCtrl[1] | FCnt[2] | FOpts[0..15] )
     DevAddr = 260D84BE (Big Endian)
       FCtrl = 8E
        FCnt = 0000 (Big Endian)
       FOpts = 0D24767E4F460707A0AF8C500905

Message Type = Unconfirmed Data Down
   Direction = down
        FCnt = 0
   FCtrl.ACK = false
   FCtrl.ADR = true

Here is the uplink I didn’t initiate. 0703 could be the new channel answer, leaving 09 as the tx params setup answer, which has no payload. This was sent by LMIC while my code was just letting it run via run_osloop_once() being called every time loop() is called - so a tight loop. There is no port or payload, my code is running the standard LMIC-node so uses port 10, payload a 2-byte counter.

TS: 2022-04-11T04:50:59.583994131Z

Assuming base64-encoded packet
QL6EDSYDAQAHAwnGfM01

Message Type = Data
  PHYPayload = 40BE840D26030100070309C67CCD35

( PHYPayload = MHDR[1] | MACPayload[..] | MIC[4] )
        MHDR = 40
  MACPayload = BE840D26030100070309
         MIC = C67CCD35

( MACPayload = FHDR | FPort | FRMPayload )
        FHDR = BE840D26030100070309
       FPort = 
  FRMPayload = 

      ( FHDR = DevAddr[4] | FCtrl[1] | FCnt[2] | FOpts[0..15] )
     DevAddr = 260D84BE (Big Endian)
       FCtrl = 03
        FCnt = 0001 (Big Endian)
       FOpts = 070309

Message Type = Unconfirmed Data Up
   Direction = up
        FCnt = 1
   FCtrl.ACK = false
   FCtrl.ADR = false

The next uplink. Another of mine, expected port & payload, not FOpts.

TS: 2022-04-11T04:51:32.382424763Z

Assuming base64-encoded packet
QL6EDSYAAgAKiSTtuQUF

Message Type = Data
  PHYPayload = 40BE840D260002000A8924EDB90505

( PHYPayload = MHDR[1] | MACPayload[..] | MIC[4] )
        MHDR = 40
  MACPayload = BE840D260002000A8924
         MIC = EDB90505

( MACPayload = FHDR | FPort | FRMPayload )
        FHDR = BE840D26000200
       FPort = 0A
  FRMPayload = 8924

      ( FHDR = DevAddr[4] | FCtrl[1] | FCnt[2] | FOpts[0..15] )
     DevAddr = 260D84BE (Big Endian)
       FCtrl = 00
        FCnt = 0002 (Big Endian)
       FOpts = 

Message Type = Unconfirmed Data Up
   Direction = up
        FCnt = 2
   FCtrl.ACK = false
   FCtrl.ADR = false

Depends on how you have your device setup - ABP or OTAA.

OTAA Join Accept can configure most things. ABP some items are dependent on how you have the device configured on the console - if you’ve told the console all the details, you’ll get less MAC commands.

But also see this: https://www.thethingsnetwork.org/forum/forum/t/adr-features-recently-introduced-to-tts-march-2022/55662?u=descartes

You definitely need to let LMIC contemplate its internal state, and you need to either operate its scheduler or replace it with something giving similar functionality to support LMIC’s network-needs logic.

In terms of making a low power sleep interact, what you want to do is check if LMIC has anything scheduled, and how far out that is. The functions for this are a little bit awkward, but you can reshuffle them to be more convenient and just given you a time to the next event or some large number if there is none, then sleep until the sooner of what LMIC wants or what you want, perhaps gated by a minimum below which you might as well just busy wait.

In terms of configuration downlinks, while in some regions and LoRaWAN versions an OTAA node can perhaps be completely configured in the join accept, in others it cannot, and so a few additional configuration downlinks will be be sent. Given that joins or loss of configured state under ABP should be very rare in the service life of a device, the need would be to make sure that these are correctly handled, rather than to necessarily handle them with power efficiency. If they’re repeating, there’s a bug, probably in the node failing to ack or losing state, though possibly in unique cases in the network.

While a configuration downlink must be acked, technically there is no time deadline to do so - the node could wait a few days to respond, what’s key is that it doesn’t send another uplink that doesn’t ack the configuration downlink, as that’s what the network would consider triggering a resend. Though it’s unclear how delaying the ack beyond what’s needed for duty cycle considerations would help much - especially given how rare these events would be for a node properly retaining session state, sending an “empty” uplink with only MAC acks but no useful application payload isn’t a tragedy when its such a rare situation.

To more simply summarize:

  • you can arbitrarily delay anything that LMIC wants to do other than enact the receive windows
  • but you can’t deny it, you have to let LMIC do what it wants to
1 Like

Thanks cslorabox.

I know of the technique where you ask LMIC if it is busy for some amount of time, but need to examine the source more closely to answer the questions below.

I’m on leave ATM and spent the whole of my first day looking into it, so will probably not do any more until next week.

I’ve seen other people are doing the same as us - issuing a LMIC_setTxData2, going into tight loop calling os_runloop_once() based upon a flag, and setting that flag to break from the loop in the TX_COMPLETE event handler, then going to sleep until we next want to measure and send.

So it would be good to get a clear idea of whether this works properly! Perhaps it worked with TTN v2 and v3 doesn’t like it as much, or perhaps it’s fine if the config is all done before TX_COMPLETE is issued, and config answers are queued for the next uplink.

  1. Does it take the config reply uplinks into account? Ie are they scheduled as jobs that can be seen by the job queries?

  2. Are the config queries handled like the time req where you set a flag and that is included in the next uplink. If so, then #1 doesn’t matter because the answers to the config reqs will be sent on my next uplink.

  3. Are the internal state updates requested by the server handled before TX_COMPLETE is issued?

  4. Does LMIC want to do anything between issuing a TX_COMPLETE event and the next call to os_runloop_once?

  5. If I wake up to do a os_runloop_once(), is that enough to get LMIC doing what it needs to do and in a state I can check to see I need to let it keep going until some defined and visible end state where I can sleep again.

The simple answer remains that you need to check if LMIC has something scheduled, and not sleep in a way that would prevent that from happening when scheduled. Once you determine how far out the next scheduled job is, you can sleep for (almost) that long.

People who try to defeat LMIC’s scheduler without understanding it enough to fully replace it are asking for trouble.

Terry confirmed in an LMIC repo issue that the way to go is to check the schedule, as others have suggested, which I read as meaning it will take into account any uplinks it wants to send itself.

I tried this with a time of 15 mins, and it only ever said I could sleep after a join completion. Even when nothing was left to do, 15 mins didn’t come back as ok to sleep.

I then tried 10 and 15 seconds and had better results, with an okay to sleep result most times.

But assuming LMIC needs the Arduino millis() result kept up to date, there’s more involved in just calling os_queryTimeCriticalJobs(sec2osticks(15)); if we want to sleep and keep LMIC happy.

Modifying hal_ticks() to use its own source that survives sleep mode might be easier than keeping millis() up-to-date.

That’s why it’s been suggested to reshuffle things so that you can find out the actual time until the next job, rather than play guessing games.

But assuming LMIC needs the Arduino millis() result kept up to date

Yes, you need to fixup the clock to account for the time slept