TTIG - No downlinks after a few hours, until power cycling the gateway

novag · December 4, 2019, 9:52am

Hi,

my TTIG runs in an environment where it receives a packet every 30 to 40 seconds so luckily I do not suffer from the force disconnect problems.

Nevertheless I can observe that a few hours after powering the TTIG I can’t receive any downlink message from it anymore. I can reproduce it using an OTAA node with ADR enabled.
The initial Join procedure is working perfectly and I can send downlink messages to the node in the next few minutes without problems. I can access a debug console on the node so I know exactly what it’s doing and even see if it received a corrupted packet.
What I can observe is that the first ADR downlink after 64 packets (in this case approximately 2 hours) is received and confirmed by the node. After 4 hours another ADR downlink is forced but this time it’s not received by node and TTN starts to resend it after every uplink but it’s never received.
Using another node I can confirm that the TTIG did not send any packet. I also can’t receive any manually scheduled packets anymore. However all the packets are visible on the gateways traffic page.

Power cycling the TTIG resolves the issue for the next few hours. Afterwards it stops working again.

I can rule out that the TTIG runs into problems with the 1% regulation or TTN’s fair-use policy. At the beginning of the repeated downlinks there is plenty of air time left.

Has anyone been able to observe this behavior as well? What could be the reason? What exactly does a downlink packet appearing on the gateway traffic page mean?

Thank you very much!

arjanvanb · December 4, 2019, 12:02pm

If you’re into opening your TTIG, you might add an UART debug connector to see its logs as well.

Did you ever let it run for hours after the downlinks stopped working? Did you ever see it resolve automatically after another few hours?

And just to be sure: when power cycling the TTIG, you do not restart the node as well, right?

Background:

If it resolves automatically, then it might be some overflow error. The Kickstarter gateway also suffers from that, but only for a very small moment in time, erroneously rejecting just a single downlink for being “too late”, and then being fine again. But maybe there’s some more severe mismatch in the Basic Station code. Code like the following is suspect for missing out on overflows, as it does not compare a time difference between two values of the same type, which would automatically handle rollover:

if( txjob->txtime < earliest )

However, the above Basic Station code applies to the gateway calculating some alternative downlink time itself, so does not apply to your problem. Also, maybe the number types allow for very large values, not running into any rollover at all. Still then, if the problem resolves after some time (and comes back some time later), then it might be caused by code that is not rollover-safe.

novag · December 4, 2019, 1:29pm

Yes thanks, I will do that when I find some time.

Yes, I let it run for some hours but it did not change anything. I had to powercycle it to get it working again.

Yes of course, the node was running all the time and nothing was changed on the nodes side.

Do you know if the TTIG is running the basicstation code base? It’s interesting what you brought up, I will further look into that.
Anyway I assume that there are some buffer/stack overflow bugs in the current firmware as my TTIG sometimes, after powering up, advertises itself using DHCP as ‘ESP_D29…’ (which I assume is fine) but most of the time it announces itself as ‘/s2/tc.key’ which seems to be a string used in the firmware but should not be announced using DHCP as its network name.

Do you know who is responsible for the firmware? Does TTI own the sources or is another company “maintaining” the TTIG?

arjanvanb · December 4, 2019, 4:51pm

Yes, it is. (Along with some server-side code that translates the Basic Forwarder protocol into something suitable for the community stack which is at V2.)

See:

novag · December 5, 2019, 11:41am

I had some time to attach a UART console to the TTIG.

Yesterday at around 7:30 pm I booted the node and the TTIG. In the first few minutes I sent a few dummy downlink messages to verify that the downlink is working. The node sends an uplink message every 5 minutes.
After 64 uplink messages the node forced an ADR request which was sent through the TTIG and successfully processed. I can see in the logs that the Node switched to SF7.
After the next 64 uplink messages the node again forced an ADR request which was sent by TTN but never received on the node and thus TTN repeated the ADR until I woke up in the morning.
Now it gets pretty strange and I’m happy for any idea that could shed some light on this.

The following log of the TTIG shows a dummy message I sent yesterday to check if downlink works:
Uplink: 407E2301268003000117E557EABB56F570042D
Dowlink: 607E2301268501000341FF0001BBF92DB0390B

2019-12-04 18:31:27.052 [any:INFO] HW pipe: PowerFolder node: [1337]-[F7DQCVTLdrGprAbCfys1]
2019-12-04 18:31:33.444 [SYN:INFO] MCU/SX1301 drift stats: min: 0.0ppm q50: 1.9ppm q80: 2.9ppm max: 3.3ppm - threshold q90: 2.9ppm
2019-12-04 18:31:33.449 [SYN:INFO] Avg MCU drift vs SX1301#0: 1.0ppm
2019-12-04 18:31:42.590 [S2E:VERB] RX 868.3MHz DR2 SF10/BW125 snr=10.5 rssi=-10 xtime=0x7F0000394564DC - updf mhdr=40 DevAddr=2601237E FCtrl=80 FCnt=3 FOpts= 0117E557…BB56 mic=755265781 (19 bytes)
2019-12-04 18:31:43.855 [S2E:DEBU] ::0 diid=6678 [ant#0] - next TX start ahead by 716ms365us
2019-12-04 18:31:44.562 [S2E:VERB] ::0 diid=6678 [ant#0] - starting TX in 19ms903us
2019-12-04 18:31:44.577 [S2E:INFO] TX ::0 diid=6678 [ant#0] - dntxed: 869.525MHz 27.0dBm ant#0(0) DR3 SF9/BW125 frame=607E2301268501000341FF00…2DB0390B
2019-12-04 18:31:44.736 [S2E:DEBU] Tx done diid=6678

The log looks sane and the message was received.

This morning I check what happened with the repeated ADR request downlinks that are not received anymore:
Uplink: 407E23012680A00001956AC0556A50E43CFAAC
Downlink: 607E2301268531000355070001E3CC1762

2019-12-05 07:48:19.726 [SYN:VERB] Time sync rejected: quality=1070 threshold=1059
2019-12-05 07:48:23.932 [SYN:INFO] Time sync qualities: min=1049 q90=1069 max=1445 (previous q90=1059)
2019-12-05 07:48:24.654 [S2E:VERB] RX 868.1MHz DR0 SF12/BW125 snr=10.2 rssi=-25 xtime=0xC2000ADA06D8C4 - updf mhdr=40 DevAddr=2601237E FCtrl=80 FCnt=160 FOpts= 01956AC0…6A50 mic=-1392886556 (19 bytes)
2019-12-05 07:48:46.238 [S2E:VERB] ::0 diid=43454 [ant#0] - starting TX in 19ms822us
2019-12-05 07:48:46.254 [S2E:INFO] TX ::0 diid=43454 [ant#0] - dntxed: 869.525MHz 27.0dBm ant#0(0) DR3 SF9/BW125 frame=607E230126A5230003550700…FA35372F
2019-12-05 07:48:46.413 [S2E:DEBU] Tx done diid=43454
2019-12-05 07:48:46.415 [S2E:DEBU] ::0 diid=43695 [ant#0] - next TX start ahead by 5m4s
2019-12-05 07:48:47.061 [SYN:VERB] Time sync rejected: quality=2068 threshold=1069

Look at the timestamps when the downlink is sent. It is sent 21.6 seconds after the uplink packet is received. If you look at the downlink frame payload you can see that it does not match the downlink TTN sent. The frame counter value of the TTN downlink is 49 whereas the value of the frame that is sent is 35.

It also schedules a downlink for diid 43695 5 minutes after diid 43454. This is also suspicious. How does it know that the node will send again in 5 minutes?

By checking the MIC I can confirm that both payloads are for the current node. The missing byte replaced by ‘…’ is 01. It looks like the TTIG completely scrambled the downlink packets and the downlink times.

Here’s another example:
Uplink: 407E23012680A20001B902A47B94F34C88D329
Downlink: 607E2301268533000355070001D77A767F

2019-12-05 07:58:34.658 [SYN:INFO] Time sync qualities: min=1046 q90=1070 max=1602 (previous q90=1059)
2019-12-05 07:58:35.543 [SYS:DEBU] Free Heap: 18824 (min=15488) wifi=5 mh=7 cups=8 tc=4
2019-12-05 07:58:35.574 [S2E:VERB] RX 868.3MHz DR0 SF12/BW125 snr=8.5 rssi=-24 xtime=0xC2000AFE70C3FC - updf mhdr=40 DevAddr=2601237E FCtrl=80 FCnt=162 FOpts= 01B902A4…94F3 mic=701728844 (19 bytes)
2019-12-05 07:58:56.052 [S2E:VERB] ::0 diid=43965 [ant#0] - starting TX in 19ms689us
2019-12-05 07:58:56.068 [S2E:INFO] TX ::0 diid=43965 [ant#0] - dntxed: 869.525MHz 27.0dBm ant#0(0) DR3 SF9/BW125 frame=607E23012685250003550700…6C009B58
2019-12-05 07:58:56.227 [S2E:DEBU] Tx done diid=43965
2019-12-05 07:58:56.229 [S2E:DEBU] ::0 diid=44211 [ant#0] - next TX start ahead by 5m5s

I also found the following lines in the log:

2019-12-05 07:59:07.241 [SYN:INFO] MCU/SX1301 drift stats: min: 0.5ppm q50: 1.9ppm q80: 2.9ppm max: 5.2ppm - threshold q90: 3.3ppm
2019-12-05 07:59:07.247 [SYN:INFO] Avg MCU drift vs SX1301#0: 1.0ppm
2019-12-05 07:59:09.352 [SYN:ERRO] Repeated excessive clock drifts between MCU/SX1301#0 (3 retries): 3.6ppm (threshold 3.3ppm)

From the uplink times I think the general timing is okay. I’m not sure what exactly is going wrong here. I will keep looking into it.

Thanks!

arjanvanb · December 5, 2019, 12:30pm

Using the MQTT API you might also see the downlinks, including the time they were queued, though I’m not sure ADR is included in that (maybe that only includes application downlinks). That might also tell you if things are queueing up (for which the backend should set the “downlink pending” flag to make the node transmit another uplink if it wants to).

As the MIC validates, I guess we agree you’re seeing old (queued, repeated, …?) downlinks, given the lower frame counters. Next, I wonder if the 2nd line below tells one when the TTIG received the downlink from TTN:

It seems these lines are not logged when a “dnmsg” command is received by the TTIG, but only when it’s handling its internal TX queue. So it seems one cannot tell when the TTIG receives commands from TTN.

If lines like this are logged upon arrival of the downlink commands, then the time those arrive at the TTIG clearly is weird. If, like for other forwarders, the Basic Station protocol includes some relative time with each uplink, and the backend just adds 1,000,000 or 2,000,000 microseconds to that value to tell the TTIG when to transmit, then I could imagine that relative timestamps in the metadata of old queued downlinks mess up the time a downlink is actually transmitted, like due to rollover of the timers. But would that also make the backend (or the intermediate bridge that sits between the Basic Station protocol and the V2 backend) forward the downlink to the TTIG so late after an uplink?

Any chance there’s something different for the first downlink that went wrong? Or maybe sometimes the very same downlink is forwarded to the TTIG multiple times?

Also, I think we need @KrishnaIyerEaswaran2 here.

novag · August 8, 2020, 10:45am

The problem was fixed on the server side a few months ago.

pomplesiegel · August 11, 2020, 4:03pm

I believe I’m seeing this exact behavior happening on my Indoor Gateway! I had a few nodes totally unable to see downlink messages (even though TTN v2 thought they were sent). Once I unplugged and replugged the Things Indoor Gateway from the wall, the messages went through without issue.

I was also having a join problem for a node which was resolved by power cycling it too.

Any idea what could be going on here?

pomplesiegel · August 12, 2020, 12:04pm

@novag, has this totally gone away for you? For me it is alive and well

novag · August 12, 2020, 12:46pm

@pomplesiegel Yes I think it’s gone. I would suggest you to enable the USB-C serial console or attaching a UART adapter like this: https://tinkerman.cat/post/hacking-the-tti-indoor-gateway/#peaking-into-the-firmware. Both requires opening the TTIG.

pomplesiegel · August 12, 2020, 3:12pm

@novag, ok I’ll look into that. Thank you!

In the meantime I am having sensors that cannot receive downlink messages after running overnight again. I just plugged in my original “The Things Gateway” and unplugged the “The Things Indoor Gateway”, resent the message and it went through, no problem.

I’ll keep testing overnight with the “The Things Gateway”, but this is so frustrating!