Missing messages between gateway and application

I’ve got a curious problem with messages that I can see arriving on the gateway, and all but a couple of them don’t arrive at the application.

I’ve tried a bunch of things to be able to understand what might be going on, including spending hours with the sensor manufacturer’s application engineer to try and find out what might be happening, with no idea. But at least the problem is reproducable!

This is with a new to the market internal air quality sensor, so I don’t want to mention the manufacturer straight away, as this could just be teething trouble with a new device. I’ve tried with both OTAA and ABP with the same results, and tried with ADR on and off.

What’s the best way to try and find the cause of this packet loss?

Setup:
The sensor is turned on an working as it expects to sending (in this case) every 2 minutes (although I started at every 20 minutes), using OTAA the join works as expected, or using ADB the sensor starts sending as expected. While the gateway local to me (on the roof of the building the sensor is in) shows receiving every expected message, the message cnt matches the sensor’s expectations, and I can decode all the LoRaWAN messages correctly using lora-packet with the appkey and devkey from the application, only 2 or 3 of the packets actually arrive at the application and the rest go missing somewhere between the gateway and the application.

Expected Result:
Every packet for the application that the gateway receives should be decoded and arrive at the application.

Current Result:
Only 2 or 3 of the messages arrive at the application, the rest go missing.

These screenshots are from rebooting the sensor just now and watching the sensor traffic only (uses ADB):

image

Sometimes, packet 1 makes it through, and occasionally I’ve seen number 4, but never anything beyond that.

Payloads:

2: 0175640367FF0004686106659300C8004700056A2B00077DF203087D000009735F2703670A0104684E06659300C8004700
3: 056A2B00077DF203087D000009735F27

And the local gateway traffic, filtered by device ID:

image

Detail:

Device Address: 260117E6
Network Session Key: 0583B50A5A0EEE11A1504A5C20822694
Application Session Key: FCB12639D8E7341EDFAAD71711DD5E2E

Physical payloads as seen by the gateway:

1: 40E61701260101000D55369CDEDAAA342470773F2637282DB54DD92F3829526465A60D01EE6ECB3DB58543
2: 40E6170126000200555350BCB55AD3E62C49247A1633491D52EB9828596734FFCF5B7BA601DE35527196BEC33DD3EB3677155019BEFCB88ED8CA246395B8
3: 40E6170126000300554FB8CD427EB690DDADF5AE07D942D066FDD34396
4: 40E61701260104000D553A548BE9C7E26CAF9D76635F57A0B33DB81530F6E4ACB392E35C8F83DAD7101B23A402
5: 40E61701260105000D559A94A00DD7FF486EC7DCF6715467A26EAC05F4FA5548A7A54EB34D8FCDC5FA3B53DB4F
6: 40E61701260106000D55769EA7804EFECD84656A8C920E8DC0220D8F05283A12F9AADE422CA2B974CCC7D5F086
7: 40E61701260107000D5587360B69BFBB7C854D80F4F3FCF463D5CD9B2121E8D519ABF76FBF30BC0EBBC91F638D
8: 40E61701260108000D553DCB5B7D3EC7A8D7BE4E4D25F4826BC531B56772DD592922BD7F4AB14F45464931FA7F
9: 40E61701260109000D550A7B826F8E5D4AF1682DBC327D8F065ABFD12D53599B21BC1483C59E17FD3956A2D422
10: 40E6170126010A000D559055A1F59326FE9D9E743B11850183EA824362F4BC94DFB28318AD72B93898694630E0

And the decoded versions, using the keys above:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

The payload data is a little esoteric, but it’s using a variant of the Cayenne, except message 1 that encodes device/firmware info. But what’s confusing, is that all of the packet data can be decoded properly.

What am I missing here?

What’s the best way to go about troubleshooting to find these missing messages?

1 Like

Hi @DefProc, please can I suggest that you take a look at the trace data in the web console / gateway / traffic page.

This first image selects a specific uplink by clicking on the blue up marker that I have shown with a red line.

ttn-forum-01
Then scroll down using the scroll bar that I have marked with a red line in this image.

ttn-forum-02
This should take you down to the Trace section and you should be able to see if the system encountered problems routing the uplink.

1 Like

Are you only looking at TTN Console? Or do you have some actual application outside of TTN that also only gets the first few uplinks? If you don’t have such application, then I’d very much recommend using an MQTT client during debugging as well. Also, in TTN Console, the device’s “Status” might reveal if TTN actually routed the uplinks without TTN Console showing them in the Data pages.

Do you have any integration enabled? (I don’t recall if failing integrations might cause the Data page to not show the uplinks.)

Anything different in the event data when clicking the different uplinks in the gateway Traffic?

Aside: I failed to see any uplinks in the Data page at all when I started to write this answer. But the device’s Status was updated just fine, and MQTT worked just fine at that time too. Also querying the Data Integration worked as expected, though I did not see “historical data” in TTN Console’s Data page either. Meanwhile, all is fine again for me, but the Data and Traffic pages in TTN Console fail on me a lot.

I have the Data Storage integration enabled, for example, a query for the last 12hrs gives the same 2 messages as above.

[
  {
    "device_id": "6128a1757429",
    "raw": "AXVkA2f/AARoYQZlkwDIAEcABWorAAd98gMIfQAACXNfJwNnCgEEaE4GZZMAyABHAA==",
    "time": "2020-05-22T11:44:12.975415847Z"
  },
  {
    "device_id": "6128a1757429",
    "raw": "BWorAAd98gMIfQAACXNfJw==",
    "time": "2020-05-22T11:45:27.367185239Z"
  }
]

I’ve watched traffic over MQTT using mosquitto_sub, and it only recieves the same messages as the console.

The device status seems to match the last recived message (currently showing: 22/05/2020 12:45:27, the same recieve time as packet 3), and the frames up counter is on 3.

The gateway info doesn’t seem to show anything unusual (this is the most recent failed packet #153):

image

1 Like

It looks to be routing correctly, this is packet 153, which failed to appear at the application:

image

Has it always been the same application and device in TTN Console? So: did you ever start from scratch in TTN Console? (Don’t delete the gateway!)

Any chance you can program a different DevEUI and AppEUI in the device? Then you could create an additional application, with a new device, and see if that behaves the same. (You could then keep the existing thing as is, in TTN Console.) I know, it shouldn’t matter, but maybe there’s some duplicate device, somehow…?

Are you using the EU handler (ttn-handler-eu) for the application, and EU router (ttn-router-eu) for the gateway, and has that always been like that? I’ve seen reports about changing the router not always being effective. (In those cases yielding weird downlink frequencies.)

Or program the current DevEUI, AppEUI and AppKey into a totally different device? (If you don’t have any other devices, then I could do that too, if you’d like.)

You are missing the packets where FCtrl = 1 indicates that FOpts contains one byte, and where that byte is 0x0D.

0x0D is “DeviceTimeReq” which is a LoRaWAN v1.1 MAC command not supported by TTN - there are several other threads on that here.

Because MAC commands are variable in length, it is in general impossible for an implementation to safely skip over unknown ones, as it does not know how many bytes to skip. In theory an implementation could simply skip processing anything remaining in FOpts once encountering something undefined and move on to the rest of the packet (since the total length of FOpts is known), however it would appear that the packet is being rejected entirely, and that may well be what the specification formally expects in the case of undefined input.

In practical terms, you will need to work with the device manufacturer to get its firmware to cease sending DeviceTimeReq if you wish to use it on a pre-1.1 network such as TTN. Hopefully this would be part of a more general “make it operate in legacy LoRaWAN mode” as otherwise there may well be other areas where you will trip over incompatibilities between LoRaWAN 1.1 and 1.0x too.

4 Likes

Great find! (And I feel bad for not peeking into the actual messages, while @DefProc provided such good details!)

The 1.0.2 specifications state to stop processing:

Note: The length of a MAC command is not explicitly given and must be implicitly known by the MAC implementation. Therefore unknown MAC commands cannot be skipped and the first unknown MAC command terminates the processing of the MAC command sequence.

…and for nodes:

It is therefore advisable to order MAC commands according to the version of the LoRaWAN specification which has introduced a MAC command for the first time. This way all MAC commands up to the version of the LoRaWAN specification implemented can be processed even in the presence of MAC commands specified only in a version of the LoRaWAN specification newer than that implemented.

Indeed, some have claimed not seeing the data in TTN Console when DeviceTimeReq was used. But others say data was still displayed, though reading that claim again today raises new questions:

Still, just to be sure: there is no Decoder involved in the Payload Formats in TTN Console, right? (The successful uplinks show no sign of that.)

In short: of course @cslorabox is very likely right that this is the culprit, even though earlier reports differ. :slight_smile: I’ve heard that V3 makes debugging easier…

(In an earlier version of this answer I referred to my November 2018 issue report #748, in which I thought TTN was not stopping processing. Looking at the code again, I now see that the actually parsing was already done elsewhere, not in the code I referred to. So while I might have reported the symptoms in that issue, I think I wrongly interpreted their cause.)

1 Like

This is fantastic information thank you both (@arjanvanb and @cslorabox). From that analysis, it looks like the device is requesting network time information with all but those two messages. Even as a stop gap measure, it looks like just not requesting the time every message would get me most of the data in this case.

Yes, there was no decoder for these messages, I tried it with one, and got the same results.