Dataloss in the backend?

EFthings01 · October 23, 2019, 7:16am

Hy there,

we are logging data losses for some test nodes. Over the last days we had increased loss rates. The following log was written within node-red, it gives the node packet counter, the time since last reception and a time stamp:

{"arduino_test":10377,"arduino_test_delta":2463.921,"timestamp":"2019-10-22 14:13:19","arduino_test_lost":40}
{"arduino_test":10416,"arduino_test_delta":2336.506,"timestamp":"2019-10-22 14:52:16","arduino_test_lost":38}
{"arduino_test":10427,"arduino_test_delta":664.953,"timestamp":"2019-10-22 15:03:20","arduino_test_lost":10}
{"arduino_test":10428,"arduino_test_delta":57.415,"timestamp":"2019-10-22 15:04:18"}
{"arduino_test":10454,"arduino_test_delta":1557.818,"timestamp":"2019-10-22 15:30:16","arduino_test_lost":25}
{"arduino_test":10496,"arduino_test_delta":2519.84,"timestamp":"2019-10-22 16:12:16","arduino_test_lost":41}

Checking the gateway logs I found, that during the “downtime” lot´s of data have been sent:

Oct 22 15:48:45 klk-wifc-040187 local1.info lorafwd[3036]: <6> Uplink message (EF17) sent
Oct 22 15:48:45 klk-wifc-040187 local1.info lorafwd[3036]: <6> Uplink message (EF17) acknowledged in 39.104 ms
Oct 22 15:48:48 klk-wifc-040187 local1.info lorafwd[3036]: <6> Received uplink message: 
Oct 22 15:48:48 klk-wifc-040187 local1.info lorafwd[3036]: <6> | lora uplink (7B305FE4), payload 27 B, channel 868.5 MHz, crc ok, bw 125 kHz, sf 7, cr 4/5
Oct 22 15:48:48 klk-wifc-040187 local1.info lorafwd[3036]: <6> | Unconfirmed Data Up, DevAddr 27008866, FCtrl [ADR], FCnt 4895, FPort 1
Oct 22 15:48:48 klk-wifc-040187 local1.info lorafwd[3036]: <6> |  - radio (00000107)
Oct 22 15:48:48 klk-wifc-040187 local1.info lorafwd[3036]: <6> |   - demodulator counter 2126864916, UTC time 2019-10-22T13:48:48.059476Z, rssi -62.2 dB, snr 4< 7 <9.75 dB
Oct 22 15:48:48 klk-wifc-040187 local1.info lorafwd[3036]: <6> Uplink message (EF18) sent
Oct 22 15:48:48 klk-wifc-040187 local1.info lorafwd[3036]: <6> Uplink message (EF18) acknowledged in 37.5419 ms

So, it lookis like data have been transmitted, but do not reflect in the backend. We are usint ttn-contrib-nod-red to receive data in node-red.

I know there is a status page, but this gives us only current status. Is there any status history to see, if data losses are caused by downtime or maintenance?

EFthings01 · October 25, 2019, 10:48pm

Followup: After lot´s of debugging we catched one case. This is the gateway traffic:

41%20AM

this is the device traffic:

41%20AM%20001

The platform was running for hours without any issue, but suddenly lot´s of data got lost in the backend.

Any explanation for this?

EFthings01 · October 27, 2019, 3:42pm

Current state below. Is this the “normal” operation?

24%20PM_2

Left side getway-traffic for a device, right side device traffic. Nearly every second message lost.
To be honest, not all losses are caused by the backend, but currently there does not seem to be a stable transmission. This are loss rates for the last week (1 means: 1 packet lost)

38%20PM

Nobody with similar issues?

bsiege · October 27, 2019, 6:42pm

Since it seems not random. I would expect such behavior if some FUP enforcment would be active.

Jeff-UK · October 27, 2019, 8:16pm

What gw & ttn handler/router being used?

Which packet forwarder running? If legacy/UDP based (I suspect as gw name/eui is usual alpha numeric jumble vs a more human friendly name ref) that may be an issue as generically UDP traffic more likely to get lost (& no retry/authentication mechanism) ‘on the net’ even before reaching ttn backend.

redwirelessus · October 27, 2019, 9:41pm

Not sure if these are related but we’re experiencing the same thing over in the US last week and today as well, where our ‘typically running fine’ nodes all of the sudden do not receive a JA from TTN, even in the best of RF environments and without any changes to any variable lately:

We’ve checked both, uplink and downlink spectrum used with a spec A and everything is ‘clean’/no interference, which I’m not surprised given the fact that: 1) you can see the messages getting to TTN, just not being ‘replied’ to by TTN 2) the messages are 6 to 10+ times above the noise (SNR 8 to10 dB), so that’s not it. Something has been going on lately in the backend. I had a conversation with TTN last week (10/23), they confirmed there was a ‘hiccup’ for 30 mins, but I see this happening a lot more than that and for lot longer. Although this is ‘as is’ service for all to enjoy, work, play, collaborate, I wish there was a better way of knowing if there’s a ‘downtime’ happening - especially when for lots out there ( included) TTN is a springboard to TTI.

Here’s more info from our end:
LoRa Broker: TTN US West
Nodes : OTC RadioBridge
GWs : Laird Sentrius RG191
Spec : US 902-928 MHz (starting at LoRa sub band 2 typically for OTAA from 903.9 MHz channel 8 / 125 kHz to 905.3 MHz channel 15 / 125 kHz / center channel 904.6 MHz / 500 kHz :

EFthings01 · October 28, 2019, 12:09am

Hy Jeff,

as the packets show up in the gateway traffic console, the udp connection cannot be an issue here (Console shows only packets that have passed the UDP link) . And Fair Use Policy is not active at the moment (there is even no airtime counter in TTN).

So, I heard this might be an overload issue, but no details.

Sure, TTN is a springboard to TTI, but if TTN is loosing most of the packets, this is also no good promotion for TTI.

redwirelessus · October 28, 2019, 1:49pm

It seems we’re back to normal since about 2:40 AM EST after 4 hours of downtime last night :

You can see communication was in/out around 1 PM yesterday, then 5 PM, then 10 PM until almost 3 AM. I let TTN know and they’re looking into it as of last night.

DeltaQ · October 30, 2019, 3:42pm

Same here…
We have a gateway with 30 devices and 5-min updates and a random 10% of those updates does arrive in the Gateway Traffic list (correctly labelled, indistinguishable from other payloads etc) but AFAICT not in the Application Data list and not on the MQTT API.

DeltaQ · October 30, 2019, 3:47pm

same issue?

ferfersan6 · October 31, 2019, 8:24am

Yes, seems the same. I am working with a GPS and it is losing packets since I started it (3 days). Sometimes it loses more, sometimes less. But all packets appear in the GW backend. Don’t know where the problem could be.

DeltaQ · November 6, 2019, 11:22am

FYI: what we did to “solve” the issue is to add a Data Storage integration on the TTN Application and use the REST API on that storage instead of the MQTT interface.

EFthings01 · November 7, 2019, 5:06pm

Hy,

thank you very much, I did not know it exists. Did it really help?