TTIG Problems, - no location data, wrong date/time, wrong channel and stability issues

After verifying the formatted posts above … there’s a lost line between the first two lines of the second log, here’s the original line:

2019-09-08 09:44:30.300 [S2E:VERB] RX 867.3MHz DR5 SF7/BW125 snr=9.8 rssi=-33 xtime=0xFA00033F1AF0FB - updf mhdr=40 DevAddr=26011CB2 FCtrl=80 FCnt=28612 FOpts=[] 018C458F mic=1566300477 (16 bytes)

Before receiving this packet the free heap reduces from 18872 to 17272.
Something strange happened here

Could you boot the gateway and leave it in error mode for 30 minutes to see if it recovers by itself?

ok, I’ll boot the gateway again and leave it in the failure state fo min 30 min. I’ll report it
I’ve done that some times in the past, but the GW didn’t recover
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it

@kersing: TTN Console reports: Last Seen 35 minutes ago for this GW
it didn’t recover …
I’ll compare the bootlogs for a “good” boot with GW alive and a “bad” boot with GW not forwarding any packets …

2 Likes

The only difference in Bootlogs is as follows:

1970-01-01 00:00:05.276 [SYS:DEBU]   Free Heap: 55888 (min=55456) wifi=1 mh=3 cups=0 tc=0
scandone
state: 0 -> 2 (b0)
state: 2 -> 2 (8a0)
state: 2 -> 0 (2)
reconnect

Sorry fo my Poor English, Sorry, for the bad descriptions

09.09.2019:
I’ve tested the TTIG connected to a mobile Hotspot (Smartphone) this morning.
Same problems as above …
After a Power Cycle Connection gets lost after sending the first packet, LED constant green
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it
So I think it isn’t a matter of my WLAN Setup
Same Results if connected to Fritzbox, Cisco AP, Asus AP and mobile Hotspot

Any Suggestions ?
Should I complain on RS-Components a second time ?
Or throw away the dammned thing as suggested in the first posts ?

If I terminate and reenable the mobile Hotspot, things go wrong again …
same as power cycling the TTIG

After loosing connection the TTIG reboots (as seen in the Debug Output) and rescans immediately the WLAN. If the AP is available in this scan, things go wrong as described above.
If I switch on the AP AFTER this first rescan, the TTIG rescans the WLAN after about 30 s a second time (without rebooting), finds the AP and things go right … TTIG functions properly

But there’s no workaround in case of

  1. short power loss
  2. short wifi loss

Why the reboot if WIFI gets lost more then 1 sec ?
Maybe a Bug …
Anybody out there who may verify this reboots ?

@bei: I’m not allowed to post a reply for the next 5 hours …
Here’s my Setup:
@bei: I have a simple TTN Node (Arduino+RFM95) which sends a Temparature Packet every 15 secs. After the described faulty reboot the TTIG forwards exactly 1 Packet … seen in the TTN Console of Gateway and Device. Logging the DEBUG Output of the TTIG doesn’t show any differences between “good” and “bad” state. The TTIG receives all Packets sent by the Node but stops forwarding them after the first sent packet. This is the condition ‘connection gets lost after sending the first packet’. If I activate my DIY 1ch RASPI GW, this GW forwards exactly every 8th Packet (1ch of 8) without any Problems over the same Infrastructure.

@bel: I think I should wait the 4 hours to be permitted for replying …
TTIG Debug Output in failure state doesn’t differ from “good” state:

  • 2019-09-09 08:37:08.883 [S2E:VERB] RX 867.7MHz DR5 SF7/BW125 snr=10.0 rssi=-12 xtime=0x77000005C08C3C - updf mhdr=40 DevAddr=26011CB2 FCtrl=80 FCnt=870 FOpts=[] 01AFB05F mic=2133244814 (16 bytes)
  • 2019-09-09 08:37:09.298 [SYS:DEBU] Free Heap: 18872 (min=17064) wifi=5 mh=7 cups=8 tc=4
    and yes, I’m looking at the TTN Console Traffic Page of the GW
    Packet Capture maybe next weekend, i have to recover an old WRT54G or is it possible to do a capture on my Android Smartphone ?
1 Like

After a Power Cycle Connection gets lost after sending the first packet, LED constant green
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it

Do you have logs which show that? How do you determine the condition ‘connection gets lost after sending the first packet’?

EDIT 1:
@Franz_Refle Thanks for the clarification. So, the TTIG logs do not indicate connection loss in what you call ‘bad state’, correct? If the TTIG was unable to forward received LoRa packets to the LNS, the internal packet buffer would fill up and you would see log messages indicating that (see basicstation/src/ral_lgw.c at master · lorabasics/basicstation · GitHub). How do you determine that ‘the TTIG receives all Packets sent by the Node but stops forwarding them after the first sent packet’? My guess is that you are looking at the TTN console. The robust way to determine that is to do an IP packet capture on the websocket connection. Do you have the means to do that?

EDIT 2:
I was able to reproduce the issue on my TTIG. Let me try to formalize:

The failure condition

In the failure condition, the TTIG has an active TCP connection to the LNS back-end (solid GREEN LED) but LoRa uplinks do not appear in the TTN console. The DEBUG logs indicate that LoRa messages are received and forwarded to the LNS via the websocket connection. An IP packet capture on the websocket connection supports the observation that TCP packets are sent to the LNS back-end: on the TCP layer, the LNS acks all packet sent by the TTIG. This shows: TCP connection is healthy, forwarded packets are received by the LNS websocket server. However, the LoRa uplink messages do not make it to the LNS’s packet routing logic. In low-activity scenarios, the LNS will reset the websocket connection if no TCP packets are transmitted over the connection in a certain amount of time. After this server-initiated reset and re-connect, everything is back to normal. In high-activity scenarios the websocket connection will not be reset by the LNS because it is seeing TCP packets coming in. In that case, the failure condition will sustain in steady state.

How the failure condition is triggered

The failure condition occurs whenever there is a ‘fast’ TTIG-initiated reconnect.
The TTN LNS regularly performs server-initiated connection resets whenever no activity is detected on the websocket connection (apparently ‘activity’ in this context is measured on the TCP level and not on the LoRa-packet level). These server-initiated re-connects do not trigger the failure condition.
In some scenarios the client (TTIG) may reset and re-establish its connection quickly, for example due to ‘short’ power loss or ‘short’ loss of wifi connectivity. After re-connection, the LNS is receiving and routing the first received LoRa packets (they show up in the TTN console). After a short time (around 10-15s after re-connect) the failure condition kicks in: the packets stop to appear in the console, although the TCP connection is alive and healthy.

A wild guess at the root cause

To me this looks like a race condition in the connection reset logic of the LNS websocket server. Apparently, the LNS websocket server is measuring connection activity on the TCP level and resets the connection after no activity is detected for a particular timeout T. Probably this logic also triggers the destruction of data structures representing the gateway in order to free up resources associated to the just closed connection. In the case of an (unclean) client-side connection reset, the same resource cleanup logic is triggered after timeout T on the dead connection. If the client re-connects before T is expired, there will be two TCP level connections tied to the same internal logical gateway representation. After timeout T hits due to inactivity on the first (dead) connection, the gateway datastructure is destroyed and any packets coming in through the second (healthy) connection will not have the required context to properly route the LoRa packets up the stack.
As said, this is a wild guess, but to me this could explain the behavior we are seeing. Hopefully this helps to locate the true issue in the back-end.

4 Likes

@bei:full ack to your description of failure condition and trigger.
I only wonder why the TTIG is rebooting if WLAN outage lasts longer then a second …
My Findings are as follows:

  • if WLAN Outage < 1 sec then reconnect, – no problems
  • if WLAN Outage > 1 sec and shorter then about 10 sec then reboot, reconnect, – problems
  • if WLAN Outage > 10 sec then reboot, 1st reconnect fails, 2nd reconnect success, – no probs

Ist this reboot after 1 sec WLAN Outage normal behaviour of the TTIG or is it caused by a flaw ?
The problem seems to me the 1st reconnect after a reboot, but I’m not the big SW Engineer, knowing the TTIG FW in Detail … only hobbyist. Logs of this behaviour are reproducable.

This is why I did not answer yet :wink: . It needs some time, however I will do it soon.
However, I see there is definitely a batch of TTIGs with problems. Mine was one of the first, although not taken at the conference.

By the way, for me the NOC provides coordinates, metadata in packets do not.

I don’t see any problem on the TTIG side. Dropping the connection uncleanly and re-establishing it shortly thereafter is not an uncommon case in networking which the server should be able to handle gracefully.

@bei: full ack, server should be able to handle this …
I’m only in doubt, if there’s a need to reboot the hole gateway 1 sec after WLAN is lost …
is this behavior normal ?

Both events, a short powerl loss and also a short WLAN loss may occur sometimes in real life …
But those two events aren’t handled correct either by server or by gateway … thats my problem

Can’t find a workaround

Hi Guys,

i just installed my brand new TTIG and have problems with letting my Testdevice Join.

If placed in a location that only the TTIG can receive it, it doesnt Join, even if i can see in the console all the Join Requests and also the Accept answers.

Than i placed the device on a different location where it could be heard from other gateways and it could join there.
The only diference i can see that Date on my gateway is 1970… Could this be the reason for not able to Join?

(Before i came to this 1970 problem i created this post (sorry)

Thanks for help
Wolf.

TTN Mapper has been updated to use the gateway-data API and not the noc API. Things Indoor Gateways should therefore start to appear on TTN Mapper.

9 Likes

joining works flawlessly on my TTIG
I’ve joined a PAX-Counter without any problems

Yeah, very good quickfix solution :slight_smile:

@bei: if there’s no problem on the TTIG side, will the problem be fixed on server side in the future ?
Else I think of the following workaround: I’ll build a little ATTiny14 or something else into the TTIG, which listens to the Debug output. If it discovers a Reboot of the TTIG it will switch off the WLAN AP for about 10 seconds, so that the first reconnect fails … the second reconnect will then succeed and all is fine. Thats what i’ve tested manually. In german one would say: “Von hinten durch die Brust ins Auge … aber funktioniert”

:grin:

Thanks for the fix, my TTIG showed up on ttnmapper.org just now (although it is still marked offline despite being online, but alt least it’s visible at all).
Edit; switched to online just now, so ttnmapper seems fine now.

2 Likes

My problem with “Joining” solved: i needed to set ATS220=3 on my Adeunis Tester.
(extend RX Windows Timing for Testhouse certification).

Now everything works fine with the TTIG (also can not see regulary reboots etc)

Thanks
Wolf.

1 Like

I’d also like to know this.

This is why we’ve chosen to ‘wait’ and/or use something a bit more ‘robust’ like Laird Sentrius for indoor commercial applications…
image
There, I couldn’t read :smiley: