After a Power Cycle Connection gets lost after sending the first packet, LED constant green
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it
Do you have logs which show that? How do you determine the condition ‘connection gets lost after sending the first packet’?
@Franz_Refle Thanks for the clarification. So, the TTIG logs do not indicate connection loss in what you call ‘bad state’, correct? If the TTIG was unable to forward received LoRa packets to the LNS, the internal packet buffer would fill up and you would see log messages indicating that (see https://github.com/lorabasics/basicstation/blob/master/src/ral_lgw.c#L240). How do you determine that ‘the TTIG receives all Packets sent by the Node but stops forwarding them after the first sent packet’? My guess is that you are looking at the TTN console. The robust way to determine that is to do an IP packet capture on the websocket connection. Do you have the means to do that?
I was able to reproduce the issue on my TTIG. Let me try to formalize:
The failure condition
In the failure condition, the TTIG has an active TCP connection to the LNS back-end (solid GREEN LED) but LoRa uplinks do not appear in the TTN console. The DEBUG logs indicate that LoRa messages are received and forwarded to the LNS via the websocket connection. An IP packet capture on the websocket connection supports the observation that TCP packets are sent to the LNS back-end: on the TCP layer, the LNS acks all packet sent by the TTIG. This shows: TCP connection is healthy, forwarded packets are received by the LNS websocket server. However, the LoRa uplink messages do not make it to the LNS’s packet routing logic. In low-activity scenarios, the LNS will reset the websocket connection if no TCP packets are transmitted over the connection in a certain amount of time. After this server-initiated reset and re-connect, everything is back to normal. In high-activity scenarios the websocket connection will not be reset by the LNS because it is seeing TCP packets coming in. In that case, the failure condition will sustain in steady state.
How the failure condition is triggered
The failure condition occurs whenever there is a ‘fast’ TTIG-initiated reconnect.
The TTN LNS regularly performs server-initiated connection resets whenever no activity is detected on the websocket connection (apparently ‘activity’ in this context is measured on the TCP level and not on the LoRa-packet level). These server-initiated re-connects do not trigger the failure condition.
In some scenarios the client (TTIG) may reset and re-establish its connection quickly, for example due to ‘short’ power loss or ‘short’ loss of wifi connectivity. After re-connection, the LNS is receiving and routing the first received LoRa packets (they show up in the TTN console). After a short time (around 10-15s after re-connect) the failure condition kicks in: the packets stop to appear in the console, although the TCP connection is alive and healthy.
A wild guess at the root cause
To me this looks like a race condition in the connection reset logic of the LNS websocket server. Apparently, the LNS websocket server is measuring connection activity on the TCP level and resets the connection after no activity is detected for a particular timeout T. Probably this logic also triggers the destruction of data structures representing the gateway in order to free up resources associated to the just closed connection. In the case of an (unclean) client-side connection reset, the same resource cleanup logic is triggered after timeout T on the dead connection. If the client re-connects before T is expired, there will be two TCP level connections tied to the same internal logical gateway representation. After timeout T hits due to inactivity on the first (dead) connection, the gateway datastructure is destroyed and any packets coming in through the second (healthy) connection will not have the required context to properly route the LoRa packets up the stack.
As said, this is a wild guess, but to me this could explain the behavior we are seeing. Hopefully this helps to locate the true issue in the back-end.