Kerlink IFemtocell Packet loss

With a gateway running a real operating system (ie, where you can get a shell via a serial port or SSH) you can probably pull the raw packets out of a log file, or modify the software to put them there, or point it at an intermediate log server which forwards to TTN.

If you know the ABP keys (or current OTAA keys device address) you can decode traffic yourself.

Even if you don’t have the session keys, you can readily analyze raw packets to see their type and node address.

I do most of my development with nodes scattered around the building each connected to a raspberry pi that both collects its serial debug output, and can flash new firmware (there are versions of OpenOCD that can use a pi’s gpio’s as an SWD adapter). It’s been hugely useful to be able to open up one window with a remote session to the pi babysitting a node some distance away and see its perspective, while having another window running a python script that analyzes raw traffic from the gateway’s or network’s.

Was a packet sent? Was it received? When you have a way into both a test node’s view and a gateway’s, these questions become easier to answer. And when you aren’t yet worrying about performance at distance (or oddities of being too close, such as receiving weak copies on adjacent channels in addition to the true one), you can just use a serial cable to your PC.

In the SPF-logs I found the following message several times (about every 10 Minutes, but varying):

Sep 7 07:11:04 klk-wifc-040187 local1.notice spf: INFO: concentrator stopped successfully
Sep 7 07:11:04 klk-wifc-040187 local1.notice spf: INFO: Exiting packet forwarder program

Some of the packet losses seem to correspond to this messages. Seems the SPF is restarted by the watchdog, so it is running again.

We get no answer form KERLINK, only that we should ask the company that sold the gateway…

Hmm, you should try to figure out why it is getting restarted.

Could be a watchdog as you suspect, in which case you need to figure out why the watchdog feeding condition is not being met.

Or could be a hardware glitch like an overclocked SPI bus (though that is much more common on DIY gateways) leading to a failure of communication between the SX1301 and the host and a restart.

See if there aren’t more verbose log messages somewhere…

The lines you quote show the software exiting, but there should be additional information on why the software stops in the lines above the quoted ones. Could you post (correctly formatted) 50 to 100 lines of log ending in the lines you already posted?

Hy Jac, thanks for your support! We started investigation becaus of high packet loss rates on the kerlink, significantly higher than on the TTN gateways.

Here is a complete cycle (one stop to the next), I could not see any good reason for the stop. We can see that the process is started again by the watchdog, but inbetween packets are lost.

I tried to install previous versions of the SPF/Keros and found the same issue in spf_3.1.0-klk18_4.1.3-klk11_klk_wifc with Keros 3.4.4.

Now the gateway is running spf_3.1.0-klk11_4.1.3-klk3_klk_wifc from May 2017 and liveburner_3.1.14_klk-wifc, here we did not recognize any stops til now.

Here is part of a protocol from the latest version for the iFemtocell (referring to kerlink) spf_3.1.0-klk18_4.1.3-klk12_klk_wifc:

Sep  6 07:33:43 klk-wifc-040187 local1.notice spf: INFO: Exiting packet forwarder program
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: I/ Programming FPGA with spectral scan firmware
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: *** Beacon Packet Forwarder for Lora Gateway ***
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: Version: 3.1.0-klk18
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: *** Lora concentrator HAL library version info ***
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: Version: 4.1.3-klk12;
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: ***
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: INFO: Little endian host
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: I/ Using default Gateway EUID: 7276FF0039040187
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: INFO: found global configuration file global_conf.json, parsing it
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: INFO: found local configuration file local_conf.json, parsing it
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: INFO: redefined parameters will overwrite global parameters
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: INFO: Using Gateway EUID: 7276FF0039040187
Sep  6 07:33:52 klk-wifc-040187 local1.notice spf: INFO: FPGA supported features: [TX filter]  [Spectral Scan] 
Sep  6 07:33:55 klk-wifc-040187 local1.notice spf: INFO: [main] concentrator started, packet can now be received
Sep  6 07:33:55 klk-wifc-040187 local1.notice spf: INFO: Disabling GPS mode for concentrator's counter...
Sep  6 07:33:55 klk-wifc-040187 local1.notice spf: INFO: host/sx1301 time offset=(1567755232s:745020µs) - drift=1975391804µs
Sep  6 07:33:55 klk-wifc-040187 local1.notice spf: INFO: Enabling GPS mode for concentrator's counter.
Sep  6 07:33:56 klk-wifc-040187 local1.notice spf: INFO: Received pkt from mote: 26012558 (fcnt=2385)
Sep  6 07:33:56 klk-wifc-040187 local1.notice spf: JSON up: {"rxpk":[{"tmst":3313227,"time":"2019-09-06T07:33:56Z","chan":2,"rfch":0,"freq":867.500000,"stat":1,"modu":"LORA","datr":"SF7BW125","codr":"4/5","lsnr":7.5,"rssi":-71,"size":17,"data":"QFglASYAUQkCZvemmQncMVI="}]}
Sep  6 07:33:56 klk-wifc-040187 local1.notice spf: INFO: [up] PUSH_ACK received in 41 ms
Sep  6 07:34:09 klk-wifc-040187 local1.notice spf: INFO: Received pkt from mote: 26012DE2 (fcnt=141)
Sep  6 07:34:09 klk-wifc-040187 local1.notice spf: JSON up: {"rxpk":[{"tmst":17196428,"time":"2019-09-06T07:34:09Z","chan":6,"rfch":1,"freq":868.300000,"stat":1,"modu":"LORA","datr":"SF7BW125","codr":"4/5","lsnr":9.5,"rssi":-63,"size":17,"data":"QOItASaAjQABwICmkL889sQ="}]}
Sep  6 07:34:09 klk-wifc-040187 local1.notice spf: INFO: [up] PUSH_ACK received in 44 ms
Sep  6 07:34:16 klk-wifc-040187 local1.notice spf: INFO: Received pkt from mote: 26012558 (fcnt=2386)
Sep  6 07:34:16 klk-wifc-040187 local1.notice spf: JSON up: {"rxpk":[{"tmst":23704443,"time":"2019-09-06T07:34:16Z","chan":3,"rfch":0,"freq":867.700000,"stat":1,"modu":"LORA","datr":"SF7BW125","codr":"4/5","lsnr":11.0,"rssi":-72,"size":17,"data":"QFglASYAUgkCwYeaEq1EtlE="}]}
Sep  6 07:34:16 klk-wifc-040187 local1.notice spf: INFO: [up] PUSH_ACK received in 42 ms
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: INFO: [down] the last 3 PULL_DATA were not ACKed, exiting application
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: INFO: End of downstream thread
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: INFO: End of upstream thread
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: ##### 2019-09-06 07:34:25 GMT #####
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: ### [UPSTREAM] ###
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # RF packets received by concentrator: 3
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # CRC_OK: 100.00%, CRC_FAIL: 0.00%, NO_CRC: 0.00%
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # RF packets forwarded: 3 (51 bytes)
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # PUSH_DATA datagrams sent: 3 (678 bytes)
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # PUSH_DATA acknowledged: 100.00%
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: ### [DOWNSTREAM] ###
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # PULL_DATA sent: 3 (0.00% acknowledged, ping 0.00 ms)
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # PULL_RESP(onse) datagrams received: 0 (0 bytes)
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # RF packets sent to concentrator: 0 (0 bytes)
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # TX errors: 0
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # BEACON queued: 0
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # BEACON sent so far: 0
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # BEACON rejected: 0
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: ### [JIT] ###
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: /home/drd/jenkins/workspace/spf_release/lora_pkt_fwd/src/jitqueue.c:448:jit_print_queue(): INFO: [jit] queue is empty
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: ### [GPS] ###
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: # GPS sync is disabled
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: ##### END #####
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: INFO: concentrator stopped successfully
Sep  6 07:34:25 klk-wifc-040187 local1.notice spf: INFO: Exiting packet forwarder program

I can. It is actually mentioned in the log:

Granted, reading logs ie easier when you know what to look for, but you could have spotted this yourself…

What ports have you put in the configuration?

And finally, please don’t dump large amounts of text in the forum without using some formatting.

Hy Jac,

thank you for your hints.

The setup is just what we got from kerlink. We have been facing too much packet loss on TTN and the support form kerlink is more or less non existing. So we had to digg a bit deeper.

I do nor really understand why the SPF quits on missing ACK, but I think this is a job for Kerlink. For us it is important to know, that there is a reason for packet loss and we can find a solution.

Many thanks for your help!

Best regards, Eckehard

The missing ACKs will result in downlink issues so the software restarting is actually good as it is trying to recover from a failure state that way. What is the configuration you are using? (Can you include it in your reply?) Is the downlink port set to 1700?
Is there a firewall with small timeout values for related UDP packets between the gateway and the internet?

In spf\etc\global_conf.json we find:

"gateway_conf": {
		"server_address": "router.eu.thethings.network",
		"serv_port_up": 1700,                                               
		"serv_port_down": 1700,
		"keepalive_interval": 10,
		"stat_interval": 30,
		"push_timeout_ms": 100,
		"forward_crc_valid": true,
		"forward_crc_error": false,
		"forward_crc_disabled": false,
		"autoquit_threshold": 3
	}

As mentioned, the old Version from 2017 works without errors. I compared the configuration and it is more or less identical.

I have no information about a firewall. Externally the gateway is connected to a fritz.box, there are very little information about the internal configuration of the KERLINK.

The configuration looks sound. If the previous software works Kerlink has some explaining to do about what they changed.

Maybe you can try it with a different connection to the Internet.

Especially one that goes by way of a router on which you can run tcpdump or tshark. Or maybe you can run that on the Kerlink - the goal would be to see what is going on at raw UDP level.

Is it possible to just build a different packet forwarder such as from Semtech’s repo (or Kersing’s) and run that instead of the Kerlink software?

Thank you all for your support! We are only users, so it´s Kerlinks work to get their device ready. We are using two ttn gatways that work perfect, the just cannot log in to WPA-EAP-Networks that use SSID / Identity / Password. We got this running with the iFemtocell, but it was really painful to find the necessary information.

With your help we came far enough to find the source of the high packet losses of the kerlink. Maybe it can help you to improve the TTN forwarder too.

Best regards,
Eckehard

That really depends on what the issue is. If something in the connection you are providing to the Internet and thus TTN servers is dropping the UDP ack packets, that is your problem, not Kerlink’s.

Unlikely… We use different gateways in different locations/netowks without problems. The iFemtocell works with the old firmware too, so it´s pretty cleard the packet loss ist related to the current set up.

Even with the old firmwre, there are plenty packet losses. From the spf logs we see:

Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: INFO: Received pkt from mote: 26012395 (fcnt=16298)
Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: JSON up: {"rxpk":[{"tmst":552221707,"chan":1,"rfch":0,"freq":867.300000,"stat":1,"modu":"LORA","datr":"SF7BW125","codr":"4/5","lsnr":9.0,"rssi":-63,"size":21,"data":"QJUjASaAqj8Bk616AQ+63HXICLhD"}]}
Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: ##### 2019-09-16 15:45:07 GMT #####
Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: ### [UPSTREAM] ###
Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: # RF packets received by concentrator: 1
Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: # CRC_OK: 100.00%, CRC_FAIL: 0.00%, NO_CRC: 0.00%
Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: # RF packets forwarded: 1 (21 bytes)
Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: # PUSH_DATA datagrams sent: 2 (312 bytes)
Sep 16 15:45:07 klk-wifc-040187 local1.notice spf: # PUSH_DATA acknowledged: 0.00%

As visible in the protocol above, the usual message should be:

PUSH_DATA acknowledged: 100.00%

Possibly a similar issue that was reported here: https://github.com/gotthardp/lorawan-server/issues/227? Could this be a reason for the SPF to quit?

global_conf.json -> “push_timeout_ms”: 100, possibly to short on some networks?

Possibly solved the mystery!?!

Looking through the logs I found this (more than once):

Sep 17 23:33:30 klk-wifc-040187 local1.notice spf: INFO: [up] PUSH_ACK received in 276 ms

The semtech UDP protocol defines, that an uplink communication between gateway and server is acknowlaged by a PUSH_ACK message. Same is true for downlink/keepAliveSignal, which gets an PULL_ACK. (For more details on the UDP protocol see mobilefish, LoRa/LoRaWAN tutorial 29.)

Most Messages are acknowledeged within ~40ms, but there are a lot of messages with 70ms, 90 ms or more. In global_conf.json a timeout of 100ms is set, which will frequently be hit.

We also see: “autoquit_threshold”: 3. Autoquit is used to restart the SPF from a watchdog. It may happen, that after a reconect the gateway gets a new IP adress. As the UDP forwarder cannot handle this situation, it has to be restarted.

In our case, this gives a fatal mix. If the PUSH_ACK is too slow, it will generate a timeout. After 3 timeouts the gateway is restarted, which makes no sense, if there is only a response time issue on the backend. Autoquit makes it even worse, while during restart the forwarder does not respond.

So, now we will try to increase both, timeout and autoquit_treshold and see what´s happening.

Followup:

Timeout was set to 500ms, autoquit_treshold is 10.

We can see, that this prevents the SPF from quitting. But it shows also, that the cause is an instable connection to the TTN backend. Here is the statistics:

From 1140 messages we lost 6 with CRC Errors, but 228 messages got lost between Gateway and backend. Generally we have no issues with our internet connection…

Do you follow the forum on a regular basis? If so you will have noticed several messages concerning capacity issues for the EU region when it comes to UDP based connections. (That includes TTIG because of the UDP bridge it uses)
If not, did you check the TTN status page? That lists performance issues with the UDP based Gateway API as well.

Bottom line, the community network in Europe can’t handle the load when it comes to UDP connected gateways.

That was my suspicion, thank you for clarification. That´s even worse as the semtech PF does not use the PUSH_ACK for retransmission.

But what can we do to make things reliable as possible?

a) Send all kerlink gateways back until they preset a new Packet forwarder ?
b) Ask TTI for a professional account (Possibly with similar issues on UDP?)?
c) Skip TTN and use another platform?
d) Install a different SPF on the iFemtoCell? But which one, the TTN forwarder is a.f.a.i.k not supported?!?

We see, that the loss rates are a bit lower on the TTN Gateway, but the TTIG performs very similar to the SPF. Does TTIG also use UDP? Thank you for your advice.

Found some documentation of the different options here:

https://www.thethingsnetwork.org/docs/gateways/kerlink/config.html

For the iFemtocell there is the old SPF (Keros 3.4.5) and the CPF (Keros 4.0.4) als installation packets available. While Kerlink had some confusion about their hardware versions we could install both versions on the lates iFemtocells.

Now we can confirm, that the CPF is working together with TTN. The difference is not that big, but CPF can retransmit packets if no PUSH_ACK is received. This does not eleminate Packet loss, but reduces the loss.