The NOC experienced unexpected timeouts while writing to its database. Instead of backing off, the NOC started even more processes trying to write to the database (an unintended positive feedback loop). This resulted in an explosion of goroutines and memory usage and eventually the NOC just kept crashing/restarting/…
The Proxy
We have an SSL-terminating proxy in front of the NOC. When the NOC crashed and restarted within a second, the proxy did not close existing connections
The Routing Services
Because the connections were not closed by the proxy, the routing services (router, broker, …) that forward metadata to the NOC did not back off to allow the NOC to recover, this also didn’t really help. Instead, they started buffering these messages, dumping a flood of messages onto the NOC when it came back after a crash. For some reason these components also spawned more goroutines, leading to extremely high memory usage and slowdown of message processing.
Mitigation
We temporarily disabled forwaring metadata to the NOC, but as a result the gateways now appear as offline on the console and maps.
Resolution
We are still working on reproducing the issue in a controlled environment, and will post an update when we know more. We aim to re-enable NOC forwarding within a couple of hours, after which the gateway pages should display the correct gateway status again.
It still doesn’t appear when issuing CLI “ttnctl gateways status …” or shows as inactive with “ttnctl gateways info …”, which I normally would consider more accurate.
My gateway isn’t displayed on the map of TTN-mapper.
“Location”, “Status” and “Owner” are all public in the Gateway > Settings > Privacy tab.
There is no marker at all on the location of my gateway (EUI: eui-b827ebfffe52009e)
There is an offline issue to occur when Router set to ttn-router-asia-se.
Gateways appear as online if Router set to ttn-router-us-west.
I am in Taiwan.