Watchdog monitoring on the Pycom Pygate gateway

we’re having a lot of success with the Pycom 8-channel ‘Pygate’ gateways, both Lora+Ethernet PoE powered and Lora+WiFi USB powered. -we’ve built 10 and are in the process of building another 10.

Anyone got advice on improving the default ‘main.py’ for the Pygate to maximise the reliability?

It would be super-helpful to know if anyone else has (a) watchdog code running on the pygate that can at least reboot it if it’s obviously not sending sensor data and (b) if you had that code on the Pygate, how can you check if the gateway is functioning at some basic level, recognising that the Pygate code is running in some parallel thread e.g. if there’s no suitable call to check on the pygate code, could we check bytes sent through the ethernet or wifi without conflicting with the Pygate code?)

In general we’d like these to run reliably unattended for months. For example if it improves reliability we can have them reboot every night, but more useful would be advice on creating watchdog code in main.py that could detect the gateway getting stuck or having some other kind of problem so we could (for example) force a reboot in the hope that clears it.

Please note I’m not looking for advice for how to work out if the gateway is ok via the TTN API or by interpreting incoming sensor data (we can do all that) - more important is to minimise the chance of the gateway having a problem that means it just sits there not working until we go find it and kick it into life.

This is where propriety solutions can get a bit icky, though my impression is that enough of their source is published that you probably could hook status information.

One thing you could do is add your own Internet configuration/maintenance server outside of TTN, with logic on the board to connect to that independently, and ideally able to pull down a firmware update to replace what’s running, though you’re going to have to be more explicit about the capabilities you build into a system without an interactive shell.

You could rig things such that if your custom server can’t ever be reached, then the gateway reboots once every 24 hours - enough to fix a permanent issue (and hopefully let you in between reboot and when it breaks again) but not so often as to be disruptive if it’s your server infrastructure that’s broken. Then if the gateway is able to hit your endpoint, maybe you tell it to go into a mode where the watchdog interval is much much shorter (and could stay that way for a half day even across reboots). And then on a continuous basis you can loop what TTN is reporting about connectivity of the gateway (though this is less than perfectly defined in the absence of uplink packets to report) back to the watchdog interval you’re commanding at your custom endpoint, or set a “please reboot now” flag there.

Its also worth considering if a reboot will fix all local state while perhaps not a factor for your backhaul connections one thing I found with Embedded Linux platforms hosting LTE modems was that rebooting the hosting Linux often wouldn’t fix a stuck modem, instead it was the modem that needed to be power cycled and not just re-enumerate as a USB device. So first it would attempt to re-dial a few times, then power cycle the modem and finally as a last resort reboot Linux.

Another option could be to put a LoRaWAN node with a 50 ohm surface mount resistor instead of an antenna right at the gateway, have it transmit at a long interval, and have it watchdog the gateway. You could potentially have a short interval watchdog that needs to be fed by software on the pycom blipping a gpio or sending a serial message, and day-scale one that needs to be fed by eventually getting a LoRaWAN downlink. Naturally you can run this at the highest data rate supported in your region to minimize the gateway airtime you are taking. Since it’s periodically sending a packet, you also know that there should always be a feed of data through the gateway, unless the gateway, backhaul, or TTN are broken.

Context being everything, are you experiencing reliability issues and if so, how often, how do they present, what do you think is the cause, how do you fix it currently?

As much as you may like the Pygate, it’s not an unreasonable expectation for a gateway to run for many many months and if that’s not happening, why start putting bandaids on when you could use something that just keeps on trucking that doesn’t need C22H30N6O4S extras adding to keep it up.

For any generic monitoring of gateway, the Norwich Football Club Canaries as suggested above is the way to go - runs on battery and validates the entire path from concentrator on the gateway through to your backend.

We’re new to the Pygate gateways, having bought the parts for 10 and built them over the past 3 weeks, so we’re just learning how they behave. We’ve been running LoraWAN with other gateways (in particular Multitech) for many years which are extremely reliable so make a lot of sense on the roofs of some of our buildings and other remote locations. The Pygates are part of an in-building project and are attractively inexpensive plus with the WiFi-LoraWAN flavor they are really easy to deploy and move around the building as we decide what we’re actually doing.

From 10 pygates, we’ve had two that periodically coredump (with core panic illegal instruction and load/storeprohibited errors). If these then reboot successfully, the issue isn’t obvious to spot from the TTN side (we have many overlapping gateways) but we know we could detect that with suitable server-side programming. The pygates are running the Pycom ‘pygate’ firmware which executes asynchronously after machine.pygate_init(..) is called in main.py leaving the question of has anyone written suitable python code that runs after the pygate_init that could detect the gateway having problems.

I’m not clear on the idea of a ‘canary’… I fully understand I could monitor the incoming sensor uplink messages from our 500 sensors connected to TTN, look at the ‘gateways’ metadata, and detect any of our 20 gateways dropping off the network, but I’m unclear how that helps me reboot the Pygate that has failed i.e. a dead canary doesn’t traditionally kick something else back into life - I guess the idea is to have the canary able to toggle a reboot pin on the Pygate if it doesn’t receive a downlink message? So you’d have a second independent Pycom LoraWAN node bolted onto each Pygate gateway and each gateway becomes a double LoPy device? I guess that would work but I’d prefer to start with the MicroPython built-in watchdog timer support if that’s feasible.

this is the crux of my question - I wondered if anyone had written watchdog-feeding software on the Pygate, given that has to co-exist with the asynchronously executing Pygate software which doesn’t seem to provide any API you could use to monitor its status.

In general I’d say the Pygate is a really neat package at an excellent price for a wifi or PoE LoraWAN gateway (LTE backhaul would also be possible by using the GPy or FiPy modules) but understandably it isn’t wrapped in the comprehensive management and recovery code you’d find in a much more expensive commercial product. In general I reckon we’ll get these things running reliably for weeks or months, but in the event of a glitch (ethernet hanging for some reason, wifi hanging for some reason, clock having drifted for some reason, some other reason) I’d rather it at least rebooted and reported the problem rather than sit there broken until I go visit the gateway,

I referred to it as generic monitoring - not really a particular solution - context is everything - like your prior experience with Mulitech & the fact you have 10 PyGates.

What does PyCom have to say on the matter?

That sounds like an interesting challenge for the community to take up - it would need to monitor the state variables of the gateway code if you feel it’s going awol before it falls over.

Otherwise, use something like an ATtiny that is poked by a code loop at regular intervals and if not poked, resets the gateway so when it does flatline, it comes back within a minute or so.

That’s why I suggested having the watchdog code check connectivity to and commands at an independent online server. That would automatically cover complete network failures, and if you have server side automation to feed dissatisfaction with the data flow through TTN back as a “hey please reboot” command left at that endpoint, then you get that, too.

The other path was to dig around in the code pycom has published and maybe start using a customized version of it. The core dumps sound like there may be unresolved design issues there.