Not seen gateway status on ttn

Jeff-UK · April 28, 2021, 2:39pm

@descartes Yep tried a few more and refresh - BANG!

Just logging into Slack OPS and havent seen anyone flagging so a recent problem?

Update: just seen Nick post to OPS - types faster than me

descartes · April 28, 2021, 2:48pm

Many apologies for not spotting this: tap = tab - I hate autocorrect!

Although the first screen shot didn’t show this section of the web page so I didn’t figure it out.

In the meanwhile you can see your data at the application or device level.

afsaneh_akhbari · April 28, 2021, 2:52pm

thank you
you mean the gateway part wont be fix anymore? and i have to just see data in application part?

descartes · April 28, 2021, 2:53pm

No, this is an issue that we would expect to have fixed - it has been reported.

If you search the forum you will find that there is an issue with the part of the console that you took a screenshot of that doesn’t update the last connected or the status.

Which is why I thought it was that you were referring to.

afsaneh_akhbari · April 28, 2021, 2:56pm

thank you so much for any efforts
i look foreward to seeing this isuue would be fixed soon.

htdvisser · April 28, 2021, 4:16pm

Other comments in this topic already give a pretty good explanation of the situation, but I can also add some more details.

We recently updated the v2 Console to make it less confusing. Instead of showing incorrect information or empty pages, the Console now tries to hide parts of the user interface that rely on the “NOC” component when this “NOC” component isn’t available. Gateway status and live gateway traffic are the main features for which data is provided by the “NOC” component.

The problem here is that the “NOC” component is having issues. This is unfortunately something that happens more often. We made a lot of crucial mistakes when we initially designed the architecture of the NOC. As a result of these wrong architecture decisions, we’ve had a lot of difficulty scaling the NOC as The Things Network grew. We’ve had to disable some functionality (such as event history, searching for events, events from applications and end devices) to prevent it from crashing completely. Suffice to say, the NOC is always having a hard time, and it’s barely working most of the time.

In the meantime, the core team is now spending all its time on The Things Stack (v3), we’re not going to fix v2 issues anymore. The core team still does its best to resolve downtime of critical components in v2 deployments. Critical components are the components that are required for routing LoRaWAN traffic between Gateways and Applications. The “NOC” component is not considered critical, so we’re not going to spend a lot of time figuring out what’s wrong. Sometimes switching some things off and on again helps, but that’s all we’re going to do.

I would indeed like to take this opportunity to push you guys to The Things Stack (v3). Anyone who’s just getting started and doesn’t have working applications or end devices, and still relies on live gateway traffic, should really be using The Things Stack (v3). Anyone with working applications and end devices can use the data view of their application or end devices. But you should still migrate those to The Things Stack (v3) as soon as you can.

Jeff-UK · April 28, 2021, 5:12pm

Ok, appreciate the explaination BUT this causes HUGE problems for many.

Whilst appreciating changes due to NOC issues what you have just done is rendered much of V2 unusable…and with little forwarning or discussion of effect or mitigation .

E.g. besides the change to individual GW status and traffic (and the value that has for helping debug device issues or gw problems it seems this has also impacted the GW overview page Listing all GW’s is closest thing to a simple individual/user/group NOC and as you can see connected/notconnected status no longer listed.:

With, in my personal vs community or clients case, several dozen GW’s to monitor and manage a quick scroll up or down would tell me if any had gone offline or not - if lots suddenly offline then I could guess it was the NOC problem and then if concerned dive in to indivdual devices/applications to see if traffic still coming through and all ok with routing or if indeed the GW was down and needing attention. It is not practical/too time consuming to step through ALL of them in the abscence of an overview.

Yes, we know we get the message… BUT v3 seems, looking at the posts we see on the forum and from our own testing and experiments/experiences, still not ready for prime time for many, there are still missing integrations, documentation needs and there is a steeper learning curve cw V2 usage. TTN Mapper integration has only recently become available and there is still no sign of My Devices/Cayenne coming on board requiring major discussion and evaluation of potential alternatives/replacements (and likely significant costs - both monitary and opportunity/manpower - in considering, evaluating, learning and the implementing alternates).

In many cases migration requires visits to site to recover/reprogramme devices and or GW’s…

In case you missed it we are in the middle of a global pandemic and travel (domestic or international), in country movement and visiting sites - be they commercial, governmental or private locations is often not possible and in many cases not desireable… forcing the issue doesnt provide a solution. We are ‘in limp along mode’ for the comming months and TTN/TTI needs to respect and allow for that… please!

Having V2 data forwarded to V3 helps but we still have 2 issues to resolve - physically migrating the end devices and applications from V2 to V3 and, where applicable, its not possible to look at received metadata to verify which (V2) GW’s are handling the traffic to help debug system issues or indeed to see which GW’s are most effectively carrying the traffic in order to prioritise the gw migration task in a given area - as PacketBroker annonymises the GW id when passing the traffic, and as per other posts to the forum, GW owners have a responsibility to not just migrate their GW’s from V2 to V3 for selfish reasons but rather need to consider possible local users who may still be using for V2 devices and V2 apps (that they may not be able to deal with in current situation, even if motivated to migrate) - we do not want to cut these guys and gals off at the knees…

It is well documented that the NOC/V2 console is causing problems and oft posted to the forum, with users regularly calling out for help (and sadly not using search to get understanding 1st!)… but it is far better to leave it running showing some data than cripple it even further…even if as Mods, and other active users, we then have to step in and answer user questions and point them to prior reponses where V2 causes problems.

@wienkegiezeman @laurens @johan please ensure V2 can limp on for a time, and in a usable fashion, for as long as practical and consult with your community users before taking such impactful decisions and implementing without a heads up…

descartes · April 28, 2021, 5:50pm

Just to be clear @htdvisser, as I understand it you or someone close to you took a conscious decision to remove the traffic tab and you didn’t think to tell anyone as you were doing it??

So instead me, @Jeff-UK and @kersing spend a few minutes checking, cross checking, spinning up browsers to make sure it’s not Safari/Chrome/FireFox on Win10/Win7/iOS/macOS/Linux, CLI, API etc etc and I post on Slack. BTW, not everyone uses the forum, so I’ll expand on your answer two hours after the fact.

I can respect that you have to take tactical decisions that we are not in a position to have reversed as we (on here at least) aren’t paying the bills. Or indeed even consult us.

But if TTN is going to continue to be the huge PoC for v3 that demonstrates to the corporate world you can run a large installed base, please communicate with us so we don’t end up spinning our wheels looking for answers. Because if you do it to us, how big do you have to be before you give a customer heads up on a short term critical change?

I think it would be fair to say many of us can figure out even the most cryptic of posts that we’d be happy to expand upon - so a terse “v2 gateway traffic tab is going to have to go, to much pressure on the NOC” would have eye brows raised but at least we’d know not to go hunting for answers.

hphillip · April 28, 2021, 6:11pm

Curiously the V2 websocket is still showing heartbeat messages.
But I see no traffic

heartbeat

Roberto69 · April 28, 2021, 7:17pm

Sorry but I can’t buy such an advice. It’s unfair to make such a change without prior warning. I agree with @descartes as we are not in a position to have reversed as we (on here at least) aren’t paying the bills. But such a practice could raise up in TTI quickly also, and this is the last what I want. You know, bad habits at home, bad habits everywhere.

Again, as @Jeff-UK wrote: " Ok, appreciate the explaination BUT this causes HUGE problems for many." You have done this step half of year too early, at least.

descartes · April 28, 2021, 7:35pm

Is that really the bit you intended to quote - @Jeff-UK doesn’t appear to be giving advice there, more observations.

AndyG · April 28, 2021, 8:04pm

If all TTN V2 gateways are now reporting as offline will this affect TTN Mapper?

Jeff-UK · April 28, 2021, 8:09pm

Lets ask JP? @jpmeijers

If the Mapper scrapes directly from the NOC that might be an issue as I’ve also noticed over last couple of hours that the noc url doesnt respond and connection times out - either for all V2 gateways or by selecting known individual GW’s - same result: time out… Combine that with changes above and fact individual GW page no longer shows a last connected item means there is now no direct way to check online status of V2 GW’s

Maj · April 29, 2021, 12:05am

I don’t think TTI are turning off the NOC just yet. Judging by Hylke’s quote above, it seems that the NOC will continue to exist (with all its problems), but it just won’t cause the gateway page to show “disconnected” when the NOC is down.

If that’s the case then I agree with it, since telling people their gateway is down when its actually up causes confusion every time the NOC crashes.

But like everyone else here, we absolutely need some way of telling whether gateways are connected or not. As gateway owners we don’t know if gateways are routing packets so we need the backend to tell us. I can’t think how to do this without the NOC, or running some code on the gateways themselves.

htdvisser · April 29, 2021, 6:44am

We’re indeed not turning off the NOC, nor are we removing live gateway traffic or gateway statuses from the Console. The only recent change is that the console hides NOC functionality when the NOC is down.

When this happens, you can use ttnctl to get the status of your gateway directly from the v2 Routers:

ttnctl gateways status your-gateway-id --router-id ttn-router-eu

I just restarted a bunch of servers, hopefully that improves the situation for now, but I can’t promise that it won’t go down again.

kersing · April 29, 2021, 7:25am

I understand not showing information that isn’t available. However now people are confused by suddenly missing tabs and fields. Would it be possible to keep those in place and display a message in stead of the values? That makes for a more consistent user interface.

descartes · April 29, 2021, 8:36am

Your first post-implementation message was just far too subtle for all of us.

The second is much clearer but is an egregious breach of good UI design - as Jak suggests, just put a message saying that data isn’t available rather than have us refresh our browsers to see if we’ve won the NOC lottery.

At best, show the last known data with the timestamp the info was last available.

Could the offending server processes be set to restart at midnight UTC so we have some info at some point.

descartes · April 29, 2021, 8:57am

Console showing status but:

user@descaaa6:~# ttnctl gateways list

 	ID                  	Activated	Frequency Plan	Coordinates                
1	eui-58awtffffe8017ec	false    	EU_863_870    	(0.000000, 0.000000, 0)    
plus some more ...

user@descaaa6:~# ttnctl gateways info eui-58awtffffe8017ec
  INFO Found gateway                           

          Gateway ID: eui-58awtffffe8017ec
      Frequency Plan: EU_863_870
              Router: ttn-router-eu
         Auto Update: on
               Owner: descartes
        Owner Public: yes
     Location Public: no
       Status Public: no

               Brand: The Things Network
               Model: Indoor Gateway
           Placement: indoor
        AntennaModel: Built in
         Description: TTN Indoor Gateway on-the-go
          Access Key: ttn-account-v2.CWuW2lg1-kvawoNwtfffe8017ecwyjoxxbnww

Collaborators:
       - Username: descartes
         Rights: gateway:settings, gateway:collaborators, gateway:status, gateway:delete, gateway:location, gateway:owner, gateway:messages

user@descaaa6:~# ttnctl gateways status eui-58awtffffe8017ec  --router-id ttn-router-eu
  INFO Discovering Router...                   
  INFO Connecting with Router...               
  INFO Connected to Router                     
 FATAL Could not get status of gateway.         GatewayID=eui-58awtffffe8017ec error=unavailable: connection error: desc = "transport: Error while dialing dial tcp 52.169.76.255:1901: i/o timeout"

I’ll put this on Slack …

descartes · April 29, 2021, 8:58am

I see that things are broken - I’ll go and put some washing on, do the hoovering etc and come back later.

https://status.thethings.network/

Jeff-UK · April 29, 2021, 9:09am

Thanks Hylke, I checked over breakfast earlier and saw ‘Last Seen’, Traffic tab and GW Overview Connected/Not Connected colums all back

Are there other items we should be aware get hidden in these circumstances (to save us trawling or having to respond to user cry’s for help)? We can get them documented and flagged on the forum so users know not to panic!

So to clarify what you are saying is if NOC goes down the various pages will automagically remove the page elements related to NOC - as called out - last seen, traffic tab and for GW overview page the con/not con column? When NOC back up they automagically re-appear?

As Jac says this is inconsistent UI/experience so would be good if the repective elements remained in place but instead called out something like ‘sorry there is a noc issue’ so user knows it’s not their GW or their browser or whatever. E.g. the Traffic tab could still be shown but with a line saying ‘noc is down, no current data available for display’ or some such, GW overview lines could substitute ‘noc down’ for connected/not connected.

As mentioned on other threads and #ops last nigh the noc also stopped responding to browers with connection timed out error suggesting noc was down/not responding… can that be scripted and checked with a simple watchdog and then if down for more that say 10 mins (to allow for some off time for maintenance/updates) then automagically trigger a server/noc restart - that would save you the hastle of having to go beat it with a stick as and when needed - appreciate this may not be the fun part of your job or a priority therefore automating makes sense? Doing this automatically would also stop the flood of forum or #ops posts when one of the pages starts to noc has thrown a wobbly again (note though data coming through on consle I checked noc url (noc.thethingsnetwork.org:8085/api/v2/gateways) for overview of gateways and for some individual known good gw’s earlier and still getting connection time out problem)

As Andrew says we absolutely have to have a way of monitoring GW status, I would say ~1/4-1/3rd of my personal deployed GWs do not carry traffic for me regularly (hourly/daily) - some may see my traffic within a given month but that is no use for status monitoring via data received (they are depoyed for community benefit). These days I try to follow best practice of locating a canary (often a chosen functional data gathering node) close to a gw so full path to backend can be monitored and gw status verified, but only adopted that approach after some time on TTN and after a few harsh lessons - and ofcourse those early canaryless deployments are all on V2!