Dealing with incidents within TTN

htdvisser · September 15, 2019, 4:04pm

@kersing’s quote of my message is missing quite an important bit of context:

With our current v2 infrastructure, we indeed need to avoid adding clusters (specifically broker+networkserver combinations) if not needed. In regions that are already being served by our existing infrastructure, there is indeed no real need, but more importantly, it will lower the quality of service. This is a technical problem with v2, and not because we don’t want it.

htdvisser · September 15, 2019, 4:06pm

I agree that it would be good to discuss the future of operating public community network clusters, and we indeed don’t need to wait for The Things Conference with that.

TIS · September 15, 2019, 4:35pm

For me the failures are not the biggest problem, I know TTN is provided “as is” and depends on some hardworking volunteers solving the failures. But if I am experimenting with my gateway or nodes and I encounter a problem, I would like to be able to exclude that something is wrong with the TTN services. More than once it took me many frustrating hours to figure out what was wrong with my nodes or gateway while it later turned out that there was a failure at TTN, often a failure that was not reported on #OPS or on the status page. The status page sometimes even reported that there were no problems while there certainly were issues. So a really up-to-date status report would be highly appreciated and would save me a lot of frustration.

BoRRoZ · September 15, 2019, 4:46pm

I mean, addressing the growth of the community / network and the problems in general as we all can see this year.
A presentation of ideas and future implementations how to cope ?
A more or less automated actual status page with input from devices worldwide combined with operator data for example would be very usefull.
What can we, as community, do to help improve… think we need more guidance from TTi were to start

mwcl · September 15, 2019, 4:55pm

I presume TTI benefits from 24/7 engineer alerts and possibly out of hours fixes.

One idea for the list might be to treat TTN users as a TTI customer and figure out a cost of providing some sort of SLA on a paid basis. I am sure there are enough of us on TTN to make a regular ‘donation’ to make this business model work. At the very least, such an SLA would have alerted an TTI engineer to the 15+ hour complete outage issue (which in this case, I presume, would have been a simple fix).

The problem, no doubt, will be the scope of such a SLA in the context of a ‘as is’ free service.
I’m thinking ‘critical issues only’ at this stage.

Also, up-vote for a more automated status page.

cslorabox · September 15, 2019, 5:01pm

To be honest, I think the only really viable solution is going to be a two-level one, where someone with a fleet of gateways manages their own network server with whatever level of reliability effort they require, and anything unknown to that then gets bumped up to a regional server that handles “roaming” along with individually owned gateways not having their own serving infrastructure.

An organization who invests in putting up a decent number gateways typically has an application need that means it isn’t going to be able to justify depending on an external service with no SLA in a situation where the failure of that can take out even purely local usage.

My general sense is that what’s missing to make this work, is an ability for a gateway (or an intermediate server proxying for many) to indicate back to a server that it won’t be able to accept a transmit request, because it is already busy transmitting something else at that timeslot. Lengthening the standard RX1 delay would greatly increase flexibility for “who is going to send it” negotiations, with little obvious downside.

Ultimately, if TTN can’t be architected in a way that allows local and fleet self-reliance, the result is likely to be lots of uncoordinated private networks serving only the needs of those paying for the infrastructure.

LoRaTracker · September 15, 2019, 5:02pm

Its a fine idea in theory, but what happens if in one period (a month ?) the ‘donations’ dont cover the cost, no SLA support, donations returned etc.

Complicated.

LoRaTracker · September 15, 2019, 5:26pm

What are the numbers for known TTI nodes and known TTN nodes ?

mwcl · September 15, 2019, 6:13pm

Given the number of TTN users (many of which will be businesses) I would expect donations to exceed the actual TTI cost, building up a buffer. Until TTI throw some figures at this it’s difficult to say (how cash strapped are they) but £5-10/month/donation would seem reasonable plus some [more] good will from TTI.

I am sure TTI do WANT to keep things running smoothly as TTN is a good test platform for them and is a good news story for their TTI sales.

Anyway, just a thought as it would bring a basic SLA to the TTN platform.

Maj · September 16, 2019, 1:00am

As someone who manages their own cluster (Australia), I watched the comments unfold in the #ops channel yesterday. It was frustrating to watch, even though our region wasn’t even the one impacted.

On one hand I can’t work out how it can take 15+ hours for TTI to respond to a problem of this magnitude when #ops was buzzing with comments every few minutes. On the other hand, TTI staff are entitled to enjoy their weekend uninterrupted by work.

Even though TTN is ultimately provided to us as a best effort free service, for some of us it’s become part of what we do for a living. At a country-wide level for some of you it will make sense to invest in the infrastructure and services required to manage your own cluster (when V3 becomes available).

For Australia this has worked well. We have become largely immune to the issues that affect the global network, mainly because we’re smaller and it’s more manageable, but mostly because we can decide how and when to respond to issues that arise and we’re available in our own timezone. For the most part we can fix the issues ourselves. (Occasionally though, we can’t, and we have had long outages too)

But it seems to me there was one simple thing that could have made yesterday’s outage more tolerable for those affected… acknowledgement. Just something notifying everyone that TTI are aware of the problem. It could be an automated Slack message, a status page update, a simple “we’re aware of a problem” message. That would make “best effort” even better

EFthings01 · September 16, 2019, 6:23am

What about the status pages? As far as I could see https://status.thethings.network/ did not show any change during the latest incident. That was one reason for irritation…

LoRaTracker · September 16, 2019, 6:42am

As someone who used to provide 24\7 support for large computer networks, I can fully understand why it took TTI that long to respond to a problem for which they had no SLA.

Once those who are responsible for the 24\7 support for TTI start dealing with TTN issues, because its seen as a good thing to do, the response becomes expected by the community (for free) even though its outside of the SLA.

Now if I was being paid (in my old job) to look after a network, then indeed I kept an eye on systems, but I was careful not to deal with issues for which there was no SLA and for which I was not being paid. Is that just mean, or practical ?

If the TTI guys were to provide a 24\7 SLA for TTN, I guess someone would have to pay for that service.

LanMarc77 · September 16, 2019, 8:34pm

Marshall Rosenberg said: If we are able to hear others and express our own needs the solution is only 20 minutes away.

Let me try to give a short summary of what I understood so far:

TTI wants to keep autonomy with their own hosted systems
v2 stack is not really suited for a federated/community based approach, but v3 will be
the community could help in monitoring/setting up automated monitoring/reporting systems (federated like the mapper service?)
clarity of causes for issues should be available fast and automated
clarity that someone is working on the problem
a certain amount of downtime is acceptable
long term development of TTN should be discussed within and with all the community (conference workshop?)

I am still uncertain about the following questions:

how much downtime is acceptable and by whom?
how can money help and can it even help?

Thank you for all your opinions so far!

johan · September 17, 2019, 1:30pm

First of all, apologies for the late response from my side. I was traveling last week and off yesterday, but I’ve been following internal and public (#ops, twitter, here) messages closely. I appreciate the constructive feedback, nice words and support a lot, thanks @LanMarc77 for raising this, also “special thanks” to the person that spent a few hours on a Twitter post with the TTN logo in flames and a skull suggesting we’re hacked. It shows, positively and negatively, that people care. And that’s a good thing after all.

There’s an unfortunate mix of circumstances that caused the long downtime;

what: the busiest cluster (EU operated by The Things Network Foundation *)
why: issues with component interconnections that are notoriously vulnerable and hard to do right due to the design of V2, add to that doubling the network year-on-year growth and shifting focus to V3 since two years now
when: weekend (and actually enjoying it)

* small note on operations; yes, the TTN core team is on the TTI payroll, but TTN is not a TTI operated service and will not be. We allow for spending time on non-profit Foundation activities, including operations, but it is best effort.

That being said, I see a task for me for bridging the gap between limited operational availability of TTI staff and the seemingly unlimited willingness of the TTN community to maximize service uptime.

I’m replying to a few suggestions above to steer the discussion in that direction. Please feel free to make other suggestions.

This assumption is correct; I would say that “fixing” over 90% of operational issues (in V2) is simply a matter of restarting Docker containers in the right order. Unfortunately, we cannot give direct access to Docker as that implies system root access, access to keys and password which implies user data which results in privacy and GDPR issues, etc.

How about selecting 10-15 active and trusted community members spread across timezones who have access to custom Slackbot command (i.e. /restart ttn-eu) that triggers a predefined restart sequence? Only if, say, 3 individual members executed that command independently within 15 minutes, it actually happens. And to avoid abuse, at most every hour. I know it’s monkey patching, but it gives control.

Yes, and I’m happy to cover this topic on the other 2019 Conferences where I’ll be (India and Adelaide).

Right. We need to improve this as well. So, as we speak, we’re spending time to improve (real-time) communications on TTN operations, including a new status page. If automated reporting works well, we’ll be using the @ttnstatus twitter account too.

TTN is a special network and is operationally very different from TTI networks; TTN clusters are fully connected and TTN Foundation only operates a part of it (a big part, but not all), there’s heavy load, there’s lots of experimentation and testing on device and gateway level that would flag alerts in an enterprise network (and hence put unnecessary burden on TTI ops), etc. That being said, as we migrate to V3, we will converge the operational setup of TTN Foundation’s clusters and TTI Cloud Hosted. It will not be the same infrastructure, but it will be better manageable.

Yes. This is what Packet Broker is going to provide. In V3, Packet Broker is the backbone of LoRaWAN traffic between public TTN clusters, private TTI clusters, anyone running the open source The Things Stack, and even other network servers (through LoRaWAN stateless passive roaming or implementing the Packet Broker protocol).

As we’ve been promising for a long time, it will be easier for the community to operate TTN clusters as well (cc @kersing). I have several requests from all parts of the world for operating one. There is nothing wrong with overlapping regions. However, we do need Packet Broker for this. Also, we do need to structurally start measuring uptime of all TTN clusters in a way for the community to choose which TTN cluster to pick for connecting a gateway and registering devices. If it turns out that the public TTN Groningen community cluster has better uptime than the TTN Foundation’s EU cluster, so be it. We’ll hand out an award for the best community contributed cluster at the Conference. We’ll figure it out; willingness is not the issue but technology is not there yet.

Now, I hear you thinking; “but who operates Packet Broker and how can we make that redundant?” PB will be TTI operated. But APIs are open and V3 peering allows for redundancy (i.e. multiple independent upstreams). We are open to working with neutral parties (i.e. universities, internet exchanges, domain registrars, etc) to provide redundancy for peering in the TTN community network. That is regardless of community-contributed TTN clusters as described above: both provide decentralization.

I don’t think that throwing money at it will help. Also to whom would TTI provide the SLA? What if we don’t meet the SLA? This, plus the operational differences as outlined above, make me believe that we should keep TTN really a community initiative, where TTI is a contributor on a best effort bases, like, hopefully, many other contributors that will offer a public TTN cluster. Sure, TTN Foundation operated clusters are a showcase of our TTI commercial abilities, but only to a certain extent, like we do and will allow for other community network contributors to optionally (!) upsell services.

I’m not against donations though, in fact I think it’s good to consider it to cover the operational costs of TTN Foundation. Currently, TTI pays the exponentially increasing Azure bills and pays for the time spent on ops, and it would be great if TTN can be financially self-sustaining (through donations) and operationally independent (through decentralization and potentially Slackbots?).

BoRRoZ · September 17, 2019, 1:37pm

Jeff-UK · September 17, 2019, 2:04pm

I think Reading UK comes up 1st - can we get started there with TTN Core/TTI delegate(s) even if you aren’t there yourself? May be useful to continue process (from this thread) of gathering ideas/needs/concerns/issues/strawman solution suggestions?

johan · September 17, 2019, 2:07pm

Sure, please do and post any findings or things to discuss here. @wienkegiezeman is attending and is a good delegate of TTN core team and TTI as well I’ll keep an eye on this thread too

tve · September 17, 2019, 9:35pm

This has been an interesting an shocking thread. I’ve not paid much attention in recent months, but to me this thread clearly states that I cannot rely on TTN at all. It’s useful to do experiments and that’s it.

As someone who has built and operated a large cloud service with a significant free offering I’m rather shocked by the statement “our internal on-call system only alerts the team about incidents with the public community network during working hours, and not at night or during weekends”. It would have never occurred to us to leave the free offering broken for a whole week-end and then just say “hey, best effort!”.

What I would expect is for the free offering to be lowest priority, and of course there’s no recourse/compensation nor 24x7 support line to call, etc. But to leave it completely abandoned, wow.

I supposed what it really means is that TTN has become a pure liability to TTI, i.e., it provides zero benefits, and that’s why TTI can’t provide a minimum of commitment?

The best thing that could come out of this discussion would be a true community network that is operated by volunteers that care about the mission and that peers with TTN. Then get a majority of GWs to switch to the community network and let TTN die…

OK, I’m sure I’m now going to get flamed off the forum…

BoRRoZ · September 17, 2019, 10:01pm

and you see that happen ?

I am still impressed how this community grows, how TTI grows world wide, I see it as an accomplishment and I accept the ‘grow pains’ , that’s the reason I am still volunteer.

cslorabox · September 18, 2019, 1:33am

In the sense that the alternative for those who need to be sure they can fix anything that may break is local networks that don’t peer with TTN, you are correct, peering is the best outcome.