Dealing with incidents within TTN

@kersingā€™s quote of my message is missing quite an important bit of context:

With our current v2 infrastructure, we indeed need to avoid adding clusters (specifically broker+networkserver combinations) if not needed. In regions that are already being served by our existing infrastructure, there is indeed no real need, but more importantly, it will lower the quality of service. This is a technical problem with v2, and not because we donā€™t want it.

I agree that it would be good to discuss the future of operating public community network clusters, and we indeed donā€™t need to wait for The Things Conference with that.

3 Likes

For me the failures are not the biggest problem, I know TTN is provided ā€œas isā€ and depends on some hardworking volunteers solving the failures. But if I am experimenting with my gateway or nodes and I encounter a problem, I would like to be able to exclude that something is wrong with the TTN services. More than once it took me many frustrating hours to figure out what was wrong with my nodes or gateway while it later turned out that there was a failure at TTN, often a failure that was not reported on #OPS or on the status page. The status page sometimes even reported that there were no problems while there certainly were issues. So a really up-to-date status report would be highly appreciated and would save me a lot of frustration.

4 Likes

I mean, addressing the growth of the community / network and the problems in general as we all can see this year.
A presentation of ideas and future implementations how to cope ?
A more or less automated actual status page with input from devices worldwide combined with operator data for example would be very usefull.
What can we, as community, do to help improveā€¦ think we need more guidance from TTi were to start :sunglasses:

1 Like

I presume TTI benefits from 24/7 engineer alerts and possibly out of hours fixes.

One idea for the list might be to treat TTN users as a TTI customer and figure out a cost of providing some sort of SLA on a paid basis. I am sure there are enough of us on TTN to make a regular ā€˜donationā€™ to make this business model work. At the very least, such an SLA would have alerted an TTI engineer to the 15+ hour complete outage issue (which in this case, I presume, would have been a simple fix).

The problem, no doubt, will be the scope of such a SLA in the context of a ā€˜as isā€™ free service.
Iā€™m thinking ā€˜critical issues onlyā€™ at this stage.

Also, up-vote for a more automated status page.

1 Like

To be honest, I think the only really viable solution is going to be a two-level one, where someone with a fleet of gateways manages their own network server with whatever level of reliability effort they require, and anything unknown to that then gets bumped up to a regional server that handles ā€œroamingā€ along with individually owned gateways not having their own serving infrastructure.

An organization who invests in putting up a decent number gateways typically has an application need that means it isnā€™t going to be able to justify depending on an external service with no SLA in a situation where the failure of that can take out even purely local usage.

My general sense is that whatā€™s missing to make this work, is an ability for a gateway (or an intermediate server proxying for many) to indicate back to a server that it wonā€™t be able to accept a transmit request, because it is already busy transmitting something else at that timeslot. Lengthening the standard RX1 delay would greatly increase flexibility for ā€œwho is going to send itā€ negotiations, with little obvious downside.

Ultimately, if TTN canā€™t be architected in a way that allows local and fleet self-reliance, the result is likely to be lots of uncoordinated private networks serving only the needs of those paying for the infrastructure.

Its a fine idea in theory, but what happens if in one period (a month ?) the ā€˜donationsā€™ dont cover the cost, no SLA support, donations returned etc.

Complicated.

What are the numbers for known TTI nodes and known TTN nodes ?

Given the number of TTN users (many of which will be businesses) I would expect donations to exceed the actual TTI cost, building up a buffer. Until TTI throw some figures at this itā€™s difficult to say (how cash strapped are they) but Ā£5-10/month/donation would seem reasonable plus some [more] good will from TTI.

I am sure TTI do WANT to keep things running smoothly as TTN is a good test platform for them and is a good news story for their TTI sales.

Anyway, just a thought as it would bring a basic SLA to the TTN platform.

2 Likes

As someone who manages their own cluster (Australia), I watched the comments unfold in the #ops channel yesterday. It was frustrating to watch, even though our region wasnā€™t even the one impacted.

On one hand I canā€™t work out how it can take 15+ hours for TTI to respond to a problem of this magnitude when #ops was buzzing with comments every few minutes. On the other hand, TTI staff are entitled to enjoy their weekend uninterrupted by work.

Even though TTN is ultimately provided to us as a best effort free service, for some of us itā€™s become part of what we do for a living. At a country-wide level for some of you it will make sense to invest in the infrastructure and services required to manage your own cluster (when V3 becomes available).

For Australia this has worked well. We have become largely immune to the issues that affect the global network, mainly because weā€™re smaller and itā€™s more manageable, but mostly because we can decide how and when to respond to issues that arise and weā€™re available in our own timezone. For the most part we can fix the issues ourselves. (Occasionally though, we canā€™t, and we have had long outages too)

But it seems to me there was one simple thing that could have made yesterdayā€™s outage more tolerable for those affectedā€¦ acknowledgement. Just something notifying everyone that TTI are aware of the problem. It could be an automated Slack message, a status page update, a simple ā€œweā€™re aware of a problemā€ message. That would make ā€œbest effortā€ even better :slight_smile:

6 Likes

What about the status pages? As far as I could see https://status.thethings.network/ did not show any change during the latest incident. That was one reason for irritationā€¦

As someone who used to provide 24\7 support for large computer networks, I can fully understand why it took TTI that long to respond to a problem for which they had no SLA.

Once those who are responsible for the 24\7 support for TTI start dealing with TTN issues, because its seen as a good thing to do, the response becomes expected by the community (for free) even though its outside of the SLA.

Now if I was being paid (in my old job) to look after a network, then indeed I kept an eye on systems, but I was careful not to deal with issues for which there was no SLA and for which I was not being paid. Is that just mean, or practical ?

If the TTI guys were to provide a 24\7 SLA for TTN, I guess someone would have to pay for that service.

1 Like

Marshall Rosenberg said: If we are able to hear others and express our own needs the solution is only 20 minutes away.

Let me try to give a short summary of what I understood so far:

  • TTI wants to keep autonomy with their own hosted systems
  • v2 stack is not really suited for a federated/community based approach, but v3 will be
  • the community could help in monitoring/setting up automated monitoring/reporting systems (federated like the mapper service?)
  • clarity of causes for issues should be available fast and automated
  • clarity that someone is working on the problem
  • a certain amount of downtime is acceptable
  • long term development of TTN should be discussed within and with all the community (conference workshop?)

I am still uncertain about the following questions:

  • how much downtime is acceptable and by whom?
  • how can money help and can it even help?

Thank you for all your opinions so far!

2 Likes

First of all, apologies for the late response from my side. I was traveling last week and off yesterday, but Iā€™ve been following internal and public (#ops, twitter, here) messages closely. I appreciate the constructive feedback, nice words and support a lot, thanks @LanMarc77 for raising this, also ā€œspecial thanksā€ to the person that spent a few hours on a Twitter post with the TTN logo in flames and a skull suggesting weā€™re hacked. It shows, positively and negatively, that people care. And thatā€™s a good thing after all.

Thereā€™s an unfortunate mix of circumstances that caused the long downtime;

  • what: the busiest cluster (EU operated by The Things Network Foundation *)
  • why: issues with component interconnections that are notoriously vulnerable and hard to do right due to the design of V2, add to that doubling the network year-on-year growth and shifting focus to V3 since two years now
  • when: weekend (and actually enjoying it)

* small note on operations; yes, the TTN core team is on the TTI payroll, but TTN is not a TTI operated service and will not be. We allow for spending time on non-profit Foundation activities, including operations, but it is best effort.

That being said, I see a task for me for bridging the gap between limited operational availability of TTI staff and the seemingly unlimited willingness of the TTN community to maximize service uptime.

Iā€™m replying to a few suggestions above to steer the discussion in that direction. Please feel free to make other suggestions.

This assumption is correct; I would say that ā€œfixingā€ over 90% of operational issues (in V2) is simply a matter of restarting Docker containers in the right order. Unfortunately, we cannot give direct access to Docker as that implies system root access, access to keys and password which implies user data which results in privacy and GDPR issues, etc.

How about selecting 10-15 active and trusted community members spread across timezones who have access to custom Slackbot command (i.e. /restart ttn-eu) that triggers a predefined restart sequence? Only if, say, 3 individual members executed that command independently within 15 minutes, it actually happens. And to avoid abuse, at most every hour. I know itā€™s monkey patching, but it gives control.

Yes, and Iā€™m happy to cover this topic on the other 2019 Conferences where Iā€™ll be (India and Adelaide).

Right. We need to improve this as well. So, as we speak, weā€™re spending time to improve (real-time) communications on TTN operations, including a new status page. If automated reporting works well, weā€™ll be using the @ttnstatus twitter account too.

TTN is a special network and is operationally very different from TTI networks; TTN clusters are fully connected and TTN Foundation only operates a part of it (a big part, but not all), thereā€™s heavy load, thereā€™s lots of experimentation and testing on device and gateway level that would flag alerts in an enterprise network (and hence put unnecessary burden on TTI ops), etc. That being said, as we migrate to V3, we will converge the operational setup of TTN Foundationā€™s clusters and TTI Cloud Hosted. It will not be the same infrastructure, but it will be better manageable.

Yes. This is what Packet Broker is going to provide. In V3, Packet Broker is the backbone of LoRaWAN traffic between public TTN clusters, private TTI clusters, anyone running the open source The Things Stack, and even other network servers (through LoRaWAN stateless passive roaming or implementing the Packet Broker protocol).

As weā€™ve been promising for a long time, it will be easier for the community to operate TTN clusters as well (cc @kersing). I have several requests from all parts of the world for operating one. There is nothing wrong with overlapping regions. However, we do need Packet Broker for this. Also, we do need to structurally start measuring uptime of all TTN clusters in a way for the community to choose which TTN cluster to pick for connecting a gateway and registering devices. If it turns out that the public TTN Groningen community cluster has better uptime than the TTN Foundationā€™s EU cluster, so be it. Weā€™ll hand out an award for the best community contributed cluster at the Conference. Weā€™ll figure it out; willingness is not the issue but technology is not there yet.

Now, I hear you thinking; ā€œbut who operates Packet Broker and how can we make that redundant?ā€ PB will be TTI operated. But APIs are open and V3 peering allows for redundancy (i.e. multiple independent upstreams). We are open to working with neutral parties (i.e. universities, internet exchanges, domain registrars, etc) to provide redundancy for peering in the TTN community network. That is regardless of community-contributed TTN clusters as described above: both provide decentralization.

I donā€™t think that throwing money at it will help. Also to whom would TTI provide the SLA? What if we donā€™t meet the SLA? This, plus the operational differences as outlined above, make me believe that we should keep TTN really a community initiative, where TTI is a contributor on a best effort bases, like, hopefully, many other contributors that will offer a public TTN cluster. Sure, TTN Foundation operated clusters are a showcase of our TTI commercial abilities, but only to a certain extent, like we do and will allow for other community network contributors to optionally (!) upsell services.

Iā€™m not against donations though, in fact I think itā€™s good to consider it to cover the operational costs of TTN Foundation. Currently, TTI pays the exponentially increasing Azure bills and pays for the time spent on ops, and it would be great if TTN can be financially self-sustaining (through donations) and operationally independent (through decentralization and potentially Slackbots?).

10 Likes

:clap::+1:

I think Reading UK comes up 1st - can we get started there with TTN Core/TTI delegate(s) even if you arenā€™t there yourself? May be useful to continue process (from this thread) of gathering ideas/needs/concerns/issues/strawman solution suggestions?

Sure, please do and post any findings or things to discuss here. @wienkegiezeman is attending and is a good delegate of TTN core team and TTI as well :wink: Iā€™ll keep an eye on this thread too

2 Likes

This has been an interesting an shocking thread. Iā€™ve not paid much attention in recent months, but to me this thread clearly states that I cannot rely on TTN at all. Itā€™s useful to do experiments and thatā€™s it.

As someone who has built and operated a large cloud service with a significant free offering Iā€™m rather shocked by the statement ā€œour internal on-call system only alerts the team about incidents with the public community network during working hours, and not at night or during weekendsā€. It would have never occurred to us to leave the free offering broken for a whole week-end and then just say ā€œhey, best effort!ā€.

What I would expect is for the free offering to be lowest priority, and of course thereā€™s no recourse/compensation nor 24x7 support line to call, etc. But to leave it completely abandoned, wow.

I supposed what it really means is that TTN has become a pure liability to TTI, i.e., it provides zero benefits, and thatā€™s why TTI canā€™t provide a minimum of commitment?

The best thing that could come out of this discussion would be a true community network that is operated by volunteers that care about the mission and that peers with TTN. Then get a majority of GWs to switch to the community network and let TTN dieā€¦

OK, Iā€™m sure Iā€™m now going to get flamed off the forumā€¦

1 Like

and you see that happen ?

I am still impressed how this community grows, how TTI grows world wide, I see it as an accomplishment and I accept the ā€˜grow painsā€™ , thatā€™s the reason I am still volunteer. :sunglasses:

4 Likes

In the sense that the alternative for those who need to be sure they can fix anything that may break is local networks that donā€™t peer with TTN, you are correct, peering is the best outcome.

1 Like