Dealing with incidents within TTN

So just that you know before: I can not deal very well with no information situations. I have a strong need for clarity and more a ā€œWe need to do somethingā€ type of guy than ā€œOk lets waitā€.

The recent incident was reported here and in the Slack #ops channel and IamBatman fixed it. I assume it took him 5 minutes, based on his first reaction in the channel and the messages comming in again.

TTN is provided ā€œas isā€ there is no SLA (service level agreement). We all know this and by joining also agreed to this. TTI people probably are on shifts to provide the service they can. Even on weekends and holidays. Thank you for your service.

I assume that there are certain errors classes that are easily identified by trained and trusted humans and can be fixed with a pretedermined set of options (e.g. restarting a service). If we the community want the service to be more stable (I do!) then I think we as a community need to invest.

I can imagine a TTN watchtower group that consists of community people willing to give part of their day to watch over TTN. If they then have a little bit of training (quality assurance!) and the possibility to restart services (e.g. by a TTI provided access) they could fix maybe a lot of incidents fast and ā€œescalateā€ the other to TTI.
I see this as a very logic step in the growth process of TTN and being so in line with the manifest that inspired us all:
We believe that this power should not be restricted to a few people, companies or nations. Instead this should be distributed over as many people as possible without the possibility to be taken away by anyone.

First of all, Iā€™m glad that @KrishnaIyerEaswaran2 managed to resolve this particular incident, and I would like to thank him for taking care of it. I havenā€™t spoken to him about what exactly was wrong and what he did to resolve it. Weā€™ll discuss that tomorrow and update the incident page with some details.

We understand that many here are relying on the public community network services being available at all times, and I think we can all agree that outages like this one already donā€™t happen often. Since the public community network doesnā€™t have an SLA, our internal on-call system only alerts the team about incidents with the public community network during working hours, and not at night or during weekends. Outside working hours, thereā€™s often someone from the team available to take action, but this time none of us was available for a long time.

The Things Networkā€™s backend (v2) was designed with decentralization in mind. The public community network consists of multiple clusters worldwide, some of them operated by The Things Industries, others operated by partners such as Meshed, Switch and Digital Catapult. As a bonus, there is (usually) traffic exchange between the regions. The idea was that this decentralization allows us to operate a global network that doesnā€™t rely on a single party (TTI) for everything. So far that model worked quite well, and also with the recent incident with our EU cluster, we can see that other clusters were still fully operational.

traffic

Iā€™m pretty sure that giving external people access to TTI-hosted clusters isnā€™t going to happen. Instead our goal is - and always has been - to involve more operators in the public community network. We believe that large and active communities should be able to operate their own clusters instead of relying on TTI-hosted clusters. Unfortunately our v2 implementation doesnā€™t properly deal with unreliable or malicious operators (we learned that the hard way), and we decided to stop adding partner-operated clusters. With v3 this is going to be possible again, and we hope to involve communities in hosting their own clusters.

5 Likes

I think it will be helpful to add external monitoring to TTN that is aided by community members.

1 Like

I know I donā€™t have to watch as my automated monitoring provides that information readily. Yesterday at 17:15 CEST it reported there was a problem. Reported things were up and running again this morning within 5 minutes of the issue being fixed.
The only thing I need to do when I get monitoring messages is to check if the monitoring might have an issue but that happened only once over a 3 year period.

That is interesting because I inquired about what would be needed to implement this for our Dutch community at least 3 times to be told there is no need as TTN covers NL quite adequately.

Hmmm, speaking to different core team members at this years TTN conference regarding this my take on the answers was that I was discouraged (again) as there is no need. Did I interpret the responses that badly???

6 Likes

I hope this subject will be on the agenda for the TTN conference 2020 in some form

1 Like

Why wait that long?! :wink: Add it to Reading and all the pending Q4 regional Confs! :slight_smile:

@kersingā€™s quote of my message is missing quite an important bit of context:

With our current v2 infrastructure, we indeed need to avoid adding clusters (specifically broker+networkserver combinations) if not needed. In regions that are already being served by our existing infrastructure, there is indeed no real need, but more importantly, it will lower the quality of service. This is a technical problem with v2, and not because we donā€™t want it.

I agree that it would be good to discuss the future of operating public community network clusters, and we indeed donā€™t need to wait for The Things Conference with that.

3 Likes

For me the failures are not the biggest problem, I know TTN is provided ā€œas isā€ and depends on some hardworking volunteers solving the failures. But if I am experimenting with my gateway or nodes and I encounter a problem, I would like to be able to exclude that something is wrong with the TTN services. More than once it took me many frustrating hours to figure out what was wrong with my nodes or gateway while it later turned out that there was a failure at TTN, often a failure that was not reported on #OPS or on the status page. The status page sometimes even reported that there were no problems while there certainly were issues. So a really up-to-date status report would be highly appreciated and would save me a lot of frustration.

4 Likes

I mean, addressing the growth of the community / network and the problems in general as we all can see this year.
A presentation of ideas and future implementations how to cope ?
A more or less automated actual status page with input from devices worldwide combined with operator data for example would be very usefull.
What can we, as community, do to help improveā€¦ think we need more guidance from TTi were to start :sunglasses:

1 Like

I presume TTI benefits from 24/7 engineer alerts and possibly out of hours fixes.

One idea for the list might be to treat TTN users as a TTI customer and figure out a cost of providing some sort of SLA on a paid basis. I am sure there are enough of us on TTN to make a regular ā€˜donationā€™ to make this business model work. At the very least, such an SLA would have alerted an TTI engineer to the 15+ hour complete outage issue (which in this case, I presume, would have been a simple fix).

The problem, no doubt, will be the scope of such a SLA in the context of a ā€˜as isā€™ free service.
Iā€™m thinking ā€˜critical issues onlyā€™ at this stage.

Also, up-vote for a more automated status page.

1 Like

To be honest, I think the only really viable solution is going to be a two-level one, where someone with a fleet of gateways manages their own network server with whatever level of reliability effort they require, and anything unknown to that then gets bumped up to a regional server that handles ā€œroamingā€ along with individually owned gateways not having their own serving infrastructure.

An organization who invests in putting up a decent number gateways typically has an application need that means it isnā€™t going to be able to justify depending on an external service with no SLA in a situation where the failure of that can take out even purely local usage.

My general sense is that whatā€™s missing to make this work, is an ability for a gateway (or an intermediate server proxying for many) to indicate back to a server that it wonā€™t be able to accept a transmit request, because it is already busy transmitting something else at that timeslot. Lengthening the standard RX1 delay would greatly increase flexibility for ā€œwho is going to send itā€ negotiations, with little obvious downside.

Ultimately, if TTN canā€™t be architected in a way that allows local and fleet self-reliance, the result is likely to be lots of uncoordinated private networks serving only the needs of those paying for the infrastructure.

Its a fine idea in theory, but what happens if in one period (a month ?) the ā€˜donationsā€™ dont cover the cost, no SLA support, donations returned etc.

Complicated.

What are the numbers for known TTI nodes and known TTN nodes ?

Given the number of TTN users (many of which will be businesses) I would expect donations to exceed the actual TTI cost, building up a buffer. Until TTI throw some figures at this itā€™s difficult to say (how cash strapped are they) but Ā£5-10/month/donation would seem reasonable plus some [more] good will from TTI.

I am sure TTI do WANT to keep things running smoothly as TTN is a good test platform for them and is a good news story for their TTI sales.

Anyway, just a thought as it would bring a basic SLA to the TTN platform.

2 Likes

As someone who manages their own cluster (Australia), I watched the comments unfold in the #ops channel yesterday. It was frustrating to watch, even though our region wasnā€™t even the one impacted.

On one hand I canā€™t work out how it can take 15+ hours for TTI to respond to a problem of this magnitude when #ops was buzzing with comments every few minutes. On the other hand, TTI staff are entitled to enjoy their weekend uninterrupted by work.

Even though TTN is ultimately provided to us as a best effort free service, for some of us itā€™s become part of what we do for a living. At a country-wide level for some of you it will make sense to invest in the infrastructure and services required to manage your own cluster (when V3 becomes available).

For Australia this has worked well. We have become largely immune to the issues that affect the global network, mainly because weā€™re smaller and itā€™s more manageable, but mostly because we can decide how and when to respond to issues that arise and weā€™re available in our own timezone. For the most part we can fix the issues ourselves. (Occasionally though, we canā€™t, and we have had long outages too)

But it seems to me there was one simple thing that could have made yesterdayā€™s outage more tolerable for those affectedā€¦ acknowledgement. Just something notifying everyone that TTI are aware of the problem. It could be an automated Slack message, a status page update, a simple ā€œweā€™re aware of a problemā€ message. That would make ā€œbest effortā€ even better :slight_smile:

6 Likes

What about the status pages? As far as I could see https://status.thethings.network/ did not show any change during the latest incident. That was one reason for irritationā€¦

As someone who used to provide 24\7 support for large computer networks, I can fully understand why it took TTI that long to respond to a problem for which they had no SLA.

Once those who are responsible for the 24\7 support for TTI start dealing with TTN issues, because its seen as a good thing to do, the response becomes expected by the community (for free) even though its outside of the SLA.

Now if I was being paid (in my old job) to look after a network, then indeed I kept an eye on systems, but I was careful not to deal with issues for which there was no SLA and for which I was not being paid. Is that just mean, or practical ?

If the TTI guys were to provide a 24\7 SLA for TTN, I guess someone would have to pay for that service.

1 Like

Marshall Rosenberg said: If we are able to hear others and express our own needs the solution is only 20 minutes away.

Let me try to give a short summary of what I understood so far:

  • TTI wants to keep autonomy with their own hosted systems
  • v2 stack is not really suited for a federated/community based approach, but v3 will be
  • the community could help in monitoring/setting up automated monitoring/reporting systems (federated like the mapper service?)
  • clarity of causes for issues should be available fast and automated
  • clarity that someone is working on the problem
  • a certain amount of downtime is acceptable
  • long term development of TTN should be discussed within and with all the community (conference workshop?)

I am still uncertain about the following questions:

  • how much downtime is acceptable and by whom?
  • how can money help and can it even help?

Thank you for all your opinions so far!

2 Likes

First of all, apologies for the late response from my side. I was traveling last week and off yesterday, but Iā€™ve been following internal and public (#ops, twitter, here) messages closely. I appreciate the constructive feedback, nice words and support a lot, thanks @LanMarc77 for raising this, also ā€œspecial thanksā€ to the person that spent a few hours on a Twitter post with the TTN logo in flames and a skull suggesting weā€™re hacked. It shows, positively and negatively, that people care. And thatā€™s a good thing after all.

Thereā€™s an unfortunate mix of circumstances that caused the long downtime;

  • what: the busiest cluster (EU operated by The Things Network Foundation *)
  • why: issues with component interconnections that are notoriously vulnerable and hard to do right due to the design of V2, add to that doubling the network year-on-year growth and shifting focus to V3 since two years now
  • when: weekend (and actually enjoying it)

* small note on operations; yes, the TTN core team is on the TTI payroll, but TTN is not a TTI operated service and will not be. We allow for spending time on non-profit Foundation activities, including operations, but it is best effort.

That being said, I see a task for me for bridging the gap between limited operational availability of TTI staff and the seemingly unlimited willingness of the TTN community to maximize service uptime.

Iā€™m replying to a few suggestions above to steer the discussion in that direction. Please feel free to make other suggestions.

This assumption is correct; I would say that ā€œfixingā€ over 90% of operational issues (in V2) is simply a matter of restarting Docker containers in the right order. Unfortunately, we cannot give direct access to Docker as that implies system root access, access to keys and password which implies user data which results in privacy and GDPR issues, etc.

How about selecting 10-15 active and trusted community members spread across timezones who have access to custom Slackbot command (i.e. /restart ttn-eu) that triggers a predefined restart sequence? Only if, say, 3 individual members executed that command independently within 15 minutes, it actually happens. And to avoid abuse, at most every hour. I know itā€™s monkey patching, but it gives control.

Yes, and Iā€™m happy to cover this topic on the other 2019 Conferences where Iā€™ll be (India and Adelaide).

Right. We need to improve this as well. So, as we speak, weā€™re spending time to improve (real-time) communications on TTN operations, including a new status page. If automated reporting works well, weā€™ll be using the @ttnstatus twitter account too.

TTN is a special network and is operationally very different from TTI networks; TTN clusters are fully connected and TTN Foundation only operates a part of it (a big part, but not all), thereā€™s heavy load, thereā€™s lots of experimentation and testing on device and gateway level that would flag alerts in an enterprise network (and hence put unnecessary burden on TTI ops), etc. That being said, as we migrate to V3, we will converge the operational setup of TTN Foundationā€™s clusters and TTI Cloud Hosted. It will not be the same infrastructure, but it will be better manageable.

Yes. This is what Packet Broker is going to provide. In V3, Packet Broker is the backbone of LoRaWAN traffic between public TTN clusters, private TTI clusters, anyone running the open source The Things Stack, and even other network servers (through LoRaWAN stateless passive roaming or implementing the Packet Broker protocol).

As weā€™ve been promising for a long time, it will be easier for the community to operate TTN clusters as well (cc @kersing). I have several requests from all parts of the world for operating one. There is nothing wrong with overlapping regions. However, we do need Packet Broker for this. Also, we do need to structurally start measuring uptime of all TTN clusters in a way for the community to choose which TTN cluster to pick for connecting a gateway and registering devices. If it turns out that the public TTN Groningen community cluster has better uptime than the TTN Foundationā€™s EU cluster, so be it. Weā€™ll hand out an award for the best community contributed cluster at the Conference. Weā€™ll figure it out; willingness is not the issue but technology is not there yet.

Now, I hear you thinking; ā€œbut who operates Packet Broker and how can we make that redundant?ā€ PB will be TTI operated. But APIs are open and V3 peering allows for redundancy (i.e. multiple independent upstreams). We are open to working with neutral parties (i.e. universities, internet exchanges, domain registrars, etc) to provide redundancy for peering in the TTN community network. That is regardless of community-contributed TTN clusters as described above: both provide decentralization.

I donā€™t think that throwing money at it will help. Also to whom would TTI provide the SLA? What if we donā€™t meet the SLA? This, plus the operational differences as outlined above, make me believe that we should keep TTN really a community initiative, where TTI is a contributor on a best effort bases, like, hopefully, many other contributors that will offer a public TTN cluster. Sure, TTN Foundation operated clusters are a showcase of our TTI commercial abilities, but only to a certain extent, like we do and will allow for other community network contributors to optionally (!) upsell services.

Iā€™m not against donations though, in fact I think itā€™s good to consider it to cover the operational costs of TTN Foundation. Currently, TTI pays the exponentially increasing Azure bills and pays for the time spent on ops, and it would be great if TTN can be financially self-sustaining (through donations) and operationally independent (through decentralization and potentially Slackbots?).

10 Likes