So just that you know before: I can not deal very well with no information situations. I have a strong need for clarity and more a āWe need to do somethingā type of guy than āOk lets waitā.
The recent incident was reported here and in the Slack #ops channel and IamBatman fixed it. I assume it took him 5 minutes, based on his first reaction in the channel and the messages comming in again.
TTN is provided āas isā there is no SLA (service level agreement). We all know this and by joining also agreed to this. TTI people probably are on shifts to provide the service they can. Even on weekends and holidays. Thank you for your service.
I assume that there are certain errors classes that are easily identified by trained and trusted humans and can be fixed with a pretedermined set of options (e.g. restarting a service). If we the community want the service to be more stable (I do!) then I think we as a community need to invest.
I can imagine a TTN watchtower group that consists of community people willing to give part of their day to watch over TTN. If they then have a little bit of training (quality assurance!) and the possibility to restart services (e.g. by a TTI provided access) they could fix maybe a lot of incidents fast and āescalateā the other to TTI.
I see this as a very logic step in the growth process of TTN and being so in line with the manifest that inspired us all:
We believe that this power should not be restricted to a few people, companies or nations. Instead this should be distributed over as many people as possible without the possibility to be taken away by anyone.
First of all, Iām glad that @KrishnaIyerEaswaran2 managed to resolve this particular incident, and I would like to thank him for taking care of it. I havenāt spoken to him about what exactly was wrong and what he did to resolve it. Weāll discuss that tomorrow and update the incident page with some details.
We understand that many here are relying on the public community network services being available at all times, and I think we can all agree that outages like this one already donāt happen often. Since the public community network doesnāt have an SLA, our internal on-call system only alerts the team about incidents with the public community network during working hours, and not at night or during weekends. Outside working hours, thereās often someone from the team available to take action, but this time none of us was available for a long time.
The Things Networkās backend (v2) was designed with decentralization in mind. The public community network consists of multiple clusters worldwide, some of them operated by The Things Industries, others operated by partners such as Meshed, Switch and Digital Catapult. As a bonus, there is (usually) traffic exchange between the regions. The idea was that this decentralization allows us to operate a global network that doesnāt rely on a single party (TTI) for everything. So far that model worked quite well, and also with the recent incident with our EU cluster, we can see that other clusters were still fully operational.
Iām pretty sure that giving external people access to TTI-hosted clusters isnāt going to happen. Instead our goal is - and always has been - to involve more operators in the public community network. We believe that large and active communities should be able to operate their own clusters instead of relying on TTI-hosted clusters. Unfortunately our v2 implementation doesnāt properly deal with unreliable or malicious operators (we learned that the hard way), and we decided to stop adding partner-operated clusters. With v3 this is going to be possible again, and we hope to involve communities in hosting their own clusters.
I know I donāt have to watch as my automated monitoring provides that information readily. Yesterday at 17:15 CEST it reported there was a problem. Reported things were up and running again this morning within 5 minutes of the issue being fixed.
The only thing I need to do when I get monitoring messages is to check if the monitoring might have an issue but that happened only once over a 3 year period.
That is interesting because I inquired about what would be needed to implement this for our Dutch community at least 3 times to be told there is no need as TTN covers NL quite adequately.
Hmmm, speaking to different core team members at this years TTN conference regarding this my take on the answers was that I was discouraged (again) as there is no need. Did I interpret the responses that badly???
@kersingās quote of my message is missing quite an important bit of context:
With our current v2 infrastructure, we indeed need to avoid adding clusters (specifically broker+networkserver combinations) if not needed. In regions that are already being served by our existing infrastructure, there is indeed no real need, but more importantly, it will lower the quality of service. This is a technical problem with v2, and not because we donāt want it.
I agree that it would be good to discuss the future of operating public community network clusters, and we indeed donāt need to wait for The Things Conference with that.
For me the failures are not the biggest problem, I know TTN is provided āas isā and depends on some hardworking volunteers solving the failures. But if I am experimenting with my gateway or nodes and I encounter a problem, I would like to be able to exclude that something is wrong with the TTN services. More than once it took me many frustrating hours to figure out what was wrong with my nodes or gateway while it later turned out that there was a failure at TTN, often a failure that was not reported on #OPS or on the status page. The status page sometimes even reported that there were no problems while there certainly were issues. So a really up-to-date status report would be highly appreciated and would save me a lot of frustration.
I mean, addressing the growth of the community / network and the problems in general as we all can see this year.
A presentation of ideas and future implementations how to cope ?
A more or less automated actual status page with input from devices worldwide combined with operator data for example would be very usefull.
What can we, as community, do to help improveā¦ think we need more guidance from TTi were to start
I presume TTI benefits from 24/7 engineer alerts and possibly out of hours fixes.
One idea for the list might be to treat TTN users as a TTI customer and figure out a cost of providing some sort of SLA on a paid basis. I am sure there are enough of us on TTN to make a regular ādonationā to make this business model work. At the very least, such an SLA would have alerted an TTI engineer to the 15+ hour complete outage issue (which in this case, I presume, would have been a simple fix).
The problem, no doubt, will be the scope of such a SLA in the context of a āas isā free service.
Iām thinking ācritical issues onlyā at this stage.
To be honest, I think the only really viable solution is going to be a two-level one, where someone with a fleet of gateways manages their own network server with whatever level of reliability effort they require, and anything unknown to that then gets bumped up to a regional server that handles āroamingā along with individually owned gateways not having their own serving infrastructure.
An organization who invests in putting up a decent number gateways typically has an application need that means it isnāt going to be able to justify depending on an external service with no SLA in a situation where the failure of that can take out even purely local usage.
My general sense is that whatās missing to make this work, is an ability for a gateway (or an intermediate server proxying for many) to indicate back to a server that it wonāt be able to accept a transmit request, because it is already busy transmitting something else at that timeslot. Lengthening the standard RX1 delay would greatly increase flexibility for āwho is going to send itā negotiations, with little obvious downside.
Ultimately, if TTN canāt be architected in a way that allows local and fleet self-reliance, the result is likely to be lots of uncoordinated private networks serving only the needs of those paying for the infrastructure.
Its a fine idea in theory, but what happens if in one period (a month ?) the ādonationsā dont cover the cost, no SLA support, donations returned etc.
Given the number of TTN users (many of which will be businesses) I would expect donations to exceed the actual TTI cost, building up a buffer. Until TTI throw some figures at this itās difficult to say (how cash strapped are they) but Ā£5-10/month/donation would seem reasonable plus some [more] good will from TTI.
I am sure TTI do WANT to keep things running smoothly as TTN is a good test platform for them and is a good news story for their TTI sales.
Anyway, just a thought as it would bring a basic SLA to the TTN platform.
As someone who manages their own cluster (Australia), I watched the comments unfold in the #ops channel yesterday. It was frustrating to watch, even though our region wasnāt even the one impacted.
On one hand I canāt work out how it can take 15+ hours for TTI to respond to a problem of this magnitude when #ops was buzzing with comments every few minutes. On the other hand, TTI staff are entitled to enjoy their weekend uninterrupted by work.
Even though TTN is ultimately provided to us as a best effort free service, for some of us itās become part of what we do for a living. At a country-wide level for some of you it will make sense to invest in the infrastructure and services required to manage your own cluster (when V3 becomes available).
For Australia this has worked well. We have become largely immune to the issues that affect the global network, mainly because weāre smaller and itās more manageable, but mostly because we can decide how and when to respond to issues that arise and weāre available in our own timezone. For the most part we can fix the issues ourselves. (Occasionally though, we canāt, and we have had long outages too)
But it seems to me there was one simple thing that could have made yesterdayās outage more tolerable for those affectedā¦ acknowledgement. Just something notifying everyone that TTI are aware of the problem. It could be an automated Slack message, a status page update, a simple āweāre aware of a problemā message. That would make ābest effortā even better
What about the status pages? As far as I could see https://status.thethings.network/ did not show any change during the latest incident. That was one reason for irritationā¦
As someone who used to provide 24\7 support for large computer networks, I can fully understand why it took TTI that long to respond to a problem for which they had no SLA.
Once those who are responsible for the 24\7 support for TTI start dealing with TTN issues, because its seen as a good thing to do, the response becomes expected by the community (for free) even though its outside of the SLA.
Now if I was being paid (in my old job) to look after a network, then indeed I kept an eye on systems, but I was careful not to deal with issues for which there was no SLA and for which I was not being paid. Is that just mean, or practical ?
If the TTI guys were to provide a 24\7 SLA for TTN, I guess someone would have to pay for that service.
First of all, apologies for the late response from my side. I was traveling last week and off yesterday, but Iāve been following internal and public (#ops, twitter, here) messages closely. I appreciate the constructive feedback, nice words and support a lot, thanks @LanMarc77 for raising this, also āspecial thanksā to the person that spent a few hours on a Twitter post with the TTN logo in flames and a skull suggesting weāre hacked. It shows, positively and negatively, that people care. And thatās a good thing after all.
Thereās an unfortunate mix of circumstances that caused the long downtime;
what: the busiest cluster (EU operated by The Things Network Foundation *)
why: issues with component interconnections that are notoriously vulnerable and hard to do right due to the design of V2, add to that doubling the network year-on-year growth and shifting focus to V3 since two years now
when: weekend (and actually enjoying it)
* small note on operations; yes, the TTN core team is on the TTI payroll, but TTN is not a TTI operated service and will not be. We allow for spending time on non-profit Foundation activities, including operations, but it is best effort.
That being said, I see a task for me for bridging the gap between limited operational availability of TTI staff and the seemingly unlimited willingness of the TTN community to maximize service uptime.
Iām replying to a few suggestions above to steer the discussion in that direction. Please feel free to make other suggestions.
This assumption is correct; I would say that āfixingā over 90% of operational issues (in V2) is simply a matter of restarting Docker containers in the right order. Unfortunately, we cannot give direct access to Docker as that implies system root access, access to keys and password which implies user data which results in privacy and GDPR issues, etc.
How about selecting 10-15 active and trusted community members spread across timezones who have access to custom Slackbot command (i.e. /restart ttn-eu) that triggers a predefined restart sequence? Only if, say, 3 individual members executed that command independently within 15 minutes, it actually happens. And to avoid abuse, at most every hour. I know itās monkey patching, but it gives control.
Yes, and Iām happy to cover this topic on the other 2019 Conferences where Iāll be (India and Adelaide).
Right. We need to improve this as well. So, as we speak, weāre spending time to improve (real-time) communications on TTN operations, including a new status page. If automated reporting works well, weāll be using the @ttnstatus twitter account too.
TTN is a special network and is operationally very different from TTI networks; TTN clusters are fully connected and TTN Foundation only operates a part of it (a big part, but not all), thereās heavy load, thereās lots of experimentation and testing on device and gateway level that would flag alerts in an enterprise network (and hence put unnecessary burden on TTI ops), etc. That being said, as we migrate to V3, we will converge the operational setup of TTN Foundationās clusters and TTI Cloud Hosted. It will not be the same infrastructure, but it will be better manageable.
Yes. This is what Packet Broker is going to provide. In V3, Packet Broker is the backbone of LoRaWAN traffic between public TTN clusters, private TTI clusters, anyone running the open source The Things Stack, and even other network servers (through LoRaWAN stateless passive roaming or implementing the Packet Broker protocol).
As weāve been promising for a long time, it will be easier for the community to operate TTN clusters as well (cc @kersing). I have several requests from all parts of the world for operating one. There is nothing wrong with overlapping regions. However, we do need Packet Broker for this. Also, we do need to structurally start measuring uptime of all TTN clusters in a way for the community to choose which TTN cluster to pick for connecting a gateway and registering devices. If it turns out that the public TTN Groningen community cluster has better uptime than the TTN Foundationās EU cluster, so be it. Weāll hand out an award for the best community contributed cluster at the Conference. Weāll figure it out; willingness is not the issue but technology is not there yet.
Now, I hear you thinking; ābut who operates Packet Broker and how can we make that redundant?ā PB will be TTI operated. But APIs are open and V3 peering allows for redundancy (i.e. multiple independent upstreams). We are open to working with neutral parties (i.e. universities, internet exchanges, domain registrars, etc) to provide redundancy for peering in the TTN community network. That is regardless of community-contributed TTN clusters as described above: both provide decentralization.
I donāt think that throwing money at it will help. Also to whom would TTI provide the SLA? What if we donāt meet the SLA? This, plus the operational differences as outlined above, make me believe that we should keep TTN really a community initiative, where TTI is a contributor on a best effort bases, like, hopefully, many other contributors that will offer a public TTN cluster. Sure, TTN Foundation operated clusters are a showcase of our TTI commercial abilities, but only to a certain extent, like we do and will allow for other community network contributors to optionally (!) upsell services.
Iām not against donations though, in fact I think itās good to consider it to cover the operational costs of TTN Foundation. Currently, TTI pays the exponentially increasing Azure bills and pays for the time spent on ops, and it would be great if TTN can be financially self-sustaining (through donations) and operationally independent (through decentralization and potentially Slackbots?).