Linuxoid
Matrix - @saint:group.lt
They cut all such scenes and pasted into The Boys, in a Mark Twain style “Sprinkle these around as you see fit!”.
I liked the book as well. The show had some similar feeling in some ways, but also had a distinct character for itself.
Reread today again, with some highlights:
The riskiness of a mitigation should scale with the severity of the outage
We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it’s meant to resolve.
We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.
Recovery mechanisms should be fully tested before an emergency
An emergency fire evacuation in a tall city building is a terrible opportunity to use a ladder for the first time.
Testing recovery mechanisms has a fun side effect of reducing the risk of performing some of these actions. Since this messy outage, we’ve doubled down on testing.
We were pretty sure that it would not lead to anything bad. But pretty sure is not 100% sure.
A “Big Red Button” is a unique but highly practical safety feature: it should kick off a simple, easy-to-trigger action that reverts whatever triggered the undesirable state to (ideally) shut down whatever’s happening.
Unit tests alone are not enough - integration testing is also needed
This lesson was learned during a Calendar outage in which our testing didn’t follow the same path as real use, resulting in plenty of testing… that didn’t help us assess how a change would perform in reality.
Teams were expecting to be able to use Google Hangouts and Google Meet to manage the incident. But when 350M users were logged out of their devices and services… relying on these Google services was, in retrospect, kind of a bad call.
It’s easy to think of availability as either “fully up” or “fully down” … but being able to offer a continuous minimum functionality with a degraded performance mode helps to offer a more consistent user experience.
This next lesson is a recommendation to ensure that your last-line-of-defense system works as expected in extreme scenarios, such as natural disasters or cyber attacks, that result in loss of productivity or service availability.
A useful activity can also be sitting your team down and working through how some of these scenarios could theoretically play out—tabletop game style. This can also be a fun opportunity to explore those terrifying “What Ifs”, for example, “What if part of your network connectivity gets shut down unexpectedly?”.
In such instances, you can reduce your mean time to resolution (MTTR), by automating mitigating measures done by hand. If there’s a clear signal that a particular failure is occurring, then why can’t that mitigation be kicked off in an automated way? Sometimes it is better to use an automated mitigation first and save the root-causing for after user impact has been avoided.
Having long delays between rollouts, especially in complex, multiple component systems, makes it extremely difficult to reason out the safety of a particular change. Frequent rollouts—with the proper testing in place— lead to fewer surprises from this class of failure.
Having only one particular model of device to perform a critical function can make for simpler operations and maintenance. However, it means that if that model turns out to have a problem, that critical function is no longer being performed.
Latent bugs in critical infrastructure can lurk undetected until a seemingly innocuous event triggers them. Maintaining a diverse infrastructure, while incurring costs of its own, can mean the difference between a troublesome outage and a total one.
not a bug, but a feature :))
a source code of a game ;))
thank you, actually it seems that it is https://en.m.wikipedia.org/wiki/The_Sliced-Crosswise_Only-On-Tuesday_World , which has inspired Dayworld :)
looks interesting, but not this one.
from the logs it would seem that synapse went down not due to share volume of traffic, but special malformed usernames - so it seems a different pattern was used (if it is was an attack)
I am not sure if that is related, but technically Matrix uses a different protocol from ActivityPub, so it had to be targeted specifically
Video debunking the report: https://yewtu.be/watch?v=7CD_Nl3iwhE
can do, if you could provide the link to the debunking source - would be great!
nice, thank you.
This might be FUD, but… Vastaamo hacker traced via ‘untraceable’ Monero transactions, police says. (Edit) - A video debunking the police report - https://yewtu.be/watch?v=7CD_Nl3iwhE
Yes, seems so from the article.
Agree, but five nines are not 100% ;) Anyway - this discussion reminds me of Technical Report 85.7 - Jim Gray, which might be of the interest to some of you.
a lot of things are possible if you are lucky enough ;)
well this is probably PR as there is no such system nor it can be made that can have 100% uptime. not talking about the fact that network engineers rarely work with servers :)
Not anymore, nowadays, I feel guilty reading non-fiction and understand Lindy effect on books much better (be it fiction or non-fiction).