Analysis Internet services in the US on Thursday were far more stable than those in Ukraine and Russia, but even so reports of problems surfaced.

DownDetector.com, which tracks service outage reports from individuals and real-time data analysis, showed spikes reflecting connectivity issues for Amazon Web Services and other platforms, such as Etrade. We note that Heroku, which is hosted by AWS, stumbled over during Amazon’s outage, disrupting, for instance, the Rust programming language’s Crates.io, which relies on Heroku and Amazon’s cloud.

Yet as was the case on Tuesday this very week, when Slack experienced an outage and AWS looked to be experiencing less serious issues, an Amazon spokesperson today insisted all’s well with AWS.

“I can confirm there are no issues with AWS services,” an Amazon spokesperson told The Register.

graph of downdetector data for AWS on 2-24-22

Thursday’s spike in complaints from netizens about AWS availability … Click to enlarge. Source: Downdetector.

You can see as much from the AWS status page: no recent events are noted and every listed service shows a green check icon to indicate normal operations.

How then to reconcile reports of problems with the insistence there are no problems? Amazon believes DownDetector.com’s data is unreliable. And Luke Deryckx, CTO of Ookla (which owns DownDetector.com) has reportedly said as much with regard to the issues reported on Tuesday. But there are other interpretations.

Tim Perry, who develops HTTP Toolkit, recently published a Status Page Status Page to highlight when the official status for AWS, GitHub, and Slack differs from what users of those services experience.

At the time of this article was filed, it signaled “many users reporting issues” for AWS while the AWS status page reports everything is okay. And there have been similar efforts of this sort, such as stop.lying.cloud.

“Outages are inconvenient of course, but they’re inevitable for any major service, something is always going to break eventually – these are complex and constantly changing systems under pressure from heavy usage,” Perry told The Register.

“The status page situation is much more frustrating though. It’s very common to see status pages fail to match reality, and I made statuspagestatuspage.com specifically because for big tech services like Slack, GitHub and AWS the delay in updating the status page has become such a running joke.”

Perry said it’s absurd that small sites like downdector.com can tell when AWS is having issues but AWS cannot.

“I strongly suspect this is because the published status is linked to contractual SLAs [Service Level Agreements] for enterprise clients, and financial penalties in those contracts for the service provider,” said Perry. “These SLAs create disincentives to proactively update the status of course, but worse they really discourage many useful improvements for issue detection that would fix this completely.

“Publishing any indication of downtime has a major and direct financial impact, so automated anomaly detection and reporting is out, crowdsourced reporting is out, and everything has to run at the speed of manual confirmation by somebody high up enough in the company to sign off on the consequences.”

Perry pointed to the GitHub status page as an example, noting that it used to show automated statistics about performance, failures per second, and other metrics, such as recent views.

“That was removed a few years ago, stripping it down to a simple manually controlled red-yellow-green status for their set of services, which reliably lags behind the reality,” he said.

Behind the curtain

There’s some support for Perry’s contention that AWS isn’t being as upfront as it could be in reporting what’s going on. In a post to Hacker News in 2017, infrastructure engineer Nick Humrich described his experience working as a software engineer on the Elastic Beanstalk team at AWS in 2015.

“When I was on an AWS team, posting a ‘non-green’ status to the status page was actually a manager decision,” he wrote. “That means that it is in no way real time, and it’s possible the status might say ‘everything is ok’ when it’s really not because a manager doesn’t think it’s a big enough deal.

“Also there is a status called green-i which is the green icon with a little ‘i’ information bubble. Almost every time something was seriously wrong, our status was green-i, because the ‘level’ of warning is ultimately a manager’s decision. So they avoid yellow and red as much as possible. So the statuses aren’t very accurate. That being said, if you do see a yellow or red status on AWS, you know things are REALLY bad.”

Humrich in an email confirmed the accuracy of his post to The Register.

“Does SLA avoidance explain it? Maybe, though if so, it’s indirect,” he said via email.

“The reality is it’s explained by two aspects: first, the page is a human decision, not automated. Second, managers want to look good. AWS is a massive place and ‘number of outages’ is probably the best ‘measure’ of a team.”

The deeper question, he said, is why the page isn’t automated.

“Maybe because Amazon assumes on-call teams are responding quickly enough. Maybe it’s because they learn that SLA is lower? Hard to know, or prove. I don’t think SLA avoidance would really be too big of a factor, though, because large enterprise companies almost always get their money back if they ask for it. Doesn’t even require much proof.”

Corey Quinn, chief cloud economist of The Duckbill Group, a cloud service consultancy, told The Register that there are two sets of issues with the AWS status page. 

“First, it’s just a useless ‘sea of green dots’ that doesn’t tell us anything useful; that’s what I built stop.lying.cloud,” he said, explaining that this service strips cruft from the AWS status page and escalates the severity to better reflect reality.

It’s just a useless ‘sea of green dots’ that doesn’t tell us anything useful

“Second, it’s hard to comprehend the scale of hyperscale cloud providers. US-East-1 in Virginia is something like a hundred or so buildings spread across a number of different towns, but it’s expressed as six availability zones. If an entire building falls off the internet, some customers see their service explode, others have no idea that anything’s wrong whatsoever. The question isn’t ‘is AWS down?’ so much as it is ‘at this point in time, how down is it?’ 

“At scale, something is always broken; building durable and reliable systems that can survive those breakages is what the game is all about. The communication challenge is that when the service you use is down for your environment, the sea of green dots is infuriating. Conversely if they showed every outage they experienced on their status page, it’d be an equally useless but far more alarming sea of red dots instead.”

Quinn illustrated the problem with excessive transparency by pointing to how Slack used to provide highly detailed outage data and then Reuters uses the data as the basis for a story questioning Slack’s stability.

He also argued that SLA compliance isn’t a plausible explanation for why people perceive problems that the AWS status page does not reflect. Enterprises, he said, have access to their own uptime metrics and generally just want things to work.

“SLA credits are effectively useless for companies,” he said. “It’s not enough money to move the needle on the lost business opportunity for most companies.”

Quinn said AWS’ approach to making its status data more meaningful to customers has been providing an account-specific “Personal Health Dashboard” that offers a far more granular view of current service status. But, he added, this isn’t well-known because it requires a login and is customer-specific.

It’s not enough money to move the needle on the lost business opportunity

The existence of performance monitoring services suggests cloud providers aren’t always up front about what’s going on. After all, there would be no need to involve a third-party service to verify service availability and SLA compliance if cloud companies reported everything completely accurately.

Monitoring service Ably, for example, contends, “Amazon AWS flat out lies” on its status page.

Malik Zakaria, managing director at managed IT service provider ExterNetworks, described the issue in more diplomatic terms.

In a phone interview with The Register, he said, “When we talk to our customers, SLA monitoring and compliance monitoring is required to ensure that the SLA agreement is being met.”

There are times, he said, when service outages occur and are not being reported, and per agreements, we help resolve those. But 99.99 per cent of the time, he said, we’re able to provide restoration of service within agreed upon times.

In general, Zakaria said, the system works very well, and the support staff at cloud providers are very good and work around the clock.

The Register spoke at length to an Amazon spokesperson about the aspersions being made against the AWS status page. Amazon disputes the notion that SLA concerns affect its status reporting and questions the accuracy of crowd-sourced reporting that may reflect service disruptions linked to ISPs or network-layer providers.

“Third parties speculating on AWS availability almost always get it wrong,” an Amazon spokesperson told us.

“Just this week, Downdetector walked back its own false reporting by saying, ‘we do not believe there was a widespread service issue on AWS’s platform.’ The AWS Service Health Dashboard (SHD) is the only reliable source of AWS availability data, providing customers with timely and accurate information on AWS services and regions.

“It is not connected to our Service Level Agreements (SLAs) in any way. Our SHD provides more details and transparency on service availability than any other cloud provider.” ®


Source link