Server failures: how to recover faster

There will be server outages and it will happen to the best of us. Believing otherwise is like driving a car without an airbag, because the manufacturer has promised that its cars will never crash.

 

In 2017, Amazon Trusted Web Services (AWS) experienced a 4-hour outage affecting all companies using AWS as their back-end provider. Four hours may not seem like a long time to restore a system of this magnitude. However, for AWS customers like Netflix, whose website is available 24/7, it was four hours more expensive.

 

So how do you protect your organization and your customers who depend on your availability? When working with an availability solution provider, it is important to determine which system will provide the fastest recovery time. Or better yet, what kind of system ensures that your clients don't even notice the machine crashed when the server crashes.

 

The Avoiding Downtime Buyer's Guide discusses six questions to ask to avoid downtime, including server crashes. The guide recommends asking questions such as: "What is the process of restoring applications to normal operation in the event of a server failure, and how long will it take?" The guide also compares the various levels of downtime that can be expected from specific systems.

 

“If you rely on standalone servers, recovery times can range from minutes to days, given the high level of human interaction required to restore applications and backup data, provided you back up your system regularly.

 

In HA clusters, processing is interrupted during a server failure, and recovery can take anywhere from minutes to hours, depending on how long it takes to verify file integrity, roll back databases, and replay transaction logs after availability is restored. If the cluster was properly sized early in planning, users should not experience application performance degradation while the failed server is down; however, they may need to rerun some transactions using the log file after normal processing resumes.

 

Resilient solutions proactively prevent downtime with fully replicated components that eliminate any point of failure. Some platforms automatically manage their replicated components, performing all processing in sync.

 

Since the replicated components execute the same instructions at the same time, processing is not interrupted even if one of the components fails. This means that, unlike a stand-alone server or HA cluster, a failover solution continues to work until all issues are resolved. "

beautifullhouse  computerworldblog  readwriteart  instylishworld  getworldbeauty