glob2-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[glob2-devel] Fw: WestHost 2.0 Apology / Disk Space Increases


From: Gabriel Walt
Subject: [glob2-devel] Fw: WestHost 2.0 Apology / Disk Space Increases
Date: Wed, 1 Oct 2003 03:07:23 +0200

Hello,

Une note de notre cher hebergeur. A noter (en milieu de mail) qu'ils semblent 
s'etre heurtes a de serieux bugs que linux portait avec soi de 2.4.9 a 2.4.20 
et qu'ils ont reussi a traquer avec l'aide d'intel, redhat et autres experts du 
noyeau et de ses drivers. Comme quoi, si meme les OS les plus stables ont ce 
genre de problemes, je n'ose meme pas m'immaginer ce qu'il en est avec les OS 
auquels on a pas acces aux sources... ;)

Bref, nos quota disque ont double!
- Disc Quota 4To ???? C'est du moins ce que dit notre interface de management...
- Bandwidth Quota 12Go... C'est deja un chiffre plus raisonnable.

Gabriel

----

Begin forwarded message:

Date: Sat, 27 Sep 2003 15:43:07 -0500
From: "WestHost Inc." <address@hidden>
To: address@hidden
Subject: WestHost 2.0 Apology / Disk Space Increases


Dear WestHost Clients,

We want to thank each of our clients personally for the enormous amount of 
patience during this very difficult time for WestHost.  We realize that for 
some, this has been merely an inconvenience, yet for others, it has seemed like 
a disaster.

As we continue to work through remaining client support requests and return to 
normal operations, we felt it was necessary to offer this letter as a formal 
explanation of the problems that many have experienced as a result of our 
transition to WestHost 2.0.  We also wanted to show our appreciation for your 
continued support.  This weekend, we will be doubling the disk space quotas for 
all current WestHost sites and again want to say thank you for choosing 
WestHost!

But first, what went wrong?
============================
We believe there are two major factors that caused the problems in our first 
phase of site upgrades:  
1. Infrastructure problems resulting in major instability.
2. Bugs in the upgrade process.

Although many have attributed these breakdowns to insufficient testing, it is 
not as if the platform was launched untested. Indeed, the new platform 
underwent several months of rigorous testing prior to launch. We believe the 
breakdown occurred because no amount of preparation could have produced a test 
environment that resembled the real world. Current technologies in load 
simulation, while close, are simply incapable of generating the number of 
metrics required to simulate the real world with this volume. To the degree we 
were capable, extensive testing was performed and the green lights were given 
that everything was stable, even though it was not.

The second major breakdown occurred during the upgrade preparation. Several 
bugs made it through our testing and quality assurance procedures due to the 
fact that they only presented themselves when large quantities of sites were 
being upgraded, or in a very small percentage of sites.

The decision was made at the end of the testing period to proceed with the 
launch as all metrics were well within the parameters established as 
_exceptional_. Our technology team had invested literally thousands of 
man-hours into the project and felt strongly that we were in a good position to 
launch. We proceeded by launching the transition with our smallest server in 
hopes that if there were problems with the procedures or the infrastructure, 
they would be noticed right away and impact the smallest possible number of 
clients. Immediately following the transition, each site was tested and 
verified using an automated testing tool.

A few days later, the transition of the first server was deemed a success. The 
numbers of support requests generated as a result were minimal (most were 
related to confusion about how to use the new platform, which taught us 
valuable lessons in how to present the help documents and corrections were 
made.). The few that were related to problems with their websites were 
investigated thoroughly, and changes were made and tested to the transition 
procedures to ensure we didn't hit the same problems going forward. At this 
point, we felt safe in moving 7 additional servers.

The upgrade of the next 7 servers, despite rigorous testing and careful 
transitional plans, proved that there were several bugs in our upgrade 
procedures that needed correction. We also realized that many factors we 
couldn't control (such as having to change passwords) were going to cause more 
overhead than we anticipated.  Immediately, additional resources were brought 
in and trained as fast as possible. Further plans were formulated to help deal 
with the volume of support requests we were receiving and special assignments 
were made to make sure we could quickly identify _patterns_ in requests that 
would indicate bugs. 

At this point we found it necessary to discontinue our phone and live chat 
support in an effort to improve efficiency and productivity in informing 
clients of changes to their account and resolving problems directly related to 
the 2.0 upgrade (they have since been reactivated).  Rollback plans were 
discussed, but due to the nature of hosting, are not practical as they can 
result in corrupt or missing data.

Over the next several days our staff members worked around the clock, and had 
renewed confidence in the upgrade procedures as new bugs were identified and 
corrected. The decision was made a few days later to test the new procedures on 
a very small number of servers before preceding any further. A small group of 
servers were tested, things looked good, small corrections were made, and we 
proceeded with the next large block.

It was at this point that major instability issues became apparent in our 
hosting platform. We had approximately 9,000 hosting accounts on the new system 
and servers were crashing. Combine this with overall slowness and we had a 
full-blown disaster on our hands.

The decision was made immediately to halt the transition of any new servers 
until:
1. All bugs were removed from the transition procedures
2. The platform could be stabilized (Linux, touted to be one of the most stable 
operating systems on the planet, was behaving exactly the opposite).
3. Outstanding client support issues could be resolved

Multiple engineers were brought in to assist with the instability problem. 
Driver engineers from Intel, Linux Kernel developers, RedHat engineers, and 
various other industry experts were employed around the clock to resolve the 
problems relating to the kernel crashes we were experiencing. At this point, we 
were well aware of the consequences we faced with every minute that passed 
without a resolution. Finally, the stability problem was narrowed down a bug in 
the entire 2.4.20 series of Linux kernels (even some that had been out for 
almost a year, and including the 2.4.9 series from RedHat Advanced Server). 

The major cause of the performance issue was resolved a few days later when it 
was learned that the Linux NFS v3 code had a seemingly unknown bug, which 
caused very poor NFS performance under heavy load. Once this was discovered, we 
reverted back to NFS v2 and the majority of the performance issues were 
resolved.

We reverted back to an older version and found immediate stability. We are 
currently working with RedHat and the Linux kernel development team to resolve 
these issues going forward.

At this point, resources previously dedicated to stabilizing the platform could 
be dedicated to resolving client issues (most of which were resolved once the 
speed and stability issues were under control). All efforts were, and are 
concentrated on working around the clock to answer each and every support 
request we received.  In the last few weeks we have received 5 months worth of 
support requests.  Nonetheless, we are gaining control and will soon have 
resolutions for everyone.  Toll free phone support is once again active, as is 
live chat.

The remaining accounts to be migrated can expect a far better experience than 
our _first batch_. We have learned valuable lessons in communicating with our 
clients, the infrastructure is solid and stable, and our tech support 
department has been exposed to solving complex problems on a relatively new 
platform.  We plan to upgrade each server on a one-by-one basis rather than in 
groups, and we now have a firm understanding of the human resources required to 
manage an upgrade of this size.

Please understand that our reasoning behind our decision to upgrade to WestHost 
2.0 has always been driven by our dedication to offer clients the best solution 
possible.  This transition has involved new hardware, new software, new 
infrastructure, and innovative concepts in hosting.  Nowhere in the industry 
will you find the same level of service and features at the prices we have set. 
 As a current WestHost client, the overall benefits are huge.  As an example: 
Today we are one of the only hosting companies to offer true high-availability 
hosting; meaning we can withstand a full hardware failure on one of our servers 
without resorting to tape-backups.

There are many things we could have done differently with this upgrade.  We've 
made mistakes along the way and sincerely apologize for those that have been 
affected.  We genuinely value each of our clients and want you to be happy with 
your decision to stay with WestHost.  We are making every effort to return to 
our normal high standards of service and support, and are committed to putting 
in place the resources, tools and services to ensure the same level of 
reliability you have come to expect from WestHost.  Please accept our apologies 
and enjoy the increased disk space we will be adding to your account.

Sincerely, 

Brian Shellabarger 
Chief Technology Officer 
WestHost Inc. 

When you expect more from your web host 
<http://www.westhost.com/>





reply via email to

[Prev in Thread] Current Thread [Next in Thread]