[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[glob2-devel] Fw: WestHost 2.0 Apology / Disk Space Increases
From: |
Gabriel Walt |
Subject: |
[glob2-devel] Fw: WestHost 2.0 Apology / Disk Space Increases |
Date: |
Wed, 1 Oct 2003 03:07:23 +0200 |
Hello,
Une note de notre cher hebergeur. A noter (en milieu de mail) qu'ils semblent
s'etre heurtes a de serieux bugs que linux portait avec soi de 2.4.9 a 2.4.20
et qu'ils ont reussi a traquer avec l'aide d'intel, redhat et autres experts du
noyeau et de ses drivers. Comme quoi, si meme les OS les plus stables ont ce
genre de problemes, je n'ose meme pas m'immaginer ce qu'il en est avec les OS
auquels on a pas acces aux sources... ;)
Bref, nos quota disque ont double!
- Disc Quota 4To ???? C'est du moins ce que dit notre interface de management...
- Bandwidth Quota 12Go... C'est deja un chiffre plus raisonnable.
Gabriel
----
Begin forwarded message:
Date: Sat, 27 Sep 2003 15:43:07 -0500
From: "WestHost Inc." <address@hidden>
To: address@hidden
Subject: WestHost 2.0 Apology / Disk Space Increases
Dear WestHost Clients,
We want to thank each of our clients personally for the enormous amount of
patience during this very difficult time for WestHost. We realize that for
some, this has been merely an inconvenience, yet for others, it has seemed like
a disaster.
As we continue to work through remaining client support requests and return to
normal operations, we felt it was necessary to offer this letter as a formal
explanation of the problems that many have experienced as a result of our
transition to WestHost 2.0. We also wanted to show our appreciation for your
continued support. This weekend, we will be doubling the disk space quotas for
all current WestHost sites and again want to say thank you for choosing
WestHost!
But first, what went wrong?
============================
We believe there are two major factors that caused the problems in our first
phase of site upgrades:
1. Infrastructure problems resulting in major instability.
2. Bugs in the upgrade process.
Although many have attributed these breakdowns to insufficient testing, it is
not as if the platform was launched untested. Indeed, the new platform
underwent several months of rigorous testing prior to launch. We believe the
breakdown occurred because no amount of preparation could have produced a test
environment that resembled the real world. Current technologies in load
simulation, while close, are simply incapable of generating the number of
metrics required to simulate the real world with this volume. To the degree we
were capable, extensive testing was performed and the green lights were given
that everything was stable, even though it was not.
The second major breakdown occurred during the upgrade preparation. Several
bugs made it through our testing and quality assurance procedures due to the
fact that they only presented themselves when large quantities of sites were
being upgraded, or in a very small percentage of sites.
The decision was made at the end of the testing period to proceed with the
launch as all metrics were well within the parameters established as
_exceptional_. Our technology team had invested literally thousands of
man-hours into the project and felt strongly that we were in a good position to
launch. We proceeded by launching the transition with our smallest server in
hopes that if there were problems with the procedures or the infrastructure,
they would be noticed right away and impact the smallest possible number of
clients. Immediately following the transition, each site was tested and
verified using an automated testing tool.
A few days later, the transition of the first server was deemed a success. The
numbers of support requests generated as a result were minimal (most were
related to confusion about how to use the new platform, which taught us
valuable lessons in how to present the help documents and corrections were
made.). The few that were related to problems with their websites were
investigated thoroughly, and changes were made and tested to the transition
procedures to ensure we didn't hit the same problems going forward. At this
point, we felt safe in moving 7 additional servers.
The upgrade of the next 7 servers, despite rigorous testing and careful
transitional plans, proved that there were several bugs in our upgrade
procedures that needed correction. We also realized that many factors we
couldn't control (such as having to change passwords) were going to cause more
overhead than we anticipated. Immediately, additional resources were brought
in and trained as fast as possible. Further plans were formulated to help deal
with the volume of support requests we were receiving and special assignments
were made to make sure we could quickly identify _patterns_ in requests that
would indicate bugs.
At this point we found it necessary to discontinue our phone and live chat
support in an effort to improve efficiency and productivity in informing
clients of changes to their account and resolving problems directly related to
the 2.0 upgrade (they have since been reactivated). Rollback plans were
discussed, but due to the nature of hosting, are not practical as they can
result in corrupt or missing data.
Over the next several days our staff members worked around the clock, and had
renewed confidence in the upgrade procedures as new bugs were identified and
corrected. The decision was made a few days later to test the new procedures on
a very small number of servers before preceding any further. A small group of
servers were tested, things looked good, small corrections were made, and we
proceeded with the next large block.
It was at this point that major instability issues became apparent in our
hosting platform. We had approximately 9,000 hosting accounts on the new system
and servers were crashing. Combine this with overall slowness and we had a
full-blown disaster on our hands.
The decision was made immediately to halt the transition of any new servers
until:
1. All bugs were removed from the transition procedures
2. The platform could be stabilized (Linux, touted to be one of the most stable
operating systems on the planet, was behaving exactly the opposite).
3. Outstanding client support issues could be resolved
Multiple engineers were brought in to assist with the instability problem.
Driver engineers from Intel, Linux Kernel developers, RedHat engineers, and
various other industry experts were employed around the clock to resolve the
problems relating to the kernel crashes we were experiencing. At this point, we
were well aware of the consequences we faced with every minute that passed
without a resolution. Finally, the stability problem was narrowed down a bug in
the entire 2.4.20 series of Linux kernels (even some that had been out for
almost a year, and including the 2.4.9 series from RedHat Advanced Server).
The major cause of the performance issue was resolved a few days later when it
was learned that the Linux NFS v3 code had a seemingly unknown bug, which
caused very poor NFS performance under heavy load. Once this was discovered, we
reverted back to NFS v2 and the majority of the performance issues were
resolved.
We reverted back to an older version and found immediate stability. We are
currently working with RedHat and the Linux kernel development team to resolve
these issues going forward.
At this point, resources previously dedicated to stabilizing the platform could
be dedicated to resolving client issues (most of which were resolved once the
speed and stability issues were under control). All efforts were, and are
concentrated on working around the clock to answer each and every support
request we received. In the last few weeks we have received 5 months worth of
support requests. Nonetheless, we are gaining control and will soon have
resolutions for everyone. Toll free phone support is once again active, as is
live chat.
The remaining accounts to be migrated can expect a far better experience than
our _first batch_. We have learned valuable lessons in communicating with our
clients, the infrastructure is solid and stable, and our tech support
department has been exposed to solving complex problems on a relatively new
platform. We plan to upgrade each server on a one-by-one basis rather than in
groups, and we now have a firm understanding of the human resources required to
manage an upgrade of this size.
Please understand that our reasoning behind our decision to upgrade to WestHost
2.0 has always been driven by our dedication to offer clients the best solution
possible. This transition has involved new hardware, new software, new
infrastructure, and innovative concepts in hosting. Nowhere in the industry
will you find the same level of service and features at the prices we have set.
As a current WestHost client, the overall benefits are huge. As an example:
Today we are one of the only hosting companies to offer true high-availability
hosting; meaning we can withstand a full hardware failure on one of our servers
without resorting to tape-backups.
There are many things we could have done differently with this upgrade. We've
made mistakes along the way and sincerely apologize for those that have been
affected. We genuinely value each of our clients and want you to be happy with
your decision to stay with WestHost. We are making every effort to return to
our normal high standards of service and support, and are committed to putting
in place the resources, tools and services to ensure the same level of
reliability you have come to expect from WestHost. Please accept our apologies
and enjoy the increased disk space we will be adding to your account.
Sincerely,
Brian Shellabarger
Chief Technology Officer
WestHost Inc.
When you expect more from your web host
<http://www.westhost.com/>
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [glob2-devel] Fw: WestHost 2.0 Apology / Disk Space Increases,
Gabriel Walt <=