l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Persistence Pros and Cons


From: Jonathan S. Shapiro
Subject: Re: Persistence Pros and Cons
Date: Mon, 07 Nov 2005 20:38:11 -0500

On Mon, 2005-11-07 at 01:20 +0100, address@hidden wrote:
> I'm sure you mentioned it already at some point, but I don't remember,
> and I guess others may be interested as well: Could you please explain
> why exactly you planned to back off on persistance, and what made you
> undo this decision? I guess this can give valuable insight into the
> implications.

Actually, I don't recall that I *did* say. I'll try... :-)


The EROS kernel persistence mechanism was complicated, and I was going
through an extended period of architectural frustration. A variety of
irritations had accumulated, and it was time to start over. In the
process, I mis-diagnosed the real source of the complexity in the EROS
checkpoint mechanism, and I was going through a phase when most of our
work was embedded and persistence wasn't being used. So I thought to
myself: hey, if we can remove this, we can simplify lots of stuff.

So I spent some time looking at persistence. I concluded that if we
removed it we would lose a fair amount of efficiency, and a small number
of design patterns (some of which were heavily exploited), but I thought
I could replace the design patterns in most cases. The big thing we
would lose was the machine-wide consistency story, and I was thinking a
lot about the distributed system case at that time. In a distributed
system, you don't have consistency across machines, so some applications
need to manage this issue explicitly.

I still didn't like non-persistence, but I started to think to myself:

  One bad persistence mechanism is better than one bad
  persistence mechanism plus one good persistence mechanism.

I spent a lot of time wearing circles in my office carpet trying to sort
this issue out. Eventually I got re-energized and was able to focus. I
concluded that (a) this really isn't true, because most interactions are
still local, (b) as a result, the problem actually does NOT pervade
applications, and (c) the transactional consistency properties of a good
persistence implementation are very very powerful, and should not be
abandoned.

So I went back to look at the sources of complexity in EROS again. There
were two kinds of complexity:

  1. The logic needed to page nodes in and out consistently was
     significant.

  2. The fact that address spaces and processes needed an underlying
     representation as nodes was quite ugly from an implementation
     perspective.

I eventually realized that the first issue *could not* go away -- all of
the issues that were hiding here are going to exist in any object paging
system, whether or not it is persistent. Given an object paging layer,
the marginal logic for in-kernel persistence is just the snapshot and
generation support, and that is fairly simple. I *was* able to figure
out how to move a bunch of this logic out of the kernel.

The second part was the big one. In KeyKOS there was a design rule that
the kernel should never rely on the correct behavior of application code
in order to ensure the correct behavior of kernel code. Not even
application code from the system TCB. If the space bank got something
wrong, the kernel should not contribute to further error as a result.

Given this design rule, the fact that processes were not really a "first
class" data structure led to some ugly corner cases:

  - since an EROS process is constructed from nodes, it is possible
    that the kernel might encounter a process that is missing one
    of its parts. Behavior in this case must be well defined.

  - a node might be misused in such a way that it appears simultaneously
    as part of a process and as part of an address space. About six
    months ago we noticed a case where the KeyKOS and EROS kernels had
    independently implemented the same live-lock bug in one corner case.
    It could not happen in practice because of an application-level
    guarantee that prevented the situation in practice, but it was still
    a kernel bug. Overall, the big complication was the fact that nodes
    could be cached several ways, and this made kernel (software) cache
    management and coherency more difficult than it really needed to be.

To simplify the cache management, what we really needed to do was to
make processes be first class objects at the disk layer. Unfortunately,
this was very tricky. We wanted to keep the number of different object
sizes that needed to be stored on the disk very small, and we
*desperately* wanted to avoid having any of the disk object sizes be
machine dependent if the kernel needed to manage them. We already had
pages and nodes. Two object types seemed like the limit we could handle
without adding indirection in the object store. Indirection would mean
indirect blocks, which would mean seeks, which would create performance
issues for checkpoint. A lot of the arguments in the single store
evolution paper would be hopelessly lost.

Eventually, I realized that I needed to give up on the "multiple object
sizes" issue. Instead of focusing on why it was a problem, I started
focusing on how to solve it. Once I did this, I realized that moving the
logic to user mode created some options that we would not have
considered feasible in the old kernel. In particular, it enabled a new
disk store layout that manages to address most of my concerns without a
large increase in seeks.

One of the key enablers for this realization was endpoints -- or more
precisely, the decision to replace EROS "resume capabilities" with
endpoints.



In EROS, the "CALL" operation (approximately an L4 send) creates
something called a "resume capability" as a side effect. The CALL
operation blocks, and can only be awakened by an invocation of the
resume capability (or one of its copies). So in EROS a resume capability
names a process in a closed wait and a start capability can invoke a
process only when it is in an open wait.

All copies of a resume capability are destroyed when any copy is used.
This ensures that every CALL gets at most one reply, and it has the
effect of severing the session between client and server when the server
replies. It helps eliminate cases where a buggy or compromised server
might send stray messages to a client after the client thinks that the
discussion is over.

The problem with resume capabilities is that they require a second
version number in every node (the "call count"). This counter makes it a
bit challenging to "re-type" a disk region, and it makes switching to an
incremental collector for capability rescind impractical (explaining why
is a long story). Finally, because it is a frequently updated counter
that is closely tied to a disk-based object (the node), it has the
unintended side effect of restricting the space of feasible disk object
layouts.

As I evaluated the move to endpoints for Coyotos, one of the important
questions was: "Did we actually need resume capabilities?" My answer, in
the end, was "we can live without them", but I have a couple of tricks
in mind to make endpoint sever efficient. If those tricks work, then
they provide most of the post-IPC isolation that resume capabilities
provided.

So I would say that what really happened is that I got caught during a
time of confusion in one of EROS's uniquely twisty architectural mazes,
and I temporarily lost track of where the door needed to be.

I don't think that Coyotos is going to have a similarly twisty maze,
which is one of the things I really like about it.



Let me close this with two meta-observations about microkernels.

We talk about removing from the microkernel everything that we can. In
practice, this means removing anything where (a) we can identify some
vaguely modular interface and (b) removing it won't create a protection
problem. And then we talk about how nice it is to have a system
structure where all of these new modules have enforced encapsulation so
that we can debug them and/or replace them.

All of this is true, but it's important to ask: "but what happens to the
kernel?" Think about it. When you are done stripping out all of the
stuff that you can feasibly modularize, what you are left with is this
hopelessly intertwined ball of stuff that is too complicated to tease
apart. This ball of stuff necessarily has lots of globally interacting
design constraints, and a small change in one place can lead to a big
bleeding ulcer someplace else. Especially so if security is a major
design goal.

So: to the extent that you are successful in shrinking the microkernel,
you discover that you have fewer and fewer design options in the
microkernel itself. The design reaches a certain kind of architectural
low-energy fixpoint, and it has a way of remaining there, because the
activation and architectural energy needed to make any serious change is
very large and usually has incompatible results.

The really interesting question to ask is: why did L3/Eumel/L4 and
Gnosis/KeyKOS/EROS survive for 25 years when the other microkernels
died? In some sense, both systems have managed to survive their original
architects (Norm Hardy lives, but I think it would be fair to describe
him as "actively retired").

I think the answer is that both systems were based firmly and
uncompromisingly on principle driven design. There are portions of L4
that I argue about (and portions of EROS that *they* argue about), but
the active and elegant role of mathematics and principle in the design
of L4 is striking and beautiful. The same, I think, is true in KeyKOS. I
would like to believe that I preserved that attribute in EROS. We shall
have to see whether that remains true with Coyotos. I hope that it will.

I do not know how to convey exactly how rare a property this is in *any*
software system, but especially so in such a performance critical and
complex system as a microkernel.


shap





reply via email to

[Prev in Thread] Current Thread [Next in Thread]