gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Arch Cache & cached archives


From: Tom Lord
Subject: Re: [Gnu-arch-users] Arch Cache & cached archives
Date: Wed, 15 Sep 2004 14:42:31 -0700 (PDT)

    > From: Aaron Bentley <address@hidden>

    > Although I'm using the term "cache", it's really about memoizing
    > data that can be time-consuming to produce, but will always be
    > equivalent once produced.

I think that caching vs. memoizing should (at least ultimately) be a
finely tunable parameter.   It is worth planning at least a little bit
for that because if the two can be simply unified (and i see no reason
why they can not) then it is worth doing so.


    > Layer 1: The Arch Cache
    > =======================

    > The Arch Cache abstraction connects "query paths" with streams,
    > or things that streams can represent.  Query paths look
    > suspiciously like POSIX pathnames.  Convenience functions are
    > available for use with strings.

The structure of the namespace of cachable items is a critical design
point:

First: That namespace has some quasi-algebraic structure.  For
example, some cachable things may be small parts of larger cachable
things (a containment relationship).   Hierarchical pathnames capture
the structure of "containment" well enough but what we lack here is
any evidence that containment is the right structure overall.

Second: paths are ideal for naming parameters to a referentially
transparent function of a single parameter but far from ideal for
functions of 2 or more parameters.  Certainly gymnastics to map tuples
into paths are possible but are they either necessary or desirable?

Third: paths are ideal for naming parameters to referentially
transparent functions whose arguments are strings.   But what about
functions whose arguments are of some other type?   At the very least,
we ought to have a clear statement about the domain of cachable
functions and how that domain maps onto paths.

In short: you are introducing a namespace extension and the structure
of the addition ought to be explicitly documented and considered.


    > There is a test for whether the cache is enabled:
    > extern int
    > arch_cache_active (void)

    > Attempts to use the cache when it is not enabled will cause panics.

On one side of a wall is the user's configuration: how the cache is
tuned.

On the other side of that wall are the core algorithms of arch.

I am alarmed that the algorithms care how the cache is tuned.  For
example, why isn't a "put" to a "non-enabled" cache simply a (cheap)
noop?


    > extern int
    > arch_cache_put (t_uchar **tmp_name, t_uchar *rel_query_path)

    > To add something to the cache, we use arch_cache_put.  This returns a 
    > file descriptor that we'll have to close, and a tmp_name that we'll 
    > ultimately need to free.

    > extern void
    > arch_cache_commit (t_uchar *tmp_name, t_uchar *rel_query_path)

    > After we have written the answer to the file descriptor, we must commit 
    > it, before the answer can become active.  This step is not required for 
    > the string wrappers.

    > extern int
    > arch_cache_has_answer (t_uchar * rel_query_path)

    > We can use arch_cache_has_answer to find out whether the cache has an 
    > answer for a particular query.


Why is that query not internal to the caching mechanism?

If client code asks for some thing, unless there is a really
compelling reason otherwise, the client code should get that thing
whether it comes from the cache or not.

That is why I suggested thinking about ways to link computation rules
(how a given cachable entity can be computed) to the namespace
structure.   So that clients don't have to care whether or not the
cache exists and those who work on caching code can concentrate on the
problem of optimizing client queries by any means necessary.


    > extern int
    > arch_cache_get (t_uchar * rel_query_path)

    > We can use arch_cache_get to retrieve the answer for a query.  It will 
    > panic if no answer is available for that query.  This is where the smart 
    > caching functionality Tom's mentioned could hook in.  One possible 
    > inplementation would be to register a set of query handlers, and invoke 
    > them in sequence until one of them produced an answer.

I don't understand the prototype of that function, probably because
you haven't said anything about the range of cachable functions.
Presumably the range is not `int' but I don't see any other return
parameter there.


    > Layer 2: Namespace
    > ==================
    > The current namespace looks like this:

    > /archives : data for archives, but not for specific locations

    > /archives/$ARCHIVE: data for a particular archive

    > /archives/$ARCHIVE/$REVISION: data for a particular archive revision. 
    > I'm not sure I want to keep it this way.  For scalability reasons, this 
    > might be better: /archives/$ARCHIVE/$VERSION/$DATATYPE/$PATCHLEVEL. 
    > That way, listing data would scale with the number of patchlevels (which 
    > have cached queries) in the version, not version*patchlevels.

    > /archives/$ARCHIVE/$REVISION/full-tree.tar.gz: The full tree (same 
    > contents as a cacherev or import) for the revision

    > /archives/$ARCHIVE/$REVISION/log: The patchlog for the revision

    > /archives/$ARCHIVE/$REVISION/delta.tar.gz: The changeset between the 
    > revision and its direct ancestor

    > /archives/$ARCHIVE/$REVISION1/delta-from-REVISION2.tar.gz: (not 
    > implemented) The changeset that transforms $REVISION2 into $REVISION1

    > /archives/$ARCHIVE/$REVISION/ancestor: (not implemented) The direct 
    > ancestor of the revision

    > /archives/$ARCHIVE/$REVISION/type: (not implemented) The type of the 
    > revision ("import", "simple" or "continuation")

    > /locations/$MANGLED_URL/NAME: (not implemented) The official name 
    > associated with an archive location.  Required for disconnected 
    > operation or lazy initialization, but may occasionally change.

Rationale?  (See my comments above about namespace design.)


    > Cached Archives
    > ===============
    > Cached archives are the first clients of the Arch Cache.  


One could implement cached archives without changing arch in the
slightest.

Such an implementation would be very clean.

Why isn't that approach better?

I suspect that the answer is because you want to cache more than just
archives.

Great.  But in that case, the namespace, range, domain, and layering
issues I've raised deserve more design attention.

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]