gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] A Truly Transactional Filesystem (API description)


From: Thomas Lord
Subject: [Gnu-arch-users] A Truly Transactional Filesystem (API description)
Date: Thu, 12 Jan 2006 17:22:30 -0800

  Building a file system on top `vumnd' (the page-oriented, persistent
  transactional memory system I presented in my last message) is a
  well-understood problem.  An inode layer can be done in a few
  thousand line of code and a `vu_' system call layer with a few
  thousand more.  I hope to get such a thing working soon, if I can.

  This note is about *extra* functionality that can be built into 
  a file system at the same time.


* Constant Files, Symbolic Links, and Devices.

  The new file system call `int conx (int fd)' turns inodes which are
  not directories into *constant inodes*.

  The link count of a constant inode can change but no other aspect of
  the inode can ever again change, regardless of the node's
  permissions.  There is no inverse to the `conx' operation.


* Constant Directories

  Constant directories work a little bit differently.   If `conx'
  is applied to a directory, all of the contents of the directory
  must themselves be constants (except `..').   The directory
  is made constant and its device number (normally 0) is changed.

  Cross device links are permitted into and and out of constant
  directories.   If a link is made to a node in a constant tree
  the resulting link may or may not be to the same device and
  inode as found in the constant tree.


* Cheap Device Cloning

  The new file system call `clone (...)' makes a space and time
  efficient copy of a constant directory.  The copy has a new device
  number (and possibly a new inode) for all of its subdirectories.
  All corresponding non-directory nodes in the two trees have the same
  device and inode number.  The effect cloning has on link counts of
  non-directories is not specified.

  Note that a clone is always itself a constant tree and is always
  a clone of a constant tree.


* Transaction Devices

  The new file system call `dbase (...)' makes a space and
  time efficient copy of a constant directory, again with a new
  device number.

  Unlike a clone, a transaction copy can be modified in ordinary
  ways by file system calls -- none of the files or directories is
  initially constant.

  Also unlike a clone, in a copy created by `dbase', even
  non-directory nodes have a new device number.

  `conx' works on files and directories on transaction nodes in the
  usual way.

  The new call `relink (dir_fd, name)' replaces the link for a given
  name and directory, where the named node is constant and resides on
  the transaction tree, with a link to an equivalent constant node
  that does not reside on the transaction device.  For a given
  constant node on the transaction device, `relink' always returns the
  same result.

  The new call `revert (dir_fd, name)' replaces the named subtree of a
  transaction tree with either a link (to an original file) or to a
  new clone (to the original directory).


* Expanded `rename' Capabilities

  `rename' can overwrite a link to a constant directory even
  if that directory is not empty.

  `soft_rename' acts like `rename' except that it never overwrites
  an existing link.


* Example Application: Revision Control Whole-file Storage

  The above features have many uses in revision control.  An easy
  example to see is storage management for tree snapshots:

  Suppose that we have a constant tree representing an immediate 
  ancestor revision and want to `commit', creating a new constant
  tree representing a new revision.  We can:

        1. `dbase' (make a transactional copy) of the ancestor
           revision.  This is O(1) in time and space.

        2. In the transactional copy, modify changed files
           and directories.  This performs about the same
           as modifying regular files.

        3. In the transactional copy, `revert' all of the largest
           unmodified subtrees. This is O(1) for each such subtree,
           O(N) for N largest unmodified subtrees.

        4. From leafs towards the root, `relink' all entries
           of each *modified* directory and `conx' that directory.

  When the top-level of the transactional copy is made constant
  by `conx', the result is a constant tree for the new revision,
  sharing unchanged state with the tree for the previous revision.

  `commit' time and space in this system is proportionate to the size
  of the file and directory modifications being committed.


* Example Application: A Persistent MRU Delta Cache

  Storing whole trees optimizes for some important access patterns but
  pessimizes others.  For example, clients sometimes need optimized
  access to arbitrary whole-tree deltas -- diffs between two
  revisions.

  Reasonable performance characteristics can be obtained from a system
  that computes diffs between revisions "on-demand" but also caches
  recently used deltas (perhaps weighted by the expense of recomputing
  those deltas). 

  Using our new features we can represent the cache as a pair: a
  constant directory (for cache readers) and a transactional copy 
  of that constant directory (for cache writers).

  Cache writers update the transaction copy, make it constant, rename
  it to replace the reader's copy of the cache, and create a new
  transaction copy for the next write transaction.

  Cache readers open the root directory of the constant copy of the
  cache and access cache contents relative to that directory.  Thus,
  readers are isolated from concurrent writes.

* Other Applications

  Revision control is hardly the only application.

* Implementation

  I've already hinted at large parts of it (`vudev' and `vumnd' in
  the previous two messages).   With luck and support I'll show how
  to implement these features by Sept.   (support: address@hidden via
  paypal, thanks).

* Project Plans

  I plan to leverage my strengths and avoid my weaknesses.  So:

  With some encouragement, I'll start publishing code snapshots.

  With greater encouragement, I'll publish more code snapshots
  and more detailed descriptions and documentation.

  While I welcome helpful kibbitzing including bug reports I make 
  no promises to respond in any way.  When it gets to the point that
  people want to start using the result in projects of their own,
  someone *else* will have to start maintaining fork for that purpose.

-t







reply via email to

[Prev in Thread] Current Thread [Next in Thread]