[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Review of Thomas's >2GB ext2fs proposal

From: Neal H. Walfield
Subject: Review of Thomas's >2GB ext2fs proposal
Date: Mon, 16 Aug 2004 17:28:15 +0100
User-agent: Wanderlust/2.8.1 (Something) SEMI/1.14.3 (Ushinoya) FLIM/1.14.3 (Unebigoryƍmae) APEL/10.6 Emacs/21.2 (i386-debian-linux-gnu) MULE/5.0 (SAKAKI)


I would like to take a moment to examine Thomas's proposal for fixing
the 2GB ext2fs limitation and highlight what I view as potential short
comings in it as well as some noteworthy parts.  The only email I can
find where he elaborates his plan in any detail is:


Thomas, like Roland, would continue to use a single memory object (or,
rather, array of 4GB memory objects) to span the entire disk.
However, rather than mapping the whole disk as we do now, we would
only map what is required at any given time (and maintain some cache
of mappings).  Unlike Roland, Thomas does not have multiple metadata
pagers.  The metadata pager remains ignorant of its content and no
effort is made to build a simple array out of the inode table or block
list for a given file, etc.  The result is that the file block look up
logic remains as is (i.e. in the front end and not integrated into the

In order to access metadata, we continue to use accessor functions.
Rather than applying a simple offset into a one-to-one mapping, a
cache of active mappings is consulted.  If a mapped region is found
which contains the page, it is returned.  Otherwise, a region
containing the page is mapped into the address space, stored in the
cache and returned.  Regions may be as small as a page or several
megabytes in size.  This parameter would need tuning based on
empirical data.

As I explained in my mail reviewing Roland's proposal, metadata
represents approximately 3% of the file system.  As such, it is
imperative that we provide a mechanism to drain the mapping cache and
keep it at a reasonable size.  (This is less important on small file
systems even though indirect blocks are part of the main block pool
and given sufficient time, all blocks could have been potentially used
as indirect blocks.  However, indirect blocks can be special cased:
when they are freed, we remove any mapping straight away.)

In the very least then, we need a list of the mappings.  It is
insufficient, however, to wait until vm_map fails to drain the cache
as the address space is shared with entities (i.e. anything which uses
vm_allocate or vm_map) which are unaware of the cache and would cause
the program to fail miserably on what in reality is a soft error.
Unmapping cannot be haphazard: we cannot release a mapping as long as
there are users which expect the address to refer to specific disk
blocks.  Therefore, we need to reference count the regions and which
means having each block accessor paired with a release function.  This
system is clearly rather fragile, yet, I see no simpler alternative.
Happily however, the locations where this must be done have been
identified by Ogi and that knowledge is easily transferable to any
other system.

We must also consider what the eviction mechanism will look like.
Thomas makes no suggestions of which I am aware.  If we just evict
mappings when the reference count drops to zero, the only mappings
which will profit from the cache are pages being accessed
concurrently.  Although I have done no experiments to suggest that
this is never the case, we need only consider a sequential read of a
file to realize that it is often not the case: a client sends a
request of X blocks to the file system.  The server replies and then,
after the client processes the returned blocks, the client asks for
more.  Clearly, the inode will be consulted again.  This scenario
would have elided a vm_unmap and vm_map had the mapping remained in
the cache.  Given this, I see a strong theoretical motivation to make
cache entries more persistent.

If we make cache entries semi-persistent, a mechanism needs to be in
place to drain the cache when it is full.  The easiest place to
trigger the draining is just before a new mapping is created.  But how
do we find the best region to evict?  The information we are able to
account for is: when a given region is mapped into the address space,
the last time the kernel requested pages from a region, and when the
last references to a region were added.  Except for the last bit, this
information is already in the kernel.  One way we could take advantage
of this is to use the pager_notify_eviction mechanism which Ogi is
using and I described in a previous email [1].  If the kernel does not
have a copy (and there are no exant user references), then the page
likely makes a good eviction candidate.  This data can be augmented by
the amount of recent references in conjunction with a standard clock
againg algorithm.  But really, that final bit is unnecessary: once the
kernel has dropped a page, the only way we can get the data back is by
reading it from disk making an extra vm_unmap and vm_map rather cheap.
Strictly following this offers another advantage: the cache data in
the file system remains propostional to the amount of data cached in
the kernel.  This, it seems to me, is a good arguement to keep the
region size equal to vm_page_size, as I have in my proposal.

So far, Thomas's proposal is strikingly similar to what I have
suggested [2,3] (or rather, my proposal is strikingly similar to his).
The major difference lies in what we are caching: Thomas has a large
memory object of which small parts are mapped into the address space;
I have a single small memory object of which the entire contents are
mapped into the address space.  Thomas multiplexes the address space
keeping the contents of the memory object fixed; I multiplex the
contents of the memory object keeping the address space fixed.  In my
proposal, the mapping database is only in the task and not also in
Mach.  More concretely, we both require two hashes: Thomas hashes file
system blocks to address space mappings and vice versa; I hash file
system blocks to address space locations and vice verse.  So, Thomas
has a lots of small mappings in the address space which are associated
with disk blocks; I track the contents of a single large mapping.  The
advantage in my proposal, I believe, is that it is much easier on Mach
(as far as I understand Mach's internals; I am sure that Thomas has a
much better insight into Mach's machinery and I hope will confirm this
as either perceived advantage as either a fantasy or a reality): with
only one mapping, Mach uses less memory.  Since we both require two
hashes anyway, my mapping database then consumes no additional memory.

Hopefully, I have given an accurate representation of Thomas's
proposal.  If I have anything wrong, I hope you will point it out so
that we can find the closest approximation of the ideal fix for this


[1] http://lists.gnu.org/archive/html/bug-hurd/2004-08/msg00005.html
[2] http://lists.gnu.org/archive/html/bug-hurd/2002-12/msg00055.html
[3] http://lists.gnu.org/archive/html/bug-hurd/2003-05/msg00024.html

reply via email to

[Prev in Thread] Current Thread [Next in Thread]