[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
limitation of `cp -a's hard-link-preserving code
From: |
Jim Meyering |
Subject: |
limitation of `cp -a's hard-link-preserving code |
Date: |
Sun, 07 Dec 2003 12:28:30 +0100 |
KAMOSAWA Masao <address@hidden> wrote:
> I'm using a backup tool "pdumpfs", takes and maintains everyday
Thank you for reporting that.
Your situation does highlight a limitation of `cp -a's
hard-link-preserving code.
In order for `cp -a' to preserve hard links, it stores a
triple <source device number, source inode, dest file name> for
each candidate so that if it would copy another file with identical
source device and inode, then it can instead simply create a hard link
to the saved destination file name.
The problem you've noticed is that when there are many candidates
(many files with link count greater than 1) then cp must store
many of those triples.
Note that `cp -a' has no problem copying a directory containing
a single file and a million hard links to that file, but it requires
a lot more memory to copy a directory containing a million files
(distinct inodes) where each has a link count of 2.
Knowing that, it's not surprising that `cp -a' ran out of memory
coping your hierarchy.
How could you mitigate the problem?
1) use a destination path that is as short as possible.
E.g., rather than your
cp -a /backup/2003/* /backup2/2003
do this:
cd /backup2/2003; cp -a /backup/2003/* .
2) copy smaller pieces at a time, realizing that some
files will be copied rather than hard-linked; later, use
a tool identifies and links files with identical contents.
But even that's not perfect, since it might end up linking files
that were identical but not linked in the original.
3) try another tool, like rsync with its --hard-links (-H) option.
Maybe its implementation uses less memory.
4) Get more RAM :-)
How might cp handle this situation better?
1) make the source-dev+ino--to--dest-filename map, more compact, e.g.
by saving the destination file names more intelligently, eliminating
duplication in prefix and directory components when there are
many names. The encoding used by frcode (described in documentation
for locatedb) could reduce the size of the map by a factor of 4 or 5.
2) use a more memory-efficient map implementation, E.g.,
http://judy.sourceforge.net/
3) find a (safe) way to avoid having to store device numbers,
since they're almost always the same. Hmm... this would be easy.
Add an option to say `I expect all source files to be on the same
device' (unfortunately this is not the same as -x). Then require
that, and don't save device numbers.
...
> Only the tool I could use was "cp -a", for handling capability
> of hardlinks and it preserves all attributes.
>
> The filesystem cotains about 200day's snapshots of my whole
> system include mirrored indexed "web cache", so thousands
> (How I could get the number?) of files (mostly hardlinks)
> and directory entries are in each day's dir.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- limitation of `cp -a's hard-link-preserving code,
Jim Meyering <=