bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

limitation of `cp -a's hard-link-preserving code


From: Jim Meyering
Subject: limitation of `cp -a's hard-link-preserving code
Date: Sun, 07 Dec 2003 12:28:30 +0100

KAMOSAWA Masao <address@hidden> wrote:
> I'm using a backup tool "pdumpfs", takes and maintains everyday

Thank you for reporting that.
Your situation does highlight a limitation of `cp -a's
hard-link-preserving code.

In order for `cp -a' to preserve hard links, it stores a
triple <source device number, source inode, dest file name> for
each candidate so that if it would copy another file with identical
source device and inode, then it can instead simply create a hard link
to the saved destination file name.
The problem you've noticed is that when there are many candidates
(many files with link count greater than 1) then cp must store
many of those triples.

Note that `cp -a' has no problem copying a directory containing
a single file and a million hard links to that file, but it requires
a lot more memory to copy a directory containing a million files
(distinct inodes) where each has a link count of 2.

Knowing that, it's not surprising that `cp -a' ran out of memory
coping your hierarchy.

How could you mitigate the problem?

  1) use a destination path that is as short as possible.
  E.g., rather than your
    cp -a /backup/2003/* /backup2/2003
  do this:
    cd /backup2/2003; cp -a /backup/2003/* .

  2) copy smaller pieces at a time, realizing that some
  files will be copied rather than hard-linked; later, use
  a tool identifies and links files with identical contents.
  But even that's not perfect, since it might end up linking files
  that were identical but not linked in the original.

  3) try another tool, like rsync with its --hard-links (-H) option.
  Maybe its implementation uses less memory.

  4) Get more RAM :-)

How might cp handle this situation better?

  1) make the source-dev+ino--to--dest-filename map, more compact, e.g.
  by saving the destination file names more intelligently, eliminating
  duplication in prefix and directory components when there are
  many names.  The encoding used by frcode (described in documentation
  for locatedb) could reduce the size of the map by a factor of 4 or 5.

  2) use a more memory-efficient map implementation, E.g.,
  http://judy.sourceforge.net/

  3) find a (safe) way to avoid having to store device numbers,
  since they're almost always the same.  Hmm... this would be easy.
  Add an option to say `I expect all source files to be on the same
  device' (unfortunately this is not the same as -x).  Then require
  that, and don't save device numbers.

...
> Only the tool I could use was "cp -a", for handling capability
> of hardlinks and it preserves all attributes.
>
> The filesystem cotains about 200day's snapshots of my whole
> system include mirrored indexed "web cache", so thousands
>  (How I could get the number?) of files (mostly hardlinks)
> and directory entries are in each day's dir.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]