bug-tar
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tar is creating corrupt archives when soft links are present


From: Timothe Litt
Subject: tar is creating corrupt archives when soft links are present
Date: Thu, 1 Dec 2022 09:25:01 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0

tar has multiple bad behaviors when creating an archive, resulting in an archive that incorrectly represents soft links.  Restoring from such an archive results in an unusable system.

These are serious problems - I've just spent 4 daze reconstructing disks from faulty backups exhibiting these issues.  I was lucky that it was ONLY 4 days; there were >30,000 files affected.  I currently have NO reliable backups.

I have come up with a fairly small reproducer.   Key observations: 

Here is the subject data; /bin on a fairly old system.  I'm showing just the links to keep this small.  Note: NO hard links.

# ls -l /bin | grep '^[lh]'
lrwxrwxrwx 1 root root       4 Nov 28 08:45 awk -> gawk
lrwxrwxrwx 1 root root      21 Nov 28 08:45 bash -> ../usr/local/bin/bash
lrwxrwxrwx 1 root root       4 Nov 28 08:45 csh -> tcsh
lrwxrwxrwx 1 root root       8 Nov 28 08:45 dnsdomainname -> hostname
lrwxrwxrwx 1 root root       8 Nov 28 08:45 domainname -> hostname
lrwxrwxrwx 1 root root       4 Nov 28 08:45 egrep -> grep
lrwxrwxrwx 1 root root       2 Nov 28 08:45 ex -> vi
lrwxrwxrwx 1 root root       4 Nov 28 08:45 fgrep -> grep
lrwxrwxrwx 1 root root       3 Nov 28 08:45 gtar -> tar
lrwxrwxrwx 1 root root       4 Nov 28 08:45 mailx -> mail
lrwxrwxrwx 1 root root       8 Nov 28 08:45 nisdomainname -> hostname
lrwxrwxrwx 1 root root      13 Nov 28 08:45 perl -> /usr/bin/perl
lrwxrwxrwx 1 root root       2 Nov 28 08:45 rvi -> vi
lrwxrwxrwx 1 root root       2 Nov 28 08:45 rview -> vi
lrwxrwxrwx 1 root root       4 Nov 28 08:45 sh -> bash
lrwxrwxrwx 1 root root      10 Nov 28 08:45 traceroute6 -> traceroute
lrwxrwxrwx 1 root root      10 Nov 28 08:45 tracert -> traceroute
lrwxrwxrwx 1 root root       2 Nov 28 08:45 view -> vi
lrwxrwxrwx 1 root root       8 Nov 28 08:45 ypdomainname -> hostname

It shouldn't matter, but FWIW the filesystem is ext3.

Here's what happens with tar 1.34, which is the current release on ftp.gnu.org.  I create an archive (explicit xz is to isolate & test the same way with older version)

Note that 'bin/*' is  the key to global merging; ('cd / && .tar -cf - bin') will not fail in the same way as shown later. 

Note that in the following example, all the links are converted to hard links in addition to being misdirected.  It's actually more common for most of the soft links to remain soft links, but all pointing to the first soft link target encountered.  (Think libc.so => vi...)  I don't have a small reproducer for the latter.

Also, note that the output order differs.  My guess is that tar is processing soft links as if they were hard links, and caching in order to merge names linked to a common inode.  The directory is not changing; /bin is stable.

# /usr/local/bin/tar --version | head -n1
tar (GNU tar) 1.34

# ( cd / && /usr/local/bin/tar  -cf - bin/* | xz --stdout >/root/test.1.34.tar.xz )

# tar -tvf /root/test.1.34.tar.xz | grep -- ' -> \| link to '
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/awk -> gawk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/bash link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/csh link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/dnsdomainname link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/domainname link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/egrep link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/ex link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/fgrep link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/gtar link to bin/awk
hrwxr-xr-x root/root         0 2006-10-01 16:22 bin/gzip link to bin/gunzip
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/mailx link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/nisdomainname link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/perl link to bin/awk
hrwxr-xr-x root/root         0 2007-01-18 06:59 bin/red link to bin/ed
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/rvi link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/rview link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/sh link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/traceroute6 link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/tracert link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/view link to bin/awk
hrwxrwxrwx root/root         0 2022-11-29 14:37 bin/ypdomainname link to bin/awk
hrwxr-xr-x root/root         0 2006-10-01 16:22 bin/zcat link to bin/gunzip

Here is a run without the wildcard, showing the hard link conversion.    But it does convert some soft links to hard, which is not semantically equivalent.  (e.g. consider b soft linked to a.  Update a; b gets the new version.  Convert to hard link & update a.  Now a is the new version, and b is the old.)  I have flagged the hard links with !! hard so they stand out from the clutter.

# ( cd / && /usr/local/bin/tar  -cf - bin | xz --stdout >/root/test.1.34.tar.xz )

# tar -tvf /root/test.1.34.tar.xz | grep -- ' -> \| link to '
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/nisdomainname -> hostname
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/traceroute6 -> traceroute
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/dnsdomainname -> hostname
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/bash -> ../usr/local/bin/bash
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/gtar -> tar
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/awk -> gawk
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/sh -> bash
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/ex -> vi
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/fgrep -> grep
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/csh -> tcsh
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/rview -> vi
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/egrep -> grep
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/mailx -> mail
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/view -> vi
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/tracert -> traceroute
hrwxr-xr-x root/root         0 2006-10-01 16:22 bin/gzip link to bin/zcat  !! hard
hrwxr-xr-x root/root         0 2007-01-18 06:59 bin/ed link to bin/red     !! hard
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/rvi -> vi
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/ypdomainname -> hostname
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/domainname -> hostname
hrwxr-xr-x root/root         0 2006-10-01 16:22 bin/gunzip link to bin/zcat  !! hard
lrwxrwxrwx root/root         0 2022-11-29 14:37 bin/perl -> /usr/bin/perl

tar 1.15.1 exhibits the link conversion, but not the merging.  Here is a sample

# /bin/tar --version
tar (GNU tar) 1.15.1

# ( cd /bin && /bin/tar  -cf - * | xz --stdout >/root/test.1.15.1.tar.xz )
[root@overkill:~]# tar -tvf /root/test.1.15.1.tar.xz | grep -- ' -> \| link to '
lrwxrwxrwx root/root         0 2022-11-28 08:45 awk -> gawk
lrwxrwxrwx root/root         0 2022-11-28 08:45 bash -> ../usr/local/bin/bash
lrwxrwxrwx root/root         0 2022-11-28 08:45 csh -> tcsh
lrwxrwxrwx root/root         0 2022-11-28 08:45 dnsdomainname -> hostname
lrwxrwxrwx root/root         0 2022-11-28 08:45 domainname -> hostname
lrwxrwxrwx root/root         0 2022-11-28 08:45 egrep -> grep
lrwxrwxrwx root/root         0 2022-11-28 08:45 ex -> vi
lrwxrwxrwx root/root         0 2022-11-28 08:45 fgrep -> grep
lrwxrwxrwx root/root         0 2022-11-28 08:45 gtar -> tar
hrwxr-xr-x root/root         0 2006-10-01 16:22 gzip link to gunzip    || hard
lrwxrwxrwx root/root         0 2022-11-28 08:45 mailx -> mail
lrwxrwxrwx root/root         0 2022-11-28 08:45 nisdomainname -> hostname
lrwxrwxrwx root/root         0 2022-11-28 08:45 perl -> /usr/bin/perl
hrwxr-xr-x root/root         0 2007-01-18 06:59 red link to ed         !! hard
lrwxrwxrwx root/root         0 2022-11-28 08:45 rvi -> vi
lrwxrwxrwx root/root         0 2022-11-28 08:45 rview -> vi
lrwxrwxrwx root/root         0 2022-11-28 08:45 sh -> bash
lrwxrwxrwx root/root         0 2022-11-28 08:45 traceroute6 -> traceroute
lrwxrwxrwx root/root         0 2022-11-28 08:45 tracert -> traceroute
lrwxrwxrwx root/root         0 2022-11-28 08:45 view -> vi
lrwxrwxrwx root/root         0 2022-11-28 08:45 ypdomainname -> hostname
hrwxr-xr-x root/root         0 2006-10-01 16:22 zcat link to gunzip    !! hard

Without the wildcard, links remain distinct, but different files selected for bogus conversion to hard links.

# ( cd / && /bin/tar  -cf - bin | xz --stdout >/root/test.1.15.1.tar.xz )
[root@overkill:~]# tar -tvf /root/test.1.15.1.tar.xz | grep -- ' -> \| link to '
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/nisdomainname -> hostname
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/traceroute6 -> traceroute
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/dnsdomainname -> hostname
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/bash -> ../usr/local/bin/bash
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/gtar -> tar
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/awk -> gawk
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/sh -> bash
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/ex -> vi
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/fgrep -> grep
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/csh -> tcsh
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/rview -> vi
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/egrep -> grep
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/mailx -> mail
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/view -> vi
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/tracert -> traceroute
hrwxr-xr-x root/root         0 2006-10-01 16:22 bin/gzip link to bin/zcat   !! hard
hrwxr-xr-x root/root         0 2007-01-18 06:59 bin/ed link to bin/red      !! hard
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/rvi -> vi
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/ypdomainname -> hostname
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/domainname -> hostname
hrwxr-xr-x root/root         0 2006-10-01 16:22 bin/gunzip link to bin/zcat !! hard
lrwxrwxrwx root/root         0 2022-11-28 08:45 bin/perl -> /usr/bin/perl   !! hard

If you are wondering why many links have recent dates - that's an artifact of recovering the correct links after restoring from a corrupt archive.


Timothe Litt
ACM Distinguished Engineer
--------------------------
This communication may not represent the ACM or my employer's views,
if any, on the matters discussed. 

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]