Lost commit bug and a fix

bug-cvs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Lost commit bug and a fix

From:	Rahul Bhargava
Subject:	Lost commit bug and a fix
Date:	Sun, 25 Sep 2005 15:19:55 -0700
User-agent:	Mozilla Thunderbird 1.0.5 (Windows/20050711)

We have run into an issue with cvs server (we have tried 1.11.17 -1.12.12, all versions suffer from this)whereby a "committed" transactions can be lost from disk due to a poweroutage or an unclean shutdownof the machine. The OS we have experienced this were - Linux kernel2.4.21-27 (RHEL3) and 2.6.5, 2.6.9,

but the problem can occur on other OS as well.

When cvs commits a change to the RCS files, it invokesfilesubr.c:rename_file() which in turnsinvokes STDIO rename() function. Prior to invoking rename(), thercs.c:rcs_internal_unlockfile()invokes fclose() on the ,<file>, file which only ensures changes fromuser space buffers providedby the C library are flushed to kernel buffers. It does not flushkernel buffers to disk. The currentcvs code does not invoke fsync() on the ,<file>, file descriptor.Invoking rename(",<file>," , "<file>,v")on Linux and almost all UNIXs only flushes the inode for the target fileto disk, it does not guaranteeflush of the kernel buffers allocated for the ,<file>. Depending uponthe load on the machine, the Linuxkernel's flush daemon process may not flush for a while. In the meantimethe cvs transaction couldhave been declared committed to the end CVS user (cvs process hasreturned the final "OK"). If themachine crashes prior to syncing the changes to disk, the committedtransaction can be lost.

In production environment we have seen this happen several times. TheCVS server side processneeds to guarantee not just atomicity via rename but durability of thetransaction, The solution isto fsync(outfile fd) prior to rename().The MTA/sendmail community seems to be aware of this issue. For example,we looked at thesource code of sendmail-8.13.5/sendmail/queue.c and confirmed that theyuse fsync and renamepairs to guarantee changes to the files are written to disk prior toreturning OK to the client. In thedatabase community we ensure we first write to a durable transaction logbefore declaring victoryon the given transaction that we are committing. Without fsyncing itslike playing Russian roulettewith the cvs commit. Most of the time it will work but as we found inproduction, every now and then

changes can be lost.

Note a journaling file system like ext3 doesn't help as withoutfsync/fdatasync calls the journalmay not record an event for data changes. Our production environmentswere all using ext3/Linux

when they ran into the lost commit problem.

The fix to rcs.c file :

[ccvs/src]$ cvs diff -w -rcvs1-12-12  rcs.c
Index: rcs.c
===================================================================
RCS file: /cvs/ccvs/src/rcs.c,v
retrieving revision 1.345
diff -w -r1.345 rcs.c
8450a8451,8458

> /* Rahul: start fix , fsync the file to disk else rename cancause data loss */

>     if (fflush(fp) != 0)

> error (1, errno, "error flushing file %s to kernel buffers",rcs_lockfile);> /* Now that we have xfered to kernel buffers, lets call fsync toget to disk */

>     if (fsync(rcs_lockfd) < 0)

> error (1, errno, "error fsyncing file %s", rcs_lockfile);> /* Rahul: end fix

>

--
Rahul Bhargava,
CTO, WANdisco
(650) 242-8352
Mountain View, CA
http://www.wandisco.com

[Prev in Thread]

Current Thread

[Next in Thread]

Lost commit bug and a fix, Rahul Bhargava <=
- Re: Lost commit bug and a fix, Derek Price, 2005/09/25
  - Re: Lost commit bug and a fix, Rahul Bhargava, 2005/09/25
  - Re: Lost commit bug and a fix, Rahul Bhargava, 2005/09/26

Prev by Date: Winsock or Winsock2 - Single Point Of Change
Next by Date: Re: Winsock or Winsock2 - Single Point Of Change
Previous by thread: Winsock or Winsock2 - Single Point Of Change
Next by thread: Re: Lost commit bug and a fix
Index(es):
- Date
- Thread