<< | Thread Index | >> ]    [ << | Date Index | >> ]

To: cipe-l,AT,inka,DOT,de
Subject: ciped crashes with kxchg: read(r): Interrupted system call error
From: Michael Fischer <fischer-michael,AT,cs,DOT,yale,DOT,edu>
Date: Fri, 05 Dec 2003 22:18:45 -0500

I have used cipe for a year or two under RedHat 8.0 and Redhat 9 linux. After a recent upgrade to Fedora Core 1, the cipe client daemon began dying every few days after logging the message "kxchg: read(r): Interrupted system call". Here is the excerpt from the log:

Nov 26 11:50:56 daphne ciped-cb[19769]: kxchg: read(r): Interrupted system 
call
Nov 26 11:50:56 daphne ciped-cb[19769]: Interface stats  1780808    7285   22 
   0    0    22
      0         0  1347332    6289    0    0    0     0       0          0
Nov 26 11:50:56 daphne ciped-cb[19769]: KX stats: rreq=1, req=643, ind=640, 
indb=0, ack=640, ack
b=0, unknown=0
Nov 26 11:50:57 daphne ciped-cb[19769]: cipcb0: daemon exiting

Fedora ships with cipe-1.4.5-18.i386.rpm, which seems to be a slightly-patched version of cipe 1.4.5.

Somebody mentioned in a news posting that this was probably due to EINTR being returned by read() and not being handled properly. I patched the source code to retry the read() in case of an EINTR error or a successful return of fewer than the requested number of bytes. My patched version has been up for five days now without a crash.

While version 1.4.5 is fairly old on the development tree, this same bug is still present in version 1.5.4.

Here's the patch I'm using:

*** ciped.c.orig        Sun Nov 30 14:44:38 2003
--- ciped.c     Sun Nov 30 14:50:41 2003
***************
*** 807,815 ****
       break;
     case NK_REQ:
       kx_typ=NK_IND;
!       if (read(r, &LM->skey, userKeySize)!=userKeySize) {
!           logerr(LOG_ERR, "kxchg: read(r)");
!           return -1;
       }
       memcpy(kx_nkind_key, LM->skey, userKeySize);
 #ifdef VER_CRC32
--- 807,824 ----
       break;
     case NK_REQ:
       kx_typ=NK_IND;
!       {
!         int n = 0;            /* number of chars read */
!         int ret;
!         do {
!           ret = read(r, &LM->skey[n], userKeySize-n);
!           if ( ret == -1 ) {
!             if ( errno == EINTR ) continue;
!             logerr(LOG_ERR, "kxchg: read(r): %m");
!             return -1;
!           }
!           n += ret;
!         } while ( n < userKeySize );
       }
       memcpy(kx_nkind_key, LM->skey, userKeySize);
 #ifdef VER_CRC32

I don't know how to explain why the problem only began manifesting itself after the upgrade to Fedora Core 1, but I'm guessing that the newer linux kernel (2.4.22 rather than 2.4.20) allows more concurrency and greater possibilities for an interrupt during read(). On reflecting back, ciped did die occasionally even before the upgrade, but it was so rare (only once every couple of months at most) that I never paid much attention to it.

Sincerely,
--Michael Fischer

--
==================================================
| Michael Fischer <fischer-michael,AT,cs,DOT,yale,DOT,edu>  |
| Professor of Computer Science                  |
==================================================


<< | Thread Index | >> ]    [ << | Date Index | >> ]