<< | Thread Index | >> ]    [ << | Date Index | >> ]

To: Olaf Titz <olaf,AT,bigred,DOT,inka,DOT,de>
Subject: Re: ciped crashes with kxchg: read(r): Interrupted system call error
From: Michael Fischer <fischer-michael,AT,cs,DOT,yale,DOT,edu>
Date: Wed, 07 Jan 2004 23:30:23 -0500
Cc: cipe-l,AT,inka,DOT,de
In-reply-to: <E1AeJeN-0003pM-00@bigred.inka.de>
References: <20031228123556.2854e613.skraw@ithnet.com> <E1AcspX-00031W-00@bigred.inka.de> <20040104125148.5a021467.skraw@ithnet.com> <E1AdxGK-00057Y-00@bigred.inka.de> <20040106215032.1450805a.skraw@ithnet.com> <E1AeJeN-0003pM-00@bigred.inka.de>

Dear Olaf,

Last month, I posted a patch for the problem of ciped dying after logging the message "kxchg: read(r): Interrupted system call". I've attached a copy to this message so that you will have it on hand when you start sorting the fixes out. While this patch applies to version 1.4.5, the same bug is still present in the current development version from CVS, although the old code has been cleaned up a bit and looks somewhat different now. I've been using the patched code for a month and have not had any further problems with the daemon dying.

I like others on this mailing list are very grateful to you for your efforts in building and supporting cipe.

Sincerely,
--Mike

Olaf Titz wrote:

Did you incorporate the ignoredf/forcemtu patch (from long ago :-)? It really
does work and and all mtu-related issues are gone afterwards.



Yes, I did incorporate that as soon as it came up.




It would be nice to release some 1.5.X stable for 2.2/2.4 kernel use, just to
have something like a clear cut. 1.5.4 is not really up-to-date any more.



let's see if I can sort all the fixes out.


Olaf


--
Message sent by the cipe-l,AT,inka,DOT,de mailing list.
Unsubscribe: mail majordomo,AT,inka,DOT,de, "unsubscribe cipe-l" in body
Other commands available with "help" in body to the same address.
CIPE info and list archive: <URL:http://sites.inka.de/~bigred/devel/cipe.html>



-- ================================================== | Michael Fischer <fischer-michael,AT,cs,DOT,yale,DOT,edu> | | Professor of Computer Science | ==================================================
--- Begin Message ---
To: cipe-l,AT,inka,DOT,de
Subject: ciped crashes with kxchg: read(r): Interrupted system call error
From: Michael Fischer <fischer-michael,AT,cs,DOT,yale,DOT,edu>
Date: Fri, 05 Dec 2003 22:18:45 -0500
Delivered-to: fischer-michael@cs.yale.edu
I have used cipe for a year or two under RedHat 8.0 and Redhat 9 linux. After a recent upgrade to Fedora Core 1, the cipe client daemon began dying every few days after logging the message "kxchg: read(r): Interrupted system call". Here is the excerpt from the log:

Nov 26 11:50:56 daphne ciped-cb[19769]: kxchg: read(r): Interrupted system 
call
Nov 26 11:50:56 daphne ciped-cb[19769]: Interface stats  1780808    7285   22 
   0    0    22
      0         0  1347332    6289    0    0    0     0       0          0
Nov 26 11:50:56 daphne ciped-cb[19769]: KX stats: rreq=1, req=643, ind=640, 
indb=0, ack=640, ack
b=0, unknown=0
Nov 26 11:50:57 daphne ciped-cb[19769]: cipcb0: daemon exiting

Fedora ships with cipe-1.4.5-18.i386.rpm, which seems to be a slightly-patched version of cipe 1.4.5.

Somebody mentioned in a news posting that this was probably due to EINTR being returned by read() and not being handled properly. I patched the source code to retry the read() in case of an EINTR error or a successful return of fewer than the requested number of bytes. My patched version has been up for five days now without a crash.

While version 1.4.5 is fairly old on the development tree, this same bug is still present in version 1.5.4.

Here's the patch I'm using:

*** ciped.c.orig        Sun Nov 30 14:44:38 2003
--- ciped.c     Sun Nov 30 14:50:41 2003
***************
*** 807,815 ****
       break;
     case NK_REQ:
       kx_typ=NK_IND;
!       if (read(r, &LM->skey, userKeySize)!=userKeySize) {
!           logerr(LOG_ERR, "kxchg: read(r)");
!           return -1;
       }
       memcpy(kx_nkind_key, LM->skey, userKeySize);
 #ifdef VER_CRC32
--- 807,824 ----
       break;
     case NK_REQ:
       kx_typ=NK_IND;
!       {
!         int n = 0;            /* number of chars read */
!         int ret;
!         do {
!           ret = read(r, &LM->skey[n], userKeySize-n);
!           if ( ret == -1 ) {
!             if ( errno == EINTR ) continue;
!             logerr(LOG_ERR, "kxchg: read(r): %m");
!             return -1;
!           }
!           n += ret;
!         } while ( n < userKeySize );
       }
       memcpy(kx_nkind_key, LM->skey, userKeySize);
 #ifdef VER_CRC32

I don't know how to explain why the problem only began manifesting itself after the upgrade to Fedora Core 1, but I'm guessing that the newer linux kernel (2.4.22 rather than 2.4.20) allows more concurrency and greater possibilities for an interrupt during read(). On reflecting back, ciped did die occasionally even before the upgrade, but it was so rare (only once every couple of months at most) that I never paid much attention to it.

Sincerely,
--Michael Fischer

--
==================================================
| Michael Fischer <fischer-michael,AT,cs,DOT,yale,DOT,edu>  |
| Professor of Computer Science                  |
==================================================

--
Message sent by the cipe-l,AT,inka,DOT,de mailing list.
Unsubscribe: mail majordomo,AT,inka,DOT,de, "unsubscribe cipe-l" in body
Other commands available with "help" in body to the same address.
CIPE info and list archive: <URL:http://sites.inka.de/~bigred/devel/cipe.html>

--- End Message ---

<< | Thread Index | >> ]    [ << | Date Index | >> ]