<< | Thread Index | >> ]    [ << | Date Index | >> ]

To: "Mark Smith" <mark.smith,AT,avcosystems,DOT,co,DOT,uk>
Subject: RE: CIPE 1.5.4 / NAT / iptables issue
From: "Dick St.Peters" <stpeters,AT,NetHeaven,DOT,com>
Date: Fri, 11 Jul 2003 14:18:32 -0400
Cc: <cipe-l,AT,inka,DOT,de>
In-reply-to: <001301c347ce$1e744fc0$d100010a@lyta>
References: <16142.55310.676075.458946@saint.heaven.net><001301c347ce$1e744fc0$d100010a@lyta>

When PMTUD is broken, it's nearly always broken at the
server end, and there's nothing you can do to fix it.  You
just have to work around it.

What happens is this:
client == routerC1 -- routerC2 {internet} routerS == server
The routerC1-routerC2 link has a reduced MTU, but the client
and server don't know that.  The client asks for someething
from the server, something large enough to take multiple
packets.  The server starts sending big packets with the DF
bit set.  The DF bit means "If this packet won't fit, don't
fragment it, instead notify me with an ICMP."  When the
packet reaches routerC2 it won't fit, so routerC2 dutifully
drops it and sends back an ICMP.  However, the server site
has its ingress router routerS configured to block ICMPs, so
the server doesn't get the notification it requested.

The server also gets no ACK from the client, because the
client never got the big packet.  After while with no ACK,
the server tries again, another ICMP is sent back and is
blocked again.

In other words, the server site configuration causes the
problem, but it only affects clients behind reduced-MTU
links (usually VPN tunnels or PPPoE-on-DSL connections).
Typically the server site admin can't be convinced anything
is wrong at his end, because most clients can talk to the
server.  He thinks anything affecting only some clients must
be something wrong with those clients.

The only time such clients get a chance to tell the server
to use a reduced packet size is at the very beginning of a
TCP connection.  The initial SYN from the client tells the
server what the client thinks the MSS is, and the SYN/ACK
from the server has the server's view.  tcpdump will print
what the values in these packets are.  Usually they're both
1460, the 1500-byte MTU of ethernet minus the 40 bytes of
TCP/IP overhead.

However, if you set the "mss" to 1460 with the route
command, the MSS in the SYN will be 1420.  I'd be interested
in a tcpdump trace of what goes on with your
--clamp-mss-to-pmtu, to see if it really sets the MSS
correctly.  The name suggests that it's equally wrong,
because setting the MSS to the MTU won't work; it leaves no
room for the tcp/ip overhead.  (My impression is that the
MSS vs. MTU confusion is actually in the kernel and is
spreading from there through the Linux community.  It can't
be everywhere in the kernel though; if it were, Linux
networking wouldn't work.)

Dick St.Peters, stpeters,AT,NetHeaven,DOT,com 

> From: "Mark Smith" <mark.smith,AT,avcosystems,DOT,co,DOT,uk>
> Sender: owner-cipe-l,AT,inka,DOT,de
> To: <cipe-l,AT,inka,DOT,de>
> Subject: RE: CIPE 1.5.4 / NAT / iptables issue
> Date: Fri, 11 Jul 2003 18:01:48 +0100
> That bit I understand, and the version of iptables I'm using lets me modify
> the real MSS:
> iptables -A FORWARD -i eth0 -o cipcb0 -p tcp -m tcp --tcp-flags SYN,RST
> SYN -j TCPMSS --clamp-mss-to-pmtu
> However, this obviously will only work when it is Linux on the end of the
> tunnel.  I have the same problem with a Win32 client.  What I'm after is why
> PMTUD isn't working when it should be.  I'd rather fix that than fudge the
> MSS, as above.
> --
> Mark Smith - Avco Systems Ltd
> email: mark.smith,AT,avcosystems,DOT,co,DOT,uk
> Tel: +44 (0)1784 430996 Fax: +44 (0)1784 431078
> > -----Original Message-----
> > From: owner-cipe-l,AT,inka,DOT,de [mailto:owner-cipe-l,AT,inka,DOT,de 
> > Behalf Of
> > Dick St.Peters
> > Sent: 11 July 2003 16:30
> > To: Mark Smith
> > Cc: cipe-l,AT,inka,DOT,de
> > Subject: RE: CIPE 1.5.4 / NAT / iptables issue
> >
> >
> > Mark Smith writes:
> > > I run NAT over CIPE 1.5.4 on 2.4.18 without any apparent
> > problem apart from
> > > large packets and PMTUD.  I'm still hoping that the person
> > that suggested
> > > using iptables to clamp mss to pmtu can provide more
> > information either
> > > where they found that out, or if they know, why it works and what it
> > > changes.
> >
> > I don't think I'm the person who suggested that, but I can
> > provide some explanation.  At the beginning of a TCP
> > session, each end tells the other the largest packet size it
> > can send or receive.  To see what this implies, it helps to
> > consider network diagrams, beginning with a trivial one, two
> > hosts A and B directly connected by a network link:
> >         A <-link-> B
> > Both A and B know the largest packet the link will carry
> > because they are connected to it.  This is true even if the
> > link is a virtual link, such as a CIPE tunnel.
> >
> > Now advance to a more complex diagram:
> >         A <-link 1-> X <-link 2-> Y <-link 3-> B
> > If link 2 is a virtual link, its maximum packet size will be
> > reduced by the tunnel overhead and will be smaller than for
> > links 1 and 3.  Normally neither A nor B will know that.  A
> > will only know the largest packet link 1 can carry.  B will
> > only know the largest packet link 3 can carry.
> >
> > However, A and B have to learn about link 2's smaller
> > maximum size to talk efficiently.  One way for A to learn
> > the link 2 size is for the owner/user of A to clamp A's
> > maximum packet size to the maximum size for link 2.  Then
> > when A tells B the maximum size A can send/receive, it will
> > give the clamped size, not the link 1 maximum size.
> >
> > There's no way to clamp the size with iptables that I know
> > of, but you can do it with the Linux route command.
> > However, there's an error in the implementation: the "mss"
> > route parameter actually sets the MTU, not the MSS.
> > Clamping the real MSS to the path MTU would be wrong, but
> > clamping what the route command calls mss to the path MTU is
> > correct.
> >
> > (MTU is the maximum packet size, MSS is the maximum TCP
> > payload size - i.e., MTU minus TCP/IP overhead.)
> >
> > --
> > Dick St.Peters, stpeters,AT,NetHeaven,DOT,com
> >
> > --
> > Message sent by the cipe-l,AT,inka,DOT,de mailing list.
> > Unsubscribe: mail majordomo,AT,inka,DOT,de, "unsubscribe cipe-l" in body
> > Other commands available with "help" in body to the same address.
> > CIPE info and list archive:
> > <URL:http://sites.inka.de/~bigred/devel/cipe.html>
> >
> --
> Message sent by the cipe-l,AT,inka,DOT,de mailing list.
> Unsubscribe: mail majordomo,AT,inka,DOT,de, "unsubscribe cipe-l" in body
> Other commands available with "help" in body to the same address.
> CIPE info and list archive: 
> <URL:http://sites.inka.de/~bigred/devel/cipe.html>

<< | Thread Index | >> ]    [ << | Date Index | >> ]