On 02/15/2010 01:11 PM, Eric Dumazet wrote:
Le lundi 15 février 2010 à 12:41 -0800, David Daney a écrit :
On 02/15/2010 12:27 PM, Eric Dumazet wrote:
Le lundi 15 février 2010 à 12:13 -0800, David Daney a écrit :
If we wait for the once-per-second cleanup to free transmit SKBs,
sockets with small transmit buffer sizes might spend most of their
time blocked waiting for the cleanup.
Normally we do a cleanup for each transmitted packet. We add a
watchdog type timer so that we also schedule a timeout for 150uS after
a packet is transmitted. The watchdog is reset for each transmitted
packet, so for high packet rates, it never expires. At these high
rates, the cleanups are done for each packet so the extra watchdog
initiated cleanups are not needed.
or perhaps s/are not needed/are neither needed nor fired/
Hmm, but re-arming a timer for each transmited packet must have a cost ?
The cost is fairly low (less than 10 processor clock cycles). We didn't
add this for amusement, people actually do things like only send UDP
packets from userspace. Since we can fill the transmit queue faster
than it is emptied, the socket transmit buffer is quickly consumed. If
we don't free the SKBs in short order, the transmitting process get to
take a long sleep (until our previous once per second clean up task was
I understand this, but traditionaly, NIC drivers dont use a timer, but a
'TX complete' interrupt, that usually fires a few us after packet
submission on Gigabit speed.
Indeed. Lacking this type of interrupt, the watchdog seemed the best
short term solution.
I am investigating the possibility of feeding TX complete notifications
back through the RX path where it is possible to generate interrupts.
The drawback to this is that it takes a lot more CPU cycles as well as
added cache pressure.
A fast program could try to send X small udp packets in less than 150
us, X being greater than the size of your TX ring.
My TX queue (it is not a ring) size can be made arbitrarily large
(currently 1000). 64bytes * 1000 packets * 10 bits/packet / 10e9
bits/sec == 640uS. My watchdog will fire after less than 1/4 of the
ring capacity is freed.
So your patch makes the window smaller, but it still is there (at
physical layer, we'll see a burst of packets, a ~100us delay, then a
With this patch, there will be no burstiness using default socket buffer
sizes and packets of arbitrary size on a standard 1gig port.
On the 10gig ports there is the possibility for burstiness as you aptly
explain. However, in practice it would be difficult to arrange things
to achieve sufficiently high packet rates, so we can live with it like this.