24 Jun 2016
Intro
In this post I’ll have an insight of the flow of a network packet in current linux OS, from wire to socket. Tools will be my preferred editor and a pretty recent kernel source code, a 4.6 branch will be fine. Docs from the net will help alot, and I’ll reporting refs at the end of post.
NAPI theory
[todo]
Look @ source
In this post I’ll use as sample the net card:
- Realtek rtl8169 (in
drivers/net/ethernet/realtek/r8169.c
)
Look at the new module_pci_driver()
macro that wraps module_init()
and module_exit()
calls, in drivers/net/ethernet/realtek/r8169.c
:
The rtl8169_pci_driver
struct defines functions used by kernel to init/destroy the pci device:
I’m interested in .probe
function, used by kernel to initialize the device.
For realtek, the .probe
function is rtl_init_one()
.
rtl_init_one()
This function make most of the work of device initialization.
Let start evidencing the function netif_napi_add()
:
this func initializes a core struct of NAPI system:
struct napi_struct
(tp->napi
);
Here I show the struct:
this struct contains device specific parameters, fundamental afterwards to consume packet:
there is an instance of struct napi_struct
for each device ring queue (and so one for each interrupt),
and each instance contains a poll function (rtl8169_poll
) that will be responsible to process
incoming packets.
netif_napi_add()
register a poll function inside the struct napi_struct
, initialize
the poll_list
and a weight value (remember those stuffs, we’ll be important later).
Continuing in analysis of rtl_init_one()
, I find the struct net_device_ops
:
This struct contains device operations, callback functions called after some action on the device (like setup, change mac, etc…).
In rtl_init_one
it is initialized and all device operations registered:
Now, when the device is activated (using ifconfig dev up), the .ndo_open
callback (rtl_open()
) is called.
rtl_open()
The .ndo_open
callback function is interesting:
- Create Rx ring buffer
The Rx ring buffer is the queue that store network packets received from wire. They are written directly from NIC using DMA. If packets from network arrives faster than they are processed, the queue will be filled, and when full, new packets will be dropped.
So after this init we have:
- 256 network packet buffer (each 16383 byte)
- an Rx ring buffer (RxDescArray) with 256 slot, each slot pointing to the physic address of a packet buffer
- an array of 256 pointer to packet buffer (virtual address)
RxDescArray | Rx_databuff | |||
---|---|---|---|---|
slot 0: | — PhyAddr –> | 16383 byte | <– VirtAddr — | array 0 |
slot 1: | — PhyAddr –> | 16383 byte | <– VirtAddr — | array 1 |
slot 255: | — PhyAddr –> | 16383 byte | <– VirtAddr — | array 255 |
- Register interrupt handler
The request_irq()
function register the irq handler rtl8169_interrupt
using the irq number obtained earlier from the system.
If MSI interrupts are available they are used, failing back to legacy one if not availables (IRQF_SHARED).
MSI interrupts are better, specially in multi ring queue device,
where each ring queue can have its own IRQ assigned and can be handled by a specific cpu (using irqbalance or smp_affinity).
Here the code for interrupt handler registration:
- Enable NAPI
Enable NAPI subsystem: it simply clear a bit in the state
member of struct napi_struct
.
- Enable interrupts
Finally enable interrupts on the device. From now incoming packets start to be received. Here the code:
- Start NAPI queue
This code starts the NAPI queue:
Incoming packets
IRQ handler
When a packet arrives, if the receive ring buffer is not full, it is written to
ring buffer using DMA, then IRQ is fired.
The IRQ handler is called:
The IRQ handler is very simple and fast, because when it runs other interrupts are blocked.
The packet processing is not executed in this context but, as we see, it is scheduled to be executed by softirq.
IRQ handler simply disable further NAPI irq, and schedule execution (napi_schedule
, in net/core/dev.c
):
napi_schedule()
retrive the struct softnet_data
associated to the current cpu,
add the napi_struct
associated to irq to the softnet_data
linked list and raise softirq NET_RX_SOFTIRQ
;
those actions are the core of the NAPI system: the softirq will cycle on struct softnet_data
and will grab
all napi_struct
it will find on that list.
It is important to notice the IRQ handler wakes up the NAPI softirq process on the same CPU as the IRQ handler.