The TCP Offload Engine
In high speed computing, the host processors often become a bottleneck as the network interconnect speed increases. This leads to more CPU cycles being used by the TCP/IP protocol stack than by the business-critical applications that are running. Increase in the network speed causes severe performance degradation due to the increased TCP/IP overhead, especially with IP storage protocols like iSCSI and NFS. In such cases, the TCP Offload Engine(TOE), relieves the host CPU of the overhead of TCP/IP processing. TOEs allow the operating systems to push all the TCP/IP specific traffic to a specialized hardware on the network adapter. Only the TCP/IP control decisions are left for the host CPUs.

Inherent drawbacks of the standard TCP/IP stack
TCP is a connection-oriented protocol, meaning that the two hosts must establish a session between each other before transferring any data. It also ensures that the host system receives all packets in order, without duplication and also provides multiplexing and flow control. As a result, when an application sends data on the network several data movement and protocol processing steps occur. It also generates a series of interrupts to segment the data into packets and receive acknowledgements. These and other TCP activities consume critical host resources. Thus the three main causes for the TCP/IP overhead on the host are — CPU interrupt processing, memory copies and protocol processing.
Traditional ways to reduce TCP/IP overhead
The two most popular methods to reduce the CPU overhead that TCP/IP processing incurs are TCP/IP checksum offload and large send offload.
The TCP/IP checksum offload technique moves the calculation of TCP and IP checksum packets from the host CPU to the network adapter. This approach however, only gives modest reduction in the CPU utilization.
Large Send Offload(LSO), also known as TCP Segmentation Offload(TSO), relieves the host CPU of the overload of segmenting the application’s transmit data into MTU size chunks. TCP, in this mode is allowed to send larger than MTU size data to the network adapter. The adapter then divides it into MTU size chunks and uses the prototype TCP and IP headers of the send buffer to create TCP/IP header for each packet in preparation of transmission. This method however, is only effective when transmitting large sized messages. Also, in environments where packets are frequently dropped and connections are lost, connection management and maintenance consumes significant amount of Host CPU.
How TOEs reduce TCP overheads
TOEs can delegate all the send and receive related processing to the network adapters, leaving the host server’s CPU more available for business applications.
Using TOEs, the host CPU can process an entire application I/O in one interrupt, unlike the standard TCP/IP stack which requires a series of interrupts to segment and ACK the packets.
The TOE model also eliminates the necessity of multiple copies of messages throughout the stack by using zero-copy algorithms which copy data directly from the NIC buffers into application memory location without intermediate copies to system buffers.
With respect to the protocol processing problem in the tradition TCP/IP stack, it is a common observation that the faster the network, the more protocol processing the CPU has to perform. Multi-home hosts with multiple Gigabit Ethernet NICs compound the problem. Throughput does not scale linearly when utilizing multiple NICs in the same server because only one host TCP/IP stack processes all the traffic. TOEs distribute network transaction processing across multiple TOE-enabled network adapters. This provides huge reduction in the end-to-end transmission latency, essentially speeding up the response time for the applications.
Implementations of TOEs
A standard implementation of a TOE consists of two components — network adapters that are capable of TCP/IP processing operations and extensions to the TCP/IP software stack that offload specified operations to the network adapter. Together, these components let the OS move all the TCP/IP traffic to TOE enabled firmware.
TOE implementations can be differentiated in two broad categories — processor-based and chip-based. Processor-based methods can use off-the-shelf network adapters that have a built-in processor and memory. The chip-based implementation uses an application-specific integrated circuit (ASIC) that is designed into the network adapter. ASIC-based implementations can offer better performance than off-the-shelf processor-based implementations because they are customized to perform the TCP offload. However, because ASICs are manufactured for a certain set of operations, adding new features may not always be possible.
TOEs can also be differentiated based on the amount of processing that is offloaded— partial and full offloading.