Multi-core network packet steering: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 12:26, 10 July 2025 edit FrancioT (talk \| contribs) 93 edits No edit summary ← Previous edit		Latest revision as of 20:05, 8 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,865,965 edits Added hdl. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar
(39 intermediate revisions by 8 users not shown)
Line 1: {{Short description\|Network packet distribution with multiple cores}} [[Network packet]] steering of incoming traffic for [[Multi-core_processor\|multi-core architectures]] is needed in modern network computing environment, especially in [[Data_center\|data centers]], where the high bandwidth and heavy loads would easily congestion a single core's [[Queueing theory\|queue]]. {{copyedit\|reason=an encylopedic tone in the lead section\|date=July 2025}} [[Network packet]] steering of transmitted and received traffic for [[Multi-core_processor\|multi-core architectures]] is needed in modern network computing environment, especially in [[Data_center\|data centers]], where the high bandwidth and heavy loads would easily congestion a single core's [[Queueing theory\|queue]].<ref name="RSS++">{{Cite book \|last1=Barbette \|first1=Tom \|last2=Katsikas \|first2=Georgios P. \|last3=Maguire \|first3=Gerald Q. \|last4=Kostić \|first4=Dejan \|chapter=RSS++: Load and state-aware receive side scaling \|date=2019-12-03 \|title=Proceedings of the 15th International Conference on Emerging Networking Experiments and Technologies \|chapter-url=https://dl.acm.org/doi/10.1145/3359989.3365412 \|pages=318–333 \|___location=New York, NY, USA \|publisher=Association for Computing Machinery \|doi=10.1145/3359989.3365412 \|hdl=2078.1/262641 \|isbn=978-1-4503-6998-5 }}</ref> [[File:Simple NIC and cores architecture.png\|thumb\|upright=1.7\|Simple graph showing the path receiving packets need to travel to reach the cores' queues]] For this reason many techniques, both in hardware and in software, are leveraged in order to distribute the incoming load of packets across the cores of the [[Central processing unit\|processor]]. On the traffic-receiving side, the most notable techniques presented in this article are: RSS, aRFS, RPS and RFS. For transmission, we will focus on XPS.<ref name="General intro">{{Citation \|last=Madden \|first=Michael M. \|title=Challenges Using the Linux Network Stack for Real-Time Communication \|date=2019-01-06 \|work=AIAA Scitech 2019 Forum \|url=https://arc.aiaa.org/doi/10.2514/6.2019-0503 \|access-date=2025-07-10 \|series=AIAA SciTech Forum \|publisher=American Institute of Aeronautics and Astronautics \|doi=10.2514/6.2019-0503 \|pages=99–11 \|isbn=978-111-62410-578-4 \|url-access=subscription }}</ref><ref>{{Cite web \|last=Herbert \|first=Tom \|date=2025-02-24 \|title=The alphabet soup of receive packet steering: RSS, RPS, RFS, and aRFS \|url=https://medium.com/@tom_84912/the-alphabet-soup-of-receive-packet-steering-rss-rps-rfs-and-arfs-c84347156d68 \|access-date=2025-07-10 \|website=Medium \|language=en}}</ref><br> As shown by the figure beside, packets coming into the [[Network_interface_controller\|network interface card (NIC)]] are processed and loaded to the receiving queues managed by the cores (which are usually implemented as [[Circular buffer\|ring buffers]] within the [[User space and kernel space\|kernel space]]). The main objective is being able to leverage all the cores available within the [[Central processing unit\|CPU]] to process incoming packets, while also improving performances like [[Latency (engineering)\|latency]] and [[Network throughput\|throughput]].<ref name="RSS kernel linux docs">{{Cite web\|title=RSS kernel linux docs\|url=https://www.kernel.org/doc/html/v5.1/networking/scaling.html#rss-receive-side-scaling\|access-date=2025-07-08\|website=kernel.org\|publisher=The Linux Kernel documentation\|language=en-US}}</ref><ref name="RSS overview by microsoft">{{Cite web\|title=RSS overview by microsoft\|url=https://learn.microsoft.com/en-us/windows-hardware/drivers/network/introduction-to-receive-side-scaling\|access-date=2025-07-08\|website=learn.microsoft.com\|language=en-US}}</ref><ref name="RSS++">{{Cite journal \|last=Barbette \|first=Tom \|last2=Katsikas \|first2=Georgios P. \|last3=Maguire \|first3=Gerald Q. \|last4=Kostić \|first4=Dejan \|date=2019-12-03 \|title=RSS++: load and state-aware receive side scaling \|url=https://dl.acm.org/doi/10.1145/3359989.3365412 \|journal=Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies \|series=CoNEXT '19 \|___location=New York, NY, USA \|publisher=Association for Computing Machinery \|doi=10.1145/3359989.3365412 \|isbn=978-1-4503-6998-5}}</ref><ref>{{Cite journal \|~~last~~last1=Wu \|~~first~~first1=Wenji \|last2=DeMar \|first2=Phil \|last3=Crawford \|first3=Matt \|date=2011-02-01 \|title=Why Can Some Advanced Ethernet NICs Cause Packet Reordering? ~~\|url=https://ieeexplore.ieee.org/document/5673999/~~ \|journal=IEEE Communications Letters \|volume=15 \|issue=2 \|pages=253–255 \|doi=10.1109/LCOMM.2011.122010.102022 \|arxiv=1106.0443 \|bibcode=2011IComL..15..253W \|issn=1558-2558}}</ref> == Hardware techniques == Hardware accelerated techniques like RSS and aRFS are used to route and load balance incoming [[Network_packet\|packets]] across the multiple cores' queues of a processor.<ref name="RSS++" /><br> Those hardware supported methods achieve extremely low latencies and reduce the load on the CPU, as compared to the software based ones. However they require a specialized hardware integrated within the [[Network_interface_controller\|network interface controller]] (which ~~could be~~, for example, ais usually available on more advanced cards, like the [[Data_processing_unit\|SmartNIC]]).~~<ref name="RSS++" />~~<ref name="aRFS by redhat">{{Cite web\|title=aRFS by redhat\|url=https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/network-acc-rfs\|access-date=2025-07-08\|website=docs.redhat.com\|publisher=Red Hat Documentation\|language=en-US}}</ref><ref name="aRFS by nvidea">{{Cite web\|title=aRFS by nvidea\|url=https://docs.nvidia.com/networking/display/mlnxofedv23070512/flow+steering#src-2396583156_safe-id-Rmxvd1N0ZWVyaW5nLUFjY2VsZXJhdGVkUmVjZWl2ZUZsb3dTdGVlcmluZyhhUkZTKQ\|access-date=2025-07-08\|website=docs.nvidia.com\|publisher=NVIDIA Documentation Hub\|language=en-US}}</ref> === RSS === [[File:RSS architecture.png\|upright=1.7\|thumb\|Simple view of the receive side scaling architecture]] Receive Side Scaling (RSS) is ana hardware supported technique, leveraging an [[indirection\|indirection table]] indexed by the last bits of the result provided by ana [[hash function]], taking as inputs the [[Header (computing)\|header fields]] of the packets. The hash function input is usually customizable and the header fields used can vary between use case and implementations. Some notable ~~example~~examples of header fields ~~used~~chosen as keys for the hash are the [[Internet Protocol\|layer 3 IP]] source and destination addresses, the protocol and the [[transport layer\|layer 4]] source and destination ~~port~~ports. In this way, packets corresponding to the same flow will be directed to the same receiving queue, without loosing the original order, causing an [[Out-of-order delivery\|out-of-order delivery]]. Moreover all incoming flows will be [[Load balancing (computing)\|load balanced]] across all the available cores thanks to the hash function properties.<ref name="RSS overview by microsoft" /><br> Another important feature introduced by the indirection table is the capability of changing the mapping of flows to the cores without having to change the hash function, but by simply updating the table entries.<ref>{{Cite web\|title=RSS intel doc\|url=https://www.intel.com/content/dam/support/us/en/documents/network/sb/318483001us2.pdf\|access-date=2025-07-08\|website=earn.microsoft.com\|language=en-US}}</ref~~><ref name="RSS overview by microsoft" /~~><ref>{{Cite web\|title=RSS by redhat\|url=https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/network-rss\|access-date=2025-07-08\|website=docs.redhat.com\|publisher=Red Hat Documentation\|language=en-US}}</ref><ref name="RSS kernel linux docs" /><ref>{{Cite web\|title=Receive-side scaling enhancements in windows server 2008\|url=https://download.microsoft.com/download/a/d/f/adf1347d-08dc-41a4-9084-623b1194d4b2/rss_server2008.docx\|access-date=2025-07-08\|website=microsoft.com\|publisher=Microsoft\|language=en-US}}</ref> === aRFS === [[File:ARFS architecture.png\|upright=1.7\|thumb\|Simple view of the accelerated receive flow steering architecture]] Accelerated Receive Flow Steering (aRFS) is another hardware supported technique, born with the idea of leveraging [[Locality of reference\|cache locality]] to improve performances by routing incoming packet flows to specific cores. Differently from RSS which is a fully ~~hardware~~ independent hardware implementation, aRFS needs to interface with the software (the [[Kernel (operating system)\|kernel]]) to properly function.<ref>{{Cite web\|title=aRFS kernel linux docs\|url=https://www.kernel.org/doc/html/v5.1/networking/scaling.html#accelerated-rfs\|access-date=2025-07-08\|website=kernel.org\|publisher=The Linux Kernel documentation\|language=en-US}}</ref><br> RSS simply load balance incoming traffic across the cores; however if a packet flow is directed to the ''core i'' (as a result of the hash function) while the application needing the received packet is running on ''core j'', many cache misses could be avoided by simply forcing ''i=j'', so that packets are received exactly where they are needed and consumed.<ref name="aRFS by nvidea" /><br> To do this aRFS doesn't forward packets directly from the result of the hash function, but using a configurable routing table (which can be filled and updated for instance by the [[Scheduling (computing)\|scheduler]] through an [[API]]) packet flows can be steered to the specific consuming core.<ref name="aRFS by nvidea" /><ref name="aRFS by redhat" /><ref>{{Cite web\|title=aRFS kernel linux docs\|url=https://www.kernel.org/doc/html/v5.1/networking/scaling.html#accelerated-rfs\|access-date=2025-07-08\|website=kernel.org\|publisher=The Linux Kernel documentation\|language=en-US}}</ref> == Software techniques == Software techniques like RPS and RFS employ one of the CPU cores to steer incoming packets across the other cores of the processor. This comes at the cost of introducing additional [[Inter-processor interrupt\|inter-processor interrupts (IPIs)]],; however the number of hardware interrupts will not increase and ~~thanks~~potentially, toby employing an [[Interrupt coalescing\|interrupt aggregation]] technique, it could even be reduced.<ref name="RPS kernel linux docs">{{Cite web\|title=RPS kernel linux docs\|url=https://www.kernel.org/doc/html/v5.1/networking/scaling.html#rps-receive-packet-steering\|access-date=2025-07-08\|website=kernel.org\|publisher=The Linux Kernel documentation\|language=en-US}}</ref><br> The benefits of a software solutions is the ease in implementation, without having to change any component (like the [[Network_interface_controller\|NIC]]) of the currently used architecture, but by simply deploying the proper [[Loadable kernel module\|kernel module]]. This benefit can be crucial especially in cases where the server machine can't be customized or accessed (like in [[Cloud computing#Infrastructure as a service (IaaS)\|cloud computing]] environment), even if the network performances could be reduced as compared the hardware supported ones.<ref name="RPS linux news (LWM)">{{Cite web\|last1=Corbet \|first1=Jonathan \|title=RPS linux news (LWM)\|url=https://lwn.net/Articles/362339/\|access-date=2025-07-08\|website=lwn.net\|date=17 November 2009 \|publisher=Linux Weekly News\|language=en-US}}</ref><ref name="RPS by redhat">{{Cite web\|title=RPS by redhat\|url=https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/network-rps\|access-date=2025-07-08\|website=docs.redhat.com\|publisher=Red Hat Documentation\|language=en-US}}</ref><ref name="RFS by nvidea" /> === RPS === [[File:RPS logic.png\|upright=1.7\|thumb\|Diagram showing how RPS load balance incoming packets across the CPU cores]] Receive Packet Steering (RPS) is the RSS parallel implemented in software. All packets received by the NIC are load balanced between the cores' queues by implementing an hash function using as configurable key the header fields (like the layer 3 source and destination IP and layer 4 source and destination ports), in the same fashion as RSS does. Moreover thanks to the hash ~~function~~properties, packets belonging to the same flow will always be steered to the same core.<ref name="RPS by redhat" /><br> This is usually done in the kernel, right after the NIC driver. Having handled the network interrupt and before it can be processed, the packet is sent to the receiving queue of a core, which is then notified thanks to an inter process interrupt.<ref name="RPS linux news (LWM)" /><br> RPS can be used in conjunction with RSS, in case the number of ~~hardware~~ queues managed by the hardware is lower than the number of cores. In this case after having distributed across the ~~hardware~~RSS queues the incoming packets, a pool of cores can be assigned to each queue and RPS will be used to spread again the incoming flows across the specified pool.<ref name="~~RPS linux news (LWM)" /><ref name="RPS by redhat" /><ref>{{Cite web\|title=~~RPS kernel linux docs~~\|url=https://www.kernel.org/doc/html/v5.1/networking/scaling.html#rps-receive-packet-steering\|access-date=2025-07-08\|website=kernel.org\|publisher=The~~" ~~Linux Kernel documentation\|language=en-US}}<~~/~~ref~~> === RFS === [[File:RFS logic.png\|upright=1.7\|thumb\|Diagram showing how the RFS logic distribute each incoming packet to the core running the corresponding application]] Receive Flow Steering (RFS) upgrades RPS in the same direction as the aRFS hardware solution does. By routing packet flows to the same CPU core running the consuming application, cache locality can be improved and leveraged, avoiding many misses and reducing the latencies introduced by the retrieval of the data from the [[Memory hierarchy\|central memory]].<ref name="RFS by redhat">{{Cite web\|title=RFS by redhat\|url=https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/network-rfs\|access-date=2025-07-08\|website=docs.redhat.com\|publisher=Red Hat Documentation\|language=en-US}}</ref><br> To do this, after having computed the hash of the header fields for the current packet, the result is used to index a lookup table. This table is managed by the scheduler, which updates its entries when the application processes are moved between the cores.<ref name="RFS kernel linux docs">{{Cite web\|title=RFS kernel linux docs\|url=https://www.kernel.org/doc/html/v5.1/networking/scaling.html#rfs-receive-flow-steering\|access-date=2025-07-08\|website=kernel.org\|publisher=The Linux Kernel documentation\|language=en-US}}</ref><br> The overall CPU load distribution is balanced as long as the applications in [[User space and kernel space\|user-space]] are evenly distributed across the multiple cores.<ref name="RFS by redhat">{{Cite web\|title=RFS by redhat\|url=https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/network-rfs\|access-date=2025-07-08\|website=docs.redhat.com\|publisher=Red Hat Documentation\|language=en-US}}</ref><ref name="RFS by nvidea">{{Cite web\|title=RFS by nvidea\|url=https://docs.nvidia.com/networking/display/mlnxofedv23070512/flow+steering\|access-date=2025-07-08\|website=docs.nvidia.com\|publisher=NVIDIA Documentation Hub\|language=en-US}}</ref><ref~~>{{Cite~~ ~~web\|title~~name="RFS kernel linux docs~~\|url=https://www.kernel.org/doc/html/v5.1/networking/scaling.html#rfs-receive-flow-steering\|access-date=2025-07-08\|website=kernel.org\|publisher=The~~" ~~Linux Kernel documentation\|language=en-US}}<~~/~~ref~~> === XPS (in transmission) === Transmit Packet Steering (XPS) ~~differently~~is ~~from~~a ~~all~~transmission protocol, as opposed to the ~~prevoius~~others ~~protocols~~that ~~exposed,~~have isbeen ~~used~~mentioned inso ~~transmission~~far. When packets need to be loaded on one of the transmission queues exposed by the NIC, there are again many possible optimization that could be done.<ref>{{Cite web\|title=XPS linux news (LWM)\|url=https://lwn.net/Articles/412062/\|access-date=2025-07-08\|website=lwn.net\|publisher=Linux Weekly News\|language=en-US}}</ref><br> For instance if multiple transmission queues are assigned to a single core, an hash function could be used to load balance outgoing packets across the queues (similarly to how RPS does in reception). Moreover in order to improve cache locality and hit-rate (similarly to how RFS does), XPS ensures that applications producing the outgoing traffic and running in ''core i'' will favor the transmitting queues associated with the same ''core i''. This reduces the inter-core communication and cache coherency protocols overheads, resulting in better performances in heavy load environments.<ref>{{Cite web\|title=XPS intel overview\|url=https://www.intel.com/content/www/us/en/docs/programmable/683517/21-4/transmit-packet-steering-xps.html\|access-date=2025-07-08\|website=intel.com\|publisher=Intel corp~~\|language=en-US}}</ref><ref>{{Cite web\|title=XPS linux news (LWM)\|url=https://lwn.net/Articles/412062/\|access-date=2025-07-08\|website=lwn.net\|publisher=Linux Weekly News~~\|language=en-US}}</ref><ref>{{Cite web\|title=XPS kernel linux docs\|url=https://www.kernel.org/doc/html/v5.1/networking/scaling.html#xps-transmit-packet-steering\|access-date=2025-07-08\|website=kernel.org\|publisher=The Linux Kernel documentation\|language=en-US}}</ref><ref name="General intro" /> == See also == {{Div col\|colwidth=25em}} * [[cloud computing\|Cloud computing]]▼ * [[Multi-core processor\|Multi-core architectures]]▼ * [[Load balancing (computing)\|Load balancing]] ▲* [[Multi-core processor\|Multi-core architectures]] * [[Network packet\|Network packets]] ▲* [[cloud computing\|Cloud computing]] * [[Network interface controller\|NIC]] * [[Data processing unit\|SmartNIC]]▼ * [[packet processing\|Packet processing]] ▲* [[Data processing unit\|SmartNIC]] {{div col end}} == References == {{reflist}} == Further readings == * {{Cite book \|last1=Enberg \|first1=Pekka \|last2=Rao \|first2=Ashwin \|last3=Tarkoma \|first3=Sasu \|chapter=Partition-Aware Packet Steering Using XDP and eBPF for Improving Application-Level Parallelism \|date=2019-12-09 \|title=Proceedings of the 1st ACM CoNEXT Workshop on Emerging in-Network Computing Paradigms \|chapter-url=https://dl.acm.org/doi/10.1145/3359993.3366766 \|pages=27–33 \|___location=New York, NY, USA \|publisher=Association for Computing Machinery \|doi=10.1145/3359993.3366766 \|hdl=10138/326309 \|isbn=978-1-4503-7000-4}} * {{Cite book \|last1=Helbig \|first1=Maike \|last2=Kim \|first2=Younghoon \|chapter=IAPS: Decreasing Software-Based Packet Steering Overhead Through Interrupt Reduction \|date=2025-01-01 \|pages=127–130 \|title=2025 International Conference on Information Networking (ICOIN) \|doi=10.1109/ICOIN63865.2025.10993154 \|isbn=979-8-3315-0694-0 }} * {{Cite book \|last1=Kumar \|first1=Ashwin \|last2=Katkam \|first2=Rajneesh \|last3=Chaudhary \|first3=Pranav \|last4=Naik \|first4=Priyanka \|last5=Vutukuru \|first5=Mythili \|chapter=AppSteer: Framework for Improving Multicore Scalability of Network Functions via Application-aware Packet Steering \|date=2024-05-01 \|pages=18–27 \|title=2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid) \|doi=10.1109/CCGrid59990.2024.00012 \|isbn=979-8-3503-9566-2 }} * {{Cite journal \|last1=Tyunyayev \|first1=Nikita \|last2=Delzotti \|first2=Clément \|last3=Eran \|first3=Haggai \|last4=Barbette \|first4=Tom \|date=2025-06-09 \|title=ASNI: Redefining the Interface Between SmartNICs and Applications \|url=https://dl.acm.org/doi/10.1145/3730966 \|journal= Proceedings of the ACM on Networking\|volume=3 \|issue= \|pages=1–22 \|doi=10.1145/3730966\|url-access=subscription }} == External links == * {{YouTube\|id=GP6kSs6vCH8&pp=ygUUcmVjZWl2ZSBzaWRlIHNjYWxpbmc%3D\|title=Receive Side Scaling (RSS) with eBPF in QEMU and virtio-net}} * {{YouTube\|id=BmhqBY2AQoc&pp=ygUYVHJhbnNtaXQgUGFja2V0IFN0ZWVyaW5n0gcJCcEJAYcqIYzv\|title=Packet Steering for Multicore Virtual Network Applications over DPDK}} * {{YouTube\|id=dANekxZZems\|title=Offloading Network Traffic Classification to Hardware}} * [https://fast.dpdk.org/events/slides/DPDK-2017-04-VNF.pdf Packet Steering for Multicore Virtual Network Applications over DPDK] {{Parallel computing\|state=collapsed}} {{Operating system\|state=collapsed}} {{Basic computer components\|state=collapsed}} [[Category:Networking hardware]] [[Category:Network flow problem]] [[Category:Manycore processors]] [[Category:Load balancing (computing)]] [[Category:Cache (computing)]]