RoCEv2 ECN & CNP Deep Dive
- Pualaman
- 12 hours ago
- 4 min read
Overview
This document provides a deep-dive analysis of ECN (Explicit Congestion Notification) and CNP (Congestion Notification Packet) in RoCEv2 networks, based on real packet captures from an 8-node RDMA AI cluster.
What You'll Learn:
How ECN marks packets during congestion
How CNP packets trigger rate reduction
How DCQCN algorithm maintains lossless operation
Real packet analysis with Wireshark
Why This Matters
Modern AI/ML training requires lossless networking for GPU-to-GPU communication:

Key Statistics from Production AI Clusters:
RDMA achieves 2x bandwidth over TCP
Latency reduced by 91% (161us to 14us)
Packet drops reduced by 96%
The Three-Layer Congestion Control

ECN Deep Dive
What is ECN?
ECN (Explicit Congestion Notification) is an IP-layer mechanism that allows switches to mark packets instead of dropping them.
ECN Bits in IP Header
IP TOS Byte (8 bits):
+----------------------------------------+
| 7 6 5 4 3 2 | 1 0 |
| DSCP (6 bits) | ECN (2 bits) |
+----------------------------------------+ECN Values

Real ECN Example from Capture
Normal Packet (No Congestion):
Frame 1: RDMA Write
IP Header:
Differentiated Services Field: 0x02
0000 00.. = DSCP: Default (0)
.... ..10 = ECN: ECT(0) <-- ECN CAPABLE, NO CONGESTIONCongested Packet (CE Marked):
Frame 48: RDMA Write (same flow, 0.5ms later)
IP Header:
Differentiated Services Field: 0x03
0000 00.. = DSCP: Default (0)
.... ..11 = ECN: CE <-- CONGESTION EXPERIENCED!CNP Deep Dive
What is CNP?
CNP (Congestion Notification Packet) is a special RoCEv2 packet that tells the sender to slow down.
Key Facts:
Generated by receiver NIC (not application)
Sent when receiver sees CE-marked packets
Contains QP number to identify which flow to slow
Marked with DSCP 48 for high priority
CNP Packet Structure
+----------------------------------------------------+
| Ethernet Header (14 bytes) |
| Src: Receiver MAC |
| Dst: Sender MAC |
+----------------------------------------------------+
| IP Header (20 bytes) |
| Src: Receiver IP |
| Dst: Sender IP |
| DSCP: 48 (CS6) <-- HIGH PRIORITY |
| ECN: ECT(0) |
+----------------------------------------------------+
| UDP Header (8 bytes) |
| Src Port: 0 <-- SPECIAL CNP SIGNATURE |
| Dst Port: 4791 (RoCEv2) |
+----------------------------------------------------+
| InfiniBand BTH (12 bytes) |
| OpCode: 0x81 (129) <-- CNP IDENTIFIER |
| Dest QP: 0x000d1e <-- WHICH FLOW TO SLOW DOWN |
| PSN: 0 |
+----------------------------------------------------+
| CNP Payload (16 bytes) + ICRC (4 bytes) |
+----------------------------------------------------+
Total: 74 bytes (minimal overhead)Real CNP Example from Capture
Frame 42818: CNP Packet (74 bytes)
Ethernet II
Src: 00:50:56:af:0d:ec
Dst: 00:50:56:af:39:dc
Internet Protocol Version 4
Src: 192.168.250.114 (Receiver sending CNP back)
Dst: 192.168.250.117 (Original sender)
Differentiated Services Field: 0xc2
1100 00.. = DSCP: CS6 (48) <-- HIGH PRIORITY
.... ..10 = ECN: ECT(0)
User Datagram Protocol
Src Port: 0 <-- CNP SIGNATURE
Dst Port: 4791
InfiniBand BTH
Opcode: 129 (0x81) <-- CNP OPCODE
Dest QP: 0x000d1e <-- FLOW TO SLOW DOWN
PSN: 0Real Packet Analysis
Complete DCQCN Flow from Capture
Timeline (from actual pcap):
T=0.000ms Frame 1: Normal RDMA Write, ECN=2 (ECT)
|
T=0.278ms Frame 26: RDMA Write First, ECN=2 (ECT)
| Starting 64KB transfer
|
T=0.517ms Frame 48: RDMA Write Middle, ECN=3 (CE) <-- FIRST CONGESTION!
| Switch marked this packet
|
T=0.587ms Frame 55: RDMA Write First, ECN=3 (CE)
| Congestion continues
|
... (more CE-marked packets)
|
T=196.7ms Frame 42818: CNP generated!
| DSCP=48, OpCode=0x81
| Dest QP=0x000d1e
| "Slow down QP 0x000d1e!"
|
v
Sender NIC receives CNP
Rate reduced for QP 0x000d1ePacket Comparison Table

Lab Results
Capture Statistics

Wireshark Filters
Essential Filters
# All RoCEv2 traffic
udp.port == 4791
# ECN Congestion Experienced (CE)
ip.dsfield.ecn == 3
# CNP packets (DSCP 48)
ip.dsfield.dscp == 48
# CNP by OpCode
infiniband.bth.opcode == 129
# Specific QP traffic
infiniband.bth.dstqp == 0x000d1dtshark Commands
# Count ECN states
tshark -r capture.pcap -T fields -e ip.dsfield.ecn | sort | uniq -c
# Count DSCP values
tshark -r capture.pcap -T fields -e ip.dsfield.dscp | sort | uniq -c
# List CE-marked packets
tshark -r capture.pcap -Y "ip.dsfield.ecn == 3" -c 10
# List CNP packets
tshark -r capture.pcap -Y "ip.dsfield.dscp == 48" -c 10Key Takeaways
ECN signals congestion by changing ECN bits from 2 (ECT) to 3 (CE)
CNP is a real packet - not just a header flag
CNP contains QP number - sender knows exactly which flow to slow
DSCP 48 ensures CNP priority - feedback must arrive fast
Rate limiting happens at sender NIC - not at switch
The Complete Picture
+---------------------------------------------------------------+
| DCQCN Operation |
| |
| Sender Switch Receiver |
| | | | |
| | -- RDMA Data (ECN=2) -->| | |
| | | | |
| | [Queue fills] | |
| | [Mark ECN=3] | |
| | | | |
| | |-- Data (ECN=3) --------->| |
| | | | |
| | | [See CE mark] |
| | | [Generate CNP] |
| | | | |
| |<--- CNP (DSCP 48) ------|<--------------------------| |
| | OpCode 0x81 | Strict Priority | |
| | QP=0x000d1e | | |
| | | | |
| [Reduce rate | | |
| for QP 0x000d1e] | | |
| |
+---------------------------------------------------------------+Thanks For Reading


Comments