top of page

RoCEv2 ECN & CNP Deep Dive

Overview

This document provides a deep-dive analysis of ECN (Explicit Congestion Notification) and CNP (Congestion Notification Packet) in RoCEv2 networks, based on real packet captures from an 8-node RDMA AI cluster.


What You'll Learn:

  • How ECN marks packets during congestion

  • How CNP packets trigger rate reduction

  • How DCQCN algorithm maintains lossless operation

  • Real packet analysis with Wireshark


Why This Matters

Modern AI/ML training requires lossless networking for GPU-to-GPU communication:



Key Statistics from Production AI Clusters:

  • RDMA achieves 2x bandwidth over TCP

  • Latency reduced by 91% (161us to 14us)

  • Packet drops reduced by 96%


The Three-Layer Congestion Control



ECN Deep Dive

What is ECN?


ECN (Explicit Congestion Notification) is an IP-layer mechanism that allows switches to mark packets instead of dropping them.


ECN Bits in IP Header

IP TOS Byte (8 bits):
+----------------------------------------+
| 7   6   5   4   3   2 | 1   0          |
|       DSCP (6 bits)   | ECN (2 bits)   |
+----------------------------------------+

ECN Values


Real ECN Example from Capture


Normal Packet (No Congestion):

Frame 1: RDMA Write
    IP Header:
        Differentiated Services Field: 0x02
            0000 00.. = DSCP: Default (0)
            .... ..10 = ECN: ECT(0)  <-- ECN CAPABLE, NO CONGESTION

Congested Packet (CE Marked):

Frame 48: RDMA Write (same flow, 0.5ms later)
    IP Header:
        Differentiated Services Field: 0x03
            0000 00.. = DSCP: Default (0)
            .... ..11 = ECN: CE  <-- CONGESTION EXPERIENCED!

CNP Deep Dive

What is CNP?

CNP (Congestion Notification Packet) is a special RoCEv2 packet that tells the sender to slow down.


Key Facts:

  • Generated by receiver NIC (not application)

  • Sent when receiver sees CE-marked packets

  • Contains QP number to identify which flow to slow

  • Marked with DSCP 48 for high priority


CNP Packet Structure

+----------------------------------------------------+
| Ethernet Header (14 bytes)                         |
|   Src: Receiver MAC                                |
|   Dst: Sender MAC                                  |
+----------------------------------------------------+
| IP Header (20 bytes)                               |
|   Src: Receiver IP                                 |
|   Dst: Sender IP                                   |
|   DSCP: 48 (CS6) <-- HIGH PRIORITY                |
|   ECN: ECT(0)                                      |
+----------------------------------------------------+
| UDP Header (8 bytes)                               |
|   Src Port: 0 <-- SPECIAL CNP SIGNATURE           |
|   Dst Port: 4791 (RoCEv2)                          |
+----------------------------------------------------+
| InfiniBand BTH (12 bytes)                          |
|   OpCode: 0x81 (129) <-- CNP IDENTIFIER           |
|   Dest QP: 0x000d1e <-- WHICH FLOW TO SLOW DOWN   |
|   PSN: 0                                           |
+----------------------------------------------------+
| CNP Payload (16 bytes) + ICRC (4 bytes)            |
+----------------------------------------------------+
Total: 74 bytes (minimal overhead)

Real CNP Example from Capture

Frame 42818: CNP Packet (74 bytes)

Ethernet II
    Src: 00:50:56:af:0d:ec
    Dst: 00:50:56:af:39:dc

Internet Protocol Version 4
    Src: 192.168.250.114 (Receiver sending CNP back)
    Dst: 192.168.250.117 (Original sender)

    Differentiated Services Field: 0xc2
        1100 00.. = DSCP: CS6 (48) <-- HIGH PRIORITY
        .... ..10 = ECN: ECT(0)

User Datagram Protocol
    Src Port: 0 <-- CNP SIGNATURE
    Dst Port: 4791

InfiniBand BTH
    Opcode: 129 (0x81) <-- CNP OPCODE
    Dest QP: 0x000d1e <-- FLOW TO SLOW DOWN
    PSN: 0

Real Packet Analysis

Complete DCQCN Flow from Capture

Timeline (from actual pcap):

T=0.000ms    Frame 1:    Normal RDMA Write, ECN=2 (ECT)
             |
T=0.278ms    Frame 26:   RDMA Write First, ECN=2 (ECT)
             |           Starting 64KB transfer
             |
T=0.517ms    Frame 48:   RDMA Write Middle, ECN=3 (CE) <-- FIRST CONGESTION!
             |           Switch marked this packet
             |
T=0.587ms    Frame 55:   RDMA Write First, ECN=3 (CE)
             |           Congestion continues
             |
             ... (more CE-marked packets)
             |
T=196.7ms    Frame 42818: CNP generated!
             |            DSCP=48, OpCode=0x81
             |            Dest QP=0x000d1e
             |            "Slow down QP 0x000d1e!"
             |
             v
             Sender NIC receives CNP
             Rate reduced for QP 0x000d1e

Packet Comparison Table


Lab Results


Capture Statistics


Wireshark Filters


Essential Filters

# All RoCEv2 traffic
udp.port == 4791

# ECN Congestion Experienced (CE)
ip.dsfield.ecn == 3

# CNP packets (DSCP 48)
ip.dsfield.dscp == 48

# CNP by OpCode
infiniband.bth.opcode == 129

# Specific QP traffic
infiniband.bth.dstqp == 0x000d1d

tshark Commands

# Count ECN states
tshark -r capture.pcap -T fields -e ip.dsfield.ecn | sort | uniq -c

# Count DSCP values
tshark -r capture.pcap -T fields -e ip.dsfield.dscp | sort | uniq -c

# List CE-marked packets
tshark -r capture.pcap -Y "ip.dsfield.ecn == 3" -c 10

# List CNP packets
tshark -r capture.pcap -Y "ip.dsfield.dscp == 48" -c 10

Key Takeaways


  1. ECN signals congestion by changing ECN bits from 2 (ECT) to 3 (CE)

  2. CNP is a real packet - not just a header flag

  3. CNP contains QP number - sender knows exactly which flow to slow

  4. DSCP 48 ensures CNP priority - feedback must arrive fast

  5. Rate limiting happens at sender NIC - not at switch


The Complete Picture

+---------------------------------------------------------------+
|                      DCQCN Operation                           |
|                                                                 |
|  Sender                    Switch                    Receiver  |
|    |                         |                          |       |
|    | -- RDMA Data (ECN=2) -->|                          |       |
|    |                         |                          |       |
|    |                    [Queue fills]                   |       |
|    |                    [Mark ECN=3]                    |       |
|    |                         |                          |       |
|    |                         |-- Data (ECN=3) --------->|       |
|    |                         |                          |       |
|    |                         |              [See CE mark]       |
|    |                         |              [Generate CNP]      |
|    |                         |                          |       |
|    |<--- CNP (DSCP 48) ------|<--------------------------|       |
|    |     OpCode 0x81         |     Strict Priority      |       |
|    |     QP=0x000d1e         |                          |       |
|    |                         |                          |       |
| [Reduce rate                 |                          |       |
|  for QP 0x000d1e]            |                          |       |
|                                                                 |
+---------------------------------------------------------------+

Thanks For Reading

Comments


  • Twitter

©2020, Founded by NetworkTcpIP.

bottom of page