View on GitHub

networking

Harsh Kapadia's Computer Networking knowledge base.

Homa

(Back to Home)

Table of Contents

Introduction

Final report/summary: HTML, PDF

Data Center Environment

Topology

A typical Data Center cluster

Traffic Properties

Traffic Control Challenges

Network Requirements

A Data Center environment should have:

Protocol Requirements

Some of the important features a Data Center transport protocol should have:

Message vs Packet

Problems with TCP

The following features of TCP cause it problems in the Data Center:

Stream Orientation

Connection Orientation

Fair scheduling

Sender-Driven Congestion Control

In-Order Packet Delivery

Sender vs Receiver

A peer can be a sender or a receiver and can have multiple RPCs.

Packet Types

NOTE:

Homa’s packet types:

DATA

GRANT

RESEND

UNKNOWN

BUSY

CUTOFFS

ACK

NEED_ACK

FREEZE

BOGUS

Features

Overview of the Homa protocol

Homa’s features:

Message Orientation

Connectionless Protocol

SRPT

Receiver-Driven Congestion Control

Out-of-Order Packet Tolerance

No Per-Packet Acknowledgements

At-Least-Once Semantics

Design Principles

Homa’s Design Principles:

Linux Internals

How Homa works in Linux:

Working of Homa in the Linux kernel

Streaming vs Messages

NOTE: Message vs Packet

The Problem with TCP Streaming

How Messages Help

API

NOTE:

  • These are the functions that implement the Homa API visible to applications. They are intended to be a part of the user-level run-time library.
  • Source: homa_api.c file

Homa’s API:

Message Sequence Scenarios

NOTE: Sender vs Receiver

Homa’s Message Sequence Scenarios:

Error-Free Communication

Homa message sequence diagram

Scheduled Data Loss

Homa message sequence diagram

Aborted RPC

Homa message sequence diagram

Unscheduled Data Loss

Homa message sequence diagram

Algorithms

Some of the algorithms that Homa implements for

All details as seen in commit 9c9a1ff of the Homa Linux kernel module (31st March 2023).

Sorting RPCs and Peers

Algorithm

/* - The overall Homa state maintains one sorted grantable peer list (`homa_state->grantable_peers`).
 * - Every peer in that list maintains a sorted grantable RPC list (`peer->grantable_rpcs`).
 * - The algorithm below first sorts a RPC in its peer's `grantable_rpcs` list and then due to that change, sorts that peer in the `grantable_peers` list.
 * - Logic derived from https://github.com/PlatformLab/HomaModule/blob/9c9a1ff3cbd018810f9fc11ca404c5d20ed10c3b/homa_incoming.c#L819-L933
 */

position_rpc(homa_state, rpc_to_check) {
	peer_to_check = rpc_to_check->peer;

	while(rpc in peer_to_check->grantable_rpcs) {
		if(rpc->bytes_remaining > rpc_to_check->bytes_remaining) { // RPC with fewest bytes wins.
			// Add `rpc_to_check` before `rpc`.

			// Due to above change, `peer_to_check` might have to be re-positioned in `homa_state->grantable_peers`.
			position_peer(homa_state, peer_to_check);
			break;
		}
		else if(rpc->bytes_remaining == rpc_to_check->bytes_remaining) {
			if(rpc->birth > rpc_to_check->birth) { // Tie breaker: Oldest RPC wins.
				// Add `rpc_to_check` before `rpc`.

				// Due to above change, `peer_to_check` might have to be re-positioned in `homa_state->grantable_peers`.
				position_peer(homa_state, peer_to_check);
				break;
			}
		}
	}
}

position_peer(homa_state, peer_to_check) {
	first_rpc_in_peer_to_check = get_list_head(peer_to_check);

	while(peer in homa_state->grantable_peers) {
		first_rpc_in_peer = get_list_head(peer);

		if(first_rpc_in_peer->bytes_remaining > first_rpc_in_peer_to_check->bytes_remaining) { // Peer having first RPC with fewest bytes wins.
			// Add `peer_to_check` before `peer`.

			break;
		}
		else if(first_rpc_in_peer->bytes_remaining == first_rpc_in_peer_to_check->bytes_remaining) {
			if(first_rpc_in_peer->birth > first_rpc_in_peer_to_check->birth) { // Tie breaker: Peer with oldest first RPC wins.
				// Add `peer_to_check` before `peer`.

				break;
			}
		}
	}
}

RPC and Peer Sorting Timing

Homa's RPC and peer sorting function call tree

Calculating Scheduled Priorities

Homa's priority level structure

Algorithm

// Logic derived from https://github.com/PlatformLab/HomaModule/blob/9c9a1ff3cbd018810f9fc11ca404c5d20ed10c3b/homa_incoming.c#L935-L1119

set_scheduled_data_priority(homa_state) {
	rank = 0; // Just a counter variable.
	max_grants = 10; // Hard coded value for maximum grants that can be issued in one function call.
	max_scheduled_priority = homa_state->max_sched_prio; // Fixed integer value set using the `unsched_cutoffs` array.
	num_grantable_peers = homa_state->num_grantable_peers; // No. of peers in `homa_state->grantable_peers` list.

	while(peer in homa_state->grantable_peers) {
		rank++;

		priority = max_scheduled_priority - (rank - 1); // Calculates the max. priority possible.

		/* `extra_levels` is calculated to try to later adjust priority to lowest possible value for SRPT.
		 * It helps in pushing priorities as low as possible, so that newer higher priority messages can be accommodated easily.
		 */
		total_levels = max_scheduled_priority + 1;
		extra_levels = total_levels - num_grantable_peers;

		if (extra_levels >= 0) { // `extra_levels` < 0 implies congestion or too much overcommitment.
			priority = priority - extra_levels; // Adjust priority to lowest possible value for SRPT.
		}
		if (priority < 0) {
			priority = 0;
		}

		// Assign `priority` to `GRANT` packet.

		if(rank == max_grants) {
			break;
		}
	}
}

Examples

Calculating Unscheduled Priorities

Unscheduled Priority Setting and Transmission Timing

Initializations
Sender
Receiver

Calculating rtt_bytes

Calculating GRANT Offset

Algorithm

// Logic derived from https://github.com/PlatformLab/HomaModule/blob/9c9a1ff3cbd018810f9fc11ca404c5d20ed10c3b/homa_incoming.c#L935-L1119

set_data_offset(homa_state) {
	counter = 0;
	max_grants = 10; // Hard coded value for maximum grants that can be issued in one function call.
	total_bytes_available_to_grant = homa_state->max_incoming - homa_state->total_incoming;

	if(total_bytes_available_to_grant <= 0) {
		return;
	}

	while(peer in homa_state->grantable_peers) {
		counter++;

		first_rpc_in_peer = get_list_head(peer);
		rpc_bytes_received = first_rpc_in_peer->total_msg_length - first_rpc_in_peer->msg_bytes_remaining;
		offset = rpc_bytes_received + homa_state->rtt_bytes; // Ensures optimal link utilization

		if(offset > first_rpc_in_peer->total_msg_length) {
			offset = first_rpc_in_peer->total_msg_length;
		}

		increment = offset - first_rpc_in_peer->total_incoming_bytes;

		if (increment <= 0) {
			continue; // No need to grant same amount of data as already expected.
		}
		if (total_bytes_available_to_grant <= 0) {
			break;
		}
		if (increment > total_bytes_available_to_grant) {
			increment = total_bytes_available_to_grant;
			offset = first_rpc_in_peer->total_incoming_bytes + increment;
		}

		first_rpc_in_peer->total_incoming_bytes = offset;
		total_bytes_available_to_grant = total_bytes_available_to_grant - increment;

		// Assign `offset` to `GRANT` packet.

		if(counter == max_grants) {
			break;
		}
	}
}

/* homa_state->max_incoming = homa_state->max_overcommit * homa_state->rtt_bytes;
 * `max_overcommit` is the maximum number of messages to which Homa will send grants at any given point in time.
 * As seen in https://github.com/PlatformLab/HomaModule/blob/9c9a1ff3cbd018810f9fc11ca404c5d20ed10c3b/homa_incoming.c#L1815
 */

Offset and Scheduled Priority Calculation Timing

Homa's offset and scheduled priority calculation function call tree

The homa_send_grants function is called by

Calculating Unscheduled Bytes Offset

Algorithm

// Logic derived from https://github.com/PlatformLab/HomaModule/blob/9c9a1ff3cbd018810f9fc11ca404c5d20ed10c3b/homa_outgoing.c#L42-L244

set_unscheduled_byte_offset(rpc) {
	/* Network stack layers: Ethernet(IP(Homa(rpc_msg_data_bytes)))
	 * `mss` = Maximum Segment Size (Max. payload of a Homa `DATA` packet.)
	 * `mtu` = Maximum Transmission Unit (Max. payload of an Ethernet Frame.)
	 * `data_header_length` = Length of the header of a Homa `DATA` packet
	 */
	mss = mtu - ip_header_length - data_header_length;

	if(rpc->total_msg_length <= mss) {
		// Message fits in a single packet, so no need for Generic Segmentation Offload (GSO).
		rpc->unscheduled_bytes = rpc->total_msg_length;
	}
	else {
		gso_pkt_data = pkts_per_gso * mss;

		rpc->unscheduled_bytes = rpc->rtt_bytes + gso_pkt_data - 1; // Not clear about the rationale behind this formula.

		// Rounding down to a quantized number of packets.
		extra_bytes = rpc->unscheduled_bytes % gso_pkt_data;
		rpc->unscheduled_bytes = rpc->unscheduled_bytes - extra_bytes;

		if (rpc->unscheduled_bytes > rpc->total_msg_length) {
			rpc->unscheduled_bytes = rpc->total_msg_length;
		}
	}
}

Unscheduled Bytes Offset Calculation Timing

Resources