View on GitHub


Harsh Kapadia's Computer Networking knowledge base.


(Back to Home)

Table of Contents


Final report/summary: HTML, PDF

Data Center Environment


A typical Data Center cluster

Traffic Properties

Traffic Control Challenges

Network Requirements

A Data Center environment should have:

Protocol Requirements

Some of the important features a Data Center transport protocol should have:

Message vs Packet

Problems with TCP

The following features of TCP cause it problems in the Data Center:

Stream Orientation

Connection Orientation

Fair scheduling

Sender-Driven Congestion Control

In-Order Packet Delivery

Sender vs Receiver

A peer can be a sender or a receiver and can have multiple RPCs.

Packet Types


Homa’s packet types:












Overview of the Homa protocol

Homa’s features:

Message Orientation

Connectionless Protocol


Receiver-Driven Congestion Control

Out-of-Order Packet Tolerance

No Per-Packet Acknowledgements

At-Least-Once Semantics

Design Principles

Homa’s Design Principles:

Linux Internals

How Homa works in Linux:

Working of Homa in the Linux kernel

Streaming vs Messages

NOTE: Message vs Packet

The Problem with TCP Streaming

How Messages Help



  • These are the functions that implement the Homa API visible to applications. They are intended to be a part of the user-level run-time library.
  • Source: homa_api.c file

Homa’s API:

Message Sequence Scenarios

NOTE: Sender vs Receiver

Homa’s Message Sequence Scenarios:

Error-Free Communication

Homa message sequence diagram

Scheduled Data Loss

Homa message sequence diagram

Aborted RPC

Homa message sequence diagram

Unscheduled Data Loss

Homa message sequence diagram


Some of the algorithms that Homa implements for

All details as seen in commit 9c9a1ff of the Homa Linux kernel module (31st March 2023).

Sorting RPCs and Peers


/* - The overall Homa state maintains one sorted grantable peer list (`homa_state->grantable_peers`).
 * - Every peer in that list maintains a sorted grantable RPC list (`peer->grantable_rpcs`).
 * - The algorithm below first sorts a RPC in its peer's `grantable_rpcs` list and then due to that change, sorts that peer in the `grantable_peers` list.
 * - Logic derived from

position_rpc(homa_state, rpc_to_check) {
	peer_to_check = rpc_to_check->peer;

	while(rpc in peer_to_check->grantable_rpcs) {
		if(rpc->bytes_remaining > rpc_to_check->bytes_remaining) { // RPC with fewest bytes wins.
			// Add `rpc_to_check` before `rpc`.

			// Due to above change, `peer_to_check` might have to be re-positioned in `homa_state->grantable_peers`.
			position_peer(homa_state, peer_to_check);
		else if(rpc->bytes_remaining == rpc_to_check->bytes_remaining) {
			if(rpc->birth > rpc_to_check->birth) { // Tie breaker: Oldest RPC wins.
				// Add `rpc_to_check` before `rpc`.

				// Due to above change, `peer_to_check` might have to be re-positioned in `homa_state->grantable_peers`.
				position_peer(homa_state, peer_to_check);

position_peer(homa_state, peer_to_check) {
	first_rpc_in_peer_to_check = get_list_head(peer_to_check);

	while(peer in homa_state->grantable_peers) {
		first_rpc_in_peer = get_list_head(peer);

		if(first_rpc_in_peer->bytes_remaining > first_rpc_in_peer_to_check->bytes_remaining) { // Peer having first RPC with fewest bytes wins.
			// Add `peer_to_check` before `peer`.

		else if(first_rpc_in_peer->bytes_remaining == first_rpc_in_peer_to_check->bytes_remaining) {
			if(first_rpc_in_peer->birth > first_rpc_in_peer_to_check->birth) { // Tie breaker: Peer with oldest first RPC wins.
				// Add `peer_to_check` before `peer`.


RPC and Peer Sorting Timing

Homa's RPC and peer sorting function call tree

Calculating Scheduled Priorities

Homa's priority level structure


// Logic derived from

set_scheduled_data_priority(homa_state) {
	rank = 0; // Just a counter variable.
	max_grants = 10; // Hard coded value for maximum grants that can be issued in one function call.
	max_scheduled_priority = homa_state->max_sched_prio; // Fixed integer value set using the `unsched_cutoffs` array.
	num_grantable_peers = homa_state->num_grantable_peers; // No. of peers in `homa_state->grantable_peers` list.

	while(peer in homa_state->grantable_peers) {

		priority = max_scheduled_priority - (rank - 1); // Calculates the max. priority possible.

		/* `extra_levels` is calculated to try to later adjust priority to lowest possible value for SRPT.
		 * It helps in pushing priorities as low as possible, so that newer higher priority messages can be accommodated easily.
		total_levels = max_scheduled_priority + 1;
		extra_levels = total_levels - num_grantable_peers;

		if (extra_levels >= 0) { // `extra_levels` < 0 implies congestion or too much overcommitment.
			priority = priority - extra_levels; // Adjust priority to lowest possible value for SRPT.
		if (priority < 0) {
			priority = 0;

		// Assign `priority` to `GRANT` packet.

		if(rank == max_grants) {


Calculating Unscheduled Priorities

Unscheduled Priority Setting and Transmission Timing


Calculating rtt_bytes

Calculating GRANT Offset


// Logic derived from

set_data_offset(homa_state) {
	counter = 0;
	max_grants = 10; // Hard coded value for maximum grants that can be issued in one function call.
	total_bytes_available_to_grant = homa_state->max_incoming - homa_state->total_incoming;

	if(total_bytes_available_to_grant <= 0) {

	while(peer in homa_state->grantable_peers) {

		first_rpc_in_peer = get_list_head(peer);
		rpc_bytes_received = first_rpc_in_peer->total_msg_length - first_rpc_in_peer->msg_bytes_remaining;
		offset = rpc_bytes_received + homa_state->rtt_bytes; // Ensures optimal link utilization

		if(offset > first_rpc_in_peer->total_msg_length) {
			offset = first_rpc_in_peer->total_msg_length;

		increment = offset - first_rpc_in_peer->total_incoming_bytes;

		if (increment <= 0) {
			continue; // No need to grant same amount of data as already expected.
		if (total_bytes_available_to_grant <= 0) {
		if (increment > total_bytes_available_to_grant) {
			increment = total_bytes_available_to_grant;
			offset = first_rpc_in_peer->total_incoming_bytes + increment;

		first_rpc_in_peer->total_incoming_bytes = offset;
		total_bytes_available_to_grant = total_bytes_available_to_grant - increment;

		// Assign `offset` to `GRANT` packet.

		if(counter == max_grants) {

/* homa_state->max_incoming = homa_state->max_overcommit * homa_state->rtt_bytes;
 * `max_overcommit` is the maximum number of messages to which Homa will send grants at any given point in time.
 * As seen in

Offset and Scheduled Priority Calculation Timing

Homa's offset and scheduled priority calculation function call tree

The homa_send_grants function is called by

Calculating Unscheduled Bytes Offset


// Logic derived from

set_unscheduled_byte_offset(rpc) {
	/* Network stack layers: Ethernet(IP(Homa(rpc_msg_data_bytes)))
	 * `mss` = Maximum Segment Size (Max. payload of a Homa `DATA` packet.)
	 * `mtu` = Maximum Transmission Unit (Max. payload of an Ethernet Frame.)
	 * `data_header_length` = Length of the header of a Homa `DATA` packet
	mss = mtu - ip_header_length - data_header_length;

	if(rpc->total_msg_length <= mss) {
		// Message fits in a single packet, so no need for Generic Segmentation Offload (GSO).
		rpc->unscheduled_bytes = rpc->total_msg_length;
	else {
		gso_pkt_data = pkts_per_gso * mss;

		rpc->unscheduled_bytes = rpc->rtt_bytes + gso_pkt_data - 1; // Not clear about the rationale behind this formula.

		// Rounding down to a quantized number of packets.
		extra_bytes = rpc->unscheduled_bytes % gso_pkt_data;
		rpc->unscheduled_bytes = rpc->unscheduled_bytes - extra_bytes;

		if (rpc->unscheduled_bytes > rpc->total_msg_length) {
			rpc->unscheduled_bytes = rpc->total_msg_length;

Unscheduled Bytes Offset Calculation Timing