How to design a round robin arbiter

Work Conserving Round Robin Arbiter block diagram

The logic design of an efficient and fast round robin arbiter in Verilog or any other HDL language relies on the capability to find the next requestor to grant without losing cycles and with minimal logical stages. Using the fastest logic constructs like Parallel Prefix Computation (PPC) and the most suitable architecture will result in a fast and efficient arbiter suitable for pipeline integration of a multiple queue structure.

First we need to start with a clear definition of what we actually want the arbiter to do. The round robin arbiter design requirement can be summed up in a list:

  1. One cycle calculation, so the arbiter can grant different requestors in each cycle.
  2. Wraparound functionality, meaning that the arbiter does not loose cycles at the end of each round when moving from a grant to the last active requestor, back to the first one.
  3. Work conserving functionality, so no cycles are lost on requestors that are inactive.


The figure shows a design that covers all those points, given it is implemented using the right logic.

round robin arbiter block diagram

The wrap around functionality, results from the two priority arbiters. Once the masked request vector has no active requests, the mux will cause the grant to be generated from the non-masked priority, which will naturally start at the first requestor. The two priority arbiters implement simple find-first-set priority encoding, returning the first bit which is set in their respective input vectors.

The following figure illustrates the wrap around process for a case where the request vector is stable during the whole time with 3 request bits set. The figure shows a process occurring over 3 cycles and then wraps around back to the first requestor.

Round robin wrap around illustration

In the first cycle, the masked grant vector is all 0, so the mux selects the unmasked grant and therefore grants the first requestor (bit #2). The mask logic calculates the mask for the next cycle, by masking all requestors below the selected one as well as the selected requestor itself, so bits 0-2 are masked and all the rest are enabled.

At the second cycle, the masked request vector which is the result of an AND between the first cycle “Next mask vector” and the request vector is not all 0, so the resulting grant is bit #3 which is the lowest bit set in the “masked request vector”.

Following the same principle, at the third cycle, the selected grant is bit #7, and the Next mask vector has only bit 8 and 9 set.

At the forth cycle, the AND of the request vector and the mask vector result in an all 0 masked request vector so the arbiter uses the unmasked grant vector, returning to grant the first active requestor which is bit #2. The result is a clean wrap around.

The find-first-set priority logic, makes the arbiter “work conserving” meaning that no cycles are lost between consecutive grants.

The more challenging part of the design is to meet the single cycle timing requirement. As shown above, the logical solution is clear, all you need to do is update the mask according to the latest selected grant and you will get the next grant in the following cycle.

As the demonstrated arbiter has combinational path from input to output, it is clear that minimizing this path through the priority arbiters would result in better overall timing, also for the internal path coming from the mask register.

The calculation of the first bit set of a vector, in its basic form is a ripple issue, where each bit of the output is affected by all the preceding bits of the input. For example, if bit #5 is set, the value of the output vector bit #5 would be affected by bits 0-4 but not by bit #7.

The best solutions for fast computation of a find-first-set are Parallel Prefix Computation (PPC) algorithms, which result in low gate count and a low number of logical stages.

Parallel prefix computation (PPC)

PPC OR computation

Parallel prefix computation is a type of parallel calculation serving to reduce the number of stages it takes to calculate a function on multiple inputs by using parallel computations where possible. The example below is for an OR computation resulting in a thermometer decoded vector. From the first bit which is set onwards, all bits in the output vector would be set.

The OR parallel prefix computation enables the arbitration logic to find quickly the first request which is asserted, resulting in the ability to find the next valid requestor with the minimal possible logic cones.

The function of an OR PPC is demonstrated in the following table:

Request vector

PPC output









An additional advantage of using the OR PPC logic for this arbiter is the thermometer output vector which enables the calculation of the next mask register without additional logic (a shift left is only a bit manipulation). This further reduces the required logic.

The grant of the arbiter can be converted from the thermometer decoding into a one-hot encoding using simple shift and XOR logic.


The round robin arbiter is a basic building block of many applications. The design has some important aspects that must be met for the arbiter to be easy for integration and timing closure. Using the PPC logical structure, enables an optimal solution for gate count and timing.

The round robin arbiter logic design, Verilog code and documentation can be found in the RTLery library arbiters section, where you can find the design and Verilog source code for a variety of round robin arbiters such as deficit round robin arbiter, weighted round robin arbiter and more. Please see the arbiters section form more details.


Subscribe to RTLery articles feed:

Leave a comment

Leave a comment