the area of the crossbar, and also avoiding the problem of routing clock signals into the crossbar array.

In addition, this implementation can also reduce power. Observe that the longest distance any signal will travel on the horizontal buses is from one side of the small crossbar to the opposite side of the chip. For a 20 mm chip and a 10 mm crossbar centered in the middle of the chip, this distance is 15 mm. On average, signals only travel 7.5 mm; however, the drivers must always charge the entire 20 mm width of the bus. Since the L/S unit will always know if it is sending signals to the left or right, or receiving signals from the left or right, we can break up the bus at each hardwired crosspoint, and only charge up half of the bus. Using this scheme, on average each driver only has to drive 10 mm of wire, which will reduce average power by half.

Although this scheme is also possible with the large crossbar, it would require the signal to pass through two switches instead of one, which impacts delay. Also, each control signal must would have to drive three times as many switches.

4.3 Self Routing Crossbar

Another problem of a single crossbar Figure 4.1 is the need to route control signals to all of the switches. Since the control signals originate in the L/S units, if data is coming from the memory section, and the switch it passes through is also near the memory section, then the data must wait for the control signal to propagate down the height of the crossbar, before the switch is activated. This problem can be alleviated by placing the Load buses close to the vector unit, and the Store buses close to the memory sections, thereby overlapping the control signal propagation time with the data signal propagation time.

Another approach is illustrated in figure 4.4. In this implementation, the control signals are not routed globally, but instead, are generated locally. This is done at each switch point by decoding the appropriate address bits and if there is a match, activating a tristate buffer which takes the incoming data, arriving on M2, and switches it onto the outgoing M1 line.

Figure 4.4: System diagram for the self routing crossbar. At each switch point, a portion of the address bits are decoded to determine whether the incoming data is driven onto the outgoing line.