Technical details
Initially, when we first connected 13 GPUs, the system refused to boot. After discussing this with ASUS, we concluded that the problem was that the GPUs required a larger block of physical address space than the BIOS was able to provide. The 32 bit BIOS can only map the PCI devices (including PCI-E) below the 4GB boundary, so this meant there was at most roughly 3GB of address space available for the devices. Because each GPU requires a block of 16MB, a block of 32MB and a block of 256MB, only 8 or 9 GPUs worked, depending on how many on-board devices we disabled in the BIOS setup. Adding more cards than that caused a boot failure.
ASUS was extremely helpful with solving this, and they provided a custom BIOS for our motherboard that skipped the address space allocation of the GTX295 cards entirely. This is also the reason we have a single GTX275 card in the FASTRA II: it is the one card that is fully initialized by the BIOS and can provide graphics output to the monitor.
With this custom BIOS, the system booted successfully, but without working GTX295 cards since those were not initialized yet. To enable these cards, we modified a Linux 2.6.29.1 kernel (the latest at the time) to allocate physical address space to the GPUs manually. Since the kernel is 64-bit, we could map the large 256MB resource blocks above the 4GB boundary, thereby ensuring there was plenty of room for them. The smaller 16MB and 32MB blocks easily fit below 4GB, where the GPU required them.
The remaining problem was unexpected: each GPU requires a block of 4KB of I/O port space, for which only 64KB is reserved in total. Together with low-level system devices and devices like network and USB controllers also taking up I/O space this was a very tight fit. We needed to re-map inefficiently allocated system devices and disable as many devices as possible entirely, such as the RAID controller and the second network controller. From later experiments we suspect it might actually only be necessary to allocate this 4KB block of I/O ports for the primary VGA controller, but we haven’t verified that.
Below is the tree of PCI devices. On the far right you can see the 12 GPUs on the 6 GTX295 cards behind their PCI-E bridges. The first 3 cards are on the first NF200 chipset (together with the GTX275), and the other 3 on the second NF200.
$ lspci -t -[0000:00]-+-00.0 +-01.0-[0000:01]-- +-03.0-[0000:02-10]----00.0-[0000:03-10]--+-00.0-[0000:04]----00.0 | +-01.0-[0000:05-08]----00.0-[0000:06-08]--+-00.0-[0000:07]----00.0 | | \-02.0-[0000:08]----00.0 | +-02.0-[0000:09-0c]----00.0-[0000:0a-0c]--+-00.0-[0000:0b]----00.0 | | \-02.0-[0000:0c]----00.0 | \-03.0-[0000:0d-10]----00.0-[0000:0e-10]--+-00.0-[0000:0f]----00.0 | \-02.0-[0000:10]----00.0 +-07.0-[0000:11-1e]----00.0-[0000:12-1e]--+-00.0-[0000:13-16]----00.0-[0000:14-16]--+-00.0-[0000:15]----00.0 | | \-02.0-[0000:16]----00.0 | +-01.0-[0000:17-1a]----00.0-[0000:18-1a]--+-00.0-[0000:19]----00.0 | | \-02.0-[0000:1a]----00.0 | \-02.0-[0000:1b-1e]----00.0-[0000:1c-1e]--+-00.0-[0000:1d]----00.0 | \-02.0-[0000:1e]----00.0 +-14.0 +-14.1 +-14.2 +-14.3 +-1a.0 +-1a.1 +-1a.2 +-1a.7 +-1c.0-[0000:21]-- +-1c.1-[0000:20]----00.0 +-1c.3-[0000:1f]----00.0 +-1d.0 +-1d.1 +-1d.2 +-1d.7 +-1e.0-[0000:22]-- +-1f.0 +-1f.2 +-1f.3 \-1f.5
The list of PCI devices in the system is below. The “nVidia Corporation Unknown device 05e6″ is the GTX275,
and the “nVidia Corporation Unknown device 05e0″ devices are the GPUs on the 2-PCB GTX295 cards, while the “nVidia Corporation Unknown device 05eb” devices are on the newer 1-PCB GTX295 cards. The remaining NVIDIA devices in the list are PCI bridges.
$ lspci 00:00.0 Host bridge: Intel Corporation X58 I/O Hub to ESI Port (rev 12) 00:01.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root Port 1 (rev 12) 00:03.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root Port 3 (rev 12) 00:07.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root Port 7 (rev 12) 00:14.0 PIC: Intel Corporation X58 I/O Hub System Management Registers (rev 12) 00:14.1 PIC: Intel Corporation X58 I/O Hub GPIO and Scratch Pad Registers (rev 12) 00:14.2 PIC: Intel Corporation X58 I/O Hub Control Status and RAS Registers (rev 12) 00:14.3 PIC: Intel Corporation X58 I/O Hub Throttle Registers (rev 12) 00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 00:1a.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 00:1a.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Port 1 00:1c.1 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Port 2 00:1c.3 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Port 4 00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller 00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller 00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller 00:1f.5 IDE interface: Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE Controller 02:00.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 03:00.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 03:01.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 03:02.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 03:03.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 04:00.0 VGA compatible controller: nVidia Corporation Unknown device 05e6 (rev a1) 05:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 06:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 06:02.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 07:00.0 3D controller: nVidia Corporation Unknown device 05e0 (rev a1) 08:00.0 VGA compatible controller: nVidia Corporation Unknown device 05e0 (rev a1) 09:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 0a:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 0a:02.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 0b:00.0 3D controller: nVidia Corporation Unknown device 05e0 (rev a1) 0c:00.0 VGA compatible controller: nVidia Corporation Unknown device 05e0 (rev a1) 0d:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 0e:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 0e:02.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 0f:00.0 3D controller: nVidia Corporation Unknown device 05eb (rev a1) 10:00.0 VGA compatible controller: nVidia Corporation Unknown device 05eb (rev a1) 11:00.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 12:00.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 12:01.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 12:02.0 PCI bridge: nVidia Corporation Unknown device 05b1 (rev a3) 13:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 14:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 14:02.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 15:00.0 3D controller: nVidia Corporation Unknown device 05e0 (rev a1) 16:00.0 VGA compatible controller: nVidia Corporation Unknown device 05e0 (rev a1) 17:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 18:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 18:02.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 19:00.0 3D controller: nVidia Corporation Unknown device 05eb (rev a1) 1a:00.0 VGA compatible controller: nVidia Corporation Unknown device 05eb (rev a1) 1b:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 1c:00.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 1c:02.0 PCI bridge: nVidia Corporation Unknown device 05b8 (rev a3) 1d:00.0 3D controller: nVidia Corporation Unknown device 05e0 (rev a1) 1e:00.0 VGA compatible controller: nVidia Corporation Unknown device 05e0 (rev a1) 1f:00.0 IDE interface: Marvell Technology Group Ltd. 88SE6121 SATA II Controller (rev b2) 20:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
Finally, this is output from the NVIDIA binary kernel module being loaded on boot and recognizing all 13 GPUs.
$ dmesg | grep -i nvidia nvidia: module license 'NVIDIA' taints kernel. 0000:04:00.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24 nvidia 0000:04:00.0: setting latency timer to 64 nvidia 0000:07:00.0: enabling device (0000 -> 0003) nvidia 0000:07:00.0: PCI INT A -> GSI 34 (level, low) -> IRQ 34 nvidia 0000:07:00.0: setting latency timer to 64 nvidia 0000:08:00.0: enabling device (0000 -> 0003) nvidia 0000:08:00.0: PCI INT A -> GSI 36 (level, low) -> IRQ 36 nvidia 0000:08:00.0: setting latency timer to 64 nvidia 0000:0b:00.0: enabling device (0000 -> 0003) nvidia 0000:0b:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35 nvidia 0000:0b:00.0: setting latency timer to 64 nvidia 0000:0c:00.0: enabling device (0000 -> 0003) nvidia 0000:0c:00.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24 nvidia 0000:0c:00.0: setting latency timer to 64 nvidia 0000:0f:00.0: PCI INT A -> GSI 36 (level, low) -> IRQ 36 nvidia 0000:0f:00.0: setting latency timer to 64 nvidia 0000:10:00.0: enabling device (0000 -> 0003) nvidia 0000:10:00.0: PCI INT A -> GSI 34 (level, low) -> IRQ 34 nvidia 0000:10:00.0: setting latency timer to 64 nvidia 0000:15:00.0: enabling device (0000 -> 0003) nvidia 0000:15:00.0: PCI INT A -> GSI 30 (level, low) -> IRQ 30 nvidia 0000:15:00.0: setting latency timer to 64 nvidia 0000:16:00.0: enabling device (0000 -> 0003) nvidia 0000:16:00.0: PCI INT A -> GSI 39 (level, low) -> IRQ 39 nvidia 0000:16:00.0: setting latency timer to 64 nvidia 0000:19:00.0: PCI INT A -> GSI 37 (level, low) -> IRQ 37 nvidia 0000:19:00.0: setting latency timer to 64 nvidia 0000:1a:00.0: enabling device (0000 -> 0003) nvidia 0000:1a:00.0: PCI INT A -> GSI 38 (level, low) -> IRQ 38 nvidia 0000:1a:00.0: setting latency timer to 64 nvidia 0000:1d:00.0: enabling device (0000 -> 0003) nvidia 0000:1d:00.0: PCI INT A -> GSI 39 (level, low) -> IRQ 39 nvidia 0000:1d:00.0: setting latency timer to 64 nvidia 0000:1e:00.0: enabling device (0000 -> 0003) nvidia 0000:1e:00.0: PCI INT A -> GSI 30 (level, low) -> IRQ 30 nvidia 0000:1e:00.0: setting latency timer to 64 NVRM: loading NVIDIA UNIX x86_64 Kernel Module 190.32 Wed Sep 2 02:23:20 PDT 2009