Follow-up on Colony West

Build Log:

So the system has been up and running for about a week now, so it’s time for a follow-up on how the system is performing.

Temperatures

First I couldn’t be happier with the water cooling loop. The radiators have no problem keeping the GPUs cool. MilkyWay@Home tasks tended to produce the highest temperatures on the cards while air cooled, so seeing the temperatures in the low 30s C on water is just phenomenal. But that was with the fans running at full voltage.

So to quiet the system, I used the voltage step-down I already have (read about it here) to take the fans down to around 8 volts. When I plugged it in, I didn’t realize the voltage was set that low. But the temperatures only went up a few degrees, hovering around the mid 30s C. The fans are inaudible compared to the fans in the graphics host and graphics enclosure, which is odd considering the fans in both chassis are undervolted as well.

Either way, that is phenomenal performance. I’ll see what I can do about the other fan noise later. If it’s the power supplies in either instance, then there’s nothing I can do about it.

Stability

Along with temperatures, I also need to talk about stability. And that is where so far I’ve had some concerns. In short I think I need to revisit the water blocks. Not for cleaning, but to remount them, at least with regard to one of the GTX 660s.

For some reason one of the GTX 660s does not want to remain stable. The Xid error code thrown by the NVIDIA driver is 62, which is labeled as an “Internal micro-controller halt”. Three possible causes: hardware error/failure (unlikely), driver error (also unlikely), or thermal issues. This is what makes me think the block needs to be remounted.

Now if the temperatures on the GPUs are in the mid 30s, how can this be a thermal concern? Simple. It’s either the memory or VRMs that is the issue. I just hope I have enough thermal pad material to do that.

What’s odd is the driver crash would happen consistently after about 36 hours of continuous load. So I reconfigured BOINC to take a 15 minute “break” at midnight every day. This alleviated the driver concern, but it didn’t keep the issues completely at bay. So re-seating the water block is a necessity, meaning I get to test how well I could drain the system.

BOINC

Now for the real meat of the project: distributed computing performance.

Recall that there are four graphics cards: two GTX 680s and two GTX 660s. When the system remains stable, it’s able to clear over 100,000 points in a day.

Folding@Home

After clearing the 5000 ranking for MilkyWay@Home, I decided to turn the system over toward Folding@Home to see what kind of performance I could expect. Since this is a headless system, I needed to manually configure the client through the config.xml file. First, I used the –configure command-line option to set up a few basic options, then I manually edited the file to provide four slots, one for each graphics card.

<config>
  <user value='Colony_West'/>
  <team value='0'/>
  <passkey value='********************************'/>
  <smp value='true'/>
  <gpu value='true'/>

  <slot id='0' type='GPU'><gpus v='0' /></slot>
  <slot id='1' type='GPU'><gpus v='1' /></slot>
  <slot id='2' type='GPU'><gpus v='2' /></slot>
  <slot id='3' type='GPU'><gpus v='3' /></slot>
</config>

So how well does this perform? As of the time I wrote this, the client estimates it can clear 260k points per day, sometimes estimating over 270k points per day. Given published numbers I’ve seen for the GTX 680 and GTX 660, this likely means the GPUs are being held back a little by either the CPU or memory. The 660s should be able to pull between 100k and 115k PPD combined, while the GTX 680s should be able to net 210k to 225k PPD combined, meaning this setup should be able to pull almost 335k PPD, but 300k at minimum.

So the GPUs are being held back, likely by the fact this is a 10 year-old dual-core CPU that is being used. As the Folding@Home FAQ says, the GPU tasks are still heavily CPU dependent, though that’s something they aim to try to change later. But that means that even one of the GPUs is going to be held back by the current client.

So for now, Folding@Home is out. Perhaps when the AMD mainboard in either Absinthe or Beta Orionis are freed up will I try doing Folding@Home again. For now, Colony West will be doing just BOINC.