Maya 2011 and Mental Ray Satellite – always benchmark first! (part 2)

In part 1, I did some testing with a pretty simple scene, where you were looking at multiple frames rendered within a minute – a frame rendered every 10-20 seconds give or take. The diminishing returns when dishing out a simple scene to be rendered across the network are HUGE. In fact, I did another test with an AMD 6-core machine as the master, and found that repeating the tests from part 1 resulted in a DECREASE in frames rendered per minute regardless of the combination of machines used. Solo was always better with the 6-core, at a steady rate of 7 frames per minute (or one every 8-9 seconds). Adding any slaves resulted in a reduction of speed to between 5-6 frames per minute. 8 combinations were tried in addition to the solo method, and NONE yielded an improvement.

This time I’m looking at a more complex scene (or at least… a more complex render of it). I took the same scene, chose a section of frames to render that tended to be slowest, and cranked up a few settings in the mental ray options. I bumped up the resolution and added motion blur. Now instead of seconds, I was talking 2-3 minutes to render each frame.

The goal was to reduce the effect of network overhead on the results, with more time spent rendering, and a much smaller percent being taken up by Mental Ray Satellite’s network distribution.

The results were… interesting.

The same 36 frames were rendered for each run, and I used the results from 35 of them (using the 1st as a time stamp).

Note: ignore the last 4 columns – it’s just data that I’d added to the chart to look for correlation/scaling. Focus on the 4th data column (“improvement as % of solo render”) to see the benefit/decrease of additional machines. I apologize for not simply highlighting that column in the image.

A few notes as they relate to the above chart:

  • The “Master” was the AMD 6-core, clocked at 3.5 Ghz.
  • For the “1-slave” test, the o/c’ed dual-core 2.32 Ghz machine from part 1 was used as the slave.
  • For the “2-slave” test, the i3’s from part 1 (3.2 and 3.06 Ghz) were used as the slaves. Both are similar performance-wise.
  • For the “3-slave” test, the 3 machines from above were used as the slaves.
  • For the “4-slave” test, a 2.26 Ghz Core2Duo (also from part 1 of the writeup) was added to the above. This additional machine is similar in performance to the machine from the “1-slave” test.

Unfortunately, the test results are riddled with inconsistencies, and you MUST keep that in mind when comparing for your own purposes.┬áThe biggest issue is that every machine was different. That means you can’t really compare the results between “1 slave” and “4 slaves” in terms of how much value you’ll get in going from 1-to-4 machines.

Really, you’ve got:

  1. An AMD 6-core machine at the highest clock. No hyperthreading, and an AMD architecture.
  2. Two intels on the i3 architecture. Both have hyperthreading. Similar in performance to each other.
  3. Two intels on the Core/Core2 architecture. No hyperthreading. Similar in performance to each other.

You’d have to be insane to take the “core improvement” and “Ghz improvement” to heart. They’re there simply for informational purposes, and to look for correlation between the work done and the cores/Ghz provided. It’s in no way meant as a set-in-stone-for-every-x%-Ghz-improvement-you-will-get-Y%-more-frames-done thing.

Looking at the results (even though we know they’re not perfect)

The first thing to note is that we *still* hit a wall in terms of improvement where things simply get worse when enough machines are added. 3 slaves gave no improvement, and 4 was slightly worse. This could be due to the Core Duos being slower than both the master and the i3’s. However, unlike Part I (where the drop was so bad that it was the same speed as a solo render), we still exceeded the speed of the Master rendering solo this time around, so while it wasn’t helpful, it at least didn’t destroy the efficiency of the others.

One interesting thing to note is how well the i3’s look to have scaled on paper (the 2-slave test). You had a Ghz/Core improvement of 160%/167%, and in terms of the render speed, you were looking at 156%. Really, not that bad. Hyperthreading may have thrown this off a bit (and made things look better) mind you. However, when using a single machine (doing the simple render), the AMD was roughly twice as fast as a single i3. This isn’t mentioned in the data, so you’ll have to take my word for it. In any case, based on that tidbit, you would actually expect that the 2-slave test using both i3’s would have brought a number closer to 200%. It didn’t quite make that.

Further tests..?

While I’m done testing for the time being (it’s time consuming and I’ve used up all the machines available to me), it would certainly be interesting to see what a few other tests show. Tests I would like to run if I had the time/resources are as follows:

  1. Testing more than 4 machines, rendering scenes that take a very large base time to render (30 minutes per frame minimum).
  2. All machines (master and slaves) the same spec, testing scaling from 0-slaves all the way up to a large number of slaves. This would help to determine the exact scaling in an ideal environment, and also indicate whether some of the slaves being slower contributed largely to the drop seen during my 4-slave tests (both here and in part 1).
  3. Reducing the threads on the master to leave 1 core free. Goal would be to see if the 1 open core would do the network distribution, and if so whether letting it have it’s own core would result in an overall improvement.
  4. Turning off the option on the master to render on the local machine (only render on the network machines). Goal similar to the above. Trying this on both a fast and a slow master to see what (if any) effect there is, and whether it’s optimal or not to have a master dedicated soley to distributing the render.
  5. Disabling/enabling hyperthreading on all machines (in a mixed environment similar to the one I’ve tested with) to see if there’s an effect/impact on the network render.

Final Thoughts:

In any case, the benefit in benchmarking before sending out a full-scene satellite render still applies. However, it doesn’t appear to be as critical for longer / more complex scenes – even though utilizing all the machines at the same time wasn’t optimal in my case, it didn’t hamper things as badly as it did for the simple scenes in Part 1.

I’m even more curious than ever as to why 3 and 4 slaves gave a poor result. All the machines *were* working, and the effect of network overhead should have been largely reduced.