Maya 2011 and Mental Ray Satellite – always benchmark first! (part 2)

By Matt Gadient

In part 1, I did some testing with a pretty simple scene, where you were looking at multiple frames rendered within a minute – a frame rendered every 10-20 seconds give or take. The diminishing returns when dishing out a simple scene to be rendered across the network are HUGE. In fact, I did another test with an AMD 6-core machine as the master, and found that repeating the tests from part 1 resulted in a DECREASE in frames rendered per minute regardless of the combination of machines used. Solo was always better with the 6-core, at a steady rate of 7 frames per minute (or one every 8-9 seconds). Adding any slaves resulted in a reduction of speed to between 5-6 frames per minute. 8 combinations were tried in addition to the solo method, and NONE yielded an improvement.

This time I’m looking at a more complex scene (or at least… a more complex render of it). I took the same scene, chose a section of frames to render that tended to be slowest, and cranked up a few settings in the mental ray options. I bumped up the resolution and added motion blur. Now instead of seconds, I was talking 2-3 minutes to render each frame.

The goal was to reduce the effect of network overhead on the results, with more time spent rendering, and a much smaller percent being taken up by Mental Ray Satellite’s network distribution.

The results were… interesting.

The same 36 frames were rendered for each run, and I used the results from 35 of them (using the 1st as a time stamp).

Note: ignore the last 4 columns – it’s just data that I’d added to the chart to look for correlation/scaling. Focus on the 4th data column (“improvement as % of solo render”) to see the benefit/decrease of additional machines. I apologize for not simply highlighting that column in the image.

A few notes as they relate to the above chart:

The “Master” was the AMD 6-core, clocked at 3.5 Ghz.
For the “1-slave” test, the o/c’ed dual-core 2.32 Ghz machine from part 1 was used as the slave.
For the “2-slave” test, the i3’s from part 1 (3.2 and 3.06 Ghz) were used as the slaves. Both are similar performance-wise.
For the “3-slave” test, the 3 machines from above were used as the slaves.
For the “4-slave” test, a 2.26 Ghz Core2Duo (also from part 1 of the writeup) was added to the above. This additional machine is similar in performance to the machine from the “1-slave” test.

Unfortunately, the test results are riddled with inconsistencies, and you MUST keep that in mind when comparing for your own purposes. The biggest issue is that every machine was different. That means you can’t really compare the results between “1 slave” and “4 slaves” in terms of how much value you’ll get in going from 1-to-4 machines.

Really, you’ve got:

An AMD 6-core machine at the highest clock. No hyperthreading, and an AMD architecture.
Two intels on the i3 architecture. Both have hyperthreading. Similar in performance to each other.
Two intels on the Core/Core2 architecture. No hyperthreading. Similar in performance to each other.

You’d have to be insane to take the “core improvement” and “Ghz improvement” to heart. They’re there simply for informational purposes, and to look for correlation between the work done and the cores/Ghz provided. It’s in no way meant as a set-in-stone-for-every-x%-Ghz-improvement-you-will-get-Y%-more-frames-done thing.

—

Looking at the results (even though we know they’re not perfect)

The first thing to note is that we *still* hit a wall in terms of improvement where things simply get worse when enough machines are added. 3 slaves gave no improvement, and 4 was slightly worse. This could be due to the Core Duos being slower than both the master and the i3’s. However, unlike Part I (where the drop was so bad that it was the same speed as a solo render), we still exceeded the speed of the Master rendering solo this time around, so while it wasn’t helpful, it at least didn’t destroy the efficiency of the others.

One interesting thing to note is how well the i3’s look to have scaled on paper (the 2-slave test). You had a Ghz/Core improvement of 160%/167%, and in terms of the render speed, you were looking at 156%. Really, not that bad. Hyperthreading may have thrown this off a bit (and made things look better) mind you. However, when using a single machine (doing the simple render), the AMD was roughly twice as fast as a single i3. This isn’t mentioned in the data, so you’ll have to take my word for it. In any case, based on that tidbit, you would actually expect that the 2-slave test using both i3’s would have brought a number closer to 200%. It didn’t quite make that.

–

Further tests..?

While I’m done testing for the time being (it’s time consuming and I’ve used up all the machines available to me), it would certainly be interesting to see what a few other tests show. Tests I would like to run if I had the time/resources are as follows:

Testing more than 4 machines, rendering scenes that take a very large base time to render (30 minutes per frame minimum).
All machines (master and slaves) the same spec, testing scaling from 0-slaves all the way up to a large number of slaves. This would help to determine the exact scaling in an ideal environment, and also indicate whether some of the slaves being slower contributed largely to the drop seen during my 4-slave tests (both here and in part 1).
Reducing the threads on the master to leave 1 core free. Goal would be to see if the 1 open core would do the network distribution, and if so whether letting it have it’s own core would result in an overall improvement.
Turning off the option on the master to render on the local machine (only render on the network machines). Goal similar to the above. Trying this on both a fast and a slow master to see what (if any) effect there is, and whether it’s optimal or not to have a master dedicated soley to distributing the render.
Disabling/enabling hyperthreading on all machines (in a mixed environment similar to the one I’ve tested with) to see if there’s an effect/impact on the network render.

—

Final Thoughts:

In any case, the benefit in benchmarking before sending out a full-scene satellite render still applies. However, it doesn’t appear to be as critical for longer / more complex scenes – even though utilizing all the machines at the same time wasn’t optimal in my case, it didn’t hamper things as badly as it did for the simple scenes in Part 1.

I’m even more curious than ever as to why 3 and 4 slaves gave a poor result. All the machines *were* working, and the effect of network overhead should have been largely reduced.

4 Comments | Leave a Comment

Alec Istomin on August 25, 2010 - click here to reply
Did you capture the master/slave utilization for non obvious resources like Disk and Network? I recommend to capture perfmon counters if you're running windows and later analyze it.
Did you use backburner to facilitate network tests ? I'm interested in infrastructure and performance testing, especially distributed, any info on your test setup and proicess will be very interesting
- Matt Gadient on August 25, 2010 - click here to reply
  Alec:
  
  I didn't check for Disk/Network resources, but checking perfmon is definitely a good idea. If I get the time for further testing I may give it a shot. One of the other things I'd like to try is to simply limit the threads so that 1 core is free to take care of disk & network i/o (and any other processes going on in the system) and see if it makes any difference. Unfortunately, I don't think you can tune the slave machines (running Satellite anyway) in this regard, so if Maya Satellite is choking i/o on the slaves there's not much that can be done short of either trying to force processor affinity of the satellite service/process through the OS or (failing that) using a VM.
  
  In regards to Back Burner, I haven't used or tried it. From what I gather, it probably works similar to MR standalone (in that I believe it assigns chunks of frames to each machine), as opposed to Maya Satellite where you have all machines working on the same frame. I'd assume both BB and MR standalone should fare a lot better with multiple machines, but I haven't used either of them (I've only used Satellite) so it's just a guess.
  
  One thing I probably should have tried is playing around with the task size in the batch render options. If Maya's just picking really bad task sizes by default, that could be the main issue right there.
Zao on September 13, 2010 - click here to reply
This made me laugh, Mental Satellite has been total (expletive removed) for years, I mean years. Try another renderer like vray/ finalRender etc... all of them have FARRRRRR FARRRRRRRRRRRRRRR better distributed CPU rendering performance where you'll actually see the gains of chucking more hardware into the distributed network for rendering.. I know because I noticed the same probs with MR years ago and reported it, keep checking every new maya release to see if they've fixed (expletive removed), but never do they. Its just funny seeing new people come to and have the same problems only to be fobbed off thinking its something on their network config etc its not...

Mental ray is just rubbish, its satellite rendering is beyond crap I don't even know why they update it every new release(as really the amount work they do seems to be just change the version number and repackage), got nothing nothing to do network speed/cpu, the bottleneck is in MR, you'll see the CPU% usage on your other network nodes is barely used.. the retards who programmed it are just lazy idiots who can't do (expletive removed) right at all, and tbh I wouldn't be surprised if they do it like that on purpose to try sell more MR standalone licenses, by crippling satellite!.... infact even MR gets more attention on 3dsmax lol because its main vocal developer who hangs about on cgtalk(expletives removed) is a prick.. I forget his name.
josie on March 6, 2011 - click here to reply
i agree with you. mental ray satellite is a joke.....autodesk should be sued.

mattgadient.com