Programming in Parallel
New programming techniques for parallel processing boost performance of CAD/CAM and simulation software
By Patrick Waurzyniak
Parallel processing in CAD/CAM and simulation has the potential to dramatically speed up compute-intensive tasks in manufacturing. With the latest CAD/CAM and simulation packages, manufacturers now perform parallel and background processing on computers equipped with newer dual-core and multicore microprocessors, resulting in substantially reduced programming times and increased shop-floor productivity.
Performance improvements ranging up to four times the level of previous methods have been reported using PCs incorporating dual-core or multicore processors like the new Intel i7 processors, although in many cases software developers say the gains in calculation times will be much more modest.
Parallel processing with multicore computers presents obstacles for developers. An Intel i7 multicore processor presently contains up to eight separate cores, or individual processors, on which programmers can attempt to divide up tasks to run either simultaneously in parallel, or in background processing, which allows the system to continue working on other programming tasks. Graphics-intensive packages like CAD/CAM or simulation/verification software can split up tasks or assign them to run on different cores within the newer processors, and therefore free up computing power to work on other jobs.
In programming, a thread is a part of a program that can execute independently of other parts. Multithreading is the ability of an operating system to execute different threads simultaneously. This form of time-division multiplexing creates an environment where a program is configured to allow processes to fork or split into two or more threads of execution.
Several CAD/CAM and simulation software developers have released versions of their software with parallel-processing capabilities that claim significant performance gains through the use of both multithreading and background processing of NC toolpath calculations, or simulation and verification of NC programs.
"Multicore programming is very much about analyzing algorithms to identify when parallel operations could take place," notes Colin Jones, PowerMILL software development manager, Delcam plc (Birmingham, UK), developer of the PowerMILL CAD/CAM package. "It's then a case of reorganizing the code and managing or synchronizing the data flow through these parallel branches."
In its latest release of PowerMILL 10, Delcam has developed two techniques that take advantage of the latest multicore processor technology. The first technique involved rewriting machining algorithms so they could be performed in parallel. The second technique involved calculating the toolpaths using a background process—allowing you to continue using PowerMILL in the foreground. Both techniques deliver substantial performance gains. A Delcam developer, Mark Jacobs, recently published a white paper, available at the company's Web site, www.delcam.com, outlining what programmers can expect when implementing the programming techniques on multicore processors.
"We identified a core algorithm (raster scanning) that's used throughout the majority of toolpath calculations," Jones states. "By multithreading this code, we are able to speed up the majority of toolpath calculations. It's important to stress that this is just the beginning. There are lots of algorithms that will benefit from multithreading—these are being identified and will appear in future releases. It's also important to add that Delcam owns all its machining code so we are free to optimize any areas we can find."
When adding multicore support, Delcam uses a mixture of Microsoft's version of the OpenMP (http://openmp.org) library, and its own multithreading libraries. "The OpenMP libraries are useful when dealing with simple aspects of the code that doesn't have synchronization issues," Jones says. "For the more complicated areas we use our own code."
One of the core strategies in PowerMILL is a raster strategy. "When the toolpath goes over a part from left to right, we call them raster scans," Jones says. In PowerMILL 10, the code that calculates how a tool runs over the model also uses parallel processing, enabling raster machining calculations to run almost entirely in parallel. Other strategies using this code include constant Z, 3-D offset, area clearance, interleaved constant Z, optimized constant Z, and boundary calculations. In addition, the calculation to apply a toolpath to the stock model runs entirely in parallel, and parallel processing happens automatically if the user's computer is suitable.
"In a normal single-processing operation with Raster, for example, the toolpath is generated sequentially across the part. With PowerMILL 10 we are able to split the operation into segments, which can be machined in parallel with one segment on each core being calculated at the same time," notes Mark Forth, engineering technology product manager. "Those underlying algorithms are actually used throughout a lot of the toolpath calculations. So if we can improve the raster algorithm to work in a multithreaded way, it means that we'll see significant benefits across the board."
Benchmarks on three-axis parts done by Delcam show major speed improvements for raster machining when using multicore systems. In its white paper, the company shows benchmarks of dual-core and multicore PCs equipped with 2–8 GB of RAM that delivered shortened processing times, but, in general, the more cores the better, as well as lots of RAM, and some users may consider employing 64-bit operating systems that are capable of addressing memory levels above the 2-GB maximum addressable by 32-bit systems.
"From the perspective of a user on the shop floor, he wants to generate a toolpath quickly and check out something else," Forth states. "He's going to have a lot more control over that process, rather than wait for something to be calculated and then do something else.
"More importantly, rather than just think in terms of programming, there may be different ways in which CAD/CAM companies interpret multithreading capability. The way that we interpret multithreading or parallelization is that we take one process and try and calculate that on as many cores as possible. But some of our competitors, for example, will take one process and dedicate one core of activity to calculate it; they'll take a second process and calculate that on a second core, so overall toolpaths will be generated faster with both methods. We feel the benefit in our approach, however, is that individual toolpaths will calculate faster, enabling programmers to edit the toolpath once it's complete, or post it ready to be machined. Couple this with the new background processing functionality in PowerMILL 10, where you can perform both foreground and background operations using multithreading, and end users will notice significant efficiency gains."
Complex machining tasks can benefit the most from parallelism, if those toolpaths can be multithreaded to execute on multiple processor cores. "The multicore processors give us a computer that is capable of doing more than one thing at the same time, with more than one central processing unit," notes Bill Gibbs, president, Gibbs and Associates (Moorpark, CA). "In principle, the idea is that if you can do four things at the same time, you ought to be able to complete the task faster. But even the ultimate theoretical gain is less than four times. A computer has to do other things besides just compute.
"The most common other thing that it is does is move data around, access memory, and move information on and off hard drives," he adds. "As it does these things, the computer is constrained by a data bus—another part of a computer's architecture. Further, it's constrained by memory speeds, access times, and memory amounts, because if you have enough memory you don't have to go to hard drive, and hard drive is many times slower than RAM. It's a huge factor.
"As we look at CAM software, what we want to look for are things that are slow enough so that people want them to be faster, and things that can be made faster by parallel processing," Gibbs says. "In the world of NC toolpath computation, the most common place to begin with is a very complex three-axis toolpath, and another place that is slow is a very complex five-axis toolpath. These are related things, and when I say big and complex, these usually involve larger parts with great part complexity. Cutting a three-axis toolpath over a sphere is pretty fast. Cutting the mold for the dashboard of a car can get to be a very complex, intricate thing to machine. So, some years ago, GibbsCAM customers wanted to do these tasks faster, and we looked to parallel processing."
In GibbsCAM 2008, the company added some multicore support of background processing for three-axis machining, and Gibbs is currently working on multithreading and background processing for five-axis machining. "We usually use the term multithreaded to indicate a single process that can do multiple things at the same time," Gibbs explains. "Now, many processes cannot be multithreaded, because they have to be performed sequentially—you have to finish step one before you can start step two. But other processes and types of processes lend themselves to multiple threading. Software processes that have been written to be multithreaded can take advantage of multiple cores very easily."
Multicore support in the current GibbsCAM allows starting a very complex three-axis toolpath creation, then starting more similar processes, Gibbs says. "All are running in parallel on a multicore processor, and you can still use GibbsCAM to keep programming other aspects of the part, while these three processes are running simultaneously on a quad core," he says. "In our next release, what we're looking to do is bring the same multiprocessed background programming, quad-core support for our five-axis toolpaths, because this is another area where people wish the toolpath would take less time. We've had very good responses from our customers about our ability to utilize their dual cores and their quad cores, and I'm sure next year we're going to have 16-core machines, as the next hot hardware thing becomes ever bigger and faster.
"For customers who are doing large complex programs, where creating the toolpath is slow, parallel-processing multicores have the potential to dramatically and significantly speed up those processes by allowing the CAM software to do more than one thing at a time. If your CAM software and what you're using it for is already fast enough, then you probably won't notice much difference."
Large, complex jobs with higher accuracies can benefit greatly from multicore systems using more than 2 GB of memory, notes Chuck Mathews, vice president, DP Technology Corp. (Camarillo, CA), which added multicore support for its Esprit 2009 CAD/CAM package. "We went in and looked at where a lot of time is being used, and it was in the three-axis toolpath generation and five-axis toolpath generation. Those two areas are where there's a lot of calculation time; we rewrote all of those machining cycles that calculate the toolpaths, and parallelized them, which allows them to run on two, four, or eight processors, and on average, we have a gain of 36% each time you double the number of processors."
Memory requirements vary by workpiece size and tolerance requirements, but larger pieces at high tolerances can benefit not just from multicore but also from going to the higher memory levels offered with a 64-bit operating system, Mathews adds. "If it's a large workpiece at a very high tolerance, the memory requirements are much larger than a small piece at low tolerance," he says. "In the more-complex applications, we actually need to go to 64-bit computers. Geometry's becoming more complex, and people are asking for tighter accuracies. The other area where it's very significant also is in the area of simulation, where users want to visualize the toolpath on the screen and the material removal processes—there's a big demand for making that visualization faster as well, which has not only to do with multicore, but it also has to do with graphics cards and graphics engines. In both areas, there's still an unmet demand. Customers would be happy if we were 10 times faster than we are today."
Simulation/verification software like Vericut from CGTech Corp. (Irvine, CA) potentially can greatly benefit from parallel processing with multicore systems, but only under certain circumstances, according to Bill Hasenjaeger, CGTech product manager. "Application software does not itself specifically invoke the use of multiple processors, also known as multiple-core processors, on a computer," Hasenjaeger points out. "An application program may use 'process threads' to run specific processing tasks. The operating system can decide to assign a process thread to a processor. If multiple processors are available, and if more than one process thread can run concurrently, then the operating system, typically Windows, may choose to assign them to different processors. The operating system typically assigns process threads to processors based on the current workload of the processors available.
"The challenge for an application program, especially a user-interactive program such as a CAD/CAM application or simulation program like Vericut, is to find multiple computation tasks that can run independent of each other; run at the same time as each other; and run concurrently and independent of each other long enough to overcome the operating system overhead required to manage multiple process threads," Hasenjaeger states. "If all three of the above requirements cannot be met, then the application program cannot keep multiple processors busy at the same time, and using multiple threads may actually result in running more slowly than a single process thread, due to the thread-management overhead."
Many CAD/CAM systems are programmed in C++, and developers use OpenMP to help write multithreading-enabling algorithms for multicore. Vericut is a Java application, and it uses the process-thread feature in Java to take advantage of multiple concurrent process threads, Hasenjaeger notes. "However, the big implementation challenge is not the standard or the language, but finding processing conditions that can truly execute faster by running concurrent processes on multiple processors. Vericut uses Java's ability to create process threads, then execute C code within those threads. We've done this since 2002."
To fully use multicore systems, Hasenjaeger notes that the operating system must manage multiple process threads, which requires some computer resources, often generally referred to as overhead. "To overcome the overhead of multiple concurrent process threads, the concurrent processes must run long enough, and be compute-intensive enough, to benefit from running each on its own processor. If not, the resources needed for the overhead eliminate any processing speed gained by using multiple processors," he says. "Additionally, processes in a user-interactive application must synchronize together frequently. Which means even though one process may run very quickly on its own processor, it still has to wait for one or more other process threads to complete before continuing.
"Between waiting for other threads to sync and the overhead of managing process threads, the gains from multiple processors are often neutralized. It is obviously not possible to get anything near two times faster by using two processors versus one processor. We have seen improvements in the range of 10-30% when using multiple processors with Vericut," Hasenjaeger observes. "However, it seems that by far the biggest gains from multiple processor computers are to be had by people who run multiple software applications concurrently and frequently switch between them. For example, our users typically run a large CAD/CAM program, like CATIA, NX, or Pro/Engineer simultaneously with Vericut, plus an e-mail tool such as Outlook, a Webbrowser, and maybe a couple of other utility programs. With a multiprocessor computer, each application program responds much better because of more available processing resources. One application program does not have to contend with another program for compute resources."
Programming and debugging multithreaded programs is much more complex, notes Joe Dionise, director, R&D, for Delmia (Auburn Hills, MI), a unit of Dassault Systèmes SA (Paris). "With multithreaded applications, the programmer must account for race conditions and deadlocks, which occur when multiple threads access the same data concurrently," Dionise says. "Access to global data must be carefully serialized to prevent data corruption and inconsistent results. To prevent such errors, it's important to provide complete multithreading documentation and training. Further, we recommend minimizing the number of developers who have to write thread-safe code, by providing data-management components.
Since Delmia's manufacturing software is written primarily in C++, it does not provide native multithreading capabilities, Dionise adds, but for client applications that run on Microsoft Windows and UNIX, such as IBM AIX, Sun Solaris, etc., Dassault Systèmes has developed portable C++ threading classes that use Win32 threads on Windows, and POSIX threads (pthreads) on UNIX.
"In this way, the same application code can be used on all platforms," Dionise says. "The Delmia Manufacturing Hub is deployed only on Microsoft Windows Server, so it uses Win32 functions directly. In general, Dassault Systèmes software is designed to take advantage of multiple processors. Client applications can use multithreading to load multiple parts simultaneously, improve the performance of certain rendering algorithms, manage background client/server communication, and perform finite-element analysis more efficiently.
"In addition, Delmia clients use a multithreaded kernel to simulate complex factory models in near real-time. By default, 3-D rendering is executed on the main thread, while each factory resource [human, robot, machine, conveyor] is simulated on a 'worker' thread. To accurately model the motion of a resource, several thousand floating-point operations must be performed during each simulation step. On a multicore computer, many resources can be simulated concurrently with good interactive response."
Dionise also notes that the Delmia Manufacturing Hub uses multithreading techniques to efficiently service client requests (such as load/save models, query objects, apply configuration filters, propagate database changes). "Customers typically deploy the Manufacturing Hub on a Windows Server platform with two to four processors," he adds. "Most application servers, such as the Manufacturing Hub, use multithreaded programs to process client requests. In the digital manufacturing domain, Delmia is unique in its use of multithreading for 3-D factory simulation."
Multithreaded applications require a system architecture that is "thread safe," notes Steven Mastrangelo, director of software development, CNC Software Inc. (Tolland, CT), which recently added multithreading capability to its Mastercam X4 release. "This basically means that the processing modules that perform the heavy math modeling, toolpath generation, graphics, and database work have to be able to exist in a processing thread without accessing outside user interaction controls and data in the main processing thread, and cross-accessing each other," Mastrangelo says. "When multithread processing, care must be taken to ensure that a process occurring in one thread doesn't 'get ahead' of a parallel process that must be completed first, otherwise the system falls apart. With CAD/CAM, there are so many interdependencies that the system has to be architected with this control arrangement in mind."
"One of the difficulties of turning several processes loose to calculate an NC job is the in-process workpiece dependencies that relate roughing cuts to re-roughing cuts to finishing cuts," notes Vynce Paradise, of Siemens PLM Software (Plano, TX), developer of NX CAM software. "NX has always handled the intelligent progression of a workpiece through the programming process, so it was important that we make sure that these relationships were not jeopardized by making the processing of many toolpaths more parallel in nature.
"NX toolpath calculation can be performed separately from the interactive programming session on any computer architecture, so the implementation is as flexible as possible," Paradise adds. "The more resources the machine has at its disposal, the more calculations can be performed simultaneously. The advantage to our implementation is that the user can request any machining operation to be submitted for processing while he or she carries on programming, maintaining a flexible and interactive system.
"The productivity advantage that is offered by multi-CPU machines is apparent, so it is not surprising that CAM vendors are pursuing it," he notes. "Our approach with NX is to virtually eliminate the question of calculation speed by providing a parallel process for calculation and further interactive NC programming. Of course, we maintain the intelligence of the system in terms of the inter-operation dependencies. By eliminating wait times and maintaining relationships between operations, we've found that it's possible to have overall throughput improvements of more than 2x. The programmer can carry on working while the system calculates toolpaths. The benefit and convenience and overall time saving can be very valuable, especially on big jobs such as large mold-and-die programs."
This article was first published in the October 2009 edition of Manufacturing Engineering magazine.
Published Date : 10/1/2009