首页 有限元和流体仿真的高性能计算最佳实践案例

有限元和流体仿真的高性能计算最佳实践案例

有限元和流体仿真的高性能计算最佳实践案例北京福思营销顾问有限公司呈送有限元和流体仿真的高性能计算最佳实践案例杨柱高级流体工程师 • HPC Terminology • ANSYS Work-flow • Hardware Considerations • Modelling guidelines Agenda HPC Hardware Terminology Machine 1 (or Node 1) GPU Processor 1 (or Socket 1) Proces...

北京福思营销顾问有限公司呈送有限元和流体仿真的高性能计算最佳实践案例杨柱高级流体工程师 • HPC Terminology • ANSYS Work-flow • Hardware Considerations • Modelling guidelines Agenda HPC Hardware Terminology Machine 1 (or Node 1) GPU Processor 1 (or Socket 1) Processor 2 (or Socket 2) Interconnect (GigE or InfiniBand) Machine N (or Node N) GPU Processor 1 (or Socket 1) Processor 2 (or Socket 2) Machine 1 (or Node 1) Shared Memory Parallel • Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable. • OpenMP is the industry standard. Processor 1 (or Socket 1) Distributed Memory Parallel • Distributed memory parallel processing (DMP) assumes that physical memory for each process is separate from all other processes. • Parallel processing on such a system requires some form of message passing software to exchange data between the cores. • MPI (Message Passing Interface) is the industry standard for this. Machine 1 (or Node 1) Processor 1 (or Socket 1) Typical HPC Growth Path Cluster Users Desktop User Workstation and/or Server Users Cloud Solution • Know your hardware lifecycle • Have a goal in mind for what you want to achieve. • Using Licensing productively • Using ANSYS provided processes effectively. Guidelines : Understanding the effect of clock speed - ANSYS CFD Impact of CPU Clock on Application Performance Processor: Xeon X5600 Series Hyper Threading: OFF, TURBO: ON Active cores: 12/node; Memory speed: 1333 MHz (performance measure is improvement relative to CPU Clock 2.66 GHz) 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 Clock Ratio eddy_417K aircraft_2M turbo_500K sedan_4M truck_14M ANSYS/FLUENT Model Im p ro v e m e n t d u e t o C lo c k 2.66 GHz 2.93 GHz 3.47 GHz H ig h e r is b e tt e r Understanding the effect of clock speed - ANSYS Mechanical • Effect of increased core operating frequencies on the DMP benchmarks running on 8 cores • Influence is highest for sparse solver benchmarks Using higher clock speed is always helpful to realize productivity gains Understanding the effect of memory bandwidth - Is 24 Cores Equal to 24 Cores? 3 x (2 x 4) = 24 cores x5570 x5570 x5570 2 x (2 x 6) = 24 cores x5670 x5670 Understanding the effect of memory bandwidth - Is 24 Cores Equal to 24 Cores? 3 x (2 x 4) = 24 cores x5570 x5570 x5570 2 x (2 x 6) = 24 cores x5670 x5670 Consider memory per core! Understanding the effect of memory bandwidth - Is 16 Cores Equal to 16 Cores? 2 x (2 x 4) = 16 cores 2 x (2 x 4) = 16 cores x5570 x5570 x5670 x5670 Using less cores per node can be helpful to realize productivity gains Understanding the effect of memory speed - ANSYS CFD • We can see here the effect of memory speed. • This has implications on how you build your hardware. • Some processors types have slower memory speeds by default. • On other processors non- optimally filling the memory channels can slow the memory speed. Impact of DIMM speed on ANSYS/FLUENT Application Performance (Intel Xeon x5670, 2.93 GHz) Hyper Threading: OFF, TURBO: ON Active threads per node: 12 (performance measure improvement is relative to memory speed of 1066 MHz) 80% 85% 90% 95% 100% 105% 110% 115% 120% 125% 130% eddy_417K turbo_500K aircraft_2M sedan_4M truck_14M ANSYS/FLUENT Model Im p a c t o f M e m o ry S p e e d 1066 MHz 1333 MHz Using higher memory speed can be helpful to realize productivity gains Understanding the effect of memory speed - ANSYS Mechanical • We can see here the effect of memory speed. • This has implications on how you build your hardware. • Some processors types have slower memory speeds by default. • On other processors non- optimally filling the memory channels can slow the memory speed. Impact of Memory Speed on Benchmark Speed Using higher memory speed can be helpful to realize productivity gains Turbo Boost (Intel) / Turbo Core (AMD) - ANSYS CFD • Turbo Boost (Intel)/ Turbo Core(AMD) is a form of over-clocking that allows you to give more GHz to individual processors when others are idle. • With the Intel’s have seen variable performance with this ranging between 0-8% improvement depending on the numbers of cores in use. • The graph below for CFX on a Intel X5550. This only sees a maximum of 2.5% improvement. Turbo Boost (Intel) / Turbo Core (AMD) - ANSYS Mechanical • Effect of Turbo Boost on the SMP benchmarks using 1, 2, 4 and 8 out of 8 physical cores of 1 node • Turbo Boost most efficient for the lower core counts Im p ac t o f Tu rb o B o o st - S p ee d # of cores Using Turbo Boost / Core can be helpful to realize productivity gains - particularly for lower core counts Hyper-threading – ANSYS Fluent Evaluation of Hyperthreading on ANSYS/FLUENT Performance iDataplex M3 (Intel Xeon x5670, 2.93 GHz) TURBO: ON (measurement is improvement relative ot Hyperthtreading OFF) 0.90 0.95 1.00 1.05 1.10 eddy_417K turbo_500K aircraft_2M sedan_4M truck_14M ANSYS/FLUENT Model Im p ro v e m e t d u e t o H y p e rt h re a d in g . HT OFF (12 threads on 12 physical cores) HT ON (24 threads on 12 physical cores) H ig h e r is b e tt e r Hyper-threading – ANSYS Mechanical Hyper-threading is NOT recommended • Need fast interconnects to feed fast processors – Two main characteristics for each interconnect: latency and bandwidth – Distributed ANSYS is highly bandwidth bound +--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+ Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07 Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz Total number of cores available : 32 Number of physical cores available : 32 Number of cores requested : 4 (Distributed Memory Parallel) MPI Type: INTELMPI Core Machine Name Working Directory ---------------------------------------------------- 0 hpclnxsmc00 /data1/ansyswork 1 hpclnxsmc00 /data1/ansyswork 2 hpclnxsmc01 /data1/ansyswork 3 hpclnxsmc01 /data1/ansyswork Latency time from master to core 1 = 1.171 microseconds Latency time from master to core 2 = 2.251 microseconds Latency time from master to core 3 = 2.225 microseconds Communication speed from master to core 1 = 7934.49 MB/sec  Same machine Communication speed from master to core 2 = 3011.09 MB/sec  QDR Infiniband Communication speed from master to core 3 = 3235.00 MB/sec  QDR Infiniband Understanding the effect of the interconnect Understanding the effect of the interconnect - ANSYS Fluent ANSYS/FLUENT Performance iDataplex M3 (Intel Xeon x5670, 12C 2.93 GHz) Network: Gigabit, 10-Gigabit, 4X QDR Infiniband (QLogic, Voltaire) Hyperthreading: OFF, TURBO: ON Models: truck_14M 0 1000 2000 3000 4000 5000 12 24 48 96 192 384 768 Number of Cores used by a single job F L U E N T R a ti n g QLogic Voltaire 10-Gigabit Gigabit H ig h e r is b e tt e r 0 10 20 30 40 50 60 8 cores 16 cores 32 cores 64 cores 128 cores R at in g (r u n s/ d ay ) Interconnect Performance Gigabit Ethernet DDR Infiniband Understanding the effect of the interconnect - ANSYS Mechanical V13sp-5 Model Turbine geometry 2,100 K DOF SOLID187 FEs Static, nonlinear One iteration Direct sparse Linux cluster (8 cores per node) Understanding the effect of the interconnect - ANSYS Mechanical 3 Millions DOF using direct sparse solver Solid95 elements, worst case for a direct solver 0 1000 2000 3000 4000 5000 6000 16 32 64 128 Wall Time (secs) Cores TrueScale versus GigE In Core Memory TrueScale Gig-E Using faster interconnects can be helpful to realize productivity gains - particularly at higher core/node counts • Need fast hard drives to feed fast processors – Check the bandwidth specs – ANSYS Mechanical can be highly I/O bandwidth bound – Sparse solver in the out-of-core memory mode does lots of I/O – Distributed ANSYS can be highly I/O latency bound – Seek time to read/write each set of files causes overhead – Consider SSDs – High bandwidth and extremely low seek times – Consider RAID configurations RAID 0 – for speed RAID 1,5 – for redundancy RAID 10 – for speed and redundancy Understanding the effect of the disks/storage - ANSYS Mechanical Understanding the effect of the disks/storage - ANSYS Mechanical Using faster disks can be helpful to realize productivity gains - particularly at higher core/node counts Is Your Hardware Ready for HPC? - ANSYS Mechanical 100 200 400 600 800 1000 1200 I/O [Mb/s] RAM [Gb] 8 16 32 48 64 96 128 2x S S D 1x S S D 2x S A S 1x S A S 0.2 Mdof 2 Mdof 4 Mdof > 6 Mdof GPU Accelerator Capability - ANSYS Mechanical Supports majority of ANSYS structural mechanics solvers: • Covers both sparse direct and PCG iterative solvers Ease of use: • Requires at least one supported GPU card to be installed • Requires at least one HPC Pack license • No rebuild, no additional installation steps Performance: • Offer significantly faster time to solution • Should never slow down your simulation Influence of GPU Accelerator on Speedup 5.9x 3.7x 2.4x ANSYS Mechanical Model – Impeller Impeller geometry of ~2M DOF, solid FEs Normal modes analysis using cyclic symmetry ANSYS Mechanical SMP and Block-Lanczos solver Speedup Impeller 2M DOF Normal modes 4 cores + GPU = 2.4x speedup vs. 4 cores ANSYS Mechanical Model – Speaker Speaker geometry of ~0.7M DOF, solid FEs Vibroacoustic harmonic analysis for one frequency ANSYS Mechanical distributed sparse solver Speaker 0.7M DOF Harmonic analysis 4 cores + GPU = 2.7x speedup vs. 4 cores Speedup Some Recommendations for GPU Acceleration Model with “enough” solver work will accelerate most with GPUs • Solid FE models with > 500K DOFs are recommended for best speedups GPU and system memories both play important roles in performance • ANSYS Mechanical for both CPU-only and CPU+GPU perform best with solutions that are run in-core so that scratch disk I/O is eliminated • Sparse solver: – bulkier and/or higher-order FE models are good and will be accelerated. If the model exceeds 8M DOF, then either add another GPU or it will be processed on the CPU. – GPU works really well when we have complex or unsymmetric matrices • PCG/JCG solver: – models with lower Lev_Diff values are good and will be accelerated the most. – before R14.5, set parameter MSAV,OFF (default is ON) or GPU will be disabled. At R14.5, MSAVE,AUTO will only set MSAVE,ON when the criteria for MSAVE is met and the model size is > 3 M DOFs Model Suited for HPC? - ANSYS Mechanical •If you are solving small- or medium-sized models (e.g., # DOFs < 500,000), but the solution takes a long time because there are hundreds of iterations (calculations) performed, HPC may not provide significant benefits to you. – This is due to the fact that HPC helps to speed up large- sized models – HPC does not reduce the total number of calculations needed to be performed. Joints Suited for GPU Acceleration? - ANSYS Mechanical •Joints in Workbench Mechanical utilize a special solution technique that cannot be used with GPU Accelerator. •In these cases, use of Distributed ANSYS can still speed up the solutions Sparse or Iterative Solvers for HPC? - ANSYS Mechanical Solver type Distributed / Shared Memory Pros Cons SPARSE (direct) DMP/SMP Robust High I/O PCG (iterative) DMP/SMP Less I/O than Sparse Suited for large models Doesn’t support all functionality No guaranteed solution LANB (direct, modal) SMP Robust Suited for accurate high number of modes Very high I/O LANPCG (iterative, modal) DMP Suited for large DOF modal analysis; fully DMP SNODE SMP Suited for large number of modes Reduced I/O Not as efficient for small number of modes Get it in-core! - ANSYS Mechanical Incore - 24GB Optimal - 24GB Minimum - 24GB Optimal -4GB Minimum - 4GB 0 500 1000 1500 2000 2500 Time (s) Time Check the PCG level! - ANSYS Mechanical Balancing the Load: a Key to Efficiency - ANSYS Mechanical A Consequence for Contact Users - ANSYS Mechanical Remote Load or Displacements - ANSYS Mechanical Point moment and it is distributed to internal surface of the hole Deformed shape All nodes connected to one RBE3 node have to be grouped into the same domain. This hurts load balance! Try to reduce # of RBE3 nodes. Parallelization: Bottlenecks & Strategies - ANSYS CFD Domain Decomposition & Physical Models • Some physical models show load balancing difficulties with Domain Decomposition • DPM model: injection points and trajectories are not evenly distributed across defaultly partitioned domains • Radiation models: S2S radiation faces are not evenly distributed • VoF: phase interface may travel through partitions • Dynamic mesh: partitions hosting dynamic meshes may change in size • Combustion: main reaction terms and gradients may be hosted in small area Dynamic re-partitioning or manual physics weighting may be advantageous - and ANSYS CFD allows this! Parallelization: Bottlenecks & Strategies - ANSYS CFD Domain Decomposition & General Issues • Domain Decomposition introduces partition boundaries • These partition boundaries are in rare cases badly located and create numerical overhead • If strange scaling behavior is observed, using an only slightly different number of parallel partitions may resolve such accidential misfit of partition boundaries Manually overriding default partitioning approach may be advantageous - and ANSYS CFD allows this! Tuning Your Software for Client/Server - ANSYS CFD • Specify distributed memory system • Physics-aware partitioning • Fluent does architecture-aware partitioning by default • Use Asynchronous I/O Fluent • Specify distributed parallel system CFX • Core allocation strategy is related to job scheduler system • Remote visualization speed is influenced by network • Reduce file I/O tasks, if possible • Check compatibility of tuned MPI libraries General Tuning Your Software - ANSYS CFD Mesh File Location Async I/O Time 15M Cas NFS OFF 217s 15M Cas NFS ON 62s 15M Dat NFS OFF 113s 15M Dat NFS ON 8s 30M Cas NFS OFF 207s 30M Cas NFS ON 75s 30M Dat NFS OFF 144s 30M Dat NFS ON 10s Asynchronous I/O for Linux Fluent By average, total write time 3-5x faster over NFS Even larger speed-ups on bigger cases and local disk (up to 10x) Information Available - ANSYS IT Webcast Series Planned webinars in 2013: • Accelerating Time-to-Results with Parallel I/O • Workstation Upgrade ROI — Productivity Gains with Latest Intel Processors and GPUs • Tuning InfiniBand for Peak ANSYS Performance on Latest Intel Processors • Enterprise Simulation — Best Practice Deployment Solution Architectures Recorded webinars from 2012: • Scalable Storage and Data Management for Engineering Simulation • Optimizing Remote Access to Simulation • Understanding Hardware Selection for Structural Mechanics • Methodology and Tools for Compute Performance at Any Scale • Extreme Scalability for High-Fidelity CFD Simulations Thanks

                    本文档为【有限元和流体仿真的高性能计算最佳实践案例】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

有限元和流体仿真的高性能计算最佳实践案例

你可能还喜欢