北京福思营销顾问有限公司呈送
有限元和流体仿真的高性能计算
最佳实践案例
杨柱 高级流体
工程
路基工程安全技术交底工程项目施工成本控制工程量增项单年度零星工程技术标正投影法基本原理
师
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Modelling guidelines
Agenda
HPC Hardware Terminology
Machine 1 (or Node 1)
GPU
Processor 1
(or Socket 1)
Processor 2
(or Socket 2)
Interconnect
(GigE or InfiniBand)
Machine N (or Node N)
GPU
Processor 1
(or Socket 1)
Processor 2
(or Socket 2)
Machine 1 (or Node 1)
Shared Memory Parallel
• Single Machine Parallel (SMP) systems share a single global memory
image that may be distributed physically across multiple cores, but is
globally addressable.
• OpenMP is the industry standard.
Processor 1
(or Socket 1)
Distributed Memory Parallel
• Distributed memory parallel processing (DMP) assumes that physical
memory for each process is separate from all other processes.
• Parallel processing on such a system requires some form of message
passing software to exchange data between the cores.
• MPI (Message Passing Interface) is the industry standard for this.
Machine 1 (or Node 1)
Processor 1
(or Socket 1)
Typical HPC Growth Path
Cluster Users Desktop User
Workstation and/or
Server Users
Cloud Solution
• Know your hardware lifecycle
• Have a goal in mind for what you want to achieve.
• Using Licensing productively
• Using ANSYS provided processes effectively.
Guidelines :
Understanding the effect of clock speed
- ANSYS CFD
Impact of CPU Clock on Application Performance
Processor: Xeon X5600 Series
Hyper Threading: OFF, TURBO: ON
Active cores: 12/node; Memory speed: 1333 MHz
(performance measure is improvement relative to CPU Clock 2.66 GHz)
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
Clock Ratio eddy_417K aircraft_2M turbo_500K sedan_4M truck_14M
ANSYS/FLUENT Model
Im
p
ro
v
e
m
e
n
t
d
u
e
t
o
C
lo
c
k
2.66 GHz
2.93 GHz
3.47 GHz
H
ig
h
e
r
is
b
e
tt
e
r
Understanding the effect of clock speed
- ANSYS Mechanical
• Effect of increased core operating frequencies on the DMP
benchmarks running on 8 cores
• Influence is highest for sparse solver benchmarks
Using higher clock speed is always
helpful to realize productivity gains
Understanding the effect of memory
bandwidth
- Is 24 Cores Equal to 24 Cores?
3 x (2 x 4) = 24 cores
x5570
x5570 x5570
2 x (2 x 6) = 24 cores
x5670
x5670
Understanding the effect of memory
bandwidth
- Is 24 Cores Equal to 24 Cores?
3 x (2 x 4) = 24 cores
x5570
x5570 x5570
2 x (2 x 6) = 24 cores
x5670
x5670
Consider memory per core!
Understanding the effect of memory
bandwidth
- Is 16 Cores Equal to 16 Cores?
2 x (2 x 4) = 16 cores 2 x (2 x 4) = 16 cores
x5570
x5570 x5670
x5670
Using less cores per node can be
helpful to realize productivity gains
Understanding the effect of memory
speed
- ANSYS CFD
• We can see here the effect of
memory speed.
• This has implications on how
you build your hardware.
• Some processors types have
slower memory speeds by
default.
• On other processors non-
optimally filling the memory
channels can slow the
memory speed.
Impact of DIMM speed on ANSYS/FLUENT Application Performance
(Intel Xeon x5670, 2.93 GHz)
Hyper Threading: OFF, TURBO: ON
Active threads per node: 12
(performance measure improvement is relative to memory speed of 1066 MHz)
80%
85%
90%
95%
100%
105%
110%
115%
120%
125%
130%
eddy_417K turbo_500K aircraft_2M sedan_4M truck_14M
ANSYS/FLUENT Model
Im
p
a
c
t
o
f
M
e
m
o
ry
S
p
e
e
d
1066 MHz
1333 MHz
Using higher memory speed can be
helpful to realize productivity gains
Understanding the effect of memory
speed
- ANSYS Mechanical
• We can see here the effect of
memory speed.
• This has implications on how
you build your hardware.
• Some processors types have
slower memory speeds by
default.
• On other processors non-
optimally filling the memory
channels can slow the
memory speed.
Impact of Memory Speed on Benchmark Speed
Using higher memory speed can be
helpful to realize productivity gains
Turbo Boost (Intel) / Turbo Core (AMD)
- ANSYS CFD
• Turbo Boost (Intel)/ Turbo
Core(AMD) is a form of over-clocking
that allows you to give more GHz to
individual processors when others
are idle.
• With the Intel’s have seen variable
performance with this ranging
between 0-8% improvement
depending on the numbers of cores
in use.
• The graph below for CFX on a Intel
X5550. This only sees a maximum of
2.5% improvement.
Turbo Boost (Intel) / Turbo Core (AMD)
- ANSYS Mechanical
• Effect of Turbo Boost on the SMP benchmarks using 1, 2, 4
and 8 out of 8 physical cores of 1 node
• Turbo Boost most efficient for the lower core counts
Im
p
ac
t
o
f
Tu
rb
o
B
o
o
st
-
S
p
ee
d
# of cores
Using Turbo Boost / Core can be
helpful to realize productivity gains
- particularly for lower core counts
Hyper-threading – ANSYS Fluent
Evaluation of Hyperthreading on ANSYS/FLUENT Performance
iDataplex M3 (Intel Xeon x5670, 2.93 GHz)
TURBO: ON
(measurement is improvement relative ot Hyperthtreading OFF)
0.90
0.95
1.00
1.05
1.10
eddy_417K turbo_500K aircraft_2M sedan_4M truck_14M
ANSYS/FLUENT Model
Im
p
ro
v
e
m
e
t
d
u
e
t
o
H
y
p
e
rt
h
re
a
d
in
g
.
HT OFF (12 threads on 12 physical cores) HT ON (24 threads on 12 physical cores)
H
ig
h
e
r
is
b
e
tt
e
r
Hyper-threading – ANSYS Mechanical
Hyper-threading is NOT
recommended
• Need fast interconnects to feed fast processors
– Two main characteristics for each interconnect: latency and bandwidth
– Distributed ANSYS is highly bandwidth bound
+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+
Release: 14.5 Build: UP20120802 Platform: LINUX x64
Date Run: 08/09/2012 Time: 23:07
Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
Total number of cores available : 32
Number of physical cores available : 32
Number of cores requested : 4 (Distributed Memory Parallel)
MPI Type: INTELMPI
Core Machine Name Working Directory
----------------------------------------------------
0 hpclnxsmc00 /data1/ansyswork
1 hpclnxsmc00 /data1/ansyswork
2 hpclnxsmc01 /data1/ansyswork
3 hpclnxsmc01 /data1/ansyswork
Latency time from master to core 1 = 1.171 microseconds
Latency time from master to core 2 = 2.251 microseconds
Latency time from master to core 3 = 2.225 microseconds
Communication speed from master to core 1 = 7934.49 MB/sec Same machine
Communication speed from master to core 2 = 3011.09 MB/sec QDR Infiniband
Communication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband
Understanding the effect of the
interconnect
Understanding the effect of the
interconnect
- ANSYS Fluent
ANSYS/FLUENT Performance
iDataplex M3 (Intel Xeon x5670, 12C 2.93 GHz)
Network: Gigabit, 10-Gigabit, 4X QDR Infiniband (QLogic, Voltaire)
Hyperthreading: OFF, TURBO: ON
Models: truck_14M
0
1000
2000
3000
4000
5000
12 24 48 96 192 384 768
Number of Cores used by a single job
F
L
U
E
N
T
R
a
ti
n
g
QLogic Voltaire 10-Gigabit Gigabit
H
ig
h
e
r
is
b
e
tt
e
r
0
10
20
30
40
50
60
8 cores 16 cores 32 cores 64 cores 128 cores
R
at
in
g
(r
u
n
s/
d
ay
)
Interconnect Performance
Gigabit Ethernet
DDR Infiniband
Understanding the effect of the
interconnect
- ANSYS Mechanical
V13sp-5 Model
Turbine
geometry
2,100 K DOF
SOLID187 FEs
Static, nonlinear
One iteration
Direct sparse
Linux cluster (8
cores per node)
Understanding the effect of the interconnect
- ANSYS Mechanical
3 Millions DOF using direct sparse solver
Solid95 elements, worst case for a direct solver
0
1000
2000
3000
4000
5000
6000
16 32 64 128
Wall Time
(secs)
Cores
TrueScale versus GigE
In Core Memory
TrueScale
Gig-E
Using faster interconnects can be
helpful to realize productivity gains
- particularly at higher core/node counts
• Need fast hard drives to feed fast processors
– Check the bandwidth specs
– ANSYS Mechanical can be highly I/O bandwidth bound
– Sparse solver in the out-of-core memory mode does lots of I/O
– Distributed ANSYS can be highly I/O latency bound
– Seek time to read/write each set of files causes overhead
– Consider SSDs
– High bandwidth and extremely low seek times
– Consider RAID configurations
RAID 0 – for speed
RAID 1,5 – for redundancy
RAID 10 – for speed and redundancy
Understanding the effect of the
disks/storage
- ANSYS Mechanical
Understanding the effect of the disks/storage
- ANSYS Mechanical
Using faster disks can be
helpful to realize productivity gains
- particularly at higher core/node counts
Is Your Hardware Ready for HPC?
- ANSYS Mechanical
100
200
400
600
800
1000
1200
I/O [Mb/s]
RAM
[Gb]
8 16 32 48 64 96 128
2x
S
S
D
1x
S
S
D
2x
S
A
S
1x
S
A
S
0.2 Mdof
2 Mdof
4 Mdof
> 6 Mdof
GPU Accelerator Capability
- ANSYS Mechanical
Supports majority of ANSYS structural mechanics solvers:
• Covers both sparse direct and PCG iterative solvers
Ease of use:
• Requires at least one supported GPU card to be installed
• Requires at least one HPC Pack license
• No rebuild, no additional installation steps
Performance:
• Offer significantly faster time to solution
• Should never slow down your simulation
Influence of GPU Accelerator on Speedup
5.9x
3.7x 2.4x
ANSYS Mechanical Model – Impeller
Impeller geometry of ~2M DOF, solid FEs
Normal modes analysis using cyclic symmetry
ANSYS Mechanical SMP and Block-Lanczos solver
Speedup
Impeller 2M DOF
Normal modes
4 cores + GPU
= 2.4x speedup
vs. 4 cores
ANSYS Mechanical Model – Speaker
Speaker geometry of ~0.7M DOF, solid FEs
Vibroacoustic harmonic analysis for one frequency
ANSYS Mechanical distributed sparse solver
Speaker 0.7M DOF
Harmonic analysis
4 cores + GPU
= 2.7x speedup
vs. 4 cores
Speedup
Some Recommendations for GPU Acceleration
Model with “enough” solver work will accelerate most with GPUs
• Solid FE models with > 500K DOFs are recommended for best speedups
GPU and system memories both play important roles in performance
• ANSYS Mechanical for both CPU-only and CPU+GPU perform best with
solutions that are run in-core so that scratch disk I/O is eliminated
• Sparse solver:
– bulkier and/or higher-order FE models are good and will be accelerated. If the model
exceeds 8M DOF, then either add another GPU or it will be processed on the CPU.
– GPU works really well when we have complex or unsymmetric matrices
• PCG/JCG solver:
– models with lower Lev_Diff values are good and will be accelerated the most.
– before R14.5, set parameter MSAV,OFF (default is ON) or GPU will be disabled. At
R14.5, MSAVE,AUTO will only set MSAVE,ON when the criteria for MSAVE is met
and the model size is > 3 M DOFs
Model Suited for HPC?
- ANSYS Mechanical
•If you are solving small- or medium-sized models (e.g., # DOFs
< 500,000), but the solution takes a long time because there are
hundreds of iterations (calculations) performed, HPC may not
provide significant benefits to you.
– This is due to the fact that HPC helps to speed up large-
sized models
– HPC does not reduce the total number of calculations
needed to be performed.
Joints Suited for GPU Acceleration?
- ANSYS Mechanical
•Joints in Workbench
Mechanical utilize a special
solution technique that cannot
be used with GPU Accelerator.
•In these cases, use of
Distributed ANSYS can still
speed up the solutions
Sparse or Iterative Solvers for HPC?
- ANSYS Mechanical
Solver type
Distributed /
Shared Memory
Pros Cons
SPARSE (direct)
DMP/SMP Robust High I/O
PCG (iterative)
DMP/SMP Less I/O than Sparse
Suited for large models
Doesn’t support all
functionality
No guaranteed solution
LANB (direct, modal) SMP Robust
Suited for accurate high
number of modes
Very high I/O
LANPCG (iterative,
modal)
DMP Suited for large DOF
modal analysis; fully
DMP
SNODE SMP Suited for large number
of modes
Reduced I/O
Not as efficient for
small number of modes
Get it in-core!
- ANSYS Mechanical
Incore - 24GB
Optimal - 24GB
Minimum - 24GB
Optimal -4GB
Minimum - 4GB
0
500
1000
1500
2000
2500
Time (s)
Time
Check the PCG level!
- ANSYS Mechanical
Balancing the Load: a Key to Efficiency
- ANSYS Mechanical
A Consequence for Contact Users
- ANSYS Mechanical
Remote Load or Displacements
- ANSYS Mechanical
Point moment and it is
distributed to internal surface
of the hole
Deformed shape
All nodes connected to one RBE3 node have to be
grouped into the same domain. This hurts load
balance! Try to reduce # of RBE3 nodes.
Parallelization: Bottlenecks & Strategies
- ANSYS CFD
Domain Decomposition & Physical Models
• Some physical models show load balancing difficulties with Domain
Decomposition
• DPM model: injection points and trajectories are not evenly distributed
across defaultly partitioned domains
• Radiation models: S2S radiation faces are not evenly distributed
• VoF: phase interface may travel through partitions
• Dynamic mesh: partitions hosting dynamic meshes may change in size
• Combustion: main reaction terms and gradients may be hosted in small
area
Dynamic re-partitioning or manual physics weighting may be
advantageous - and ANSYS CFD allows this!
Parallelization: Bottlenecks & Strategies
- ANSYS CFD
Domain Decomposition & General Issues
• Domain Decomposition introduces partition boundaries
• These partition boundaries are in rare cases badly located
and create numerical overhead
• If strange scaling behavior is observed, using an only slightly
different number of parallel partitions may resolve such
accidential misfit of partition boundaries
Manually overriding default partitioning approach
may be advantageous - and ANSYS CFD allows this!
Tuning Your Software for Client/Server
- ANSYS CFD
• Specify distributed memory system
• Physics-aware partitioning
• Fluent does architecture-aware partitioning by default
• Use Asynchronous I/O
Fluent
• Specify distributed parallel system CFX
• Core allocation strategy is related to job scheduler system
• Remote visualization speed is influenced by network
• Reduce file I/O tasks, if possible
• Check compatibility of tuned MPI libraries
General
Tuning Your Software
- ANSYS CFD
Mesh File Location Async I/O Time
15M Cas NFS OFF 217s
15M Cas NFS ON 62s
15M Dat NFS OFF 113s
15M Dat NFS ON 8s
30M Cas NFS OFF 207s
30M Cas NFS ON 75s
30M Dat NFS OFF 144s
30M Dat NFS ON 10s
Asynchronous I/O for Linux Fluent
By average, total write time 3-5x faster over NFS
Even larger speed-ups on bigger cases and local disk (up to 10x)
Information Available
- ANSYS IT Webcast Series
Planned webinars in 2013:
• Accelerating Time-to-Results with Parallel I/O
• Workstation Upgrade ROI — Productivity Gains
with Latest Intel Processors and GPUs
• Tuning InfiniBand for Peak ANSYS Performance on
Latest Intel Processors
• Enterprise Simulation — Best Practice Deployment
Solution Architectures
Recorded webinars from 2012:
• Scalable Storage and Data Management for Engineering Simulation
• Optimizing Remote Access to Simulation
• Understanding Hardware Selection for Structural Mechanics
• Methodology and Tools for Compute Performance at Any Scale
• Extreme Scalability for High-Fidelity CFD Simulations
Thanks