Quick-Reference Guide
to Optimization with
Intel® Compilers
version 10.x
For IA-32 processors, Intel® 64¹ processors,
and IA-64² processors.
Intel® Software Development Products
Application Performance
A Step-by-Step Approach to Application Tuning with Intel® Compilers
Before you begin performance tuning, you may want to check correctness of
your application by building it without optimization using /Od (-O0).
Use the General Optimization Options (Windows* /O1, /O2 or /O3; Linux*
and Mac OS* -O1, -O2, or -O3) and determine which one works best for your
application by measuring performance with each. Most users should start at
/O2 (–O2) (default) before trying more advanced optimizations. Next, try /O3
(-O3) for loop-intensive applications, especially on IA-64-based systems.
Fine-tune performance to target systems based on IA-32 and Intel 64® with
processor-specific options such as /QxT (–xT) for Intel® Core™2 processor
family. For a complete list of recommended options for specific processors,
see the table “Recommended Processor-Specific Optimization Options for
IA-32 and Intel® 64 Architectures”. For Dual-Core Intel® Itanium® 2 9000
Sequence processors, set /G2-p9000 (–mtune=itanium2-p9000).
Use the Intel® VTune™ Performance Analyzer to help you identify performance
“hotspots” so that you know which specific parts of your application could
benefit from further tuning. The Intel® Compilers’ optimization reports also help
by showing where the compiler could benefit from your help.
Add in interprocedural optimization (IPO), /Qipo (-ipo) and/or profile-guided
optimization (PGO), /Qprof-gen and /Qprof-use (-prof-gen and -prof-use),
then measure performance again to determine whether your application
benefits from one or both of them.
Optimize your application for multi-core, multi-processor, or Hyper-Threading
Technology (HT Technology)-capable systems using the parallel performance
options (/Qparallel (-parallel), /Qopenmp (-openmp)), or by using Intel®
Performance Libraries, or the Intel® Threading Building Blocks.
Use Intel® Thread Profiler to help you understand the structure of your
threaded applications and maximize their performance. Use Intel® Thread
Checker to reduce the time to market for threaded applications by diagnosing
threading errors and speeding up the development process. Both threading
tools work with binary instrumentation. Using the Intel Compiler with source
code instrumentation will give you more complete source code information.
Please consult the Compiler Documentation and the Optimizing Applications
with the Intel® C++ & Fortran Compilers white paper for more details.
1.
2.
3.
4.
5.
6.
¹ Intel® 64 = Intel® Processors with Extended Memory 64 Technology [EM64T]
² IA-64 = Intel® Itanium® Processors
Included in this Guide:
General Optimization Options
Before you begin performance tuning, you may want to check correctness of your application
by building it without optimization using /Od (-O0). Begin performance tuning with /O1, /O2, or
/O3 (-O1, -O2, or -O3 ). These are general optimization options that should be at the heart of any
application tuning for all 32-bit and 64-bit Intel processors. Measure your performance before
proceeding with more advanced options.
Parallel Performance
For systems with Hyper-Threading Technology , multi-core and/or multiple processors, Intel compilers
support development of multi-threaded applications through two mechanisms, /Qparallel (-parallel)
or /Qopenmp (-openmp).
If you are using Intel® Thread Profiler and Intel® Thread Checker to tune your threaded application,
use /Qtcheck (-tcheck) to enable source instrumentation for Intel® Thread Checker and Qtprofile
(-tprofile) to enable source instrumentation for Intel® Thread Profiler.
Recommended Processor-Specific Optimization Options
for IA-32 and Intel® 64¹ Architectures
Use /QxT (–xT on Linux* and Mac OS*) for best performance on the Intel® Core™2 processor
family, and /QxP (-xP on Linux*) on older Intel-based systems that support SSE3 instructions. We
recommend /QaxT /QxW (–axT -xW on Linux*) for best performance on the Intel® Core™2 processor
family, and good performance on other systems that support SSE2 including those from AMD. For
best performance on non-Intel processors that support SSE3 instructions, we recommend using
/QxO (-xO) in place of /QxW (-xW). For recommended options for older processors, see the table
entitled “Recommended Optimization Options for Specific Intel® Processors”.
These options allow you to tune performance for specific Intel processors. As with each previous
step, measure the performance benefit of each option to guide your decisions. Use the Intel
compilers’ optimization reports to assist in determining whether you can provide more help to the
compiler to resolve possible dependencies or aliases.
IA-64 (Intel® Itanium®) Processor-Specific Optimization Options
In general, using /O3 (-O3), IPO and/or PGO, in conjunction with the optimization reports (described
in the Fine-Tuning section of this document), to help resolve possible aliases and improve memory
utilization provides the best performance for IA-64-based systems.
Interprocedural Optimization (IPO) and Profile-Guided Optimization (PGO) Options
IPO includes function-inlining to reduce function call overhead and expose more optimization opportunities.
PGO provides runtime feedback to guide optimization decisions about data and code layout to improve
instruction-cache efficiency, paging and branch prediction. However, IPO can increase code size. Be sure to
measure your execution performance, compile time, and code size tradeoffs with these options. IPO is best
used in conjunction with PGO to guide which functions to inline.
Floating-Point Arithmetic Options
The Intel® compilers provide options for enhancing the consistency or precision of floating-point results on all
Intel® architectures, at some cost in performance. Refer to the Compiler Options section of the Intel® C++ and
Fortran Compiler Documentation for detailed information on floating-point options.
Fine-Tuning (All Processors)
Once you have identified performance hot-spots, you may need to provide the compiler with more
information to fine-tune specific functions. The optimization and vectorization reports may show places
where loops could not be optimized fully due to pointer aliasing or memory-access overlaps, for example.
The Intel® C++ and Fortran Compiler Documentation includes details on other #pragmas, directives, and
intrinsics that can be used to control software-pipelining, loop unrolling, vectorization, and prefetching for
further fine-tuning within your application code.
Windows* Linux*
Mac OS*
Comment
/Od -O0 No optimization. Used during the early stages of application
development and debugging. Use a higher setting when the application
is working correctly.
/O1 -O1 Optimize for size. Omits optimizations that tend to increase object size.
Creates the smallest optimized code in most cases.
This option is useful in many large server/database applications where
memory paging due to larger code size is an issue.
/O2 -O2 Maximize speed. Default setting. Creates faster code than /O1 (-O1) in
most cases.
/O3 -O3 Enables /O2 (-O2) optimizations plus more aggressive loop and memory-
access optimizations, such as scalar replacement, loop unrolling, code
replication to eliminate branches, loop blocking to allow more efficient use of
cache and, on IA-64-based systems only, additional data prefetching.
The /O3 (-O3) option is particularly recommended for applications that have
loops that heavily use floating-point calculations or process large data sets.
These aggressive optimizations may occasionally slow down other types of
applications compared to /O2 (-O2).
/Zi -g Generates debug information for use with any of the common platform
debuggers. This option turns off /O2 (-O2) and makes /Od (-O0) the default
unless /O2 (-O2) (or another O option) is specified.
/debug:full -debug full Allows easier debugging of optimized code by adding full symbol information,
including the local symbol table information, regardless of the optimization
level. This may result in minor performance degradation.
If this option is specified for an application that makes calls to C library
routines that will be debugged, the option /dbglibs must also be specified to
link the appropriate C debug library.
General Optimization Options
Parallel Performance
Windows* Linux*
Mac OS*
Comment
/Qopenmp -openmp Enables the parallelizer to generate multi-threaded code based on the
OpenMP* directives.
/Qopenmp-
report
{0|1|2}
-openmp-
report
{0|1|2}
Controls the OpenMP parallelizer’s diagnostic levels. The default is
/Qopenmp-report1.
/Qparallel -parallel Detects simply structured loops capable of being executed safely in
parallel and automatically generates multi-threaded code for these loops.
/Qpar-report
{0|1|2|3}
-par-report
{0|1|2|3}
Controls the auto-parallelizer’s diagnostic levels as follows:
0 – Displays no diagnostic information.
1 – Indicates loops successfully parallelized (default).
2 – Adds information on loops that were not parallelized..
3 – Adds information about any proven or assumed dependencies
inhibiting auto-parallelization (reasons for not parallelizing).
/Qpar-
threshold[n]
-par-
threshold[n]
Sets a threshold for the auto-parallelization of loops based on the
probability of profitable execution of the loop in parallel, n=0 to 100.
Default: n=100.
0 – Parallelize loops regardless of computation work volume.
100 – Parallelize loops only if profitable parallel execution is almost certain.
Must be used in conjunction with /Qparallel (-parallel).
/Qtprofile -tprofile Enables source instrumentation to capture information about the
structure of threaded applications for use in tuning them to maximize
performance. This option creates a binary which will generate results
that can be viewed with Intel® Thread Profiler.
/Qtcheck -tcheck Enables source instrumentation to capture information for diagnosing
threading errors in threaded applications. This option creates a binary which
will generate diagnostics that can be viewed with Intel® Thread Checker.
/Qopt-mem-
bandwidth
(IA-64 only)
-opt-mem-
bandwidth
(IA-64 only)
Restricts certain optimizations that may increase memory bandwidth
requirements.
/Qopt-mem-bandwidth0 (-opt-mem-bandwidth0) - no restriction
(default for serial compilation)
/Qopt-mem-bandwidth1 (-opt-mem-bandwidth1) – restricts
optimizations for loops in OpenMP parallel regions (default with
/Qparallel (-parallel) or /Qopenmp (-openmp) )
/Qopt-mem-bandwidth2 (-opt-mem-bandwidth2) - restricts
optimizations for all loops. May be useful for MPI or other parallel
applications.
Note: For Mac OS*, this option is not supported.
Windows* Linux*
Mac OS*
Comment
/Qx
{S| T| P|
O| N| W|
K}
-x
{S| T| P|
O| N| W|
K}
Processor-specific targeting. Generates specialized code for the indicated
processor and enables vectorization. The executable should only be run on the
targeted compatible processors.
S – May generate SSE4, SSSE3, SSE3, SSE2, and SSE instructions for Intel
processors. Optimizes for a future Intel® processor that supports SSE4 Vectorizing
Compiler and Media Accelerators.
T – May Generate SSSE3, SSE3, SSE2, and SSE instructions for Intel processors.
Optimizes for the Intel® Core™2 Duo Processor family, Quad-Core Intel® Xeon®
processors, and Dual-Core Intel® Xeon® 5300, 5100 and 3000 series processors.
P – May Generate SSE3, SSE2, and SSE instructions for Intel processors. Optimizes
for Intel® Core™ microarchitecture, Intel® Pentium® 4 processors with SSE3, Intel®
Xeon® processors with SSE3, Intel® Pentium® dual-core processor T2060, Intel®
Pentium® Extreme Edition processor, and Intel® Pentium® D processor. Performs
optimizations not enabled with /QxO (-xO).
O – May Generate SSE3, SSE2, and SSE instructions. Optimizes for the Intel® Core™
microarchitecture, Intel® Pentium® 4 processors with SSE3, Intel® Xeon® processors
with SSE3, Intel® Pentium® dual-core processor T2060, Intel® Pentium® Extreme Edition
processor, and Intel® Pentium® D processor. Code path may execute on Intel® and Non-
Intel Processors which support SSE3*.
N – May Generate SSE2 and SSE instructions for Intel processors. Optimizes for the
Intel® Pentium® 4 processor, Intel® Xeon® processor with SSE2, and Intel® Pentium® M
processor. Performs optimizations not enabled with /QxW (-xW).
W – May Generate SSE2 and SSE instructions. Optimizes for the Intel® Pentium® 4
processor and Intel Xeon® processor with SSE2. Code path may execute on Intel® and
Non-Intel Processors which support SSE2 and SSE*.
K – May Generate SSE instructions. Optimizes for the Intel® Pentium® III processor
and Intel® Pentium® III Xeon® processor. Code path may execute on Intel® and Non-
Intel Processors which support SSE*.
Note: On Mac OS*, options O, N, W and K are not supported. For Mac OS* systems
using IA-32 architecture, -xP is default. For Mac OS* systems using Intel® 64
architecture, -xT is default.
/Qax
{S| T| P|
N| W| K}
-ax
{S| T| P|
N| W|
K}
Automatic Processor Dispatch. Generates specialized code and enables vectorization for
the indicated processors while also generating non-processor-specific code. You can use
more than one letter to tune for multiple processors in the same executable.
For example, for best performance on the Intel® Core™2 Duo Processor family, Quad-
Core Intel® Xeon® processors, and Dual-Core Intel® Xeon® 5300, 5100 and 3000 series
processors while also running well on an AMD processor that supports only SSE2, use
/QaxT /QxW (-axT -xW on Linux*) to generate a binary that will utilize SSSE3 and be
tuned for non-SSSE3 x86-64 processors via CPU dispatch.
In this example, the /QaxT /QxW (-axT -xW on Linux*) combination will produce binaries
with two code paths, using the process-dispatch technology. One code path will take full
advantage of the Intel® Core™2 Duo Processor family, Quad-Core Intel® Xeon® processors,
and Dual-Core Intel® Xeon® 5300, 5100 and 3000 series processors. The other code
path also takes advantage of the capabilities provided by the Intel processor and will also
run on processors that do not support SSE3. At runtime, the application automatically
identifies the Intel processor on which it is running and selects the appropriate
implementation, either specialized or generic.
Notes: Option O is not supported for /Qax (-ax). On Mac OS*, options P, N, W and K
are not supported.
/Qvec-
report
[n]
-vec-
report
[n]
n = 0: no information
n = 1: indicates vectorized loops (default)
n = 2: indicates vectorized and non-vectorized loops
n = 3: indicates vectorized loops and explains why non-vectorized loops were
not vectorized
Recommended Processor-Specific Optimization Options for IA-32
and Intel® 64¹ Architectures
* The option values O, W, and K produce binaries that should run on processors not made by Intel such as AMD
processors that implement the same capabilities as the corresponding Intel processors. P and N option values perform
additional optimizations that are not enabled with option values O and W.
Windows* Linux* Comment
/G2 -mtune=itanium2 Targets optimization for the Intel Itanium 2 processor. Generated code
is also compatible with the older IA-64 processor (default).
/G2-p9000 -mtune=itanium2-
p9000
Targets optimizations for Dual-Core Intel® Itanium® 2 9000 Sequence
processors. Generated code is also compatible with all IA-64 processors,
unless the user program calls intrinsic functions specific to the Dual-Core
Intel Itanium 2 9000 Sequence processors.
/QIPF-fma[-] -IPF-fma[-] Enables [disables] the combining of floating-point multiply operations
and add/subtract operations. (Enabled by default)
/Qivdep-parallel -ivdep-parallel Indicates that there is no forward or backward loop-carried memory
dependency in the loop where the IVDEP directive is specified.
Typically used in conjunction with /Qparallel (-parallel).
/Qprefetch[-] -prefetch[-] Enables or disables prefetch insertion.
IA-64² Processor-Specific Optimization Options
Interprocedural Optimization (IPO) and Profile-Guided Optimization (PGO) Options
Windows* Linux*
Mac OS*
Comment
/Qip -ip Single file optimization. Interprocedural optimizations, including selective inlining,
within the current source file.
Caution: For large files, this option may sometimes significantly increase compile
time and code size.
/Qipo[value] -ipo[value] Permits inlining and other interprocedural optimizations among multiple
source files. The optional value argument controls the maximum number
of link-time compilations (or number of object files) spawned. Default for
value is 0 (the compiler chooses).
Caution: This option can in some cases significantly increase compile time
and code size.
/Qipo-jobs[n] -ipo-jobs[n] Specifies the number of commands (jobs) to be executed simultaneously
during the link phase of Interprocedural Optimization (IPO). The default is
1 job.
/Ob2 -finline-
functions
-finline-
level=2
This option enables function inlining within the current source file at the
compiler’s discretion. This option is enabled by default at /O2 and /O3 (-O2
and –O3).
Caution: For large files, this option may sometimes significantly increase
compile time and code size. It can be disabled by /Ob0 (-fno-inline-
functions on Linux* and Mac OS*).
/Qinline-
factor=n
-finline-
factor=n
This option scales the total and maximum sizes of functions that can be
inlined. The default value of n is 100, i.e., 100% or a scale factor of one.
/Qprof-gen -prof-gen Instruments a program for profiling.
/Qprof-use -prof-use Enables the use of profiling information during optimization.
/Qprof-dir dir -prof-dir dir Specifies a directory for the profiling output files, *.dyn and *.dpi.
Windows* Linux*
Mac OS*
Comment
/fp:name -fp-model
name
This method of controlling the consistency of floating point results by restricting
certain optimizations is recommended in preference to the /Op (-mp) and
/Qprec (-mp1) switches. The possible values of name are:
precise – Enables only value-safe optimizations on floating point code.
double/extended/source – Implies precise and causes intermediates to be
computed in double, extended or source precision.
The double and extended options are not available for Intel® Fortran.
fast=[1|2] – Allows more aggressive optimizations at a slight cost in accuracy or
consistency. (fast=1 is the default)
except – Enables floating point exception semantics.
strict – Strictest mode of operation, enables both the precise
and except options and disables fma contractions.
Recommendation: /fp:source (-fp-model source) is the recommended form
for the majority of situations on IA-64 processors, on processors supporting
Intel® 64, and on IA-32 when SSE are enabled with /QxW (-xW) or higher when
enhanced floating point consistency and reproducibility are needed.
/Qfp-
speculation
mode
-fp-
speculation
mode
Enables floating-point speculations with one of the following modes:
fast – Speculate floating-point operations. (default)
off – Disables speculation of floating-point operations.
safe – Do not speculate if this could expose a floating-point exception.
strict – This is the same as specifying off.
/Qftz[-] -ftz[-] When the main program or dll main is compiled with this option, denormal results
are flushed to zero for the whole program (dll). Setting this option does not
guarantee that all denormals in a program are flushed to zero. It only causes
denormals generated at run time to be flushed to zero.
On IA-64-based systems, the default is off ex
本文档为【Intel编译器优化速查】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。