oneAPI Deep Neural Network Library (oneDNN)
Performance library for Deep Learning
2.1.3
Performance Profiling Example

This example demonstrates the best practices for application performance optimizations with oneDNN.

‍Example code: performance_profiling.cpp

This example uses DNNL_VERBOSE trace output to tune oneDNN code to align with the best practices.

It assumes knowledge of memory formats and their usage in oneDNN. You can read more about this topic here.

Additionally, see the article for recommended environment forrunning benchmarks". The example has three different implementations of the mathematical operation: 1. <em>Naive implementation</em> executes 2D convolution followed by ReLU on the data in <strong>NCHW</strong> format. This implementation does not align with oneDNN best practices and results in suboptimal performance. 2. <em>Blocked format implementation</em> executes the same operations sequence on the <strong>blocked format</strong> optimized for convolution performance. This implementation uses <tt>format_tag=ANY</tt> to create a convolution memory descriptor to determine the data format optimal for the convolution implementation. It then <strong>propagates the blocked format</strong> to the non-intensive ReLU. This implementation results in better overall performance than the naive implementation. 3. <em>Fused implementation</em> executes convolution fused with ReLU on blocked data format. This implementation uses <tt>format_tag=ANY</tt> to create a convolution memory descriptor, and then adds ReLU as a <strong>post-op</strong> to the convolution primitive. This version implements all of the best practices for inference resulting in the best overall performance. @section performance_profiling_cpp_walkthrough Walkthrough The program in \ref performance_profiling.cpp includes all three implementations introduced above. You can select the specific implementation using command line options. After compilation, you can execute each implementation with: @code{sh} ./program.exe [cpu|gpu] [implementation] @endcode Before you run the program, set your <tt>DNNL_VERBOSE</tt> environment variable to 1: @code{sh} export DNNL_VERBOSE=1 @endcode The program starts by creating oneDNN memory objects in <strong>NCHW</strong> format. These are called <tt>user_</tt> because they are meant to represent the user's source data entering oneDNN with the NCHW format. @snippet performance_profiling.cpp Set dimensions @note Here the library allocates memory. @snippet performance_profiling.cpp Create memory objects @note You can change the batch size to easily increase/decrease the workload. The following descriptions of each implementation will reference each other, and are meant to be read in order. @section performance_profiling_cpp_implementation1 Naive Implementation This implementation is launched with the following shell code: @code{sh} ./program.exe cpu naive @endcode The program will call the implementation defined in the function <tt>conv_relu_naive()</tt>. First it sets the dimensions and format for convolution memory descriptors (<tt>_md</tt>) to match <tt>user_</tt> values&ndash;one <tt>md</tt> each for source, destination, and weight data. Then it uses those <tt>md</tt> to create the convolution descriptor <tt>conv_d</tt>, which tells oneDNN to use plain format (NCHW) for the convolution. @snippet performance_profiling.cpp Create mem_desc @snippet performance_profiling.cpp Create conv_desc Next the program creates a convolution primitive descriptor <tt>conv_pd</tt> and convolution primitive <tt>conv</tt>. These structs will inherit NCHW format from <tt>md</tt> by way of the <tt>conv_d</tt>. Finally it creates the convolution primitive <tt>conv</tt> and adds it to the stream <tt>s</tt>, and then executes the <tt>create_and_execute_relu(user_dst)</tt> function. @snippet performance_profiling.cpp Create conv_prim_desc @snippet performance_profiling.cpp Create conv_primitive @snippet performance_profiling.cpp Add to stream @snippet performance_profiling.cpp Create and execute relu @note The function for creation and execution of ReLU primitive is defined elsewhere to keep this example clean. It is an non-intensive operation, so the <tt>create_and_execute_relu()</tt> function uses whatever the input data format is at the time it is called. Using NCHW data format may result in suboptimal performance for compute intensive primitives, as shown in the following DNNL_VERBOSE output by the convolution and relu execution times of 38.3 and 2.9 milliseconds, respectively. <em>DNNL_VERBOSE output (see configuration notice*):</em> @code{sh} dnnl_verbose,exec,cpu,convolution,gemm:jit,forward_inference,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_undef::undef::f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb128_ic3oc96_ih227oh55kh11sh4dh0ph0_iw227ow55kw11sw4dw0pw0,38.314 dnnl_verbose,exec,cpu,eltwise,jit:avx512_common,forward_inference,data_f32::blocked:abcd:f0 diff_undef::undef::f0,,alg:eltwise_relu alpha:0 beta:0,128x96x55x55,2.87695 @endcode In <em>Blocked format implementation</em>, we will incorporate the best practice of letting oneDNN determine the optimal format for convolution primitive. @section performance_profiling_cpp_implementation2 Blocked format implementation This implementation is launched with the following shell code: @code{sh} ./program.exe cpu blocked @endcode The program will call the implementation defined in the function <tt>conv_relu_blocked()</tt>. First it creates the md as in <strong>naive implementation</strong>. Next it changes the dnnl::memory::format_tag for each md to <tt>ANY</tt>. Then it uses those md to create the convolution descriptor conv_d, which tells oneDNN to use whatever format it recommends for the convolution. oneDNN will choose a friendly blocked format. @snippet performance_profiling.cpp Create mem_desc with tag=any @snippet performance_profiling.cpp Create conv_desc implementation2 Next the program creates a convolution primitive descriptor conv_pd and convolution primitive conv as in naive implementation. However, in this implementation the structs will inherit blocked format from md by way of the conv_d. @snippet performance_profiling.cpp Create conv_prim_desc implementation2 Since the resulting convolution primitive will expect blocked source data, conditional reorders are inserted to convert input data to blocked format if required. The input data user_src is NCHW, so this conditional will be triggered: @note The reoders are applied using oneDNN <tt>reorder</tt> primitive. @snippet performance_profiling.cpp Conditionally create and execute reorder prims Finally it creates the convolution primitive <tt>conv</tt> and adds it to the stream <tt>s</tt> with the reordered data (<tt>conv_src</tt>, <tt>conv_wei</tt>, <tt>conv_dst1</tt>) as inputs and then executes the <tt>create_and_execute_relu(conv_dst)</tt> function. @snippet performance_profiling.cpp Create conv_primitive implementation2 @snippet performance_profiling.cpp Add to stream implementation2 @snippet performance_profiling.cpp Create and execute relu implementation2 Blocked memory format is recommended for oneDNN primitive execution and provides better performance, as shown in the DNNL_VERBOSE output by the convolution and relu execution times of 18.3 and 2.7 milliseconds (down from 38.3 and 2.9 in <em>naive implementation</em>), respectively. In this implementation, there is an additional reorder operation that executes before and after the the conv + relu. This small cost is worth the gain from executing in blocked format. If fact, it becomes negligible when chaining together multiple oneDNN operations in succession. In these situations, you can do one reorder at the beginning and one at the end of the chain, and only pay the reorder penalty at those points in the execution. <em>DNNL_VERBOSE output (see configuration notice*):</em> @code{sh} dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,96x3x11x11,0.0310059 dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_inference,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_undef::undef::f0 dst_f32::blocked:aBcd16b:f0,,alg:convolution_direct,mb128_ic3oc96_ih227oh55kh11sh4dh0ph0_iw227ow55kw11sw4dw0pw0,18.3101 dnnl_verbose,exec,cpu,eltwise,jit:avx512_common,forward_inference,data_f32::blocked:aBcd16b:f0 diff_undef::undef::f0,,alg:eltwise_relu alpha:0 beta:0,128x96x55x55,2.66895 dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,128x96x55x55,4.80396 @endcode This inference implementation is closer to best practices than <em>naive implementation</em> because it uses oneDNN recommended memory format. <em>fused implementation</em> will futher optimize the performance by fusing convolution with ReLU using oneDNN @ref dev_guide_attributes_post_ops "post-ops". @section performance_profiling_cpp_implementation3 Fused Implementation This implementation is launched with the following shell code: @code{sh} ./program.exe cpu fused @endcode The program will call the implementation defined in the function <tt>conv_relu_fused()</tt>. First the memory descriptors and convolution descriptor are created as in <em>naive implementation</em>. Then in preparation for the convolution prim desctiptor, a ReLU post-op is built and added to the primitive attribute <tt>attr</tt>: @snippet performance_profiling.cpp Create post_op attr with relu post-op by way of the attributes <tt>attr</tt>: @snippet performance_profiling.cpp Create prim_desc with attr Then conditional reorders are applied as in <em>blocked format implementation</em> to convert <tt>user_</tt> format NCHW to blocked. Finally, it creates the convolution primitive <tt>conv</tt> and adds it to the stream <tt>s</tt> with the reordered data (<tt>conv_src</tt>, <tt>conv_wei</tt>, <tt>conv_dst1</tt>). @note There is no separate addition to the stream for the ReLU operation because it has been added as a post-op to the <tt>conv</tt> primitive. @snippet performance_profiling.cpp Create conv_primitive implementation3 @snippet performance_profiling.cpp Add to stream implementation3 This implementation complies with best practices for f32 inference by using the oneDNN recommended blocked format for convolution and adding ReLU as a post-op to execute a fused version of conv + ReLU. The consequence to following best practices can be seen in the execution time of the fused primitive of 18.0 milliseconds. <em>DNNL_VERBOSE output (see configuration notice*):</em> @code{sh} dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,96x3x11x11,0.0148926 dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_inference,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_undef::undef::f0 dst_f32::blocked:aBcd16b:f0,post_ops:'eltwise_relu;';,alg:convolution_direct,mb128_ic3oc96_ih227oh55kh11sh4dh0ph0_iw227ow55kw11sw4dw0pw0,17.968 dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,128x96x55x55,4.66797 @endcode @section performance_profiling_cpp_roundup Performance summary <table class="markdownTable"> <tr class="markdownTableHead"> <th class="markdownTableHeadLeft"> Implementation

Time, ms

Cumulative speedup

Naive

41.2

1.0

Blocked format

21.0

2.0

Fused

18.0

2.3


Configuration Notice

Note
This example is meant to demonstrate oneDNN best practices.
It is not meant for benchmarking purposes. The platform is not fully
optimized, so the primitive execution times are only relevant in
relation to the other times in this example.

Runtime Settings:

  • OMP_NUM_THREADS=14
  • KMP_AFFINITY=granularity=fine,compact

Platform:

  • CPU: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
  • Thread(s) per core: 1
  • Core(s) per socket: 28
  • Socket(s): 2
  • NUMA node(s): 2
  • RAM (DDR4): 192 GB