1<h1>TensorFlow Profiler and Advisor</h1> 2 3* [Features](#features) 4* [Quick Start](#quick-start) 5* [Demo](#demo) 6* [Feature Request and Bug Report](#feature-request-and-bug-report) 7 8 9### Features 10 11* Profile model architectures. 12 * parameters, tensor shapes, float operations, device placement, etc. 13* Profile multiple-steps model performance. 14 * execution time, memory consumption. 15* Auto profile and advise. 16 * accelerator utilization check 17 * expensive operation check 18 * operation configuration check 19 * distributed runtime check (Not OSS) 20 21### Quick Start 22 23```python 24# When using high-level API, session is usually hidden. 25# 26# Under the default ProfileContext, run a few hundred steps. 27# The ProfileContext will sample some steps and dump the profiles 28# to files. Users can then use command line tool or Web UI for 29# interactive profiling. 30with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx: 31 # High level API, such as slim, Estimator, etc. 32 train_loop() 33 34bazel-bin/tensorflow/core/profiler/profiler \ 35 --profile_path=/tmp/train_dir/profile_xx 36tfprof> op -select micros,bytes,occurrence -order_by micros 37 38# To be open sourced... 39bazel-bin/tensorflow/python/profiler/profiler_ui \ 40 --profile_path=/tmp/profiles/profile_1 41``` 42![ProfilerUI](g3doc/profiler_ui.jpg) 43 44```python 45# When using lower-level APIs with a Session object. User can have 46# explicit control of each step. 47# 48# Create options to profile the time and memory information. 49builder = tf.profiler.ProfileOptionBuilder 50opts = builder(builder.time_and_memory()).order_by('micros').build() 51# Create a profiling context, set constructor argument `trace_steps`, 52# `dump_steps` to empty for explicit control. 53with tf.contrib.tfprof.ProfileContext('/tmp/train_dir', 54 trace_steps=[], 55 dump_steps=[]) as pctx: 56 with tf.Session() as sess: 57 # Enable tracing for next session.run. 58 pctx.trace_next_step() 59 # Dump the profile to '/tmp/train_dir' after the step. 60 pctx.dump_next_step() 61 _ = session.run(train_op) 62 pctx.profiler.profile_operations(options=opts) 63``` 64 65```python 66# For more advanced usage, user can control the tracing steps and 67# dumping steps. User can also run online profiling during training. 68# 69# Create options to profile time/memory as well as parameters. 70builder = tf.profiler.ProfileOptionBuilder 71opts = builder(builder.time_and_memory()).order_by('micros').build() 72opts2 = tf.profiler.ProfileOptionBuilder.trainable_variables_parameter() 73 74# Collect traces of steps 10~20, dump the whole profile (with traces of 75# step 10~20) at step 20. The dumped profile can be used for further profiling 76# with command line interface or Web UI. 77with tf.contrib.tfprof.ProfileContext('/tmp/train_dir', 78 trace_steps=range(10, 20), 79 dump_steps=[20]) as pctx: 80 # Run online profiling with 'op' view and 'opts' options at step 15, 18, 20. 81 pctx.add_auto_profiling('op', opts, [15, 18, 20]) 82 # Run online profiling with 'scope' view and 'opts2' options at step 20. 83 pctx.add_auto_profiling('scope', opts2, [20]) 84 # High level API, such as slim, Estimator, etc. 85 train_loop() 86``` 87 88 89<b>Detail Tutorials</b> 90 91* [Python API](g3doc/python_api.md) 92* [Command Line Interface](g3doc/command_line.md) 93* [Profile Time](g3doc/profile_time.md) 94* [Profile Memory](g3doc/profile_memory.md) 95* [Profile Model Architecture](g3doc/profile_model_architecture.md) 96* [Auto Detect and Advise](g3doc/advise.md) 97 98<b>Detail Documentation</b> 99 100* [Options](g3doc/options.md) 101 102## Demo 103 104### Attribute TensorFlow graph running time to your Python codes. 105```shell 106tfprof> code -max_depth 1000 -show_name_regexes .*model_analyzer.*py.* -select micros -account_type_regexes .* -order_by micros 107_TFProfRoot (0us/22.44ms) 108 model_analyzer_test.py:149:run_filename_as_m...:none (0us/22.44ms) 109 model_analyzer_test.py:33:_run_code_in_main:none (0us/22.44ms) 110 model_analyzer_test.py:208:<module>:test.main() (0us/22.44ms) 111 model_analyzer_test.py:132:testComplexCodeView:x = lib.BuildFull... (0us/22.44ms) 112 model_analyzer_testlib.py:63:BuildFullModel:return sgd_op.min... (0us/21.83ms) 113 model_analyzer_testlib.py:58:BuildFullModel:cell, array_ops.c... (0us/333us) 114 model_analyzer_testlib.py:54:BuildFullModel:seq.append(array_... (0us/254us) 115 model_analyzer_testlib.py:42:BuildSmallModel:x = nn_ops.conv2d... (0us/134us) 116 model_analyzer_testlib.py:46:BuildSmallModel:initializer=init_... (0us/40us) 117 ... 118 model_analyzer_testlib.py:61:BuildFullModel:loss = nn_ops.l2_... (0us/28us) 119 model_analyzer_testlib.py:60:BuildFullModel:target = array_op... (0us/0us) 120 model_analyzer_test.py:134:testComplexCodeView:sess.run(variable... (0us/0us) 121``` 122 123### Show your model variables and the number of parameters. 124``` 125tfprof> scope -account_type_regexes VariableV2 -max_depth 4 -select params 126_TFProfRoot (--/930.58k params) 127 global_step (1/1 params) 128 init/init_conv/DW (3x3x3x16, 432/864 params) 129 pool_logit/DW (64x10, 640/1.28k params) 130 pool_logit/DW/Momentum (64x10, 640/640 params) 131 pool_logit/biases (10, 10/20 params) 132 pool_logit/biases/Momentum (10, 10/10 params) 133 unit_last/final_bn/beta (64, 64/128 params) 134 unit_last/final_bn/gamma (64, 64/128 params) 135 unit_last/final_bn/moving_mean (64, 64/64 params) 136 unit_last/final_bn/moving_variance (64, 64/64 params) 137``` 138 139### Show the most expensive operation types. 140``` 141tfprof> op -select micros,bytes,occurrence -order_by micros 142node name | requested bytes | total execution time | accelerator execution time | cpu execution time | op occurrence (run|defined) 143SoftmaxCrossEntropyWithLogits 36.58MB (100.00%, 0.05%), 1.37sec (100.00%, 26.68%), 0us (100.00%, 0.00%), 1.37sec (100.00%, 30.75%), 30|30 144MatMul 2720.57MB (99.95%, 3.66%), 708.14ms (73.32%, 13.83%), 280.76ms (100.00%, 41.42%), 427.39ms (69.25%, 9.62%), 2694|3450 145ConcatV2 741.37MB (96.29%, 1.00%), 389.63ms (59.49%, 7.61%), 31.80ms (58.58%, 4.69%), 357.83ms (59.63%, 8.05%), 4801|6098 146Mul 3957.24MB (95.29%, 5.33%), 338.02ms (51.88%, 6.60%), 80.88ms (53.88%, 11.93%), 257.14ms (51.58%, 5.79%), 7282|9427 147Add 740.05MB (89.96%, 1.00%), 321.76ms (45.28%, 6.28%), 13.50ms (41.95%, 1.99%), 308.26ms (45.79%, 6.94%), 1699|2180 148Sub 32.46MB (88.97%, 0.04%), 216.20ms (39.00%, 4.22%), 241us (39.96%, 0.04%), 215.96ms (38.85%, 4.86%), 1780|4372 149Slice 708.07MB (88.92%, 0.95%), 179.88ms (34.78%, 3.51%), 25.38ms (39.92%, 3.74%), 154.50ms (33.99%, 3.48%), 5800|7277 150AddN 733.21MB (87.97%, 0.99%), 158.36ms (31.26%, 3.09%), 50.10ms (36.18%, 7.39%), 108.26ms (30.51%, 2.44%), 4567|5481 151Fill 954.27MB (86.98%, 1.28%), 138.29ms (28.17%, 2.70%), 16.21ms (28.79%, 2.39%), 122.08ms (28.08%, 2.75%), 3278|9686 152Select 312.33MB (85.70%, 0.42%), 104.75ms (25.47%, 2.05%), 18.30ms (26.40%, 2.70%), 86.45ms (25.33%, 1.95%), 2880|5746 153ApplyAdam 231.65MB (85.28%, 0.31%), 92.66ms (23.43%, 1.81%), 0us (23.70%, 0.00%), 92.66ms (23.38%, 2.09%), 27|27 154``` 155 156### Auto-profile. 157 158```shell 159tfprof> advise 160Not running under xxxx. Skip JobChecker. 161 162AcceleratorUtilizationChecker: 163device: /job:worker/replica:0/task:0/device:GPU:0 low utilization: 0.03 164device: /job:worker/replica:0/task:0/device:GPU:1 low utilization: 0.08 165device: /job:worker/replica:0/task:0/device:GPU:2 low utilization: 0.04 166device: /job:worker/replica:0/task:0/device:GPU:3 low utilization: 0.21 167 168OperationChecker: 169Found operation using NHWC data_format on GPU. Maybe NCHW is faster. 170 171JobChecker: 172 173ExpensiveOperationChecker: 174top 1 operation type: SoftmaxCrossEntropyWithLogits, cpu: 1.37sec, accelerator: 0us, total: 1.37sec (26.68%) 175top 2 operation type: MatMul, cpu: 427.39ms, accelerator: 280.76ms, total: 708.14ms (13.83%) 176top 3 operation type: ConcatV2, cpu: 357.83ms, accelerator: 31.80ms, total: 389.63ms (7.61%) 177top 1 graph node: seq2seq/loss/sampled_sequence_loss/sequence_loss_by_example/SoftmaxCrossEntropyWithLogits_11, cpu: 89.92ms, accelerator: 0us, total: 89.92ms 178top 2 graph node: train_step/update_seq2seq/output_projection/w/ApplyAdam, cpu: 84.52ms, accelerator: 0us, total: 84.52ms 179top 3 graph node: seq2seq/loss/sampled_sequence_loss/sequence_loss_by_example/SoftmaxCrossEntropyWithLogits_19, cpu: 73.02ms, accelerator: 0us, total: 73.02ms 180seq2seq_attention_model.py:360:build_graph:self._add_seq2seq(), cpu: 3.16sec, accelerator: 214.84ms, total: 3.37sec 181 seq2seq_attention_model.py:293:_add_seq2seq:decoder_outputs, ..., cpu: 2.46sec, accelerator: 3.25ms, total: 2.47sec 182 seq2seq_lib.py:181:sampled_sequence_...:average_across_ti..., cpu: 2.46sec, accelerator: 3.24ms, total: 2.47sec 183 seq2seq_lib.py:147:sequence_loss_by_...:crossent = loss_f..., cpu: 2.46sec, accelerator: 3.06ms, total: 2.46sec 184 seq2seq_lib.py:148:sequence_loss_by_...:log_perp_list.app..., cpu: 1.33ms, accelerator: 120us, total: 1.45ms 185 seq2seq_attention_model.py:192:_add_seq2seq:sequence_length=a..., cpu: 651.56ms, accelerator: 158.92ms, total: 810.48ms 186 seq2seq_lib.py:104:bidirectional_rnn:sequence_length, ..., cpu: 306.58ms, accelerator: 73.54ms, total: 380.12ms 187 core_rnn.py:195:static_rnn:state_size=cell.s..., cpu: 306.52ms, accelerator: 73.54ms, total: 380.05ms 188 seq2seq_lib.py:110:bidirectional_rnn:initial_state_bw,..., cpu: 296.21ms, accelerator: 73.54ms, total: 369.75ms 189 core_rnn.py:195:static_rnn:state_size=cell.s..., cpu: 296.11ms, accelerator: 73.54ms, total: 369.65ms 190 seq2seq_lib.py:113:bidirectional_rnn:outputs = [tf.con..., cpu: 46.88ms, accelerator: 3.87ms, total: 50.75ms 191 seq2seq_attention_model.py:253:_add_seq2seq:initial_state_att..., cpu: 32.48ms, accelerator: 50.01ms, total: 82.50ms 192 seq2seq.py:693:attention_decoder:attns = attention..., cpu: 11.73ms, accelerator: 38.41ms, total: 50.14ms 193 seq2seq.py:653:attention:s = math_ops.redu..., cpu: 2.62ms, accelerator: 17.80ms, total: 20.41ms 194 seq2seq.py:658:attention:array_ops.reshape..., cpu: 1.90ms, accelerator: 12.08ms, total: 13.98ms 195 seq2seq.py:655:attention:a = nn_ops.softma..., cpu: 4.15ms, accelerator: 4.25ms, total: 8.40ms 196 seq2seq.py:686:attention_decoder:cell_output, stat..., cpu: 14.43ms, accelerator: 4.85ms, total: 19.27ms 197 seq2seq.py:696:attention_decoder:output = linear([..., cpu: 3.04ms, accelerator: 2.88ms, total: 5.93ms 198 core_rnn_cell_impl.py:1009:_linear:res = math_ops.ma..., cpu: 2.33ms, accelerator: 2.71ms, total: 5.04ms 199seq2seq_attention_model.py:363:build_graph:self._add_train_o..., cpu: 1.28sec, accelerator: 462.93ms, total: 1.74sec 200 seq2seq_attention_model.py:307:_add_train_op:tf.gradients(self..., cpu: 967.84ms, accelerator: 462.88ms, total: 1.43sec 201 gradients_impl.py:563:gradients:grad_scope, op, f..., cpu: 692.60ms, accelerator: 390.75ms, total: 1.08sec 202 gradients_impl.py:554:gradients:out_grads[i] = co..., cpu: 164.71ms, accelerator: 16.21ms, total: 180.92ms 203 control_flow_ops.py:1314:ZerosLikeOutsideL...:return array_ops...., cpu: 121.85ms, accelerator: 16.21ms, total: 138.05ms 204 control_flow_ops.py:1313:ZerosLikeOutsideL...:zeros_shape = arr..., cpu: 22.85ms, accelerator: 0us, total: 22.85ms 205 control_flow_ops.py:1312:ZerosLikeOutsideL...:switch_val = swit..., cpu: 20.02ms, accelerator: 0us, total: 20.02ms 206 gradients_impl.py:515:gradients:out_grads = _Aggr..., cpu: 108.69ms, accelerator: 51.92ms, total: 160.61ms 207 gradients_impl.py:846:_AggregatedGrads:out_grads[i] = _M..., cpu: 107.99ms, accelerator: 50.05ms, total: 158.04ms 208 gradients_impl.py:856:_AggregatedGrads:array_ops.concat(..., cpu: 340us, accelerator: 1.87ms, total: 2.21ms 209 seq2seq_attention_model.py:322:_add_train_op:zip(grads, tvars)..., cpu: 307.56ms, accelerator: 0us, total: 307.56ms 210 optimizer.py:456:apply_gradients:update_ops.append..., cpu: 307.43ms, accelerator: 0us, total: 307.43ms 211 optimizer.py:102:update_op:return optimizer...., cpu: 222.66ms, accelerator: 0us, total: 222.66ms 212 optimizer.py:97:update_op:return optimizer...., cpu: 84.76ms, accelerator: 0us, total: 84.76ms 213``` 214 215### Visualize time and memory 216 217```shell 218# The following example generates a timeline. 219tfprof> graph -step -1 -max_depth 100000 -output timeline:outfile=<filename> 220 221generating trace file. 222 223****************************************************** 224Timeline file is written to <filename>. 225Open a Chrome browser, enter URL chrome://tracing and load the timeline file. 226****************************************************** 227``` 228 229![Timeline](g3doc/graph_timeline.png) 230 231```shell 232# The following example generates a pprof graph (only supported by code view). 233# Since TensorFlow runs the graph instead of Python code, the pprof graph 234# doesn't profile the statistics of Python, but the TensorFlow graph 235# nodes created by the Python call stack. 236# Nevertheless, it pops critical Python code path for us. 237# 238# `-trim_name_regexes` trims the some traces that have no valuable information. 239# `-select accelerator_micros` pick accelerator time for pprof graph. User 240# can also generate memory profile using `-select bytes` 241tfprof> code -select accelerator_micros -max_depth 100000 -output pprof:outfile=<filename> -trim_name_regexes .*apply_op.* 242 243# Use google-pprof, from the google-perftools package to visualize the generated file. 244# On Ubuntu you can install it with `apt-get install it google-perftools`. 245google-pprof --pdf --nodecount=100 <filename> 246``` 247 248![PprofGraph](g3doc/pprof.jpg) 249 250### Feature Request and Bug Report 251 252Contact: xpan@google.com 253 254Providing `GraphDef` and `RunMetadata` file will greatly help 255bug fix. `OpLogProto` is a good plus if it is used. 256 257 258#### Teams 259 260* Xin Pan 261* Chris Antaki 262* Yao Zhang 263* Jon Shlens 264