1# Benchmarks 2 3## Overview 4 5A selection of image classification models were tested across multiple platforms 6to create a point of reference for the TensorFlow community. The 7[Methodology](#methodology) section details how the tests were executed and has 8links to the scripts used. 9 10## Results for image classification models 11 12InceptionV3 ([arXiv:1512.00567](https://arxiv.org/abs/1512.00567)), ResNet-50 13([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)), ResNet-152 14([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)), VGG16 15([arXiv:1409.1556](https://arxiv.org/abs/1409.1556)), and 16[AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) 17were tested using the [ImageNet](http://www.image-net.org/) data set. Tests were 18run on Google Compute Engine, Amazon Elastic Compute Cloud (Amazon EC2), and an 19NVIDIA® DGX-1™. Most of the tests were run with both synthetic and real data. 20Testing with synthetic data was done by using a `tf.Variable` set to the same 21shape as the data expected by each model for ImageNet. We believe it is 22important to include real data measurements when benchmarking a platform. This 23load tests both the underlying hardware and the framework at preparing data for 24actual training. We start with synthetic data to remove disk I/O as a variable 25and to set a baseline. Real data is then used to verify that the TensorFlow 26input pipeline and the underlying disk I/O are saturating the compute units. 27 28### Training with NVIDIA® DGX-1™ (NVIDIA® Tesla® P100) 29 30<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 31 <img style="width:80%" src="../images/perf_summary_p100_single_server.png"> 32</div> 33 34Details and additional results are in the [Details for NVIDIA® DGX-1™ (NVIDIA® 35Tesla® P100)](#details_for_nvidia_dgx-1tm_nvidia_tesla_p100) section. 36 37### Training with NVIDIA® Tesla® K80 38 39<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 40 <img style="width:80%" src="../images/perf_summary_k80_single_server.png"> 41</div> 42 43Details and additional results are in the [Details for Google Compute Engine 44(NVIDIA® Tesla® K80)](#details_for_google_compute_engine_nvidia_tesla_k80) and 45[Details for Amazon EC2 (NVIDIA® Tesla® 46K80)](#details_for_amazon_ec2_nvidia_tesla_k80) sections. 47 48### Distributed training with NVIDIA® Tesla® K80 49 50<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 51 <img style="width:80%" src="../images/perf_summary_k80_aws_distributed.png"> 52</div> 53 54Details and additional results are in the [Details for Amazon EC2 Distributed 55(NVIDIA® Tesla® K80)](#details_for_amazon_ec2_distributed_nvidia_tesla_k80) 56section. 57 58### Compare synthetic with real data training 59 60**NVIDIA® Tesla® P100** 61 62<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 63 <img style="width:35%" src="../images/perf_summary_p100_data_compare_inceptionv3.png"> 64 <img style="width:35%" src="../images/perf_summary_p100_data_compare_resnet50.png"> 65</div> 66 67**NVIDIA® Tesla® K80** 68 69<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 70 <img style="width:35%" src="../images/perf_summary_k80_data_compare_inceptionv3.png"> 71 <img style="width:35%" src="../images/perf_summary_k80_data_compare_resnet50.png"> 72</div> 73 74## Details for NVIDIA® DGX-1™ (NVIDIA® Tesla® P100) 75 76### Environment 77 78* **Instance type**: NVIDIA® DGX-1™ 79* **GPU:** 8x NVIDIA® Tesla® P100 80* **OS:** Ubuntu 16.04 LTS with tests run via Docker 81* **CUDA / cuDNN:** 8.0 / 5.1 82* **TensorFlow GitHub hash:** b1e174e 83* **Benchmark GitHub hash:** 9165a70 84* **Build Command:** `bazel build -c opt --copt=-march="haswell" --config=cuda 85 //tensorflow/tools/pip_package:build_pip_package` 86* **Disk:** Local SSD 87* **DataSet:** ImageNet 88* **Test Date:** May 2017 89 90Batch size and optimizer used for each model are listed in the table below. In 91addition to the batch sizes listed in the table, InceptionV3, ResNet-50, 92ResNet-152, and VGG16 were tested with a batch size of 32. Those results are in 93the *other results* section. 94 95Options | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 96------------------ | ----------- | --------- | ---------- | ------- | ----- 97Batch size per GPU | 64 | 64 | 64 | 512 | 64 98Optimizer | sgd | sgd | sgd | sgd | sgd 99 100Configuration used for each model. 101 102Model | variable_update | local_parameter_device 103----------- | ---------------------- | ---------------------- 104InceptionV3 | parameter_server | cpu 105ResNet50 | parameter_server | cpu 106ResNet152 | parameter_server | cpu 107AlexNet | replicated (with NCCL) | n/a 108VGG16 | replicated (with NCCL) | n/a 109 110### Results 111 112<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 113 <img style="width:80%" src="../images/perf_summary_p100_single_server.png"> 114</div> 115 116<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 117 <img style="width:35%" src="../images/perf_dgx1_synth_p100_single_server_scaling.png"> 118 <img style="width:35%" src="../images/perf_dgx1_real_p100_single_server_scaling.png"> 119</div> 120 121**Training synthetic data** 122 123GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 124---- | ----------- | --------- | ---------- | ------- | ----- 1251 | 142 | 219 | 91.8 | 2987 | 154 1262 | 284 | 422 | 181 | 5658 | 295 1274 | 569 | 852 | 356 | 10509 | 584 1288 | 1131 | 1734 | 716 | 17822 | 1081 129 130**Training real data** 131 132GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 133---- | ----------- | --------- | ---------- | ------- | ----- 1341 | 142 | 218 | 91.4 | 2890 | 154 1352 | 278 | 425 | 179 | 4448 | 284 1364 | 551 | 853 | 359 | 7105 | 534 1378 | 1079 | 1630 | 708 | N/A | 898 138 139Training AlexNet with real data on 8 GPUs was excluded from the graph and table 140above due to it maxing out the input pipeline. 141 142### Other Results 143 144The results below are all with a batch size of 32. 145 146**Training synthetic data** 147 148GPUs | InceptionV3 | ResNet-50 | ResNet-152 | VGG16 149---- | ----------- | --------- | ---------- | ----- 1501 | 128 | 195 | 82.7 | 144 1512 | 259 | 368 | 160 | 281 1524 | 520 | 768 | 317 | 549 1538 | 995 | 1485 | 632 | 820 154 155**Training real data** 156 157GPUs | InceptionV3 | ResNet-50 | ResNet-152 | VGG16 158---- | ----------- | --------- | ---------- | ----- 1591 | 130 | 193 | 82.4 | 144 1602 | 257 | 369 | 159 | 253 1614 | 507 | 760 | 317 | 457 1628 | 966 | 1410 | 609 | 690 163 164## Details for Google Compute Engine (NVIDIA® Tesla® K80) 165 166### Environment 167 168* **Instance type**: n1-standard-32-k80x8 169* **GPU:** 8x NVIDIA® Tesla® K80 170* **OS:** Ubuntu 16.04 LTS 171* **CUDA / cuDNN:** 8.0 / 5.1 172* **TensorFlow GitHub hash:** b1e174e 173* **Benchmark GitHub hash:** 9165a70 174* **Build Command:** `bazel build -c opt --copt=-march="haswell" --config=cuda 175 //tensorflow/tools/pip_package:build_pip_package` 176* **Disk:** 1.7 TB Shared SSD persistent disk (800 MB/s) 177* **DataSet:** ImageNet 178* **Test Date:** May 2017 179 180Batch size and optimizer used for each model are listed in the table below. In 181addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were 182tested with a batch size of 32. Those results are in the *other results* 183section. 184 185Options | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 186------------------ | ----------- | --------- | ---------- | ------- | ----- 187Batch size per GPU | 64 | 64 | 32 | 512 | 32 188Optimizer | sgd | sgd | sgd | sgd | sgd 189 190The configuration used for each model was `variable_update` equal to 191`parameter_server` and `local_parameter_device` equal to `cpu`. 192 193### Results 194 195<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 196 <img style="width:35%" src="../images/perf_gce_synth_k80_single_server_scaling.png"> 197 <img style="width:35%" src="../images/perf_gce_real_k80_single_server_scaling.png"> 198</div> 199 200**Training synthetic data** 201 202GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 203---- | ----------- | --------- | ---------- | ------- | ----- 2041 | 30.5 | 51.9 | 20.0 | 656 | 35.4 2052 | 57.8 | 99.0 | 38.2 | 1209 | 64.8 2064 | 116 | 195 | 75.8 | 2328 | 120 2078 | 227 | 387 | 148 | 4640 | 234 208 209**Training real data** 210 211GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 212---- | ----------- | --------- | ---------- | ------- | ----- 2131 | 30.6 | 51.2 | 20.0 | 639 | 34.2 2142 | 58.4 | 98.8 | 38.3 | 1136 | 62.9 2154 | 115 | 194 | 75.4 | 2067 | 118 2168 | 225 | 381 | 148 | 4056 | 230 217 218### Other Results 219 220**Training synthetic data** 221 222GPUs | InceptionV3 (batch size 32) | ResNet-50 (batch size 32) 223---- | --------------------------- | ------------------------- 2241 | 29.3 | 49.5 2252 | 55.0 | 95.4 2264 | 109 | 183 2278 | 216 | 362 228 229**Training real data** 230 231GPUs | InceptionV3 (batch size 32) | ResNet-50 (batch size 32) 232---- | --------------------------- | ------------------------- 2331 | 29.5 | 49.3 2342 | 55.4 | 95.3 2354 | 110 | 186 2368 | 216 | 359 237 238## Details for Amazon EC2 (NVIDIA® Tesla® K80) 239 240### Environment 241 242* **Instance type**: p2.8xlarge 243* **GPU:** 8x NVIDIA® Tesla® K80 244* **OS:** Ubuntu 16.04 LTS 245* **CUDA / cuDNN:** 8.0 / 5.1 246* **TensorFlow GitHub hash:** b1e174e 247* **Benchmark GitHub hash:** 9165a70 248* **Build Command:** `bazel build -c opt --copt=-march="haswell" --config=cuda 249 //tensorflow/tools/pip_package:build_pip_package` 250* **Disk:** 1TB Amazon EFS (burst 100 MiB/sec for 12 hours, continuous 50 251 MiB/sec) 252* **DataSet:** ImageNet 253* **Test Date:** May 2017 254 255Batch size and optimizer used for each model are listed in the table below. In 256addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were 257tested with a batch size of 32. Those results are in the *other results* 258section. 259 260Options | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 261------------------ | ----------- | --------- | ---------- | ------- | ----- 262Batch size per GPU | 64 | 64 | 32 | 512 | 32 263Optimizer | sgd | sgd | sgd | sgd | sgd 264 265Configuration used for each model. 266 267Model | variable_update | local_parameter_device 268----------- | ------------------------- | ---------------------- 269InceptionV3 | parameter_server | cpu 270ResNet-50 | replicated (without NCCL) | gpu 271ResNet-152 | replicated (without NCCL) | gpu 272AlexNet | parameter_server | gpu 273VGG16 | parameter_server | gpu 274 275### Results 276 277<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 278 <img style="width:35%" src="../images/perf_aws_synth_k80_single_server_scaling.png"> 279 <img style="width:35%" src="../images/perf_aws_real_k80_single_server_scaling.png"> 280</div> 281 282**Training synthetic data** 283 284GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 285---- | ----------- | --------- | ---------- | ------- | ----- 2861 | 30.8 | 51.5 | 19.7 | 684 | 36.3 2872 | 58.7 | 98.0 | 37.6 | 1244 | 69.4 2884 | 117 | 195 | 74.9 | 2479 | 141 2898 | 230 | 384 | 149 | 4853 | 260 290 291**Training real data** 292 293GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16 294---- | ----------- | --------- | ---------- | ------- | ----- 2951 | 30.5 | 51.3 | 19.7 | 674 | 36.3 2962 | 59.0 | 94.9 | 38.2 | 1227 | 67.5 2974 | 118 | 188 | 75.2 | 2201 | 136 2988 | 228 | 373 | 149 | N/A | 242 299 300Training AlexNet with real data on 8 GPUs was excluded from the graph and table 301above due to our EFS setup not providing enough throughput. 302 303### Other Results 304 305**Training synthetic data** 306 307GPUs | InceptionV3 (batch size 32) | ResNet-50 (batch size 32) 308---- | --------------------------- | ------------------------- 3091 | 29.9 | 49.0 3102 | 57.5 | 94.1 3114 | 114 | 184 3128 | 216 | 355 313 314**Training real data** 315 316GPUs | InceptionV3 (batch size 32) | ResNet-50 (batch size 32) 317---- | --------------------------- | ------------------------- 3181 | 30.0 | 49.1 3192 | 57.5 | 95.1 3204 | 113 | 185 3218 | 212 | 353 322 323## Details for Amazon EC2 Distributed (NVIDIA® Tesla® K80) 324 325### Environment 326 327* **Instance type**: p2.8xlarge 328* **GPU:** 8x NVIDIA® Tesla® K80 329* **OS:** Ubuntu 16.04 LTS 330* **CUDA / cuDNN:** 8.0 / 5.1 331* **TensorFlow GitHub hash:** b1e174e 332* **Benchmark GitHub hash:** 9165a70 333* **Build Command:** `bazel build -c opt --copt=-march="haswell" --config=cuda 334 //tensorflow/tools/pip_package:build_pip_package` 335* **Disk:** 1.0 TB EFS (burst 100 MB/sec for 12 hours, continuous 50 MB/sec) 336* **DataSet:** ImageNet 337* **Test Date:** May 2017 338 339The batch size and optimizer used for the tests are listed in the table. In 340addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were 341tested with a batch size of 32. Those results are in the *other results* 342section. 343 344Options | InceptionV3 | ResNet-50 | ResNet-152 345------------------ | ----------- | --------- | ---------- 346Batch size per GPU | 64 | 64 | 32 347Optimizer | sgd | sgd | sgd 348 349Configuration used for each model. 350 351Model | variable_update | local_parameter_device | cross_replica_sync 352----------- | ---------------------- | ---------------------- | ------------------ 353InceptionV3 | distributed_replicated | n/a | True 354ResNet-50 | distributed_replicated | n/a | True 355ResNet-152 | distributed_replicated | n/a | True 356 357To simplify server setup, EC2 instances (p2.8xlarge) running worker servers also 358ran parameter servers. Equal numbers of parameter servers and worker servers were 359used with the following exceptions: 360 361* InceptionV3: 8 instances / 6 parameter servers 362* ResNet-50: (batch size 32) 8 instances / 4 parameter servers 363* ResNet-152: 8 instances / 4 parameter servers 364 365### Results 366 367<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 368 <img style="width:80%" src="../images/perf_summary_k80_aws_distributed.png"> 369</div> 370 371<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 372 <img style="width:70%" src="../images/perf_aws_synth_k80_distributed_scaling.png"> 373</div> 374 375**Training synthetic data** 376 377GPUs | InceptionV3 | ResNet-50 | ResNet-152 378---- | ----------- | --------- | ---------- 3791 | 29.7 | 52.4 | 19.4 3808 | 229 | 378 | 146 38116 | 459 | 751 | 291 38232 | 902 | 1388 | 565 38364 | 1783 | 2744 | 981 384 385### Other Results 386 387<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 388 <img style="width:50%" src="../images/perf_aws_synth_k80_multi_server_batch32.png"> 389</div> 390 391**Training synthetic data** 392 393GPUs | InceptionV3 (batch size 32) | ResNet-50 (batch size 32) 394---- | --------------------------- | ------------------------- 3951 | 29.2 | 48.4 3968 | 219 | 333 39716 | 427 | 667 39832 | 820 | 1180 39964 | 1608 | 2315 400 401## Methodology 402 403This 404[script](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks) 405was run on the various platforms to generate the above results. 406@{$performance_models$High-Performance Models} details techniques in the script 407along with examples of how to execute the script. 408 409In order to create results that are as repeatable as possible, each test was run 4105 times and then the times were averaged together. GPUs are run in their default 411state on the given platform. For NVIDIA® Tesla® K80 this means leaving on [GPU 412Boost](https://devblogs.nvidia.com/parallelforall/increase-performance-gpu-boost-k80-autoboost/). 413For each test, 10 warmup steps are done and then the next 100 steps are 414averaged. 415