1/// 2/// Copyright (c) 2017-2020 Arm Limited. 3/// 4/// SPDX-License-Identifier: MIT 5/// 6/// Permission is hereby granted, free of charge, to any person obtaining a copy 7/// of this software and associated documentation files (the "Software"), to 8/// deal in the Software without restriction, including without limitation the 9/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or 10/// sell copies of the Software, and to permit persons to whom the Software is 11/// furnished to do so, subject to the following conditions: 12/// 13/// The above copyright notice and this permission notice shall be included in all 14/// copies or substantial portions of the Software. 15/// 16/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22/// SOFTWARE. 23/// 24namespace arm_compute 25{ 26/** @mainpage Introduction 27 28@tableofcontents 29 30The Computer Vision and Machine Learning library is a set of functions optimised for both ARM CPUs and GPUs using SIMD technologies. 31 32Several builds of the library are available using various configurations: 33 - OS: Linux, Android or bare metal. 34 - Architecture: armv7a (32bit) or arm64-v8a (64bit) 35 - Technology: NEON / OpenCL / GLES_COMPUTE / NEON and OpenCL and GLES_COMPUTE 36 - Debug / Asserts / Release: Use a build with asserts enabled to debug your application and enable extra validation. Once you are sure your application works as expected you can switch to a release build of the library for maximum performance. 37 38@section S0_1_contact Contact / Support 39 40Please email developer@arm.com 41 42In order to facilitate the work of the support team please provide the build information of the library you are using. To get the version of the library you are using simply run: 43 44 $ strings android-armv7a-cl-asserts/libarm_compute.so | grep arm_compute_version 45 arm_compute_version=v16.12 Build options: {'embed_kernels': '1', 'opencl': '1', 'arch': 'armv7a', 'neon': '0', 'asserts': '1', 'debug': '0', 'os': 'android', 'Werror': '1'} Git hash=f51a545d4ea12a9059fe4e598a092f1fd06dc858 46 47@section S0_2_prebuilt_binaries Pre-built binaries 48 49For each release we provide some pre-built binaries of the library [here](https://github.com/ARM-software/ComputeLibrary/releases) 50 51These binaries have been built using the following toolchains: 52 - Linux armv7a: gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf 53 - Linux arm64-v8a: gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu 54 - Android armv7a: clang++ / libc++ NDK r18b 55 - Android am64-v8a: clang++ / libc++ NDK r18b 56 57@warning Make sure to use a compatible toolchain to build your application or you will get some std::bad_alloc errors at runtime. 58 59@section S1_file_organisation File organisation 60 61This archive contains: 62 - The arm_compute header and source files 63 - The latest Khronos OpenCL 1.2 C headers from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a> 64 - The latest Khronos cl2.hpp from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a> (API version 2.1 when this document was written) 65 - The latest Khronos OpenGL ES 3.1 C headers from the <a href="https://www.khronos.org/registry/gles/">Khronos OpenGL ES registry</a> 66 - The latest Khronos EGL 1.5 C headers from the <a href="https://www.khronos.org/registry/gles/">Khronos EGL registry</a> 67 - The sources for a stub version of libOpenCL.so, libGLESv1_CM.so, libGLESv2.so and libEGL.so to help you build your application. 68 - An examples folder containing a few examples to compile and link against the library. 69 - A @ref utils folder containing headers with some boiler plate code used by the examples. 70 - This documentation. 71 72 For detailed information about file organization, please refer to Files -> File List section of this documentation. 73 74@section S2_versions_changelog Release versions and changelog 75 76@subsection S2_1_versions Release versions 77 78All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number. 79If there is more than one release in a month then an extra sequential number is appended at the end: 80 81 v17.03 (First release of March 2017) 82 v17.03.1 (Second release of March 2017) 83 v17.04 (First release of April 2017) 84 85@note We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes. 86 87@subsection S2_2_changelog Changelog 88 89v20.11 Public major release 90 - Various bug fixes. 91 - Various optimisations. 92 - Performance regressions can be noted when executing Depthwise Convolution on Neon with a depth multiplier > 1 for quantized data type. 93 This is planned to be resolved in 21.02 release. 94 - Added new data type QASYMM8_SIGNED support for @ref NEROIAlignLayer. 95 - Added new data type S32 support for: 96 - @ref NEArithmeticSubtraction 97 - @ref NEArithmeticSubtractionKernel 98 - @ref NEPixelWiseMultiplication 99 - @ref NEPixelWiseMultiplicationKernel 100 - @ref NEElementwiseDivision 101 - @ref NEDivisionOperationKernel 102 - Interface change 103 - Properly support softmax axis to have the same meaning as other major frameworks. That is, axis now defines the dimension 104 on which Softmax/Logsoftmax is performed. E.g. for input of shape 4x5x6 and axis=1, softmax will be applied to 4x6=24 vectors of size 5. 105 The supported value range of axis is [-rank, rank). 106 This change applies to the following functions: 107 - @ref NESoftmaxLayer 108 - @ref NELogSoftmaxLayer 109 - @ref CLSoftmaxLayer 110 - @ref CLLogSoftmaxLayer 111 - @ref GCSoftmaxLayer 112 - New OpenCL kernels / functions: 113 - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel 114 - @ref CLLogicalNot 115 - @ref CLLogicalAnd 116 - @ref CLLogicalOr 117 - New NEON kernels / functions: 118 - @ref NELogicalNot 119 - @ref NELogicalAnd 120 - @ref NELogicalOr 121 - Removed padding from NEON kernels: 122 - @ref NEComplexPixelWiseMultiplicationKernel 123 - @ref NENonMaximaSuppression3x3Kernel 124 - @ref NERemapKernel 125 - @ref NEGEMMInterleave4x4Kernel 126 - @ref NEDirectConvolutionLayerKernel 127 - @ref NEScaleKernel 128 - @ref NELocallyConnectedMatrixMultiplyKernel 129 - @ref NEGEMMLowpOffsetContributionKernel 130 - @ref NEGEMMTranspose1xWKernel 131 - @ref NEPoolingLayerKernel 132 - @ref NEConvolutionKernel 133 - @ref NEDepthwiseConvolutionLayerNativeKernel 134 - @ref NEGEMMLowpMatrixMultiplyKernel 135 - @ref NEGEMMMatrixMultiplyKernel 136 - @ref NEDirectConvolutionLayerOutputStageKernel 137 - @ref NEReductionOperationKernel 138 - @ref NEGEMMLowpMatrixAReductionKernel 139 - @ref NEGEMMLowpMatrixBReductionKernel 140 - Removed padding from OpenCL kernels: 141 - @ref CLBatchConcatenateLayerKernel 142 - @ref CLElementwiseOperationKernel 143 - @ref CLBatchNormalizationLayerKernel 144 - @ref CLPoolingLayerKernel 145 - @ref CLWinogradInputTransformKernel 146 - @ref CLGEMMLowpMatrixMultiplyNativeKernel 147 - @ref CLGEMMLowpMatrixAReductionKernel 148 - @ref CLGEMMLowpMatrixBReductionKernel 149 - @ref CLGEMMLowpOffsetContributionOutputStageKernel 150 - @ref CLGEMMLowpOffsetContributionKernel 151 - @ref CLWinogradOutputTransformKernel 152 - @ref CLGEMMLowpMatrixMultiplyReshapedKernel 153 - @ref CLFuseBatchNormalizationKernel 154 - @ref CLDepthwiseConvolutionLayerNativeKernel 155 - @ref CLDepthConvertLayerKernel 156 - @ref CLCopyKernel 157 - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel 158 - @ref CLActivationLayerKernel 159 - @ref CLWinogradFilterTransformKernel 160 - @ref CLWidthConcatenateLayerKernel 161 - @ref CLWidthConcatenate4TensorsKernel 162 - @ref CLWidthConcatenate2TensorsKernel 163 - @ref CLLogits1DMaxShiftExpSumKernel 164 - @ref CLLogits1DNormKernel 165 - @ref CLHeightConcatenateLayerKernel 166 - @ref CLGEMMMatrixMultiplyKernel 167 - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel 168 - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel 169 - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel 170 - @ref CLDepthConcatenateLayerKernel 171 - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel 172 - Removed OpenCL kernels / functions: 173 - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel 174 - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel 175 - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel 176 - Deprecated OpenCL kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together): 177 - CLLocallyConnectedLayer 178 - CLLocallyConnectedMatrixMultiplyKernel 179 - CLAbsoluteDifference 180 - CLAbsoluteDifferenceKernel 181 - CLAccumulate 182 - CLAccumulateKernel 183 - CLAccumulateSquared 184 - CLAccumulateSquaredKernel 185 - CLAccumulateWeighted 186 - CLAccumulateWeightedKernel 187 - CLAccumulateWeightedFP16Kernel 188 - CLBox3x3 189 - CLBox3x3Kernel 190 - CLBox3x3FP16Kernel 191 - CLCannyEdge 192 - CLChannelCombine 193 - CLChannelCombineKernel 194 - CLChannelExtract 195 - CLChannelExtractKernel 196 - CLColorConvert 197 - CLColorConvertKernel 198 - CLConvolution3x3 199 - CLConvolutionRectangle 200 - CLConvolutionRectangleKernel 201 - CLConvolutionSquare 202 - CLConvolutionKernel 203 - CLDerivative 204 - CLDerivativeKernel 205 - CLDilate 206 - CLDilateKernel 207 - CLEqualizeHistogram 208 - CLErode 209 - CLErodeKernel 210 - CLFastCorners 211 - CLFastCornersKernel 212 - CLGaussian3x3 213 - CLGaussian3x3Kernel 214 - CLGaussian5x5 215 - CLGaussian5x5HorKernel 216 - CLGaussian5x5VertKernel 217 - CLGaussianPyramid 218 - CLGaussianPyramidHalf 219 - CLGaussianPyramidOrb 220 - CLHarrisCorners 221 - CLHarrisScoreKernel 222 - CLHarrisScoreFP16Kernel 223 - CLHistogram 224 - CLHistogramKernel 225 - CLHOGOrientationBinningKernel 226 - CLHOGBlockNormalizationKernel 227 - CLHOGDetectorKernel 228 - CLHOGNonMaximaSuppressionKernel 229 - CLHOGDescriptor 230 - CLHOGDetector 231 - CLHOGGradient 232 - CLHOGMultiDetection 233 - CLHOGOrientationBinningKernel 234 - CLHOGBlockNormalizationKernel 235 - CLHOGDetectorKernel 236 - CLIntegralImage 237 - CLIntegralImageKernel 238 - CLLaplacianReconstruct 239 - CLLaplacianPyramid 240 - CLMagnitude 241 - CLMagnitudePhaseKernel 242 - CLMedian3x3 243 - CLMedian3x3Kernel 244 - CLMinMaxLocation 245 - CLMinMaxLocationKernel 246 - CLNonLinearFilter 247 - CLNonLinearFilterKernel 248 - CLNonMaximaSuppression3x3 249 - CLNonMaximaSuppression3x3FP16Kernel 250 - CLNonMaximaSuppression3x3Kernel 251 - CLOpticalFlow 252 - CLPhase 253 - CLRemap 254 - CLRemapKernel 255 - CLScharr3x3 256 - CLScharr3x3Kernel 257 - CLSobel3x3 258 - CLSobel3x3Kernel 259 - CLSobel5x5 260 - CLSobel5x5HorKernel 261 - CLSobel5x5VertKernel 262 - CLSobel7x7 263 - CLSobel7x7HorKernel 264 - CLSobel7x7VertKernel 265 - CLThreshold 266 - CLThresholdKernel 267 - CLWarpAffine 268 - CLWarpAffineKernel 269 - CLWarpPerspective 270 - CLWarpPerspectiveKernel 271 - Deprecated NEON kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together): 272 - NELocallyConnectedLayer 273 - NELocallyConnectedMatrixMultiplyKernel 274 - NEAbsoluteDifference 275 - NEAbsoluteDifferenceKernel 276 - NEAccumulate 277 - NEAccumulateKernel 278 - NEAccumulateSquared 279 - NEAccumulateSquaredKernel 280 - NEAccumulateWeighted 281 - NEAccumulateWeightedKernel 282 - NEAccumulateWeightedFP16Kernel 283 - NEBox3x3 284 - NEBox3x3Kernel 285 - NEBox3x3FP16Kernel 286 - NECannyEdge 287 - NEChannelCombine 288 - NEChannelCombineKernel 289 - NEChannelExtract 290 - NEChannelExtractKernel 291 - NEColorConvert 292 - NEColorConvertKernel 293 - NEConvolution3x3 294 - NEConvolutionRectangle 295 - NEConvolutionRectangleKernel 296 - NEConvolutionSquare 297 - NEConvolutionKernel 298 - NEDerivative 299 - NEDerivativeKernel 300 - NEDilate 301 - NEDilateKernel 302 - NEEqualizeHistogram 303 - NEErode 304 - NEErodeKernel 305 - NEFastCorners 306 - NEFastCornersKernel 307 - NEGaussian3x3 308 - NEGaussian3x3Kernel 309 - NEGaussian5x5 310 - NEGaussian5x5HorKernel 311 - NEGaussian5x5VertKernel 312 - NEGaussianPyramid 313 - NEGaussianPyramidHalf 314 - NEGaussianPyramidOrb 315 - NEHarrisCorners 316 - NEHarrisScoreKernel 317 - NEHarrisScoreFP16Kernel 318 - NEHistogram 319 - NEHistogramKernel 320 - NEHOGOrientationBinningKernel 321 - NEHOGBlockNormalizationKernel 322 - NEHOGDetectorKernel 323 - NEHOGNonMaximaSuppressionKernel 324 - NEHOGDescriptor 325 - NEHOGDetector 326 - NEHOGGradient 327 - NEHOGMultiDetection 328 - NEHOGOrientationBinningKernel 329 - NEHOGBlockNormalizationKernel 330 - NEHOGDetectorKernel 331 - NEIntegralImage 332 - NEIntegralImageKernel 333 - NELaplacianReconstruct 334 - NELaplacianPyramid 335 - NEMagnitude 336 - NEMagnitudePhaseKernel 337 - NEMedian3x3 338 - NEMedian3x3Kernel 339 - NEMinMaxLocation 340 - NEMinMaxLocationKernel 341 - NENonLinearFilter 342 - NENonLinearFilterKernel 343 - NENonMaximaSuppression3x3 344 - NENonMaximaSuppression3x3FP16Kernel 345 - NENonMaximaSuppression3x3Kernel 346 - NEOpticalFlow 347 - NEPhase 348 - NERemap 349 - NERemapKernel 350 - NEScharr3x3 351 - NEScharr3x3Kernel 352 - NESobel3x3 353 - NESobel3x3Kernel 354 - NESobel5x5 355 - NESobel5x5HorKernel 356 - NESobel5x5VertKernel 357 - NESobel7x7 358 - NESobel7x7HorKernel 359 - NESobel7x7VertKernel 360 - NEThreshold 361 - NEThresholdKernel 362 - NEWarpAffine 363 - NEWarpAffineKernel 364 - NEWarpPerspective 365 - NEWarpPerspectiveKernel 366 - Deprecated GLES kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together): 367 - GCAbsoluteDifference 368 - GCActivationLayer 369 - GCArithmeticAddition 370 - GCBatchNormalizationLayer 371 - GCConcatenateLayer 372 - GCConvolutionLayer 373 - GCDepthwiseConvolutionLayer 374 - GCDirectConvolutionLayer 375 - GCDropoutLayer 376 - GCFillBorder 377 - GCFullyConnectedLayer 378 - GCGEMM 379 - GCGEMMInterleave4x4 380 - GCGEMMTranspose1xW 381 - GCNormalizationLayer 382 - GCNormalizePlanarYUVLayer 383 - GCPixelWiseMultiplication 384 - GCPoolingLayer 385 - GCScale 386 - GCSoftmaxLayer 387 - GCTensorShift 388 - GCTranspose 389 390 391v20.08 Public major release 392 - Various bug fixes. 393 - Various optimisations. 394 - Added new data type QASYMM8_SIGNED support for: 395 - @ref CLArgMinMaxLayer 396 - @ref CLArgMinMaxLayerKernel 397 - Added new data type U8 support for: 398 - @ref NECropKernel 399 - @ref CLCropKernel 400 - Added aligh_corner support for nearest neighbor interpolation in: 401 - @ref NEScaleKernel 402 - @ref CLScaleKernel 403 - New OpenCL kernels / functions: 404 - @ref CLMaxUnpoolingLayerKernel 405 - New NEON kernels / functions: 406 - @ref NEMaxUnpoolingLayerKernel 407 - New graph example: 408 - graph_yolov3_output_detector 409 - GEMMTuner improvements: 410 - Added fp16 support 411 - Output json files for easier integration 412 - Enabled tuning for export_to_cl_image_rhs option for RHS tensors 413 - More robust script for running benchmarks 414 - Removed padding from: 415 - @ref NEPixelWiseMultiplicationKernel 416 - @ref NEHeightConcatenateLayerKernel 417 - @ref NEThresholdKernel 418 - @ref NEBatchConcatenateLayerKernel 419 - @ref NETransposeKernel 420 - @ref NEBatchNormalizationLayerKernel 421 - @ref NEArithmeticSubtractionKernel 422 - @ref NEBoundingBoxTransformKernel 423 - @ref NELogits1DMaxKernel 424 - @ref NELogits1DSoftmaxKernel 425 - @ref NEROIPoolingLayerKernel 426 - @ref NEROIAlignLayerKernel 427 - @ref NEYOLOLayerKernel 428 - @ref NEUpsampleLayerKernel 429 - @ref NEFloorKernel 430 - @ref NEWidthConcatenateLayerKernel 431 - @ref NEDepthConcatenateLayerKernel 432 - @ref NENormalizationLayerKernel 433 - @ref NEL2NormalizeLayerKernel 434 - @ref NEFillArrayKernel 435 - @ref NEDepthConvertLayerKernel 436 - @ref NERangeKernel 437 - @ref NEPriorBoxLayer 438 - Removed OpenCL kernels / functions: 439 - CLGEMMLowpQuantizeDownInt32ToUint8Scale 440 - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat 441 - Removed NEON kernels / functions: 442 - NEGEMMLowpQuantizeDownInt32ToUint8Scale 443 - NEGEMMMatrixAccumulateBiasesKernel 444 - Deprecated functions / interfaces: 445 - Non-descriptor based interfaces for @ref NEThreshold, @ref CLThreshold 446 - Non-descriptor based interfaces for @ref NEScale, @ref CLScale and @ref GCScale 447 - In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and @ref GCSoftmaxLayer : 448 The default "axis" value for @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and @ref GCSoftmaxLayer is changed from 1 to 0. 449 Only axis 0 is supported. 450 The default "axis" value for @ref NESoftmaxLayer, @ref NELogSoftmaxLayer is changed from 1 to 0. 451 Only axis 0 is supported. 452 - The support for quantized data types has been removed from @ref CLLogSoftmaxLayer due to implementation complexity. 453 - Removed padding requirement for the input (e.g. LHS of GEMM) and output in @ref CLGEMMMatrixMultiplyNativeKernel, @ref CLGEMMMatrixMultiplyReshapedKernel, @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and @ref CLIm2ColKernel (NHWC only) 454 - This change allows to use @ref CLGEMMConvolutionLayer without extra padding for the input and output. 455 - Only the weights/bias of @ref CLGEMMConvolutionLayer could require padding for the computation. 456 - Only on Arm Mali Midgard GPUs, @ref CLGEMMConvolutionLayer could require padding since @ref CLGEMMMatrixMultiplyKernel is called and currently requires padding. 457 - Added support for exporting the OpenCL buffer object to the OpenCL image object in @ref CLGEMMMatrixMultiplyReshapedKernel and @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel. 458 - This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object. 459 - The padding requirement for the OpenCL image object is considered into the @ref CLGEMMReshapeRHSMatrixKernel. 460 - The reshaped RHS matrix stores the weights when GEMM is used to accelerate @ref CLGEMMConvolutionLayer. 461 462v20.05 Public major release 463 - Various bug fixes. 464 - Various optimisations. 465 - Updated recommended NDK version to r18b. 466 - Updated recommended gcc version to Linaro 6.3.1. 467 - Added Bfloat16 type support 468 - Added Bfloat16 support in: 469 - @ref NEWeightsReshapeKernel 470 - @ref NEConvolutionLayerReshapeWeights 471 - @ref NEIm2ColKernel 472 - @ref NEIm2Col 473 - @ref NEDepthConvertLayerKernel 474 - @ref NEDepthConvertLayer 475 - @ref NEGEMMConvolutionLayer 476 - @ref NEGEMMAssemblyDispatch 477 - Added new data type QASYMM8_SIGNED support for: 478 - @ref CLDirectConvolutionLayer 479 - @ref CLDeconvolutionLayer 480 - @ref CLDirectDeconvolutionLayer 481 - @ref CLGEMMDeconvolutionLayer 482 - @ref CLGEMMLowpMatrixMultiplyReshapedKernel 483 - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel 484 - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel 485 - @ref CLReductionOperation 486 - @ref CLReduceMean 487 - @ref NEScale 488 - @ref NEScaleKernel 489 - @ref NEUpsampleLayer 490 - @ref NECast 491 - @ref NEReductionOperation 492 - @ref NEReduceMean 493 - @ref NEArgMinMaxLayer 494 - @ref NEDeconvolutionLayer 495 - @ref NEGEMMLowpQuantizeDownInt32ScaleKernel 496 - @ref CPPBoxWithNonMaximaSuppressionLimit 497 - @ref CPPDetectionPostProcessLayer 498 - @ref CPPPermuteKernel 499 - @ref CPPPermute 500 - @ref CPPTopKVKernel 501 - @ref CPPTopKV 502 - @ref CPPUpsample 503 - @ref CPPUpsampleKernel 504 - New OpenCL kernels / functions: 505 - @ref CLQLSTMLayer 506 - @ref CLQLSTMLayerNormalizationKernel 507 - New NEON kernels / functions: 508 - @ref NEQLSTMLayer 509 - @ref NEQLSTMLayerNormalizationKernel 510 - Added HARD_SWISH support in: 511 - @ref CLActivationLayerKernel 512 - @ref NEActivationLayerKernel 513 - Deprecated OpenCL kernels / functions: 514 - CLGEMMLowpQuantizeDownInt32ToUint8Scale 515 - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat 516 - Deprecated NEON kernels / functions: 517 - NEGEMMLowpQuantizeDownInt32ToUint8Scale 518 - Removed CPP kernels / functions: 519 - CPPFlipWeightsKernel 520 - Removed PoolingLayerInfo constructors without Data Layout. 521 - Removed CLDepthwiseConvolutionLayer3x3 522 - Removed NEDepthwiseConvolutionLayerOptimized 523 - Added support for Winograd 3x3,4x4 on NEON FP16: 524 - @ref NEWinogradConvolutionLayer 525 - @ref NEWinogradLayerTransformInputKernel 526 - @ref NEWinogradLayerTransformOutputKernel 527 - @ref NEWinogradLayerTransformWeightsKernel 528 - Added CLCompileContext 529 - Added NEON GEMM kernel with 2D window support 530 531v20.02.1 Maintenance release 532 - Added Android-NN build script. 533 534v20.02 Public major release 535 - Various bug fixes. 536 - Various optimisations. 537 - Added new data type QASYMM8_SIGNED support for: 538 - @ref CLDepthwiseConvolutionLayer 539 - CLDepthwiseConvolutionLayer3x3 540 - @ref CLGEMMConvolutionLayer 541 - @ref CLGEMMLowpMatrixMultiplyCore 542 - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel 543 - @ref CLGEMMLowpMatrixMultiplyNativeKernel 544 - @ref NEActivationLayer 545 - @ref NEComparisonOperationKernel 546 - @ref NEConvolutionLayer 547 - @ref NEDepthwiseConvolutionLayer 548 - NEDepthwiseConvolutionLayer3x3Kernel 549 - @ref NEDirectConvolutionLayerOutputStageKernel 550 - @ref NEElementwiseComparison 551 - @ref NEElementwiseMax 552 - @ref NEElementwiseMin 553 - @ref NEElementwiseSquaredDiff 554 - @ref NEFullyConnectedLayer 555 - NEGEMMMatrixVectorMultiplyKernel 556 - @ref NEPixelWiseMultiplication 557 - @ref NEPoolingLayer 558 - @ref NEPReluLayer 559 - Added support for QSYMM8_PER_CHANNEL in: 560 - NEDepthwiseConvolutionLayer3x3Kernel 561 - Added support for split sizes in: 562 - @ref CLSplit 563 - @ref NESplit 564 - New OpenCL kernels / functions: 565 - @ref CLFill 566 - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint 567 - New NEON kernels / functions: 568 - @ref NEFill 569 - @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint 570 - Deprecated NEON functions / interfaces: 571 - CLDepthwiseConvolutionLayer3x3 572 - NEDepthwiseConvolutionLayerOptimized 573 - PoolingLayerInfo constructors without Data Layout. 574 - Added support for quantization with multiplier greater than 1 on NEON and CL. 575 - Added support for quantized inputs of type QASYMM8_SIGNED and QASYMM8 to @ref CLQuantizationLayer. 576 - Added the ability to build bootcode for bare metal. 577 - Added support for generating synthetic QASYMM8 graphs. 578 - Added support for F16 datatype in VGG16. 579 - Removed pre-built binaries for GLES. 580 581v19.11.1 Public maintenance release 582 - Fix offset calculation in NEReductionOperationKernel. 583 - Fix data layout in NEScaleKernel for nhwc. 584 - Retain configuration step data layout to avoid side-effects. 585 - Perform sqrt in double domain for L2 pooling. 586 - Fix output shape calculation for Reduce Mean 587 - Restrict cases where optimized NEPadLayer runs. 588 589v19.11 Public major release 590 - Various bug fixes. 591 - Various optimisations. 592 - Updated recommended NDK version to r17c. 593 - Deprecated OpenCL kernels / functions: 594 - CLDepthwiseConvolutionLayerReshapeWeightsGenericKernel 595 - CLDepthwiseIm2ColKernel 596 - CLDepthwiseSeparableConvolutionLayer 597 - CLDepthwiseVectorToTensorKernel 598 - CLDirectConvolutionLayerOutputStageKernel 599 - Deprecated NEON kernels / functions: 600 - NEDepthwiseWeightsReshapeKernel 601 - NEDepthwiseIm2ColKernel 602 - NEDepthwiseSeparableConvolutionLayer 603 - NEDepthwiseVectorToTensorKernel 604 - NEDepthwiseConvolutionLayer3x3 605 - New OpenCL kernels / functions: 606 - @ref CLInstanceNormalizationLayerKernel / @ref CLInstanceNormalizationLayer 607 - @ref CLDepthwiseConvolutionLayerNativeKernel to replace the old generic depthwise convolution (see Deprecated 608 OpenCL kernels / functions) 609 - @ref CLLogSoftmaxLayer 610 - New NEON kernels / functions: 611 - @ref NEBoundingBoxTransformKernel / @ref NEBoundingBoxTransform 612 - @ref NEComputeAllAnchorsKernel / @ref NEComputeAllAnchors 613 - @ref NEDetectionPostProcessLayer 614 - @ref NEGenerateProposalsLayer 615 - @ref NEInstanceNormalizationLayerKernel / @ref NEInstanceNormalizationLayer 616 - @ref NELogSoftmaxLayer 617 - @ref NEROIAlignLayerKernel / @ref NEROIAlignLayer 618 - Added QASYMM8 support for: 619 - @ref CLGenerateProposalsLayer 620 - @ref CLROIAlignLayer 621 - @ref CPPBoxWithNonMaximaSuppressionLimit 622 - Added QASYMM16 support for: 623 - @ref CLBoundingBoxTransform 624 - Added FP16 support for: 625 - @ref CLGEMMMatrixMultiplyReshapedKernel 626 - Added new data type QASYMM8_PER_CHANNEL support for: 627 - @ref CLDequantizationLayer 628 - @ref NEDequantizationLayer 629 - Added new data type QSYMM8_PER_CHANNEL support for: 630 - @ref CLConvolutionLayer 631 - @ref NEConvolutionLayer 632 - @ref CLDepthwiseConvolutionLayer 633 - @ref NEDepthwiseConvolutionLayer 634 - Added FP16 mixed-precision support for: 635 - @ref CLGEMMMatrixMultiplyReshapedKernel 636 - @ref CLPoolingLayerKernel 637 - Added FP32 and FP16 ELU activation for: 638 - @ref CLActivationLayer 639 - @ref NEActivationLayer 640 - Added asymmetric padding support for: 641 - @ref CLDirectDeconvolutionLayer 642 - @ref CLGEMMDeconvolutionLayer 643 - @ref NEDeconvolutionLayer 644 - Added SYMMETRIC and REFLECT modes for @ref CLPadLayerKernel / @ref CLPadLayer. 645 - Replaced the calls to @ref NECopyKernel and @ref NEMemsetKernel with @ref NEPadLayer in @ref NEGenerateProposalsLayer. 646 - Replaced the calls to @ref CLCopyKernel and @ref CLMemsetKernel with @ref CLPadLayer in @ref CLGenerateProposalsLayer. 647 - Improved performance for CL Inception V3 - FP16. 648 - Improved accuracy for CL Inception V3 - FP16 by enabling FP32 accumulator (mixed-precision). 649 - Improved NEON performance by enabling fusing batch normalization with convolution and depth-wise convolution layer. 650 - Improved NEON performance for MobileNet-SSD by improving the output detection performance. 651 - Optimized @ref CLPadLayer. 652 - Optimized CL generic depthwise convolution layer by introducing @ref CLDepthwiseConvolutionLayerNativeKernel. 653 - Reduced memory consumption by implementing weights sharing. 654 655v19.08.1 Public maintenance release 656 - Fix offset calculation in NEReductionOperationKernel. 657 - Fix data layout in NEScaleKernel for nhwc. 658 - Retain configuration step data layout to avoid side-effects. 659 - Perform sqrt in double domain for L2 pooling. 660 - Fix output shape calculation for Reduce Mean 661 - Fix broadcast CLPixelwiseMultiplication with 5D tensors 662 663v19.08 Public major release 664 - Various bug fixes. 665 - Various optimisations. 666 - Deprecated NEON functions 667 - NEDepthConcatenateLayer 668 - NEWidthConcatenateLayer 669 - Deprecated OpenCL kernels / functions 670 - CLDepthConcatenateLayer 671 - CLGEMMInterleave4x4Kernel / CLGEMMInterleave4x4 672 - CLGEMMTranspose1xWKernel / CLGEMMTranspose1xW 673 - CLWidthConcatenateLayer 674 - New NEON kernels / functions: 675 - @ref NEAbsLayer 676 - @ref NECast 677 - @ref NEElementwisePower 678 - @ref NELogLayer 679 - @ref NELSTMLayerQuantized 680 - @ref NENegLayer 681 - @ref NEPReluLayer 682 - @ref NESinLayer 683 - @ref NEBatchConcatenateLayerKernel 684 - @ref NEDepthToSpaceLayerKernel / @ref NEDepthToSpaceLayer 685 - @ref NEDepthwiseConvolutionLayerNativeKernel 686 - @ref NEGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel 687 - @ref NEMeanStdDevNormalizationKernel / @ref NEMeanStdDevNormalizationLayer 688 - @ref NESpaceToDepthLayerKernel / @ref NESpaceToDepthLayer 689 - New OpenCL kernels / functions: 690 - @ref CLAbsLayer 691 - @ref CLElementwisePower 692 - @ref CLLogLayer 693 - @ref CLLSTMLayerQuantized 694 - @ref CLNegLayer 695 - @ref CLPReluLayer 696 - @ref CLSinLayer 697 - @ref CLBatchConcatenateLayerKernel 698 - @ref CLDepthToSpaceLayerKernel / @ref CLDepthToSpaceLayer 699 - @ref CLGEMMLowpMatrixMultiplyNativeKernel 700 - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel 701 - @ref CLGEMMMatrixMultiplyNativeKernel 702 - @ref CLMeanStdDevNormalizationKernel / @ref CLMeanStdDevNormalizationLayer 703 - @ref CLSpaceToDepthLayerKernel / @ref CLSpaceToDepthLayer 704 - New examples: 705 - neon_opticalflow 706 - cl_cache 707 - neon_permute 708 - Added support for FP16 in @ref NEDeconvolutionLayer 709 - Added support for FP16 in @ref CLDeconvolutionLayer 710 - Added support for REDUCE_MIN and REDUCE_MAX in @ref ReductionOperation 711 - Enable the fusion of batch normalization with convolution and depthwise convolution layer for FP32 in the graph API (OpenCL only) 712 - Added support for fusing activation function and broadcast addition with the matrix multiplication for FP32 (OpenCL only) 713 - Re-factored the depthwise convolution layer kernel on NEON for generic cases 714 - Added an optimized depthwise convolution layer kernel for 5x5 filters (NEON only) 715 - Added support to enable OpenCL kernel cache. Added example showing how to load the prebuilt OpenCL kernels from a binary cache file 716 - Altered @ref QuantizationInfo interface to support per-channel quantization. 717 - The CLDepthwiseConvolutionLayer3x3 will be included by @ref CLDepthwiseConvolutionLayer to accommodate for future optimizations. 718 - The NEDepthwiseConvolutionLayerOptimized will be included by @ref NEDepthwiseConvolutionLayer to accommodate for future optimizations. 719 - Removed inner_border_right and inner_border_top parameters from @ref CLDeconvolutionLayer interface 720 - Removed inner_border_right and inner_border_top parameters from @ref NEDeconvolutionLayer interface 721 - Optimized the NEON assembly kernel for GEMMLowp. The new implementation fuses the output stage and quantization with the matrix multiplication kernel 722 723v19.05 Public major release 724 - Various bug fixes. 725 - Various optimisations. 726 - New Neon kernels / functions: 727 - @ref NEBatchToSpaceLayerKernel / @ref NEBatchToSpaceLayer 728 - @ref NEComplexPixelWiseMultiplicationKernel / @ref NEComplexPixelWiseMultiplication 729 - @ref NECropKernel / @ref NECropResize 730 - @ref NEDepthwiseConvolutionAssemblyDispatch 731 - @ref NEFFTDigitReverseKernel 732 - @ref NEFFTRadixStageKernel 733 - @ref NEFFTScaleKernel 734 - @ref NEGEMMLowpOffsetContributionOutputStageKernel 735 - @ref NEHeightConcatenateLayerKernel 736 - @ref NESpaceToBatchLayerKernel / @ref NESpaceToBatchLayer 737 - @ref NEFFT1D 738 - @ref NEFFT2D 739 - @ref NEFFTConvolutionLayer 740 - New OpenCL kernels / functions: 741 - @ref CLComplexPixelWiseMultiplicationKernel / @ref CLComplexPixelWiseMultiplication 742 - @ref CLCropKernel / @ref CLCropResize 743 - @ref CLDeconvolutionReshapeOutputKernel 744 - @ref CLFFTDigitReverseKernel 745 - @ref CLFFTRadixStageKernel 746 - @ref CLFFTScaleKernel 747 - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel 748 - @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel 749 - @ref CLHeightConcatenateLayerKernel 750 - @ref CLDirectDeconvolutionLayer 751 - @ref CLFFT1D 752 - @ref CLFFT2D 753 - @ref CLFFTConvolutionLayer 754 - @ref CLGEMMDeconvolutionLayer 755 - New OpenGLES kernels / functions: 756 - @ref GCConcatenateLayer 757 - Deprecated functions/interfaces 758 - GCDepthConcatenateLayer 759 - NEWidthConcatenateLayer 760 - NEDepthConcatenateLayer 761 - CLWidthConcatenateLayer 762 - CLDepthConcatenateLayer 763 - CLGEMMInterleave4x4 764 - CLGEMMTranspose1xW 765 - Support different quantization info in CLConcatLayer. 766 - Add checks on different input/output quantization info were not supported. 767 - Tensors have different quantization information. 768 - Add FP16 support checks. 769 - Fix output quantization CLDeptwiseConv3x3 when activation is fused. 770 - New graph examples: 771 - graph_convolution 772 - graph_fully_connected 773 - graph_depthwise_convolution 774 - Deepspeech v0.4.1 775 - Add support for QASYMM8 in NEArithmeticSubtractionKernel. 776 - Add support for QASYMM8 in NEPixelWiseMultiplicationKernel. 777 - Add support for QASYMM8 NEDeconvolution. 778 - Add support for DequantizationLayer for NEON/CL. 779 - Add support for dilation in CLDepthwiseConvolution. 780 - Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore. 781 - Optimize CLDeconvolution. 782 - Add StackLayer to the graph API. 783 - Add support for "reflect" padding mode in NEPad. 784 - Winograd 7x7 NHWC on OpenCL. 785 - Rework CL ML layers to run exclusively on CL. 786 - Support different quantization info in PoolingLayer. 787 - Implement and test import memory interfaces. 788 - Added new tests and removed old ones. 789 - Various clang-tidy fixes. 790 791v19.02 Public major release 792 - Various bug fixes. 793 - Various optimisations. 794 - New Neon kernels / functions: 795 - @ref NETileKernel / @ref NETile 796 - @ref NEFuseBatchNormalizationKernel / @ref NEFuseBatchNormalization 797 - @ref NEElementwiseOperationKernel 798 - @ref NEElementwiseMax 799 - @ref NEElementwiseMin 800 - @ref NEElementwiseSquaredDiff 801 - @ref NESelectKernel / @ref NESelect 802 - @ref NESplit 803 - @ref NESlice 804 - @ref NEUnstack 805 - @ref NEStridedSliceKernel / @ref NEStridedSlice 806 - @ref NEElementwiseUnaryKernel 807 - @ref NERsqrtLayer 808 - @ref NEExpLayer 809 - @ref NEReverseKernel / @ref NEReverse 810 - @ref NEArgMinMaxLayer 811 - @ref NEStackLayerKernel / @ref NEStackLayer 812 - @ref NERangeKernel / @ref NERange 813 - @ref NEPadLayer 814 - @ref NEMemsetKernel 815 - @ref NEGatherKernel / @ref NEGather 816 - @ref NEElementwiseComparison 817 - @ref NEElementwiseComparisonStatic 818 - @ref NEComparisonOperationKernel 819 - @ref NEElementwiseDivision 820 - New OpenCL kernels / functions: 821 - @ref CLSelectKernel / @ref CLSelect 822 - @ref CLTileKernel / @ref CLTile 823 - @ref CLComparisonKernel / @ref CLComparison 824 - @ref CLArgMinMaxLayer 825 - @ref CLElementwiseMax 826 - @ref CLElementwiseMin 827 - @ref CLElementwiseSquaredDiff 828 - @ref CLStackLayerKernel / @ref CLStackLayer 829 - @ref CLReverse / @ref CLReverseKernel 830 - @ref CLRsqrtLayer 831 - @ref CLExpLayer 832 - @ref CLElementWiseUnaryLayerKernel 833 - @ref CLGEMMReshapeLHSMatrixKernel 834 - @ref CLGEMMReshapeRHSMatrixKernel 835 - @ref CLGEMMMatrixMultiplyReshapedKernel 836 - @ref CLRangeKernel / @ref CLRange 837 - @ref CLUnstack 838 - @ref CLGatherKernel / @ref CLGather 839 - @ref CLGEMMLowpMatrixMultiplyReshapedKernel 840 - New CPP kernels / functions: 841 - @ref CPPDetectionOutputLayer 842 - @ref CPPTopKV / @ref CPPTopKVKernel 843 - Added new examples: 844 - graph_ssd_mobilenet.cpp 845 - graph_mobilenet_v2.cpp 846 - graph_resnet12.cpp 847 - graph_srcnn955.cpp 848 - graph_vgg_vdsr.cpp 849 - graph_inception_resnet_v1.cpp 850 - Add 4D tensors support to 851 - @ref NESoftmaxLayer 852 - Fused activation in @ref CLWinogradConvolutionLayer 853 - Extented @ref NEPermute to support more cases 854 - Added NEON/SVE GEMM Hybrid kernels 855 - Added u8 and s8 hybrid assembly kernels 856 - Introduced GEMM strategy name in NEGEMMAssemblyWrapper 857 - Improved @ref CLTuner 858 - Fused the bias addition within @ref CLGEMM 859 - Added support for QASYMM8 LOGISTIC activation in @ref NEActivationLayer 860 - Added NHWC data layout support to: 861 - @ref NEScale for F16 862 - @ref CLNormalizationLayer IN_MAP_2D for FP32/FP16 863 - @ref NEL2NormalizeLayer for FP32/FP16 864 - @ref NENormalizationLayer IN_MAP_2D for FP32/FP16 865 - @ref CLROIAlignLayer 866 - @ref CLGenerateProposalsLayer 867 - Added QASYMM8 support to the following kernels: 868 - @ref NEArithmeticAdditionKernel 869 - @ref NEScale 870 - Added new tests and improved validation and benchmarking suites. 871 - Deprecated functions/interfaces 872 - Usage of inner_border_right and inner_border_top has been deprecated in @ref CLDeconvolutionLayer and @ref NEDeconvolutionLayer 873 874v18.11 Public major release 875 - Various bug fixes. 876 - Various optimisations. 877 - New Neon kernels / functions: 878 - @ref NEChannelShuffleLayer / @ref NEChannelShuffleLayerKernel 879 - @ref NEReduceMean 880 - @ref NEReorgLayer / @ref NEReorgLayerKernel 881 - @ref NEPriorBoxLayer / @ref NEPriorBoxLayerKernel 882 - @ref NEUpsampleLayer / @ref NEUpsampleLayerKernel 883 - @ref NEYOLOLayer / @ref NEYOLOLayerKernel 884 - New OpenCL kernels / functions: 885 - @ref CLBatchToSpaceLayer / @ref CLBatchToSpaceLayerKernel 886 - @ref CLBoundingBoxTransform / @ref CLBoundingBoxTransformKernel 887 - @ref CLComputeAllAnchorsKernel 888 - @ref CLGenerateProposalsLayer 889 - @ref CLNormalizePlanarYUVLayer / @ref CLNormalizePlanarYUVLayerKernel 890 - @ref CLReorgLayer / @ref CLReorgLayerKernel 891 - @ref CLSpaceToBatchLayer / @ref CLSpaceToBatchLayerKernel 892 - @ref CLPadLayer 893 - @ref CLReduceMean 894 - @ref CLPriorBoxLayer / @ref CLPriorBoxLayerKernel 895 - @ref CLROIAlignLayer / @ref CLROIAlignLayerKernel 896 - @ref CLSlice 897 - @ref CLSplit 898 - @ref CLStridedSlice / @ref CLStridedSliceKernel 899 - @ref CLUpsampleLayer / @ref CLUpsampleLayerKernel 900 - @ref CLYOLOLayer / @ref CLYOLOLayerKernel 901 - New CPP kernels / functions: 902 - @ref CPPBoxWithNonMaximaSuppressionLimit / @ref CPPBoxWithNonMaximaSuppressionLimitKernel 903 - Added the validate method in: 904 - @ref NEDepthConvertLayer 905 - @ref NEFloor / @ref CLFloor 906 - @ref NEGEMMMatrixAdditionKernel 907 - @ref NEReshapeLayer / @ref CLReshapeLayer 908 - @ref CLScale 909 - Added new examples: 910 - graph_shufflenet.cpp 911 - graph_yolov3.cpp 912 - Added documentation for add a new function or kernel. 913 - Improved doxygen documentation adding a list of the existing functions. 914 - Add 4D tensors support to 915 - CLWidthConcatenateLayer 916 - @ref CLFlattenLayer 917 - @ref CLSoftmaxLayer 918 - Add dot product support for @ref CLDepthwiseConvolutionLayer3x3NHWCKernel non-unit stride 919 - Add SVE support 920 - Fused batch normalization into convolution layer weights in @ref CLFuseBatchNormalization 921 - Fuses activation in @ref CLDepthwiseConvolutionLayer3x3NCHWKernel, @ref CLDepthwiseConvolutionLayer3x3NHWCKernel and @ref NEGEMMConvolutionLayer 922 - Added NHWC data layout support to: 923 - @ref CLChannelShuffleLayer 924 - @ref CLDeconvolutionLayer 925 - @ref CLL2NormalizeLayer 926 - Added QASYMM8 support to the following kernels: 927 - @ref CLScaleKernel 928 - NEDepthwiseConvolutionLayer3x3Kernel 929 - @ref CLPixelWiseMultiplicationKernel 930 - Added FP16 support to the following kernels: 931 - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel 932 - NEDepthwiseConvolutionLayer3x3Kernel 933 - @ref CLNormalizePlanarYUVLayerKernel 934 - @ref CLWinogradConvolutionLayer (5x5 kernel) 935 - More tests added to both validation and benchmarking suites. 936 937v18.08 Public major release 938 - Various bug fixes. 939 - Various optimisations. 940 - Updated recommended NDK version to r17b. 941 - Removed support for QS8/QS16 data types. 942 - Added support for grouped convolution in @ref CLConvolutionLayer. 943 - Added NHWC data layout support to: 944 - NEDepthConcatenateLayer / CLDepthConcatenateLayer 945 - @ref NEWinogradConvolutionLayer / @ref CLWinogradConvolutionLayer 946 - @ref CLDepthwiseConvolutionLayer 947 - @ref CLDirectConvolutionLayer 948 - @ref CLConvolutionLayer 949 - @ref CLScale 950 - @ref CLIm2ColKernel 951 - New Neon kernels / functions: 952 - @ref NERNNLayer 953 - New OpenCL kernels / functions: 954 - @ref CLArithmeticDivision 955 - Introduced prepare() stage support in the graph API for GLES. 956 - Added support for memory reusage when trying to allocate smaller CLTensors. 957 - Enabled NHWC execution on graph examples. 958 - Added JPEG accessor for validation purposes. 959 - Added validate methods to some kernels / functions. 960 961v18.05 Public major release 962 - Various bug fixes. 963 - Various optimisations. 964 - Major redesign in the interface for the neon kernels implemented in assembly. 965 - Removed arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore / arm_compute::NEHGEMMAArch64FP16Kernel 966 - Added NEGEMMAssemblyWrapper and AssemblyKernelGlue which are used to execute assembly kernels in neon functions. 967 - Minor changes to the CPUInfo type to make it compatible with the new assembly gemm interface. 968 - Moved neon assembly kernels to the folder src/core/NEON/kernels/arm_gemm. 969 - Improved doxygen documentation. 970 - Improved memory management for layer's transitions. 971 - Added support for NHWC data layout in tensors. 972 - Added NHWC data layout support to: 973 - @ref NEGEMMConvolutionLayer 974 - @ref NEDirectConvolutionLayer 975 - @ref NEPoolingLayer / @ref CLPoolingLayer 976 - @ref NEBatchNormalizationLayer / @ref CLBatchNormalizationLayer 977 - @ref NEDepthwiseConvolutionLayer 978 - @ref NEScale 979 - @ref NEIm2Col 980 - Added support for dilated convolutions in @ref NEConvolutionLayer and @ref CLConvolutionLayer. 981 - New OpenCL kernels / functions: 982 - @ref CLChannelShuffleLayer / @ref CLChannelShuffleLayerKernel 983 - @ref CLConvertFullyConnectedWeightsKernel / @ref CLConvertFullyConnectedWeights 984 - @ref CLCopy / @ref CLCopyKernel 985 - @ref CLLSTMLayer 986 - @ref CLRNNLayer 987 - CLWidthConcatenateLayer / @ref CLWidthConcatenateLayerKernel 988 - @ref CLWinogradFilterTransformKernel / @ref CLWinogradInputTransformKernel / @ref CLWinogradConvolutionLayer 989 - @ref CLWinogradInputTransformKernel / @ref CLWinogradInputTransform 990 - New Neon kernels / functions: 991 - @ref NEConvertFullyConnectedWeightsKernel / @ref NEConvertFullyConnectedWeights. 992 - Created the validate method in @ref CLDepthwiseConvolutionLayer. 993 - Beta and gamma are no longer mandatory arguments in @ref NEBatchNormalizationLayer and @ref CLBatchNormalizationLayer. 994 - Added depth multiplier support in @ref NEDepthwiseConvolutionLayer and @ref CLDepthwiseConvolutionLayer. 995 - Added broadcast multiply support in @ref NEPixelWiseMultiplication / @ref NEPixelWiseMultiplicationKernel. 996 - Port mobilenet example to NHWC data layout. 997 - Enabled Winograd method in @ref CLConvolutionLayer. 998 - Renamed NEWinogradLayer to @ref NEWinogradConvolutionLayer. 999 - Updated @ref NEWinogradConvolutionLayer to use highly optimised assembly kernels in src/core/NEON/kernels/arm_gemm. 1000 - Added memory manager support in GLES functions. 1001 - Major refactoring of the graph API. 1002 - Added GLES backend in the graph API. 1003 - Added support for the memory manager in the graph API. 1004 - Enabled Winograd Convolution method in the graph API. 1005 - Added support for grouped convolutions in the graph API. 1006 - Replaced NEDeconvolutionLayerUpsampleKernel with @ref NEScaleKernel in @ref NEDeconvolutionLayer. 1007 - Added fast maths flag in @ref CLConvolutionLayer. 1008 - Added new tests and benchmarks in validation and benchmark frameworks 1009 - Merge Activation layer with Convolution Layer (NEON. CL, GLES) 1010 - Added support to OpenCL 2.0 SVM 1011 - Added support to import memory in OpenCL tensors. 1012 - Added the prepare() method to perform any one off pre-processing before running the function. 1013 - Added new examples: 1014 - graph_inception_v4.cpp 1015 - graph_resnext50.cpp 1016 - Added memory measurement instrument for CL. 1017 1018v18.03 Public maintenance release 1019 - Various bug fixes. 1020 - Fixed bug in @ref NEActivationLayer 1021 - Fix in @ref CLTuner when using batches. 1022 - Updated recommended NDK version to r16b (And fixed warnings). 1023 - Fixed bug in validation code. 1024 - Added Inception v4 graph example. 1025 - Renamed NEWinogradLayer.cpp to @ref NEWinogradConvolutionLayer 1026 1027v18.02 Public major release 1028 - Various NEON / OpenCL / GLES optimisations. 1029 - Various bug fixes. 1030 - Changed default number of threads on big LITTLE systems. 1031 - Refactored examples and added: 1032 - graph_mobilenet_qassym8 1033 - graph_resnet 1034 - graph_squeezenet_v1_1 1035 - Renamed @ref CLConvolutionLayer into @ref CLGEMMConvolutionLayer and created a new @ref CLConvolutionLayer to select the fastest convolution method. 1036 - Renamed @ref NEConvolutionLayer into @ref NEGEMMConvolutionLayer and created a new @ref NEConvolutionLayer to select the fastest convolution method. 1037 - Added in place support to: 1038 - @ref CLActivationLayer 1039 - @ref CLBatchNormalizationLayer 1040 - Added QASYMM8 support to: 1041 - @ref CLActivationLayer 1042 - @ref CLDepthwiseConvolutionLayer 1043 - @ref NEDepthwiseConvolutionLayer 1044 - @ref NESoftmaxLayer 1045 - Added FP16 support to: 1046 - CLDepthwiseConvolutionLayer3x3 1047 - @ref CLDepthwiseConvolutionLayer 1048 - Added broadcasting support to @ref NEArithmeticAddition / @ref CLArithmeticAddition / @ref CLPixelWiseMultiplication 1049 - Added fused batched normalization and activation to @ref CLBatchNormalizationLayer and @ref NEBatchNormalizationLayer 1050 - Added support for non-square pooling to @ref NEPoolingLayer and @ref CLPoolingLayer 1051 - New OpenCL kernels / functions: 1052 - CLDirectConvolutionLayerOutputStageKernel 1053 - New NEON kernels / functions 1054 - Added name() method to all kernels. 1055 - Added support for Winograd 5x5. 1056 - @ref NEPermuteKernel / @ref NEPermute 1057 - @ref NEWinogradLayerTransformInputKernel / NEWinogradLayer 1058 - @ref NEWinogradLayerTransformOutputKernel / NEWinogradLayer 1059 - @ref NEWinogradLayerTransformWeightsKernel / NEWinogradLayer 1060 - Renamed NEWinogradLayerKernel into NEWinogradLayerBatchedGEMMKernel 1061 - New GLES kernels / functions: 1062 - @ref GCTensorShiftKernel / @ref GCTensorShift 1063 1064v18.01 Public maintenance release 1065 - Various bug fixes 1066 - Added some of the missing validate() methods 1067 - Added @ref CLDeconvolutionLayerUpsampleKernel / @ref CLDeconvolutionLayer @ref CLDeconvolutionLayerUpsample 1068 - Added @ref CLPermuteKernel / @ref CLPermute 1069 - Added method to clean the programs cache in the CL Kernel library. 1070 - Added @ref GCArithmeticAdditionKernel / @ref GCArithmeticAddition 1071 - Added @ref GCDepthwiseConvolutionLayer3x3Kernel / @ref GCDepthwiseConvolutionLayer3x3 1072 - Added @ref GCNormalizePlanarYUVLayerKernel / @ref GCNormalizePlanarYUVLayer 1073 - Added @ref GCScaleKernel / @ref GCScale 1074 - Added @ref GCWeightsReshapeKernel / @ref GCConvolutionLayer 1075 - Added FP16 support to the following GLES compute kernels: 1076 - @ref GCCol2ImKernel 1077 - @ref GCGEMMInterleave4x4Kernel 1078 - @ref GCGEMMTranspose1xWKernel 1079 - @ref GCIm2ColKernel 1080 - Refactored NEON Winograd (NEWinogradLayerKernel) 1081 - Added @ref NEDirectConvolutionLayerOutputStageKernel 1082 - Added QASYMM8 support to the following NEON kernels: 1083 - NEDepthwiseConvolutionLayer3x3Kernel 1084 - @ref NEFillBorderKernel 1085 - @ref NEPoolingLayerKernel 1086 - Added new examples: 1087 - graph_cl_mobilenet_qasymm8.cpp 1088 - graph_inception_v3.cpp 1089 - gc_dc.cpp 1090 - More tests added to both validation and benchmarking suites. 1091 1092v17.12 Public major release 1093 - Most machine learning functions on OpenCL support the new data type QASYMM8 1094 - Introduced logging interface 1095 - Introduced opencl timer 1096 - Reworked GEMMLowp interface 1097 - Added new NEON assembly kernels for GEMMLowp, SGEMM and HGEMM 1098 - Added validation method for most Machine Learning kernels / functions 1099 - Added new graph examples such as googlenet, mobilenet, squeezenet, vgg16 and vgg19 1100 - Added sgemm example for OpenCL 1101 - Added absolute difference example for GLES compute 1102 - Added new tests and benchmarks in validation and benchmark frameworks 1103 - Added new kernels / functions for GLES compute 1104 1105 - New OpenGL ES kernels / functions 1106 - @ref GCAbsoluteDifferenceKernel / @ref GCAbsoluteDifference 1107 - @ref GCActivationLayerKernel / @ref GCActivationLayer 1108 - @ref GCBatchNormalizationLayerKernel / @ref GCBatchNormalizationLayer 1109 - @ref GCCol2ImKernel 1110 - @ref GCDepthConcatenateLayerKernel / GCDepthConcatenateLayer 1111 - @ref GCDirectConvolutionLayerKernel / @ref GCDirectConvolutionLayer 1112 - @ref GCDropoutLayerKernel / @ref GCDropoutLayer 1113 - @ref GCFillBorderKernel / @ref GCFillBorder 1114 - @ref GCGEMMInterleave4x4Kernel / @ref GCGEMMInterleave4x4 1115 - @ref GCGEMMMatrixAccumulateBiasesKernel / @ref GCGEMMMatrixAdditionKernel / @ref GCGEMMMatrixMultiplyKernel / @ref GCGEMM 1116 - @ref GCGEMMTranspose1xWKernel / @ref GCGEMMTranspose1xW 1117 - @ref GCIm2ColKernel 1118 - @ref GCNormalizationLayerKernel / @ref GCNormalizationLayer 1119 - @ref GCPixelWiseMultiplicationKernel / @ref GCPixelWiseMultiplication 1120 - @ref GCPoolingLayerKernel / @ref GCPoolingLayer 1121 - @ref GCLogits1DMaxKernel / @ref GCLogits1DShiftExpSumKernel / @ref GCLogits1DNormKernel / @ref GCSoftmaxLayer 1122 - @ref GCTransposeKernel / @ref GCTranspose 1123 1124 - New NEON kernels / functions 1125 - arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore 1126 - arm_compute::NEHGEMMAArch64FP16Kernel 1127 - NEDepthwiseConvolutionLayer3x3Kernel / NEDepthwiseIm2ColKernel / NEGEMMMatrixVectorMultiplyKernel / NEDepthwiseVectorToTensorKernel / @ref NEDepthwiseConvolutionLayer 1128 - @ref NEGEMMLowpOffsetContributionKernel / @ref NEGEMMLowpMatrixAReductionKernel / @ref NEGEMMLowpMatrixBReductionKernel / @ref NEGEMMLowpMatrixMultiplyCore 1129 - @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint 1130 - NEWinogradLayer / NEWinogradLayerKernel 1131 1132 - New OpenCL kernels / functions 1133 - @ref CLGEMMLowpOffsetContributionKernel / @ref CLGEMMLowpMatrixAReductionKernel / @ref CLGEMMLowpMatrixBReductionKernel / @ref CLGEMMLowpMatrixMultiplyCore 1134 - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint 1135 1136 - New graph nodes for NEON and OpenCL 1137 - graph::BranchLayer 1138 - graph::DepthConvertLayer 1139 - graph::DepthwiseConvolutionLayer 1140 - graph::DequantizationLayer 1141 - graph::FlattenLayer 1142 - graph::QuantizationLayer 1143 - graph::ReshapeLayer 1144 1145v17.10 Public maintenance release 1146 - Bug fixes: 1147 - Check the maximum local workgroup size supported by OpenCL devices 1148 - Minor documentation updates (Fixed instructions to build the examples) 1149 - Introduced a graph::GraphContext 1150 - Added a few new Graph nodes, support for branches and grouping. 1151 - Automatically enable cl_printf in debug builds 1152 - Fixed bare metal builds for armv7a 1153 - Added AlexNet and cartoon effect examples 1154 - Fixed library builds: libraries are no longer built as supersets of each other.(It means application using the Runtime part of the library now need to link against both libarm_compute_core and libarm_compute) 1155 1156v17.09 Public major release 1157 - Experimental Graph support: initial implementation of a simple stream API to easily chain machine learning layers. 1158 - Memory Manager (@ref BlobLifetimeManager, @ref BlobMemoryPool, @ref ILifetimeManager, @ref IMemoryGroup, @ref IMemoryManager, @ref IMemoryPool, @ref IPoolManager, @ref MemoryManagerOnDemand, @ref PoolManager) 1159 - New validation and benchmark frameworks (Boost and Google frameworks replaced by homemade framework). 1160 - Most machine learning functions support both fixed point 8 and 16 bit (QS8, QS16) for both NEON and OpenCL. 1161 - New NEON kernels / functions: 1162 - arm_compute::NEGEMMAssemblyBaseKernel arm_compute::NEGEMMAArch64Kernel 1163 - @ref NEDequantizationLayerKernel / @ref NEDequantizationLayer 1164 - @ref NEFloorKernel / @ref NEFloor 1165 - @ref NEL2NormalizeLayerKernel / @ref NEL2NormalizeLayer 1166 - @ref NEQuantizationLayerKernel @ref NEMinMaxLayerKernel / @ref NEQuantizationLayer 1167 - @ref NEROIPoolingLayerKernel / @ref NEROIPoolingLayer 1168 - @ref NEReductionOperationKernel / @ref NEReductionOperation 1169 - @ref NEReshapeLayerKernel / @ref NEReshapeLayer 1170 1171 - New OpenCL kernels / functions: 1172 - @ref CLDepthwiseConvolutionLayer3x3NCHWKernel @ref CLDepthwiseConvolutionLayer3x3NHWCKernel CLDepthwiseIm2ColKernel CLDepthwiseVectorToTensorKernel CLDepthwiseWeightsReshapeKernel / CLDepthwiseConvolutionLayer3x3 @ref CLDepthwiseConvolutionLayer CLDepthwiseSeparableConvolutionLayer 1173 - @ref CLDequantizationLayerKernel / @ref CLDequantizationLayer 1174 - @ref CLDirectConvolutionLayerKernel / @ref CLDirectConvolutionLayer 1175 - @ref CLFlattenLayer 1176 - @ref CLFloorKernel / @ref CLFloor 1177 - CLGEMMTranspose1xW 1178 - @ref CLGEMMMatrixVectorMultiplyKernel 1179 - @ref CLL2NormalizeLayerKernel / @ref CLL2NormalizeLayer 1180 - @ref CLQuantizationLayerKernel @ref CLMinMaxLayerKernel / @ref CLQuantizationLayer 1181 - @ref CLROIPoolingLayerKernel / @ref CLROIPoolingLayer 1182 - @ref CLReductionOperationKernel / @ref CLReductionOperation 1183 - @ref CLReshapeLayerKernel / @ref CLReshapeLayer 1184 1185v17.06 Public major release 1186 - Various bug fixes 1187 - Added support for fixed point 8 bit (QS8) to the various NEON machine learning kernels. 1188 - Added unit tests and benchmarks (AlexNet, LeNet) 1189 - Added support for sub tensors. 1190 - Added infrastructure to provide GPU specific optimisation for some OpenCL kernels. 1191 - Added @ref OMPScheduler (OpenMP) scheduler for NEON 1192 - Added @ref SingleThreadScheduler scheduler for NEON (For bare metal) 1193 - User can specify his own scheduler by implementing the @ref IScheduler interface. 1194 - New OpenCL kernels / functions: 1195 - @ref CLBatchNormalizationLayerKernel / @ref CLBatchNormalizationLayer 1196 - @ref CLDepthConcatenateLayerKernel / CLDepthConcatenateLayer 1197 - @ref CLHOGOrientationBinningKernel @ref CLHOGBlockNormalizationKernel, @ref CLHOGDetectorKernel / @ref CLHOGDescriptor @ref CLHOGDetector @ref CLHOGGradient @ref CLHOGMultiDetection 1198 - @ref CLLocallyConnectedMatrixMultiplyKernel / @ref CLLocallyConnectedLayer 1199 - @ref CLWeightsReshapeKernel / @ref CLConvolutionLayerReshapeWeights 1200 - New C++ kernels: 1201 - @ref CPPDetectionWindowNonMaximaSuppressionKernel 1202 - New NEON kernels / functions: 1203 - @ref NEBatchNormalizationLayerKernel / @ref NEBatchNormalizationLayer 1204 - @ref NEDepthConcatenateLayerKernel / NEDepthConcatenateLayer 1205 - @ref NEDirectConvolutionLayerKernel / @ref NEDirectConvolutionLayer 1206 - @ref NELocallyConnectedMatrixMultiplyKernel / @ref NELocallyConnectedLayer 1207 - @ref NEWeightsReshapeKernel / @ref NEConvolutionLayerReshapeWeights 1208 1209v17.05 Public bug fixes release 1210 - Various bug fixes 1211 - Remaining of the functions ported to use accurate padding. 1212 - Library does not link against OpenCL anymore (It uses dlopen / dlsym at runtime instead to determine whether or not OpenCL is available). 1213 - Added "free" method to allocator. 1214 - Minimum version of g++ required for armv7 Linux changed from 4.8 to 4.9 1215 1216v17.04 Public bug fixes release 1217 1218 The following functions have been ported to use the new accurate padding: 1219 - @ref CLColorConvertKernel 1220 - @ref CLEdgeNonMaxSuppressionKernel 1221 - @ref CLEdgeTraceKernel 1222 - @ref CLGaussianPyramidHorKernel 1223 - @ref CLGaussianPyramidVertKernel 1224 - @ref CLGradientKernel 1225 - @ref NEChannelCombineKernel 1226 - @ref NEFillArrayKernel 1227 - @ref NEGaussianPyramidHorKernel 1228 - @ref NEGaussianPyramidVertKernel 1229 - NEHarrisScoreFP16Kernel 1230 - @ref NEHarrisScoreKernel 1231 - @ref NEHOGDetectorKernel 1232 - @ref NELogits1DMaxKernel 1233 - NELogits1DShiftExpSumKernel 1234 - NELogits1DNormKernel 1235 - @ref NENonMaximaSuppression3x3FP16Kernel 1236 - @ref NENonMaximaSuppression3x3Kernel 1237 1238v17.03.1 First Major public release of the sources 1239 - Renamed the library to arm_compute 1240 - New CPP target introduced for C++ kernels shared between NEON and CL functions. 1241 - New padding calculation interface introduced and ported most kernels / functions to use it. 1242 - New OpenCL kernels / functions: 1243 - CLGEMMLowpMatrixMultiplyKernel / CLGEMMLowp 1244 - New NEON kernels / functions: 1245 - @ref NENormalizationLayerKernel / @ref NENormalizationLayer 1246 - @ref NETransposeKernel / @ref NETranspose 1247 - @ref NELogits1DMaxKernel, NELogits1DShiftExpSumKernel, NELogits1DNormKernel / @ref NESoftmaxLayer 1248 - @ref NEIm2ColKernel, @ref NECol2ImKernel, NEConvolutionLayerWeightsReshapeKernel / @ref NEConvolutionLayer 1249 - NEGEMMMatrixAccumulateBiasesKernel / @ref NEFullyConnectedLayer 1250 - @ref NEGEMMLowpMatrixMultiplyKernel / NEGEMMLowp 1251 1252v17.03 Sources preview 1253 - New OpenCL kernels / functions: 1254 - @ref CLGradientKernel, @ref CLEdgeNonMaxSuppressionKernel, @ref CLEdgeTraceKernel / @ref CLCannyEdge 1255 - GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, @ref CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / @ref CLGEMM 1256 - CLGEMMMatrixAccumulateBiasesKernel / @ref CLFullyConnectedLayer 1257 - @ref CLTransposeKernel / @ref CLTranspose 1258 - @ref CLLKTrackerInitKernel, @ref CLLKTrackerStage0Kernel, @ref CLLKTrackerStage1Kernel, @ref CLLKTrackerFinalizeKernel / @ref CLOpticalFlow 1259 - @ref CLNormalizationLayerKernel / @ref CLNormalizationLayer 1260 - @ref CLLaplacianPyramid, @ref CLLaplacianReconstruct 1261 - New NEON kernels / functions: 1262 - @ref NEActivationLayerKernel / @ref NEActivationLayer 1263 - GEMM refactoring + FP16 support (Requires armv8.2 CPU): @ref NEGEMMInterleave4x4Kernel, @ref NEGEMMTranspose1xWKernel, @ref NEGEMMMatrixMultiplyKernel, @ref NEGEMMMatrixAdditionKernel / @ref NEGEMM 1264 - @ref NEPoolingLayerKernel / @ref NEPoolingLayer 1265 1266v17.02.1 Sources preview 1267 - New OpenCL kernels / functions: 1268 - CLLogits1DMaxKernel, CLLogits1DShiftExpSumKernel, @ref CLLogits1DNormKernel / @ref CLSoftmaxLayer 1269 - @ref CLPoolingLayerKernel / @ref CLPoolingLayer 1270 - @ref CLIm2ColKernel, @ref CLCol2ImKernel, CLConvolutionLayerWeightsReshapeKernel / @ref CLConvolutionLayer 1271 - @ref CLRemapKernel / @ref CLRemap 1272 - @ref CLGaussianPyramidHorKernel, @ref CLGaussianPyramidVertKernel / @ref CLGaussianPyramid, @ref CLGaussianPyramidHalf, @ref CLGaussianPyramidOrb 1273 - @ref CLMinMaxKernel, @ref CLMinMaxLocationKernel / @ref CLMinMaxLocation 1274 - @ref CLNonLinearFilterKernel / @ref CLNonLinearFilter 1275 - New NEON FP16 kernels (Requires armv8.2 CPU) 1276 - @ref NEAccumulateWeightedFP16Kernel 1277 - @ref NEBox3x3FP16Kernel 1278 - @ref NENonMaximaSuppression3x3FP16Kernel 1279 1280v17.02 Sources preview 1281 - New OpenCL kernels / functions: 1282 - @ref CLActivationLayerKernel / @ref CLActivationLayer 1283 - @ref CLChannelCombineKernel / @ref CLChannelCombine 1284 - @ref CLDerivativeKernel / @ref CLChannelExtract 1285 - @ref CLFastCornersKernel / @ref CLFastCorners 1286 - @ref CLMeanStdDevKernel / @ref CLMeanStdDev 1287 - New NEON kernels / functions: 1288 - HOG / SVM: @ref NEHOGOrientationBinningKernel, @ref NEHOGBlockNormalizationKernel, @ref NEHOGDetectorKernel, NEHOGNonMaximaSuppressionKernel / @ref NEHOGDescriptor, @ref NEHOGDetector, @ref NEHOGGradient, @ref NEHOGMultiDetection 1289 - @ref NENonLinearFilterKernel / @ref NENonLinearFilter 1290 - Introduced a CLScheduler to manage the default context and command queue used by the runtime library and create synchronisation events. 1291 - Switched all the kernels / functions to use tensors instead of images. 1292 - Updated documentation to include instructions to build the library from sources. 1293 1294v16.12 Binary preview release 1295 - Original release 1296 1297@section S3_how_to_build How to build the library and the examples 1298 1299@subsection S3_1_build_options Build options 1300 1301scons 2.3 or above is required to build the library. 1302To see the build options available simply run ```scons -h```: 1303 1304 debug: Debug (yes|no) 1305 default: False 1306 actual: False 1307 1308 asserts: Enable asserts (this flag is forced to 1 for debug=1) (yes|no) 1309 default: False 1310 actual: False 1311 1312 arch: Target Architecture (armv7a|arm64-v8a|arm64-v8.2-a|x86_32|x86_64) 1313 default: armv7a 1314 actual: armv7a 1315 1316 os: Target OS (linux|android|bare_metal) 1317 default: linux 1318 actual: linux 1319 1320 build: Build type (native|cross_compile|embed_only) 1321 default: cross_compile 1322 actual: cross_compile 1323 1324 examples: Build example programs (yes|no) 1325 default: True 1326 actual: True 1327 1328 Werror: Enable/disable the -Werror compilation flag (yes|no) 1329 default: True 1330 actual: True 1331 1332 opencl: Enable OpenCL support (yes|no) 1333 default: True 1334 actual: True 1335 1336 neon: Enable Neon support (yes|no) 1337 default: False 1338 actual: False 1339 1340 gles_compute: Enable OpenGL ES Compute Shader support (yes|no) 1341 default: False 1342 actual: False 1343 1344 embed_kernels: Embed OpenCL kernels and OpenGL ES compute shader in library binary (yes|no) 1345 default: True 1346 actual: True 1347 1348 set_soname: Set the library's soname and shlibversion (requires SCons 2.4 or above) (yes|no) 1349 default: False 1350 actual: False 1351 1352 openmp: Enable OpenMP backend (yes|no) 1353 default: False 1354 actual: False 1355 1356 cppthreads: Enable C++11 threads backend (yes|no) 1357 default: True 1358 actual: True 1359 1360 build_dir: Specify sub-folder for the build ( /path/to/build_dir ) 1361 default: . 1362 actual: . 1363 1364 extra_cxx_flags: Extra CXX flags to be appended to the build command 1365 default: 1366 actual: 1367 1368 pmu: Enable PMU counters (yes|no) 1369 default: False 1370 actual: False 1371 1372 mali: Enable Mali hardware counters (yes|no) 1373 default: False 1374 actual: False 1375 1376 validation_tests: Build validation test programs (yes|no) 1377 default: False 1378 actual: False 1379 1380 benchmark_tests: Build benchmark test programs (yes|no) 1381 default: False 1382 actual: False 1383 1384@b debug / @b asserts: 1385 - With debug=1 asserts are enabled, and the library is built with symbols and no optimisations enabled. 1386 - With debug=0 and asserts=1: Optimisations are enabled and symbols are removed, however all the asserts are still present (This is about 20% slower than the release build) 1387 - With debug=0 and asserts=0: All optimisations are enable and no validation is performed, if the application misuses the library it is likely to result in a crash. (Only use this mode once you are sure your application is working as expected). 1388 1389@b arch: The x86_32 and x86_64 targets can only be used with neon=0 and opencl=1. 1390 1391@b os: Choose the operating system you are targeting: Linux, Android or bare metal. 1392@note bare metal can only be used for NEON (not OpenCL), only static libraries get built and NEON's multi-threading support is disabled. 1393 1394@b build: you can either build directly on your device (native) or cross compile from your desktop machine (cross-compile). In both cases make sure the compiler is available in your path. 1395 1396@note If you want to natively compile for 32bit on a 64bit ARM device running a 64bit OS then you will have to use cross-compile too. 1397 1398There is also an 'embed_only' option which will generate all the .embed files for the OpenCL kernels and / or OpenGLES compute shaders. This might be useful if using a different build system to compile the library. 1399 1400@b Werror: If you are compiling using the same toolchains as the ones used in this guide then there shouldn't be any warning and therefore you should be able to keep Werror=1. If with a different compiler version the library fails to build because of warnings interpreted as errors then, if you are sure the warnings are not important, you might want to try to build with Werror=0 (But please do report the issue either on Github or by an email to developer@arm.com so that the issue can be addressed). 1401 1402@b opencl / @b neon / @b gles_compute: Choose which SIMD technology you want to target. (NEON for ARM Cortex-A CPUs or OpenCL / GLES_COMPUTE for ARM Mali GPUs) 1403 1404@b embed_kernels: For OpenCL / GLES_COMPUTE only: set embed_kernels=1 if you want the OpenCL / GLES_COMPUTE kernels to be built in the library's binaries instead of being read from separate ".cl" / ".cs" files. If embed_kernels is set to 0 then the application can set the path to the folder containing the OpenCL / GLES_COMPUTE kernel files by calling CLKernelLibrary::init() / GCKernelLibrary::init(). By default the path is set to "./cl_kernels" / "./cs_shaders". 1405 1406@b set_soname: Do you want to build the versioned version of the library ? 1407 1408If enabled the library will contain a SONAME and SHLIBVERSION and some symlinks will automatically be created between the objects. 1409Example: 1410 libarm_compute_core.so -> libarm_compute_core.so.1.0.0 1411 libarm_compute_core.so.1 -> libarm_compute_core.so.1.0.0 1412 libarm_compute_core.so.1.0.0 1413 1414@note This options is disabled by default as it requires SCons version 2.4 or above. 1415 1416@b extra_cxx_flags: Custom CXX flags which will be appended to the end of the build command. 1417 1418@b build_dir: Build the library in a subfolder of the "build" folder. (Allows to build several configurations in parallel). 1419 1420@b examples: Build or not the examples 1421 1422@b validation_tests: Enable the build of the validation suite. 1423 1424@b benchmark_tests: Enable the build of the benchmark tests 1425 1426@b pmu: Enable the PMU cycle counter to measure execution time in benchmark tests. (Your device needs to support it) 1427 1428@b mali: Enable the collection of Mali hardware counters to measure execution time in benchmark tests. (Your device needs to have a Mali driver that supports it) 1429 1430@b openmp Build in the OpenMP scheduler for NEON. 1431 1432@note Only works when building with g++ not clang++ 1433 1434@b cppthreads Build in the C++11 scheduler for NEON. 1435 1436@sa Scheduler::set 1437 1438@subsection S3_2_linux Building for Linux 1439 1440@subsubsection S3_2_1_library How to build the library ? 1441 1442For Linux, the library was successfully built and tested using the following Linaro GCC toolchain: 1443 1444 - gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf 1445 - gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu 1446 1447To cross-compile the library in debug mode, with NEON only support, for Linux 32bit: 1448 1449 scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=linux arch=armv7a 1450 1451To cross-compile the library in asserts mode, with OpenCL only support, for Linux 64bit: 1452 1453 scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a 1454 1455To cross-compile the library in asserts mode, with GLES_COMPUTE only support, for Linux 64bit: 1456 1457 scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=0 gles_compute=1 embed_kernels=1 os=linux arch=arm64-v8a 1458 1459You can also compile the library natively on an ARM device by using <b>build=native</b>: 1460 1461 scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=arm64-v8a build=native 1462 scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=native 1463 1464@note g++ for ARM is mono-arch, therefore if you want to compile for Linux 32bit on a Linux 64bit platform you will have to use a cross compiler. 1465 1466For example on a 64bit Debian based system you would have to install <b>g++-arm-linux-gnueabihf</b> 1467 1468 apt-get install g++-arm-linux-gnueabihf 1469 1470Then run 1471 1472 scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=cross_compile 1473 1474or simply remove the build parameter as build=cross_compile is the default value: 1475 1476 scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a 1477 1478@subsubsection S3_2_2_examples How to manually build the examples ? 1479 1480The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library. 1481 1482@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed. 1483 1484To cross compile a NEON example for Linux 32bit: 1485 1486 arm-linux-gnueabihf-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute -larm_compute_core -o neon_convolution 1487 1488To cross compile a NEON example for Linux 64bit: 1489 1490 aarch64-linux-gnu-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -L. -larm_compute -larm_compute_core -o neon_convolution 1491 1492(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different) 1493 1494To cross compile an OpenCL example for Linux 32bit: 1495 1496 arm-linux-gnueabihf-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL 1497 1498To cross compile an OpenCL example for Linux 64bit: 1499 1500 aarch64-linux-gnu-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL 1501 1502To cross compile a GLES example for Linux 32bit: 1503 1504 arm-linux-gnueabihf-g++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude/ -L. -larm_compute -larm_compute_core -std=c++11 -mfpu=neon -DARM_COMPUTE_GC -Iinclude/linux/ -o gc_absdiff 1505 1506To cross compile a GLES example for Linux 64bit: 1507 1508 aarch64-linux-gnu-g++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude/ -L. -larm_compute -larm_compute_core -std=c++11 -DARM_COMPUTE_GC -Iinclude/linux/ -o gc_absdiff 1509 1510(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different) 1511 1512To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too. 1513 1514i.e. to cross compile the "graph_lenet" example for Linux 32bit: 1515 1516 arm-linux-gnueabihf-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet 1517 1518i.e. to cross compile the "graph_lenet" example for Linux 64bit: 1519 1520 aarch64-linux-gnu-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet 1521 1522(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different) 1523 1524@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core 1525 1526To compile natively (i.e directly on an ARM device) for NEON for Linux 32bit: 1527 1528 g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -mfpu=neon -larm_compute -larm_compute_core -o neon_convolution 1529 1530To compile natively (i.e directly on an ARM device) for NEON for Linux 64bit: 1531 1532 g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute -larm_compute_core -o neon_convolution 1533 1534(notice the only difference with the 32 bit command is that we don't need the -mfpu option) 1535 1536To compile natively (i.e directly on an ARM device) for OpenCL for Linux 32bit or Linux 64bit: 1537 1538 g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL 1539 1540To compile natively (i.e directly on an ARM device) for GLES for Linux 32bit or Linux 64bit: 1541 1542 g++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude/ -L. -larm_compute -larm_compute_core -std=c++11 -DARM_COMPUTE_GC -Iinclude/linux/ -o gc_absdiff 1543 1544To compile natively the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too. 1545 1546i.e. to natively compile the "graph_lenet" example for Linux 32bit: 1547 1548 g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet 1549 1550i.e. to natively compile the "graph_lenet" example for Linux 64bit: 1551 1552 g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet 1553 1554(notice the only difference with the 32 bit command is that we don't need the -mfpu option) 1555 1556@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core 1557 1558@note These two commands assume libarm_compute.so is available in your library path, if not add the path to it using -L (e.g. -Llib/linux-arm64-v8a-neon-cl-asserts/) 1559@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled. 1560 1561To run the built executable simply run: 1562 1563 LD_LIBRARY_PATH=build ./neon_convolution 1564 1565or 1566 1567 LD_LIBRARY_PATH=build ./cl_convolution 1568 1569@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph. 1570 1571For example: 1572 1573 LD_LIBRARY_PATH=. ./graph_lenet --help 1574 1575Below is a list of the common parameters among the graph examples : 1576@snippet utils/CommonGraphOptions.h Common graph examples parameters 1577 1578@subsection S3_3_android Building for Android 1579 1580For Android, the library was successfully built and tested using Google's standalone toolchains: 1581 - clang++ from NDK r18b for armv7a 1582 - clang++ from NDK r18b for arm64-v8a 1583 - clang++ from NDK r18b for arm64-v8.2-a with FP16 support 1584 1585Here is a guide to <a href="https://developer.android.com/ndk/guides/standalone_toolchain.html">create your Android standalone toolchains from the NDK</a> 1586 1587- Download the NDK r18b from here: https://developer.android.com/ndk/downloads/index.html to directory $NDK 1588- Make sure you have Python 2.7 installed on your machine. 1589- Generate the 32 and/or 64 toolchains by running the following commands to your toolchain dirctory $MY_TOOLCHAINS: 1590 1591 1592 $NDK/build/tools/make_standalone_toolchain.py --arch arm64 --install-dir $MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b --stl libc++ --api 21 1593 $NDK/build/tools/make_standalone_toolchain.py --arch arm --install-dir $MY_TOOLCHAINS/arm-linux-android-ndk-r18b --stl libc++ --api 21 1594 1595@attention We used to use gnustl but as of NDK r17 it is deprecated so we switched to libc++ 1596 1597@note Make sure to add the toolchains to your PATH: 1598 1599 export PATH=$PATH:$MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b/bin:$MY_TOOLCHAINS/arm-linux-android-ndk-r18b/bin 1600 1601@subsubsection S3_3_1_library How to build the library ? 1602 1603To cross-compile the library in debug mode, with NEON only support, for Android 32bit: 1604 1605 CXX=clang++ CC=clang scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=android arch=armv7a 1606 1607To cross-compile the library in asserts mode, with OpenCL only support, for Android 64bit: 1608 1609 CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=arm64-v8a 1610 1611To cross-compile the library in asserts mode, with GLES_COMPUTE only support, for Android 64bit: 1612 1613 CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=0 gles_compute=1 embed_kernels=1 os=android arch=arm64-v8a 1614 1615@subsubsection S3_3_2_examples How to manually build the examples ? 1616 1617The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library. 1618 1619@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed. 1620 1621Once you've got your Android standalone toolchain built and added to your path you can do the following: 1622 1623To cross compile a NEON example: 1624 1625 #32 bit: 1626 arm-linux-androideabi-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_arm -static-libstdc++ -pie 1627 #64 bit: 1628 aarch64-linux-android-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_aarch64 -static-libstdc++ -pie 1629 1630To cross compile an OpenCL example: 1631 1632 #32 bit: 1633 arm-linux-androideabi-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_arm -static-libstdc++ -pie -DARM_COMPUTE_CL 1634 #64 bit: 1635 aarch64-linux-android-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL 1636 1637To cross compile a GLES example: 1638 1639 #32 bit: 1640 arm-linux-androideabi-clang++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o gc_absdiff_arm -static-libstdc++ -pie -DARM_COMPUTE_GC 1641 #64 bit: 1642 aarch64-linux-android-clang++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o gc_absdiff_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_GC 1643 1644To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the library arm_compute_graph also. 1645 1646 #32 bit: 1647 arm-linux-androideabi-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_arm -static-libstdc++ -pie -DARM_COMPUTE_CL 1648 #64 bit: 1649 aarch64-linux-android-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL 1650 1651@note Due to some issues in older versions of the Mali OpenCL DDK (<= r13p0), we recommend to link arm_compute statically on Android. 1652@note When linked statically the arm_compute_graph library currently needs the --whole-archive linker flag in order to work properly 1653 1654Then you need to do is upload the executable and the shared library to the device using ADB: 1655 1656 adb push neon_convolution_arm /data/local/tmp/ 1657 adb push cl_convolution_arm /data/local/tmp/ 1658 adb push gc_absdiff_arm /data/local/tmp/ 1659 adb shell chmod 777 -R /data/local/tmp/ 1660 1661And finally to run the example: 1662 1663 adb shell /data/local/tmp/neon_convolution_arm 1664 adb shell /data/local/tmp/cl_convolution_arm 1665 adb shell /data/local/tmp/gc_absdiff_arm 1666 1667For 64bit: 1668 1669 adb push neon_convolution_aarch64 /data/local/tmp/ 1670 adb push cl_convolution_aarch64 /data/local/tmp/ 1671 adb push gc_absdiff_aarch64 /data/local/tmp/ 1672 adb shell chmod 777 -R /data/local/tmp/ 1673 1674And finally to run the example: 1675 1676 adb shell /data/local/tmp/neon_convolution_aarch64 1677 adb shell /data/local/tmp/cl_convolution_aarch64 1678 adb shell /data/local/tmp/gc_absdiff_aarch64 1679 1680@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph. 1681 1682For example: 1683 adb shell /data/local/tmp/graph_lenet --help 1684 1685In this case the first argument of LeNet (like all the graph examples) is the target (i.e 0 to run on NEON, 1 to run on OpenCL if available, 2 to run on OpenCL using the CLTuner), the second argument is the path to the folder containing the npy files for the weights and finally the third argument is the number of batches to run. 1686 1687@subsection S3_4_bare_metal Building for bare metal 1688 1689For bare metal, the library was successfully built using linaro's latest (gcc-linaro-6.3.1-2017.05) bare metal toolchains: 1690 - arm-eabi for armv7a 1691 - aarch64-elf for arm64-v8a 1692 1693Download linaro for <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/arm-eabi/">armv7a</a> and <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/aarch64-elf/">arm64-v8a</a>. 1694 1695@note Make sure to add the toolchains to your PATH: export PATH=$PATH:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-elf/bin:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_arm-eabi/bin 1696 1697@subsubsection S3_4_1_library How to build the library ? 1698 1699To cross-compile the library with NEON support for baremetal arm64-v8a: 1700 1701 scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=bare_metal arch=arm64-v8a build=cross_compile cppthreads=0 openmp=0 standalone=1 1702 1703@subsubsection S3_4_2_examples How to manually build the examples ? 1704 1705Examples are disabled when building for bare metal. If you want to build the examples you need to provide a custom bootcode depending on the target architecture and link against the compute library. More information about bare metal bootcode can be found <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a/index.html">here</a>. 1706 1707@subsection S3_5_windows_host Building on a Windows host system 1708 1709Using `scons` directly from the Windows command line is known to cause 1710problems. The reason seems to be that if `scons` is setup for cross-compilation 1711it gets confused about Windows style paths (using backslashes). Thus it is 1712recommended to follow one of the options outlined below. 1713 1714@subsubsection S3_5_1_ubuntu_on_windows Bash on Ubuntu on Windows 1715 1716The best and easiest option is to use 1717<a href="https://msdn.microsoft.com/en-gb/commandline/wsl/about">Ubuntu on Windows</a>. 1718This feature is still marked as *beta* and thus might not be available. 1719However, if it is building the library is as simple as opening a *Bash on 1720Ubuntu on Windows* shell and following the general guidelines given above. 1721 1722@subsubsection S3_5_2_cygwin Cygwin 1723 1724If the Windows subsystem for Linux is not available <a href="https://www.cygwin.com/">Cygwin</a> 1725can be used to install and run `scons`, the minimum Cygwin version must be 3.0.7 or later. In addition 1726to the default packages installed by Cygwin `scons` has to be selected in the installer. (`git` might 1727also be useful but is not strictly required if you already have got the source 1728code of the library.) Linaro provides pre-built versions of 1729<a href="http://releases.linaro.org/components/toolchain/binaries/">GCC cross-compilers</a> 1730that can be used from the Cygwin terminal. When building for Android the 1731compiler is included in the Android standalone toolchain. After everything has 1732been set up in the Cygwin terminal the general guide on building the library 1733can be followed. 1734 1735@subsection S3_6_cl_requirements OpenCL DDK Requirements 1736 1737@subsubsection S3_6_1_cl_hard_requirements Hard Requirements 1738 1739Compute Library requires OpenCL 1.1 and above with support of non uniform workgroup sizes, which is officially supported in the Mali OpenCL DDK r8p0 and above as an extension (respective extension flag is \a -cl-arm-non-uniform-work-group-size). 1740 1741Enabling 16-bit floating point calculations require \a cl_khr_fp16 extension to be supported. All Mali GPUs with compute capabilities have native support for half precision floating points. 1742 1743Use of @ref CLMeanStdDev function requires 64-bit atomics support, thus \a cl_khr_int64_base_atomics should be supported in order to use. 1744 1745@subsubsection S3_6_2_cl_performance_requirements Performance improvements 1746 1747Integer dot product built-in function extensions (and therefore optimized kernels) are available with Mali OpenCL DDK r22p0 and above for the following GPUs : G71, G76. The relevant extensions are \a cl_arm_integer_dot_product_int8, \a cl_arm_integer_dot_product_accumulate_int8 and \a cl_arm_integer_dot_product_accumulate_int16. 1748 1749OpenCL kernel level debugging can be simplified with the use of printf, this requires the \a cl_arm_printf extension to be supported. 1750 1751SVM allocations are supported for all the underlying allocations in Compute Library. To enable this OpenCL 2.0 and above is a requirement. 1752 1753@subsection S3_7_cl_tuner OpenCL Tuner 1754 1755The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS). 1756The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file. 1757The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file. 1758In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase. 1759 1760If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link: 1761 1762https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice 1763 1764Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel. 1765 1766CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations. 1767 1768 #Example: 2 unique Matrix Multiply configurations 1769@code{.cpp} 1770 TensorShape a0 = TensorShape(32,32); 1771 TensorShape b0 = TensorShape(32,32); 1772 TensorShape c0 = TensorShape(32,32); 1773 TensorShape a1 = TensorShape(64,64); 1774 TensorShape b1 = TensorShape(64,64); 1775 TensorShape c1 = TensorShape(64,64); 1776 1777 Tensor a0_tensor; 1778 Tensor b0_tensor; 1779 Tensor c0_tensor; 1780 Tensor a1_tensor; 1781 Tensor b1_tensor; 1782 Tensor c1_tensor; 1783 1784 a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32)); 1785 b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32)); 1786 c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32)); 1787 a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32)); 1788 b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32)); 1789 c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32)); 1790 1791 CLGEMM gemm0; 1792 CLGEMM gemm1; 1793 1794 // Configuration 0 1795 gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f); 1796 1797 // Configuration 1 1798 gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f); 1799@endcode 1800 1801@subsubsection S3_7_1_cl_tuner_how_to How to use it 1802 1803All the graph examples in the Compute Library's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file 1804 1805 #Enable CL tuner 1806 ./graph_mobilenet --enable-tuner –-target=CL 1807 ./arm_compute_benchmark --enable-tuner 1808 1809 #Export/Import to/from a file 1810 ./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv 1811 ./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv 1812 1813If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it. 1814 1815Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to: 1816 1817 -# Disable the power management 1818 -# Keep the GPU frequency constant 1819 -# Run multiple times the network (i.e. 10). 1820 1821If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function. 1822 1823@code{.cpp} 1824CLTuner tuner; 1825 1826// Setup Scheduler 1827CLScheduler::get().default_init(&tuner); 1828@endcode 1829 1830After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()". 1831- tuner.save_to_file("results.csv"); 1832 1833This file can be also imported using the method "load_from_file("results.csv")". 1834- tuner.load_from_file("results.csv"); 1835*/ 1836} // namespace arm_compute 1837