1Platform: NVIDIA CUDA 2 Device: NVIDIA A100-SXM4-40GB 3 Driver version : 515.48.07 (Linux x64) 4 Compute units : 108 5 Clock frequency : 1410 MHz 6 7 Global memory bandwidth (GBPS) 8 float : 1292.96 9 float2 : 1377.07 10 float4 : 1419.22 11 float8 : 1443.80 12 float16 : 1464.44 13 14 Single-precision compute (GFLOPS) 15 float : 19352.89 16 float2 : 19386.64 17 float4 : 19367.97 18 float8 : 19285.78 19 float16 : 19110.61 20 21 No half precision support! Skipped 22 23 Double-precision compute (GFLOPS) 24 double : 9723.09 25 double2 : 9707.90 26 double4 : 9686.22 27 double8 : 9636.75 28 double16 : 9547.99 29 30 Integer compute (GIOPS) 31 int : 19278.83 32 int2 : 19335.19 33 int4 : 19273.49 34 int8 : 19357.90 35 int16 : 19346.59 36 37 Integer compute Fast 24bit (GIOPS) 38 int : 19289.10 39 int2 : 19293.80 40 int4 : 19278.51 41 int8 : 19233.96 42 int16 : 19053.93 43 44 Transfer bandwidth (GBPS) 45 enqueueWriteBuffer : 14.54 46 enqueueReadBuffer : 13.04 47 enqueueWriteBuffer non-blocking : 7.79 48 enqueueReadBuffer non-blocking : 7.21 49 enqueueMapBuffer(for read) : 20.03 50 memcpy from mapped ptr : 20.65 51 enqueueUnmap(after write) : 26.77 52 memcpy to mapped ptr : 20.86 53 54 Kernel launch latency : 5.76 us 55 56 Device: NVIDIA A100-SXM4-40GB 57 Driver version : 515.48.07 (Linux x64) 58 Compute units : 108 59 Clock frequency : 1410 MHz 60 61 Global memory bandwidth (GBPS) 62 float : 1293.90 63 float2 : 1377.21 64 float4 : 1419.86 65 float8 : 1443.43 66 float16 : 1464.50 67 68 Single-precision compute (GFLOPS) 69 float : 19349.87 70 float2 : 19381.94 71 float4 : 19362.20 72 float8 : 19280.28 73 float16 : 19105.41 74 75 No half precision support! Skipped 76 77 Double-precision compute (GFLOPS) 78 double : 9716.58 79 double2 : 9701.24 80 double4 : 9680.46 81 double8 : 9631.46 82 double16 : 9543.48 83 84 Integer compute (GIOPS) 85 int : 19275.95 86 int2 : 19324.45 87 int4 : 19268.48 88 int8 : 19351.87 89 int16 : 19343.15 90 91 Integer compute Fast 24bit (GIOPS) 92 int : 19283.22 93 int2 : 19287.28 94 int4 : 19272.75 95 int8 : 19230.13 96 int16 : 19047.88 97 98 Transfer bandwidth (GBPS) 99 enqueueWriteBuffer : 14.50 100 enqueueReadBuffer : 13.05 101 enqueueWriteBuffer non-blocking : 7.71 102 enqueueReadBuffer non-blocking : 7.27 103 enqueueMapBuffer(for read) : 19.83 104 memcpy from mapped ptr : 19.54 105 enqueueUnmap(after write) : 26.77 106 memcpy to mapped ptr : 20.55 107 108 Kernel launch latency : 5.65 us 109 110 Device: NVIDIA A100-SXM4-40GB 111 Driver version : 515.48.07 (Linux x64) 112 Compute units : 108 113 Clock frequency : 1410 MHz 114 115 Global memory bandwidth (GBPS) 116 float : 1304.11 117 float2 : 1376.87 118 float4 : 1419.82 119 float8 : 1444.07 120 float16 : 1465.06 121 122 Single-precision compute (GFLOPS) 123 float : 19350.41 124 float2 : 19382.11 125 float4 : 19363.12 126 float8 : 19281.61 127 float16 : 19108.25 128 129 No half precision support! Skipped 130 131 Double-precision compute (GFLOPS) 132 double : 9719.24 133 double2 : 9704.38 134 double4 : 9682.93 135 double8 : 9633.92 136 double16 : 9544.74 137 138 Integer compute (GIOPS) 139 int : 19277.98 140 int2 : 19332.19 141 int4 : 19269.01 142 int8 : 19352.73 143 int16 : 19343.15 144 145 Integer compute Fast 24bit (GIOPS) 146 int : 19283.32 147 int2 : 19288.03 148 int4 : 19273.28 149 int8 : 19231.30 150 int16 : 19048.40 151 152 Transfer bandwidth (GBPS) 153 enqueueWriteBuffer : 14.37 154 enqueueReadBuffer : 13.13 155 enqueueWriteBuffer non-blocking : 7.50 156 enqueueReadBuffer non-blocking : 6.90 157 enqueueMapBuffer(for read) : 19.81 158 memcpy from mapped ptr : 20.73 159 enqueueUnmap(after write) : 26.77 160 memcpy to mapped ptr : 20.62 161 162 Kernel launch latency : 5.75 us 163 164 Device: NVIDIA A100-SXM4-40GB 165 Driver version : 515.48.07 (Linux x64) 166 Compute units : 108 167 Clock frequency : 1410 MHz 168 169 Global memory bandwidth (GBPS) 170 float : 1303.89 171 float2 : 1376.82 172 float4 : 1419.15 173 float8 : 1444.89 174 float16 : 1465.04 175 176 Single-precision compute (GFLOPS) 177 float : 19339.44 178 float2 : 19388.10 179 float4 : 19371.42 180 float8 : 19289.58 181 float16 : 19115.54 182 183 No half precision support! Skipped 184 185 Double-precision compute (GFLOPS) 186 double : 9724.83 187 double2 : 9710.39 188 double4 : 9689.79 189 double8 : 9641.13 190 double16 : 9552.76 191 192 Integer compute (GIOPS) 193 int : 19285.03 194 int2 : 19313.19 195 int4 : 19286.42 196 int8 : 19361.56 197 int16 : 19347.78 198 199 Integer compute Fast 24bit (GIOPS) 200 int : 19292.73 201 int2 : 19297.12 202 int4 : 19282.58 203 int8 : 19238.22 204 int16 : 19056.33 205 206 Transfer bandwidth (GBPS) 207 enqueueWriteBuffer : 14.48 208 enqueueReadBuffer : 13.16 209 enqueueWriteBuffer non-blocking : 7.18 210 enqueueReadBuffer non-blocking : 6.98 211 enqueueMapBuffer(for read) : 19.99 212 memcpy from mapped ptr : 19.35 213 enqueueUnmap(after write) : 26.77 214 memcpy to mapped ptr : 20.63 215 216 Kernel launch latency : 5.70 us 217