• Home
  • Raw
  • Download

Lines Matching refs:warp

82         ReduceData that is already reduced within a warp to a lane in the first
83 warp with minimal shared memory footprint. This is an essential step to
96 On the warp level, we have three versions of the algorithms:
110 algorithm being used here, is set to 0 to signify full warp reduction.
123 An illustration of this algorithm operating on a hypothetical 8-lane full-warp
162 located in a contiguous subset of threads in a warp starting from lane 0.
171 warp woud be:
222 warp). This particular version of shuffle intrinsic we take accepts only
231 where the first half of the (partial) warp is reduced with the second half
232 of the (partial) warp. This is because, the mapping
261 //full warp reduction
265 //partial warp reduction
270 //Gather all the reduced values from each warp
271 //to the first warp
277 //This is to reduce data gathered from each "warp master".
287 to various versions of the warp-reduction functions. It first reduces
288 ReduceData warp by warp; in the end, we end up with the number of
290 block. We then proceed to gather all such ReduceData to the first warp.
293 which copies data from each of the "warp master" (0th lane of each warp, where
294 a warp-reduced ReduceData is held) to the 0th warp. This step reduces (in a
295 mathematical sense) the problem of reduction across warp masters in a block to
296 the problem of warp reduction which we already have solutions to.
425 a warp are active (i.e., number of threads in the parallel region is a
473 // done reducing to one value per warp, now reduce across warps