1Workload descriptor format 2========================== 3 4ctx.engine.duration_us.dependency.wait,... 5<uint>.<str>.<uint>[-<uint>]|*.<int <= 0>[/<int <= 0>][...].<0|1>,... 6B.<uint> 7M.<uint>.<str>[|<str>]... 8P|S|X.<uint>.<int> 9d|p|s|t|q|a|T.<int>,... 10b.<uint>.<str>[|<str>].<str> 11f 12 13For duration a range can be given from which a random value will be picked 14before every submit. Since this and seqno management requires CPU access to 15objects, care needs to be taken in order to ensure the submit queue is deep 16enough these operations do not affect the execution speed unless that is 17desired. 18 19Additional workload steps are also supported: 20 21 'd' - Adds a delay (in microseconds). 22 'p' - Adds a delay relative to the start of previous loop so that the each loop 23 starts execution with a given period. 24 's' - Synchronises the pipeline to a batch relative to the step. 25 't' - Throttle every n batches. 26 'q' - Throttle to n max queue depth. 27 'f' - Create a sync fence. 28 'a' - Advance the previously created sync fence. 29 'B' - Turn on context load balancing. 30 'b' - Set up engine bonds. 31 'M' - Set up engine map. 32 'P' - Context priority. 33 'S' - Context SSEU configuration. 34 'T' - Terminate an infinite batch. 35 'X' - Context preemption control. 36 37Engine ids: DEFAULT, RCS, BCS, VCS, VCS1, VCS2, VECS 38 39Example (leading spaces must not be present in the actual file): 40---------------------------------------------------------------- 41 42 1.VCS1.3000.0.1 43 1.RCS.500-1000.-1.0 44 1.RCS.3700.0.0 45 1.RCS.1000.-2.0 46 1.VCS2.2300.-2.0 47 1.RCS.4700.-1.0 48 1.VCS2.600.-1.1 49 p.16000 50 51The above workload described in human language works like this: 52 53 1. A batch is sent to the VCS1 engine which will be executing for 3ms on the 54 GPU and userspace will wait until it is finished before proceeding. 55 2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random 56 duration range), 3.7ms and 1ms respectively. The first batch has a data 57 dependency on the preceding VCS1 batch, and the last of the group depends 58 on the first from the group. 59 5. Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms 60 RCS batch. 61 6. This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms 62 VCS2 batch. 63 7. Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the 64 same step the tool is told to wait for the batch completes before 65 proceeding. 66 8. Finally the tool is told to wait long enough to ensure the next iteration 67 starts 16ms after the previous one has started. 68 69When workload descriptors are provided on the command line, commas must be used 70instead of new lines. 71 72Multiple dependencies can be given separated by forward slashes. 73 74Example: 75 76 1.VCS1.3000.0.1 77 1.RCS.3700.0.0 78 1.VCS2.2300.-1/-2.0 79 80I this case the last step has a data dependency on both first and second steps. 81 82Batch durations can also be specified as infinite by using the '*' in the 83duration field. Such batches must be ended by the terminate command ('T') 84otherwise they will cause a GPU hang to be reported. 85 86Sync (fd) fences 87---------------- 88 89Sync fences are also supported as dependencies. 90 91To use them put a "f<N>" token in the step dependecy list. N is this case the 92same relative step offset to the dependee batch, but instead of the data 93dependency an output fence will be emitted at the dependee step, and passed in 94as a dependency in the current step. 95 96Example: 97 98 1.VCS1.3000.0.0 99 1.RCS.500-1000.-1/f-1.0 100 101In this case the second step will have both a data dependency and a sync fence 102dependency on the previous step. 103 104Example: 105 106 1.RCS.500-1000.0.0 107 1.VCS1.3000.f-1.0 108 1.VCS2.3000.f-2.0 109 110VCS1 and VCS2 batches will have a sync fence dependency on the RCS batch. 111 112Example: 113 114 1.RCS.500-1000.0.0 115 f 116 2.VCS1.3000.f-1.0 117 2.VCS2.3000.f-2.0 118 1.RCS.500-1000.0.1 119 a.-4 120 s.-4 121 s.-4 122 123VCS1 and VCS2 batches have an input sync fence dependecy on the standalone fence 124created at the second step. They are submitted ahead of time while still not 125runnable. When the second RCS batch completes the standalone fence is signaled 126which allows the two VCS batches to be executed. Finally we wait until the both 127VCS batches have completed before starting the (optional) next iteration. 128 129Submit fences 130------------- 131 132Submit fences are a type of input fence which are signalled when the originating 133batch buffer is submitted to the GPU. (In contrary to normal sync fences, which 134are signalled when completed.) 135 136Submit fences have the identical syntax as the sync fences with the lower-case 137's' being used to select them. Eg: 138 139 1.RCS.500-1000.0.0 140 1.VCS1.3000.s-1.0 141 1.VCS2.3000.s-2.0 142 143Here VCS1 and VCS2 batches will only be submitted for executing once the RCS 144batch enters the GPU. 145 146Context priority 147---------------- 148 149 P.1.-1 150 1.RCS.1000.0.0 151 P.2.1 152 2.BCS.1000.-2.0 153 154Context 1 is marked as low priority (-1) and then a batch buffer is submitted 155against it. Context 2 is marked as high priority (1) and then a batch buffer 156is submitted against it which depends on the batch from context 1. 157 158Context priority command is executed at workload runtime and is valid until 159overriden by another (optional) same context priority change. Actual driver 160ioctls are executed only if the priority level has changed for the context. 161 162Context preemption control 163-------------------------- 164 165 X.1.0 166 1.RCS.1000.0.0 167 X.1.500 168 1.RCS.1000.0.0 169 170Context 1 is marked as non-preemptable batches and a batch is sent against 1. 171The same context is then marked to have batches which can be preempted every 172500us and another batch is submitted. 173 174Same as with context priority, context preemption commands are valid until 175optionally overriden by another preemption control change on the same context. 176 177Engine maps 178----------- 179 180Engine maps are a per context feature which changes the way engine selection is 181done in the driver. 182 183Example: 184 185 M.1.VCS1|VCS2 186 187This sets up context 1 with an engine map containing VCS1 and VCS2 engine. 188Submission to this context can now only reference these two engines. 189 190Engine maps can also be defined based on class like VCS. 191 192Example: 193 194M.1.VCS 195 196This sets up the engine map to all available VCS class engines. 197 198Context load balancing 199---------------------- 200 201Context load balancing (aka Virtual Engine) is an i915 feature where the driver 202will pick the best engine (most idle) to submit to given previously configured 203engine map. 204 205Example: 206 207 B.1 208 209This enables load balancing for context number one. 210 211Engine bonds 212------------ 213 214Engine bonds are extensions on load balanced contexts. They allow expressing 215rules of engine selection between two co-operating contexts tied with submit 216fences. In other words, the rule expression is telling the driver: "If you pick 217this engine for context one, then you have to pick that engine for context two". 218 219Syntax is: 220 b.<context>.<engine_list>.<master_engine> 221 222Engine list is a list of one or more sibling engines separated by a pipe 223character (eg. "VCS1|VCS2"). 224 225There can be multiple bonds tied to the same context. 226 227Example: 228 229 M.1.RCS|VECS 230 B.1 231 M.2.VCS1|VCS2 232 B.2 233 b.2.VCS1.RCS 234 b.2.VCS2.VECS 235 236This tells the driver that if it picked RCS for context one, it has to pick VCS1 237for context two. And if it picked VECS for context one, it has to pick VCS1 for 238context two. 239 240If we extend the above example with more workload directives: 241 242 1.DEFAULT.1000.0.0 243 2.DEFAULT.1000.s-1.0 244 245We get to a fully functional example where two batch buffers are submitted in a 246load balanced fashion, telling the driver they should run simultaneously and 247that valid engine pairs are either RCS + VCS1 (for two contexts respectively), 248or VECS + VCS2. 249 250This can also be extended using sync fences to improve chances of the first 251submission not getting on the hardware after the second one. Second block would 252then look like: 253 254 f 255 1.DEFAULT.1000.f-1.0 256 2.DEFAULT.1000.s-1.0 257 a.-3 258 259Context SSEU configuration 260-------------------------- 261 262 S.1.1 263 1.RCS.1000.0.0 264 S.2.-1 265 2.RCS.1000.0.0 266 267Context 1 is configured to run with one enabled slice (slice mask 1) and a batch 268is sumitted against it. Context 2 is configured to run with all slices (this is 269the default so the command could also be omitted) and a batch submitted against 270it. 271 272This shows the dynamic SSEU reconfiguration cost beween two contexts competing 273for the render engine. 274 275Slice mask of -1 has a special meaning of "all slices". Otherwise any integer 276can be specifying as the slice mask, but beware any apart from 1 and -1 can make 277the workload not portable between different GPUs. 278