• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# 'linalg' Dialect
2
3[TOC]
4
5## Rationale
6
7<img width="90" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
8
9Linalg is designed to solve the High-level Hierarchical Optimization
10(HHO box) in MLIR and to interoperate nicely within a
11*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
12
13The [Rationale Document](../Rationale/RationaleLinalgDialect.md)
14goes into significantly more design and architectural decision details.
15
16## Set of Key Transformations<a name="key_transformations"></a>
17
18The following key transformations have been central to driving the design of
19Linalg. They are all implemented in terms of the properties of the
20`linalg.generic` OpInterface and avoid the pitfall of relying on hardcoded
21one-off op knowledge.
22
23The textual form description of these transformations is left for future
24work. Still, it is useful to at least the key transformations that are
25performed on the Linalg IR and that have influenced its design:
261. Progressive Buffer Allocation.
271. Parametric Tiling.
281. Promotion to Temporary Buffer in Fast Memory.
291. Tiled Producer-Consumer Fusion with Parametric Tile-And-Fuse.
301. Map to Parallel and Reduction Loops and Hardware.
311. Vectorization: Rewrite in Vector Form.
321. Lower to Loops (Affine, Generic, and Parallel).
331. Lower to Library Calls or Special Instructions, Intrinsics or ISA.
341. Partially Lower to Iterations Over a Finer-Grained Linalg Op.
35
36## High-Level Description of Linalg Ops<a name="linalg_ops"></a>
37Linalg takes at least some inspiration from all previously [listed prior
38art](#prior_art). The design enables the definition of ***CustomOps*** with
39generic properties that enable [key transformations](#key_transformations),
40including lowering to scalar load/store and other operations or to external
41library calls and intrinsics.
42
43These ops can have ***either tensor or buffer operands***, subject to
44[conventions and limitations](#tensors_and_buffers).
45
46### Payload-Carrying Ops<a name="payload_ops"></a>
47Linalg defines two payload carrying operations that implement the [structured ops](
48https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p
49) abstraction on tensors and buffers. This is architected as two generic operations
50`linalg.generic` (resp. `linalg.indexed_generic`) that can express custom
51operations with *index-free semantics* (resp. *indexing semantics*).
52The properties of these generic ops are the result of applying the
53guiding principles described in the [Rationale Document](../Rationale/RationaleLinalgDialect.md).
54They are listed next, with a brief example and discussion for each.
55
56#### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>
57A `linalg.generic` op fully *derives* the specification of its iteration space
58from its operands.
59The property enforces that a localized IR element (the op) *has* all the information
60needed to synthesize the control-flow required to iterate over its operands,
61according to their type. This notion of IR localization bears some resemblance
62to [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf).
63
64Consider the following fully specified `linalg.generic` example.
65Here, the first operand is a `memref` of `f32` scalar elements that
66has an ordinary identity layout, and the second one is a `memref` of
674-element vectors with a 2-strided, 1-offset layout.
68
69```mlir
70// File name: example1.mlir
71#accesses = [
72  affine_map<(m) -> (m)>,
73  affine_map<(m) -> (m)>
74]
75#attrs = {
76  args_in = 1,
77  args_out = 1,
78  indexing_maps = #accesses,
79  iterator_types = ["parallel"]
80}
81// memory layouts
82#identity = affine_map<(d0) -> (d0)>
83
84func @example(%A: memref<?xf32, #identity>,
85              %B: memref<?xvector<4xf32>, offset: 1, strides: [2]>) {
86  linalg.generic #attrs %A, %B {
87  ^bb0(%a: f32, %b: vector<4xf32>):
88    %c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
89    linalg.yield %c: vector<4xf32>
90  } : memref<?xf32, #identity>, memref<?xvector<4xf32>, offset: 1, strides: [2]>
91  return
92}
93```
94
95The property "*Input and Output Operands Define The Iteration Space*" is
96materialized by a lowering into a form that will resemble:
97
98```mlir
99// Run: mlir-opt example1.mlir -allow-unregistered-dialect -convert-linalg-to-loops
100// This converted representation is in the `scf` dialect.
101// It's syntax can be found here: https://mlir.llvm.org/docs/Dialects/SCFDialect/
102#map0 = affine_map<(d0) -> (d0 * 2 + 1)>
103
104func @example(%arg0: memref<?xf32>, %arg1: memref<?xvector<4xf32>, #map0>) {
105  %c0 = constant 0 : index
106  %c1 = constant 1 : index
107  %0 = dim %arg0, %c0 : memref<?xf32>
108  scf.for %arg2 = %c0 to %0 step %c1 {
109    %1 = load %arg0[%arg2] : memref<?xf32>
110    %2 = load %arg1[%arg2] : memref<?xvector<4xf32>, #map0>
111    %3 = "some_compute"(%1, %2) : (f32, vector<4xf32>) -> vector<4xf32>
112    store %3, %arg1[%arg2] : memref<?xvector<4xf32>, #map0>
113  }
114  return
115}
116```
117
118The property participates in simplifying analyses and transformations. For
119instance, it guarantees no out-of bounds access can occur by construction
120(assuming dynamic operand dimensions agree with each other, which is the
121purpose of the `assert` runtime check).
122
123Before lowering to loop form, loop induction variables and iterators are *not yet
124materialized*. This is a necessary property if we want an abstraction that
125works on both tensor values and buffers because ***values don’t escape
126loops/nesting***.
127
128The main implications are that:
1291. The semantics of the ops are *restricted to operate on structured data
130types*, on which we can define an iterator.
1312. This does not model arbitrary code with side-effects.
132
133We do not think these are serious limitations in practice because MLIR is all
134about mixing different levels of abstractions in the same IR. As long as
135Linalg can progressively lower to the next level of abstraction, it can also
136be just bypassed for things that do not fit.
137
138At the same time, conditioning op semantics on structured data types is a very
139promising path towards extensibility to non-dense tensors as experience with
140LIFT abstractions for
141[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
142and [position-dependent
143arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf),
144as well as [TACO](http://tensor-compiler.org/), has shown.
145
146#### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>
147A `linalg.generic` *defines* the mapping between the iteration space (i.e. the
148loops) and the data.
149
150Consider the following fully specified `linalg.generic` example.
151Here, the first `memref` is a 2-strided one on both of its dimensions,
152and the second `memref` uses an identity layout.
153
154```
155// File name: example2.mlir
156#indexing_maps = [
157  affine_map<(i, j) -> (j, i)>,
158  affine_map<(i, j) -> (j)>
159]
160#attrs = {
161  args_in = 1,
162  args_out = 1,
163  indexing_maps = #indexing_maps,
164  iterator_types = ["parallel", "parallel"]
165}
166
167func @example(%A: memref<8x?xf32, offset: 0, strides: [2, 2]>,
168              %B: memref<?xvector<4xf32>>) {
169  linalg.generic #attrs %A, %B {
170  ^bb0(%a: f32, %b: vector<4xf32>):
171    %c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
172    linalg.yield %c: vector<4xf32>
173  }: memref<8x?xf32 , offset: 0, strides: [2, 2]>, memref<?xvector<4xf32>>
174  return
175}
176```
177
178The property "*Reversible Mappings Between Control and Data Structures*" is
179materialized by a lowering into a form that will resemble:
180```
181// Run: mlir-opt example2.mlir -allow-unregistered-dialect -convert-linalg-to-loops
182#map0 = affine_map<(d0, d1) -> (d0 * 2 + d1 * 2)>
183
184func @example(%arg0: memref<8x?xf32, #map0>, %arg1: memref<?xvector<4xf32>>) {
185  %c8 = constant 8 : index
186  %c0 = constant 0 : index
187  %c1 = constant 1 : index
188  %0 = dim %arg0, %c1 : memref<8x?xf32, #map0>
189  scf.for %arg2 = %c0 to %0 step %c1 {
190    scf.for %arg3 = %c0 to %c8 step %c1 {
191      %1 = load %arg0[%arg3, %arg2] : memref<8x?xf32, #map0>
192      %2 = load %arg1[%arg3] : memref<?xvector<4xf32>>
193      %3 = "some_compute"(%1, %2) : (f32, vector<4xf32>) -> vector<4xf32>
194      store %3, %arg1[%arg3] : memref<?xvector<4xf32>>
195    }
196  }
197  return
198}
199```
200
201This mapping needs to be reversible because we want to be
202able to go back and forth between the two and answer questions such as:
203- Given a subset of the iteration space, what subset of data does it read and
204write?
205- Given a subset of data read or written, what subset of the iteration space
206is responsible for this read or write?
207
208Answering these `2` questions is one of the main analyses that Linalg uses to
209implement transformations such as tiling, tiled producer-consumer fusion, and
210promotion to temporary buffers in fast memory.
211
212In the current implementation, `linalg.generic` uses a list of [AffineMaps](https://mlir.llvm.org/docs/LangRef/#affinemap-attribute) (see the `#indexing_maps` attribute in the previous examples).
213This is a pragmatic short-term solution, but in the longer term note that
214this property could be even evaluated dynamically, similarly to
215inspector-executor algorithms.
216
217#### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>
218A `linalg.generic` op fully *declares* the type of its iterators. This
219information is used in transformations.
220
221These properties are derived from established practice in the field and mirror
222the properties from Ken Kennedy's [Optimizing Compilers for Modern Architectures](
223https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9).
224The key idea of legality of loop transformations expressed by Kennedy is
225that ***the lexicographic order of all dependence vectors must be
226preserved***.
227
228This can be better captured directly at the loop level thanks to specific
229iterator types, among which:
230*parallel*, *reduction*, *partition*, *permutable/monotonic*, *sequential*,
231*dependence distance*, ...
232
233These types are traditionally the result of complex dependence analyses and
234have been referred to as "*bands*" in the polyhedral community (e.g. *parallel
235bands*, *permutable bands*, etc, in
236[ISL](https://en.wikipedia.org/wiki/Integer_set_library) schedule tree
237parlance).
238
239Specifying the information declaratively in a `linalg.generic` allows
240conveying properties that may be hard (or even impossible) to derive from
241lower-level information. These properties can be brought all the way to the
242moment when they are useful for transformations, used and then discarded.
243
244Additionally, these properties may also be viewed as a contract that the
245frontend/user guarantees and that the compiler may take advantage of. The
246common example is the use of data-dependent reduction semantics for
247specifying histogram computations. If the frontend has additional knowledge
248that proper atomic operations are available, it may be better to specify
249parallel semantics and use the special atomic in the computation region.
250
251At this time, Linalg only has an explicit use for *parallel* and *reduction*
252loops but previous experience shows that the abstraction generalizes.
253
254#### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>
255A `linalg.generic` op has a compute payload that is fully generic thanks to
256the use of
257[Regions](https://github.com/llvm/llvm-project/blob/58265ad42a90ae8905be6a447cb42e53529a54a0/mlir/docs/LangRef.md#regions).
258
259The region takes as arguments the scalar elemental types of the tensor or
260buffer operands of the `linalg.generic`. For flexibility and ability to match
261library calls, additional special values may be passed. For instance, a
262`linalg.fill` operation takes a buffer and an additional scalar value.
263
264At this time there are no additional restrictions to the region
265semantics. This is meant to allow the exploration of various design tradeoffs
266at the intersection of regions and iterator types.
267In particular, the frontend is responsible for the semantics of iterator types
268to correspond to the operations inside the region: the region can capture
269buffers arbitrarily and write into them. If this conflicts with some parallel
270iterator requirement, this is undefined behavior.
271
272Previous examples already elaborate compute payloads with an unregistered function `"some_compute"`. The following code snippet shows what the result will be when using a concrete operation `addf`:
273```
274// File name: example3.mlir
275#indexing_maps = [
276  affine_map<(i, j) -> (i, j)>,
277  affine_map<(i, j) -> (i, j)>,
278  affine_map<(i, j) -> (i, j)>
279]
280#attrs = {
281  args_in = 2,
282  args_out = 1,
283  indexing_maps = #indexing_maps,
284  iterator_types = ["parallel", "parallel"]
285}
286func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
287  linalg.generic #attrs %A, %B, %C {
288  ^bb0(%a: f32, %b: f32, %c: f32):
289    %d = addf %a, %b : f32
290    linalg.yield %d : f32
291  }: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
292  return
293}
294```
295
296This function basically element-wise adds up two matrices (`%A` and `%B`) and stores the result into another one (`%C`).
297
298The property "*The Compute Payload is Specified With a Region*" is
299materialized by a lowering into a form that will resemble:
300```
301// Run: mlir-opt example3.mlir -convert-linalg-to-loops
302#indexing_maps = [
303  affine_map<(i, j) -> (i, j)>,
304  affine_map<(i, j) -> (i, j)>,
305  affine_map<(i, j) -> (i, j)>
306]
307#attrs = {
308  args_in = 2,
309  args_out = 1,
310  indexing_maps = #indexing_maps,
311  iterator_types = ["parallel", "parallel"]
312}
313func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
314  linalg.generic #attrs %A, %B, %C {
315  ^bb0(%a: f32, %b: f32, %c: f32):
316    %d = addf %a, %b : f32
317    linalg.yield %d : f32
318  }: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
319  return
320}
321```
322
323In the process of lowering to loops and lower-level constructs, similar
324requirements are encountered, as are discussed in the [inlined call op
325proposal](https://llvm.discourse.group/t/introduce-std-inlined-call-op-proposal/282/2).
326We expect to be able to reuse the common lower-level infrastructure provided
327it evolves to support both region arguments and captures.
328
329#### Property 5: May Map To an External Library Call<a name="prop5"></a>
330A `linalg.generic` op may map to an external library call by specifying a
331`SymbolAttr`. At this level of abstraction, the important glue is the ability
332to perform transformations that preserve the structure necessary to ***call
333the external library after different transformations have been applied***.
334
335This involves considerations related to preservation of op semantics
336and integration at the ABI level. Regardless of whether one wants to use
337external library calls or a custom ISA, the problem for codegen is similar:
338preservation of a fixed granularity.
339
340Consider the following example that adds an additional attribute `library_call="pointwise_add"`
341that specifies the name of an external library call we intend to use:
342```
343// File name: example4.mlir
344#indexing_maps = [
345  affine_map<(i, j) -> (i, j)>,
346  affine_map<(i, j) -> (i, j)>,
347  affine_map<(i, j) -> (i, j)>
348]
349#attrs = {
350  args_in = 2,
351  args_out = 1,
352  indexing_maps = #indexing_maps,
353  iterator_types = ["parallel", "parallel"],
354  library_call = "pointwise_add"
355}
356func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
357  linalg.generic #attrs %A, %B, %C {
358  ^bb0(%a: f32, %b: f32, %c: f32):
359    %d = addf %a, %b : f32
360    linalg.yield %d : f32
361  }: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
362  return
363}
364```
365
366The property "*Map To an External Library Call*" is
367materialized by a lowering into a form that will resemble:
368
369```
370// Run: mlir-opt example4.mlir -convert-linalg-to-std
371// Note that we lower the Linalg dialect directly to the Standard dialect.
372// See this doc: https://mlir.llvm.org/docs/Dialects/Standard/
373
374#map0 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
375
376func @example(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
377  %0 = memref_cast %arg0 : memref<?x?xf32> to memref<?x?xf32, #map0>
378  %1 = memref_cast %arg1 : memref<?x?xf32> to memref<?x?xf32, #map0>
379  %2 = memref_cast %arg2 : memref<?x?xf32> to memref<?x?xf32, #map0>
380  call @pointwise_add(%0, %1, %2) : (memref<?x?xf32, #map0>, memref<?x?xf32, #map0>, memref<?x?xf32, #map0>) -> ()
381  return
382}
383func @pointwise_add(memref<?x?xf32, #map0>, memref<?x?xf32, #map0>, memref<?x?xf32, #map0>) attributes {llvm.emit_c_interface}
384```
385
386Which, after lowering to LLVM resembles:
387```
388// Run: mlir-opt example4.mlir -convert-linalg-to-std | mlir-opt -convert-std-to-llvm
389// Some generated code are omitted here.
390func @example(%arg0: !llvm<"float*">, ...) {
391  ...
392  llvm.call @pointwise_add(...) : (!llvm<"float*">, ...) -> ()
393  return
394}
395
396llvm.func @pointwise_add(%arg0: !llvm<"float*">, ...) attributes {llvm.emit_c_interface} {
397  ...
398  llvm.call @_mlir_ciface_pointwise_add(%9, %19, %29) : (!llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }
399*">) -> ()
400  llvm.return
401}
402llvm.func @_mlir_ciface_pointwise_add(!llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">) attributes {llvm.emit_c_interface}
403```
404
405##### Convention For External Library Interoperability
406The `linalg` dialect adopts a convention that is similar to `BLAS` when
407offloading operations to fast library implementations: pass a non-owning
408pointer to input and output data with additional metadata. This convention
409is also found in libraries such as `MKL`, `OpenBLAS`, `BLIS`, `cuBLAS`,
410`cuDNN`, etc.. and more generally at interface points across language
411boundaries (e.g. C++ / Python).
412
413Generally, `linalg` passes non-owning pointers to View data structures
414to pre-compiled library calls linked externally.
415
416There is an [ongoing
417discussion](https://llvm.discourse.group/t/lowering-optional-attributes-in-linalg-structuredops-to-standard-dialect/333/3)
418on the topic of extending interoperability in the presence of key attributes.
419
420#### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>
421Perfectly nested loops form a particularly important class of structure that
422enables key loop transformations such as tiling and mapping to library calls.
423Unfortunately, this type of structure is easily broken by transformations such
424as partial loop fusion. Tiling and mapping to library calls become more
425challenging, or even infeasible. Linalg ops adopt perfect-nestedness
426as a first-class property: the structure cannot be broken and is
427transported in the IR by construction.
428
429A `linalg.generic` op represents a perfectly nested loop nest that writes the
430entire memory region.  This is a structural constraint across regions and
431loops that has proven to be key in simplifying transformations.
432
433One particular point to mention is that converting imperfectly nested code
434into perfectly nested code can often be done with enough loop distribution
435and embedding of conditionals down to the innermost loop level.
436
437Previous experience with Tensor Comprehensions gave us the intuition that
438forcing innermost control-flow nesting is a lot like writing data-parallel
439code with arrays of boolean values and predication.
440This type of trick has also been used before in polyhedral compilers to
441convert non-affine control into affine compute dependencies.
442
443While it may be possible to automate such rewrites from generic IR,
444`linalg.generic` just forces the semantics for now.
445
446The key implication is that this conversion to deep predication needs to be
447undone once we are done with Linalg transformations.
448After iterators and induction variables are materialized (i.e. after lowering
449out of `linalg.generic` occurred), the overall performance will be greatly
450influenced by the quality of canonicalizations, foldings and *Loop Independent
451Code Motion* (LICM).
452
453In the grander scheme, the reliance on late LICM was deemed a necessary risk.
454
455#### Putting it Together<a name="summary"></a>
456As it stands, the six properties above define the semantics of a
457`linalg.generic` op. It is an open question whether all of these semantics are
458strictly necessary in practice and whether some should or could be derived
459automatically while still maintaining the [core guiding
460principles](#guiding_principles).
461
462For the time being, we have settled on the combination of these properties
463because of empirical evidence building and working on multiple high-level
464compilers. As we lay those down and engage more with the community, we expect
465multiple rounds of discussions and design changes to the original architecture.
466
467### Tensors and Buffers: Conventions and Limitations <a name="tensors_and_buffers"></a>
468
469Tensors are immutable SSA values, buffers are mutable regions of memory subject
470to side-effects and aliasing. As a consequence, output buffers are passed as
471operands whereas output tensors are new SSA values corresponding to op results.
472Inputs can be arbitrary tensors or buffers and are always passed as operands.
473
474The following convention is currently in-flight and is in the process of
475replacing other existing conventions. The following convention currently applies
476to "named" structured ops which are auto-generated by the linalg-ods tool.
477
478The convention adopted is as follows:
479
4801.  A first block of `ins` op operands hold read-only inputs of ShapedType.
4812.  An optional second block of `outs` op operands hold read-write output
482    buffers of MemRefType.
4833.  An optional third block of `init` operands hold initialization tensors of
484    RankedTensorType. Such tensors can appear when the op performs a reduction
485    and returns a tensor.
486
487Structured ops with fully parallel semantics, have empty `init`. They may either
488write in-place into `outs` buffers or return new tensors.
489
490Structured ops with reduction semantics and output tensor(s) however have
491additional restrictions:
492
4931.  They can only return a single tensor for now.
4942.  They cannot have any output buffer operand (i.e. `outs` is empty).
4953.  They have exactly one `init` tensor of the same type as the unique output
496    tensor. Such an `init` tensor does not have an explicit associate indexing
497    map. Instead the map of the result tensor is used to signify that the `init`
498    and the `result` are "tied".
499
500Points 1. and 2. keep complexity of the representation in check by allowing only
501a single result tensor, when reductions are present.
502
503Point 3. is related to the fact that SSA values cannot represent in-place
504updates. Instead, linalg adopts a similar convention that exists in e.g.
505`vector.outerproduct`: the value that is reduced into is passed as an explicit
506argument and a new result of the same shape is produced.
507
508It is expected buffer allocation will fold this last input onto the result in a
509single output buffer argument, which is why the same indexing map is required:
510the last input operand is said to be "tied" to the result.
511
512Alternative, more complex representations, would allow for:
513
5141.  Multiple results and `init` tensors in arbitrary orders, which could be
515    captured by an extra ArrayAttr of position pairs.
5162.  Relaxing the conditions on the indexing map equalities on the each pair and
517    e.g. allow implicit broadcasts of the input.
518
519These representations are deemed unnecessarily complex for now and are left for
520future discussion.
521
522As an illustration, the syntax for a `linalg.matmul` writing into a buffer is:
523
524```
525linalg.matmul ins(%a, %b : memref<?x?xf32>, tensor<?x?xf32>)
526             outs(%c : memref<?x?xf32>)
527```
528
529, whereas the syntax for a `linalg.matmul` returning a new tensor is:
530
531```
532%d = linalg.matmul ins(%a, %b : tensor<?x?xf32>, memref<?x?xf32>)
533                  init(%c : tensor<?x?xf32>)
534                    -> tensor<?x?xf32>
535```
536
537### Data Representation: Views<a name="views"></a>
538The current implementation uses the [Strided MemRef (a.k.a View)](
539https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/MaL8m2nXuio)
540abstraction. The name *View* is used interchangeably in `linalg` to signify
541*Strided MemRef*.
542In the future we expect to use other structured data types and
543support ragged, mixed-sparse and other types. We expect to draw on the
544experience from existing LIFT abstractions for
545[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
546and [position-dependent
547arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
548
549### Metadata Ops<a name="metadata_ops"></a>
550A set of ops that manipulate metadata but do not move memory. These ops take
551`view` operands + extra attributes and return new `view`s. The returned
552`view`s generally alias the operand `view`. At the moment the existing ops
553are:
554
555    * `std.view`,
556    * `std.subview`,
557    * `std.transpose`.
558    * `linalg.range`,
559    * `linalg.slice`,
560    * `linalg.reshape`,
561
562Future ops are added on a per-need basis but should include:
563
564    * `linalg.tile`,
565    * `linalg.intersection`,
566    * `linalg.convex_union`,
567    * `linalg.difference` (would need to work on a list of views).
568
569These additional operations correspond to abstractions that have been known to
570work in the field of large-scale distributed stencil computations.
571
572In a longer-term future, the abstractions from [Legion data-centric
573programming model](https://legion.stanford.edu/overview/) seem generally
574appealing.
575
576### Named Payload-Carrying Ops<a name="named_ops"></a>
577Additionally, `linalg` provides a small subset of commonly named operations:
578
579    * `linalg.copy`,
580    * `linalg.fill`,
581    * `linalg.dot`,
582    * `linalg.matmul`,
583    * `linalg.conv`.
584
585These named operations adhere to the `linalg.generic` op interface. Work is in
586progress to define declarative mechanisms to automatically generate named ops
587from a description in terms of only the generic op interface.
588
589This is the main reason there are only a small number of ops today: we expect
590them to be auto-generated from Tablegen soon.
591
592### Named Payload Ops Specification
593
594Linalg provides a declarative specification and a generation tool
595(`mlir-linalg-ods-gen`) to automatically produce named ops from a notation that
596is inspired by Einstein notation.
597
598The syntax and semantics used in `mlir-linalg-ods-gen` are very much in flight
599and borrow from Tensor Comprehensions (TC) but differ in a few dimensions, to
600better adapt to Linalg:
601
6021.  The input and output tensor parameters are specified as `id :
603    type(symbolic-affine-expression-list)` (e.g. `A : f32(M, N + M)`) and each
604    new symbol is discovered eagerly. TC on the other hand does not allow
605    general symbolic affine expressions.
6061.  The output shapes are specified explicitly, in TC they are always derived
607    from the input shapes.
6081.  The operations used to specify computations use EDSC intrinsics so that they
609    can easily be parsed and emitted into a simple region builder without
610    resorting to more general MLIR parsing.
6111.  Reduction dimensions are specified with angle bracket notation on the
612    operation they apply to (e.g. `std_add<k>` specifies that `k` is a reduction
613    dimension). In TC, a reduction is specified with `op=` operator and the
614    reduction dimensions are inferred.
6151.  The parallel and reduction dimension are ordered by the textual program
616    order. For instance, in the comprehension `O(i, j) = std_add<k, l>(...)`,
617    `i` (resp. `j`) is a parallel iterator encoded by affine dimension of
618    position `0` (resp. `1`); `k` (resp. `l`) is a reduction iterator encoded by
619    an affine dimension of position `2` (resp. `3`).
620
621These decisions and syntax are subject to evolution and change. In particular,
622op-specific attributes, dynamic ranks, some form of templating, shape
623calculation function specification, etc. may be added in the future.
624
625At this time, the following restrictions are imposed on the syntax and
626semantics:
627
6281.  Each def may only contain a single comprehension but each comprehension may
629    perform multiple updates.
6302.  Each tensor may only be used with a single indexing expression.
631
632The following specification may be used to define a named `batchmatmul` op:
633
634```
635def batchmatmul(A: f32(Batch, M, K), B: f32(K, N)) -> (C: f32(Batch, M, N)) {
636  C(b, m, n) = std_addf<k>(std_mulf(A(b, m, k), B(k, n)));
637}
638```
639
640When `mlir-linalg-ods-gen -gen-ods-decl=1` is called, the following ODS is
641produced:
642
643```
644def batchmatmulOp : LinalgNamedStructured_Op<"batchmatmul", [
645  NInputs<2>,
646  NOutputs<1>,
647  NamedStructuredOpTrait]> { ... }
648```
649
650When `mlir-linalg-ods-gen -gen-impl=1` is called, the following C++ is produced:
651
652```
653llvm::Optional<SmallVector<StringRef, 8>> batchmatmul::referenceIterators() {
654  return SmallVector<StringRef, 8>{
655    getParallelIteratorTypeName(),
656    getParallelIteratorTypeName(),
657    getParallelIteratorTypeName(),
658    getReductionIteratorTypeName() };
659}
660llvm::Optional<SmallVector<AffineMap, 8>> batchmatmul::referenceIndexingMaps() {
661  MLIRContext *context = getContext();
662  AffineExpr d0, d1, d2, d3;
663  bindDims(context, d0, d1, d2, d3);
664  return SmallVector<AffineMap, 8>{
665      AffineMap::get(4, 0, {d0, d1, d3}),
666      AffineMap::get(4, 0, {d3, d2}),
667      AffineMap::get(4, 0, {d0, d1, d2}) };
668}
669void batchmatmul::regionBuilder(ArrayRef<BlockArgument> args) {
670  using namespace edsc;
671  using namespace intrinsics;
672  Value _0(args[0]), _1(args[1]), _2(args[2]);
673  Value _4 = std_mulf(_0, _1);
674  Value _5 = std_addf(_2, _4);
675  (linalg_yield(ValueRange{ _5 }));
676}
677```
678
679## Open Issues and Design Alternatives<a name="open_issues"></a>
680Multiple open issues and design alternatives are in flight and it is time to
681lay them out for the community to discuss and pick apart:
6821. Should `linalg.generic` support nesting?
6831. Should `linalg.generic` regions take views or only scalars?
6841. Should we try to solve automatic differentiation at this level of
685abstraction?
6861. Are all the six properties really necessary?
6871. Is this relying too much on declarative specification and would we be
688better off relying more on analyses?
6891. Is this general enough for the community's needs? If not how should this be
690extended, if at all?
691...
692
693These key questions (and much more) should be really thought of in the general
694context of MLIR in which different levels of IR interoperate seamlessly. In
695practice, it is not necessary (or beneficial) to try and solve all problems in the
696same IR.
697
698## Operations
699
700[include "Dialects/LinalgOps.md"]
701