1Text Properties API 2============= 3A variety of (internationally correct) text processing requires know the *properties* of unicode characters. 4For example, where in a string are the word boundaries (needed for line-breaking), which need to be ordered 5right-to-left or left-to-right? 6 7We propose a batch call that **characterizes** the code-points in a string. The method will return an array 8of bitfields packed in a 32bit unsigned long, containing the results of all of the Options. 9 10## Functional Requirement 11 12One measure of the value/completeness of the API is the following: 13 14*For sophisticated apps or frameworks (e.g. Flutter or Lottie) that need... 15- text shaping 16- line breaking 17- word and grapheme boundaries 18 19Certainly this API could include **more** than is strictly required for those use cases, but it is important that it include **at least** enough to allow them to function without increasing their (WASM) download size 20by having to include an copy of ICU (or its equivalent). 21 22## Ergonomics 23 24Associated with the above Function Requirements, another driver for the shape of the API is efficiency, esp. when called by **WASM** clients. There is a real cost for each JS <--> WASM call, more than the equivalent 25sequence between JS and the Browser. 26- Minimize # calls needed for a block of text 27- Homogenous arrays rather than sequence of objects 28 29Given this, implementations are encourged to use **Uint32Array** typed array buffer for the result. 30 31```WebIDL 32// Bulk call to characterize the code-points in a string. 33// This can return a number of different properties per code-point, so to maximize performance, 34// it will only compute the requested properties requested (see optional boolean request fields). 35// 36interface TextProperties { 37 const unsigned long BidiLevelMask = 31, // 0..31 bidi level 38 39 const unsigned long GraphemeBreak = 1 << 5, 40 const unsigned long IntraWordBreak = 1 << 6, 41 const unsigned long WordBreak = 1 << 7, 42 const unsigned long SoftLineBreak = 1 << 8, 43 const unsigned long HardLineBreak = 1 << 9, 44 45 const unsigned long IsControl = 1 << 10, 46 const unsigned long IsSpace = 1 << 11, 47 const unsigned long IsWhiteSpace = 1 << 12, 48 49 attribute boolean bidiLevel?; 50 attribute boolean graphemeBreak?; 51 attribute boolean wordBreak?; // returns Word and IntraWord break properties 52 attribute boolean lineBreak?; // returns Soft and Hard linebreak properties 53 54 attribute boolean isControl?; 55 attribute boolean isSpace?; 56 attribute boolean isWhiteSpace?; 57 58 // Returns an array the same length as the input string. Each returned value contains the 59 // bitfield results for the corresponding code-point in the string. For surrogate pairs 60 // in the input, the results will be in the first output value, and the 2nd output value 61 // will be zero. 62 // 63 // Bitfields that are currently unused, or which correspond to an Option attribute that 64 // was not requested, will be set to zero. 65 // 66 sequence<unsigned long> characterize(DOMString inputString, 67 DOMString bcp47?); 68} 69``` 70 71## Example 72 73```js 74const properties = { 75 isWhiteSpace: true, 76 lineBreak: true, 77}; 78 79const text = "Because I could not stop for Death\nHe kindly stopped for me"; 80 81const results = properties.characterize(text); 82 83// expected results 84 85results[7,9,15,19,24,28,37,44,52,65] --> IsWhiteSpace | SoftLineBreak 86results[34] --> HardLineBreak 87``` 88 89## Related 90 91Some facilities for characterizing Unicode already exist, either as part of EcmaScript or the Web api. See [intl segmenter](https://github.com/tc39/proposal-intl-segmenter). This 92proposal acknowledges these, but suggests that any potential overlap in functionality is OK, 93given the design constraint spelled out in the [Ergonomics](#Ergonomics) section. 94 95Similar to the contrast between canvas2d and webgl, this proposal seeks to provide very efficient, 96lower level access to unicode propoerties, specifically for sophisticated (possibly native ported to wasm) 97frameworks and apps. It is not intended to replace existing facilities (i.e. Segmenter), but rather 98to offer an alternative interface more suited to high-performance clients. 99 100We also propose a higher level interface specfically aimed at [Text Shaping](text_overview.md). 101 102## Contributors: 103 [mikerreed](https://github.com/mikerreed), 104