1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 2<html> 3<head> 4<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 5<title>Design Discussion</title> 6<link rel="stylesheet" href="../../../doc/src/boostbook.css" type="text/css"> 7<meta name="generator" content="DocBook XSL Stylesheets V1.79.1"> 8<link rel="home" href="../index.html" title="The Boost C++ Libraries BoostBook Documentation Subset"> 9<link rel="up" href="../program_options.html" title="Chapter 30. Boost.Program_options"> 10<link rel="prev" href="howto.html" title="How To"> 11<link rel="next" href="s06.html" title="Acknowledgements"> 12</head> 13<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> 14<table cellpadding="2" width="100%"><tr> 15<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../boost.png"></td> 16<td align="center"><a href="../../../index.html">Home</a></td> 17<td align="center"><a href="../../../libs/libraries.htm">Libraries</a></td> 18<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> 19<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> 20<td align="center"><a href="../../../more/index.htm">More</a></td> 21</tr></table> 22<hr> 23<div class="spirit-nav"> 24<a accesskey="p" href="howto.html"><img src="../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../program_options.html"><img src="../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="s06.html"><img src="../../../doc/src/images/next.png" alt="Next"></a> 25</div> 26<div class="section"> 27<div class="titlepage"><div><div><h2 class="title" style="clear: both"> 28<a name="program_options.design"></a>Design Discussion</h2></div></div></div> 29<div class="toc"><dl class="toc"><dt><span class="section"><a href="design.html#program_options.design.unicode">Unicode Support</a></span></dt></dl></div> 30<p>This section focuses on some of the design questions. 31 </p> 32<div class="section"> 33<div class="titlepage"><div><div><h3 class="title"> 34<a name="program_options.design.unicode"></a>Unicode Support</h3></div></div></div> 35<p>Unicode support was one of the features specifically requested 36 during the formal review. Throughout this document "Unicode support" is 37 a synonym for "wchar_t" support, assuming that "wchar_t" always uses 38 Unicode encoding. Also, when talking about "ascii" (in lowercase) we'll 39 not mean strict 7-bit ASCII encoding, but rather "char" strings in local 40 8-bit encoding. 41 </p> 42<p> 43 Generally, "Unicode support" can mean 44 many things, but for the program_options library it means that: 45 46 </p> 47<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 48<li class="listitem"><p>Each parser should accept either <code class="computeroutput">char*</code> 49 or <code class="computeroutput">wchar_t*</code>, correctly split the input into option 50 names and option values and return the data. 51 </p></li> 52<li class="listitem"><p>For each option, it should be possible to specify whether the conversion 53 from string to value uses ascii or Unicode. 54 </p></li> 55<li class="listitem"> 56<p>The library guarantees that: 57 </p> 58<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "> 59<li class="listitem"><p>ascii input is passed to an ascii value without change 60 </p></li> 61<li class="listitem"><p>Unicode input is passed to a Unicode value without change</p></li> 62<li class="listitem"><p>ascii input passed to a Unicode value, and Unicode input 63 passed to an ascii value will be converted using a codecvt 64 facet (which may be specified by the user). 65 </p></li> 66</ul></div> 67<p> 68 </p> 69</li> 70</ul></div> 71<p> 72 </p> 73<p>The important point is that it's possible to have some "ascii 74 options" together with "Unicode options". There are two reasons for 75 this. First, for a given type you might not have the code to extract the 76 value from Unicode string and it's not good to require that such code be written. 77 Second, imagine a reusable library which has some options and exposes 78 options description in its interface. If <span class="emphasis"><em>all</em></span> 79 options are either ascii or Unicode, and the library does not use any 80 Unicode strings, then the author is likely to use ascii options, making 81 the library unusable inside Unicode 82 applications. Essentially, it would be necessary to provide two versions 83 of the library -- ascii and Unicode. 84 </p> 85<p>Another important point is that ascii strings are passed though 86 without modification. In other words, it's not possible to just convert 87 ascii to Unicode and process the Unicode further. The problem is that the 88 default conversion mechanism -- the <code class="computeroutput">codecvt</code> facet -- might 89 not work with 8-bit input without additional setup. 90 </p> 91<p>The Unicode support outlined above is not complete. For example, we 92 don't support Unicode option names. Unicode support is hard and 93 requires a Boost-wide solution. Even comparing two arbitrary Unicode 94 strings is non-trivial. Finally, using Unicode in option names is 95 related to internationalization, which has it's own 96 complexities. E.g. if option names depend on current locale, then all 97 program parts and other parts which use the name must be 98 internationalized too. 99 </p> 100<p>The primary question in implementing the Unicode support is whether 101 to use templates and <code class="computeroutput">std::basic_string</code> or to use some 102 internal encoding and convert between internal and external encodings on 103 the interface boundaries. 104 </p> 105<p>The choice, mostly, is between code size and execution 106 speed. A templated solution would either link library code into every 107 application that uses the library (thereby making shared library 108 impossible), or provide explicit instantiations in the shared library 109 (increasing its size). The solution based on internal encoding would 110 necessarily make conversions in a number of places and will be somewhat slower. 111 Since speed is generally not an issue for this library, the second 112 solution looks more attractive, but we'll take a closer look at 113 individual components. 114 </p> 115<p>For the parsers component, we have three choices: 116 </p> 117<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 118<li class="listitem"><p>Use a fully templated implementation: given a string of a 119 certain type, a parser will return a <code class="computeroutput"><a class="link" href="reference.html#boost.program_options.parsed_options">parsed_options</a></code> instance 120 with strings of the same type (i.e. the <code class="computeroutput"><a class="link" href="reference.html#boost.program_options.parsed_options">parsed_options</a></code> class 121 will be templated).</p></li> 122<li class="listitem"><p>Use internal encoding: same as above, but strings will be converted to and 123 from the internal encoding.</p></li> 124<li class="listitem"><p>Use and partly expose the internal encoding: same as above, 125 but the strings in the <code class="computeroutput"><a class="link" href="reference.html#boost.program_options.parsed_options">parsed_options</a></code> instance will be in the 126 internal encoding. This might avoid a conversion if 127 <code class="computeroutput"><a class="link" href="reference.html#boost.program_options.parsed_options">parsed_options</a></code> instance is passed directly to other components, 128 but can be also dangerous or confusing for a user. 129 </p></li> 130</ul></div> 131<p> 132 </p> 133<p>The second solution appears to be the best -- it does not increase 134 the code size much and is cleaner than the third. To avoid extra 135 conversions, the Unicode version of <code class="computeroutput"><a class="link" href="reference.html#boost.program_options.parsed_options">parsed_options</a></code> can also store 136 strings in internal encoding. 137 </p> 138<p>For the options descriptions component, we don't have much 139 choice. Since it's not desirable to have either all options use ascii or all 140 of them use Unicode, but rather have some ascii and some Unicode options, the 141 interface of the <code class="computeroutput"><a class="link" href="../boost/program_options/value_semantic.html" title="Class value_semantic">value_semantic</a></code> must work with both. The only way is 142 to pass an additional flag telling if strings use ascii or internal encoding. 143 The instance of <code class="computeroutput"><a class="link" href="../boost/program_options/value_semantic.html" title="Class value_semantic">value_semantic</a></code> can then convert into some 144 other encoding if needed. 145 </p> 146<p>For the storage component, the only affected function is <code class="computeroutput"><a class="link" href="../boost/program_options/store_1_3_31_9_11_1_1_5.html" title="Function store">store</a></code>. 147 For Unicode input, the <code class="computeroutput"><a class="link" href="../boost/program_options/store_1_3_31_9_11_1_1_5.html" title="Function store">store</a></code> function should convert the value to the 148 internal encoding. It should also inform the <code class="computeroutput"><a class="link" href="../boost/program_options/value_semantic.html" title="Class value_semantic">value_semantic</a></code> class 149 about the used encoding. 150 </p> 151<p>Finally, what internal encoding should we use? The 152 alternatives are: 153 <code class="computeroutput">std::wstring</code> (using UCS-4 encoding) and 154 <code class="computeroutput">std::string</code> (using UTF-8 encoding). The difference between 155 alternatives is: 156 </p> 157<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 158<li class="listitem"><p>Speed: UTF-8 is a bit slower</p></li> 159<li class="listitem"><p>Space: UTF-8 takes less space when input is ascii</p></li> 160<li class="listitem"><p>Code size: UTF-8 requires additional conversion code. However, 161 it allows one to use existing parsers without converting them to 162 <code class="computeroutput">std::wstring</code> and such conversion is likely to create a 163 number of new instantiations. 164 </p></li> 165</ul></div> 166<p> 167 There's no clear leader, but the last point seems important, so UTF-8 168 will be used. 169 </p> 170<p>Choosing the UTF-8 encoding allows the use of existing parsers, 171 because 7-bit ascii characters retain their values in UTF-8, 172 so searching for 7-bit strings is simple. However, there are 173 two subtle issues: 174 </p> 175<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 176<li class="listitem"><p>We need to assume the character literals use ascii encoding 177 and that inputs use Unicode encoding.</p></li> 178<li class="listitem"><p>A Unicode character (say '=') can be followed by 'composing 179 character' and the combination is not the same as just '=', so a 180 simple search for '=' might find the wrong character. 181 </p></li> 182</ul></div> 183<p> 184 Neither of these issues appear to be critical in practice, since ascii is 185 almost universal encoding and since composing characters following '=' (and 186 other characters with special meaning to the library) are not likely to appear. 187 </p> 188</div> 189</div> 190<table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> 191<td align="left"></td> 192<td align="right"><div class="copyright-footer">Copyright © 2002-2004 Vladimir Prus<p>Distributed under the Boost Software License, Version 1.0. 193 (See accompanying file <code class="filename">LICENSE_1_0.txt</code> or copy at 194 <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>) 195 </p> 196</div></td> 197</tr></table> 198<hr> 199<div class="spirit-nav"> 200<a accesskey="p" href="howto.html"><img src="../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../program_options.html"><img src="../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="s06.html"><img src="../../../doc/src/images/next.png" alt="Next"></a> 201</div> 202</body> 203</html> 204