• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1//
2//  Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
3//  Copyright (c) 2019-2020 Alexander Grund
4//
5//  Distributed under the Boost Software License, Version 1.0. (See
6//  accompanying file LICENSE or copy at
7//  http://www.boost.org/LICENSE_1_0.txt)
8//
9
10
11/*!
12
13\mainpage Boost.Nowide
14
15\ref changelog_page
16
17Table of Contents:
18
19- \ref main
20- \ref main_rationale
21    - \ref main_the_problem
22    - \ref main_the_solution
23    - \ref main_wide
24    - \ref alternative
25    - \ref main_reading
26- \ref using
27    - \ref using_standard
28    - \ref using_custom
29    - \ref using_integration
30- \ref technical
31    - \ref technical_imple
32    - \ref technical_cio
33- \ref qna
34- \ref standalone_version
35- \ref sources
36
37\section main What is Boost.Nowide
38
39Boost.Nowide is a library originally implemented by Artyom Beilis
40that makes cross platform Unicode aware programming easier.
41
42The library provides an implementation of standard C and C++ library
43functions, such that their inputs are UTF-8 aware on Windows without
44requiring to use the Wide API.
45On Non-Windows/POSIX platforms the StdLib equivalents are aliased instead,
46so no conversion is performed there as UTF-8 is commonly used already.
47
48
49Hence you can use the Boost.Nowide functions with the same name as their
50std counterparts with narrow strings on all platforms and just have it work.
51
52
53\section main_rationale Rationale
54\subsection main_the_problem The Problem
55
56Consider a simple application that splits a big file into chunks, such that
57they can be sent by e-mail. It requires doing a few very simple tasks:
58
59- Access command line arguments: <code>int main(int argc,char **argv)</code>
60- Open an input file, open several output files: <code>std::fstream::open(const char*,std::ios::openmode m)</code>
61- Remove the files in case of fault: <code>std::remove(const char* file)</code>
62- Print a progress report onto the console: <code>std::cout << file_name </code>
63
64Unfortunately it is impossible to implement this simple task in plain C++
65if the file names contain non-ASCII characters.
66
67The simple program that uses the API would work on the systems that use UTF-8
68internally -- the vast majority of Unix-Line operating systems: Linux, Mac OS X,
69Solaris, BSD. But it would fail on files like <code>War and Peace - Война и мир - מלחמה ושלום.zip</code>
70under Microsoft Windows because the native Windows Unicode aware API is Wide-API -- UTF-16.
71
72This incredibly trivial task is very hard to implement in a cross platform manner.
73
74\subsection main_the_solution The Solution
75
76Boost.Nowide provides a set of standard library functions that are UTF-8 aware on Windows
77and make Unicode aware programming easier.
78
79The library provides:
80
81- Easy to use functions for converting UTF-8 to/from UTF-16
82- A class to make the \c argc, \c argc and \c env parameters of \c main use UTF-8
83- UTF-8 aware functions
84    - \c cstdio functions:
85        - \c fopen
86        - \c freopen
87        - \c remove
88        - \c rename
89    - \c cstdlib functions:
90        - \c system
91        - \c getenv
92        - \c setenv
93        - \c unsetenv
94        - \c putenv
95    - \c fstream
96        - \c filebuf
97        - \c fstream/ofstream/ifstream
98    - \c iostream
99        - \c cout
100        - \c cerr
101        - \c clog
102        - \c cin
103
104All these functions are available in Boost.Nowide in headers of the same name.
105So instead of including \c cstdio and using \c std::fopen
106you simply include \c boost/nowide/cstdio.hpp and use \c boost::nowide::fopen.
107The functions accept the same arguments as their \c std counterparts,
108in fact on non-Windows builds they are just aliases for those.
109But on Windows Boost.Nowide does its magic: The narrow string arguments are
110interpreted as UTF-8, converted to wide strings (UTF-16) and passed to the wide
111API which handles special chars correctly.
112
113If there are non-UTF-8 characters in the passed string, the conversion will
114replace them by a replacement character (default: \c U+FFFD) similar to
115what the NT kernel does.
116This means invalid UTF-8 sequences will not roundtrip from narrow->wide->narrow
117resulting in e.g. failure to open a file if the filename is ilformed.
118
119\subsection main_wide Why Not Narrow and Wide?
120
121Why not provide both Wide and Narrow implementations so the
122developer can choose to use Wide characters on Unix-like platforms?
123
124Several reasons:
125
126- \c wchar_t is not really portable, it can be 2 bytes, 4 bytes or even 1 byte making Unicode aware programming harder
127- The C and C++ standard libraries use narrow strings for OS interactions. This library follows the same general rule. There is
128  no such thing as <code>fopen(const wchar_t*, const wchar_t*)</code> in the standard library, so it is better
129  to stick to the standards rather than re-implement Wide API in "Microsoft Windows Style"
130
131
132\subsection alternative Alternatives
133
134Since the May 2019 update Windows 10 does support UTF-8 for narrow strings via a manifest file.
135So setting "UTF-8" as the active code page would allow using the narrow API without any other changes with UTF-8 encoded strings.
136See <a href="https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page">the documentation</a> for details.
137
138Since April 2018 there is a (Beta) function available in Windows 10 to use UTF-8 code pages by default via a user setting.
139
140Both methods do work but have a major drawback: They are not fully reliable for the app developer.
141The code page via manifest method falls back to a legacy code page when an older Windows version than 1903 is used.
142Hence it is only usable if the targetted system is Windows 10 after May 2019.
143The second method relies on user interaction prior to starting the program.
144Obviously this is not reliable when expecting only UTF-8 in the code.
145
146Hence under some circumstances (and hopefully always somewhen in the future) this library will not be required and even Windows I/O can be used with UTF-8 encoded text.
147
148\subsection main_reading Further Reading
149
150- <a href="http://www.utf8everywhere.org/">www.utf8everywhere.org</a>
151- <a href="http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/">Windows console I/O approaches</a>
152
153\section using Using The Library
154\subsection using_standard Standard Features
155
156As a developer you are expected to use \c boost::nowide functions instead of the functions available in the
157\c std namespace.
158
159For example, here is a Unicode unaware implementation of a line counter:
160\code
161#include <fstream>
162#include <iostream>
163
164int main(int argc,char **argv)
165{
166    if(argc!=2) {
167        std::cerr << "Usage: file_name" << std::endl;
168        return 1;
169    }
170
171    std::ifstream f(argv[1]);
172    if(!f) {
173        std::cerr << "Can't open " << argv[1] << std::endl;
174        return 1;
175    }
176    int total_lines = 0;
177    while(f) {
178        if(f.get() == '\n')
179            total_lines++;
180    }
181    f.close();
182    std::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl;
183    return 0;
184}
185\endcode
186
187To make this program handle Unicode properly, we do the following changes:
188
189\code
190#include <boost/nowide/args.hpp>
191#include <boost/nowide/fstream.hpp>
192#include <boost/nowide/iostream.hpp>
193
194int main(int argc,char **argv)
195{
196    boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8
197    if(argc!=2) {
198        boost::nowide::cerr << "Usage: file_name" << std::endl; // Unicode aware console
199        return 1;
200    }
201
202    boost::nowide::ifstream f(argv[1]); // argv[1] - is UTF-8
203    if(!f) {
204        // the console can display UTF-8
205        boost::nowide::cerr << "Can't open " << argv[1] << std::endl;
206        return 1;
207    }
208    int total_lines = 0;
209    while(f) {
210        if(f.get() == '\n')
211            total_lines++;
212    }
213    f.close();
214    // the console can display UTF-8
215    boost::nowide::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl;
216    return 0;
217}
218\endcode
219
220This very simple and straightforward approach helps writing Unicode aware programs.
221
222Watch the use of \c boost::nowide::args, \c boost::nowide::ifstream and \c boost::nowide::cerr/cout.
223On Non-Windows it does nothing, but on Windows the following happens:
224
225- \c boost::nowide::args retrieves UTF-16 arguments from the Windows API, converts them to UTF-8,
226and temporarily replaces the original \c argv (and optionally \c env) with pointers to those internally stored
227UTF-8 strings for the lifetime of the instance.
228- \c boost::nowide::ifstream converts the passed filename (which is now valid UTF-8) to UTF-16
229and calls the Windows Wide API to open the file stream which can then be used as usual.
230- Similarily \c boost::nowide::cerr and \c boost::nowide::cout use an underlying stream buffer
231that converts the UTF-8 string to UTF-16 and use another Wide API function to write it to console.
232
233\subsection using_custom Custom API
234
235Of course, this simple set of functions does not cover all needs. If you need
236to access Wide API from a Windows application that uses UTF-8 internally you can use
237the functions \c boost::nowide::widen and \c boost::nowide::narrow.
238
239For example:
240\code
241CopyFileW(  boost::nowide::widen(existing_file).c_str(),
242            boost::nowide::widen(new_file).c_str(),
243            TRUE);
244\endcode
245
246The conversion is done at the last stage, and you continue using UTF-8
247strings everywhere else. You only switch to the Wide API at glue points.
248
249\c boost::nowide::widen returns \c std::string. Sometimes
250it is useful to prevent allocation and use on-stack buffers
251instead. Boost.Nowide provides the \c boost::nowide::basic_stackstring
252class for this purpose.
253
254The example above could be rewritten as:
255
256\code
257boost::nowide::basic_stackstring<wchar_t,char,64> wexisting_file(existing_file), wnew_file(new_file);
258CopyFileW(wexisting_file.c_str(),wnew_file.c_str(),TRUE);
259\endcode
260
261\note There are a few convenience typedefs: \c stackstring and \c wstackstring using
262256-character buffers, and \c short_stackstring and \c wshort_stackstring using 16-character
263buffers. If the string is longer, they fall back to heap memory allocation.
264
265\subsection using_windows_h The windows.h header
266
267The library does not include the \c windows.h in order to prevent namespace pollution with numerous
268defines and types. Instead, the library defines the prototypes of the Win32 API functions.
269
270However, you may request to use the \c windows.h header by defining \c BOOST_USE_WINDOWS_H
271before including any of the Boost.Nowide headers
272
273\subsection using_integration Integration with Boost.Filesystem
274
275Boost.Filesystem supports selection of narrow encoding.
276Unfortunatelly the default narrow encoding on Windows isn't UTF-8.
277But you can enable UTF-8 as default encoding on Boost.Filesystem by calling
278`boost::nowide::nowide_filesystem()` in the beginning of your program which
279imbues a locale with a UTF-8 conversion facet to convert between \c char \c wchar_t.
280This interprets all narrow strings passed to and from \c boost::filesystem::path as UTF-8
281when converting them to wide strings (as required for internal storage).
282On POSIX this has usually no effect, as no conversion is done due to narrow strings being
283used as the storage format.
284
285
286\section technical Technical Details
287\subsection technical_imple Windows vs POSIX
288
289For Microsoft Windows, the library provides UTF-8 aware variants of some \c std:: functions in the \c boost::nowide namespace.
290For example, \c std::fopen becomes \c boost::nowide::fopen.
291
292Under POSIX platforms, the functions in boost::nowide are aliases of their standard library counterparts:
293
294\code
295namespace boost {
296namespace nowide {
297#ifdef BOOST_WINDOWS
298inline FILE *fopen(const char* name, const char* mode)
299{
300    ...
301}
302#else
303using std::fopen
304#endif
305} // nowide
306} // boost
307\endcode
308
309There is also a \c std::filebuf compatible implementation provided for Windows which supports UTF-8 filepaths
310for \c open and behaves otherwise identical (API-wise).
311
312On all systems the \c std::fstream class and friends are provided as custom implementations supporting
313\c std::string and \c \*\::filesystem::path as well as \c wchar_t\* (Windows only) overloads for the
314constructor and \c open.
315This is done so users can use e.g. \c boost::filesystem::path with \c boost::nowide::fstream without
316depending on C++17 support.
317Furthermore any path-like class is supported if it matches the interface of \c std::filesystem::path "enough".
318
319Note that there is no universal support for \c path and \c std::string in \c boost::nowide::filebuf.
320This is due to using the std variant on non-Windows systems which might be faster in some cases.
321As \c filebuf is rarely used by user code but rather indirectly through \c fstream not having string or
322path support seems a small price to pay especially as C++11 adds \c std::string support, C++17 \c path support
323and usage via \c string_or_path.c_str() is still possible and portable.
324
325\subsection technical_cio Console I/O
326
327Console I/O is implemented as a wrapper around ReadConsoleW/WriteConsoleW when the stream goes to the "real" console.
328When the stream was piped/redirected the standard \c cin/cout is used instead.
329
330This approach eliminates a need of manual code page handling.
331If TrueType fonts are used the Unicode aware input and output works as intended.
332
333\section qna Q & A
334
335<b>Q: What happens to invalid UTF passed through Boost.Nowide? For example Windows using UCS-2 instead of UTF-16.</b>
336
337A: The policy of Boost.Nowide is to always yield valid UTF encoded strings.
338So invalid UTF characters are replaced by the replacement character \c U+FFFD.
339
340This happens in both directions:\n
341When passing a (presumptly) UTF-8 encoded string to Boost.Nowide it will convert it to UTF-16 and replace every invalid character before passing it to the OS.\n
342On retrieval of a value from the OS (e.g. \c boost::nowide::getenv or command line arguments through \c boost::nowide::args) the value is assumed to be UTF-16 and converted to UTF-8 replacing any invalid character.
343
344This means that if one somehow manages to create an invalid UTF-16 filename in Windows it will be **impossible** to handle it with Boost.Nowide.
345But as Microsoft switched from UCS-2 (aka strings with arbitrary 2 Byte values) to UTF-16 in Windows 2000 it won't be a problem in most environments.
346
347<b>Q: What kind of error reporting is used?</b>
348
349A: There are in fact 3:
350
351- Invalid UTF encoded strings are used by replacing invalid chars by the replacement character U+FFFD
352- API calls mirroring the standard API use the same error reporting as that, e.g. by returning a non-zero value on failure
353- Non-continuable errors are reported by standard exceptions. Main example is failure to get the command line parameters via the WinAPI
354
355<b>Q: Why doesn't the library convert the string to/from the locale's encoding (instead of UTF-8) on POSIX systems?</b>
356
357A: It is inherently incorrect to convert strings to/from locale encodings on POSIX platforms.
358
359You can create a file named "\xFF\xFF.txt" (invalid UTF-8), remove it, pass its name as a parameter to a program
360and it would work whether the current locale is UTF-8 or not.
361Also, changing the locale from let's say \c en_US.UTF-8 to \c en_US.ISO-8859-1 would not magically change all
362files in the OS or the strings a user may pass to the program (which is different on Windows)
363
364POSIX OSs treat strings as \c NULL terminated cookies.
365
366So altering their content according to the locale would actually lead to incorrect behavior.
367
368For example, this is a naive implementation of a standard program "rm"
369
370\code
371#include <cstdio>
372
373int main(int argc,char **argv)
374{
375    for(int i=1;i<argc;i++)
376        std::remove(argv[i]);
377    return 0;
378}
379\endcode
380
381It would work with ANY locale and changing the strings would lead to incorrect behavior.
382
383The meaning of a locale under POSIX and Windows platforms is different and has very different effects.
384
385\subsection standalone_version Standalone Version
386
387It is possible to use Nowide library without having the huge Boost project as a dependency. There is a standalone version that has all the functionality in the \c nowide namespace instead of \c boost::nowide. The example above would look like
388
389\code
390#include <nowide/args.hpp>
391#include <nowide/fstream.hpp>
392#include <nowide/iostream.hpp>
393
394int main(int argc,char **argv)
395{
396    nowide::args a(argc,argv); // Fix arguments - make them UTF-8
397    if(argc!=2) {
398        nowide::cerr << "Usage: file_name" << std::endl; // Unicode aware console
399        return 1;
400    }
401
402    nowide::ifstream f(argv[1]); // argv[1] - is UTF-8
403    if(!f) {
404        // the console can display UTF-8
405        nowide::cerr << "Can't open a file " << argv[1] << std::endl;
406        return 1;
407    }
408    int total_lines = 0;
409    while(f) {
410        if(f.get() == '\n')
411            total_lines++;
412    }
413    f.close();
414    // the console can display UTF-8
415    nowide::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl;
416    return 0;
417}
418\endcode
419
420\subsection sources Sources and Downloads
421
422The upstream sources can be found at GitHub: <a href="https://github.com/boostorg/nowide">https://github.com/boostorg/nowide</a>
423
424You can download the latest sources there:
425
426- Standard Version: <a href="https://github.com/boostorg/nowide/archive/master.zip">nowide-master.zip</a>
427
428*/
429