MBRTOC16(3C)            Standard C Library Functions            MBRTOC16(3C)
NAME
     mbrtoc16, 
mbrtoc32, 
mbrtowc, 
mbrtowc_l - convert characters to wide
     characters
SYNOPSIS
     #include <wchar.h>     size_t     mbrtowc(
wchar_t *restrict pwc, 
const char *restrict str, 
size_t len,         
mstate_t *restrict ps);     
#include <wchar.h>     #include <xlocale.h>     size_t     mbrtowc_l(
wchar_t *restrict pwc, 
const char *restrict str, 
size_t len,         
mstate_t *restrict ps, 
locale_t loc);     
#include <uchar.h>     size_t     mbrtoc16(
char16_t *restrict p16c, 
const char *restrict str, 
size_t len,         
mbstate_t *restrict ps);     
size_t     mbrtoc32(
char32_t *restrict p32c, 
const char *restrict str, 
size_t len,         
mbstate_t *restrict ps);
DESCRIPTION
     The 
mbrtoc16(), 
mbrtoc32(), 
mbrtowc(), and 
mbrtowc_l() functions
     convert character sequences, which may contain multi-byte characters,
     into different character formats.  The functions work in the following
     formats:     
mbrtoc16()
                A UTF-16 code sequence, where every code point is
                represented by one or two 
char16_t.  The UTF-16 encoding
                will encode certain Unicode code points as a pair of two
                16-bit code sequences, commonly referred to as a surrogate
                pair.     
mbrtoc32()
                A UTF-32 code sequence, where every code point is
                represented by a single 
char32_t.     
mbrtowc(), 
mbrtowc_l()
                Wide characters, being a 32-bit value where every code point
                is represented by a single 
wchar_t.  While the 
wchar_t and                
char32_t are different types, in this implementation, they
                are similar encodings.
     The functions consume up to 
len characters from the string 
str and
     accumulate them in 
ps until a valid character is found, which is
     influenced by the LC_CTYPE category of the current locale.  For
     example, in the 
C locale, only ASCII characters are recognized, while
     in a 
UTF-8 based locale like 
en_US.UTF-8, UTF-8 multi-byte character
     sequences that represent Unicode code points are recognized.  The     
mbrtowc_l() function uses the locale passed in 
loc rather than the
     locale of the current thread.
     When a valid character sequence has been found, it is converted to
     either a 16-bit character sequence for 
mbrtoc16() or a 32-bit character
     sequence for 
mbrtoc32() and will be stored in 
p16c and 
p32c     respectively.
     The 
ps argument represents a multi-byte conversion state which can be
     used across multiple calls to a given function (but not mixed between
     functions).  These allow for characters to be consumed from subsequent
     buffers, e.g.  different values of 
str.  The functions may be called
     from multiple threads as long as they use unique values for 
ps.  If 
ps     is NULL, then a function-specific buffer will be used for the
     conversion state; however, this is stored between all threads and its
     use is not recommended.
     When using these functions, more than one character may be output for a
     given set of consumed input characters.  An example of this is when a
     given code point is represented as a set of surrogate pairs in UTF-16,
     which require two 16-bit characters to represent a code point.  When
     this occurs, the functions return the special return value 
-3.
     The functions all have a special behavior when NULL is passed for 
str.
     They instead will treat it as though 
pwc, 
p16c, or 
p32c were NULL, 
str     had been passed as the empty string, "" and the length, 
len, would
     appear as the value 1.  In other words, the functions would be called
     as:
           mbrtowc(NULL, "", 1, ps)
           mbrtowc_l(NULL, "", 1, ps)
           mbrtoc16(NULL, "", 1, ps)
           mbrtoc32(NULL, "", 1, ps)
   Locale Details
     Not all locales in the system are Unicode based locales.  For example,
     ISO 8859 family locales have code points with values that do not match
     their counterparts in Unicode.  When using these functions with non-
     Unicode based locales, the code points returned will be those
     determined by the locale.  They will not be converted to the
     corresponding Unicode code point.  For example, if using the Euro sign
     in ISO 8859-15, these functions might return the code point 0xa4 and
     not the Unicode value 0x20ac.
     Regardless of the locale, the characters returned will be encoded as
     though the code point were the corresponding value in Unicode.  This
     means that if a locale returns a value that would be a surrogate pair
     in the UTF-16 encoding, it will still be encoded as a UTF-16 character.
     This behavior of the 
mbrtoc16() and 
mbrtoc32() functions should not be
     relied upon, is not portable, and subject to change for non-Unicode
     locales.
RETURN VALUES
     The 
mbrtoc16(), 
mbrtoc32(), 
mbrtowc(), and 
mbrtowc_l() functions return
     the following values:     
0           len or fewer bytes of 
str were consumed and the null wide
                 character was written into the wide character buffer (
pwc,                 
p16c, 
p32c).     
between 1 and len                 The specified number of bytes were consumed and a single
                 character was written into the wide character buffer (
pwc,                 
p16c, 
p32c).     
(size_t)-1  An encoding error has occurred.  The next 
len bytes of 
str                 do not contribute to a valid character.  
errno has been set
                 to EILSEQ.  No data was written into the wide character
                 buffer (
pwc, 
p16c, 
p32c).     
(size_t)-2  len bytes of 
str were consumed, but a complete multi-byte
                 character sequence has not been found and no data was
                 written into the wide character buffer (
pwc, 
p16c, 
p32c).     
(size_t)-3  A character has been written into the wide character buffer
                 (
pwc, 
p16c, 
p32c).  This character was from a previous call
                 (such as another part of a UTF-16 surrogate pair) and no
                 input was consumed.  This is limited to the 
mbrtoc16() and                 
mbrtoc32() functions.
EXAMPLES
     Example 1 Using the 
mbrtoc32() function to convert a multibyte string.
     #include <locale.h>
     #include <stdlib.h>
     #include <string.h>
     #include <err.h>
     #include <stdio.h>
     #include <uchar.h>
     int
     main(void)
     {
             mbstate_t mbs;
             char32_t out;
             size_t ret;
             const char *uchar_str = "\xe5\x85\x89";
             (void) memset(&mbs, 0, sizeof (mbs));
             (void) setlocale(LC_CTYPE, "en_US.UTF-8");
             ret = mbrtoc32(&out, uchar_str, strlen(uchar_str), &mbs);
             if (ret != strlen(uchar_str)) {
                     errx(EXIT_FAILURE, "failed to convert string, got %zd",
                         ret);
             }
             (void) printf("Converted %zu bytes into UTF-32 character "
                 "0x%x0, ret, out);
             return (0);
     }
     When compiled and run, this produces:
           $ ./a.out
           Converted 3 bytes into UTF-32 character 0x5149     
Example 2 Handling surrogate pairs from the 
mbrtoc16() function.
     #include <locale.h>
     #include <stdlib.h>
     #include <string.h>
     #include <err.h>
     #include <stdio.h>
     #include <uchar.h>
     int
     main(void)
     {
             mbstate_t mbs;
             char16_t first, second;
             size_t ret;
             const char *uchar_str = "\xf0\x9f\x92\xa9";
             (void) memset(&mbs, 0, sizeof (mbs));
             (void) setlocale(LC_CTYPE, "en_US.UTF-8");
             ret = mbrtoc16(&first, uchar_str, strlen(uchar_str), &mbs);
             if (ret != strlen(uchar_str)) {
                     errx(EXIT_FAILURE, "failed to convert string, got %zd",
                         ret);
             }
             ret = mbrtoc16(&second, "", 0, &mbs);
             if (ret != (size_t)-3) {
                     errx(EXIT_FAILURE, "didn't get second surrogate pair, "
                         "got %zd", ret);
             }
             (void) printf("UTF-16 surrogates: 0x%x 0x%x0, first, second);
             return (0);
     }
     When compiled and run, this produces:
           $ ./a.out
           UTF-16 surrogates: 0xd83d 0xdca9
ERRORS
     The 
mbrtoc16(), 
mbrtoc32(), 
mbrtowc(), and 
mbrtowc_l() functions will
     fail if:
     EINVAL             The conversion state in 
ps is invalid.
     EILSEQ             An invalid character sequence has been detected.
MT-LEVEL     The 
mbrtoc16(), 
mbrtoc32(), 
mbrtowc(), and 
mbrtowc_l() functions are     
MT-Safe as long as different 
mbstate_t structures are passed in 
ps.  If     
ps is NULL or different threads use the same value for 
ps, then the
     functions are 
Unsafe.
INTERFACE STABILITY
     CommittedSEE ALSO
     c16rtomb(3C), 
c32rtomb(3C), 
newlocale(3C), 
setlocale(3C),     
uselocale(3C), 
wcrtomb(3C), 
uchar.h(3HEAD), 
environ(7)illumos                         June 5, 2023                         illumos