MBRTOC16(3C) Standard C Library Functions MBRTOC16(3C)
NAME
mbrtoc16,
mbrtoc32,
mbrtowc,
mbrtowc_l - convert characters to wide
characters
SYNOPSIS
#include <wchar.h> size_t mbrtowc(
wchar_t *restrict pwc,
const char *restrict str,
size_t len,
mstate_t *restrict ps);
#include <wchar.h> #include <xlocale.h> size_t mbrtowc_l(
wchar_t *restrict pwc,
const char *restrict str,
size_t len,
mstate_t *restrict ps,
locale_t loc);
#include <uchar.h> size_t mbrtoc16(
char16_t *restrict p16c,
const char *restrict str,
size_t len,
mbstate_t *restrict ps);
size_t mbrtoc32(
char32_t *restrict p32c,
const char *restrict str,
size_t len,
mbstate_t *restrict ps);
DESCRIPTION
The
mbrtoc16(),
mbrtoc32(),
mbrtowc(), and
mbrtowc_l() functions
convert character sequences, which may contain multi-byte characters,
into different character formats. The functions work in the following
formats:
mbrtoc16()
A UTF-16 code sequence, where every code point is
represented by one or two
char16_t. The UTF-16 encoding
will encode certain Unicode code points as a pair of two
16-bit code sequences, commonly referred to as a surrogate
pair.
mbrtoc32()
A UTF-32 code sequence, where every code point is
represented by a single
char32_t.
mbrtowc(),
mbrtowc_l()
Wide characters, being a 32-bit value where every code point
is represented by a single
wchar_t. While the
wchar_t and
char32_t are different types, in this implementation, they
are similar encodings.
The functions consume up to
len characters from the string
str and
accumulate them in
ps until a valid character is found, which is
influenced by the LC_CTYPE category of the current locale. For
example, in the
C locale, only ASCII characters are recognized, while
in a
UTF-8 based locale like
en_US.UTF-8, UTF-8 multi-byte character
sequences that represent Unicode code points are recognized. The
mbrtowc_l() function uses the locale passed in
loc rather than the
locale of the current thread.
When a valid character sequence has been found, it is converted to
either a 16-bit character sequence for
mbrtoc16() or a 32-bit character
sequence for
mbrtoc32() and will be stored in
p16c and
p32c respectively.
The
ps argument represents a multi-byte conversion state which can be
used across multiple calls to a given function (but not mixed between
functions). These allow for characters to be consumed from subsequent
buffers, e.g. different values of
str. The functions may be called
from multiple threads as long as they use unique values for
ps. If
ps is NULL, then a function-specific buffer will be used for the
conversion state; however, this is stored between all threads and its
use is not recommended.
When using these functions, more than one character may be output for a
given set of consumed input characters. An example of this is when a
given code point is represented as a set of surrogate pairs in UTF-16,
which require two 16-bit characters to represent a code point. When
this occurs, the functions return the special return value
-3.
The functions all have a special behavior when NULL is passed for
str.
They instead will treat it as though
pwc,
p16c, or
p32c were NULL,
str had been passed as the empty string, "" and the length,
len, would
appear as the value 1. In other words, the functions would be called
as:
mbrtowc(NULL, "", 1, ps)
mbrtowc_l(NULL, "", 1, ps)
mbrtoc16(NULL, "", 1, ps)
mbrtoc32(NULL, "", 1, ps)
Locale Details
Not all locales in the system are Unicode based locales. For example,
ISO 8859 family locales have code points with values that do not match
their counterparts in Unicode. When using these functions with non-
Unicode based locales, the code points returned will be those
determined by the locale. They will not be converted to the
corresponding Unicode code point. For example, if using the Euro sign
in ISO 8859-15, these functions might return the code point 0xa4 and
not the Unicode value 0x20ac.
Regardless of the locale, the characters returned will be encoded as
though the code point were the corresponding value in Unicode. This
means that if a locale returns a value that would be a surrogate pair
in the UTF-16 encoding, it will still be encoded as a UTF-16 character.
This behavior of the
mbrtoc16() and
mbrtoc32() functions should not be
relied upon, is not portable, and subject to change for non-Unicode
locales.
RETURN VALUES
The
mbrtoc16(),
mbrtoc32(),
mbrtowc(), and
mbrtowc_l() functions return
the following values:
0 len or fewer bytes of
str were consumed and the null wide
character was written into the wide character buffer (
pwc,
p16c,
p32c).
between 1 and len The specified number of bytes were consumed and a single
character was written into the wide character buffer (
pwc,
p16c,
p32c).
(size_t)-1 An encoding error has occurred. The next
len bytes of
str do not contribute to a valid character.
errno has been set
to EILSEQ. No data was written into the wide character
buffer (
pwc,
p16c,
p32c).
(size_t)-2 len bytes of
str were consumed, but a complete multi-byte
character sequence has not been found and no data was
written into the wide character buffer (
pwc,
p16c,
p32c).
(size_t)-3 A character has been written into the wide character buffer
(
pwc,
p16c,
p32c). This character was from a previous call
(such as another part of a UTF-16 surrogate pair) and no
input was consumed. This is limited to the
mbrtoc16() and
mbrtoc32() functions.
EXAMPLES
Example 1 Using the
mbrtoc32() function to convert a multibyte string.
#include <locale.h>
#include <stdlib.h>
#include <string.h>
#include <err.h>
#include <stdio.h>
#include <uchar.h>
int
main(void)
{
mbstate_t mbs;
char32_t out;
size_t ret;
const char *uchar_str = "\xe5\x85\x89";
(void) memset(&mbs, 0, sizeof (mbs));
(void) setlocale(LC_CTYPE, "en_US.UTF-8");
ret = mbrtoc32(&out, uchar_str, strlen(uchar_str), &mbs);
if (ret != strlen(uchar_str)) {
errx(EXIT_FAILURE, "failed to convert string, got %zd",
ret);
}
(void) printf("Converted %zu bytes into UTF-32 character "
"0x%x0, ret, out);
return (0);
}
When compiled and run, this produces:
$ ./a.out
Converted 3 bytes into UTF-32 character 0x5149
Example 2 Handling surrogate pairs from the
mbrtoc16() function.
#include <locale.h>
#include <stdlib.h>
#include <string.h>
#include <err.h>
#include <stdio.h>
#include <uchar.h>
int
main(void)
{
mbstate_t mbs;
char16_t first, second;
size_t ret;
const char *uchar_str = "\xf0\x9f\x92\xa9";
(void) memset(&mbs, 0, sizeof (mbs));
(void) setlocale(LC_CTYPE, "en_US.UTF-8");
ret = mbrtoc16(&first, uchar_str, strlen(uchar_str), &mbs);
if (ret != strlen(uchar_str)) {
errx(EXIT_FAILURE, "failed to convert string, got %zd",
ret);
}
ret = mbrtoc16(&second, "", 0, &mbs);
if (ret != (size_t)-3) {
errx(EXIT_FAILURE, "didn't get second surrogate pair, "
"got %zd", ret);
}
(void) printf("UTF-16 surrogates: 0x%x 0x%x0, first, second);
return (0);
}
When compiled and run, this produces:
$ ./a.out
UTF-16 surrogates: 0xd83d 0xdca9
ERRORS
The
mbrtoc16(),
mbrtoc32(),
mbrtowc(), and
mbrtowc_l() functions will
fail if:
EINVAL The conversion state in
ps is invalid.
EILSEQ An invalid character sequence has been detected.
MT-LEVEL The
mbrtoc16(),
mbrtoc32(),
mbrtowc(), and
mbrtowc_l() functions are
MT-Safe as long as different
mbstate_t structures are passed in
ps. If
ps is NULL or different threads use the same value for
ps, then the
functions are
Unsafe.
INTERFACE STABILITY
CommittedSEE ALSO
c16rtomb(3C),
c32rtomb(3C),
newlocale(3C),
setlocale(3C),
uselocale(3C),
wcrtomb(3C),
uchar.h(3HEAD),
environ(7)illumos June 5, 2023 illumos