diff --git a/reference/mbstring/encodings.xml b/reference/mbstring/encodings.xml new file mode 100644 index 0000000000..4786b26678 --- /dev/null +++ b/reference/mbstring/encodings.xml @@ -0,0 +1,879 @@ + + +
+ Summaries of supported encodings + + UCS-4 + Name in the IANA character set registry + Underlying character set + Description + Additional note + + ISO-10646-UCS-4 + ISO 10646 + + The Universal Character Set with 31-bit code space, standardized as UCS-4 + by ISO/IEC 10646. It is kept synchronized with the latest version of the + Unicode code map. + + + If this name is used in the encoding conversion facility, + the converter attempts to identify by the preceding BOM + (byte order mark)in which endian the subsequent bytes + are represented. + + + + + UCS-4BE + Name in the IANA character set registry + Underlying character set + Description + Additional note + + ISO-10646-UCS-4 + UCS-4 + + See above. + + + In contrast to UCS-4, strings are always assumed + to be in big endian form. + + + + + UCS-4LE + Name in the IANA character set registry + Underlying character set + Description + Additional note + + ISO-10646-UCS-4 + UCS-4 + + See above. + + + In contrast to UCS-4, strings are always assumed + to be in little endian form. + + + + + UCS-2 + Name in the IANA character set registry + Underlying character set + Description + Additional note + + ISO-10646-UCS-2 + UCS-2 + + The Universal Character Set with 16-bit code space, standardized as UCS-2 + by ISO/IEC 10646. It is kept synchronized with the latest version of the + unicode code map. + + + If this name is used in the encoding conversion facility, + the converter attempts to identify by the preceding BOM + (byte order mark)in which endian the subsequent bytes + are represented. + + + + + UCS-2BE + Name in the iana character set registry + Underlying character set + Description + Additional note + + ISO-10646-UCS-2 + UCS-2 + + See above. + + + In contrast to UCS-2, strings are always assumed + to be in big endian form. + + + + + UCS-2LE + Name in the iana character set registry + Underlying character set + Description + Ddditional note + + ISO-10646-UCS-2 + UCS-2 + + See above. + + + In contrast to UCS-2, strings are always assumed + to be in little endian form. + + + + + UTF-32 + Name in the iana character set registry + Underlying character set + Description + Additional note + + UTF-32 + Unicode + + Unicode Transformation Format of 32-bit unit width, whose encoding space + refers to the Unicode's codeset standard. This encoding scheme wasn't + identical to UCS-4 because the code space of Unicode were limited to + a 21-bit value. + + + If this name is used in the encoding conversion facility, + the converter attempts to identify by the preceding BOM + (byte order mark)in which endian the subsequent bytes + are represented. + + + + + UTF-32BE + Name in the iana character set registry + Underlying character set + Description + Additional note + + UTF-32BE + Unicode + See above + + In contrast to UTF-32, strings are always assumed + to be in big endian form. + + + + + UTF-32LE + Name in the iana character set registry + Underlying character set + Description + Additional note + + UTF-32LE + Unicode + See above + + In contrast to UTF-32, strings are always assumed + to be in little endian form. + + + + + UTF-16 + Name in the iana character set registry + Underlying character set + Description + Additional note + + UTF-16 + Unicode + + Unicode Transformation Format of 16-bit unit width. It's worth a note + that UTF-16 is no longer the same specification as UCS-2 because the + surrogate mechanism has been introduced since Unicode 2.0 and + UTF-16 now refers to a 21-bit code space. + + + If this name is used in the encoding conversion facility, + the converter attempts to identify by the preceding BOM + (byte order mark)in which endian the subsequent bytes + are represented. + + + + + UTF-16BE + Name in the iana character set registry + Underlying character set + Description + Additional note + + UTF-16BE + Unicode + + See above. + + + In contrast to UTF-16, strings are always assumed + to be in big endian form. + + + + + UTF-16LE + Name in the iana character set registry + Underlying character set + Description + Additional note + + UTF-16BE + Unicode + + See above. + + + In contrast to UTF-16, strings are always assumed + to be in big endian form. + + + + + UTF-8 + Name in the iana character set registry + Underlying character set + Description + Additional note + + UTF-8 + Unicode / UCS + + Unicode Transformation Format of 8-bit unit width. + + none + + + + UTF-7 + Name in the iana character set registry + Underlying character set + Description + Additional note + + UTF-7 + Unicode + + A mail-safe transformation format of Unicode, specified in + RFC2152. + + none + + + + UTF7-IMAP + Name in the iana character set registry + Underlying character set + Description + Additional note + + (none) + Unicode + + A variant of UTF-7 which is specialized for use in the + IMAP protocol. + + none + + + + ASCII + Name in the iana character set registry + Underlying character set + Description + Additional note + + + US-ASCII (preferred MIME name) / iso-ir-6 / ANSI_X3.4-1986 / + ISO_646.irv:1991 / ASCII / ISO646-US / us / IBM367 / CP367 / csASCII + + ASCII / ISO 646 + + American Standard Code for Information Interchange is a commonly-used + 7-bit encoding. Also standardized as an international standard, ISO 646. + + (none) + + + + EUC-JP + Name in the iana character set registry + Underlying character set + Description + Additional note + + + EUC-JP (preferred MIME name) / + Extended_UNIX_Code_Packed_Format_for_Japanese / csEUCPkdFmtJapanese + + + Compound of US-ASCII / JIS X0201:1997 (hankaku kana part) / + JIS X0208:1990 / JIS X0212:1990 + + + As you see the name is derived from an abbreviation of Extended UNIX Code + Packed Format for Japanese, this encoding is mostly used on UNIX or + alike platforms. The original encoding scheme, Extended UNIX Code, is + designed on the basis of ISO 2022. + + + The character set referred to by EUC-JP is different to IBM932 / CP932, + which are used by OS/2® and Microsoft® Windows®. + For information interchange with those platforms, use EUCJP-WIN instead. + + + + + SJIS + Name in the iana character set registry + Underlying character set + Description + Additional note + + Shift_JIS (preferred MIME name) / MS_Kanji / csShift_JIS + Compound of JIS X0201:1997 / JIS X0208:1997 + + Shift_JIS was developed in early 80's, at the time personal Japanese word + processors were brought into the market, in order to maintain + compatiblities with the legacy encoding scheme JIS X 0201:1976. + According to the IANA definition the codeset of Shift_JIS is slightly + different to IBM932 / CP932. However, the names "SJIS" / "Shift_JIS" are + often wrongly used to refer to these codesets. + + For the CP932 codemap, use SJIS-WIN instead. + + + + EUCJP-WIN + Name in the iana character set registry + Underlying character set + Description + Additional note + + (none) + + Compound of JIS X0201:1997 / JIS X0208:1997 / IBM extensions / NEC extensions + + + While this "encoding" uses the same encoding scheme as EUC-JP, + the underlying character set is different. That is, some code points map + to different characters than EUC-JP. + + none + + + + SJIS-win + Name in the iana character set registry + Underlying character set + Description + Additional note + + Windows-31J / csWindows31J + + Compound of JIS X0201:1997 / JIS X0208:1997 / IBM extensions / NEC extensions + + + While this "encoding" uses the same encoding scheme as + Shift_JIS, the underlying character set is different. That means some code + points map to different characters than Shift_JIS. + + (none) + + + + ISO-2022-JP + Name in the iana character set registry + Underlying character set + Description + Additional note + + ISO-2022-JP (preferred MIME name) / csISO2022JP + + US-ASCII / JIS X0201:1976 / JIS X0208:1978 / JIS X0208:1983 + + RFC1468 + (none) + + + + JIS + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-1 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-2 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-3 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-4 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-5 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-6 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-7 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-8 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-9 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-10 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-13 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-14 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-8859-15 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + byte2be + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + byte2le + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + byte4be + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + byte4le + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + BASE64 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + HTML-ENTITIES + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + 7bit + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + 8bit + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + EUC-CN + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + CP936 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + HZ + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + EUC-TW + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + CP950 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + BIG-5 + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + EUC-KR + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + UHC (CP949) + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + ISO-2022-KR + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + Windows-1251 (CP1251) + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + Windows-1252 (CP1252) + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + CP866 (IBM866) + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + + + KOI8-R + Name in the iana character set registry + Underlying character set + Description + Additional note + + + + + + + +
+ + diff --git a/reference/mbstring/reference.xml b/reference/mbstring/reference.xml index 5f4357344b..5b929ac34d 100644 --- a/reference/mbstring/reference.xml +++ b/reference/mbstring/reference.xml @@ -1,8 +1,8 @@ - + - Multi-Byte String Functions - Multi-Byte String + Multibyte String Functions + Multibyte String
@@ -110,7 +110,6 @@ JIS, SJIS, ISO-2022-JP, BIG-5 scanner and the character encoding. - If you have some database connected with PHP, it is recommended that @@ -148,13 +147,13 @@ JIS, SJIS, ISO-2022-JP, BIG-5 - In PHP 4.3.2 or earlier versions, mbstring - there is a limitation in this functionality that - mbstring does not perform character encoding - conversion in POST data if the enctype attribute in - the form element is set to - multipart/form-data. So you have to convert - the incoming data by yourself in this case if necessary. + In PHP 4.3.2 or earlier versions, there was a limitation in this + functionality that mbstring does not perform + character encoding conversion in POST data if the + enctype attribute in the form + element is set to multipart/form-data. + So you have to convert the incoming data by yourself in this case + if necessary. Beginning with PHP 4.3.3, if enctype for HTML form is @@ -257,300 +256,306 @@ ob_start('mb_output_handler');
-
- Supported Character Encodings - - Currently the following character encodings are supported by the - mbstring module. Any of those Character encodings - can be specified in the encoding parameter of - mbstring functions. - - - The following character encoding is supported in this PHP - extension: - - - UCS-4 - UCS-4BE - UCS-4LE - UCS-2 - UCS-2BE - UCS-2LE - UTF-32 - UTF-32BE - UTF-32LE - UTF-16 - UTF-16BE - UTF-16LE - UTF-7 - UTF7-IMAP - UTF-8 - ASCII - EUC-JP - SJIS - eucJP-win - SJIS-win - ISO-2022-JP - JIS - ISO-8859-1 - ISO-8859-2 - ISO-8859-3 - ISO-8859-4 - ISO-8859-5 - ISO-8859-6 - ISO-8859-7 - ISO-8859-8 - ISO-8859-9 - ISO-8859-10 - ISO-8859-13 - ISO-8859-14 - ISO-8859-15 - byte2be - byte2le - byte4be - byte4le - BASE64 - HTML-ENTITIES - 7bit - 8bit - EUC-CN - CP936 - HZ - EUC-TW - CP950 - BIG-5 - EUC-KR - UHC (CP949) - ISO-2022-KR - Windows-1251 (CP1251) - Windows-1252 (CP1252) - CP866 (IBM866) - KOI8-R - - - &php.ini; entry, which accepts encoding name, - accepts "auto" and - "pass" also. - mbstring functions, which accepts encoding - name, and accepts "auto". - - - If "pass" is set, no character - encoding conversion is performed. - - - If "auto" is set, it is expanded to - the list of encodings defined per the NLS. - For instance, if the NLS is set to Japanese, - the value is assumed to be - "ASCII,JIS,UTF-8,EUC-JP,SJIS". - - - See also mb_detect_order - +
+ Supported Character Encodings + + Currently the following character encodings are supported by the + mbstring module. Any of those Character encodings + can be specified in the encoding parameter of + mbstring functions. + + + The following character encoding is supported in this PHP + extension: + + + UCS-4 + UCS-4BE + UCS-4LE + UCS-2 + UCS-2BE + UCS-2LE + UTF-32 + UTF-32BE + UTF-32LE + UTF-16 + UTF-16BE + UTF-16LE + UTF-7 + UTF7-IMAP + UTF-8 + ASCII + EUC-JP + SJIS + eucJP-win + SJIS-win + ISO-2022-JP + JIS + ISO-8859-1 + ISO-8859-2 + ISO-8859-3 + ISO-8859-4 + ISO-8859-5 + ISO-8859-6 + ISO-8859-7 + ISO-8859-8 + ISO-8859-9 + ISO-8859-10 + ISO-8859-13 + ISO-8859-14 + ISO-8859-15 + byte2be + byte2le + byte4be + byte4le + BASE64 + HTML-ENTITIES + 7bit + 8bit + EUC-CN + CP936 + HZ + EUC-TW + CP950 + BIG-5 + EUC-KR + UHC (CP949) + ISO-2022-KR + Windows-1251 (CP1251) + Windows-1252 (CP1252) + CP866 (IBM866) + KOI8-R + + + &php.ini; entry, which accepts encoding name, + accepts "auto" and + "pass" also. + mbstring functions, which accepts encoding + name, and accepts "auto". + + + If "pass" is set, no character + encoding conversion is performed. + + + If "auto" is set, it is expanded to + the list of encodings defined per the NLS. + For instance, if the NLS is set to Japanese, + the value is assumed to be + "ASCII,JIS,UTF-8,EUC-JP,SJIS". + + + See also mb_detect_order +
- - Function Overloading Feature - - - You might often find it difficult to get an existing PHP application - work in a given multibyte environment. That's mostly because lots of - PHP applications out there are written with the standard - string functions such as substr, which are - known to not properly handle multibyte-encoded strings. - - - mbstring supports 'function overloading' feature which enables - you to add multibyte awareness to such an application without - code modification by overloading multibyte counterparts on - the standard string functions. For example, - mb_substr is called instead of - substr if function overloading is enabled. - This feature makes it easy to port applications that only support - single-byte encodings to a multibyte environment in many cases. - - - To use the function overloading, set - mbstring.func_overload in &php.ini; to a - positive value that represents a combination of bitmasks specifying - the categories of functions to be overloaded. It should be set - to 1 to overload the mail function. 2 for string - functions, 4 for regular expression functions. For example, - if is set for 7, mail, strings and regular expression functions should - be overloaded. The list of overloaded functions are shown below. - - Functions to be overloaded - - - - value of mbstring.func_overload - original function - overloaded function - - - - - 1 - mail - mb_send_mail - - - 2 - strlen - mb_strlen - - - 2 - strpos - mb_strpos - - - 2 - strrpos - mb_strrpos - - - 2 - substr - mb_substr - - - 2 - strtolower - mb_strtolower - - - 2 - strtoupper - mb_strtoupper - - - 2 - substr_count - mb_substr_count - - - 4 - ereg - mb_ereg - - - 4 - eregi - mb_eregi - - - 4 - ereg_replace - mb_ereg_replace - - - 4 - eregi_replace - mb_eregi_replace - - - 4 - split - mb_split - - - -
-
+ + Function Overloading Feature + + + You might often find it difficult to get an existing PHP application + work in a given multibyte environment. That's mostly because lots of + PHP applications out there are written with the standard + string functions such as substr, which are + known to not properly handle multibyte-encoded strings. + + + mbstring supports 'function overloading' feature which enables + you to add multibyte awareness to such an application without + code modification by overloading multibyte counterparts on + the standard string functions. For example, + mb_substr is called instead of + substr if function overloading is enabled. + This feature makes it easy to port applications that only support + single-byte encodings to a multibyte environment in many cases. + + + To use the function overloading, set + mbstring.func_overload in &php.ini; to a + positive value that represents a combination of bitmasks specifying + the categories of functions to be overloaded. It should be set + to 1 to overload the mail function. 2 for string + functions, 4 for regular expression functions. For example, + if is set for 7, mail, strings and regular expression functions should + be overloaded. The list of overloaded functions are shown below. + + Functions to be overloaded + + + + value of mbstring.func_overload + original function + overloaded function + + + + + 1 + mail + mb_send_mail + + + 2 + strlen + mb_strlen + + + 2 + strpos + mb_strpos + + + 2 + strrpos + mb_strrpos + + + 2 + substr + mb_substr + + + 2 + strtolower + mb_strtolower + + + 2 + strtoupper + mb_strtoupper + + + 2 + substr_count + mb_substr_count + + + 4 + ereg + mb_ereg + + + 4 + eregi + mb_eregi + + + 4 + ereg_replace + mb_ereg_replace + + + 4 + eregi_replace + mb_eregi_replace + + + 4 + split + mb_split + + + +
+
+ + + It is not recommended to use the function overloading option in + the per-directory context, because it's not confirmed yet to be + stable enough in a production environment and may lead to undefined + behaviour. + +
- Basics of Japanese multi-byte encodings - - It is often said quite hard to figure out how Japanese texts are - handled in the computer. This is not only because Japanese characters - can only be represented by multibyte encodings, but because different - encoding standards are adopted for different purposes / platforms. - Moreover, not a few character set standards are used there, which - are slightly different from one another. Those facts have often led - developers to inevitable mess-up. - - - To create a working web application that would be put in the Japanese - environment, it is important to use the proper character encoding and - character set for the task in hand. - - - - - Storage for a character can be up to six bytes - - - - Most of multibyte characters often appear twice as wide as - a single-byte character on display. Those characters are called - "zen-kaku" in Japanese which means "full width", and the other - (narrower) characters are called "han-kaku" - means half width. - However the graphical properties of the characters depend on - the glyphs of the type faces used to display them or print them out. - - - - - Some character encodings use shift(escape) sequences defined - in ISO2022 to switch the code map of the specific code area - (00h to 7fh). - - - - - ISO-2022-JP should be used in SMTP/NNTP, and headers and entities - should be reencoded as per RFC requirements. Although those are not - requisites, it's still a good idea because several popular user - agents cannot recognize any other encoding methods. - - - - - Webpages created for mobile phone services such as - i-mode, - Vodafone live!, or ezweb - are supposed to use Shift_JIS. - - - - + Basics of Japanese multi-byte encodings + + It is often said quite hard to figure out how Japanese texts are + handled in the computer. This is not only because Japanese characters + can only be represented by multibyte encodings, but because different + encoding standards are adopted for different purposes / platforms. + Moreover, not a few character set standards are used there, which + are slightly different from one another. Those facts have often led + developers to inevitable mess-up. + + + To create a working web application that would be put in the Japanese + environment, it is important to use the proper character encoding and + character set for the task in hand. + + + + + Storage for a character can be up to six bytes + + + + Most of multibyte characters often appear twice as wide as + a single-byte character on display. Those characters are called + "zen-kaku" in Japanese which means "full width", and the other + (narrower) characters are called "han-kaku" - means half width. + However the graphical properties of the characters depend on + the glyphs of the type faces used to display them or print them out. + + + + + Some character encodings use shift(escape) sequences defined + in ISO2022 to switch the code map of the specific code area + (00h to 7fh). + + + + + ISO-2022-JP should be used in SMTP/NNTP, and headers and entities + should be reencoded as per RFC requirements. Although those are not + requisites, it's still a good idea because several popular user + agents cannot recognize any other encoding methods. + + + + + Webpages created for mobile phone services such as + i-mode, + Vodafone live!, or EZweb + are supposed to use Shift_JIS. + + + +
- References - - Multibyte character encoding schemes and the related issues are very - complicated. There should be too few space to cover in sufficient details. - Please refer to the following URLs and other resources for - further readings. - - - - Unicode materials - - - &url.unicode; - - - - - Japanese/Korean/Chinese character information - - - - - ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf - - - - - - + References + + Multibyte character encoding schemes and the related issues are very + complicated. There should be too few space to cover in sufficient details. + Please refer to the following URLs and other resources for + further readings. + + + + Unicode materials + + + &url.unicode; + + + + + Japanese/Korean/Chinese character information + + + http://examples.oreilly.com/cjkvinfo/doc/cjk.inf + + + +
+&reference.mbstring.encodings; + &reference.mbstring.functions;