diff --git a/reference/mbstring/configure.xml b/reference/mbstring/configure.xml index 45963649c0..a9e6d89dff 100644 --- a/reference/mbstring/configure.xml +++ b/reference/mbstring/configure.xml @@ -1,12 +1,12 @@ - +
&reftitle.install; - mbstring is an extended module. You must - enable the module with the configure script. - Refer to the Install section for - details. + mbstring is a non-default extension. This means it + is not enabled by default. You must explicitly enable the module with + the configure option. See the + Install section for details. The following configure options are related to the @@ -57,7 +57,7 @@ As of PHP 4.3.0, the option - will be eliminated and replaced with + was eliminated and replaced with the runtime setting mbstring.encoding_translation. HTTP input character encoding conversion is enabled when this is set to On diff --git a/reference/mbstring/ini.xml b/reference/mbstring/ini.xml index 4edc605e97..fc1490919b 100644 --- a/reference/mbstring/ini.xml +++ b/reference/mbstring/ini.xml @@ -1,70 +1,70 @@ - +
&reftitle.runtime; &extension.runtime; - - Multi-Byte String configuration options - - - - Name - Default - Changeable - - - - - mbstring.language - "neutral" - PHP_INI_SYSTEM | PHP_INI_PERDIR - - - mbstring.detect_order - NULL - PHP_INI_ALL - - - mbstring.http_input - "pass" - PHP_INI_ALL - - - mbstring.http_output - "pass" - PHP_INI_ALL - - - mbstring.internal_encoding - NULL - PHP_INI_ALL - - - mbstring.script_encoding - NULL - PHP_INI_ALL - - - mbstring.substitute_character - NULL - PHP_INI_ALL - - - mbstring.func_overload - "0" - PHP_INI_SYSTEM | PHP_INI_PERDIR - - - mbstring.encoding_translation - "0" - PHP_INI_SYSTEM | PHP_INI_PERDIR - - - -
- For further details and definition of the PHP_INI_* constants see - ini_set. + + mbstring configuration options + + + + Name + Default + Changeable + + + + + mbstring.language + "neutral" + PHP_INI_SYSTEM | PHP_INI_PERDIR + + + mbstring.detect_order + NULL + PHP_INI_ALL + + + mbstring.http_input + "pass" + PHP_INI_ALL + + + mbstring.http_output + "pass" + PHP_INI_ALL + + + mbstring.internal_encoding + NULL + PHP_INI_ALL + + + mbstring.script_encoding + NULL + PHP_INI_ALL + + + mbstring.substitute_character + NULL + PHP_INI_ALL + + + mbstring.func_overload + "0" + PHP_INI_SYSTEM | PHP_INI_PERDIR + + + mbstring.encoding_translation + "0" + PHP_INI_SYSTEM | PHP_INI_PERDIR + + + +
+ For the definition of the PHP_INI_* constants, please refer to + ini_set.
&ini.descriptions.title; @@ -73,37 +73,36 @@ - mbstring.language defines - default language used in mbstring. - Note that this option defines - mbstring.internal_encoding - and mbstring.internal_encoding - should be placed after mbstring.language - in &php.ini; + mbstring.language is the default national + language setting (NLS) used in mbstring. Note that this option + automagically defines mbstring.internal_encoding and + mbstring.internal_encoding should be placed + after mbstring.language in &php.ini; - mbstring.encoding_translation enables - HTTP input character encoding detection and translation into + mbstring.encoding_translation enables the + transparent character encoding filter for the incoming HTTP queries, + which performs detection and conversion of the input encoding to the internal character encoding. - mbstring.internal_encoding defines default + mbstring.internal_encoding defines the default internal character encoding. - mbstring.http_input defines default HTTP + mbstring.http_input defines the default HTTP input character encoding. - mbstring.http_output defines default HTTP + mbstring.http_output defines the default HTTP output character encoding. @@ -122,40 +121,31 @@ - mbstring.func_overloadoverload(replace) single byte - functions by mbstring functions. mail, - ereg, etc. are overloaded by - mb_send_mail, mb_ereg, etc. - Possible values are 0, 1, 2, 4 or a combination of them. - For example, 7 for overload everything. - 0: No overload, 1: Overload mail function, - 2: Overload str*() functions, 4: Overload ereg*() functions. + mbstring.func_overload overloads a set of single byte + functions by the mbstring counterparts. See + Funtion overloading for more + information. - Web Browsers are supposed to use the same character encoding - when submitting form. However, browsers may not use the same - character encoding. See mb_http_input to - detect character encoding used by browsers. + According to the HTML 4.01 specification, + Web browsers is allowed to encode a form being submitted with a character + encoding different from the one used for the page. + See mb_http_input to detect character encoding + used by browsers. - If enctype is set to - multipart/form-data in HTML forms, - mbstring does not convert character encoding - in POST data. The user must convert them in the script, if - conversion is needed. - - - Although, browsers are smart enough to detect character encoding - in HTML. charset is better to be set in HTTP - header. Change default_charset according to - character encoding. + Although browsers are enough to detect the character encoding + of a given HTML document by using heuristics, it would be better to set the + charset parameter in the Content-Type + HTTP header to the appropriate value by header or + default_charset ini setting. - &php.ini; setting example + &php.ini; setting examples - + Multi-Byte String Functions Multi-Byte String @@ -8,94 +8,123 @@
&reftitle.intro; - There are many languages in which all characters can be expressed - by single byte. Multi-byte character codes are used to express - many characters for many languages. mbstring - is developed to handle Japanese characters. However, many - mbstring functions are able to handle - character encoding other than Japanese. + While there are many languages in which every necessary character can + be represented by a one-to-one mapping to a 8-bit value, there are also + several languages which require so many characters for written + communication that cannot be contained within the range a mere byte can + code. Multibyte character encoding schemes were developed to express + that many (more than 256) characters in the regular bytewise coding + system. - A multi-byte character encoding represents single character with - consecutive bytes. Some character encoding has shift(escape) - sequences to start/end multi-byte character strings. Therefore, a - multi-byte character string may be destroyed when it is divided - and/or counted unless multi-byte character encoding safe method - is used. This module provides multi-byte character safe string - functions and other utility functions such as conversion - functions. + When you manipulate (trim, split, splice, etc.) strings encoded in a + multibyte encoding, you need to use special functions since two or more + consecutive bytes may represent a single character in such encoding + schemes. Otherwise, if you apply a non-multibyte-aware string function + to the string, it probably fails to detect the beginning or ending of + the multibyte character and ends up with a corrupted garbage string that + most likely loses its original meaning. - Since PHP is basically designed for ISO-8859-1, some multi-byte - character encoding does not work well with PHP. Therefore, it is - important to set - mbstring.language to appropriate language - (i.e. "Japanese" for Japanese) and - mbstring.internal_encoding to a character - encoding that works with PHP. + mbstring provides these multibyte specific + string functions that help you deal with multibyte encodings in PHP, + which is basically supposed to be used with single byte encodings. + In addition to that, mbstring handles character + encoding conversion between the possible encoding pairs. - PHP 4 Character Encoding Requirements + mbstring is also designed to handle Unicode-based + encodings such as UTF-8 and UCS-2 and many single-byte encodings + for convenience (listed below), whereas mbstring was + originally developed for use in Japanese web pages. - - - - - Per byte encoding - - - - - Single byte characters in range of 00h-7fh - which is compatible with ASCII - - - - - Multi-byte characters without 00h-7fh - - - - - - These are examples of internal character encoding that works with - PHP and does NOT work with PHP. - - - + PHP Character Encoding Requirements + + Encodings of the following types are safely used with PHP. + + + + A singlebyte encoding, + + + + which has ASCII-compatible (ISO646 compatible) mappings for the + characters in range of 00h to + 7fh. + + + + + + + + A multibyte encoding, + + + + which has ASCII-compatible mappings for the characters in range of + 00h to 7fh. + + + + + which don't use ISO2022 escape sequences. + + + + + which don't use a value from 00h to + 7fh in any of the compounded bytes + that represents a single character. + + + + + + + + + These are examples of character encodings that are unlikely to work + with PHP. + + + - - - - - Character encoding, that does not work with PHP, may be converted - with mbstring's HTTP input/output conversion - feature/function. - - - - SJIS should not be used for internal encoding unless the reader - is familiar with parser/compiler, character encoding and - character encoding issues. + + - - - If you use databases with PHP, it is recommended that you use the - same character encoding for both database and internal - encoding for ease of use and better performance. + Although PHP scripts written in any of those encodings might not work, + especially in the case where encoded strings appear as identifiers + or literals in the script, you can almost avoid using these encodings + by setting up the mbstring's transparent encoding + filter function for incoming HTTP queries. + + + + It's highly discouraged to use SJIS, BIG5, CP936, CP949 and GB18030 for + the internal encoding unless you are familiar with the parser, the + scanner and the character encoding. - - If you are using PostgreSQL, it supports character - encoding that is different from backend character encoding. See - the PostgreSQL manual for details. - - + + + + + If you have some database connected with PHP, it is recommended that + you use the same character encoding for both database and the + internal encoding for ease of use and better + performance. + + + If you are using PostgreSQL, the character encoding used in the + database and the one used in the PHP may differ as it supports + automatic character set conversion between the backend and the frontend. + + +
&reference.mbstring.configure; @@ -119,25 +148,21 @@ JIS, SJIS
- For PHP 4.3.2 or earlier, - if enctype for HTML form is set to - multipart/form-data, - mbstring does not convert character encoding - in POST data. If it is the case, strings are needed to be - converted to internal character encoding. + In PHP 4.3.2 or earlier versions, mbstring + there is a limitation in this functionality that + mbstring does not perform character encoding + conversion in POST data if the enctype attribute in + the form element is set to + multipart/form-data. So you have to convert + the incoming data by yourself in this case if necessary. - - - Since PHP 4.3.3, - if enctype for HTML form is set to - multipart/form-data, and, - mbstring.encoding_translation is set to - On in &php.ini; - POST variables and uploaded filename will be converted to - internal character encoding. - But, characters specified in 'name' of HTML form will not be - converted. + Beginning with PHP 4.3.3, if enctype for HTML form is + set to multipart/form-data and + mbstring.encoding_translation is set to On + in &php.ini; the POST'ed variables and the names of uploaded files + will be converted to the internal character encoding as well. + However, the conversion isn't applied to the query keys. @@ -166,9 +191,8 @@ mbstring.encoding_translation = Off When using PHP as an Apache module, it is possible to - override PHP ini setting per Virtual Host in - &httpd.conf; or per directory with - &htaccess;. Refer to the Configuration section and Apache Manual for details. @@ -186,7 +210,7 @@ mbstring.encoding_translation = Off - For PHP3-i18n users, mbstring's output + PHP3-i18n users should note that mbstring's output conversion differs from PHP3-i18n. Character encoding is converted using output buffer. @@ -236,51 +260,101 @@ ob_start('mb_output_handler');
Supported Character Encodings - Currently, the following character encoding is supported by the - mbstring module. Character encoding may - be specified for mbstring functions' - encoding parameter. + Currently the following character encodings are supported by the + mbstring module. Any of those Character encodings + can be specified in the encoding parameter of + mbstring functions. The following character encoding is supported in this PHP extension: - - UCS-4, UCS-4BE, - UCS-4LE, UCS-2, - UCS-2BE, UCS-2LE, - UTF-32, UTF-32BE, - UTF-32LE, UCS-2LE, - UTF-16, UTF-16BE, - UTF-16LE, UTF-8, - UTF-7, ASCII, - EUC-JP, SJIS, - eucJP-win, SJIS-win, - ISO-2022-JP, JIS, - ISO-8859-1, ISO-8859-2, - ISO-8859-3, ISO-8859-4, - ISO-8859-5, ISO-8859-6, - ISO-8859-7, ISO-8859-8, - ISO-8859-9, ISO-8859-10, - ISO-8859-13, ISO-8859-14, - ISO-8859-15, byte2be, - byte2le, byte4be, - byte4le, BASE64, - 7bit, 8bit and - UTF7-IMAP. - - - As of PHP 4.3.0, the following character encoding support will be added - experimentally : - EUC-CN, CP936, HZ, - EUC-TW, CP950, BIG-5, - EUC-KR, UHC (CP949), - ISO-2022-KR, - Windows-1251 (CP1251), - Windows-1252 (CP1252), - CP866, - KOI8-R. - + + UCS-4 + UCS-4BE + + UCS-4LE + UCS-2 + + UCS-2BE + UCS-2LE + + UTF-32 + UTF-32BE + + UTF-32LE + UCS-2LE + + UTF-16 + UTF-16BE + + UTF-16LE + UTF-8 + + UTF-7 + ASCII + + EUC-JP + SJIS + + eucJP-win + SJIS-win + + ISO-2022-JP + JIS + + ISO-8859-1 + ISO-8859-2 + + ISO-8859-3 + ISO-8859-4 + + ISO-8859-5 + ISO-8859-6 + + ISO-8859-7 + ISO-8859-8 + + ISO-8859-9 + ISO-8859-10 + + ISO-8859-13 + ISO-8859-14 + + ISO-8859-15 + byte2be + + byte2le + byte4be + + byte4le + BASE64 + + 7bit + 8bit + UTF7-IMAP + EUC-CN + CP936 + HZ + + EUC-TW + CP950 + BIG-5 + + EUC-KR + UHC (CP949) + + ISO-2022-KR + + Windows-1251 (CP1251) + + Windows-1252 (CP1252) + + CP866 + + KOI8-R + + &php.ini; entry, which accepts encoding name, accepts "auto" and @@ -294,56 +368,48 @@ ob_start('mb_output_handler'); If "auto" is set, it is expanded to + the list of encodings defined per the NLS. + For instance, if the NLS is set to Japanese, + the value is assumed to be "ASCII,JIS,UTF-8,EUC-JP,SJIS". See also mb_detect_order - - - "Supported character encoding" does not mean that it - works as internal character code. - -
- Overloading PHP string functions with multi byte string functions + Function Overloading Feature - Because almost PHP application written for language using - single-byte character encoding, there are some difficulties for - multibyte string handling including Japanese. Most PHP string - functions such as substr do not support - multibyte strings. + You might often find it difficult to get an existing PHP application + work in a given multibyte environment. That's mostly because lots of + PHP applications out there are written with the standard + string functions such as substr, which are + known to not properly handle multibyte-encoded strings. - Multibyte extension (mbstring) has some PHP string functions - with multibyte support (ex. substr supports - mb_substr). + mbstring supports 'function overloading' feature which enables + you to add multibyte awareness to such an application without + code modification by overloading multibyte counterparts on + the standard string functions. For example, + mb_substr is called instead of + substr if function overloading is enabled. + This feature makes it easy to port applications that only support + single-byte encodings to a multibyte environment in many cases. - Multibyte extension (mbstring) also supports 'function - overloading' to add multibyte string functionality without - code modification. Using function overloading, some PHP string - functions will be overloaded multibyte string functions. - For example, mb_substr is called - instead of substr if function overloading - is enabled. Function overload makes easy to port application - supporting only single-byte encoding for multibyte application. - - - mbstring.func_overload in &php.ini; should be - set some positive value to use function overloading. - The value should specify the category of overloading functions, - should be set 1 to enable mail function overloading. 2 to enable - string functions, 4 to regular expression functions. For - example, if is set for 7, mail, strings, regex functions should - be overloaded. The list of overloaded functions are shown in - below. + To use the function overloading, set + mbstring.func_overload in &php.ini; to a + positive value that represents a combination of bitmasks specifying + the categories of functions to be overloaded. It should be set + to 1 to overload the mail function. 2 for string + functions, 4 for regular expression functions. For example, + if is set for 7, mail, strings and regular expression functions should + be overloaded. The list of overloaded functions are shown below. - Functions to be overloaded + Functions to be overloaded @@ -417,7 +483,7 @@ ob_start('mb_output_handler'); 4 split mb_split - +
@@ -425,46 +491,58 @@ ob_start('mb_output_handler');
- Basics of Japanese multi-byte characters + Basics of Japanese multi-byte encodings - Most Japanese characters need more than 1 byte per character. In - addition, several character encoding schemes are used under a - Japanese environment. There are EUC-JP, Shift_JIS(SJIS) and - ISO-2022-JP(JIS) character encoding. As Unicode becomes popular, - UTF-8 is used also. To develop Web applications for a Japanese - environment, it is important to use the character set for the - task in hand, whether HTTP input/output, RDBMS and E-mail. + It is often said quite hard to figure out how Japanese texts are + handled in the computer. This is not only because Japanese characters + can only be represented by multibyte encodings, but because different + encoding standards are adopted for different purposes / platforms. + Moreover, not a few character set standards are used there, which + are slightly different from one another. Those facts have often led + developers to inevitable mess-up. + + + To create a working web application that would be put in the Japanese + environment, it is important to use the proper character encoding and + character set for the task in hand. - Storage for a character can be up to six - bytes + Storage for a character can be up to six bytes - A multi-byte character is usually twice of the width compared - to single-byte characters. Wider characters are called - "zen-kaku" - meaning full width, narrower characters are - called "han-kaku" - meaning half width. "zen-kaku" characters - are usually fixed width. + Most of multibyte characters often appear twice as wide as + a single-byte character on display. Those characters are called + "zen-kaku" in Japanese which means "full width", and the other + (narrower) characters are called "han-kaku" - means half width. + However the graphical properties of the characters depend on + the glyphs of the type faces used to display them or print them out. - Some character encoding defines shift(escape) sequence for - entering/exiting multi-byte character strings. + Some character encodings use shift(escape) sequences defined + in ISO2022 to switch the code map of the specific code area + (00h to 7fh). - ISO-2022-JP must be used for SMTP/NNTP. + ISO-2022-JP should be used in SMTP/NNTP, and headers and entities + should be reencoded as per RFC requirements. Although those are not + requisites, it's still a good idea because several popular user + agents cannot recognize any other encoding methods. - - "i-mode" web site is supposed to use SJIS. - + + Webpages created for mobile phone services such as + i-mode, + Vodafone live!, or ezweb + are supposed to use Shift_JIS. + @@ -473,14 +551,14 @@ ob_start('mb_output_handler');
References - Multi-byte character encoding and its related issues are very - complex. It is impossible to cover in sufficient detail - here. Please refer to the following URLs and other resources for + Multibyte character encoding schemes and the related issues are very + complicated. There should be too few space to cover in sufficient details. + Please refer to the following URLs and other resources for further readings. - Unicode/UTF/UCS/etc + Unicode materials &url.unicode; @@ -488,13 +566,14 @@ ob_start('mb_output_handler'); - Japanese/Korean/Chinese character - information + Japanese/Korean/Chinese character information - - ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf - + + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf + +