diff --git a/reference/mbstring/configure.xml b/reference/mbstring/configure.xml
index 45963649c0..a9e6d89dff 100644
--- a/reference/mbstring/configure.xml
+++ b/reference/mbstring/configure.xml
@@ -1,12 +1,12 @@
-
+
&reftitle.install;
- mbstring is an extended module. You must
- enable the module with the configure script.
- Refer to the Install section for
- details.
+ mbstring is a non-default extension. This means it
+ is not enabled by default. You must explicitly enable the module with
+ the configure option. See the
+ Install section for details.
The following configure options are related to the
@@ -57,7 +57,7 @@
As of PHP 4.3.0, the option
- will be eliminated and replaced with
+ was eliminated and replaced with the runtime setting
mbstring.encoding_translation.
HTTP input character encoding conversion is enabled
when this is set to On
diff --git a/reference/mbstring/ini.xml b/reference/mbstring/ini.xml
index 4edc605e97..fc1490919b 100644
--- a/reference/mbstring/ini.xml
+++ b/reference/mbstring/ini.xml
@@ -1,70 +1,70 @@
-
+
&reftitle.runtime;
&extension.runtime;
-
+ For the definition of the PHP_INI_* constants, please refer to
+ ini_set.
&ini.descriptions.title;
@@ -73,37 +73,36 @@
- mbstring.language defines
- default language used in mbstring.
- Note that this option defines
- mbstring.internal_encoding
- and mbstring.internal_encoding
- should be placed after mbstring.language
- in &php.ini;
+ mbstring.language is the default national
+ language setting (NLS) used in mbstring. Note that this option
+ automagically defines mbstring.internal_encoding and
+ mbstring.internal_encoding should be placed
+ after mbstring.language in &php.ini;
- mbstring.encoding_translation enables
- HTTP input character encoding detection and translation into
+ mbstring.encoding_translation enables the
+ transparent character encoding filter for the incoming HTTP queries,
+ which performs detection and conversion of the input encoding to the
internal character encoding.
- mbstring.internal_encoding defines default
+ mbstring.internal_encoding defines the default
internal character encoding.
- mbstring.http_input defines default HTTP
+ mbstring.http_input defines the default HTTP
input character encoding.
- mbstring.http_output defines default HTTP
+ mbstring.http_output defines the default HTTP
output character encoding.
@@ -122,40 +121,31 @@
- mbstring.func_overloadoverload(replace) single byte
- functions by mbstring functions. mail,
- ereg, etc. are overloaded by
- mb_send_mail, mb_ereg, etc.
- Possible values are 0, 1, 2, 4 or a combination of them.
- For example, 7 for overload everything.
- 0: No overload, 1: Overload mail function,
- 2: Overload str*() functions, 4: Overload ereg*() functions.
+ mbstring.func_overload overloads a set of single byte
+ functions by the mbstring counterparts. See
+ Funtion overloading for more
+ information.
- Web Browsers are supposed to use the same character encoding
- when submitting form. However, browsers may not use the same
- character encoding. See mb_http_input to
- detect character encoding used by browsers.
+ According to the HTML 4.01 specification,
+ Web browsers is allowed to encode a form being submitted with a character
+ encoding different from the one used for the page.
+ See mb_http_input to detect character encoding
+ used by browsers.
- If enctype is set to
- multipart/form-data in HTML forms,
- mbstring does not convert character encoding
- in POST data. The user must convert them in the script, if
- conversion is needed.
-
-
- Although, browsers are smart enough to detect character encoding
- in HTML. charset is better to be set in HTTP
- header. Change default_charset according to
- character encoding.
+ Although browsers are enough to detect the character encoding
+ of a given HTML document by using heuristics, it would be better to set the
+ charset parameter in the Content-Type
+ HTTP header to the appropriate value by header or
+ default_charset ini setting.
- &php.ini; setting example
+ &php.ini; setting examples
-
+
Multi-Byte String FunctionsMulti-Byte String
@@ -8,94 +8,123 @@
&reftitle.intro;
- There are many languages in which all characters can be expressed
- by single byte. Multi-byte character codes are used to express
- many characters for many languages. mbstring
- is developed to handle Japanese characters. However, many
- mbstring functions are able to handle
- character encoding other than Japanese.
+ While there are many languages in which every necessary character can
+ be represented by a one-to-one mapping to a 8-bit value, there are also
+ several languages which require so many characters for written
+ communication that cannot be contained within the range a mere byte can
+ code. Multibyte character encoding schemes were developed to express
+ that many (more than 256) characters in the regular bytewise coding
+ system.
- A multi-byte character encoding represents single character with
- consecutive bytes. Some character encoding has shift(escape)
- sequences to start/end multi-byte character strings. Therefore, a
- multi-byte character string may be destroyed when it is divided
- and/or counted unless multi-byte character encoding safe method
- is used. This module provides multi-byte character safe string
- functions and other utility functions such as conversion
- functions.
+ When you manipulate (trim, split, splice, etc.) strings encoded in a
+ multibyte encoding, you need to use special functions since two or more
+ consecutive bytes may represent a single character in such encoding
+ schemes. Otherwise, if you apply a non-multibyte-aware string function
+ to the string, it probably fails to detect the beginning or ending of
+ the multibyte character and ends up with a corrupted garbage string that
+ most likely loses its original meaning.
- Since PHP is basically designed for ISO-8859-1, some multi-byte
- character encoding does not work well with PHP. Therefore, it is
- important to set
- mbstring.language to appropriate language
- (i.e. "Japanese" for Japanese) and
- mbstring.internal_encoding to a character
- encoding that works with PHP.
+ mbstring provides these multibyte specific
+ string functions that help you deal with multibyte encodings in PHP,
+ which is basically supposed to be used with single byte encodings.
+ In addition to that, mbstring handles character
+ encoding conversion between the possible encoding pairs.
- PHP 4 Character Encoding Requirements
+ mbstring is also designed to handle Unicode-based
+ encodings such as UTF-8 and UCS-2 and many single-byte encodings
+ for convenience (listed below), whereas mbstring was
+ originally developed for use in Japanese web pages.
-
-
-
-
- Per byte encoding
-
-
-
-
- Single byte characters in range of 00h-7fh
- which is compatible with ASCII
-
-
-
-
- Multi-byte characters without 00h-7fh
-
-
-
-
-
- These are examples of internal character encoding that works with
- PHP and does NOT work with PHP.
-
-
-
+ PHP Character Encoding Requirements
+
+ Encodings of the following types are safely used with PHP.
+
+
+
+ A singlebyte encoding,
+
+
+
+ which has ASCII-compatible (ISO646 compatible) mappings for the
+ characters in range of 00h to
+ 7fh.
+
+
+
+
+
+
+
+ A multibyte encoding,
+
+
+
+ which has ASCII-compatible mappings for the characters in range of
+ 00h to 7fh.
+
+
+
+
+ which don't use ISO2022 escape sequences.
+
+
+
+
+ which don't use a value from 00h to
+ 7fh in any of the compounded bytes
+ that represents a single character.
+
+
+
+
+
+
+
+
+ These are examples of character encodings that are unlikely to work
+ with PHP.
+
+
+
-
-
-
-
- Character encoding, that does not work with PHP, may be converted
- with mbstring's HTTP input/output conversion
- feature/function.
-
-
-
- SJIS should not be used for internal encoding unless the reader
- is familiar with parser/compiler, character encoding and
- character encoding issues.
+
+
-
-
- If you use databases with PHP, it is recommended that you use the
- same character encoding for both database and internal
- encoding for ease of use and better performance.
+ Although PHP scripts written in any of those encodings might not work,
+ especially in the case where encoded strings appear as identifiers
+ or literals in the script, you can almost avoid using these encodings
+ by setting up the mbstring's transparent encoding
+ filter function for incoming HTTP queries.
+
+
+
+ It's highly discouraged to use SJIS, BIG5, CP936, CP949 and GB18030 for
+ the internal encoding unless you are familiar with the parser, the
+ scanner and the character encoding.
-
- If you are using PostgreSQL, it supports character
- encoding that is different from backend character encoding. See
- the PostgreSQL manual for details.
-
-
+
+
+
+
+ If you have some database connected with PHP, it is recommended that
+ you use the same character encoding for both database and the
+ internal encoding for ease of use and better
+ performance.
+
+
+ If you are using PostgreSQL, the character encoding used in the
+ database and the one used in the PHP may differ as it supports
+ automatic character set conversion between the backend and the frontend.
+
+
+
&reference.mbstring.configure;
@@ -119,25 +148,21 @@ JIS, SJIS
- For PHP 4.3.2 or earlier,
- if enctype for HTML form is set to
- multipart/form-data,
- mbstring does not convert character encoding
- in POST data. If it is the case, strings are needed to be
- converted to internal character encoding.
+ In PHP 4.3.2 or earlier versions, mbstring
+ there is a limitation in this functionality that
+ mbstring does not perform character encoding
+ conversion in POST data if the enctype attribute in
+ the form element is set to
+ multipart/form-data. So you have to convert
+ the incoming data by yourself in this case if necessary.
-
-
- Since PHP 4.3.3,
- if enctype for HTML form is set to
- multipart/form-data, and,
- mbstring.encoding_translation is set to
- On in &php.ini;
- POST variables and uploaded filename will be converted to
- internal character encoding.
- But, characters specified in 'name' of HTML form will not be
- converted.
+ Beginning with PHP 4.3.3, if enctype for HTML form is
+ set to multipart/form-data and
+ mbstring.encoding_translation is set to On
+ in &php.ini; the POST'ed variables and the names of uploaded files
+ will be converted to the internal character encoding as well.
+ However, the conversion isn't applied to the query keys.
@@ -166,9 +191,8 @@ mbstring.encoding_translation = Off
When using PHP as an Apache module, it is possible to
- override PHP ini setting per Virtual Host in
- &httpd.conf; or per directory with
- &htaccess;. Refer to the Configuration section and
Apache Manual for details.
@@ -186,7 +210,7 @@ mbstring.encoding_translation = Off
- For PHP3-i18n users, mbstring's output
+ PHP3-i18n users should note that mbstring's output
conversion differs from PHP3-i18n. Character encoding is
converted using output buffer.
@@ -236,51 +260,101 @@ ob_start('mb_output_handler');
Supported Character Encodings
- Currently, the following character encoding is supported by the
- mbstring module. Character encoding may
- be specified for mbstring functions'
- encoding parameter.
+ Currently the following character encodings are supported by the
+ mbstring module. Any of those Character encodings
+ can be specified in the encoding parameter of
+ mbstring functions.
The following character encoding is supported in this PHP
extension:
-
- UCS-4, UCS-4BE,
- UCS-4LE, UCS-2,
- UCS-2BE, UCS-2LE,
- UTF-32, UTF-32BE,
- UTF-32LE, UCS-2LE,
- UTF-16, UTF-16BE,
- UTF-16LE, UTF-8,
- UTF-7, ASCII,
- EUC-JP, SJIS,
- eucJP-win, SJIS-win,
- ISO-2022-JP, JIS,
- ISO-8859-1, ISO-8859-2,
- ISO-8859-3, ISO-8859-4,
- ISO-8859-5, ISO-8859-6,
- ISO-8859-7, ISO-8859-8,
- ISO-8859-9, ISO-8859-10,
- ISO-8859-13, ISO-8859-14,
- ISO-8859-15, byte2be,
- byte2le, byte4be,
- byte4le, BASE64,
- 7bit, 8bit and
- UTF7-IMAP.
-
-
- As of PHP 4.3.0, the following character encoding support will be added
- experimentally :
- EUC-CN, CP936, HZ,
- EUC-TW, CP950, BIG-5,
- EUC-KR, UHC (CP949),
- ISO-2022-KR,
- Windows-1251 (CP1251),
- Windows-1252 (CP1252),
- CP866,
- KOI8-R.
-
+
+ UCS-4
+ UCS-4BE
+
+ UCS-4LE
+ UCS-2
+
+ UCS-2BE
+ UCS-2LE
+
+ UTF-32
+ UTF-32BE
+
+ UTF-32LE
+ UCS-2LE
+
+ UTF-16
+ UTF-16BE
+
+ UTF-16LE
+ UTF-8
+
+ UTF-7
+ ASCII
+
+ EUC-JP
+ SJIS
+
+ eucJP-win
+ SJIS-win
+
+ ISO-2022-JP
+ JIS
+
+ ISO-8859-1
+ ISO-8859-2
+
+ ISO-8859-3
+ ISO-8859-4
+
+ ISO-8859-5
+ ISO-8859-6
+
+ ISO-8859-7
+ ISO-8859-8
+
+ ISO-8859-9
+ ISO-8859-10
+
+ ISO-8859-13
+ ISO-8859-14
+
+ ISO-8859-15
+ byte2be
+
+ byte2le
+ byte4be
+
+ byte4le
+ BASE64
+
+ 7bit
+ 8bit
+ UTF7-IMAP
+ EUC-CN
+ CP936
+ HZ
+
+ EUC-TW
+ CP950
+ BIG-5
+
+ EUC-KR
+ UHC (CP949)
+
+ ISO-2022-KR
+
+ Windows-1251 (CP1251)
+
+ Windows-1252 (CP1252)
+
+ CP866
+
+ KOI8-R
+
+
&php.ini; entry, which accepts encoding name,
accepts "auto" and
@@ -294,56 +368,48 @@ ob_start('mb_output_handler');
If "auto" is set, it is expanded to
+ the list of encodings defined per the NLS.
+ For instance, if the NLS is set to Japanese,
+ the value is assumed to be
"ASCII,JIS,UTF-8,EUC-JP,SJIS".
See also mb_detect_order
-
-
- "Supported character encoding" does not mean that it
- works as internal character code.
-
-
- Overloading PHP string functions with multi byte string functions
+ Function Overloading Feature
- Because almost PHP application written for language using
- single-byte character encoding, there are some difficulties for
- multibyte string handling including Japanese. Most PHP string
- functions such as substr do not support
- multibyte strings.
+ You might often find it difficult to get an existing PHP application
+ work in a given multibyte environment. That's mostly because lots of
+ PHP applications out there are written with the standard
+ string functions such as substr, which are
+ known to not properly handle multibyte-encoded strings.
- Multibyte extension (mbstring) has some PHP string functions
- with multibyte support (ex. substr supports
- mb_substr).
+ mbstring supports 'function overloading' feature which enables
+ you to add multibyte awareness to such an application without
+ code modification by overloading multibyte counterparts on
+ the standard string functions. For example,
+ mb_substr is called instead of
+ substr if function overloading is enabled.
+ This feature makes it easy to port applications that only support
+ single-byte encodings to a multibyte environment in many cases.
- Multibyte extension (mbstring) also supports 'function
- overloading' to add multibyte string functionality without
- code modification. Using function overloading, some PHP string
- functions will be overloaded multibyte string functions.
- For example, mb_substr is called
- instead of substr if function overloading
- is enabled. Function overload makes easy to port application
- supporting only single-byte encoding for multibyte application.
-
-
- mbstring.func_overload in &php.ini; should be
- set some positive value to use function overloading.
- The value should specify the category of overloading functions,
- should be set 1 to enable mail function overloading. 2 to enable
- string functions, 4 to regular expression functions. For
- example, if is set for 7, mail, strings, regex functions should
- be overloaded. The list of overloaded functions are shown in
- below.
+ To use the function overloading, set
+ mbstring.func_overload in &php.ini; to a
+ positive value that represents a combination of bitmasks specifying
+ the categories of functions to be overloaded. It should be set
+ to 1 to overload the mail function. 2 for string
+ functions, 4 for regular expression functions. For example,
+ if is set for 7, mail, strings and regular expression functions should
+ be overloaded. The list of overloaded functions are shown below.
- Functions to be overloaded
+ Functions to be overloaded
@@ -417,7 +483,7 @@ ob_start('mb_output_handler');
4splitmb_split
-
+
@@ -425,46 +491,58 @@ ob_start('mb_output_handler');
- Basics of Japanese multi-byte characters
+ Basics of Japanese multi-byte encodings
- Most Japanese characters need more than 1 byte per character. In
- addition, several character encoding schemes are used under a
- Japanese environment. There are EUC-JP, Shift_JIS(SJIS) and
- ISO-2022-JP(JIS) character encoding. As Unicode becomes popular,
- UTF-8 is used also. To develop Web applications for a Japanese
- environment, it is important to use the character set for the
- task in hand, whether HTTP input/output, RDBMS and E-mail.
+ It is often said quite hard to figure out how Japanese texts are
+ handled in the computer. This is not only because Japanese characters
+ can only be represented by multibyte encodings, but because different
+ encoding standards are adopted for different purposes / platforms.
+ Moreover, not a few character set standards are used there, which
+ are slightly different from one another. Those facts have often led
+ developers to inevitable mess-up.
+
+
+ To create a working web application that would be put in the Japanese
+ environment, it is important to use the proper character encoding and
+ character set for the task in hand.
- Storage for a character can be up to six
- bytes
+ Storage for a character can be up to six bytes
- A multi-byte character is usually twice of the width compared
- to single-byte characters. Wider characters are called
- "zen-kaku" - meaning full width, narrower characters are
- called "han-kaku" - meaning half width. "zen-kaku" characters
- are usually fixed width.
+ Most of multibyte characters often appear twice as wide as
+ a single-byte character on display. Those characters are called
+ "zen-kaku" in Japanese which means "full width", and the other
+ (narrower) characters are called "han-kaku" - means half width.
+ However the graphical properties of the characters depend on
+ the glyphs of the type faces used to display them or print them out.
- Some character encoding defines shift(escape) sequence for
- entering/exiting multi-byte character strings.
+ Some character encodings use shift(escape) sequences defined
+ in ISO2022 to switch the code map of the specific code area
+ (00h to 7fh).
- ISO-2022-JP must be used for SMTP/NNTP.
+ ISO-2022-JP should be used in SMTP/NNTP, and headers and entities
+ should be reencoded as per RFC requirements. Although those are not
+ requisites, it's still a good idea because several popular user
+ agents cannot recognize any other encoding methods.
-
- "i-mode" web site is supposed to use SJIS.
-
+
+ Webpages created for mobile phone services such as
+ i-mode,
+ Vodafone live!, or ezweb
+ are supposed to use Shift_JIS.
+
@@ -473,14 +551,14 @@ ob_start('mb_output_handler');
References
- Multi-byte character encoding and its related issues are very
- complex. It is impossible to cover in sufficient detail
- here. Please refer to the following URLs and other resources for
+ Multibyte character encoding schemes and the related issues are very
+ complicated. There should be too few space to cover in sufficient details.
+ Please refer to the following URLs and other resources for
further readings.
- Unicode/UTF/UCS/etc
+ Unicode materials
&url.unicode;
@@ -488,13 +566,14 @@ ob_start('mb_output_handler');
- Japanese/Korean/Chinese character
- information
+ Japanese/Korean/Chinese character information
-
- ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
-
+
+
+ ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
+
+