Added a few grammatical fixes and provided a more in-depth explanation of why we need mbstring because of the limitations of a byte.

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@210636 c90b9560-bf6c-de11-be94-00142212c4b1
This commit is contained in:
Derek Ford 2006-04-03 21:39:59 +00:00
parent 65201dc069
commit 7adb369ad7

View file

@ -1,5 +1,5 @@
<?xml version="1.0" encoding="iso-8859-1"?>
<!-- $Revision: 1.22 $ -->
<!-- $Revision: 1.23 $ -->
<!-- Purpose: international -->
<!-- Membership: bundled -->
@ -12,12 +12,14 @@
&reftitle.intro;
<para>
While there are many languages in which every necessary character can
be represented by a one-to-one mapping to a 8-bit value, there are also
be represented by a one-to-one mapping to an 8-bit value, there are also
several languages which require so many characters for written
communication that cannot be contained within the range a mere byte can
code. Multibyte character encoding schemes were developed to express
that many (more than 256) characters in the regular bytewise coding
system.
communication that they cannot be contained within the range a mere byte
can code (A byte is made up of eight bits. Each bit can contain only two
distinct values, one or zero. Because of this, a byte can only represent
256 unique values (two to the power of eight)). Multibyte character
encoding schemes were developed to express more than 256 characters
in the regular bytewise coding system.
</para>
<para>
When you manipulate (trim, split, splice, etc.) strings encoded in a
@ -29,17 +31,12 @@
most likely loses its original meaning.
</para>
<para>
<literal>mbstring</literal> provides these multibyte specific
string functions that help you deal with multibyte encodings in PHP,
which is basically supposed to be used with single byte encodings.
In addition to that, <literal>mbstring</literal> handles character
encoding conversion between the possible encoding pairs.
</para>
<para>
<literal>mbstring</literal> is also designed to handle Unicode-based
encodings such as UTF-8 and UCS-2 and many single-byte encodings
for convenience (listed below), whereas <literal>mbstring</literal> was
originally developed for use in Japanese web pages.
<literal>mbstring</literal> provides multibyte specific string functions
that help you deal with multibyte encodings in PHP. In addition to that,
<literal>mbstring</literal> handles character encoding conversion between
the possible encoding pairs. <literal>mbstring</literal> is designed to
handle Unicode-based encodings such as UTF-8 and UCS-2 and many
single-byte encodings for convenience (listed below).
</para>
<section id="mbstring.php4.req">
@ -115,14 +112,14 @@ JIS, SJIS, ISO-2022-JP, BIG-5
</note>
<note>
<para>
If you have some database connected with PHP, it is recommended that
you use the same character encoding for both database and the
If you are connecting to a database with PHP, it is recommended that
you use the same character encoding for both the database and the
<literal>internal encoding</literal> for ease of use and better
performance.
</para>
<para>
If you are using PostgreSQL, the character encoding used in the
database and the one used in the PHP may differ as it supports
database and the one used in PHP may differ as it supports
automatic character set conversion between the backend and the frontend.
</para>
</note>
@ -175,7 +172,7 @@ JIS, SJIS, ISO-2022-JP, BIG-5
</simpara>
<para>
There is no way to control HTTP input character
conversion from PHP script. To disable HTTP input character
conversion from a PHP script. To disable HTTP input character
conversion, it has to be done in &php.ini;.
<example>
<title>
@ -207,14 +204,14 @@ mbstring.encoding_translation = Off
There are several ways to enable output character encoding
conversion. One is using &php.ini;, another
is using <function>ob_start</function> with
<function>mb_output_handler</function> as
<function>mb_output_handler</function> as the
<literal>ob_start</literal> callback function.
</para>
<note>
<para>
PHP3-i18n users should note that <literal>mbstring</literal>'s output
conversion differs from PHP3-i18n. Character encoding is
converted using output buffer.
converted using an output buffer.
</para>
</note>
</listitem>
@ -268,7 +265,7 @@ ob_start('mb_output_handler');
<literal>mbstring</literal> functions.
</simpara>
<para>
The following character encoding is supported in this PHP
The following character encodings are supported in this PHP
extension:
</para>
<itemizedlist>
@ -330,11 +327,11 @@ ob_start('mb_output_handler');
<listitem><simpara>KOI8-R</simpara></listitem>
</itemizedlist>
<para>
&php.ini; entry, which accepts encoding name,
accepts &quot;<literal>auto</literal>&quot; and
&quot;<literal>pass</literal>&quot; also.
<literal>mbstring</literal> functions, which accepts encoding
name, and accepts &quot;<literal>auto</literal>&quot;.
Any &php.ini; entry which accepts an encoding name
can also use the values &quot;<literal>auto</literal>&quot; and
&quot;<literal>pass</literal>&quot;.
<literal>mbstring</literal> functions which accept an encoding
name can also use the value &quot;<literal>auto</literal>&quot;.
</para>
<para>
If &quot;<literal>pass</literal>&quot; is set, no character
@ -358,13 +355,13 @@ ob_start('mb_output_handler');
</title>
<para>
You might often find it difficult to get an existing PHP application
work in a given multibyte environment. That's mostly because lots of
PHP applications out there are written with the standard
string functions such as <function>substr</function>, which are
known to not properly handle multibyte-encoded strings.
to work in a given multibyte environment. This happens because most
PHP applications out there are written with the standard string
functions such as <function>substr</function>, which are known to
not properly handle multibyte-encoded strings.
</para>
<para>
mbstring supports 'function overloading' feature which enables
mbstring supports a 'function overloading' feature which enables
you to add multibyte awareness to such an application without
code modification by overloading multibyte counterparts on
the standard string functions. For example,
@ -374,13 +371,13 @@ ob_start('mb_output_handler');
single-byte encodings to a multibyte environment in many cases.
</para>
<para>
To use the function overloading, set
To use function overloading, set
<literal>mbstring.func_overload</literal> in &php.ini; to a
positive value that represents a combination of bitmasks specifying
the categories of functions to be overloaded. It should be set
to 1 to overload the <function>mail</function> function. 2 for string
functions, 4 for regular expression functions. For example,
if is set for 7, mail, strings and regular expression functions should
if it is set to 7, mail, strings and regular expression functions will
be overloaded. The list of overloaded functions are shown below.
<table>
<title>Functions to be overloaded</title>
@ -475,18 +472,13 @@ ob_start('mb_output_handler');
<section id="mbstring.ja-basic">
<title>Basics of Japanese multi-byte encodings</title>
<para>
It is often said quite hard to figure out how Japanese texts are
handled in the computer. This is not only because Japanese characters
can only be represented by multibyte encodings, but because different
encoding standards are adopted for different purposes / platforms.
Moreover, not a few character set standards are used there, which
are slightly different from one another. Those facts have often led
developers to inevitable mess-up.
</para>
<para>
To create a working web application that would be put in the Japanese
environment, it is important to use the proper character encoding and
character set for the task in hand.
Japanese characters can only be represented by multibyte encodings,
and multiple encoding standards are used depending on platform and
text purpose. To make matters worse, these encoding standards
differ slightly from one another. In order to create a web
application which would be usable in a Japanese environment, a
developer has to keep these complexities in mind to ensure that the
proper character encodings are used.
</para>
<para>
<itemizedlist>
@ -495,18 +487,19 @@ ob_start('mb_output_handler');
</listitem>
<listitem>
<simpara>
Most of multibyte characters often appear twice as wide as
a single-byte character on display. Those characters are called
"zen-kaku" in Japanese which means "full width", and the other
(narrower) characters are called "han-kaku" - means half width.
However the graphical properties of the characters depend on
the glyphs of the type faces used to display them or print them out.
Most Japanese multibyte characters appear twice as wide as
single-byte characters. These characters are called &quot;
zen-kaku&quot; in Japanese, which means &quot;full width&quot;.
Other, narrower, characters are called &quot;han-kaku&quot;,
which means &quot;half width&quot;. The graphical properties
of the characters, however, depends upon the type faces used
to display them.
</simpara>
</listitem>
<listitem>
<simpara>
Some character encodings use shift(escape) sequences defined
in ISO2022 to switch the code map of the specific code area
in ISO-2022 to switch the code map of the specific code area
(<literal>00h</literal> to <literal>7fh</literal>).
</simpara>
</listitem>
@ -533,10 +526,10 @@ ob_start('mb_output_handler');
<section id="mbstring.ref">
<title>References</title>
<para>
Multibyte character encoding schemes and the related issues are very
complicated. There should be too few space to cover in sufficient details.
Please refer to the following URLs and other resources for
further readings.
Multibyte character encoding schemes and their related issues are
fairly complicated, and are beyond the scope of this documentation.
Please refer to the following URLs and other resources for further
information regarding these topics.
<itemizedlist>
<listitem>
<para>