mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-16 08:58:56 +00:00
Added a few grammatical fixes and provided a more in-depth explanation of why we need mbstring because of the limitations of a byte.
git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@210636 c90b9560-bf6c-de11-be94-00142212c4b1
This commit is contained in:
parent
65201dc069
commit
7adb369ad7
1 changed files with 52 additions and 59 deletions
|
@ -1,5 +1,5 @@
|
|||
<?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<!-- $Revision: 1.22 $ -->
|
||||
<!-- $Revision: 1.23 $ -->
|
||||
<!-- Purpose: international -->
|
||||
<!-- Membership: bundled -->
|
||||
|
||||
|
@ -12,12 +12,14 @@
|
|||
&reftitle.intro;
|
||||
<para>
|
||||
While there are many languages in which every necessary character can
|
||||
be represented by a one-to-one mapping to a 8-bit value, there are also
|
||||
be represented by a one-to-one mapping to an 8-bit value, there are also
|
||||
several languages which require so many characters for written
|
||||
communication that cannot be contained within the range a mere byte can
|
||||
code. Multibyte character encoding schemes were developed to express
|
||||
that many (more than 256) characters in the regular bytewise coding
|
||||
system.
|
||||
communication that they cannot be contained within the range a mere byte
|
||||
can code (A byte is made up of eight bits. Each bit can contain only two
|
||||
distinct values, one or zero. Because of this, a byte can only represent
|
||||
256 unique values (two to the power of eight)). Multibyte character
|
||||
encoding schemes were developed to express more than 256 characters
|
||||
in the regular bytewise coding system.
|
||||
</para>
|
||||
<para>
|
||||
When you manipulate (trim, split, splice, etc.) strings encoded in a
|
||||
|
@ -29,17 +31,12 @@
|
|||
most likely loses its original meaning.
|
||||
</para>
|
||||
<para>
|
||||
<literal>mbstring</literal> provides these multibyte specific
|
||||
string functions that help you deal with multibyte encodings in PHP,
|
||||
which is basically supposed to be used with single byte encodings.
|
||||
In addition to that, <literal>mbstring</literal> handles character
|
||||
encoding conversion between the possible encoding pairs.
|
||||
</para>
|
||||
<para>
|
||||
<literal>mbstring</literal> is also designed to handle Unicode-based
|
||||
encodings such as UTF-8 and UCS-2 and many single-byte encodings
|
||||
for convenience (listed below), whereas <literal>mbstring</literal> was
|
||||
originally developed for use in Japanese web pages.
|
||||
<literal>mbstring</literal> provides multibyte specific string functions
|
||||
that help you deal with multibyte encodings in PHP. In addition to that,
|
||||
<literal>mbstring</literal> handles character encoding conversion between
|
||||
the possible encoding pairs. <literal>mbstring</literal> is designed to
|
||||
handle Unicode-based encodings such as UTF-8 and UCS-2 and many
|
||||
single-byte encodings for convenience (listed below).
|
||||
</para>
|
||||
|
||||
<section id="mbstring.php4.req">
|
||||
|
@ -115,14 +112,14 @@ JIS, SJIS, ISO-2022-JP, BIG-5
|
|||
</note>
|
||||
<note>
|
||||
<para>
|
||||
If you have some database connected with PHP, it is recommended that
|
||||
you use the same character encoding for both database and the
|
||||
If you are connecting to a database with PHP, it is recommended that
|
||||
you use the same character encoding for both the database and the
|
||||
<literal>internal encoding</literal> for ease of use and better
|
||||
performance.
|
||||
</para>
|
||||
<para>
|
||||
If you are using PostgreSQL, the character encoding used in the
|
||||
database and the one used in the PHP may differ as it supports
|
||||
database and the one used in PHP may differ as it supports
|
||||
automatic character set conversion between the backend and the frontend.
|
||||
</para>
|
||||
</note>
|
||||
|
@ -175,7 +172,7 @@ JIS, SJIS, ISO-2022-JP, BIG-5
|
|||
</simpara>
|
||||
<para>
|
||||
There is no way to control HTTP input character
|
||||
conversion from PHP script. To disable HTTP input character
|
||||
conversion from a PHP script. To disable HTTP input character
|
||||
conversion, it has to be done in &php.ini;.
|
||||
<example>
|
||||
<title>
|
||||
|
@ -207,14 +204,14 @@ mbstring.encoding_translation = Off
|
|||
There are several ways to enable output character encoding
|
||||
conversion. One is using &php.ini;, another
|
||||
is using <function>ob_start</function> with
|
||||
<function>mb_output_handler</function> as
|
||||
<function>mb_output_handler</function> as the
|
||||
<literal>ob_start</literal> callback function.
|
||||
</para>
|
||||
<note>
|
||||
<para>
|
||||
PHP3-i18n users should note that <literal>mbstring</literal>'s output
|
||||
conversion differs from PHP3-i18n. Character encoding is
|
||||
converted using output buffer.
|
||||
converted using an output buffer.
|
||||
</para>
|
||||
</note>
|
||||
</listitem>
|
||||
|
@ -268,7 +265,7 @@ ob_start('mb_output_handler');
|
|||
<literal>mbstring</literal> functions.
|
||||
</simpara>
|
||||
<para>
|
||||
The following character encoding is supported in this PHP
|
||||
The following character encodings are supported in this PHP
|
||||
extension:
|
||||
</para>
|
||||
<itemizedlist>
|
||||
|
@ -330,11 +327,11 @@ ob_start('mb_output_handler');
|
|||
<listitem><simpara>KOI8-R</simpara></listitem>
|
||||
</itemizedlist>
|
||||
<para>
|
||||
&php.ini; entry, which accepts encoding name,
|
||||
accepts "<literal>auto</literal>" and
|
||||
"<literal>pass</literal>" also.
|
||||
<literal>mbstring</literal> functions, which accepts encoding
|
||||
name, and accepts "<literal>auto</literal>".
|
||||
Any &php.ini; entry which accepts an encoding name
|
||||
can also use the values "<literal>auto</literal>" and
|
||||
"<literal>pass</literal>".
|
||||
<literal>mbstring</literal> functions which accept an encoding
|
||||
name can also use the value "<literal>auto</literal>".
|
||||
</para>
|
||||
<para>
|
||||
If "<literal>pass</literal>" is set, no character
|
||||
|
@ -358,13 +355,13 @@ ob_start('mb_output_handler');
|
|||
</title>
|
||||
<para>
|
||||
You might often find it difficult to get an existing PHP application
|
||||
work in a given multibyte environment. That's mostly because lots of
|
||||
PHP applications out there are written with the standard
|
||||
string functions such as <function>substr</function>, which are
|
||||
known to not properly handle multibyte-encoded strings.
|
||||
to work in a given multibyte environment. This happens because most
|
||||
PHP applications out there are written with the standard string
|
||||
functions such as <function>substr</function>, which are known to
|
||||
not properly handle multibyte-encoded strings.
|
||||
</para>
|
||||
<para>
|
||||
mbstring supports 'function overloading' feature which enables
|
||||
mbstring supports a 'function overloading' feature which enables
|
||||
you to add multibyte awareness to such an application without
|
||||
code modification by overloading multibyte counterparts on
|
||||
the standard string functions. For example,
|
||||
|
@ -374,13 +371,13 @@ ob_start('mb_output_handler');
|
|||
single-byte encodings to a multibyte environment in many cases.
|
||||
</para>
|
||||
<para>
|
||||
To use the function overloading, set
|
||||
To use function overloading, set
|
||||
<literal>mbstring.func_overload</literal> in &php.ini; to a
|
||||
positive value that represents a combination of bitmasks specifying
|
||||
the categories of functions to be overloaded. It should be set
|
||||
to 1 to overload the <function>mail</function> function. 2 for string
|
||||
functions, 4 for regular expression functions. For example,
|
||||
if is set for 7, mail, strings and regular expression functions should
|
||||
if it is set to 7, mail, strings and regular expression functions will
|
||||
be overloaded. The list of overloaded functions are shown below.
|
||||
<table>
|
||||
<title>Functions to be overloaded</title>
|
||||
|
@ -475,18 +472,13 @@ ob_start('mb_output_handler');
|
|||
<section id="mbstring.ja-basic">
|
||||
<title>Basics of Japanese multi-byte encodings</title>
|
||||
<para>
|
||||
It is often said quite hard to figure out how Japanese texts are
|
||||
handled in the computer. This is not only because Japanese characters
|
||||
can only be represented by multibyte encodings, but because different
|
||||
encoding standards are adopted for different purposes / platforms.
|
||||
Moreover, not a few character set standards are used there, which
|
||||
are slightly different from one another. Those facts have often led
|
||||
developers to inevitable mess-up.
|
||||
</para>
|
||||
<para>
|
||||
To create a working web application that would be put in the Japanese
|
||||
environment, it is important to use the proper character encoding and
|
||||
character set for the task in hand.
|
||||
Japanese characters can only be represented by multibyte encodings,
|
||||
and multiple encoding standards are used depending on platform and
|
||||
text purpose. To make matters worse, these encoding standards
|
||||
differ slightly from one another. In order to create a web
|
||||
application which would be usable in a Japanese environment, a
|
||||
developer has to keep these complexities in mind to ensure that the
|
||||
proper character encodings are used.
|
||||
</para>
|
||||
<para>
|
||||
<itemizedlist>
|
||||
|
@ -495,18 +487,19 @@ ob_start('mb_output_handler');
|
|||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Most of multibyte characters often appear twice as wide as
|
||||
a single-byte character on display. Those characters are called
|
||||
"zen-kaku" in Japanese which means "full width", and the other
|
||||
(narrower) characters are called "han-kaku" - means half width.
|
||||
However the graphical properties of the characters depend on
|
||||
the glyphs of the type faces used to display them or print them out.
|
||||
Most Japanese multibyte characters appear twice as wide as
|
||||
single-byte characters. These characters are called "
|
||||
zen-kaku" in Japanese, which means "full width".
|
||||
Other, narrower, characters are called "han-kaku",
|
||||
which means "half width". The graphical properties
|
||||
of the characters, however, depends upon the type faces used
|
||||
to display them.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Some character encodings use shift(escape) sequences defined
|
||||
in ISO2022 to switch the code map of the specific code area
|
||||
in ISO-2022 to switch the code map of the specific code area
|
||||
(<literal>00h</literal> to <literal>7fh</literal>).
|
||||
</simpara>
|
||||
</listitem>
|
||||
|
@ -533,10 +526,10 @@ ob_start('mb_output_handler');
|
|||
<section id="mbstring.ref">
|
||||
<title>References</title>
|
||||
<para>
|
||||
Multibyte character encoding schemes and the related issues are very
|
||||
complicated. There should be too few space to cover in sufficient details.
|
||||
Please refer to the following URLs and other resources for
|
||||
further readings.
|
||||
Multibyte character encoding schemes and their related issues are
|
||||
fairly complicated, and are beyond the scope of this documentation.
|
||||
Please refer to the following URLs and other resources for further
|
||||
information regarding these topics.
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
|
|
Loading…
Reference in a new issue