mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-19 10:28:54 +00:00

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@165127 c90b9560-bf6c-de11-be94-00142212c4b1
583 lines
21 KiB
XML
583 lines
21 KiB
XML
<?xml version="1.0" encoding="iso-8859-1"?>
|
|
<!-- $Revision: 1.21 $ -->
|
|
<reference id="ref.mbstring">
|
|
<title>Multibyte String Functions</title>
|
|
<titleabbrev>Multibyte String</titleabbrev>
|
|
<partintro>
|
|
|
|
<section id="mbstring.intro">
|
|
&reftitle.intro;
|
|
<para>
|
|
While there are many languages in which every necessary character can
|
|
be represented by a one-to-one mapping to a 8-bit value, there are also
|
|
several languages which require so many characters for written
|
|
communication that cannot be contained within the range a mere byte can
|
|
code. Multibyte character encoding schemes were developed to express
|
|
that many (more than 256) characters in the regular bytewise coding
|
|
system.
|
|
</para>
|
|
<para>
|
|
When you manipulate (trim, split, splice, etc.) strings encoded in a
|
|
multibyte encoding, you need to use special functions since two or more
|
|
consecutive bytes may represent a single character in such encoding
|
|
schemes. Otherwise, if you apply a non-multibyte-aware string function
|
|
to the string, it probably fails to detect the beginning or ending of
|
|
the multibyte character and ends up with a corrupted garbage string that
|
|
most likely loses its original meaning.
|
|
</para>
|
|
<para>
|
|
<literal>mbstring</literal> provides these multibyte specific
|
|
string functions that help you deal with multibyte encodings in PHP,
|
|
which is basically supposed to be used with single byte encodings.
|
|
In addition to that, <literal>mbstring</literal> handles character
|
|
encoding conversion between the possible encoding pairs.
|
|
</para>
|
|
<para>
|
|
<literal>mbstring</literal> is also designed to handle Unicode-based
|
|
encodings such as UTF-8 and UCS-2 and many single-byte encodings
|
|
for convenience (listed below), whereas <literal>mbstring</literal> was
|
|
originally developed for use in Japanese web pages.
|
|
</para>
|
|
|
|
<section id="mbstring.php4.req">
|
|
<title>PHP Character Encoding Requirements</title>
|
|
<para>
|
|
Encodings of the following types are safely used with PHP.
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
A singlebyte encoding,
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>
|
|
which has ASCII-compatible (ISO646 compatible) mappings for the
|
|
characters in range of <literal>00h</literal> to
|
|
<literal>7fh</literal>.
|
|
</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A multibyte encoding,
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>
|
|
which has ASCII-compatible mappings for the characters in range of
|
|
<literal>00h</literal> to <literal>7fh</literal>.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
which don't use ISO2022 escape sequences.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
which don't use a value from <literal>00h</literal> to
|
|
<literal>7fh</literal> in any of the compounded bytes
|
|
that represents a single character.
|
|
</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
These are examples of character encodings that are unlikely to work
|
|
with PHP.
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
JIS, SJIS, ISO-2022-JP, BIG-5
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
</para>
|
|
<para>
|
|
Although PHP scripts written in any of those encodings might not work,
|
|
especially in the case where encoded strings appear as identifiers
|
|
or literals in the script, you can almost avoid using these encodings
|
|
by setting up the <literal>mbstring</literal>'s transparent encoding
|
|
filter function for incoming HTTP queries.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
It's highly discouraged to use SJIS, BIG5, CP936, CP949 and GB18030 for
|
|
the internal encoding unless you are familiar with the parser, the
|
|
scanner and the character encoding.
|
|
</para>
|
|
</note>
|
|
<note>
|
|
<para>
|
|
If you have some database connected with PHP, it is recommended that
|
|
you use the same character encoding for both database and the
|
|
<literal>internal encoding</literal> for ease of use and better
|
|
performance.
|
|
</para>
|
|
<para>
|
|
If you are using PostgreSQL, the character encoding used in the
|
|
database and the one used in the PHP may differ as it supports
|
|
automatic character set conversion between the backend and the frontend.
|
|
</para>
|
|
</note>
|
|
</section>
|
|
</section>
|
|
|
|
&reference.mbstring.configure;
|
|
|
|
&reference.mbstring.ini;
|
|
|
|
<section id="mbstring.resources">
|
|
&reftitle.resources;
|
|
&no.resource;
|
|
</section>
|
|
|
|
&reference.mbstring.constants;
|
|
|
|
<section id="mbstring.http">
|
|
<title>HTTP Input and Output</title>
|
|
<para>
|
|
HTTP input/output character encoding conversion may convert
|
|
binary data also. Users are supposed to control character
|
|
encoding conversion if binary data is used for HTTP
|
|
input/output.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
In PHP 4.3.2 or earlier versions, there was a limitation in this
|
|
functionality that <literal>mbstring</literal> does not perform
|
|
character encoding conversion in POST data if the
|
|
<literal>enctype</literal> attribute in the <literal>form</literal>
|
|
element is set to <literal>multipart/form-data</literal>.
|
|
So you have to convert the incoming data by yourself in this case
|
|
if necessary.
|
|
</para>
|
|
<para>
|
|
Beginning with PHP 4.3.3, if <literal>enctype</literal> for HTML form is
|
|
set to <literal>multipart/form-data</literal> and
|
|
<literal>mbstring.encoding_translation</literal> is set to On
|
|
in &php.ini; the POST'ed variables and the names of uploaded files
|
|
will be converted to the internal character encoding as well.
|
|
However, the conversion isn't applied to the query keys.
|
|
</para>
|
|
</note>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>
|
|
HTTP Input
|
|
</simpara>
|
|
<para>
|
|
There is no way to control HTTP input character
|
|
conversion from PHP script. To disable HTTP input character
|
|
conversion, it has to be done in &php.ini;.
|
|
<example>
|
|
<title>
|
|
Disable HTTP input conversion in &php.ini;
|
|
</title>
|
|
<programlisting role="php">
|
|
<![CDATA[
|
|
;; Disable HTTP Input conversion
|
|
mbstring.http_input = pass
|
|
;; Disable HTTP Input conversion (PHP 4.3.0 or higher)
|
|
mbstring.encoding_translation = Off
|
|
]]>
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
<para>
|
|
When using PHP as an Apache module, it is possible to
|
|
override those settings in each Virtual Host directive in
|
|
&httpd.conf; or per directory with &htaccess;. Refer to the <link
|
|
linkend="configuration">Configuration</link> section and
|
|
Apache Manual for details.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
HTTP Output
|
|
</simpara>
|
|
<para>
|
|
There are several ways to enable output character encoding
|
|
conversion. One is using &php.ini;, another
|
|
is using <function>ob_start</function> with
|
|
<function>mb_output_handler</function> as
|
|
<literal>ob_start</literal> callback function.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
PHP3-i18n users should note that <literal>mbstring</literal>'s output
|
|
conversion differs from PHP3-i18n. Character encoding is
|
|
converted using output buffer.
|
|
</para>
|
|
</note>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
<example>
|
|
<title>&php.ini; setting example</title>
|
|
<programlisting>
|
|
<![CDATA[
|
|
;; Enable output character encoding conversion for all PHP pages
|
|
|
|
;; Enable Output Buffering
|
|
output_buffering = On
|
|
|
|
;; Set mb_output_handler to enable output conversion
|
|
output_handler = mb_output_handler
|
|
]]>
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
<para>
|
|
<example>
|
|
<title>Script example</title>
|
|
<programlisting role="php">
|
|
<![CDATA[
|
|
<?php
|
|
|
|
// Enable output character encoding conversion only for this page
|
|
|
|
// Set HTTP output character encoding to SJIS
|
|
mb_http_output('SJIS');
|
|
|
|
// Start buffering and specify "mb_output_handler" as
|
|
// callback function
|
|
ob_start('mb_output_handler');
|
|
|
|
?>
|
|
]]>
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
</section>
|
|
|
|
<section id="mbstring.supported-encodings">
|
|
<title>Supported Character Encodings</title>
|
|
<simpara>
|
|
Currently the following character encodings are supported by the
|
|
<literal>mbstring</literal> module. Any of those Character encodings
|
|
can be specified in the <literal>encoding</literal> parameter of
|
|
<literal>mbstring</literal> functions.
|
|
</simpara>
|
|
<para>
|
|
The following character encoding is supported in this PHP
|
|
extension:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem><simpara>UCS-4</simpara></listitem>
|
|
<listitem><simpara>UCS-4BE</simpara></listitem>
|
|
<listitem><simpara>UCS-4LE</simpara></listitem>
|
|
<listitem><simpara>UCS-2</simpara></listitem>
|
|
<listitem><simpara>UCS-2BE</simpara></listitem>
|
|
<listitem><simpara>UCS-2LE</simpara></listitem>
|
|
<listitem><simpara>UTF-32</simpara></listitem>
|
|
<listitem><simpara>UTF-32BE</simpara></listitem>
|
|
<listitem><simpara>UTF-32LE</simpara></listitem>
|
|
<listitem><simpara>UTF-16</simpara></listitem>
|
|
<listitem><simpara>UTF-16BE</simpara></listitem>
|
|
<listitem><simpara>UTF-16LE</simpara></listitem>
|
|
<listitem><simpara>UTF-7</simpara></listitem>
|
|
<listitem><simpara>UTF7-IMAP</simpara></listitem>
|
|
<listitem><simpara>UTF-8</simpara></listitem>
|
|
<listitem><simpara>ASCII</simpara></listitem>
|
|
<listitem><simpara>EUC-JP</simpara></listitem>
|
|
<listitem><simpara>SJIS</simpara></listitem>
|
|
<listitem><simpara>eucJP-win</simpara></listitem>
|
|
<listitem><simpara>SJIS-win</simpara></listitem>
|
|
<listitem><simpara>ISO-2022-JP</simpara></listitem>
|
|
<listitem><simpara>JIS</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-1</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-2</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-3</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-4</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-5</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-6</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-7</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-8</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-9</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-10</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-13</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-14</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-15</simpara></listitem>
|
|
<listitem><simpara>byte2be</simpara></listitem>
|
|
<listitem><simpara>byte2le</simpara></listitem>
|
|
<listitem><simpara>byte4be</simpara></listitem>
|
|
<listitem><simpara>byte4le</simpara></listitem>
|
|
<listitem><simpara>BASE64</simpara></listitem>
|
|
<listitem><simpara>HTML-ENTITIES</simpara></listitem>
|
|
<listitem><simpara>7bit</simpara></listitem>
|
|
<listitem><simpara>8bit</simpara></listitem>
|
|
<listitem><simpara>EUC-CN</simpara></listitem>
|
|
<listitem><simpara>CP936</simpara></listitem>
|
|
<listitem><simpara>HZ</simpara></listitem>
|
|
<listitem><simpara>EUC-TW</simpara></listitem>
|
|
<listitem><simpara>CP950</simpara></listitem>
|
|
<listitem><simpara>BIG-5</simpara></listitem>
|
|
<listitem><simpara>EUC-KR</simpara></listitem>
|
|
<listitem><simpara>UHC (CP949)</simpara></listitem>
|
|
<listitem><simpara>ISO-2022-KR</simpara></listitem>
|
|
<listitem><simpara>Windows-1251 (CP1251)</simpara></listitem>
|
|
<listitem><simpara>Windows-1252 (CP1252)</simpara></listitem>
|
|
<listitem><simpara>CP866 (IBM866)</simpara></listitem>
|
|
<listitem><simpara>KOI8-R</simpara></listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
&php.ini; entry, which accepts encoding name,
|
|
accepts "<literal>auto</literal>" and
|
|
"<literal>pass</literal>" also.
|
|
<literal>mbstring</literal> functions, which accepts encoding
|
|
name, and accepts "<literal>auto</literal>".
|
|
</para>
|
|
<para>
|
|
If "<literal>pass</literal>" is set, no character
|
|
encoding conversion is performed.
|
|
</para>
|
|
<para>
|
|
If "<literal>auto</literal>" is set, it is expanded to
|
|
the list of encodings defined per the <link linkend="mbstring.configuration">NLS</link>.
|
|
For instance, if the NLS is set to <literal>Japanese</literal>,
|
|
the value is assumed to be
|
|
"<literal>ASCII,JIS,UTF-8,EUC-JP,SJIS</literal>".
|
|
</para>
|
|
<para>
|
|
See also <function>mb_detect_order</function>
|
|
</para>
|
|
</section>
|
|
|
|
<section id="mbstring.overload">
|
|
<title>
|
|
Function Overloading Feature
|
|
</title>
|
|
<para>
|
|
You might often find it difficult to get an existing PHP application
|
|
work in a given multibyte environment. That's mostly because lots of
|
|
PHP applications out there are written with the standard
|
|
string functions such as <function>substr</function>, which are
|
|
known to not properly handle multibyte-encoded strings.
|
|
</para>
|
|
<para>
|
|
mbstring supports 'function overloading' feature which enables
|
|
you to add multibyte awareness to such an application without
|
|
code modification by overloading multibyte counterparts on
|
|
the standard string functions. For example,
|
|
<function>mb_substr</function> is called instead of
|
|
<function>substr</function> if function overloading is enabled.
|
|
This feature makes it easy to port applications that only support
|
|
single-byte encodings to a multibyte environment in many cases.
|
|
</para>
|
|
<para>
|
|
To use the function overloading, set
|
|
<literal>mbstring.func_overload</literal> in &php.ini; to a
|
|
positive value that represents a combination of bitmasks specifying
|
|
the categories of functions to be overloaded. It should be set
|
|
to 1 to overload the <function>mail</function> function. 2 for string
|
|
functions, 4 for regular expression functions. For example,
|
|
if is set for 7, mail, strings and regular expression functions should
|
|
be overloaded. The list of overloaded functions are shown below.
|
|
<table>
|
|
<title>Functions to be overloaded</title>
|
|
<tgroup cols="3">
|
|
<thead>
|
|
<row>
|
|
<entry>value of mbstring.func_overload</entry>
|
|
<entry>original function</entry>
|
|
<entry>overloaded function</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row>
|
|
<entry>1</entry>
|
|
<entry><function>mail</function></entry>
|
|
<entry><function>mb_send_mail</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strlen</function></entry>
|
|
<entry><function>mb_strlen</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strpos</function></entry>
|
|
<entry><function>mb_strpos</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strrpos</function></entry>
|
|
<entry><function>mb_strrpos</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>substr</function></entry>
|
|
<entry><function>mb_substr</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strtolower</function></entry>
|
|
<entry><function>mb_strtolower</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strtoupper</function></entry>
|
|
<entry><function>mb_strtoupper</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>substr_count</function></entry>
|
|
<entry><function>mb_substr_count</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>ereg</function></entry>
|
|
<entry><function>mb_ereg</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>eregi</function></entry>
|
|
<entry><function>mb_eregi</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>ereg_replace</function></entry>
|
|
<entry><function>mb_ereg_replace</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>eregi_replace</function></entry>
|
|
<entry><function>mb_eregi_replace</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>split</function></entry>
|
|
<entry><function>mb_split</function></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
</para>
|
|
<note>
|
|
<para>
|
|
It is not recommended to use the function overloading option in
|
|
the per-directory context, because it's not confirmed yet to be
|
|
stable enough in a production environment and may lead to undefined
|
|
behaviour.
|
|
</para>
|
|
</note>
|
|
</section>
|
|
|
|
<section id="mbstring.ja-basic">
|
|
<title>Basics of Japanese multi-byte encodings</title>
|
|
<para>
|
|
It is often said quite hard to figure out how Japanese texts are
|
|
handled in the computer. This is not only because Japanese characters
|
|
can only be represented by multibyte encodings, but because different
|
|
encoding standards are adopted for different purposes / platforms.
|
|
Moreover, not a few character set standards are used there, which
|
|
are slightly different from one another. Those facts have often led
|
|
developers to inevitable mess-up.
|
|
</para>
|
|
<para>
|
|
To create a working web application that would be put in the Japanese
|
|
environment, it is important to use the proper character encoding and
|
|
character set for the task in hand.
|
|
</para>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>Storage for a character can be up to six bytes</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Most of multibyte characters often appear twice as wide as
|
|
a single-byte character on display. Those characters are called
|
|
"zen-kaku" in Japanese which means "full width", and the other
|
|
(narrower) characters are called "han-kaku" - means half width.
|
|
However the graphical properties of the characters depend on
|
|
the glyphs of the type faces used to display them or print them out.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Some character encodings use shift(escape) sequences defined
|
|
in ISO2022 to switch the code map of the specific code area
|
|
(<literal>00h</literal> to <literal>7fh</literal>).
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
ISO-2022-JP should be used in SMTP/NNTP, and headers and entities
|
|
should be reencoded as per RFC requirements. Although those are not
|
|
requisites, it's still a good idea because several popular user
|
|
agents cannot recognize any other encoding methods.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Webpages created for mobile phone services such as
|
|
<ulink url="&url.imode;">i-mode</ulink>,
|
|
<ulink url="&url.vlife;">Vodafone live!</ulink>, or <ulink url="&url.ezweb;">EZweb</ulink>
|
|
are supposed to use Shift_JIS.
|
|
</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</section>
|
|
|
|
<section id="mbstring.ref">
|
|
<title>References</title>
|
|
<para>
|
|
Multibyte character encoding schemes and the related issues are very
|
|
complicated. There should be too few space to cover in sufficient details.
|
|
Please refer to the following URLs and other resources for
|
|
further readings.
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Unicode materials
|
|
</para>
|
|
<para>
|
|
<ulink url="&url.unicode;">&url.unicode;</ulink>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Japanese/Korean/Chinese character information
|
|
</para>
|
|
<para>
|
|
<ulink url="&url.oreilly.cjk-inf;">&url.oreilly.cjk-inf;</ulink>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</section>
|
|
&reference.mbstring.encodings;
|
|
|
|
</partintro>
|
|
|
|
&reference.mbstring.functions;
|
|
|
|
</reference>
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode: sgml
|
|
sgml-omittag:t
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
indent-tabs-mode:nil
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"../../../manual.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:nil
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
vim600: syn=xml fen fdm=syntax fdl=2 si
|
|
vim: et tw=78 syn=sgml
|
|
vi: ts=1 sw=1
|
|
-->
|