mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-19 18:38:55 +00:00

- All id attributes are now xml:id - Add docbook namespace to all root elements - Replace <ulink /> with <link xlink:href /> - Minor markup fixes here and there git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@238160 c90b9560-bf6c-de11-be94-00142212c4b1
579 lines
21 KiB
XML
579 lines
21 KiB
XML
<?xml version="1.0" encoding="iso-8859-1"?>
|
|
<!-- $Revision: 1.26 $ -->
|
|
<!-- Purpose: international -->
|
|
<!-- Membership: bundled -->
|
|
|
|
<reference xml:id="ref.mbstring" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
|
|
<title>Multibyte String Functions</title>
|
|
<titleabbrev>Multibyte String</titleabbrev>
|
|
<partintro>
|
|
|
|
<section xml:id="mbstring.intro">
|
|
&reftitle.intro;
|
|
<para>
|
|
While there are many languages in which every necessary character can
|
|
be represented by a one-to-one mapping to an 8-bit value, there are also
|
|
several languages which require so many characters for written
|
|
communication that they cannot be contained within the range a mere byte
|
|
can code (A byte is made up of eight bits. Each bit can contain only two
|
|
distinct values, one or zero. Because of this, a byte can only represent
|
|
256 unique values (two to the power of eight)). Multibyte character
|
|
encoding schemes were developed to express more than 256 characters
|
|
in the regular bytewise coding system.
|
|
</para>
|
|
<para>
|
|
When you manipulate (trim, split, splice, etc.) strings encoded in a
|
|
multibyte encoding, you need to use special functions since two or more
|
|
consecutive bytes may represent a single character in such encoding
|
|
schemes. Otherwise, if you apply a non-multibyte-aware string function
|
|
to the string, it probably fails to detect the beginning or ending of
|
|
the multibyte character and ends up with a corrupted garbage string that
|
|
most likely loses its original meaning.
|
|
</para>
|
|
<para>
|
|
<literal>mbstring</literal> provides multibyte specific string functions
|
|
that help you deal with multibyte encodings in PHP. In addition to that,
|
|
<literal>mbstring</literal> handles character encoding conversion between
|
|
the possible encoding pairs. <literal>mbstring</literal> is designed to
|
|
handle Unicode-based encodings such as UTF-8 and UCS-2 and many
|
|
single-byte encodings for convenience (listed below).
|
|
</para>
|
|
|
|
<section xml:id="mbstring.php4.req">
|
|
<title>PHP Character Encoding Requirements</title>
|
|
<para>
|
|
Encodings of the following types are safely used with PHP.
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
A singlebyte encoding,
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>
|
|
which has ASCII-compatible (ISO646 compatible) mappings for the
|
|
characters in range of <literal>00h</literal> to
|
|
<literal>7fh</literal>.
|
|
</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A multibyte encoding,
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>
|
|
which has ASCII-compatible mappings for the characters in range of
|
|
<literal>00h</literal> to <literal>7fh</literal>.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
which don't use ISO2022 escape sequences.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
which don't use a value from <literal>00h</literal> to
|
|
<literal>7fh</literal> in any of the compounded bytes
|
|
that represents a single character.
|
|
</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
These are examples of character encodings that are unlikely to work
|
|
with PHP.
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
JIS, SJIS, ISO-2022-JP, BIG-5
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
</para>
|
|
<para>
|
|
Although PHP scripts written in any of those encodings might not work,
|
|
especially in the case where encoded strings appear as identifiers
|
|
or literals in the script, you can almost avoid using these encodings
|
|
by setting up the <literal>mbstring</literal>'s transparent encoding
|
|
filter function for incoming HTTP queries.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
It's highly discouraged to use SJIS, BIG5, CP936, CP949 and GB18030 for
|
|
the internal encoding unless you are familiar with the parser, the
|
|
scanner and the character encoding.
|
|
</para>
|
|
</note>
|
|
<note>
|
|
<para>
|
|
If you are connecting to a database with PHP, it is recommended that
|
|
you use the same character encoding for both the database and the
|
|
<literal>internal encoding</literal> for ease of use and better
|
|
performance.
|
|
</para>
|
|
<para>
|
|
If you are using PostgreSQL, the character encoding used in the
|
|
database and the one used in PHP may differ as it supports
|
|
automatic character set conversion between the backend and the frontend.
|
|
</para>
|
|
</note>
|
|
</section>
|
|
</section>
|
|
|
|
&reference.mbstring.configure;
|
|
|
|
&reference.mbstring.ini;
|
|
|
|
<section xml:id="mbstring.resources">
|
|
&reftitle.resources;
|
|
&no.resource;
|
|
</section>
|
|
|
|
&reference.mbstring.constants;
|
|
|
|
<section xml:id="mbstring.http">
|
|
<title>HTTP Input and Output</title>
|
|
<para>
|
|
HTTP input/output character encoding conversion may convert
|
|
binary data also. Users are supposed to control character
|
|
encoding conversion if binary data is used for HTTP
|
|
input/output.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
In PHP 4.3.2 or earlier versions, there was a limitation in this
|
|
functionality that <literal>mbstring</literal> does not perform
|
|
character encoding conversion in POST data if the
|
|
<literal>enctype</literal> attribute in the <literal>form</literal>
|
|
element is set to <literal>multipart/form-data</literal>.
|
|
So you have to convert the incoming data by yourself in this case
|
|
if necessary.
|
|
</para>
|
|
<para>
|
|
Beginning with PHP 4.3.3, if <literal>enctype</literal> for HTML form is
|
|
set to <literal>multipart/form-data</literal> and
|
|
<literal>mbstring.encoding_translation</literal> is set to On
|
|
in &php.ini; the POST'ed variables and the names of uploaded files
|
|
will be converted to the internal character encoding as well.
|
|
However, the conversion isn't applied to the query keys.
|
|
</para>
|
|
</note>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>
|
|
HTTP Input
|
|
</simpara>
|
|
<para>
|
|
There is no way to control HTTP input character
|
|
conversion from a PHP script. To disable HTTP input character
|
|
conversion, it has to be done in &php.ini;.
|
|
<example>
|
|
<title>
|
|
Disable HTTP input conversion in &php.ini;
|
|
</title>
|
|
<programlisting role="php">
|
|
<![CDATA[
|
|
;; Disable HTTP Input conversion
|
|
mbstring.http_input = pass
|
|
;; Disable HTTP Input conversion (PHP 4.3.0 or higher)
|
|
mbstring.encoding_translation = Off
|
|
]]>
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
<para>
|
|
When using PHP as an Apache module, it is possible to
|
|
override those settings in each Virtual Host directive in
|
|
&httpd.conf; or per directory with &htaccess;. Refer to the <link
|
|
linkend="configuration">Configuration</link> section and
|
|
Apache Manual for details.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
HTTP Output
|
|
</simpara>
|
|
<para>
|
|
There are several ways to enable output character encoding
|
|
conversion. One is using &php.ini;, another
|
|
is using <function>ob_start</function> with
|
|
<function>mb_output_handler</function> as the
|
|
<literal>ob_start</literal> callback function.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
PHP3-i18n users should note that <literal>mbstring</literal>'s output
|
|
conversion differs from PHP3-i18n. Character encoding is
|
|
converted using an output buffer.
|
|
</para>
|
|
</note>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
<example>
|
|
<title>&php.ini; setting example</title>
|
|
<programlisting>
|
|
<![CDATA[
|
|
;; Enable output character encoding conversion for all PHP pages
|
|
|
|
;; Enable Output Buffering
|
|
output_buffering = On
|
|
|
|
;; Set mb_output_handler to enable output conversion
|
|
output_handler = mb_output_handler
|
|
]]>
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
<para>
|
|
<example>
|
|
<title>Script example</title>
|
|
<programlisting role="php">
|
|
<![CDATA[
|
|
<?php
|
|
|
|
// Enable output character encoding conversion only for this page
|
|
|
|
// Set HTTP output character encoding to SJIS
|
|
mb_http_output('SJIS');
|
|
|
|
// Start buffering and specify "mb_output_handler" as
|
|
// callback function
|
|
ob_start('mb_output_handler');
|
|
|
|
?>
|
|
]]>
|
|
</programlisting>
|
|
</example>
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="mbstring.supported-encodings">
|
|
<title>Supported Character Encodings</title>
|
|
<simpara>
|
|
Currently the following character encodings are supported by the
|
|
<literal>mbstring</literal> module. Any of those Character encodings
|
|
can be specified in the <literal>encoding</literal> parameter of
|
|
<literal>mbstring</literal> functions.
|
|
</simpara>
|
|
<para>
|
|
The following character encodings are supported in this PHP
|
|
extension:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem><simpara>UCS-4</simpara></listitem>
|
|
<listitem><simpara>UCS-4BE</simpara></listitem>
|
|
<listitem><simpara>UCS-4LE</simpara></listitem>
|
|
<listitem><simpara>UCS-2</simpara></listitem>
|
|
<listitem><simpara>UCS-2BE</simpara></listitem>
|
|
<listitem><simpara>UCS-2LE</simpara></listitem>
|
|
<listitem><simpara>UTF-32</simpara></listitem>
|
|
<listitem><simpara>UTF-32BE</simpara></listitem>
|
|
<listitem><simpara>UTF-32LE</simpara></listitem>
|
|
<listitem><simpara>UTF-16</simpara></listitem>
|
|
<listitem><simpara>UTF-16BE</simpara></listitem>
|
|
<listitem><simpara>UTF-16LE</simpara></listitem>
|
|
<listitem><simpara>UTF-7</simpara></listitem>
|
|
<listitem><simpara>UTF7-IMAP</simpara></listitem>
|
|
<listitem><simpara>UTF-8</simpara></listitem>
|
|
<listitem><simpara>ASCII</simpara></listitem>
|
|
<listitem><simpara>EUC-JP</simpara></listitem>
|
|
<listitem><simpara>SJIS</simpara></listitem>
|
|
<listitem><simpara>eucJP-win</simpara></listitem>
|
|
<listitem><simpara>SJIS-win</simpara></listitem>
|
|
<listitem><simpara>ISO-2022-JP</simpara></listitem>
|
|
<listitem><simpara>JIS</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-1</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-2</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-3</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-4</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-5</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-6</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-7</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-8</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-9</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-10</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-13</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-14</simpara></listitem>
|
|
<listitem><simpara>ISO-8859-15</simpara></listitem>
|
|
<listitem><simpara>byte2be</simpara></listitem>
|
|
<listitem><simpara>byte2le</simpara></listitem>
|
|
<listitem><simpara>byte4be</simpara></listitem>
|
|
<listitem><simpara>byte4le</simpara></listitem>
|
|
<listitem><simpara>BASE64</simpara></listitem>
|
|
<listitem><simpara>HTML-ENTITIES</simpara></listitem>
|
|
<listitem><simpara>7bit</simpara></listitem>
|
|
<listitem><simpara>8bit</simpara></listitem>
|
|
<listitem><simpara>EUC-CN</simpara></listitem>
|
|
<listitem><simpara>CP936</simpara></listitem>
|
|
<listitem><simpara>HZ</simpara></listitem>
|
|
<listitem><simpara>EUC-TW</simpara></listitem>
|
|
<listitem><simpara>CP950</simpara></listitem>
|
|
<listitem><simpara>BIG-5</simpara></listitem>
|
|
<listitem><simpara>EUC-KR</simpara></listitem>
|
|
<listitem><simpara>UHC (CP949)</simpara></listitem>
|
|
<listitem><simpara>ISO-2022-KR</simpara></listitem>
|
|
<listitem><simpara>Windows-1251 (CP1251)</simpara></listitem>
|
|
<listitem><simpara>Windows-1252 (CP1252)</simpara></listitem>
|
|
<listitem><simpara>CP866 (IBM866)</simpara></listitem>
|
|
<listitem><simpara>KOI8-R</simpara></listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
Any &php.ini; entry which accepts an encoding name
|
|
can also use the values "<literal>auto</literal>" and
|
|
"<literal>pass</literal>".
|
|
<literal>mbstring</literal> functions which accept an encoding
|
|
name can also use the value "<literal>auto</literal>".
|
|
</para>
|
|
<para>
|
|
If "<literal>pass</literal>" is set, no character
|
|
encoding conversion is performed.
|
|
</para>
|
|
<para>
|
|
If "<literal>auto</literal>" is set, it is expanded to
|
|
the list of encodings defined per the <link linkend="mbstring.configuration">NLS</link>.
|
|
For instance, if the NLS is set to <literal>Japanese</literal>,
|
|
the value is assumed to be
|
|
"<literal>ASCII,JIS,UTF-8,EUC-JP,SJIS</literal>".
|
|
</para>
|
|
<para>
|
|
See also <function>mb_detect_order</function>
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="mbstring.overload">
|
|
<title>
|
|
Function Overloading Feature
|
|
</title>
|
|
<para>
|
|
You might often find it difficult to get an existing PHP application
|
|
to work in a given multibyte environment. This happens because most
|
|
PHP applications out there are written with the standard string
|
|
functions such as <function>substr</function>, which are known to
|
|
not properly handle multibyte-encoded strings.
|
|
</para>
|
|
<para>
|
|
mbstring supports a 'function overloading' feature which enables
|
|
you to add multibyte awareness to such an application without
|
|
code modification by overloading multibyte counterparts on
|
|
the standard string functions. For example,
|
|
<function>mb_substr</function> is called instead of
|
|
<function>substr</function> if function overloading is enabled.
|
|
This feature makes it easy to port applications that only support
|
|
single-byte encodings to a multibyte environment in many cases.
|
|
</para>
|
|
<para>
|
|
To use function overloading, set
|
|
<literal>mbstring.func_overload</literal> in &php.ini; to a
|
|
positive value that represents a combination of bitmasks specifying
|
|
the categories of functions to be overloaded. It should be set
|
|
to 1 to overload the <function>mail</function> function. 2 for string
|
|
functions, 4 for regular expression functions. For example,
|
|
if it is set to 7, mail, strings and regular expression functions will
|
|
be overloaded. The list of overloaded functions are shown below.
|
|
<table>
|
|
<title>Functions to be overloaded</title>
|
|
<tgroup cols="3">
|
|
<thead>
|
|
<row>
|
|
<entry>value of mbstring.func_overload</entry>
|
|
<entry>original function</entry>
|
|
<entry>overloaded function</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row>
|
|
<entry>1</entry>
|
|
<entry><function>mail</function></entry>
|
|
<entry><function>mb_send_mail</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strlen</function></entry>
|
|
<entry><function>mb_strlen</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strpos</function></entry>
|
|
<entry><function>mb_strpos</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strrpos</function></entry>
|
|
<entry><function>mb_strrpos</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>substr</function></entry>
|
|
<entry><function>mb_substr</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strtolower</function></entry>
|
|
<entry><function>mb_strtolower</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>strtoupper</function></entry>
|
|
<entry><function>mb_strtoupper</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>2</entry>
|
|
<entry><function>substr_count</function></entry>
|
|
<entry><function>mb_substr_count</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>ereg</function></entry>
|
|
<entry><function>mb_ereg</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>eregi</function></entry>
|
|
<entry><function>mb_eregi</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>ereg_replace</function></entry>
|
|
<entry><function>mb_ereg_replace</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>eregi_replace</function></entry>
|
|
<entry><function>mb_eregi_replace</function></entry>
|
|
</row>
|
|
<row>
|
|
<entry>4</entry>
|
|
<entry><function>split</function></entry>
|
|
<entry><function>mb_split</function></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
</para>
|
|
<note>
|
|
<para>
|
|
It is not recommended to use the function overloading option in
|
|
the per-directory context, because it's not confirmed yet to be
|
|
stable enough in a production environment and may lead to undefined
|
|
behaviour.
|
|
</para>
|
|
</note>
|
|
</section>
|
|
|
|
<section xml:id="mbstring.ja-basic">
|
|
<title>Basics of Japanese multi-byte encodings</title>
|
|
<para>
|
|
Japanese characters can only be represented by multibyte encodings,
|
|
and multiple encoding standards are used depending on platform and
|
|
text purpose. To make matters worse, these encoding standards
|
|
differ slightly from one another. In order to create a web
|
|
application which would be usable in a Japanese environment, a
|
|
developer has to keep these complexities in mind to ensure that the
|
|
proper character encodings are used.
|
|
</para>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>Storage for a character can be up to six bytes</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Most Japanese multibyte characters appear twice as wide as
|
|
single-byte characters. These characters are called
|
|
"zen-kaku" in Japanese, which means
|
|
"full width". Other, narrower, characters are called
|
|
"han-kaku", which means "half width". The
|
|
graphical properties of the characters, however, depends upon
|
|
the type faces used to display them.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Some character encodings use shift(escape) sequences defined
|
|
in ISO-2022 to switch the code map of the specific code area
|
|
(<literal>00h</literal> to <literal>7fh</literal>).
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
ISO-2022-JP should be used in SMTP/NNTP, and headers and entities
|
|
should be reencoded as per RFC requirements. Although those are not
|
|
requisites, it's still a good idea because several popular user
|
|
agents cannot recognize any other encoding methods.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Web pages created for mobile phone services such as
|
|
<link xlink:href="&url.imode;">i-mode</link>,
|
|
<link xlink:href="&url.vlife;">Vodafone live!</link>, or <link xlink:href="&url.ezweb;">EZweb</link>
|
|
are supposed to use Shift_JIS.
|
|
</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="mbstring.ref">
|
|
<title>References</title>
|
|
<para>
|
|
Multibyte character encoding schemes and their related issues are
|
|
fairly complicated, and are beyond the scope of this documentation.
|
|
Please refer to the following URLs and other resources for further
|
|
information regarding these topics.
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Unicode materials
|
|
</para>
|
|
<para>
|
|
<link xlink:href="&url.unicode;">&url.unicode;</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Japanese/Korean/Chinese character information
|
|
</para>
|
|
<para>
|
|
<link xlink:href="&url.oreilly.cjk-inf;">&url.oreilly.cjk-inf;</link>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</section>
|
|
&reference.mbstring.encodings;
|
|
|
|
</partintro>
|
|
|
|
&reference.mbstring.functions;
|
|
|
|
</reference>
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode: sgml
|
|
sgml-omittag:t
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
indent-tabs-mode:nil
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"../../../manual.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:nil
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
vim600: syn=xml fen fdm=syntax fdl=2 si
|
|
vim: et tw=78 syn=sgml
|
|
vi: ts=1 sw=1
|
|
-->
|