<?xml version="1.0" encoding="iso-8859-1"?> <!-- $Revision: 1.11 $ --> <reference id="ref.mbstring"> <title>Multi-Byte String Functions</title> <titleabbrev>Multi-Byte String</titleabbrev> <partintro> <section id="mbstring.intro"> &reftitle.intro; <para> There are many languages in which all characters can be expressed by single byte. Multi-byte character codes are used to express many characters for many languages. <literal>mbstring</literal> is developed to handle Japanese characters. However, many <literal>mbstring</literal> functions are able to handle character encoding other than Japanese. </para> <para> A multi-byte character encoding represents single character with consecutive bytes. Some character encoding has shift(escape) sequences to start/end multi-byte character strings. Therefore, a multi-byte character string may be destroyed when it is divided and/or counted unless multi-byte character encoding safe method is used. This module provides multi-byte character safe string functions and other utility functions such as conversion functions. </para> <para> Since PHP is basically designed for ISO-8859-1, some multi-byte character encoding does not work well with PHP. Therefore, it is important to set <literal>mbstring.internal_encoding</literal> to a character encoding that works with PHP. </para> <para> PHP4 Character Encoding Requirements </para> <para> <itemizedlist> <listitem> <simpara> Per byte encoding </simpara> </listitem> <listitem> <simpara> Single byte characters in range of <literal>00h-7fh</literal> which is compatible with <literal>ASCII</literal> </simpara> </listitem> <listitem> <simpara> Multi-byte characters without <literal>00h-7fh</literal> </simpara> </listitem> </itemizedlist> </para> <para> These are examples of internal character encoding that works with PHP and does NOT work with PHP. <informalexample> <programlisting> <![CDATA[ Character encodings work with PHP: ISO-8859-*, EUC-JP, UTF-8 Character encodings do NOT work with PHP: JIS, SJIS ]]> </programlisting> </informalexample> </para> <para> Character encoding, that does not work with PHP, may be converted with <literal>mbstring</literal>'s HTTP input/output conversion feature/function. </para> <note> <para> SJIS should not be used for internal encoding unless the reader is familiar with parser/compiler, character encoding and character encoding issues. </para> </note> <note> <para> If you use databases with PHP, it is recommended that you use the same character encoding for both database and <literal>internal encoding</literal> for ease of use and better performance. </para> <para> If you are using PostgreSQL, it supports character encoding that is different from backend character encoding. See the PostgreSQL manual for details. </para> </note> </section> &reference.mbstring.configure; &reference.mbstring.ini; <section id="mbstring.resources"> &reftitle.resources; &no.resource; </section> &reference.mbstring.constants; <section id="mbstring.http"> <title>HTTP Input and Output</title> <para> HTTP input/output character encoding conversion may convert binary data also. Users are supposed to control character encoding conversion if binary data is used for HTTP input/output. </para> <para> If <literal>enctype</literal> for HTML form is set to <literal>multipart/form-data</literal>, <literal>mbstring</literal> does not convert character encoding in POST data. If it is the case, strings are needed to be converted to internal character encoding. </para> <para> <itemizedlist> <listitem> <simpara> HTTP Input </simpara> <para> There is no way to control HTTP input character conversion from PHP script. To disable HTTP input character conversion, it has to be done in &php.ini;. <example> <title> Disable HTTP input conversion in &php.ini; </title> <programlisting role="php"> <![CDATA[ ;; Disable HTTP Input conversion mbstring.http_input = pass ;; Disable HTTP Input conversion (PHP 4.3.0 or higher) mbstring.encoding_translation = Off ]]> </programlisting> </example> </para> <para> When using PHP as an Apache module, it is possible to override PHP ini setting per Virtual Host in &httpd.conf; or per directory with &htaccess;. Refer to the <link linkend="configuration">Configuration</link> section and Apache Manual for details. </para> </listitem> <listitem> <simpara> HTTP Output </simpara> <para> There are several ways to enable output character encoding conversion. One is using &php.ini;, another is using <function>ob_start</function> with <function>mb_output_handler</function> as <literal>ob_start</literal> callback function. </para> <note> <para> For PHP3-i18n users, <literal>mbstring</literal>'s output conversion differs from PHP3-i18n. Character encoding is converted using output buffer. </para> </note> </listitem> </itemizedlist> </para> <para> <example> <title>&php.ini; setting example</title> <programlisting> <![CDATA[ ;; Enable output character encoding conversion for all PHP pages ;; Enable Output Buffering output_buffering = On ;; Set mb_output_handler to enable output conversion output_handler = mb_output_handler ]]> </programlisting> </example> </para> <para> <example> <title>Script example</title> <programlisting role="php"> <![CDATA[ <?php // Enable output character encoding conversion only for this page // Set HTTP output character encoding to SJIS mb_http_output('SJIS'); // Start buffering and specify "mb_output_handler" as // callback function ob_start('mb_output_handler'); ?> ]]> </programlisting> </example> </para> </section> <section id="mbstring.encodings"> <title>Supported Character Encodings</title> <simpara> Currently, the following character encoding is supported by the <literal>mbstring</literal> module. Character encoding may be specified for <literal>mbstring</literal> functions' <literal>encoding</literal> parameter. </simpara> <para> The following character encoding is supported in this PHP extension: </para> <para> <literal>UCS-4</literal>, <literal>UCS-4BE</literal>, <literal>UCS-4LE</literal>, <literal>UCS-2</literal>, <literal>UCS-2BE</literal>, <literal>UCS-2LE</literal>, <literal>UTF-32</literal>, <literal>UTF-32BE</literal>, <literal>UTF-32LE</literal>, <literal>UCS-2LE</literal>, <literal>UTF-16</literal>, <literal>UTF-16BE</literal>, <literal>UTF-16LE</literal>, <literal>UTF-8</literal>, <literal>UTF-7</literal>, <literal>ASCII</literal>, <literal>EUC-JP</literal>, <literal>SJIS</literal>, <literal>eucJP-win</literal>, <literal>SJIS-win</literal>, <literal>ISO-2022-JP</literal>, <literal>JIS</literal>, <literal>ISO-8859-1</literal>, <literal>ISO-8859-2</literal>, <literal>ISO-8859-3</literal>, <literal>ISO-8859-4</literal>, <literal>ISO-8859-5</literal>, <literal>ISO-8859-6</literal>, <literal>ISO-8859-7</literal>, <literal>ISO-8859-8</literal>, <literal>ISO-8859-9</literal>, <literal>ISO-8859-10</literal>, <literal>ISO-8859-13</literal>, <literal>ISO-8859-14</literal>, <literal>ISO-8859-15</literal>, <literal>byte2be</literal>, <literal>byte2le</literal>, <literal>byte4be</literal>, <literal>byte4le</literal>, <literal>BASE64</literal>, <literal>7bit</literal>, <literal>8bit</literal> and <literal>UTF7-IMAP</literal>. </para> <para> As of PHP 4.3.0, the following character encoding support will be added experimentally : <literal>EUC-CN</literal>, <literal>CP936</literal>, <literal>HZ</literal>, <literal>EUC-TW</literal>, <literal>CP950</literal>, <literal>BIG-5</literal>, <literal>EUC-KR</literal>, <literal>UHC</literal> (<literal>CP949</literal>), <literal>ISO-2022-KR</literal>, <literal>Windows-1251</literal> (<literal>CP1251</literal>), <literal>Windows-1252</literal> (<literal>CP1252</literal>), <literal>CP866</literal>, <literal>KOI8-R</literal>. </para> <para> &php.ini; entry, which accepts encoding name, accepts "<literal>auto</literal>" and "<literal>pass</literal>" also. <literal>mbstring</literal> functions, which accepts encoding name, and accepts "<literal>auto</literal>". </para> <para> If "<literal>pass</literal>" is set, no character encoding conversion is performed. </para> <para> If "<literal>auto</literal>" is set, it is expanded to "<literal>ASCII,JIS,UTF-8,EUC-JP,SJIS</literal>". </para> <para> See also <function>mb_detect_order</function> </para> <note> <para> "Supported character encoding" does not mean that it works as internal character code. </para> </note> </section> <section id="mbstring.overload"> <title> Overloading PHP string functions with multi byte string functions </title> <para> Because almost PHP application written for language using single-byte character encoding, there are some difficulties for multibyte string handling including japanese. Almost PHP string functions such as <function>substr</function> do not support multibyte string. </para> <para> Multibyte extension (mbstring) has some PHP string functions with multibyte support (ex. <function>substr</function> supports <function>mb_substr</function>). </para> <para> Multibyte extension (mbstring) also supports 'function overloading' to add multibyte string functionality without code modification. Using function overloading, some PHP string functions will be oveloaded multibyte string functions. For example, <function>mb_substr</function> is called instead of <function>substr</function> if function overloading is enabled. Function overload makes easy to port application supporting only single-byte encoding for multibyte application. </para> <para> <literal>mbstring.func_overload</literal> in &php.ini; should be set some positive value to use function overloading. The value should specify the category of overloading functions, sbould be set 1 to enable mail function overloading. 2 to enable string functions, 4 to regular expression functions. For example, if is set for 7, mail, strings, regex functions should be overloaded. The list of overloaded functions are shown in below. <table> <title>Functions to be overloaded</title> <tgroup cols="3"> <thead> <row> <entry>value of mbstring.func_overload</entry> <entry>original function</entry> <entry>overloaded function</entry> </row> </thead> <tbody> <row> <entry>1</entry> <entry><function>mail</function></entry> <entry><function>mb_send_mail</function></entry> </row> <row> <entry>2</entry> <entry><function>strlen</function></entry> <entry><function>mb_strlen</function></entry> </row> <row> <entry>2</entry> <entry><function>strpos</function></entry> <entry><function>mb_strpos</function></entry> </row> <row> <entry>2</entry> <entry><function>strrpos</function></entry> <entry><function>mb_strrpos</function></entry> </row> <row> <entry>2</entry> <entry><function>substr</function></entry> <entry><function>mb_substr</function></entry> </row> <row> <entry>2</entry> <entry><function>strtolower</function></entry> <entry><function>mb_strtolower</function></entry> </row> <row> <entry>2</entry> <entry><function>strtoupper</function></entry> <entry><function>mb_strtoupper</function></entry> </row> <row> <entry>2</entry> <entry><function>substr_count</function></entry> <entry><function>mb_substr_count</function></entry> </row> <row> <entry>4</entry> <entry><function>ereg</function></entry> <entry><function>mb_ereg</function></entry> </row> <row> <entry>4</entry> <entry><function>eregi</function></entry> <entry><function>mb_eregi</function></entry> </row> <row> <entry>4</entry> <entry><function>ereg_replace</function></entry> <entry><function>mb_ereg_replace</function></entry> </row> <row> <entry>4</entry> <entry><function>eregi_replace</function></entry> <entry><function>mb_eregi_replace</function></entry> </row> <row> <entry>4</entry> <entry><function>split</function></entry> <entry><function>mb_split</function></entry> </row> </tbody> </tgroup> </table> </para> </section> <section id="mbstring.ja-basic"> <title>Basics of Japanese multi-byte characters</title> <para> Most Japanese characters need more than 1 byte per character. In addition, several character encoding schemas are used under a Japanese environment. There are EUC-JP, Shift_JIS(SJIS) and ISO-2022-JP(JIS) character encoding. As Unicode becomes popular, UTF-8 is used also. To develop Web applications for a Japanese environment, it is important to use the character set for the task in hand, whether HTTP input/output, RDBMS and E-mail. </para> <para> <itemizedlist> <listitem> <simpara>Storage for a character can be up to six bytes</simpara> </listitem> <listitem> <simpara> A multi-byte character is usually twice of the width compared to single-byte characters. Wider characters are called "zen-kaku" - meaning full width, narrower characters are called "han-kaku" - meaning half width. "zen-kaku" characters are usually fixed width. </simpara> </listitem> <listitem> <simpara> Some character encoding defines shift(escape) sequence for entering/exiting multi-byte character strings. </simpara> </listitem> <listitem> <simpara> ISO-2022-JP must be used for SMTP/NNTP. </simpara> </listitem> <listitem> <para> "i-mode" web site is supposed to use SJIS. </para> </listitem> </itemizedlist> </para> </section> <section id="mbstring.ref"> <title>References</title> <para> Multi-byte character encoding and its related issues are very complex. It is impossible to cover in sufficient detail here. 