From a5509853fdb90395f39f490a6e376d88cd5775e5 Mon Sep 17 00:00:00 2001 From: Rui Hirokawa Date: Fri, 29 Jun 2001 03:20:28 +0000 Subject: [PATCH] fixed some typos. git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@50334 c90b9560-bf6c-de11-be94-00142212c4b1 --- functions/mbstring.xml | 772 ++++++++++++++++++++++++++++++++++------- 1 file changed, 642 insertions(+), 130 deletions(-) diff --git a/functions/mbstring.xml b/functions/mbstring.xml index 4843317bf8..f0d8f3e351 100644 --- a/functions/mbstring.xml +++ b/functions/mbstring.xml @@ -1,117 +1,305 @@ Multi-Byte String Functions - Multi-Byte String + + Multi-Byte String + &warn.experimental; Introduction - This module is EXPERIMENTAL. Function name/API is subject to be - changed. Current conversion filter supports Japanese only. + This module is EXPERIMENTAL. Function name/API is subject to + change. Current conversion filter supports Japanese only. - There are many languages that all characters cannot be expressed + There are many languages in which all characters can be expressed by single byte. Multi-byte character codes are used to express many characters for many languages. mbstring is developed to handle Japanese characters. However, many mbstring functions are able to handle - character codes other than Japanese. + character encoding other than Japanese. - Multi-byte character encoding represents single character with + A multi-byte character encoding represents single character with consecutive bytes. Some character encoding has shift(escape) - sequences to start/end multi-byte character string. Therefore, + sequences to start/end multi-byte character strings. Therefore, a multi-byte character string may be destroyed when it is divided - and/or counted, unless multi-byte character encoding safe method - is used. mbstring functions support multi-byte - character safe string functions and other utility functions such - as conversion functions. + and/or counted unless multi-byte character encoding safe method + is used. This module provides multi-byte character safe string + functions and other utility functions such as conversion + functions. + + Since PHP is basically designed for ISO-8859-1, some multi-byte + character encoding does not work well with PHP. Therefore, it is + important to set mbstring.internal_encoding to + a character encoding that works with PHP. + + + PHP4 Character Encoding Requirements + + + + + + Per byte encoding + + + + + Single byte characters in range of 00h-7fh + which is compatible with ASCII + + + + + Multi-byte characters without 00h-7fh + + + + + + These are examples of internal character encoding that works with + PHP and does NOT work with PHP. + + - - Basics for Japanese multi-byte character +Character encodings work with PHP: +ISO-8859-*, EUC-JP, UTF-8 + + +Character encodings do NOT work with PHP: +JIS, SJIS + + + + + Character encoding, that does not work with PHP, may be converted + with mbstring's HTTP input/output conversion + feature/function. + + - Most Japanese characters need more than 1 byte for a - character. In addition to this, several character encodings are - used under Japanese environment. There are EUC-JP, Shift_JIS and - ISO-2022-JP character encoding. As Unicode is getting popular, - UTF-8 is used also. To develop Web application for Japanese - environment, it is important to use these character codes depend - on its purpose, HTTP input/output, RDBMS and E-mail. + SJIS should not be used for internal encoding unless the reader + is familiar with parser/compiler, character encoding and + character encoding issues. + + + + If you use database with PHP, it is recommended that you use the + same character encoding for both database and internal + encoding for ease of use and better performance. + + + If you are using PostgreSQL, it supports character + encoding that is different from backend character encoding. See + the PostgreSQL manual for details. + + + + + How to Enable mbstring + + mbstring is an extended module. You must + enable module with configure script. Refer + to the Install section for + details. + + + The following configure options are related to + mbstring module. + - - Storage for a character can be upto four bytes - - - - - A multi-byte character usually has twice of width compare to - single byte characters. Wider character is called "zen-kaku" - - meaning full width, narrower character called "han-kaku" - - meaning half width. "zen-kaku" characters are fixed width - usually. - - - - - Some character encoding defines shift sequence for - entering/exiting multi-byte character strings. - - - - - Database may allocate storage for characters that differs - from size used in PHP even if the same character encoding is - used. (For example, PostgreSQL) - - - - - E-mail is supposed to use ISO-2022-JP. - + + : Enable + mbstring functions. This option is + required to use mbstring functions. + - "i-mode" web site is supposed to use Shift_JIS. + : + Enable HTTP input character encoding conversion using + mbstring conversion engine. If this + feature is enabled, HTTP input character encoding may be + converted to mbstring.internal_encoding + automatically. - - Supported character encodings + + HTTP Input and Output - Following character encodings are supported in this PHP - extension : UCS-4, - UCS-4BE, UCS-4LE, - UCS-2, UCS-2BE, - UCS-2LE, UTF-32, - UTF-32BE, UTF-32LE, - UCS-2LE, UTF-16, - UTF-16BE, UTF-16LE, - UTF-8, UTF-7, - ASCII, EUC-JP, - SJIS, eucJP-win, - SJIS-win, - ISO-2022-JP(JIS), + HTTP input/output character encoding conversion may convert + binary data also. Users are supposed to control character + encoding conversion if binary data is used for HTTP + input/output. + + + If enctype for HTML form is set to + multipart/form-data, + mbstring does not convert character encoding + in POST data. If it is the case, strings are needed to be + converted to internal character encoding. + + + + + + HTTP Input + + There is no way to control HTTP input character + conversion from PHP script. To disable HTTP input character + conversion, it has to be done in php.ini. + + + Disable HTTP input conversion in php.ini + + + +;; Disable HTTP Input conversion +mbstring.http_input = pass + + + + + When using PHP as an Apache module, it is possible to + override PHP ini setting per Virtual Host in + httpd.conf or per directory with + .htaccess. Refer to the Configuration section and + Apache Manual for details. + + + + + HTTP Output + + + There are several ways to enable output character encoding + conversion. One is using php.ini, another + is using ob_start with + mb_output_handler as + ob_start callback function. + + + + For PHP3-i18n users, mbstring's output + conversion differs from PHP3-i18n. Character encoding is + converted using output buffer. + + + + + + + + <literal>php.ini</literal> setting example + + +;; Enable output character encoding conversion for all PHP pages + +;; Enable Output Buffering +output_buffering = On + +;; Set mb_output_handler to enable output conversion +output_handler = mb_output_handler + + + + + + Script example + + +<?php + +// Enable output character encoding conversion only for this page + +// Set HTTP output character encoding to SJIS +mb_http_output('SJIS'); + +// Start buffering and specify "mb_output_handler" as +// callback function +ob_start('mb_output_handler'); + +?> + + + + + + + Supported Character Encoding + + Currently, the following character encoding is supported by + mbstring module. Caracter encoding may + be specified for mbstring functions' + encoding parameter. + + The following character encoding is supported in this PHP + extension : + + + UCS-4, UCS-4BE, + UCS-4LE, UCS-2, + UCS-2BE, UCS-2LE, + UTF-32, UTF-32BE, + UTF-32LE, UCS-2LE, + UTF-16, UTF-16BE, + UTF-16LE, UTF-8, + UTF-7, ASCII, + EUC-JP, SJIS, + eucJP-win, SJIS-win, + ISO-2022-JP, JIS, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14, - ISO-8859-15. + ISO-8859-15, byte2be, + byte2le, byte4be, + byte4le, BASE64, + 7bit, 8bit and + UTF7-IMAP. + + php.ini entry, which accepts encoding name, + accepts "auto" and + "pass" also. + mbstring functions, which accepts encoding + name, and accepts "auto". + + + If "pass" is set, no character + encoding conversion is performed. + + + If "auto" is set, it is expanded to + "ASCII,JIS,UTF-8,EUC-JP,SJIS". + + + See also mb_detect_order + + + + "Supported character encoding" does not mean that it + works as internal character code. + + - php.ini settings + php.ini settings @@ -122,63 +310,311 @@ - mbstring.http_input defines default HTTP input - character encoding. + mbstring.http_input defines default HTTP + input character encoding. - mbstring.http_output defines default HTTP output - character encoding. + mbstring.http_output defines default HTTP + output character encoding. - mbstring.detect_order defines default character - encoding detection order. + mbstring.detect_order defines default + character code detection order. See also + mb_detect_order. - mbstring.substitute_character defines character - to substitute for invalid character codes. + mbstring.substitute_character defines + character to substitute for invalid character encoding. + + Web Browsers are supposed to use the same character encoding + when submitting form. However, browsers may not use the same + character encoding. See mb_http_input to + detect character encoding used by browsers. + + + If enctype is set to + multipart/form-data in HTML forms, + mbstring does not convert character encoding + in POST data. The user must convert them in the script, if + conversion is needed. + + + Although, browsers are smart enough to detect character encoding + in HTML. charset is better to be set in HTTP + header. Change default_charset according to + character encoding. + <literal>php.ini</literal> setting example - + + ;; Set default internal encoding +;; Note: Make sure to use character encoding works with PHP mbstring.internal_encoding = UTF-8 ; Set internal encoding to UTF-8 -;; Set default HTTP input character code -mbstring.http_input = auto ; Set HTTP input to auto -; or -; mbstring.http_input = SJIS ; Set HTTP input to SJIS -; mbstring.http_input = eucjp-win, sjis-win, UTF-8 ; Specify order +;; Set default HTTP input character encoding +;; Note: Script cannot change http_input setting. +mbstring.http_input = pass ; No conversion. +mbstring.http_input = auto ; Set HTTP input to auto + ; "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" +mbstring.http_input = SJIS ; Set HTTP2 input to SJIS +mbstring.http_input = UTF-8,SJIS,EUC-JP ; Specify order -;; Set default HTTP output character code -mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8 +;; Set default HTTP output character encoding +mbstring.http_output = pass ; No conversion +mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8 -;; Set default character code detection order -mbstring.detect_order = auto ; Set HTTP output to auto -; or -; mbstring.detect_order = eucjp-win, sjis-win, UTF-8 ; Specify order +;; Set default character encoding detection order +mbstring.detect_order = auto ; Set detect order to auto +mbstring.detect_order = ASCII,JIS,UTF-8,SJIS,EUC-JP ; Specify order ;; Set default substitute character -mbstring.substitute_character = 12307 ; Specify character code -; or -; mbstring.substitute_character = none ; Null character -; mbstring.substitute_character = long ; Long +mbstring.substitute_character = 12307 ; Specify Unicode value +mbstring.substitute_character = none ; Do not print character +mbstring.substitute_character = long ; Long Example: U+3000,JIS+7E7E + + + + + + <literal>php.ini</literal> setting for <literal>EUC-JP</literal> users + + +;; Disable Output Buffering +output_buffering = Off + +;; Set HTTP header charset +default_charset = EUC-JP + +;; Set HTTP input encoding conversion to auto +mbstring.http_input = auto + +;; Convert HTTP output to EUC-JP +mbstring.http_output = EUC-JP + +;; Set internal encoding to EUC-JP +mbstring.internal_encoding = EUC-JP + +;; Do not print invalid characters +mbstring.substitute_character = none + + + + + + <literal>php.ini</literal> setting for <literal>SJIS</literal> users + + +;; Enable Output Buffering +output_buffering = On + +;; Set mb_output_handler to enable output conversion +output_handler = mb_output_handler + +;; Set HTTP header charset +default_charset = Shift_JIS + +;; Set http input encoding conversion to auto +mbstring.http_input = auto + +;; Convert to SJIS +mbstring.http_output = SJIS + +;; Set internal encoding to EUC-JP +mbstring.internal_encoding = EUC-JP + +;; Do not print invalid characters +mbstring.substitute_character = none + + + Basics for Japanese multi-byte character + + Most Japanese characters need more than 1 byte per character. In + addition, several character encoding schemas are used under a + Japanese environment. There are EUC-JP, Shift_JIS(SJIS) and + ISO-2022-JP(JIS) character encoding. As Unicode becomes popular, + UTF-8 is used also. To develop Web applications for a Japanese + environment, it is important to use the character set for the + task in hand, whether HTTP input/output, RDBMS and E-mail. + + + + + Storage for a character can be up to four + bytes + + + + A multi-byte character is usually twice of the width compared + to single-byte characters. Wider characters are called + "zen-kaku" - meaning full width, narrower characters are + called "han-kaku" - meaning half width. "zen-kaku" characters + are usually fixed width. + + + + + Some character encoding defines shift(escape) sequence for + entering/exiting multi-byte character strings. + + + + + ISO-2022-JP must be used for SMTP/NNTP. + + + + + "i-mode" web site is supposed to use SJIS. + + + + + + + + References + + Multi-byte character encoding and its related issues are very + complex. It is impossible to cover in sufficient detail + here. Please refer to the following URLs and other resources for + further readings. + + + + Unicode/UTF/UCS/etc + + + http://www.unicode.org/ + + + + + Japanese/Korean/Chinese character + information + + + + ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf + + + + + + + + + + mb_language + + Set/Get current language + + + + Description + + + string + mb_language + string + language + + + + mb_language sets language. If + language is omitted, it returns current + language as string. + + + language setting is used for encoding + e-mail messages. Valid languages are "Japanese", + "ja","English","en" and "uni" + (UTF-8). mb_send_mail uses this setting to + encode e-mail. + + Language and its setting is ISO-2022-JP/Base64 for + Japanese, UTF-8/Base64 for uni, ISO-8859-1/quoted printable for + English. + + + Return Value: If language is set and + language is valid, it returns + TRUE. Otherwise, it returns FALSE. When + language is omitted, it returns language + name as string. If no language is set previously, it returns + FALSE. + + + See also mb_send_mail. + + + + + + + mb_parse_str + + Parse GET/POST/COOKIE data and set global variable + + + + Description + + + string + mb_parse_str + + string + encoded_string + + array + result + + + + + mb_parse_str parses GET/POST/COOKIE data and + sets global variables. Since PHP does not provide raw POST/COOKIE + data, it can only used for GET data for now. It preses URL + encoded data, detects encoding, converts coding to internal + encoding and set values to result array or + global variables. + + + encoded_string: URL encoded data. + + + result: Array contains decoded and + character encoding converted values. + + + Return Value: It returns TRUE for success or FALSE for failure. + + + See also mb_detect_order, + mb_internal_encoding. + + + + mb_internal_encoding @@ -211,7 +647,7 @@ mbstring.substitute_character = 12307 ; Specify character code encoding: Character encoding name - Return Value: If encoding is + Return Value: If encoding is set,mb_internal_encoding returns TRUE for success, otherwise returns FALSE. If encoding is @@ -232,7 +668,7 @@ echo mb_internal_encoding(); See also mb_http_input, mb_http_output, - mb_detect_order + mb_detect_order. @@ -270,7 +706,7 @@ echo mb_internal_encoding(); See also mb_internal_encoding, mb_http_output, - mb_detect_order + mb_detect_order. @@ -294,9 +730,10 @@ echo mb_internal_encoding(); If encoding is set, mb_http_output sets HTTP output character encoding to encoding. Output after this - function is converted to encoding. - mb_http_output returns TRUE for success and - FALSE for failure. + function is converted to encoding. + mb_http_output returns + TRUE for success and FALSE + for failure. If encoding is omitted, @@ -306,7 +743,7 @@ echo mb_internal_encoding(); See also mb_internal_encoding, mb_http_input, - mb_detect_order + mb_detect_order. @@ -331,11 +768,12 @@ echo mb_internal_encoding(); mb_detect_order sets automatic character encoding detection order to encoding-list. - It returns TRUE for success, FALSE for failure. + It returns TRUE for success, + FALSE for failure. encoding-list is array or comma separated - list of character encodings. ("auto" is expanded to + list of character encoding. ("auto" is expanded to "ASCII, JIS, UTF-8, EUC-JP, SJIS") @@ -346,6 +784,42 @@ echo mb_internal_encoding(); This setting affects mb_detect_encoding and mb_send_mail. + + + mbstring currently implements following + encoding detection filters. If there is a invalid byte sequence + for following encoding, encoding detection will fail. + + + UTF-8, UTF-7, + ASCII, + EUC-JP,SJIS, + eucJP-win, SJIS-win, + JIS, ISO-2022-JP + + + For ISO-8859-*, mbstring + always detects as ISO-8859-*. + + + For UTF-16, UTF-32, + UCS2 and UCS4, encoding + detection will fail always. + + + + Useless detect order example + +; Always detect as ISO-8859-1 +detect_order = ISO-8859-1, UTF-8 + +; Always detect as UTF-8, since ASCII/UTF-7 values are +; valid for UTF-8 +detect_order = UTF-8, ASCII, UTF-7 + + + + <function>mb_detect_order</function> examples @@ -368,7 +842,7 @@ echo implode(", ", mb_detect_order()); See also mb_internal_encoding, mb_http_input, mb_http_output - mb_send_mail + mb_send_mail. @@ -393,7 +867,7 @@ echo implode(", ", mb_detect_order()); substitution character when input character encoding is invalid or character code is not exist in output character encoding. Invalid characters may be substituted null(no output), - string or hex value (Unicode character code value). + string or integer value (Unicode character code value). This setting affects mb_detect_encoding @@ -410,16 +884,17 @@ echo implode(", ", mb_detect_order()); - "long" : Output hex value (Example: U+3000,JIS+7E7E) + "long" : Output character code value (Example: + U+3000,JIS+7E7E) Return Value: If substchar is set, it - returns TRUE for success, otherwise returns FALSE. If - substchar is not set, it returns Unicode - value or + returns TRUE for success, otherwise returns + FALSE. If substchar is + not set, it returns Unicode value or "none"/"long". @@ -461,7 +936,27 @@ echo mb_substitute_character(); ob_start callback function. mb_output_handler converts characters in output buffer from internal character encoding to - HTTP output character encoding. + HTTP output character encoding. + + + 4.0.7 or later version, this hanlder adds charset HTTP header + when following conditions are met: + + + + + Does not set Content-Type by + header() + + + Default MIME type begins with + text/ + + + http_output setting is other than + pass + + contents : Output buffer contents @@ -483,8 +978,8 @@ ob_start("mb_output_handler"); - If you want to output some binary data such as image from php - script, you must set output encoding to "pass" using + If you want to output some binary data such as image from PHP + script, you must set output encoding to "pass" using mb_http_output. @@ -520,7 +1015,7 @@ ob_start("mb_output_handler"); $outputenc = "sjis-win"; mb_http_output($outputenc); ob_start("mb_output_handler"); -Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)); +header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)); @@ -549,6 +1044,11 @@ Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)) encoding. A multi-byte character is counted as 1. + + encoding is character encoding for + str. If encoding is + omitted, internal character encoding is used. + See also mb_internal_encoding, strlen. @@ -567,7 +1067,7 @@ Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)) Description - string mb_strpos + int mb_strpos string haystack string needle int @@ -605,7 +1105,7 @@ Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)) encoding is character encoding name. If it - is not specified, internal character encoding is used. + is omitted, internal character encoding is used. See also mb_strpos, @@ -626,7 +1126,7 @@ Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)) Description - string mb_strrpos + int mb_strrpos string haystack string needle string @@ -649,7 +1149,7 @@ Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)) 0. Second character position is 1. - If encoding is not set, internal encoding + If encoding is omitted, internal encoding is assumed. mb_strrpos accepts string for needle where strrpos accepts only character. @@ -709,7 +1209,7 @@ Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)) omitted, internal character encoding is used. - See also mb_struct, + See also mb_strcut, mb_internal_encoding. @@ -822,7 +1322,7 @@ Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)) Description - string mb_strmwidth + string mb_strimwidth string str int start int width @@ -833,7 +1333,7 @@ Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc)) - mb_strmwidth truncates string + mb_strimwidth truncates string str to specified width. It returns truncated string. @@ -1163,6 +1663,12 @@ echo $addr; to-encoding. It returns character encoding before conversion for success, FALSE for failure. + + mb_convert_variables join strings in Array + or Object to detect encoding, since encoding detection tends to + fail for short strings. Therefore, it is impossible to mix + encoding in single array or object. + It from-encoding is specified by array or comma separated string, it tries to detect encoding from @@ -1172,7 +1678,9 @@ echo $addr; vars (3rd and larger) is reference to - variable to be converted. String, Array and Object are accepted. + variable to be converted. String, Array and Object are accepted. + mb_convert_variables assumes all parameters + have the same encoding. @@ -1296,7 +1804,8 @@ $str = mb_encode_numericentity($str, $convmap, "sjis-win"); convert. - encoding is character encoding. + encoding is character encoding. If it is + omitted, internal character encoding is used. @@ -1323,7 +1832,7 @@ $convmap = array ( mb_send_mail - Send mail with ISO-2022-JP character code. (Japanese specific) + Send encoded mail. @@ -1344,7 +1853,8 @@ $convmap = array ( mb_send_mail sends email. Headers and - message are converted and encoded in ISO-2022-JP. + message are converted and encoded according to + mb_language setting. mb_send_mail is wrapper function of mail. See mail for details. @@ -1361,21 +1871,23 @@ $convmap = array ( message is mail message. - string additional_headers is inserted at - the end of the header. This is typically used to add - extra headers. Multiple extra headers are separated with a + additional_headers is inserted at + the end of the header. This is typically used to add extra + headers. Multiple extra headers are separated with a newline(\n). - It returns TRUE for success, otherwise it returns FALSE. + additional_parameter is a MTA command line + parameter. It is useful when setting the correct Return-Path + header when using sendmail. - additional_parameter is added this - data to the call to the mailer by PHP. This is useful when - setting the correct Return-Path header when using sendmail. + It returns TRUE for success, otherwise it + returns FALSE. - See also: mail. + See also: mb_language, + mail.