mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-16 00:48:54 +00:00
Improve utf8_decode() and utf8_encode() documentation
This rewrites the descriptions of both to clarify that they convert specifically between ISO-8859-1 and UTF-8, adds a warning about confusion with Windows-1252, and adds helpful "See also" links to other character set conversion functions. Additionally, the behaviour for invalid characters in utf8_decode() was clarified, and the description of the UTF-8 binary encoding was removed. git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@340506 c90b9560-bf6c-de11-be94-00142212c4b1
This commit is contained in:
parent
a9d0248cd3
commit
838941f6cc
2 changed files with 45 additions and 51 deletions
|
@ -16,9 +16,27 @@
|
|||
<methodparam><type>string</type><parameter>data</parameter></methodparam>
|
||||
</methodsynopsis>
|
||||
<para>
|
||||
This function decodes <parameter>data</parameter>, assumed to be
|
||||
<literal>UTF-8</literal> encoded, to <literal>ISO-8859-1</literal>.
|
||||
This function converts the string <parameter>data</parameter> from the
|
||||
<literal>UTF-8</literal> encoding to <literal>ISO-8859-1</literal>. Bytes
|
||||
in the string which are not valid <literal>UTF-8</literal>, and
|
||||
<literal>UTF-8</literal> characters which do not exist in
|
||||
<literal>ISO-8859-1</literal> (that is, characters above
|
||||
<literal>U+00FF</literal>) are replaced with <literal>?</literal>.
|
||||
</para>
|
||||
<note>
|
||||
<para>
|
||||
Many web pages marked as using the <literal>ISO-8859-1</literal> character
|
||||
encoding actually use the similar <literal>Windows-1252</literal> encoding,
|
||||
and web browsers will interpret <literal>ISO-8859-1</literal> web pages as
|
||||
<literal>Windows-1252</literal>. <literal>Windows-1252</literal> features
|
||||
additional printable characters, such as the Euro sign
|
||||
(<literal>€</literal>) and curly quotes (<literal>“</literal>
|
||||
<literal>”</literal>), instead of certain <literal>ISO-8859-1</literal>
|
||||
control characters. This function will not convert such
|
||||
<literal>Windows-1252</literal> characters correctly. Use a different
|
||||
function if <literal>Windows-1252</literal> conversion is required.
|
||||
</para>
|
||||
</note>
|
||||
</refsect1>
|
||||
|
||||
<refsect1 role="parameters">
|
||||
|
@ -29,7 +47,7 @@
|
|||
<term><parameter>data</parameter></term>
|
||||
<listitem>
|
||||
<para>
|
||||
An UTF-8 encoded string.
|
||||
A UTF-8 encoded string.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
@ -48,7 +66,10 @@
|
|||
&reftitle.seealso;
|
||||
<para>
|
||||
<simplelist>
|
||||
<member><function>utf8_encode</function> (contains an explanation of UTF-8 encoding)</member>
|
||||
<member><function>utf8_encode</function> - Performs the reverse conversion</member>
|
||||
<member><function>mb_convert_encoding</function> - Converts between various character encodings, including UTF-8, ISO-8859-1 and Windows-1252</member>
|
||||
<member><function>iconv</function> - Converts between various character encodings</member>
|
||||
<member><function>recode_string</function> - Converts between various character encodings</member>
|
||||
</simplelist>
|
||||
</para>
|
||||
</refsect1>
|
||||
|
|
|
@ -13,53 +13,23 @@
|
|||
<methodparam><type>string</type><parameter>data</parameter></methodparam>
|
||||
</methodsynopsis>
|
||||
<para>
|
||||
This function encodes the string <parameter>data</parameter> to
|
||||
<literal>UTF-8</literal>, and returns the encoded version.
|
||||
<literal>UTF-8</literal> is a standard mechanism used by
|
||||
<acronym>Unicode</acronym> for encoding <glossterm>wide
|
||||
character</glossterm> values into a byte stream.
|
||||
<literal>UTF-8</literal> is transparent to plain <abbrev>ASCII</abbrev>
|
||||
characters, is self-synchronized (meaning it is possible for a program to
|
||||
figure out where in the bytestream characters start) and can be used with
|
||||
normal string comparison functions for sorting and such. PHP encodes
|
||||
<literal>UTF-8</literal> characters in up to four bytes, like this:
|
||||
<table>
|
||||
<title>UTF-8 encoding</title>
|
||||
<tgroup cols="3">
|
||||
<thead>
|
||||
<row>
|
||||
<entry>bytes</entry>
|
||||
<entry>bits</entry>
|
||||
<entry>representation</entry>
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>1</entry>
|
||||
<entry>7</entry>
|
||||
<entry>0bbbbbbb</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>2</entry>
|
||||
<entry>11</entry>
|
||||
<entry>110bbbbb 10bbbbbb</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>3</entry>
|
||||
<entry>16</entry>
|
||||
<entry>1110bbbb 10bbbbbb 10bbbbbb</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>4</entry>
|
||||
<entry>21</entry>
|
||||
<entry>11110bbb 10bbbbbb 10bbbbbb 10bbbbbb</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
Each <replaceable>b</replaceable> represents a bit that can be
|
||||
used to store character data.
|
||||
This function converts the string <parameter>data</parameter> from the
|
||||
<literal>ISO-8859-1</literal> encoding to <literal>UTF-8</literal>.
|
||||
</para>
|
||||
<note>
|
||||
<para>
|
||||
Many web pages marked as using the <literal>ISO-8859-1</literal> character
|
||||
encoding actually use the similar <literal>Windows-1252</literal> encoding,
|
||||
and web browsers will interpret <literal>ISO-8859-1</literal> web pages as
|
||||
<literal>Windows-1252</literal>. <literal>Windows-1252</literal> features
|
||||
additional printable characters, such as the Euro sign
|
||||
(<literal>€</literal>) and curly quotes (<literal>“</literal>
|
||||
<literal>”</literal>), instead of certain <literal>ISO-8859-1</literal>
|
||||
control characters. This function will not convert such
|
||||
<literal>Windows-1252</literal> characters correctly. Use a different
|
||||
function if <literal>Windows-1252</literal> conversion is required.
|
||||
</para>
|
||||
</note>
|
||||
</refsect1>
|
||||
|
||||
<refsect1 role="parameters">
|
||||
|
@ -89,7 +59,10 @@
|
|||
&reftitle.seealso;
|
||||
<para>
|
||||
<simplelist>
|
||||
<member><function>utf8_decode</function></member>
|
||||
<member><function>utf8_encode</function> - Performs the reverse conversion</member>
|
||||
<member><function>mb_convert_encoding</function> - Converts between various character encodings, including UTF-8, ISO-8859-1 and Windows-1252</member>
|
||||
<member><function>iconv</function> - Converts between various character encodings</member>
|
||||
<member><function>recode_string</function> - Converts between various character encodings</member>
|
||||
</simplelist>
|
||||
</para>
|
||||
</refsect1>
|
||||
|
|
Loading…
Reference in a new issue