Improve utf8_decode() and utf8_encode() documentation

This rewrites the descriptions of both to clarify that they convert specifically between ISO-8859-1 and UTF-8, adds a warning about confusion with Windows-1252, and adds helpful "See also" links to other character set conversion functions. Additionally, the behaviour for invalid characters in utf8_decode() was clarified, and the description of the UTF-8 binary encoding was removed. git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@340506 c90b9560-bf6c-de11-be94-00142212c4b1
2025-03-17 01:18:55 +00:00 · 2016-10-17 14:31:39 +00:00 · 2016-10-17 14:31:39 +00:00 · 838941f6cc
commit 838941f6cc
parent a9d0248cd3
2 changed files with 45 additions and 51 deletions
--- a/reference/xml/functions/utf8-decode.xml
+++ b/reference/xml/functions/utf8-decode.xml
@ -16,9 +16,27 @@
   <methodparam><type>string</type><parameter>data</parameter></methodparam>
  </methodsynopsis>
  <para>
-   This function decodes <parameter>data</parameter>, assumed to be
-   <literal>UTF-8</literal> encoded, to <literal>ISO-8859-1</literal>.
+   This function converts the string <parameter>data</parameter> from the
+   <literal>UTF-8</literal> encoding to <literal>ISO-8859-1</literal>. Bytes
+   in the string which are not valid <literal>UTF-8</literal>, and
+   <literal>UTF-8</literal> characters which do not exist in
+   <literal>ISO-8859-1</literal> (that is, characters above
+   <literal>U+00FF</literal>) are replaced with <literal>?</literal>.
  </para>
+  <note>
+   <para>
+    Many web pages marked as using the <literal>ISO-8859-1</literal> character
+    encoding actually use the similar <literal>Windows-1252</literal> encoding,
+    and web browsers will interpret <literal>ISO-8859-1</literal> web pages as
+    <literal>Windows-1252</literal>. <literal>Windows-1252</literal> features
+    additional printable characters, such as the Euro sign
+    (<literal>€</literal>) and curly quotes (<literal>“</literal>
+    <literal>”</literal>), instead of certain <literal>ISO-8859-1</literal>
+    control characters. This function will not convert such
+    <literal>Windows-1252</literal> characters correctly. Use a different
+    function if <literal>Windows-1252</literal> conversion is required.
+   </para>
+  </note>
 </refsect1>

 <refsect1 role="parameters">
@ -29,7 +47,7 @@
     <term><parameter>data</parameter></term>
     <listitem>
      <para>
-       An UTF-8 encoded string.
+       A UTF-8 encoded string.
      </para>
     </listitem>
    </varlistentry>
@ -48,7 +66,10 @@
  &reftitle.seealso;
  <para>
   <simplelist>
-    <member><function>utf8_encode</function> (contains an explanation of UTF-8 encoding)</member>
+    <member><function>utf8_encode</function> - Performs the reverse conversion</member>
+    <member><function>mb_convert_encoding</function> - Converts between various character encodings, including UTF-8, ISO-8859-1 and Windows-1252</member>
+    <member><function>iconv</function> - Converts between various character encodings</member>
+    <member><function>recode_string</function> - Converts between various character encodings</member>
   </simplelist>
  </para>
 </refsect1>
--- a/reference/xml/functions/utf8-encode.xml
+++ b/reference/xml/functions/utf8-encode.xml
@ -13,53 +13,23 @@
   <methodparam><type>string</type><parameter>data</parameter></methodparam>
  </methodsynopsis>
  <para>
-   This function encodes the string <parameter>data</parameter> to
-   <literal>UTF-8</literal>, and returns the encoded version.
-   <literal>UTF-8</literal> is a standard mechanism used by
-   <acronym>Unicode</acronym> for encoding <glossterm>wide
-   character</glossterm> values into a byte stream.
-   <literal>UTF-8</literal> is transparent to plain <abbrev>ASCII</abbrev>
-   characters, is self-synchronized (meaning it is possible for a program to
-   figure out where in the bytestream characters start) and can be used with
-   normal string comparison functions for sorting and such. PHP encodes
-   <literal>UTF-8</literal> characters in up to four bytes, like this:
-   <table>
-    <title>UTF-8 encoding</title>
-    <tgroup cols="3">
-     <thead>
-      <row>
-       <entry>bytes</entry>
-       <entry>bits</entry>
-       <entry>representation</entry>
-      </row>
-     </thead>
-     <tbody>
-      <row>
-       <entry>1</entry>
-       <entry>7</entry>
-       <entry>0bbbbbbb</entry>
-      </row>
-      <row>
-       <entry>2</entry>
-       <entry>11</entry>
-       <entry>110bbbbb 10bbbbbb</entry>
-      </row>
-      <row>
-       <entry>3</entry>
-       <entry>16</entry>
-       <entry>1110bbbb 10bbbbbb 10bbbbbb</entry>
-      </row>
-      <row>
-       <entry>4</entry>
-       <entry>21</entry>
-       <entry>11110bbb 10bbbbbb 10bbbbbb 10bbbbbb</entry>
-      </row>
-     </tbody>
-    </tgroup>
-   </table>
-   Each <replaceable>b</replaceable> represents a bit that can be
-   used to store character data.
+   This function converts the string <parameter>data</parameter> from the
+   <literal>ISO-8859-1</literal> encoding to <literal>UTF-8</literal>.
  </para>
+  <note>
+   <para>
+    Many web pages marked as using the <literal>ISO-8859-1</literal> character
+    encoding actually use the similar <literal>Windows-1252</literal> encoding,
+    and web browsers will interpret <literal>ISO-8859-1</literal> web pages as
+    <literal>Windows-1252</literal>. <literal>Windows-1252</literal> features
+    additional printable characters, such as the Euro sign
+    (<literal>€</literal>) and curly quotes (<literal>“</literal>
+    <literal>”</literal>), instead of certain <literal>ISO-8859-1</literal>
+    control characters. This function will not convert such
+    <literal>Windows-1252</literal> characters correctly. Use a different
+    function if <literal>Windows-1252</literal> conversion is required.
+   </para>
+  </note>
 </refsect1>

 <refsect1 role="parameters">
@ -89,7 +59,10 @@
  &reftitle.seealso;
  <para>
   <simplelist>
-    <member><function>utf8_decode</function></member>
+    <member><function>utf8_encode</function> - Performs the reverse conversion</member>
+    <member><function>mb_convert_encoding</function> - Converts between various character encodings, including UTF-8, ISO-8859-1 and Windows-1252</member>
+    <member><function>iconv</function> - Converts between various character encodings</member>
+    <member><function>recode_string</function> - Converts between various character encodings</member>
   </simplelist>
  </para>
 </refsect1>