Google

Jul 10, 2013

Character encoding and decoding


Q. What do you understand by the terms character encoding and decoding?
A. To a computer, text characters are symbols. These symbols are assigned numbers (integers) in order to store these symbols in memory. Encoding is the system by which these numbers are assigned. Decoding is the way in which these numbers are converted to characters.


Q. What is the difference between ASCII and UTF-8?
A.

ASCII was the initial encoding, which only supported 256 letters assuming that the computers were created for english speaking people with some special characters. Google for ASCII table to see those characters. This made sense then when the computer memory was limited as well.

UTF-8 has the added benefit of character support beyond "ASCII-characters". UTF-8 uses Unicode, which has a different way of thinking about characters. There are different Unicode standards and UTF-8 is the most popular standard used in internet as it assumes that the leading bits are 0s




Every letter in every alphabet is assigned a magic number called a "code point".  ASCII has 128 code points, whereas a Unicode comprises 1, 114, 112 code points in the range 0 (hex) to 10FFFF (hex). Google for Unicode table (e.g http://unicode-table.com/en/). In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This means English text looks exactly the same in UTF-8 as it did in ASCII.  The characters are referred to by their "Unicode code point". Unicode code points are written in hexadecimal (to keep the numbers shorter), preceded by a "U+". The character A is represented as U+0041.


Q. What is the significance of character encoding?
A. Every string has an encoding associated with it. There is no such thing as a plain text.  If you have a string, in memory, in a file, web browser, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly. If you open a document and it looks like this, there's one and only one reason for it: Your text editor, browser, word processor or whatever else that's trying to read the document is assuming the wrong encoding. The document is not broken, and you simply need to select the right encoding to display the document.

For an email message, you are expected to have a string in the header of the form

    Content-Type: text/plain; charset="UTF-8"


In HTML 5

   <meta charset="UTF-8">

In HTML 4

   <meta http-equiv="Content-type" content="text/html;charset=UTF-8">
  
Setting encoding to UTF-8 with Java Servlets:

    resource.setContentType ("text/html;charset=utf-8");


Q. Does java.lang.String class in Java has an encoding?
A. No.  the byte[] array has an encoding. For example:

package com.mycompany.app;

import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;

public class CharacterEncoding
{
    public static void main(String[] args) throws UnsupportedEncodingException
    {
        System.out.println(Charset.defaultCharset());
        
        String str = "Copy Right \u00A9";
        
        //uses the default encoding
        System.out.println(new String(str.getBytes()));
        
        //uses utf-8 encoding
        System.out.println(new String(str.getBytes("utf-8")));
        
    }
}


When I ran the above in eclipse, I got the following output.

windows-1252
Copy Right ©
Copy Right ©


This was using Eclipse IDE's default Text encoding.

 
When I changed this encoding to UTF-8 via Window --> Preferences --> General --> Workspace
 

The output has changed to as shown below.

UTF-8
Copy Right ©
Copy Right ©

Labels:

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home