Character encoding

From DreamHost
Jump to: navigation, search

A character encoding or character set consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. In early computing, the most common form was the American Standard Code for Information Interchange (ASCII), and that is still in use today.

Common character encodings

Generally pronounced as ass-key, it is a character encoding based on the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that work with text. Most modern character encodings — which support many more characters — have a historical basis in ASCII.
More formally cited as ISO/IEC 8859-2:1987 or less formally as Latin-2, is part 2 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. 2, consisting of 191 characters from the Latin script, each encoded as a single 8-bit code value.
The UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is consistent with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed.

Methods of defining character encoding for web documents

Please note that in most cases, these examples also set the MIME type.

With .htaccess

# Serves all files ending in .html as UTF-8
AddCharset UTF-8 .html

# Serves specific file as UTF-8
<Files "example.php">
AddCharset UTF-8 .php

With PHP

Output the header before any part of the actual page.

header("Content-Type: text/html;charset=UTF-8");

With Perl

Output the correct header before any part of the actual page.

print "Content-Type: text/html; charset=utf-8\n\n";

With Python

Output the correct header before any part of the actual page. Note that this is the same as with Perl, but without the trailing semicolon.

print "Content-Type: text/html; charset=utf-8\n\n"


<meta http-equiv="content-type" content="text/html;charset=utf-8">

Checking character encoding

Use Web Sniffer to examine the HTTP request and response of a URL.