- Web templates
- E-commerce Templates
- CMS & Blog Templates
- Facebook Templates
- Website Builders
UTF-8
February 8, 2017
UTF-8 is the byte-oriented encoding form of Unicode. UTF stands for Unicode Transformation Format. Unicode and originally designed by Ken Thompson and Rob Pike. The encoding is variable-length and uses 8-bit code units. The ‘8’ means it uses 8-bit blocks to represent a character. The number of blocks needed to represent a character varies from 1 to 4.
At first about 40 years ago, the standard for information encoding ASCII was created. ASCII consisted originally of 128 characters, including lowercase and uppercase letters, numbers and punctuation, each one encoded using 7 bits.
Then came "extended ASCII" which used all 8 bits to accommodate for more characters like á, é, ü and so on. A lot of different code pages are used to account for those extra 128 character slots, like latin1, windows-1252, etc (i.e there is no unique correspondence chart for those extra 128 characters, it depends on region, language, operating system, etc). With developing of web technologies in the World it became apparent that neither 128 (7 bit) or 256 (8 bit) slots were enough to represent a very big number of characters consistently, so Unicode was created as a standard to represent characters from nearly all writing systems. It currently consists of more than 1,000,000 code points (they have the prefix "U+").
UTF-8 is a method for encoding the code points. A character in UTF-8 can be made up of one or more bytes. The encoding of the first 128 code points is equivalent to their ASCII counterpart. Further code points are represented using more than one byte. Each further byte in a single character starts with a special bit sequence to signal that it’s still the same character.
UTF-8 is a compromise character encoding that can be as compact as ASCII (if the file is just plain English text) but can also contain any unicode characters (with some increase in file size).
Unicode is a standard for representing a great variety of characters from many languages. UTF-8 is the preferred encoding for e-mail and web pages.