HTML, ASCII, UTF8, Unicode: Weird characters on the loose �

HTML, ASCII, Hex, Unicode: Weird characters on the loose �

Are you seeing lots of weird characters like � or ▯ in your documents or web pages?

Check out the block of characters and symbols below. Anything look wrong?

ā ē ī ō ū
á æ ð é í ó ö þ ú ý
☎ ✆ ✈ ✉ ✍ ✎
½ ¾ ≈ ≤ ∞ ±

Problems with older browsers, encoding and fonts can sometimes lead to display errors. It's all about 'character sets' and the codes they use.

For instance, the replacement character is from a character set called 'Unicode'. It indicates that your system couldn't recognise a piece of code and didn't know how to turn it into a symbol.

Similarly, the Unicode character is likely to show up here and there when you're cutting and pasting between different types of document. It indicates there is a mismatch between the character you have pasted and those supported by the font.

And then there's a whole bunch of stuff that occasionally turns up in web page titles like  ™ ©. These are HTML symbols gone wrong. Depending on the version of HTML you are using you might argue they come from the UTF-8, ASCII or ISO-8859-1 character set (more). They are code that should have translated into symbols. ™ is ™,     is a blank space and © is ©.

If you're seeing this kind of error then the characters have probably been typed or pasted somewhere (usually involving a database) that only accepts 'alphanumeric' characters. It's like speaking French and expecting your audience to hear Spanish.

Then there's those awful hyperlinks that contain stuff like %6f%62%73%63%75%72%65? That's from the hexadecimal character set, which is what's required for URLs.

Don't underestimate the importance of using the right characters. Film-maker Peter Jackson's "Weta Workshop" is a very different thing from (the correct spelling, with macrons:) "Wētā Workshop". The former word means excrement (more).

Anglophones often neglect things like this because they seldom use accentuated characters. You'll get a sense of the difference it can make if you look at the difference an underscore might have made in the following domain names (all real examples):

If you're setting out to become a web or database developer, you're going to come across this kind of issue again and again. On the bright side, the people who design browsers and databases are learning to make allowances. It's much less of a problem than it used to be. And you'll quickly learn how to recognise and fix the problems. Here's some resources that'll make troubleshooting easier.

Explanations and resources

List of special characters to create symbols in HTML
List of characters for use in URLs
List of Unicode characters (click for HTML equivalents)
ASCII character set + hex, binary & HTML equivalents
ASCII to HEX converter
HEX to ASCII converter
Wordpress Tuition