Character Sets and Document Encoding

Thursday 2nd July, 2009

I've just spent an embarrassingly long time trying to solve a problem with normalizing a string (i.e. getting rid of all non-western characters and replacing them with equivalents).

I needed to take the string "L'Autopsie Phénoménale De Dieu" and remove the é's, replacing them with e's - this was so the string could be used as part of a URL.

I tried all sorts of things, from the obvious (str_replace / strtr functions) to the ridiculously complicated (all sorts of string encoding, decoding, regex replacements) but nothing seemed to work as expected.

Eventually I discovered that the problem was nothing to do with the PHP - it was infact the document encoding. All pages on my site are set with the standard UTF-8 charset, but the PHP files that were actually performing the string replace functions with saved with a 'Western European' encoding type. Changing this to UTF-8 soon solved the problem.

In the end I used a great function by allixsenos from the PHP.net comments area.

Comments

Please login to comment on this page

back