Thu, 30 Sep 2010

2:29 PM - How to test for UTF-8 characters

One of the problems on the web is all the different character encodings.  Computers represent information in different ways.  Some of these approaches handle multiple languages, others do not.  One such encoding is UTF-8.  You can test for UTF-8 in your web applications using this regular expression:

http://www.w3.org/International/questions/qa-forms-utf-8

 

$field =~
  m/A(
     [x09x0Ax0Dx20-x7E]            # ASCII
   | [xC2-xDF][x80-xBF]             # non-overlong 2-byte
   |  xE0[xA0-xBF][x80-xBF]        # excluding overlongs
   | [xE1-xECxEExEF][x80-xBF]{2}  # straight 3-byte
   |  xED[x80-x9F][x80-xBF]        # excluding surrogates
   |  xF0[x90-xBF][x80-xBF]{2}     # planes 1-3
   | [xF1-xF3][x80-xBF]{3}          # planes 4-15
   |  xF4[x80-x8F][x80-xBF]{2}     # plane 16
  )*z/x;

tags: character encoding

()