dealing with utf-8 in php5

Published: 03/05/10 01:36 PM


I just worked on a search-engine indexer, which crawls utf-8 files and creates an index of the words. There have been some issues, that are easy to solve if you know them.

First of all, I worked with utf-8 files without BOM.

In order to find possible searchwords, some regular expressions have been used.

The indexed words were supposed to be stored in lower case.

These are the suggestions, if you have a similar task:

There was no problem with reading the files using get_file_contents or writing the index file using file_put_contents. (as all of my files, even the php-sources were encoded in utf-8).

Make sure to use preg_split or preg_replace with the ‘/u’ option, for example preg_split(”/\b/u”, $haystack). This tells php to use utf-8. (see php.net – pcre pattern modifiers for more information). Make sure to use the lowercase ‘u’.

Using mb_split instead of preg_split didn’t work for me, as the results differed from my expectation. I didn’t go deeper into it though.

To convert utf-8 strings to lowercase, don’t use strtolower, but use mb_convert_case($word,MB_CASE_LOWER,”UTF-8″); instead.

That’s it. If you have any further tips related to this topic, I’d appreciate your comment.


Kommentieren Sie diesen Artikel

BdB at Work 2008 SP4: Nachträgliche Änderung der Geschäftsnummer

Embedded gecko-browser in java-application with DJNativeSwing