Bolyard: Accented characters are not indexed in Sphinx

Friday, 27 September 2013

Accented characters are not indexed in Sphinx

Accented characters are not indexed in Sphinx

I have problems with searching words that contains accented characters. I
use Sphinx 2.1.1, Linux, MsSQL 2005 via odbc (freetds).
Here is my sphinx.conf:
source parentSource
{
type = odbc
...
}
index parentIndex
{
morphology = stem_en
charset_type = utf-8
charset_table = 0..9, a..z, A..Z->a..z, ... (mapping taken from
http://sphinxsearch.com/wiki/doku.php?id=charset_tables for common, A-Z)
...
}
After changing config, I've reindexed all indexes and restarted searchd.
When I search for "Muller" - I get results that contain only "Muller".
When I search for "Müller" - I also get only "Muller" results. But there
are also "Müller" records in db, that not indexed properly. Mapping for ü
(U+00FC->u) present in config. I mean after I've added accented characters
to charset_table, it (accented characters) is converted when I search, but
not when content is indexed, as I understand.
When I run indexer with --buildstops option, I found next record in output
file: "mller". And yes, when I search for "mller" - I get "Müller" results
(but no "Muller" of course).
What I need to do for search by "Muller/Müller" give results for both
"Muller" and "Müller"?
PS: collation used for column (and for wohle database) is
SQL_LATIN1_GENERAL_CP1_CI_AS. I change column type from varchar to
nvarchar, but it doesn't help. "Müller" records displaued properly on the
site (without ???) and when I run indexer with --dump-rows.

Bolyard

Friday, 27 September 2013

Accented characters are not indexed in Sphinx

No comments:

Post a Comment