Re: [i18n-no] Forbetring av e-postmottak på speling.org

26 Feb 2006


      Petter Reinholdtsen skrev:
...
[Jacob Sparre Andersen]
...
...
At de relevante personer får en konto på Tyge og bliver 
medlemmer af "speling(nb|nn)"-grupperne på maskinen.
...
OK, det skal jeg få gjort.  Håper de andre 
prosjektdeltagerne gjør det samme.
Fint.
...
...
PS: Stavekontrol og unicode hænger meget dårligt sammen.
    Unicodes normaliseringsregler kan let lave kage i
    teksterne, da de ikke er bijektive.
Can you repeat this in English?
Det kan jeg da godt.
It is problematic to do spell checking of texts encoded in a 
ISO-10646/Unicode encoding (among which UTF-8 is the best 
known).  This is because ISO-10646/Unicode contains some 
normalisation rules, which only work one-way. - And that way 
happens to be the wrong way; letters are converted to 
graphics.
One of the effects of this is that you may see a spell 
checking tool posing this question "The word 'blåbærgrød' is 
misspelled.  Did you mean 'blåbærgrød'?".
...
When you write unicode, do you mean unicode, ISO 10646 or 
UTF-8?
I mean ISO-10646 and Unicode in general.  The specific 
choice of encoding does not matter, since the normalisation 
rules (AFAIK) are common for all of them.
...
In any case, we are talking about the storage format of 
the word database, not spell checking as such.  Not sure 
how your statement relates to that, which is part of the 
reason I am confused and ask for more info.
The problem is that with UTF-8 coding of the database, you 
can have eight different UTF-8 strings looking like 
"blåbærgrød" in the database.
We can sort of work around the problem by introducing some 
language specific normalisation rules on top of the Unicode 
rules.  We will still see problems, but they will be limited 
to special cases.
Another way to work around the problem is to run a program 
which tags graphics coded strings in the database as likely 
errors, so they can be weeded out quickly.
Linux tools are generally nice and don't use the Unicode 
normalisation rules.  Mac OS X, on the other hand, always 
uses the normalisation rules.
Jacob
-- 
»Hvis vi foruden de to fuldtræffere har 3 missere får vi
  overskud«                                           -- Kurt

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [i18n-no] Forbetring av e-postmottak på speling.org