Petter Reinholdtsen skrev:
[Jacob Sparre Andersen]
At de relevante personer får en konto på Tyge og bliver medlemmer af "speling(nb|nn)"-grupperne på maskinen.
OK, det skal jeg få gjort. Håper de andre prosjektdeltagerne gjør det samme.
Fint.
PS: Stavekontrol og unicode hænger meget dårligt sammen. Unicodes normaliseringsregler kan let lave kage i teksterne, da de ikke er bijektive.
Can you repeat this in English?
Det kan jeg da godt.
It is problematic to do spell checking of texts encoded in a ISO-10646/Unicode encoding (among which UTF-8 is the best known). This is because ISO-10646/Unicode contains some normalisation rules, which only work one-way. - And that way happens to be the wrong way; letters are converted to graphics.
One of the effects of this is that you may see a spell checking tool posing this question "The word 'blåbærgrød' is misspelled. Did you mean 'blåbærgrød'?".
When you write unicode, do you mean unicode, ISO 10646 or UTF-8?
I mean ISO-10646 and Unicode in general. The specific choice of encoding does not matter, since the normalisation rules (AFAIK) are common for all of them.
In any case, we are talking about the storage format of the word database, not spell checking as such. Not sure how your statement relates to that, which is part of the reason I am confused and ask for more info.
The problem is that with UTF-8 coding of the database, you can have eight different UTF-8 strings looking like "blåbærgrød" in the database.
We can sort of work around the problem by introducing some language specific normalisation rules on top of the Unicode rules. We will still see problems, but they will be limited to special cases.
Another way to work around the problem is to run a program which tags graphics coded strings in the database as likely errors, so they can be weeded out quickly.
Linux tools are generally nice and don't use the Unicode normalisation rules. Mac OS X, on the other hand, always uses the normalisation rules.
Jacob