[Kevin Patrick Scannell]
Yes, I believe Børre Gaup wrote to me about this last year some time.
Right. I suspect he got too busy to follow up on it. :)
I don't make the word/frequency lists available on the web because (ironically) I intend them only to be used for open source projects.
Heh. Quite ironic, yes, when one consider the freedom aspect of it all. :)
Does this mean that you are not interested in making the information available to non-free projects? I plan to talk to some univerisity group working on the norwegian dictionary, to try to get access to their database of words. They are not into free software and open source at all (yet, at least. :). I assume they would be interested in getting access to a larger corpus of web documents. If you are against sharing with them, I need to know.
Can you tell me a bit about the licensing you'll be using for the spell checkers? As I recall there was some kind of morphological back end being written for Saami - will that be open source also or will you use it to generate a large word list offline? Are you writing affix files too?
I norwegian (bokmål and nynorsk) spell checking package is GPL licensed. I've been told that the Saami spell checker will be free software, but that initial versions will depend on some non-free software because the people working on it do not know of any free alternatives. But I am not directly involved in the Saami work, so you will have to take this answer with a grain of salt. Several of that projects members get a copy of this email (to i18n-sme@), so I guess they will answer when they find time.
I read on their prject page that their first alpha version of the nothern saami spell checker will be released now, so I am curious to see if they are on schedule or not. :)
More info on the saami project is available from URL:http://divvun.no/english.html.
Anyway, if you send me your latest word lists for all three languages (with affix flags expanded, if any) I can send lists of "best candidates" for addition that are determined via some naive statistics.
I'm not quite sure what you ask for here, but assume the 'words' files normally stored in /usr/share/dicts/ is the one you ask for. I generated for bokmål and nynorsk, and made available from URL:http://folk.uio.no/pre/words.norwegian.tar.gz. Can you make updated frequency information for bokmål and nynorsk available as well? At the moment ew only have outdated frequency info for bokmål, and this make the logic to select words unreliable.
The crawler runs for Bokmål also, but the language has a substantial enough web presence that it doesn't qualify for "minority" status (and so is not listed on the page).
Right. I guess we are very active on the web, though there are less people speaking bokmål then some of the other languages listed. At least I've been told that for example catalan have more users than the population of Norway. :)
In practical terms, this means that I don't let the crawler run to completion, but gather just enough text to use for frequency lists, 3-gram models, etc.
Right. We could really use the frequency list. As you probably can tell, I am not very skilled with linquistic stuff, so I do not know what 3-gram models are. A quick google search and a few reads later, I assume it is the frequency of three words following each other. We do not have a way to use such information yet, as far as I know. Perhaps one of the other on the project know if it is useful for us or not.
Anyway, thank you for your positive response. I am eager to see the results from your system. :)