On 10:02 Mon 02 Jan , Petter Reinholdtsen wrote:
I don't make the word/frequency lists available on the web because (ironically) I intend them only to be used for open source projects.
Heh. Quite ironic, yes, when one consider the freedom aspect of it all. :)
Does this mean that you are not interested in making the information available to non-free projects? I plan to talk to some univerisity group working on the norwegian dictionary, to try to get access to their database of words. They are not into free software and open source at all (yet, at least. :).
Right - generally speaking I only provide the data to open source projects. What we've done for some languages in similar situations is convince the dictionary project to make some of their data available in return for access to the corpora - maybe you could explore this possibility with them.
I norwegian (bokmål and nynorsk) spell checking package is GPL licensed.
Great!
I'm not quite sure what you ask for here, but assume the 'words' files normally stored in /usr/share/dicts/ is the one you ask for. I
Actually I meant the latest development version of your spell checking packages, word list + affix file, etc. I used the existing aspell dictionaries to train the web crawler, but it looks like those were created by someone else and are outdated.
I can send you raw frequency lists but those aren't all that useful since they contain a lot of "pollution" - my software works in part by trying to filter out the pollution by statistical means. This is why having your latest version is useful.
Right. We could really use the frequency list. As you probably can tell, I am not very skilled with linquistic stuff, so I do not know what 3-gram models are. A quick google search and a few reads later, I assume it is the frequency of three words following each other.
Yes - but it can also mean sequences of three characters, which is what I mean here. They're used for language recognition and also as part of the pollution filters.
Kevin
ps sorry about the delay in replying - I'm just back from out of town.