Petter Reinholdtsen kirjoitti 2. jan. 2006 kello 10.02:
[Kevin Patrick Scannell]
Yes, I believe Børre Gaup wrote to me about this last year some time.
Right. I suspect he got too busy to follow up on it. :)
He will return to this, we are following your website.
I don't make the word/frequency lists available on the web because (ironically) I intend them only to be used for open source projects.
Heh. Quite ironic, yes, when one consider the freedom aspect of it all. :) Does this mean that you are not interested in making the information available to non-free projects?
Kevin may answer himself, but I read him so that he would like to give the lists to projects he knows adher to open source. In that case, it is easy, since all projects involved in this discussion (both nno/nob spellers and the Sámi speller project) are open source projects (but cf. below).
Can you tell me a bit about the licensing you'll be using for the spell checkers? As I recall there was some kind of morphological back end being written for Saami - will that be open source also or will you use it to generate a large word list offline? Are you writing affix files too?
There are two Sámi projects, http://divvun.no and http:// giellatekno.uit.no. Both are open source, GPL, both will make everything available. We just don't think things are ready enough to just put up a download link, but interested parties may get copies of source code already now. The one notable proviso is that the core analysers are compiled by Xerox compilers (twolc, lexc, xfst), these compilers belong to Xerox and are not open source. We do not have access to the source code of these compilers, only to their binary versions. But they are accessible (as binaries) to all the buyers of the http://www.fsmbook.com/ book, so the open-source Sámi morphological transducers may be modified, compiled and run by anyone.
As for affix files, that is for a different technology, the Xspell family. The divvun project alpha version was made on schedule, as an aspell spellchecker, in August, but we haven't distributed it, since we would like to get past some more basic problems before we invite testers to look at it. Interested parties may get a version, though. This is all documented on our web pages, cf. http://divvun.no/doc/ proof/spelling/X-spell/aspell.html.
Anyway, if you send me your latest word lists for all three languages (with affix flags expanded, if any) I can send lists of "best candidates" for addition that are determined via some naive statistics.
As is clear from the above, the Sámi projects do not work like that. We have a transducer with a lexicon and a morphological component. We also have corpora from which it is possible to make list of wordforms (not words, i.e., not lemmas, and also not full paradigms). So on this point some elaboration on what you mean is probably needed.
What we would like to get access to is the text corpus you have gathered from the web, but I take it that my collegue Børre will return to you on that issue.
Trond.
---------------------------------------------------------------------- Trond Trosterud t +47 7764 4763 Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140 N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216 Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/ ----------------------------------------------------------------------