Hello, and happy new year!
I believe Torstein Dybdahl were in contact with you earlier, regarding the norwegian languages and your corpus-building web crawler, URL:http://borel.slu.edu/crubadan/. He has since become busy with real life, and was unable to continue this effort. I am part of the group working with Torstein on the norwegian spell checking systems. I'm very pleased to discover that Norwegian Nynorsk (nn) and Northern Saami (se) are listed on the status page with lots of files and words registered.
We here in Norway are now in the process of revitalizing the Norwegian Bokmål and Nynorsk spell checking package, URL:http://no.speling.org/. This is a volunteer project. To do a good job with this, we need to find updated frequency information for the norwegian words. At the moment, we do not have access to a corpus nor frequency information for either of these languages.
In addition, a related group of people are funded by the Norwegian government to create spell checking systems for several of the Saami languages, URL:http://divvun.no/english.html. This work is organized by the university of Tromsø, and this group have access to a corpus, but could use more words.
If I understood Torstein correctly, you are willing to share your collection of words with us. But I've checked the web page, and been unable to find links to the word collection on your web pages. Where can I find the list of words, preferably with frequency information? Is the collection of files available on the web somewhere?
Norwegian Bokmål is missing from your status page. Would you be willing to collect documents for that language as well? If not, how hard is it to set up your software system on our servers so we can collect words for this language ourself?
Cc to the Norwegian and Saami translators mailing list, which is read by the people working on the spell checking systems.
Friendly,
Hi Petter,
We here in Norway are now in the process of revitalizing the Norwegian Bokmål and Nynorsk spell checking package, URL:http://no.speling.org/.
Excellent - I just saw the project announcement come across on freshmeat.
In addition, a related group of people are funded by the Norwegian government to create spell checking systems for several of the Saami languages, URL:http://divvun.no/english.html. This work is organized by the university of Tromsø, and this group have access to a corpus, but could use more words.
Yes, I believe Børre Gaup wrote to me about this last year some time.
Is the collection of files available on the web somewhere?
I don't make the word/frequency lists available on the web because (ironically) I intend them only to be used for open source projects. Can you tell me a bit about the licensing you'll be using for the spell checkers? As I recall there was some kind of morphological back end being written for Saami - will that be open source also or will you use it to generate a large word list offline? Are you writing affix files too?
Anyway, if you send me your latest word lists for all three languages (with affix flags expanded, if any) I can send lists of "best candidates" for addition that are determined via some naive statistics.
Norwegian Bokmål is missing from your status page. Would you be willing to collect documents for that language as well?
The crawler runs for Bokmål also, but the language has a substantial enough web presence that it doesn't qualify for "minority" status (and so is not listed on the page). In practical terms, this means that I don't let the crawler run to completion, but gather just enough text to use for frequency lists, 3-gram models, etc.
I'll be travelling for the next week, but should have access to a computer - so please be patient if I'm slow in responding.
-Kevin
[Kevin Patrick Scannell]
Yes, I believe Børre Gaup wrote to me about this last year some time.
Right. I suspect he got too busy to follow up on it. :)
I don't make the word/frequency lists available on the web because (ironically) I intend them only to be used for open source projects.
Heh. Quite ironic, yes, when one consider the freedom aspect of it all. :)
Does this mean that you are not interested in making the information available to non-free projects? I plan to talk to some univerisity group working on the norwegian dictionary, to try to get access to their database of words. They are not into free software and open source at all (yet, at least. :). I assume they would be interested in getting access to a larger corpus of web documents. If you are against sharing with them, I need to know.
Can you tell me a bit about the licensing you'll be using for the spell checkers? As I recall there was some kind of morphological back end being written for Saami - will that be open source also or will you use it to generate a large word list offline? Are you writing affix files too?
I norwegian (bokmål and nynorsk) spell checking package is GPL licensed. I've been told that the Saami spell checker will be free software, but that initial versions will depend on some non-free software because the people working on it do not know of any free alternatives. But I am not directly involved in the Saami work, so you will have to take this answer with a grain of salt. Several of that projects members get a copy of this email (to i18n-sme@), so I guess they will answer when they find time.
I read on their prject page that their first alpha version of the nothern saami spell checker will be released now, so I am curious to see if they are on schedule or not. :)
More info on the saami project is available from URL:http://divvun.no/english.html.
Anyway, if you send me your latest word lists for all three languages (with affix flags expanded, if any) I can send lists of "best candidates" for addition that are determined via some naive statistics.
I'm not quite sure what you ask for here, but assume the 'words' files normally stored in /usr/share/dicts/ is the one you ask for. I generated for bokmål and nynorsk, and made available from URL:http://folk.uio.no/pre/words.norwegian.tar.gz. Can you make updated frequency information for bokmål and nynorsk available as well? At the moment ew only have outdated frequency info for bokmål, and this make the logic to select words unreliable.
The crawler runs for Bokmål also, but the language has a substantial enough web presence that it doesn't qualify for "minority" status (and so is not listed on the page).
Right. I guess we are very active on the web, though there are less people speaking bokmål then some of the other languages listed. At least I've been told that for example catalan have more users than the population of Norway. :)
In practical terms, this means that I don't let the crawler run to completion, but gather just enough text to use for frequency lists, 3-gram models, etc.
Right. We could really use the frequency list. As you probably can tell, I am not very skilled with linquistic stuff, so I do not know what 3-gram models are. A quick google search and a few reads later, I assume it is the frequency of three words following each other. We do not have a way to use such information yet, as far as I know. Perhaps one of the other on the project know if it is useful for us or not.
Anyway, thank you for your positive response. I am eager to see the results from your system. :)
Petter Reinholdtsen kirjoitti 2. jan. 2006 kello 10.02:
[Kevin Patrick Scannell]
Yes, I believe Børre Gaup wrote to me about this last year some time.
Right. I suspect he got too busy to follow up on it. :)
He will return to this, we are following your website.
I don't make the word/frequency lists available on the web because (ironically) I intend them only to be used for open source projects.
Heh. Quite ironic, yes, when one consider the freedom aspect of it all. :) Does this mean that you are not interested in making the information available to non-free projects?
Kevin may answer himself, but I read him so that he would like to give the lists to projects he knows adher to open source. In that case, it is easy, since all projects involved in this discussion (both nno/nob spellers and the Sámi speller project) are open source projects (but cf. below).
Can you tell me a bit about the licensing you'll be using for the spell checkers? As I recall there was some kind of morphological back end being written for Saami - will that be open source also or will you use it to generate a large word list offline? Are you writing affix files too?
There are two Sámi projects, http://divvun.no and http:// giellatekno.uit.no. Both are open source, GPL, both will make everything available. We just don't think things are ready enough to just put up a download link, but interested parties may get copies of source code already now. The one notable proviso is that the core analysers are compiled by Xerox compilers (twolc, lexc, xfst), these compilers belong to Xerox and are not open source. We do not have access to the source code of these compilers, only to their binary versions. But they are accessible (as binaries) to all the buyers of the http://www.fsmbook.com/ book, so the open-source Sámi morphological transducers may be modified, compiled and run by anyone.
As for affix files, that is for a different technology, the Xspell family. The divvun project alpha version was made on schedule, as an aspell spellchecker, in August, but we haven't distributed it, since we would like to get past some more basic problems before we invite testers to look at it. Interested parties may get a version, though. This is all documented on our web pages, cf. http://divvun.no/doc/ proof/spelling/X-spell/aspell.html.
Anyway, if you send me your latest word lists for all three languages (with affix flags expanded, if any) I can send lists of "best candidates" for addition that are determined via some naive statistics.
As is clear from the above, the Sámi projects do not work like that. We have a transducer with a lexicon and a morphological component. We also have corpora from which it is possible to make list of wordforms (not words, i.e., not lemmas, and also not full paradigms). So on this point some elaboration on what you mean is probably needed.
What we would like to get access to is the text corpus you have gathered from the web, but I take it that my collegue Børre will return to you on that issue.
Trond.
---------------------------------------------------------------------- Trond Trosterud t +47 7764 4763 Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140 N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216 Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/ ----------------------------------------------------------------------
On 12:54 Mon 02 Jan , Trond Trosterud wrote:
There are two Sámi projects, http://divvun.no and http:// giellatekno.uit.no. Both are open source, GPL, both will make everything available. We just don't think things are ready enough to just put up a download link, but interested parties may get copies of source code already now.
I see - I'd definitely like to see what you have so far.
And I understand about the Xerox stuff. There's a similar situation for Irish - a morphological analyzer was developed at the Irish Linguistics Institute (ITÉ) with the Xerox tools. I developed a separate, completely open source version to use for my grammar checking stuff.
So as long as you make the Sámi transducer sources available freely, I'm satisfied (and I'll be excited to look at what you've done since I have a particular interest in morphology).
As is clear from the above, the Sámi projects do not work like that. We have a transducer with a lexicon and a morphological component. We also have corpora from which it is possible to make list of wordforms (not words, i.e., not lemmas, and also not full paradigms). So on this point some elaboration on what you mean is probably needed.
My code is designed to work with an existing Xspell package (even if it amounts to a simple word list with no affix file, which is how most languages start out). So for me to offer any non-trivial help I'd have to work your transducer into my system, which mightn't be too hard if you're interested.
What we would like to get access to is the text corpus you have gathered from the web, but I take it that my collegue Børre will return to you on that issue.
OK, well just sending you raw data is the easiest thing for me! Let me know.
Best Kevin
Kevin Patrick Scannell kirjoitti 9. jan. 2006 kello 06.07:
I see - I'd definitely like to see what you have so far.
I'll send you a tarball. Since it seems you have been using Xerox tools you will also be able to compile our stuff. The documentation is found under "TechDoc" at our site.
And I understand about the Xerox stuff. There's a similar situation for Irish - a morphological analyzer was developed at the Irish Linguistics Institute (ITÉ) with the Xerox tools.
Yes, I know Elaine very well, since we both use Xerox tools we did some cooperation work during development phase. Send her my greetings if you meet her.
I developed a separate, completely open source version to use for my grammar checking stuff.
If what you made for Irish is based upon her code, but uses open- source compilers (or whatever), than you already did what we would like to do. We would thus be interested in seeing in what way you have utilized her stuff, and if you could send us your separate, etc. version, we would be interested in looking at it.
So as long as you make the Sámi transducer sources available freely, I'm satisfied (and I'll be excited to look at what you've done since I have a particular interest in morphology).
Feedbacks welcome.
My code is designed to work with an existing Xspell package (even if it amounts to a simple word list with no affix file, which is how most languages start out). So for me to offer any non-trivial help I'd have to work your transducer into my system, which mightn't be too hard if you're interested.
The divvun project made an Xspell version (alpha version, no compounding) in august, it was not done without difficulties, and also here we would be interested in input.
OK, well just sending you raw data is the easiest thing for me! Let me know.
Nothing could be better than copies of what you have. Børres address is boerre@skolelinux.no.
Trond.
---------------------------------------------------------------------- Trond Trosterud t +47 7764 4763 Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140 N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216 Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/ ----------------------------------------------------------------------
On 09:12 Mon 09 Jan , Trond Trosterud wrote:
Yes, I know Elaine very well, since we both use Xerox tools we did some cooperation work during development phase. Send her my greetings if you meet her.
I will!
If what you made for Irish is based upon her code, but uses open- source compilers (or whatever), than you already did what we would like to do.
No, it's not based on Elaine's work (which I don't even have a copy of).
What I tried to do as part of my "grammar checking engine" http://borel.slu.edu/gramadoir/ is come up with a formalism somewhere between the Xspell affix file and a full-blown transducer. I don't think it's been such a great success - most people find the syntax too arcane to be easily usable. But I like it and find it quite powerful.
The whole system is documented here: http://borel.slu.edu/gramadoir/manual/index.html
and morphology is section 3.1.4.
Note that in practice, the Irish grammar checker only uses this is in a limited way -- stuff like verbal morphology (which I'm guessing makes up a lot of what Elaine did) is done "offline" when generating the part-of-speech-tagged lexicon.
Nothing could be better than copies of what you have. Børres address is boerre@skolelinux.no.
OK Kevin
On 10:02 Mon 02 Jan , Petter Reinholdtsen wrote:
I don't make the word/frequency lists available on the web because (ironically) I intend them only to be used for open source projects.
Heh. Quite ironic, yes, when one consider the freedom aspect of it all. :)
Does this mean that you are not interested in making the information available to non-free projects? I plan to talk to some univerisity group working on the norwegian dictionary, to try to get access to their database of words. They are not into free software and open source at all (yet, at least. :).
Right - generally speaking I only provide the data to open source projects. What we've done for some languages in similar situations is convince the dictionary project to make some of their data available in return for access to the corpora - maybe you could explore this possibility with them.
I norwegian (bokmål and nynorsk) spell checking package is GPL licensed.
Great!
I'm not quite sure what you ask for here, but assume the 'words' files normally stored in /usr/share/dicts/ is the one you ask for. I
Actually I meant the latest development version of your spell checking packages, word list + affix file, etc. I used the existing aspell dictionaries to train the web crawler, but it looks like those were created by someone else and are outdated.
I can send you raw frequency lists but those aren't all that useful since they contain a lot of "pollution" - my software works in part by trying to filter out the pollution by statistical means. This is why having your latest version is useful.
Right. We could really use the frequency list. As you probably can tell, I am not very skilled with linquistic stuff, so I do not know what 3-gram models are. A quick google search and a few reads later, I assume it is the frequency of three words following each other.
Yes - but it can also mean sequences of three characters, which is what I mean here. They're used for language recognition and also as part of the pollution filters.
Kevin
ps sorry about the delay in replying - I'm just back from out of town.
[Kevin Patrick Scannell]
Right - generally speaking I only provide the data to open source projects.
OK. The problem with this is that we as a free software project have no way to keep the info we get from you away from non-free software projects. To make sure all people willing to contribute to your project have the means to do it, we need to make sure all our sources and background material are publicly available. We already tried the alternative when the last maintainer went missing and no-one else had the background material and could not really update the spell checker for 5 years. If you want us to keep your data non-public, it will in reality be unavailable for all potential (and probably some current) contributors of the spell checking project. So what do you mean when you say you only provide data to open souce projects? Are projects like us supposed to keep the information you provide away from non-open source projects, or can we make it publicly available for everyone?
What we've done for some languages in similar situations is convince the dictionary project to make some of their data available in return for access to the corpora - maybe you could explore this possibility with them.
I will for sure explore that possibility, but it would only solve part of the problem. :)
Actually I meant the latest development version of your spell checking packages, word list + affix file, etc.
Right. Those are available from URL:https://alioth.debian.org/projects/spell-norwegian/. The build system is a bit special, so you will have to extract the bokmål and nynorsk words from norsk.words. :)
I used the existing aspell dictionaries to train the web crawler, but it looks like those were created by someone else and are outdated.
Yes. The aspell package floating around is based on the old source of the spell checking package.
I can send you raw frequency lists but those aren't all that useful since they contain a lot of "pollution" - my software works in part by trying to filter out the pollution by statistical means. This is why having your latest version is useful.
Having a look at the raw frequency list would be useful for me, to see which words in the current package could use an updated frequency value. Please post the URL to i18n-no@.
On 08:34 Mon 09 Jan , Petter Reinholdtsen wrote:
[Kevin Patrick Scannell]
Right - generally speaking I only provide the data to open source projects.
OK. The problem with this is that we as a free software project have no way to keep the info we get from you away from non-free software projects.
Sure you do: just don't put all of the texts on the web. Run-of-the-mill contributors to the project have no need for the unprocessed corpora. You're welcome to put frequency lists, etc. up for others to use, or word lists for contributors to check.
Right. Those are available from URL:https://alioth.debian.org/projects/spell-norwegian/. The build system is a bit special, so you will have to extract the bokmål and nynorsk words from norsk.words. :)
thanks, I'll have a look.
I can send you raw frequency lists but those aren't all that useful since they contain a lot of "pollution" - my software works in part by trying to filter out the pollution by statistical means. This is why having your latest version is useful.
Having a look at the raw frequency list would be useful for me, to see which words in the current package could use an updated frequency value. Please post the URL to i18n-no@.
Temporary link:
http://borel.slu.edu/obair/nbnnse.zip
frequencies based on corpora of 1.38M words (nb) 3.06M words (nn) 1.99M words (se)
-Kevin
Vuos, ođđajagemánu 9. b. 2006 14:44, Kevin Patrick Scannell čálii:
On 08:34 Mon 09 Jan , Petter Reinholdtsen wrote:
[Kevin Patrick Scannell]
Right - generally speaking I only provide the data to open source projects.
OK. The problem with this is that we as a free software project have no way to keep the info we get from you away from non-free software projects.
Sure you do: just don't put all of the texts on the web. Run-of-the-mill contributors to the project have no need for the unprocessed corpora. You're welcome to put frequency lists, etc. up for others to use, or word lists for contributors to check.
Right. Those are available from URL:https://alioth.debian.org/projects/spell-norwegian/. The build system is a bit special, so you will have to extract the bokmål and nynorsk words from norsk.words. :)
thanks, I'll have a look.
I can send you raw frequency lists but those aren't all that useful since they contain a lot of "pollution" - my software works in part by trying to filter out the pollution by statistical means. This is why having your latest version is useful.
Having a look at the raw frequency list would be useful for me, to see which words in the current package could use an updated frequency value. Please post the URL to i18n-no@.
Temporary link:
http://borel.slu.edu/obair/nbnnse.zip
frequencies based on corpora of 1.38M words (nb) 3.06M words (nn) 1.99M words (se)
Thank you, Kevin! I'll run the sámi part of the list through our transducer and report the results to you.
[Kevin Patrick Scannell]
Sure you do: just don't put all of the texts on the web. Run-of-the-mill contributors to the project have no need for the unprocessed corpora.
My point is that we have no way to know who is going to be deeply involved and who will be just a passing contributor. And to make sure all those willing and capable of becoming deeply involved can do so without having to be trained by the "know-hows" and given extra access, all the information used to maintain the spell checker need to be publicly available. Yes, of course we could have a secret archive of extra information, but then we run the risk killing the project when the few with access to the secret archive disappear from the project.
This has already happened once with this spell checking project, when Rune Kleveland started working and lost access to the project web page. This stopped development for almost five years. He had (and probably still have) access to lots of extra data, and as no-one else had this data we had a really hard time to continue development. I do not want us to end up in that situation again, and thus believe we should base the spell checking work only on publicly available sources.
You're welcome to put frequency lists, etc. up for others to use, or word lists for contributors to check.
Good. If that is the public info we can get from you, it will come very handy. :)
Temporary link:
Thank you. I've downloaded it, and will put it on the web pages soon. Will need to massage the scripts before I can use the numbers to update the frequency info in norsk.words. :)