[i18n-no] Re: Access to you Norwegian and Nothern Saami word collection?

2 Jan 2006


      [Kevin Patrick Scannell]
...
Yes, I believe Børre Gaup wrote to me about this last year some
time.
Right.  I suspect he got too busy to follow up on it. :)
...
I don't make the word/frequency lists available on the web because
(ironically) I intend them only to be used for open source projects.
Heh.  Quite ironic, yes, when one consider the freedom aspect of it
all. :)
Does this mean that you are not interested in making the information
available to non-free projects?  I plan to talk to some univerisity
group working on the norwegian dictionary, to try to get access to
their database of words.  They are not into free software and open
source at all (yet, at least. :).  I assume they would be interested
in getting access to a larger corpus of web documents.  If you are
against sharing with them, I need to know.
...
Can you tell me a bit about the licensing you'll be using for the
spell checkers?  As I recall there was some kind of morphological
back end being written for Saami - will that be open source also or
will you use it to generate a large word list offline?  Are you
writing affix files too?
I norwegian (bokmål and nynorsk) spell checking package is GPL
licensed.  I've been told that the Saami spell checker will be free
software, but that initial versions will depend on some non-free
software because the people working on it do not know of any free
alternatives.  But I am not directly involved in the Saami work, so
you will have to take this answer with a grain of salt.  Several of
that projects members get a copy of this email (to i18n-sme@), so I
guess they will answer when they find time.
I read on their prject page that their first alpha version of the
nothern saami spell checker will be released now, so I am curious to
see if they are on schedule or not. :)
More info on the saami project is available from
URL:http://divvun.no/english.html.
...
Anyway, if you send me your latest word lists for all three
languages (with affix flags expanded, if any) I can send lists of
"best candidates" for addition that are determined via some naive
statistics.
I'm not quite sure what you ask for here, but assume the 'words' files
normally stored in /usr/share/dicts/ is the one you ask for.  I
generated for bokmål and nynorsk, and made available from
URL:http://folk.uio.no/pre/words.norwegian.tar.gz.  Can you make
updated frequency information for bokmål and nynorsk available as
well?  At the moment ew only have outdated frequency info for bokmål,
and this make the logic to select words unreliable.
...
The crawler runs for Bokmål also, but the language has a substantial
enough web presence that it doesn't qualify for "minority" status
(and so is not listed on the page).
Right.  I guess we are very active on the web, though there are less
people speaking bokmål then some of the other languages listed.  At
least I've been told that for example catalan have more users than the
population of Norway. :)
...
In practical terms, this means that I don't let the crawler run to
completion, but gather just enough text to use for frequency lists,
3-gram models, etc.
Right.  We could really use the frequency list.  As you probably can
tell, I am not very skilled with linquistic stuff, so I do not know
what 3-gram models are.  A quick google search and a few reads later,
I assume it is the frequency of three words following each other.  We
do not have a way to use such information yet, as far as I know.
Perhaps one of the other on the project know if it is useful for us or
not.
Anyway, thank you for your positive response.  I am eager to see the
results from your system. :)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

[i18n-no] Re: Access to you Norwegian and Nothern Saami word collection?