"Metadata is a love note to the future"

Thomas Sødring Thomas.Sodring at hioa.no
Wed Jun 21 11:26:28 CEST 2017


On 06/21/2017 10:17 AM, Petter Reinholdtsen wrote:

[Thomas Sødring]


When it comes to machine learning, I would like to autoclassify new
incoming documents based on existing documents. Take for example the
Hallingdal kommuner that I worked with.  I generated 7 database dump
extractions with related documents. The classification code and other
metadata and documents are linked. So we could try a machine learning
training run on 2 of the extractions and verify the algorithm on the
other 5.  There are so many factors that we can finetune and play with
during such experiments.


That would be cool.  How much work would it be to get access to these
extractions?  What kind of volume are we talking about?

We're talking about ~ 400 000 documents, the actual figure can be calculated from the N4OK report.

I created the extractions and understand how it all is related so with a little understanding of Noark 4 it is relatively easy to get started. The problem is that the extraction has a lot of sensitive casefiles and documents. I have spoken to the relevant IKA about this and they are, in principle, in agreement that it is possible, provided we don't look at the documents.



The limiting factor here is privacy laws, but we could do this without
ever looking at the sensitive data, but it would require a strict
technical setup at an IKA.


I do not believe it make sense to talk about 'without ever looking' when
we are writing software to look at the document content.  To me, the
fact that the content is made available for processing by people or
computers controlled by an entity is the privacy challenging part.  I do
not buy into the idea that 'only machines will look at the information'
is less intrusive than having people looking at the personal
information,

While I agree, my understanding is that the law will not allow us to do this. We cannot just undertake research on data about citizens as the data was not collected for that purpose. Forvaltningsloven does open for undertaking statistical analysis on such data, but in principle I believe our hands are tied. From discussing with IKA, I believe that we might be able to undertake such an approach if it becomes part of the process of handling data at an IKA. Gaining a deeper understanding and cross-verification across datasets can be argued is a task that IKA should undertake. So if we approach the problem from a particular point of view, we might be able to do something like this.

But then IKA needs a machine, some manhours to supervise etc. We will never get our hands on the database/data. It's possible to try such an approach but there will be strict rules!


So if we head down this path, we need to consider carefully what we do
and how we do it.



I agree. There are so many avenues opening around nikita at the moment, and we have a capacity to pursue very few of them. But this is something that is very relevant and worth considering going forward.


 - Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.nuug.no/pipermail/nikita-noark/attachments/20170621/f96adda6/attachment-0001.htm 


More information about the nikita-noark mailing list