Re: Hvordan bør epost lagres i en NOARK5-database?

Thomas Sødring thomas.sodring at hioa.no
Wed Dec 21 13:41:34 CET 2016


Hi,

I did some work on email when I worked on creating a Noark 4 extraction.
The database contained information from 1999 to 2012. A lot of the
emails were plain text and those were left as plain text. At some point
a lot of emails became html and I used an open source php library to
render the html and convert it to PDF/A. There were lots of EML files as
well and to be honest I can't remember how I handled them.

None of this is an appropriate response to handling email! Message Ids,
reply Ids, signatures all fall away, but as you see in the response you
got from the national archive, they do not want to think about it. They
avoid getting involved in the detail and rely on the vendors to sort it
out. The "arkivskaper" is incapable of solving this. The vendors don't
really care about long term preservation, and will probably be happy to
simply convert it a long term preservation format (PDF/A).  In fact as I
teach Noark to students I still denote emails as a "utradisjonell kilde"
as documents.

In the Noark thinking, an email will be registered in a registryEntry
(no:journalpost). The registryEntry is just like an envelope so it can
contain documents and attachments. It's up to the case handler to take
the contents of the email and register documents as primary and
attachments. Sometimes an attachment might be a company logo and you may
not want to register that information. So it's not as simple as
registering everything in an email. Sometimes a person might send
documents belonging to two different cases in the same email. Automation
here is difficult because sometimes people do stupid things.

There is not really any room to handle the in-reply to field in Noark.
There is a cross-reference mechanism in Noark that probably could be
used to do that, but in this case it would only be used to reference
other registryEntries or caseFiles. I think the in-reply-to and referer
fields are important authenticity information elements that should not
be lost. You might be interested to read about the capstone project from
the US [1]. Datatilsynet might not allow a similar project in Norway,
but I really like the idea of capturing all emails and archiving them.
If we did this the reply-to and referer fields would be very useful,
especially for capturing a lot of the information that "disappears"
within government!

Signatures are something that the archive wants to keep away from. The
reason being that there is no reliable way of dealing with (PKI) key
management in a long term preservation perspective. In 100 years, we
might have moved away from PKI to something else that we are unable to
see at the moment. The public keys simply are not there or may have been
revoked. So Noark has fields that say the document was signed and the
signature was verified on a particular date as correct or incorrect by
some system.

None of these preservation "problems" place limitations of how we can do
record-keeping so there are things we can do in the core that can try
and handle the important issue you are looking at.

Reproducing emails is very important and given your description I think
we could have a container format that packages the various attachments
in a suitable long term preservation format  along with important
metadata. It would be a half-way solution as you accept that some of the
metadata will become useless in the future, but you have an authentic
representation of the email. This is an important issue for archives as
they simply do no want to deal with signatures.

On the side I think we are today not capturing enough metadata! I think
we will look back at this era in the future and lament the fact that we
threw away so much information. So maybe we might move to a situation
where actually start to increase the amount of metadata.

In many ways creating an authenticity envelope. It's up the archive then
if they just want the email or the authenticity envelope. I guess the NA
are even open for this as they say you can store the email as xml,
provided it has an XSD.

So this is something we definitely should look at implementing, once the
core becomes useful enough that you can play around with.

I think another standard that might be of interest here is METS [2].
It's overly complicated for this, but might be something to think about.
They have a part that allows you to link things together. But we really
need to avoid https://xkcd.com/927/

 - Tom


[1]
https://www.archives.gov/records-mgmt/email-management/capstone-training-and-resources.html
[2] https://www.loc.gov/standards/mets/
On 12/21/2016 11:10 AM, Petter Reinholdtsen wrote:
> Hei
>
> For to og et halvt år siden skrev jeg
> <URL: http://people.skolelinux.org/pere/blog/Hvordan_b_r_RFC_822_formattert_epost_lagres_i_en_NOARK5_database_.html >,
> mens jeg forsøkte å finne ut hvordan epost burde lagres i en
> NOARK5-database.  Jeg kom til å tenke på den i dag da jeg kom på et
> bruksområde for nikita - enkel arkivering av epost.  Hvordan kan epost
> enkelt arkiveres i et NOARK5-system, og hvordan kan en sikre at den kan
> gjenskapes ved behov?  Gjenskaping er f.eks. nødvendig hvis en ønsker å
> sjekke kryptosignaturer som var del av eposten, ønsker å eksportere en
> eposttråd på mboxformat, eller ønsker å svare på eposten med korrekte
> In-Reply-To- og Referer-felter.
>
> Har noen av dere peiling på relevante spesifikasjoner?  Kanskje vi bør
> dokumentere hvordan vi anbefaler at epost lagres i nikita?
>



More information about the nikita-noark mailing list