Hvordan bør epost lagres i en NOARK5-database?

Petter Reinholdtsen pere at hungry.com
Mon Apr 3 11:58:48 CEST 2017


I've been thinking some more on how to store emails in a Noark 5
database, and in addition to the XML idea presented earlier, what about
storing emails as PDF, with the original email content (header and body)
included in the PDF?  I believe the PDF keyword for this feature is 'PDF
attachment', and the PDF format allow any file to be included in the PDF
as an attachment.  This is of course problematic from an archiving point
of view, as it is perfectly possible to store propriatery and
undocumented content as attachments in a PDF.  Such content will be
unreadable in the future.

Anyway, back to the emails as PDF attachments idea.  To prepare the
email for upload, first the email would need to be formatted as a PDF,
preferably including any attachment.  Key parts of the header and the
entire content would be formatted as PDF pages.  In the normal case,
this should be fairly trivial, but it will give problems for more exotic
file attachements like audio or 3D models.  I guess the process should
reject to do this for emails with such exotic formats.

Next, the email headers and body could be added to the PDF using pdftk,
for example like this:

  pdftk email-content.pdf \
    attach_files email-content.mbox \
    output email-combined.pdf

I here suggest to use the mbox email format, which is very close to how
the email was actually transfered to the client.

Are any of the existing archive systems doing something like this for
email?  Is it allowed to have attachments like this in PDF/A?  If not,
this idea is going to cause problems for the extraction process.

I am unsure how a random reader of the PDF would be able to identify the
content format of the attachement, given that pdftk do not ask for the
file format or mime type.  It will prove problematic for future readers,
if they can not know if the attachment was intended to be an email or
not.  Perhaps the PDF Collection feature can be used for this?  I do not
know the PDF specification enough to tell.

-- 
Happy hacking
Petter Reinholdtsen


More information about the nikita-noark mailing list