Recipes for converting PDF to PDF/A

Petter Reinholdtsen pere at hungry.com
Mon May 22 11:37:19 CEST 2017


[Thomas Sødring]
> Hi,
>
> I'd like to go a little microservices on this. I was thinking that we
> need a REST service that takes a document and returns its archive
> format equivalent. This can even autodetect the mimetype using e.g
> JHOVE [1]. There are other tools as well.

Sound interesting, even though I intuitively fear such microservice
architecture would lead to centralised converter microservice which
would get access to way too much sensitive data.

But my question was not really about that part of the process, more if
the convertion should be done by the API service or by a API client.
Both could use a microservice to do the actual conversion.  What make
most sense?  We do not want to block the API while doing conversion, so
if it is done by the API service, it must be done in the background when
there is spare resources to do it.

Who should 'own' a converted document and what should its timestamps be?
If a API client generate and upload it, I guess those metadata no longer
will reflect the uploader of the original document.  Is there a way to
avoid it?

Btw, JHOVE and JHOVE2 seem like nice projects.  I'm unable to figure out
if they support ODF and the many formats from MS Office, which would be
vital to know if the original is well formed.  Using them for format
identification seem like overkill, when the command line tool 'file'
seem to work well?

> I'd like to support ODF natively via LibreOffice, where the service
> has LibreOffice running and converts all documents to PDF/A.

That sound good.

> The documents should be automatically converted to archive format when
> the case file is closed.

Why wait so long?  Isn't it best to discover any conversion problems
early, while someone is still working on the original files?

> For teaching purposes, I will not support MS Office.

What do you mean?  MS Office isn't a document format, and procuces lots
of different formats.  You mean using MS office as the document
converter?

-- 
Happy hacking
Petter Reinholdtsen


More information about the nikita-noark mailing list