Recipes for converting PDF to PDF/A

Thu May 18 19:26:29 CEST 2017

Hi,

I'd like to go a little microservices on this. I was thinking that we need a REST service that takes a document and returns its archive format equivalent. This can even autodetect the mimetype using e.g JHOVE [1]. There are other tools as well.

I'd like to support ODF natively via LibreOffice, where the service has LibreOffice running and converts all documents to PDF/A.

PDF to PDF/A could be done as Petter suggests.

I would have a microservices style approach and offload it to another REST service as sometimes users will upload big files. The largest PDF file I saw in a Noark system is about 250 MB and when I converted 800 000 documents to archive format a few years ago, we saw many random crashes. So I really think it's worthwhile offloading it away from the core.

The documents should be automatically converted to archive format when the case file is closed.

For teaching purposes, I will not support MS Office. So many of my students are unaware of LibreOffice so I think it's worthwhile exposing them only to LibreOffice via nikita. However if we were to support MS Office, we would need a queue system that can talk to a PixEdit server to do the conversion. PixEdit works really nicely and is able to work on a per-core basis so scales nicely.

Creating a REST service to convert the documents would be a nice project for someone to attempt.

[1]https://en.wikipedia.org/wiki/JHOVE

________________________________
From: nikita-noark-bounces at nuug.no <nikita-noark-bounces at nuug.no> on behalf of Petter Reinholdtsen <pere at hungry.com>
Sent: Thursday, May 18, 2017 19:11
To: nikita-noark at nuug.no
Subject: Recipes for converting PDF to PDF/A

One task we need to implement in the core, is converting PDF files to
PDF/A if they are not already in PDF/A form.  I had a quick look, and
found this recipe on
<URL: https://unix.stackexchange.com/questions/79516/converting-pdf-to-pdf-a >:

  gs -sDEVICE=pdfwrite -q -dNOPAUSE -dBATCH -dNOSAFER     \
    -dPDFA -dUseCIEColor -sProcessColorModel=DeviceCMYK       \
    -sOutputFile=Out_PDFA.pdf PDFA_def.ps pdfmarks IN_PDF.pdf

and

  java FixPrintFlag Out_PDFA.pdf New_verifiablePDFA.pdf

Based on are recipe available from
<URL: http://thisthatisnot.blogspot.no/2010/04/free-way-to-convert-existing-pdf-to.html >.

I wonder, should this be a task done by a API client, or a task done
internally in the server?  What do the rest of you think?  I suspect
doing it via the API will either loose some metadata or must be done
using operations that allow us to set metadata that normally should not
be modifyable via the API.

--
Happy hacking
Petter Reinholdtsen
_______________________________________________
nikita-noark mailing list
nikita-noark at nuug.no
https://lists.nuug.no/mailman/listinfo/nikita-noark
nikita-noark Info Page - NUUG<https://lists.nuug.no/mailman/listinfo/nikita-noark>
lists.nuug.no
To see the collection of prior postings to the list, visit the nikita-noark Archives. Using nikita-noark: To post a message to all the list members ...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.nuug.no/pipermail/nikita-noark/attachments/20170518/520195ce/attachment-0001.htm