How can consistency check results

Sun Jun 11 22:43:30 CEST 2017

On 06/02/2017 09:06 PM, Petter Reinholdtsen wrote:

After skimming through
<URL: http://edu.hioa.no/ark2100/current/syllabus/DQ%20Ouzounov.pdf >,
and reading a paper Thomas is writing on archive quality control, I
started thinking how a Noark 5 service interface implementation could
conduct various consistency checks and report back the findings to a API
user.  There is nothing about this in the current service interface
specification.

I drafted an idea in
<URL: https://github.com/petterreinholdtsen/noark5-tester/blob/master/mangelmelding/2017-06-02-konsistenssjekk.md >,
and would like to get your feedback on this defect report.

Lets pick a simple example consistency check (the defect report have
more examples).  When creating or modifying a journalpost object in the
database, the API could check that journal date is after document date.
And probably it should conduct checks like this to ensure the metadata
in the database make sense.  But how could it report back its findings
to API users?  The change should be accepted, of course, as it might be
correct (in this case, creating the journalpost for a document that is
still being written and with a deadline in the future).  But the issue
should also be reported to the user to allow a closer look in case it
prove to be a typo instead.

I suggest a new relation is defined for consistency check results, and
this relation is returned in _links if one or more consistency checks
fail when creating or modifying an object.  Fetching the href for such
relation should return a list of all the failing checks.  Not quite sure
what the list should contain, but suspect check type, severity and
details about the detected issue should be included.

What do the rest of you think?  Is this a useful proposal to suggest as
an extention for the service interface specification?

I agree that this is something that should be done. It's very much in
keeping with 2011 DQ-project and applying the results to the interface spec. But this was one of the  reasons for developing an open source Noark core, so we could move the quality of the  Noark-based archives forward based on DQ R&D.

I have two set of slides that discuss this. One is DQ in general and can be accessed here [1]. Slide 45 is where we should focus our efforts. This is what should be looked at. In terms of Ravn and Høed (who are behind slide 45 ... strange reference missing), they reduce Wangs dimensions down to something more "håndfast" and this is what I think we should start to focus on. [2] is used to discuss the issue in relation to n5 entities, but I see I could update it a bit. When we focused on Data Quality we dreamed that DIFI had a DQ monitor for all government computer systems and every time data was exchanged, a DQ analysis was undertaken and reported to DIFI.

On the side, I sent in an application to Forskningsrådet a few years ago to try and pick up the research in archival systems, arguing to study semantic web, data quality and extractions. The project was not considered relevant to the verdikt program. I think it's because research is often seen as something more theoretical, while my approach to research is too practical in nature. Sometimes I think we want to think about how to solve problems, rather than actually solve them :) It reminds me of Bent Høie complaining this week that he's tired of seeing so many health project concepts not make it out of the lab. My research approach has been to be as close to industry and the profession as possible.

I do think that we need to move DQ to something beyond consistency. I have pushed for DQ at the objective level because it's something that can be studied from the database. But really it is the semantic or pragmatic data quality that is important. A lot of information is sometimes in the archive, but not accessible. A case of point is at a recent meeting with municipal archivists there was a discussion about a zoning issue related to a beach in a town centre where no-one really knew how to deal with a building application, because no knew how that part was zoned. Then a lady who was going to retire was able to remember that she sat in a formanskapsmøte in June of 198X, where that particular area was discussed and they were able to find the relevant information. So DQ in that municipality is actually terrible! When she retires, they loose a lot of information. So we need to raise the issue of DQ beyond objective dimensions! That is a difficult issue!

On the server implementation side, we should make sure it is possible to
extend the server with extra consistency check modules, to allow new
checks to be tested without having to reprogram the server.  I guess it
is possible with some plugin mechanism in place, similar to how for
example the java program Minecraft allow 'mods' to change almost
anything in the game.  This would also allow students and others to
easily test out ideas on what to check in the archive.

Yes, and so much easier with a open source version of n5 :)

It would probably be good if there is a global and complete list of
failing consistency checks available via the REST API (and the href
mentioned about could do a search for a specific systemID in the list to
show only the results relevant for a given object), to get a general
idea of the number of consistency check failures in the archive, split
out on check type and severity.

One of the things we wanted to do was to be able to drill down
and measure DQ on a per-case, per-employee, per-series etc
basis. We have shown this in a GUI, where Dimitar had created
this and attached it to a dashboard. I also know that ACOS have
made something similar.

With time  I landed on an understanding that when it comes to DQ and Noark, ultimately it has to done from a perspective of the extraction. The end game is the extraction and as such DQ should at a minimum ensure that an extraction can be made. That's not say we shouldn't do more, but we must start here.

Adding additional links showing this is a great idea, but it should
be possible to pull the data out on multiple dimensions as above.
It should also be able pull out data for the entire organisation so
citizens actually can see a single figure for their municipalities
archive.

I'm probably not answering exactly how we would develop low-level checks. CPUs are cheap, people are not. So any change in an entity should result in a DQ analysis.  It could be real time, but I don't think it's necessary. For real-time to have value, we'd need to reject a change. That would probably cause people to try and avoid using the system. I think I'd like to have a Coverity approach to the issue, where every day a cron-job analyses the noark database and reports a list of issues to the record manager. If people took this issue seriously we could build up a collection of quality dimensions, how the look, how to measure, how to fix etc. I have come to see that if it's not part of the record keeper / archivists job description, then it's not worth spending time on it!

Remember also the Noark metadata is so simple that it's not really the n5 metadata where we can increase DQ, but probably in the contents of the documents. That's the limitation! But we have to start somewhere.

As always pere, you are 2 steps ahead of me :) ... I'm so stuck in detail relating to implementation, that it's still difficult to raise my head over the detail and look at the birds eye view.  DQ is an important issue and one that deserves further attention ... But I need to finish the core. Getting students to do work related to this is a possibility.

 - Tom

[1] http://edu.hioa.no/ark2100/current/slides/week3/day3/
[2] http://edu.hioa.no/ark2100/current/slides/week4/day1/Measuring%20data%20quality%20within%20a%20Noark%20perspective.odp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.nuug.no/pipermail/nikita-noark/attachments/20170611/058a5633/attachment.htm