How can consistency check results

Petter Reinholdtsen pere at hungry.com
Mon Jun 12 10:02:57 CEST 2017


[Thomas Sødring]
> I have two set of slides that discuss this. One is DQ in general and
> can be accessed here [1]. Slide 45 is where we should focus our
> efforts. This is what should be looked at. In terms of Ravn and Høed
> (who are behind slide 45 ... strange reference missing), they reduce
> Wangs dimensions down to something more "håndfast" and this is what I
> think we should start to focus on.

Well, I do not find the pie chart on page 45 very illuminating.  The
quality controls I would like to focus on at the start do not really fit
into those categories and are more logical consistency checks, that fall
into both validity, consistency and integrity at the same time.  For
example check that the dates in metadata make sense timewise and that
that law references for graded documents make sense when looking at
dates and content.

> [2] is used to discuss the issue in relation to n5 entities, but I see
> I could update it a bit. When we focused on Data Quality we dreamed
> that DIFI had a DQ monitor for all government computer systems and
> every time data was exchanged, a DQ analysis was undertaken and
> reported to DIFI.

Well, this slide have very little information about concrete quality
indicators, which I believe we need.

> I do think that we need to move DQ to something beyond consistency. I
> have pushed for DQ at the objective level because it's something that
> can be studied from the database.

What does 'at the objective level' mean?

> But really it is the semantic or pragmatic data quality that is
> important. A lot of information is sometimes in the archive, but not
> accessible. A case of point is at a recent meeting with municipal
> archivists there was a discussion about a zoning issue related to a
> beach in a town centre where no-one really knew how to deal with a
> building application, because no knew how that part was zoned. Then a
> lady who was going to retire was able to remember that she sat in a
> formanskapsmøte in June of 198X, where that particular area was
> discussed and they were able to find the relevant information. So DQ
> in that municipality is actually terrible! When she retires, they
> loose a lot of information. So we need to raise the issue of DQ beyond
> objective dimensions! That is a difficult issue!

Sure it is, and for this to work well I believe a lot of the metadata
need to be extracted automaticaly from documents.  Otherwise people will
forget to add geographic location references like the one that could
have saved the day in your story. :)

> With time I landed on an understanding that when it comes to DQ and
> Noark, ultimately it has to done from a perspective of the
> extraction. The end game is the extraction and as such DQ should at a
> minimum ensure that an extraction can be made. That's not say we
> shouldn't do more, but we must start here.

How come?

I believe the only sensible perspective regarding data quality is how it
help day to day use.  It will ensure there is an incentive in the
organisation to improve the quality when they find errors, and ensure
someone care about the result from any automated checks for consistency
and correctness in the archive.  No-one, in their busy day at work, is
going to give priority to helping someone in the far future by ensuring
the extraction of the archive is of high quality.  If the DQ perspective
is the extraction, the data quality is going to be horrible.

> Adding additional links showing this is a great idea, but it should be
> possible to pull the data out on multiple dimensions as above.

I had not considered this, so it was a nice viewpoint to consider.  I am
reluctant to provide blame lists (ie data quality measurement per case
handler), as I suspect that is going to privide an incentive for people
to keep documents away from the archive due to fear of making mistakes.

Also, it would be hard to attribute errors to users.  Say one user
create a POST and add a bogus journal date and document date, and
another user correct the document date to reflect reality.  This in turn
causes a consistency check to trigger, as journal date is before
document date.  Should the user introducing this issue be attributed, or
the original submitter?  How could we tell when this is the case?
Without blaming individuals, everyone have equal incentive to improve
the quality of the archive.  It seem like a good idea to me.

> It should also be able pull out data for the entire organisation so
> citizens actually can see a single figure for their municipalities
> archive.

That would be very cool indeed. :)

> I'm probably not answering exactly how we would develop low-level
> checks. CPUs are cheap, people are not. So any change in an entity
> should result in a DQ analysis.  It could be real time, but I don't
> think it's necessary. For real-time to have value, we'd need to reject
> a change. That would probably cause people to try and avoid using the
> system.

I do not agree that these kind of real time consistency checks should
cause the system to reject changes (errors should, not dubious data
values).  It should instead make 'warnings' available to the user, to
allow the user to decide if the check is right or wrong, and make it
possible to ignore it.  This is why I suggest to add a relation to
_links when something strange is added to the archive, instead of
rejecting the POST or PUT.

> I think I'd like to have a Coverity approach to the issue, where every
> day a cron-job analyses the noark database and reports a list of
> issues to the record manager.

This should definitely be in place, yes, with links to the global list
of issues I talked about in my proposal.

> Remember also the Noark metadata is so simple that it's not really the
> n5 metadata where we can increase DQ, but probably in the contents of
> the documents. That's the limitation! But we have to start somewhere.

Well, I believe there is quite a few checks we can do with only the
metadata, and believe we should start there.  If you have ideas on what
to check, please follow up 
<URL: https://lists.nuug.no/pipermail/nikita-noark/2017-June/000294.html >.

-- 
Happy hacking
Petter Reinholdtsen


More information about the nikita-noark mailing list