Test result from my runtest script for Nikita and n5test

Petter Reinholdtsen pere at hungry.com
Tue Feb 7 09:33:24 CET 2017


[Thomas Sødring]
> Hmmm! I see! I have definitely missed a few things that you have caught!
> I am beginning to think you understand this standard better than most
> ... including me ...

Nah, I am only bluffing.  Some details are still fresh in my mind
because I read the start of the spec recently, and my PDF viewer have a
great search facility. :)

> Yeah. Search is important, but OData is a wrecking my head! It's the
> only think that isn't easy. The rest of the core is just plugging in
> libraries and figuring things out. OData might require some
> algorithmic work.

For me, freetext search and case ID lookup is the most important
features.

> That's a tricky one. One of the earlier versions of the core has
> upload functionality with automatic SHA-256 checksum generation. The
> bufferedreader is a checksum bufferedreader. So I'm probably only a
> day away. The 'problem' I find with this is that the core is blocking
> and uploading a 250MB document is not an unreasonable request so I
> need to think a little more about this.

I guess single threading isn't a good idea, then. :)

> Perhaps a secondary service for uploading documents is the way to
> go. Another issue is whether or not I want some kind of database to
> handle documents. This would be especially useful for handling
> versions of documents! Also I should be able to handle interrupts in
> document uploads so someone should be able to resume the upload of a
> large document that has been previously been cancelled.

>From a backup perspective it is useful to have everything in the
database, allowing a SQL dump of the database to provide a complete and
consistent backup.  But storing variable size data in SQL is not always
a good idea.

What about storing the documents on disk with file names derived from
their checksum?  This way multiple versions of a file can be stored on
disk, and a directory hierarcy can be created from the hash name to
avoid too many files in one directory.  Say the hash is abcd12345, the
file could be stored as a/b/c/d/12345 (or a/b/c/d/abcd12345).  If the
hash function is good, the directory content would become automatically
balanced.

If the file is received tn a temporary location, moved into its final
location once transfer is complete and the database is updated with the
hash name after the file is stored in its final location, the database
dump would always be consistent if the database is backed up before the
file area.  It would also make it easy to keep multiple versions of a
file (but very hard to keep track of which file on disk is which
database entry without lookup on in the database).

-- 
Happy hacking
Petter Reinholdtsen


More information about the nikita-noark mailing list