For moro skyld og for å se hvor godt kunstig idioti kan legge til helt
ny funksjonalitet i en eksisterende kodebase, så ba jeg vrøvleroboten
min om å legge inn valgfri S3-støtte i Nikita, mens den tok hensyn til
at endel filbehandling i bakgrunnen trenger lokale filer å jobbe med
(OCR, omforming til PDF). Etter noen timers aktivitet kom den opp med
en skisse til endring som implementerer dette. Jeg har bedt den lage to
S3-klientvarianter, en som bruker AWS og en som bruker JetS3T. Har ikke
rukket hverken finlese gjennom kodeforslaget eller teste det mot en ekte
S3-tjeneste, men tenkte ideen og skissen kunne være interessant for
flere og vil derfor delen status her.
Endringsforslaget
<URL: https://gitlab.com/OsloMet-ABI/nikita-noark5-core/-/merge_requests/491 >
inneholder to innsjekk, først en som omstrukturerer lagringskoden til å
gå gjennom et nytt abtraksjonslag der en kan velge lokal fillager eller
S3, og deretter en som bytter ut AWS med JetS3T.
Hva tenker dere om tilnærmingen og valg av S3-klientbibliotek? Jeg vet
ikke hvilke av disse som har størst sjanse for å fungere flott i
fremtiden, men observerer at JetS3T har Debian-pakke, men AWS-varianten
ikke har det, hvilket får meg til å helle mot JetS3T da det unngår å
gjøre jobben med å få Nikita inn i Debian enda tyngre.
Det nye abstraksjonslaget bør gjøre det mulig å legge inn flere
lagringsprotokoller i fremtiden. Kanskje bør det gjøres utvidbart uten
kodeendringer (aka dynamisk lasting), slik at tredjeparter kan støtte
for eksempel Ceph, NFS, CIFS eller DPC en eller annen gang i fremtiden?
--
Vennlig hilsen
Petter Reinholdtsen
Jeg har lekt litt med kunstig idioti i det siste, og som et forsøk ga
jeg den oppdrag å analysere kodebasen til Nikita og se hva som trengs
for å kunne få Nikita inn som en offisiell Debian pakke, i praksis hva
som trengs for å kunne bygge Nikita uten Internett-tilkobling. Her er
det den kom opp med. Det er som forventet ganske mye som må på plass
før vi er i mål, men tenkte det var greit om flere er kjent med
detaljene rundt utfordringen.
# Deserted Island Build Proposal for nikita-noark5-core
## Goal
Enable building and running the project with **zero Internet access**
--- only a local Debian mirror (Testing) on disk and the source
repository. The "deserted island" test: can you get this up and running
without any network connection?
------------------------------------------------------------------------
## Current Situation
### What Gets Downloaded Today
------------------------------------------------------------------------------------------
When What From Count/Size
---------------- ----------------------------- -------------- ----------------------------
**Build** \~450 JAR artifacts (Spring Maven Central \~200 MB
(`make build`) Boot BOM + transitive deps)
**Build** Additional test-scoped Maven Central \~100 MB
(`make check`) dependencies (Mockito, JUnit
engine, etc.)
**Build** `antlr4-maven-plugin`, Maven Central \~50 MB
(plugins) `asciidoctor-maven-plugin`,
`spring-boot-maven-plugin`,
`maven-surefire-plugin` +
transitive deps
**Test setup** Keycloak 26.0.6 binary GitHub \~130 MB
tarball Releases
(`keycloak-setup-start.sh`)
**Runtime Language detection model Maven Central \~5 MB
(Tika)** (`langdetect` models) --- may / CDN
download on first use if not
bundled
------------------------------------------------------------------------------------------
The `maven-repo/` directory exists in the working tree but is **not
committed to git** (untracked files), so every fresh clone downloads
everything from scratch.
### Why Debian Packages Alone Aren't Enough
Spring Boot 3.4.5 and its entire ecosystem (spring-boot-starter-web,
-data-jpa, -security, -oauth2-resource-server, -amqp, etc.) are **not
packaged for Debian**. Many individual libraries *are* available as
`lib*-java` packages in Debian Testing:
-------------------------------------------------------------------------------------
Dependency Available as Debian pkg? Package name
------------------- ---------------------------------- ------------------------------
ANTLR 4 runtime ✅ Yes `libantlr4-runtime-java`
commons-lang3 ✅ Yes `libcommons-lang3-java`
Guava ✅ Yes `libguava-java` (32.0.1)
H2 database ✅ Yes `libh2-java`
PostgreSQL JDBC ✅ Yes `libpostgresql-jdbc-java`
Joda-Time ✅ Yes `libjoda-time-java`
JAXB runtime ✅ Yes `libjaxb-java`
ByteBuddy ✅ Yes `libbyte-buddy-java`
Reflections ✅ Yes `libreflections-java`
RabbitMQ client ✅ Yes `librabbitmq-client-java`
JSON (org.json) ⚠ Partially `libjson-java` is different
library (gson-based, not
org.json)
Apache POI ❌ No Not packaged in Debian
**Spring Boot** ❌ No Not packaged at all
Spring Security ❌ No Not packaged at all
springdoc-openapi ❌ No Not packaged at all
Tika parsers ❌ No Not packaged for Debian
AsciiDoctor (Ruby ✅ Yes `asciidoctor` package provides
CLI) the Ruby-based CLI tool
AsciiDoctorJ (Maven ❌ No The JVM wrapper
plugin bridge) (`asciidoctor-maven-plugin`)
is not in Debian. Can be
replaced by calling the Debian
`asciidoctor` CLI from
Makefile, or skipped entirely
since docs are cosmetic.
-------------------------------------------------------------------------------------
**Bottom line**: Replacing Maven with Debian packages alone is **not
feasible** because Spring Boot, POI, springdoc-openapi, and Tika have no
Debian equivalents. The only realistic path is to make Maven work
offline.
------------------------------------------------------------------------
## Proposal: Three-Part Approach
### Part 1 --- Build-Time Offline (Maven)
#### Option A: Commit `maven-repo/` to git (Recommended for simplicity)
**Changes needed:** 1. Add `maven-repo/*.jar`, `maven-repo/*.pom`,
`maven-repo/*.sha1`, `maven-repo/*.lastUpdated` to `.gitattributes` with
`export-ignore = false` and commit the directory. 2. Alternatively, use
a sparse checkout or git-lfs for large binaries. 3. Add an offline guard
to the Makefile:
``` makefile
# In Makefile, add:
MVNOPTS := -Dmaven.repo.local=$(CURDIR)/maven-repo --offline
# Optional: fail-fast if network would be needed:
.PHONY: verify-offline
verify-offline:
@echo "Verifying offline build capability..."
$(MVN) $(MVNOPTS) dependency:resolve -DskipTests || \
(echo "ERROR: Missing dependencies. Run 'make populate-maven-repo' online first."; exit 1)
build: verify-offline
$(MVN) $(MVNOPTS) clean validate install
```
4. Provide a one-time online bootstrap target for maintainers to keep
`maven-repo/` current:
``` makefile
.PHONY: populate-maven-repo
populate-maven-repo:
@echo "Populating local Maven repo (requires Internet)..."
$(MVN) -Dmaven.repo.local=$(CURDIR)/maven-repo dependency:go-offline
$(MVN) -Dmaven.repo.local=$(CURDIR)/maven-repo dependency:resolve-plugins
```
**Pros**: Simple, works immediately, no build-system changes needed.
**Cons**: Bloated git repository (\~300 MB of JARs). Consider git-lfs or
a separate tarball artifact instead.
#### Option B: Debian policy-compliant approach (Recommended for packaging)
For proper Debian packaging (`dpkg-buildpackage`), the standard approach
is:
1. **List all upstream VCS artifacts in `debian/watch`** and use
`uscan/udeb` or manually manage them in
`debian/source/include-binaries`.
2. **Download all JARs during package build** from Maven Central using
the `debian/rules` target, with checksums pinned in `debian/control`
or a separate file.
3. Use **Maven's offline mode** (`--offline`) pointing at a
pre-populated local repo that was assembled during the online phase
of the Debian build.
Debian Java Policy recommends: - Each upstream JAR dependency should be
either (a) packaged in Debian as `lib*-java`, or (b) downloaded and
built from source within the package build process. - For option (b),
use `download-maven-poms` helper script or similar to fetch artifacts.
However, since Spring Boot is not in Debian, **this project cannot be
packaged purely from Debian packages**. The practical approach:
``` makefile
# debian/rules snippet:
%:
dh $@ --with javahelper,maven
override_dh_auto_build:
# Ensure offline build
dh_auto_build -- --offline -Dmaven.repo.local=$(CURDIR)/deps-maven-repo
```
With `debian/control` Build-Depends including all available Debian
packages:
Build-Depends: debhelper-compat (=13),
dh-buildupdate, default-jdk (>= 17), maven-debian-helper, maven-repo-helper,
libantlr4-runtime-java, libcommons-lang3-java, libguava-java, libh2-java,
libpostgresql-jdbc-java, libjoda-time-java, libjaxb-java, libbyte-buddy-java,
librabbitmq-client-java, libreflections-java
Then supplement with direct downloads for non-packaged dependencies
(Spring Boot BOM, POI, Tika, springdoc-openapi). The `maven-repo-helper`
tools can download these from Maven Central during the online build
phase.
### Part 2 --- Test-Time Offline (Keycloak)
The file `scripts/keycloak-setup-start.sh` downloads Keycloak from
GitHub:
``` bash
wget https://github.com/keycloak/keycloak/releases/download/${ver}/keycloak-${ve…
```
**Fixes:**
1. **Bundle Keycloak tarball in the repo** (or ship as a separate
artifact):
- Download `keycloak-26.0.6.tar.gz` and place it in
`scripts/keycloak/`.
- Modify script to check for local file first:
``` bash
if [ -f "scripts/keycloak/keycloak-${ver}.tar.gz" ]; then
echo "Using bundled Keycloak tarball."
else
wget https://github.com/keycloak/keycloak/releases/download/${ver}/keycloak-${ve…
fi
```
2. **Alternatively**, package Keycloak from Debian:
`libkeycloak-admin-rest-client-java` exists, but the full Keycloak
server is not packaged. The bundled tarball approach is simpler.
3. **Or skip Keycloak** for unit tests entirely --- run only
`make check` (which runs JUnit tests that don't need Keycloak) and
document that integration tests require the online setup:
``` makefile
check-offline:
$(MVN) $(MVNOPTS) test -Dtest='!*IntegrationTest,!*IT'
```
### Part 3 --- Runtime Offline (Tika Language Detection)
Apache Tika's `tika-langdetect` module downloads language detection
models on first use. This happens at runtime, not build time.
**Fix:** Set the system property to prevent online download:
``` java
// In application configuration or startup code:
System.setProperty("org.apache.tika.language.detect.model", "/path/to/local/model.jar");
```
Or exclude the language detection module from Tika's dependency tree if
it's not needed. Check `FileHandlingService.java` and
`FileUtilsService.java` --- they use `Tika` for MIME detection and text
extraction, which works fine without language detection models. The
models are only needed for `LanguageDetector`.
**In pom.xml**, add exclusions if language detection is not used:
``` xml
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<exclusions>
<exclusion>
<groupId>com.github.pemistahl</groupId>
<artifactId>lingua-language-detector</artifactId>
</exclusion>
</exclusions>
</dependency>
```
------------------------------------------------------------------------
## Proposed Debian Packages to Install
### Required Build Dependencies
default-jdk (>= 21) # Java 21+ compiler and runtime (see JDK note below)
maven # Maven build tool
libantlr4-runtime-java # ANTLR runtime (already in deps, but available as pkg)
git # Version control (for git-commit-id plugin if used)
**JDK version bump required:** `pom.xml` currently sets
`<java.version>17</java.version>`. JDK 17 has been removed from Debian
Testing --- only JDK 21 (`openjdk-21-jdk`) and JDK 25 (`openjdk-25-jdk`)
remain. The fix is straightforward:
``` xml
<!-- In pom.xml, change: -->
<properties>
<java.version>21</java.version>
</properties>
```
Spring Boot 3.x supports JDK 17 through 21+ and the codebase doesn't use
any Java 17-specific APIs that would break on 21. This is a one-line
change in `pom.xml` and updates to `.gitlab-ci.yml` (which references
`openjdk-17-jdk`).
### Required Runtime Dependencies
default-jre # Java runtime
postgresql # Database backend for production use
rabbitmq-server # Message queue (if using AMQP integration profile)
keycloak # Auth provider (NOT in Debian — bundle or skip)
tesseract-ocr-nor # OCR language data (used by Tika, from .gitlab-ci.yml)
unoconv # Document conversion (from .gitlab-ci.yml)
libreoffice-core # For document format conversion via unoconv
python3 # Test scripts use Python 3
curl # Health checks and API testing
jq # JSON processing in test scripts
### Optional / Nice to Have
asciidoctor # For documentation generation (alternative to maven plugin)
libh2-java # Embedded database for demo/testing mode
------------------------------------------------------------------------
## Proposed Dependencies That Could Be Dropped
--------------------------------------------------------------------------------------------------------------
Dependency Used For Can Drop? Notes
---------------------------------- ----------------- ------------------- -------------------------------------
`spring-boot-starter-amqp` RabbitMQ **Yes, Only needed if mail queue integration
integration conditionally** is used. Profile-gated via
`application-queueintegration.yml`.
`spring-boot-starter-validation` Bean validation No Core functionality depends on it.
(`@NotNull`)
`springdoc-openapi-*` Swagger/OpenAPI **Yes** Cosmetic/documentation only. Can be
UI docs excluded for minimal build.
`asciidoctor-maven-plugin` API documentation **Yes** Only generates HTML docs during
generation package phase, not needed to run the
app.
`spring-restdocs-*` (test scope) REST API **Yes** Test-time doc generation only.
documentation
tests
`junit-vintage-engine` (test JUnit 3/4 Maybe Only if all tests migrate to JUnit 5.
scope) compatibility in
tests
`spring-boot-starter-webflux` Reactive web Maybe Depends on which tests use it.
(test scope) client for tests
Tika language detection models Language **Yes** MIME type detection works without it.
identification of Exclude from classpath or set offline
documents mode property.
--------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------
## Summary: Concrete Steps to Achieve "Deserted Island" Build
### Immediate (low effort, high impact)
1. **Add `--offline` flag to Maven in Makefile**:
``` makefile
MVNOPTS := -Dmaven.repo.local=$(CURDIR)/maven-repo --offline
```
2. **Commit `maven-repo/` contents** (or create a tarball artifact):
``` bash
mvn dependency:go-offline -Dmaven.repo.local=$(pwd)/maven-repo
mvn dependency:resolve-plugins -Dmaven.repo.local=$(pwd)/maven-repo
# Then commit or tar the directory
```
3. **Bundle Keycloak** in `scripts/keycloak/` and update
`keycloak-setup-start.sh`.
4. **Guard against Tika model download** via system property: Add to
`application.yml`:
``` yaml
spring:
main:
add-application-context-initializer: true
---
# Or set JVM flag: -Dorg.apache.tika.language.detect.model=none
```
### Medium-term (proper Debian packaging)
5. **Create proper `debian/` directory** with:
- `debian/control` listing all Build-Depends
- `debian/rules` using `dh-sequence-maven` or manual Maven
invocation with offline mode
- `debian/source/include-binaries` for bundled JARs that can't be
Debian-packaged
6. **Document the offline build process** in a new
`docs/general/OfflineBuild.md`.
### Long-term: Get Into Debian Main (reduce external dependency count)
This section addresses the fundamental conflict between Maven's
"download JARs" model and Debian's requirement that all code in `main`
be built from source available within Debian.
#### The Problem: Bundled Binary JARs vs DFSG Compliance
Debian Free Software Guidelines (DFSG) and [Java
Policy](https://www.debian.org/doc/packaging-manuals/policies/java-policy/)
require: 1. **All code must be available in source form** ---
binary-only JAR blobs are not acceptable for `main` 2. **Build
dependencies must themselves be Debian packages** --- downloading from
Maven Central during build is only acceptable for `non-free` 3. **Each
upstream component should be packaged separately** as its own
`lib*-java` package
Currently, \~450 JARs are downloaded from Maven Central. Even if their
licenses are DFSG-compatible (Apache 2.0, MIT, LGPL), they violate the
"buildable from source" requirement because: - They're pre-built
binaries shipped alongside our source - Their build chain is external to
Debian and not reproducible within the archive
#### Three Paths Forward
**Path A: Package Each Dependency Individually (Required for `main`)**
Every non-packaged dependency must become its own Debian package. The
ones we need to package are:
------------------------------------------------------------------------
Dependency License Packaging Difficulty
-------------------- --------------- -----------------------------------
Spring Boot 3.4.x Apache 2.0 **Very high** --- \~50 transitive
modules, each needs separate
packaging; depends on Jakarta EE
APIs not all in Debian
Apache POI 5.4.0 Apache 2.0 Medium --- single project but
large; may already exist as
`libpoi-java` (check)
Tika parsers 2.8.0 Apache 2.0 **Very high** --- huge dependency
tree including Lucene, XML
libraries, etc.
springdoc-openapi MIT Low --- single project,
1.6.x straightforward Maven build
Keycloak server Apache 2.0 High --- not needed in `main` if
(test only) test-only; bundle as VCS artifact
or skip
------------------------------------------------------------------------
This is a **multi-year effort**. Each package needs proper metadata,
patches, and maintenance. Spring Boot alone has dozens of modules that
would need individual packaging.
**Path B: Use Maven Offline with Online Download Phase (Acceptable for
`non-free`)**
For `contrib`/`non-free`, the approach is simpler: 1. `debian/rules`
downloads all JARs from Maven Central during online build phase 2.
Checksums are verified against pinned values in `debian/control` or
checksum file 3. Build runs with `--offline` pointing at pre-populated
local repo
This still requires Internet access during package build, but is
acceptable for non-free. The "deserted island" test would pass because
the Debian mirror includes these downloaded artifacts as part of the
built package.
**Path C: Hybrid Approach (Recommended Near-Term)**
Use a combination: 1. **Replace packaged dependencies with Debian
packages** where available (ANTLR, Guava, H2, etc.) --- reduces JAR
count from \~450 to \~350 2. **Package the critical non-packaged ones
ourselves**: springdoc-openapi (MIT, easy), any others that are small
and well-maintained 3. **Bundle remaining JARs** for now with proper
licensing documentation, targeting `non-free` initially 4. **Work
upstream** to get Spring Boot packaged in Debian --- this is the blocker
for everything
#### Concrete Steps for Path C
``` makefile
# debian/rules approach:
override_dh_auto_build:
# Use Debian packages where available via classpath
# Download remaining from Maven Central (online phase only)
dh_auto_build -- -Dmaven.repo.local=$(CURDIR)/.deps-repo --offline
```
With `debian/control`:
Build-Depends: debhelper-compat (=13), default-jdk (>= 21), maven-debian-helper,
libantlr4-runtime-java, libguava-java, libh2-java, libpostgresql-jdbc-java,
libjoda-time-java, libjaxb-java, libbyte-buddy-java, librabbitmq-client-java,
libreflections-java, # Replace Maven deps with Debian packages where available
Standards-Version: 4.6.2
And `debian/copyright` documenting all bundled JAR licenses.
#### Impact on "Deserted Island" Goal
The "deserted island" goal is **achievable now** for building and
running the application, even without Debian main packaging: - A
complete Debian mirror + our source repo + pre-populated `maven-repo/` =
offline build works - The barrier to debian `main` is separate from the
ability to build/run offline
The two goals should be tracked separately: 1. **Offline build
capability** (this document) --- achievable with bundled artifacts 2.
**Debian main compliance** --- requires packaging all dependencies or
dropping Spring Boot
7. **Evaluate dropping springdoc-openapi and asciidoctor-maven-plugin**
from default build, moving them to an optional profile.
8. **Profile-gate AMQP integration** more clearly so it's not pulled in
by default.
9. **Replace Tika with Debian-packaged alternatives** where possible
(e.g., `file` command for MIME detection, `tesseract-ocr` for OCR)
--- significant refactoring needed.
------------------------------------------------------------------------
## Verification Checklist
To verify the "deserted island" build works:
``` bash
# 1. Start with fresh clone + bundled maven-repo tarball
git clone <repo> && cd nikita-noark5-core-upstream
tar xzf ../maven-repo-bundle.tar.gz # or already in git
# 2. Ensure no network connectivity
iptables -A OUTPUT -p tcp --dport 80 -j DROP
iptables -A OUTPUT -p tcp --dport 443 -j DROP
# 3. Build offline
make build check
# 4. Verify success
test -f target/nikita-noark5-core-*.jar && echo "BUILD SUCCESS"
# 5. Restore network
iptables -D OUTPUT -p tcp --dport 80 -j DROP
iptables -D OUTPUT -p tcp --dport 443 -j DROP
```
If `make build check` succeeds with all ports blocked, the deserted
island test passes.
--
Vennlig hilsen
Petter Reinholdtsen