Two systems sharing one query language. Explicit memory (an RDF-star triplestore, exact answers) plus
implicit memory (a transformer trained on the same triples, plausible answers). Generated facts
write back into the store as RDF-star annotations with propositionInferredFrom citation edges.
Loka started as Loka, a lean RDF-star triplestore with native vector indexing. Over time the purpose
shifted: the engine is now one half of a neuro-symbolic world-model engine. The other half is a small role-aware
transformer trained from scratch on the same triples (with English labels substituted for opaque QIDs/PIDs). Both
expose the same SPARQL+ interface. A query reaches both systems and the caller doesn't pick which one answered —
except via propositionInferredFrom RDF-star edges that thread every model-generated triple back to the
context that informed it. The engine retains the Loka name; the project as a whole is becoming Loka.
The loop is closed. Generated triples land in the store flagged propositionGenerated true.
The next training-corpus extraction's SPARQL-star FILTER excludes them, so the model never trains on its own output.
Inference can be re-run repeatedly to grow the citation graph without polluting the training distribution.
When the inference layer accepts a candidate (S, P) and emits a predicted object "X", it
writes a fixed-shape annotation block. The block's subject is the quoted generated triple. Its objects
include the metadata predicates plus one or more propositionInferredFrom edges whose object is
another quoted triple — a cited piece of context the prediction was conditioned on:
<S> <P> "X" .
<<S P "X">> prov:propositionGenerated "true"^^xsd:boolean .
<<S P "X">> prov:propositionGeneratedBy "loka-wikidata-v14" .
<<S P "X">> prov:propositionConfidence "0.43"^^xsd:decimal .
<<S P "X">> prov:propositionInferredFrom <<S existing_p1 existing_o1>> .
<<S P "X">> prov:propositionInferredFrom <<S existing_p2 existing_o2>> .
...one per cited context triple (default 10)
prov: expands to http://loka.dev/provenance/. This whole namespace is reserved:
the world model never sees, proposes, or emits one of its predicates. Three independent guards enforce that — corpus
stripping at extraction time, candidate-predicate filtering during inference, and an emit-time guard before each primary
triple is written. Verbose names (propositionGeneratedFrom, not generatedFrom) make the rule
scannable by humans and collision-resistant against real-world predicates.
propositionInferredFrom edge is still a transparent RDF-star row pointing at a
concrete context triple — auditable, filterable like any generated triple, often informative about what the
model thinks the reasoning is. The schema does the work; we don't need elaborate guards.
Masked-S/P/O training produces models that "know" the answer category (university, museum, https-URL) but degenerate
during greedy decoding into fillers like of of of of or museum museum. We don't fix this at
training time. We fix it at decode time:
Every emission of token t increments a per-token counter. At each later masked position we divide the logit
of every emitted token by repetition_penaltycount. Three emissions of of at penalty 3.0
multiply its divisor by 27, reliably dropping it below the per-token floor and breaking the cascade. A genuinely-needed
re-use can still win on its first repeat; only loops collapse.
| Subject / predicate | No penalty | Cumulative penalty 3.0 |
|---|---|---|
Comtesse de Die / educated at | university of of of of of of of | university of halle (correctly identifies Halle, where she studied) |
canton of Romilly / Commons category | canton of of sur sur | canton of (clean truncation) |
Zudar / area | (didn't pass threshold) | 33 (numeric — model picked up that area is a number) |
Abbas Mirza / has works in collection | 1 http www w3 org 2001 xmlschema decimal | metropolitan museum of museum (the Met genuinely holds Abbas Mirza pieces) |
All checkpoints share the same architecture from v5 onward: 44.5 M parameters, BPE tokenizer from v6 onward, role-aware masked S/P/O transformer. v3–v10 trained on Wikidata slices ingested into Loka; v11 onward train on the standalone normalized-wikidata HF dataset, built by tools/preprocess_from_hf.py streaming philippesaade/wikidata directly — no Loka in the data path. Pull any tag from EmmaLeonhart/loka:
| Tag | Final ppl | Corpus | Notes |
|---|---|---|---|
v3 | 53.4 | 757 k noisy | 16 M params. First end-to-end run; misleading ppl from datatype-suffix memorisation. |
v4 | 92.5 | 757 k cleaned | 16 M params. Datatype-suffix bug fixed. |
v5 | 84.85 | 757 k cleaned | 44.5 M params (3× scale-up). Picks specific real entities where v4 fell back to fillers. |
v6-bpe | 194.98 | 757 k cleaned | BPE tokenizer added. Catalog hallucinations dominant — led to v7 corpus cleanup. |
v7 | 192.63 | 184 k v7-cleaned | Catalog datatypes dropped (~76 % of v6 corpus removed). Catalog-format leak gone. |
v8 | 64.65 | 184 k v7-cleaned | 20 epochs on v7 corpus. 3× ppl improvement over v7. |
v9 | 57.15 | 94 k cleaned | Fresh 2 M-triple slice; 97 % semantic-predicate share on Q42 propgen. |
v10 | 55.52 | 94 k cleaned | First fully-automated cron cycle. 100 % semantic-predicate share — cleanest run yet. |
v11 | 279.12 | 350 k from v11-50k | First model on the no-Loka preprocessing pipeline. 3 of 20 epochs (CUDA OOM at batch 32 on the laptop's 8 GB VRAM); future runs use batch 16. Different operating regime — not directly comparable to v10. |
v12 | 250.82 | 672 k from v12-100k | Epoch-6 snapshot (training disrupted by shared-GPU LLaMA contention; epoch 4 hit 226.86). Per-epoch tags v12.1…v12.7 also available. |
v13 | training | 2.5 M from v13-500k | 10 epochs, batch 16, exclusive GPU. Per-epoch tags v13.1…v13.10 via tools/epoch_snapshot_pusher.py. |
v14 | 202.01 | 4,021,409 from v14-1M | Series best. Epoch-4 ppl 202.01 is canonical; per-epoch tags v14.1…v14.7 on HF, continuing to v14.10. Corpus-scale lever confirmed: v11 350k→279 to v14 4M→202. The fresh-Adam continuation (ep6 206, ep7 210) is not beating epoch 4 — ~202 is the from-scratch floor on this corpus/hardware; a clean 10-epoch contributor run is the path lower. |
v11 onward, the training corpus is its own Hugging Face dataset: EmmaLeonhart/normalized-wikidata — clean text-form Wikidata triples, four scale tiers from 50 k to 1 M entity rows, published as a standalone artifact (CC-BY-SA 4.0). Pull either the corpus, the model, or both.
from huggingface_hub import hf_hub_download
# Latest pinned model (v12)
ckpt = hf_hub_download(
repo_id="EmmaLeonhart/loka", repo_type="dataset",
filename="checkpoints/wikidata_v12.pt", revision="v12",
)
tok = hf_hub_download(
repo_id="EmmaLeonhart/loka", repo_type="dataset",
filename="corpus/tokenizer_bpe.json", revision="v12",
)
# Or just the corpus, standalone
corpus = hf_hub_download(
repo_id="EmmaLeonhart/normalized-wikidata", repo_type="dataset",
filename="triples_normalized.txt", revision="v13-500k",
)
# 1. Engine (optional — v11+ training doesn't need Loka serve)
cargo build --release -p loka-cli
# 2. Pull the latest model + tokenizer (auto-downloads on first inference)
python training/loader.py
# 3. Generative-citation inference
python training/infer_with_citations.py \
--bpe-tokenizer training/data/tokenizer_bpe.json \
--max-subjects 50 \
--confidence 0.4 \
--repetition-penalty 3.0
# add --post to write predictions back into a running Loka store