Loka — Neuro-Symbolic World Model

The pivot in one paragraph

Loka started as Loka, a lean RDF-star triplestore with native vector indexing. Over time the purpose shifted: the engine is now one half of a neuro-symbolic world-model engine. The other half is a small role-aware transformer trained from scratch on the same triples (with English labels substituted for opaque QIDs/PIDs). Both expose the same SPARQL+ interface. A query reaches both systems and the caller doesn't pick which one answered — except via propositionInferredFrom RDF-star edges that thread every model-generated triple back to the context that informed it. The engine retains the Loka name; the project as a whole is becoming Loka.

The two-system loop

┌───────────────────┐ │ Curated triples │ (Wikidata, philippesaade/wikidata HF parquet) │ (RDF-star) │ └─────────┬─────────┘ ▼ ┌───────────────────┐ ┌──────────────────────┐ │ Loka store │ ──────► │ Training corpus │ │ (.sdb file) │ SPARQL+ │ (label-substituted) │ │ │ SPARQL- │ │ │ │ star │ │ └─────────▲─────────┘ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ │ │ Role-aware │ │ │ transformer │ │ │ (44.5M params, v5+) │ │ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ │ │ Inference loop │ │ │ + cumulative rep. │ │ │ penalty decoder │ │ │ + RDF-star write- │ │ │ back to store │ │ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ └──────────────────┤ Generated triples + │ │ propositionInferred │ │ From edges │ └──────────────────────┘

The loop is closed. Generated triples land in the store flagged propositionGenerated true. The next training-corpus extraction's SPARQL-star FILTER excludes them, so the model never trains on its own output. Inference can be re-run repeatedly to grow the citation graph without polluting the training distribution.

Generative citation

When the inference layer accepts a candidate (S, P) and emits a predicted object "X", it writes a fixed-shape annotation block. The block's subject is the quoted generated triple. Its objects include the metadata predicates plus one or more propositionInferredFrom edges whose object is another quoted triple — a cited piece of context the prediction was conditioned on:

<S> <P> "X" .
<<S P "X">>  prov:propositionGenerated     "true"^^xsd:boolean .
<<S P "X">>  prov:propositionGeneratedBy   "loka-wikidata-v14" .
<<S P "X">>  prov:propositionConfidence    "0.43"^^xsd:decimal .
<<S P "X">>  prov:propositionInferredFrom  <<S existing_p1 existing_o1>> .
<<S P "X">>  prov:propositionInferredFrom  <<S existing_p2 existing_o2>> .
   ...one per cited context triple (default 10)

prov: expands to http://loka.dev/provenance/. This whole namespace is reserved: the world model never sees, proposes, or emits one of its predicates. Three independent guards enforce that — corpus stripping at extraction time, candidate-predicate filtering during inference, and an emit-time guard before each primary triple is written. Verbose names (propositionGeneratedFrom, not generatedFrom) make the rule scannable by humans and collision-resistant against real-world predicates.

Why hallucinated citations aren't a blocker. A fabricated propositionInferredFrom edge is still a transparent RDF-star row pointing at a concrete context triple — auditable, filterable like any generated triple, often informative about what the model thinks the reasoning is. The schema does the work; we don't need elaborate guards.

Cumulative repetition penalty

Masked-S/P/O training produces models that "know" the answer category (university, museum, https-URL) but degenerate during greedy decoding into fillers like of of of of or museum museum. We don't fix this at training time. We fix it at decode time:

Every emission of token t increments a per-token counter. At each later masked position we divide the logit of every emitted token by repetition_penalty^count. Three emissions of of at penalty 3.0 multiply its divisor by 27, reliably dropping it below the per-token floor and breaking the cascade. A genuinely-needed re-use can still win on its first repeat; only loops collapse.

Same checkpoint, different decoder

Subject / predicate	No penalty	Cumulative penalty 3.0
`Comtesse de Die / educated at`	`university of of of of of of of`	`university of halle` (correctly identifies Halle, where she studied)
`canton of Romilly / Commons category`	`canton of of sur sur`	`canton of` (clean truncation)
`Zudar / area`	(didn't pass threshold)	`33` (numeric — model picked up that area is a number)
`Abbas Mirza / has works in collection`	`1 http www w3 org 2001 xmlschema decimal`	`metropolitan museum of museum` (the Met genuinely holds Abbas Mirza pieces)

Trained checkpoints

All checkpoints share the same architecture from v5 onward: 44.5 M parameters, BPE tokenizer from v6 onward, role-aware masked S/P/O transformer. v3–v10 trained on Wikidata slices ingested into Loka; v11 onward train on the standalone normalized-wikidata HF dataset, built by tools/preprocess_from_hf.py streaming philippesaade/wikidata directly — no Loka in the data path. Pull any tag from EmmaLeonhart/loka:

Tag	Final ppl	Corpus	Notes
`v3`	53.4	757 k noisy	16 M params. First end-to-end run; misleading ppl from datatype-suffix memorisation.
`v4`	92.5	757 k cleaned	16 M params. Datatype-suffix bug fixed.
`v5`	84.85	757 k cleaned	44.5 M params (3× scale-up). Picks specific real entities where v4 fell back to fillers.
`v6-bpe`	194.98	757 k cleaned	BPE tokenizer added. Catalog hallucinations dominant — led to v7 corpus cleanup.
`v7`	192.63	184 k v7-cleaned	Catalog datatypes dropped (~76 % of v6 corpus removed). Catalog-format leak gone.
`v8`	64.65	184 k v7-cleaned	20 epochs on v7 corpus. 3× ppl improvement over v7.
`v9`	57.15	94 k cleaned	Fresh 2 M-triple slice; 97 % semantic-predicate share on Q42 propgen.
`v10`	55.52	94 k cleaned	First fully-automated cron cycle. 100 % semantic-predicate share — cleanest run yet.
`v11`	279.12	350 k from `v11-50k`	First model on the no-Loka preprocessing pipeline. 3 of 20 epochs (CUDA OOM at batch 32 on the laptop's 8 GB VRAM); future runs use batch 16. Different operating regime — not directly comparable to v10.
`v12`	250.82	672 k from `v12-100k`	Epoch-6 snapshot (training disrupted by shared-GPU LLaMA contention; epoch 4 hit 226.86). Per-epoch tags `v12.1`…`v12.7` also available.
`v13`	training	2.5 M from `v13-500k`	10 epochs, batch 16, exclusive GPU. Per-epoch tags `v13.1`…`v13.10` via `tools/epoch_snapshot_pusher.py`.
`v14`	202.01	4,021,409 from `v14-1M`	Series best. Epoch-4 ppl 202.01 is canonical; per-epoch tags `v14.1`…`v14.7` on HF, continuing to `v14.10`. Corpus-scale lever confirmed: v11 350k→279 to v14 4M→202. The fresh-Adam continuation (ep6 206, ep7 210) is not beating epoch 4 — ~202 is the from-scratch floor on this corpus/hardware; a clean 10-epoch contributor run is the path lower.

v11 onward, the training corpus is its own Hugging Face dataset: EmmaLeonhart/normalized-wikidata — clean text-form Wikidata triples, four scale tiers from 50 k to 1 M entity rows, published as a standalone artifact (CC-BY-SA 4.0). Pull either the corpus, the model, or both.

from huggingface_hub import hf_hub_download

# Latest pinned model (v12)
ckpt = hf_hub_download(
    repo_id="EmmaLeonhart/loka", repo_type="dataset",
    filename="checkpoints/wikidata_v12.pt", revision="v12",
)
tok = hf_hub_download(
    repo_id="EmmaLeonhart/loka", repo_type="dataset",
    filename="corpus/tokenizer_bpe.json", revision="v12",
)

# Or just the corpus, standalone
corpus = hf_hub_download(
    repo_id="EmmaLeonhart/normalized-wikidata", repo_type="dataset",
    filename="triples_normalized.txt", revision="v13-500k",
)

Run the full loop

# 1. Engine (optional — v11+ training doesn't need Loka serve)
cargo build --release -p loka-cli

# 2. Pull the latest model + tokenizer (auto-downloads on first inference)
python training/loader.py

# 3. Generative-citation inference
python training/infer_with_citations.py \
    --bpe-tokenizer training/data/tokenizer_bpe.json \
    --max-subjects 50 \
    --confidence 0.4 \
    --repetition-penalty 3.0
# add --post to write predictions back into a running Loka store

Full development history — the pivot, the corpus rebuild, every checkpoint, every dead end.
world-model-thesis.md — canonical vision document.
fine-tuning-track.md — parallel near-term track admitting Qwen-style fine-tuning.
paper draft — Loka: Generative Citation in a Neuro-Symbolic World Model over RDF-Star Knowledge Graphs.
Hugging Face dataset — corpus + checkpoints with snapshot tags.

The engine remains Loka in code (loka-core, loka-hnsw, etc.). The project — engine + corpus + transformer + inference layer — is Loka.

Loka — a neuro-symbolic world model

The pivot in one paragraph

The two-system loop

Generative citation

Cumulative repetition penalty

Same checkpoint, different decoder

Trained checkpoints

Run the full loop

Read more