Loka — a neuro-symbolic world model

Two systems sharing one query language. Explicit memory (an RDF-star triplestore, exact answers) plus implicit memory (a transformer trained on the same triples, plausible answers). Generated facts write back into the store as RDF-star annotations with propositionInferredFrom citation edges.

RDF-star SPARQL+ & SPARQL-star Self-citing inference v0.4.0 — World-model release
Pull from Hugging Face Engine release Full history

The pivot in one paragraph

Loka started as Loka, a lean RDF-star triplestore with native vector indexing. Over time the purpose shifted: the engine is now one half of a neuro-symbolic world-model engine. The other half is a small role-aware transformer trained from scratch on the same triples (with English labels substituted for opaque QIDs/PIDs). Both expose the same SPARQL+ interface. A query reaches both systems and the caller doesn't pick which one answered — except via propositionInferredFrom RDF-star edges that thread every model-generated triple back to the context that informed it. The engine retains the Loka name; the project as a whole is becoming Loka.

The two-system loop

┌───────────────────┐ │ Curated triples │ (Wikidata, philippesaade/wikidata HF parquet) │ (RDF-star) │ └─────────┬─────────┘ ▼ ┌───────────────────┐ ┌──────────────────────┐ │ Loka store │ ──────► │ Training corpus │ │ (.sdb file) │ SPARQL+ │ (label-substituted) │ │ │ SPARQL- │ │ │ │ star │ │ └─────────▲─────────┘ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ │ │ Role-aware │ │ │ transformer │ │ │ (44.5M params, v5+) │ │ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ │ │ Inference loop │ │ │ + cumulative rep. │ │ │ penalty decoder │ │ │ + RDF-star write- │ │ │ back to store │ │ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ └──────────────────┤ Generated triples + │ │ propositionInferred │ │ From edges │ └──────────────────────┘

The loop is closed. Generated triples land in the store flagged propositionGenerated true. The next training-corpus extraction's SPARQL-star FILTER excludes them, so the model never trains on its own output. Inference can be re-run repeatedly to grow the citation graph without polluting the training distribution.

Generative citation

When the inference layer accepts a candidate (S, P) and emits a predicted object "X", it writes a fixed-shape annotation block. The block's subject is the quoted generated triple. Its objects include the metadata predicates plus one or more propositionInferredFrom edges whose object is another quoted triple — a cited piece of context the prediction was conditioned on:

<S> <P> "X" .
<<S P "X">>  prov:propositionGenerated     "true"^^xsd:boolean .
<<S P "X">>  prov:propositionGeneratedBy   "loka-wikidata-v14" .
<<S P "X">>  prov:propositionConfidence    "0.43"^^xsd:decimal .
<<S P "X">>  prov:propositionInferredFrom  <<S existing_p1 existing_o1>> .
<<S P "X">>  prov:propositionInferredFrom  <<S existing_p2 existing_o2>> .
   ...one per cited context triple (default 10)

prov: expands to http://loka.dev/provenance/. This whole namespace is reserved: the world model never sees, proposes, or emits one of its predicates. Three independent guards enforce that — corpus stripping at extraction time, candidate-predicate filtering during inference, and an emit-time guard before each primary triple is written. Verbose names (propositionGeneratedFrom, not generatedFrom) make the rule scannable by humans and collision-resistant against real-world predicates.

Why hallucinated citations aren't a blocker. A fabricated propositionInferredFrom edge is still a transparent RDF-star row pointing at a concrete context triple — auditable, filterable like any generated triple, often informative about what the model thinks the reasoning is. The schema does the work; we don't need elaborate guards.

Cumulative repetition penalty

Masked-S/P/O training produces models that "know" the answer category (university, museum, https-URL) but degenerate during greedy decoding into fillers like of of of of or museum museum. We don't fix this at training time. We fix it at decode time:

Every emission of token t increments a per-token counter. At each later masked position we divide the logit of every emitted token by repetition_penaltycount. Three emissions of of at penalty 3.0 multiply its divisor by 27, reliably dropping it below the per-token floor and breaking the cascade. A genuinely-needed re-use can still win on its first repeat; only loops collapse.

Same checkpoint, different decoder

Subject / predicateNo penaltyCumulative penalty 3.0
Comtesse de Die / educated atuniversity of of of of of of ofuniversity of halle (correctly identifies Halle, where she studied)
canton of Romilly / Commons categorycanton of of sur surcanton of (clean truncation)
Zudar / area(didn't pass threshold)33 (numeric — model picked up that area is a number)
Abbas Mirza / has works in collection1 http www w3 org 2001 xmlschema decimalmetropolitan museum of museum (the Met genuinely holds Abbas Mirza pieces)

Trained checkpoints

All checkpoints share the same architecture from v5 onward: 44.5 M parameters, BPE tokenizer from v6 onward, role-aware masked S/P/O transformer. v3–v10 trained on Wikidata slices ingested into Loka; v11 onward train on the standalone normalized-wikidata HF dataset, built by tools/preprocess_from_hf.py streaming philippesaade/wikidata directly — no Loka in the data path. Pull any tag from EmmaLeonhart/loka:

TagFinal pplCorpusNotes
v353.4757 k noisy16 M params. First end-to-end run; misleading ppl from datatype-suffix memorisation.
v492.5757 k cleaned16 M params. Datatype-suffix bug fixed.
v584.85757 k cleaned44.5 M params (3× scale-up). Picks specific real entities where v4 fell back to fillers.
v6-bpe194.98757 k cleanedBPE tokenizer added. Catalog hallucinations dominant — led to v7 corpus cleanup.
v7192.63184 k v7-cleanedCatalog datatypes dropped (~76 % of v6 corpus removed). Catalog-format leak gone.
v864.65184 k v7-cleaned20 epochs on v7 corpus. 3× ppl improvement over v7.
v957.1594 k cleanedFresh 2 M-triple slice; 97 % semantic-predicate share on Q42 propgen.
v1055.5294 k cleanedFirst fully-automated cron cycle. 100 % semantic-predicate share — cleanest run yet.
v11279.12350 k from v11-50kFirst model on the no-Loka preprocessing pipeline. 3 of 20 epochs (CUDA OOM at batch 32 on the laptop's 8 GB VRAM); future runs use batch 16. Different operating regime — not directly comparable to v10.
v12250.82672 k from v12-100kEpoch-6 snapshot (training disrupted by shared-GPU LLaMA contention; epoch 4 hit 226.86). Per-epoch tags v12.1v12.7 also available.
v13training2.5 M from v13-500k10 epochs, batch 16, exclusive GPU. Per-epoch tags v13.1v13.10 via tools/epoch_snapshot_pusher.py.
v14202.014,021,409 from v14-1MSeries best. Epoch-4 ppl 202.01 is canonical; per-epoch tags v14.1v14.7 on HF, continuing to v14.10. Corpus-scale lever confirmed: v11 350k→279 to v14 4M→202. The fresh-Adam continuation (ep6 206, ep7 210) is not beating epoch 4 — ~202 is the from-scratch floor on this corpus/hardware; a clean 10-epoch contributor run is the path lower.

v11 onward, the training corpus is its own Hugging Face dataset: EmmaLeonhart/normalized-wikidata — clean text-form Wikidata triples, four scale tiers from 50 k to 1 M entity rows, published as a standalone artifact (CC-BY-SA 4.0). Pull either the corpus, the model, or both.

from huggingface_hub import hf_hub_download

# Latest pinned model (v12)
ckpt = hf_hub_download(
    repo_id="EmmaLeonhart/loka", repo_type="dataset",
    filename="checkpoints/wikidata_v12.pt", revision="v12",
)
tok = hf_hub_download(
    repo_id="EmmaLeonhart/loka", repo_type="dataset",
    filename="corpus/tokenizer_bpe.json", revision="v12",
)

# Or just the corpus, standalone
corpus = hf_hub_download(
    repo_id="EmmaLeonhart/normalized-wikidata", repo_type="dataset",
    filename="triples_normalized.txt", revision="v13-500k",
)

Run the full loop

# 1. Engine (optional — v11+ training doesn't need Loka serve)
cargo build --release -p loka-cli

# 2. Pull the latest model + tokenizer (auto-downloads on first inference)
python training/loader.py

# 3. Generative-citation inference
python training/infer_with_citations.py \
    --bpe-tokenizer training/data/tokenizer_bpe.json \
    --max-subjects 50 \
    --confidence 0.4 \
    --repetition-penalty 3.0
# add --post to write predictions back into a running Loka store

Read more

The engine remains Loka in code (loka-core, loka-hnsw, etc.). The project — engine + corpus + transformer + inference layer — is Loka.