Computational Linguistics — Overview

Computational linguistics studies how computers can process and model natural language. It includes tasks such as parsing, tagging, machine translation, information extraction, and language generation. This section introduces core concepts and practical techniques for building language-aware systems.

Framing in 2016

These ideas were first published in early and mid‑2016 (see the Jazenga posts linked from related pages). At that time the NLP and localisation landscape was in transition: phrase‑based statistical machine translation and translation memory workflows were widely used, while neural machine translation was rapidly emerging but not yet ubiquitous. Semantic web formats (RDF/SKOS/OWL) were established technically but adoption for multilingual application design was inconsistent.

The approaches described on this site emphasised practical, low‑infrastructure solutions — using language‑independent concept identifiers, non‑linguistic parameters for disambiguation, and small embedded translation memory files — which fit well with offline or low‑resource deployment constraints common in 2016. These ideas were intended as pragmatic alternatives and complements to large server‑side TMS/MT systems of the time.

Relevant posts: Polylingual Ontologies (Feb 2016) and Embedded Translation Memory (Aug 2016).

Tokenization & Preprocessing

Tokenization splits raw text into meaningful units (tokens). Preprocessing steps include lowercasing, normalization, stemming or lemmatization, and removal of stopwords. Proper preprocessing improves downstream model performance and evaluation.

Whitespace/token regex — simplest approach for many languages
Subword tokenization — BPE / SentencePiece for open vocab
Normalization — Unicode, punctuation, diacritics

More: Tokenization examples (Python)

Language Models

Language models estimate probabilities over sequences of words or tokens. Approaches range from n-gram models to modern neural models (RNNs, Transformers). These models are central to tasks like text completion, translation, and classification.

N‑gram models: simple, fast, interpretable
Neural LM: RNNs, LSTMs, Transformers (BERT/GPT family)
Evaluation: perplexity, BLEU, ROUGE, accuracy

Hugging Face — models & datasets

Embedded Translation Memory — Overview

Embedded translation memory (TM) integrates translation units directly within a software project or application, enabling runtime or build-time lookup of canonical phrases, context-aware matches, and preservation of terminology. This approach supports multilingual UIs, offline translation, and consistent terminology across platforms.

Inspired by Multilingual Software Design — Embedded Translation Memory (Jazenga).

Framing (2016)

When first published in 2016, embedded TM and small JSON-based TM workflows offered a practical way to add multilingual support for offline and low-resource apps. At that time most industry workflows relied on server-side TMS and TMX exchanges; embedding a compact TM was a pragmatic, developer-friendly alternative.

Benefits

Consistent translations across the app via reusable segments.
Faster localisation cycles (reuse previously-translated segments).
Context-aware fallbacks and pluralisation handling.
Ability to ship with a default TM for initial offline translations.

Practical Implementation

Extract translatable strings with context keys (msg id, domain, context) during development.
Store TM entries as JSON/TMX/RDF with canonical IDs, language tags, and optional context metadata.
Include a small runtime lookup library to resolve keys to target language strings with placeholder substitution.
Provide tooling to merge new translations into the TM and to export/import TMX when needed.
Include provenance metadata and versioning for TM entries so updates are traceable.

Example: Simple JSON TM lookup (Python)

import json

# Load TM (small example)
with open('tm-example.json','r',encoding='utf-8') as f:
    TM = json.load(f)


def lookup(key, lang='en'):
    entry = TM.get(key)
    if not entry:
        return key  # fallback to key
    return entry.get(lang) or entry.get('default') or key

# Usage
print(lookup('greeting.hello','fr'))  # -> 'Bonjour'

See tm-example.json in this folder for a minimal TM example.

Formats & Tools

TMX — translation memory exchange (standard for TM interop).
JSON — lightweight, great for embedded lookups in apps.
Tools: OmegaT, Okapi, custom scripts for extraction/merge.

References & Further Reading

Polylingual Ontologies — Overview

Polylingual ontologies provide a language-independent conceptual backbone that connects equivalent terms and concepts across languages using non-linguistic parameters such as stable identifiers, taxonomic position, properties and relations. This enables robust cross-lingual mapping, search, and data integration without depending solely on string translations.

This page is inspired by and summarises ideas from Polylingual Ontologies — Jazenga (2016).

Framing (2016)

At the time of publication in 2016, statistical MT and TM workflows were widespread and neural MT was just becoming mainstream. Semantic web standards were available but adoption varied; the approach here emphasised practical non‑linguistic parameters and embedded TM to address ambiguity and support low‑infrastructure deployments.

Why use them?

Disambiguate meaning by mapping words to language-independent concepts (URIs) rather than strings.
Support multilingual applications: search, machine translation, cross-lingual knowledge graphs.
Preserve provenance and cultural or domain-specific nuances via structured metadata.

Key Concepts & Approaches

Canonical concept identifiers: assign URIs or stable IDs to concepts (independent of language).
Non-linguistic parameters: taxonomic position, properties, instances, and formal definitions used to link lexicalisations across languages.
Ontology alignment: map existing ontologies or lexical resources together using equivalence and subsumption relations.
SKOS / OWL / RDF: use semantic web formats for representation and interoperation.

Practical Steps to Build or Use Polylingual Ontologies

Define stable concepts with URIs and canonical labels or language-neutral glosses.
Add multilingual lexicalisations: labels, altLabels and language tags for each concept.
Capture language-agnostic features (properties, relations, example instances) useful for alignment.
Align or merge ontologies using automated tools then refine manually for ambiguous cases.
Publish as RDF/SKOS/OWL and provide provenance metadata and versioning information.

Tools & Resources

BabelNet — multilingual semantic network combining WordNet and Wikipedia.
SKOS — simple knowledge organization systems (good for thesauri).
OWL / RDF — for formal ontologies.
Ontology alignment tools — e.g., AgreementMaker, LogMap, AML.

Practical examples include mapping domain vocabularies to a shared upper ontology or using BabelNet/DBpedia identifiers as interlingual anchors.

Examples & Use Cases

Multilingual search — index by concept IDs to return results across languages.
Data integration across international datasets where labels differ but concepts match.
Translation memory and terminology management with precise sense mapping.

References & Further Reading

Polylingual Ontologies — Jazenga (2016) (inspiration)
BabelNet
W3C Semantic Web Standards
Apache Jena — RDF framework

Resources & Tools

A short list of useful resources and libraries:

NLTK — classic Python toolkit
spaCy — industrial-strength NLP
Hugging Face — transformers and datasets
SentencePiece/BPE — subword tokenization
Multilingual Software Design — Embedded Translation Memory (Jazenga)