Computational linguistics studies how computers can process and model natural language. It includes tasks such as parsing, tagging, machine translation, information extraction, and language generation. This section introduces core concepts and practical techniques for building language-aware systems.
Suggested reading: Speech and Language Processing (Jurafsky & Martin).
These ideas were first published in early and mid‑2016 (see the Jazenga posts linked from related pages). At that time the NLP and localisation landscape was in transition: phrase‑based statistical machine translation and translation memory workflows were widely used, while neural machine translation was rapidly emerging but not yet ubiquitous. Semantic web formats (RDF/SKOS/OWL) were established technically but adoption for multilingual application design was inconsistent.
The approaches described on this site emphasised practical, low‑infrastructure solutions — using language‑independent concept identifiers, non‑linguistic parameters for disambiguation, and small embedded translation memory files — which fit well with offline or low‑resource deployment constraints common in 2016. These ideas were intended as pragmatic alternatives and complements to large server‑side TMS/MT systems of the time.
Relevant posts: Polylingual Ontologies (Feb 2016) and Embedded Translation Memory (Aug 2016).
Tokenization splits raw text into meaningful units (tokens). Preprocessing steps include lowercasing, normalization, stemming or lemmatization, and removal of stopwords. Proper preprocessing improves downstream model performance and evaluation.
Language models estimate probabilities over sequences of words or tokens. Approaches range from n-gram models to modern neural models (RNNs, Transformers). These models are central to tasks like text completion, translation, and classification.
Embedded translation memory (TM) integrates translation units directly within a software project or application, enabling runtime or build-time lookup of canonical phrases, context-aware matches, and preservation of terminology. This approach supports multilingual UIs, offline translation, and consistent terminology across platforms.
Inspired by Multilingual Software Design — Embedded Translation Memory (Jazenga).
When first published in 2016, embedded TM and small JSON-based TM workflows offered a practical way to add multilingual support for offline and low-resource apps. At that time most industry workflows relied on server-side TMS and TMX exchanges; embedding a compact TM was a pragmatic, developer-friendly alternative.
import json
# Load TM (small example)
with open('tm-example.json','r',encoding='utf-8') as f:
TM = json.load(f)
def lookup(key, lang='en'):
entry = TM.get(key)
if not entry:
return key # fallback to key
return entry.get(lang) or entry.get('default') or key
# Usage
print(lookup('greeting.hello','fr')) # -> 'Bonjour'
See tm-example.json in this folder for a minimal TM example.
Polylingual ontologies provide a language-independent conceptual backbone that connects equivalent terms and concepts across languages using non-linguistic parameters such as stable identifiers, taxonomic position, properties and relations. This enables robust cross-lingual mapping, search, and data integration without depending solely on string translations.
This page is inspired by and summarises ideas from Polylingual Ontologies — Jazenga (2016).
At the time of publication in 2016, statistical MT and TM workflows were widespread and neural MT was just becoming mainstream. Semantic web standards were available but adoption varied; the approach here emphasised practical non‑linguistic parameters and embedded TM to address ambiguity and support low‑infrastructure deployments.
Practical examples include mapping domain vocabularies to a shared upper ontology or using BabelNet/DBpedia identifiers as interlingual anchors.
A short list of useful resources and libraries:
See also: Stanford NLP
Jazenga Educational was created to:
We continue to develop and add content
Professional Summary:
Peter Dunne is a multifaceted engineer, inventor, linguist, programmer, and scientist with a passion for creating advanced technologies...
Expertise:
Mission at Jazenga:
Peter is committed to fostering a community of learning and innovation through Jazenga...
Contact:
Email 📧 peterjazenga@gmail.com
WhatsApp 💬 +44 7932 676 847
|
Signal 🔐 +44 7932 676 847
Copyright © Peter Ivan Dunne, all rights reserved