When is something a (domain-specific) language?

Markus Voelter
7 min readJun 18, 2022

Customers often ask me: what is a DSL? How is a language different from … and then they mention all kinds of other terms. This is a great topic for discussions at academic workshops or evenings over drinks (the latter probably more productive), so I could write pages and pages here. Let me try something succinct, and you tell me if it makes sense and where you disagree.

I will sort the ideas in order of increasing languaginess and start with glossaries.

Glossaries

When trying to figure out a domain, one usually starts with a glossary. This defines the most important terms used (and agreed upon!) in the domain and explains them in prose. It’s a list of words, with a few definitional sentences for each. It is very useful to make verbal and whiteboard conversations about the domain more structured and less ambiguous, for example, when you’re trying to understand the domain to build a DSL.

Obvious, but I’ll say it: a glossary is not a DSL!

Structured Glossary

When you want to be one step more precise, you’ll start defining the relationships between the glossary terms with a set of well-defined relationship types. These types are familiar for anyone with even a basic object-oriented background: contains, refers to and is-a. You can also write sentence fragments at the contains and refers arrows that charaterize the relationship. Often it makes sense to represent the relationship network in some graphical, UML-ish notation.

I call something like this a structured glossary. It starts to look suspiciously like a (meta)model, but the crucial thing is that this thing is intended for human consumption, not for processing with a tool. Again: useful, but not a DSL.

The term ubiquitous language from the space of DDD is IMHO somewhere between the glossary and the structured glossary.

Domain models and metamodels

Now lets move on to the term domain model. It is more or less similar to the structured glossary, but it is intended for tools (in the widest sense). This means that the amount of semantics conveyed by the sentence fragments on the structured glossary should be moved into additional model constructs and/or program code. A domain model can be implemented as a bunch of classes in a programming language and then serve as the backbone for capturing data or building a UI.

If it is intended to be used in a modeling infrastructure, such a formal domain model is usually called a metamodel. A meta model can be implemented in many kind of technical formalisms: an XML schema, a JSON schema, an EMD Ecore file, MPS structure definitions and yes, also Java classes, although I usually don’t call it a meta-model but a domain model.

Attention pet peeve alarm: any model can be a metamodel. “meta” characterizes a relationship (to another model) and is not an inherent property of the model (not true for some technical spaces, but I digress).

We are slowly moving into the domain of languages. A meta-model defines the abstract syntax of a language — aka, the kinds of things you have available for defining sentences (or models). A meta-model is not a DSL, but it is one of the ingredients of a DSL.

Even when using the meta-model standalone (without the other ingredients of languages) it serves as a well-defined “truth” regarding the structure of the domain. It can be the basis for the definition of data exchange formats. It’s definitely useful.

Validations

However, only the most trivial domains can be defined with structure only. You usually have to define validations on them. The simplest ones are cardinality constraints (which are kinda structural), but there are validations that check name uniqueness and rules of the kind

if the X contains a Y,
then this A over there cannot have more than 2 children of type B.

All validations on a model instance must be true for a model to be a valid instance of the metamodel. Any tool that writes or reads models must know and be able to process such validations.

Some people call a meta-model with validations a language (or DSL, if it is specific to a domain). I don’t. But this is not a value judgement — this thing is useful, and as I have said, it is a part of a DSL. But I want to be clear about terms: so no, it’s not a DSL.

Syntax

The next ingredient of a language is the syntax. This is a topic that is a bit hard to grasp. Is XML a syntax? If you encode your data in an XML document that is compliant to your metamodel expressed in XML schema, are you using a syntax? Same question for Json. Many people will argue yes. And technically they are probably right. However, that syntax is not metamodel-specific. It is meta-metamodel-specific. This means that everything defined in your metamodel is encoded the same way, based on rules defined for the meta-metamodel. To go back to XML: all XML elements you define with your schema are expressed in the well known nested-angle-bracket-plus-attributes syntax.

I call such a meta-metamodel-defined syntax a serialization format. What makes a “real” syntax different is that you define a specific syntax for each (or groups of) your metamodel elements (aka metaclasses). It does not matter whether that syntax is textual, tabular, graphical, form-like or a mix of all of these, as we like to do with MPS.

Just to drive home this difference, let’s assume we have a language that contains functions calls with arguments and plus operators with two arguments. With a real (metamodel-specific) syntax, you would perhaps write these two as

As you can see, we use the familiar parens-and-comma syntax for function calls and an infix notation for plus; a syntax specific for each concept. If we were to use a meta-metamodel-specific syntax, every concept would use the same syntax; we use XML here:

You can see two things: first, each concept uses the same approach for encoding (angle bracket, concept name, properties as attributes, children as nested XML), nothing specific (aka different) for PlusOp or FunCall. You can also see that this kind of syntax is useless for human consumption except for very simple configuration languages (even if you use a less verbose syntax than XML) such this less noisy one:

I repeat: a meta-model with a serialization format is not a language. It’s just a metamodel with a serialization syntax. It’s useful, but distinct from a language.

Type Systems

There are two more things we need to discuss. Type systems and semantics (bear with me for a more precise definition of the terms). Type systems first.

Are a bunch of validation rules a type system? Some people argue yes. And again, maybe they are technically right. For me, validation rules are just validation rules. But how is a type system different? My criterion for distinction is that, as soon as your validations are so complicated that you compute additional data structures (ie., types) for your model elements and then perform computations on these data structures, then you have a type system. Validations don’t really do this, usually, they just inspect the structure and values in the model.

Are type systems needed for something to be a language? IMHO not. There are meaningful languages that don’t need a type systems (and can make do with validations). But many (interesting, useful) languages do have a type system, because many (interesting, useful) languages require expressions. And expressions require type checking (unless you defer to runtime type checking, which for reasons I don’t want to go into here are at odds with DSLs).

Semantics

Talking about runtime: the final ingredient to a language is the (formal) definition of semantics. While “formal” sounds like greek letters and deduction and proofs, in practice this is typically achieved by transforming (generating, compiling) your language to some of other language or formalism whose semantics is known, or by writing an interpreter. Note that the goal of the semantics is not necessarily execution, it might also be some form of sophisticated analysis (which is why type checking is a form of semantics …), but many DSLs indeed have execution semantics because you want to “run” the program (again, by generation or interpretation).

Wrap up

So where does this leave us? When do we have a DSL?

  • Glossary? No.
  • Structured Glossary? No.
  • Metamodel? No.
  • Metamodel + Validations? No.
  • Metamodel + Validations + Metamodel-specific Syntax? Yes!
  • Metamodel + Type System + Metamodel-specific Syntax + Execution Semantics? Double yes!

There’s a bit of a caveat to my nice, incrementally building story: what about metamodel + type system + execution semantics, but no metamodel-specific syntax? Many people will argue that this is a language, they consider a formal semantics (plus the necessary metamodel) the main ingredient. And indeed, many languages, especially those used internally in tools, e.g., as intermediate representations in compilers, don’t really need a “real” syntax because no human ever writes them, they are just used by tools. I understand the perspective. However, I avoid the term language for those. I call them model, intermediate format, whatever.

So where does this leave us? Essentially, I think a metamodel-specific syntax is the deciding ingredient to make something a language. Because this makes them useful to human consumption. And that’s the core thing: a language is a formalism that can be written, read and understood by humans and computers, not just by only computers or only humans.

--

--

Markus Voelter

software (language) engineer, science & engineering podcaster, cross-country glider pilot. On medium mostly for the software stuff.