% Formal Grammars and Languages

\newcommand\emptystring{\varepsilon}
\newcommand\snoc{\smalltriright}
\newcommand\concat{\curl}
<!-- \newcommand\concat{\textcaret\textcaret } -->
\newcommand\AB{\mathfrak{S}}
\newcommand\lettera{\mathcal{A}}
\newcommand\letterb{\mathcal{B}}
\newcommand\letterc{\mathcal{C}}


### The free monoid over an alphabet ###

As we said, strings are built up out of a collection of primitive "letters" or "words." The standard term in formal discussions for that collection of primitives is an **alphabet**, and the set of all strings that can be built up from a given alphabet $\Sigma$ is written $\Sigma^*$.

<div class="box">
On the page about [Algebraic structures](algebra.htm), I used the symbol $\AB$ to refer to the set $\set{\lettera,\letterb}^*$.

The symbol $\Sigma^*$ is similar to some other symbols that are used. $\Sigma^n$ is the set of strings whose length is exactly $n$. So if $\Sigma = \set{\lettera,\letterb}$, then $\Sigma^0 = \set{\emptystring}$, $\Sigma^1 = \set{\emptystring\snoc\lettera, \emptystring\snoc\letterb}$, $\Sigma^2 = \set{\emptystring\snoc\lettera\snoc\lettera, \emptystring\snoc\lettera\snoc\letterb, \emptystring\snoc\letterb\snoc\lettera, \emptystring\snoc\letterb\snoc\letterb}$, and so on. Then we can define:

> $\Sigma^* = \Union_{n\in\N} \Sigma^n$

Sometimes people talk about the **free semigroup** $\Sigma^+$. This is just $\Sigma^* - \set{\emptystring}$, that is, $\Sigma^*$ without the empty string.

We've now seen a variety of superscript notations. There's (a) $\Gamma^3$ to mean $\Gamma\times\Gamma\times\Gamma$, that is, the set of 3-tuples all of whose elements are from $\Gamma$. Then there's (b) $f^3$ to mean $f\circ f\circ f$. Then there's (c) $f^{-1}$ to mean the inverse of $f$, when it exists. Now finally we see $\Sigma^3$ to mean the set of all strings over the alphabet $\Sigma$ whose length is exactly three. You have to just pay attention to track which of these is meant in a given context. (There are also still <em>further</em> common superscript notations; but I'm suppressing them.)

<!-- (e) $a^n$ to mean $a\star a\star \dots \star a$ with a monoid or group operation $\star$. This is connected to uses (b) and (c) when the function in question is a permutation; but those uses are also meaningful even when the function has different domain and codomain. Also (f) $\Beta^\Alpha$ to mean what I write as $\Alpha\to\Beta$. Think here about my notation for powersets. In what sense is the powerset $2^\Beta$ akin to a function $\Beta\to 2$? -->

Epp suggests on p. 781 that usage (d) is really the same as usage (a), that is, that strings of length 3 are just 3-tuples from the set $\Sigma$. That involves conflating strings and tuples in the way I discouraged at the start of term.
</div>

Note that $\Sigma^*$ consists only of *finite* strings, though they may be arbitrarily finitely long. The kind of sequence that we defined strings to be has no infinitely long instances. One *can* sensibly define other notions of sequence that do have infinitely long instances. (Consider a function $f\colon \N \to \Sigma$, where $f(0)$ is the start of the sequence, $f(1)$ comes next, and so on.) But $\Sigma^*$ only contains finite sequences.

$\Sigma^*$ is formally known as the **free monoid** generated by the alphabet $\Sigma$. People talk that way --- and sometimes I will too --- but really the proper thing to say would be that it's not the *set* $\Sigma^*$ that's the monoid, but instead $\tuple{\Sigma^*, \concat, \emptystring}$. The meaning of "generated" here is similar to the meaning I explained in the [advanced notes on groups](algebra2.htm#generate). The meaning of "free monoid" is that every other monoid with the same (or isomorphic) set of generators will be a homomorphism of $\Sigma^*$. That is, they will either be isomorphic to $\Sigma^*$ or they'll be monoids that "simplify" or "partly model" $\Sigma^*$ by discarding some of its structure. (There, I just used $\Sigma^*$ three times to refer to a monoid, when officially I defined it as just being the *set* that is that monoid's universe.)


<div class="box">
Here's something that might help several ideas you've been looking at recently click together. (We discussed this in class on 17 Sept.) Remember the length function we defined on strings? If our strings are built from alphabet $\Sigma$, then that length function will be a homormorphism from the monoid $\tuple{\Sigma^*,\concat,\emptystring}$ to the monoid $\tuple{\N,+,0}$. *Huh?* Sure, think it through. The length function takes us from elements of the monoid of strings (that is, particular strings) to elements of the monoid of numbers (particular numbers). It does so in a way that preserves the structure imposed on the strings by the $\concat$ operation, with its identity $\emptystring$. So the length of $\emptystring$ is $0$, and the length of $S\concat T$ will be $\operatorname{length}S + \operatorname{length}T$.

In general that will only be a *homo*morphism, not an *iso*morphism, because in the case where $\Sigma$ has more than one letter, distinct strings will be mapped onto the same length. So the homomorphism is many-one, that is, not injective.
</div>


At the end of class on Sept 17, I mentioned two ways in which the kinds of strings we've been working with so far might leave us unsatisfied.

The first issue is that
typically we won't want to be working with *all* the strings in the free monoid $\Sigma^*$. For example, if $\Sigma$ is an alphabet of English words, we might want to only work with that subset of $\Sigma^*$ which consists of grammatical English sentences. Theorists use the expression **formal language** as follows: they say that any subset of $\Sigma^*$ counts as a formal language over the alphabet $\Sigma$. This is not the ordinary folk notion of a language. It doesn't say anything about meanings or even anything about syntax. It's just a set of sentences. Another example might be, perhaps I want to work with that subset of $\set{\lettera,\letterb}^*$ which consists of sentences that begin and end with the same letter. That set would be a different formal language.

HOMEWORK EXERCISE:

(44)  Is the formal language $\set{\emptystring}$ the same or different from the formal language $\emptyset$? Defend your answer.


The second issue is that the strings we're working with are *flat*. They all have exactly the same structure: either they're the empty string, or they have an initial string and a final letter. If I concatenate $\lettera\letterb$ and $\letterc\letterb\lettera$, I get just the same result as if I concatenate $\lettera\letterb\letterc$ and $\letterb\lettera$. For some purposes, that's convenient. For other purposes, it's not. We'd rather keep track of more internal structure in the string. Just look at all the complex structure in the following string:

> $[John, [[whose~[blue~car]]~[was~[in~[the~garage]]]], [walked~[to~[the~grocery~store]]]]$

I don't mean the brackets to be part of the string; they're just signs to help you see the structure.

We especially need to keep track of more structure when strings are syntactically ambigous. For example, consider the string $They~fed~her~rat~poison.$ What this means depends on whether we understand it as grouped like this:

> $They~fed~[her~rat]~[poison]$

or instead like this:

> $They~fed~[her]~[rat~poison]$

How can we keep track of different string structures of this sort?


### Formal grammars ###

There is a unified, but complex, solution to both of our issues.

The centerpiece to this solution is the notion of a **formal grammar**. In general terms, this will be a mathematically precise description of how larger expressions can be built out of smaller ones. More specifically, it will be a set of (possibly, and usually, recursive) rules for generating strings from a given alphabet.
The rules I first presented you with for how to build strings were an example. But typically formal grammars will be more specialized, so that they don't end up generating *all* the strings in the free monoid for that alphabet. They should just end up generating the strings in the formal language we're interested in (all and only those strings).

I spoke of grammars as *generating* a set of strings; it's also possible to think of the grammars as providing patterns that *match* strings in that set (and fail to match other strings). It'll become clearer why both of these perspectives are available as we proceed.

Let's have an example. First let me define, in English, a set of strings built using the alphabet $\set{\lettera, \letterb, \letterc}$. Let a $\mathrm{center}$ be defined so as to include: (i) $\emptystring$; (ii) the string $\letterb$ concatenated with a $\mathrm{center}$; and (iii) nothing else. Let a $\mathrm{surround}$ be defined so as to include (iv) the string $\lettera$ concatenated with a $\mathrm{center}$ concatenated with the string $\letterc\letterc$; and (v) nothing else. In other words, $\mathrm{surround}$s are all the strings with an $\lettera$ followed by 0 or more $\letterb$s followed by $\letterc\letterc$. This set of strings would sometimes be written as $\set{\lettera\letterb^n\letterc\letterc \where n \ge 0}$.

<!-- Yet another, somewhat different, use of superscripts. -->

Here's what a formal grammar generating that set of strings will look like:

> $\mathrm{center} \longrightarrow \emptystring$

> $\mathrm{center} \longrightarrow \letterb \: \mathrm{center}$

> $\mathbf{surround} \longrightarrow \lettera \: \mathrm{center} \: \letterc\letterc$

What do we have here? There are three rules, whose left-hand side and right-hand side are separated by a $\longrightarrow$ symbol. There is a lot of variation in what symbol is used here. Gamut use $\Rightarrow$. I've also seen $\mathbin{::=}$, $\longleftarrow$, and plain $=$. These are just notational choices. Their role is just to separate the left-hand side of the rules from the right-hand sides. These rules are typically called the grammar's **rewrite** or **production rules**.

On the right-hand side of each rule, we see a sequence of symbols. Some of these are just literal strings over our alphabet. These are called **terminal symbols**. The rest are called **nonterminal symbols**, or **syntactic variables**, or sometimes **syntactic categories**. They play a role akin to my schematic Greek letters. They are "holes" in a template that can be filled in with strings. Which strings? Any string that is ultimately generated by or matches the syntactic variable in question. Each of these syntactic variables has to have one or more defining rules in the grammar: that is, rules where that symbol is on the left-hand side. Those rules will specify (perhaps in a complex recursive way) which strings count as generated by or match that symbol. For example, in the above grammar, the symbol $\mathrm{center}$ counts as matching any string consisting of 0 or more $\letterb$s. Any such string can fill in the hole marked by the symbol $\mathrm{center}$ on the right-hand side of any rule.

A few more details:

*   As you probably already assumed, the sequencing of symbols on the right-hand side means that the corresponding strings should be concatenated.

*   Also, it's implicit in these grammars that nothing counts as a $\mathrm{center}$ or a $\mathrm{surround}$ except what the listed rules imply.

*   One of the syntactic variables always gets designated as the primary, or topmost, or "starting" symbol. I've indicated that here by making $\mathbf{surround}$ bold.

*   The top-to-bottom order of the rules has no significance. When a symbol appears on the left-hand side of more than one rule, that means that all of the provided right-hand sides are legitimate ways for that symbol to be expanded or unpacked.

*   If an syntactic variable appears multiple times on a given rhs, it is *not* required that it be unpacked into the same string in each position.

*   If there are any strings that can be generated in more than one way by a given grammar, those strings count as "ambiguous" on that grammar. We'll come back to this later.


Here are some more examples.

> $\mathbf{pre} \longrightarrow \lettera \: \mathrm{pre}$

> $\mathbf{pre} \longrightarrow \mathrm{post}$

> $\mathrm{post} \longrightarrow \letterb \: \mathrm{post}$

> $\mathrm{post} \longrightarrow \emptystring$

The strings this grammar generates or matches consist of 0 or more $\lettera$s followed by 0 or more $\letterb$s. Sometimes this set of strings is written as $\set{\lettera^m \letterb^n \where m \ge 0, n \ge 0}$.


Here is a slightly but importantly different set of rules:

> $\mathbf{early} \longrightarrow \lettera \: \mathrm{early} \: \letterb$

> $\mathbf{early} \longrightarrow \mathrm{late}$

> $\mathrm{late} \longrightarrow \emptystring$

Before I explain this grammar, see if you can figure out for yourself what the difference is between the set of strings it generates or matches, and those generated or matched by the preceding grammar.

The answer is that this grammar generates the set of strings $\set{\lettera^m \letterb^m \where m \ge 0}$: that is, strings that consist of 0 or more $\lettera$s followed by *the same number* of $\letterb$s.

These two grammars, and the two languages they generate, are importantly different.


The simplest grammars are called **regular**. These have to have a form where every rhs consists of entirely of terminal symbols, optionally followed by a single syntactic variable. (Alternatively, you could have the syntactic variables coming at the beginning; but whichever style you choose has to be applied consistently across the whole grammar. If the syntactic variables sometimes comes at the start of a rhs, and other times at the end, then the grammar is not regular.)

Our grammar for the $\lettera^m\letterb^n$ language (where the syntactic variables were named $\mathrm{pre}$ and $\mathrm{post}$) meets these criteria. (The rhs of the second rule can be rewritten as $\emptystring \: \mathrm{post}$, thus satisfying the letter of the requirement that the syntactic variable *follow* some terminals.)

But our grammar for the $\lettera^m\letterb^m$ language does not meet them.

Now, it can happen that two different grammars end up generating all the same strings. So the mapping from grammars to the formal languages they determine can be many-one. This is important because grammars can have different theoretically interesting properties; so it tells us you can't directly read those properties off of a formal language. Potentially, that language might be generated by some grammars that have the property in question, and also by some grammars that lack it.

So in particular, it could be, for all we've seen so far, that though our $\mathrm{early}$/$\mathrm{late}$ grammar is not regular, there is some *other* grammar that *is* regular and that generates or matches exactly the same language, the set of strings $\set{\lettera^m \letterb^m \where m \ge 0}$. However, an important result in this field is that this is not so. No regular grammar can generate or match exactly that set of strings. Hence that set is called a **non-regular language**.

Our $\mathrm{early}$/$\mathrm{late}$ grammar does match those strings, though; and grammars of that form are called 
**context-free grammars** (sometimes abbreviated CFG). The "context" here doesn't have anything to do with what's needed to fix the reference of English expressions like "now" and "that dog." It means the rules for each syntactic variable can be given in a way that ignores their surrounding *linguistic* context or neighborhood, that is, what symbols come before or after them. We can just say how the symbol $\mathrm{late}$ should be expanded or unpacked, period. We don't have to say: it should be expanded this way when it comes after symbol so-and-so, but unpacked this other way when it comes after symbol such-and-such. The rhs of the rules in a context-free grammar can contain any sequence of terminal symbols and syntactic variables.

As you might guess, for some sets of strings even this will not be adequate. There are some sets of strings that can't even be generated even by a context-free grammar. Here are some examples:

> $\set{\lettera^m\letterb^m\letterc^m \where m \ge 0}$

> $\set{s\concat s \where s\in \set{\lettera,\letterb}^*}$

There are also more powerful categories of grammars; two well-known ones are called **context-sensitive** and **type 0** grammars. These categories come from Chomsky. There are also various finer subdivisions of the categories we've described so far, and some kinds of grammars (for example, **PEG** grammars) that cross-cut some of the categories laid out here. You can read up about the details on your own if you're interested. Lucas Champollion in linguistics may teach a class on these issues someday soon.


### Patterns ###

There are some notational shortcuts or conveniences that one often sees in formal grammars. Recall our first grammar:


> $\mathrm{center} \longrightarrow \emptystring$

> $\mathrm{center} \longrightarrow \letterb \: \mathrm{center}$

> $\mathbf{surround} \longrightarrow \lettera \: \mathrm{center} \: \letterc\letterc$

One shortcut would be to write the two rules for $\mathrm{center}$ on a single line, like this:

> $\mathrm{center} \longrightarrow \emptystring \mid \letterb \: \mathrm{center}$

The $|$ divider is part of the grammar's notation. It doesn't mean the literal string or letter $|$, which may or may not be part of our alphabet. If we wanted to refer to that literal letter, we'd have to use some other convention, for instance we might write:

> $\mathbf{fence} \longrightarrow \emptystring \mid~``|\latin{''} \: \mathrm{fence}$

or:

> $\mathbf{fence} \longrightarrow \emptystring \mid \pmb{\color{blue}|} \: \mathrm{fence}$

to say that $\mathrm{fence}$s are strings of 0 or more sequential $|$s. But to keep things simple, I will suppose that none of the special characters in our grammar's notation also appear in our alphabet.

Another shorthand we might adopt is to use the symbol "$*$" to mean *0 or more of the preceding*. And we might add in parentheses as well, so that "the preceding" can mean a chunk of stuff. Using "$*$", our rule for $\mathrm{center}$ can be written as:

> $\mathrm{center} \longrightarrow \letterb{*}$

Sometimes the symbol "$+$" is used to mean *1 or more of the preceding*. (Compare the superscripts on $\Sigma^*$ and $\Sigma^+$.) Each of the three following rules would then be equivalent:

> $\mathrm{longcenter} \longrightarrow \letterb \mid \letterb \: \mathrm{longcenter}$

> $\mathrm{longcenter} \longrightarrow \letterb\letterb{*}$

> $\mathrm{longcenter} \longrightarrow \letterb{+}$

Similarly, sometimes the symbol "$?$" is used to mean *0 or 1 of the preceding*. And, as I mentioned, parentheses can sometimes be used to save us from having to write a separate rule. So the following grammar:

> $\mathrm{tail} \longrightarrow \lettera\letterb \: \mathrm{tail}$

> $\mathrm{tail} \longrightarrow \emptystring$

> $\mathbf{head} \longrightarrow \letterc \: \mathrm{tail}$

Could equivalently be expressed as:

> $\mathbf{head} \longrightarrow \letterc (\lettera\letterb){*}$

With enough of these conventions, you may be able to express many grammars all in a single rule. We can call these "patterns." For some purposes, having everything as a single pattern may be more useful; for other purposes, having everything as a set of rules with few of these extra shorthands will be easier to theorize about.


Epp describes a pattern format starting on pp. 783 of her selection. It does not essentially differ from the system with the "shortcuts" described above. (It's just that some of those shortcuts would be defined in terms of others.) Epp calls these "regular expression patterns," often abbreviated as **regexes** or **regexps**.
The patterns Epp describes, introduced by the logician Kleene, are interdefinable with the notion of a regular grammar that we introduced above. 
Note that you should *only* assume this to be true for that specific variety of regex. Many contemporary programming languages or interfaces provide facilities they call "regular expressions" but which are more powerful than regular grammars. (The regex patterns in the Perl programming language are in fact Turing complete. Don't worry
if you don't yet know what that means; we'll explain it later.)

<!-- Even without Perl's recursive matching: (*) pattern matching with back-references exceeds context-free?? (*) lookahead/behind also exceeds context-free?? -->

HOMEWORK EXERCISES:

(45) Here is a formal grammar:

    > $\mathbf{S} \longrightarrow \lettera \: \mathrm{S} \: \lettera$

    > $\mathbf{S} \longrightarrow \letterb \: \mathrm{S} \: \letterb$

    > $\mathbf{S} \longrightarrow \emptystring$

    Describe in English what set of strings it generates. Is this grammar regular? If not, can you write a regular grammar that generates the same set of strings? (Maybe that's too mean.)

(46) Here is a pattern describing a set of strings: $(a\mid b){*}a$. Describe those strings in English. Write a formal grammar that generates or matches those strings in the form we did at the beginning, with no shortcuts. Can you make it a regular grammar?

(47) Here is one pattern: $(\lettera\mid\emptystring){*}$. Here is another: $\lettera{*}$. Do these generate the same language? Defend your answer. If they don't, give examples of strings that belong to one of the languages but not the other.

(48) Here is one pattern: $(\lettera{*} \mid \letterb{*})$. Here is another: $(\lettera \mid \letterb){*}$. Do these generate the same language? Defend your answer. If they don't, give examples of strings that belong to one of the languages but not the other.


<!-- In grammars for a logic, it's sometimes explicitly required that the named symbols all be distinct, and that none of them consist of sequences of others of them; see Enderton, pp. 13--15 -->


Let's take a look at pp. 25-26 of Sider. He lists some "primitive vocabulary" for the language of propositional logic (PL). This is an alphabet. In fact it contains infinitely many letters, including $P, P_1, P_2, \dots$. To make things a bit easier for us, I'll vary this and instead use $P, P', P'', \dots$.

Next Sider gives us a "definition of wffs" for PL. Wtf is a wff? Well, what we're doing with our grammars is we're taking the whole big unconstrained set of sentences $\Sigma^*$ and we're selecting a proper subset of them as targets of interest. We can call those the *well-formed* strings or expressions. When we're doing logic, we're often interested especially in the expressions we call **formulas**. These are things whose syntactic role is like a sentence, but they are allowed to contain "free variables" like $x$. (Another sort of expression are **terms**, whose syntactic role is like a name or pronoun.) When logicians are talking about well-formed formulas, they abbreviate that as **wff** (pronounced "whoof").

Here is Sider's definition:

> (i) Every sentence letter is a PL-wff.

> (ii) If $\phi$ and $\psi$ are PL-wffs, then $(\phi \horseshoe \psi)$ and $\neg\phi$ are also PL-wffs.

> (iii) Only strings that can be shown to be PL-wffs using (i) and (ii) are PL-wffs.

How would we write this using the apparatus we've been looking at? Here's one way:

> $\mathrm{sent} \longrightarrow P \mid Q \mid R$

> $\mathrm{sent} \longrightarrow \mathrm{sent} \pmb{\color{blue}\prime}$

> $\mathbf{wff} \longrightarrow \mathrm{sent} \mid \pmb{\color{blue}(} \mathrm{wff} \pmb{\color{blue}\horseshoe} \mathrm{wff} \pmb{\color{blue})} \mid \pmb{\color{blue}\neg} \mathrm{wff}$

I've made the $\prime$, $($, $\horseshoe$, $)$, and $\neg$ symbols bold and blue to signal that these are literal strings, not parts of our grammar notation.

The alphabet we're using here only has a finite number of letters. We include $P$, $Q$, and $R$ in the language, and also the single letter $\prime$. $P$ counts as a $\mathrm{sent}$, but so too do $P'$, $P''$, and so on.


HOMEWORK EXERCISES:

(49) Write a formal grammar for simple arithmetic expressions involving literal numerals (denoting members of $\N$) and the operations $+, -, *, /$. Some things to think about:

    *   Will you simplify the task by using Polish (prefix) or Reverse Polish (postfix) notation? Or will you use standard infix notation for operators?

    *   Will you choose to allow numerals like $0012$? Or only $12$? Does your grammar allow $0$?

    *   Will you choose to allow expressions like $5+3+2$ Or will you insist on $5+(3+2)$? Or $(5+(3+2))$?

    *   Will you choose to allow expressions like $5-3-2$? If so, will your grammar generate or match the $5-3$ as a unit that forms part of the larger expression with $-2$, rather than (incorrectly) matching the $3-2$ as a unit that forms part of the larger expression with $5-$?

    *   Will you choose to allow expressions like $5+3*2$? If so, will your grammar generate or match the $3*2$ as a unit that forms part of the larger expression with $5+$, rather than (incorrectly) matching the $5+3$ as a unit that forms part of the larger expression with $*2$?

(50) Some questions about "postfix" or Reverse Polish Notation (see Epp, p. 782). Translate the following into more familiar infix notation: (a) $w \: x * y \:z - -$; (b) $w \: x * y \:z * +$; (c) $w \: x * y \:z + *$; (d) $w \: x * y + z *$.