Phil 455: Strings

What we’re going to call “strings” are also sometimes called (finite) sequences, or lists, or arrays, or various other names.

In the background we need to have an understanding of what are the elements that are sequenced. These can be anything: characters of the Latin alphabet, punctuation, words of the English language, sentences in Spanish, numbers, people, planets. Whatever you like. I’m just going to use characters of the Latin alphabet. Sometimes formal discussions of this are explicit about what is the pool of elements. In those cases they call the set of possible elements the “alphabet.” This is a natural label when the elements are characters like a, b, and c. It’s less natural when the elements are words, or numbers, or planets. In our discussion we’ll just take the pool of elements as understood. We’ll only talk about them informally. (After these opening paragraphs,) our formalism won’t refer to the elements themselves. Instead, we’ll only be referring to strings/sequences that contain them.

If you’ve got a sequence of numbers, you might think of it and write it (in some formalism, but not ours) like this: [3, 1, 4, 1]. Note that the same number can appear multiple times in the sequence. Also these sequences can be as long as you like, but they have to have some finite length. It’s possible to think about sequences that are (always or sometimes) infinite. But that involves some differences; and it won’t be needed in this course. When I talk about the “length” of a sequence, I mean the number of elements it has. So the “length” of the sequence [3, 1, 4, 1] would be 4. If a sequence uses words as elements, like [𝘵𝘩𝘦, 𝘤𝘢𝘵, 𝘴𝘭𝘦𝘱𝘵], the length would be 3. You could define different notions which counted things other than the number of elements. Perhaps you’d want to talk about the length′ of the first list, which instead would give 3+1+4+1 = 9. Or the length″ of the second list, which instead would give 3+3+5 = 11. But the way I’m understanding “length,” these sequences have length 4 and 3.

Suppose your pool of elements contained the word 𝘤𝘢𝘵 and the word 𝘭𝘪𝘬𝘦 and also the word 𝘤𝘢𝘵𝘭𝘪𝘬𝘦. We’re going to be talking about the “concatenation” of strings/sequences. That’s an operation that for example takes the sequences [𝘤𝘢𝘵] and [𝘭𝘪𝘬𝘦] as arguments, and returns [𝘤𝘢𝘵, 𝘭𝘪𝘬𝘦]. We want to be sure that we don’t confuse [𝘤𝘢𝘵, 𝘭𝘪𝘬𝘦], which has a length of 2, with the sequence [𝘤𝘢𝘵𝘭𝘪𝘬𝘦], which has the length of 1. In the notation we’re considering right now, with brackets [ ] and commas, that confusion might not arise. But it’s common in discussions of strings/sequences to use a different notation instead. (Which is what I used in class, and we’ll be using below.) This just sequences everything together inside quotation marks. When our elements are just characters, this is simple and natural. So one string (of length 1) is "d", another is "o", a third is "g", and their concatenation is the (length 3) string "dog". If we used that notation when our elements aren’t just characters, though, things could get confusing. If "𝘤𝘢𝘵" were an atomic string, built out of only one element, and "𝘭𝘪𝘬𝘦" were another atomic string, their concatenation shouldn’t be conflated with "𝘤𝘢𝘵𝘭𝘪𝘬𝘦" which we said was yet another atomic string. We might instead represent their concatenation as "𝘤𝘢𝘵 𝘭𝘪𝘬𝘦", with a separating space. But then if some of our potential elements could also include spaces, we’d have to deal with that.

These are issues you’d need to be aware of and work around in some contexts, but for our purposes in this course we can make assumptions that avoid them. As I said, we’ll only be working with strings that are built out of characters, and we won’t be formally referring to the characters themselves, but only to strings that contain them.

Sometimes we’ll write a string literally, as when I wrote the strings "d" and "dog". When I write a string literally, I might sometimes include spaces in it for readability. But they’ll just be for readability. If I write "d o g", or "d og", that should be understood as the same string as "dog".

If you need to have special characters in a literal string, we could have a convention where you “escape” those characters. Say we want to have a string which is the concatenation of "do" and then a quote mark, and then the string "g". So it should be a string of length 4, where the third element is the quote mark. You can’t write that like this: "do"g", because we’d understand the second " as closing the string that starts with the first ", and then who knows how we should interpret the trailing g" afterwards. Instead, you’d have to represent the string you want like this: "do\"g. The backslash in \" indicates that what comes next shouldn’t have its regular interpretation (ending the literal string), but instead be interpreted as just a raw character. If we do this, then we want to also allow ourselves to escape the backslash too, so that "do\\" could be interpreted as the string "do" followed by a raw backslash, and then the final quote will mark the end of the literal string. In this case, you don’t read the \" as meaning an escaped quote. The \ before the " has already been consumed by the \ before it.

In addition to escaping quotes and backslashes, it’s typical to allow some other characters to be escaped too. Since I’m saying that spaces in literal strings are just for readability, we might say that if you did want to include a space in a literal string, you’d do it using some escape code like "do\_g". That would be a length 4 string where the third element is the space character. (A different way to handle this would be to have some special constant space in the language, which was stipulated to refer to length 1 strings containing only the space character. Then you could write "do" ⁀ space ⁀ "g", where ⁀ is the symbol we’ll introduce below for concatenation. Here you can talk about and work with strings containing spaces; you just don’t have a way to include a space in the quoted string literals.) It’s standard in many programming languages to use \n in a string literal as an escape code for a line break. Later in our class, we’ll be giving dollar signs a special interpretation. So if you wanted to include a literal dollar sign in a string, you’d have to escape it, as in "do\$g". There the dollar sign wouldn’t have a special interpretation.

What if you write some escape code that hasn’t been given a special meaning? For example, if no special convention has been made about backslashes followed by periods. Then what should "do\.g" mean? Some programming languages would say, the default is that escaping anything gives you the character escaped, so that this is a length 4 string where the third element is a period. Others say that when backslash is followed by an arbitrary character that hasn’t been given a special meaning, then the backslash is interpreted as a raw character, so that this would instead be understood to be a length 5 string where the third element is a backslash and the fourth is a period. If we encounter this, I’ll instead go with a third strategy, which is to say that such notation is ill-formed or broken. It’d be like if you wrote "do and left off the closing quotation mark. It has no defined meaning.

To keep things simple for now, though, I’ll assume that things like spaces and quotes and dollar signs and backslashes and the like aren’t part of our pool of elements. So when you see "d o g", you know I have to mean a length 3 string (if I mean anything at all) because spaces won’t be part of the string anyway.

So we’ve seen literal strings, like "d" and "dog". Some of these we assume are made up out of only a single character. These are the ones I’ve been calling “length 1 strings”; though that’s just to help you informally fix your ideas. We will be defining the notion of length below, not taking it as basic. We can call these atomic strings or letters. (The label “letters” wouldn’t be so happy for some decisions about what the pool of elements should be.) We will also sometimes want to talk about the empty string, which is a sequence of no elements. We could write that like this: "" (or like this, for more readability: " "). Formal treatments of strings instead often use the symbol ɛ. That’s a lowercase Greek epsilon. That symbol is also used for the set-membership relation, which we’ll enounter later, though then it’s used in a different typeface, so it looks like ∈. Until you get used to these conventions, I think it’s confusing that epsilon is used in both of these ways. So I will instead say that we’ll talk about the empty string using a special constant empty. You can also write a literal "" if you want; that’s also correct. But I will use empty.

In addition to the string literals, and the special constant empty, we’ll also want to use variables to refer to strings. I won’t insist on any special nomenclature for these. You can use anything that looks like a variable: x and y, or s₁ and s₂, or s1 and s2, or firstString and second_string. I will usually use variables that end with an s, for example xs and ys. This is an informal convention in some programming languages for when you’re referring to a collection of things, like a sequence or set. But you don’t have to do the same.

Primitive relations and operations

So we have strings, which we can specify directly using literals or empty, or we can have variables “ranging over” or “designating” or “being bound to.” What can we do with these strings? One primitive operation we’ll allow ourselves is to test whether two strings are the same. We’ll do that using the = symbol. That will allow us to say things like "dog" = empty (which is false). Or xs = "dog" (which may be true or false, depending on what string value the variable xs is bound to or assigned).

We’ll also allow ourselves to say more complex things using logical connectives like and, or, and not, as in: xs = "dog" and not (ys = empty). We can abbreviate that as xs = "dog" and ys ≠ empty.

Don’t confuse "dog" (which is a literal, length 3 string) and dog (which I would understand to be a variable, although one with a weird name). Also don’t try to use things like empty as variables.

Another primitive operation we’ll allow ourselves is concatenation, which I’ll symbolize with ⁀. The easiest thing for you to type is just to use the caret symbol ^ that appears over the 6 on US keyboards. In different contexts concatenating is symbolized differently. Off the top of my head, some programming languages use + or ++ or @. But we’ll use ⁀.

With concatenation, if you take "d" and concatenate it with "og", you get "dog". That is, "d" ⁀ "og" = "dog". Given what I said about spaces and readability, you can also say "d" ⁀ "og" = "d og".

You can concatenate the empty string onto things, but it won’t make a difference. That is, empty ⁀ "dog" = "dog" and "dog" ⁀ empty = "dog".

Some terminology

The concatenation operation is associative. We say that an operation ⊛ is associative when for any arguments x, y, and z: (x ⊛ y) ⊛ z = x ⊛ (y ⊛ z). Since concatenation is associative, we don’t need to worry about parentheses. Should "d" ⁀ "o" ⁀ "g" be intepreted as ("d" ⁀ "o") ⁀ "g" or as "d" ⁀ ("o" ⁀ "g")? It doesn’t matter. The result is always the same.

The concatenation operation is not commutative. We say that an operation ⊛ is commutative when for any arguments x and y: x ⊛ y = y ⊛ x. It’s clear that this doesn’t hold for concatenation: "d" ⁀ "og" ≠ "og" ⁀ "d". There can be particular arguments where the order doesn’t matter. For example, if xs was bound to the string value "d", then "d" ⁀ xs and xs ⁀ "d" would have the same value, namely "dd". But since order sometimes matters for concatenation, we say that this operation is not commutative.

Addition is both associative and commutative.

Subtraction is neither. 5 - (3 - 2) is not the same as (5 - 3) - 2, so it’s not associative. And 5 - 3 is not the same as 3 - 5, so it’s not commutative.

With a string like "abc", we say that "a" and "ab" are prefixes of the string. What about the empty string, or "abc" itself? These are edge cases, and you could make a reasonable argument for saying various things about them. But the standard practice is to count these as also being prefixes of "abc".

Similarly, we say that "c" and "bc" (and also the empty string and "abc" itself) are suffixes of "abc".

Dissecting

We talked about the notion of a pattern, as in our earlier definition of factorial:

factorial =def {
    if the argument is 0, then 1;
    if the argument is k+1, then (k+1)⋅factorial(k)
}

The part where we say “if the argument is …”, and try to match the argument, uses a pattern. Sometimes the pattern is just a literal value, as in the first clause, where the pattern is just 0. Sometimes the pattern combines a variable and some extra structure, as in the second clause, where the pattern is k+1. k is a variable and +1 is what I’m calling “extra structure.” Notice that then we go on to use k in what comes later in that clause. We’re supposing that k will get bound or assigned to the relevant matching part of the argument that the pattern k+1 matches.

Using patterns in this way is implicit in lots of mathematical and logical practice. Some programming languages (like OCaml, Haskell, Scheme, and recent versions of Python) are more explicit about it. It may be confusing at first, but in the long run, I think it’ll help us to be more explicit about it from the beginning.

For clarity, when we’re giving a pattern, I will preface it with the Greek lowercase lambda (λ). (This should also start helping you think about λ in the way it’s explicitly used in “lambda calculi”, though often in those systems the patterns are restricted to fewer options than we’ll allow here.) The λ preface will help us distinguish between literal values when they’re being used in patterns and when they’re being used in ordinary expressions. For example, using patterns the definition of factorial would be written like this:

factorial =def {
    λ 0. 1;
    λ k+1. (k+1)⋅factorial(k)
}

In the first clause, the 0 is part of the pattern, but the 1 is not. It’s being used as an ordinary expression. The period . marks where the pattern ends and the result (which is an ordinary expression) begins.

Some extra bits we added were that patterns may also have “guards” on them which are written as if ... with some sentence filling in the .... The factorial definition could also be written like this:

factorial =def {
    λ k if k = 0. 1;
    λ k+1. (k+1)⋅factorial(k)
}

Another extra bit is that sometimes we need to ensure that some clauses have higher precedence or priority than others. We said we’d indicate that by using exclamation marks (!). If the pattern part of a clause ends with a period (.), it has 0 exclamation marks. But it could instead end with ! or !!. The more exclamation marks, the higher the priority. It’s fine if two clauses have the same priority, so long as their patterns and guards never match the same arguments. (Actually, it’s really only a problem if they match the same argument and then try to give different results.)

Instead of writing the factorial definition like above, it could also be written like this:

factorial (n) =def dissect n { λ 0. 1; λ k+1. (k+1)⋅factorial(k) }

Sometimes this long-hand version will be more useful (for instance, if the function takes multiple arguments, but we only want to dissect some of them). But often we’ll use the shorter version.

If a dissection is (or might be) attempted on an argument, but no clause has a pattern and guard expression that match the argument, the dissection is ill-formed or broken. We’ll usually want to make some assumptions about what the possible arguments to the dissection will be. For instance, we’re assuming that the factorial function is only defined on natural numbers. Not on negative numbers, or real numbers like π, or on strings, or whatnot. Sometimes we’ll be explicit about these assumptions. Other times we’ll leave them implicit.

Now about dissecting strings. One thing you’ll often want to do is to match the empty string. That can be done using a pattern like λ "" or λ empty. Remember that empty is not a variable! It’s a special constant, like 0 is.

If we have a pattern like λ xs, consisting of just a variable, that can match any string, including the empty string. You can restrict which argument you want the clause to match by adding a guard expression, like this: λ xs if xs = "abc". That would just match the string "abc". That could also be written like this, putting the literal string directly into the pattern: λ "abc". In that case, you’d no longer need a guard expression.

If you have a pattern like λ "abc" ⁀ xs (or λ ys ⁀ xs if ys = "abc"), that will match any string that has "abc" as a prefix, and then the variable xs will be bound to whatever comes after "abc", which might only be the empty string.

What if you have a pattern with multiple variables in it, like λ ys ⁀ xs. Then any way of breaking the argument into two strings, where the variable ys gets bound to or assigned the first, and the variable xs gets bound to or assigned the second, counts. It may be that one (or even both!) of these strings is empty. Typically you will want to put further restrictions in a guard clause. (That’s what we did above when we had λ ys ⁀ xs if ys = "abc".)

Remember that our dissections are broken if no clause (together with its guard expression, if any) matches an argument. It’s also broken if there’s more than one clause (of the same priority) that matches an argument, and gives different results. Another way for them to be broken is if a pattern has multiple variables in it, like λ ys ⁀ xs, and depending on how the argument gets broken up, you’d get different results. So for example, a clause like λ ys ⁀ xs. ys would usually be a bad idea.

Let’s look at some examples.

One notion it will be really useful to have defined is that of being an atomic string (intuitively, a string of length 1, though we’re not taking this notion of “length” as a primitive). One way to describe these is as strings formed out exactly one element. But I’m not going to provide the resources in our formalism to refer to the string elements (characters of the Latin alphabet). We’re only going to talk about the strings they make up. Instead, we’ll define the notion of an atomic string like this. I’ll call them “letters”:

letter =def {
    λ empty! false;
    λ ys ⁀ xs if ys ≠ empty and xs ≠ empty! false;
    λ xs. true
}

What’s going on here? We’re defining a function whose arguments are strings and whose results are truth-values. (Functions whose results are truth-values are standardly called — or equated with — predicates.) If the function letter gets as argument an empty string, or a string which can be dissected into two non-empty strings, then the argument is counted as not being a letter. Otherwise the argument does have to be a letter. Notice that we put priority marks on the first two clauses. Otherwise they would conflict with the third clause, which matches everything. We didn’t have to give the first two clauses different priorities, because no argument could match both clauses. (The empty string could match the pattern λ ys ⁀ xs at the start of the second clause, but then the clause’s guard expression would exclude the match.) If you wanted to give one of the first two clauses higher priority than the other, that’s harmless.

We’re not counting the empty string as a “letter.” But sometimes we’ll want to talk about arguments being either a letter or the empty string. I’ll call such strings “short.” One way to define a predicate for these strings would be like this:

short =def { λ empty! true; λ ys ⁀ xs if ys ≠ empty and xs ≠ empty! false; λ xs. true }

Here we just changed the result of the first clause of the definition of letter. But you could also define this predicate like this:

short (xs) =def letter(xs) or xs = empty

You don’t always need to use the “dissect” apparatus in a definition.

Armed with a predicate letter, we can now define the notion of length that we’ve been using informally, like this:

length =def {
    λ empty. 0;
    λ ys ⁀ xs if letter(ys). 1 + length(xs)
}

Or like this:

length =def {
    λ empty. 0;
    λ ys ⁀ xs if letter(xs). length(ys) + 1
}

Either will work. Note that we don’t have to give the clauses different priorities because no argument could ever match the patterns and guards of both clauses.

Using the definition, length(empty) will directly give us 0. length("d") will match the second clause, and the result will be 1 + length(empty), which is 1+0, that is 1. If the argument is a string like "dog", its length will depend on how that string is understood. I’ve been understanding it as the concatenation of three atomic strings "d", "o", and "g". So then this definition will match the second clause, and say that its length is 1 + length("og"), which if we keep following the definition will turn out to be 3.

On the other hand, nothing says that our strings can only be made of Latin characters. Earlier we talked about a string 𝘵𝘩𝘦 𝘤𝘢𝘵 𝘴𝘭𝘦𝘱𝘵 which we understood to be the concatenation of three atomic strings 𝘵𝘩𝘦, 𝘤𝘢𝘵, and 𝘴𝘭𝘦𝘱𝘵, not the concatenation of "t", "h", and so on. The length of that string should then just be 3. But we’ll continue to work with strings whose elements are just single Latin characters.

In class we next defined a predicate startsWith (hs, ps). This should be true iff the argument ps is (since ps is a variable, we mean this variable is bound to or assigned) some string that is a prefix of hs (that is, is a prefix of the string that hs is bound to or assigned). As we said earlier, we’ll count the empty string and a string itself as among that string’s prefixes. We can define this predicate like this:

startsWith (hs, ps) =def dissect hs {
    λ ys ⁀ xs if ys = ps! true;
    λ xs. false
}

A few things to note here. First, we only need to dissect the first argument of startsWith, so we use the longer form startsWith (hs, ps) =def dissect hs {...} rather than the shorthand startsWith =def {...}. That lets us say which argument is being dissected. Second, we need to give the first clause of the definition higher priority than the second, else the second will match all the cases that the first does (and more too, but that’s not the problem). Third, in the guard expression to the first clause, we use the variables ys and ps. The first of these gets assigned some value based on how the argument hs that’s being dissected can match the pattern λ ys ⁀ xs. The variable ps on the other hand gets assigned some value based on what the defined function startsWith gets applied to. So if we say startsWith("abc", "og"), then ps will be bound to the string "og".

(As I said in class, you can’t say λ ps ⁀ xs; because in that case the variable ps won’t be understood as a variable-with-an-existing-meaning, but instead as a variable-waiting-to-be-assigned-to-part-of-the-dissected-string that matches the clause. Instead, we say λ ys ⁀ xs if ys = ps.)

In truth, the (hs, ps) to the left of the =def in this definition are patterns just like the λ ys ⁀ xs and so on! These are working fundamentally the same way. We could instead have written:

startsWith =def {
    λ (hs, ps) . dissect hs {...}
}

Where the ... is replaced by the what’s inside the { } in the first version. This new definition has one dissect expression embedded inside another, where the outer one has only a single clause, and its pattern is λ (hs, ps). We’ll talk about this more later. For now, we’ll stick with the earlier style of definition.

Now you’re in a position to start this week’s homework.