Why Infinity is Strange

Infinite Sets I think I finally understand Cantor’s proof about non-countable (or non-enumerable) sets (I’ve been slowly working through Boolos and Jeffrey’s Computability and Logic; I find their notation and general approach to be quite hard to follow, so it’s been a struggle). This particular result has to do with infinity, which is a conceptually difficult topic to grasp, even when considered non-mathematically. Part of what Cantor discovered is that infinite sets can have different cardinalities (where cardinality here corresponds to the number of items in a given set). For example, the cardinality of the natural (or counting) numbers $\mathbb{N}$ is much smaller than that of the real numbers $\mathbb{R}$ (we won’t talk directly about the real number, but instead about the power set of the natural numbers $\mathcal{P}(\mathbb{N})$, which has the same cardinality as $\mathbb{R}$). More generally, Cantor showed that infinity is not a singular concept and that there are in fact many different types of infinity (which, at first glance, is a bewildering proposition). Let’s start with some definitions. Somewhat informally, we will call an enumerable set as any set that can be aligned to a subset of the natural counting numbers (e.g., the set of chairs in my kitchen, of which there are 4). A countably infinite set, then, is a set that can be aligned to all of the natural counting numbers. For example, let’s imagine we have a simple formal language $\mathcal{L} = { a^{*} }$; we can argue that the string set associated with this language is countably infinite by doing the following alignment (such that the number of $a$s in each string is mapped uniquely to each associated counting number): $$\lambda \to 0, a \to 1, aa \to 2, aaa \to 3, aaaa \to 4, …$$ More formally, you can define an infinite set as a function from values in that set to the natural numbers, e.g., $$f : \mathcal{L} \to \mathbb{N}$$ To use some terminology that will become useful later, we will say that for an infinite set to be countably infinite, there must exist some bijective relation, or one-to-one mapping, between the given set and the natural numbers (i.e., for each item $x$ in the domain of $f$, which in this case would would be our infinite set, there should exist a single item in the co-domain $f(x) \in \mathbb{N}$ and vice versa). With these ideas, let’s now prove a relatively simple, yet puzzling, theorem about countably infinite sets. Theorem 1 For any two countably infinite sets $A$ and $B$ (for convenience, we will assume that $A$ and $B$ are disjoint, i.e., $A \cap B = \emptyset$), the union of $A$ and $B$, $A \cup B$, is also countably infinite. Proof (rough sketch following the discussion here) Given that $A$ and $B$ are countably infinite, we know that there exist two bijective functions $f : A \to \mathbb{N}$ and $f’ : B \to \mathbb{N}$. To prove the theorem, we need to come up with a new bijective function $g : (A \cup B) \to \mathbb{N}$. Below defines such a function: $$ g(x) = \begin{cases} 2f(x) & \text{if }x \in A, \\ 2f’(x)+1 & \text {if }x \in B \end{cases} $$ which simply assigns items in $A$ to the even numbers (of which there are a countably infinite amount) and items in $B$ to odd numbers (of which there also are a countably infinite amount) □ Now why is this result puzzling? Well, typically when we combine two of anything, the result is larger than the individual pieces we combined. For example, if I combine the set of chairs in my kitchen with the set of chairs in your kitchen (assuming that you have chairs), the resulting set is the sum of our two sets and hence is larger. Countable infinity doesn’t work this way; when you combine two countably infinite sets, the resulting set has exactly the same cardinality. In the 17th century, Galileo had made several observations about infinity, even without the set-theoretic machinery that was developed by Cantor much later in the 19th century. What he noticed is that if you can create a one-to-one mapping between square numbers (i.e., the numbers that are the product of other natural numbers, e.g., $0,1,4,9,16,…$) and the natural numbers (this is similar in spirit to what we did in Theorem 1). More technically, he made the observation that the function $ f : n \to n^{2} $ is bijective for all $n \in \mathbb{N}$. This led him to say the following about infinity in his book Two New Sciences: This is one of the difficulties which arise when we attempt, with our finite minds, to discuss the infinite, assigning to it those properties which we give to the finite and limited; but this I think is wrong, for we cannot speak of infinite quantities as being the one greater or less than or equal to another. [He then goes on to discuss his result on square numbers]. As a mathematical matter, it turns out that his last point about how we cannot speak of infinite quantities as being smaller or greater than one another is not quite right. Even though countably infinite sets are all equal to one another, there are infinities with larger cardinalities. This is the result that we turn to next. Cantor’s Theorem Cantor’s Theorem is about the relative cardinality of sets and their power sets, as mentioned at the beginning. More specifically, the theorem states that the cardinality of a given set $A$ is strictly less than its power set (i.e., the set of all sets constructed from $A$, plus the empty set $\emptyset$). Below states this more formally (where $|\cdot|$ denotes the cardinality of a set, and $\mathcal{P}(\cdot)$ denotes the power set): Cantor’s Theorem $$\text{For any set } A, |A| \lt | \mathcal{P}(A)|$$ The Finite case: Notice that we didn’t specify whether the set $A$ is finite or infinite, meaning that this result applies to both. For finite sets, this is straightforward to prove; if you attempt to construct a power set from a given set of cardinality $n$, you will notice that the cardinality of the resulting power set will always be $2^{n}$ (it’s encouraged that you try to do this yourself). For example, imagining the set $A = \{ a , b\}$ of size 2, the power set is the following: Example 1 $$\mathcal{P}(A) = \{ \{ a \} , \{ b \} , \{ a , b \} , \emptyset \} $$ (note the existence of the empty set $\emptyset$, which is always a subset of every set). Clearly $n \lt 2^{n}$ is always true for all $n \in \mathbb{N}$ (the full proof requires a bit more work, but it relates to what comes next related to proving this for infinite sets). The Infinite case: The more surprising part of Cantor’s theorem is that this applies to infinite sets. We will specifically try to prove the following claim: Claim 1 $$| \mathbb{N} | \lt | \mathcal{P}(\mathbb{N}) |$$ In keeping with our more general theme, what this claim says is that the cardinality of $\mathbb{N}$, which we know to be countably infinite, is smaller than something else (hence, there is an infinity larger than $| \mathbb{N} |$!). To demonstrate this, we will do the following: first, we will start with a definition of a set that is undeniably in $\mathcal{P}(\mathbb{N})$, then show that if we try to assign it to a specific natural number (as is required to make the mapping bijective), we will run into a contradiction. Let’s imagine that we were to map the natural numbers to its power set. What we would have are individual mappings such as those shown below: $$ 0 \to \{ 0, 1 \} \\ 1 \to \{ 3, 4 \} \\ 2 \to \{ 5, 6 \} \\ 3 \to \{ 3, 5 \} \\ … $$ Notice that some of the sets on the right side contain the numbers on the left to which they are indexed (i.e., for $0$ and $3$, these numbers are included in their corresponding sets $\{ 0,1 \}$ and $\{ 3, 5\}$), whereas the others do not. Based on this, let’s make the following definition: Definition Let’s define the function $d : \mathbb{N} \to \mathcal{P}(\mathbb{N})$, and the set D of all numbers that are not included in their corresponding set: $$ D = \bigg\{ x \in \mathbb{N} \mid x \notin d(x) \bigg\} $$ Using again the example mapping above, $D$ would include the numbers $1$ and $2$ (i.e., $\{ 1, 2\} \subseteq D$). Since $D$ is a set of positive numbers, it clearly must appear in the power set of $\mathbb{N}$, or $\mathcal{P}(\mathbb{N})$. That is, there should exist some mapping: $$ j \to D $$ This definition, however, leads to a contradiction that makes it impossible for $d$ to be bijective (thus making Claim 1 true). To organize this a bit, we will first prove that $d$ is not bijective in the following lemma, then prove Claim 1 as a separate theorem: Lemma 1 The function $d$ defined above is not bijective/one-to-one. Proof (sketch) We will prove this in relation to the set $D$ defined above. Let’s assume that $d$ is bijective. Let’s then imagine that we find some set $S_{j} \in \mathcal{P}(\mathbb{N})$ that is identical to $D$ and, since $d$ is bijective, maps backwards to the number $j$, i.e., $S_{j} = D$ and $d(j) = S_{j}$. By definition of $D$, the following must be true: $$j \in D \text{ iff } j \notin S_{j}$$ If we assume the equivalence $S_{j} = D$, however, we get the following contradiction by substituting $D$ for $S_{j}$ in the statement above: $$j \in S_{j} \text{ iff } j \notin S_{j}$$ This shows that It is not possible to assign a given number $j$ to $D$ (which again, is undeniably in $\mathcal{P}(\mathbb{N})$). Therefore, $d$ cannot be bijective. □ To be honest, this result still boggles my mind a bit. What this shows is that if we try to construct a bijective relationship between $\mathbb{N}$ and $\mathcal{P}(\mathbb{N})$, we will find a hole in the form of a missing set such as $S_{j}$ that is excluded in our mapping. As Boolos and Jeffrey remark, even if we try to repair this hole by trying to add $S_{j}$, this will only have the effect of creating a new hole. With this, we can prove the main result related to Claim 1: Theorem 2 Claim 1 is true, i.e., $| \mathbb{N} | \lt | \mathcal{P}(\mathbb{N}) |$ Proof (rough sketch) Let’s assume that this claim isn’t true, and that $\mathbb{N}$ and $\mathcal{P}(\mathbb{N})$ have the same cardinality. Since we already know that $\mathbb{N}$ is countably infinite, this would mean that $\mathcal{P}(\mathbb{N})$ is also countably infinite, and hence should have a bijective/one-to-one mapping to $\mathbb{N}$. Through Lemma 1, we know that such a bijection is not possible since it is possible to come up with a set in the power set that can’t be indexed with some $j \in \mathbb{N}$. Given the result above, we still need to exclude the possibility that $\mathcal{P}(\mathbb{N})$ is smaller than $\mathbb{N}$. To do this, it suffices to show that some strict subset of $\mathcal{P}(\mathbb{N})$ is greater or equal to $\mathbb{N}$. Consider the subset $D’ = \{ \{ x \} \mid x \in \mathbb{N} \}$. Clearly $D’$ is a strict subset of $\mathcal{P}(\mathbb{N})$, and clearly it can have a one-to-one mapping to $\mathbb{N}$ (i.e., by indexing each $x \in \mathbb{N}$ with its singleton set which will give it the same size) □ A natural question to ask at this point is whether there is an infinity that has a greater cardinality than $\mathcal{P}(\mathbb{N})$ (perhaps this is the largest infinity). Using similar arguments, we can prove that $| \mathcal{P}(\mathbb{N}) | \lt |\mathcal{P}(\mathcal{P}(\mathbb{N})) |$, which in turn is less than $ | \mathcal{P}(\mathcal{P}(\mathcal{P}(\mathbb{N}))) |$, which is less than $ | \mathcal{P}(\mathcal{P}(\mathcal{P}(\mathcal{P}(\mathbb{N})))) |$, which is less than….

What is Kolmogorov Complexity?

The Basic Idea I’ve been reading about Kolmogorov complexity, with the aim of understanding certain metamathematical results (including general incompleteness). I seem to be on the verge of understanding specifically Chaitin’s reformulation of Gödel’s results using the Berry paradox. First, before getting to the paradox let’s define Kolmogorov complexity (or Kolmogorov-Chaitin-Solomonoff complexity; so-called after the 3 individauals who independently discovered it). Assuming a given string (or string rendering of a particular problem), the Kolmogorov complexity of that string is defined as the size of the smallest program that is needed to generate that string. For example (I’m using some of the examples from here), let’s assume that we have the following two strings of size $n=8$: $$ x = 01010101 \\ x’ = 11101001 $$ In the first case, a reasonably compact Python program for generating $x$ is the following: print("01"*4) which, excluding the Python parts ",print, (,), * and the prefix "01" (since these wil stay constant for all such patterns longer than $n=8$), has a size of $4$ (or $n/2$). In bits this is equivalent to the following (using the laws of number-bit conversion; this part about the conversion tripped me up a bit at first, but is essential for the result in the next Section): $$ \left\lfloor \log_{2}(4) \right\rfloor + c $$ where $c (\ge 1)$ is the constant that covers the programming language parts and the constant pattern 01. Now what about the second string $x’$? Clearly, this pattern is more complex, in that sense that it is difficult to come up with a more compact program than the following: print("11101001") which simply returns the total pattern "11101001", and is therefore of size equal to the size of the input (for now, forget the point about the $log_{2}$ bit conversion). With these ideas in mind, we can then quantify the notion of a random string as one whose smallest program is greater than or equal to the size of the input string plus this constant $c$ (in other words, we cannot find a program smaller than the one that simply encodes and returns the full string). We can further define the set $R$ of all random strings as follows (where $K(x)$ stands for Kolmogorov complexity): $$R = \bigg\{ x \mid K(x) \ge | x | \bigg\}$$ I find this definition of randomness to be very satisfying. In the simplest terms, it says that a string is random if we cannot come up with a clever pattern to describe or generate it (alternatively, you can think about a random string as one that we cannot compress down to something smaller; in other presentations, $R$ is sometimes referred to as the set of uncompressable strings). To me it is very easy to imagine many such strings, though this notion of randomness turns out to be fundamentally problematic, as described next. The Result Now we can ask the question: can we come up with a general algorithm to find this set $R$ and determine if a given $x$ is random? Somewhat shockingly, it turns out we can’t according to the following result (it took me some time to find a readable proof; this one is based on this blog post already cited above, these notes here and this wonderful textbook, which I’m still working through): Theorem $R$ is not decidable. Proof. Let’s imagine a computer program $M$ (e.g., a Turing machine, Python program, whatever; as it turns out, the choice of programming language is not so important), that computes $R$. If such a program exists, then we can use it for storing all random strings in sorted order (we can further represent this program $M$ as a string of size $| M |^{\texttt{bits}}$, where $| \cdot |^{\texttt{bits}}$ is the size of the program string in bits; this exploits the fact that we can encode Turing machines as binary strings). To find if a particular string $x \in R$ (where $|x|=n$), we can then use another program $M’$ that will simply do a look up on $R$ for all strings of size $n$ and return true if it finds the input $x$. Imagine that we represent this latter program as a string $s_{n}$ built from a string rendering of $M’$ (of size $| M’ |^{\texttt{bits}}$) concatenated with the number $n$ (where again, the bit representation of $| n |^{\texttt{bits}}$ is approximately $log_{2}(n)$, as discussed above). In other words: $$| s_{n} |^{\texttt{bits}} = | M’ |^{\texttt{bits}} + | n |^{\texttt{bits}}$$ The problem is that according to our definition of randomness, the following must hold: $$K(s_{n}) \ge n $$ whereas $s_{n}$ will have a length of $c + \log_{2} n$ (where $M,M’$ get stuffed into our constant $c$). Therefore, we have the following: $$n \le \log_{2}(n) + c$$ which cannot hold for most $n$. Therefore $R$ cannot be computed in the general case. □ As once remarked by Marvin Minsky, Kolmogorov complexity therefore has the fatal flaw that, in the general case, it is not possible to compute exactly what the theory is designed to compute! Why Does This Happen? As mentioned at the onset, this result can be used to derive the famous Gödel incompleteness results (see Chaitin’s paper for more details). The source of these results is the so-called Berry Paradox, which was first published by Bertrand Russell but named after an obscure Oxford University librarian who first posed the paradox to Russell (as Chaitin describes, it can be viewed as a variant of the liar paradox that Gödel relied on to prove his famous results). The paradox can be understood by thinking about the following description of a number: The smallest positive integer than cannot be described in less than 1 billion words in English. While we can imagine the particular number being described, the paradox is that we just described such a number using far less than 1 billion English words (more exactly, we used only 16 English words!). This is the essense of the proof above, which involves this $\log_{2}$ bit compression trick.