What Is Information? (Part 3: Everything is Conditional)


This is the third part of the "What is Information?" series. Part one is here, and clicking this way gets you to Part 2. 

We are rolling along here, without any indication of how many parts this series may take. Between you and me, I have no idea. We may be here for a while. Or you may run out of steam before me. Or just run. 

Let me remind you why I am writing this series. (Perhaps I should have put this paragraph at the beginning of part 1, not part 3? No matter). 

I believe that Shannon's theory of information is a profound addition to the canon of theoretical physics. Yes, I said theoretical physics. I can't get into the details of why I think this in this blog (but if you are wondering about this you can find my musings here). But if this theory is so fundamental (as I claim) then we should make an effort to understand the basic concepts in walks of life that are not strictly theoretical physics. I tried this for molecular biology here, and evolutionary biology here.  

But even though the theory of information is so fundamental to several areas of science, I find that it is also one of the most misunderstood theories. It seems, almost, that because "everybody knows what information is", a significant number of people (including professional scientists) use the word, but do not bother to learn the concepts behind it. 

But you really have to. You end up making terrible mistakes if you don't.

The theory of information, in the end, teaches you to think about knowledge, and prediction. I'll try to give you the entry ticket to all that. Here's the quick synopsis of what we have learned in the first two parts.

1.) It makes no sense to ask what the entropy of any physical system is. Because technically, it is infinite. It is only when you specify what questions you will be asking (by specifying the measurement device that you will use in order to determine the state of the random variable in question) that entropy (a.k.a. uncertainty) is finite, and defined.

2.) When you are asked to calculate the entropy of a mathematical (as opposed to physical) random variable, you are usually handed a bunch of information you didn't realize you have. Like, what's the number of possible states to expect, what those states are, and possibly even what the likelihood is of experiencing those states. But given those, your prime directive is to predict the state of the random variable as accurately as you can. And the more information you have, the better your prediction is going to be.

Now that we've got these preliminaries out of the way, it seems like high time that we get to the concept of information in earnest. I mean, how long can you dwell on the concept of entropy, really?

Actually, just a bit longer as it turns out. 

I think I confused you a bit in the first two posts. One time, I write that the entropy is just $\log N$, the logarithm of the number of states the system can take on, and later I write Shannon's formula for the entropy of random variable $X$ that can take on states $x_i$ with probability $p_i$ as

                                     $H(X)=-\sum_{i=1}^N p_i\log p_i$   (1)

And then I went on to tell you that the first was "just a special case" of the second. And because I yelled at one reader, you probably didn't question me on it. But I think I need to clear up what happened here.

In the second part, I talked about the fact that you really are given some information when a mathematician defines a random variable. Like, for example, in Eq. (1) above. If you know nothing about the random variable, you don't know the $p_i$. You may not even know the range of $i$. If that's the case, we are really up the creek, with paddle in absentia. Because you wouldn't even have any idea about how much you don't know. So in the following, let's assume that you know at least how many states to expect, that is, you know $N$.

If you don't know anything else about a probability distribution, then you have to assume that each state appears with equal probability. Actually, this isn't a law or anything. I just don't know how you would assign probabilities to states if you have zero information. Nada. You just have to assume that your uncertainty is maximal in that case. And this happens to be a celebrated principle: the "maximum entropy principle". The uncertainty (1) is maximized if $p_i=1/N$ for all $i$.  And if you plug in $p_i=1/N$ in (1), you get

                                                 $H_{\rm max}=\log N$.   (2)

It's that simple. So let me recapitulate. If you don't know the probability distribution, the entropy is (2). If you do know it, it is (1). The difference between the entropies is knowledge. Uncertainty (2) does not depend on knowledge, but the entropy (1) does. One of them is conditional on knowledge, the other isn't it. 

There is a technical note for you physicists out there that is imminent. All you physics geeks, read on. Everybody else, cover your eyes and go: "La La La La!"



<geeky physics> Did you realize how Eq. (2) is really just like the entropy in statistical physics when using the microcanonical ensemble, while Eq. (1) is the Boltzmann-Gibbs entropy in the macrocanoical ensemble, where $p_i$ is given by the Boltzmann distribution? </geeky physics>

You can start reading again. If you had your fingers in your ears while going "La La La La!", don't worry: you're nearly normal, because reading silently means nothing to your brain

Note, by the way, that I've been using the words "entropy" and "uncertainty" interchangeably. I did this on purpose, because they are one and the same thing here. You should use one or the other interchangeably too. 

So, getting back to the narrative, one of the entropies is conditional on knowledge. But, you think while scratching you head, wasn't there something in Shannon's work about "conditional entropies"?

Indeed. and those are the subject of this part 3. The title kinda gave it away, I'm afraid. 

To introduce conditional entropies more formally, and then connect to (1)--which completely innocently looks like an ordinary Shannon entropy--we first have to talk about conditional probabilities. 

What's a conditional probability? I know, some of you groan "I've known what a conditional probability is since I've been seven!" But even you may learn something. After all, you learned something reading this blog even though you're somewhat of an expert? Right? Why else would you still be reading? 

                               "Infinite patience", you say? Moving on. 

A conditional probability characterizes the likelihood of an event, when another event has happened at the same time. So, for example, there is a (generally small) probability that you will crash your car. The probability that you will crash your car while you are texting at the same time is considerably higher. On the other hand, the probability that you will crash your car while it is Tuesday at the same time, is probably unchanged, that is, unconditional on the "Tuesday" variable. (Unless Tuesday is your texting day, that is.)

So, the probability of events depends on what else is going on at the same time. "Duh", you say. But while this is obvious, understanding how to quantify this dependence is key to understanding information.  

In order to quantify the dependence between "two things that happen at the same time", we just need to look at two random variables. In the case I just discussed, one variable is the likelihood that you will crash your car, and the other is the likelihood that you will be texting. The two are not always independent, you see. The problems occur when the two occur simultaneously.

You know, if this was another blog (like, the one where I veer off to discuss topics relevant only to theoretical physicists) I would now begin to remind you that the concept of simultaneity is totally relative, so that the concept of a conditional probability cannot even be unambiguously defined in relativistic physics. But this is not that blog, so I will just let it go.  I didn't even warn you about geeky physics this time.   

OK, here we go: $X$ is one random variable (think: $p_i$ is the likelihood that you crash your car while you conduct maneuver $X=x_i$). The other random variable is $Y$. That variable has only two states: either you are texting ($Y=1$), or you are not ($Y=0$) And those two states have probabilities $q_1$(texting)  and $q_0$ (not texting)  associated to them. 

I can then write down the formula for the uncertainty of crashing your car while texting, using the probability distribution

                                                    $P(X=x_i|Y=1)$ .

This you can read as "the probability that random variable $X$ is in state $x_i$ given that, at the same time, random variable $Y$ is in state $Y=1$." 

This vertical bar "|" is always read as "given".

So, let me write  $P(X=x_i|Y=1)=p(i|1)$. I can also define $P(X=x_i|Y=0)=p(i|0)$. $p(i|1)$ and $p(i|0)$ are two probability distributions that may be different (but they don't have to be if my driving is unaffected by texting). Fat chance for the latter, by the way. 

I can then write the entropy while texting as

                               $H(X|{\rm texting})=-\sum_{i=1}^N p(x_i|1)\log p(x_i|1)$.  (3)

On the other hand, the entropy of the driving variable while not texting is 

                           $H(X|{\rm not\  texting})=-\sum_{i=1}^N p(x_i|0)\log p(x_i|0)$.  (4)

Now, compare Eqs (3) and (4) to Eq. (1). The latter two are conditional entropies, conditional in this case on the co-occurrence of another event, here texting. They look just like the the Shannon formula for entropy, which I told you was the one where "you already knew something", like the probability distribution. In the case of (3) and (4), you know exactly what it is that you know, namely whether random variable $X$ is texting while driving, or not. 

So here's the gestalt idea that I want to get across. Probability distributions are born being uniform. In that case, you know nothing about the variable, except perhaps the number of states it can take on, Because if you didn't know that, then you wouldn't even know how much you don't know. That would be the "unknown unknowns", that a certain political figure once injected into the national discourse. 

These probability distributions become non-uniform (that is, some states are more likely than others) once you acquire information about the states. This information is manifested by conditional probabilities. You really only know that a state is more or less likely than the random expectation if you at the same time know something else (like in the case discussed, whether the driver is texting or not). 

Put in another way, what I'm trying to tell you here is that any probability distribution that is not uniform (same probability for all states) is necessarily conditional. When someone hands you such a probability distribution, you may not know what it is conditional about. But I assure you that it is conditional. I'll state it as a theorem:

All probability distributions that are not uniform are in fact conditional probability distributions.

This is not what your standard textbook will tell you, but it is the only interpretation of "what do we know" that makes sense to me. "Everything is conditional" thus, as the title of this blog post promised.

But let me leave you with one more definition, which we will need in the next post, when we finally get to define information. 

Don't groan, I'm doing this for you

We can write down what the average uncertainty for crashing your car is, given your texting status. It is simply the average of the uncertainty while texting and the uncertainty while not texting, weighted by the probability that you engage in any of the two behaviors.  Thus, the conditional entropy $H(X|Y)$, that is the uncertainty of crashing your car given your texting status, is

                           $H(X|Y)=q_0H(X|Y=0)+q_1H(X|Y=1)$   (5).

That's obvious, right? $q_0$ being the probability that you are texting while executing any maneuver $i$, and $q_1$ the probability that you are not (while executing any maneuver).

With this definition of the entropy of one random variable given another, we can now finally tackle information.

In the next installment of the "What is Information" series, of course!









What is Information? (Part 2: The Things We Know)

In the first part of this post I have talked to you about entropy mostly. How the entropy of a physical system (such as a die, a coin, or a book) depends on the measurement device that you will use for querying that system. That, come to think of it, the uncertainty (or entropy) of any physical object really is infinite, and made finite only by the finiteness of our measurement devices. If you start to think about it, of course the things you could possibly know about any physical object is infinite! Think about it! Look at any object near to you. OK, the screen in front of you. Just imagine a microscope zooming in on the area framing the screen, revealing the intricate details of the material. The variations that the manufacturing process left behind, making each and every computer screen (or iPad or iPhone), essentially unique.

If this was another blog, I would now launch into a discussion of how there is a precise parallel (really!) to renormalization theory in quantum field theory... but it isn't. So, let's instead delve head first into the matter, and finally discuss the concept of information.

What does it even mean to have information? Yes, of course, it means that you know something. About something. Let's make this more precise. I'll conjure up the old "urn". The urn has things in it. You have to tell me what they are.

Credit: www.dystopiafunction.com


So, now imagine that.....

"Hold on, hold on. Who told you that the urn has things in it? Isn't that information already? Who told you that?"

OK, fine, good point. But you know, the urn is really just a stand-in for what we call "random variables" in probability theory. A random variable is a "thing" that can take on different states. Kind of like the urn, that you draw something from? When I draw a blue ball, say, then the "state of the urn" is blue. If I draw a red ball, then the "state of the urn" is red. So, "urn=random variable". OK?

"OK, fine, but you haven't answered my question. Who told you that there are blue and red balls in it? Who?"

You really are interrupting my explanations here. Who are you anyway? Never mind. Let me think about this. Here's the thing. When a mathematician defines a random variable, they tell you which state it can take on, and with what probability. Like: "A fair coin is a random variable with two states. Each state can be taken on with equal probability one-half." When they give you an urn, they also tell you how likely it is to get a blue or a red balls from it. They just don't tell you what you will actually get when you pull one out. 

"But is this how real systems are? That you know the alternatives before asking questions?"

All right, all right. I'm trying to teach you information theory, the way it is taught in any school you would set your foot in. I concede, when I define a random variable, then I tell you how many states it can take on, and what the probability is that you will see each of these states, when you "reach into the random variable". Let's say that this info is magically conferred upon you. Happy now?

"Not really."

OK, let's just imagine that you spend a long time with this urn, and after a while of messing with it, you do realize that:

A) This urn has balls in it.
B) From what you can tell, they are blue and red.
C) Reds occur more frequently than blues, but you're still working on what the ratio is.

Is this enough?

"At least now we're talking. Do you know that you assume a lot when you say "random variable"?

I wanted to tell you about information, and we got bogged down in this discussion about random variables instead. Really, you're getting in the way of some valuable instruction here. Could you just go away?

"You want to tell me what it means to 'know something', and you use urns, which you say are just code for random variables, and I find out that there is all this hidden information in there! Who is getting in the way of instruction here??? Just sayin'!"

....

OK.

....

All right, you're making this more difficult than I intended it to be. According to standard lore, it appears that you're allowed to assume that you know something about the things you know nothing about. Let's just call these things "common sense". And the things you don't know about the random variable are the things that go beyond common sense. The things that, unless you had performed dedicated experiments to ascertain the state of the variables, you kinda know. Like, that a coin has two sides. That's common knowledge, right?

"And urns have red and blue balls in it? What about red and green?"

You're kinda pushing it now. Shut up.

Soooo. Here we are. Excuse this outburst.  Moving on.

We have this urn. It's got red and blue balls in it. (This is common knowledge.) They could be any pairs of colors, you do realize. How much don't you know about it? 

Easily answered using our good buddy Shannon's insight. How much you don't know is quantified by the "entropy" of the urn. That's calculated from the fraction of blue balls known to be in the urn, and  the fraction of red balls in the urn. You know, these fractions that are common knowledge. So, let's say that fraction of blue is p. The fraction of red then is of course (you do the math) 1-p. And the entropy of the urn is

                                $H(X)=-p\log p-(1-p)\log(1-p)$          (1)

Now you're gonna ask me about the logarithm aren't you? Like, what base are you using?

You should. The mathematical logarithm function needs a basis. Without it, its value is undefined. But given the base, the entropy function defined above gets more than just a value: it gets units. So, for example, if the base is 2, then the units are "bits". If the base is e, then the units are "nats". We are mostly going to be using bits, so base 2 it is.

"In part 1 you wrote that the entropy is $\log N$, where $N$ is the number of states of the system. Are you changing definitions on me?"

I'm not, actually. I just used a special case of the entropy to get across the point that the uncertainty/entropy is additive. It was the special case where each possible state occurs equally likely. In that case, the probability $p$ is equal to $1/N$, and the above formula (1)  turns into the first one. 

But let's get back to our urn. I mean random variable. And let's try to answer the question: 

"How much is there to know (about it)? "

Assuming that we know the common knowledge stuff that the urn only has read and blue balls in it, then what we don't know is the identity of the next ball that we will draw. This drawing of balls is our experiment. We would love to be able to predict the outcome of this experiment exactly, but in order to pull off this feat, we would have to have some information about the urn. I mean, the contents of the urn. 

If we know nothing else about this urn, then the uncertainty is equal to the log of the number of possible states, as I wrote before. Because there are only red and blue balls, that would be log 2. And if the base of the log is two, then the result is $\log_2 2=1$ bit.  So, if there are red and blue balls only in an urn, then I can predict the outcome of an experiment (pulling a ball from the urn) just as well as I can predict whether a fair coin lands on heads or tails. If I correctly predict the outcome (I will be able to do this about half the time, on average) I am correct purely by chance. Information is that which allows you to make a correct prediction with accuracy better than chance, which in this case means, more than half of the time. 

"How can you do this, for the case of the fair coin, or the urn with equal numbers of red and blue balls?"

Well, you can't unless you cheat. I should say, the case of the urn and of the fair coin are somewhat different. For the fair coin, I could use the knowledge of the state of the coin before flipping, and the forces acting on it during the flip, to calculate how it is going to land, at least approximately. This is a sophisticated way to use extra information to make predictions (the information here is the initial condition of the coin) but something akin to that has been used by a bunch of physics grad students to predict the outcome of casino roulette in the late 70s. (And incidentally I know a bunch of them!)

The coin is different from the urn because for the urn, you won't be able to get any "extraneous" information. But suppose the urn has blue and red balls in unequal proportions. If you knew what these proportions were [the $p$ and $1-p$ in Eq. (1) above] then you could reduce the uncertainty of 1 bit to $H(X)$. A priori (that is, before performing any measurements on the probability distribution of blue and read balls), the distribution is of course given by $p=1/2$, which is what you have to assume in the absence of information. That means your uncertainty is 1 bit. But keep in mind (from part 1: The Eye of the Beholder)  that it is only one bit because you have decided that the color of the ball (blue or red) is what you are interested in predicting.

If you start drawing balls from the urn (and then replacing them, and noting down the result, of course) you would be able to estimate $p$ from the frequencies of blue and red balls. So, for example, if you end up seeing 9 times as many red balls as blue balls, you should adjust your prediction strategy to "The next one will be red". And you would likely be right about 90% of the time, quite a bit better than the 50/50 prior.

"So what you are telling me, is that the entropy formula (1) assumes a whole lot of things, such as that you already know to expect a bunch of things, namely what the possible alternatives of the measurement are, and even what the frequency distribution is, which you can really only know if you have divine inspiration, or else made a ton of measurements!"

Yes, dear reader, that's what I'm telling you. You already come equipped with some information (your common sense) and if you can predict with accuracy better than chance (because somebody told you the $p$ and it is not one half), then you have some more info. And yes, most people won't tell you that. But if you want to know about information, you first need to know.... what it is that you already know.

Where do thinking machines come from?

We've been waiting for these thinking machines for a long time now. We've read about them, and seen them in countless movies. They are just technology, right? And we've gotten really good at this technology thing! But where are the machines?

In a previous post I've hinted at the big problem in serious Artificial Intelligence (AI) research: if the theory of consciousness based on the concept of integrated information is right, then thinking machines are essentially undesignable. 

Mind you, we do have smart machines. We have machines that outperform humans in playing chess, we have self-driving cars that process close to 1Gbit per second of data, and we have machines that can beat pretty much anybody at Jeopardy! But neither you, or I, would call these smart machines intelligent. We do not take that word lightly: if you're just good at doing one particular job, then you're smart at that, but you are not intelligent. Google's car cannot play chess (nor can Watson), and neither Big Blue nor Watson should be allowed behind the wheel of a car. 

What's going on here? 

Here's the most important thing you need to know about what it takes to be intelligent. You have to be able to create worlds inside your brain. Literally. You have to be able to imagine worlds, and you have to be able to examine these worlds. Walk around in them, linger. 

This is important because you live in this world, the one you are also imagining. This world is complex, it is dangerous, and it is often unpredictable. It is precisely this unpredictability that is dangerous: you can be lunch if you don't understand the tell-tale signs of the lurking tiger. 

Yes I know, your chances of being eaten by a tiger are fairly low, but I'm not talking about today: I'm talking about the time when we (as a species) "grew up", that is, when we came down from the trees and ventured into the open fields of the savannah. To survive in this world, we have to make accurate predictions about the future state of the world. (Not just in the next five minutes, but also on the scale of months, seasons, years.)

How do we make these predictions? Why, we imagine the world, and in our minds imagine what happens. These imaginings, juxtaposed with the things that really do happen, allow us to hone a very important skill: we can represent an abstract version of the world in our heads, and use it to understand it. Understanding means removing surprises, the things that usually kill you.

Thinking about an object thus means creating an abstract representation of this object in your head, and playing around with it. If you can't do that, then you cannot think. You cannot be intelligent.

Are workers in the field of Artificial Intelligence oblivious about this absolutely crucial, essential aspect of intelligence?

Absolutely not. They are perfectly aware of it. In the heydays of AI research, that's pretty much all people did: they tried to cram as many facts about the real world into a computer's memory as they could. This, by the way, is still pretty much the way Watson is programmed, but he has a smarter retrieval system than what was possible in those days, based on Bayesian inference. 

But in the end, the programmers had to give up. No matter how much information they crammed into these brains, this information was not integrated: it did not produce an impression of the object that allowed the machine to make new inferences about the object that were not already programmed in. But that is precisely what is needed: your model of the world has to be good enough so that (when thinking about it)  you can make predictions about things you didn't already know.

So what did AI researchers do? Some gave up, and left the field. Others decided that they could do without these pesky imagined worlds. That you could create intelligence without representation. (The linked article is available beyond the paywall all over the internet, for example here. Tells you something about paywalls.) NOTE: This was available until recently! Also tells you something about paywalls. 

Given all that I just told you, you ought to at least be baffled. It all seemed so convincing! You can do without internal models? How?

The idea that you could do away with representations for the purpose of Artificial Intelligence is due to Rodney Brooks, then Professor of Robotics at MIT. Brooks is no slouch, mind you. His work has influenced a generation of roboticists. But he decided that robots should not make plans, because, well, the best laid plans, you know....

Rather Brooks argued that robots should react to the world. The world already contains all the complexity! Why not use that? Why program something that you have direct access to?

Why indeed? Brooks was quite successful with this approach, creating reactive robots with a subsumption architecture. Reactive robots are indeed robust: they can act appropriately given the current state of the world, because they take the world seriously: the world is all they have. 

But I think we can all agree that these robots, agile as they are, won't ever be intelligent. They won't be able to make plans. Because plans require good internal models, which we don't know how to program.  

So where will our intelligent machines come from? 

The avid reader of Spherical Harmonics (should such a person actually exist), already knows the answer to this question. Evolution is the tool to create the undesignable! If you can't program it, evolve it! After all, that's where we came from.

Now, I've hinted at this before: evolve it! Can you actually evolve representations? 

Yeah, we can, and we've shown it. And there is a paper that just came out in the journal Neural Computation that documents it. That's right, you've been reading a blog post that is an advertisement for  a journal article that is behind a pay wall! 

Relax, there is a version of the article on the AdamiLab web site. Or go get it from arxiv.org here

Now back to the specifics: "You've evolved representations, you say? Prove it!"

Ah! Now, a can of worms opens.  How can you show that any evolved anything actually represents the world inside its.... bits? What are representations anyway? Can you measure them?

Now here's a good question. It's the question the empiricist asks, when he is entangled in a philosophical discussion. And lo and behold, the concept of representation is a big one in the field of philosophy. Countless articles have been written about it. I'm not going to review them here. I have this sneaking suspicion that I am, again, engaged in writing an overly long blog post. If you're into this sort of thing (reading about philosophy, as opposed to writing overly long blog posts), you can read about philosophers talking about representation here, for example. Or here. I could go on. 

Philosophers have defined "representation" as something that "stands in" for the real thing. Something we would call a model. So we're all on the same wavelength here. But can you measure it? What we have done in the article I'm blogging about, is to propose an information-theoretic measure for how much a brain represents. And then we evolve a brain that must represent to win, and measure that thing we call representation. But then we go one better: we also measure what it is that these brains represent.

We literally measure what these brains are thinking about when they make their predictions. 

How do we do that? So, first of all, we understand that when you represent something, then this something must be information. Your model of the world is a compressed representation of the world, compressed into the essential bits only. But importantly, you're not allowed to get those bits from looking at the world. Staring at it, if you will. If you have a model of the world, you can have that model with your eyes closed. And ears. All sensors. Because if you could not, you would just be a reactive machine. So, a representation is those bits of the worlds that you can't see in your sensors. Can you measure that?

Hell yes! Claude Shannon, that genius of geniuses, taught us how! Here is the informational Venn diagram between the world (W), the sensors (S) that see the world (they represent it, albeit in a trivial manner),  and the Brain (B):



What we call "representation" (R) is the information that the brain knows about the world (information shared between W and B) given the sensor states (S). "Given", in the language of information theory, means that these states (the sensor states) do not contribute to your uncertainty. It also means that the "given" states do not contribute to the information (shared entropy) between W and B. That's why the "intersection triangle" between W, B, and S does not contribute to R: we have to subtract it because it also belongs to S. (I will talk about these concepts in more detail in part 2 of my "What is Information? series) So, R is what the brain knows about the world without sneaking a peek at what the world currently looks like in the sensors. It is what you truly know.

Now that we have defined representation quantitatively (so that we can measure it), how does it evolve?

Splendidly, as you may have surmised. To test this, we designed a task (that a simulated agent must solve) that requires building a model of the task, in your brain. This task is relatively simple: you are a machine that catches blocks. Blocks rain down from the sky (falling diagonally) but there are two kinds of blocks in the world. Small ones (that you should definitely catch) and large ones (that you should definitely avoid). To make things interesting, your vision is kind of shoddy. You have a blind spot in the middle of your retina, so that a big block may look like a small bock (and vice versa), for a while.




In this image, a large block is falling diagonally to the left. This is a tough nut to crack for our agent, because he hasn't even seen it. He is moving in the right direction (perhaps by chance) but once the block appears in the agent's sensors, he has to make a decision quickly. You have to determine both size, direction of motion, and relative location (is the block to my left? right above me? to my right?) You have to integrate several informational streams in order to "recognize" what you are dealing with. And the agent'a actions will tell us whether he has "understood" what it is what he is dealing with. That's what makes this task cool.

We can in fact evolve agents that solve this task perfectly, that is, they determine the right move for each of the 80 possible scenarios. Why 80? Well the falling block can be in 20 different positions at the top row. It can be small or large. It can fall to the left or to the right: 20 x 2 x 2 = 80. You say that I'm neglecting the 20 possible relative positions of the catcher? No I'm not: because the game "wraps" in the horizontal direction. Then if the block falls off the screen from the left, it reappears, as if by magic, on the left. The agent also reappears on the left/right if he disappears on the right/left.  As a consequence, we only have to count the 20 relative positions between falling block and catching agent.

As the agents become more proficient at catching (and avoiding) blocks, our measure R increases steadily. But not only can we measure how much of this world is represented in the agent's brain, we can literally figure out what they are thinking about!

Is this magic?

Not at all, it is information theory. The way we do this, is by defining a few (binary) concepts that we think may be important for the agent, such as:

Is the block to my left or to my right?
Is the block moving left or right?
Is the block currently triggering one of my sensors?
Is the block large or small?

Granted, the world itself can be in 1,600 different possible states. (Yes, we counted). These 4 concepts only cover two to the power of 4, or 16 possible states. But we believe that the agent may want to think about these four concepts in order to come to a decision; that these are essential concepts in this task. 

Of course, we may be wrong.

But we can measure which of the twelve neurons encode each of the four concepts, and we can even determine the time when they have become adapted to this feature. So, do the agents pay attention to these four concepts as they learn how to make a living in this world?

Not exactly, actually. That would be too simple. These concepts are important to a bunch of neurons, to be sure. But it is not like a single neuron evolves to pay attention to "big or small" while another tells the agent whether the brick is moving left or right. Rather, these concepts are "smeared" across a bunch of neurons, and there is synergy between concepts. Synergy means that if two (or more) neurons are encoding a concept together synergistically, then together they have more information about it then summing up the information that each one has by itself. 

So what does all of this teach us?

It means (and of course I'm biased here), that we have learned a great deal about representation here. We can measure how much a brain represents about its world within its states information-theoretically, and we can (with some astute guessing) even spy on what concepts the brain uses to make decisions. We can even see these concepts form as the brain is processing the information. At the first time step, the brain is pretty much clueless: what it sees can lead to anything. After the second time step, it can rule out a bunch of different scenarios, and as time goes by, the idea of what the agent is looking at forms. It is a hazy picture at first, for sure. But as more and more information is integrated, the point in time arrives where the agent's mental image is crystal clear: THIS is what I'm dealing with, and this is why I move THAT way. 

It is but a small step, for sure. Do brains really work like this? Can we measure representation in real biological brains? Figure out what an organism thinks about, and how decisions are made? 

If any of our information theory is correct, it is just a matter of technology to get the kind of data that will provide answers to these questions. That technology is far from trivial. In order to determine what we know about the brains that we evolve, we have to have the time series of neuronal firing (000010100010 etc) for all neurons, for a considerable amount of time (such as, the entire history of experiencing all 80 experimental conditions).  That's fine for our simple little world, but it not at all OK for any realistic system. Obtaining this type of resolution for animals is almost completely unheard of. Daniel Wagenaar (formerly at Caltech and now at the University of Cincinnati) can do this for 400 neurons in the ganglion of the medicinal leech. Yes, the thing seen on the left. Don't judge, it has very big neurons!

And, we are hoping to use Daniel's data to peer into the leech's brain, see what it is thinking about. We expect that food and mating are the variables we find. Not very original, I know. But wouldn't that be a new world? Not only can we measure how much a brain represents, we can also see what it is representing! As long as we have any idea about what the concepts could be that the animals are thinking about, that is. 

I do understand, from watching current politics, that this may be impossible for humans. But yet, we are undeterred! 

Article reference: L. Marstaller, A. Hintze, and C. Adami. (2013). The evolution of representation is simple cognitive networks. Neural Computation 25:2079-2107.













What is Information? (Part I: The Eye of the Beholder)


Information is a central concept in our daily life. We rely on information in order to make sense of the world: to make "informed" decisions. We use information technology in our daily interactions with people and machines. Even though most people are perfectly comfortable with their day-to-day understanding of information, the precise definition of information, along with its properties and consequences, is not always as well understood. I want to argue in this series of blog posts that a precise understanding of the concept of information is crucial to a number of scientific disciplines. Conversely, a vague understanding of the concept can lead to profound misunderstandings, within daily life and within the technical scientific literature.  My purpose is to introduce the concept of informationmathematically defined—to a broader audience, with the express intent of eliminating a number of common misconceptions that have plagued the progress of information science in different fields.

What is information? Simply put, information is that which allows you (who is in possession of that information) to make predictions with accuracy better than chance. Even though the former sentence appears glib, it captures the concept of information fairly succinctly. But the concepts introduced in this sentence need to be clarified. What do I mean with prediction? What is "accuracy better than chance"? Predictions of what? 

We all understand that information is useful. When is the last time that you have found information to be counterproductive? Perhaps it was the last time you watched the News. I will argue that, when you thought that the information you were given was not useful, then what you were exposed to was most likely not information. That stuff, instead, was mostly entropy (with a little bit of information thrown in here or there). Entropy, in case you have not yet come across the term,  is just a word we use to quantify how much you don't know. Actually, how much anybody doesn't know. (I'm not just picking on you).

But, isn't entropy the same as information?

One of the objectives of these posts is to make the distinction between the two as clear as I can. Information and entropy are two very different objects. They may have been used synonymously (even by Claude Shannon—the father of information theory—thus being responsible in part for a persistent myth) but they are fundamentally different. If the only thing you will take away from this article is your appreciation of the difference between entropy and information, then I will have succeeded.

But let us go back to our colloquial description of what information is, in terms of predictions. "Predictions of what"? you should ask. Well, in general, when we make predictions, it is about a system that we don't already know. In other words, an other system. This other system can be anything: the stock market, a book, the behavior of another person. But I've told you that we will make the concept of information mathematically precise. In that case, I have to specify this "other system" as precisely as I can. I have to specify, in particular, which states the system can take on. This is, in most cases, not particularly difficult. If I'm interested in quantifying how much I don't know about a phone book, say, I just need to tell you the number of phone numbers in it. Or, let's take a more familiar example (as phone books may appeal, conceptually, only to the older crowd among us), such as the six-sided fair die. What I don't know about this system is which side is going to be up when I throw it next. What you do know is that it has six sides. How much don't you know about this die? The answer is not six. This is because information (or the lack thereof) is not defined in terms of the number of unknown states. Rather, it is given by the logarithm of the number of unknown states. 

"Why on Earth introduce that complication?", you ask.

Well, think of it this way. Let's quantify your uncertainty (that is, how much you don't know) about a system (System One) by the number of states it can be in. Say this is $N_1$. Imagine that there is another system (System Two), and that one can be in $N_2$ different states. How many states can the joint system (System One And Two Combined) be in? Well, for each state of System One, there can be $N_2$ number of states. So the total number of states of the joint system must be $N_1\times N_2$. But our uncertainty about the joint system is not $N_1\times N_2$. Our uncertainty adds, it does not multiply. And fortunately the logarithm is that one function where the log of a product of elements is the sum of the logs of the elements. So, the uncertainty about the system $N_1\times N_2$ is the logarithm of the number of states
$$H(N_1N_2)=\log(N_1N_2)=\log(N_1) + \log(N_2).$$
I had to assume here that you knew about the properties of the log function. If this is a problem for you, please consult Wikipedia and continue after you digested that content.

Phew, I'm glad we got this out of the way. But, we were talking about a six-sided die. You know, the type you've known all your life. What you don't know about the state of this die (your uncertainty) before throwing it is $\log 6$. When you peek at the number that came up, you have reduced your uncertainty (about the outcome of this throw) to zero. This is because you made a perfect measurement. (In an imperfect measurement, you only got a glimpse of the surface that rules out a "1" and a "2", say.) 

What if the die wasn't fair? Well that complicates things. Let us for the sake of the argument assume that the die is so unfair that one of the six sides (say, the "six") can never be up. You might argue that the a priori uncertainty of the die (the uncertainty before measurement) should now be $\log 5$, because only five of the states can be the outcome of the measurement. But how are you supposed to know this? You were not told that the die is unfair in this manner, so as far as you are concerned, your uncertainty is still $\log 6$. 

Absurd, you say? You say that the entropy of the die is whatever it is, and does not depend on the state of the observer? Well I'm here to say that if you think that, then you are mistaken. Physical objects do not have an intrinsic uncertainty. I can easily convince you of that. You say the fair die has an entropy of $\log 6$? Let's look at an even more simple object: the fair coin. Its entropy is $\log 2$, right? What if I told you that I'm playing a somewhat different game, one where I'm not just counting whether the coin comes up heads to tails, but am also counting the angle that the face has made with a line that points towards True North. And in my game, I allow four different quadrants, like so:


Suddenly, the coin has $2\times4$ possible states, just because I told you that in my game the angle that the face makes with respect to a circle divided into 4 quadrants is interesting to me. It's the same coin, but I decided to measure something that is actually measurable (because the coin's faces can be in different orientation, as opposed to, say, a coin with a plain face but two differently colored sides). And you immediately realize that I could have divided the circle into as many quadrants as I can possibly resolve by eye. 

Alright fine, you say, so the entropy is $\log(2\times N)$ where $N$ is the number of resolvable angles. But you know, what is resolvable really depends on the measurement device you are going to use. If you use a microscope instead of your eyes, you could probably resolve many more states. Actually, let's follow this train of thought. Let's imagine I have a very sensitive thermometer that can sense the temperature of the coin. When throwing it high, the energy the coin absorbs when hitting the surface will raise the temperature of the coin slightly, compared to one that was tossed gently. If I so choose, I could include this temperature as another characteristic, and now the entropy is $\log(2\times N\times M)$, where $M$ is the number of different temperatures that can be reliably measured by the device. And you know that I can drive this to the absurd, by deciding to consider the excitation states of the molecules that compose the coin, or of the atoms composing the molecules, or nuclei, the nucleons, the quarks and gluons? 

The entropy of a physical object, it dawns on you, is not defined unless you tell me which degrees of freedom are important to you. In other words, it is defined by the number of states that can be resolved by the measurement that you are going to be using to determine the state of the physical object. If it is heads or tails that counts for you, then $\log 2$ is your uncertainty. If you play the "4-quadrant" game, the entropy of the coin is $\log 8$, and so on. Which brings us back to six-sided die that has been mysteriously manipulated to never land on "six". You (who do not know about this mischievous machination) expect six possible states, so this dictates your uncertainty. Incidentally, how do you even know the die has six sides it can land on? You know this from experience with dice, and having looked at the die you are about to throw. This knowledge allowed you to quantify your a priori uncertainty in the first place. 

Now, you start throwing this weighted die, and after about twenty throws or so without a "six" turning up, you start to become suspicious. You write down the results of a longer set of trials, and note this curious pattern of "six" never showing up, but the other five outcomes with roughly equal frequency. What happens now is that you adjust your expectation. You now hypothesize that it is a weighted die with five equally likely outcome, and one that never occurs. Now your expected uncertainty is $\log 5$. (Of course, you can't be 100% sure.)

But you did learn something through all these measurements. You gained information. How much? Easy! It's the difference between your uncertainty before you started to be suspicious, minus the uncertainty after it dawned on you. The information you gained is just $\log 6-\log5$. How much is that? Well, you can calculate it yourself. You didn't give me the base of the logarithm you say? 

Well, that's true. Without specifying the logarithm's base, the information gained is not specified. It does not matter which base you choose: each base just gives units to your information gain. It's kind of like asking how much you weigh. Well, my weight is one thing. The number I give you depends on whether you want to know it in kilograms, or pounds. Or stones, for all it matters.

If you choose the base of the logarithm to be 2, well then your units will be called "bits" (which is what we all use in information theory land). But you may choose the Eulerian e as your base. That makes your logarithms "natural", but your units of information (or entropy, for that matter) will be called "nats".  You can define other units (and we may get to that), but we'll keep it at that for the moment. 

So, if you choose base 2 (bits), your information gain is $\log_2(6/5)\approx 0.263$ bits. That may not sound like much, but in a Vegas-type setting this gain of information might be worth, well, a lot. Information that you have (and those you play with do not) can be moderately valuable (for example, in a stock market setting), or it could mean the difference between life and death (in a predator/prey setting). In any case, we should value information.  

As an aside, this little example where we used a series of experiments to "inform" us that one of the six sides of the die will not, in all likelihood, ever show up, should have convinced you that we can never know the actual uncertainty that we have about any physical object, unless the statistics of the possible measurement outcomes of that physical object are for some reason known with infinite precision (which you cannot attain in a finite lifetime). It is for that reason that I suggest to the reader to give up thinking about the uncertainty of any physical object, and be only concerned with differences between uncertainties (before and after a measurement, for example). 

The uncertainties themselves we call entropy. Differences between entropies (for example before and after a measurement) are called information. Information, you see, is real. Entropy on the other hand: in the eye of the beholder.

In this series on the nature of information, I expect the next posts to feature more conventional definitions of  entropy and information—meaning, those that Claude Shannon has introduced—(with some examples from physics and biology), then moving on to communication, and the concept of the channel capacity.

Part 2: The Things We Know


The evolution of the circle of empathy

What is the circle of empathy? Empathy, as we all know, is the capacity to feel (or at least recognize) emotions in other entities that have emotions. Many people believe that this capacity is in fact shared by many types of animals. The "circle of empathy" is  a boundary within which each individual places the things he or she empathizes with. Usually, this only includes people and possibly certain animals, but is unlikely to include inanimate objects, and very rarely plants or microbes. This circle is intensely personal, however. (Psychopaths, for example, seem to have no circle of empathy whatsoever.) Incidentally, I thought I had invented the term, but it turns out that Jaron Lanier has used it before me in a similar fashion, as has the bioethicist Peter Singer. What I would like to discuss here is the evolution of our circle of empathy over time, what this trend says about us, and think about where this might lead us in the long run.

When we go way, way back in time, life was different. There wasn't what we now call "society", or even "civilization". There were people, animals, and plants. And there was the sun rising predictably in the morning, and setting in the evening just as expected. But everything else was less predictable. Life was "fraught with perils" (as a lazy writer would write). Life was uncertain. What is the best mode of survival in this world?

"Trust no-one", the X-files may exhort you, but in truth, you've got to trust somebody. The life (and survival) of the Lone Ranger is not predicated on loneliness; he too must rely on the kindness of strangers and companions.  Life is more predictable when you can trust. But who do you trust, then? Of course, you trust family first: this is the primal empathic circle: you feel for your family, and expect they feel for you. Emotions are almost sure-fire guarantors of behavior. From this point of view, emotions protect, and make life a little more predictable. 

As we evolve, we learn that expanding the circle of empathy is beneficial. When it comes to protecting the family, as well as the things we have gained, it is beneficial to gang up with those that have an equal amount to lose. "Let us forge a band of brothers that is not strictly limited to brothers and sisters; we who defend the same stake, let us stand as one against those that thrive to tear us down".

Thus, through ongoing conflicts, new bonds are forged. We may not be related in the familiar manner, but we are alike, and our costs and benefits are aligned. The circle of empathy has widened. 

Time, relentlessly, goes on. And the circle of empathy inevitably widens (on average). Yes, don't get me wrong, I fully understand that human history is nothing but a wildly careening battle between the forces that compel us to love our fellow man, and the urge to destroy those who are perceived to interfere with our plans of advancement. Throughout history, the circle of empathy may widen for a while, then restrict. People perceived to be different  (often, in fact, perceived as inferior) may be admitted to the circle for some (sometimes even most), but just as often dismissed. Yet, over time, the circle appears to inexorably widen. 

There is no doubt about this trend, really.  From the family, the circle expanded to encompass the clans that were probably closely related. From those, the circle expanded to cities, city states, and finally countries. At this point it was just a matter of time until humans expressed their empathy with respect to all humankind. "We are all one", the idealist would invariably exclaim (mindful that not everyone on Earth has evolved to be quite as magnificent, or magnanimous).  Our many differences aside, the widening of the circle of empathy is palpable. The tragedy of September 11th 2001, for example, was genuinely felt to be a tragedy by the majority of people on the globe. 

It is also clear that the evolution of the circle's radius proceeds by a widening in a few individuals first, who then spend a good portion of their lives convincing their fellow humans that they ought to widen their circles just as much. Civil rights struggles and equal rights campaigns can be subsumed this way. Anti-abortion crusaders would like everyone to include the unborn fetus into their circle of empathy. Many vegetarians have chosen not to eat meat for the simple reason that they have included all animals within their circle of empathy.

Given that the dynamics of the widening of the circle on average is driven by a few pioneers who widened theirs ahead of everyone else, how far should we expect to widen our own circles? For example, I am not a vegetarian. I do empathize with animals, but like most people I know, my empathy has its limits. I generally do not kill animals, but when insects find their way into my house I consider that a territorial transgression. Given the nervous system of most insects, it is unlikely that they perceive pain in any manner comparable with how we perceive it.

And this is probably the line of empathy that will most likely be drawn by the majority of people at some point in the future: if animals can perceive pain just as we do, then we are likely to include them into our circle. The more complex they are cognitively, the more likely we would have them in our circle.

The trouble is, the cognitive complexity of animals isn't easily accessible to us. We empathize with the great apes (the group of primates that, besides the gorilla, chimpanzee, and organgutan, also includes us) in part because they are so similar to us. But cetaceans (the group of animals that includes whales and dolphins) have at least as complex a cognitive system as the great apes, but appear on far fewer people's radar.


Bottlenose dolphin. (Source: Wikimedia)

The neuroscientist Lori Marino, for example (who together with Diana Reiss first published evidence that bottlenose dolphins can recognize themselves in a mirror) has been pushing for the ethical treatment of cetaceans (and therefore for a widening of our circle of empathy to include cetaceans) using scientific arguments based both on behavioral as well as neuroanatomical evidence. She (as well as people like the lawyer Steven Wise) have been pushing for "non-human legal rights" for certain groups of animals, thus enshrining the widened circle into law. From this point of view, the recent analysis of the methods used by Japanese dolphin hunters to round up and kill dolphins is another stark reminder of how different the radius of the circle can be among fellow humans (and how culture and ethnic heritage affects it).

All this leads me back to a thought I have touched upon in a previous post: if higher cognitive capacities are associated with things we call "consciousness" and "self-awareness", perhaps we need to be able to better capture them mathematically, and therefore ultimately make them measurable. If we were to achieve this, then we may end up with a scale that gives us clear guidelines on what the radius of our circle of empathy should be, rather than waiting for more enlightened people to show us the path.

It is unlikely that this circle will encompass plants, microbes, or even insects. But there are surely animals out there who, on account of their not being able to talk to us, have been suffering tremendously. Looking at this from the vantage point of our future more enlightened selves, we should really figure out how to draw the line, somehow, sooner better than later. I don't know where that line is, but I'm pretty sure that my line will evolve in time, and yours will too.