What is Information? (Part 4: Information!)

This is the 4th (and last) installment of the "What is Information" series. The one where I finally get to define for you what information is. The first three parts can be found here: Part 1, Part 2Part 3. Let me quickly summarize the take-home points of parts1-3, just in case you are reading this temporally removed from parts 1-3.

1.) Entropy, also known as "uncertainty", is something that is mathematically defined for a "random variable". But physical objects aren't mathematical. They are messy complicated things. They become mathematical when observed through the looking glass of a measurement device that has a finite resolution. We then understand that a physical object does not "have an entropy". Rather, its entropy is defined by the measurement device I choose to examine it with.  Information theory is a theory of the relative state of measurement devices. 

2.) Entropy, also known as uncertainty, quantifies how much you don't know about something (a random variable). But in order to quantify how much you don't know, you have to know something about the thing you don't know. These are the hidden assumptions in probability theory and information theory. These are the things you didn't know you knew.

3.) Shannon's entropy is written as "log p", but these "p" are really conditional probabilities if you know that they are not uniform (all the same for all states). They are not uniform given what else you know. Like that they are not uniformly distributed. Duh.

Alright, now we're up to speed. So we have the unconditional entropy, which is the one where we know nothing about the system that the random variable describes. We call that $H_{\rm max}$, because an unconditional entropy must be maximal: it tells us how much we don't know if we don't know anything. Then there is the conditional entropy $H=-\sum_i p_i\log p_i$, where the $p_i$ are conditional probabilities. They are conditional on some knowledge. Thus, $H$ tells you what remains to be known. Information is "what you don't know minus what remains to be known given what you know". There it is. Clear?

"Hold on, hold on. Hold on for just a minute"

What?

"This is not what I've been reading in textbooks."

You read textbooks?

"Don't condescend. What I mean is that this is not what I would read in, say, Wikipedia, that you link to all the time. For example, in this link to `mutual information'. You're obviously wrong. 

Because, you know, it's Wiki!"

So tell me what it is that you read.

"It says there that the mutual information is the difference between the entropy of random variable $X$,  $H(X)$, and the conditional entropy $H(X|Y)$, which is the conditional entropy of variable $X$ given you know the state of variable $Y$."

"Come to think of it, you yourself defined that conditional entropy at the end of your part 3

. I think it is Equation (5) there!" And there is this Venn diagram on Wiki. It looks like this:"

Source: Wikimedia

Source: Wikimedia

Ah, yes. That's a good diagram. Two variables $X$ and $Y$. The red circle represents the entropy of $X$, the blue circle the entropy of $Y$. The purple thing in the middle is the shared entropy $I(X:Y)$, which is what $X$ knows about $Y$. Also what $Y$ knows about $X$. They are the same thing.

"You wrote $I(X:Y)$ but Wiki says $I(X;Y)$. Is your semicolon key broken?"

Actually, there are two notations for the shared entropy (a.k.a information) in the literature. One uses the colon, the other the semicolon. Thanks for bringing this up. It confuses people. In fact, I wanted to bring up this other thing....

"Hold on again. You also keep on saying "shared entropy" when Wiki says "shared information". You really ought to pay more attention."

Well, you. That's a bit of a pet-peeve of mine. Just look at the diagram above. The thing in the middle, the purple area, it's a shared entropy. Information is shared entropy. "Shared information" would be, like, shared shared entropy. I think that's a bit ridiculous, don't you think?

"Well, if you put it like this. I see your point. But why do I read "shared information" everywhere?"

That is, dear reader, because people are confused about what to call entropy, and what to call information. A sizable fraction of the literature calls what we have been calling "entropy" (or uncertainty) "information". You can see this even in the book by Shannon and Weaver

(which, come to think of it, was edited by Weaver, not Shannon). When you do this, then what is shared by the "informations" is "shared information". But that does not make any sense, right?

"I don't understand. Why would anybody call entropy information? Entropy is what you don't know, information is what you know. How could you possibly confuse the two?"

I'm with you there. Entropy is "potential information". It quantifies "what you could possibly

know". But it is not what you actually know. I think, between you and me, that it was just sloppy writing at first, which then ballooned into a massive misunderstanding. Both entropy and information are measured in bits, and so people would just flippantly say: "a coin has two bits of information", when they mean to say "two bits of entropy". And it's all downhill from there. 

I think I've made my point here, I hope. Being precise about entropy and information really matters. Colon vs. semicolon does not. Information is "unconditional entropy minus conditional entropy". When cast as a relationship between two random variables $X$ and $Y$, we can write it as $I(X:Y)=H(X)-H(X|Y)$.

And because information is symmetric in the one who measures and the one who is being measured (remember: a theory of the relative state of measurement devices...) this can also be written as $I(X:Y)=H(Y)-H(Y|X)$.

And both formulas can be verified by looking at the Venn diagram above.

"OK, this is cool."

"Hold on, hold on!"

What is it again?

"I just remembered. This was all a discussion that came after I brought up Wikipedia, that says that information was $I(X:Y)=H(X)-H(X|Y)$, while you said it was $H_{\rm max}-H$, where the $H$ was clearly an entropy that you write as $H=-\sum_i p_i\log p_i$. All you have to do is scroll up, I'm not dreaming this!"

So you are saying that textbooks say $I=H(X)-H(X|Y)$  (1) while I write instead $I=H_{\rm max}-H(X)$,   (2) where $H(X)=-\sum_i p_i\log p_i$. Is that what you're objecting to?

"Yes. Yes it is."

Well, here it is in a nutshell. In (1), information is defined as the difference between the actual observed entropy of $X$, minus the actual observed entropy of $X$ given that I know the state of $Y$ (whatever that state may be). 

In (2), information is defined as what I don't know about $X$ (without knowing any of the things that we may implicitly know already), and the actual uncertainty of $X$. The latter does not mention a system $Y$. It quantifies my knowledge of $X$ without stressing what it is I know about $X$. If the probability distribution with which I describe $X$ is not uniform, then I know something about $X$. My $I$ in Eq. (2) quantifies that. Eq. (1) quantifies what I know about $X$ above and beyond what I already know via Eq. (2). It quantifies specifically the information that $Y$ conveys about $X$. So you could say that the total information that I have about $X$, given that I also know the state of $Y$, would be $I_{\rm total}=H_{\rm max}-H(X) + H(X)-H(X|Y)=H_{\rm max}-H(X|Y)$.

So the difference between what I would write and what textbooks write is really only in the unconditional term: it should be maximal. But in the end, Eqs. (1) and (2) simply refer to different informations. (2) is information, but I may not be aware how I got into possession of that information. (1) tells me exactly the source of my information: the variable $Y$. 

Isn't that clear?

"I'll have to get back to you on that. I'm still reading. I think I have to read it again. It sort of takes some getting used to." 

I know what you mean. It took me a while to get to that place. But, as I has hinted at in the introduction to this series, it pays off big time to have your perspective adjusted, so that you know what you are talking about when you say "information". I will be writing a good number of blog posts that reference "information", and many of those are a consequence of research that was only possible when you understand the concept precisely. I wrote a series on information in black holes already (starting here). That's just the beginning. There will be more to come, for example on information and cooperation. I mean, how you can only fruitfully engage in the latter if you have the former. 

I know it sounds more like a threat than a promise. I really mean it to be a promise.