The Importance of Asking Why — “The Book of Why: The New Science of Cause and Effect”
“The lack of progress in AGI is due to a severe logjam of misconceptions. Without Popperian epistemology, one cannot even begin to guess what detailed functionality must be achieved to make an AGI. And Popperian epistemology is not widely known, let alone understood well enough to be applied.”
So claims physicist and researcher David Deutsch near the end of his article on why we are on the wrong path to Artificial General Intelligence.
I believe Deutsch is correct: we aren’t yet headed in the right direction. To understand how to create a true Artificial General Intelligence that is as capable of learning as a human being will require researching and understanding how we humans can come up with creative conjectures to test and criticize in the first place.
The Scientific Method
Consider, for example, a fairly typical version of the ‘the scientific method’ (in this case from the very respectable Khan Academy):
- Make an observation.
- Ask a question.
- Form a hypothesis, or testable explanation.
- Make a prediction based on the hypothesis.
- Test the prediction.
- Iterate: use the results to make new hypotheses or predictions.
Most likely some version of the scientific method is being taught in schools to your children today, despite the fact that philosopher Karl Popper demonstrated that the real ‘scientific method’ does not start with observations, but rather with problems to be solved – from previously beliefs or theories – and then comes up with conjectures on ideas that might solve those problems. Popper showed that conjectures (hypotheses) that survive criticism (including experimental testing) are the ones that survive and become “scientific theories” while their competitors become “falsified” and die out.
Yes, a problem is an ‘observation’ of sorts I suppose, but merely telling people science start with observations is as misleading as telling teenagers that the best way to get a first job is to list all their work experience. Popper used to demonstrate this to his students by asking them to ‘observe’ and then waiting for the confusion to set in. “What should we be observing?” they’d ask. They had no idea how to comply with his request. (Conjectures and Refutations, p. 61) And of course they didn’t, since what we ‘observe’ is contingent on our understanding of the world. Our view of the world is ‘theory-impregnated”, he’d explain. (The Myth of the Framework, p. 53) There is therefore “no such thing as a ‘pure’ observation.” (p 86) (1)
Judea Pearl is a huge name in both Artificial Intelligence, particularly Bayesian probability theory. In fact, he coined the term Bayesian Networks. So I was a bit surprised to discover that this giant of Bayesian Probability has abandoned his belief that Bayesian theory is a path to Artificial General Intelligence (AGI). He wrote “The Book of Why” in part to explain why he feels Bayesian theory is insufficient and what he thinks is the right path forward. (2) Interestingly, Pearl’s most recent book is practically (perhaps unintentionally) a study in where we’ve gone wrong in Artificial Intelligence and why Deutsch is right that until we learn the lessons of Popper we’re doomed to be unable to create Artificial General Intelligences.
Causality vs Correlation
Pearl spends a great deal of time documenting the history – or lack thereof – of causality in statistics. This is a fascinating story with a strong moral lesson in the blindness due to preconceived notions. His story of the difficulties the scientific community had even just stating the rather obvious “smoking causes cancer” makes you want to scream in frustration at the scientific community of yesteryear.
And yet, in what sense does smoking “cause” cancer? Yes, the two strongly correlate, but every first year student knows that ‘correlation is not causation.’ Pearl points out the distinct difference between trying to claim smoking causes cancer vs, say, demonstrating that lack of vitamin C causes scurvy. In the second case, the correlation is 100%. If you don’t get enough vitamin C, you get scurvy.
Smoking isn’t like this because some people smoke all their lives without getting lung cancer and some people get lung cancer never having smoked. The scientific community’s hesitance to declare smoking a “cause” of cancer becomes more understandable in that light. The skeptic merely had to ask, “what if something else is causing both smoking and cancer?” The scientific community did not know how to respond.
To put it more technically, let’s imagine we have input X, in this case a value of whether or not a person smokes. Let’s say we set X to 1 if they do smoke and 0 if they don’t. What we want to know if X causes Y – a value that is 1 if the person gets lung cancer and 0 if they do not, with a number between 1 and 0 representing the probability of smoking causing cancer. Pearl would graph this all out like this:
Here we’re showing that smoking causes cancer, that is to say, if you take up smoking your chances of cancer go up because smoking affects your body such that it’s more likely to get cancer. But it’s impossible to come up with this explanation just by looking at the data. All we really know is that Smoking and Cancer correlate strongly. To put this into the form of probability theory, we’re saying that the probability of getting cancer is higher if we know the person is smoking. In probability theory, we would write this as P(Cancer) which reads “the probability of having cancer.” For the sake of keeping things easy, let’s say that one out of every 100 people get Cancer during their life. We’d write this as P(Cancer) = 0.01.
But what if we knew that a person was a smoker. Let’s say that we knew, from data, that 1 out of every 10 people get cancer if they smoked. The way we’d write this is P(Cancer|Smoker) = 0.1 which can be read as “the probability of getting cancer if the person is a smoker is 0.1.” The variable “smoker” after the “|” is usually called “evidence.” If you have evidence someone is a smoker, you know their probability of cancer is (using our simple numbers) 10 times higher. This is called “conditioning” on being a smoker.
Now it’s tempting to say that because smokers are ten times more likely to get cancer that this must be ‘proof’ or at least ‘evidence’ that smoking causes cancer.
However, consider that if you know someone has lung cancer, you also know it’s more likely that they were a smoker. Or in other words P(Smoker|Cancer) > P(Smoker): which means that the probability of someone being a smoker, if you know they have lung cancer, is greater than the probability of just some random person off the street being a smoker. But does that then mean that cancer causes smoking? Of course not. But how do you know that for sure?
Notice how there is as a correlation between P(Cancer|Smoker) and also between P(Smoker|Cancer). So it really isn’t possible to determine causation using correlation at all. But how can you be sure that Smoking causes Cancer rather than Cancer causes Smoking?
But worse yet, isn’t it possible that there is some third variable causing both smoking and cancer? Maybe there is a gene that causes cancer and also causes people to crave smoking. This alternative hypothesis would be graphed like this:
In this second graph, the “smoking gene” (if it were real) is what we call a “Confounder,” which is a variable that causes Cancer and causes a person to take up smoking. If this were true, then the fact that smoking correlates with cancer would not mean that smoking actually causes cancer.
The smoking skeptics were, in essence, asking those that believed smoking caused cancer “How can you be sure there isn’t such a Confounder that is creating the correlation between Caner and Smoking? How can you eliminate that possibility?” And the scientific community had no answer to the skeptics, so for decades they just left it as an open question.
Testing Rival Explanations
In fact, there is no way you can ever be certain that such a factor doesn’t exist. The key problem here is that most people see science as seeking justified knowledge or somehow ‘proving’ certain ‘scientific facts.’ But Popper’s philosophy of science (called Critical Rationalism) demonstrated that this was all wrong. Science can never justify any sort of certain knowledge, nor does it need to.
Instead, Popper suggests that while we can never, with certainty, prove beyond doubt that such a Confounder doesn’t exist (nor that cancer does not cause smoking), we can look at both ‘hypotheses’ above and try to criticize both and see which one survives criticism better. If one of our theories survives all criticism and the rest do not, then that is our best theory by definition and we should tentatively accept it.
One way we might do this is we might setup an (unethical) experiment where we randomly take 100 people and randomly force half of them to smoke all their lives and the other half not to. This is the famous “Randomly Controlled Trial.” If we did this, then one of the competing hypotheses will start to face increasing problems that it can’t explain. Why? Because we actually intervened and changed the system. If in a regular population people with a ‘smoker’s gene’ were naturally more drawn to smoking and more likely to get cancer, we just eliminated that as a factor by forcing half our test population to smoke and half not to, regardless of the gene. In essence we forced the graph to look like this:
Notice that this is very different than “conditioning” on the same variable, which changes nothing on the graph, but “merely narrows our focus to the subset of cases in which the variable takes the value we are interested in…” (Causal Inference in Statistics, p. 54)
If it turns out that even with this intervention that a smoker is 10 times more likely to get cancer, then we have created a problem for the other competing hypotheses. How will the “smoker’s gene causes cancer and smoking” (or for that matter “cancer causes smoking”) hypothesis explain this result? But no equivalent problem has been created for the “smoking causes cancer” hypothesis. Thus it’s our ‘surviving’ theory, and thus our best one.
But also notice that ‘intervening’ like this is not the same as conditioning on the data. Or put another way, we cannot find ‘evidence’ from data alone. Indeed, to make sense of the data, we first need one or more hypotheses (i.e. explanations) to work with. (Causal Inference in Statistics, p. 54)
To address this insight, Pearl uses the notation P(Cancer|do(Smoker)) which reads “the probability of getting cancer if you intervene with the person and force them to smoke.” The notation “do(Smoker)” is called the do-operator and it is a new addition to probability theory introduced by Pearl and his team. P(Cancer|Smoker) is not the same as P(Cancer|do(Smoker)) because they give different values all together for some explanations (or causality graphs) vs others. Specifically, if Smoking causes cancer then P(Cancer|Smoker) = P(Cancer|do(Smoker)). But if there is secretly a Confounder causing the correlation (our “smoker’s gene”) then P(Cancer|Smoker) != P(Cancer|do(Smoker)). (Where “!=” is “does not equal.”)
Pearl points out that this is actually the first time we’ve been able to formalize what a Confounder actually is, i.e. when P(X|Y) != P(X|do(Y)). And through this experiment to test if P(Cancer|Smoker) != P(Cancer|do(Smoker)) we can determine which hypothesis is the better one.
The Need for Scientific Explanations
Of course there is an obvious problem here. It’s unethical to force people to smoke (and presumably give them cancer) for the sake of our experiment. But as we’ve seen, merely measuring the likelihood of getting cancer for someone that chooses on their own to smoke vs one that we forced to smoke, isn’t going to give the same result if there is a Confounder like our “smoker’s gene” causing some people to both smoke and get cancer.
So how do we deal with this problem?
Pearl’s response to this is pretty much straight out of Popper: we use our scientific reasoning to first determine which graph (i.e. explanation) is the better one first. (3) The experiment I explained in the last section was unethical, but the same basic approach of conjecture and refutation can be used. We can criticize our various competing explanations and find the best one. Pearl gives an example of a study that showed that people who had smoked and then stopped reduced their risk by a factor of two. (The Book of Why, p. 174) This seems like a problem for the “Smoking Gene” hypothesis (or “Cancer causing Smoking” hypothesis), but not for the “Smoking Causes Cancer” hypothesis. Or even better, scientists used mathematics to show that if there was a ‘smoker’s gene’ that to explain the difference in cancer rates between smokers and non-smokers, it would have to be 9 times as common for smokers. Put another way, if 11% of non-smokers had the gene, then 99% of the smokers must have it. (The Book of Why, p. 175) This seemed like an unreasonable result, and thus was a problem for the “Smoker Gene” hypothesis, but not for the “Smoking Causes Cancer” hypothesis.
In other words, we can criticize the competing theories and see which one survives. In any case, Critical Rationalism shows us that we don’t need to know with certainty that there is no Confounder to know that one hypothesis is a better explanation than the other hypothesis. Therefore, we have good reason to accept that smoking causes cancer as the better explanation compared to the alternative theories we’re considering.
The Adjustment Formula: The Power of Explanations
But here is where things get interesting. Now that we understand the difference between P(X|Y) and P(X|do(Y)) we can mathematically work out one from the other based on our scientific explanation (as captured in our graph)?
So for example, let’s say that that smoking did cause cancer, but also that there is a ‘smoker’s gene’ that both makes cancer more likely and also increases the chances that you’ll crave smoking. (4) That would change the causal diagram to look like this:
Would it be possible to determine directly how much of the cancer is determined by smoking alone? Well, obviously you could do the unethical experiment we already mentioned and force half your study to smoke and half not to. That would again erase the arrow between “Smoker Gene” and “Smoker.” But what if we wanted to know the results of P(Cancer|do(Smoker)) but all we had was data on P(Cancer|Smoker)? In other words, could we somehow tease out the answer mathematically to this new graph, but without having to do the unethical experiment?
So long as we have good scientific reasons (i.e. the other competing hypotheses have failed to survive criticism) to believe the above explanation contained in the graph is correct, then in fact, we can.
To be clear, what we’re looking to do is to determine what P(Cancer|Smoker) is under the second graph (i.e. Figure 4, the one with the arrow between Gene and Smoker removed) even though all our data is for P(Cancer|Smoker) under the first graph (i.e. Figure 3, with the arrow still present.)
To put this into notation, we’re saying that P(Y|do(X)) by definition is equal to Pm(Y|X) where “Pm” means probability on the manipulated graph. Therefore, by definition:
Formula 1: P(Y|do(X)) = Pm(Y|X)
Now consider that under P(Cancer|do(Smoker)) – which removes the arrow from “Smoker Gene” to “Smoker” – that P(Cancer|Smoker, Gene) is the same under either graph because Cancer changes with Smoker and Gene regardless of how we’re manipulating the relationship of Gene and Smoker. Therefore:
Formula 2: Pm(Y|X, Z) = P(Y|X, Z)
Also note that P(Gene) is the same on both graphs because nothing is affecting that variable. Therefore:
Formula 3: Pm(Z) = P(Z)
Now using Formula one and the Law of Total Probability (which is a standard rule of probability) we can say:
Formula 4: P(Y|do(X)) = SUM( Pm(Y|X, Z) * Pm(Z|X) )
However, in the modified model, Smoker (i.e. X) and Smoker Gene (i.e. Z) are now independent. And since P(Z|X) = P(Z) if they are independent, we now have:
Formula 5: P(Y|do(X)) = SUM( Pm(Y|X, Z) * Pm(Z) )
Now taking formula 5 and combining it with 2 and 3, we now have:
Formula 6: P(Y|do(X)) = SUM( P(Y|X, Z) * P(Z|X) )
Notice that we now have P(Y|do(X)) defined in terms of the non-manipulated model.
That means we can now just take data from the unmanipulated model (from non-experimental data!) and come up with what P(Y|do(X)) would be. This is called “The Adjustment Formula.”
So our final formula is:
Adjustment Formula: P(Cancer|do(Smoker)) = SUM( P(Cancer|Smoker, Smoker Gene) * P(Smoker Gene|Smoker) )
In short, combining the power of scientific explanations and the power of causal inference we can just use regular population data to mathematically perform the unethical experiment without having to actually perform it!
Impact on Artificial Intelligence Research
Pearl is not shy about the ramifications of his research on the state of the art Machine Learning today: Deep Learning. In essence, we just produced a proof that Deep Learning – which is just learning correlations from data – cannot usually produce the human equivalent of scientific explanations. In short, Deep Learning is not and can never be intelligence, just like David Deutsch said in the opening quote. (5)
The Power of Causal Inference
Hopefully this simple example whets your appetite for what else can be done with good explanations coupled with causal inference. In fact, there are far more exciting things that can be done that I’ll cover in a future post. For example, Causal Inference can be used to model not only interventions but also counterfactuals, such as “what would happen to sales if we had run this ad.” It can also or solve Simon’s Paradox consistently.
But probably my favorite ability of Causal Inference is that it solves one of the biggest problems of science, namely how to setup randomly controlled trial when you know your test population wasn’t totally random. Pearl claims “If we understand the mechanism by which we recruit subjects for [a] study, we can recover from bias…” of not having a truly random selection process. Indeed Pearl goes on to claim causal Inference seems poised to “allow us to exploit causal logic and Big Data to perform miracles that were previously inconceivable.”
I feel that there are interesting touch points between discoveries in the Causal Inference community and the epistemology of Karl Popper. (Especially as amended by David Deutsch.)
This post is a first broad attempt to paint these touch point, but a lot more study is needed and I feel we’re just scratching the surface of something new and exciting. Hopefully in future posts I can explain in more depth the power of Causal Inference.
(1) There are other problems with the “scientific method” as outlined from Khan Academy. For example, Step 3 (i.e. Form a hypothesis, or testable explanation), is presumably equivalent to Popper’s “making conjectures.” What’s missing is that what we really conjecture (form hypotheses) about is problems we’re interested in and are trying to solve. Step 2 (Ask a question) seems trivial at best. Not surprisingly, all the heavy lifting is really in step 3, where we form our hypothesis, and Step 4 where we test it via experiment. But even this isn’t quite right. In reality, we probably start criticizing our conjectures (hypotheses) right away (as part of Step 3 when initially forming hypotheses) and get rid of the ones that don’t survive our initial criticisms. The few that survive that process are the ones we’ll bother to actually spend time setting up an experiment for. So Step 3 probably deserves more sub-steps. Further, this version of the “scientific method” leaves out some really important things that science requires, like a community of scientists with a culture of criticism. The process of actually publishing to that community is absent from KA’s version of the scientific method, but is actually one of the most important parts in real life. KA’s scientific method makes science seem like a primarily individual process whereas Popper would claim it was primarily a communal process.
(2) See The Book of Why, p. 46: “It is because of this robustness [in causal diagrams], I conjecture, that human intuition is organized around causal, not statistical, relationships.”
It should be noted, however, that David Deutsch (from the opening quote), who agrees with Pearl that Bayesian probability theory can’t be a path towards AGI, probably wouldn’t agree with Pearl’s assessment that causal diagrams are the right path forward. As Deutsch states in his book, The Fabric of Reality (p. 24): “…I must mention another way in which reductionism misrepresents the structure of scientific knowledge. Not only does it assume that explanation always consists of analyzing a system into smaller simpler systems, it also assumes that all explanation is of later events in terms of earlier events; in other words, that the only way of explaining something is to state its causes.”
However, there is a least a some overlap between Pearl and Deutsch on this. For one thing, causes are types or modes of explanation and Pearl acknowledges this throughout his book, even if, at times, his concept of ‘explanation’ seems somewhat primitive compared to the full breadth of possible scientific explanations. This is less obvious to me, but it’s possible that Pearl may incorrectly understand an ‘explanation’ to merely be a synonym for his causation graphs. For example, if smoking (probabilistically) causes cancer, than the explanation for lung cancer is (often) smoking. While may be technically correct, it’s certainly not a deep understanding of the real potential of explanations.
Hernán, Hsu, and Healy, in “Data science is science’s second chance to get causal inference right. A classification of data science tasks” (p. 5) point out that this is a common mistake among causal Inference methodologists: “Some methodologists have referred to the causal inference task as “explanation”, but this is a somewhat misleading term because causal effects may be quantified while remaining unexplained (e.g., randomized trials identify causal effects even if the causal mechanisms that explain them are unknown).”
So I think Pearl may be on the right track, but I suspect we’re still looking at ‘explanations’ too narrowly here. What we really need is a deeper theory of explanations, not just causation.
However, this post details out where I do see a number of touch points between both Popper’s (as extended by Deutsch) and Pearl’s theories that I feel are worth noting. But I think Deutsch would likely have two objections still. First, that Pearl is focusing too narrowly on “Causation” (which is merely one possible type of “Explanation”). And second, that this view of “causation as explanations” is really still primarily Bayesian probability theory, which Deutsch does not accept as a valid understanding of how science works. (And thus, as per Deutsch’s argument from the opening quote, he’d not see it as a path to AGI.)
However, even if Causal Inference isn’t a path to AGI, it’s an interesting study in it’s own right about the problems of current Machine Learning techniques, especially Bayesian probability theory. And I believe it will turn out to be it’s own fruitful field — just like Artificial Intelligence itself started out as an attempt to invent AGI, failed at that attempt, but turned out to be it’s own fruitful field. Failure is our path forward.
(3) See The Book of Why, p 79. Here I’m using the language of Popper rather than Pearl, because I believe it is more accurate to scientific epistemology. The key thing here is that Pearl acknowledges that “Causal Discovery”, which is the study of how to “prove that X is a cause of Y or else to find the cause of Y from scratch” is something distinct from his area of study, which is how to “[represent] plausible causal knowledge in some mathematical language, combining it with empirical data, and answering causal queries of practical value.” In short, Pearl is arguing that you must first use “Causal Discovery” to create a causal theory and then you can use that theory to do “Causal Inference” in useful ways. I am conjecturing that “Causal Discovery” is really just a subset of Popper’s epistemology.
(4) And Pearl notes that we eventually did find that there are certain genes that both make it more likely that you’ll get addicted to smoking and also more likely that you’ll get lung cancer if you do, so the skeptics weren’t entirely wrong!
(5) Pearl states it this way: “One aspect of deep learning does interest me: the theoretical limitations of these systems, primarily limitations that stem from their inability to go beyond rung one of the Ladder of Causation. … In technical terms, machine-learning methods today provide us with an efficient way of going from finite sample estimates to probability distributions, and we still need to get from distributions to cause-effect relations.” I did not, in the article, define the ladder of causation, but basically Pearl defines Causation as having three levels: 1) Association (as used by Machine Learning today using standard statistical theory), 2) Intervention (with the Do-Operator), and 3) Counterfactuals. Therefore Pearl is really saying that because existing Machine Learning today only uses Association, it can’t do Intervention and Counterfactuals like humans can do.
However, I suspect the short fall of narrow AI today isn’t a lack of ability to climb the ladder of causation, but that we have no idea how to implement the ability to do Popper’s conjecture and refutation in software. Or, in Pearl’s language, we don’t know how to implement “Causal Discovery” in software, which is a subset of Popper’s epistemology. The real key to AGI will be computational creativity — the ability for programs to be creative.
As an interesting side note, Hernán, Hsu, and Healy, in “Data science is science’s second chance to get causal inference right. A classification of data science tasks” (p. 9-11) do a good job of defining when current Machine Learning techniques can in fact predict counter factual states using just association data. They argue that normally “expert knowledge” is needed to deal with counterfactuals (where I take “expert knowledge” to also be a crypto reference to scientific theories and explanations) unless the rules of what the system is trying to predict are “governed by a set of known game rules… or physical laws…” In short, Hernán and company are arguing that Machine Learning can deal with things higher up Pearl’s ladder of causation without any knowledge of causal inference, but that this is only possible in certain limiting conditions where the rules / laws are well understood so that we can create sufficient data to train on. (Such as in Alpha Go or Self Driving cars.) This is an interesting side theory.