12 November 2015

Learning from Interpretations

The LFI-Problog algorithm, for doing inference on probabilistic logic programs.

This is a brief summary of what I learned from [Gutmann et al., Learning the Parameters of Probabilistic Logic Programs from Interpretations, 2011]. It's also an abridged record of what I presented in the probabilistic programming reading group at Oxford. Well, abridged in content but not in presentation. The presentation here was done in one go. On the first go, I tend to write rather chatty text, which I don't like. But, others say they actually prefer it! Go figure. Anyway …

ProbLog can be seen as a concise way to describe probability distributions over bitvectors. Here is an example:

  p :: foo(X).
  bar() :- foo(X).

In addition to the above, we know, from the type of $X$, that $X \in \{1,2\}$. (In ProbLog it is more complicated, but I'll just assume the types of variables are given and fixed.) The probability distribution this program describes has the type $\bigl(\{{\it foo}(1), {\it foo}(2), {\it bar}() \} \to 2\bigr) \to [0,1]$. Earlier I said ‘bitvector’ because this type is isomorphic to $2^3 \to [0,1]$, once we fix an order on the atoms ${\it foo}(1)$, ${\it foo(2)}$, ${\it bar}()$. More precisely, the distribution is the following:

world 000 001 010 011 100 101 110 111
probability $qq$ $0$ $0$ $qp$ $0$ $pq$ $0$ $pp$

Here, $p$ is the parameter that occurs in the ProbLog program, and I used $q$ to denote $1-p$. The table is produced as follows. First, identify the input atoms: these are the groundings of those facts labeled with probabilities. In our case, the input atoms are ${\it foo}(1)$ and ${\it foo}(2)$. Second, give a truth assignment to the input atoms and fix the probability. For example, if we set ${\it foo}(1)=1$ and ${\it foo}(2)=0$, then the probability is $p \times (1-p)$: the $p$ for ${\it foo}(1)$ and the $1-p$ for ${\it foo}(2)$. Finally, complete the world: this is done by applying the derivation rules until a fixed-point is reached. This will be a least fixed-point, so we say we use least fixed-point semantics. All the worlds that can't be generated by this process (pick inputs arbitrarily, then do least fixed-point) get probability $0$.

Note that inputs should not appear as heads of any derivation rule. Otherwise, you're in a bit of a pickle if you try to apply the algorithm from above.

OK, now we have a way to describe probability distributions. In fact, parameterized probability distributions because $p$ is a parameter. The next task is to do learning. And by learning I mean MLE (maximum likelihood estimation).

What is MLE? In MLE you observe some event which you assume come from some parameterized distribution, and you want to set the parameters to maximize the probability of the observed event. For example, if we observe ${\it foo}(1)=1$ and ${\it foo}(2)=0$, then we compute the probability of this event to be $p(1-p)$, and we maximize it by taking $p=1/2$. Or, we could observe just ${\it bar}()=1$, which has the probability $1-(1-p)^2=p(2-p)$, maximized by taking $p=1$. In this latter case, we say that ${\it foo}$s are latent (or hidden) variables: we need to think about them to compute the probability, but we don't observe them.

MLE is very simple in principle, but the expression that you're supposed to optimize becomes unwieldy for examples of even moderate size, especially if latent variables are involved. One general optimization strategy is the EM algorithm. This is an iterative algorithm, which I'll illustrate on some examples.

Look at this table which lists for each observation of ${\it foo}(1),{\it foo}(2)$ how we should pick $p$:

observationbest $p$
00$0$
01$1/2$
10$1/2$
11$1$

The best $p$ is the average of ${\it foo}(0)$ and ${\it foo}(1)$! But what do we do if we don't observe the $\it foo$s? Well, we still have some expectation for their values, given what we observed.

Let's see what this means for the case in which we observed ${\it bar}()=1$. First, we guess $p=1/2$. Under this guess, $\mathop{\rm E}\bigl({\it foo}(1)\mid {\it bar}()=1\bigr)$ is $$\begin{align*} \frac{pq+pp}{qp+pq+pp} = \frac{q+p}{q+q+p} = \frac{1}{1+q} = \frac{2}{3} \end{align*}$$

By symmetry, $\mathop{\rm E}\bigl({\it foo}(2)\mid {\it bar}()=1\bigr)$ is also $2/3$. So, their average is also $2/3$. (Remember: We take averages of expectations.) Thus, we update $p:=2/3$ (and $q=1/3$).

In the next iteration, we do the same, but with the new value of $p$: $$\begin{align*} \frac{pq+pp}{qp+pq+pp} = \frac{1}{1+q} = \frac{3}{4} \end{align*}$$

If we keep doing this, then we end up with $p=1$, which is the same solution we got when we applied MLE directly, by maximizing $1-(1-p)^2$. (You can check that the recurrence from above defines a sequence $q_n=1/n$. So, when $n\to\infty$ we have $q_n \to 0$ and $p_n \to 1$.) For the general case, you can find a proof in Wikipedia that applying one EM step doesn't reduce the likelihood (likelihood = the probability of the observed event). The proof boils down to Gibbs inequality. Since likelihood doesn't decrease, then we may hope it increases.

The EM algorithm can be seen as some sort of gradient ascent, specialized for maximizing likelihoods. In particular, it is a numeric algorithm, not a symbolic one.

Let's see where we are. We have a language for describing parametrized distributions. We have a numeric algorithm for estimating good values for the parameters. Each iteration of the numeric algorithm works by averaging some expectations. The rest is about how to compute the expectations efficiently. For this, I'll switch to a slightly more complicated example — the one from the paper.

  0.1 :: burglary.
  0.2 :: earthquake.
  0.7 :: awake(X).  // X is one of { mary, john }
  alarm :- burglary.
  alarm :- earthquake.
  calls(X) :- awake(X), alarm

Suppose we observe that ${\it alarm}()=1$ and ${\it calls}({\rm John})=0$. First, we ground the program. For efficiency, we also throw away stuff that can't influence what we observed. In this case, we throw away mary.

  0.1 :: burglary.
  0.2 :: earthquake.
  0.7 :: awake(john).
  alarm :- burglary.
  alarm :- earthquake.
  calls(john) :- awake(john), alarm

Second, we build a formula whose models are the least fixed-points of the above program; that is, the worlds that can have nonzero probability. This step is very easy if there are no cyclic dependencies. $$\bigl({\it alarm} \leftrightarrow ({\it burglary} \lor {\it earthquake})\bigr) \land \bigl({\it calls\_john} \leftrightarrow ({\it awake\_john} \land {\it alarm})\bigr) $$

Third, we simplify the formula according to the observation. $$\begin{align*} &\bigl(1 \leftrightarrow ({\it burglary} \lor {\it earthquake})\bigr) \land \bigl(0 \leftrightarrow ({\it awake\_john} \land 1)\bigr) \\ &\quad=({\it burglary} \lor {\it earthquake}) \land \lnot{\it awake\_john} \end{align*}$$

Let $\phi$ be this formula, corresponding to our observation. We want to compute the expectation $\mathop{\rm E}(f\mid \phi=1)$ for $f={\it burglary}$ and for $f={\it earthquake}$. By definition of conditional probabilities (and using $\mathop{\rm E}{X}=\Pr(X=1)$), $$ \mathop{\rm E} (f \mid \phi=1) % = \Pr(f=1\mid \phi=1) % = \frac{\Pr(\phi=1\mid f=1) \Pr(f=1)}{\Pr(\phi=1)} = \frac{\mathop{\rm E}(\phi\land f)}{\mathop{\rm E}\phi} $$

In other words, the task is to evaluate the reliability polynomial of $\phi$ and of $\phi\land f$. Both of these tasks are easy once we have $\phi$ represented as a BDD. More precisely, they are linear in the size of the BDD. I expect it is obvious why this is so. :-) In our case, if the BDD has size $n$, then we would update the parameters in $\sim 4n$ steps:

  1. compute $\mathbin{\rm E}\phi$ in $n$ steps
  2. compute $\mathbin{\rm E}(\phi \land {\it burglary})$ in $n$ steps
  3. compute $\mathbin{\rm E}(\phi \land {\it earthquake})$ in $n$ steps
  4. compute $\mathbin{\rm E}(\phi \land {\it awake\_john})$ in $n$ steps
  5. update $p_{\it burglary}:= \frac{\mathbin{\rm E}(\phi \land {\it burglary})}{\mathbin{\rm E}\phi}$ in $1$ step
  6. update $p_{\it earthquake}:= \frac{\mathbin{\rm E}(\phi \land {\it earthquake})}{\mathbin{\rm E}\phi}$ in $1$ step
  7. update $p_{\it awake}:= \frac{1}{2} \biggl( \frac{\mathbin{\rm E}(\phi \land {\it awake\_john})}{\mathbin{\rm E}\phi} +p_{\it awake} \biggr)$ in $1$ step. (Here we averaged two expectations: $\mathbin{\rm E}({\it awake\_john}\mid \phi=1)$ and $\mathbin{\rm E}({\it awake\_mary}\mid \phi=1)$. The latter is just $\mathbin{\rm E}({\it awake\_mary})=p_{\it awake}$.)

The final trick is the observation that the four conjoined expectations — $\mathbin{\rm E}(\phi\land f)$ for $f\in\{{\it burglary}, {\it earthquake}, {\it awake\_john}, {\it awake\_mary}\}$ — can be done all in time linear in the size of the BDD. More precisely, the time is $O(m+n)$, where $m$ is the number of expectations being computed, and $n$ is the size of the BDD. In the first traversal you construct a slightly improper BDD that has each of ${\it burgalry}$, $\it earthquake$, $\it awake\_john$, $\it awake\_mary$ on every path from root to leaves. (The paper doesn't do this. In fact, in an implementation you wouldn't do it either. But, the equivalent computation that avoids constructing this pseudo-BDD is slightly annoying to describe.) Then, you tag each node with two numbers, $\alpha$ and $\beta$. Both can be understood in terms of downward random walk which at a node labeled by $\ell$ takes the 1-branch with probability $\Pr(\ell=1)$. For example, at the node labelled by $\it burglary$ we take the 1-branch with probability $p_{\it burglary}$. Thinking in terms of this random walk, the $\alpha$ of node $x$ is the probability that starting at $x$ you end up at a $1$-leaf. Clearly, these numbers can be filled by one bottom-up traversal of the BDD. The $\beta$ of node $x$ is the probability that starting at the root we end up visiting $x$. Clearly, these numbers can be filled by one top-down traversal of the BDD.

How are these $\alpha$ and $\beta$ tags to be used? Well, the $\alpha$ on the root is $\mathop{\rm E}\phi$, which we wanted to know. It is the probability that the formula $\phi$ evaluates to $1$. We can decompose this probability into a sum over all paths from the root to a $1$-leaf. When we want to evaluate $\mathbin{\rm E}(\phi\land f)$, we are interested in summing only over those paths that have $f=1$. Since we made sure that all paths test the value of $f$ exactly once, we can just look at each node labeled by $f$: $$ \mathop{\rm E}(\phi\land f) = \sum_{\text{$x$ labeled by $f$}} \alpha(x) \beta(x) $$

Some comments. After reading the paper, I was worried about two things. First, what do you do if there are cycles in the program? Second, are these BDDs small enough in practice? So, I looked up subsequent work, and I saw that both issues are better addressed in [Fierens et al., Inference and Learning in Probabilistic Logic Programs using Weighted Boolean Formulas, 2015] . I only skimmed this paper (it's much longer!), so what I say below about it may be wrong.

Cycles. I was worrying about cycles because they gave me some headache recently, while working on [Grigore, Yang, Abstraction Refinement Guided by a Learnt Probabilistic Model, 2016] . The paper by Fierens et al. gives references to two standard ways to handle cycles. I don't think it gives more detail. Hongseok and I did use one of those methods, but we had to throw in a few more approximations and ideas to get something that works in any reasonable time. The reference I like is [Lee, A model-theoretic counterpart of loop formulas, 2005]. (The ‘novel’ part of this paper is supposed to be what to do with cycles if you also have negations. We didn't use that part. But, the paper also reviews what you do with cycles when you don't have negations, and I think that review is very readable.)

By the way, now that I read this article carefully, I am convinced that it is very closely related to the learning part in the article by Hongseok and me. I should write-up some proper, in-depth comparison.

BDD size. I was worried about size partly because the problem and a solution (automatic theory splitting) are mentioned in the paper I summarize here, and partly because I tend to be worried about size when the word ‘BDD’ is mentioned. Automatic theory splitting simply means that you decompose the formula into a conjunction of parts that don't share variables, and build one BDD for each part, rather than one BDD for the whole thing. I suspect it's not too often that you can actually do this. Also, this amounts to using a restricted form of decision-DNNF. To remind you, a decision-DNNF has two types of nodes: a decision node that behave like those in a BDD, and a conjunction node. The conjunction node requires that its two (decision-DNNF) children don't share variables. In general, conjunction nodes and decision nodes can be interspersed. Automatic theory splitting amounts to disallowing conjunctions below decisions. The paper by Fierens et al. doesn't have this limitation — it uses more results from the area of knowledge compilation to build decision-DNNFs.

Right, I didn't say what an ‘interpretation’ is. It's what I call above ‘observation’.

No comments:

Post a Comment

Note: (1) You need to have third-party cookies enabled in order to comment on Blogger. (2) Better to copy your comment before hitting publish/preview. Blogger sometimes eats comments on the first try, but the second works. Crazy Blogger.