Hidden Markov Models, how to train using Baum-Welch?

TL;DR: My uni is messing up my life with a horribly disorganized AI class with unhelpful teachers. Just want some confirmation that I’m on the right track. See my questions at the end.

I’ve been really busy with school lately due to my AI class wrapping up. The class is a clusterf*ck. The online resources are lacking at best and some of the teachers are the worst kind. One of them was asked a question about a project assignment and responded with “I wrote that project description and frankly I’m offended by that question.” and then ignored the student. The same teacher forced me to come in to school on a day I had no classes on just so he would believe that I didn’t get a mail with some instructions.

Our final assignment is to write a automatic player for a small game they designed. In addition to being vastly undocumented and using very confusing concepts (objects move in directions but don’t have positions, what?), the entire assignment reeks of “use this screwdriver to hammer in this nail”, as the assignment can be solved much more easily using Markov chains, but no, we have to use Hidden Markov Models. So here I am sitting with a completely undocumented HMM implementation they’ve supplied, a half described game concept and a VERY high-level explanation of HMMs and the Baum-Welch algorithm. It’s not entirely about game development, but since this is technically a game AI and I REALLY just want this class to be over I’m asking here anyway.

The entire problem is taking in a stream of emissions/observations and predicting the next emission. The underlying states are irrelevant. A correct guest ends the game and gives you 1 point, while every wrong guess makes you lose 1 point (you don’t have to guess if you’re not certain). So I basically have a stream of emissions that I want to train my HMM with and calculate a probability distribution for the next emission. In addition, the code skeleton we’re forced to use is not compatible with a debugger, hell, with a frigging IDE, so debugging is a huge mess. I have two questions:

  • Baum-Welch seems to suck. If I repeatedly train my HMM with the entire sequence of emissions/observations, it eventually explodes into a HMM filled with 0s. The scarce information I’ve found says that this is due to overfitting. The Baum-Welch implementation was supplied by them, and despite other people complaining about the same problem, the teachers insist that we’re “using it wrong” (again, completely undocumented). What is the optimal way of training an HMM given a stream of data?

  • Given the sequence of emissions [a, b, c], to calculate the probability of each possible next emission I compute the probability of the emission [a, b, c] occurring given the current HMM state and then the probabilities of [a, b, c, ] occurring, where X is each possible emission. The conditional probability of each emission is then
    p([a, b, c, ]) / p([a, b, c])
    Is this the correct way of calculating the probability of the next emission?

From my understanding you are to repeat the iterations until you start getting a percentage of good guesses. How good is the guessing just before it goes to all zeros?

I think I managed to figure it out. My calculation of the next emission probability was most definitely faulty, and I managed to work around the NaN problem.

The overfitting (?) problem I solved by adding a small amount of noise to all the matrices before retraining after a new emission comes in to make sure that it can adapt to the new information. The problem seemed to occur when some matrix elements turned up to 0, at which point one of the values which wasn’t checked for a division by 0 ended up being infinity, etc. The random noise helped avoid this 100%.

It turns out that the correct way of calculating the next emission is not as I described it in my previous post. You’re supposed to calculate the state probabilities and multiply that by the emission (B) matrix to get the probability of each emission. The state probabilities is calculated in Baum-Welch (the forward pass) so I adapted the code for that. I now get around ~60% accuracy, which at least makes it worth using.

The score I’m getting should be more than enough for a pass at least.

You mean it’s irrelevant to your post because they’ve given you a structure? Out of interest, how many states are there and how many transitions with non-zero probability? If it’s up to you to make the structure then this is a pretty critical step.

[quote]Baum-Welch seems to suck. If I repeatedly train my HMM with the entire sequence of emissions/observations, it eventually explodes into a HMM filled with 0s. The scarce information I’ve found says that this is due to overfitting.
[/quote]
I don’t think that should be possible. The transition probabilities out of each state have to sum to 1, though that is including the “stay” transition back in to the state itself. Overfitting can generate some states with only a single non-zero probability transition in and out but you’ll either need to have tons of states or be training it on the same sequence again and again.

[quote]What is the optimal way of training an HMM given a stream of data?
[/quote]
Baum-Welch is pretty good. I’d stick with that until you get it to work.

[quote]Given the sequence of emissions [a, b, c], to calculate the probability of each possible next emission I compute the probability of the emission [a, b, c] occurring given the current HMM state and then the probabilities of [a, b, c, ] occurring, where X is each possible emission. The conditional probability of each emission is then
p([a, b, c, ]) / p([a, b, c])
Is this the correct way of calculating the probability of the next emission?
[/quote]
After seeing [a,b,c] you could be in any of your states. Call the set of states . You first need to calculate the probability of being in each n in : P([a,b,c,n]). This is the “forward” step that you’ll find documented (Wikipedia is fine for this).

In each of the states there will also be a probability of emitting each x in : P(x|n). That’s straight from the current model parameters.

For an emission x, P([a,b,c,x]) is the sum of all the ways x could be emitted, i.e. sum of P([a,b,c,n])*P(x|n).

If you follow through with the method above to just predict the next state I think you’ll be doing Forward/Backward. If all you want is the single most likely state then it’s much simpler to do Viterbi, which is just plain old dynamic programming applied to the HMM (and for me at least, is easier to understand).