prev

next

of 27

View

217Download

3

Embed Size (px)

- Slide 1
- Hidden Markov ModelsVariants Conditional Random Fields 1 2 K 1 2 K 1 2 K 1 2 K x1x1 x2x2 x3x3 xKxK 2 1 K 2
- Slide 2
- CS262 Lecture 7, Win06, Batzoglou Two learning scenarios 1.Estimation when the right answer is known Examples: GIVEN:a genomic region x = x 1 x 1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN:the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2.Estimation when the right answer is unknown Examples: GIVEN:the porcupine genome; we dont know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we dont see when he changes dice QUESTION:Update the parameters of the model to maximize P(x| )
- Slide 3
- CS262 Lecture 7, Win06, Batzoglou 1.When the true parse is known Given x = x 1 x N for which the true = 1 N is known, Simply count up # of times each transition & emission is taken! Define: A kl = # times k l transition occurs in E k (b) = # times state k in emits b in x We can show that the maximum likelihood parameters (maximize P(x| )) are: A kl E k (b) a kl = e k (b) = i A ki c E k (c)
- Slide 4
- CS262 Lecture 7, Win06, Batzoglou 2. When the true parse is unknown Baum-Welch Algorithm expected Compute expected # of times each transition & is taken! Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1.Forward 2.Backward 3.Calculate A kl, E k (b), given CURRENT 4.Calculate new model parameters NEW : a kl, e k (b) 5.Calculate new log-likelihood P(x | NEW ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | ) does not change much
- Slide 5
- CS262 Lecture 7, Win06, Batzoglou Variants of HMMs
- Slide 6
- CS262 Lecture 7, Win06, Batzoglou Higher-order HMMs How do we model memory larger than one time point? P( i+1 = l | i = k)a kl P( i+1 = l | i = k, i -1 = j)a jkl A second order HMM with K states is equivalent to a first order HMM with K 2 states state Hstate T a HT (prev = H) a HT (prev = T) a TH (prev = H) a TH (prev = T) state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH
- Slide 7
- CS262 Lecture 7, Win06, Batzoglou Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq
- Slide 8
- CS262 Lecture 7, Win06, Batzoglou Example: exon lengths in genes
- Slide 9
- CS262 Lecture 7, Win06, Batzoglou Solution 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)
- Slide 10
- CS262 Lecture 7, Win06, Batzoglou Solution 2: Negative binomial distribution Duration in X: m turns, where During first m 1 turns, exactly n 1 arrows to next state are followed During m th turn, an arrow to next state is followed m 1 P(l X = m) = n 1 (1 p) n-1+1 p (m-1)-(n-1) = n 1 (1 p) n p m-n X (n) p X (2) X (1) p 1 p p Y 1 p
- Slide 11
- CS262 Lecture 7, Win06, Batzoglou Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3
- Slide 12
- CS262 Lecture 7, Win06, Batzoglou Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity of Viterbi: Time: O(D) Space: O(1) where D = maximum duration of state F d
- CS262 Lecture 7, Win06, Batzoglou Features that depend on many pos. in x What do we put in g(k, l, x, i)? The higher g(k, l, x, i), the more we like going from k to l at position i Richer models using this additional power Examples Casino player looks at previous 100 posns; if > 50 6s, he likes to go to Fair g(Loaded, Fair, x, i) += 1[x i-100, , x i-1 has > 50 6s] w DONT_GET_CAUGHT Genes are close to CpG islands; for any state k, g(k, exon, x, i) += 1[x i-1000, , x i+1000 has > 1/16 CpG] w CG_RICH_REGION x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9 ii i-1
- Slide 19
- CS262 Lecture 7, Win06, Batzoglou Features that depend on many pos. in x Conditional Random FieldsFeatures 1.Define a set of features that you think are important All features should be functions of current state, previous state, x, and position i Example: Old features: transition k l, emission b from state k Plus new features: prev 100 letters have 50 6s Number the features 1n: f 1 (k, l, x, i), , f n (k, l, x, i) features are indicator true/false variables Find appropriate weights w 1,, w n for when each feature is true weights are the parameters of the model 2.Lets assume for now each feature has a weight w j Then, g(k, l, x, i) = j=1n f j (k, l, x, i) w j x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9
- Slide 20
- CS262 Lecture 7, Win06, Batzoglou Features that depend on many pos. in x Define V k (i): Optimal score of parsing x 1 x i and ending in state k Then, assuming V k (i) is optimal for every k at position i, it follows that V l (i+1) = max k [V k (i) + g(k, l, x, i+1)] Why? Even though at posn i+1 we look at arbitrary positions in x, we are only affected by the choice of ending state k Therefore, Viterbi algorithm again finds optimal (highest scoring) parse for x 1 x N x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9
- Slide 21
- CS262 Lecture 7, Win06, Batzoglou Features that depend on many pos. in x Score of a parse depends on all of x at each position Can still do Viterbi because state i only looks at prev. state i-1 and the constant sequence x 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 HMM CRF
- Slide 22
- CS262 Lecture 7, Win06, Batzoglou How many parameters are there, in general? Arbitrarily many parameters! For example, let f j (k, l, x, i) depend on x i-5, x i-4, , x i+5 Then, we would have up to K | | 11 parameters! Advantage: powerful, expressive model Example: if there are more than 50 6s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6s, this is evidence we are in Fair state Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6s Example: if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region Question: how do we train these parameters?
- Slide 23
- CS262 Lecture 7, Win06, Batzoglou Conditional Training Hidden Markov Model training: Given training sequence x, true parse Maximize P(x, ) Disadvantage: P(x, ) = P( | x) P(x) Quantity we care about so as to get a good parse Quantity we dont care so much about because x is always given
- Slide 24
- CS262 Lecture 7, Win06, Batzoglou Conditional Training P(x, ) = P( | x) P(x) P( | x) = P(x, ) / P(x) Recall F(j, x, ) = # times feature f j occurs in (x, ) = i=1N f j (k, l, x, i) ; count f j in x, In HMMs, lets denote by w j the weight of j th feature: w j = log(a kl ) or log(e k (b)) Then, HMM: P(x, ) = exp [ j=1n w j F(j, x, ) ] CRF:Score(x, ) = exp [ j=1n w j F(j, x, ) ]
- Slide 25
- CS262 Lecture 7, Win06, Batzoglou Conditional Training In HMMs, P( | x) = P(x, ) / P(x) P(x, ) = exp [ j=1n w j F(j, x, ) ] P(x) = exp [ j=1n w j F(j, x, ) ] =: Z Then, in CRF we can do the same to normalize Score(x, ) into a prob. P CRF ( | x) = exp [ j=1n w j F(j, x, ) ] / Z QUESTION: Why is this a probability???
- Slide 26
- CS262 Lecture 7, Win06, Batzoglou Conditional Training 1.We need to be given a set of sequences x and true parses 2.Calculate Z by a sum-of-paths algorithm similar to HMM We can then easily calculate P( | x) 3.Calculate partial derivative of P( | x) w.r.t. each parameter w j (not coveredakin to forward/backward) Update each parameter with gradient descent! 4.Continue until convergence to optimal set of weights P( | x) = exp [ j=1n w j F(j, x, ) ] / Zis convex!!!
- Slide 27
- CS262 Lecture 7, Win06, Batzoglou Conditional Random FieldsSummary 1.Ability to incorporate complicated non-local feature sets Do away with some independence assumptions of HMMs Parsing is still equally efficient 2.Conditional training Train parameters that are best for parsing, not modeling Need labeled examplessequences x and true parses (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) Training is significantly slowermany iterations of forward/backward