At first, let's define the sample function:
def sample(dist, num_samples=1):
"""
Uses the inverse CDF method to return samples drawn from an
(unnormalized) discrete distribution.
Arguments:
dist -- (unnormalized) distribution
Keyword arguments:
num_samples -- number of samples to draw
"""
cdf = cumsum(dist)
r = uniform(size=num_samples) * cdf[-1]
return cdf.searchsorted(r)
As we can see, the sample function input two parameters, one is dist, which can be an un-normalized distribution, another is the sample we want to draw.
Let's see how to generate corpus for
Dirichlet--multinomial unigram language modeldef generate_corpus(beta, mean, N):
"""
Returns a corpus of tokens drawn from a Dirichlet--multinomial
unigram language model. Each token is an instance of one of V
unique word types, represented by indices 0, , V - 1.
Arguments:
beta -- concentration parameter for the Dirichlet prior
mean -- V-dimensional mean of the Dirichlet prior
N -- number of tokens to generate
""" pass # YOUR CODE GOES HERE
#print mean
#print beta
#print dot(mean,beta)
#print dirichlet(mean*beta,size=1)
temp=sample(dirichlet(beta*array(mean),size=1),N)
#print temp
return temp
please keep in mind the
dirichlet function is “from numpy.random.mtrand import dirichlet"
and the parameters it receives are corresponding to beta*array(mean). beta is the concentration factor, and mean is the vector which sum to 1.
another way is to generate corpus is using the property:
P(D'|D,H)= Nv+beta_nv/N+beta
def generate_corpus_collapsed(beta, mean, N):
"""
Returns a corpus of tokens drawn from a Dirichlet--multinomial
unigram language model using the 'collapsed' generative process
(i.e., phi is not explicitly represented). Each token is an
instance of one of V unique word types.
Arguments:
beta -- concentration parameter for the Dirichlet prior
mean -- V-dimensional mean of the Dirichlet prior
N -- number of tokens to generate
"""
V = len(mean) # vocabulary size
corpus = zeros(N, dtype=int) # corpus
Nv = zeros(V, dtype=int) # counts for each word type
pass # YOUR CODE GOES HERE
for n in xrange(N):
corpus[n]=sample((Nv+beta*array(mean))/(n+beta),1)
Nv[corpus[n]]+=1;
return corpus
Let's see how to generate corpus for
Mixture of Dirichlet-multinomial unigram language model def generate_corpus(alpha, m, beta, n, D, Nd):
"""
Returns a grouped corpus drawn from a mixture of
Dirichlet--multinomial unigram language models.
Arguments:
alpha -- concentration parameter for the Dirichlet prior over theta
m -- T-dimensional mean of the Dirichlet prior over theta
beta -- concentration parameter for the Dirichlet prior over phis
n -- V-dimensional mean of the Dirichlet prior over phis
D -- number of documents to generate
Nd -- number of tokens to generate per document
"""
corpus = GroupedCorpus()
pass # YOUR CODE GOES HERE
#determine the topic the distribution for topic dirichlet(dot(m,alpha),size=1)
#given the topic, the distribtuion for word dirichlet(dot(n,beta),size=1)
theta=dirichlet(alpha*array(m),1)
phis=dirichlet(beta*array(n),len(m))
for d in range(0,D):
[t]=sample(theta,1)
#print groupVcab
corpus.add(str(d),str(t),[str(x) for x in sample(phis[t,:],Nd)])
return corpus
注意是T个topic (group),
phis=dirichlet(beta*array(n),len(m)) 产生了T个 dirichlet distribution,相同的topic t应该取同一个 dirichlet distribution phis[t,:]
posted on 2012-10-28 10:13
luis 阅读(612)
评论(0) 编辑 收藏 引用 所属分类:
Python