{"title": "Replicated Softmax: an Undirected Topic Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1607, "page_last": 1614, "abstract": "We show how to model documents as bags of words using family of two-layer, undirected graphical models. Each member of the family has the same number of binary hidden units but a different number of ``softmax visible units. All of the softmax units in all of the models in the family share the same weights to the binary hidden units. We describe efficient inference and learning procedures for such a family. Each member of the family models the probability distribution of documents of a specific length as a product of topic-specific distributions rather than as a mixture and this gives much better generalization than Latent Dirichlet Allocation for modeling the log probabilities of held-out documents. The low-dimensional topic vectors learned by the undirected family are also much better than LDA topic vectors for retrieving documents that are similar to a query document. The learned topics are more general than those found by LDA because precision is achieved by intersecting many general topics rather than by selecting a single precise topic to generate each word.", "full_text": "Replicated Softmax: an Undirected Topic Model\n\nRuslan Salakhutdinov\n\nBrain and Cognitive Sciences and CSAIL\n\nMassachusetts Institute of Technology\n\nGeoffrey Hinton\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nrsalakhu@mit.edu\n\nhinton@cs.toronto.edu\n\nAbstract\n\nWe introduce a two-layer undirected graphical model, called a \u201cReplicated Soft-\nmax\u201d, that can be used to model and automatically extract low-dimensional latent\nsemantic representations from a large unstructured collection of documents. We\npresent ef\ufb01cient learning and inference algorithms for this model, and show how a\nMonte-Carlo based method, Annealed Importance Sampling, can be used to pro-\nduce an accurate estimate of the log-probability the model assigns to test data.\nThis allows us to demonstrate that the proposed model is able to generalize much\nbetter compared to Latent Dirichlet Allocation in terms of both the log-probability\nof held-out documents and the retrieval accuracy.\n\n1 Introduction\n\nProbabilistic topic models [2, 9, 6] are often used to analyze and extract semantic topics from large\ntext collections. Many of the existing topic models are based on the assumption that each document\nis represented as a mixture of topics, where each topic de\ufb01nes a probability distribution over words.\nThe mixing proportions of the topics are document speci\ufb01c, but the probability distribution over\nwords, de\ufb01ned by each topic, is the same across all documents.\n\nAll these models can be viewed as graphical models in which latent topic variables have directed\nconnections to observed variables that represent words in a document. One major drawback is that\nexact inference in these models is intractable, so one has to resort to slow or inaccurate approxima-\ntions to compute the posterior distribution over topics. A second major drawback, that is shared by\nall mixture models, is that these models can never make predictions for words that are sharper than\nthe distributions predicted by any of the individual topics. They are unable to capture the essential\nidea of distributed representations which is that the distributions predicted by individual active fea-\ntures get multiplied together (and renormalized) to give the distribution predicted by a whole set of\nactive features. This allows individual features to be fairly general but their intersection to be much\nmore precise. For example, distributed representations allow the topics \u201cgovernment\u201d, \u201dma\ufb01a\u201d and\n\u201dplayboy\u201d to combine to give very high probability to a word \u201cBerlusconi\u201d that is not predicted\nnearly as strongly by each topic alone.\n\nTo date, there has been very little work on developing topic models using undirected graphical mod-\nels. Several authors [4, 17] used two-layer undirected graphical models, called Restricted Boltzmann\nMachines (RBMs), in which word-count vectors are modeled as a Poisson distribution. While these\nmodels are able to produce distributed representations of the input and perform well in terms of re-\ntrieval accuracy, they are unable to properly deal with documents of different lengths, which makes\nlearning very unstable and hard. This is perhaps the main reason why these potentially powerful\nmodels have not found their application in practice. Directed models, on the other hand, can eas-\nily handle unobserved words (by simply ignoring them), which allows them to easily deal with\ndifferent-sized documents. For undirected models marginalizing over unobserved variables is gen-\nerally a non-trivial operation, which makes learning far more dif\ufb01cult. Recently, [13] attempted to\n\ufb01x this problem by proposing a Constrained Poisson model that would ensure that the mean Poisson\n\n1\n\n\frates across all words sum up to the length of the document. While the parameter learning has been\nshown to be stable, the introduced model no longer de\ufb01nes a proper probability distribution over the\nword counts.\n\nIn the next section we introduce a \u201cReplicated Softmax\u201d model. The model can be ef\ufb01ciently trained\nusing Contrastive Divergence, it has a better way of dealing with documents of different lengths, and\ncomputing the posterior distribution over the latent topic values is easy. We will also demonstrate\nthat the proposed model is able to generalize much better compared to a popular Bayesian mixture\nmodel, Latent Dirichlet Allocation (LDA) [2], in terms of both the log-probability on previously\nunseen documents and the retrieval accuracy.\n\n2 Replicated Softmax: A Generative Model of Word Counts\n\nConsider modeling discrete visible units v using a restricted Boltzmann machine, that has a two-\nlayer architecture as shown in Fig. 1. Let v \u2208 {1, ..., K}D, where K is the dictionary size and D\nis the document size, and let h \u2208 {0, 1}F be binary stochastic hidden topic features. Let V be a\ni = 1 if visible unit i takes on kth value. We de\ufb01ne the energy\nK \u00d7 D observed binary matrix with vk\nof the state {V, h} as follows:\n\nE(V, h) = \u2212\n\nD\n\nF\n\nK\n\nXi=1\n\nXj=1\n\nXk=1\n\nW k\n\nijhjvk\n\ni \u2212\n\nD\n\nK\n\nXi=1\n\nXk=1\n\nvk\ni bk\n\ni \u2212\n\nF\n\nXj=1\n\nhjaj,\n\n(1)\n\nij is a symmetric interaction term between visible\nwhere {W, a, b} are the model parameters: W k\nunit i that takes on value k, and hidden feature j, bk\ni is the bias of unit i that takes on value k, and aj\nis the bias of hidden feature j (see Fig. 1). The probability that the model assigns to a visible binary\nmatrix V is:\n\nP (V) =\n\n1\n\nZ Xh\n\nexp (\u2212E(V, h)), Z = XV Xh\n\nexp (\u2212E(V, h)),\n\n(2)\n\nwhere Z is known as the partition function or normalizing constant. The conditional distributions\nare given by softmax and logistic functions:\n\np(vk\n\ni = 1|h) =\n\np(hj = 1|V) = \u03c3 aj +\n\nj=1 hjW k\nij )\nj=1 hjW q\n\nexp (bk\n\ni +PF\nPK\ni +PF\nq=1 exp(cid:0)bq\nXk=1\nXi=1\n\nK\n\nD\n\nvk\ni W k\n\nij! ,\n\nij(cid:1)\n\n(3)\n\n(4)\n\nwhere \u03c3(x) = 1/(1 + exp(\u2212x)) is the logistic function.\nNow suppose that for each document we create a separate RBM with as many softmax units as there\nare words in the document. Assuming we can ignore the order of the words, all of these softmax units\ncan share the same set of weights, connecting them to binary hidden units. Consider a document\nthat contains D words. In this case, we de\ufb01ne the energy of the state {V, h} to be:\n\nE(V, h) = \u2212\n\nF\n\nK\n\nXj=1\n\nXk=1\n\nW k\n\nj hj \u02c6vk \u2212\n\n\u02c6vkbk \u2212 D\n\nK\n\nXk=1\n\nhjaj,\n\nF\n\nXj=1\n\n(5)\n\ni=1 vk\n\nwhere \u02c6vk =PD\n\ni denotes the count for the kth word. Observe that the bias terms of the hidden\nunits are scaled up by the length of the document. This scaling is crucial and allows hidden topic\nunits to behave sensibly when dealing with documents of different lengths.\nGiven a collection of N documents {Vn}N\nparameters W takes the form:\n\nn=1, the derivative of the log-likelihood with respect to\n\n1\nN\n\nN\n\nXn=1\n\n\u2202 log P (Vn)\n\n\u2202W k\nj\n\nwhere EPdata [\u00b7] denotes an expectation with respect to the data distribution Pdata(h, V) =\np(h|V)Pdata(V), with Pdata(V) = 1\n\n= EPdata(cid:2)\u02c6vkhj(cid:3) \u2212 EPModel(cid:2)\u02c6vkhj(cid:3) ,\nN Pn \u03b4(V \u2212 Vn) representing the empirical distribution,\n\n2\n\n\fLatent Topics\n\nLatent Topics\n\nW1 W2\n\nW2\n\nW1\n\nW2\n\nW1\n\nW1\n\nW2\n\nW2\n\nW1\n\nW2\n\nh\n\nW1\n\nv\n\nObserved Softmax Visibles\n\nMultinomial Visible\n\nFigure 1: Replicated Softmax model. The top layer represents a vector h of stochastic, binary topic features\nand and the bottom layer represents softmax visible units v. All visible units share the same set of weights,\nconnecting them to binary hidden units. Left: The model for a document containing two and three words.\nRight: A different interpretation of the Replicated Softmax model, in which D softmax units with identical\nweights are replaced by a single multinomial unit which is sampled D times.\n\nand EPModel [\u00b7] is an expectation with respect to the distribution de\ufb01ned by the model. Exact maxi-\nmum likelihood learning in this model is intractable because exact computation of the expectation\nEPModel [\u00b7] takes time that is exponential in min{D, F }, i.e the number of visible or hidden units. To\navoid computing this expectation, learning is done by following an approximation to the gradient of\na different objective function, called the \u201cContrastive Divergence\u201d (CD) ([7]):\n\n\u2206W k\n\nj = \u03b1(cid:18)EPdata(cid:2)\u02c6vkhj(cid:3) \u2212 EPT (cid:2)\u02c6vkhj(cid:3)(cid:19),\n\n(6)\n\nwhere \u03b1 is the learning rate and PT represents a distribution de\ufb01ned by running the Gibbs chain,\ninitialized at the data, for T full steps. The special bipartite structure of RBM\u2019s allows for quite an\nef\ufb01cient Gibbs sampler that alternates between sampling the states of the hidden units independently\ngiven the states of the visible units, and vise versa (see Eqs. 3, 4). Setting T = \u221e recovers maximum\nlikelihood learning.\n\nThe weights can now be shared by the whole family of different-sized RBM\u2019s that are created for\ndocuments of different lengths (see Fig. 1). We call this the \u201cReplicated Softmax\u201d model. A pleasing\nproperty of this model is that computing the approximate gradients of the CD objective (Eq. 6) for a\ndocument that contains 100 words is computationally not much more expensive than computing the\ngradients for a document that contains only one word. A key observation is that using D softmax\nunits with identical weights is equivalent to having a single multinomial unit which is sampled D\ntimes, as shown in Fig. 1, right panel. If instead of sampling, we use real-valued softmax proba-\nbilities multiplied by D, we exactly recover the learning algorithm of a Constrained Poisson model\n[13], except for the scaling of the hidden biases with D.\n\n3 Evaluating Replicated Softmax as a Generative Model\n\nAssessing the generalization performance of probabilistic topic models plays an important role in\nmodel selection. Much of the existing literature, particularly for undirected topic models [4, 17],\nuses extremely indirect performance measures, such as information retrieval or document classi\ufb01ca-\ntion. More broadly, however, the ability of the model to generalize can be evaluated by computing\nthe probability that the model assigns to the previously unseen documents, which is independent of\nany speci\ufb01c application.\n\nFor undirected models, computing the probability of held-out documents exactly is intractable, since\ncomputing the global normalization constant requires enumeration over an exponential number of\nterms. Evaluating the same probability for directed topic models is also dif\ufb01cult, because there are\nan exponential number of possible topic assignments for the words.\n\nRecently, [14] showed that a Monte Carlo based method, Annealed Importance Sampling (AIS) [12],\ncan be used to ef\ufb01ciently estimate the partition function of an RBM. We also \ufb01nd AIS attractive\nbecause it not only provides a good estimate of the partition function in a reasonable amount of\ncomputer time, but it can also just as easily be used to estimate the probability of held-out documents\nfor directed topic models, including Latent Dirichlet Allocation (for details see [16]). This will\nallow us to properly measure and compare generalization capabilities of Replicated Softmax and\n\n3\n\n\fAlgorithm 1 Annealed Importance Sampling (AIS) run.\n1: Initialize 0 = \u03b20 < \u03b21 < ... < \u03b2S = 1.\n2: Sample V1 from p0.\n3: for s = 1 : S \u2212 1 do\n4:\n5: end for\n6: Set wAIS = QS\n\nSample Vs+1 given Vs using Ts(Vs+1 \u2190 Vs).\n\ns\u22121(Vs).\n\ns(Vs)/p\u2217\n\ns=1 p\u2217\n\nLDA models. We now show how AIS can be used to estimate the partition function of a Replicated\nSoftmax model.\n\n3.1 Annealed Importance Sampling\n\nB(x)/ZB. Typically\nSuppose we have two distributions: pA(x) = p\u2217\npA(x) is de\ufb01ned to be some simple proposal distribution with known ZA, whereas pB represents\nour complex target distribution of interest. One way of estimating the ratio of normalizing constants\nis to use a simple importance sampling method:\n\nA(x)/ZA and pB(x) = p\u2217\n\nZB\nZA\n\n=Xx\n\np\u2217\nB(x)\np\u2217\nA(x)\n\npA(x) = EpA(cid:20) p\u2217\n\nB(x)\np\u2217\n\nA(x)(cid:21) \u2248\n\n1\nN\n\nN\n\nXi=1\n\nB(x(i))\np\u2217\nA(x(i))\np\u2217\n\n,\n\n(7)\n\nwhere x(i) \u223c pA. However, if the pA and pB are not close enough, the estimator will be very poor.\nIn high-dimensional spaces, the variance of the importance sampling estimator will be very large, or\npossibly in\ufb01nite, unless pA is a near-perfect approximation to pB.\nAnnealed Importance Sampling can be viewed as simple importance sampling de\ufb01ned on a much\nhigher dimensional state space. It uses many auxiliary variables in order to make the proposal distri-\nbution pA be closer to the target distribution pB. AIS starts by de\ufb01ning a sequence of intermediate\nprobability distributions: p0, ..., pS, with p0 = pA and pS = pB. One general way to de\ufb01ne this\nsequence is to set:\n\npk(x) \u221d p\u2217\n\nA(x)1\u2212\u03b2k p\u2217\n\nB(x)\u03b2k ,\n\n(8)\n\nwith \u201cinverse temperatures\u201d 0 = \u03b20 < \u03b21 < ... < \u03b2K = 1 chosen by the user. For each intermediate\ndistribution, a Markov chain transition operator Tk(x\u2032; x) that leaves pk(x) invariant must also be\nde\ufb01ned.\n\nUsing the special bipartite structure of RBM\u2019s, we can devise a better AIS scheme [14] for estimating\nthe model\u2019s partition function. Let us consider a Replicated Softmax model with D words. Using\nEq. 5, the joint distribution over {V, h} is de\ufb01ned as1:\n\np(V, h) =\n\n1\nZ\n\nF\n\nexp\uf8eb\nXj=1\n\uf8ed\n\nK\n\nXk=1\n\nW k\n\nj hj \u02c6vk\uf8f6\n\uf8f8 ,\n\n(9)\n\nwhere \u02c6vk =PD\n\ni denotes the count for the kth word. By explicitly summing out the latent topic\nunits h we can easily evaluate an unnormalized probability p\u2217(V). The sequence of intermediate\ndistributions, parameterized by \u03b2, can now be de\ufb01ned as follows:\n\ni=1 vk\n\nps(V) =\n\n1\nZs\n\np\u2217(V) =\n\np\u2217\ns(V, h) =\n\n1\nZs\n\n1\n\nZs Xh\n\nF\n\nYj=1 1 + exp \u03b2s\n\nK\n\nXk=1\n\nW k\n\nj \u02c6vk!! .\n\n(10)\n\nNote that for s = 0, we have \u03b2s = 0, and so p0 represents a uniform distribution, whose partition\nfunction evaluates to Z0 = 2F , where F is the number of hidden units. Similarly, when s = S, we\nhave \u03b2s = 1, and so pS represents the distribution de\ufb01ned by the Replicated Softmax model. For the\nintermediate values of s, we will have some interpolation between uniform and target distributions.\nUsing Eqs. 3, 4, it is also straightforward to derive an ef\ufb01cient Gibbs transition operator that leaves\nps(V) invariant.\n\n1We have omitted the bias terms for clarity of presentation\n\n4\n\n\fA single run of AIS procedure is summarized in Algorithm 1. It starts by \ufb01rst sampling from a sim-\nple uniform distribution p0(V) and then applying a series of transition operators T1, T2, . . . , TS\u22121\nthat \u201cmove\u201d the sample through the intermediate distributions ps(V) towards the target distribution\npS(V). Note that there is no need to compute the normalizing constants of any intermediate distri-\nbutions. After performing M runs of AIS, the importance weights w(i)\nAIS can be used to obtain an\nunbiased estimate of our model\u2019s partition function ZS:\n\nZS\nZ0\n\n\u2248\n\n1\nM\n\nM\n\nXi=1\n\nw(i)\n\nAIS,\n\n(11)\n\nwhere Z0 = 2F . Observe that the Markov transition operators do not necessarily need to be ergodic.\nIn particular, if we were to choose dumb transition operators that do nothing, Ts(V\u2032 \u2190 V) =\n\u03b4(V\u2032 \u2212 V) for all s, we simply recover the simple importance sampling procedure of Eq. 7.\nWhen evaluating the probability of a collection of several documents, we need to perform a separate\nAIS run per document, if those documents are of different lengths. This is because each different-\nsized document can be represented as a separate RBM that has its own global normalizing constant.\n\n4 Experimental Results\n\nIn this section we present experimental results on three three text datasets: NIPS proceedings pa-\npers, 20-newsgroups, and Reuters Corpus Volume I (RCV1-v2) [10], and report generalization per-\nformance of Replicated Softmax and LDA models.\n\n4.1 Description of Datasets\nThe NIPS proceedings papers2 contains 1740 NIPS papers. We used the \ufb01rst 1690 documents as\ntraining data and the remaining 50 documents as test. The dataset was already preprocessed, where\neach document was represented as a vector containing 13,649 word counts.\n\nThe 20-newsgroups corpus contains 18,845 postings taken from the Usenet newsgroup collection.\nThe corpus is partitioned fairly evenly into 20 different newsgroups, each corresponding to a sepa-\nrate topic.3 The data was split by date into 11,314 training and 7,531 test articles, so the training and\ntest sets were separated in time. We further preprocessed the data by removing common stopwords,\nstemming, and then only considering the 2000 most frequent words in the training dataset. As a re-\nsult, each posting was represented as a vector containing 2000 word counts. No other preprocessing\nwas done.\nThe Reuters Corpus Volume I is an archive of 804,414 newswire stories4 that have been manually\ncategorized into 103 topics. The topic classes form a tree which is typically of depth 3. For this\ndataset, we de\ufb01ne the relevance of one document to another to be the fraction of the topic labels that\nagree on the two paths from the root to the two documents. The data was randomly split into 794,414\ntraining and 10,000 test articles. The available data was already in the preprocessed format, where\ncommon stopwords were removed and all documents were stemmed. We again only considered the\n10,000 most frequent words in the training dataset.\nFor all datasets, each word count wi was replaced by log(1 + wi), rounded to the nearest integer,\nwhich slightly improved retrieval performance of both models. Table 1 shows description of all three\ndatasets.\n\n4.2 Details of Training\nFor the Replicated Softmax model, to speed-up learning, we subdivided datasets into minibatches,\neach containing 100 training cases, and updated the parameters after each minibatch. Learning\nwas carried out using Contrastive Divergence by starting with one full Gibbs step and gradually\nincreaing to \ufb01ve steps during the course of training, as described in [14]. For all three datasets, the\ntotal number of parameter updates was set to 100,000, which took several hours to train. For the\n\n2Available at http://psiexp.ss.uci.edu/research/programs data/toolbox.htm.\n3Available at http://people.csail.mit.edu/jrennie/20Newsgroups (20news-bydate.tar.gz).\n4Available at http://trec.nist.gov/data/reuters/reuters.html\n\n5\n\n\fData set\n\nNIPS\n20-news\nReuters\n\nNumber of docs\nTest\nTrain\n1,690\n50\n11,314\n794,414\n\n7,531\n10,000\n\nK\n\n\u00afD\n\nSt. Dev.\n\nAvg. Test perplexity per word (in nats)\n\nLDA-50\n\nLDA-200 R. Soft-50 Unigram\n\n13,649\n2,000\n10,000\n\n98.0\n51.8\n94.6\n\n245.3\n70.8\n69.3\n\n3576\n1091\n1437\n\n3391\n1058\n1142\n\n3405\n953\n988\n\n4385\n1335\n2208\n\nTable 1: Results for LDA using 50 and 200 topics, and Replaced Softmax model that uses 50 topics. K is\nthe vocabulary size, \u00afD is the mean document length, St. Dev. is the estimated standard deviation in document\nlength.\n\nNIPS Proceedings\n\n20-newsgroups\n\nReuters\n\n5000\n\n4500\n\n4000\n\nA\nD\nL\n\n3500\n\n3000\n\n2500\n\n2500\n\n1600\n\n1400\n\nA\nD\nL\n\n1200\n\n1000\n\n800\n\n600\n\n5000\n\n600\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\nA\nD\nL\n\n1600\n\n0\n0\n\n1000\n\n800\n1400\nReplicated Softmax\n\n1200\n\n3000\nReplicated Softmax\n\n3500\n\n4000\n\n4500\n\n500\n\n2000\n\n2500\n\n1000\n\n1500\n\nReplicated Softmax\n\nFigure 2: The average test perplexity scores for each of the 50 held-out documents under the learned 50-\ndimensional Replicated Softmax and LDA that uses 50 topics.\n\nLDA model, we used the Gibbs sampling implementation of the Matlab Topic Modeling Toolbox5\n[5]. The hyperparameters were optimized using stochastic EM as described by [15]. For the 20-\nnewsgroups and NIPS datasets, the number of Gibbs updates was set to 100,000. For the large\nReuters dataset, it was set to 10,000, which took several days to train.\n\n4.3 Assessing Topic Models as Generative Models\n\nFor each of the three datasets, we estimated the log-probability for 50 held-out documents.6 For both\nthe Replicated Softmax and LDA models we used 10,000 inverse temperatures \u03b2s, spaced uniformly\nfrom 0 to 1. For each held-out document, the estimates were averaged over 100 AIS runs. The\n\naverage test perplexity per word was then estimated as exp(cid:16)\u22121/NPN\n\nN is the total number of documents, Dn and vn are the total number of words and the observed\nword-count vector for a document n.\nTable 1 shows that for all three datasets the 50-dimensional Replicated Softmax consistently outper-\nforms the LDA with 50-topics. For the NIPS dataset, the undirected model achieves the average test\nperplexity of 3405, improving upon LDA\u2019s perplexity of 3576. The LDA with 200 topics performed\nmuch better on this dataset compared to the LDA-50, but its performance only slightly improved\nupon the 50-dimensional Replicated Softmax model. For the 20-newsgroups dataset, even with 200\ntopics, the LDA could not match the perplexity of the Replicated Softmax model with 50 topic units.\n\n1/Dn log p(vn)(cid:17), where\n\nn=1\n\nThe difference in performance is particularly striking for the large Reuters dataset, whose vocabulary\nsize is 10,000. LDA achieves an average test perplexity of 1437, substantially reducing it from\n2208, achieved by a simple smoothed unigram model. The Replicated Softmax further reduces the\nperplexity down to 986, which is comparable in magnitude to the improvement produced by the LDA\nover the unigram model. LDA with 200 topics does improve upon LDA-50, achieving a perplexity\nof 1142. However, its performance is still considerably worse than that of the Replicated Softmax\nmodel.\n\n5The code is available at http://psiexp.ss.uci.edu/research/programs data/toolbox.htm\n6For the 20-newsgroups and Reuters datasets, the 50 held-out documents were randomly sampled from the\n\ntest sets.\n\n6\n\n\f20-newsgroups\n\nReplicated \nSoftmax 50\u2212D\n\nLDA 50\u2212D\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n)\n\n%\n\n(\n \nn\no\ns\n\ni\n\ni\n\nc\ne\nr\nP\n\nReuters\n\nReplicated \nSoftmax 50\u2212D\n\nLDA 50\u2212D\n\n50\n\n40\n\n30\n\n20\n\n10\n\n)\n\n%\n\ni\n\n(\n \nn\no\ns\nc\ne\nr\nP\n\ni\n\n0.02 0.1 0.4 1.6 6.4 25.6 100 \n\nRecall (%) \n\n0.001 0.006 0.051 0.4 1.6 6.4 25.6 100 \n\nRecall (%) \n\nFigure 3: Precision-Recall curves for the 20-newsgroups and Reuters datasets, when a query document from\nthe test set is used to retrieve similar documents from the training corpus. Results are averaged over all 7,531\n(for 20-newsgroups) and 10,000 (for Reuters) possible queries.\n\nFigure 2 further shows three scatter plots of the average test perplexity per document. Observe that\nfor almost all test documents, the Replicated Softmax achieves a better perplexity compared to the\ncorresponding LDA model. For the Reuters dataset, as expected, there are many documents that are\nmodeled much better by the undirected model than an LDA. Clearly, the Replicated Softmax is able\nto generalize much better.\n\n4.4 Document Retrieval\n\nWe used 20-newsgroup and Reuters datasets to evaluate model performance on a document retrieval\ntask. To decide whether a retrieved document is relevant to the query document, we simply check if\nthey have the same class label. This is the only time that the class labels are used. For the Replicated\nSoftmax, the mapping from a word-count vector to the values of the latent topic features is fast,\nrequiring only a single matrix multiplication followed by a componentwise sigmoid non-linearity.\nFor the LDA, we used 1000 Gibbs sweeps per test document in order to get an approximate posterior\nover the topics. Figure 3 shows that when we use the cosine of the angle between two topic vectors to\nmeasure their similarity, the Replicated Softmax signi\ufb01cantly outperforms LDA, particularly when\nretrieving the top few documents.\n\n5 Conclusions and Extensions\n\nWe have presented a simple two-layer undirected topic model that be used to model and automati-\ncally extract distributed semantic representations from large collections of text corpora. The model\ncan be viewed as a family of different-sized RBM\u2019s that share parameters. The proposed model have\nseveral key advantages: the learning is easy and stable, it can model documents of different lengths,\nand computing the posterior distribution over the latent topic values is easy. Furthermore, using\nstochastic gradient descent, scaling up learning to billions of documents would not be particularly\ndif\ufb01cult. This is in contrast to directed topic models, where most of the existing inference algorithms\nare designed to be run in a batch mode. Therefore one would have to make further approximations,\nfor example by using particle \ufb01ltering [3]. We have also demonstrated that the proposed model is\nable to generalize much better than LDA in terms of both the log-probability on held-out documents\nand the retrieval accuracy.\n\nIn this paper we have only considered the simplest possible topic model, but the proposed model can\nbe extended in several ways. For example, similar to supervised LDA [1], the proposed Replicated\nSoftmax can be easily extended to modeling the joint the distribution over words and a document\nlabel, as shown in Fig. 4, left panel. Recently, [11] introduced a Dirichlet-multinomial regression\nmodel, where a prior on the document-speci\ufb01c topic distributions was modeled as a function of\nobserved metadata of the document. Similarly, we can de\ufb01ne a conditional Replicated Softmax\nmodel, where the observed document-speci\ufb01c metadata, such as author, references, etc., can be used\n\n7\n\n\fLatent Topics\n\nLatent Topics\n\nLabel\n\nMetadata\n\nMultinomial Visible\n\nMultinomial Visible\n\nFigure 4: Left: A Replicated Softmax model that models the joint distribution of words and document label.\nRight: Conditional Replicated Softmax model where the observed document-speci\ufb01c metadata affects binary\nstates of the hidden topic units.\n\nto in\ufb02uence the states of the latent topic units, as shown in Fig. 4, right panel. Finally, as argued by\n[13], a single layer of binary features may not the best way to capture the complex structure in the\ncount data. Once the Replicated Softmax has been trained, we can add more layers to create a Deep\nBelief Network [8], which could potentially produce a better generative model and further improve\nretrieval accuracy.\n\nAcknowledgments\nThis research was supported by NSERC, CFI, and CIFAR.\n\nReferences\n\n[1] D. Blei and J. McAuliffe. Supervised topic models. In NIPS, 2007.\n[2] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2003.\n\n[3] K. Canini, L. Shi, and T. Grif\ufb01ths. Online inference of topics with latent Dirichlet allocation. In Proceed-\n\nings of the International Conference on Arti\ufb01cial Intelligence and Statistics, volume 5, 2009.\n\n[4] P. Gehler, A. Holub, and M. Welling. The Rate Adapting Poisson (RAP) model for information retrieval\nand object recognition. In Proceedings of the 23rd International Conference on Machine Learning, 2006.\nIn Proceedings of the National Academy of\n\n[5] T. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics.\n\nSciences, volume 101, pages 5228\u20135235, 2004.\n\n[6] Thomas Grif\ufb01ths and Mark Steyvers. Finding scienti\ufb01c topics. PNAS, 101(suppl. 1), 2004.\n[7] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1711\u20131800, 2002.\n\n[8] G. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation,\n\n18(7):1527\u20131554, 2006.\n\n[9] T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty\n\nin AI, pages 289\u2013296, San Fransisco, California, 1999. Morgan Kaufmann.\n\n[10] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization\n\nresearch. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[11] D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial\n\nregression. In UAI, pages 411\u2013418, 2008.\n\n[12] R. Neal. Annealed importance sampling. Statistics and Computing, 11:125\u2013139, 2001.\n[13] R. Salakhutdinov and G. Hinton. Semantic Hashing. In SIGIR workshop on Information Retrieval and\n\napplications of Graphical Models, 2007.\n\n[14] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of\n\nthe International Conference on Machine Learning, volume 25, pages 872 \u2013 879, 2008.\n\n[15] H. Wallach. Topic modeling: beyond bag-of-words. In ICML, volume 148, pages 977\u2013984, 2006.\n[16] H. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models.\n\nProceedings of the 26th International Conference on Machine Learning (ICML 2009), 2009.\n\nIn\n\n[17] E. Xing, R. Yan, and A. Hauptmann. Mining associated text and images with dual-wing harmoniums. In\n\nProceedings of the 21st Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-2005), 2005.\n\n8\n\n\f", "award": [], "sourceid": 817, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}