Latent Semantic Analysis (LSA) tools

PDF

Authors
  1. Derbentseva, N.
  2. Kwantes, P.J.
  3. Terhaar, P.
Corporate Authors
Defence R&D Canada - Toronto, Toronto ONT (CAN)
Abstract
Latent Semantic Analysis (LSA; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998) is a computational model that uses a large collection of unstructured documents to construct semantic representations for words. The representations are based on a statistical analysis of the terms' occurrences within and across documents, and take the form of a vector. The semantic similarity between resulting word representations can be compared by calculating their cosine. After an LSA space is created, it can be queried to provide a word-to-word, word-to-document, or document-to-document comparisons to determine their semantic similarity by taking the cosine of the angle formed by their vector representations. When comparing words to documents (and sometimes documents to each other), LSA uses what is called a “bag of words” approach to representing the semantic contents of a document. In a “bag of words”, the order of terms in a document does not matter, and the semantic representation of a document is formed by summing the vectors of all its content words. LSA requires a relatively large set of short documents to generate a semantic space (i.e., from several hundreds to tens of thousands). The large number helps to ensure that many words from a variety of contexts (operationally defined as documents) are available for analysis. Before a semantic space for the terms in the corpus is constructed by our version of LSA, the documents need to be prepared. Punctuation is removed fro
Report Number
DRDC-TORONTO-TN-2012-079 — Technical Note
Date of publication
01 Jul 2012
Number of Pages
35
DSTKIM No
CA036670
CANDIS No
536947
Format(s):
Electronic Document(PDF)

Permanent link

Document 1 of 1

Date modified: