Wikipedia Subcorpora Tool (WiST) – A tool for creating customized document collections for training unsupervised models of lexical semantics


  1. Derbentseva, N.
  2. Kwantes, P.
  3. Dennis, S.
  4. Stone, B.
Corporate Authors
Defence Research and Development Canada, Toronto Research Centre , Toronto ON (CAN)
One of the most important advances in cognitive science over the past 20 years is the invention of computer models that can form semantic representations for words by analysing the patterns with which words are used in documents. Generally speaking, the models need to be ‘trained’ on tens of thousands of documents to form representations that are recognizable as the meaning or ‘gist’ of a term or document. Because the models derive meaning from words’ usage across contexts/documents, the ways that words are used will drive the meaning. In this report, we describe the Wikipedia Subcorpora Tool (WiST), a tool for creating custom document corpora for the purpose of training models of lexical semantics. The tool is unique in that it allows the user to control the kinds of documents that comprise a corpus. For example, one might want to train a model to be an expert on medical topics, so the user can use the WiST to select a collection of medical documents on which to train the model. In this report, we detail the functionalities of the tool.

Il y a un résumé en français ici.

Unsupervised models of lexical semantics;training corpora;automated corpora generation;context-specific;Latent Semantic Analysis (LSA)
Report Number
DRDC-RDDC-2016-R100 — Scientific Report
Date of publication
01 Jun 2016
Number of Pages
Electronic Document(PDF)

Permanent link

Document 1 of 1

Date modified: