Latent semantic analysis
Latent Semantic Analysis (LSA) is a technique used in natural language processing that allows analysis of the relationships between documents all terms they contain by producing a set of concepts related to the documents and terms. This technique was developed in 1988. Scott Deerwester, who was a major factor in the development of LSA, describes in in his paper titled Indexing by Latent Semantic Analysis as taking "advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries."
Working With the Software
To use semanticvectors follow these instructions:
To use semanticvectors, you need to download the Apache Lucene, which is an open source text search engine. Download the "lucene-3.0.3.tar.gz" version from here: http://mirrors.devlib.org/apache/lucene/java/3.0.3/ and unzip it to the directory you will be working in. Sometimes the archive utility on mac has issues and wants to change this to a .cpgz file, back to a .gz file, which changes it to a .cpgz file...if you are having this issue, make sure the download is fully completed before you attempt to uncompress it, try another program such as The Unarchiver or Stuffit Expander, or download the "lucene-3.0.3.zip" file from the same website.
Download the "semanticvectors-2.2.jar" file from semanticvector's download page here: http://code.google.com/p/semanticvectors/downloads/list and place it in the directory you will be working in.
Add these paths to your java classpath: "lucene-core-3.0.3.jar:lucene-demos-3.0.3.jar:semanticvectors-2.2.jar:." If you are using bash, you can type 'export CLASSPATH="path:goes:here" in terminal or add that command to your .bashrc file.
After this, semanticvectors should be up and running.
To correctly build an index, place your corpus files in a folder inside your working directory, and use this command to build the index:
java org.apache.lucene.demo.IndexFiles <folder name>
This will create an /index folder in your current directory. Then run LSA on the index directory by using this command:
java pitt.search.semanticvectors.LSA -dimension <# dimensions> index
The number of dimensions will be the number of separate files your corpus is composed of.
After this, you can run searches and queries on your LSA matrix. For example, if you wanted to compare a document's cosine value, you would use the command
java pitt.search.semanticvectors.Search -queryvectorfile svd_docvectors.bin -searchvectorfile svd_docvectors.bin -matchcase <document name>
To learn more about using documents in your queries, visit this page: http://code.google.com/p/semanticvectors/wiki/DocumentSearch
You can also search your LSA model by using this command: java pitt.search.semanticvectors.Search QUERYTERMS.
and you can also compare two concepts using this command: java pitt.search.semanticvectors.CompareTerms "QUERYTERMS1" "QUERYTERMS2".
http://www.cs.csustan.edu/~mmartin/LDS/index.html - link to the CSU Stanislaus research project.
http://lsa.colorado.edu/ - Home of University of Colorado's LSA project.
http://www.cs.csustan.edu/~mmartin/LDS/Deerwester-et-al.pdf - link to the Deerwester paper about LSA.
http://knowledgesearch.org/download.html - this is an open source program that is under a public license. Documentation is sparse but enough is there to be useful.
http://sourceforge.net/projects/phplsa/ - a PHP LSA extension that looks like it's geared more for smaller experiments.
http://tedlab.mit.edu/~dr/SVDLIBC/ - here is a SVD library written in C.
http://web.eecs.utk.edu/research/lsi/ - A good list of free LSA programs, but some of the links on this site are dead.
http://www.kdnuggets.com/software/text.html - this is another list of commercial LSI software, but upon my quick perusal I didn't see any free/educational license options.