jml.topics
Class Corpus

java.lang.Object
  extended by jml.topics.Corpus

public class Corpus
extends java.lang.Object

A class to model corpus. Term indices always start from 0, and are used to index elements in a 2D integer array. Term IDs always start from 1, and are used in a Vector of termID sequences.

Version:
Jan. 17th, 2013
Author:
Mingjie Qian

Field Summary
private  java.util.Vector<java.util.Vector<java.lang.Integer>> corpus
          A Vector of termID sequences.
 java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray
          A ArrayList of TreeMap storing the doc-term-count matrix.
 int[][] documents
          2D integer array carrying the doc-term count matrix.
static int IdxStart
          The starting index for LDA_Blei input data.
 int nDoc
          Number of documents in the corpus.
 int nTerm
          Vocabulary size.
 
Constructor Summary
Corpus()
          Constructor for the class Corpus.
 
Method Summary
 void clearCorpus()
          Clear corpus for class Corpus.
 void clearDocTermCountArray()
          Clear docTermCountArray.
static int[][] corpus2Documents(java.util.Vector<java.util.Vector<java.lang.Integer>> corpus)
          Convert a Vector of termID sequences into a 2D doc-term count array.
static org.apache.commons.math.linear.RealMatrix documents2Matrix(int[][] documents)
          Convert a 2D doc-term count array into a matrix.
 int[][] getDocuments()
          Get the documents.
static int getVocabularySize(int[][] documents)
          Get the vocabulary size.
 void readCorpusFromDocTermCountArray(java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray)
          Load corpus and documents from a ArrayList<TreeMap<Integer, Integer>> instance.
 void readCorpusFromDocTermCountFile(java.lang.String docTermCountFilePath)
          Load corpus and documents from a text file located at String docTermCountFilePath.
 void readCorpusFromLDAInputFile(java.lang.String LDAInputDataFilePath)
          Load corpus and documents from a LDAInput file.
 void readCorpusFromMatrix(org.apache.commons.math.linear.RealMatrix X)
          Load corpus and documents from a RealMatrix instance.
static void setLDATermIndexStart(int IdxStart)
          Set term staring index for LDA input file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

IdxStart

public static int IdxStart
The starting index for LDA_Blei input data. Default is 0.


corpus

private java.util.Vector<java.util.Vector<java.lang.Integer>> corpus
A Vector of termID sequences. Each element of the vector is a sequence of termID (starting from 1) of a document. Each termID represents a corresponding term in the vocabulary. For example, assume a term occurs in a document ten times, then we have ten same termID for this term in the sequence.


docTermCountArray

public java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray
A ArrayList of TreeMap storing the doc-term-count matrix. The TreeMap mapping a termID to its observed counts.


documents

public int[][] documents
2D integer array carrying the doc-term count matrix. documents[i][j] is the number of occurrence for the j-th vocabulary term in the i-th document. Term indices start from 0 for documents.


nTerm

public int nTerm
Vocabulary size.


nDoc

public int nDoc
Number of documents in the corpus.

Constructor Detail

Corpus

public Corpus()
Constructor for the class Corpus.

Method Detail

clearCorpus

public void clearCorpus()
Clear corpus for class Corpus.


clearDocTermCountArray

public void clearDocTermCountArray()
Clear docTermCountArray.


getDocuments

public int[][] getDocuments()
Get the documents.

Returns:
documents.

readCorpusFromLDAInputFile

public void readCorpusFromLDAInputFile(java.lang.String LDAInputDataFilePath)
Load corpus and documents from a LDAInput file.

Parameters:
LDAInputDataFilePath - The file path specifying the path of the LDAInput file.

readCorpusFromDocTermCountFile

public void readCorpusFromDocTermCountFile(java.lang.String docTermCountFilePath)
Load corpus and documents from a text file located at String docTermCountFilePath.

Parameters:
docTermCountFilePath - A String specifying the location of the text file holding doc-term-count matrix data.

readCorpusFromDocTermCountArray

public void readCorpusFromDocTermCountArray(java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray)
Load corpus and documents from a ArrayList<TreeMap<Integer, Integer>> instance. Each element of the ArrayList is a doc-term count mapping.

Parameters:
docTermCountArray - A ArrayList<TreeMap<Integer, Integer>> instance, each element of the ArrayList records the doc-term count mapping for the corresponding document.

readCorpusFromMatrix

public void readCorpusFromMatrix(org.apache.commons.math.linear.RealMatrix X)
Load corpus and documents from a RealMatrix instance.

Parameters:
X - a matrix with each column being a term count vector for a document with X(i, j) being the number of occurrence for the i-th vocabulary term in the j-th document

corpus2Documents

public static int[][] corpus2Documents(java.util.Vector<java.util.Vector<java.lang.Integer>> corpus)
Convert a Vector of termID sequences into a 2D doc-term count array. Term IDs always start from 1.

Parameters:
corpus - a Vector of termID sequences
Returns:
a 2D integer array carrying the doc-term count matrix

documents2Matrix

public static org.apache.commons.math.linear.RealMatrix documents2Matrix(int[][] documents)
Convert a 2D doc-term count array into a matrix.

Parameters:
documents - a 2D integer array carrying the doc-term count matrix
Returns:
a matrix with each column being a term count vector for a document with X(i, j) being the number of occurrence for the i-th vocabulary term in the j-th document

getVocabularySize

public static int getVocabularySize(int[][] documents)
Get the vocabulary size.

Parameters:
documents - a 2D integer array where documents[m][n] is the term index in the vocabulary for the n-th word of the m-th document. Indices always start from 0.
Returns:
vocabulary size

setLDATermIndexStart

public static void setLDATermIndexStart(int IdxStart)
Set term staring index for LDA input file.

Parameters:
IdxStart -