Package elki.datasource.parser
Class TermFrequencyParser<V extends SparseNumberVector>
- java.lang.Object
-
- elki.datasource.parser.AbstractStreamingParser
-
- elki.datasource.parser.NumberVectorLabelParser<V>
-
- elki.datasource.parser.TermFrequencyParser<V>
-
- All Implemented Interfaces:
BundleStreamSource
,Parser
,StreamingParser
public class TermFrequencyParser<V extends SparseNumberVector> extends NumberVectorLabelParser<V>
A parser to load term frequency data, which essentially are sparse vectors with text keys.Parse a file containing term frequencies. The expected format is:
rowlabel1 term1 <freq> term2 <freq> ... rowlabel2 term1 <freq> term3 <freq> ...
Terms must not contain the separator character!If your data does not contain frequencies, you can maybe use
SimpleTransactionParser
instead.- Since:
- 0.4.0
- Author:
- Erich Schubert
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
TermFrequencyParser.Par<V extends SparseNumberVector>
Parameterization class.-
Nested classes/interfaces inherited from interface elki.datasource.bundle.BundleStreamSource
BundleStreamSource.Event
-
-
Field Summary
Fields Modifier and Type Field Description (package private) it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap<java.lang.String>
keymap
Map.(package private) java.util.ArrayList<java.lang.String>
labels
(Reused) label buffer.private static Logging
LOG
Class logger.(package private) boolean
normalize
Normalize.(package private) int
numterms
Number of different terms observed.private SparseNumberVector.Factory<V>
sparsefactory
Same asNumberVectorLabelParser.factory
, but subtype.(package private) it.unimi.dsi.fastutil.ints.Int2DoubleOpenHashMap
values
(Reused) set of values for the number vector.-
Fields inherited from class elki.datasource.parser.NumberVectorLabelParser
attributes, columnnames, curlbl, curvec, factory, haslabels, maxdim, meta, mindim, nextevent, unique, warnedDim, warnedPrecision
-
Fields inherited from class elki.datasource.parser.AbstractStreamingParser
reader, tokenizer
-
-
Constructor Summary
Constructors Constructor Description TermFrequencyParser(boolean normalize, SparseNumberVector.Factory<V> factory)
Constructor.TermFrequencyParser(boolean normalize, CSVReaderFormat format, long[] labelIndices, SparseNumberVector.Factory<V> factory)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected Logging
getLogger()
Get the logger for this class.protected SimpleTypeInformation<V>
getTypeInformation(int mindim, int maxdim)
Get a prototype object for the given dimensionality.protected boolean
parseLineInternal()
Internal method for parsing a single line.-
Methods inherited from class elki.datasource.parser.NumberVectorLabelParser
buildMeta, cleanup, createVector, data, getMeta, initStream, isLabelColumn, nextEvent
-
Methods inherited from class elki.datasource.parser.AbstractStreamingParser
asMultipleObjectsBundle, assignDBID, hasDBIDs, parse
-
-
-
-
Field Detail
-
LOG
private static final Logging LOG
Class logger.
-
numterms
int numterms
Number of different terms observed.
-
keymap
it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap<java.lang.String> keymap
Map.
-
normalize
boolean normalize
Normalize.
-
sparsefactory
private SparseNumberVector.Factory<V extends SparseNumberVector> sparsefactory
Same asNumberVectorLabelParser.factory
, but subtype.
-
values
it.unimi.dsi.fastutil.ints.Int2DoubleOpenHashMap values
(Reused) set of values for the number vector.
-
labels
java.util.ArrayList<java.lang.String> labels
(Reused) label buffer.
-
-
Constructor Detail
-
TermFrequencyParser
public TermFrequencyParser(boolean normalize, SparseNumberVector.Factory<V> factory)
Constructor.- Parameters:
normalize
- Normalizefactory
- Vector type
-
TermFrequencyParser
public TermFrequencyParser(boolean normalize, CSVReaderFormat format, long[] labelIndices, SparseNumberVector.Factory<V> factory)
Constructor.- Parameters:
normalize
- Normalizeformat
- Input formatlabelIndices
- Indices to use as labelsfactory
- Vector type
-
-
Method Detail
-
parseLineInternal
protected boolean parseLineInternal()
Description copied from class:NumberVectorLabelParser
Internal method for parsing a single line. Used by both line based parsing as well as block parsing. This saves the building of meta data for each line.- Overrides:
parseLineInternal
in classNumberVectorLabelParser<V extends SparseNumberVector>
- Returns:
true
when a valid line was read,false
on a label row.
-
getTypeInformation
protected SimpleTypeInformation<V> getTypeInformation(int mindim, int maxdim)
Description copied from class:NumberVectorLabelParser
Get a prototype object for the given dimensionality.- Overrides:
getTypeInformation
in classNumberVectorLabelParser<V extends SparseNumberVector>
- Parameters:
mindim
- Minimum dimensionalitymaxdim
- Maximum dimensionality- Returns:
- Prototype object
-
getLogger
protected Logging getLogger()
Description copied from class:AbstractStreamingParser
Get the logger for this class.- Overrides:
getLogger
in classNumberVectorLabelParser<V extends SparseNumberVector>
- Returns:
- Logger.
-
-