Package elki.datasource.parser
The general use-case for any parser is to create objects out of an
InputStream
(e.g., by reading a data file).
The objects are packed in a
MultipleObjectsBundle
which,
in turn, is used by a
DatabaseConnection
-Object
to fill a Database
containing the corresponding objects.
By default (i.e., if the user does not specify any specific requests),
any KDDTask
will
use the StaticArrayDatabase
which,
in turn, will use a
FileBasedDatabaseConnection
and a NumberVectorLabelParser
to parse a specified data file creating
a StaticArrayDatabase
containing DoubleVector
-Objects.
Thus, the standard procedure to use a data set of a real-valued vector space
is to prepare the data set in a file of the following format
(as suitable to
NumberVectorLabelParser
):
- One point per line, attributes separated by whitespace.
- Several labels may be given per point. A label must not be parseable as double.
- Lines starting with "#" will be ignored.
- An index can be specified to identify an entry to be treated as class label. This index counts all entries (numeric and labels as well) starting with 0.
- Files can be gzip compressed.
-
Interface Summary Interface Description Parser A Parser shall provide a ParsingResult by parsing an InputStream.StreamingParser Interface for streaming parsers, that may be much more efficient in combination with filters. -
Class Summary Class Description AbstractStreamingParser Base class for streaming parsers.AbstractStreamingParser.Par Parameterization class.ArffParser Parser to load WEKA .arff files into ELKI.ArffParser.Par Parameterization class.BitVectorLabelParser Parser for parsing one BitVector per line, bits separated by whitespace.BitVectorLabelParser.Par Parameterization class.CategorialDataAsNumberVectorParser<V extends NumberVector> A very simple parser for categorial data, which will then be encoded as numbers.CategorialDataAsNumberVectorParser.Par<V extends NumberVector> Parameterization class.ClusteringVectorParser Parser for simple clustering results in vector form, as written byClusteringVectorDumper
.ClusteringVectorParser.Par Parameterization class.CSVReaderFormat Basic format factory for parsing CSV-like formats.CSVReaderFormat.Par Parameterization class.LibSVMFormatParser<V extends SparseNumberVector> Parser to read libSVM format files.LibSVMFormatParser.Par<V extends SparseNumberVector> Parameterization class.NumberVectorLabelParser<V extends NumberVector> Parser for a simple CSV type of format, with columns separated by the given pattern (default: whitespace).NumberVectorLabelParser.Par<V extends NumberVector> Parameterization class.SimplePolygonParser Parser to load polygon data (2D and 3D only) from a simple format.SimplePolygonParser.Par Parameterization class.SimpleTransactionParser Simple parser for transactional data, such as market baskets.SimpleTransactionParser.Par Parameterization class.SparseNumberVectorLabelParser<V extends SparseNumberVector> Parser for parsing one point per line, attributes separated by whitespace.SparseNumberVectorLabelParser.Par<V extends SparseNumberVector> Parameterization class.StringParser Parser that loads a text file for use with string similarity measures.StringParser.Par Parameterization class.TermFrequencyParser<V extends SparseNumberVector> A parser to load term frequency data, which essentially are sparse vectors with text keys.TermFrequencyParser.Par<V extends SparseNumberVector> Parameterization class.