Package elki.datasource.parser

Parsers for different file formats and data types.

The general use-case for any parser is to create objects out of an InputStream (e.g., by reading a data file). The objects are packed in a MultipleObjectsBundle which, in turn, is used by a DatabaseConnection-Object to fill a Database containing the corresponding objects.

By default (i.e., if the user does not specify any specific requests), any KDDTask will use the StaticArrayDatabase which, in turn, will use a FileBasedDatabaseConnection and a NumberVectorLabelParser to parse a specified data file creating a StaticArrayDatabase containing DoubleVector-Objects.

Thus, the standard procedure to use a data set of a real-valued vector space is to prepare the data set in a file of the following format (as suitable to NumberVectorLabelParser):

  • One point per line, attributes separated by whitespace.
  • Several labels may be given per point. A label must not be parseable as double.
  • Lines starting with "#" will be ignored.
  • An index can be specified to identify an entry to be treated as class label. This index counts all entries (numeric and labels as well) starting with 0.
  • Files can be gzip compressed.
This file format is, e.g., also suitable to gnuplot.