Package elki.datasource.parser
Class ArffParser
- java.lang.Object
-
- elki.datasource.parser.ArffParser
-
- All Implemented Interfaces:
Parser
@Title("ARFF File Format Parser") public class ArffParser extends java.lang.Object implements Parser
Parser to load WEKA .arff files into ELKI.This parser is quite hackish, and contains lots of not yet configurable magic.
TODO: Allow configuration of the vector types (double, float)
TODO: when encountering integer columns, produce integer vectors.
TODO: allow optional class labels.
- Since:
- 0.4.0
- Author:
- Erich Schubert
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
ArffParser.Par
Parameterization class.
-
Field Summary
Fields Modifier and Type Field Description static java.util.regex.Matcher
ARFF_COMMENT
Comment pattern.static java.util.regex.Matcher
ARFF_HEADER_ATTRIBUTE
Arff attribute declaration marker.static java.util.regex.Matcher
ARFF_HEADER_DATA
Arff data marker.static java.util.regex.Matcher
ARFF_HEADER_RELATION
Arff file marker.static java.util.regex.Matcher
ARFF_NUMERIC
Pattern for numeric columns.static java.lang.String
DEFAULT_ARFF_MAGIC_CLASS
Pattern to auto-convert columns to class labels.static java.lang.String
DEFAULT_ARFF_MAGIC_EID
Pattern to auto-convert columns to external ids.(package private) NumberVector.Factory<?>
denseFactory
Factory for dense vectors.static java.util.regex.Matcher
EMPTY
Empty line pattern.(package private) java.util.ArrayList<java.lang.String>
labels
(Reused) buffer for building label lists.private static Logging
LOG
Logger.(package private) java.util.regex.Matcher
magic_class
Pattern to recognize class label columns.(package private) java.util.regex.Matcher
magic_eid
Pattern to recognize external ids.
-
Constructor Summary
Constructors Constructor Description ArffParser(java.lang.String magic_eid, java.lang.String magic_class)
Constructor.ArffParser(java.util.regex.Pattern magic_eid, java.util.regex.Pattern magic_class)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
cleanup()
Perform cleanup operations after parsing.private java.lang.Object[]
loadDenseInstance(java.io.StreamTokenizer tokenizer, int[] dimsize, TypeInformation[] etyp, int outdim)
private java.lang.Object[]
loadSparseInstance(java.io.StreamTokenizer tokenizer, int[] targ, int[] dimsize, TypeInformation[] elkitypes, int metaLength)
private java.io.StreamTokenizer
makeArffTokenizer(java.io.BufferedReader br)
Make a StreamTokenizer for the ARFF format.private void
nextToken(java.io.StreamTokenizer tokenizer)
Helper function for token handling.MultipleObjectsBundle
parse(java.io.InputStream instream)
Returns a list of the objects parsed from the specified input stream.private void
parseAttributeStatements(java.io.BufferedReader br, java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types)
Parse the "@attribute" section of the ARFF file.private void
processColumnTypes(java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types, int[] targ, TypeInformation[] etyp, int[] dims)
Process the column types (and names!)private void
readHeader(java.io.BufferedReader br)
Read the dataset header part of the ARFF file, to ensure consistency.private void
setupBundleHeaders(java.util.ArrayList<java.lang.String> names, int[] targ, TypeInformation[] etyp, int[] dimsize, MultipleObjectsBundle bundle, boolean sparse)
Setup the headers for the object bundle.
-
-
-
Field Detail
-
LOG
private static final Logging LOG
Logger.
-
ARFF_HEADER_RELATION
public static final java.util.regex.Matcher ARFF_HEADER_RELATION
Arff file marker.
-
ARFF_HEADER_ATTRIBUTE
public static final java.util.regex.Matcher ARFF_HEADER_ATTRIBUTE
Arff attribute declaration marker.
-
ARFF_HEADER_DATA
public static final java.util.regex.Matcher ARFF_HEADER_DATA
Arff data marker.
-
ARFF_COMMENT
public static final java.util.regex.Matcher ARFF_COMMENT
Comment pattern.
-
DEFAULT_ARFF_MAGIC_EID
public static final java.lang.String DEFAULT_ARFF_MAGIC_EID
Pattern to auto-convert columns to external ids.- See Also:
- Constant Field Values
-
DEFAULT_ARFF_MAGIC_CLASS
public static final java.lang.String DEFAULT_ARFF_MAGIC_CLASS
Pattern to auto-convert columns to class labels.- See Also:
- Constant Field Values
-
ARFF_NUMERIC
public static final java.util.regex.Matcher ARFF_NUMERIC
Pattern for numeric columns.
-
EMPTY
public static final java.util.regex.Matcher EMPTY
Empty line pattern.
-
magic_eid
java.util.regex.Matcher magic_eid
Pattern to recognize external ids.
-
magic_class
java.util.regex.Matcher magic_class
Pattern to recognize class label columns.
-
labels
java.util.ArrayList<java.lang.String> labels
(Reused) buffer for building label lists.
-
denseFactory
NumberVector.Factory<?> denseFactory
Factory for dense vectors. TODO: Make parameterizable
-
-
Constructor Detail
-
ArffParser
public ArffParser(java.util.regex.Pattern magic_eid, java.util.regex.Pattern magic_class)
Constructor.- Parameters:
magic_eid
- Magic to recognize external IDsmagic_class
- Magic to recognize class labels
-
ArffParser
public ArffParser(java.lang.String magic_eid, java.lang.String magic_class)
Constructor.- Parameters:
magic_eid
- Magic to recognize external IDsmagic_class
- Magic to recognize class labels
-
-
Method Detail
-
parse
public MultipleObjectsBundle parse(java.io.InputStream instream)
Description copied from interface:Parser
Returns a list of the objects parsed from the specified input stream.
-
loadSparseInstance
private java.lang.Object[] loadSparseInstance(java.io.StreamTokenizer tokenizer, int[] targ, int[] dimsize, TypeInformation[] elkitypes, int metaLength) throws java.io.IOException
- Throws:
java.io.IOException
-
loadDenseInstance
private java.lang.Object[] loadDenseInstance(java.io.StreamTokenizer tokenizer, int[] dimsize, TypeInformation[] etyp, int outdim) throws java.io.IOException
- Throws:
java.io.IOException
-
makeArffTokenizer
private java.io.StreamTokenizer makeArffTokenizer(java.io.BufferedReader br)
Make a StreamTokenizer for the ARFF format.- Parameters:
br
- Buffered reader- Returns:
- Tokenizer
-
setupBundleHeaders
private void setupBundleHeaders(java.util.ArrayList<java.lang.String> names, int[] targ, TypeInformation[] etyp, int[] dimsize, MultipleObjectsBundle bundle, boolean sparse)
Setup the headers for the object bundle.- Parameters:
names
- Attribute namestarg
- Target columnsetyp
- ELKI type informationdimsize
- Number of dimensions in the individual typesbundle
- Output bundlesparse
- Flag to create sparse vectors
-
readHeader
private void readHeader(java.io.BufferedReader br) throws java.io.IOException
Read the dataset header part of the ARFF file, to ensure consistency.- Parameters:
br
- Buffered Reader- Throws:
java.io.IOException
-
parseAttributeStatements
private void parseAttributeStatements(java.io.BufferedReader br, java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types) throws java.io.IOException
Parse the "@attribute" section of the ARFF file.- Parameters:
br
- Inputnames
- List (to fill) of attribute namestypes
- List (to fill) of attribute types- Throws:
java.io.IOException
-
processColumnTypes
private void processColumnTypes(java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types, int[] targ, TypeInformation[] etyp, int[] dims)
Process the column types (and names!) into ELKI relation style. Note that this will for example merge successive numerical columns into a single vector.- Parameters:
names
- Attribute namestypes
- Attribute typestarg
- Target dimension mapping (ARFF to ELKI), return valueetyp
- ELKI type information, return valuedims
- Number of successive dimensions, return value
-
nextToken
private void nextToken(java.io.StreamTokenizer tokenizer) throws java.io.IOException
Helper function for token handling.- Parameters:
tokenizer
- Tokenizer- Throws:
java.io.IOException
-
-