Package elki.datasource.parser
Class ArffParser
- java.lang.Object
-
- elki.datasource.parser.ArffParser
-
- All Implemented Interfaces:
Parser
@Title("ARFF File Format Parser") public class ArffParser extends java.lang.Object implements Parser
Parser to load WEKA .arff files into ELKI.This parser is quite hackish, and contains lots of not yet configurable magic.
TODO: Allow configuration of the vector types (double, float)
TODO: when encountering integer columns, produce integer vectors.
TODO: allow optional class labels.
- Since:
- 0.4.0
- Author:
- Erich Schubert
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classArffParser.ParParameterization class.
-
Field Summary
Fields Modifier and Type Field Description static java.util.regex.MatcherARFF_COMMENTComment pattern.static java.util.regex.MatcherARFF_HEADER_ATTRIBUTEArff attribute declaration marker.static java.util.regex.MatcherARFF_HEADER_DATAArff data marker.static java.util.regex.MatcherARFF_HEADER_RELATIONArff file marker.static java.util.regex.MatcherARFF_NUMERICPattern for numeric columns.static java.lang.StringDEFAULT_ARFF_MAGIC_CLASSPattern to auto-convert columns to class labels.static java.lang.StringDEFAULT_ARFF_MAGIC_EIDPattern to auto-convert columns to external ids.(package private) NumberVector.Factory<?>denseFactoryFactory for dense vectors.static java.util.regex.MatcherEMPTYEmpty line pattern.(package private) java.util.ArrayList<java.lang.String>labels(Reused) buffer for building label lists.private static LoggingLOGLogger.(package private) java.util.regex.Matchermagic_classPattern to recognize class label columns.(package private) java.util.regex.Matchermagic_eidPattern to recognize external ids.
-
Constructor Summary
Constructors Constructor Description ArffParser(java.lang.String magic_eid, java.lang.String magic_class)Constructor.ArffParser(java.util.regex.Pattern magic_eid, java.util.regex.Pattern magic_class)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidcleanup()Perform cleanup operations after parsing.private java.lang.Object[]loadDenseInstance(java.io.StreamTokenizer tokenizer, int[] dimsize, TypeInformation[] etyp, int outdim)private java.lang.Object[]loadSparseInstance(java.io.StreamTokenizer tokenizer, int[] targ, int[] dimsize, TypeInformation[] elkitypes, int metaLength)private java.io.StreamTokenizermakeArffTokenizer(java.io.BufferedReader br)Make a StreamTokenizer for the ARFF format.private voidnextToken(java.io.StreamTokenizer tokenizer)Helper function for token handling.MultipleObjectsBundleparse(java.io.InputStream instream)Returns a list of the objects parsed from the specified input stream.private voidparseAttributeStatements(java.io.BufferedReader br, java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types)Parse the "@attribute" section of the ARFF file.private voidprocessColumnTypes(java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types, int[] targ, TypeInformation[] etyp, int[] dims)Process the column types (and names!)private voidreadHeader(java.io.BufferedReader br)Read the dataset header part of the ARFF file, to ensure consistency.private voidsetupBundleHeaders(java.util.ArrayList<java.lang.String> names, int[] targ, TypeInformation[] etyp, int[] dimsize, MultipleObjectsBundle bundle, boolean sparse)Setup the headers for the object bundle.
-
-
-
Field Detail
-
LOG
private static final Logging LOG
Logger.
-
ARFF_HEADER_RELATION
public static final java.util.regex.Matcher ARFF_HEADER_RELATION
Arff file marker.
-
ARFF_HEADER_ATTRIBUTE
public static final java.util.regex.Matcher ARFF_HEADER_ATTRIBUTE
Arff attribute declaration marker.
-
ARFF_HEADER_DATA
public static final java.util.regex.Matcher ARFF_HEADER_DATA
Arff data marker.
-
ARFF_COMMENT
public static final java.util.regex.Matcher ARFF_COMMENT
Comment pattern.
-
DEFAULT_ARFF_MAGIC_EID
public static final java.lang.String DEFAULT_ARFF_MAGIC_EID
Pattern to auto-convert columns to external ids.- See Also:
- Constant Field Values
-
DEFAULT_ARFF_MAGIC_CLASS
public static final java.lang.String DEFAULT_ARFF_MAGIC_CLASS
Pattern to auto-convert columns to class labels.- See Also:
- Constant Field Values
-
ARFF_NUMERIC
public static final java.util.regex.Matcher ARFF_NUMERIC
Pattern for numeric columns.
-
EMPTY
public static final java.util.regex.Matcher EMPTY
Empty line pattern.
-
magic_eid
java.util.regex.Matcher magic_eid
Pattern to recognize external ids.
-
magic_class
java.util.regex.Matcher magic_class
Pattern to recognize class label columns.
-
labels
java.util.ArrayList<java.lang.String> labels
(Reused) buffer for building label lists.
-
denseFactory
NumberVector.Factory<?> denseFactory
Factory for dense vectors. TODO: Make parameterizable
-
-
Constructor Detail
-
ArffParser
public ArffParser(java.util.regex.Pattern magic_eid, java.util.regex.Pattern magic_class)Constructor.- Parameters:
magic_eid- Magic to recognize external IDsmagic_class- Magic to recognize class labels
-
ArffParser
public ArffParser(java.lang.String magic_eid, java.lang.String magic_class)Constructor.- Parameters:
magic_eid- Magic to recognize external IDsmagic_class- Magic to recognize class labels
-
-
Method Detail
-
parse
public MultipleObjectsBundle parse(java.io.InputStream instream)
Description copied from interface:ParserReturns a list of the objects parsed from the specified input stream.
-
loadSparseInstance
private java.lang.Object[] loadSparseInstance(java.io.StreamTokenizer tokenizer, int[] targ, int[] dimsize, TypeInformation[] elkitypes, int metaLength) throws java.io.IOException- Throws:
java.io.IOException
-
loadDenseInstance
private java.lang.Object[] loadDenseInstance(java.io.StreamTokenizer tokenizer, int[] dimsize, TypeInformation[] etyp, int outdim) throws java.io.IOException- Throws:
java.io.IOException
-
makeArffTokenizer
private java.io.StreamTokenizer makeArffTokenizer(java.io.BufferedReader br)
Make a StreamTokenizer for the ARFF format.- Parameters:
br- Buffered reader- Returns:
- Tokenizer
-
setupBundleHeaders
private void setupBundleHeaders(java.util.ArrayList<java.lang.String> names, int[] targ, TypeInformation[] etyp, int[] dimsize, MultipleObjectsBundle bundle, boolean sparse)Setup the headers for the object bundle.- Parameters:
names- Attribute namestarg- Target columnsetyp- ELKI type informationdimsize- Number of dimensions in the individual typesbundle- Output bundlesparse- Flag to create sparse vectors
-
readHeader
private void readHeader(java.io.BufferedReader br) throws java.io.IOExceptionRead the dataset header part of the ARFF file, to ensure consistency.- Parameters:
br- Buffered Reader- Throws:
java.io.IOException
-
parseAttributeStatements
private void parseAttributeStatements(java.io.BufferedReader br, java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types) throws java.io.IOExceptionParse the "@attribute" section of the ARFF file.- Parameters:
br- Inputnames- List (to fill) of attribute namestypes- List (to fill) of attribute types- Throws:
java.io.IOException
-
processColumnTypes
private void processColumnTypes(java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types, int[] targ, TypeInformation[] etyp, int[] dims)Process the column types (and names!) into ELKI relation style. Note that this will for example merge successive numerical columns into a single vector.- Parameters:
names- Attribute namestypes- Attribute typestarg- Target dimension mapping (ARFF to ELKI), return valueetyp- ELKI type information, return valuedims- Number of successive dimensions, return value
-
nextToken
private void nextToken(java.io.StreamTokenizer tokenizer) throws java.io.IOExceptionHelper function for token handling.- Parameters:
tokenizer- Tokenizer- Throws:
java.io.IOException
-
-