Class ArffParser

  • All Implemented Interfaces:
    Parser

    @Title("ARFF File Format Parser")
    public class ArffParser
    extends java.lang.Object
    implements Parser
    Parser to load WEKA .arff files into ELKI.

    This parser is quite hackish, and contains lots of not yet configurable magic.

    TODO: Allow configuration of the vector types (double, float)

    TODO: when encountering integer columns, produce integer vectors.

    TODO: allow optional class labels.

    Since:
    0.4.0
    Author:
    Erich Schubert
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  ArffParser.Par
      Parameterization class.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.util.regex.Matcher ARFF_COMMENT
      Comment pattern.
      static java.util.regex.Matcher ARFF_HEADER_ATTRIBUTE
      Arff attribute declaration marker.
      static java.util.regex.Matcher ARFF_HEADER_DATA
      Arff data marker.
      static java.util.regex.Matcher ARFF_HEADER_RELATION
      Arff file marker.
      static java.util.regex.Matcher ARFF_NUMERIC
      Pattern for numeric columns.
      static java.lang.String DEFAULT_ARFF_MAGIC_CLASS
      Pattern to auto-convert columns to class labels.
      static java.lang.String DEFAULT_ARFF_MAGIC_EID
      Pattern to auto-convert columns to external ids.
      (package private) NumberVector.Factory<?> denseFactory
      Factory for dense vectors.
      static java.util.regex.Matcher EMPTY
      Empty line pattern.
      (package private) java.util.ArrayList<java.lang.String> labels
      (Reused) buffer for building label lists.
      private static Logging LOG
      Logger.
      (package private) java.util.regex.Matcher magic_class
      Pattern to recognize class label columns.
      (package private) java.util.regex.Matcher magic_eid
      Pattern to recognize external ids.
    • Constructor Summary

      Constructors 
      Constructor Description
      ArffParser​(java.lang.String magic_eid, java.lang.String magic_class)
      Constructor.
      ArffParser​(java.util.regex.Pattern magic_eid, java.util.regex.Pattern magic_class)
      Constructor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void cleanup()
      Perform cleanup operations after parsing.
      private java.lang.Object[] loadDenseInstance​(java.io.StreamTokenizer tokenizer, int[] dimsize, TypeInformation[] etyp, int outdim)  
      private java.lang.Object[] loadSparseInstance​(java.io.StreamTokenizer tokenizer, int[] targ, int[] dimsize, TypeInformation[] elkitypes, int metaLength)  
      private java.io.StreamTokenizer makeArffTokenizer​(java.io.BufferedReader br)
      Make a StreamTokenizer for the ARFF format.
      private void nextToken​(java.io.StreamTokenizer tokenizer)
      Helper function for token handling.
      MultipleObjectsBundle parse​(java.io.InputStream instream)
      Returns a list of the objects parsed from the specified input stream.
      private void parseAttributeStatements​(java.io.BufferedReader br, java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types)
      Parse the "@attribute" section of the ARFF file.
      private void processColumnTypes​(java.util.ArrayList<java.lang.String> names, java.util.ArrayList<java.lang.String> types, int[] targ, TypeInformation[] etyp, int[] dims)
      Process the column types (and names!)
      private void readHeader​(java.io.BufferedReader br)
      Read the dataset header part of the ARFF file, to ensure consistency.
      private void setupBundleHeaders​(java.util.ArrayList<java.lang.String> names, int[] targ, TypeInformation[] etyp, int[] dimsize, MultipleObjectsBundle bundle, boolean sparse)
      Setup the headers for the object bundle.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOG

        private static final Logging LOG
        Logger.
      • ARFF_HEADER_RELATION

        public static final java.util.regex.Matcher ARFF_HEADER_RELATION
        Arff file marker.
      • ARFF_HEADER_ATTRIBUTE

        public static final java.util.regex.Matcher ARFF_HEADER_ATTRIBUTE
        Arff attribute declaration marker.
      • ARFF_HEADER_DATA

        public static final java.util.regex.Matcher ARFF_HEADER_DATA
        Arff data marker.
      • ARFF_COMMENT

        public static final java.util.regex.Matcher ARFF_COMMENT
        Comment pattern.
      • DEFAULT_ARFF_MAGIC_EID

        public static final java.lang.String DEFAULT_ARFF_MAGIC_EID
        Pattern to auto-convert columns to external ids.
        See Also:
        Constant Field Values
      • DEFAULT_ARFF_MAGIC_CLASS

        public static final java.lang.String DEFAULT_ARFF_MAGIC_CLASS
        Pattern to auto-convert columns to class labels.
        See Also:
        Constant Field Values
      • ARFF_NUMERIC

        public static final java.util.regex.Matcher ARFF_NUMERIC
        Pattern for numeric columns.
      • EMPTY

        public static final java.util.regex.Matcher EMPTY
        Empty line pattern.
      • magic_eid

        java.util.regex.Matcher magic_eid
        Pattern to recognize external ids.
      • magic_class

        java.util.regex.Matcher magic_class
        Pattern to recognize class label columns.
      • labels

        java.util.ArrayList<java.lang.String> labels
        (Reused) buffer for building label lists.
      • denseFactory

        NumberVector.Factory<?> denseFactory
        Factory for dense vectors. TODO: Make parameterizable
    • Constructor Detail

      • ArffParser

        public ArffParser​(java.util.regex.Pattern magic_eid,
                          java.util.regex.Pattern magic_class)
        Constructor.
        Parameters:
        magic_eid - Magic to recognize external IDs
        magic_class - Magic to recognize class labels
      • ArffParser

        public ArffParser​(java.lang.String magic_eid,
                          java.lang.String magic_class)
        Constructor.
        Parameters:
        magic_eid - Magic to recognize external IDs
        magic_class - Magic to recognize class labels
    • Method Detail

      • parse

        public MultipleObjectsBundle parse​(java.io.InputStream instream)
        Description copied from interface: Parser
        Returns a list of the objects parsed from the specified input stream.
        Specified by:
        parse in interface Parser
        Parameters:
        instream - the stream to parse objects from
        Returns:
        a list containing those objects parsed from the input stream
      • loadSparseInstance

        private java.lang.Object[] loadSparseInstance​(java.io.StreamTokenizer tokenizer,
                                                      int[] targ,
                                                      int[] dimsize,
                                                      TypeInformation[] elkitypes,
                                                      int metaLength)
                                               throws java.io.IOException
        Throws:
        java.io.IOException
      • loadDenseInstance

        private java.lang.Object[] loadDenseInstance​(java.io.StreamTokenizer tokenizer,
                                                     int[] dimsize,
                                                     TypeInformation[] etyp,
                                                     int outdim)
                                              throws java.io.IOException
        Throws:
        java.io.IOException
      • makeArffTokenizer

        private java.io.StreamTokenizer makeArffTokenizer​(java.io.BufferedReader br)
        Make a StreamTokenizer for the ARFF format.
        Parameters:
        br - Buffered reader
        Returns:
        Tokenizer
      • setupBundleHeaders

        private void setupBundleHeaders​(java.util.ArrayList<java.lang.String> names,
                                        int[] targ,
                                        TypeInformation[] etyp,
                                        int[] dimsize,
                                        MultipleObjectsBundle bundle,
                                        boolean sparse)
        Setup the headers for the object bundle.
        Parameters:
        names - Attribute names
        targ - Target columns
        etyp - ELKI type information
        dimsize - Number of dimensions in the individual types
        bundle - Output bundle
        sparse - Flag to create sparse vectors
      • readHeader

        private void readHeader​(java.io.BufferedReader br)
                         throws java.io.IOException
        Read the dataset header part of the ARFF file, to ensure consistency.
        Parameters:
        br - Buffered Reader
        Throws:
        java.io.IOException
      • parseAttributeStatements

        private void parseAttributeStatements​(java.io.BufferedReader br,
                                              java.util.ArrayList<java.lang.String> names,
                                              java.util.ArrayList<java.lang.String> types)
                                       throws java.io.IOException
        Parse the "@attribute" section of the ARFF file.
        Parameters:
        br - Input
        names - List (to fill) of attribute names
        types - List (to fill) of attribute types
        Throws:
        java.io.IOException
      • processColumnTypes

        private void processColumnTypes​(java.util.ArrayList<java.lang.String> names,
                                        java.util.ArrayList<java.lang.String> types,
                                        int[] targ,
                                        TypeInformation[] etyp,
                                        int[] dims)
        Process the column types (and names!) into ELKI relation style. Note that this will for example merge successive numerical columns into a single vector.
        Parameters:
        names - Attribute names
        types - Attribute types
        targ - Target dimension mapping (ARFF to ELKI), return value
        etyp - ELKI type information, return value
        dims - Number of successive dimensions, return value
      • nextToken

        private void nextToken​(java.io.StreamTokenizer tokenizer)
                        throws java.io.IOException
        Helper function for token handling.
        Parameters:
        tokenizer - Tokenizer
        Throws:
        java.io.IOException
      • cleanup

        public void cleanup()
        Description copied from interface: Parser
        Perform cleanup operations after parsing.
        Specified by:
        cleanup in interface Parser