Class DefaultFeatureReportCalculator

    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      FeatureReport computeDataSourceFeatures​(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf, java.lang.Boolean performCalculations)
      Computes the Features associated with a DataSource from the given Dataset.
      FeatureReport computeFeatureReport​(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sampleDf, boolean performCalculations, java.util.List<org.apache.spark.sql.Row> previewCollection, java.lang.String profileGroup)
      Computes the Features associated with a given DataSource and ProfileGroup from a sample of the data.
      FeatureReport computePreviewFeatures​(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf)
      Computes the Features from an explicit sample of the DataSource.
      FeatureReport computeProfileFeatures​(java.lang.String project, java.lang.String sourceName, java.lang.String profileGroup, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf, java.lang.Boolean performCalculations)
      Computes the Features associated with a DataSource and specific ProfileGroup.
      org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sample​(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df)
      Returns a sample taken from the dataset of size MIN_SAMPLE_SIZE.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • SAMPLE_MOE

        public static final java.lang.Double SAMPLE_MOE
      • SAMPLE_Z_SCORE

        public static final java.lang.Double SAMPLE_Z_SCORE
      • SAMPLE_P

        public static final java.lang.Double SAMPLE_P
      • MIN_SAMPLE_SIZE

        public static final java.lang.Integer MIN_SAMPLE_SIZE
      • PROFILE_ID_FIELD

        public static final java.lang.String PROFILE_ID_FIELD
        See Also:
        Constant Field Values
      • SAMPLING_FRACTIONS

        public static final java.util.List<java.lang.Double> SAMPLING_FRACTIONS
    • Constructor Detail

      • DefaultFeatureReportCalculator

        public DefaultFeatureReportCalculator()
    • Method Detail

      • computeFeatureReport

        public FeatureReport computeFeatureReport​(java.lang.String project,
                                                  java.lang.String sourceName,
                                                  org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sampleDf,
                                                  boolean performCalculations,
                                                  java.util.List<org.apache.spark.sql.Row> previewCollection,
                                                  java.lang.String profileGroup)
        Description copied from interface: FeatureReportCalculator
        Computes the Features associated with a given DataSource and ProfileGroup from a sample of the data.
        Specified by:
        computeFeatureReport in interface FeatureReportCalculator
        Parameters:
        project - project the DataSource belongs to
        sourceName - Cortex DataSource name
        sampleDf - source data
        performCalculations - whether additional calculations should be performed based on the source data to fill out feature information. If false, not all properties will be filled
        previewCollection - explicit preview of the data
        profileGroup - name of the profile group, maybe null
        Returns:
        FeatureReport feature information with a reference to the sample the features were inferred from
      • computeDataSourceFeatures

        public FeatureReport computeDataSourceFeatures​(java.lang.String project,
                                                       java.lang.String sourceName,
                                                       org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf,
                                                       java.lang.Boolean performCalculations)
        Description copied from interface: FeatureReportCalculator
        Computes the Features associated with a DataSource from the given Dataset. Features will not be associated to a specific ProfileGroup.
        Specified by:
        computeDataSourceFeatures in interface FeatureReportCalculator
        Parameters:
        project - project the DataSource belongs to
        sourceName - Cortex DataSource name
        sourceDf - source data
        performCalculations - perform analytic calculations
        Returns:
        FeatureReport feature information with a reference to the sample the features were inferred from.
      • computePreviewFeatures

        public FeatureReport computePreviewFeatures​(java.lang.String project,
                                                    java.lang.String sourceName,
                                                    org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf)
        Description copied from interface: FeatureReportCalculator
        Computes the Features from an explicit sample of the DataSource. The provided Dataset should be a sample of the entire dataset, as implementations should use the given dataset for calculations, and not a sub-sample. Features will not be associated to a specific ProfileGroup.
        Specified by:
        computePreviewFeatures in interface FeatureReportCalculator
        Parameters:
        project - project the DataSource belongs to
        sourceName - Cortex DataSource name
        sourceDf - source data
        Returns:
        FeatureReport
      • computeProfileFeatures

        public FeatureReport computeProfileFeatures​(java.lang.String project,
                                                    java.lang.String sourceName,
                                                    java.lang.String profileGroup,
                                                    org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf,
                                                    java.lang.Boolean performCalculations)
        Description copied from interface: FeatureReportCalculator
        Computes the Features associated with a DataSource and specific ProfileGroup.
        Specified by:
        computeProfileFeatures in interface FeatureReportCalculator
        Parameters:
        project - project the DataSource belongs to
        sourceName - DataSource name
        profileGroup - profile group name
        sourceDf - source data
        performCalculations - perform analytic calculations
        Returns:
        FeatureReport
      • sample

        public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sample​(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df)
        Returns a sample taken from the dataset of size MIN_SAMPLE_SIZE. If the dataset size is smaller than MIN_SAMPLE_SIZE, then the dataset will be returned as is.
        Parameters:
        df - dataset to sample
        Returns:
        a dataset sample