Class DefaultFeatureReportCalculator
- java.lang.Object
-
- com.c12e.cortex.profiles.featurecatalog.DefaultFeatureReportCalculator
-
- All Implemented Interfaces:
FeatureReportCalculator
public class DefaultFeatureReportCalculator extends java.lang.Object implements FeatureReportCalculator
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.Integer
MIN_SAMPLE_SIZE
static java.lang.String
PROFILE_ID_FIELD
static java.lang.Double
SAMPLE_MOE
static java.lang.Double
SAMPLE_P
static java.lang.Double
SAMPLE_Z_SCORE
static java.util.List<java.lang.Double>
SAMPLING_FRACTIONS
static java.lang.String
TIMESTAMP_FIELD
-
Constructor Summary
Constructors Constructor Description DefaultFeatureReportCalculator()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description FeatureReport
computeDataSourceFeatures(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf, java.lang.Boolean performCalculations)
Computes theFeatures
associated with a DataSource from the given Dataset.FeatureReport
computeFeatureReport(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sampleDf, boolean performCalculations, java.util.List<org.apache.spark.sql.Row> previewCollection, java.lang.String profileGroup)
Computes theFeatures
associated with a given DataSource and ProfileGroup from a sample of the data.FeatureReport
computePreviewFeatures(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf)
Computes theFeatures
from an explicit sample of the DataSource.FeatureReport
computeProfileFeatures(java.lang.String project, java.lang.String sourceName, java.lang.String profileGroup, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf, java.lang.Boolean performCalculations)
Computes theFeatures
associated with a DataSource and specificProfileGroup
.org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
sample(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df)
Returns a sample taken from the dataset of sizeMIN_SAMPLE_SIZE
.
-
-
-
Field Detail
-
SAMPLE_MOE
public static final java.lang.Double SAMPLE_MOE
-
SAMPLE_Z_SCORE
public static final java.lang.Double SAMPLE_Z_SCORE
-
SAMPLE_P
public static final java.lang.Double SAMPLE_P
-
MIN_SAMPLE_SIZE
public static final java.lang.Integer MIN_SAMPLE_SIZE
-
PROFILE_ID_FIELD
public static final java.lang.String PROFILE_ID_FIELD
- See Also:
- Constant Field Values
-
TIMESTAMP_FIELD
public static final java.lang.String TIMESTAMP_FIELD
- See Also:
- Constant Field Values
-
SAMPLING_FRACTIONS
public static final java.util.List<java.lang.Double> SAMPLING_FRACTIONS
-
-
Method Detail
-
computeFeatureReport
public FeatureReport computeFeatureReport(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sampleDf, boolean performCalculations, java.util.List<org.apache.spark.sql.Row> previewCollection, java.lang.String profileGroup)
Description copied from interface:FeatureReportCalculator
Computes theFeatures
associated with a given DataSource and ProfileGroup from a sample of the data.- Specified by:
computeFeatureReport
in interfaceFeatureReportCalculator
- Parameters:
project
- project the DataSource belongs tosourceName
- Cortex DataSource namesampleDf
- source dataperformCalculations
- whether additional calculations should be performed based on the source data to fill out feature information. If false, not all properties will be filledpreviewCollection
- explicit preview of the dataprofileGroup
- name of the profile group, maybe null- Returns:
FeatureReport
feature information with a reference to the sample the features were inferred from
-
computeDataSourceFeatures
public FeatureReport computeDataSourceFeatures(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf, java.lang.Boolean performCalculations)
Description copied from interface:FeatureReportCalculator
Computes theFeatures
associated with a DataSource from the given Dataset. Features will not be associated to a specificProfileGroup
.- Specified by:
computeDataSourceFeatures
in interfaceFeatureReportCalculator
- Parameters:
project
- project the DataSource belongs tosourceName
- Cortex DataSource namesourceDf
- source dataperformCalculations
- perform analytic calculations- Returns:
FeatureReport
feature information with a reference to the sample the features were inferred from.
-
computePreviewFeatures
public FeatureReport computePreviewFeatures(java.lang.String project, java.lang.String sourceName, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf)
Description copied from interface:FeatureReportCalculator
Computes theFeatures
from an explicit sample of the DataSource. The provided Dataset should be a sample of the entire dataset, as implementations should use the given dataset for calculations, and not a sub-sample. Features will not be associated to a specificProfileGroup
.- Specified by:
computePreviewFeatures
in interfaceFeatureReportCalculator
- Parameters:
project
- project the DataSource belongs tosourceName
- Cortex DataSource namesourceDf
- source data- Returns:
FeatureReport
-
computeProfileFeatures
public FeatureReport computeProfileFeatures(java.lang.String project, java.lang.String sourceName, java.lang.String profileGroup, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sourceDf, java.lang.Boolean performCalculations)
Description copied from interface:FeatureReportCalculator
Computes theFeatures
associated with a DataSource and specificProfileGroup
.- Specified by:
computeProfileFeatures
in interfaceFeatureReportCalculator
- Parameters:
project
- project the DataSource belongs tosourceName
- DataSource nameprofileGroup
- profile group namesourceDf
- source dataperformCalculations
- perform analytic calculations- Returns:
FeatureReport
-
sample
public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> sample(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df)
Returns a sample taken from the dataset of sizeMIN_SAMPLE_SIZE
. If the dataset size is smaller thanMIN_SAMPLE_SIZE
, then the dataset will be returned as is.- Parameters:
df
- dataset to sample- Returns:
- a dataset sample
-
-