A Paradigm Shift for Enterprise Data
Feature Factory is a fundamental shift in how enterprise data science teams develop curated data and accumulate data know-how as reusable assets. Feature spaces and the ability to discover features through a data-centric, programmatic approach leads to enhanced collaboration, better efficiency, increased model quality, greater reusability, reproducibility, scalability, and transparency. Break down silos and capitalize on the wealth of information at your disposal.
Step 1
Prepare Tables as Dataframes
Connect to multiple data sources, data lakes, or data warehouses and ingest the data as Spark Dataframes in Python
- Load data from modern cloud data marts (including Amazon Redshift, Google Big Query, Snowflake, MS Azure Synapse), traditional data warehouses (Oracle, Teradata, and MS SQL Server), and flat data sources (CSV files, Tableau Hyper files, etc.) via Spark Dataframe API
- Automatic data type detection and data schema inference
- Connect multiple data sources together by specifying Dataframe relationships
- Define and configure temporal data relationships for automated temporal feature discovery
Step 2
Run dotData Feature Factory
Specify your target variable and the source tables as Dataframes you will use to build features. Define your search criteria and run dotData Feature Factory from your favorite Python IDE or notebook
- Resolve data quality issues like illegal values, outliers, data canonicalization, missing values, target label mapping, and more.
- Explore millions of feature hypotheses – including numeric, categorical, time-series, text, and even geospatial data.
- Resolve feature over-fitting, feature collinearity, feature drifts, and feature redundancy based on dotData’s proprietary algorithms.
- Custom feature primitives and search criteria to add your own domain features into the feature exploration space.
Step 3
Discover Features & Insights
Explore and evaluate discovered features interactively from Python
- Feature leaderboard (feature list) that surfaces features that are the most relevant and correlated with your target variable
- Understand each feature’s business value and construction via an easy-to-understand auto-generated explanation and feature blueprint diagram
- Select your preferred features based on various feature metrics like correlation, feature-wise AUC, permutation importance, feature locality, popularity, and more.
- Extract feature tables as Dataframe and visualize each feature using the built-in visualization tool or any Python visualization library you like.
Step 4
Extract & Iterate Feature Discovery Experiments
Iterate feature discovery experiments to derive better quality and higher-order features. insightsExplore, optimize, and tune features interactively. Choose which features to extract for further analysis, modeling, or reporting from within Python
- Edit feature descriptors (definitions) to customize discovered features and leverage your domain expertise.
- Natural interface to add new datasets and run new experiments. Combine features from multiple experiments with different granularity.
- All steps and feature space details are reported without any black box
- Modularized execution allows you to run your experiments from any intermediate steps and iterate them faster
Step 5
Deploy Your Features Into Production
Populate feature stores and continuously update features in production applications
- Ingest features and metadata (feature explanation, feature statistics, feature schema) into any feature stores (Databricks, Snowflake, AWS SageMaker and more) and enhance your ML models
- Automatic feature pipeline generation with fully specified query statements for reuse and eliminate error-prone manual feature query implementation.
- One command deployment of feature pipelines into dotData Ops. Continuously recalculate features values with the newest data and monitor feature quality and drifts.
Amazon EMR
Install dotData Feature Factory in your AWS EMR instance to accelerate feature discovery for your data science team.
Pip Install
Quickly deploy dotData Feature Factory via pip-install – even on your own personal laptop.
Product Features
How SMBC Discovered 2,000,000 new features
When SMBC, one of the world’s largest banks, wanted to get the maximum value from their feature engineering investment, they turned to dotData. Download the case study and read how the went from 2,000 features a year to over 2,000,000.
Are You Ready for Feature Factory?
Take our five-minute self-assessment to see if your data and organization could benefit from dotData’s Feature Factory revolution.