dotData Feature Factory

dotData Feature Factory:

Data-Centric & Programmatic Feature Discovery for Data Scientists & Data Engineers

A Paradigm Shift for Enterprise Data

Feature Factory is a fundamental shift in how enterprise data science teams develop curated data and accumulate data know-how as reusable assets. Feature spaces and the ability to discover features through a data-centric, programmatic approach leads to enhanced collaboration, better efficiency, increased model quality, greater reusability, reproducibility, scalability, and transparency. Break down silos and capitalize on the wealth of information at your disposal.

“Feature engineering is powerful and scalable, even across tens of tables with billions of rows.”

Step 1

Prepare Tables as Dataframes

Connect to multiple data sources, data lakes, or data warehouses and ingest the data as Spark Dataframes in Python

  • Load data from modern cloud data marts (including Amazon Redshift, Google Big Query, Snowflake, MS Azure Synapse),  traditional data warehouses (Oracle, Teradata, and MS SQL Server), and flat data sources (CSV files,  Tableau Hyper files, etc.) via Spark Dataframe API
  • Automatic data type detection and data schema inference
  • Connect multiple data sources together by specifying Dataframe relationships
  • Define and configure temporal data relationships for automated temporal feature discovery
Step 2

Run dotData Feature Factory

Specify your target variable and the source tables as Dataframes you will use to build features.  Define your search criteria and run dotData Feature Factory from your favorite Python IDE or notebook

  • Resolve data quality issues like illegal values, outliers, data canonicalization, missing values, target label mapping, and more.
  • Explore millions of feature hypotheses – including numeric, categorical, time-series, text, and even geospatial data.
  • Resolve feature over-fitting, feature collinearity, feature drifts, and feature redundancy based on dotData’s proprietary algorithms.
  • Custom feature primitives and search criteria to add your own domain features into the feature exploration space.
Step 3

Discover Features & Insights

Explore and evaluate discovered features interactively from Python

  • Feature leaderboard (feature list) that surfaces features that are the most relevant and correlated with your target variable
  • Understand each feature’s business value and construction via an easy-to-understand auto-generated explanation and feature blueprint diagram
  • Select your preferred features based on various feature metrics like correlation, feature-wise AUC, permutation importance, feature locality, popularity, and more.
  • Extract feature tables as Dataframe and visualize each feature using the built-in visualization tool or any Python visualization library you like.
Step 4

Extract & Iterate Feature Discovery Experiments

Iterate feature discovery experiments to derive better quality and higher-order features. insightsExplore, optimize, and tune features interactively.  Choose which features to extract for further analysis, modeling, or reporting from within Python 

  • Edit feature descriptors (definitions) to customize discovered features and leverage your domain expertise.
  • Natural interface to add new datasets and run new experiments. Combine features from multiple experiments with different granularity.
  • All steps and feature space details are reported without any black box
  • Modularized execution allows you to run your experiments from any intermediate steps and iterate them faster
Step 5

Deploy Your Features Into Production

Populate feature stores and continuously update features in production applications

  • Ingest features and metadata (feature explanation, feature statistics, feature schema) into any feature stores (Databricks, Snowflake, AWS SageMaker and more) and enhance your ML models
  • Automatic feature pipeline generation with fully specified query statements for reuse and eliminate error-prone manual feature query implementation.
  • One command deployment of feature pipelines into dotData Ops. Continuously recalculate features values with the newest data and monitor feature quality and drifts.

Amazon EMR

Install dotData Feature Factory in your AWS EMR instance to accelerate feature discovery for your data science team.

Pip Install

Quickly deploy dotData Feature Factory via pip-install – even on your own personal laptop.

How SMBC Discovered 2,000,000 new features

When SMBC, one of the world’s largest banks, wanted to get the maximum value from their feature engineering investment, they turned to dotData. Download the case study and read how the went from 2,000 features a year to over 2,000,000.

Are You Ready for Feature Factory?

Take our five-minute self-assessment to see if your data and organization could benefit from dotData’s Feature Factory revolution.