dotData Py:

Your Feature Factory for Data Scientists & Data Engineers. Discover 100X more Features and Build Better Models

Discover features with dotData

Discover 100X More Features and Build Better Models

dotData Py is an enterprise-grade feature discovery platform that helps data science and data engineering teams perform feature engineering faster and build production-quality feature pipelines automatically. dotData Py uses an AI algorithm to hypothesize, explore, build, and validate features automatically, and AI features augment your feature space to build explainable models.

“Feature engineering is powerful and scalable, even across tens of tables with billions of rows.”

prepare to discover features with dotdata
Step 1

Prepare Tables as Dataframes

Connect to multiple data sources, data lakes, or data warehouses and ingest the data as Spark Dataframes in python

  • Load data from modern cloud data marts (including Amazon Redshift, Google Big Query, Snowflake, MS Azure Synapse),  traditional data warehouses (Oracle, Teradata, and MS SQL Server), and flat data sources (CSV files,  Tableau Hyper files, etc.) via Spark Dataframe API
  • Automatic data type detection and data schema inference importing of tables as a Dataframe
  • Connect multiple data sources together by specifying Dataframe relationships
  • Define and configure temporal data relationships for automated temporal feature discovery
run dotdata feature factory
Step 2

Run dotData Py

Specify your target variable and the source tables as Dataframes you will use to build features.  Define your search criteria and run dotData Py from your favorite python IDE or notebook

  • Resolve data quality issues like illegal values, outliers, data canonicalization, milling values, target label mapping, and more.
  • Explore millions of feature hypotheses – including numeric, categorical, time-series, text, and even geospatial data.
  • Resolve feature over-fitting, feature collinearity, feature drifts, and feature redundancy based on dotData’s proprietary algorithms. 
  • Custom feature primitives and search criteria to add your own domain features into the feature exploration space.
gain feature insights with dotdata
Step 3

Discover Features & Insights

Explore and evaluate discovered features interactively from python

  • Feature leaderboard (feature list) that surfaces features that are the most relevant and correlated with your target variable
  • Understand each feature’s business value and construction via an easy-to-understand auto-generated explanation and feature blueprint diagram
  • Select your preferred features based on various feature metrics like correlation, feature-wise AUC, permutation importance, feature locality, popularity, and more.
  • Extract feature tables as Dataframe and visualize each feature using the built-in visualization tool or any Python visualization library you like.
Select features with dotdata
Step 4

Extract & Iterate Feature Discovery Experiments

Iterate feature discovery experiments to derive better quality and higher-order features. insightsExplore, optimize, and tune features interactively.  Choose which features to extract for further analysis, modeling, or reporting from within python 

  • Edit feature descriptors (definitions) to customize discovered features and leverage your domain expertise.
  • Natural interface to add new datasets and run new experiments. Combine features from multiple experiments with different granularity. 
  • All steps and feature space details are reported without any black box 
  • Modularized execution allows you to run your experiments from any intermediate steps and iterate them faster
PyGraphics_Feature-pipeline
Step 5

Deploy Your Features Into Production

Populate feature stores and continuously update features in production applications

  • Ingest features and metadata (feature explanation, feature statistics, feature schema) into any feature stores (Databricks, Snowflake, AWS SageMaker and more) and enhance your ML models
  •  Automatic feature pipeline generation with fully specified query statements for reuse and eliminate error-prone manual feature query implementation.
  • One command deployment of feature pipelines into dotData Ops. Continuously recalculate features values with the newest data and monitor feature quality and drifts.

Amazon EMR

Install dotData Py in your AWS EMR instance to accelerate feature discovery for your data science team.

Pip Install

Quickly deploy dotData Py via pip-install – even on your own personal laptop.

Justin Shoolery - dotData Client

How SMBC Discovered 2,000,000 new features

When SMBC, one of the world’s largest banks, wanted to get the maximum value from their feature engineering investment, they turned to dotData. Download the case study and read how the went from 2,000 features a year to over 2,000,000.

Are You Ready for dotData Py?

Take our five-minute self-assessment to see if your data and organization could benefit from dotData’s Feature Factory revolution.

GettyImages-1364917563-removebg
dotDataFF2