fbpx

dotData Insight: Melding the Power of AI-Driven Insight Discovery & Generative AI

By Walter Paliska

Introduction Today, we announced the launch of dotData Insight, a new platform that leverages an AI-driven business signal discovery engine augmented with GenerativeAI - deriving business hypotheses beyond uncovered signals. dotData Insight directly explores millions of possible data signals from convoluted enterprise data, frees data analysts, business intelligence professionals, and power users from weeks or months of repetitive trial-and-error effort, and delivers valuable and unseen insights.  https://vimeo.com/891270844/da759762de?share=copy Move From a Single-Threaded Analytics Process to  Multi-Threaded Signal Discovery  Business Intelligence (BI) systems have been around for decades, yet according to VentureBeat, 90% of executives still struggle to use data to make decisions. The problem is inherent in how BI systems were developed - and in how they were intended to be used. BI systems are ideal for providing business users with information on “what” happened to the business. Whether in scorecard systems, dashboards, or static reports, they provide a static snapshot…

Boost Time-Series Modeling with Effective Temporal Feature Engineering – Part 3

By Sharada Narayanan

Introduction Time-series modeling is a statistical technique used to analyze and predict the patterns and behavior of data that change over time. Part 1 of this blog series explained standard time-series models such as AR models, ARIMA, LTSM, and Prophet and discussed their advantages and disadvantages. Part 2, on the other hand, introduced an alternative approach - feature engineering from temporal datasets, that provides numerous benefits over standard time-series modeling.   In Part 3, the last of this blog series, we will examine ARIMA and Prophet models, compare them with an alternative feature engineering approach, and demonstrate the advantages of the feature engineering approaches.   Dataset In this blog, we utilized the dataset taken from the Prophet quick start demo guide. The data is a time series based on the log of daily page views for the Wikipedia page for Peyton Manning. The data is a periodic time series spanning eight…

Practical Guide for Feature Engineering of Time Series Data

By Joshua Gordon

Introduction Time series modeling is one of the most impactful machine learning use cases with broad applications across industries. Traditional time series modeling techniques, such as ARIMA, often automatically incorporate the time component of the data by using lagged values of the target variable as model inputs. While these techniques provide interpretable coefficients that aid in understanding the contribution of each variable to the forecast, they can be sensitive to outliers, missing data, and changes in the underlying data-generating process over time. As a result, their accuracy may be compromised.  On the other hand, machine learning combined with feature engineering offers a more robust approach to time series modeling. This approach can handle complex, non-linear relationships and is well-suited for large relational datasets with more complex relationships and intricate interdependencies. Feature engineering plays a crucial role in time series modeling, as it involves selecting and transforming raw data into meaningful…

Maintain Model Robustness: Strategies to Combat Feature Drift in Machine Learning

By Yukitaka Kusumura

Introduction Building robust and reliable models in machine learning is of utmost importance for assured decision-making and resilient predictions. While accuracy remains a desirable trait, the stability and durability of these models take precedence in ensuring long-term efficacy. A crucial aspect that bolsters model reliability is the stability of the features, where consistency over time can drastically improve the model's robustness. In this article, we will explore the concept of feature drift and analyze methods to maintain feature stability, thereby enhancing the model's overall robustness and resistance to fluctuating conditions, even at the cost of marginal reductions in accuracy. Challenges of Feature Drift The evolution of features, more commonly referred to as feature drift, is typically categorized into two distinct types: data drift and concept drift. Data Drift Machine learning models are trained to minimize overall input errors, causing the models to be more inclined toward fitting the majority of…

The Hard Truth about Manual Feature Engineering

By Aaron Cheng

The past decade has seen rapid adoption of Artificial Intelligence (AI) and Machine Learning (ML) across different industries and for many successful use cases. Beyond AI's “cute” factor that can automatically differentiate between a dog and a cat in photos, these new technologies are already deployed in real-world applications to generate impactful business outcomes. AI and Machine Learning predict lending risk, provide product recommendations, analyze customer churn behaviors, and manage inventories. While technology giants like Google and Amazon continue harvesting the great benefits of AI, many traditional businesses are still struggling to adopt AI. One key challenge businesses face is that data is often not ready for AI/ML, and preparing it would take too much time and effort, something they cannot afford.  ML input data must be a single flat table, but real-world data is often distributed across different tables and multiple databases. Combining data from disparate tables, generating new…

Feature Factory: A Paradigm Shift for Enterprise Data

By Ryohei Fujimaki, PhD.

The world of enterprise data applications such as Business Intelligence (BI), Machine Learning (ML), and Artificial Intelligence (AI) is becoming increasingly critical for organizations of all sizes. As these technologies advance, businesses face the challenge of choosing the best tools to apply in different situations. The landscape of BI, ML, and AI tools has become commoditized and fragmented, leading to the development of various tools for specific purposes. Despite this, data application development in enterprises often remains siloed, and the full potential of data is not extracted. Enter Feature Factory – a new paradigm that aims to revolutionize open and scalable enterprise BI, ML, and AI development. A Commoditized and Fragmented Landscape The Enterprise ML and AI tools market, as well as the BI tools market, is highly commoditized, with a wide variety of tools available for use, each with its strengths and weaknesses. The tools range from open-source libraries…

Preventing Data Leakage in Feature Engineering: Strategies and Solutions

By Yukitaka Kusumura

Data leakage is a widespread and critical issue that can undermine the reliability of features. In this blog, we will delve into the concept of data leakage, examine how it can transpire during feature engineering, and present various strategies to prevent or mitigate its consequences. Understanding Data Leakage  Data leakage occurs when the feature engineering process unintentionally uses information from the target variable or the validation/test set. This can lead to overly optimistic performance metrics, as the feature appears to perform exceptionally well on the test set. However, when the feature is implemented in real-world applications, its performance is often significantly worse than anticipated. Data Leakage in Feature Engineering Feature engineering is the process of creating new features or transforming existing ones, and it can frequently be a source of data leakage if not managed carefully. Statistical Value Leakage Statistical value leakage arises when you create or transform features before…

Feature Engineering from Geo-Spatial Data

By Yukitaka Kusumura

Geospatial data, combining geographic and spatial information, is becoming increasingly important in various industries, from transportation and logistics to urban planning and environmental monitoring. However, working with geospatial data can be challenging, as it often requires specialized feature engineering and analysis techniques. In this blog, we will explore some of the key concepts and techniques involved in feature engineering for geospatial tables. Geospatial Tables  Geo-spatial tables contain geographic data that can be used to represent geographic features. These tables are rows and columns containing information about geographic locations and features like roads, rivers, buildings, and parks. Each row represents a point in space or a specific area. The columns may contain longitude or latitude coordinates and other unique characteristics like environmental conditions, population density, and land use.  For example, geospatial data showing land use in a town or city may have columns for longitudes and latitudes alongside other columns with…

Power your Feature Store using Automated Feature Engineering

By Lulu Liu

Why are enterprise feature stores empty? The notion that “data is the new oil” has existed for a while. The analogy, however, is more appropriate than people often consider. Crude oil serves few concrete purposes in its raw state. It must be refined and processed before gasoline, for example, can power our vehicles. In the same fashion, raw data has little value to an organization. Like crude oil, we must refine, cleanse, transform, and often combine it with other data elements to elicit business insights and value.  In the case of machine learning, the “gasoline” is what is known as “features.” Easy access to diverse and high-performance features is critical for successful machine learning projects. While feature engineering has been around as long as data scientists have built machine learning models, feature stores are a relatively new concept. A Feature store is a machine learning-specific system used to centralize storage,…

Feature Engineering from Multi-Valued Categorical Data

By dotData

Introduction Multi-valued categorical data is ubiquitous and critical to businesses across multiple industries. Most real-world data that businesses store is not numerical but categorical in nature. Examples of categorical data include the level of education (elementary, middle school, bachelor's degree, master's degree, Ph.D.), subscriptions types (basic, premium, business), countries, and countless other examples, that are prevalent in business and necessary for daily operation and analysis of business performance. Although businesses need to rely on categorical data to capture real-world trends, working with such data can be challenging - but there are ways to address such challenges. Overview Categorical columns represent data that can be divided into several different groups.  There are two types of categorical columns: ordinal and nominal, which can be uni-valued (regular) and multi-valued. Multi-valued categorical columns can contain different types of information in a single column. In Fig.1, the payment_type column is a regular categorical column, whereas…

Feature Engineering for Temporal Data – Part 2: Types of Temporal Data

By dotData

Temporal data is one of the most common and essential data types for enterprise AI applications, such as demand forecasting, sales forecasting, price prediction, etc. Analyzing time-series data helps organizations understand underlying patterns in their business over time and allows them to forecast what will happen in the future (a.k.a. time-series forecasting).  Part one of this series focused on standard time-series models such as AR models, ARIMA, LTSM, and Prophet. While time-series modeling techniques are still widely used, they have limitations, such as the inability to work with heterogeneous data characteristics or time resolutions, lack of support for temporal transactions, and poor model explainability and transparency.  This second part will review an alternative approach, i.e., feature engineering from temporal datasets, that provides many advantages over standard time-series modeling. We will look at three different types of temporal data, the alternative approach of engineering new features from the temporal datasets, an…

Feature Engineering for Temporal Data – Part 1

By dotData

An overview of common time-series modeling techniques Time-series (or temporal) data are among the most common and essential data types across enterprise AI applications, such as demand forecasting, sales forecasting, price prediction, etc. Analyzing time-series data helps enterprises understand the underlying patterns in their business over time and allows them to forecast what is likely to happen in the future (a.k.a. time-series forecasting). Developing good time-series models is an important and challenging process for enterprise data science and analytics teams.  This blog series will overview different approaches to developing AI and ML models from time-series and temporal datasets. This first blog in a three-part series will review common time-series modeling techniques and discuss their characteristics, advantages, and limitations. Standard Time-series Modeling Techniques Autoregressive Model The autoregressive (AR) model is one of the most straightforward and traditional time-series modeling techniques. The model is named autoregressive because it applies linear regression to…