From Raw Data to ML Models: The Magic Behind AI-Powered Feature Engineering

Understanding the steps involved and how challenges can be addressed at each stage can help organizations move from an ad-hoc machine learning process to a more production line-based approach.

August 26, 2022

The journey from raw data to machine learning models is the essence of artificial intelligence. Ryohei Fujimaki, founder and CEO of dotData, takes a closer look at understanding the journey, the steps and challenges involved, and how new technologies are disrupting the process.

Companies face different problems when moving from raw data to ML models. However, understanding the steps involved and how challenges can be addressed at each stage can help organizations move from an ad-hoc machine learning process to a more production line-based approach. Key new tools like automated feature discovery and evaluation are disrupting the data journey and creating new opportunities.   

Defining the Purpose of the Model 

It is vital to set clear expectations to develop ML models that can solve real-world problems. ML models perform predictive analysis on opportunities and risks and can present insights into alternatives or new perspectives. Organizations and businesses have embraced ML across the world. Fortune Business InsightsOpens a new window reports that the ML market will grow from $21.17 billion in 2022 to $209.91 billion by 2029, at a CAGR of 38.8%. 

ML is used in every industry, from e-commerce to security, cybersecurity, health, and supply chain management. With so many applications, it is easy to lose focus on what type of ML model a company needs. The purpose of an AI/ML system should be as specific as possible. The more specific and clear an objective is, the easier the process and the more impactful the outcome.  

While industrialized, machine learning is still a very artisanal work. It requires creativity, planning, and a scientific and engineering approach to data that is very experimental and iterative. Even with clearly defined goals for an ML model, the hard work of finding the right data is critical to the project’s overall success. Finding the right data requires a lot of experimentation and trial and error to discover exactly the right combination that will lead to a good model.

See More: ML Is a Game Changer for the Incident Management Lifecycle

Raw Data Challenges 

In 2019, the World Economic ForumOpens a new window predicted that by 2025 the world would produce 463 exabytes of data every day – the equivalent of burning 212.7 million DVDs daily. Organizations produce data faster than they can use it. Businesses such as traditional banks – that have been operating for decades – face additional data challenges. They produce large amounts of data and have massive amounts of historical data. 

Since the digital revolution, data has moved from paper to digital formats, to databases and data warehouses. Some data may be useful, others not that much. Most of the data companies store goes unused and often unexplored, it is not being leveraged nor put to work, and more often than never, it is siloed. One of ML’s biggest challenges is finding valuable data segments among an ocean of raw data. 

Challenges of Building, Testing, and Finding the Right Combinations 

Moving an ML model from experimental to production is one of organizations’ biggest challenges. In January 2022, the KD NuggetsOpens a new window survey showed that only 0 to 20% of models generated were deployed. Data scientists explained that 80% of ML models are stalled before reaching production. 

The work data science teams face when building machine learning models is complex and poorly understood. The work is still manual, complex, and time-consuming, from discovering and choosing features, testing different hypotheses, and using subject matter experts to find the right combination of columns (aka “features”) for any given model. 

Data science teams must choose what data they will use and what data performs best. To do this, they might literally have to “scan” through hundreds of columns and millions of rows of data and decide which features to use and which not to use. This is where the concept of Feature Stores comes in. These systems were designed to save useful features for future re-use. Feature discovery and engineering are so complex that a whole new area of products is designed to save features for re-use.

Feature Discovery and Engineering 

Feature engineering is the process of providing an ML model algorithm with the most valuable columns of data it needs to perform at its best. While feature engineering is a well-understood principle of data science, feature discovery is often an even bigger challenge that takes a lot of time and is highly iterative in nature.

Feature discovery is about finding the combination of table joins and column selections to create good features. Often different combinations of features will be tested. Data Science teams must optimize the algorithm’s performance by using different features. But how do data scientists choose the right features? Data science teams meet with experts in related fields within an organization to discuss which features to use. 

Testing algorithms and making sure they are effectively operating is continual work. For example, returning to the case of traditional bank institutions operating for decades, their ML models were disrupted during the pandemic as the world went into lockdown and banks did not open. This is one example of how new consumer behaviors or other factors can cause an ML model to drift. 

The 2021 State of AI and Machine Learning reportOpens a new window of Appen reveals the industry’s increased effort to prevent ML models from drifting. “Building an AI-ML model isn’t just one-and-done. The model needs regular evaluation, tuning, and retraining. In 2022,” the report explains. ML models are dynamic. As data changes, they drift and provide less accurate results. Appen says that 91% of large organizations update their models at least quarterly, and 57% of all organizations update their models at least monthly.

Feature Transparency 

Transparency and insight into what an ML model is doing and the data it uses are essential. Transparency is not only vital to meeting ethical and legal standards. It can help business decision-makers. Business users usually leverage experience and “gut feeling” when judging business decisions. If an ML model is counterintuitive, but the business user has no way of understanding it, they will not trust to invest in it. 

Another related concept is traceability. This implies the ability for someone to visually trace where the feature came from, for example, what combination of columns was used to create the feature.

It is essential to understand the two concepts of transparency in AI. One concept where transparency refers to the benefits of making AI easy to understand. The other involves transparency to minimize using features that may cause harm. For example, whether a credit company should use the race data of its customers in ML models to predict approval or disapproval of loans. 

AI transparency can be challenging. It is not about the model’s accuracy but about clearly understanding how it operates and minimizing the risks of harm. However, just because a black box AI is not understood by everyone does not imply it is causing harm. 

See More: How to Overcome Machine Learning Risks

New Disruptive Technologies

Several new technologies are disrupting the processes and journey of raw data to ML models. New AI and ML technologies include Automatic feature discovery, Automated feature engineering, featuring stores, and AutoML (AutoML).

 “It has been a common trope that 80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis,” Harvard Business ReviewOpens a new window reports. 

New innovative and disruptive AI/ML technologies are helping data science teams avoid repetitive tasks and focus on predictive analysis and modeling. The result? ML models that are built faster and better. With advanced AI tools, like automated feature discovery, the repetitive tasks — usually done manually and taking anything from weeks to months — can now be done in hours. No-Code AI is also becoming popular for its potential to solve concrete point solutions for AI. It enables less sophisticated users to build AI or ML models, the so-called “citizen data scientist.” 

From healthcare AI applications that search for new drugs for untreatable illnesses to helping businesses pivot from global supply chain disruptions, inflation, or pandemics, the potential of ML to solve real-world problems has become evident. New models that maximize data and drive predictive analysis are of great value to our global society. And the journey to improving organizations, communities, and the global society with AI, always starts with raw data and journeys to ML models. 

What innovative AI/ML technologies are you using? Share your journey with us on FacebookOpens a new window , TwitterOpens a new window , and LinkedInOpens a new window . We’d love to know!

MORE ON AI

Ryohei Fujimaki
Ryohei Fujimaki, Founder and CEO of dotData. Ryohei was the youngest research fellow ever in NEC Corporation’s 119-year history, the title was honored for only six individuals among 1000+ researchers. During his tenure at NEC, Ryohei was instrumental in the successful delivery of several high-profile analytical solutions that are now widely used in the industry. Leveraging his expertise and unique outlook as a young researcher, he built dotData as a firm focused on automated data science that delivers new levels of speed, scale and value in successful deployments across multiple industries, including several Fortune Global 250 clients.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.