AI + ML

This article is more than 1 year old

What AWS Redshift ML can do for you

The more data an AI scientist can get hold of, the better

Mon 31 Jan 2022 // 18:00 UTC

Paid Feature Machine learning applications can do amazing things, but for many users, creating them remains a pain. Getting necessary data and then nursing it through a complex, repetitive training process is a daunting process with many specialist tasks. In May 2021, Amazon released Redshift ML, a service that makes it easier to retrieve data from its Redshift data warehouse and then build automated training workflows that create AI models from it. It joins similar services for other databases, such as Aurora ML and Neptune ML.

Redshift ML focuses on building AI models based on supervised learning, which is the most popular approach to AI today. Unlike alternatives like unsupervised learning or reinforcement learning techniques, it calls for existing data that someone has already labeled. Images of road signs could be labeled as signs, whereas images of lollipops, books, or semi-trailers with arrows on them that bear a passing resemblance to signs could be labeled as not-signs. Training data could also be snippets of natural language indicating certain sentiments, vibration data from turbines correlating with impending failure, or customer transactions pointing to specific behaviors.

The more of this data an AI scientist can get hold of, the better. It helps to increase accuracy, especially among edge cases where data is easy to mislabel. As the level of accuracy increases, the value of that AI becomes more useful. Cloud-based AI systems with high accuracy and lots of computing power can make decisions about data at velocity and in volume.

Dealing with complex AI workflows

The problem with supervised learning has always been managing the training workflow. It's a complicated process and keeping track of all the steps is difficult. It begins with selecting the training data, which might come from multiple sources. That data must be cleaned and prepared for consumption by the AI training program, which is often a separate tool or library.

A programmer must choose the most appropriate algorithm for training that data. These vary according to the type of data you're handling and the outcome you're looking for. For example, deciding how to label a discrete piece of data is more likely to need a classification algorithm. If you're predicting a continuous quantity, such as the price of an asset over time, then a regression algorithm might be more appropriate. There are different types of regression, too. After selecting your algorithm, you'd need to tweak further aspects of your AI training structure, known as hyperparameters, to suit the input data and output.

Armed with an appropriate program, you must then begin the training. This will often produce subpar results, meaning that you must go back and tinker with your code and/or data, tuning it to produce better results. This takes multiple runs, which is where most of the work lies in the AI modeling process. Even when training in the cloud it's important to refine this as efficiently as possible because training is a compute-intensive process, and each run incurs a computing cost.

Then comes deployment. When the model is good enough, someone must roll it out and then monitor it as it makes decisions about production data (known as inference). That includes keeping track of cost and validating the accuracy of the model's decisions. Finally, admins must keep the inference running smoothly, allocating sufficient resources to handle the inference process.

How Redshift ML helps

These are specialized jobs that typically take multiple team members. AWS Redshift made it easier by drawing structured and semi-structured data together from multiple sources, including AWS S3, and molding it into different views. Amazon also introduced the Sagemaker tool to automate the training process. This smoothed some of the wrinkles in the AI training workflow, but gaps between these tools still put a management burden on teams, forcing them to manage data exports.

Redshift ML created a more joined-up workflow. It enables data scientists to draw the data that they want for their training model directly from Redshift using SQL queries. This makes it easier to experiment with data inputs that might make your AI model more accurate. The product also automates data preparation and selects the most appropriate algorithm using a CREATE MODEL SQL command that builds the model for you.

Redshift ML then exports that data to an S3 bucket, making it accessible to Sagemaker behind the scenes. For this, it uses Sagemaker Autopilot, which automates a lot of the training work. Autopilot, which can also input tabular data directly from S3, handles the automatic model creation and uses model notebooks to report on model quality. It also provides model leaderboards, allowing users to compare and select models based on the best results.

This won't be a completely hands-off experience. Users will still want to check on how that training is going for various reasons. They'll want to check that they're not overrunning their training budget and check the results of each training round to see how well the model is fitting the training data. They can do all this from a SQL prompt with a SHOW MODEL ALL command that shows where they're at in the training run and how much the training has cost so far. This will also highlight model accuracy as a score between 0 and 1.

Because Redshift ML runs in AWS, it's easy for Amazon to integrate the training and process with deployment into production. The service automatically deploys the trained model to the production environment in the cloud. At that point, people can use SQL queries in Redshift ML to make predictions on production data.

The benefits of a joined-up AI workflow

To summarise, ML workflows can be complex and iterative. Redshift ML simplifies model training. When you run the SQL command to create the model, Amazon Redshift ML securely exports the specified data from Amazon Redshift to Amazon S3 and calls SageMaker Autopilot to automatically prepare the data, select the appropriate pre-built algorithm, and apply the algorithm for model training.

Joining up the AI workflow like this brings some significant benefits.

Amazon Redshift ML handles all the interactions between Amazon Redshift, Amazon S3, and SageMaker, abstracting the steps involved in training and compilation. After the model is trained, Amazon Redshift ML makes it available as a SQL function in your Amazon Redshift data warehouse.

Joining up the AI workflow like this brings some significant benefits. One of the biggest is its accessibility. The use of SQL throughout the workflow makes it easier to define the inputs for training models and then to manage that training using a language that every database developer understands. It also reduces or eliminates the use of external tools to manage portions of that workflow.

That SQL capability also extends to predictions, meaning that people can use the language to make predictions from directly in the data warehouse. Redshift ML can import the trained model from Sagemaker for local inference. This enables people to generate predictions using SQL without having to ship data outside your data warehouse.

Amazon says that Redshift ML can also help to save costs. This is partly due to the pricing model. The system includes prediction in the cost of those Redshift clusters, enabling customers to pay only for the training cost.

The automatic algorithm selection feature removes a lot of the development overhead from AI trainers, but Amazon says that it has found a balance between control and usability by allowing more advanced users to specify their problem types, and to pick the algorithms they want to use. They can also tinker with the values used to control the learning process.

Amazon believes that this simpler access to AI workflows also saves developer productivity and improves query speed upgrades. In fact, it believes that the user base might go beyond traditional developers and data scientists entirely. It hopes that other types of workers such as line of business managers might want to get in on the act.

Could we be looking at a low-code revolution for AI? One thing is certain: demand for cloud-based AI applications is growing. In 2021, PwC found a quarter of US businesses reporting widespread AI adoption in their business, up from 18 percent the year prior. Another 54 percent reported that they were gearing up quickly to follow suit. The figures imply there are an awful lot of business intelligence analysts keeping busy.

The potential use cases for supervised learning-based AI are varied. Prediction carries strong appeal for companies eager for an early jump on the market. They can use AI models derived from a data warehouse to predict everything from customer churn through to the probability of a sales lead closing.

They can even use machine learning to predict a customer's lifetime value, which has implications when planning marketing and customer support strategies. Some of the more interesting use cases, Amazon tells us, include things like customer churn detection, predicting if a sales lead will close, demand and revenue predictions, product recommendations, and fraud detection.

Other potential applications for machine learning when combined with data warehousing include fraud detection. Running AI-powered reports to find suspicious activity can help to spot damaging emergent behaviors and save thousands.

Conceptually, AI is over half a century old, but modern machine learning powered by GPUs and the cloud has been around for barely a decade. Tools and services that make it easier to integrate workflows and which complete some of the steps automatically promise to open AI applications to more people in the cloud. Redshift ML complements a data warehousing tool that already allows users to easily bring together data from disparate sources into one place. As customers get to grips with this new capability, we can expect users to produce many more supervised learning-based ML applications in AWS.

Sponsored by AWS.

More about

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources