AWS SageMaker meets Titanic
In previous posts, we saw some examples of ML algorithms using almost local resources (except while using Google Colaboratory)
Let’s suppose now to have a practical example to resolve (a classification problem — the Titanic dataset to predict survivors or not), and to do this we’ll use the available tools on Amazon Web Services.
Obviously, the focus is not how good they perform (we’ll do just a basic data preprocessing without tuning) but the impressions and the feelings while using them.
The Titanic dataset is a classic classification problem: the dataset contains several info about the passengers (age, sex, ticket class and so on) and the survival (yes/no) target value.
The goal is to train a model to predict if a given passenger survived or not. The data is not ready “as is”, so we’ll need to preprocess the data a bit.
The data is already split between train and test, so we’ll use two different files.
Amazon SageMaker is a service that enables a developer to build and train machine learning models for predictive or analytical applications in the Amazon Web Services (AWS) public cloud (cit. Wikipedia).
Reading this, we expect to have several tools helping
First, we need to load the files on S3
Let’s start a new notebook (using default parameters) to preprocess data. Be sure the IAM role used has the right access to use S3 bucket.
Now we can use Jupyter, starting an instance on AWS and work with the data
Let’s perform some basic data preprocessing (removing some columns, transform categorical data and so on)
Now we’ll be possible to some EDA using the usual libraries..let’s skip this and just upload the data to S3, changing the file a bit to be compliant with SageMaker (removing the header and move target column at first position).
Ok, at this point we can train a model using an algorithm already present in SageMaker (the built ones or even from a specific marketplace) directly from the notebook or using interface: let’s choose the latter option.
We’ll use Linear Learner as out of the box algorithm.
It’s necessary to specify the kind of output expected in the hyperparameters configuration: in our case, predictor_type is “binary_classifier” (survived can be yes or not) and the feature_dim = 10 (There are 10 features columns now on the dataframe). We’ll leave other parameters untouched.
For the input type, it’s necessary to specify text/csv in the channel configuration.
At this point, if there are no errors, the training job is completed.
As you can see, is possible to create the model, allowing to deploy it on an endpoint .
To see some KPIs about how the training worked, it’s possible to use CloudWatch
It’s then possible to perform hyperparameters tuning, changing parameters and see how performances are affected.
I’m not particular impressed, at least while performing this simple task.
Obviously you have all the infrastructure and services AWS can offer and you can put them together to build a complete solution but there is a lack of dedicated focus on this part and the best approach remains using Jupyter notebooks.
Moreover, it’s quite time consuming to provision the environment for every run (every time you do a mistake while configuring something you have to see the log, clone the training job, fix the error and submit again everything), so probably the best approach is to deploy a model already trained.
Overall, the interface is not user friendly and you have to read the documentation very carefully and, of course, try and try again or end like this :)
Ok, that’s all for now: we’ll see in next posts if and how competitors differ, performing the same task, so stay tuned!