A Look into Applied Machine Learning
Naturally, I believe it is exciting for anyone to want to learn about the artificial intelligence (AI) environment and, more specifically, how a system can automate learning based on its own experiences. Humans possess the ability to learn from past experiences with potential self-improving benefits. Machines, on the other hand, follow the instructions given to them by humans. But, what if humans can train the machines to do what humans can do but more efficiently? Well, this is called machine learning. You can view machine learning as the man behind the curtain to your Spotify Discover Weekly Playlist, the female voice behind your phone’s virtual personal assistant, the eye behind the facial recognition technology securing your cell phone, or the guard filtering “spam” in your email. These are just a few of many examples of the role machine learning plays in our everyday lives. The goal is to guide you through the basics of machine learning and tasks involved.
As I’ve mentioned prior, humans improve based on experiences, and machines need a set of tasks to complete with large amounts of data to enable them the “experiences” they lack. I used a data set that initially featured in an article written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. this data set contains hotel booking information City and Resort Hotel the information inside includes guessing information such as the type of hotel, the duration of a guest visit, the number of individuals staying at the hotel, any special accommodations, and much more. Each attribute about a guest stay at the hotel lives in a column within the data set we caught these columns features. The shape of this data presents 119,930 guests (rows) and thirty-two features (columns). To explain this seemingly complicated process, I will break down the machine learning project into seven main steps:
- Frame and Understand the Problem
- Import Data
- Explore the data to gain insights
- Feature Engineering/ Feature Selection
- Build Model(s)
- Fine-Tune Model(s)
- Report Results
Frame and Understand the Problem
As a data scientist, first, you want to define the objective and fully understand how the solution will align with the business. Every problem is unique and comes with its own set of challenges, so not every issue can be viewed in the same light. Keep in mind the world of data science is a collaborative effort, so in this step, it is essential to inquire about human expertise if available. Our goal is to determine which season guests most frequently arrive at a hotel and offer a solution to a multi-class classification problem so that a hotel can more effectively and proactively assess the needs of their guests. After fully understanding the problem and potential solutions, it is time to import the data.
Import Data
One of the challenges of machine learning is importing your data using the correct method(s) for a specific format(s). The hotel data set is a tabular .csv extension file used in most spreadsheet programs, such as Excel. The pandas library contains methods to read in that file to an assigned variable to be used throughout the project. If we take a quick look at the structure of our data using the head method, we can see 119,930 rows and 32 columns spanning various data types containing missing values. Let’s explore the data more to gather an idea of the amount of data cleaning necessary.
Explore the data to gain insights
Data preprocessing is crucial for locating trends, outliers, correlations, and flagging potential issues that could arise as you develop your machine learning project. In the hotel data set, there are 3.4% missing values in the data set, primarily located in four columns (“company,” “agent,” “company,” and “children”). Some columns are not correctly formatted and categories with high cardinality or many unique values. Taking the time to explore can save time in the long haul. In this step, it is essential to format the data to be easily manipulated without affecting the data frame itself. Fortunately, this data set did not require much cleaning.
Feature Engineering/Feature Selection
Now that the data is more equipped for manipulation, we can use the raw data to extract and create more features (column).
“Having and engineering good features will allow you to most accurately represent the underlying structure of the data and therefore create the best model. Features can be engineered by decomposing or splitting features, from external data sources, or aggregating or combining features to create new features.”
One important feature to be engineered first is our predictor. In our example, three columns (arrival_date_year’, ‘arrival_date_month’, and ‘arrival_date_day_of_month’)
Need to be combined to create the new attribute from the extracted and converted object types. The goal is to create a column describing the arrival date in a day of year format expressed by a numeric value ranging from one to three-hundred and sixty-five. We are now able to manipulate the data in the rule of creating seasons based on the day of the year. The data spans over 792 days (July 1, 2015, to August 31, 2017), and once the function has been made, it will be a less daunting task in engineering other features from the data set. Or, to the contrary, eliminate features such as ones with high cardinality, or an overwhelming number of missing values, or lack a strong relationship with the solution or predictor.
In machine learning, feature engineering, and feature selection play a pivotal role in minimizing errors in the model when metrics are required. One significant creator error to consider is the potential for a prediction model to have data leakage. Many of us have seen or watched on the news of real-life examples of data leakage. In September 2018, Facebook announced that 50 million users (out of over 2.6 billion monthly active users as of the first quarter of 2020) had been compromised from an attack against their computer network. The attack left users vulnerable to attackers who could gain access to user’s accounts and other services. According to an article in MetaCompliance, the security breach was the largest in the company’s 14-year history and could result in fines in the upwards of $1.63-billion. Hopefully, this stresses the importance of this step.
Build Model(s)
Deciding which model to use can be a hefty task; however, in the first step, we took the time to break down the problem and explore ways we could potentially frame the issue. The feature “Season” contains the problem of classifying instances into one of four classes. This is an example of a multinomial classification problem. Had there been an instance to sort one of two categories, we would be working with a binary classification problem. We knew from importing the tabular labeled data that this would be a supervised instance; otherwise, the lack of labels in the data would present unsupervised data. To take the exploration one step further, we have mentioned the output to be associated with a multi-classification problem. Now we can incorporate our hard work into a prediction model.
As mentioned prior, machine learning performs best with data that it has not seen before. This is the reason that the data is separated into two parts: a training set and a testing set. The training data is what will be used to evaluate our model(s) performance, and the testing data will be the ultimate and last step to ensuring excellent performance. It is the same as starting a new job and having to complete tasks to understand the expectations repetitively. You will struggle at times. You will have to modify the way you perform a task. You will receive feedback to compare your work to the overall standards of the company. This principle is similar to machine learning.
In our case, when you struggle, you have low or conflict scores in machine learning. When you modify the way you perform a task, you are fine-tuning a model in machine learning. When you are receiving feedback, you are receiving your model(s) evaluation metrics. And, in both worlds, the goal is the same: to perform the best you can with the knowledge you have.
Think of the various prediction models in the same way you would when you arrive at a restaurant and those many components involved working together in a pipeline so you can eat that to the delicious burger. The “pipeline” in the machine learning model is very similar, and when you build your model(s), remember how you framed your problem. In our case, we used a Logistic Regression, Decision Tree Classification, and a Random Forest Classification model to compare accuracy scores. Each model or pipeline has its unique purpose and iterative properties for a specific output. The logistic regression model is implemented as a python class to estimate the parameters of a binary logistic regression with the output of the classification being from the numbers 0 to 1.
The decision tree classification model, also a python class, and random forest classification model perform similarly in the way they consist of:
- Nodes: Test for the value of a specific attribute.
- Edges/ Branch: Correspond to the outcome of a test and connect to the next node or leaf.
- Leaf nodes: Terminal nodes that predict the outcome (represent class labels or class distribution).
In both of the models, there is “a tree built through a process known as binary recursive partitioning. This is an iterative process splitting the data into partitions and then dividing it further on each branch. One noticeable difference is the random forest classification ensemble model takes multiple trees and simultaneously compares trees by repeatedly dividing the data into smaller subsets producing multiple varying outputs within the forest.
Now we are ready to incorporate other components into our pipeline based on the framework of our problem. In our case, merely a transformer and an imputer will suffice. The transformer encodes categorical data altering the shape of the inputted data, and the imputer aids with filling missing data to then fit the appropriate model.
Fine-Tune Model(s)
Remember, machine learning focuses on iterative processes or tasks that are reproducible for optimization. During the first test of each model in our example, the decision tree classifier model performed the best, achieving an accuracy score of 38%. Return to previous steps and conduct further feature selection to fine-tune our models and minimize any presence of data leakage.
Report Findings
In conclusion, the hotel example demonstrated in the first round of testing and error metric evaluation that the decision tree classification model performed more consistently than the other two models across the training and validation accuracy scores. Its performance was more sound and less prone to data leakage than the results from the other models. Even though we have built a good machine learning prediction model, we cannot heavily rely on the model to evaluate the best time of the year to make reservations at the hotel because of the lack of accounted bias. Hotels are as affected during economic hardships and type of clientele in a real-life example. The next step would further our understanding of the bias not adequately represented and its impact on city and resort hotels from 2015 to 2017. So, the next time you are searching for that vacation, hotels think of all the computational tasks providing you the best memorable experience.
“An investment in knowledge pays the best interest.” -Benjamin Franklin
This project assumes the reader to have a basic understanding of Python and standard libraries. Follow along in the notebook for the following example. https://bit.ly/2DriQ0c.