Machine Learning : Data Science essentials
Machine Learning is one of the top rated industry trends playing a key role in nearly every vertical. Big Data, and Cloud services being key enablers. This post is aimed at making it easy to explore the abilities of Machine Learning without going into too much detail. What I will be covering is merely the tip of ice-berg. Follow through, and feel free to post your feedback in comments, or get back to me on twitter.
The Big Data pipeline
There is a simple pipeline proposed through which Data is processed and churned to produce meaningful / conclusive results.
- Acquire Data: Gather any information that is being generated from all available sources. These could be log files, SQL databases, Document storage, Excel sheets.
- Extract, Clean, Annotate: Extract all relevant data from the pool, clean it off erroneous or anomalous entries and annotate or label the data appropriately.
- Integrate, Aggregate, Represent: Carry out necessary correlations and present the data in shape to suit the architecture. For e.g: flatten the data in a CSV or SQL.
- Analysis / Modelling: Run the purified data through a modelling algorithm.
- Evaluate: Compare the results with real world information to evaluate the accuracy of the model.
This pipeline is pretty standard. Anyone with a Computer Science background and enough exposure would intuitively follow a pipeline of this fashion.
The principle of Machine Learning
Traditionally, we have been using computers or machine to produce a output (O) based on input (I). The relation between O and I is defined by f.
where, O = f(I)
Machine Learning is about using computers to understand the relation between I and O, and produce f. This is what modelling is about. There are a few modelling algorithms that help produce f that we’ll take a look at next.
When the data is being used to create a model that predicts a category for the observation, Classification algorithms are used. Each observation is a set of vectors, and each parameter is weighed individually by the algorithm. Some sample scenarios would be identifying a Chair, Cat or a Car based on a picture. Sample data would include pictures labelled with Chair, Cat or Car.
- Minimizing classifications errors can be a hard task.
- Classification algorithms are susceptible to imbalanced data. A few observations for a Chair, and many observations for a Car in the training data will rarely predict a Chair.
- Yes/No or Boolean (or with 2 categories) classification are done using Decision trees or Binomial classification algorithms.
- Dataset with more than 2 categories are popularly modelled using Multi-class classification.
The quality of a classification model can be plotted on the True Positive Rate (TPR) i.e. number of positives rated my the algorithm as positives divided by total number of positives; against False Positive Rates (FPR) i.e. total number of negatives classified by the algorithm as positives divided by the total number of negatives. This creates the Receiver Operator Characteristic (ROC). The area under this curve is used to denote the accuracy of the model. The linear line represents 50% accuracy. More area under the curve denotes higher accuracy.
Regression operates on observations that predict numerical values. For example, temperature based on observations of city, date, time, humidity. It is necessary to handle over-fitting and under-fitting of the model against supplied observations.
- When evaluating the training data with test data, it is important to note the difference between f(I) and actual O. So, [O – f(I)] should be close to zero.
- Computer calculations work better with smooth functions, as opposed to absolute values. Hence, [O – f(I)]2 should be close to zero.
- When this is applied to training and test data the summation of all errors ?[O – f(I)]2 also known as sum of squares error (SSE) should be close to zero.
- Regression algorithms minimize the SSE by adjusting values of baseline variables in the function.
The choice of Regression algorithms is very critical. The choice not only depends on the nature of O, but also the relation and distribution of O. Some popular regression algorithms would be:
Simple Linear Regression: A simple linear regression model has a single factored I. Usually used when finding a relation between two continuous datasets.
- Ridge Regression: Ridge Regression is used when dealing with multiple I’s. This method is susceptible to over-fitting. Especially when dealing with large number of parameters.
- SVM Regression: Support Vector Machine Regression uses threshold ranges to give a zero error. The error grows in a linear fashion beyond threshold.
The two types of validations for Regression algorithms.
- Cross-validation: With cross-validation, n-folds of dataset are created and a model is trained on n-1 folds. The model is then tested on the 1 remaining fold. This process is repeated to test with n-folds.
- Nested cross-validation: Popular for tuning parameters, especially tricky ones. This process is simply following Cross-validation for every possibility of parameter K.
To sum it up, a good regression model is neither under-fitted nor over-fitted to training data. A good model is one that is simple and suits the data in a reasonable manner. Some error tolerances are accounted for.
As the name suggests, clustering is used to group similar observations together. Examples being, grouping customers with similar buying behaviours. Most scenarios of clustering lack ground truth, making it very difficult to validate during application. The only way to track ground truth, is based on test data.
- K-means clustering: The K-means algorithm accepts the number of clusters to be created, and simulates a clustering behaviour on the observations based on randomly selected centers. As the simulation progresses, the centers move towards actual centers of newly forming clusters. This is the most popular clustering algorithm.
- Hierarchical Agglomerative Clustering: This algorithms accepts points in their own cluster, and the simulation grows these clusters based the distance between two closest points.
In all clustering algorithms, distance metrics play a very important role and have a huge impact on the result. User adaptive distance metrics that consider local density of the data are important.
The recommender system uses matrix factorization. It is primarily used to recommend items to a user based on the user’s own behaviour and the behaviour of users falling the similar category. We won’t delve into the details of recommender algorithms as there are a wide variety of approaches, each suiting their own application.
This ends a a very brief overview of Machine Learning algorithms. Feel free to post your feedback or discussions in comments, or get in touch on twitter.