Statistical Learning
Statistics and Learning
Statistics refers to the science of collecting and analyzing large portions of numerical data to generate inferences for all population based on the sample portions.
As we know, data has been part of our life since primitive times. Demographic records of the public saved since we understood the importance of the collection of data. We advanced the ladder of social and technological developments of the value of data and extracting meaningful information it becomes an integral part. As the amount of data began to grow, it's complex to draw the information out of it. We required mathematical tools to generalize the data to draw inferences to express the likelihood of the occurrence of an event in the future. These tools are under Statistics. It takes a sample of the population data to understand the characteristics of it. Statistics rely on the concepts of probability like Expected Values(E[x]) and Variance(σ2).
Statistics is the sample equivalent of charcterstics in the data.
Learning represents the task of drawing relevant information from the data without being told explicitly to perform similar duties. Mathematical models have been prominent for centuries before the era of computers began. Indeed, in these modern times, classical methods based on statistics are remarkably productive and accurate.
Statistical Learning is defined as the set of mathematical tools used to describe a statistical model that helps in understanding data.
As the name suggests, Statistical Learning comes from the science of statistics. Wherever raw data is in abundance, the use of statistics comes in naturally. Computation of Mean, Variance, and Standard Deviation is important for a more thorough understanding of categorical and numerical data to draw inferences. As probability lays the groundwork for statistics, statistics establish the foundation for Data Science.
Machine Learning vs Statistical Learning
The debate between Machine Learning and Statistical Learning exists for a long time. Latter of the two had its name coined decades back when the learning models were beginning to rise. Supervised and Unsupervised learning uses Statistical models. As the demand for computers grows, and the data stored in the form of images, videos, GIFs. In today's world, it is not feasible to store data in excel. In this fast-developing world, Big Data the need for non-relational databases are important.
Machine Learning, on the other hand is comparitively a newer term that encompasses the methods of classical statistical learning in addition to the modern algorithms developed to work on the newer form of data with much more advanced implementations. Both the terms ML and SL are interchangebly used but the argument over a strict definition of both continues time to time. Therefore, we have differentiated them on the basis of their motive of implementation.
Machine Learning :- ML models are designed for performance oriented tasks. Data is divided into two parts - training data and test data. Model is trained on training data in such a way that itt performs best on the test set. Machine Learning is focused on determining the output value or prediction of future events.
Statistical Learning :- SL models are designed to draw inferences within the data. It is not concerned about the prediction of future events though these models have capability to do so. That is why they are used in machine learning such as Linear Regression and Logistic Regression algorithms. Here data is not classified under separate heads of training and test. Instead of prediction SL models focuses on establishing a relationship between the data and the output values.
Though these two terms can be interchangebly used and one circumscribe the features of the other, still, the difference lies in the root and is clearly understood by the researcher of the both.
History of Statistical Learning
Statistics have been around for thousands of years. However, it's scientific modeling implementation in various genres of sciences such as economics, populations, etc. began in the early 19th century.
- Legendre and Gauss published published papers on method of least squares which is known to implement the earliest form of Linear Regression. Linear Regression is used to predict the quantitative values (continuous values).
- Fisher put forward Linear Discriminant Analysis in 1936 to predict qualitative values that are observed but can't be measured. For example, the tumor is malignant or benign, or it will rain before the match or not, or the stock market increases or decreases. Such type of data is categorical data. In the 1940s a cumulative effort of the various author, (as an alternative approach) was put forward known as Logistic Regression.
- Nelder and Wedderburn in 1970 forth a Generalized Linear Models for the entire class of SL learning models that include Linear Regression and Logistic Regression as the specific cases.
- In the late 1970s, presented advent techniques for learning data. But all of the methods were linear because fitting non-linear relationships were impracticable at that time.
- Breiman, Friedman, Olshen, and Stone introduced Trees (both Classification and Regression Trees). They were also among the first ones to show the power of detailed practical implementation on the real-life data including Cross-Validation for model selection.
- Hastie and Tibshirani put forward, in 1986, a generalized method known as Generalized Additive Models for the class of non-linear extensions to Generalized Linear Models.
Since then, the advent of machine learning and the other advanced disciplines of artificial intelligence statistical learning has become an integral part of statistics and promptly focuses not just on inferential mechanisms but also efficient future predictions.
Statistical Learning - Mathematical Approach
Now to begin with a better and deep understanding of statistical learning, we must take up a suitable example. Suppose we as statistical analysts which today are better known as Business Analysts or Data Analysts work for the company that deals in motor vehicle reviewing and selling of used motor vehicles. Our role as analysts is to review the historical data on used vehicles and consult the company with the most efficient selling price of the vehicle.
You can access the dataset from down below:-
Used Vehicle Dataset consists of the selling price of 301 cars and bikes along with the attributes of each vehicle as Current Price, Kilometers Driven, and Year of Purchase. We aim to design a model to predict the worth of the used vehicle. Therefore, we must determine the relationship between the vehicular attributes and the selling price. It leads us to know the effect of changes in attributes upon the price of the vehicles. Below is shown the dataset with selling price as a function of each attribute.
Attributes of the vehicles are input variables while selling price is an output variable. Typically, X denotes input, and with numbers to its subscript such as X1 corresponds to Current Price, X2 corresponds to Kilometers Driven and X3 to Year of Purchase. Input variables are also referred to by different names such as predictors, regressors, features, independent variables. Sometimes they are referred by input. The output variable, Selling Price goes by the name, response or dependent variables and denoted by the letter Y.
Further, quantitative responses of Y for m different features, X = {X1, X2, X3, ...., Xm} is expressed as function. We assume there exists a relationship between them. The general form of this relationship is:
Y = f(x) + ε
Here, f is a fixed yet unknown function of X, and ε is a random error term independent of X and has a zero mean. Error terms are responsible for any change in the dependent variable that independent variables cannot explain. Function f provides systematic information of X provides about Y. This function f is generally unknown and must be estimated based on the data points in our dataset.
Estimating the function f(x) will give us the estimated values of Y, which may differ from the actual values in the dataset. Let us understand it with the help of a visualization graph. Considering the data plot of attribute Present Price against Selling Price, we estimate f(x) to be a linear function (which means we fit a line into our data).
From the above figure, we observe the linear function does not cover all of the data points. Some of the values lie above the function. And some below (negative) the function. The vertical distance of each data point from the line segment(or f(x)) is known as the Error (ε). We must always keep in mind that the mean of these errors should approximately be equal to zero.
Why Estimate f ?
There are two main reasons that we wish to estimate f : Prediction and Inference.
Prediction
Situations in which inputs X are available but it's corresponding output Y can't be easily computed, we can predict Y using,
Y' = f'(X)
Here, f' is the estimate for f function and, Y' represents the prediction for Y. For prediction, the important aspect is to setting f' it is often treated as a black box. We are not concerned with the exact form of f' as long as it yields an accurate prediction for Y.
Now let's talk about the accuracy of our estimated function Y'. The function depends on the input of X. But the level of accuracy depends on quantities - Reducible Error and Irreducible Error.
- Reducible Error :- Realistically,in general, f' is not a perfect estimate of f. There is always a room for error. But this error is reducible because we can use the most appropriate learning models to estimate f.
- Irreducible Error:- Even, if it's possible to form a perfect estimate of f such that Y'= f(X). There are still errors in our prediction! This is because Y is also a function of ε which can't predict using X. It is known as Irreducible error. It is independent of the fact of how well we have estimated f.
Inference
Sometimes we want to know, how changing the input variables (X1, X2, X3, ...., Xm) affects the value of output. In this, we estimate f but do not make predictions. Our interest is in establishing a relationship between X and Y. We cannot treat f' as a black box because we seek to know it's exact form. It is the importance of statistics, drawing inferences. You must always try to answer the following questions while drawing out inferences.
- Which attributes are associated with the response?
Often dataset comprises a large number of attributes. But very few of them relate with Y. It is important to know which inputs contribute more in framing the output.
- What is the relationship between the response and each attribute?
We need to understand the weightage of each attribute in framing the response of output. Some inputs are directly proportional to the output, while some have an inverse relationship. In the recent times, most of the data is complex and depends upon the complexity, the relationship of an attribute with the response variable may also depend upon the value of some other input attribute/s.
- Does the relation between response and each predictor can be established by a linear function or is the relationship much more complicated?
Historical statistical methods of learning mostly have a linear form which sometimes pays off really well. But mostly the tr ue relationship is much more complex or non-linear. In that case linear models don't serve as an accurate representation of input and output.
In real-life scenarios tasks doesnot just fall under individual category of prediction or inference but in the combination of the two.
How to Estimate f ?
There are many linear and non-linear approaches to estimate f that we will explore in much detail in the further course. But there are certain characterstics these models contain that allow us to characterize them. Let's share an overview of them. Before that some assumptions that will be used prominently further in this course.
We will always represent the number of data points in our dataset by 'n'. For examle, in Used Vehicle Dataset n = 301. These observations that are pre-determined are known as Training Data as we are going to use this data to train and teach our model on how to estimate f. The total number of predictors or input attributes is represented by the letter 'm'. For example, vehicular dataset that we discussed earlier had 3 input variables thus, m = 3. Now, the value of each predictor for a particular data point is represented as xij, where i = {1, 2, 3,...n} and j = {1, 2, 3,...m}. Correspondingly yi represent the response or output for ith observation.
Thus, dataset can be represented as,
$${ (x_1, y_1 ),( x_2, y_2),...., (x_n, y_n)}$$
where, $$x_i = ( x_{i1}, x_{i2}, x_{i3},....,x_{im})^T $$
Our main goal is to apply a learning technique to the training data in order to estimate the unknown function f such that Y ≈ f' (X) for any observation (X,Y). Predominantly, most learning methods are characterized as Parametric and Non-parametric.
Parametric Methods
Such models primarily have a two step approach.
- Firstly, making an assumption about the functional form of f. For example, most simplistic speculation of the shape of f is linear in X :
$$f ( X ) = β_0 + β_1X_1 + β_2X_2 + ..... + β_mX_m$$ Above equation suggests linear model that we will be discussing in detail futher in the course. Now as we have simplified our task by assuming the unknown function f(x),we only need to estimate the (m+1) coefficients $β_0,β_1,β_2,...,β_m$.
- After model selection we need a technique that will use the training data to fit and train the model most efficiently. We need to train the model as much accurately as possible such that :
$$Y = β_0 + β_1X_1 + β_2X_2 + ..... + β_mX_m$$
One of the most commonly used approach to fit the model expressed in point 1 is known as Least Squares. It is one of the any possible ways to fit a linear model.
One of the major disadvantage of parametric approach is that the model we choose usually don't match the true yet unknown form of f. If the model we choose is too far from true f than our estimation of output will be poor.
We can try to fit much more flexible models with different functional forms for f but doing this will lead to estimating more parameters. More complex the model is more the chances of it Over-fitting which briefly means that our model addresses errors and noise too closely. This is a very important concept while framing our model of estimation. We will discuss it in much more detail when we begin implementing our 1st learning model.
Considering our used vehicular datset as an example, linear parametric model will look like :
$$Selling Price = β_0 + β_1* present price + β_2* kilometers driven + β_3* year of purchase$$
Non-Parametric Methods
The major advantage that non-parametric methods have over parametric ones is that they don't make any explicit speculations regarding the functional form of f. As we don't assume the shape of f,non-parametric model can fit much more accurately on the training data. It ends the fear of our model selection being too far away from the true form of f.
But even non-parametric approaches suffers from the major disadvantages. As they don't reduce the estimation of some erratic function f to smaller number of parameters, they require a very large number of observations in order to suffice accurate predictions.
Summary
- Statistics and Statistical Learning have been around for a while now.
- The difference between statistical learning and machine learning lies in the motive behind the estimation i.e. Prediction and Inference.
- Most of the machine learning task are prediction based and historical statistics dealt in drawing out inferences.
- The estimation of the function can be done using two methods - Parametric and Non-Parametric.