Trending December 2023 # R Or Python? Reasons Behind This Cloud War # Suggested January 2024 # Top 12 Popular

You are reading the article R Or Python? Reasons Behind This Cloud War updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 R Or Python? Reasons Behind This Cloud War

This article was published as a part of the Data Science Blogathon.

steps to preserve nature to protect this life-saving gas? But nature makes the world talk about Oxygen using an unseen virus Covid19 by increasing the medical oxygen demand around the world. So it is our valuable responsibility to protect nature like planting saplings etc, not only for the social cause but for our sake also.

 Image Source

As with the life-saving Oxygen, the industry-saving assets for the field of technology are Data. The amount of data generated worldwide is increasing with huge differences day by day. And the tech industries showing much interest in having and mining valuable insights from them for their business growth. As we knew already, the amount of data in the datasets were mostly in large numbers. So it is not possible to handle such a huge amount of data manually to fetch valuable insights as much quicker as before the same amount of data generated. So the industry experts need technical tools to handle these data. Among the hundreds of technical tools, a cloud war is always going between the two technical tools namely R and Python.

In this article, we are going to discuss the pros and cons of both programming languages in handling the data from a Data Science point of view.

R vs Python: Why this controversy?

In general, both Python and R are the topmost preferred programming languages for Data Science learners right from the beginners to the professional level. Both the programming languages have considerable similarities in producing efficient results.

Both were created around the early 1990s

Since they are open source programming languages, anyone can easily download and access them free of cost.

They have a lot of libraries and special algorithmic functions to work and solve the data science and data analytics problems

As with other data analytics tools like SAS, SPSS, MATLAB they do not restrict the users in terms of cost as well as complexity in solving problems.

Both of them providing a user-friendly working experience that is easily understanding and recognizing even by the non-programmer

A lot of new inventions and improvements happening frequently in both the tools to handle the problems in the areas of Data Science, Machine Learning, Deep Learning, Artificial intelligence, and much more

Hence it looks like none is lower than the other and this is the reason for the controversy of R vs Python. Just have a look, in brief, to understand this better.

What are Python and R?


Python was first released in 1991 and designed initially by Guido van Rossum. Since it is an object-oriented programming language also called a general-purpose programming language that comes out with a philosophy that emphasizes code readability with efficiency.

If the programmers and the people from the technical environment want to excel in their data science passion by tackling the math and statistical concepts, python will be the best partner in supporting those situations. Hence this is the most preferable and favorite programming language for most Data Science learners.

It has dedicated special libraries for Machine Learning and Deep Learning as well are listed in its library packages index called PyPI. And the documentation for those libraries is also available in the Python Documentation format on its official site.


Ross Ihaka and Robert Gentleman were the initial creators of R. It was initially released in 1993 an implementation of the S programming language. The purpose behind the creation of this programming language is to produce effective results in Data Analysis, Statistical Methods, and Visualisation.

Image source

It has the richest environment to perform data analysis techniques. As with python, it has around 13000 library packages in Comprehensive R Archive Network (CRAN) used especially for deep analytics.

It is most popular among scholars and researchers. The most available number of projects made in R almost comes under research criteria only. It is commonly used in its own integrated development environment (IDE) called R Studio for a better and user-friendly experience.

How to choose a better one?

Image Source

The reasons for opting for a particular language are almost common in general for both Python and R. So it is needed to be wiser while picking a programming language between these two. Consider your nature of the domain and your flavor of preference while selecting one within R and Python.

If the nature of your work deals with more codes in general and with less scope of research then prefer Python, if your purpose of work involves research and conceptual processes then choose R. Python is the programmer’s language where R is the language of academicians and researchers.

Everything is based on your interests and the passion behind them. While python codes are easy to understand and capable to do more data science tasks in general. On the other hand, R codes are in the basic academic language, easy to learn, and the best effective tool for Data Analytics tool in visualization.

Key difference

Image Source

               Python                    R

What it is?

It is a general-purpose language for data science  It is the best language for Statisticians,           researchers, and non-coders

                                      First Appeared:

Early 1990’s   Early 1990’s

                                           Best for:

Deployment and Production   Data analysis, Statistics, and Research

                                              Dataset handling:

Easy to handle huge datasets

All dataset formats like .csv, .xlsx, etc, are accepted

Easy to handle huge datasets

All dataset formats like .csv, .xlsx, etc, are accepted

                                               Primary Users:

Programmers and Developers   Academicians and Researchers


Easy to understand   Easy to learn


Notebook, Spyder, Colab   R-Studio

                             Packages are available at:


                    Popular libraries:

Pandas      : for manipulating data

Numpy       : for Scientific computing

Matplotlib  : to make graphics

Scikit-learn: Machine Learning

dplyr     : for manipulating data

string    : to manipulate strings

ggplot2 : to make graphics

caret      : Machine Learning


A production-ready and general-purpose language

Best in class language for computation, code readability, speed, and handling functions

Having the best functionalities and packages for deep learning and NLP

It collaborates people from different backgrounds

Working in a notebook is simple and easy to share with colleagues

Best language for producing graphs and visualization

User ready language with a huge number of packages for handling data analysis kind of functionalities with more efficiency

Having the best functionalities and packages for handling time-series data

It has a rich ecosystem with cutting edge packages and having an active community

Complex statistical concepts can solve using simple codes

Python does not have as many alternatives for the packages as R provides

Python is poor in visualization and producing graphs when compared to R

Due to shortage of packages in number when compared with R, it is quite difficult for non-algorithmic people to understand the coding concepts in python as not like R

R is comparatively slow in processing due to poor codes, but it has considerable packages to improve it.

It is time-consuming in choosing the right package because of the huge number of packages

It is not best as python in learning deep learning and NLP

What to Use?

Usage is purely based upon the user’s need. When speaking about python, it is the most efficient tool for doing Machine Learning, Deep Learning, Data Science, and Deployment needs. But still, it has notable libraries for maths, statistics, time series, etc, it often fails to show that much efficiency for business analysis, econometrics, research kind of needs. It is the production-ready language because it has the capability to integrate all our workflow as a single tool.

Image Source


Image Source

Both the programming languages have similar pros and cons in general. Apart from all other things, the best one between Python and R is based on some of the following points in consideration only

What is the theme of your work?

What about your colleagues’ programming knowledge?

What is the time period of your work?

And finally your area of interest?

Message from the Author:

Dear Readers,

From this article, I hope you should gain at least some little knowledge on how to choose a better one between Python and R based on your needs.

I request you to share your valuable thoughts about this article. It will be more useful for me during my future works

Thanks and Regards

Shankar DK (Data Science Student)

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 


You're reading R Or Python? Reasons Behind This Cloud War

Tree Based Algorithms: A Complete Tutorial From Scratch (In R & Python)


Explanation of tree based algorithms from scratch in R and python

Learn machine learning concepts like decision trees, random forest, boosting, bagging, ensemble methods

Implementation of these tree based algorithms in R and Python

Table of Contents Introduction to Tree Based Algorithms

Tree based algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based algorithms empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or regression).

Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every analyst (fresher also), it’s important to learn these algorithms and use them for modeling.

This tutorial is meant to help beginners learn tree based algorithms from scratch. After the successful completion of this tutorial, one is expected to become proficient at using tree based algorithms and build predictive models.

Note: This tutorial requires no prior knowledge of machine learning. However, elementary knowledge of R or Python will be helpful. To get started you can follow full tutorial in R and full tutorial in Python. You can also check out the ‘Introduction to Data Science‘ course covering Python, Statistics and Predictive Modeling.

We will also cover some ensemble techniques using tree-based models below. To learn more about them and other ensemble learning techniques in a comprehensive manner, you can check out the following courses:

1. What is a Decision Tree ? How does it work ?

Decision tree is a type of supervised learning algorithm (having a predefined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.


Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other two variables.

As mentioned above, decision tree identifies the most significant variable and it’s value that gives best homogeneous sets of population. Now the question which arises is, how does it identify the variable and the split? To do this, decision tree uses various algorithms, which we will discuss in the following section.

Types of Decision Trees

Types of decision tree is based on the type of target variable we have. It can be of two types:

Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree. Example:- In above scenario of student problem, where the target variable was “Student will play cricket or not” i.e. YES or NO.

Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance company does not have income details for all customers. Now, as we know this is an important variable, then we can build a decision tree to predict customer income based on occupation, product and various other variables. In this case, we are predicting values for continuous variable.

Important Terminology related to Tree based Algorithms

Let’s look at the basic terminology used with Decision trees:

Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.

Splitting: It is a process of dividing a node into two or more sub-nodes.

Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.

Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.

Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.

Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.


Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.

Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. You can refer article (Trick to enhance power of regression model) for one such trick.  It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.

Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.

Data type is not a constraint: It can handle both numerical and categorical variables.

Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.

Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).

Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.

2. Regression Trees vs Classification Trees

We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision trees are typically drawn upside down such that leaves are the the bottom & roots are the tops (shown below).

Both the trees work almost similar to each other, let’s look at the primary differences & similarity between classification and regression trees:

Regression trees are used when dependent variable is continuous. Classification trees are used when dependent variable is categorical.

In case of regression tree, the value obtained by terminal nodes in the training data is the mean response of observation falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mean value.

In case of classification tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mode value.

Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions. For the sake of simplicity, you can think of these regions as high dimensional boxes or boxes.

Both the trees follow a top-down greedy approach known as recursive binary splitting. We call it as ‘top-down’ because it begins from the top of tree when all the observations are available in a single region and successively splits the predictor space into two new branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for best variable available) about only the current split, and not about future splits which will lead to a better tree.

This splitting process is continued until a user defined stopping criteria is reached. For example: we can tell the the algorithm to stop once the number of observations per node becomes less than 50.

In both the cases, the splitting process results in fully grown trees until the stopping criteria is reached. But, the fully grown tree is likely to overfit data, leading to poor accuracy on unseen data. This bring ‘pruning’. Pruning is one of the technique used tackle overfitting. We’ll learn more about it in following section.

3. How does a tree based algorithms decide where to split?

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. Let’s look at the four most commonly used algorithms in decision tree:

Gini  says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

It works with categorical target variable “Success” or “Failure”.

It performs only Binary splits

Higher the value of Gini higher the homogeneity.

CART (Classification and Regression Tree) uses Gini method to create binary splits.

Steps to Calculate Gini for a split

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2).

Calculate Gini for split using weighted Gini score of each node of that split

Example: – Referring to example used above, where we want to segregate the students based on target variable ( playing cricket or not ). In the snapshot below, we split the population using two input variables Gender and Class. Now, I want to identify which split is producing more homogeneous sub-nodes using Gini .

Split on Gender:

Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68

Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55

Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

Similar for Split on Class:

Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51

Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51

Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender.

You might often come across the term ‘Gini Impurity’ which is determined by subtracting the gini value from 1. So mathematically we can say,

Gini Impurity = 1-Gini


It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it by sum of squares of standardized differences between observed and expected frequencies of target variable.

It works with categorical target variable “Success” or “Failure”.

It can perform two or more splits.

Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.

Chi-Square of each node is calculated using formula,

Chi-square = ((Actual – Expected)^2 / Expected)^1/2

It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Steps to Calculate Chi-square for a split:

Calculate Chi-square for individual node by calculating the deviation for Success and Failure both

Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split

Example: Let’s work with above example that we have used to calculate Gini.

Split on Gender:

First we are populating for node Female, Populate the actual value for “Play Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.

Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be 5 for both because parent node has probability of 50% and we have applied same probability on Female count(10).

Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 – 5 = -3) and for “Not play cricket” ( 8 – 5 = 3).

Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using formula with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer below table for calculation.

Follow similar steps for calculating Chi-square value for Male node.

Now add all Chi-square values to calculate Chi-square for split Gender.

Split on Class:

Perform similar steps of calculation for split on Class and you will come up with below table.

Above, you can see that Chi-square also identify the Gender split is more significant compare to Class.

Information Gain:

Look at the image below and think which node can be described easily. I am sure, your answer is C because it requires less information as all values are similar. On the other hand, B requires more information to describe it and A requires the maximum information. In other words, we can say that C is a Pure node, B is less Impure and A is more impure.

Now, we can build a conclusion that less impure node requires less information to describe it. And, more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% – 50%), it has entropy of one.

Entropy can be calculated using formula:-

Here p and q is probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.

Steps to calculate entropy for a split:

Calculate entropy of parent node

Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available in split.

Example: Let’s use this method to identify best split for student example.

Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1 shows that it is a impure node.

Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for male node,  -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93

Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30)*0.93 = 0.86

Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for Class X node,  -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.

Entropy for split Class =  (14/30)*0.99 + (16/30)*0.99 = 0.99

Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will split on Gender. We can derive information gain from entropy as 1- Entropy.

Reduction in Variance

Till now, we have discussed the algorithms for categorical target variable. Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:

Above X-bar is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

Calculate variance for each node.

Calculate variance for each split as weighted average of each node variance.

Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket. Now follow the steps to identify the right split:

Variance for Root node, here mean value is (15*1 + 15*0)/30 = 0.5 and we have 15 one and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-0.5)^2+(0-0.5)^2+…15 times) / 30, this can be written as (15*(1-0.5)^2+15*(0-0.5)^2) / 30 = 0.25

Mean of Female node =  (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-0.2)^2) / 10 = 0.16

Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-0.65)^2) / 20 = 0.23

Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 + (20/30) *0.23 = 0.21

Mean of Class IX node =  (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-0.43)^2) / 14= 0.24

Mean of Class X node =  (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-0.56)^2) / 16 = 0.25

Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25

Above, you can see that Gender split has lower variance compare to parent node, so the split would take place on Gender variable.

Until here, we learnt about the basics of decision trees and the decision making process involved to choose the best splits in building a tree model. As I said, decision tree can be applied both on regression and classification problems. Let’s understand these aspects in detail.

4. What are the key parameters of tree based algorithms and how can we avoid over-fitting in decision trees?

Overfitting is one of the key challenges faced while using tree based algorithms. If there is no limit set of a decision tree, it will give you 100% accuracy on training set because in the worse case it will end up making 1 leaf for each observation. Thus, preventing overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:

Setting constraints on tree size

Tree pruning

Let’s discuss both of these briefly.

Setting Constraints on tree based algorithms

This can be done by using various parameters which are used to define a tree. First, lets look at the general structure of a decision tree:

The parameters used for defining a tree are further explained below. The parameters described below are irrespective of tool. It is important to understand the role of parameters used in tree modeling. These parameters are available in R & Python.

Minimum samples for a node split

Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.

Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

Too high values can lead to under-fitting hence, it should be tuned using CV.

Minimum samples for a terminal node (leaf)

Defines the minimum samples (or observations) required in a terminal node or leaf.

Used to control over-fitting similar to min_samples_split.

Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.

Maximum depth of tree (vertical depth)

The maximum depth of a tree.

Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.

Should be tuned using CV.

Maximum number of terminal nodes

The maximum number of terminal nodes or leaves in a tree.

Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.

Maximum features to consider for split

The number of features to consider while searching for a best split. These will be randomly selected.

As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.

Higher values can lead to over-fitting but depends on case to case.

Pruning in tree based algorithms

As discussed earlier, the technique of setting constraint is a greedy-approach. In other words, it will check for the best split instantaneously and move forward until one of the specified stopping condition is reached. Let’s consider the following case when you’re driving:

There are 2 lanes:

A lane with cars moving at 80km/h

A lane with trucks moving at 30km/h

At this instant, you are the yellow car and you have 2 choices:

Take a left and overtake the other 2 cars quickly

Keep moving in the present lane

Let’s analyze these choice. In the former choice, you’ll immediately overtake the car ahead and reach behind the truck and start moving at 30 km/h, looking for an opportunity to move back right. All cars originally behind you move ahead in the meanwhile. This would be the optimum choice if your objective is to maximize the distance covered in next say 10 seconds. In the later choice, you sale through at same speed, cross trucks and then overtake maybe depending on situation ahead. Greedy you!

This is exactly the difference between normal decision tree & pruning. A decision tree with constraints won’t see the truck ahead and adopt a greedy approach by taking a left. On the other hand if we use pruning, we in effect look at a few steps ahead and make a choice.

So we know pruning is better. But how to implement it in decision tree? The idea is simple.

We first make the decision tree to a large depth.

Then we start at the bottom and start removing leaves which are giving us negative returns when compared from the top.

Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we will see that the overall gain is +10 and keep both leaves.

Note that sklearn’s decision tree classifier does not currently support pruning. Advanced packages like xgboost have adopted tree pruning in their implementation. But the library rpart in R, provides a function to prune. Good for R users!

5. Are tree based algorithms better than linear models?

“If I can use logistic regression for classification problems and linear regression for regression problems, why is there a need to use trees”? Many of us have this question. And, this is a valid one too.

Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Let’s look at some key factors which will help you to decide which algorithm to use:

If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model.

If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.

If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!

6. Working with tree based algorithms Trees in R and Python

For R users and Python users, decision tree is quite easy to implement. Let’s quickly look at the set of codes that can get you started with this algorithm. For ease of use, I’ve shared standard codes where you’ll need to replace your data set name and variables to get started.

In fact, you can build the decision tree in Python right here! Here’s a live coding window for you to play around the code and generate results:

For R users, there are multiple packages available to implement decision tree such as ctree, rpart, tree etc.

> x <- cbind(x_train,y_train)

# grow tree  > fit <- rpart(






= x,

method=”class”) #Predict Output > predicted=







In the code above:

y_train – represents dependent variable.

x_train – represents independent variable

x – represents training data.

For Python users, below is the code:

#Import Library #Import other necessary libraries like pandas, numpy...





#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create tree object

model = tree.DecisionTreeClassifier(criterion='gini')

# for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score
















#Predict Output predicted= model






7. What are ensemble methods in tree based algorithms ?

The literary meaning of word ‘ensemble’ is group. Ensemble methods involve group of predictive models to achieve a better accuracy and model stability. Ensemble methods are known to impart supreme boost to tree based models.

Like every other model, a tree based algorithm also suffers from the plague of bias and variance. Bias means, ‘how much on an average are the predicted values different from the actual value.’ Variance means, ‘how different will the predictions of the model be at the same point if different samples are taken from the same population’.

You build a small tree and you will get a model with low variance and high bias. How do you manage to balance the trade off between bias and variance ?

Normally, as you increase the complexity of your model, you will see a reduction in prediction error due to lower bias in the model. As you continue to make your model more complex, you end up over-fitting your model and your model will start suffering from high variance.

A champion model should maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis.

Some of the commonly used ensemble methods include: Bagging, Boosting and Stacking. In this tutorial, we’ll focus on Bagging and Boosting in detail.

8. What is Bagging? How does it work?

Bagging is an ensemble technique used to reduce the variance of our predictions by combining the result of multiple classifiers modeled on different sub-samples of the same data set. The following figure will make it clearer:

The steps followed in bagging are:

Create Multiple DataSets:

Sampling is done with replacement on the original data and new datasets are formed.

The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model

Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting

Build Multiple Classifiers:

Classifiers are built on each data set.

Generally the same classifier is modeled on each data set and predictions are made.

Combine Classifiers:

The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.

The combined values are generally more robust than a single model.

Note that, here the number of models built is not a hyper-parameters. Higher number of models are always better or may give similar performance than lower numbers. It can be theoretically shown that the variance of the combined predictions are reduced to 1/n (n: number of classifiers) of the original variance, under some assumptions.

There are various implementations of bagging models. Random forest is one of them and we’ll discuss it next.

9. What is Random Forest ? How does it work?

Random Forest is considered to be a panacea of all data science problems. On a funny note, when you can’t think of any algorithm (irrespective of situation), use random forest!

Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

How does it work?

In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see comparison between CART and Random Forest here, part1 and part2). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

It works in the following manner. Each tree is planted & grown as follows:

Assume number of cases in the training set is N. Then, sample of these N cases is taken at random but with replacement. This sample will be the training set for growing the tree.

If there are M input variables, a number m<M is specified such that at each node, m variables are selected at random out of the M. The best split on these m is used to split the node. The value of m is held constant while we grow the forest.

Each tree is grown to the largest extent possible and  there is no pruning.

Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for classification, average for regression).

To understand more in detail about this algorithm using a case study, please read this article “Introduction to Random forest – Simplified“.

Advantages of Random Forest

This algorithm can solve both type of problems i.e. classification and regression and does a decent estimation at both fronts.

It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

It has methods for balancing errors in data sets where classes are imbalanced.

The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.

Random Forest involves sampling of the input data with replacement called as bootstrap sampling. Here one third of the data is not used for training and can be used to testing. These are called the out of bag samples. Error estimated on these out of bag samples is known as out of bag error. Study of error estimates by Out of bag, gives evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.

It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.

Random Forest can feel like a black box approach for statistical modelers – you have very little control on what the model does. You can at best – try different parameters and random seeds!

Python & R implementation

Random forests have commonly known implementations in R packages and Python scikit-learn. Let’s look at the code of loading random forest model in R and Python below:

R Code

> x <- cbind(x_train,y_train)

# Fitting model

10. What is Boosting? How does it work?

Definition: The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.

Let’s understand this definition in detail by solving a problem of spam email identification:

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

Email has only one image file (promotional image), It’s a SPAM

Email has only link(s), It’s a SPAM

Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM

Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learner.

To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:

Using average/ weighted average

Considering prediction has higher vote

For example:  Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’.

How does it work?

Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule. An immediate question which should pop in your mind is, ‘How boosting identify weak rules?‘

To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule. This is how the ensemble model is built.

Here’s another question which might haunt you, ‘How do we choose different distribution for each round?’

For choosing the right distribution, here are the following steps:

Step 1:  The base learner takes all the distributions and assign equal weight or attention to each observation.

Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.

Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.

There are many boosting algorithms which impart additional boost to model’s accuracy. In this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient Boosting (GBM) and XGboost.

11. Which is more powerful: GBM or Xgboost?


Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.

In fact, XGBoost is also known as ‘regularized boosting‘ technique.

Parallel Processing:

XGBoost implements parallel processing and is blazingly faster as compared to GBM.

But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.

XGBoost also supports implementation on Hadoop.

High Flexibility

XGBoost allow users to define custom optimization objectives and evaluation criteria.

This adds a whole new dimension to the model and there is no limit to what we can do.

Handling Missing Values

XGBoost has an in-built routine to handle missing values.

User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.

Tree Pruning:

A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.

XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.

Built-in Cross-Validation

XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

This is unlike GBM where we have to run a grid-search and only a limited values can be tested.

Continue on Existing Model

GBM implementation of sklearn also has this feature so they are even on this point.

12. Working with GBM in R and Python

Before we start working, let’s quickly understand the important parameters and the working of this algorithm. This will be helpful for both R and Python users. Below is the overall pseudo-code of GBM algorithm for 2 classes:

1. Initialize the outcome 2. Iterate from 1 to total number of trees 2.1 Update the weights for targets based on previous run (higher for the ones mis-classified) 2.2 Fit the model on selected subsample of data 2.3 Make predictions on the full set of observations 2.4 Update the output with current results taking into account the learning rate 3. Return the final output.

This is an extremely simplified (probably naive) explanation of GBM’s working. But, it will help every beginners to understand this algorithm.

Lets consider the important GBM parameters used to improve model performance in Python:


This determines the impact of each tree on the final outcome (step 2.4). GBM works by starting with an initial estimate which is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates.

Lower values are generally preferred as they make the model robust to the specific characteristics of tree and thus allowing it to generalize well.

Lower values would require higher number of trees to model all the relations and will be computationally expensive.


The number of sequential trees to be modeled (step 2)

Though GBM is fairly robust at higher number of trees but it can still overfit at a point. Hence, this should be tuned using CV for a particular learning rate.


The fraction of observations to be selected for each tree. Selection is done by random sampling.

Values slightly less than 1 make the model robust by reducing the variance.

Typical values ~0.8 generally work fine but can be fine-tuned further.

Apart from these, there are certain miscellaneous parameters which affect overall functionality:


It refers to the loss function to be minimized in each split.

It can have various values for classification and regression case. Generally the default values work fine. Other values should be chosen only if you understand their impact on the model.


This affects initialization of the output.

This can be used if we have made another model whose outcome is to be used as the initial estimates for GBM.


The random number seed so that same random numbers are generated every time.

This is important for parameter tuning. If we don’t fix the random number, then we’ll have different outcomes for subsequent runs on the same parameters and it becomes difficult to compare models.

It can potentially result in overfitting to a particular random sample selected. We can try running models for different random samples, which is computationally expensive and generally not used.


The type of output to be printed when the model fits. The different values can be:

0: no output generated (default)

1: output generated for trees in certain intervals


This parameter has an interesting application and can help a lot if used judicially.


 Select whether to presort data for faster splits.

It makes the selection automatically by default but it can be changed if needed.

I know its a long list of parameters but I have simplified it for you in an excel file which you can download from this GitHub repository.

For R users, using caret package, there are 3 main tuning parameters:

n.trees – It refers to number of iterations i.e. tree which will be taken to grow the trees

interaction.depth – It determines the complexity of the tree i.e. total number of splits it has to perform on a tree (starting from a single node)

 shrinkage – It refers to the learning rate. This is similar to learning_rate in python (shown above).

n.minobsinnode – It refers to minimum number of training samples required in a node to perform splitting

GBM in R (with cross validation)

I’ve shared the standard codes in R and Python. At your end, you’ll be required to change the value of dependent variable and data set name used in the codes below. Considering the ease of implementing GBM in R, one can easily perform tasks like cross validation and grid search with this package.







= 500
















= fitControl,






= gbmGrid)

GBM in Python

13. Working with XGBoost in R and Python

R Tutorial: For R users, this is a complete tutorial on XGboost which explains the parameters along with codes in R. Check Tutorial.

Python Tutorial: For Python users, this is a comprehensive tutorial on XGBoost, good to get you started. Check Tutorial.

14. Where to practice ?

Practice is the one and true method of mastering any concept. Hence, you need to start practicing if you wish to master these algorithms.

Till here, you’ve got gained significant knowledge on tree based algorithms along with these practical implementation. It’s time that you start working on them. Here are open practice problems where you can participate and check your live rankings on leaderboard:

End Notes

Tree based algorithms are important for every data scientist to learn. In fact, tree models are known to provide the best model performance in the family of whole machine learning algorithms. In this tutorial, we learnt until GBM and XGBoost. And with this, we come to the end of this tutorial.

We discussed about tree based algorithms from scratch. We learnt the important of decision tree and how that simplistic concept is being used in boosting algorithms. For better understanding, I would suggest you to continue practicing these algorithms practically. Also, do keep note of the parameters associated with boosting algorithms. I’m hoping that this tutorial would enrich you with complete knowledge on tree based modeling.

Note – The discussions of this article are going on at AV’s Discuss portal. Join here! You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.


Top 8 Reasons For Accountants Should Have Accounting Software On The Cloud

We have made significant progress in the last few years towards a digital world. The following data is available.

A study by Statista shows that digital transformation spending reached USD 1.3 trillion by 2023, two years ago. Compared to the previous year, digital transformation spending saw a 10% increase. This is not all, the global digital transformation spending is predicted to reach 2.4 Trillion in the next two years.

Technology must be flexible, scalable, and practical as the digitally-driven industries shift to new ways of working. Cloud hosting is one of these comfortable technologies that can help you increase data security, scalability, and accessibility.

Host accounting software in the cloud to increase employee productivity and improve business operations efficiency

Why should you host your accounting software on the cloud?

Host accounting software in the cloud and you can do everything from collecting large amounts of data to processing high-level tax season information.

Here are Some of The Benefits of Hosting Your Accounting Software in The Cloud 1. Unified Working

You can collaborate in one place if you have accounting software hosted on the cloud. Your application and data are kept in one central location that can be accessed by IT administrators at any time. This allows for fewer errors and better control over the IT team.

2. Remote Accessibility

The best thing about hosting an accounting program online is the possibility to access it remotely. If you have QuickBooks Pro installed, you can only access the software from the dedicated desktop hosting it. The cloud allows you to access the software from any location.

3. Regular Backups

Hosting accounting software in the cloud means that we use services provided by a cloud hosting company. Your CSP usually takes automated backups of your data. The system creates new backups at regular intervals to replace the older ones.

Your backup files are kept on cloud servers for between 15 and 30 days depending on which provider.

Access your data from your backup in case of a natural disaster. Redundancy is a principle that the cloud uses to reduce dependency on office infrastructure, even in the event of a disaster.

4. Third-Party Integrations – Vital for Accounting Software

Third-party integrations with the cloud are easy. For a seamless stream of data, you can integrate your tax and payroll software with your accounting software. This integration allows your team to enjoy seamless data connectivity, improve productivity, and have a more unified working environment.

Also read:

Top 6 Tips to Stay Focused on Your Financial Goals

5. Multi-Device Access

Remote access is possible when accounting software is hosted on the cloud. You can also access data and software from any device. This applies to your smartphone, desktop, tablet, and laptop. Any operating system can be used, including Windows, Mac OS, Android, iOS, and Windows.

Your accounting software, for example, is hosted in the cloud. You have previously used it in a Windows environment at your office. The same interface and cloud image can be used to open the same software from a MacBook.

6. Data Security

Host accounting software in the cloud to increase security for your business. These are just a few ways to do it:

The provider physically secures the cloud infrastructure with the aid of a CCTV camera or other equipment.

The cloud employs various security measures, including multi-factor authentication, access control, DDoS prevention, updating, patching, Intrusion Detection, and Prevention.

7. Affordability

Hosting accounting software in the cloud is a great deal. Your IT team will spend more time and resources maintaining desktops in-house than using the cloud. You must also make sure that your desktops are always updated and repurposed. Hardware is expensive.

Cloud computing eliminates the need to manage IT infrastructure. You can also reduce the size and complexity of your IT staff. We can also upgrade our system resources from the cloud so that we don’t need to buy new PCs.

Also read:

Top 10 Best Artificial Intelligence Software

8. Scalability

The cloud is also scalable. Cloud service providers have a lot of resources. You can access resources and storage whenever you need them.

You might spend a lot if you wanted to increase storage and performance in your own house. You can’t reduce storage if you don’t need it.


Accounting software is used by most accounting firms to automate tasks. It allows you to maintain your books and accounts without making costly errors or complicated calculations. You can improve the features and functionality of this accounting software by hosting it on the cloud.

Cloud computing offers high performance, scalability, and security. To receive high-quality services, you need to find the right host provider.

Swirl Package – Easy Way To “Learn R, In R”

People usually quote a steep learning curve as one of the reasons against R, when comparing R vs. Python. The reality is that people miss out on some easier ways to learn R. In this article, we’ll introduce you with one such way to learn R in a fun and interactive way.

This way is none other than use of Swirl Package to learn R. Its tagline ‘Learn R in R’ gives a clear picture of what this package intends to establish. Yes! this inbuilt package can act as your R programming trainer and make you familiar with basics of R in a simplified manner. Moreover, the actions of this trainer will be based on the choices you enter in R console, like the kind of topic to study or the quiz you want to solve.

Why should you learn R?

Here are the reasons to learn R:

The big data and data science industry is rapidly growing, so is the need for data scientists and statisticians. By 2023, United States could face a shortage of 140,000 to 190,000 people with deep analytics skills and 1.5 million managers and analyst who would know the use of big data to make information based decision making (Source: McKinsey Global Institute).

For learning R online, there is plethora of options like taking up MOOCs on platforms like coursera, edX, udacity, or probably taking up courses taught by industry experts provided by various online data science training institutes like Jigsaw Academy or Edureka. Often these options can involve a lot of study time, week on week and a financial commitment, so if you want to just explore R then Swirl could be a great option.

Also See: Comprehensive learning path to learn R from scratch

Is Swirl really for YOU?

Here are few important points, such as for whom this package might be really useful, in which scenarios this package would work best and when should one look at other options for learning R rather than just using Swirl.

Swirl is great for those people who:

are trying to learn R just for fun

are just figuring out what R is about

have no interest in networking on MOOCs

like interactive learning without an instructor

don’t like watching online videos

are Google search experts and want to overcome technical hurdles

On the other hand, if you are planning to make a career out of R programming, then learning only with Swirl package would not suffice. You should definitely learn about the business applications that depend on R programming and further you need to understand how it is used in real-time industry scenarios. As such, you should go beyond just learning R with Swirl package if you:

want to take up R programming as a career option

want to learn from industry experts

plan to network on MOOCs

want to know more about the practical applications of R

require external support when struck with technical hurdles

All in all, depending on whichever situation you are in, definitely there is no harm in starting your learning on R programming with Swirl package.

How to get started with the Swirl Package?

Let’s walk through the steps involved for getting started with Swirl package.

Before we begin, it’s necessary to ensure both R and RStudio versions are already installed on your system. You can follow the video link for completing R installation on Windows platforms. In case, if you are having Linux or Mac platforms, lot of resources are available for helping you with respective R version installations.

Though RStudio installation is not mandatory, it’s highly recommended because of the unique GUI interface which will make your experience with R much more enjoyable. Once installation is done, next step is to install a package then load the same to use its functions. This can be attained by the below commands.

After loading the swirl package in your R console, the first function in swirl package you should enter is,

 It will prompt you to enter your name, and there after all swirl package commands will address you by your name, which kind of gives a feeling of a customized training being provided by the computer. After entering the name, it will further prompt you to select any one of the listed course options to begin your training with.

In terms of course selection, you can either enter one of options listed in R console or you can do it manually by downloading the respective course of your interest from Swirl course repository. This task can be achieved using the below commands shown on the help page.

Let’s select option 1 to get started with the “R Programming” course of Swirl package. This course will automatically get installed on your system. After the course is installed, it will further prompt you to enter which lesson you want to start with. As you can see below, there are different courses like Basic Building Blocks, Workspace and Files and Base Graphics that exist under R Programming course.

If you manage to complete all the 15 lessons present under “R Programming” course, you can successfully call yourself a basic R programmer which would be your first achievement to have under R skills. Well, this seems to be a straight forward way for anyone to pick up R programming skills just with the help of Swirl package and no other external help.

However, this package falls short in providing more data science project examples that are widely used in the Industry. Definitely, one can become a good R programmer, but doing data science is all together a different game compared to picking up a programming chúng tôi is all about how you can leverage R tool skills in solving a business problem as part of your data science project. Well, this would require another blog post all together, for now let us stick with how to acquire R tool skills using Swirl package.

End Notes

In this article, we discussed the significance of Swirl Package in R i.e how this package can be easily adapted by beginners to learn R in a more fun and interactive way. You also learnt the importance of R programming, if you are the right match to use Swirl and how you can get started with Swirl package in R

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.


Linear Regression In R

Linear regression is a regression model that uses a straight line to describe the relationship between variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.

There are two main types of linear regression:

Simple linear regression uses only one independent variable

Multiple linear regression uses two or more independent variables

In this step-by-step guide, we will walk you through linear regression in R using two sample datasets.

Simple linear regressionThe first dataset contains observations about income (in a range of $15k to $75k) and happiness (rated on a scale of 1 to 10) in an imaginary sample of 500 people. The income values are divided by 10,000 to make the income data match the scale of the happiness scores (so a value of $2 represents $20,000, $3 is $30,000, etc.) Multiple linear regressionThe second dataset contains observations on the percentage of people biking to work each day, the percentage of people smoking, and the percentage of people with heart disease in an imaginary sample of 500 towns.

Download the sample datasets to try it yourself.

Simple regression dataset Multiple regression dataset

Getting started in R

To install the packages you need for the analysis, run this code (you only need to do this once):


Next, load the packages into your R environment by running this code (you need to do this every time you restart R):


Step 1: Load the data into R

Follow these four steps for each dataset:

Choose the data file you have downloaded (


), and an Import Dataset window pops up.

In the Data Frame window, you should see an X (index) column and columns listing the data for each of the variables (








, and



After you’ve loaded the data, check that it has been read in correctly using summary().

Simple regression


Because both our variables are quantitative, when we run this function we see a table in our console with a numeric summary of the data. This tells us the minimum, median, mean, and maximum values of the independent variable (income) and dependent variable (happiness):

Multiple regression


Again, because the variables are quantitative, running the code produces a numeric summary of the data for the independent variables (smoking and biking) and the dependent variable (heart disease):

Here’s why students love Scribbr’s proofreading services


Discover proofreading & editing

Step 2: Make sure your data meet the assumptions

We can use R to check that our data meet the four main assumptions for linear regression.

Simple regression

Independence of observations (aka no autocorrelation)

Because we only have one independent variable and one dependent variable, we don’t need to test for any hidden relationships among variables.

If you know that you have autocorrelation within variables (i.e. multiple observations of the same test subject), then do not proceed with a simple linear regression! Use a structured model, like a linear mixed-effects model, instead.


To check whether the dependent variable follows a normal distribution, use the hist() function.


The observations are roughly bell-shaped (more observations in the middle of the distribution, fewer on the tails), so we can proceed with the linear regression.


The relationship between the independent and dependent variable must be linear. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.

plot(happiness ~ income, data =

The relationship looks roughly linear, so we can proceed with the linear model.

Homoscedasticity (aka homogeneity of variance)

This means that the prediction error doesn’t change significantly over the range of prediction of the model. We can test this assumption later, after fitting the linear model.

Multiple regression

Independence of observations (aka no autocorrelation)

Use the cor() function to test the relationship between your independent variables and make sure they aren’t too highly correlated.


When we run this code, the output is 0.015. The correlation between biking and smoking is small (0.015 is only a 1.5% correlation), so we can include both parameters in our model.


Use the hist() function to test whether your dependent variable follows a normal distribution.


The distribution of observations is roughly bell-shaped, so we can proceed with the linear regression.


We can check this using two scatterplots: one for biking and heart disease, and one for smoking and heart disease.

plot(heart.disease ~ biking,

plot(heart.disease ~ smoking,

Although the relationship between smoking and heart disease is a bit less clear, it still appears linear. We can proceed with linear regression.


We will check this after we make the model.

Step 3: Perform the linear regression analysis

Now that you’ve determined your data meet the assumptions, you can perform a linear regression analysis to evaluate the relationship between the independent and dependent variables.

Simple regression: income and happiness

Let’s see if there’s a linear relationship between income and happiness in our survey of 500 people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1 to 10.

To perform a simple linear regression analysis and check the results, you need to run two lines of code. The first line of code makes the linear model, and the second line prints out the summary of the model:

income.happiness.lm <- lm(happiness ~ income, data = summary(income.happiness.lm)

The output looks like this:

This output table first presents the model equation, then summarizes the model residuals (see step 4).

The Coefficients section shows:

The estimates (Estimate) for the model parameters – the value of the y-intercept (in this case 0.204) and the estimated effect of income on happiness (0.713).

The standard error of the estimated values (Std. Error).

The test statistic (t value, in this case the t statistic).

The final three lines are model diagnostics – the most important thing to note is the p value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data well.

From these results, we can say that there is a significant positive relationship between income and happiness (p value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness for every unit increase in income.

Multiple regression: biking, smoking, and heart disease

Let’s see if there’s a linear relationship between biking to work, smoking, and heart disease in our imaginary survey of 500 towns. The rates of biking to work range between 1 and 75%, rates of smoking between 0.5 and 30%, and rates of heart disease between 0.5% and 20.5%.

To test the relationship, we first fit a linear model with heart disease as the dependent variable and biking and smoking as the independent variables. Run these two lines of code:

heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = summary(heart.disease.lm)

The output looks like this:

The estimated effect of biking on heart disease is -0.2, while the estimated effect of smoking is 0.178.

This means that for every 1% increase in biking to work, there is a correlated 0.2% decrease in the incidence of heart disease. Meanwhile, for every 1% increase in smoking, there is a 0.178% increase in the rate of heart disease.

The standard errors for these regression coefficients are very small, and the t statistics are very large (-147 and 50.4, respectively). The p values reflect these small errors and large t statistics. For both parameters, there is almost zero probability that this effect is due to chance.

Remember that these data are made up for this example, so in real life these relationships would not be nearly so clear!

Step 4: Check for homoscedasticity

Before proceeding with data visualization, we should make sure that our models fit the homoscedasticity assumption of the linear model.

Simple regression

We can run plot(income.happiness.lm) to check whether the observed data meets our model assumptions:


Note that the par(mfrow()) command will divide the Plots window into the number of rows and columns specified in the brackets. So par(mfrow=c(2,2)) divides it up into two rows and two columns. To go back to plotting one graph in the entire window, set the parameters again and replace the (2,2) with (1,1).

These are the residual plots produced by the code:

Residuals are the unexplained variance. They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.

The most important thing to look for is that the red lines representing the mean of the residuals are all basically horizontal and centered around zero. This means there are no outliers or biases in the data that would make a linear regression invalid.

In the Normal Q-Qplot in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model.

Based on these residuals, we can say that our model meets the assumption of homoscedasticity.

Multiple regression

Again, we should check that our model is actually a good fit for the data, and that we don’t have large variation in the model error, by running this code:


The output looks like this:

As with our simple regression, the residuals show no bias, so we can say our model fits the assumption of homoscedasticity.

Step 5: Visualize the results with a graph

Next, we can plot the data and the regression line from our linear regression model so that the results can be shared.

Simple regression

Follow 4 steps to visualize the results of your simple linear regression.

Plot the data points on a graph

income.graph<-ggplot(, aes(x=income, y=happiness))+ geom_point() income.graph

Add the linear regression line to the plotted data

Add the regression line using geom_smooth() and typing in lm as your method for creating the line. This will add the line of the linear regression as well as the standard error of the estimate (in this case +/- 0.01) as a light grey stripe surrounding the line:

income.graph <- income.graph + geom_smooth(method="lm", col="black") income.graph

Add the equation for the regression line.

income.graph <- income.graph + stat_regline_equation(label.x = 3, label.y = 7) income.graph

Make the graph ready for publication

We can add some style parameters using theme_bw() and making custom labels using labs().

income.graph + theme_bw() + labs(title = "Reported happiness as a function of income", x = "Income (x$10,000)", y = "Happiness score (0 to 10)")

This produces the finished graph that you can include in your papers:

Multiple regression

The visualization step for multiple regression is more difficult than for simple regression, because we now have two predictors. One option is to plot a plane, but these are difficult to read and not often published.

We will try a different method: plotting the relationship between biking and heart disease at different levels of smoking. In this example, smoking will be treated as a factor with three levels, just for the purposes of displaying the relationships in our data.

There are 7 steps to follow.

Create a new dataframe with the information needed to plot the model

Use the function expand.grid() to create a dataframe with the parameters you supply. Within this function we will:

Create a sequence from the lowest to the highest value of your observed biking data;

Choose the minimum, mean, and maximum values of smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.<-expand.grid( biking = seq(min($biking), max($biking), length.out=30), smoking=c(min($smoking), mean($smoking), max($smoking)))

Predict the values of heart disease based on your linear model

Next we will save our ‘predicted y’ values as a new column in the dataset we just created.$predicted.y <- predict.lm(heart.disease.lm,

Round the smoking numbers to two decimals

This will make the legend easier to read later on.$smoking <- round($smoking, digits = 2)

Change the ‘smoking’ variable into a factor

This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.$smoking <- as.factor($smoking)

Plot the original data

heart.plot <- ggplot(, aes(x=biking, y=heart.disease)) + geom_point() heart.plot

Add the regression lines

heart.plot <- chúng tôi + geom_line(, aes(x=biking, y=predicted.y, color=smoking), size=1.25) heart.plot

Make the graph ready for publication

heart.plot <- heart.plot + theme_bw() + labs(title = "Rates of heart disease (% of population) n as a function of biking to work and smoking", x = "Biking to work (% of population)", y = "Heart disease (% of population)", color = "Smoking n (% of population)") heart.plot

Because this graph has two regression coefficients, the stat_regline_equation() function won’t work here. But if we want to add our regression model to the graph, we can do so like this:

heart.plot + annotate(geom="text", x=30, y=1.75, label=" = 15 + (-0.2*biking) + (0.178*smoking)")

This is the finished graph that you can include in your papers!

Step 6: Report your results

In addition to the graph, include a brief statement explaining the results of the regression model.

Reporting the results of simple linear regressionWe found a significant relationship between income and happiness (p < 0.001, R2 = 0.73 ± 0.0193), with a 0.73-unit increase in reported happiness for every $10,000 increase in income. Reporting the results of multiple linear regressionIn our survey of 500 towns, we found significant relationships between the frequency of biking to work and the frequency of heart disease and the frequency of smoking and frequency of heart disease (p < 0 and p < 0.001, respectively).

Specifically we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.

Other interesting articles

If you want to know more about statistics, methodology, or research bias, make sure to check out some of our other articles with explanations and examples.

Cite this Scribbr article

Bevans, R. Retrieved July 19, 2023,

Cite this article

Fix: ”This Application Requires Directx Version 8.1 Or Greater To Run”

Fix: ”This application requires DirectX version 8.1 or greater to run”






Try Outbyte Driver Updater to resolve driver issues entirely:

This software will simplify the process by both searching and updating your drivers to prevent various malfunctions and enhance your PC stability. Check all your drivers now in 3 easy steps:

Download Outbyte Driver Updater.

Launch it on your PC to find all the problematic drivers.

OutByte Driver Updater has been downloaded by


readers this month.

aDirectX issues in Windows 10 are the common ache for a plentitude of users that reside in the gaming world.

One of those errors affects a lot of users that are keen to play some older, legacy titles. Allegedly, they keep seeing the ”This application requires DirectX version 8.1 or greater to run‘ prompt.

This basically means that, even though you have DirectX 11 or 12 installed, it just won’t cut it for older applications. In order to help you resolve this peculiar and rather complex problem, we provided some solutions below. If you’re stuck with this error every time you start the game, make sure to check the list below.

How to fix ”This application requires Directx version 8.1 or greater’ error in Windows 10

Install DirectX Runtime June 2010

Run the application in a compatibility mode

Reinstall the troubled program

Enable DirectPlay

Solution 1 – Install DirectX Runtime June 2010

For some peculiar reason, the older games or applications that are dependant on DirectX need older DirectX versions in order to run. Now, even though you’re positive that you have DirectX 11 or 12 installed, our guess is that you’ll need to obtain and install older DirectX version and resolve the issues.

For that matter, most of the games come with the matching DirectX installer package and additional redistributables. On the other hand, if you’re unable to locate them within the game installation folder, they can be easily found online and downloaded.

You can download DirectX Runtime installer here.

Solution 2 – Run the application in a compatibility mode

While we’re at it with older games played on Windows 10, let’s try to use compatibility mode to overcome this issue. Compatibility issues are quite frequent with older game titles, like GTA Vice City or I.G.I.-2: Covert Strike played on Windows 10 platform.

Expert tip:

Open the Compatibility tab.

Check the ”Run this program in compatibility mode for” box.

From the drop-down menu, select Windows XP or Windows 7.

Now, check the ”Run this program as an administrator” box.

Save changes and run the application.

On the other hand, if you’re still prompted with ”This application requires Directx version 8.1 or greater to run” error, make sure to continue with the steps below.

ALSO READ: How to fix Age of Mythology Extended Edition bugs on Windows 10

Solution 3 – Reinstall the troubling program

Some users managed to resolve the issue by simply reinstalling the application (most of the time, game). Integration problems are also quite common, again, especially with older game titles. So, without further ado, follow the steps below to uninstall the troubling game and install it again:

In the Windows Search bar, type Control and open Control Panel.

Select Category View.

Restart your PC.

Select Compatibility tab.

Check the ”Run this program in compatibility mode for” box.

From the drop-down menu, select Windows XP or Windows 7.

Now, check the ”Run this program as an administrator” box.

Confirm changes and run the installer.

Furthermore, if you’re a Steam user, you can do so within the client, as it has a better success rate.

Solution 4 – Enable DirectPlay

DirectPlay is a legacy component that was excluded from a few latest Windows iterations. But, as we already determined that this problem plagues older games, it’s safe to say that it’s vital to enable this option. Follow the steps below to enable DirectPlay and, hopefully, resolve this issue:

In the Windows Search bar, type Turn Windows and open Turn Windows features on or off.

Scroll down until your reach Legacy Components.

Expand Legacy Components and check the ”DirectPlay” box.

With DirectPlay enabled, you should be able to run all games from the past decade without issues whatsoever.


Was this page helpful?


Update the detailed information about R Or Python? Reasons Behind This Cloud War on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!