Modelling: Randomforest

Random Forests
Random forests are an improvement on decision trees. When analyzing a data set with several predictor variables by use of decision trees, we use a splitting rule to define a region wherein a subset of the data will fall. For example, suppose we have a data set of 2000 responses from a survey whose aim is to predict the annual income based on number of years of experience and highest level education. Here, the annual income is the response variable and number of years of experience and highest education level are the predictor variables. We could make the first split based on the rule: "at most secondary school education". We would then have two branches, one branch would have as a terminal node the average annual salary of the total number of persons whose highest level of education is secondary school. This would be the predicted annual income for this group of persons. For those with education level higher than secondary school, we could split this branch based on the rule: "at least 8 years of experience". We would then have two terminal nodes: one having the average annual income for persons with post?secondary level education and less than 8 years of work experience and the other having the average annual income for persons with post?secondary level education and more than 8 years of experience. Thus, we would have the predicted annual incomes for these two groups of respondents. It is certainly clear that we could create more branches (for example, we could split the respondents with post?secondary education into "at most an undergraduate degree" and "some post?graduate education", etc). The main idea has been to give an idea of what decision trees are, based on an illustration.