Week 7 Machine Learning (Supervised Learning)

In previous weeks, we fit linear regressions and logistic regression using all available data. That is the conventional practice for explanatory studies but not for predictive studies (i.e., Machine Learning). Here, we introduce the practice of splitting data into training and test sets, and demonstrate why this is crucial for evaluating a model’s performance on new data. We will use the 2018 General Social Survey (GSS 2018) data and the same variables from prior labs (e.g., age, education, sex, and income) to build and compare three types of models:

A standard linear regression (as a baseline). A decision tree regression model. A random forest regression model.

We’ll fit each model on a training subset of the GSS data, evaluate their performance on both training and test sets, and compare their predictive accuracy using metrics like the Root Mean Squared Error (RMSE). Throughout, we will use the tidyverse and tidymodels frameworks for a consistent, tidy approach to modeling.

7.1 Training vs. Test Data: Why Split?

When developing a predictive model, our goal is not just to describe the patterns in the training data but to make accurate predictions on new, unseen data. If we evaluate a model on the same data used for training, we risk being overly optimistic about its performance – the model may simply be memorizing noise or idiosyncrasies in that dataset, a phenomenon known as overfitting. By setting aside a portion of data as a test set (also called a hold-out set), we obtain an honest assessment of how the model might perform on future data. Key idea: We will split the GSS 2018 data into a training set (to fit the models) and a test set (to evaluate them). For example, we might use 80% of the data for training and reserve 20% for testing. The test set will act as new data that the model has not seen during training, allowing us to check how well the model generalizes. Let’s apply this concept using a linear regression model as an example. We’ll predict real household income (realinc) using respondents’ age, education (educ years), and sex – the same variables we explored in earlier regression labs. We will:

Split the GSS 2018 data into training and test sets.
Fit a linear regression on the training set.
Evaluate the model’s performance on both the training set and the test set, comparing metrics to see the difference.

7.1.1 Splitting the GSS 2018 Data

We use initial_split() from the rsample package (part of tidymodels) to randomly split the data. Below, we split 80% of the GSS data into gss_train and 20% into gss_test. We set a random seed for reproducibility so everyone gets the same split:

library(tidyverse)
library(tidymodels)
library(haven)
set.seed(123)  # for reproducibility of the random split
gss <- read_dta("GSS2018.dta")

# Clean the variables
gss <- gss %>%                                             
  transmute(
    realinc = as.numeric(realinc),
    age     = as.numeric(age),
    educ    = as.numeric(educ),
    sex     = as_factor(sex)      # labelled  -> factor
  ) %>% 
  drop_na()      
  
# Create an 80/20 train-test split
gss_split <- initial_split(gss, prop = 0.8)
gss_train <- training(gss_split)
gss_test  <- testing(gss_split)

We can verify the split:

# Verify the split sizes
nrow(gss_train)
nrow(gss_test)

## [1] 1878
## [1] 470

We have 2,348 observations in GSS 2018 (as seen in Week 1’s exploration), so the training set has 1,878 rows and the test set 470 rows (approximately 80/20 split). Now, we’ll build a linear regression model using the training data.

7.1.2 Linear Regression with Training and Test Data

First, we define a linear regression model specification using tidymodels’ parsnip package. We then fit this model to the training data, predicting realinc (real income) from age, educ, and sex. Finally, we evaluate the model’s performance on both training and test sets.

# Define a linear regression model specification
lin_mod <- linear_reg() %>% 
  set_engine("lm") %>%            # use R's linear model engine
  set_mode("regression")          # this is a regression (predicting numeric outcome)

# Fit the model on the training data
lin_fit <- lin_mod %>% fit(realinc ~ age + educ + sex, data = gss_train)

The model lin_fit now contains the linear regression results (coefficients, etc.) trained on gss_train. We can check how well this model fits the training data versus how well it predicts the test data. We’ll use the following metric:

RMSE (Root Mean Squared Error): the square root of the average squared error. This is in the same units as the outcome (income in dollars) – lower RMSE means better prediction.

Using the yardstick package (loaded via tidymodels), we can easily compute the metric. We’ll generate predictions on both the training set and test set, then calculate RMSE for each:

# Evaluate performance on the training data
lin_train_preds <- predict(lin_fit, gss_train) %>% 
  bind_cols(gss_train)          # bind predictions with true values
rmse(lin_train_preds, truth = realinc, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    binary        28930.

# Evaluate performance on the test data
lin_test_preds <- predict(lin_fit, gss_test) %>% 
  bind_cols(gss_test)
rmse(lin_test_preds, truth = realinc, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    binary        28140.

Interpretation: On the training data, our linear model has an RMSE of about 28,930, meaning on average the model’s predictions deviate from actual incomes by around $28k. On the test data, the RMSE is about 28140. These are very close to the training values (in fact, the test RMSE is slightly lower in this instance, which can happen by chance or if the model is slightly underfitting).

The key point is that the model’s performance on new data is in the same ballpark as on the training data, suggesting that our linear model did not severely overfit. This makes sense – a linear regression with three predictors is a relatively simple model (low variance), so it generalizes reasonably well.

Now that we understand training vs. test evaluation with a familiar model, let’s introduce more flexible machine learning models – decision trees and random forests – and see how they perform on the same task.

7.2 Decision Trees for Regression

Decision trees are intuitive, rule-based models that predict an outcome by learning a hierarchy of if-else rules from the data. For regression (continuous outcomes), a decision tree repeatedly splits the data into subsets based on predictor variables’ values, aiming to create groups that are as homogeneous as possible in the outcome. The result is a tree-like model where each leaf node yields a predicted value (typically the mean outcome of training cases in that leaf).

Why use trees? They can automatically capture non-linear relationships and interactions between predictors. For example, a tree could learn that income increases with education unless the person is above a certain age (retirement age), in which case income might drop – a pattern that a single linear model might miss. Trees are also easy to interpret (we can visualize the splits).

Downside: Decision trees are prone to overfitting if grown too deep. A fully grown tree can memorize the training data (yielding very low training error) but perform poorly on new data. We usually need to prune the tree or set limits on its depth/leaves to avoid overfitting. Let’s fit a decision tree on the same training data (gss_train) to predict realinc. We will use tidymodels again, with the rpart engine (which implements the CART algorithm for decision trees):

# Define a decision tree model specification for regression
tree_mod <- decision_tree() %>% 
  set_engine("rpart") %>%        # CART algorithm implementation
  set_mode("regression")

# Fit the decision tree on the training data
tree_fit <- tree_mod %>% fit(realinc ~ age + educ + sex, data = gss_train)

Now we have a trained decision tree model tree_fit. We’ll evaluate its RMSE on both training and test sets, just as we did for the linear model:

# Performance on training data
tree_train_preds <- predict(tree_fit, gss_train) %>% bind_cols(gss_train)
rmse(tree_train_preds, truth = realinc, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      27990.

# Performance on test data
tree_test_preds <- predict(tree_fit, gss_test) %>% bind_cols(gss_test)
rmse(tree_test_preds, truth = realinc, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      26861.

Interpretation: The decision tree’s training RMSE is lower than the linear model’s. This indicates the tree fit the training data more closely. Same for test set.

We can also evaluate predictor importance

# If vip isn't installed yet:
# install.packages("vip")

library(vip)
library(ggplot2)

# --- Compute and plot importance ---
vip::vip(tree_fit$fit,           # underlying rpart object
         num_features = 20) +    # show up to 20 predictors (adjust as needed)
  ggtitle("Decision tree – variable importance")

7.3 Random Forests

Random forests offer one powerful solution. A random forest is essentially an ensemble of many decision trees. The idea is to build a large number of trees (e.g., 100 or 500), each on a slightly different random subset of the data and predictors, and then average their predictions. Key characteristics of random forests:

Each tree in the forest is trained on a bootstrapped sample of the training data (random sampling with replacement), and at each split, the tree considers a random subset of predictors (this is the “random” part of random forests).

Because each tree sees a slightly different view of the data, their errors are not perfectly correlated. Averaging many trees tends to cancel out noise and reduce overfitting, while capturing complex patterns.

Random forests usually don’t require much manual tuning of tree depth – the ensemble approach and averaging naturally controls variance. They often achieve better test performance than a single deep tree, at the cost of losing some interpretability (it’s no longer a single neat set of rules, but hundreds of them aggregated).

We will fit a random forest on the training data using the ranger engine (a fast implementation of random forests). We’ll specify, for example, 100 trees in the forest:

library(ranger)
# Define a random forest model specification
rf_mod <- rand_forest(trees = 100) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

# Fit the random forest on the training data
rf_fit <- rf_mod %>% fit(realinc ~ age + educ + sex, data = gss_train)

# Performance on training data
rf_train_preds <- predict(rf_fit, gss_train) %>% bind_cols(gss_train)
rmse(rf_train_preds, truth = realinc, estimate = .pred, metrics = rmse)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      27192.

# Performance on test data
rf_test_preds <- predict(rf_fit, gss_test) %>% bind_cols(gss_test)
rmse(rf_test_preds, truth = realinc, estimate = .pred, metrics = rmse)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      26977.

Random forest did okay here but not excellent. Why? In other datasets with strong non-linear effects or interactions, a random forest would likely excel. Our example highlights that more complex models are not always better – it depends on the data and underlying patterns.

Again, we can rank predictors by importance.

vip::vip(rf_fit$fit, num_features = 20) +
  ggtitle("Random forest – variable importance")