Supervised learning: Decision Trees

Introduction to Decision Trees

Decision Trees are a non-linear predictive modeling tool widely used in machine learning for both classification and regression tasks. The model is represented as a tree-like structure where an internal node represents a feature (or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a tree is known as the root node. Decision Trees are popular due to their simplicity, interpretability, and applicability to both numerical and categorical data.

How Decision Trees Work

Decision Trees use a decision-making process to infer the class labels of the samples. The decision of making strategic splits heavily influences their performance. Here’s how they generally work:

  1. Choosing the Best Split: At each node in the tree, the algorithm selects the best split among all features based on a certain criterion that measures the impurity of the node. The most common criteria are:
    • Gini Impurity (used by the CART algorithm): A measure of how often a randomly chosen element from the set would be incorrectly labeled.
    • Entropy (used by the ID3, C4.5, and C5.0 algorithms): A measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw conclusions from that information.
  2. Recursive Splitting: After the best split is chosen, the dataset is split into subsets, which then recursively form branch nodes in the tree. This process continues until a stopping criterion is met; for instance:
    • All tuples at a node belong to the same attribute class.
    • No more remaining attributes are there to be processed.
    • The tree has reached a maximum specified depth.
  3. Pruning: This step is crucial to avoid overfitting. It involves removing parts of the tree that don’t provide additional power in classifying instances. Pruning can drastically improve the model’s generalizability.

Types of Decision Trees

  • Classification Trees: Used when the target variable is categorical. The tree is used to infer the class labels of samples.
  • Regression Trees: Used when the target variable is continuous. The decision rules of the tree predict the target’s value.

Advantages of Decision Trees

  • Easy to Understand and Interpret: Trees can be visualised, making them easy to explain to non-experts. Decisions are made based on feature interactions and are represented clearly.
  • Handles both numerical and categorical data: Can handle data types that are often not easily handled by other algorithms.
  • Requires Little Data Preparation: Unlike other algorithms, decision trees do not require data normalization.

Limitations of Decision Trees

  • Overfitting: Without proper pruning, decision trees can create overly complex trees that do not generalize well from the training data.
  • Variance: Small changes in the data might result in a completely different tree being generated.
  • Bias: Decision tree learners can create biased trees if some classes dominate.