Toegepaste Machine Learning - Midterm Summary

Machine Learning Fundamentals
1. Key Definitions
2. Supervised vs. Unsupervised Learning
3. Model Evaluation Approaches
Linear Models
1. Simple Linear Regression
2. Multiple Linear Regression
3. Model Assessment Metrics
Polynomial & Non-Linear Regression
1. Polynomial Features
2. Univariate Transforms
k-Nearest Neighbors Regression
1. Distance-Based Prediction
2. Parameter Selection
3. Feature Scaling Importance
Decision Tree Regression
1. Recursive Binary Splits
2. Pruning Techniques
Support Vector Regression
1. ε-Insensitive Loss
2. Kernel Approaches
3. Parameter Tuning
Regularization & Complexity Control
1. Ridge Regression (L2)
2. Lasso Regression (L1)
3. Elastic Net
4. Hyperparameter Selection
Feature Engineering & Representation
1. Categorical Encoding
2. Binning/Discretization
3. Interaction Terms
4. Scaling Techniques
Pipelines & Workflow
1. scikit-learn API
2. Building Pipelines
3. Grid Search with Pipelines
Practical Skills
1. Data Loading and Splitting
2. Cross-validation Techniques
3. Visualization Methods
Quick-Recall Flashcards

1. Machine Learning Fundamentals

1.1 Key Definitions

Sample: An individual observation in a dataset (row in a table)
Feature: A measurable property or characteristic of a phenomenon (column in a table)
Label: The target variable that we’re trying to predict (output variable)
Model: A mathematical representation that captures patterns in data
Training: The process of optimizing model parameters using labeled data
Prediction: Using a trained model to estimate labels for new, unseen data

1.2 Supervised vs. Unsupervised Learning

Supervised Learning:
- Uses labeled data (input/output pairs)
- Goal: Learn a mapping from inputs to outputs
- Example tasks: Classification, regression
- Evaluated on prediction accuracy/error
Unsupervised Learning:
- Uses unlabeled data (inputs only)
- Goal: Find structure or patterns in data
- Example tasks: Clustering, dimensionality reduction
- Evaluation often less straightforward than supervised learning

1.3 Model Evaluation Approaches

Train/Test Split:
- Randomly divide data into training and testing sets (typically 70-80% for training)
- Train model on training data, evaluate on test data
- Prevents overfitting assessment by evaluating on unseen data

from sklearn.model_selection import train_test_split
  
  X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=0.25, random_state=42
  )

Cross-validation:
- Split data into k folds
- Train on k-1 folds, test on remaining fold
- Repeat k times with different test fold each time
- Report average performance

from sklearn.model_selection import cross_val_score
  from sklearn.linear_model import LinearRegression
  
  model = LinearRegression()
  cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
  rmse_scores = np.sqrt(-cv_scores)
  print(f"Mean RMSE: {rmse_scores.mean()}, Std: {rmse_scores.std()}")

Evaluation Metrics:
- RMSE (Root Mean Squared Error): Square root of the average of squared differences between predictions and actual values
  - Formula: \(\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\)
- R² (Coefficient of Determination): Proportion of variance in dependent variable explained by model
  - Formula: \(1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\)
  - Range: (-∞, 1], where 1 means perfect prediction

Study Tips:

Remember that low RMSE and high R² values indicate better models
Always evaluate models on unseen data, never solely on training data
Cross-validation generally provides more reliable performance estimates than a single train/test split

2. Linear Models

2.1 Simple Linear Regression

Definition: Models relationship between a dependent variable y and a single independent variable x
Mathematical form: \(y = \beta_0 + \beta_1 x + \epsilon\)
- \(\beta_0\) is the intercept (bias)
- \(\beta_1\) is the slope (weight)
- \(\epsilon\) is the error term
Least squares method (Closed-form solution):
- Minimizes sum of squared residuals: \(\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\)
- Optimal parameters: \(\beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\), \(\beta_0 = \bar{y} - \beta_1\bar{x}\)

import numpy as np
  from sklearn.linear_model import LinearRegression
  import matplotlib.pyplot as plt
  
  # Example with a single feature
  X = np.array([[1], [2], [3], [4], [5]])
  y = np.array([2, 3.5, 4.8, 6.3, 7.2])
  
  # Create and fit the model
  model = LinearRegression()
  model.fit(X, y)
  
  # Print model parameters
  print(f"Intercept (β₀): {model.intercept_}")
  print(f"Coefficient (β₁): {model.coef_[0]}")
  
  # Make predictions
  y_pred = model.predict(X)
  
  # Visualize
  plt.scatter(X, y, color='blue', label='Data')
  plt.plot(X, y_pred, color='red', label='Linear fit')
  plt.xlabel('x')
  plt.ylabel('y')
  plt.legend()
  plt.show()

2.2 Multiple Linear Regression

Definition: Extension of simple linear regression to multiple independent variables
Mathematical form: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon\)
Matrix form:
- \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)
- Least squares solution: \(\boldsymbol{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

from sklearn.linear_model import LinearRegression
  import numpy as np
  
  # Multiple features
  X = np.array([
      [1, 2, 3],  # 3 features per sample
      [4, 5, 6],
      [7, 8, 9],
      [10, 11, 12]
  ])
  y = np.array([10, 20, 30, 40])
  
  # Create and fit model
  multi_model = LinearRegression()
  multi_model.fit(X, y)
  
  # Print model parameters
  print(f"Intercept: {multi_model.intercept_}")
  print(f"Coefficients: {multi_model.coef_}")
  
  # Make a prediction
  new_data = np.array([[5, 6, 7]])
  prediction = multi_model.predict(new_data)
  print(f"Prediction for {new_data}: {prediction}")

2.3 Model Assessment

R² (Coefficient of Determination):
- Measures proportion of variance explained by the model
- Formula: \(R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\)
- Range: (-∞, 1], where 1 is perfect prediction
- In scikit-learn: model.score(X, y)
RMSE (Root Mean Squared Error):
- Measures average magnitude of errors
- Same unit as the dependent variable
- Formula: \(RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\)

from sklearn.metrics import mean_squared_error, r2_score
  
  # Calculate metrics
  y_pred = model.predict(X_test)
  rmse = np.sqrt(mean_squared_error(y_test, y_pred))
  r2 = r2_score(y_test, y_pred)
  
  print(f"RMSE: {rmse}")
  print(f"R²: {r2}")

Study Tips:

Linear regression assumes a linear relationship between features and target
Watch for multicollinearity (high correlation between features)
Check residuals for normality and homoscedasticity
Linear models are highly interpretable but may underfit complex data

3. Polynomial & Non-Linear Regression

3.1 Polynomial Features

Definition: Transform linear features into polynomial features to capture non-linear relationships
Mathematical form: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_d x^d + \epsilon\)
Implementation: Use PolynomialFeatures to generate polynomial and interaction terms

from sklearn.preprocessing import PolynomialFeatures
  from sklearn.linear_model import LinearRegression
  from sklearn.pipeline import Pipeline
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate non-linear data
  X = np.sort(5 * np.random.rand(80, 1), axis=0)
  y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
  
  # Create pipeline with polynomial features
  polynomial_pipeline = Pipeline([
      ('poly', PolynomialFeatures(degree=3)),
      ('linear', LinearRegression())
  ])
  
  # Fit the model
  polynomial_pipeline.fit(X, y)
  
  # Make predictions on a fine grid for plotting
  X_test = np.linspace(0, 5, 100)[:, np.newaxis]
  y_pred = polynomial_pipeline.predict(X_test)
  
  # Plot results
  plt.scatter(X, y, color='blue', label='Data')
  plt.plot(X_test, y_pred, color='red', label='Polynomial fit (degree=3)')
  plt.xlabel('x')
  plt.ylabel('y')
  plt.legend()
  plt.show()

3.2 Univariate Transforms

Definition: Apply non-linear transformations to individual features
Common transforms: log, exponential, sine, square root, etc.
Implementation: Use FunctionTransformer for custom transformations

from sklearn.preprocessing import FunctionTransformer
  from sklearn.pipeline import Pipeline
  from sklearn.linear_model import LinearRegression
  import numpy as np
  
  # Create log transformation
  log_transformer = FunctionTransformer(np.log1p, validate=True)  # log(1+x) to handle zeros
  
  # Create exponential data
  X = np.array([[1], [2], [3], [4], [5]])
  y = 2 * np.exp(0.5 * X.ravel()) + np.random.normal(0, 0.2, X.shape[0])
  
  # Create and fit pipeline with log transform
  log_pipeline = Pipeline([
      ('log', log_transformer),
      ('regression', LinearRegression())
  ])
  log_pipeline.fit(X, y)
  
  # Print results
  print(f"Intercept: {log_pipeline.named_steps['regression'].intercept_}")
  print(f"Coefficient: {log_pipeline.named_steps['regression'].coef_}")

Study Tips:

Polynomial regression is still a linear model (linear in parameters, not features)
Higher polynomial degrees can lead to overfitting
Feature engineering is often more interpretable than complex models
Consider the domain knowledge when selecting transformations

4. k-Nearest Neighbors Regression

4.1 Distance-Based Prediction

Definition: Predicts the target value as the average of the k nearest training samples
Algorithm:
1. Calculate distance between query point and all training samples
2. Select k nearest neighbors based on distance
3. Output average of their target values
Distance metrics:
- Euclidean (L2): \(\sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}\)
- Manhattan (L1): \(\sum_{i=1}^{n}|x_i - y_i|\)
- Minkowski: \((\sum_{i=1}^{n}|x_i - y_i|^p)^{1/p}\)

from sklearn.neighbors import KNeighborsRegressor
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate sample data
  X = np.sort(5 * np.random.rand(40, 1), axis=0)
  y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
  
  # Create and fit KNN regressor
  k = 3
  knn_model = KNeighborsRegressor(n_neighbors=k)
  knn_model.fit(X, y)
  
  # Make predictions on a grid
  X_test = np.linspace(0, 5, 100)[:, np.newaxis]
  y_pred = knn_model.predict(X_test)
  
  # Plot results
  plt.scatter(X, y, color='blue', label='Data')
  plt.plot(X_test, y_pred, color='red', label=f'KNN (k={k})')
  plt.xlabel('x')
  plt.ylabel('y')
  plt.legend()
  plt.show()

4.2 Parameter Selection

Choice of k:
- Small k: Low bias, high variance (risk of overfitting)
- Large k: High bias, low variance (risk of underfitting)
- Common approach: Try k = √n where n is the number of training samples
- Use cross-validation to find optimal k

from sklearn.model_selection import GridSearchCV
  from sklearn.neighbors import KNeighborsRegressor
  
  # Setup parameter grid
  param_grid = {'n_neighbors': np.arange(1, 20)}
  
  # Setup grid search
  knn = KNeighborsRegressor()
  grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error')
  grid_search.fit(X, y)
  
  # Print best parameters
  print(f"Best k value: {grid_search.best_params_['n_neighbors']}")
  print(f"Best RMSE: {np.sqrt(-grid_search.best_score_)}")

4.3 Feature Scaling Importance

Why scaling matters: KNN uses distance metrics directly affected by feature scales
Common scalers:
- StandardScaler: Transforms to zero mean and unit variance
- MinMaxScaler: Scales features to a fixed range (usually [0,1])

from sklearn.preprocessing import StandardScaler
  from sklearn.pipeline import Pipeline
  from sklearn.neighbors import KNeighborsRegressor
  from sklearn.model_selection import train_test_split
  from sklearn.datasets import load_boston
  import numpy as np
  
  # Load a dataset with multiple features
  data = load_boston()
  X, y = data.data, data.target
  
  # Split data
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
  
  # Create pipeline with scaling
  knn_pipeline = Pipeline([
      ('scaler', StandardScaler()),
      ('knn', KNeighborsRegressor(n_neighbors=5))
  ])
  
  # Create pipeline without scaling
  knn_no_scaling = Pipeline([
      ('knn', KNeighborsRegressor(n_neighbors=5))
  ])
  
  # Fit and evaluate both models
  knn_pipeline.fit(X_train, y_train)
  knn_no_scaling.fit(X_train, y_train)
  
  # Calculate RMSE
  rmse_with_scaling = np.sqrt(mean_squared_error(y_test, knn_pipeline.predict(X_test)))
  rmse_no_scaling = np.sqrt(mean_squared_error(y_test, knn_no_scaling.predict(X_test)))
  
  print(f"RMSE with scaling: {rmse_with_scaling}")
  print(f"RMSE without scaling: {rmse_no_scaling}")

Study Tips:

KNN is a non-parametric method (no explicit training phase)
Always scale features before using KNN
KNN becomes computationally expensive with large datasets
Effectiveness decreases in high dimensions (curse of dimensionality)

5. Decision Tree Regression

5.1 Recursive Binary Splits

Definition: A tree-based model that recursively splits the feature space into regions
Algorithm:
1. Find the feature and threshold that minimizes prediction error
2. Split the data into two subsets based on the threshold
3. Recursively repeat for each subset until stopping criteria are met
4. Predict the mean target value of samples in each leaf node
Split criteria for regression:
- Mean Squared Error (MSE): \(\sum_{i \in S}(y_i - \bar{y}_S)^2\)
- Mean Absolute Error (MAE): \(\sum_{i \in S}|y_i - \bar{y}_S|\)

from sklearn.tree import DecisionTreeRegressor
  from sklearn.tree import export_graphviz
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate sample data
  X = np.sort(5 * np.random.rand(80, 1), axis=0)
  y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
  
  # Create and fit Decision Tree regressor
  dt_model = DecisionTreeRegressor(max_depth=3, random_state=42)
  dt_model.fit(X, y)
  
  # Make predictions on a grid
  X_test = np.linspace(0, 5, 100)[:, np.newaxis]
  y_pred = dt_model.predict(X_test)
  
  # Plot results
  plt.scatter(X, y, color='blue', label='Data')
  plt.plot(X_test, y_pred, color='red', label='Decision Tree (max_depth=3)')
  plt.xlabel('x')
  plt.ylabel('y')
  plt.legend()
  plt.show()
  
  # Export tree visualization (optional)
  export_graphviz(dt_model, out_file='tree.dot', 
                  feature_names=['x'], 
                  filled=True, 
                  rounded=True)
  # Convert to PNG with: dot -Tpng tree.dot -o tree.png

5.2 Pruning Techniques

Pre-pruning: Limit tree growth during construction
- max_depth: Maximum depth of the tree
- min_samples_split: Minimum samples required to split a node
- min_samples_leaf: Minimum samples required in a leaf node
- max_features: Maximum number of features to consider when looking for the best split
Post-pruning: Build full tree then prune back (cost-complexity pruning)
- Available through ccp_alpha parameter

from sklearn.tree import DecisionTreeRegressor
  from sklearn.model_selection import GridSearchCV
  
  # Setup parameter grid for pre-pruning
  param_grid = {
      'max_depth': [2, 3, 4, 5, 6, None],
      'min_samples_leaf': [1, 2, 4, 8],
      'min_samples_split': [2, 4, 6, 8]
  }
  
  # Setup grid search
  dt = DecisionTreeRegressor(random_state=42)
  grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='neg_mean_squared_error')
  grid_search.fit(X, y)
  
  # Print best parameters
  print(f"Best parameters: {grid_search.best_params_}")
  print(f"Best RMSE: {np.sqrt(-grid_search.best_score_)}")

Study Tips:

Decision trees can capture non-linear relationships without feature scaling
Trees are prone to overfitting if not properly pruned
Trees create “staircase” predictions that may look jagged
Interpretable for small trees, but complex trees lose this advantage

6. Support Vector Regression

6.1 ε-Insensitive Loss

Definition: SVR tries to find a function that deviates from y by no more than ε while being as flat as possible
Mathematical formulation:
- Minimize: \(\frac{1}{2}||w||^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)\)
- Subject to:
  - \(y_i - (w \cdot x_i + b) \leq \varepsilon + \xi_i\)
  - \((w \cdot x_i + b) - y_i \leq \varepsilon + \xi_i^*\)
  - \(\xi_i, \xi_i^* \geq 0\)
ε-insensitive loss function:
- \(L_{\varepsilon}(y, f(x)) = \begin{cases} 0, & \text{if } |y - f(x)| \leq \varepsilon \\ |y - f(x)| - \varepsilon, & \text{otherwise} \end{cases}\)

from sklearn.svm import SVR
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate sample data
  X = np.sort(5 * np.random.rand(80, 1), axis=0)
  y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
  
  # Create and fit SVR with linear kernel
  svr_linear = SVR(kernel='linear', C=1.0, epsilon=0.1)
  svr_linear.fit(X, y)
  
  # Make predictions on a grid
  X_test = np.linspace(0, 5, 100)[:, np.newaxis]
  y_pred = svr_linear.predict(X_test)
  
  # Plot results
  plt.scatter(X, y, color='blue', label='Data')
  plt.plot(X_test, y_pred, color='red', label='SVR (linear kernel)')
  plt.xlabel('x')
  plt.ylabel('y')
  plt.legend()
  plt.show()

6.2 Kernel Approaches

Linear SVR: \(f(x) = w \cdot x + b\)
Kernel trick: Implicitly maps inputs to high-dimensional feature space
Common kernels:
- RBF (Gaussian): \(K(x, x') = \exp(-\gamma||x - x'||^2)\)
- Polynomial: \(K(x, x') = (\gamma x \cdot x' + r)^d\)
- Sigmoid: \(K(x, x') = \tanh(\gamma x \cdot x' + r)\)

from sklearn.svm import SVR
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate nonlinear data
  X = np.sort(5 * np.random.rand(80, 1), axis=0)
  y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
  
  # Create and fit SVRs with different kernels
  svr_rbf = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
  svr_poly = SVR(kernel='poly', C=100, degree=3, epsilon=0.1)
  svr_linear = SVR(kernel='linear', C=100, epsilon=0.1)
  
  svr_rbf.fit(X, y)
  svr_poly.fit(X, y)
  svr_linear.fit(X, y)
  
  # Make predictions
  X_test = np.linspace(0, 5, 100)[:, np.newaxis]
  y_rbf = svr_rbf.predict(X_test)
  y_poly = svr_poly.predict(X_test)
  y_linear = svr_linear.predict(X_test)
  
  # Plot results
  plt.scatter(X, y, color='blue', label='Data')
  plt.plot(X_test, y_rbf, color='red', label='RBF kernel')
  plt.plot(X_test, y_poly, color='green', label='Polynomial kernel')
  plt.plot(X_test, y_linear, color='purple', label='Linear kernel')
  plt.xlabel('x')
  plt.ylabel('y')
  plt.legend()
  plt.show()

6.3 Parameter Tuning

C: Regularization parameter (penalty for errors)
- Large C: Less regularization, potentially overfitting
- Small C: More regularization, potentially underfitting
ε: Width of the insensitive tube
- Controls the number of support vectors
- Larger ε means fewer support vectors
γ (for RBF kernel): Defines influence of training examples
- Large γ: Close points have high influence, potentially overfitting
- Small γ: Distant points have influence, potentially underfitting

from sklearn.svm import SVR
  from sklearn.model_selection import GridSearchCV
  from sklearn.preprocessing import StandardScaler
  from sklearn.pipeline import Pipeline
  
  # Create pipeline with scaling (important for SVR)
  svr_pipeline = Pipeline([
      ('scaler', StandardScaler()),
      ('svr', SVR())
  ])
  
  # Setup parameter grid
  param_grid = {
      'svr__kernel': ['rbf', 'linear'],
      'svr__C': [0.1, 1, 10, 100],
      'svr__epsilon': [0.01, 0.1, 0.2],
      'svr__gamma': ['scale', 'auto', 0.1, 1]
  }
  
  # Setup grid search
  grid_search = GridSearchCV(svr_pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
  grid_search.fit(X, y)
  
  # Print best parameters
  print(f"Best parameters: {grid_search.best_params_}")
  print(f"Best RMSE: {np.sqrt(-grid_search.best_score_)}")

Study Tips:

Always scale features before using SVR
SVR is robust to outliers due to the ε-insensitive loss
Kernel selection is crucial for performance
SVR can be slow on large datasets
RBF kernel is generally a good starting point

7. Regularization & Complexity Control

7.1 Ridge Regression (L2)

Definition: Linear regression with L2 regularization
Mathematical form: Minimize \(\sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 + \alpha \sum_{j=1}^{p}\beta_j^2\)
Effect: Shrinks coefficients toward zero but rarely to exactly zero
Closed-form solution: \(\hat{\beta} = (X^TX + \alpha I)^{-1}X^Ty\)

from sklearn.linear_model import Ridge
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate sample data
  X = np.random.randn(100, 10)
  true_coef = np.array([3, 1.5, 0, 0, 2, 0, 0, 0, 0, 0])
  y = X @ true_coef + np.random.randn(100) * 0.5
  
  # Create and fit Ridge models with different alphas
  alphas = [0, 0.1, 1.0, 10.0]
  coefs = []
  
  for alpha in alphas:
      ridge = Ridge(alpha=alpha)
      ridge.fit(X, y)
      coefs.append(ridge.coef_)
  
  # Plot coefficients for different alphas
  plt.figure(figsize=(10, 6))
  for i, alpha in enumerate(alphas):
      plt.plot(range(10), coefs[i], 'o-', label=f'alpha = {alpha}')
  plt.legend()
  plt.xlabel('Coefficient index')
  plt.ylabel('Coefficient value')
  plt.title('Ridge coefficients as alpha varies')
  plt.axhline(y=0, color='k', linestyle='--')
  plt.show()

7.2 Lasso Regression (L1)

Definition: Linear regression with L1 regularization
Mathematical form: Minimize \(\sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 + \alpha \sum_{j=1}^{p}|\beta_j|\)
Effect: Encourages sparsity by shrinking some coefficients exactly to zero
No closed-form solution: Solved iteratively

from sklearn.linear_model import Lasso
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate sample data with sparse coefficients
  X = np.random.randn(100, 10)
  true_coef = np.array([3, 1.5, 0, 0, 2, 0, 0, 0, 0, 0])
  y = X @ true_coef + np.random.randn(100) * 0.5
  
  # Create and fit Lasso models with different alphas
  alphas = [0.001, 0.01, 0.1, 1.0]
  coefs = []
  
  for alpha in alphas:
      lasso = Lasso(alpha=alpha, max_iter=10000)
      lasso.fit(X, y)
      coefs.append(lasso.coef_)
  
  # Plot coefficients for different alphas
  plt.figure(figsize=(10, 6))
  for i, alpha in enumerate(alphas):
      plt.plot(range(10), coefs[i], 'o-', label=f'alpha = {alpha}')
  plt.legend()
  plt.xlabel('Coefficient index')
  plt.ylabel('Coefficient value')
  plt.title('Lasso coefficients as alpha varies')
  plt.axhline(y=0, color='k', linestyle='--')
  plt.show()

7.3 Elastic Net

Definition: Combines L1 and L2 regularization
Mathematical form: Minimize \(\sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 + \alpha_1 \sum_{j=1}^{p}|\beta_j| + \alpha_2 \sum_{j=1}^{p}\beta_j^2\)
Parameters:
- alpha: Total regularization strength
- l1_ratio: Proportion of L1 penalty (0 = Ridge, 1 = Lasso)

from sklearn.linear_model import ElasticNet
  import numpy as np
  
  # Generate sample data
  X = np.random.randn(100, 10)
  true_coef = np.array([3, 1.5, 0, 0, 2, 0, 0, 0, 0, 0])
  y = X @ true_coef + np.random.randn(100) * 0.5
  
  # Create and fit ElasticNet
  elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=10000)
  elastic_net.fit(X, y)
  
  # Print coefficients
  print(f"ElasticNet coefficients: {elastic_net.coef_}")
  
  # Compare with Lasso and Ridge
  from sklearn.linear_model import Lasso, Ridge
  
  lasso = Lasso(alpha=0.1, max_iter=10000)
  ridge = Ridge(alpha=0.1)
  
  lasso.fit(X, y)
  ridge.fit(X, y)
  
  print(f"Lasso coefficients: {lasso.coef_}")
  print(f"Ridge coefficients: {ridge.coef_}")

7.4 Hyperparameter Selection

Validation curves: Plot model performance against hyperparameter values
Grid search: Exhaustive search over specified parameter values
Cross-validation: Essential for unbiased hyperparameter tuning

from sklearn.linear_model import Ridge
  from sklearn.model_selection import validation_curve
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate data
  X = np.random.randn(100, 5)
  true_coef = np.array([3, 1.5, 0, 2, 0.5])
  y = X @ true_coef + np.random.randn(100) * 0.5
  
  # Calculate validation curve
  param_range = np.logspace(-3, 3, 10)
  train_scores, test_scores = validation_curve(
      Ridge(), X, y, param_name="alpha", param_range=param_range,
      cv=5, scoring="neg_mean_squared_error"
  )
  
  # Convert to RMSE
  train_rmse = np.sqrt(-train_scores).mean(axis=1)
  test_rmse = np.sqrt(-test_scores).mean(axis=1)
  
  # Plot validation curve
  plt.figure(figsize=(10, 6))
  plt.semilogx(param_range, train_rmse, label="Training RMSE")
  plt.semilogx(param_range, test_rmse, label="Validation RMSE")
  plt.xlabel("alpha")
  plt.ylabel("Root Mean Squared Error")
  plt.legend()
  plt.title("Validation Curve for Ridge Regression")
  plt.grid()
  plt.show()
  
  # Grid search example
  from sklearn.model_selection import GridSearchCV
  
  param_grid = {
      'alpha': [0.001, 0.01, 0.1, 1, 10, 100]
  }
  
  grid_search = GridSearchCV(
      Ridge(), param_grid, cv=5, scoring='neg_mean_squared_error'
  )
  grid_search.fit(X, y)
  
  print(f"Best alpha: {grid_search.best_params_['alpha']}")
  print(f"Best RMSE: {np.sqrt(-grid_search.best_score_)}")

Study Tips:

Regularization helps prevent overfitting and improves generalization
Ridge is preferred when many features contribute small amounts
Lasso is preferred for feature selection (sparse solutions)
ElasticNet combines benefits of both and often performs well
Always tune regularization parameters using cross-validation
Feature scaling is important before applying regularization

8. Feature Engineering & Representation

8.1 Categorical Encoding

Definition: Converting categorical variables into numeric format
One-Hot Encoding: Creates binary columns for each category
Implementation: Using OneHotEncoder in scikit-learn

from sklearn.preprocessing import OneHotEncoder
  import numpy as np
  import pandas as pd
  
  # Sample categorical data
  data = np.array([['Male', 'Small'], ['Female', 'Medium'], ['Female', 'Large'], ['Male', 'Medium']])
  df = pd.DataFrame(data, columns=['Gender', 'Size'])
  
  # One-hot encoding
  encoder = OneHotEncoder(sparse_output=False)
  encoded_data = encoder.fit_transform(df)
  
  # Create DataFrame with feature names
  encoded_df = pd.DataFrame(
      encoded_data,
      columns=encoder.get_feature_names_out(['Gender', 'Size'])
  )
  
  print("Original data:")
  print(df)
  print("\nOne-hot encoded data:")
  print(encoded_df)

8.2 Binning/Discretization

Definition: Converting continuous variables into discrete categories
Methods: Equal-width, equal-frequency, k-means binning
Implementation: Using KBinsDiscretizer in scikit-learn

from sklearn.preprocessing import KBinsDiscretizer
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate continuous data
  X = np.random.randn(100, 1) * 3 + 5  # Mean = 5, Std = 3
  
  # Create different binning strategies
  n_bins = 5
  discretizers = [
      ('uniform', KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform')),
      ('quantile', KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='quantile')),
      ('kmeans', KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='kmeans'))
  ]
  
  # Apply discretization
  plt.figure(figsize=(15, 10))
  for i, (strategy, discretizer) in enumerate(discretizers):
      X_binned = discretizer.fit_transform(X)
      
      # Plot original data vs binned data
      plt.subplot(3, 1, i+1)
      plt.scatter(X, X_binned)
      plt.xlabel('Original Value')
      plt.ylabel('Bin')
      plt.title(f'Binning with {strategy} strategy')
      
      # Add bin edges for uniform and quantile
      if strategy in ['uniform', 'quantile']:
          for edge in discretizer.bin_edges_[0]:
              plt.axvline(edge, color='r', linestyle='--', alpha=0.3)
  
  plt.tight_layout()
  plt.show()

8.3 Interaction Terms

Definition: Products of two or more features to capture combined effects
Implementation: Can use PolynomialFeatures with degree=2, interaction_only=True

from sklearn.preprocessing import PolynomialFeatures
  import numpy as np
  import pandas as pd
  
  # Sample data with two features
  X = np.array([[1, 2], [3, 4], [5, 6]])
  df = pd.DataFrame(X, columns=['Feature1', 'Feature2'])
  
  # Create interaction terms
  interaction = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
  X_interaction = interaction.fit_transform(X)
  
  # Create DataFrame with feature names
  interaction_df = pd.DataFrame(
      X_interaction, 
      columns=interaction.get_feature_names_out(['Feature1', 'Feature2'])
  )
  
  print("Original data:")
  print(df)
  print("\nWith interaction terms:")
  print(interaction_df)

8.4 Scaling Techniques

Definition: Transforming features to a similar scale
Methods:
- StandardScaler: Z-score normalization (mean=0, std=1)
- MinMaxScaler: Scale to a range, typically [0,1]
- RobustScaler: Uses median and IQR, robust to outliers

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate data with outliers
  X = np.random.randn(100, 1)
  X[0] = 10  # Add an outlier
  
  # Apply different scaling methods
  scalers = [
      ('Standard', StandardScaler()),
      ('MinMax', MinMaxScaler()),
      ('Robust', RobustScaler())
  ]
  
  plt.figure(figsize=(15, 10))
  
  # Plot original data
  plt.subplot(4, 1, 1)
  plt.hist(X, bins=30)
  plt.title('Original Data')
  
  # Plot scaled data for each scaler
  for i, (name, scaler) in enumerate(scalers):
      X_scaled = scaler.fit_transform(X)
      
      plt.subplot(4, 1, i+2)
      plt.hist(X_scaled, bins=30)
      plt.title(f'{name} Scaled Data')
  
  plt.tight_layout()
  plt.show()
  
  # Compare specific values
  sample = np.array([[0], [1], [10]])  # mean, 1 std dev, outlier
  print("Original values:", sample.ravel())
  
  for name, scaler in scalers:
      scaler.fit(X)
      scaled = scaler.transform(sample)
      print(f"{name} scaled:", scaled.ravel())

Study Tips:

Feature engineering can often improve model performance more than algorithm tuning
One-hot encoding increases dimensionality (beware of the curse of dimensionality)
Scaling is essential for distance-based models (KNN, SVR) and regularized models
Consider domain knowledge when engineering features
Binning can help capture non-linear relationships but loses information

9. Pipelines & Workflow

9.1 scikit-learn API

Key methods:
- fit(X, y): Train model on data
- predict(X): Make predictions for new data
- score(X, y): Calculate model performance
- transform(X): Apply data transformation
Estimator API structure:
1. All transformers have fit and transform methods
2. All predictors have fit and predict methods
3. Some estimators implement fit_transform for efficiency

from sklearn.linear_model import LinearRegression
  from sklearn.preprocessing import StandardScaler
  import numpy as np
  
  # Generate sample data
  X = np.random.rand(100, 2)
  y = 3*X[:, 0] + 2*X[:, 1] + np.random.randn(100) * 0.1
  
  # Example of transformer API
  scaler = StandardScaler()
  scaler.fit(X)  # Learn parameters (mean, std)
  X_scaled = scaler.transform(X)  # Apply transformation
  
  # Example of predictor API
  model = LinearRegression()
  model.fit(X_scaled, y)  # Train model
  y_pred = model.predict(X_scaled)  # Make predictions
  r2 = model.score(X_scaled, y)  # Calculate R²
  
  print(f"Model coefficients: {model.coef_}")
  print(f"Model intercept: {model.intercept_}")
  print(f"R² score: {r2}")

9.2 Building Pipelines

Definition: Sequence of transformers followed by a final estimator
Benefits:
- Ensures same transformations on training and test data
- Prevents data leakage
- Simplifies code and workflow
- Single estimator interface

from sklearn.pipeline import Pipeline
  from sklearn.preprocessing import StandardScaler, PolynomialFeatures
  from sklearn.linear_model import Ridge
  import numpy as np
  from sklearn.model_selection import train_test_split
  
  # Generate non-linear data
  X = np.random.rand(100, 1)
  y = np.sin(2 * np.pi * X.ravel()) + np.random.randn(100) * 0.1
  
  # Split data
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  
  # Create a pipeline with preprocessing and model
  pipeline = Pipeline([
      ('poly', PolynomialFeatures(degree=3)),
      ('scaler', StandardScaler()),
      ('ridge', Ridge(alpha=0.1))
  ])
  
  # Train and evaluate in one step
  pipeline.fit(X_train, y_train)
  score = pipeline.score(X_test, y_test)
  
  print(f"Pipeline R² score: {score}")
  
  # Access individual steps
  print(f"Polynomial features shape: {pipeline.named_steps['poly'].n_output_features_}")
  print(f"Ridge coefficients: {pipeline.named_steps['ridge'].coef_}")

9.3 Grid Search with Pipelines

Definition: Combine pipelines with hyperparameter tuning
Benefits:
- Tunes preprocessing and model parameters together
- Ensures cross-validation integrity

from sklearn.pipeline import Pipeline
  from sklearn.preprocessing import StandardScaler, PolynomialFeatures
  from sklearn.linear_model import Ridge
  from sklearn.model_selection import GridSearchCV
  import numpy as np
  
  # Generate non-linear data
  X = np.random.rand(100, 1)
  y = np.sin(2 * np.pi * X.ravel()) + np.random.randn(100) * 0.1
  
  # Create pipeline
  pipeline = Pipeline([
      ('poly', PolynomialFeatures()),
      ('scaler', StandardScaler()),
      ('ridge', Ridge())
  ])
  
  # Parameter grid
  param_grid = {
      'poly__degree': [1, 2, 3, 4],
      'ridge__alpha': [0.001, 0.01, 0.1, 1.0, 10.0]
  }
  
  # Grid search
  grid_search = GridSearchCV(
      pipeline, param_grid, cv=5, scoring='neg_mean_squared_error'
  )
  grid_search.fit(X, y)
  
  # Print results
  print(f"Best parameters: {grid_search.best_params_}")
  print(f"Best RMSE: {np.sqrt(-grid_search.best_score_)}")
  
  # Make predictions with best model
  best_model = grid_search.best_estimator_
  y_pred = best_model.predict(X)
  
  # Plot results
  import matplotlib.pyplot as plt
  X_sorted = np.sort(X, axis=0)
  y_pred_sorted = best_model.predict(X_sorted)
  
  plt.scatter(X, y, color='blue', label='Data')
  plt.plot(X_sorted, y_pred_sorted, color='red', label='Best model fit')
  plt.xlabel('x')
  plt.ylabel('y')
  plt.legend()
  plt.title('Best Pipeline Model')
  plt.show()

Study Tips:

Always use pipelines for real-world ML tasks to ensure proper workflow
Pipelines prevent data leakage (a common mistake in ML)
Grid search with pipelines ensures proper cross-validation
Named steps in pipelines allow easy access to fitted components
ColumnTransformer can be used for different transformations on different features

10. Practical Skills

10.1 Data Loading and Splitting

Loading data: From various sources (CSV, databases, APIs)
Train/test split: Separate data into training and testing sets

import pandas as pd
  import numpy as np
  from sklearn.model_selection import train_test_split
  
  # Load data from CSV (example)
  # df = pd.read_csv('your_data.csv')
  
  # Create sample data
  np.random.seed(42)
  n_samples = 1000
  n_features = 5
  X = np.random.randn(n_samples, n_features)
  y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + 0.5*X[:, 3] + np.random.randn(n_samples) * 0.5
  
  # Split data into training and test sets
  X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=0.2, random_state=42
  )
  
  print(f"Training set size: {X_train.shape[0]} samples")
  print(f"Test set size: {X_test.shape[0]} samples")

10.2 Cross-validation Techniques

k-fold cross-validation: Split data into k folds, train on k-1 and test on 1
cross_val_score: Evaluate model using cross-validation
validation_curve: Plot performance vs. hyperparameter value

from sklearn.model_selection import cross_val_score, KFold, validation_curve
  from sklearn.linear_model import Ridge
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate sample data
  X = np.random.randn(100, 5)
  y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5
  
  # Basic cross-validation
  model = Ridge(alpha=1.0)
  cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
  rmse_scores = np.sqrt(-cv_scores)
  print(f"Cross-validation RMSE: {rmse_scores.mean()} ± {rmse_scores.std()}")
  
  # Custom KFold
  kfold = KFold(n_splits=5, shuffle=True, random_state=42)
  cv_custom = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
  rmse_custom = np.sqrt(-cv_custom)
  print(f"Custom KFold RMSE: {rmse_custom.mean()} ± {rmse_custom.std()}")
  
  # Validation curve
  param_range = np.logspace(-3, 3, 10)
  train_scores, test_scores = validation_curve(
      Ridge(), X, y, param_name="alpha", param_range=param_range,
      cv=5, scoring="neg_mean_squared_error"
  )
  
  # Plot validation curve
  plt.figure(figsize=(10, 6))
  plt.semilogx(param_range, np.sqrt(-train_scores).mean(axis=1), 
           label="Training RMSE")
  plt.semilogx(param_range, np.sqrt(-test_scores).mean(axis=1), 
           label="Validation RMSE")
  plt.xlabel("alpha")
  plt.ylabel("RMSE")
  plt.legend()
  plt.title("Validation Curve for Ridge Regression")
  plt.grid()
  plt.show()

10.3 Visualization Methods

Learning curves: Plot training/validation score vs. training set size
Residual plots: Analyze prediction errors
Feature importance: Visualize model coefficients or importances

from sklearn.model_selection import learning_curve
  from sklearn.linear_model import LinearRegression
  import numpy as np
  import matplotlib.pyplot as plt
  
  # Generate sample data
  X = np.random.randn(200, 5)
  y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(200) * 0.5
  
  # Learning curve
  train_sizes, train_scores, test_scores = learning_curve(
      LinearRegression(), X, y, cv=5, scoring='neg_mean_squared_error',
      train_sizes=np.linspace(0.1, 1.0, 10)
  )
  
  # Calculate RMSE
  train_rmse = np.sqrt(-train_scores).mean(axis=1)
  test_rmse = np.sqrt(-test_scores).mean(axis=1)
  
  # Plot learning curve
  plt.figure(figsize=(10, 6))
  plt.plot(train_sizes, train_rmse, 'o-', color='r', label='Training RMSE')
  plt.plot(train_sizes, test_rmse, 'o-', color='g', label='Validation RMSE')
  plt.xlabel('Training set size')
  plt.ylabel('RMSE')
  plt.title('Learning Curve for Linear Regression')
  plt.legend()
  plt.grid()
  plt.show()
  
  # Fit model for residual plot
  model = LinearRegression()
  model.fit(X, y)
  y_pred = model.predict(X)
  residuals = y - y_pred
  
  # Residual plot
  plt.figure(figsize=(10, 6))
  plt.scatter(y_pred, residuals)
  plt.axhline(y=0, color='r', linestyle='-')
  plt.xlabel('Predicted values')
  plt.ylabel('Residuals')
  plt.title('Residual Plot')
  plt.grid()
  plt.show()
  
  # Feature importance plot
  plt.figure(figsize=(10, 6))
  plt.bar(range(X.shape[1]), model.coef_)
  plt.xlabel('Feature index')
  plt.ylabel('Coefficient value')
  plt.title('Feature Importance')
  plt.xticks(range(X.shape[1]))
  plt.grid(axis='y')
  plt.show()

Study Tips:

Always split data before any preprocessing to prevent data leakage
Cross-validation provides more reliable performance estimates than a single split
Learning curves help diagnose overfitting or underfitting
Residual plots help check model assumptions
Feature importance plots identify which features drive predictions

11. Quick-Recall Flashcards

Flashcard 1: Regularization Methods

Q: What are the key differences between Ridge, Lasso, and Elastic Net regularization?
A:

Ridge (L2): Adds sum of squared coefficients (α∑β²) to loss. Shrinks coefficients toward zero but rarely to exactly zero. Good when many features contribute.
Lasso (L1): Adds sum of absolute coefficients (α∑|β|) to loss. Can shrink coefficients exactly to zero, performing feature selection. Good for sparse solutions.
Elastic Net: Combines both L1 and L2 penalties. Balances feature selection and coefficient shrinkage. Often performs best in practice, especially with correlated features.

Flashcard 2: Support Vector Regression

Q: Explain the ε-insensitive loss function in SVR and how the key parameters affect the model.
A:

ε-insensitive loss: Ignores errors within ε distance of true value, penalizing only predictions outside the "tube".
C parameter: Controls trade-off between model complexity and error tolerance. Large C focuses on minimizing errors (risk of overfitting), small C prioritizes simplicity.
ε parameter: Width of the insensitive tube. Larger ε means fewer support vectors and a simpler model.
γ parameter (for RBF kernel): Controls influence radius of training examples. Large γ means closer points have high influence (risk of overfitting).

Flashcard 3: Cross-Validation Techniques

Q: What is k-fold cross-validation and why is it important?
A:

k-fold CV: Splits data into k equal parts (folds), trains on k-1 folds and validates on the remaining fold, repeating k times with different validation folds.
Importance:
- Provides robust performance estimate using all data
- Reduces variance compared to a single train/test split
- Helps detect overfitting
- Essential for hyperparameter tuning
- Makes efficient use of limited data

Flashcard 4: Decision Trees vs. k-Nearest Neighbors

Q: Compare and contrast Decision Tree Regression and k-Nearest Neighbors Regression.
A:

Decision Trees:
- Split feature space into regions based on feature thresholds
- Predict constant value within each leaf node
- Not affected by feature scaling
- Handle non-linear relationships naturally
- Prone to overfitting if not pruned
- Key parameters: max_depth, min_samples_leaf
k-Nearest Neighbors:
- Average target values of k closest training examples
- No explicit training phase (lazy learner)
- Highly affected by feature scaling
- Degrades in high dimensions (curse of dimensionality)
- Computationally expensive for large datasets
- Key parameter: k (number of neighbors)

Flashcard 5: Pipelines and Data Leakage

Q: What is data leakage and how do scikit-learn Pipelines help prevent it?
A:

Data leakage: When information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates.
Common leakage causes:
- Fitting preprocessors on entire dataset before splitting
- Including target-correlated features that wouldn't be available in practice
- Using future information to predict past events
How Pipelines help:
- Encapsulate preprocessing and modeling steps in a single object
- Ensure preprocessing is applied only to training data during cross-validation
- Apply identical transformations to training and test data
- Enable proper hyperparameter tuning of preprocessing steps

Flashcard 6: Learning Curves Interpretation

Q: How do you interpret learning curves to diagnose model problems?
A:

High training score, low validation score: Overfitting. Model memorizes training data but generalizes poorly. Solutions: More data, reduce model complexity, add regularization.
Low training score, low validation score: Underfitting. Model lacks capacity to capture patterns. Solutions: More complex model, better features, reduce regularization.
Converging scores with increasing data: Good generalization. Adding more data helps little.
Wide gap that narrows with data: Insufficient data. Model would benefit from more training examples.
Both scores plateau below acceptable performance: Model or features are inadequate for the task.

Flashcard 7: Feature Engineering Techniques

Q: What are the main types of feature engineering techniques and when should you use them?
A:

Scaling: Standardize or normalize features when using distance-based models (KNN, SVR) or regularized models. Use StandardScaler for normal distribution, RobustScaler for outliers.
Categorical Encoding: Convert categorical variables to numeric. Use One-Hot Encoding for nominal categories, Label/Ordinal Encoding for ordered categories.
Binning/Discretization: Convert continuous to categorical. Useful for non-linear relationships or when precise values aren't important.
Polynomial Features: Create interaction terms and powers. Useful when relationships are non-linear but still smooth.
Feature Extraction: Create new features from existing ones using domain knowledge or automated methods like PCA.

Toegepaste Machine Learning - Midterm Summary

Table of Contents

1. Machine Learning Fundamentals

1.1 Key Definitions

1.2 Supervised vs. Unsupervised Learning

1.3 Model Evaluation Approaches

2. Linear Models

2.1 Simple Linear Regression

2.2 Multiple Linear Regression

2.3 Model Assessment

3. Polynomial & Non-Linear Regression

3.1 Polynomial Features

3.2 Univariate Transforms

4. k-Nearest Neighbors Regression

4.1 Distance-Based Prediction

4.2 Parameter Selection

4.3 Feature Scaling Importance

5. Decision Tree Regression

5.1 Recursive Binary Splits

5.2 Pruning Techniques

6. Support Vector Regression

6.1 ε-Insensitive Loss

6.2 Kernel Approaches

6.3 Parameter Tuning

7. Regularization & Complexity Control

7.1 Ridge Regression (L2)

7.2 Lasso Regression (L1)

7.3 Elastic Net

7.4 Hyperparameter Selection

8. Feature Engineering & Representation

8.1 Categorical Encoding

8.2 Binning/Discretization

8.3 Interaction Terms

8.4 Scaling Techniques

9. Pipelines & Workflow

9.1 scikit-learn API

9.2 Building Pipelines

9.3 Grid Search with Pipelines

10. Practical Skills

10.1 Data Loading and Splitting

10.2 Cross-validation Techniques

10.3 Visualization Methods

11. Quick-Recall Flashcards

Flashcard 1: Regularization Methods

Flashcard 2: Support Vector Regression

Flashcard 3: Cross-Validation Techniques

Flashcard 4: Decision Trees vs. k-Nearest Neighbors

Flashcard 5: Pipelines and Data Leakage

Flashcard 6: Learning Curves Interpretation

Flashcard 7: Feature Engineering Techniques