Model fitting is the process of identifying mathematical or computational models that best explain observed data. Regression, a common type of model fitting, focuses specifically on estimating relationships between variables. These techniques form the foundation of data-driven engineering and scientific discovery.
Fundamental Concepts
Model Components
A typical model fitting process involves:
- Observed Data: A set of measurements or observations for
- Model Function: A mathematical relationship parameterized by
- Error Measure: A quantification of the discrepancy between observations and model predictions
- Optimization Problem: Finding the parameters that minimize the error measure
Types of Models
Models can be classified based on their structure:
-
Linear Models:
- Linear in parameters , not necessarily linear in
- Examples: linear regression, polynomial regression, basis function models
-
Nonlinear Models: is nonlinear in parameters
- Examples: exponential models, logistic models, power laws, neural networks
-
Parametric vs. Nonparametric Models:
- Parametric: Fixed form with finite number of parameters
- Nonparametric: Flexible form, potentially infinite parameters (e.g., kernel methods)
Mathematical Formulation
General Problem Statement
Given data points for , find the parameter vector that minimizes:
where is an error function measuring the discrepancy between observed and predicted .
Common Error Functions
-
Squared Error:
- Leads to least squares regression
- Sensitive to outliers
- Optimal for Gaussian noise
-
Absolute Error:
- Leads to least absolute deviations (LAD) regression
- More robust to outliers
- Optimal for Laplace distributed noise
-
Huber Loss: A hybrid approach
- for
- for
- Combines robustness of absolute error with smoothness of squared error
Linear Regression
Simple Linear Regression
The simplest form of regression models the relationship between two variables with a straight line:
where represents random error.
Multiple Linear Regression
Extends simple linear regression to multiple independent variables:
Matrix Formulation
The multiple linear regression can be expressed in matrix notation:
where:
- is the vector of observed dependent variables
- is the design matrix containing the independent variables
- is the parameter vector
- is the error vector
Least Squares Solution
Under the least squares criterion, the optimal parameter vector is:
This closed-form solution exists when is invertible.
Polynomial Regression
A specific case of linear regression where basis functions are powers of :
This is linear in parameters despite modeling nonlinear relationships in .
Nonlinear Regression
Formulation
Nonlinear regression involves models where the parameters appear nonlinearly:
Common examples include:
- Exponential models:
- Growth models:
- Sinusoidal models:
Optimization Approaches
Unlike linear regression, nonlinear regression typically requires iterative numerical optimization:
- Gauss-Newton Method: Specialized for least squares problems
- Levenberg-Marquardt Algorithm: Robust modification of Gauss-Newton
- Gradient Descent: Simple but potentially slow convergence
- Trust Region Methods: Restrict optimization steps to regions where the model is trusted
Regularization Techniques
Regularization addresses overfitting by adding penalties to the objective function:
Ridge Regression (L2 Regularization)
- Shrinks parameters toward zero
- Handles multicollinearity well
- Closed-form solution:
Lasso Regression (L1 Regularization)
- Produces sparse solutions (feature selection)
- No closed-form solution, requires quadratic programming
- Effective for high-dimensional data
Elastic Net
Combines L1 and L2 regularization:
Model Evaluation and Selection
Performance Metrics
- Mean Squared Error (MSE):
- Root Mean Squared Error (RMSE):
- Mean Absolute Error (MAE):
- Coefficient of Determination (R²):
- Represents the proportion of variance explained by the model
- Ranges from 0 to 1 for linear models (can be negative for nonlinear models)
- Higher values indicate better fit
Cross-Validation
Techniques to assess model performance on unseen data:
- k-fold Cross-Validation: Divides data into k subsets, trains on k-1 subsets and tests on the remaining one, rotating through all subsets
- Leave-One-Out Cross-Validation: Special case of k-fold where k equals the number of data points
- Train-Test Split: Divides data into training and testing sets (typically 70-30% or 80-20%)
- Hold-out Validation: Sets aside a portion of data for final evaluation
Model Selection Criteria
Metrics that balance goodness-of-fit with model complexity:
- Akaike Information Criterion (AIC):
- is the number of parameters
- is the likelihood function value
- Bayesian Information Criterion (BIC):
- Penalizes complexity more severely than AIC
- Adjusted R²:
- Adjusts R² based on the number of predictors
Parameter Uncertainty and Confidence
Parameter Covariance Matrix
For least squares estimation, the parameter covariance matrix is:
where is the variance of the error term.
Confidence Intervals
For parameter , the 95% confidence interval is:
where and is the t-statistic with degrees of freedom.
Prediction Intervals
For a new observation at , the prediction interval is:
Specialized Regression Techniques
Weighted Least Squares
Incorporates varying reliability of observations:
where are weights, often inversely proportional to the variance of observations.
Robust Regression
Methods less sensitive to outliers:
- M-Estimation: Minimizes a robust loss function
- MM-Estimation: Combines high breakdown point with efficiency
- Least Trimmed Squares: Minimizes the sum of the smallest residuals
Bayesian Regression
Incorporates prior knowledge about parameters:
- Prior: represents knowledge before seeing data
- Likelihood: represents how well parameters explain data
- Posterior: represents updated knowledge after seeing data
Implementation with Optimization Algorithms
Model fitting is fundamentally an optimization problem. Common approaches include:
- Direct Methods: For linear models with closed-form solutions
- Iterative Methods: Required for nonlinear models
- Gauss-Newton: Efficient for moderate nonlinearity
- Levenberg-Marquardt: More robust, combines Gauss-Newton with steepest descent
- Trust Region Methods: Controls step size based on model reliability
- Stochastic Gradient Descent: Effective for large datasets
Applications in Engineering
- System Identification: Determining mathematical models of dynamic systems
- Empirical Modeling: Creating models based on experimental data
- Parameter Estimation: Determining physical parameters from measurements
- Response Surface Methodology: Optimizing processes using experimental designs
- Calibration: Adjusting simulation models to match observed behavior