logo

Statistics for Health Data Science

  • Welcome to Statistics for Health Data Science

Preamble

  • Acknowledgements
  • How to use this book

Overview

  • 1 Introduction

Basic probability

  • Probability and statistics
  • 2. Discrete Distributions
    • 2.1 Application of Bayes’ Theorem
    • 2.2 The binomial distribution
    • 2.3 The Poisson distribution
    • 2.4 Summary
  • 3. Continuous distributions
    • 3.1 Continuous random variables
    • 3.2 Useful continuous distributions
    • 3.3 Uses of the standard Normal distribution
    • 3.4 Are the data normally distributed?
    • 3.5 Joint distributions and correlations

Statistical Inference

  • Statistical Inference
  • 4. Populations and Samples
    • 4.1 Sampling from a population
    • 4.2 Statistical models
    • 4.3 Sampling distributions
    • 4.4 Obtaining the sampling distribution
    • 4.5 Summary
    • Appendix: Additional Reading
  • 5. Likelihood
    • 5.1 Maximum likelihood estimation
    • 5.2 The likelihood
    • 5.3 Log likelihood
    • 5.4 Finding the MLE
    • 5.5 Summary
  • 6. Maximum Likelihood
    • 6.1 Likelihood with independent observations
    • 6.2 Properties of maximum likelihood estimators
    • 6.3 Summary
    • Appendix: Additional Reading
  • 7. Frequentist I: Confidence Intervals
    • 7.1 Confidence intervals
    • 7.2 Confidence intervals for the mean
    • 7.3 Interpretation of confidence intervals
    • 7.4 Approximate confidence intervals for parameters estimated using large samples
    • 7.5 Confidence Intervals using resampling
    • 7.6 Summary: Use of confidence intervals
    • Further resources
  • 8. Frequentist II: Hypothesis tests
    • 8.1 Evidence against hypotheses
    • 8.2 The p-value
    • 8.3 Connection between p-values and confidence intervals
    • 8.4 Other (mis-)interpretations of p-values
    • 8.5 Calculating p-values
    • Further resources
  • 9. Bayesian Statistics I
    • 9.1 Introduction to Bayesian Inference
    • 9.2 Bayes Theorem (recap)
    • 9.3 The Bayesian paradigm in Health data science problems.
    • 9.4 Bayes thorem for discrete and continous data
    • 9.5 Bayesian inference on proportions
    • 9.6 Summarising Posteriors
    • 9.7 Prior Predictions
    • 9.8 Conjugacy
  • 10. Bayesian Statistics II: Normal data
    • 10.1 Example: CD4 cell counts
    • 10.2 Calculating the posterior
    • 10.3 Credible Intervals
    • 10.4 Predictions
    • 10.5 Multiparameter models
    • Further Resources

Statistical modelling

  • Investigations and the role of regression modelling
  • 11. Types of Investigation
    • 11.1 Specifying research questions
    • 11.2 Different types of investigation
    • 11.3 Properties of different types of investigation
    • 11.4 An example: stroke in women
    • 11.5 Role of explanatory variables in different types of investigation
    • 11.6 Summary
    • References
  • 12. Linear Regression I
    • 12.1 Introduction
    • 12.2 Data used in our examples
    • 12.3 The simple linear regression model
    • 12.4 Estimation of the population parameters
    • 12.5 Example: continuous independent variable
    • 12.6 Inference for the slope
    • 12.7 Example: binary independent variable
    • 12.8 Additional material
  • 13. Linear Regression II
    • 13.1 Categorical independent variables
    • 13.2 Multivariable linear regression
    • 13.3 Including multiple covariates
    • 13.4 Centering
    • 13.6 Including higher-order terms
    • 13.7 Modelling interaction terms
  • 14. Linear Regression III
    • 14.1 Assumptions
    • 14.2 Investigating assumptions using plots
    • 14.2 Statistical tests of assumptions
    • 14.3 Dealing with violations of assumptions
    • 14.5 Collinearity
    • 14.6 Optional Reading: Analysis of Variance
    • 14.7 Proofs
  • 15 Logistic Regression
    • 15.1 Regression modelling for binary outcomes
    • 15.2 Data used in our examples
    • 15.3 The logistic regression model
    • 15.4 Estimating the parameters
    • 15.5 Examples
    • 15.6 Inference
    • 15.7 Multivariable logistic regression
    • 15.8 Interactions and higher-order terms
    • 15.9 Model diagnostics
    • 15.12 Common pitfalls
    • 15.13 Further resources
    • 15.14 Additional reading
  • 16. Generalised Linear Models (GLMs)
    • 16.1 Introduction to Generalised Linear Models (GLMs)
    • 16.2 Generalised Linear Model Components
    • 16.3 GLM Assumptions
    • 16.4 Link Functions
    • 16.5 Programming GLM’s in R
    • 16.6 Introduction to Poisson Generalised Linear Modelling (Poisson Regression)
    • 16.7 Poisson Regression Example
    • 16.8 Common Problems in Poisson Regression
  • 17. The role of regression in different types of investigation
  • Statistics and Health Data Science
Powered by Jupyter Book

16.4 Link Functions¶

The link function provides the relationship between the systematic component and the mean of the distribution. There are many commonly used link functions, the table below lists only three examples with their distributions and mean functions. Here we use matrix notation where \(\mu_{i} = \beta_{0} + \beta_{1}X_{i1} + \beta_{2}X_{i2} + … + \beta_{k}X_{ik}\) is represented by \(\mathbf{X}\mathbf{\beta}\).

Distribution

Data

Link Name

Link function

Mean function

Normal

real: (-\(\infty\) , + \(\infty\))

Identity \( \)

\(\mathbf{X}\mathbf{\beta} =\mu\)

\( \mu = \mathbf{X}\mathbf{\beta} \) \( \)

Poisson

integer: 0,1,2,…

Log

\(\mathbf{X}\mathbf{\beta} =ln( \mu)\)

\( \mu = exp(\mathbf{X}\mathbf{\beta} )\)

Binomial

integer: 0,1,2,…N

Logit

\(\mathbf{X}\mathbf{\beta}=ln(\frac{\mu}{n-\mu})\)

\( \mu =\frac{exp(\mathbf{X}\mathbf{\beta})}{1 + exp(\mathbf{X}\mathbf{\beta}} \)

Gamma

real: (0, + \(\infty\))

Negative Inverse

\(\mathbf{X}\mathbf{\beta} = -\mu^{-1}\)

\( \mu = - (\mathbf{X}\mathbf{\beta} )^{-1}\)

It is important to note that both linear regression which is covered in sessions 12 to 14 and logistic regression in sessions 15 can be reproduced through a GLM.

Recall that a linear regression assumes data is normal distributed so using the identity link function for a normal distribution within the GLM framework will give the same estimated regression coefficients. However, the inference (p-values and confidence intervals) is slightly better using ordinary least squares compared to than maximum likelihood estimation thus we prefer to fit linear regression models using OLS.

In logistic regression if you use the logit function for a binomial family (recalling that Bernoulli is a special type of binomial distribution) you will be able to reproduce the same results as obtained through standard logistic regression modelling. For binary outcomes, the GLM has the extra flexibility compared to the logistic regression model. You can also use other link functions, for example the Probit, the Log-Log and the Complementary log-log functions. These will give similar results but adjust for slight differences from data collection situations to improve the transformation of the expectation of the outcome to the systematic component. In this module we only focus on the logit link, however if you wish to explore further, more information can be found here: https://aip.scitation.org/doi/pdf/10.1063/1.5139815

16.3 GLM Assumptions 16.5 Programming GLM’s in R

By MSc Health Data Science, LSHTM
© Copyright 2021.