Statistics and Health Data Science

We end with some brief remarks about the application of statistics in health data science.

Focus on the research question

The more complicated the statistical analysis becomes, the easier it is to get lost in the technical details. As a data scientist, it is always important to be able to take a step back and re-focus on the underlying research question.

Ask yourself:

  • What is the research question?

  • What assumptions can I reasonably make, taking into account where and when the data were collected and how they were collected?

  • Does the proposed statistical analysis answer the research question?

  • How can I assess the robustness of the conclusions of my analysis to the key assumptions I have made?

Know your data

We cannot stress too much the importance of being familiar with your data. Where does it come from? How was it collected? How accurate are measurements? Do similar biases affect measurements from different units/places/times?

A hugely important step in any data science project is to look at your data. The most sophisticated analysis will produce invalid results if based on data that contains substantial errors or incorrectly assembled datasets.

Continue to learn

This module has introduced some key building blocks, concepts and statistical tools that will be very useful for data science projects. However, there are many more statistical techniques that we have not touched on. In your career as a health data scientist, you will continue to learn new methods and approaches.

We hope that this module has provided a solid foundation to build on!