Classifying Data
The first step in any statistical investigation is to correctly classify the data. The type of data determines which graphical displays and summary statistics are appropriate. Misclassifying data leads to invalid conclusions.
Categorical Data
Represents qualities or labels. Can be words or numbers acting as labels.
Nominal: Categories with no intrinsic order. (e.g., Hair Colour, Suburb)
Ordinal: Categories with a logical order or ranking. (e.g., T-Shirt Size: S, M, L)
Numerical Data
Represents quantities that are counts or measurements.
Discrete: Can be counted and takes exact values. (e.g., Number of Pets)
Continuous: Can be measured and can take any value within a range. (e.g., Height in cm)
Displaying Distributions
Histograms and Box Plots both show the distribution of numerical data, but they reveal different features. Use the buttons below to toggle the view for a sample dataset of student test scores and see how they compare.
The Normal Distribution
Many variables follow a symmetrical, bell-shaped pattern. For these distributions, we can use the 68-95-99.7% rule to make quick estimations, and z-scores to compare values from different contexts.
The 68-95-99.7% Rule
Click the buttons to see the percentage of data that falls within 1, 2, or 3 standard deviations of the mean in a normal distribution.
Z-Score Calculator
A z-score measures how many standard deviations a value is from the mean. Use it to compare scores from different distributions.
Investigating Associations
Here we explore the relationship between two numerical variables using a scatterplot. Pay attention to the direction, form, and strength of the association, and remember the crucial difference between correlation and causation.
Visualising Linear Associations
A scatterplot helps us see the relationship between two numerical variables. Pearson’s Correlation Coefficient ($r$) gives us a number to describe the strength and direction of that relationship. Drag the slider to see how the scatterplot and its description change.
Correlation is NOT Causation
A strong correlation ($r$ value) between two variables doesn’t mean one causes the other. For example, ice cream sales and drownings are strongly correlated.
This is not because ice cream causes drowning! A third variable, hot weather (a confounding variable), causes an increase in both. Always consider potential confounding variables before claiming causation.
Modelling with Least Squares Regression
Once we see a linear trend, we can model it with a “line of best fit”. This line allows us to make predictions. This section uses data comparing car age and price. Use the tabs to explore the model, make predictions, and check the residuals.
Time Series Analysis
Time series data is collected at regular intervals over time. We look for trends, seasonal patterns, and other features. This chart shows quarterly ice cream sales. The forecasting wizard then breaks down how to predict future sales.
Analysing a Time Series Plot
Forecasting with Seasonal Data
Step 1: Deseasonalise the Data
Remove the seasonal pattern using seasonal indices to reveal the underlying trend. (e.g., Actual / Seasonal Index)
Step 2: Fit a Trend Line
Fit a least squares regression line to the deseasonalised data to get a trend equation.
Step 3: Predict the Deseasonalised Value
Use the trend equation to predict the value for a future time period.
Step 4: Reseasonalise the Forecast
Put the seasonality back in by multiplying the prediction by the correct seasonal index. (i.e., Predicted Deseasonalised Γ Seasonal Index)