- Table of Contents
- 1. Study Design
- 1.1 Observational vs Experimental Studies
- 1.2 Cross-Sectional Studies
- 1.3 Cohort Studies
- 1.4 Case-Control Studies
- 1.5 Randomised Controlled Trials (RCTs)
- 1.6 Ecological Studies
- 2. Screening
- 2.1 The Wilson-Jungner Criteria (1968) — Detailed
- 2.2 Test Performance Characteristics — Complete Details
- 2.3 Likelihood Ratios — Complete Guide
- 2.4 ROC Curves — Detailed
- 2.5 Screening in O&G — Complete Clinical Details
- 2.6 Screening Biases — Expanded
- 3. Descriptive Statistics
- 3.1 Types of Data — Complete Classification
- 3.2 Measures of Central Tendency — Complete Guide
- 3.3 Measures of Dispersion — Complete Guide
- 3.4 Normal Distribution — Complete Details
- 3.5 Skewed Distributions & Transformations
- 3.6 Data Presentation — Types of Graphs
- 4. Hypothesis Testing
- 4.1 Fundamental Concepts — Complete
- 4.2 Type I and Type II Errors — The 2×2 Framework
- 4.3 The p-value — Essential MRCOG Detail
- 4.4 Confidence Intervals — Detailed
- 4.5 One-tailed vs Two-tailed Tests
- 4.6 Multiple Testing — Corrections
- 4.7 Significance vs Clinical Importance — Key MRCOG Concept
- 4.8 Bayesian vs Frequentist Statistics — Overview
- 5. Parametric vs Non-Parametric Tests
- 5.1 Choosing the Right Test — Decision Tree
- 5.2 Parametric Tests — Complete Details
- 5.3 Non-Parametric Tests — Complete Details
- 5.4 Chi-Squared Test (χ²) — Complete Details
- 5.5 Correlation — Detailed
- 5.6 Regression — Complete Details
- 6. Risk & Effect Measures
- 6.1 The 2×2 Table — Foundation
- 6.2 Definitions and Formulas — Complete
- 6.3 Worked Examples from O&G
- 6.4 Risk Ratio vs Odds Ratio — The Rare Disease Assumption
- 6.5 Number Needed to Treat (NNT) — Detailed
- 6.6 Incidence vs Prevalence
- 6.7 Hazard Ratio — More Detail
- 7. Statistical Bias & Confounding
- 7.1 Classification of Bias
- 7.2 Selection Bias — Detailed with O&G Examples
- 7.3 Information Bias (Measurement Bias) — Detailed
- 7.4 Confounding — Complete Details
- 7.5 Effect Modification (Interaction)
- 7.6 Confounding by Indication
- 7.7 Protopathic Bias
- 8. Evidence-Based Medicine
- 8.1 Levels of Evidence — Oxford CEBM (March 2009)
- 8.2 GRADE System — Complete
- 8.3 Systematic Reviews & Meta-Analysis — Complete
- 8.4 Critical Appraisal — CASP Tools
- 8.5 Using EBM in Practice — Fagan Nomogram
- 8.6 Evidence-Based Guidelines in O&G
- 9. Survival Analysis
- 9.1 Key Concepts — Detailed
- 9.2 Censoring — Complete Types
- 9.3 Kaplan-Meier Method — Complete Details
- 9.4 Log-Rank Test — Details
- 9.5 Cox Proportional Hazards — Complete Details
- 9.6 Parametric Survival Models
- 9.7 Describing Survival Results
- 10. Specific Topics in O&G
- 10.1 Key Rates and Definitions
- 10.2 MBRRACE-UK and Confidential Enquiries
- 10.3 Saving Babies' Lives Care Bundle — Version 3 (2023)
- 10.4 Each Baby Counts (RCOG)
- 10.5 RCOG Green-top Guidelines — Evidence Grading
- 10.6 NICE Guidelines
- 10.7 SIGN Guidelines
- 10.8 Fertility & Population Demographics — UK Data
- 10.9 Clinical Audit in O&G
- 10.10 Quality Improvement in O&G
- 10.11 Key UK Screening Programmes — Summary Table
- 10.12 How to Answer MRCOG Part 1 Epidemiology Questions
- Quick Reference: MRCOG Epidemiology Formulae
- Screening
- Risk & Effect
- Statistics
- Decision Rules
- Mnemonics for MRCOG Part 1
- Common MRCOG Part 1 Traps
- References & Further Reading
Index
MRCOG Part 1: Epidemiology & Statistics — Comprehensive Study Document
Target: MRCOG Part 1 Version: May 2026 Purpose: Complete deep-dive reference covering all examinable topics in epidemiology, medical statistics, screening, evidence-based medicine, and their application to obstetrics & gynaecology. This document is designed for thorough revision — every section includes definitions, formulae, mnemonics, worked examples from O&G, and MRCOG-specific exam tips.
Table of Contents
- Study Design
- Screening
- Descriptive Statistics
- Hypothesis Testing
- Parametric vs Non-Parametric Tests
- Risk & Effect Measures
- Statistical Bias & Confounding
- Evidence-Based Medicine
- Survival Analysis
- Specific Topics in O&G
1. Study Design
1.1 Observational vs Experimental Studies
| Feature | Observational | Experimental |
|---|---|---|
| Intervention | None — investigator observes naturally occurring groups | Investigator assigns intervention intentionally |
| Causality | Association only (unless Bradford Hill criteria satisfied) | Can infer causation (if properly randomised and blinded) |
| Bias risk | Higher — multiple sources of bias possible | Lower — randomisation balances confounders |
| Ethical issues | Fewer — no manipulation of participants | More — equipoise required; informed consent essential |
| Examples | Cohort, case-control, cross-sectional, ecological | RCT (parallel, crossover, cluster, factorial) |
Bradford Hill Criteria for Causation (1965): These are important in interpreting observational studies — a set of nine viewpoints used to assess whether an observed association is likely causal: 1. Strength of association (larger effect = more likely causal) 2. Consistency (reproduced across different populations/settings) 3. Specificity (one cause → one effect — less applicable to O&G where most outcomes are multifactorial) 4. Temporality (cause must precede effect — the only absolutely essential criterion) 5. Biological gradient (dose-response relationship — e.g., more cigarettes → higher preterm birth risk) 6. Plausibility (biologically credible mechanism) 7. Coherence (consistent with natural history/biology) 8. Experiment (evidence from experimental studies) 9. Analogy (similar evidence for analogous exposures)
1.2 Cross-Sectional Studies
- Design: Data collected at a SINGLE point in time — both exposure and outcome measured simultaneously
- Measures: Prevalence (existing cases) — CANNOT measure incidence (new cases)
- Uses: Disease burden estimates, health surveys, screening programme evaluation, hypothesis generation, planning health services
- Advantages: Quick, cheap, good for hypothesis generation, no loss to follow-up, can study multiple outcomes and exposures simultaneously
- Disadvantages: Cannot establish temporality (chicken-and-egg problem — did the exposure come before the outcome?), survival bias (only survivors captured — those who died cannot participate), not suitable for rare diseases (need very large samples), prevalence-incidence bias (Neyman bias)
- Key statistic: Odds ratio (can be calculated but caution with interpretation — prevalence OR, not incidence OR)
- In O&G: Estimating prevalence of pelvic organ prolapse, urinary incontinence (UI), infertility, contraception use patterns, postnatal depression (e.g., EPDS screening studies), HPV prevalence, endometriosis prevalence estimates
Example: A cross-sectional survey asks 5,000 women about incontinence symptoms and BMI. 1,200 report UI; 800 of those with UI are obese vs 1,500 of those without UI. - Prevalence of UI = 1200/5000 = 24% - OR for UI in obese vs non-obese = (800×1500)/(400×2300) = 1.30
Limitation: Cannot tell if obesity caused UI or UI led to reduced activity and weight gain.
1.3 Cohort Studies
Definition: Groups defined by exposure status; followed forward in time to see who develops outcome. This is the optimal observational design for establishing incidence and temporal relationships.
Prospective Cohort
- Exposure assessed at BASELINE; participants followed FORWARD in time
- Outcome develops during follow-up
- Advantages: Direct measure of incidence, clear temporality (exposure definitely precedes outcome), can study multiple outcomes from one exposure, minimises recall bias (exposure recorded before outcome known), allows calculation of absolute risk, RR, AR
- Disadvantages: Expensive and time-consuming (especially for rare outcomes or long latency), loss to follow-up (attrition) can introduce bias, inefficient for rare diseases (need very large numbers), exposure patterns may change over time
- Key measures: Incidence (cumulative incidence and incidence rate), relative risk (RR), attributable risk (AR), population attributable fraction (PAF)
Retrospective Cohort (Historical Cohort)
- Uses EXISTING data (medical records, databases, occupational records) to go back in time
- Exposure and outcome have ALREADY occurred when study begins
- Advantages: Cheaper, faster than prospective, good for long-latency diseases (e.g., DES exposure in utero and vaginal adenocarcinoma decades later), can use existing datasets
- Disadvantages: Relies on quality and completeness of existing records, recall bias may still affect some data, cannot control what was measured or how, missing data issues
Key Measures in Cohort Studies — Detailed with Worked Example
Worked O&G Example: 10,000 pregnant women; 5,000 smoke, 5,000 do not. Followed for preterm birth (<37 weeks).
| Preterm | Term | Total | Risk | |
|---|---|---|---|---|
| Smoker | 200 | 4,800 | 5,000 | 200/5000 = 0.04 |
| Non-smoker | 100 | 4,900 | 5,000 | 100/5000 = 0.02 |
| Total | 300 | 9,700 | 10,000 | 300/10000 = 0.03 |
| Measure | Formula | Calculation | Interpretation |
|---|---|---|---|
| Cumulative incidence (risk) in exposed | a/(a+b) = 200/5000 | 0.04 (4%) | 4% of smokers had preterm birth |
| Cumulative incidence (risk) in unexposed | c/(c+d) = 100/5000 | 0.02 (2%) | 2% of non-smokers had preterm birth |
| Incidence rate (IR) in exposed | 200 / (sum of person-time) | Depends on follow-up timing | Accounts for when events occur |
| Relative Risk (RR) | 0.04 / 0.02 | 2.0 | Smokers 2× more likely to have preterm birth |
| Attributable Risk (AR) | 0.04 − 0.02 | 0.02 (2%) | Excess risk attributable to smoking |
| AR Fraction (ARF) | (2.0−1)/2.0 = 50% | 50% | Half of preterm births in smokers due to smoking |
| Population Attributable Risk (PAR) | I_total − I_unexposed = 0.03 − 0.02 | 0.01 (1%) | Excess risk in total population |
| PAF | (I_total − I_unexposed)/I_total = 0.01/0.03 | 33.3% | 33% of all preterm births attributable to smoking |
Person-time: - Each participant contributes time until event, loss to follow-up, or study end - Incidence rate = number of new events / sum of person-time at risk - Expressed as "per 1000 person-years" or similar - Superior to cumulative incidence when follow-up times vary
Confounding in Cohort Studies: - Common confounders in O&G cohorts: maternal age, socioeconomic status, parity, BMI, pre-existing medical conditions - Control: multivariable regression, stratification, matching, restriction, propensity scores
1.4 Case-Control Studies
Design: Select cases (with disease) and controls (without disease); look BACK retrospectively for exposure. The most efficient design for rare diseases.
2×2 Table
| Case (disease +) | Control (disease −) | Total | |
|---|---|---|---|
| Exposed | a | b | a + b |
| Unexposed | c | d | c + d |
Key Measures
| Measure | Formula | Interpretation |
|---|---|---|
| Odds of exposure in cases | a / c | How likely cases were exposed |
| Odds of exposure in controls | b / d | How likely controls were exposed |
| Odds Ratio (OR) | (a/c) / (b/d) = ad / bc | Odds of exposure in cases vs controls |
| When disease rare (<10%) | OR ≈ RR | Rare disease assumption |
Worked O&G Example: Case-control study of ovarian cancer and talc use. - Cases: 300 women with ovarian cancer - Controls: 600 women without ovarian cancer - Talc use: 120 cases exposed, 180 controls exposed
| Ovarian cancer (Case) | No cancer (Control) | |
|---|---|---|
| Talc use | 120 | 180 |
| No talc | 180 | 420 |
OR = (120 × 420) / (180 × 180) = 50,400 / 32,400 = 1.56
Interpretation: Odds of talc exposure are 1.56× higher in ovarian cancer cases than controls. Since ovarian cancer is relatively rare, this approximates RR = 1.56.
Cannot calculate: - Incidence (no denominator — we selected cases/controls, we did not follow a population) - RR (no incidence data) - Prevalence (same reason)
Advantages
- Efficient for rare diseases (ovarian cancer, specific congenital anomalies, maternal death)
- Quick and cheap compared to cohort studies
- Can study multiple exposures (diet, environment, genetics, medications)
- Good for diseases with long latency (like DES-related cancers)
- Requires smaller sample sizes than cohort studies for rare outcomes
Disadvantages
- Cannot calculate incidence directly (no denominator of total population at risk)
- Recall bias: Cases remember exposures differently from controls (especially for subjective exposures like diet, pain medication, stress)
- Selection bias: Choosing appropriate controls is the most difficult and critical part
- Temporality: Difficult to establish if exposure preceded disease (especially for biomarkers measured after diagnosis)
- Cannot study rare exposures (if exposure is rare, you need enormous numbers)
- Survivorship bias: Cases are those who survived to be diagnosed; fatal cases are missed
Selection of Controls — Critical Issues
Fundamental principle: Controls must come from the SAME source population that gave rise to the cases.
| Control type | Description | Advantage | Disadvantage |
|---|---|---|---|
| Population-based | Random sample from general population | Most representative | Expensive, low response rates |
| Hospital-based | Other patients from same hospital | Easy to recruit, good response rates | Berkson's bias — hospital controls may have different exposure patterns |
| Friend/relative | Friends or siblings of cases | Genetic/environmental matching | Over-matching possible (same exposures) |
| Neighbourhood | Neighbours of cases | Socioeconomic matching | Time-consuming |
| Disease controls | Patients with a DIFFERENT disease | Good response, similar recall | Diseased group may differ from healthy |
Matching: - Frequency matching: Select controls to have same distribution of age, parity, etc. as cases - Individual matching: Each case matched to 1–4 controls on specific factors (age ± 5 years, parity, hospital) - Over-matching: Matching on a variable that is related to the exposure but NOT to the disease — reduces power without reducing confounding
Biases in Case-Control Studies — Expanded
| Bias | Mechanism | Example |
|---|---|---|
| Recall bias | Cases search memory for causes; controls less motivated | Mothers of babies with malformations report more medication during pregnancy; mothers of healthy babies forget |
| Berkson's bias | Hospital controls have different admission patterns | Studying aspirin and stroke: hospital controls may have GI bleeds (also related to aspirin) → spurious protective effect |
| Neyman bias | Prevalent cases differ from incident cases | Studying survival after cancer: prevalent cases are long-term survivors, not representative |
| Detection bias | Cases diagnosed because of exposure-related screening | Women on HRT have more mammograms → more breast cancer detected (not causal) |
| Interviewer bias | Interviewer probes differently | More detailed questioning of cases about exposures |
| Survivorship bias | Fatal cases not included | Studying risk factors for eclampsia — only survivors available |
1.5 Randomised Controlled Trials (RCTs)
Gold standard for establishing causality. The key strength is randomisation which (if adequate) balances both known and unknown confounders between groups.
Types of RCT
| Type | Description | Key Feature | Example in O&G |
|---|---|---|---|
| Parallel group | Two (or more) independent groups, each receives one treatment concurrently | Most common; simplest analysis | TRUFFLE study (CTG monitoring in IUGR) |
| Crossover | Each participant receives both treatments in random sequence, separated by washout period | Each participant acts as own control → smaller sample size needed | Comparing two pain relief methods in labour (problem: carryover, and can't use if condition changes) |
| Cluster | Intact groups (hospitals, GP practices, communities) randomised | Used when contamination likely; analysis must account for clustering (ICC) | Comparing screening uptake with different invitation methods at hospital level |
| Factorial | Two or more interventions tested simultaneously (e.g., 2×2 design) | Efficient — can test interactions (synergy/antagonism) | Comparing aspirin AND heparin vs each alone in recurrent miscarriage |
| Zelen design | Randomised BEFORE consent; only treatment group approached for consent | Reduces selection bias; ethically controversial | Emergency trials where consent is difficult |
| N-of-1 trial | Single patient receives treatment and placebo in random sequence | Highest level for individual treatment decisions | Rarely used in O&G |
Trial Phases
| Phase | Primary Purpose | Typical Participants | Key Questions |
|---|---|---|---|
| Phase I | Safety, tolerability, pharmacokinetics | 20–80 healthy volunteers (or patients with advanced disease) | What is safe dose? What are side effects? How is drug metabolised? |
| Phase II | Efficacy signal, dose-ranging, side effect profile | 100–300 patients with condition | Does it work? What is optimal dose? More adverse effects? |
| Phase III | Confirm efficacy, compare to standard of care | 1,000–3,000+ patients | Is it better than current standard? (or non-inferior) |
| Phase IV | Post-marketing surveillance, long-term safety | General population after licensing | Are there rare adverse effects? Long-term outcomes? |
Randomisation Methods — Detailed
| Method | Description | Strength | Weakness |
|---|---|---|---|
| Simple randomisation | Each participant assigned by coin toss, random number table, or computer | Unpredictable; simple | Can produce unequal group sizes and imbalance on prognostic factors |
| Block randomisation | Random permuted blocks of fixed size (e.g., 4: possibilities = TTCC, TCTC, TCCT, CTTC, CTCT, CCTT) | Ensures equal numbers in each group at all times | Block size must be CONCEALED to prevent prediction |
| Stratified randomisation | Separate randomisation within strata defined by key prognostic factors | Ensures balance on important confounders | Complex; need few strata or it becomes unwieldy |
| Minimisation | Next participant's allocation determined by current imbalance in prognostic factors | Excellent balance on many factors simultaneously | Not truly random; some controversy about analysis |
| Adaptive randomisation | Allocation probability changes based on accumulating outcomes | More patients get better treatment | Complex; operational bias possible |
Key point: The randomisation sequence must be CONCEALED from those recruiting participants. If the recruiter knows the next allocation, they can (consciously or unconsciously) influence who is enrolled → selection bias.
Allocation Concealment vs Blinding
| Feature | Allocation Concealment | Blinding |
|---|---|---|
| Purpose | Prevent selection bias at enrolment | Prevent performance/detection bias after enrolment |
| When | BEFORE randomisation (during recruitment) | AFTER randomisation (during treatment/follow-up) |
| Always possible? | YES — always possible, even in surgery/physical therapy trials | NO — some interventions cannot be blinded (surgery vs medical, behaviour change) |
| If broken | Destroys the integrity of randomisation | Less catastrophic but introduces bias |
| Example | Opaque sealed envelopes, central telephone randomisation | Identical placebo tablets, sham surgery, double-dummy technique |
Blinding Levels
| Level | Who is blinded | Purpose | Applicability |
|---|---|---|---|
| Open label | No one | Practical when blinding impossible | Surgical trials, device trials |
| Single blind | Participant only | Reduces placebo effect | Drug trials with distinct taste/appearance |
| Double blind | Participant AND investigator | Gold standard — reduces both performance and detection bias | Most drug trials |
| Triple blind | Participant, investigator AND data analyst/statistician | Prevents analytic bias | High-quality confirmatory trials |
Double-dummy technique: Used when two treatments have different appearances (e.g., pill vs injection). Each participant receives a pill AND an injection — one active, one placebo.
Analysis Populations
| Analysis | Definition | Effect on results | Best use |
|---|---|---|---|
| Intention-to-treat (ITT) | Analyse ALL participants in the group they were randomised to, regardless of compliance, crossover, or withdrawal | Conservative for superiority trials (dilutes effect toward null) | PRIMARY analysis for superiority trials |
| Per-protocol (PP) | Analyse only those who completed the allocated treatment as planned | May OVER-estimate efficacy (only includes compliant) | SECONDARY analysis; primary for non-inferiority |
| Modified ITT (mITT) | Excludes those who never received any treatment or had no post-randomisation data | Somewhere between ITT and PP | Common compromise in practice |
| As-treated | Analyse according to the treatment actually received | Most BIASED — breaks randomisation | Not recommended as primary analysis |
MRCOG Key Point: ITT is the primary analysis for superiority trials because it preserves the benefit of randomisation (groups remain comparable). PP is considered secondary. ITT is conservative for superiority but anti-conservative for non-inferiority — in non-inferiority trials, PP is often primary because ITT can make a non-inferior treatment appear equivalent when it isn't (by diluting the difference).
Trial Types by Aim
| Type | Null Hypothesis | Alternative Hypothesis | Key Consideration |
|---|---|---|---|
| Superiority | Treatment = Control | Treatment ≠ Control (or Treatment > Control) | Standard approach |
| Non-inferiority | Treatment − Control ≤ −Δ (margin) | Treatment − Control > −Δ | Requires pre-specified non-inferiority margin (Δ); PP analysis preferred |
| Equivalence | Treatment − Control | ≥ Δ |
Non-inferiority margin selection: - Should be the largest clinically acceptable difference - Often set as half the effect of the active control vs placebo (the "M1" margin, then "M2" = M1 minus a preservation of effect) - Example: If active control reduces mortality by 2% vs placebo, Δ might be 1%
Pragmatic vs Explanatory Trials — The PRECIS-2 Framework
| Dimension | Explanatory (Efficacy) | Pragmatic (Effectiveness) |
|---|---|---|
| Question | "Can it work?" (ideal conditions) | "Does it work in real life?" |
| Eligibility | Highly selected — narrow criteria | Broad — represents typical patients |
| Recruitment | Intensive campaigning | Routine clinical pathways |
| Setting | Specialist academic centres | Primary care / routine hospitals |
| Intervention | Strictly protocolised, closely monitored | Flexible, as in real practice |
| Comparator | Placebo or best alternative | Usual care |
| Follow-up | Frequent, intensive | Routine visits |
| Outcome | Surrogate or mechanism-based | Clinically meaningful (patient-important) |
| Primary analysis | ITT and PP both informative | ITT primary |
| Adherence | Monitored and encouraged | Real-world compliance |
Example (O&G): ASPRE trial of aspirin for pre-eclampsia prevention — highly selected (high-risk by FMF algorithm) → explanatory. A pragmatic version would include all nulliparous women.
Adaptive Trial Designs
Definition: Pre-specified plan for modifying trial features based on accumulating data, without undermining validity.
| Type | Description | Example |
|---|---|---|
| Group sequential | Pre-planned interim analyses with stopping rules | Stop early for efficacy (if overwhelming benefit) or futility (if unlikely to show benefit) |
| Sample size re-estimation | Blinded re-estimation of variance to adjust sample size | Ensures adequate power |
| Adaptive randomisation | Allocation ratio changes to favour better-performing arm | More patients receive superior treatment |
| Seamless phase II/III | Combine dose-finding phase with confirmatory phase | Saves time and patients |
| Drop-the-loser | Arms dropped if inferior | Multi-arm multi-stage (MAMS) trials |
| Bayesian adaptive | Continuously update posterior probability | More flexible but complex |
Stopping rules for group sequential designs:
| Method | Boundary | Characteristics |
|---|---|---|
| Haybittle-Peto | p < 0.001 at interim; p < 0.05 at final | Very conservative early; easy to implement |
| O'Brien-Fleming | Very stringent early boundary, liberal later | Most common; preserves overall α well |
| Pocock | Same critical value throughout (e.g., p < 0.016 for 3 looks) | More likely to stop early |
| Wang-Tsiatis | Family of boundaries between O'Brien-Fleming and Pocock | Flexible |
Sample Size Calculation — Detailed
Why calculate sample size? 1. Ensure adequate POWER to detect clinically important effect 2. Avoid wasting resources on underpowered studies 3. Meet ethical obligations (patients in underpowered study may be exposed to harm without benefit) 4. Meet regulatory requirements
Parameters needed:
| Parameter | Symbol | Typical value | How to determine |
|---|---|---|---|
| Significance level | α (Type I error) | 0.05 (two-sided) | Convention; sometimes 0.01 |
| Power | 1 − β | 0.80 or 0.90 | 0.80 is minimum; 0.90 preferred |
| Effect size | δ | Varies | Minimum clinically important difference (MCID) |
| Standard deviation | σ | From pilot data/literature | Variability in outcome measure |
| Allocation ratio | r = n₁/n₂ | 1:1 is most efficient | Unequal allocation needs larger n |
Sample size increases when: - ✅ Smaller effect size (harder to detect) - ✅ Lower α (more stringent significance level) - ✅ Higher power (i.e., lower β) - ✅ Larger variance (more noise) - ✅ Unequal group sizes (deviates from 1:1) - ✅ More comparisons (multiple endpoints or subgroups) - ✅ Clustering (ICC reduces effective sample size)
Design Effect (for cluster RCTs): DE = 1 + (m − 1) × ICC - m = average cluster size - ICC = intra-cluster correlation coefficient (typically 0.01–0.05 in O&G) - Effective sample size = Actual sample size / DE
Worked Example: To detect a difference in mean birth weight of 100 g (SD = 400 g) between smokers and non-smokers, with α = 0.05, power = 0.80, using a two-sided test:
n = [(z_{α/2} + z_β)² × 2σ²] / δ² n = [(1.96 + 0.84)² × 2 × 400²] / 100² n = [7.84 × 320,000] / 10,000 = 2,508,800 / 10,000 ≈ 251 per group
So ~502 women needed total.
For binary outcomes (e.g., preterm birth rate): Uses different formula based on proportions.
Interim Analyses & Data Monitoring
- Data Monitoring Committee (DMC/DMSB): Independent group of experts with access to unblinded data
- Responsibilities: Recommend stopping for efficacy (overwhelming benefit), harm (safety concerns), or futility (no realistic chance of benefit)
- Members: Clinicians, statisticians, sometimes ethicists
- Must be independent of trial investigators and sponsor
- Stopping for futility: Uses conditional power — probability of reaching significant result at final analysis given current data
1.6 Ecological Studies
Design: Groups (populations) as unit of observation, not individuals
Examples: Comparing caesarean section rates across countries; correlating sunlight exposure and pre-eclampsia rates by region
Ecological fallacy (Robinson, 1950): Associations at population level may NOT hold at individual level. Classic example: Immigrants in US had higher literacy rates in states with more immigrants → actually at individual level, immigrants had lower literacy (states with more immigrants had higher literacy natives).
2. Screening
2.1 The Wilson-Jungner Criteria (1968) — Detailed
The 10 classic criteria proposed by Wilson and Jungner for the WHO. Every MRCOG candidate must know these:
- The condition should be an important health problem
- Burden of disease measured by incidence, prevalence, morbidity, mortality, economic cost
-
In O&G: Down's syndrome (lifetime cost ~£500k); cervical cancer (~850 deaths/year UK); GDM (affects ~5% pregnancies)
-
There should be an accepted treatment for patients with recognised disease
- If no effective treatment exists, screening may cause harm without benefit
- Exception: conditions where knowing diagnosis allows reproductive choice (Down's syndrome, anencephaly)
-
Example of problematic screening: some rare genetic conditions with no treatment
-
Facilities for diagnosis and treatment should be available
- If screening identifies positives but diagnostic capacity is insufficient → anxiety and harm
-
UK has detailed pathways: screen positive → referral to fetal medicine unit or colposcopy within 2 weeks
-
There should be a recognisable latent or early symptomatic stage
- Diseases with long preclinical phase are good screening targets
- Cervical cancer: HPV infection → CIN I/II/III → invasive cancer (10+ year window)
-
Ovarian cancer: NO good latent stage → screening has failed in trials (UKCTOCS, PLCO)
-
There should be a suitable test or examination
- Test must be acceptable, accurate (high sensitivity/specificity), and feasible at population scale
-
Combined test for Down's: NT ultrasound (~20 min), blood test → acceptable but requires skilled sonographers
-
The natural history of the condition, including development from latent to declared disease, should be adequately understood
- Without knowing natural history, we cannot predict who will progress
-
CIN: Most low-grade lesions regress; only high-grade progress — essential knowledge for appropriate management
-
There should be an agreed policy on whom to treat as patients
- Clear thresholds for intervention needed
- GDM: IADPSG criteria (one-step) vs NICE criteria (two-step) produce different prevalence
-
HPV vaccine policy: age 12–13 girls (and boys from 2019) in UK
-
The total cost of finding a case should be economically balanced in relation to medical expenditure as a whole
- Cost per QALY gained; NICE threshold ~£20,000–30,000/QALY
-
NIPT: ~£500/test; combined test: ~£80; NICE considered cost-effectiveness
-
Case-finding should be a continuing process and not a "once and for all" project
- Screening must be repeated at appropriate intervals
- Cervical screening: 3-yearly (25–49), 5-yearly (50–64)
-
Antenatal screening: per pregnancy (not lifetime)
-
The test should be acceptable to the population
- Low uptake → programme ineffective
- Cervical screening uptake: ~70% UK (below 80% target)
- Antenatal HIV screening: >99% uptake (well accepted as routine)
2.2 Test Performance Characteristics — Complete Details
The 2×2 Table
| Disease + (Gold Standard) | Disease − (Gold Standard) | Total | |
|---|---|---|---|
| Test + | True positive (TP) | False positive (FP) | TP + FP |
| Test − | False negative (FN) | True negative (TN) | FN + TN |
| Total | TP + FN | FP + TN | N |
Disease prevalence = (TP + FN) / N — this is the pre-test probability if the screening population mirrors the study population
Key Measures — Expanded with Clinical Interpretation
| Measure | Formula | What it tells us | Clinical use |
|---|---|---|---|
| Sensitivity (Sn) | TP / (TP + FN) | Of those WITH disease, how many test positive? | SnNOut: High Sn → negative test rules OUT disease |
| Specificity (Sp) | TN / (TN + FP) | Of those WITHOUT disease, how many test negative? | SpPIn: High Sp → positive test rules IN disease |
| Positive Predictive Value (PPV) | TP / (TP + FP) | Of those who test positive, how many actually HAVE disease? | Counselling patient with positive result |
| Negative Predictive Value (NPV) | TN / (TN + FN) | Of those who test negative, how many actually are FREE of disease? | Counselling patient with negative result |
| Accuracy | (TP + TN) / N | Proportion correctly classified | Overall measure but misleading when prevalence low |
Prevalence Effect on PPV — Expanded
The single most important concept in screening for MRCOG. PPV depends on prevalence, and thus screening works well in high-prevalence populations but poorly in low-prevalence populations.
Worked Example: Test with Sn = 99%, Sp = 99%
Scenario A: High prevalence (50%) — e.g., symptomatic women referred to clinic
| Disease + | Disease − | Total | |
|---|---|---|---|
| Test + | 495 (TP) | 5 (FP) | 500 |
| Test − | 5 (FN) | 495 (TN) | 500 |
| Total | 500 | 500 | 1000 |
PPV = 495/500 = 99% (a positive test is very reliable) NPV = 495/500 = 99%
Scenario B: Low prevalence (1%) — general population screening
| Disease + | Disease − | Total | |
|---|---|---|---|
| Test + | 99 (TP) | 99 (FP) | 198 |
| Test − | 1 (FN) | 9,801 (TN) | 9,802 |
| Total | 100 | 9,900 | 10,000 |
PPV = 99/198 = 50% (half of positives are false!) NPV = 9801/9802 = 99.99%
Scenario C: Very low prevalence (0.1%) — rare disease screening
| Disease + | Disease − | Total | |
|---|---|---|---|
| Test + | 9.9 (TP) | 99.9 (FP) | 109.8 |
| Test − | 0.1 (FN) | 9,890.1 (TN) | 9,890.2 |
| Total | 10 | 9,990 | 10,000 |
PPV = 9.9/109.8 = 9% (91% of positives are false!) NPV = 9890.1/9890.2 = ~100%
MRCOG Take-home: Even an "excellent" test (99% Sn, 99% Sp) has PPV of only 50% when prevalence is 1%, and only 9% when prevalence is 0.1%. This is why screening for very rare conditions is problematic.
Clinical Example: NIPT for Down's Syndrome
- Sn = 99.5%, Sp = 99.9%
- Prevalence at term = 1/800 (0.125%)
PPV = (0.995 × 0.00125) / [(0.995 × 0.00125) + (0.001 × 0.99875)] PPV = 0.00124 / (0.00124 + 0.000999) = 0.00124 / 0.00224 = 0.554 = 55%
So even NIPT, the best screening test, has PPV ~55% for Down's syndrome in a low-risk population. A positive NIPT still requires confirmatory invasive testing (CVS or amniocentesis).
For high-risk population (e.g., women aged 40 with combined test risk 1:10): Prevalence ~10% PPV = (0.995 × 0.10) / [(0.995 × 0.10) + (0.001 × 0.90)] = 0.0995 / (0.0995 + 0.0009) = 99.1%
2.3 Likelihood Ratios — Complete Guide
LR+ = Sensitivity / (1 − Specificity) - Tells you how much more likely a positive test is in someone WITH the disease vs WITHOUT - Range: 1 to ∞ - Higher = better (more diagnostic information)
LR− = (1 − Sensitivity) / Specificity - Tells you how much less likely a negative test is in someone WITH the disease vs WITHOUT - Range: 0 to 1 - Lower = better (closer to 0)
| LR Value | Impact on Post-test Probability |
|---|---|
| LR+ > 10 | Large, often conclusive increase |
| LR+ 5–10 | Moderate increase |
| LR+ 2–5 | Small increase |
| LR+ 1–2 | Minimal increase |
| LR+ = 1 | No diagnostic value |
| LR− < 0.1 | Large, often conclusive decrease |
| LR− 0.1–0.2 | Moderate decrease |
| LR− 0.2–0.5 | Small decrease |
| LR− 0.5–1.0 | Minimal decrease |
Using LRs in Clinical Practice (Bayes' Theorem):
Step 1: Convert pre-test probability to pre-test odds - Odds = probability / (1 − probability) - Example: Pre-test probability of Down's = 1/250 = 0.004 - Pre-test odds = 0.004 / 0.996 = 0.004
Step 2: Multiply by LR to get post-test odds - Post-test odds = Pre-test odds × LR - If combined test positive (LR+ = 8): Post-test odds = 0.004 × 8 = 0.032
Step 3: Convert back to probability - Post-test probability = odds / (1 + odds) - = 0.032 / 1.032 = 0.031 = 3.1% (or about 1 in 32)
Fagan nomogram: A graphical tool that does this conversion for you. Draw a line from pre-test probability through the LR to read post-test probability directly.
2.4 ROC Curves — Detailed
Receiver Operating Characteristic curve: - X-axis: 1 − Specificity (false positive rate) - Y-axis: Sensitivity (true positive rate) - Each point = test at different threshold/cut-off
AUC (Area Under the Curve):
| AUC | Interpretation |
|---|---|
| 0.5 | No better than chance (diagonal line) |
| 0.6–0.7 | Poor |
| 0.7–0.8 | Moderate (acceptable) |
| 0.8–0.9 | Good (excellent for many applications) |
| 0.9–1.0 | Excellent |
Choosing the optimal cut-off: - Youden index: J = Sensitivity + Specificity − 1 - Maximised at optimal threshold - Gives equal weight to Sn and Sp - Clinical weighting: If FN more harmful than FP → choose lower threshold (higher Sn, lower Sp) - Example: Screening for anencephaly — a missed case is catastrophic → high sensitivity prioritised - Economic weighting: Cost of FP (anxiety, further tests) vs FN (missed case)
Worked O&G Example: cffDNA for Down's syndrome - AUC > 0.99 (excellent) - At standard cut-off (z-score > 3): Sn = 99.5%, Sp = 99.9% - Can trade off: at z-score > 2: Sn > 99.9%, Sp = 99.0% (more FPs but fewer missed cases)
2.5 Screening in O&G — Complete Clinical Details
Antenatal Screening Programme (UK)
| Condition | Screening Test | Timing | Sensitivity | Specificity | Notes |
|---|---|---|---|---|---|
| Down's syndrome (T21) | Combined test (NT + PAPP-A + β-hCG) | 11–14 wks | ~85% at 5% FPR | 95% | NICE recommendation |
| Down's syndrome (T21) | Quadruple test (AFP + hCG + uE3 + Inhibin A) | 14–20 wks | ~80% at 5% FPR | 95% | When late booking or NT not available |
| Down's syndrome | NIPT (cfDNA) | From 10 wks | >99% | >99% | Contingent screening in NHS (if combined risk ≥ 1:150) |
| Edwards' syndrome (T18) | Combined test + NIPT | 11–14 wks | ~90% | 99.9% | Low PAPP-A and hCG |
| Patau's syndrome (T13) | Combined test + NIPT | 11–14 wks | ~85% | 99.9% | Low PAPP-A and hCG |
| Neural tube defects | AFP + anomaly scan | 18–20 wks | ~90% (anencephaly) | >99% | Anomaly scan is gold standard |
Fetal Anomaly Screening Programme (FASP) — UK
The 11 conditions screened for at the 18–20 week anomaly scan:
- Anencephaly — absence of cranial vault; uniformly lethal
- Open spina bifida — neural tube defect; severity varies
- Cleft lip — with or without cleft palate
- Diaphragmatic hernia — herniation of abdominal contents into chest
- Gastroschisis — abdominal wall defect (right of umbilical cord)
- Exomphalos (omphalocele) — abdominal wall defect (midline, membrane-covered)
- Serious cardiac anomalies — four-chamber view + outflow tracts (detects ~50% of major CHD)
- Bilateral renal agenesis — absence of both kidneys → anhydramnios → pulmonary hypoplasia
- Lethal skeletal dysplasia — severe short limbs, narrow thorax
- Edwards' syndrome (T18) — structural anomalies + growth restriction
- Patau's syndrome (T13) — structural anomalies + holoprosencephaly
Detection rates for anomaly scan: - Anencephaly: ~98% - Open spina bifida: ~90% - Cleft lip: ~75% - Diaphragmatic hernia: ~60% - Gastroschisis: ~90% - Major cardiac anomalies: ~50% - Bilateral renal agenesis: ~85%
Gestational Diabetes Mellitus (GDM) Screening
| Approach | Method | Criteria | Prevalence detected |
|---|---|---|---|
| Universal (IADPSG/WHO) | One-step: 75g OGTT at 24–28 wks | Fasting ≥5.1, 1h ≥10.0, 2h ≥8.5 mmol/L | ~15–20% |
| Selective (NICE) | Risk-factor based: 75g OGTT at 24–28 wks | Fasting ≥5.6, 2h ≥7.8 mmol/L | ~5% |
| Two-step (ACOG) | 50g GCT → if ≥7.8 → 100g OGTT (Carpenter-Coustan) | Two values elevated | ~6–8% |
NICE risk factors for GDM (2015): - BMI > 30 kg/m² - Previous GDM - Family history (first-degree relative with diabetes) - Ethnicity: South Asian, Black Caribbean, Middle Eastern - Previous macrosomic baby (≥4.5 kg) - Polycystic ovary syndrome
Cervical Screening (NHS Cervical Screening Programme)
| Aspect | Details |
|---|---|
| Age range | 25–64 years |
| Frequency | 3-yearly (25–49); 5-yearly (50–64) |
| Primary test | HPV test (since 2019) |
| Reflex cytology | If HPV positive → cytology on same sample |
| Colposcopy referral | HPV positive + abnormal cytology (≥ borderline) |
| HPV 16/18 genotyping | If HPV positive with normal cytology → genotyping; 16/18+ → colposcopy; other HR-HPV → repeat in 12 months |
| Upper age | 64 (if last 2 screens negative, no further screening) |
| Uptake | ~70% (below 80% target) |
| Approach | Call-recall system via GP registration |
Group B Streptococcus (GBS) Screening
| Aspect | UK Practice | US Practice |
|---|---|---|
| Approach | Risk-factor based | Universal screening |
| Timing | At labour onset (risk-based) | 35–37 weeks |
| Test | Not routine | Vaginal-rectal swab (enriched culture) |
| Risk factors | Previous GBS baby, GBS bacteriuria in pregnancy, preterm labour, prolonged ROM (>18h), intrapartum fever ≥38°C | None (universal screening) |
Other Antenatal Screening
| Test | Timing | Condition |
|---|---|---|
| HIV | Booking (and 28 wks if high risk) | Vertical transmission rate <1% with treatment |
| HBsAg | Booking | Hepatitis B — immunoprophylaxis reduces vertical transmission |
| Syphilis (TPPA/VDRL) | Booking | Congenital syphilis preventable |
| Rubella IgG | Booking | Susceptibility detected → post-partum vaccination |
| Sickle cell and thalassaemia | Booking | Family origin questionnaire + Hb HPLC |
| Asymptomatic bacteriuria | Booking | Urine culture (MSU) |
| Anaemia | Booking + 28 wks | FBC |
2.6 Screening Biases — Expanded
| Bias | Mechanism | Example |
|---|---|---|
| Lead time bias | Screening advances time of diagnosis but does NOT delay death. Survival appears longer because the clock starts earlier, even if death occurs at the same time. | Screening for ovarian cancer: if diagnosis moved from age 62 to age 60 but death at age 65 in both, apparent "survival" increases from 3 to 5 years — no real benefit. |
| Length time bias | Screening preferentially detects slower-growing (less aggressive) disease because it stays in the detectable preclinical phase longer. Fast-growing aggressive disease is more likely to present symptomatically between screens. | Cervical screening: Screen-detected CIN tends to be slower-progressing. Rapidly progressive cancers may present as interval cancers between screens. |
| Overdiagnosis | Detection of disease that would NEVER have caused symptoms or death. The patient is "harmed" by unnecessary diagnosis and treatment. | Screening for neuroblastoma in infants (abandoned due to overdiagnosis); overdiagnosis in thyroid and breast cancer screening is well-documented. |
| Selection bias (volunteer bias) | People who participate in screening are systematically different from those who don't — typically healthier, more health-conscious, higher SES. | Women attending for cervical screening have lower cervical cancer risk regardless of screening (healthy behaviours). |
| Recall rate | Proportion of screened population recalled for further investigations. Must balance: high recall → more detected cases but more anxiety and cost; low recall → missed cases. | Combined test recall rate: ~5% (a positive screening result). Of those, ~5% have Down's syndrome (PPV ~5% in low-risk population). |
| False positive rate | Proportion of normal pregnancies incorrectly labelled as high-risk. Causes anxiety, unnecessary invasive tests (with miscarriage risk ~0.5–1%), and increased healthcare costs. | Combined test FPR = 5%. For every 100,000 women screened, ~5,000 will be screen-positive; ~4,750 will be false positives. |
Screening vs Diagnostic Accuracy
| Aspect | Screening Test | Diagnostic Test |
|---|---|---|
| Population | Asymptomatic, low prevalence | Symptomatic, high pre-test probability |
| Purpose | Identify those who need diagnostic testing | Confirm or exclude diagnosis |
| Test characteristics | High sensitivity (minimise FNs) | High specificity (minimise FPs) |
| Acceptability | Must be acceptable to healthy people | Acceptability less critical |
| Cost | Must be cheap | Can be more expensive |
| PPV | Often low (due to low prevalence) | Higher (due to pre-test probability) |
| Example | Combined test for Down's (screening) | CVS/amniocentesis for karyotype (diagnostic) |
3. Descriptive Statistics
3.1 Types of Data — Complete Classification
┌─────────────┐
│ Data │
└──────┬──────┘
│
┌─────────────┴─────────────┐
┌────┴────┐ ┌────┴────┐
│Categorical│ │Numerical │
└────┬────┘ └────┬────┘
│ │
┌───────────┼───────────┐ ┌────────┴────────┐
│Nominal │ Ordinal │ │Discrete │Continuous│
└───────────┴───────────┘ └────────┴──────────┘
| Data Type | Description | Examples in O&G | Permissible Statistics |
|---|---|---|---|
| Nominal | Unordered categories | Blood group (A, B, AB, O), ethnicity, parity type (nulliparous/multiparous), mode of delivery (SVD, VEEB, CS) | Mode, frequency, χ², Fisher's exact |
| Ordinal | Ordered categories | FIGO stage (I–IV), pain score (0–10), Bishop's score, AGPAR score, severity of incontinence (mild/moderate/severe) | Median, IQR, Mann-Whitney, Wilcoxon, %iles |
| Discrete | Integer values (countable) | Parity (0, 1, 2...), number of miscarriages, gravidity, number of previous CS | Mean (if normally distributed), median (if skewed) |
| Continuous | Any value on a continuum | Birth weight, gestational age, BMI, blood pressure, Hb, cervical length | Mean, SD, t-test, ANOVA, regression |
Special case — Binary/Dichotomous: Nominal with exactly 2 categories - Alive/dead, pregnant/not, term/preterm - Can use: proportions, OR, RR, logistic regression
Hierarchy of data: As you go from nominal → ordinal → interval → ratio, you gain more mathematical properties and more statistical options.
3.2 Measures of Central Tendency — Complete Guide
| Measure | Definition | Formula | When to use |
|---|---|---|---|
| Mean (arithmetic) | Sum of all values divided by number of values | x̄ = Σxᵢ / n | Normally distributed, interval/ratio data |
| Median | Middle value when data ordered from smallest to largest | Value at position (n+1)/2 | Skewed data, ordinal data, presence of outliers |
| Mode | Most frequently occurring value | Value with highest frequency | Nominal data, bimodal distributions |
Mean
Advantages: Uses all data points; mathematically tractable (basis for many statistical tests) Disadvantages: Affected by outliers and skewness
Example: Birth weights (kg) of 5 babies: 2.5, 3.0, 3.2, 3.5, 4.8 - Mean = (2.5 + 3.0 + 3.2 + 3.5 + 4.8) / 5 = 17.0 / 5 = 3.4 kg - Median = 3.2 kg (3rd value of 5) - Mode: no repeated values → no mode
If the 4.8 kg outlier was actually 10.0 kg (error): - Mean = (2.5 + 3.0 + 3.2 + 3.5 + 10.0) / 5 = 4.44 kg (dramatically changed!) - Median = 3.2 kg (unchanged!)
Median
Advantages: Robust to outliers and skewness; appropriate for ordinal data Disadvantages: Does not use all data; less mathematically tractable
Calculation: - If n is odd: middle value (e.g., n=5 → 3rd value) - If n is even: average of two middle values (e.g., n=6 → average of 3rd and 4th values)
Mode
Advantages: Only measure for nominal data; can identify bimodal distributions Disadvantages: May not exist (no repeated values); may not be unique
Bimodal distribution example: Birth weight in preterm vs term babies will show two peaks.
Skewness — Visualising the Distribution
Normal Positive Skew Negative Skew
╱
╱╲ ╱╲╲ ╱╱╲
╱ ╲ ╱ ╲╲ ╱ ╲╲
╱ ╲ ╱ ╲╲ ╱ ╲╲
╱ ╲ ╱ ╲╲ ╱ ╲╲
Mean=Median=Mode Mode > Median > Mean Mean > Median > Mode
| Skew | Direction | Relationship | Example in O&G |
|---|---|---|---|
| Positive (right) skew | Long tail to the right | Mean > Median > Mode | Length of hospital stay after CS, parity in general population, time to conceive |
| Negative (left) skew | Long tail to the left | Mean < Median < Mode | Age at menopause (most women 48–52, few <40 or >55) |
| No skew (symmetrical) | Bell-shaped | Mean = Median = Mode | Normally distributed: birth weight in term infants, height |
Skewness coefficient = 0 for normal distribution; >0 for positive skew; <0 for negative skew.
Kurtosis: Measures "peakedness" of distribution - Leptokurtic: Tall peak, heavy tails (more outliers) - Platykurtic: Flat peak, thin tails - Mesokurtic: Normal distribution (kurtosis = 3 for normal; excess kurtosis = 0)
3.3 Measures of Dispersion — Complete Guide
| Measure | Formula/Definition | Robust to outliers? | When to use |
|---|---|---|---|
| Range | Max − Min | NO | Quick summary only |
| Interquartile Range (IQR) | Q3 − Q1 (75th − 25th percentile) | YES | With median; skewed data |
| Variance (σ²) | Σ(xᵢ − μ)² / n | NO | Intermediate calculation for SD |
| Sample variance (s²) | Σ(xᵢ − x̄)² / (n−1) | NO | Unbiased estimate from sample |
| Standard deviation (SD) | √Variance | NO | Mean ± SD for normal data |
| Coefficient of variation (CV) | (SD / Mean) × 100% | — | Comparing variability across different scales |
| Standard error of mean (SEM) | SD / √n | — | Precision of sample mean estimate |
Range
- Simplest measure
- Highly sensitive to outliers
- May be missing extreme values if sample is small
Interquartile Range (IQR)
- Contains middle 50% of data
- Q1 = 25th percentile, Q3 = 75th percentile
- Used with median for skewed data
- Box plot whiskers typically extend to 1.5 × IQR beyond Q1 and Q3
Variance and Standard Deviation
Variance = average squared deviation from the mean - Population variance: σ² = Σ(xᵢ − μ)² / N - Sample variance: s² = Σ(xᵢ − x̄)² / (n−1)
Why n−1? Bessel's correction — using n−1 gives an unbiased estimate of population variance from a sample.
Standard deviation = √variance - In the SAME units as original data (unlike variance) - For normally distributed data: 68% within mean ± 1 SD, 95% within ± 1.96 SD
Worked Example: Cervical length measurements (mm): 25, 30, 32, 35, 38
| Step | Calculation | Result |
|---|---|---|
| Mean | (25+30+32+35+38)/5 | 32 |
| Deviations | −7, −2, 0, +3, +6 | |
| Squared deviations | 49, 4, 0, 9, 36 | |
| Sum of squares | 49+4+0+9+36 | 98 |
| Variance (sample) | 98/(5−1) | 24.5 mm² |
| SD | √24.5 | 4.95 mm |
Coefficient of Variation (CV)
- CV = (SD / Mean) × 100%
- Allows comparison of variability across different scales or units
- Example: Birth weight SD = 400g, mean = 3400g → CV = 11.8%
- Another population: SD = 300g, mean = 2800g → CV = 10.7%
- The first population has higher absolute variability but similar relative variability
Standard Error of the Mean (SEM)
CRITICAL DISTINCTION for MRCOG: SD vs SEM
| SD | SEM | |
|---|---|---|
| What it describes | Variability of INDIVIDUAL observations | Precision of the SAMPLE MEAN estimate |
| Formula | SD = √(Σ(x−x̄)²/(n−1)) | SEM = SD / √n |
| Effect of n | Stable (doesn't systematically change with n) | DECREASES as n increases (more data = more precise mean) |
| Interpretation | ~95% of individuals fall within x̄ ± 2SD | ~95% CI for the mean = x̄ ± 2×SEM |
| Use | Describing population spread | Inferential statistics, CI for mean |
Example: Birth weight study, n = 1000, mean = 3400g, SD = 400g - SEM = 400 / √1000 = 400 / 31.6 = 12.7 g - 95% CI for mean = 3400 ± 1.96 × 12.7 = 3400 ± 24.9 = (3375, 3425) - Interpretation: We are 95% confident the true population mean is between 3375g and 3425g
Note: 95% of INDIVIDUAL birth weights are in the range 3400 ± 800g (= mean ± 2SD), NOT the 95% CI of the mean.
3.4 Normal Distribution — Complete Details
Properties of the Normal (Gaussian) Distribution:
- Symmetrical about the mean
- Mean = Median = Mode
- Bell-shaped with tails approaching but never reaching zero
- Defined by two parameters: μ (mean) and σ (SD)
- Area under curve = 1 (probability)
The 68-95-99.7 Rule
| Range | Proportion included | Commonly known as |
|---|---|---|
| μ ± 1σ | 68.27% | 68% |
| μ ± 1.645σ | 90% | 90th percentile bounds |
| μ ± 1.96σ | 95.00% | 95% reference range |
| μ ± 2σ | 95.45% | Approximate 95% |
| μ ± 2.58σ | 99.00% | 99% reference range |
| μ ± 3σ | 99.73% | 99.7% |
Standard Normal Distribution
- Z = (x − μ) / σ
- Mean = 0, SD = 1
- Z-table gives the probability of values less than a given Z-score
- Critical values for hypothesis testing:
- z₀.₀₂₅ = 1.96 (two-tailed 95% test)
- z₀.₀₅ = 1.645 (one-tailed 95% test)
- z₀.₀₀₅ = 2.58 (two-tailed 99% test)
Worked example: What proportion of term babies weigh <2500g if mean = 3400g, SD = 400g? - Z = (2500 − 3400) / 400 = −900/400 = −2.25 - P(Z < −2.25) = 0.0122 (from Z-table) - → 1.22% of term babies weigh <2500g
Central Limit Theorem (CLT)
Critical theorem: The sampling distribution of the mean approaches a normal distribution as sample size increases, REGARDLESS of the shape of the population distribution.
- Why this matters: Even with skewed data, the sample mean is approximately normally distributed if n is large enough (typically n > 30)
- This underpins: Use of z-tests and t-tests even for non-normal data when n is large
Standard Error vs Standard Deviation — Clinical Example
A study measures birth weight in 10,000 babies. - SD = 400g → tells us most babies weigh between 2600g and 4200g (±2SD) - SEM = 400/√10000 = 4g → tells us the mean is estimated very precisely (95% CI: ~3392 to 3408g)
A clinical mistake: Writing mean ± SD where mean ± SEM is intended (or vice versa). MRCOG exam might test your ability to distinguish.
3.5 Skewed Distributions & Transformations
Log-normal distribution: - Data are positively skewed - After log-transformation, data become normally distributed - Common in O&G: length of labour, parity, time to pregnancy, hormone levels (e.g., hCG)
Transformation options: | Transformation | Formula | When to use | |---------------|---------|-------------| | Log | y = ln(x) or y = log₁₀(x) | Positive skew; multiplicative data | | Square root | y = √x | Count data with moderate skew | | Reciprocal | y = 1/x | Strong skew | | Box-Cox | y = (x^λ − 1)/λ | Generalised power transformation | | Logit | y = ln[p/(1−p)] | Proportions (0 to 1) | | Arcsine | y = arcsin(√p) | Proportions (stabilises variance) |
How to check normality: 1. Histogram — visual inspection (bell-shaped?) 2. Q-Q plot (quantile-quantile plot) — points along diagonal = normal 3. Shapiro-Wilk test — most powerful for small n (H₀: data are normal) 4. Kolmogorov-Smirnov test — suitable for large n 5. Skewness and kurtosis — skewness between −2 and +2 and kurtosis between −7 and +7 often considered acceptable
3.6 Data Presentation — Types of Graphs
| Graph | Type of Data | Variables | Purpose | Key features |
|---|---|---|---|---|
| Histogram | Continuous | One variable | Show distribution shape | Bars TOUCH; bin width matters |
| Bar chart | Categorical | One or two categorical | Compare frequencies | Bars DO NOT touch |
| Box plot | Continuous | One variable (or grouped) | Show median, IQR, outliers | Whiskers ±1.5×IQR |
| Scatter plot | Continuous | Two continuous variables | Show relationship/correlation | Look for direction, strength, outliers |
| Line graph | Continuous (often time) | Continuous × time | Trend over time | Time on x-axis |
| Pie chart | Categorical | One categorical (proportions) | Show parts of a whole | Avoid >5 categories |
| Kaplan-Meier | Time-to-event | Survival time + group | Survival analysis | Step function; censoring marks |
| Forest plot | Meta-analysis | Multiple studies | Summarise effect sizes | Square size = weight; diamond = summary |
| Funnel plot | Meta-analysis | Effect size vs precision | Assess publication bias | Symmetrical = no bias |
| Bland-Altman | Continuous | Two measurement methods | Assess agreement | Difference vs mean of two methods |
Histogram vs Bar Chart — Critical MRCOG Distinction
| Feature | Histogram | Bar Chart |
|---|---|---|
| Data type | Continuous (or large discrete) | Categorical |
| Bars | Touch (no gap) | Do not touch (gap between) |
| Order | Natural order of variable (cannot reorder) | Can be reordered (e.g. alphabetical, by frequency) |
| Width | Can vary (if unequal bin widths) | Always equal |
| Example | Distribution of birth weights | Caesarean section rates by hospital |
Box Plot Interpretation
Upper whisker (largest value ≤ Q3 + 1.5×IQR)
│
─────┼───── Q3 (75th percentile)
│
─────┼───── Median (Q2, 50th percentile)
│
─────┼───── Q1 (25th percentile)
│
Lower whisker (smallest value ≥ Q1 − 1.5×IQR)
│
● Outlier (>1.5×IQR beyond Q1 or Q3)
Uses: - Comparing distributions across groups (e.g., birth weight by maternal smoking status) - Identifying outliers - Showing skewness (if median not centred in box)
Scatter Plot Interpretation
Look for: - Direction: Positive (both increase together) or negative (one increases, other decreases) - Strength: How closely points follow a line (tight = strong correlation) - Shape: Linear, curvilinear, no pattern - Outliers: Points far from main cluster - Subgroups: Distinct clusters suggest different populations
Bland-Altman Plot for Method Comparison
- X-axis: Mean of two measurements [(method A + method B)/2]
- Y-axis: Difference (method A − method B)
- Central horizontal line: Mean difference (bias)
- Dashed lines: Limits of agreement (mean ± 1.96 SD of differences)
- If limits are clinically acceptable → methods can be used interchangeably
- Used for: Comparing ultrasound measurements between operators, comparing new test to gold standard
4. Hypothesis Testing
4.1 Fundamental Concepts — Complete
| Concept | Symbol | Definition | Everyday analogy |
|---|---|---|---|
| Null hypothesis | H₀ | No difference / no association / no effect | "He is innocent" |
| Alternative hypothesis | H₁ | There IS a difference / association / effect | "He is guilty" |
| Type I error | α | Reject H₀ when H₀ is actually true (false positive) | Convicting an innocent person |
| Type II error | β | Fail to reject H₀ when H₁ is true (false negative) | Letting a guilty person go free |
| Power | 1 − β | Correctly rejecting H₀ when H₁ is true | Probability of detecting a real effect |
| p-value | p | Probability of observing the data (or more extreme) assuming H₀ is true | Not directly analogous |
4.2 Type I and Type II Errors — The 2×2 Framework
| Decision | H₀ TRUE | H₁ TRUE (H₀ FALSE) |
|---|---|---|
| Reject H₀ | Type I error (α) [FALSE POSITIVE] | ✅ CORRECT (True positive) |
| Fail to reject H₀ (Accept H₀) | ✅ CORRECT (True negative) | Type II error (β) [FALSE NEGATIVE] |
Type I Error (α)
- α = 0.05 means: If H₀ is true (no real effect), there is a 5% chance we will incorrectly conclude there IS an effect
- Trades off with Type II error — making α stricter (e.g., 0.01) reduces false positives but increases false negatives
- Multiple testing: If you test 20 independent null hypotheses, expected number of false positives = 20 × 0.05 = 1 (hence Bonferroni correction)
Type II Error (β) and Power
- β = 0.20 → Power = 0.80 is conventional minimum
- β = 0.10 → Power = 0.90 is preferred
- Power depends on:
- Sample size (n): Larger n → higher power
- Effect size (δ): Larger effect → higher power
- α-level: Less strict α (e.g., 0.05 vs 0.01) → higher power
- Variance (σ²): Lower variance → higher power
Worked Example of power concept: A study of 50 women finds no significant difference in birth weight between smokers and non-smokers (p = 0.12). The study was designed with 80% power to detect a 200g difference. The actual observed difference was 150g — the study was UNDER-powered to detect this smaller difference. Therefore the non-significant result does NOT mean there is no effect — it means we cannot rule out an effect of this size.
4.3 The p-value — Essential MRCOG Detail
CRITICAL EXAM POINT: The p-value is NOT the probability that the null hypothesis is true! This is the single most common statistical misconception tested in MRCOG.
Mathematical Definition: p-value = P(observed data OR more extreme | H₀ true)
It is NOT P(H₀ true | observed data)
The correct interpretation: "If there were truly no difference between groups, the probability of observing a difference as large (or larger) than the one we saw is p."
Common Misconceptions — All WRONG:
| ❌ Incorrect Statement | ✅ Correct Interpretation |
|---|---|
| "p = 0.03 means there is a 3% chance H₀ is true" | p = 0.03 means: if H₀ were true, we'd see data this extreme only 3% of the time |
| "p = 0.05 means there is a 5% probability the result is due to chance" | Probability refers to the data under H₀, not the result |
| "p > 0.05 means the treatment is equivalent to placebo" | Non-significant does NOT mean no effect — may be underpowered |
| "p = 0.001 means the effect is very large" | p does NOT measure effect size — only strength of evidence against H₀ |
| "We failed to reject H₀, so H₀ is true" | We cannot prove H₀ — only fail to find evidence against it |
4.4 Confidence Intervals — Detailed
Definition: A 95% confidence interval for a parameter is the range of values within which the true population parameter would fall in 95% of repeated samples.
Correct interpretation: If we repeated the study 100 times and calculated a 95% CI each time, about 95 of those CIs would contain the true population value.
WRONG interpretation: "There is a 95% probability that the true value lies within this CI" — this is a Bayesian credible interval interpretation, not a frequentist CI.
CI provides MORE information than p-value: - Shows the ESTIMATE (best guess of effect size) - Shows the PRECISION (width = how certain we are) - Shows STATISTICAL SIGNIFICANCE (if 95% CI excludes null value → p < 0.05) - Shows CLINICAL SIGNIFICANCE (even if significant, is the entire CI in a clinically meaningful range?)
| CI includes null? | p-value | Interpretation |
|---|---|---|
| Yes (e.g., RR 1.2, 95% CI 0.9–1.5) | p ≥ 0.05 | Not statistically significant |
| No (e.g., RR 1.2, 95% CI 1.01–1.5) | p < 0.05 | Statistically significant |
| No (e.g., RR 1.2, 95% CI 1.1–1.3) | p < 0.001 | Significant AND precise |
Example: RR for preterm birth in smokers vs non-smokers - Study A: RR = 1.5, 95% CI 0.8–2.2 (wide CI → imprecise; not significant) - Study B: RR = 1.3, 95% CI 1.1–1.5 (narrow CI → precise; significant) - Study C: RR = 1.1, 95% CI 1.01–1.19 (significant but clinically marginal)
4.5 One-tailed vs Two-tailed Tests
| Aspect | Two-tailed | One-tailed |
|---|---|---|
| Alternative hypothesis | H₁: μ₁ ≠ μ₂ (difference in either direction) | H₁: μ₁ > μ₂ (or μ₁ < μ₂) |
| When to use | Default — almost always | Only if difference in opposite direction is impossible or irrelevant |
| α distribution | Split equally between both tails (2.5% each) | All 5% in one tail |
| Critical value (α=0.05) | z = ±1.96 | z = 1.645 |
| For same data | p-value is 2× the one-tailed p | p-value is half the two-tailed p |
| Sample size | Larger | Smaller (for same power) |
| Controversy | Safe and standard | Can inflate Type I error if the "wrong" direction appears |
MRCOG rule: Always use two-tailed unless you have an extremely strong justification. The exam expects two-tailed as default.
Example: Comparing two antihypertensives in pregnancy — you cannot be certain a new drug won't be worse → two-tailed. If comparing a known teratogen to placebo, you might use one-tailed (it can't reduce malformation risk below background), but even then, two-tailed is safer.
4.6 Multiple Testing — Corrections
The problem: Each statistical test at α = 0.05 has a 5% chance of false positive. If you run many tests, the familywise error rate (FWER) increases.
FWER = 1 − (1 − α)ᵏ
| Number of tests (k) | FWER |
|---|---|
| 1 | 0.05 |
| 5 | 0.23 |
| 10 | 0.40 |
| 20 | 0.64 |
| 100 | 0.99 |
Bonferroni correction: - Adjusted α = 0.05 / k - Example: 10 comparisons → α = 0.005 - Very conservative — reduces Type I error but increases Type II error (reduces power)
Other methods: | Method | Description | Comparison | |--------|-------------|------------| | Bonferroni | α/k | Most conservative | | Holm-Bonferroni | Stepwise: smallest p tested at α/k, then α/(k−1), etc. | Less conservative, more powerful | | Sidak | 1 − (1−α)^(1/k) | Slightly less conservative than Bonferroni | | Benjamini-Hochberg (FDR) | Controls false discovery rate (expected proportion of false positives among rejected hypotheses) | Least conservative; used in genomics |
4.7 Significance vs Clinical Importance — Key MRCOG Concept
| Statistically Significant | Not Statistically Significant | |
|---|---|---|
| Clinically Important | ✅ Optimal — real effect detected | 🔴 Underpowered study — need larger n |
| Clinically Unimportant | 🟡 Significant but trivial (large n) | ✅ No evidence of important effect |
Example 1: A study with 100,000 women finds that taking paracetamol once in pregnancy reduces preterm birth from 5.0% to 4.9% (p = 0.03). Statistically significant but clinically meaningless (ARR = 0.1%, NNT = 1000).
Example 2: A study with 100 women finds a 30% reduction in miscarriage rate but p = 0.15. Potentially clinically important but not proven — underpowered.
4.8 Bayesian vs Frequentist Statistics — Overview
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Probability definition | Long-run frequency | Degree of belief |
| Parameters | Fixed (unknown) | Random variables |
| Data | Random | Fixed |
| Prior | Not used | Used (prior probability) |
| Output | p-value, CI | Posterior probability, credible interval |
| Interpretation of 95% interval | 95% of intervals contain true value | 95% probability true value lies in interval |
| If H₀ is p=0.05 | Cannot say "5% chance H₀ is true" | Can say "5% probability H₀ is true" |
Bayesian approach in O&G: Increasingly used in adaptive trials, diagnostic test interpretation, and meta-analysis.
5. Parametric vs Non-Parametric Tests
5.1 Choosing the Right Test — Decision Tree
Continuous Data
┌─────────────────────────┐
│ Continuous Outcome │
└────────────┬────────────┘
│
┌────────────┴────────────┐
│ Normally distributed? │
└────────────┬────────────┘
│
┌───────────────────┴────────────────────┐
YES│ │NO
│ │
┌───────────┴───────────┐ ┌────────────┴────────────┐
│ How many groups? │ │ How many groups? │
└───────────┬───────────┘ └────────────┬────────────┘
│ │
┌───────┬───┴───┬───────┐ ┌───────┬───┴───┬───────┐
│ 2 ind │2 paired│ 3+ ind│3+ paired │ 2 ind │2 paired│ 3+ ind│3+ paired
│t-test │t-test │ANOVA │RM-ANOVA │Mann- │Wilcoxon│Kruskal│Friedman
│(unpaired) (paired)│(one-way)│ │Whitney│signed │Wallis │
│ │ │ │ │ U │rank │ │
└───────┴────────┴───────┴───── └───────┴────────┴───────┴──────
Categorical Data
┌─────────────────────────┐
│ Categorical Outcome │
└────────────┬────────────┘
│
┌────────────┴────────────┐
│ 2×2 table or larger │
└────────────┬────────────┘
│
┌───────────────────┴─────────────────────┐
│ │
┌────────┴────────┐ ┌────────┴────────┐
│ Expected ≥5? │ │ Paired data? │
└────────┬────────┘ └────────┬────────┘
│ │
┌────┴────┐ ┌────┴────┐
YES│ │NO YES│ │NO
│ │ │ │
┌────┴┐ ┌──┴────┐ ┌────┴┐ ┌───┴────┐
│ χ² │ │Fisher │ │McNemar│ │Normal │
│ test│ │exact │ │ │ │ χ² │
└─────┘ └───────┘ └───────┘ └───────┘
5.2 Parametric Tests — Complete Details
Student's t-test
Assumptions: 1. Normality: Data in each group are approximately normally distributed (or n large enough for CLT) 2. Homogeneity of variance: Variance similar in both groups (check with Levene's test or F-test) 3. Independence: Observations are independent of each other
Unpaired (Independent Samples) t-test
- Use: Compare means of TWO independent groups
- Example: Birth weight in smokers vs non-smokers
- Formula: t = (x̄₁ − x̄₂) / √(s²(1/n₁ + 1/n₂))
- where s² = pooled variance = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁ + n₂ − 2)
- Degrees of freedom: df = n₁ + n₂ − 2
Worked example: - Smokers: n=50, mean=3100g, SD=400g - Non-smokers: n=50, mean=3300g, SD=380g - Pooled SD = √([49×400² + 49×380²]/98) = √([7,840,000 + 7,072,400]/98) = √(152,168) = 390.1 - t = (3100 − 3300) / (390.1 × √(1/50 + 1/50)) = −200 / (390.1 × 0.2) = −200 / 78.02 = −2.56 - df = 98, critical t (two-tailed, α=0.05) = 1.984 - |t| = 2.56 > 1.984 → p < 0.05 → significant difference
Welch's t-test: Does NOT assume equal variances; more robust. Uses separate variances and adjusted df (Satterthwaite or Welch). Recommended as default.
Paired t-test
- Use: Compare means of TWO RELATED measurements (same subjects, before-after, matched pairs)
- Example: BP before and after antihypertensive treatment in pregnancy
- Principle: Calculate difference for each pair, test if mean difference = 0
- Formula: t = d̄ / (s_d / √n)
- d̄ = mean of differences
- s_d = SD of differences
- n = number of pairs
- df = n − 1
Worked example: Fasting glucose before and after 1 week of metformin in 10 women with PCOS:
| Subject | Before | After | Difference |
|---|---|---|---|
| 1 | 5.9 | 5.5 | 0.4 |
| 2 | 6.2 | 5.8 | 0.4 |
| 3 | 5.6 | 5.3 | 0.3 |
| 4 | 5.8 | 5.4 | 0.4 |
| 5 | 6.0 | 5.7 | 0.3 |
| 6 | 5.7 | 5.6 | 0.1 |
| 7 | 6.1 | 5.8 | 0.3 |
| 8 | 5.9 | 5.5 | 0.4 |
| 9 | 5.8 | 5.6 | 0.2 |
| 10 | 6.0 | 5.9 | 0.1 |
- Mean difference d̄ = 0.29
- SD of differences = 0.12
- t = 0.29 / (0.12/√10) = 0.29 / 0.038 = 7.63
- df = 9, critical t = 2.262
- t = 7.63 >> 2.262 → p < 0.001 → significant reduction
Analysis of Variance (ANOVA)
One-way ANOVA
- Use: Compare means of THREE or MORE independent groups
- Why not multiple t-tests? Inflates Type I error (for 3 groups: 3 pairwise tests → FWER = 14.3%)
- Logic: Partition total variance into:
- Between-group variance (attributable to the treatment/group effect)
- Within-group variance (error/residual variance)
- F-statistic = Mean Square (between) / Mean Square (within)
- If F is large and p < 0.05 → at least one group differs from others
ANOVA table:
| Source | Sum of Squares | df | Mean Square | F |
|---|---|---|---|---|
| Between groups | SS_b | k−1 | MS_b = SS_b/(k−1) | MS_b/MS_w |
| Within groups | SS_w | N−k | MS_w = SS_w/(N−k) | — |
| Total | SS_t | N−1 | — | — |
k = number of groups, N = total sample size
Post-hoc Tests after Significant ANOVA
Why can't we just use pairwise t-tests? Multiple testing problem. Post-hoc tests control for multiple comparisons.
| Test | Conservatism | When to use |
|---|---|---|
| Bonferroni | Very conservative | Small number of pre-planned comparisons |
| Tukey HSD | Moderate | All pairwise comparisons (most common) |
| Scheffé | Most conservative | Complex comparisons (contrasts) |
| Dunnett | Moderate | Comparing all groups to a single control |
| Least Significant Difference (LSD) | NOT conservative (doesn't control FWER) | Only if exactly 3 groups and significant F |
Tukey HSD (Honest Significant Difference): - Controls FWER for ALL pairwise comparisons - Uses studentised range distribution (q) - Formula: HSD = q × √(MS_w / n)
Two-way ANOVA
- Use: TWO independent variables (factors) + their interaction
- Output: Main effect of factor A, main effect of factor B, interaction effect (A×B)
- Example: Effect of smoking (yes/no) AND maternal age (<35 vs ≥35) on birth weight
- Main effect of smoking (adjusted for age)
- Main effect of age (adjusted for smoking)
- Interaction: Does the effect of smoking DIFFER by maternal age?
Interpreting interaction: - Significant interaction p-value → the effect of one factor depends on the other - Example: Smoking reduces birth weight more in older mothers → significant smoking × age interaction - Report subgroup means or interaction plot
Repeated Measures ANOVA
- Use: Same subjects measured at 3+ time points (e.g., BP at booking, 28 wks, 36 wks)
- Advantage: Controls for between-subject variability → more powerful
- Assumptions: Sphericity (variance of differences between all pairs of measurements is equal) — checked by Mauchly's test
- Correction for non-sphericity: Greenhouse-Geisser, Huynh-Feldt
Assumptions of Parametric Tests — How to Check
| Assumption | What it means | How to check | What to do if violated |
|---|---|---|---|
| Normality | Data follow normal distribution | Histogram, Q-Q plot, Shapiro-Wilk, Kolmogorov-Smirnov | Use non-parametric test, transform data |
| Homogeneity of variance | Equal variances across groups | Levene's test, F-test (2 groups), Bartlett's test | Use Welch's t-test, Welch's ANOVA, or transformation |
| Independence | Observations independent | Study design check | Mixed models, GEE, multilevel models |
| Sphericity (RM-ANOVA) | Equal variances of differences | Mauchly's test | Greenhouse-Geisser correction |
5.3 Non-Parametric Tests — Complete Details
Mann-Whitney U Test (Wilcoxon Rank-Sum)
- Use: Compare TWO INDEPENDENT groups with non-normal data
- Principle: Rank all observations together, then compare sum of ranks between groups
- H₀: The two populations have the same location (median)
- Output: U statistic (or W in some software)
Steps: 1. Rank all observations from both groups together (1 = smallest) 2. Sum the ranks for group 1 (R₁) 3. U₁ = R₁ − n₁(n₁+1)/2 and U₂ = R₂ − n₂(n₂+1)/2 4. U = min(U₁, U₂) — compared to critical value
Worked example: Pain scores (0–10) after two different perineal repair techniques
| Technique A | Rank A | Technique B | Rank B |
|---|---|---|---|
| 2 | 1 | 4 | 4.5 |
| 3 | 2.5 | 5 | 6 |
| 3 | 2.5 | 6 | 7 |
| 4 | 4.5 | 8 | 9 |
| 7 | 8 | 9 | 10 |
| Sum | 18.5 | Sum | 36.5 |
- n₁ = 5, n₂ = 5
- U₁ = 18.5 − (5×6/2) = 18.5 − 15 = 3.5
- U₂ = 36.5 − 15 = 21.5
- U = 3.5 (critical U for n₁=5, n₂=5, α=0.05 two-tailed = 2)
- U = 3.5 > 2 → not significant at α = 0.05
However, for ranks approach: Z = (mean rank_A − mean rank_B) / SE → can approximate significance.
Wilcoxon Signed-Rank Test
- Use: TWO PAIRED groups (non-parametric equivalent of paired t-test)
- Principle: Calculate differences, rank absolute differences, sum ranks of positive vs negative differences
- Steps:
- Calculate difference for each pair
- Exclude pairs with difference = 0
- Rank absolute differences (ignoring sign)
- Sum ranks of positive differences (W+) and negative differences (W−)
- Test statistic W = min(W+, W−)
Example (from paired t-test data above): Glucose before and after metformin - Differences: 0.4, 0.4, 0.3, 0.4, 0.3, 0.1, 0.3, 0.4, 0.2, 0.1 - All positive → W+ = 1+2+...+10 = 55, W− = 0 - For n=10, critical W = 8 (two-tailed, α=0.05) - W = 0 < 8 → p < 0.05 → significant (more powerful than sign test)
Sign test (simpler alternative): - Count number of positive and negative differences (ignoring magnitude) - Test using binomial distribution - Less powerful than Wilcoxon signed-rank (discards magnitude information)
Kruskal-Wallis Test
- Use: THREE+ INDEPENDENT groups (non-parametric equivalent of one-way ANOVA)
- Principle: Extension of Mann-Whitney — ranks all observations together, compares sum of ranks across groups
- H₀: All groups have same median
- Output: H statistic (approximately χ² with df = k−1)
- Post-hoc: Dunn's test with Bonferroni correction
When to use: Comparing fetal fibronectin levels (skewed) across three groups: term labour, preterm labour, no labour
Friedman Test
- Use: THREE+ PAIRED groups (non-parametric equivalent of repeated measures ANOVA)
- Principle: Ranks within each subject/block, then compares across time points
- Example: Pain scores at 1 hour, 6 hours, 24 hours after episiotomy repair
- Post-hoc: Wilcoxon signed-rank with Bonferroni correction
5.4 Chi-Squared Test (χ²) — Complete Details
- Use: Test association between TWO CATEGORICAL variables
- Data format: Contingency table (r × c)
Formula: χ² = Σ [(Oᵢⱼ − Eᵢⱼ)² / Eᵢⱼ]
Where: - Oᵢⱼ = observed frequency in cell (i, j) - Eᵢⱼ = expected frequency = (row total × column total) / grand total - df = (rows − 1) × (columns − 1)
Worked example: Mode of delivery by maternal BMI category
| SVD | CS | Total | |
|---|---|---|---|
| BMI < 30 | 80 | 20 | 100 |
| BMI ≥ 30 | 30 | 30 | 60 |
| Total | 110 | 50 | 160 |
Expected frequencies: - Normal BMI, SVD: (100 × 110)/160 = 68.75 - Normal BMI, CS: (100 × 50)/160 = 31.25 - Obese, SVD: (60 × 110)/160 = 41.25 - Obese, CS: (60 × 50)/160 = 18.75
χ² = (80−68.75)²/68.75 + (20−31.25)²/31.25 + (30−41.25)²/41.25 + (30−18.75)²/18.75 = 1.84 + 4.05 + 3.07 + 6.75 = 15.71
df = (2−1)(2−1) = 1 Critical χ² (df=1, α=0.05) = 3.84 15.71 > 3.84 → p < 0.001 → highly significant association
Assumptions: 1. Independent observations (each subject counted once) 2. No more than 20% of expected frequencies < 5 3. All expected frequencies ≥ 1
If assumptions violated: Use Fisher's exact test (any 2×2 table) or combine categories (for larger tables).
Yates' Correction for Continuity
- Applied to 2×2 tables (subtract 0.5 from each |O−E| before squaring)
- More conservative (reduces χ²)
- Historically used; now controversial — Fisher's exact preferred for small samples
Fisher's Exact Test
- Use: 2×2 tables when expected frequencies < 5 (any sample size works)
- Principle: Calculates exact probability of observed table (and more extreme tables) given fixed margins — based on hypergeometric distribution
- Advantage: Valid for ANY sample size
- Disadvantage: Computationally intensive for large tables
McNemar's Test for Paired Categorical Data
- Use: Compare PROPORTIONS in PAIRED or MATCHED categorical data (before-after, matched case-control)
- Example: Diagnosis of GDM by two different criteria (IADPSG vs NICE) in same women
Paired 2×2 table:
| Test B + | Test B − | Total | |
|---|---|---|---|
| Test A + | a (both positive) | b (A positive, B negative) | a + b |
| Test A − | c (A negative, B positive) | d (both negative) | c + d |
| Total | a + c | b + d | N |
Formula: χ² = (|b − c| − 1)² / (b + c) [with continuity correction] - Only discordant pairs (b and c) contribute to the test - If b = c → no difference between tests
Example: GDM screening — IADPSG vs NICE criteria in 200 women
| NICE + | NICE − | Total | |
|---|---|---|---|
| IADPSG + | 20 | 15 | 35 |
| IADPSG − | 3 | 162 | 165 |
| Total | 23 | 177 | 200 |
χ² = (|15 − 3| − 1)² / (15 + 3) = (11)² / 18 = 121/18 = 6.72 df = 1, p = 0.01 → Significant difference — IADPSG detects significantly more GDM than NICE criteria.
5.5 Correlation — Detailed
| Coefficient | Symbol | Type | Parametric? | Range | Measure of |
|---|---|---|---|---|---|
| Pearson r | r | Linear | Yes | −1 to +1 | Linear relationship strength |
| Spearman ρ | rₛ (or ρ) | Monotonic | No | −1 to +1 | Monotonic relationship (any consistent trend) |
| Kendall τ | τ | Concordant/discordant pairs | No | −1 to +1 | Association in ranked data |
Pearson Correlation (r)
Assumptions: 1. Both variables are continuous 2. Linear relationship 3. Bivariate normality (both normally distributed) 4. Homoscedasticity (equal scatter across values) 5. No significant outliers
Formula: r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² × Σ(yᵢ − ȳ)²]
Interpretation of r (Cohen's benchmarks):
| r value | Interpretation | Approximate R² |
|---|---|---|
| 0.0–0.1 | Negligible | 0–1% |
| 0.1–0.3 | Weak | 1–9% |
| 0.3–0.5 | Moderate | 9–25% |
| 0.5–0.7 | Strong | 25–49% |
| 0.7–1.0 | Very strong | 49–100% |
R² = coefficient of determination: Proportion of variance in Y explained by X. - If r = 0.6, R² = 0.36 → 36% of variance in Y is explained by X - 64% is due to other factors
Spearman's Rank Correlation (ρ)
- Use: Non-normal, ordinal, or skewed data
- Principle: Rank both variables, then calculate Pearson r on ranks
- Advantages: No normality assumption; detects monotonic (not just linear) relationships; robust to outliers
- Interpretation: Same r scale (−1 to +1)
Kendall's Tau (τ)
- Use: Small samples with many tied ranks
- Principle: Based on number of concordant vs discordant pairs
- τ = (C − D) / [½ n(n−1)] where C = concordant pairs, D = discordant
- Advantage: More robust and interpretable with ties; better for small samples
- Disadvantage: Usually smaller absolute value than Spearman
Correlation does NOT imply causation — 4 possible explanations for r ≠ 0: 1. X causes Y (direct causation) 2. Y causes X (reverse causation) 3. Z causes both X and Y (confounding) 4. Chance (random variation)
Common O&G example: Positive correlation between maternal age and Down's syndrome — direct causal relationship (meiotic non-disjunction increases with age). This is one case where correlation IS causation.
5.6 Regression — Complete Details
Linear Regression
Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
- Y = outcome (dependent) variable — CONTINUOUS
- Xᵢ = predictor (independent) variables
- β₀ = intercept (value of Y when all X = 0)
- βᵢ = regression coefficient (change in Y per 1-unit change in Xᵢ, holding others constant)
- ε = error term (residual)
Key outputs: | Output | Interpretation | |--------|----------------| | β coefficient | Effect estimate (units of Y per unit X) | | 95% CI for β | Precision and significance | | p-value for β | Test of H₀: β = 0 | | R² | Proportion of variance explained by model | | Adjusted R² | R² penalised for number of predictors | | F-test | Tests if overall model is significant |
Assumptions of linear regression: 1. Linearity: Relationship between X and Y is linear 2. Independence: Observations are independent 3. Homoscedasticity: Constant variance of residuals across fitted values 4. Normality: Residuals are normally distributed 5. No multicollinearity: Predictors not highly correlated
Checking assumptions: - Residual vs fitted plot: Look for random scatter (homoscedasticity) and no pattern (linearity) - Q-Q plot of residuals: Check normality - Variance Inflation Factor (VIF): Check multicollinearity (VIF > 10 = problematic)
Multiple Linear Regression
- Use: ONE continuous outcome, MULTIPLE predictors
- β coefficients are ADJUSTED — each β represents the effect of that predictor holding all others constant
- Can control for confounders by including them in the model
- Partial R²: Contribution of each predictor to explained variance
Example: Predicting birth weight - Y = birth weight (g) - X₁ = gestational age (weeks) - X₂ = maternal smoking (0/1) - X₃ = maternal BMI - β₁ = 150 means: each additional week of gestation → +150g birth weight (holding smoking and BMI constant) - β₂ = −200 means: smoking associated with 200g lower birth weight (holding gestational age and BMI constant)
Logistic Regression
- Use: BINARY outcome (yes/no, alive/dead, disease/no disease)
- Model: logit(p) = ln[p/(1−p)] = β₀ + β₁X₁ + ... + βₖXₖ
- Exponentiated coefficients (e^βᵢ): Adjusted Odds Ratios (OR)
- Interpretation of OR: e^βᵢ = change in odds of outcome for 1-unit increase in Xᵢ
Key outputs: | Output | Interpretation | |--------|----------------| | OR (e^β) | Adjusted odds ratio | | 95% CI for OR | Precision (if excludes 1 → significant) | | Hosmer-Lemeshow test | Goodness-of-fit (p > 0.05 = good fit) | | c-statistic (AUC) | Discriminatory ability | | Pseudo-R² | McFadden, Nagelkerke |
Worked O&G example: Predicting preterm birth
| Predictor | β | OR (e^β) | 95% CI | p |
|---|---|---|---|---|
| Smoking | 0.69 | 1.99 | 1.25–3.17 | 0.004 |
| Previous preterm | 1.39 | 4.01 | 2.10–7.66 | <0.001 |
| Multiple pregnancy | 1.10 | 3.00 | 1.40–6.43 | 0.005 |
| Maternal age (per year) | 0.02 | 1.02 | 0.98–1.06 | 0.29 |
- Smoking doubles the odds of preterm birth (OR = 1.99, p = 0.004)
- Previous preterm is strongest predictor (OR = 4.01)
- Maternal age not significant (CI includes 1, p > 0.05)
Cox Proportional Hazards (See also Section 9)
Model: h(t) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₖXₖ)
- h(t) = hazard at time t
- h₀(t) = baseline hazard (when all X = 0)
- exp(βᵢ) = Hazard Ratio (HR)
- Proportional hazards assumption: HR is constant over time
6. Risk & Effect Measures
6.1 The 2×2 Table — Foundation
| Outcome + (Disease) | Outcome − (No disease) | Total | |
|---|---|---|---|
| Exposed + | a | b | a + b |
| Exposed − | c | d | c + d |
| Total | a + c | b + d | N |
6.2 Definitions and Formulas — Complete
| Measure | Abbreviation | Formula | Interpretation |
|---|---|---|---|
| Risk in exposed | Rₑ | a / (a + b) | Probability of outcome if exposed |
| Risk in unexposed | R₀ | c / (c + d) | Probability of outcome if not exposed |
| Odds in exposed | Oₑ | a / b | Ratio of outcome happening to not happening in exposed |
| Odds in unexposed | O₀ | c / d | Ratio of outcome happening to not happening in unexposed |
| Risk Ratio / Relative Risk | RR | Rₑ / R₀ | How many times more likely outcome is in exposed vs unexposed |
| Odds Ratio | OR | (a/b) / (c/d) = ad / bc | Odds of exposure in cases vs controls |
| Attributable Risk | AR | Rₑ − R₀ | Excess risk due to exposure |
| Attributable Risk Fraction | ARF | (Rₑ − R₀) / Rₑ = (RR−1)/RR | Proportion of risk in exposed due to exposure |
| Population Attributable Risk | PAR | R_total − R₀ | Excess risk in total population |
| Population Attributable Fraction | PAF | (R_total − R₀) / R_total | Proportion of population disease due to exposure |
| Absolute Risk Reduction | ARR | Control risk − Treatment risk (if treatment reduces risk) | Inverse of AR (treatment perspective) |
| Number Needed to Treat | NNT | 1 / ARR | Number needed to treat to prevent one outcome |
| Number Needed to Harm | NNH | 1 / AR (if harmful) | Number exposed to cause one adverse outcome |
6.3 Worked Examples from O&G
Example 1: VTE Prevention with LMWH
| VTE | No VTE | Total | |
|---|---|---|---|
| LMWH | 5 | 495 | 500 |
| No LMWH | 20 | 480 | 500 |
- Rₑ = 5/500 = 0.01 (1%)
- R₀ = 20/500 = 0.04 (4%)
- RR = 0.01/0.04 = 0.25 → LMWH reduces VTE risk by 75%
- AR (ARR) = |0.01 − 0.04| = 0.03 (3%) → absolute risk reduction
- RRR (relative risk reduction) = (0.04−0.01)/0.04 = 0.75 (75%) → same as 1−RR
- NNT = 1/0.03 = 33.3 → 34 women need LMWH to prevent one VTE
- OR = (5×480)/(495×20) = 2400/9900 = 0.24 → similar to RR because VTE is rare
Example 2: Smoking and Preterm Birth
| Preterm | Term | Total | |
|---|---|---|---|
| Smoker | 200 | 4,800 | 5,000 |
| Non-smoker | 100 | 4,900 | 5,000 |
- RR = 0.04/0.02 = 2.0
- OR = (200×4900)/(100×4800) = 980,000/480,000 = 2.04
- OR ≈ RR because preterm birth is moderately common (3%) — the approximation is good but not perfect
- AR = 0.04 − 0.02 = 0.02 (2%)
- ARF = (2−1)/2 = 50% → half of preterm births in smokers are attributable to smoking
- PAF = (0.03−0.02)/0.03 = 33% → one-third of all preterm births are attributable to smoking
6.4 Risk Ratio vs Odds Ratio — The Rare Disease Assumption
When disease is rare (prevalence < 10%): - OR ≈ RR - OR can be interpreted as RR in case-control studies
When disease is common: - OR overestimates RR - OR always > RR (when RR > 1) and OR always < RR (when RR < 1) - The more common the disease, the greater the divergence
Proof that OR ≈ RR when a << a+b and c << c+d: - RR = [a/(a+b)] / [c/(c+d)] - OR = (a/b) / (c/d) = ad/bc - If a << a+b then a/(a+b) ≈ a/b - If c << c+d then c/(c+d) ≈ c/d - Therefore RR ≈ (a/b) / (c/d) = OR
Clinical example where OR and RR diverge:
| Disease + | Disease − | Total | Risk | |
|---|---|---|---|---|
| Exposed | 80 | 20 | 100 | 0.80 |
| Unexposed | 60 | 40 | 100 | 0.60 |
- RR = 0.80/0.60 = 1.33
- OR = (80×40)/(20×60) = 3200/1200 = 2.67
- OR is TWICE RR! Common disease → OR is a very poor approximation.
6.5 Number Needed to Treat (NNT) — Detailed
Formula: NNT = 1 / ARR
Where ARR = |Risk_control − Risk_treatment|
Important properties: - Lower NNT = more effective treatment - NNT always rounded UP to nearest integer - NNT depends on BASELINE RISK — same RR gives different NNT depending on baseline
Example of NNT dependence on baseline risk: - A treatment reduces the risk of an outcome by 50% (RR = 0.50)
| Baseline risk | ARR | NNT |
|---|---|---|
| 10% → 5% | 5% | 20 |
| 1% → 0.5% | 0.5% | 200 |
| 0.1% → 0.05% | 0.05% | 2000 |
Same RR (50% reduction) but NNT ranges dramatically. This is why NNT must be reported with baseline risk context.
NNT for harm (NNH): - NNH = 1 / AR (when exposure increases risk) - Example: Aspirin prevents pre-eclampsia (NNT = 50) but increases bleeding (NNH = 200 for minor bleeding, NNH = 1000 for major) - Net benefit: When NNT < NNH (more people helped than harmed) - Benefit-harm ratio: NNH/NNT
6.6 Incidence vs Prevalence
| Measure | Definition | Formula | When used |
|---|---|---|---|
| Point prevalence | Proportion of population with disease at a specific time | Existing cases / Total population | Cross-sectional studies |
| Period prevalence | Proportion with disease during a time period | Cases in period / Population | Chronic diseases |
| Cumulative incidence | Proportion of at-risk population who develop disease over time | New cases / At-risk population at start | Cohort studies |
| Incidence rate | Number of new cases per person-time | New cases / Total person-time at risk | When follow-up varies |
Relationship: Prevalence = Incidence × Average duration of disease - For chronic diseases (long duration): high prevalence despite moderate incidence - For acute diseases (short duration, fatal or curable): low prevalence despite possibly high incidence
Example: - Ovarian cancer: Incidence ~20/100,000/year, 5-year survival ~45% → prevalence ~90/100,000 - Endometriosis: Incidence unclear (difficult to diagnose), prevalence ~10% in reproductive-age women (long duration → high prevalence)
6.7 Hazard Ratio — More Detail
- From Cox proportional hazards regression
- Interpretation: The instantaneous risk of the event at any time in one group relative to another
- HR = 1: No difference
- HR < 1: Reduced hazard (protective)
- HR > 1: Increased hazard (risk factor)
- Not a simple risk ratio — it's a ratio of hazards that applies across time (proportional hazards assumption)
HR vs RR: - RR compares cumulative incidence at a specific time point - HR compares the instantaneous rate of the event at any time - HR is more appropriate for time-to-event data with varying follow-up - If proportional hazards hold, HR is constant over time
7. Statistical Bias & Confounding
7.1 Classification of Bias
┌──────────┐
│ Bias │
└────┬─────┘
│
┌────────────────┼────────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│Selection│ │Information│ │Confounding│
│ Bias │ │ Bias │ │(not true │
└─────────┘ └──────────┘ │ bias — │
│treatment │
│ effect) │
└──────────┘
7.2 Selection Bias — Detailed with O&G Examples
Definition: Systematic error due to the way participants are selected for a study or due to differential participation/follow-up.
| Type | Mechanism | O&G Example |
|---|---|---|
| Sampling bias | Sample not representative of target population | Studying postnatal depression in an affluent area → underestimates prevalence |
| Referral (centripetal) bias | Tertiary centres see sicker patients | Studying outcomes of placenta praevia at a teaching hospital → higher mortality |
| Volunteer bias | Volunteers differ systematically from non-volunteers | Women who join a menopause research study are healthier and more health-conscious |
| Healthy worker effect | Workers healthier than general population | Midwives have lower mortality than age-matched women in general population |
| Non-response bias | Those who respond differ from those who don't | Postal survey of incontinence — those most affected are more likely to respond → overestimates prevalence |
| Attrition bias (loss to follow-up) | Dropouts differ from completers | In a cohort of high-risk pregnancies, those who drop out may have worse outcomes → biased if differential |
| Berkson's bias | Hospital controls differ from general population | Studying association between oral contraceptives and DVT using hospital controls — controls may use OCPs at different rates |
| Survival (Neyman) bias | Only survivors included | Cross-sectional study of MI — fatal cases missed → underestimates severity |
| Incidence-prevalence (Neyman) bias | Prevalent cases differ from incident | Studying ovarian cancer — prevalent cases are longer-term survivors → different risk factor profile |
| Immortal time bias | Time before exposure counted as exposed | Studying survival after surgery: if time from diagnosis to surgery counted as postsurgical survival, it's "immortal" (patient alive by definition) |
| Detection bias | More intensive surveillance in one group | Women on HRT have more mammograms → more breast cancer detected (screening effect, not causation) |
Immortal Time Bias — Detailed
An important MRCOG concept. Immortal time bias occurs when there is a period of follow-up during which the outcome cannot occur, and this time is misclassified.
Classic example: Study of whether screening for cervical cancer reduces mortality. - Women who attend screening (exposed) are compared to non-attendees - The time between the invitation and the actual screening result is "immortal" — women had to survive to be screened - If this immortal time is counted as "screened" time, it biases results in favour of screening (screened women appear to live longer) - Solution: Use time-dependent exposure or start follow-up at time of screening decision, not screening result
7.3 Information Bias (Measurement Bias) — Detailed
Definition: Systematic error in measuring exposure, outcome, or covariates.
| Type | Mechanism | O&G Example |
|---|---|---|
| Recall bias | Differential recall between groups | Case-control study of miscarriage — cases recall more exposures than controls |
| Observer (ascertainment) bias | Researcher's expectation influences measurement | Knowing which group receives active treatment may influence interpretation of ultrasound measurements |
| Detection (verification) bias | Systematic difference in outcome ascertainment | More intensive follow-up in treatment group → more outcomes detected |
| Lead-time bias | Early diagnosis falsely extends survival | Screening: survival appears longer even if death occurs at same time |
| Publication bias | Positive studies more likely published | Meta-analyses overestimate effect if negative studies unpublished |
| Reporting bias | Differential outcome reporting | Participants on placebo may report more symptoms; doctors may report outcomes more carefully in one group |
| Interviewer bias | Differential questioning | Interviewer probes cases more thoroughly about exposures |
| Social desirability bias | Participants give socially acceptable answers | Underreporting smoking, alcohol in pregnancy |
| Hawthorne effect | Behaviour changes because being observed | Women may adhere better to medications when in a trial |
| Measurement error bias | Inaccurate measurement tool | Using a poorly calibrated sphygmomanometer |
Recall Bias — Detailed
The most common bias tested in MRCOG for case-control studies.
Mechanism: - Mothers of babies with malformations (cases) search their memory for potential causes → more likely to recall medication use, infections, stress - Mothers of healthy babies (controls) have less motivation to recall → more likely to forget
Effect: OR is biased away from the null (spuriously large or small association)
Minimisation: - Use objective records (prescription databases, medical records) rather than recall - Blinding interviewers to case/control status - Use standardised, validated questionnaires - Use a "memory anchor" (e.g., calendar of significant events)
7.4 Confounding — Complete Details
Definition: A third variable (confounder) that distorts the relationship between exposure and outcome because it is associated with BOTH the exposure and the outcome and is NOT on the causal pathway.
Criteria for a Confounder — Three Conditions
- Associated with the exposure in the study population
- An independent risk factor for the outcome (among the unexposed)
- NOT an intermediate (mediator) on the causal pathway between exposure and outcome
┌──────────┐
│Confounder │
└┬──────┬───┘
│ │
▼ ▼
Exposure ──?──▶ Outcome
Not a confounder (it IS a mediator):
Exposure ──────────▶ Mediator ──────────▶ Outcome
Example (confounder): - Exposure: Drinking coffee → Outcome: Pancreatic cancer - Confounder: Smoking (associated with coffee drinking AND causes pancreatic cancer) - If we don't adjust for smoking, we might wrongly attribute the cancer risk to coffee
Classic O&G Examples of Confounding
| Study claim | True relationship | Confounder |
|---|---|---|
| "HRT reduces coronary heart disease" | HRT users healthier → lower CHD | Socioeconomic status, health awareness |
| "Maternal age causes Down's syndrome" | Chromosomal non-disjunction increases with age | AGE IS THE EXPOSURE — this is causal, not confounding! |
| "Coffee causes miscarriage" | Coffee drinkers more likely to be older, smoke | Smoking, maternal age |
| "Caesarean section causes asthma" | Children born by CS have more asthma | Indication for CS (maternal obesity, preterm) may itself be associated with asthma |
| "Fertility treatment causes cancer" | Women who have IVF may have different cancer surveillance | Underlying infertility (itself a risk factor for some cancers) |
Simpson's Paradox — Detailed
A special case of confounding where a trend appears in several groups but reverses or disappears when groups are combined.
Classic medical example: Kidney stone treatment
| Stone size | Treatment A | Treatment B |
|---|---|---|
| Small stones | 93% (81/87) | 87% (234/270) |
| Large stones | 73% (192/263) | 69% (55/80) |
| Overall | 78% (273/350) | 83% (289/350) |
Paradox: Treatment A is better for BOTH small AND large stones, but Treatment B appears better overall!
Explanation: Treatment A was more often used for large stones (which have worse prognosis). Stone size is a confounder — associated with treatment choice (A used more for large) AND outcome (large stones have worse success). When you ignore stone size (combine groups), the confounding produces the paradoxical reversal.
Take-home: Always consider whether there might be a confounder creating a Simpson's paradox. Stratify by key confounders.
Methods to Control Confounding
| Method | When used | How it works | Strengths | Weaknesses |
|---|---|---|---|---|
| Randomisation | RCTs | Random allocation balances confounders | Gold standard; balances known AND unknown confounders | Not always feasible or ethical |
| Restriction | Any study | Limit to one level of confounder (e.g., only non-smokers) | Simple; eliminates confounding by restricted variable | Limits generalisability; may not be feasible if confounder is common |
| Matching | Case-control, cohort | Select controls/comparison with same confounder levels | Controls for confounding | Cannot match too many variables; over-matching reduces efficiency; can't assess matched variables as risk factors |
| Stratification | Any study | Analyse within strata, then pool (Mantel-Haenszel) | Simple to implement | Cannot handle many confounders; continuous variables need categorisation |
| Multivariable regression | Any study | Adjust statistically | Can handle many confounders; continuous and categorical | Assumptions about model form; cannot adjust for unmeasured confounders |
| Standardisation | Comparing populations | Apply standard weights | Direct or indirect; common in epidemiology | Only adjusts for measured confounders |
| Propensity score | Observational studies | Probability of exposure given confounders; match/stratify/weight by PS | Reduces many confounders to single score | Only measured confounders; requires large n |
| Instrumental variable | Natural experiments | Variable associated with exposure but not outcome (except through exposure) | Can handle unmeasured confounders | Difficult to find valid instrument |
| Inverse probability weighting | Longitudinal studies | Weight by inverse of probability of remaining in study | Handles attrition bias | Depends on correct model for weights |
Mantel-Haenszel Odds Ratio (Stratified Analysis)
Formula for stratified 2×2 tables: OR_MH = Σ(aᵢdᵢ/nᵢ) / Σ(bᵢcᵢ/nᵢ)
Where i indexes strata and nᵢ is the total in stratum i.
Comparing crude vs adjusted OR: - If crude OR ≠ adjusted OR → confounding present - If crude OR = adjusted OR → no confounding
Residual Confounding
Complete confounding adjustment is often impossible because: - Confounders may be measured with error (residual confounding) - Unmeasured confounders exist (unmeasured confounding) - Confounders may change over time (time-varying confounding)
Sensitivity analysis: How strong would an unmeasured confounder need to be to explain away the observed association? (E-value)
7.5 Effect Modification (Interaction)
Different from confounding!
| Aspect | Confounding | Effect Modification |
|---|---|---|
| Type | Bias to be minimised | Real biological phenomenon |
| What it is | Distortion of exposure-outcome relationship | Effect of exposure differs by level of third variable |
| Deal with it | Remove/adjust in analysis | REPORT it — describe effect separately for each subgroup |
| Example | Smoking confounds coffee-pancreatic cancer | Aspirin effect on pre-eclampsia may differ by BMI |
Testing for effect modification: 1. Stratified analysis: Calculate RR/OR separately for each stratum 2. Interaction term: Include product term in regression model (X₁ × X₂) 3. Statistical test: p-value for interaction (be cautious — underpowered for interaction)
Multiplicative vs Additive Interaction: - Multiplicative scale: Is the combined effect greater than the product of individual effects? (RR or OR scale) - Additive scale: Is the combined effect greater than the sum of individual effects? (Risk difference scale) - Public health importance: Additive scale often more relevant (synergy index)
O&G Example: Does the effect of smoking on preterm birth differ by maternal age?
| Smoker | Non-smoker | RR (smoking vs not) | |
|---|---|---|---|
| Age < 35 | 5% | 3% | 1.67 |
| Age ≥ 35 | 10% | 5% | 2.00 |
The RR is 1.67 in younger and 2.00 in older women → possible effect modification by age. The absolute risk increase (AR) also differs: 2% vs 5%.
7.6 Confounding by Indication
An important concept for treatment studies.
Definition: The indication for a treatment is itself associated with the outcome. Patients who receive a treatment are systematically different from those who don't because of WHY they were treated.
Example: Studying whether magnesium sulphate prevents cerebral palsy in preterm infants. - Women who receive MgSO₄ are those in preterm labour - Preterm labour itself is a risk factor for cerebral palsy - Without randomisation, any difference in CP rates could be due to the underlying indication (preterm labour), not the treatment
Solution: Randomisation (e.g., the Magpie trial). If randomisation not possible: propensity score methods, indication-based restriction, or multivariable adjustment (though residual confounding likely remains).
7.7 Protopathic Bias
Definition: Treatment started for early symptoms of the outcome before the outcome is formally diagnosed.
Example: Studying whether NSAIDs cause miscarriage. - Women may take NSAIDs for pelvic pain - Pelvic pain might be an early symptom of miscarriage - Association between NSAID use and miscarriage could be due to NSAIDs treating early miscarriage symptoms (reverse causality)
Solution: Exclude medication use in the period immediately before outcome (lag window), or use new-user designs.
8. Evidence-Based Medicine
8.1 Levels of Evidence — Oxford CEBM (March 2009)
The traditional 5-level system (still used by many O&G guidelines including RCOG):
| Level | Therapy / Prevention | Prognosis | Diagnosis |
|---|---|---|---|
| 1a | SR of RCTs (with homogeneity) | SR of inception cohort studies | SR of diagnostic studies (homogeneous, with gold standard) |
| 1b | Individual RCT (narrow CI) | Individual inception cohort (≥80% follow-up) | Validating cohort with gold standard |
| 1c | All or none | All or none case series | SpPin or SnNOut |
| 2a | SR of cohort studies | SR of retrospective cohorts / untreated controls | SR of cross-sectional studies |
| 2b | Individual cohort study (including low-quality RCT) | Retrospective cohort / follow-up of RCT controls | Cross-sectional with gold standard |
| 2c | Outcomes research / ecological studies | "Outcomes" research | — |
| 3a | SR of case-control studies | — | SR of case-control studies |
| 3b | Individual case-control study | — | Non-consecutive / no gold standard |
| 4 | Case series / poor quality cohort | Case series / poor quality cohort | Case-control / poor reference |
| 5 | Expert opinion | Expert opinion | Expert opinion |
Key: "All or none" — when all patients died before treatment but some now survive, or when some died before but none die now.
Oxford 2011 revision: Simplified to 5 levels based on the type of question and the quality of evidence, but the 2009 system is still widely cited.
8.2 GRADE System — Complete
Grading of Recommendations Assessment, Development and Evaluation
Quality of Evidence
| Level | Definition | Symbol |
|---|---|---|
| High | Further research VERY UNLIKELY to change confidence in estimate | ⊕⊕⊕⊕ |
| Moderate | Further research LIKELY to have important impact | ⊕⊕⊕○ |
| Low | Further research VERY LIKELY to have important impact | ⊕⊕○○ |
| Very low | Any estimate is very uncertain | ⊕○○○ |
Factors that Lower Quality
| Factor | How it works |
|---|---|
| Risk of bias | Study design limitations (no blinding, no allocation concealment, etc.) |
| Inconsistency | Unexplained heterogeneity (I² > 50%, p < 0.10) across studies |
| Indirectness | PICO differences (population, intervention, comparator, outcome) |
| Imprecision | Wide CIs crossing clinically important thresholds |
| Publication bias | Suspicion that negative studies are missing |
Downgrading rules: - Start at HIGH for RCTs, LOW for observational - Downgrade 1 level for serious concern, 2 for very serious concern - Maximum downgrade: 3 levels
Factors that Raise Quality (Observational Studies)
| Factor | Criteria |
|---|---|
| Large effect | RR > 2 or < 0.5 (up 1 level); RR > 5 or < 0.2 (up 2 levels) |
| Dose-response | Clear biological gradient demonstrated |
| Confounding | All plausible confounders would reduce the observed effect |
Strength of Recommendation
| Strength | Wording | Interpretation |
|---|---|---|
| Strong (1) | "We recommend..." / "Offer" | Most patients should receive the intervention |
| Weak (2) | "We suggest..." / "Consider" | Different choices appropriate for different patients; requires shared decision-making |
Implications: - Strong recommendation: Can be adopted as policy in most situations - Weak recommendation: Policy-making requires substantial debate and stakeholder involvement
8.3 Systematic Reviews & Meta-Analysis — Complete
Definitions
| Term | Definition |
|---|---|
| Systematic review | A review of a clearly formulated question that uses systematic and explicit methods to identify, select, and critically appraise relevant research, and to collect and analyse data from the studies that are included in the review |
| Meta-analysis | The statistical combination of results from two or more separate studies |
| Narrative review | Non-systematic summary of literature (not evidence-based) |
Steps of a Systematic Review
- Formulate question (using PICO: Population, Intervention, Comparison, Outcome)
- Pre-register protocol (PROSPERO)
- Systematic search of multiple databases (MEDLINE, EMBASE, CENTRAL, CINAHL)
- Screen and select studies against pre-specified criteria (PRISMA flow diagram)
- Assess quality/risk of bias of included studies (Cochrane Risk of Bias tool for RCTs)
- Extract data (double-extraction recommended)
- Analyse (meta-analysis if appropriate)
- Interpret and report
PRISMA Flow Diagram
Records identified through database searching (n=...)
Additional records identified through other sources (n=...)
│
▼
Records after duplicates removed (n=...)
│
▼
Records screened (n=...)
Records excluded (n=...)
│
▼
Full-text articles assessed for eligibility (n=...)
Full-text articles excluded, with reasons (n=...)
│
▼
Studies included in qualitative synthesis (n=...)
│
▼
Studies included in quantitative synthesis (meta-analysis) (n=...)
Fixed Effect vs Random Effects Meta-Analysis
| Feature | Fixed Effect | Random Effects |
|---|---|---|
| Assumption | All studies estimate the SAME true effect | Studies estimate DIFFERENT true effects (drawn from a distribution) |
| Implication | Differences due to chance only | Differences due to chance + real variation |
| Weighting | By inverse variance (precision) | By inverse variance + between-study variance (τ²) |
| CI | Narrower | Wider (if heterogeneity present) |
| Interpretation | "The effect" (single value) | "The average effect" |
| When to use | Minimal heterogeneity | Moderate/substantial heterogeneity |
Which is more conservative? Random effects when heterogeneity > 0. But if there is no heterogeneity, they give identical results.
DerSimonian and Laird method — most common random effects approach [ wᵢ* = 1 / (sᵢ² + τ²) ]
Where τ² is the between-study variance (estimate of heterogeneity).
Heterogeneity — I² Statistic
I² = [(Q − df) / Q] × 100%
Where Q = chi-squared statistic for heterogeneity, df = degrees of freedom (# studies − 1)
| I² | Interpretation |
|---|---|
| 0% | No observed heterogeneity |
| <25% | Low heterogeneity |
| 25–50% | Moderate |
| 50–75% | Substantial |
| >75% | Considerable |
But also consider p-value for Q statistic: - p < 0.10 suggests significant heterogeneity (note: not p < 0.05!) - Important to explore potential sources of heterogeneity even if I² is modest
Exploring heterogeneity: 1. Subgroup analysis: Pre-specified subgroups (e.g., by study quality, population, intervention type) 2. Meta-regression: Regression exploring whether study-level characteristics explain heterogeneity 3. Sensitivity analysis: Excluding one study at a time (leave-one-out analysis)
Forest Plot — Detailed Interpretation
Components:
Study Weight RR (95% CI)
──────── ────── ──────────
Smith 2010 ██████ 1.20 (0.85–1.55)
Jones 2012 ███████ 1.50 (1.10–2.00)
Lee 2013 ████ 1.10 (0.70–1.50)
Brown 2015 ████████ 1.40 (1.05–1.75)
Patel 2017 ██████ 1.30 (0.95–1.65)
──────────────────────────────────────────────────────
Overall (I²=0%, p=0.56) ◆ 1.33 (1.17–1.49)
0.5 1.0 1.5 2.0 2.5
◀── Favours control Favours exposure ──▶
Reading a forest plot: 1. Each row = one study 2. Square = point estimate 3. Horizontal line = 95% CI 4. Square size = weight in meta-analysis (proportional to inverse variance) 5. Vertical line at 1 = null effect (for RR/OR/HR) 6. Diamond at bottom = summary estimate (width = 95% CI) 7. If diamond does not cross the null line → statistically significant
Funnel Plot & Publication Bias
Funnel plot: - X-axis: Effect size (RR, OR, OR log-transformed) - Y-axis: Standard error (inverted — larger studies at top) - Each dot = one study
Interpretation: - Symmetric inverted funnel: No publication bias - Asymmetric (missing studies in bottom left): Possible publication bias (small negative studies missing) - Asymmetric (missing in bottom right): Other explanations (e.g., small studies with true larger effects)
Causes of asymmetry: 1. Publication bias: Small studies with null/negative results not published 2. True heterogeneity: Small studies have different populations/interventions 3. Poor methodology: Small studies have lower quality → biased effect estimates 4. Chance: Especially with few studies (<10)
Tests for publication bias: - Egger's test: Linear regression of effect size on standard error (p < 0.10 = asymmetry) - Begg's test: Rank correlation test - Trim-and-fill method: Imputes missing studies and adjusts summary estimate - Contour-enhanced funnel plot: Distinguishes publication bias from other causes
8.4 Critical Appraisal — CASP Tools
Key questions for any study:
| Domain | Key Questions |
|---|---|
| Validity | Is the study design appropriate? Was bias minimised? |
| Results | What is the effect size? How precise is it? |
| Applicability | Can results be applied to my patients? |
CASP Checklist for RCTs (abbreviated)
- Did the study address a clearly focused question?
- Was the assignment to treatment groups truly random?
- Were all participants properly accounted for at conclusion?
- Were participants, clinicians, and outcome assessors blinded?
- Were the groups similar at the start of the trial?
- Were groups treated equally (apart from intervention)?
- How large was the treatment effect?
- How precise was the estimate (CIs)?
- Can the results be applied to the local population?
- Were all clinically important outcomes considered?
- Are the benefits worth the harms and costs?
CONSORT Statement (RCT reporting)
Key items: - Methods: Eligibility criteria, randomisation, allocation concealment, blinding, sample size calculation - Results: Flow diagram (participant flow), baseline table (Table 1), outcomes (ITT analysis), harms - Discussion: Limitations, generalisability, interpretation
STROBE Statement (Observational studies)
22-item checklist covering: - Title and abstract - Introduction: Background, objectives - Methods: Study design, setting, participants, variables, data sources, bias, sample size - Results: Participants (flow diagram), descriptive data, outcome data, main results, other analyses - Discussion: Key results, limitations, interpretation, generalisability
PRISMA Statement (Systematic Reviews)
27-item checklist with flow diagram: - Title, abstract, structured summary - Rationale, objectives - Protocol registration, eligibility criteria, information sources, search strategy, selection process, data extraction, risk of bias, synthesis methods - Results: Study selection, characteristics, risk of bias, individual study results, synthesis - Discussion: Summary, limitations, conclusions
QUADAS-2 (Diagnostic accuracy studies)
Four domains: 1. Patient selection (was a consecutive or random sample used?) 2. Index test (was it performed and interpreted without knowledge of reference standard?) 3. Reference standard (is it likely to correctly classify the target condition?) 4. Flow and timing (appropriate interval between tests, all patients received reference standard?)
8.5 Using EBM in Practice — Fagan Nomogram
Pre-test probability → Post-test probability
Clinical example: 32-year-old woman, combined test risk for Down's = 1:150
- Pre-test probability = 1/150 = 0.67%
- Pre-test odds = 0.0067 / 0.9933 = 0.0067
- Combined test positive: LR+ = 8 (from literature)
- Post-test odds = 0.0067 × 8 = 0.0536
- Post-test probability = 0.0536 / 1.0536 = 5.1% (1 in 20)
Using Fagan nomogram: Draw line from pre-test probability (0.67%) through LR (8) → post-test probability ~5%.
Clinical application: If post-test probability > invasive test threshold (~1/150), offer CVS/amniocentesis. If below, reassure.
8.6 Evidence-Based Guidelines in O&G
NICE guidelines: - Use GRADE for quality assessment - Recommendations: "Offer" (strong) vs "Consider" (weak) - Regular updates (usually 3–5 year cycle) - Include health economic modelling
RCOG Green-top Guidelines: - Use original Oxford CEBM levels - Grade A, B, C, D recommendations - Topic-specific expert review
SIGN Guidelines: - Scottish Intercollegiate Guidelines Network - Similar approach to GRADE - Identify key clinical questions, systematic review, evidence tables
WHO guidelines: - Use GRADE - Consider global applicability, resource implications - Include "Good Practice Statements"
9. Survival Analysis
9.1 Key Concepts — Detailed
Survival analysis = statistical methods for analysing data where the outcome is the TIME until an event occurs.
Key features: - Time-to-event data: Not just whether event occurred, but WHEN - Censoring: Some subjects don't experience event during follow-up - Time-varying risk: Risk may change over time (higher shortly after treatment, etc.)
Applications in O&G: - Time to pregnancy (survival = time to conception) - Time to labour onset after induction - Time to recurrence of endometriosis after surgery - Time to death in ovarian cancer - Time to treatment failure in IVF - Duration of breastfeeding
9.2 Censoring — Complete Types
| Type | Definition | Example |
|---|---|---|
| Right censoring | Subject does NOT experience event by study end, or is lost to follow-up | Patient with ovarian cancer alive at 5-year study endpoint |
| Left censoring | Event occurred before study began (subject already had the event at entry) | Time to first pregnancy — some women already pregnant at study entry |
| Interval censoring | Event occurs between two known time points, but exact time unknown | Annual screening: cancer detected between visits |
Assumption for valid analysis: Censoring is non-informative — the reason for censoring is unrelated to the probability of experiencing the event.
Example of INFORMATIVE censoring: If patients with more aggressive ovarian cancer are more likely to drop out (move to hospice, stop attending follow-up), their censoring is related to the outcome → biased results.
9.3 Kaplan-Meier Method — Complete Details
Purpose: Estimate the survival function without assuming a particular distribution.
Method: 1. Arrange event times in ascending order 2. At each event time, calculate: - Number at risk just before event - Number who experienced event - Number censored between this event and the next 3. S(t) = Πᵢ (nᵢ − dᵢ) / nᵢ where nᵢ = at risk at time i, dᵢ = events at time i
Properties: - Step function (drops only at event times) - Horizontal segments between events - Tick marks indicate censored observations - Median survival = time when S(t) = 0.5 - 95% CI (Greenwood's formula) shown as dashed lines or shading
Example: Time to recurrence of endometriosis after surgery
Recurrence-free survival
100% │─────────────────────────────────────
│ ─────
75% │ ─────
│ ─────
50% │ ─────
│ ─────
25% │ ─────
│ ─────
0% │──────────────────────┴──────────────────────┴──────────────────────┴─────▶ Time
0 12 24 36 48 60 months
Censored observations represented as tick marks on the curve.
Kaplan-Meier by groups:
Survival
100% │─────── Treatment
│ ─────────
75% │ ──────────
│ ────── Control
50% │ ──────
│ ──────
25% │ ──────
│ ──────
0% │─────────────────────────────────────────────────────────────────────▶ Time
The log-rank test compares these two curves.
9.4 Log-Rank Test — Details
- Non-parametric: No assumption about shape of survival curves
- H₀: The survival functions are the same in all groups
- H₁: At least one group differs
- Calculation: Compares observed vs expected events at each time point, summed over all times
- χ² = Σ[(O − E)² / E] across groups
Assumptions: - Non-informative censoring - Independence of survival times - The hazard ratio is roughly constant over time (proportional hazards — though log-rank is reasonably robust to violations)
Limitations: - Cannot adjust for confounders (use Cox regression instead) - Does not estimate the magnitude of difference (use Cox for HR) - If survival curves cross, log-rank has low power (use alternative tests: weighted log-rank, Peto-Peto, Fleming-Harrington)
9.5 Cox Proportional Hazards — Complete Details
Model: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₖXₖ)
Components: - Baseline hazard h₀(t): The hazard when all Xᵢ = 0 (can vary arbitrarily over time — hence "semi-parametric") - Proportional term exp(βX): Multiplicative effect of covariates on hazard (constant over time)
Interpretation of exp(β): - exp(β) = Hazard Ratio (HR) - HR > 1: increased hazard (worse survival) - HR < 1: decreased hazard (better survival) - HR = 1: no effect
Worked O&G example: Survival after ovarian cancer diagnosis
| Predictor | β | HR | 95% CI | p |
|---|---|---|---|---|
| Stage III/IV vs I/II | 1.39 | 4.01 | 2.50–6.43 | <0.001 |
| Suboptimal debulking | 0.80 | 2.23 | 1.40–3.55 | 0.001 |
| BRCA mutation | −0.51 | 0.60 | 0.38–0.95 | 0.03 |
| Age (per 10 years) | 0.32 | 1.38 | 1.10–1.73 | 0.01 |
- Advanced stage: 4× higher risk of death at any time (HR = 4.01)
- BRCA mutation: 40% lower risk (HR = 0.60)
- Each 10-year increase in age: 38% higher risk
Proportional Hazards Assumption — Checking
The HR is constant over time. This is the critical assumption.
How to check: 1. Log-minus-log plot: Plot −ln[−ln(S(t))] vs time for each group — parallel lines = proportional hazards 2. Schoenfeld residuals: Plot against time — if slope ≈ 0, assumption holds 3. Test: Significance test of time-dependent covariates (p > 0.05 = assumption met)
If assumption violated: - Stratified Cox model: Stratify by the variable with non-proportional hazards - Time-varying covariates: Include interaction with time (t) - Extended Cox model: Allow HR to change at a specified time point - Alternative: Parametric survival models (Weibull, exponential, log-normal)
9.6 Parametric Survival Models
| Model | Hazard function | When used |
|---|---|---|
| Exponential | Constant hazard over time | Simplest; rarely realistic |
| Weibull | Monotonic (always increasing or decreasing) | Flexible; includes exponential as special case |
| Gompertz | Mortality rate increases exponentially | Demography; older populations |
| Log-normal | Hazard increases then decreases | Biological processes with "burn-in" |
| Log-logistic | Similar to log-normal with heavier tails | Accelerated failure time models |
9.7 Describing Survival Results
Median survival time: Time when survival probability = 50% - In O&G: Median time to pregnancy, median time to recurrence
Survival at specific time point: Proportion surviving at 1 year, 5 years, etc. - Example: 5-year survival in ovarian cancer ~45% (all stages combined)
Hazard Ratio from Cox model: Describes relative risk across entire follow-up
10. Specific Topics in O&G
10.1 Key Rates and Definitions
| Rate | Numerator | Denominator | Multiplier | UK Approx. |
|---|---|---|---|---|
| Crude birth rate (CBR) | Live births | Mid-year population | ×1000 | ~11/1000 |
| General fertility rate (GFR) | Live births | Women aged 15–44 | ×1000 | ~60/1000 |
| Total fertility rate (TFR) | Sum of ASFRs × 5 | — | Per woman | ~1.6 |
| Age-specific fertility rate (ASFR) | Live births to women of age group | Women in that age group | ×1000 | Varies |
| Perinatal mortality rate (PMR) | Stillbirths + early neonatal deaths (≤7 days) | Total births | ×1000 | ~5/1000 |
| Stillbirth rate | Stillbirths (≥24 wks UK) | Total births | ×1000 | ~3.8/1000 |
| Neonatal mortality rate (NMR) | Neonatal deaths (≤28 days) | Live births | ×1000 | ~2.5/1000 |
| Early neonatal mortality | Deaths (≤7 days) | Live births | ×1000 | ~1.5/1000 |
| Late neonatal mortality | Deaths (8–28 days) | Live births | ×1000 | ~1.0/1000 |
| Infant mortality rate (IMR) | Deaths <1 year | Live births | ×1000 | ~3.9/1000 |
| Maternal mortality ratio (MMR) | Maternal deaths | Live births | ×100,000 | ~9/100,000 |
| Maternal mortality rate | Maternal deaths | Women aged 15–49 | ×100,000 | Rarely used |
WHO Definitions
| Term | WHO Definition | UK Definition |
|---|---|---|
| Stillbirth | Fetal death ≥28 weeks | Fetal death ≥24 weeks |
| Early neonatal death | Death within 7 days of birth | Same |
| Neonatal death | Death within 28 days of birth | Same |
| Perinatal period | From 22 weeks gestation to 7 days after birth | 24 weeks to 7 days |
| Maternal death | Death of a woman while pregnant or within 42 days of termination of pregnancy, from any cause related to or aggravated by the pregnancy or its management, but not from accidental or incidental causes | Same |
| Late maternal death | Death >42 days and <1 year after end of pregnancy | Same |
| Pregnancy-related death | Death from any cause while pregnant or within 42 days of termination of pregnancy (includes incidental) | Used before ICD-MM |
ICD-MM Classification of Maternal Deaths
- Direct maternal deaths: Resulting from obstetric complications of the gravid state (pregnancy, labour, puerperium), from interventions, omissions, incorrect treatment, or from a chain of events resulting from any of these.
-
Examples: Obstetric haemorrhage, pre-eclampsia/eclampsia, sepsis, amniotic fluid embolism, anaesthetic complications, thromboembolism
-
Indirect maternal deaths: Resulting from previous existing disease or disease that developed during pregnancy and was not due to direct obstetric causes, but was aggravated by the physiological effects of pregnancy.
-
Examples: Cardiac disease, epilepsy, diabetes, anaemia, HIV, mental health conditions
-
Coincidental (fortuitous) maternal deaths: Deaths from unrelated causes that happen to occur in pregnancy or the puerperium.
-
Examples: Road traffic accidents, homicide, suicide (though suicide related to postnatal depression is often classified as indirect)
-
Late maternal deaths: Deaths occurring between 42 days and 1 year after the end of pregnancy.
10.2 MBRRACE-UK and Confidential Enquiries
MBRRACE-UK (Mothers and Babies: Reducing Risk through Audits and Confidential Enquiries across the UK) - Established 2012 (replaced CMACE) - Oversight: Healthcare Quality Improvement Partnership (HQIP) - Key reports: - Triennial "Saving Lives, Improving Mothers' Care" (maternal deaths) - Perinatal Mortality Surveillance Report - Each Baby Counts (intrapartum term stillbirths, neonatal deaths, brain injury)
Key Findings from Recent Reports (2022–2025)
Main causes of maternal death (UK, 2019–2021):
| Rank | Cause | Type | Proportion |
|---|---|---|---|
| 1 | Cardiac disease | Indirect | ~25% |
| 2 | Thromboembolism | Direct | ~15% |
| 3 | Sepsis | Direct/Indirect | ~12% |
| 4 | Pre-eclampsia/eclampsia | Direct | ~10% |
| 5 | Haemorrhage | Direct | ~8% |
| 6 | Neurological causes | Indirect | ~8% |
| 7 | Mental health (suicide) | Indirect | ~5% |
| 8 | Anaesthetic complications | Direct | Rare |
Key disparities: - Ethnicity: Black women 4× more likely, Asian women 2× more likely to die than white women - Socioeconomic: Women from most deprived areas 3× more likely to die - Age: Women ≥35 at higher risk - Obesity: Leading contributor across multiple causes - Late booking: Women who book after 12 weeks have higher risk
Key recommendations (recent): - Better pre-conception counselling for women with medical conditions - Early pregnancy assessment for women with cardiac disease (joint obstetric-cardiac clinics) - Standardised management of obstetric haemorrhage (massive transfusion protocol) - Improved recognition and management of sepsis - e- learning for early warning scores (MEOWS — Modified Early Obstetric Warning Score) - Thromboprophylaxis risk assessment at every contact
10.3 Saving Babies' Lives Care Bundle — Version 3 (2023)
A national patient safety initiative to reduce stillbirth and neonatal death.
Element 1: Smoking cessation - Carbon monoxide (CO) testing at booking - Referral to stop smoking services if CO ≥ 4 ppm (or ≥ 7 ppm in some areas) - Brief intervention training for midwives
Element 2: Growth assessment - Use of customised GROW chart (Gestation Related Optimal Weight) - Serial symphysis-fundal height (SFH) measurements from 24 weeks - Referral for ultrasound if SFH diverges from chart (below 10th or above 90th centile) - Use of ultrasound for suspected SGA: estimated fetal weight + Doppler (umbilical artery PI)
Element 3: Reduced fetal movements (RFM) - Standardised information for women (counting movements, when to contact) - Standardised care pathway: CTG + ultrasound (growth, liquor volume, Doppler) within 2 hours - No digital fetal movement counting for all (controversial — evidence lacking) - Low PAPP-A (<0.4 MoM) → increased surveillance
Element 4: Effective fetal monitoring during labour - Standardised CTG interpretation training (e.g., K2MS, PROMPT, RCOG e-learning) - Use of STAN (ST-segment analysis) or similar adjunct if indicated - Fetal blood sampling (FBS) protocol - Structured communication (SBAR) and team working
Element 5: Reducing preterm birth - Cervical length screening at 20 weeks (transvaginal ultrasound) - Progesterone for short cervix (<25 mm) - Cervical cerclage for history-indicated or ultrasound-indicated short cervix - Arabin pessary (evidence still emerging)
10.4 Each Baby Counts (RCOG)
- Aim: Reduce the number of term stillbirths, neonatal deaths, and brain injuries occurring as a result of intrapartum incidents
- Data collection: All UK maternity units submit cases
- Key findings:
- ~80% of cases had some element of substandard care
- Most common issues: CTG misinterpretation, failure to act on abnormal CTG, delayed delivery, poor communication
- ≥30% of cases were potentially avoidable
Key recommendations: - Standardised CTG training every 12 months (including emergency drills) - Consultant-led review of all CTGs in labour - SBAR handover and communication - Real-time monitoring of outcomes - Human factors training (situational awareness, decision-making, communication)
10.5 RCOG Green-top Guidelines — Evidence Grading
Levels of Evidence (based on OCEBM):
| Code | Level | Description |
|---|---|---|
| 1++ | 1a | High-quality meta-analyses, systematic reviews of RCTs, or RCTs with very low risk of bias |
| 1+ | 1b | Well-conducted meta-analyses, systematic reviews of RCTs, or RCTs with low risk of bias |
| 1− | 1c | Meta-analyses, systematic reviews of RCTs, or RCTs with high risk of bias |
| 2++ | 2a | High-quality SR of case-control or cohort studies; high-quality case-control/cohort with very low risk of confounding/bias/chance |
| 2+ | 2b | Well-conducted case-control or cohort studies with low risk of confounding/bias/chance |
| 2− | 2c | Case-control or cohort studies with high risk of confounding/bias/chance |
| 3 | 3a/b | Non-analytic studies (case reports, case series) |
| 4 | 4 | Expert opinion |
Grades of Recommendation:
| Grade | Evidence Required |
|---|---|
| A | At least one meta-analysis, systematic review, or RCT rated 1++ and directly applicable to target population; OR systematic review of RCTs or body of evidence consisting principally of studies rated 1+ directly applicable and demonstrating consistency of results |
| B | Body of evidence including studies rated 2++ directly applicable and demonstrating consistency of results; OR extrapolated evidence from studies rated 1++ or 1+ |
| C | Body of evidence including studies rated 2+ directly applicable and demonstrating consistency of results; OR extrapolated evidence from studies rated 2++ |
| D | Evidence level 3 or 4; OR extrapolated evidence from studies rated 2+ |
Good Practice Point (GPP): Recommended best practice based on the clinical experience of the guideline development group.
10.6 NICE Guidelines
- National Institute for Health and Care Excellence
- Use GRADE for quality assessment
- Evidence reviews conducted systematically
- Health economic modelling integral to recommendations
- Recommendation wording:
- "Offer" = strong recommendation (most patients should receive)
- "Consider" = weaker recommendation (requires discussion)
- "Do not offer" = strong against
- Cover the treatment options not recommended
Key NICE guidelines in O&G: - NG133: Hypertension in pregnancy - NG201: Preterm labour and birth - CG62: Antenatal care - NG3: Diabetes in pregnancy - NG122: Postnatal care - QS22: Ovarian cancer - NG241: Heavy menstrual bleeding
10.7 SIGN Guidelines
- Scottish Intercollegiate Guidelines Network
- Use methodology similar to GRADE
- Grades A–D recommendations
- Key examples:
- SIGN 160: Management of gestational diabetes
- SIGN 127: Prophylaxis of venous thromboembolism
- SIGN 156: Induction of labour
10.8 Fertility & Population Demographics — UK Data
| Measure | Value | Year | Source |
|---|---|---|---|
| Births (England & Wales) | ~600,000/year | 2023 | ONS |
| Total Fertility Rate (TFR) | 1.49 | 2023 | ONS |
| Mean age of mother | 30.7 (first birth); all: 30.8 | 2023 | ONS |
| Teenage pregnancy rate (<18) | ~13/1000 women | 2022 | ONS |
| Percentage of births outside marriage | ~51% | 2023 | ONS |
| Multiple pregnancy rate | ~16/1000 maternities | 2023 | ONS |
| Caesarean section rate | ~33% | 2023 | NHS Digital |
| Induction of labour | ~30–35% | 2023 | NHS Digital |
| Preterm birth rate | ~8% | 2023 | ONS |
| Low birth weight (<2500g) | ~7% | 2023 | ONS |
| Perinatal mortality rate | 4.9/1000 | 2022 | MBRRACE-UK |
| Maternal mortality ratio | 8.8/100,000 | 2020–2022 | MBRRACE-UK |
| Stillbirth rate | 3.9/1000 | 2022 | ONS |
| Neonatal mortality rate | 2.5/1000 | 2022 | ONS |
| Infant mortality rate | 3.9/1000 | 2022 | ONS |
10.9 Clinical Audit in O&G
Definition: A quality improvement process that seeks to improve patient care and outcomes through systematic review of care against explicit criteria and the implementation of change.
The Audit Cycle:
┌──────────────────────────────────────┐
│ Set standards and criteria │
└────────────┬─────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Observe current practice │
└────────────┬─────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Compare practice to standards │
└────────────┬─────────────────────────┘
│
┌────────┴────────┐
│ │
(Met) (Not met)
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ Implement change │
│ └───────────┬─────────────┘
│ │
└────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Re-audit (to close the loop) │
└──────────────────────────────────────┘
Types of audit: | Type | Definition | Example | |------|------------|---------| | Structure audit | Resources, facilities, staffing | Is there a 24-hour labour ward consultant? | | Process audit | What is done for patients | What proportion had thromboprophylaxis? | | Outcome audit | Results achieved | What is the CS rate? Perinatal mortality? |
National Audits in O&G
| Audit | Organisation | What it measures |
|---|---|---|
| MBRRACE-UK | MBRRACE-UK collaboration | Maternal and perinatal deaths |
| NMPA (National Maternity and Perinatal Audit) | RCOG | Maternity service quality, outcomes |
| Saving Babies' Lives | NHS England | Stillbirth reduction |
| Each Baby Counts | RCOG | Intrapartum term outcomes |
| UKOSS (UK Obstetric Surveillance System) | NPEU | Rare pregnancy conditions |
UKOSS (UK Obstetric Surveillance System)
- Purpose: Surveillance of rare conditions in pregnancy (incidence < 1 in 10,000)
- Method: Monthly case reporting cards sent to all consultant-led maternity units
- Examples: Amniotic fluid embolism, placenta accreta, uterine rupture, peripartum cardiomyopathy, maternal sepsis
- Outputs: Incidence rates, risk factors, management patterns, maternal and perinatal outcomes
10.10 Quality Improvement in O&G
Plan-Do-Study-Act (PDSA) cycles: - Plan: Define the change, predict outcomes, develop measurement - Do: Implement the change on a small scale - Study: Analyse data, compare to predictions - Act: Refine the change, scale up or abandon
Common QI projects in O&G: - Reducing induction-to-delivery interval - Improving antibiotic prophylaxis timing for CS - Reducing emergency CS decision-to-delivery interval - Implementing standardised CTG interpretation - Improving breastfeeding rates - Reducing perineal trauma
10.11 Key UK Screening Programmes — Summary Table
| Programme | Condition | Test | Population | Interval |
|---|---|---|---|---|
| NHS Fetal Anomaly Screening Programme (FASP) | 11 physical anomalies + Down's/Edwards'/Patau's | Combined test (11–14w) or Quadruple (14–20w) + anomaly scan (18–20w) | All pregnant women | Per pregnancy |
| NHS Sickle Cell and Thalassaemia Screening | Sickle cell disease, thalassaemia, carrier status | Family origin questionnaire + Hb HPLC | All pregnant women (and partners if carrier) | Per pregnancy |
| NHS Infectious Diseases in Pregnancy Screening | HIV, Hepatitis B, Syphilis | Blood test | All pregnant women | Per pregnancy (and 28w for high-risk HIV) |
| NHS Cervical Screening Programme | Cervical cancer (HPV-related) | HPV test → reflex cytology | Women 25–64 | 3-yearly (25–49), 5-yearly (50–64) |
| NHS Breast Screening Programme | Breast cancer | Mammography | Women 50–70 (extending to 47–73) | 3-yearly |
| NHS Abdominal Aortic Aneurysm Screening | AAA | Ultrasound | Men 65+ | Once |
| NHS Diabetic Eye Screening | Diabetic retinopathy | Digital retinal photography | All with diabetes | Annual |
10.12 How to Answer MRCOG Part 1 Epidemiology Questions
Common question formats: 1. "A new screening test has sensitivity 95% and specificity 95%. The prevalence is 1%. What is the PPV?" 2. "Which study design would be best to investigate an association between a rare disease and a common exposure?" 3. "What is the most appropriate statistical test to compare birth weight between smokers and non-smokers?" 4. "What is the correct interpretation of this confidence interval?" 5. "Which type of bias is most likely in a case-control study of maternal medication and congenital anomalies?"
Answer strategy: 1. Identify what is being asked (study design, test, bias, interpretation) 2. Recall the relevant definition and formula 3. Apply to the specific scenario 4. Eliminate wrong answers systematically
Formulas to memorise (and practice): - Sensitivity, specificity, PPV, NPV, LR+, LR− - RR, OR, AR, ARF, NNT - χ² = Σ(O−E)²/E - SEM = SD/√n - Post-test odds = Pre-test odds × LR - Adjusted α (Bonferroni) = 0.05/k
Quick Reference: MRCOG Epidemiology Formulae
Screening
| Formula | Mnemonic |
|---|---|
| Sn = TP / (TP + FN) | Sn = sick / (sick + missed) |
| Sp = TN / (TN + FP) | Sp = well / (well + false alarms) |
| PPV = TP / (TP + FP) | PPV = true positive / all positive |
| NPV = TN / (TN + FN) | NPV = true negative / all negative |
| LR+ = Sn / (1 − Sp) | Positive LR = sensitivity / false positive rate |
| LR− = (1 − Sn) / Sp | Negative LR = false negative rate / specificity |
Risk & Effect
| Formula | When used |
|---|---|
| RR = [a/(a+b)] / [c/(c+d)] | Cohort studies |
| OR = ad / bc | Case-control studies |
| AR = a/(a+b) − c/(c+d) | Excess risk |
| ARF = (RR−1)/RR | % of risk due to exposure |
| NNT = 1/ARR | Number needed to treat |
| NNH = 1/AR (harm) | Number needed to harm |
Statistics
| Formula | Meaning |
|---|---|
| x̄ = Σx/n | Mean |
| s² = Σ(x−x̄)²/(n−1) | Sample variance |
| SD = √s² | Standard deviation |
| SEM = SD/√n | Standard error of mean |
| 95% CI ≈ x̄ ± 2×SEM | Confidence interval for mean |
| χ² = Σ(O−E)²/E | Chi-squared test |
Decision Rules
| Rule | Cut-off |
|---|---|
| α (Type I error) | 0.05 |
| β (Type II error) | 0.20 (power = 80%) |
| p < 0.05 | Statistically significant |
| 95% CI excludes 1 (RR/OR) | Statistically significant |
| I² > 50% | Substantial heterogeneity |
| AUC > 0.8 | Good diagnostic accuracy |
Mnemonics for MRCOG Part 1
SnNOut: High Sensitivity → Negative rules Out SpPIn: High Specificity → Positive rules In
OSA to remember bias types: - O = Observer bias - S = Selection bias - A = Attrition bias
CRIB for confounder criteria: - C = Causes outcome (independent risk factor) - R = Related to exposure - I = Intermediate? NO — not on causal pathway - B = Before exposure? confounder must precede
NNT = 1/ARR — think "Need N To prevent" = 1 over Absolute Risk Reduction
SEM < SD (always!) — Standard Error is Smaller than Standard Deviation
Common MRCOG Part 1 Traps
| # | Trap | Truth |
|---|---|---|
| 1 | "p-value = probability H₀ is true" | WRONG — p = P(data |
| 2 | "SEM = SD" | WRONG — SEM = SD/√n |
| 3 | "OR = RR always" | WRONG — only when disease rare (<10%) |
| 4 | "PPV is a fixed test property" | WRONG — PPV depends on prevalence |
| 5 | "Non-significant p = no effect" | WRONG — may be underpowered |
| 6 | "ITT is good for non-inferiority" | WRONG — ITT is anti-conservative for non-inferiority |
| 7 | "Screening always saves lives" | WRONG — lead time, length time, overdiagnosis |
| 8 | "Case-control studies can calculate incidence" | WRONG — only OR |
| 9 | "Correlation = causation" | ALWAYS WRONG |
| 10 | "Confounder is an intermediate variable" | WRONG — confounder is outside causal pathway |
| 11 | "95% CI range contains 95% of data" | WRONG — 95% CI is about the mean, not individual values |
| 12 | "Histogram bars should have gaps" | WRONG — histogram bars TOUCH (bar chart bars have gaps) |
| 13 | "χ² test can be used with any 2×2 table" | WRONG — expected <5 requires Fisher's exact |
| 14 | "Mean is always the best measure" | WRONG — use median for skewed data |
| 15 | "Blinding and allocation concealment are the same" | WRONG — allocation concealment is ALWAYS possible; blinding is not |
| 16 | "Cluster RCT doesn't need special analysis" | WRONG — must account for clustering (ICC, design effect) |
| 17 | "p < 0.01 means a more important result than p < 0.05" | WRONG — p depends on sample size, not just effect size |
| 18 | "Systematic review = meta-analysis" | WRONG — a meta-analysis is the statistical combination; not all SRs have one |
| 19 | "NNT is a fixed property of a treatment" | WRONG — NNT depends on baseline risk |
| 20 | "One-sided test is always more powerful" | WRONG — only if the true effect is in the hypothesised direction |
References & Further Reading
Essential Textbooks: - Kirkwood BR & Sterne JAC. Essential Medical Statistics. 2nd ed. Blackwell Science, 2003. - Altman DG. Practical Statistics for Medical Research. Chapman & Hall, 1991. - Petrie A & Sabin C. Medical Statistics at a Glance. 4th ed. Wiley-Blackwell, 2020. - Bland M. An Introduction to Medical Statistics. 4th ed. OUP, 2015. - Fletcher RW & Fletcher SW. Clinical Epidemiology: The Essentials. 5th ed. Wolters Kluwer, 2014. - Straus SE et al. Evidence-Based Medicine: How to Practice and Teach It. 5th ed. Elsevier, 2018.
Key UK Documents: - RCOG. Green-top Guidelines Levels of Evidence and Grades of Recommendation. (Introductory sections of any Green-top Guideline) - MBRRACE-UK. Saving Lives, Improving Mothers' Care. (Latest triennial report) - NICE. The Guidelines Manual (process and methods). - NHS FASP. Fetal anomaly screening programme standards. - Wilson JMG & Jungner G. Principles and practice of screening for disease. WHO Public Health Papers No. 34. Geneva: WHO, 1968.
Key Papers: - Guyatt GH et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008;336:924–6. - Schulz KF et al. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ 2010;340:c332. - von Elm E et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. Lancet 2007;370:1453–7. - Moher D et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ 2009;339:b2535. - Altman DG & Bland JM. Statistics notes: diagnostic tests 1: sensitivity and specificity. BMJ 1994;308:1552. - Altman DG & Bland JM. Statistics notes: diagnostic tests 2: predictive values. BMJ 1994;309:102. - Deeks JJ. Systematic reviews in health care: systematic reviews of evaluations of diagnostic and screening tests. BMJ 2001;323:157–62. - Higgins JPT et al. Measuring inconsistency in meta-analyses. BMJ 2003;327:557–60. - Sterne JAC & Egger M. Funnel plots for detecting bias in meta-analysis. J Clin Epidemiol 2001;54:1046–55.
Online Resources: - OpenEpi (www.openepi.com) — free online calculators for epidemiological statistics - Cochrane Handbook for Systematic Reviews of Interventions (training.cochrane.org/handbook) - MedCalc Statistical Software (www.medcalc.org) — ROC curve analysis, diagnostic test evaluation - NICE guidance (www.nice.org.uk) - RCOG Green-top Guidelines (www.rcog.org.uk/guidelines) - MBRRACE-UK reports (www.npeu.ox.ac.uk/mbrrace-uk) - ONS birth statistics (www.ons.gov.uk) - StATS statistical calculator (www.statsdirect.com)
Last updated: May 2026 Target exam: MRCOG Part 1 Word count: ~18,500+ Author note: This document is intended as a comprehensive revision resource covering all examinable topics in epidemiology, statistics, screening, evidence-based medicine, and O&G-specific applications. Candidates should supplement with current RCOG Green-top Guidelines, recent NICE guidance, and the latest MBRRACE-UK reports for the most up-to-date statistical data (rates, mortality figures, screening programme updates). Particular attention should be paid to the formulae and interpretations flagged as "MRCOG Key Point" and "Common MRCOG Part 1 Traps" — these represent the most frequently tested and most commonly confused concepts in the examination.