On this page

Index

MRCOG Part 1: Epidemiology & Statistics — Comprehensive Study Document

Target: MRCOG Part 1 Version: May 2026 Purpose: Complete deep-dive reference covering all examinable topics in epidemiology, medical statistics, screening, evidence-based medicine, and their application to obstetrics & gynaecology. This document is designed for thorough revision — every section includes definitions, formulae, mnemonics, worked examples from O&G, and MRCOG-specific exam tips.


Table of Contents

  1. Study Design
  2. Screening
  3. Descriptive Statistics
  4. Hypothesis Testing
  5. Parametric vs Non-Parametric Tests
  6. Risk & Effect Measures
  7. Statistical Bias & Confounding
  8. Evidence-Based Medicine
  9. Survival Analysis
  10. Specific Topics in O&G

1. Study Design

1.1 Observational vs Experimental Studies

Feature Observational Experimental
Intervention None — investigator observes naturally occurring groups Investigator assigns intervention intentionally
Causality Association only (unless Bradford Hill criteria satisfied) Can infer causation (if properly randomised and blinded)
Bias risk Higher — multiple sources of bias possible Lower — randomisation balances confounders
Ethical issues Fewer — no manipulation of participants More — equipoise required; informed consent essential
Examples Cohort, case-control, cross-sectional, ecological RCT (parallel, crossover, cluster, factorial)

Bradford Hill Criteria for Causation (1965): These are important in interpreting observational studies — a set of nine viewpoints used to assess whether an observed association is likely causal: 1. Strength of association (larger effect = more likely causal) 2. Consistency (reproduced across different populations/settings) 3. Specificity (one cause → one effect — less applicable to O&G where most outcomes are multifactorial) 4. Temporality (cause must precede effect — the only absolutely essential criterion) 5. Biological gradient (dose-response relationship — e.g., more cigarettes → higher preterm birth risk) 6. Plausibility (biologically credible mechanism) 7. Coherence (consistent with natural history/biology) 8. Experiment (evidence from experimental studies) 9. Analogy (similar evidence for analogous exposures)

1.2 Cross-Sectional Studies

  • Design: Data collected at a SINGLE point in time — both exposure and outcome measured simultaneously
  • Measures: Prevalence (existing cases) — CANNOT measure incidence (new cases)
  • Uses: Disease burden estimates, health surveys, screening programme evaluation, hypothesis generation, planning health services
  • Advantages: Quick, cheap, good for hypothesis generation, no loss to follow-up, can study multiple outcomes and exposures simultaneously
  • Disadvantages: Cannot establish temporality (chicken-and-egg problem — did the exposure come before the outcome?), survival bias (only survivors captured — those who died cannot participate), not suitable for rare diseases (need very large samples), prevalence-incidence bias (Neyman bias)
  • Key statistic: Odds ratio (can be calculated but caution with interpretation — prevalence OR, not incidence OR)
  • In O&G: Estimating prevalence of pelvic organ prolapse, urinary incontinence (UI), infertility, contraception use patterns, postnatal depression (e.g., EPDS screening studies), HPV prevalence, endometriosis prevalence estimates

Example: A cross-sectional survey asks 5,000 women about incontinence symptoms and BMI. 1,200 report UI; 800 of those with UI are obese vs 1,500 of those without UI. - Prevalence of UI = 1200/5000 = 24% - OR for UI in obese vs non-obese = (800×1500)/(400×2300) = 1.30

Limitation: Cannot tell if obesity caused UI or UI led to reduced activity and weight gain.

1.3 Cohort Studies

Definition: Groups defined by exposure status; followed forward in time to see who develops outcome. This is the optimal observational design for establishing incidence and temporal relationships.

Prospective Cohort

  • Exposure assessed at BASELINE; participants followed FORWARD in time
  • Outcome develops during follow-up
  • Advantages: Direct measure of incidence, clear temporality (exposure definitely precedes outcome), can study multiple outcomes from one exposure, minimises recall bias (exposure recorded before outcome known), allows calculation of absolute risk, RR, AR
  • Disadvantages: Expensive and time-consuming (especially for rare outcomes or long latency), loss to follow-up (attrition) can introduce bias, inefficient for rare diseases (need very large numbers), exposure patterns may change over time
  • Key measures: Incidence (cumulative incidence and incidence rate), relative risk (RR), attributable risk (AR), population attributable fraction (PAF)

Retrospective Cohort (Historical Cohort)

  • Uses EXISTING data (medical records, databases, occupational records) to go back in time
  • Exposure and outcome have ALREADY occurred when study begins
  • Advantages: Cheaper, faster than prospective, good for long-latency diseases (e.g., DES exposure in utero and vaginal adenocarcinoma decades later), can use existing datasets
  • Disadvantages: Relies on quality and completeness of existing records, recall bias may still affect some data, cannot control what was measured or how, missing data issues

Key Measures in Cohort Studies — Detailed with Worked Example

Worked O&G Example: 10,000 pregnant women; 5,000 smoke, 5,000 do not. Followed for preterm birth (<37 weeks).

Preterm Term Total Risk
Smoker 200 4,800 5,000 200/5000 = 0.04
Non-smoker 100 4,900 5,000 100/5000 = 0.02
Total 300 9,700 10,000 300/10000 = 0.03
Measure Formula Calculation Interpretation
Cumulative incidence (risk) in exposed a/(a+b) = 200/5000 0.04 (4%) 4% of smokers had preterm birth
Cumulative incidence (risk) in unexposed c/(c+d) = 100/5000 0.02 (2%) 2% of non-smokers had preterm birth
Incidence rate (IR) in exposed 200 / (sum of person-time) Depends on follow-up timing Accounts for when events occur
Relative Risk (RR) 0.04 / 0.02 2.0 Smokers 2× more likely to have preterm birth
Attributable Risk (AR) 0.04 − 0.02 0.02 (2%) Excess risk attributable to smoking
AR Fraction (ARF) (2.0−1)/2.0 = 50% 50% Half of preterm births in smokers due to smoking
Population Attributable Risk (PAR) I_total − I_unexposed = 0.03 − 0.02 0.01 (1%) Excess risk in total population
PAF (I_total − I_unexposed)/I_total = 0.01/0.03 33.3% 33% of all preterm births attributable to smoking

Person-time: - Each participant contributes time until event, loss to follow-up, or study end - Incidence rate = number of new events / sum of person-time at risk - Expressed as "per 1000 person-years" or similar - Superior to cumulative incidence when follow-up times vary

Confounding in Cohort Studies: - Common confounders in O&G cohorts: maternal age, socioeconomic status, parity, BMI, pre-existing medical conditions - Control: multivariable regression, stratification, matching, restriction, propensity scores

1.4 Case-Control Studies

Design: Select cases (with disease) and controls (without disease); look BACK retrospectively for exposure. The most efficient design for rare diseases.

2×2 Table

Case (disease +) Control (disease −) Total
Exposed a b a + b
Unexposed c d c + d

Key Measures

Measure Formula Interpretation
Odds of exposure in cases a / c How likely cases were exposed
Odds of exposure in controls b / d How likely controls were exposed
Odds Ratio (OR) (a/c) / (b/d) = ad / bc Odds of exposure in cases vs controls
When disease rare (<10%) OR ≈ RR Rare disease assumption

Worked O&G Example: Case-control study of ovarian cancer and talc use. - Cases: 300 women with ovarian cancer - Controls: 600 women without ovarian cancer - Talc use: 120 cases exposed, 180 controls exposed

Ovarian cancer (Case) No cancer (Control)
Talc use 120 180
No talc 180 420

OR = (120 × 420) / (180 × 180) = 50,400 / 32,400 = 1.56

Interpretation: Odds of talc exposure are 1.56× higher in ovarian cancer cases than controls. Since ovarian cancer is relatively rare, this approximates RR = 1.56.

Cannot calculate: - Incidence (no denominator — we selected cases/controls, we did not follow a population) - RR (no incidence data) - Prevalence (same reason)

Advantages

  • Efficient for rare diseases (ovarian cancer, specific congenital anomalies, maternal death)
  • Quick and cheap compared to cohort studies
  • Can study multiple exposures (diet, environment, genetics, medications)
  • Good for diseases with long latency (like DES-related cancers)
  • Requires smaller sample sizes than cohort studies for rare outcomes

Disadvantages

  • Cannot calculate incidence directly (no denominator of total population at risk)
  • Recall bias: Cases remember exposures differently from controls (especially for subjective exposures like diet, pain medication, stress)
  • Selection bias: Choosing appropriate controls is the most difficult and critical part
  • Temporality: Difficult to establish if exposure preceded disease (especially for biomarkers measured after diagnosis)
  • Cannot study rare exposures (if exposure is rare, you need enormous numbers)
  • Survivorship bias: Cases are those who survived to be diagnosed; fatal cases are missed

Selection of Controls — Critical Issues

Fundamental principle: Controls must come from the SAME source population that gave rise to the cases.

Control type Description Advantage Disadvantage
Population-based Random sample from general population Most representative Expensive, low response rates
Hospital-based Other patients from same hospital Easy to recruit, good response rates Berkson's bias — hospital controls may have different exposure patterns
Friend/relative Friends or siblings of cases Genetic/environmental matching Over-matching possible (same exposures)
Neighbourhood Neighbours of cases Socioeconomic matching Time-consuming
Disease controls Patients with a DIFFERENT disease Good response, similar recall Diseased group may differ from healthy

Matching: - Frequency matching: Select controls to have same distribution of age, parity, etc. as cases - Individual matching: Each case matched to 1–4 controls on specific factors (age ± 5 years, parity, hospital) - Over-matching: Matching on a variable that is related to the exposure but NOT to the disease — reduces power without reducing confounding

Biases in Case-Control Studies — Expanded

Bias Mechanism Example
Recall bias Cases search memory for causes; controls less motivated Mothers of babies with malformations report more medication during pregnancy; mothers of healthy babies forget
Berkson's bias Hospital controls have different admission patterns Studying aspirin and stroke: hospital controls may have GI bleeds (also related to aspirin) → spurious protective effect
Neyman bias Prevalent cases differ from incident cases Studying survival after cancer: prevalent cases are long-term survivors, not representative
Detection bias Cases diagnosed because of exposure-related screening Women on HRT have more mammograms → more breast cancer detected (not causal)
Interviewer bias Interviewer probes differently More detailed questioning of cases about exposures
Survivorship bias Fatal cases not included Studying risk factors for eclampsia — only survivors available

1.5 Randomised Controlled Trials (RCTs)

Gold standard for establishing causality. The key strength is randomisation which (if adequate) balances both known and unknown confounders between groups.

Types of RCT

Type Description Key Feature Example in O&G
Parallel group Two (or more) independent groups, each receives one treatment concurrently Most common; simplest analysis TRUFFLE study (CTG monitoring in IUGR)
Crossover Each participant receives both treatments in random sequence, separated by washout period Each participant acts as own control → smaller sample size needed Comparing two pain relief methods in labour (problem: carryover, and can't use if condition changes)
Cluster Intact groups (hospitals, GP practices, communities) randomised Used when contamination likely; analysis must account for clustering (ICC) Comparing screening uptake with different invitation methods at hospital level
Factorial Two or more interventions tested simultaneously (e.g., 2×2 design) Efficient — can test interactions (synergy/antagonism) Comparing aspirin AND heparin vs each alone in recurrent miscarriage
Zelen design Randomised BEFORE consent; only treatment group approached for consent Reduces selection bias; ethically controversial Emergency trials where consent is difficult
N-of-1 trial Single patient receives treatment and placebo in random sequence Highest level for individual treatment decisions Rarely used in O&G

Trial Phases

Phase Primary Purpose Typical Participants Key Questions
Phase I Safety, tolerability, pharmacokinetics 20–80 healthy volunteers (or patients with advanced disease) What is safe dose? What are side effects? How is drug metabolised?
Phase II Efficacy signal, dose-ranging, side effect profile 100–300 patients with condition Does it work? What is optimal dose? More adverse effects?
Phase III Confirm efficacy, compare to standard of care 1,000–3,000+ patients Is it better than current standard? (or non-inferior)
Phase IV Post-marketing surveillance, long-term safety General population after licensing Are there rare adverse effects? Long-term outcomes?

Randomisation Methods — Detailed

Method Description Strength Weakness
Simple randomisation Each participant assigned by coin toss, random number table, or computer Unpredictable; simple Can produce unequal group sizes and imbalance on prognostic factors
Block randomisation Random permuted blocks of fixed size (e.g., 4: possibilities = TTCC, TCTC, TCCT, CTTC, CTCT, CCTT) Ensures equal numbers in each group at all times Block size must be CONCEALED to prevent prediction
Stratified randomisation Separate randomisation within strata defined by key prognostic factors Ensures balance on important confounders Complex; need few strata or it becomes unwieldy
Minimisation Next participant's allocation determined by current imbalance in prognostic factors Excellent balance on many factors simultaneously Not truly random; some controversy about analysis
Adaptive randomisation Allocation probability changes based on accumulating outcomes More patients get better treatment Complex; operational bias possible

Key point: The randomisation sequence must be CONCEALED from those recruiting participants. If the recruiter knows the next allocation, they can (consciously or unconsciously) influence who is enrolled → selection bias.

Allocation Concealment vs Blinding

Feature Allocation Concealment Blinding
Purpose Prevent selection bias at enrolment Prevent performance/detection bias after enrolment
When BEFORE randomisation (during recruitment) AFTER randomisation (during treatment/follow-up)
Always possible? YES — always possible, even in surgery/physical therapy trials NO — some interventions cannot be blinded (surgery vs medical, behaviour change)
If broken Destroys the integrity of randomisation Less catastrophic but introduces bias
Example Opaque sealed envelopes, central telephone randomisation Identical placebo tablets, sham surgery, double-dummy technique

Blinding Levels

Level Who is blinded Purpose Applicability
Open label No one Practical when blinding impossible Surgical trials, device trials
Single blind Participant only Reduces placebo effect Drug trials with distinct taste/appearance
Double blind Participant AND investigator Gold standard — reduces both performance and detection bias Most drug trials
Triple blind Participant, investigator AND data analyst/statistician Prevents analytic bias High-quality confirmatory trials

Double-dummy technique: Used when two treatments have different appearances (e.g., pill vs injection). Each participant receives a pill AND an injection — one active, one placebo.

Analysis Populations

Analysis Definition Effect on results Best use
Intention-to-treat (ITT) Analyse ALL participants in the group they were randomised to, regardless of compliance, crossover, or withdrawal Conservative for superiority trials (dilutes effect toward null) PRIMARY analysis for superiority trials
Per-protocol (PP) Analyse only those who completed the allocated treatment as planned May OVER-estimate efficacy (only includes compliant) SECONDARY analysis; primary for non-inferiority
Modified ITT (mITT) Excludes those who never received any treatment or had no post-randomisation data Somewhere between ITT and PP Common compromise in practice
As-treated Analyse according to the treatment actually received Most BIASED — breaks randomisation Not recommended as primary analysis

MRCOG Key Point: ITT is the primary analysis for superiority trials because it preserves the benefit of randomisation (groups remain comparable). PP is considered secondary. ITT is conservative for superiority but anti-conservative for non-inferiority — in non-inferiority trials, PP is often primary because ITT can make a non-inferior treatment appear equivalent when it isn't (by diluting the difference).

Trial Types by Aim

Type Null Hypothesis Alternative Hypothesis Key Consideration
Superiority Treatment = Control Treatment ≠ Control (or Treatment > Control) Standard approach
Non-inferiority Treatment − Control ≤ −Δ (margin) Treatment − Control > −Δ Requires pre-specified non-inferiority margin (Δ); PP analysis preferred
Equivalence Treatment − Control ≥ Δ

Non-inferiority margin selection: - Should be the largest clinically acceptable difference - Often set as half the effect of the active control vs placebo (the "M1" margin, then "M2" = M1 minus a preservation of effect) - Example: If active control reduces mortality by 2% vs placebo, Δ might be 1%

Pragmatic vs Explanatory Trials — The PRECIS-2 Framework

Dimension Explanatory (Efficacy) Pragmatic (Effectiveness)
Question "Can it work?" (ideal conditions) "Does it work in real life?"
Eligibility Highly selected — narrow criteria Broad — represents typical patients
Recruitment Intensive campaigning Routine clinical pathways
Setting Specialist academic centres Primary care / routine hospitals
Intervention Strictly protocolised, closely monitored Flexible, as in real practice
Comparator Placebo or best alternative Usual care
Follow-up Frequent, intensive Routine visits
Outcome Surrogate or mechanism-based Clinically meaningful (patient-important)
Primary analysis ITT and PP both informative ITT primary
Adherence Monitored and encouraged Real-world compliance

Example (O&G): ASPRE trial of aspirin for pre-eclampsia prevention — highly selected (high-risk by FMF algorithm) → explanatory. A pragmatic version would include all nulliparous women.

Adaptive Trial Designs

Definition: Pre-specified plan for modifying trial features based on accumulating data, without undermining validity.

Type Description Example
Group sequential Pre-planned interim analyses with stopping rules Stop early for efficacy (if overwhelming benefit) or futility (if unlikely to show benefit)
Sample size re-estimation Blinded re-estimation of variance to adjust sample size Ensures adequate power
Adaptive randomisation Allocation ratio changes to favour better-performing arm More patients receive superior treatment
Seamless phase II/III Combine dose-finding phase with confirmatory phase Saves time and patients
Drop-the-loser Arms dropped if inferior Multi-arm multi-stage (MAMS) trials
Bayesian adaptive Continuously update posterior probability More flexible but complex

Stopping rules for group sequential designs:

Method Boundary Characteristics
Haybittle-Peto p < 0.001 at interim; p < 0.05 at final Very conservative early; easy to implement
O'Brien-Fleming Very stringent early boundary, liberal later Most common; preserves overall α well
Pocock Same critical value throughout (e.g., p < 0.016 for 3 looks) More likely to stop early
Wang-Tsiatis Family of boundaries between O'Brien-Fleming and Pocock Flexible

Sample Size Calculation — Detailed

Why calculate sample size? 1. Ensure adequate POWER to detect clinically important effect 2. Avoid wasting resources on underpowered studies 3. Meet ethical obligations (patients in underpowered study may be exposed to harm without benefit) 4. Meet regulatory requirements

Parameters needed:

Parameter Symbol Typical value How to determine
Significance level α (Type I error) 0.05 (two-sided) Convention; sometimes 0.01
Power 1 − β 0.80 or 0.90 0.80 is minimum; 0.90 preferred
Effect size δ Varies Minimum clinically important difference (MCID)
Standard deviation σ From pilot data/literature Variability in outcome measure
Allocation ratio r = n₁/n₂ 1:1 is most efficient Unequal allocation needs larger n

Sample size increases when: - ✅ Smaller effect size (harder to detect) - ✅ Lower α (more stringent significance level) - ✅ Higher power (i.e., lower β) - ✅ Larger variance (more noise) - ✅ Unequal group sizes (deviates from 1:1) - ✅ More comparisons (multiple endpoints or subgroups) - ✅ Clustering (ICC reduces effective sample size)

Design Effect (for cluster RCTs): DE = 1 + (m − 1) × ICC - m = average cluster size - ICC = intra-cluster correlation coefficient (typically 0.01–0.05 in O&G) - Effective sample size = Actual sample size / DE

Worked Example: To detect a difference in mean birth weight of 100 g (SD = 400 g) between smokers and non-smokers, with α = 0.05, power = 0.80, using a two-sided test:

n = [(z_{α/2} + z_β)² × 2σ²] / δ² n = [(1.96 + 0.84)² × 2 × 400²] / 100² n = [7.84 × 320,000] / 10,000 = 2,508,800 / 10,000 ≈ 251 per group

So ~502 women needed total.

For binary outcomes (e.g., preterm birth rate): Uses different formula based on proportions.

Interim Analyses & Data Monitoring

  • Data Monitoring Committee (DMC/DMSB): Independent group of experts with access to unblinded data
  • Responsibilities: Recommend stopping for efficacy (overwhelming benefit), harm (safety concerns), or futility (no realistic chance of benefit)
  • Members: Clinicians, statisticians, sometimes ethicists
  • Must be independent of trial investigators and sponsor
  • Stopping for futility: Uses conditional power — probability of reaching significant result at final analysis given current data

1.6 Ecological Studies

Design: Groups (populations) as unit of observation, not individuals

Examples: Comparing caesarean section rates across countries; correlating sunlight exposure and pre-eclampsia rates by region

Ecological fallacy (Robinson, 1950): Associations at population level may NOT hold at individual level. Classic example: Immigrants in US had higher literacy rates in states with more immigrants → actually at individual level, immigrants had lower literacy (states with more immigrants had higher literacy natives).


2. Screening

2.1 The Wilson-Jungner Criteria (1968) — Detailed

The 10 classic criteria proposed by Wilson and Jungner for the WHO. Every MRCOG candidate must know these:

  1. The condition should be an important health problem
  2. Burden of disease measured by incidence, prevalence, morbidity, mortality, economic cost
  3. In O&G: Down's syndrome (lifetime cost ~£500k); cervical cancer (~850 deaths/year UK); GDM (affects ~5% pregnancies)

  4. There should be an accepted treatment for patients with recognised disease

  5. If no effective treatment exists, screening may cause harm without benefit
  6. Exception: conditions where knowing diagnosis allows reproductive choice (Down's syndrome, anencephaly)
  7. Example of problematic screening: some rare genetic conditions with no treatment

  8. Facilities for diagnosis and treatment should be available

  9. If screening identifies positives but diagnostic capacity is insufficient → anxiety and harm
  10. UK has detailed pathways: screen positive → referral to fetal medicine unit or colposcopy within 2 weeks

  11. There should be a recognisable latent or early symptomatic stage

  12. Diseases with long preclinical phase are good screening targets
  13. Cervical cancer: HPV infection → CIN I/II/III → invasive cancer (10+ year window)
  14. Ovarian cancer: NO good latent stage → screening has failed in trials (UKCTOCS, PLCO)

  15. There should be a suitable test or examination

  16. Test must be acceptable, accurate (high sensitivity/specificity), and feasible at population scale
  17. Combined test for Down's: NT ultrasound (~20 min), blood test → acceptable but requires skilled sonographers

  18. The natural history of the condition, including development from latent to declared disease, should be adequately understood

  19. Without knowing natural history, we cannot predict who will progress
  20. CIN: Most low-grade lesions regress; only high-grade progress — essential knowledge for appropriate management

  21. There should be an agreed policy on whom to treat as patients

  22. Clear thresholds for intervention needed
  23. GDM: IADPSG criteria (one-step) vs NICE criteria (two-step) produce different prevalence
  24. HPV vaccine policy: age 12–13 girls (and boys from 2019) in UK

  25. The total cost of finding a case should be economically balanced in relation to medical expenditure as a whole

  26. Cost per QALY gained; NICE threshold ~£20,000–30,000/QALY
  27. NIPT: ~£500/test; combined test: ~£80; NICE considered cost-effectiveness

  28. Case-finding should be a continuing process and not a "once and for all" project

  29. Screening must be repeated at appropriate intervals
  30. Cervical screening: 3-yearly (25–49), 5-yearly (50–64)
  31. Antenatal screening: per pregnancy (not lifetime)

  32. The test should be acceptable to the population

    • Low uptake → programme ineffective
    • Cervical screening uptake: ~70% UK (below 80% target)
    • Antenatal HIV screening: >99% uptake (well accepted as routine)

2.2 Test Performance Characteristics — Complete Details

The 2×2 Table

Disease + (Gold Standard) Disease − (Gold Standard) Total
Test + True positive (TP) False positive (FP) TP + FP
Test − False negative (FN) True negative (TN) FN + TN
Total TP + FN FP + TN N

Disease prevalence = (TP + FN) / N — this is the pre-test probability if the screening population mirrors the study population

Key Measures — Expanded with Clinical Interpretation

Measure Formula What it tells us Clinical use
Sensitivity (Sn) TP / (TP + FN) Of those WITH disease, how many test positive? SnNOut: High Sn → negative test rules OUT disease
Specificity (Sp) TN / (TN + FP) Of those WITHOUT disease, how many test negative? SpPIn: High Sp → positive test rules IN disease
Positive Predictive Value (PPV) TP / (TP + FP) Of those who test positive, how many actually HAVE disease? Counselling patient with positive result
Negative Predictive Value (NPV) TN / (TN + FN) Of those who test negative, how many actually are FREE of disease? Counselling patient with negative result
Accuracy (TP + TN) / N Proportion correctly classified Overall measure but misleading when prevalence low

Prevalence Effect on PPV — Expanded

The single most important concept in screening for MRCOG. PPV depends on prevalence, and thus screening works well in high-prevalence populations but poorly in low-prevalence populations.

Worked Example: Test with Sn = 99%, Sp = 99%

Scenario A: High prevalence (50%) — e.g., symptomatic women referred to clinic

Disease + Disease − Total
Test + 495 (TP) 5 (FP) 500
Test − 5 (FN) 495 (TN) 500
Total 500 500 1000

PPV = 495/500 = 99% (a positive test is very reliable) NPV = 495/500 = 99%

Scenario B: Low prevalence (1%) — general population screening

Disease + Disease − Total
Test + 99 (TP) 99 (FP) 198
Test − 1 (FN) 9,801 (TN) 9,802
Total 100 9,900 10,000

PPV = 99/198 = 50% (half of positives are false!) NPV = 9801/9802 = 99.99%

Scenario C: Very low prevalence (0.1%) — rare disease screening

Disease + Disease − Total
Test + 9.9 (TP) 99.9 (FP) 109.8
Test − 0.1 (FN) 9,890.1 (TN) 9,890.2
Total 10 9,990 10,000

PPV = 9.9/109.8 = 9% (91% of positives are false!) NPV = 9890.1/9890.2 = ~100%

MRCOG Take-home: Even an "excellent" test (99% Sn, 99% Sp) has PPV of only 50% when prevalence is 1%, and only 9% when prevalence is 0.1%. This is why screening for very rare conditions is problematic.

Clinical Example: NIPT for Down's Syndrome

  • Sn = 99.5%, Sp = 99.9%
  • Prevalence at term = 1/800 (0.125%)

PPV = (0.995 × 0.00125) / [(0.995 × 0.00125) + (0.001 × 0.99875)] PPV = 0.00124 / (0.00124 + 0.000999) = 0.00124 / 0.00224 = 0.554 = 55%

So even NIPT, the best screening test, has PPV ~55% for Down's syndrome in a low-risk population. A positive NIPT still requires confirmatory invasive testing (CVS or amniocentesis).

For high-risk population (e.g., women aged 40 with combined test risk 1:10): Prevalence ~10% PPV = (0.995 × 0.10) / [(0.995 × 0.10) + (0.001 × 0.90)] = 0.0995 / (0.0995 + 0.0009) = 99.1%

2.3 Likelihood Ratios — Complete Guide

LR+ = Sensitivity / (1 − Specificity) - Tells you how much more likely a positive test is in someone WITH the disease vs WITHOUT - Range: 1 to ∞ - Higher = better (more diagnostic information)

LR− = (1 − Sensitivity) / Specificity - Tells you how much less likely a negative test is in someone WITH the disease vs WITHOUT - Range: 0 to 1 - Lower = better (closer to 0)

LR Value Impact on Post-test Probability
LR+ > 10 Large, often conclusive increase
LR+ 5–10 Moderate increase
LR+ 2–5 Small increase
LR+ 1–2 Minimal increase
LR+ = 1 No diagnostic value
LR− < 0.1 Large, often conclusive decrease
LR− 0.1–0.2 Moderate decrease
LR− 0.2–0.5 Small decrease
LR− 0.5–1.0 Minimal decrease

Using LRs in Clinical Practice (Bayes' Theorem):

Step 1: Convert pre-test probability to pre-test odds - Odds = probability / (1 − probability) - Example: Pre-test probability of Down's = 1/250 = 0.004 - Pre-test odds = 0.004 / 0.996 = 0.004

Step 2: Multiply by LR to get post-test odds - Post-test odds = Pre-test odds × LR - If combined test positive (LR+ = 8): Post-test odds = 0.004 × 8 = 0.032

Step 3: Convert back to probability - Post-test probability = odds / (1 + odds) - = 0.032 / 1.032 = 0.031 = 3.1% (or about 1 in 32)

Fagan nomogram: A graphical tool that does this conversion for you. Draw a line from pre-test probability through the LR to read post-test probability directly.

2.4 ROC Curves — Detailed

Receiver Operating Characteristic curve: - X-axis: 1 − Specificity (false positive rate) - Y-axis: Sensitivity (true positive rate) - Each point = test at different threshold/cut-off

AUC (Area Under the Curve):

AUC Interpretation
0.5 No better than chance (diagonal line)
0.6–0.7 Poor
0.7–0.8 Moderate (acceptable)
0.8–0.9 Good (excellent for many applications)
0.9–1.0 Excellent

Choosing the optimal cut-off: - Youden index: J = Sensitivity + Specificity − 1 - Maximised at optimal threshold - Gives equal weight to Sn and Sp - Clinical weighting: If FN more harmful than FP → choose lower threshold (higher Sn, lower Sp) - Example: Screening for anencephaly — a missed case is catastrophic → high sensitivity prioritised - Economic weighting: Cost of FP (anxiety, further tests) vs FN (missed case)

Worked O&G Example: cffDNA for Down's syndrome - AUC > 0.99 (excellent) - At standard cut-off (z-score > 3): Sn = 99.5%, Sp = 99.9% - Can trade off: at z-score > 2: Sn > 99.9%, Sp = 99.0% (more FPs but fewer missed cases)

2.5 Screening in O&G — Complete Clinical Details

Antenatal Screening Programme (UK)

Condition Screening Test Timing Sensitivity Specificity Notes
Down's syndrome (T21) Combined test (NT + PAPP-A + β-hCG) 11–14 wks ~85% at 5% FPR 95% NICE recommendation
Down's syndrome (T21) Quadruple test (AFP + hCG + uE3 + Inhibin A) 14–20 wks ~80% at 5% FPR 95% When late booking or NT not available
Down's syndrome NIPT (cfDNA) From 10 wks >99% >99% Contingent screening in NHS (if combined risk ≥ 1:150)
Edwards' syndrome (T18) Combined test + NIPT 11–14 wks ~90% 99.9% Low PAPP-A and hCG
Patau's syndrome (T13) Combined test + NIPT 11–14 wks ~85% 99.9% Low PAPP-A and hCG
Neural tube defects AFP + anomaly scan 18–20 wks ~90% (anencephaly) >99% Anomaly scan is gold standard

Fetal Anomaly Screening Programme (FASP) — UK

The 11 conditions screened for at the 18–20 week anomaly scan:

  1. Anencephaly — absence of cranial vault; uniformly lethal
  2. Open spina bifida — neural tube defect; severity varies
  3. Cleft lip — with or without cleft palate
  4. Diaphragmatic hernia — herniation of abdominal contents into chest
  5. Gastroschisis — abdominal wall defect (right of umbilical cord)
  6. Exomphalos (omphalocele) — abdominal wall defect (midline, membrane-covered)
  7. Serious cardiac anomalies — four-chamber view + outflow tracts (detects ~50% of major CHD)
  8. Bilateral renal agenesis — absence of both kidneys → anhydramnios → pulmonary hypoplasia
  9. Lethal skeletal dysplasia — severe short limbs, narrow thorax
  10. Edwards' syndrome (T18) — structural anomalies + growth restriction
  11. Patau's syndrome (T13) — structural anomalies + holoprosencephaly

Detection rates for anomaly scan: - Anencephaly: ~98% - Open spina bifida: ~90% - Cleft lip: ~75% - Diaphragmatic hernia: ~60% - Gastroschisis: ~90% - Major cardiac anomalies: ~50% - Bilateral renal agenesis: ~85%

Gestational Diabetes Mellitus (GDM) Screening

Approach Method Criteria Prevalence detected
Universal (IADPSG/WHO) One-step: 75g OGTT at 24–28 wks Fasting ≥5.1, 1h ≥10.0, 2h ≥8.5 mmol/L ~15–20%
Selective (NICE) Risk-factor based: 75g OGTT at 24–28 wks Fasting ≥5.6, 2h ≥7.8 mmol/L ~5%
Two-step (ACOG) 50g GCT → if ≥7.8 → 100g OGTT (Carpenter-Coustan) Two values elevated ~6–8%

NICE risk factors for GDM (2015): - BMI > 30 kg/m² - Previous GDM - Family history (first-degree relative with diabetes) - Ethnicity: South Asian, Black Caribbean, Middle Eastern - Previous macrosomic baby (≥4.5 kg) - Polycystic ovary syndrome

Cervical Screening (NHS Cervical Screening Programme)

Aspect Details
Age range 25–64 years
Frequency 3-yearly (25–49); 5-yearly (50–64)
Primary test HPV test (since 2019)
Reflex cytology If HPV positive → cytology on same sample
Colposcopy referral HPV positive + abnormal cytology (≥ borderline)
HPV 16/18 genotyping If HPV positive with normal cytology → genotyping; 16/18+ → colposcopy; other HR-HPV → repeat in 12 months
Upper age 64 (if last 2 screens negative, no further screening)
Uptake ~70% (below 80% target)
Approach Call-recall system via GP registration

Group B Streptococcus (GBS) Screening

Aspect UK Practice US Practice
Approach Risk-factor based Universal screening
Timing At labour onset (risk-based) 35–37 weeks
Test Not routine Vaginal-rectal swab (enriched culture)
Risk factors Previous GBS baby, GBS bacteriuria in pregnancy, preterm labour, prolonged ROM (>18h), intrapartum fever ≥38°C None (universal screening)

Other Antenatal Screening

Test Timing Condition
HIV Booking (and 28 wks if high risk) Vertical transmission rate <1% with treatment
HBsAg Booking Hepatitis B — immunoprophylaxis reduces vertical transmission
Syphilis (TPPA/VDRL) Booking Congenital syphilis preventable
Rubella IgG Booking Susceptibility detected → post-partum vaccination
Sickle cell and thalassaemia Booking Family origin questionnaire + Hb HPLC
Asymptomatic bacteriuria Booking Urine culture (MSU)
Anaemia Booking + 28 wks FBC

2.6 Screening Biases — Expanded

Bias Mechanism Example
Lead time bias Screening advances time of diagnosis but does NOT delay death. Survival appears longer because the clock starts earlier, even if death occurs at the same time. Screening for ovarian cancer: if diagnosis moved from age 62 to age 60 but death at age 65 in both, apparent "survival" increases from 3 to 5 years — no real benefit.
Length time bias Screening preferentially detects slower-growing (less aggressive) disease because it stays in the detectable preclinical phase longer. Fast-growing aggressive disease is more likely to present symptomatically between screens. Cervical screening: Screen-detected CIN tends to be slower-progressing. Rapidly progressive cancers may present as interval cancers between screens.
Overdiagnosis Detection of disease that would NEVER have caused symptoms or death. The patient is "harmed" by unnecessary diagnosis and treatment. Screening for neuroblastoma in infants (abandoned due to overdiagnosis); overdiagnosis in thyroid and breast cancer screening is well-documented.
Selection bias (volunteer bias) People who participate in screening are systematically different from those who don't — typically healthier, more health-conscious, higher SES. Women attending for cervical screening have lower cervical cancer risk regardless of screening (healthy behaviours).
Recall rate Proportion of screened population recalled for further investigations. Must balance: high recall → more detected cases but more anxiety and cost; low recall → missed cases. Combined test recall rate: ~5% (a positive screening result). Of those, ~5% have Down's syndrome (PPV ~5% in low-risk population).
False positive rate Proportion of normal pregnancies incorrectly labelled as high-risk. Causes anxiety, unnecessary invasive tests (with miscarriage risk ~0.5–1%), and increased healthcare costs. Combined test FPR = 5%. For every 100,000 women screened, ~5,000 will be screen-positive; ~4,750 will be false positives.

Screening vs Diagnostic Accuracy

Aspect Screening Test Diagnostic Test
Population Asymptomatic, low prevalence Symptomatic, high pre-test probability
Purpose Identify those who need diagnostic testing Confirm or exclude diagnosis
Test characteristics High sensitivity (minimise FNs) High specificity (minimise FPs)
Acceptability Must be acceptable to healthy people Acceptability less critical
Cost Must be cheap Can be more expensive
PPV Often low (due to low prevalence) Higher (due to pre-test probability)
Example Combined test for Down's (screening) CVS/amniocentesis for karyotype (diagnostic)

3. Descriptive Statistics

3.1 Types of Data — Complete Classification

                           ┌─────────────┐
                           │    Data     │
                           └──────┬──────┘
                                  │
                    ┌─────────────┴─────────────┐
               ┌────┴────┐               ┌────┴────┐
               │Categorical│              │Numerical │
               └────┬────┘               └────┬────┘
                    │                         │
        ┌───────────┼───────────┐    ┌────────┴────────┐
        │Nominal    │ Ordinal   │    │Discrete   │Continuous│
        └───────────┴───────────┘    └────────┴──────────┘
Data Type Description Examples in O&G Permissible Statistics
Nominal Unordered categories Blood group (A, B, AB, O), ethnicity, parity type (nulliparous/multiparous), mode of delivery (SVD, VEEB, CS) Mode, frequency, χ², Fisher's exact
Ordinal Ordered categories FIGO stage (I–IV), pain score (0–10), Bishop's score, AGPAR score, severity of incontinence (mild/moderate/severe) Median, IQR, Mann-Whitney, Wilcoxon, %iles
Discrete Integer values (countable) Parity (0, 1, 2...), number of miscarriages, gravidity, number of previous CS Mean (if normally distributed), median (if skewed)
Continuous Any value on a continuum Birth weight, gestational age, BMI, blood pressure, Hb, cervical length Mean, SD, t-test, ANOVA, regression

Special case — Binary/Dichotomous: Nominal with exactly 2 categories - Alive/dead, pregnant/not, term/preterm - Can use: proportions, OR, RR, logistic regression

Hierarchy of data: As you go from nominal → ordinal → interval → ratio, you gain more mathematical properties and more statistical options.

3.2 Measures of Central Tendency — Complete Guide

Measure Definition Formula When to use
Mean (arithmetic) Sum of all values divided by number of values x̄ = Σxᵢ / n Normally distributed, interval/ratio data
Median Middle value when data ordered from smallest to largest Value at position (n+1)/2 Skewed data, ordinal data, presence of outliers
Mode Most frequently occurring value Value with highest frequency Nominal data, bimodal distributions

Mean

Advantages: Uses all data points; mathematically tractable (basis for many statistical tests) Disadvantages: Affected by outliers and skewness

Example: Birth weights (kg) of 5 babies: 2.5, 3.0, 3.2, 3.5, 4.8 - Mean = (2.5 + 3.0 + 3.2 + 3.5 + 4.8) / 5 = 17.0 / 5 = 3.4 kg - Median = 3.2 kg (3rd value of 5) - Mode: no repeated values → no mode

If the 4.8 kg outlier was actually 10.0 kg (error): - Mean = (2.5 + 3.0 + 3.2 + 3.5 + 10.0) / 5 = 4.44 kg (dramatically changed!) - Median = 3.2 kg (unchanged!)

Median

Advantages: Robust to outliers and skewness; appropriate for ordinal data Disadvantages: Does not use all data; less mathematically tractable

Calculation: - If n is odd: middle value (e.g., n=5 → 3rd value) - If n is even: average of two middle values (e.g., n=6 → average of 3rd and 4th values)

Mode

Advantages: Only measure for nominal data; can identify bimodal distributions Disadvantages: May not exist (no repeated values); may not be unique

Bimodal distribution example: Birth weight in preterm vs term babies will show two peaks.

Skewness — Visualising the Distribution

              Normal              Positive Skew           Negative Skew
                                                                   ╱
              ╱╲                 ╱╲╲                          ╱╱╲
             ╱  ╲               ╱  ╲╲                       ╱  ╲╲
            ╱    ╲             ╱    ╲╲                     ╱    ╲╲
           ╱      ╲           ╱      ╲╲                   ╱      ╲╲
Mean=Median=Mode      Mode > Median > Mean        Mean > Median > Mode
Skew Direction Relationship Example in O&G
Positive (right) skew Long tail to the right Mean > Median > Mode Length of hospital stay after CS, parity in general population, time to conceive
Negative (left) skew Long tail to the left Mean < Median < Mode Age at menopause (most women 48–52, few <40 or >55)
No skew (symmetrical) Bell-shaped Mean = Median = Mode Normally distributed: birth weight in term infants, height

Skewness coefficient = 0 for normal distribution; >0 for positive skew; <0 for negative skew.

Kurtosis: Measures "peakedness" of distribution - Leptokurtic: Tall peak, heavy tails (more outliers) - Platykurtic: Flat peak, thin tails - Mesokurtic: Normal distribution (kurtosis = 3 for normal; excess kurtosis = 0)

3.3 Measures of Dispersion — Complete Guide

Measure Formula/Definition Robust to outliers? When to use
Range Max − Min NO Quick summary only
Interquartile Range (IQR) Q3 − Q1 (75th − 25th percentile) YES With median; skewed data
Variance (σ²) Σ(xᵢ − μ)² / n NO Intermediate calculation for SD
Sample variance (s²) Σ(xᵢ − x̄)² / (n−1) NO Unbiased estimate from sample
Standard deviation (SD) √Variance NO Mean ± SD for normal data
Coefficient of variation (CV) (SD / Mean) × 100% Comparing variability across different scales
Standard error of mean (SEM) SD / √n Precision of sample mean estimate

Range

  • Simplest measure
  • Highly sensitive to outliers
  • May be missing extreme values if sample is small

Interquartile Range (IQR)

  • Contains middle 50% of data
  • Q1 = 25th percentile, Q3 = 75th percentile
  • Used with median for skewed data
  • Box plot whiskers typically extend to 1.5 × IQR beyond Q1 and Q3

Variance and Standard Deviation

Variance = average squared deviation from the mean - Population variance: σ² = Σ(xᵢ − μ)² / N - Sample variance: s² = Σ(xᵢ − x̄)² / (n−1)

Why n−1? Bessel's correction — using n−1 gives an unbiased estimate of population variance from a sample.

Standard deviation = √variance - In the SAME units as original data (unlike variance) - For normally distributed data: 68% within mean ± 1 SD, 95% within ± 1.96 SD

Worked Example: Cervical length measurements (mm): 25, 30, 32, 35, 38

Step Calculation Result
Mean (25+30+32+35+38)/5 32
Deviations −7, −2, 0, +3, +6
Squared deviations 49, 4, 0, 9, 36
Sum of squares 49+4+0+9+36 98
Variance (sample) 98/(5−1) 24.5 mm²
SD √24.5 4.95 mm

Coefficient of Variation (CV)

  • CV = (SD / Mean) × 100%
  • Allows comparison of variability across different scales or units
  • Example: Birth weight SD = 400g, mean = 3400g → CV = 11.8%
  • Another population: SD = 300g, mean = 2800g → CV = 10.7%
  • The first population has higher absolute variability but similar relative variability

Standard Error of the Mean (SEM)

CRITICAL DISTINCTION for MRCOG: SD vs SEM

SD SEM
What it describes Variability of INDIVIDUAL observations Precision of the SAMPLE MEAN estimate
Formula SD = √(Σ(x−x̄)²/(n−1)) SEM = SD / √n
Effect of n Stable (doesn't systematically change with n) DECREASES as n increases (more data = more precise mean)
Interpretation ~95% of individuals fall within x̄ ± 2SD ~95% CI for the mean = x̄ ± 2×SEM
Use Describing population spread Inferential statistics, CI for mean

Example: Birth weight study, n = 1000, mean = 3400g, SD = 400g - SEM = 400 / √1000 = 400 / 31.6 = 12.7 g - 95% CI for mean = 3400 ± 1.96 × 12.7 = 3400 ± 24.9 = (3375, 3425) - Interpretation: We are 95% confident the true population mean is between 3375g and 3425g

Note: 95% of INDIVIDUAL birth weights are in the range 3400 ± 800g (= mean ± 2SD), NOT the 95% CI of the mean.

3.4 Normal Distribution — Complete Details

Properties of the Normal (Gaussian) Distribution:

  1. Symmetrical about the mean
  2. Mean = Median = Mode
  3. Bell-shaped with tails approaching but never reaching zero
  4. Defined by two parameters: μ (mean) and σ (SD)
  5. Area under curve = 1 (probability)

The 68-95-99.7 Rule

Range Proportion included Commonly known as
μ ± 1σ 68.27% 68%
μ ± 1.645σ 90% 90th percentile bounds
μ ± 1.96σ 95.00% 95% reference range
μ ± 2σ 95.45% Approximate 95%
μ ± 2.58σ 99.00% 99% reference range
μ ± 3σ 99.73% 99.7%

Standard Normal Distribution

  • Z = (x − μ) / σ
  • Mean = 0, SD = 1
  • Z-table gives the probability of values less than a given Z-score
  • Critical values for hypothesis testing:
  • z₀.₀₂₅ = 1.96 (two-tailed 95% test)
  • z₀.₀₅ = 1.645 (one-tailed 95% test)
  • z₀.₀₀₅ = 2.58 (two-tailed 99% test)

Worked example: What proportion of term babies weigh <2500g if mean = 3400g, SD = 400g? - Z = (2500 − 3400) / 400 = −900/400 = −2.25 - P(Z < −2.25) = 0.0122 (from Z-table) - → 1.22% of term babies weigh <2500g

Central Limit Theorem (CLT)

Critical theorem: The sampling distribution of the mean approaches a normal distribution as sample size increases, REGARDLESS of the shape of the population distribution.

  • Why this matters: Even with skewed data, the sample mean is approximately normally distributed if n is large enough (typically n > 30)
  • This underpins: Use of z-tests and t-tests even for non-normal data when n is large

Standard Error vs Standard Deviation — Clinical Example

A study measures birth weight in 10,000 babies. - SD = 400g → tells us most babies weigh between 2600g and 4200g (±2SD) - SEM = 400/√10000 = 4g → tells us the mean is estimated very precisely (95% CI: ~3392 to 3408g)

A clinical mistake: Writing mean ± SD where mean ± SEM is intended (or vice versa). MRCOG exam might test your ability to distinguish.

3.5 Skewed Distributions & Transformations

Log-normal distribution: - Data are positively skewed - After log-transformation, data become normally distributed - Common in O&G: length of labour, parity, time to pregnancy, hormone levels (e.g., hCG)

Transformation options: | Transformation | Formula | When to use | |---------------|---------|-------------| | Log | y = ln(x) or y = log₁₀(x) | Positive skew; multiplicative data | | Square root | y = √x | Count data with moderate skew | | Reciprocal | y = 1/x | Strong skew | | Box-Cox | y = (x^λ − 1)/λ | Generalised power transformation | | Logit | y = ln[p/(1−p)] | Proportions (0 to 1) | | Arcsine | y = arcsin(√p) | Proportions (stabilises variance) |

How to check normality: 1. Histogram — visual inspection (bell-shaped?) 2. Q-Q plot (quantile-quantile plot) — points along diagonal = normal 3. Shapiro-Wilk test — most powerful for small n (H₀: data are normal) 4. Kolmogorov-Smirnov test — suitable for large n 5. Skewness and kurtosis — skewness between −2 and +2 and kurtosis between −7 and +7 often considered acceptable

3.6 Data Presentation — Types of Graphs

Graph Type of Data Variables Purpose Key features
Histogram Continuous One variable Show distribution shape Bars TOUCH; bin width matters
Bar chart Categorical One or two categorical Compare frequencies Bars DO NOT touch
Box plot Continuous One variable (or grouped) Show median, IQR, outliers Whiskers ±1.5×IQR
Scatter plot Continuous Two continuous variables Show relationship/correlation Look for direction, strength, outliers
Line graph Continuous (often time) Continuous × time Trend over time Time on x-axis
Pie chart Categorical One categorical (proportions) Show parts of a whole Avoid >5 categories
Kaplan-Meier Time-to-event Survival time + group Survival analysis Step function; censoring marks
Forest plot Meta-analysis Multiple studies Summarise effect sizes Square size = weight; diamond = summary
Funnel plot Meta-analysis Effect size vs precision Assess publication bias Symmetrical = no bias
Bland-Altman Continuous Two measurement methods Assess agreement Difference vs mean of two methods

Histogram vs Bar Chart — Critical MRCOG Distinction

Feature Histogram Bar Chart
Data type Continuous (or large discrete) Categorical
Bars Touch (no gap) Do not touch (gap between)
Order Natural order of variable (cannot reorder) Can be reordered (e.g. alphabetical, by frequency)
Width Can vary (if unequal bin widths) Always equal
Example Distribution of birth weights Caesarean section rates by hospital

Box Plot Interpretation

     Upper whisker (largest value ≤ Q3 + 1.5×IQR)
          │
     ─────┼─────   Q3 (75th percentile)
          │
     ─────┼─────   Median (Q2, 50th percentile)
          │
     ─────┼─────   Q1 (25th percentile)
          │
     Lower whisker (smallest value ≥ Q1 − 1.5×IQR)
          │
          ●        Outlier (>1.5×IQR beyond Q1 or Q3)

Uses: - Comparing distributions across groups (e.g., birth weight by maternal smoking status) - Identifying outliers - Showing skewness (if median not centred in box)

Scatter Plot Interpretation

Look for: - Direction: Positive (both increase together) or negative (one increases, other decreases) - Strength: How closely points follow a line (tight = strong correlation) - Shape: Linear, curvilinear, no pattern - Outliers: Points far from main cluster - Subgroups: Distinct clusters suggest different populations

Bland-Altman Plot for Method Comparison

  • X-axis: Mean of two measurements [(method A + method B)/2]
  • Y-axis: Difference (method A − method B)
  • Central horizontal line: Mean difference (bias)
  • Dashed lines: Limits of agreement (mean ± 1.96 SD of differences)
  • If limits are clinically acceptable → methods can be used interchangeably
  • Used for: Comparing ultrasound measurements between operators, comparing new test to gold standard

4. Hypothesis Testing

4.1 Fundamental Concepts — Complete

Concept Symbol Definition Everyday analogy
Null hypothesis H₀ No difference / no association / no effect "He is innocent"
Alternative hypothesis H₁ There IS a difference / association / effect "He is guilty"
Type I error α Reject H₀ when H₀ is actually true (false positive) Convicting an innocent person
Type II error β Fail to reject H₀ when H₁ is true (false negative) Letting a guilty person go free
Power 1 − β Correctly rejecting H₀ when H₁ is true Probability of detecting a real effect
p-value p Probability of observing the data (or more extreme) assuming H₀ is true Not directly analogous

4.2 Type I and Type II Errors — The 2×2 Framework

Decision H₀ TRUE H₁ TRUE (H₀ FALSE)
Reject H₀ Type I error (α) [FALSE POSITIVE] CORRECT (True positive)
Fail to reject H₀ (Accept H₀) CORRECT (True negative) Type II error (β) [FALSE NEGATIVE]

Type I Error (α)

  • α = 0.05 means: If H₀ is true (no real effect), there is a 5% chance we will incorrectly conclude there IS an effect
  • Trades off with Type II error — making α stricter (e.g., 0.01) reduces false positives but increases false negatives
  • Multiple testing: If you test 20 independent null hypotheses, expected number of false positives = 20 × 0.05 = 1 (hence Bonferroni correction)

Type II Error (β) and Power

  • β = 0.20 → Power = 0.80 is conventional minimum
  • β = 0.10 → Power = 0.90 is preferred
  • Power depends on:
  • Sample size (n): Larger n → higher power
  • Effect size (δ): Larger effect → higher power
  • α-level: Less strict α (e.g., 0.05 vs 0.01) → higher power
  • Variance (σ²): Lower variance → higher power

Worked Example of power concept: A study of 50 women finds no significant difference in birth weight between smokers and non-smokers (p = 0.12). The study was designed with 80% power to detect a 200g difference. The actual observed difference was 150g — the study was UNDER-powered to detect this smaller difference. Therefore the non-significant result does NOT mean there is no effect — it means we cannot rule out an effect of this size.

4.3 The p-value — Essential MRCOG Detail

CRITICAL EXAM POINT: The p-value is NOT the probability that the null hypothesis is true! This is the single most common statistical misconception tested in MRCOG.

Mathematical Definition: p-value = P(observed data OR more extreme | H₀ true)

It is NOT P(H₀ true | observed data)

The correct interpretation: "If there were truly no difference between groups, the probability of observing a difference as large (or larger) than the one we saw is p."

Common Misconceptions — All WRONG:

❌ Incorrect Statement ✅ Correct Interpretation
"p = 0.03 means there is a 3% chance H₀ is true" p = 0.03 means: if H₀ were true, we'd see data this extreme only 3% of the time
"p = 0.05 means there is a 5% probability the result is due to chance" Probability refers to the data under H₀, not the result
"p > 0.05 means the treatment is equivalent to placebo" Non-significant does NOT mean no effect — may be underpowered
"p = 0.001 means the effect is very large" p does NOT measure effect size — only strength of evidence against H₀
"We failed to reject H₀, so H₀ is true" We cannot prove H₀ — only fail to find evidence against it

4.4 Confidence Intervals — Detailed

Definition: A 95% confidence interval for a parameter is the range of values within which the true population parameter would fall in 95% of repeated samples.

Correct interpretation: If we repeated the study 100 times and calculated a 95% CI each time, about 95 of those CIs would contain the true population value.

WRONG interpretation: "There is a 95% probability that the true value lies within this CI" — this is a Bayesian credible interval interpretation, not a frequentist CI.

CI provides MORE information than p-value: - Shows the ESTIMATE (best guess of effect size) - Shows the PRECISION (width = how certain we are) - Shows STATISTICAL SIGNIFICANCE (if 95% CI excludes null value → p < 0.05) - Shows CLINICAL SIGNIFICANCE (even if significant, is the entire CI in a clinically meaningful range?)

CI includes null? p-value Interpretation
Yes (e.g., RR 1.2, 95% CI 0.9–1.5) p ≥ 0.05 Not statistically significant
No (e.g., RR 1.2, 95% CI 1.01–1.5) p < 0.05 Statistically significant
No (e.g., RR 1.2, 95% CI 1.1–1.3) p < 0.001 Significant AND precise

Example: RR for preterm birth in smokers vs non-smokers - Study A: RR = 1.5, 95% CI 0.8–2.2 (wide CI → imprecise; not significant) - Study B: RR = 1.3, 95% CI 1.1–1.5 (narrow CI → precise; significant) - Study C: RR = 1.1, 95% CI 1.01–1.19 (significant but clinically marginal)

4.5 One-tailed vs Two-tailed Tests

Aspect Two-tailed One-tailed
Alternative hypothesis H₁: μ₁ ≠ μ₂ (difference in either direction) H₁: μ₁ > μ₂ (or μ₁ < μ₂)
When to use Default — almost always Only if difference in opposite direction is impossible or irrelevant
α distribution Split equally between both tails (2.5% each) All 5% in one tail
Critical value (α=0.05) z = ±1.96 z = 1.645
For same data p-value is 2× the one-tailed p p-value is half the two-tailed p
Sample size Larger Smaller (for same power)
Controversy Safe and standard Can inflate Type I error if the "wrong" direction appears

MRCOG rule: Always use two-tailed unless you have an extremely strong justification. The exam expects two-tailed as default.

Example: Comparing two antihypertensives in pregnancy — you cannot be certain a new drug won't be worse → two-tailed. If comparing a known teratogen to placebo, you might use one-tailed (it can't reduce malformation risk below background), but even then, two-tailed is safer.

4.6 Multiple Testing — Corrections

The problem: Each statistical test at α = 0.05 has a 5% chance of false positive. If you run many tests, the familywise error rate (FWER) increases.

FWER = 1 − (1 − α)ᵏ

Number of tests (k) FWER
1 0.05
5 0.23
10 0.40
20 0.64
100 0.99

Bonferroni correction: - Adjusted α = 0.05 / k - Example: 10 comparisons → α = 0.005 - Very conservative — reduces Type I error but increases Type II error (reduces power)

Other methods: | Method | Description | Comparison | |--------|-------------|------------| | Bonferroni | α/k | Most conservative | | Holm-Bonferroni | Stepwise: smallest p tested at α/k, then α/(k−1), etc. | Less conservative, more powerful | | Sidak | 1 − (1−α)^(1/k) | Slightly less conservative than Bonferroni | | Benjamini-Hochberg (FDR) | Controls false discovery rate (expected proportion of false positives among rejected hypotheses) | Least conservative; used in genomics |

4.7 Significance vs Clinical Importance — Key MRCOG Concept

Statistically Significant Not Statistically Significant
Clinically Important ✅ Optimal — real effect detected 🔴 Underpowered study — need larger n
Clinically Unimportant 🟡 Significant but trivial (large n) ✅ No evidence of important effect

Example 1: A study with 100,000 women finds that taking paracetamol once in pregnancy reduces preterm birth from 5.0% to 4.9% (p = 0.03). Statistically significant but clinically meaningless (ARR = 0.1%, NNT = 1000).

Example 2: A study with 100 women finds a 30% reduction in miscarriage rate but p = 0.15. Potentially clinically important but not proven — underpowered.

4.8 Bayesian vs Frequentist Statistics — Overview

Aspect Frequentist Bayesian
Probability definition Long-run frequency Degree of belief
Parameters Fixed (unknown) Random variables
Data Random Fixed
Prior Not used Used (prior probability)
Output p-value, CI Posterior probability, credible interval
Interpretation of 95% interval 95% of intervals contain true value 95% probability true value lies in interval
If H₀ is p=0.05 Cannot say "5% chance H₀ is true" Can say "5% probability H₀ is true"

Bayesian approach in O&G: Increasingly used in adaptive trials, diagnostic test interpretation, and meta-analysis.


5. Parametric vs Non-Parametric Tests

5.1 Choosing the Right Test — Decision Tree

Continuous Data

                            ┌─────────────────────────┐
                            │   Continuous Outcome    │
                            └────────────┬────────────┘
                                         │
                            ┌────────────┴────────────┐
                            │    Normally distributed? │
                            └────────────┬────────────┘
                                         │
                     ┌───────────────────┴────────────────────┐
                   YES│                                      │NO
                      │                                       │
          ┌───────────┴───────────┐              ┌────────────┴────────────┐
          │   How many groups?    │              │   How many groups?      │
          └───────────┬───────────┘              └────────────┬────────────┘
                      │                                       │
          ┌───────┬───┴───┬───────┐              ┌───────┬───┴───┬───────┐
          │ 2 ind │2 paired│ 3+ ind│3+ paired    │ 2 ind │2 paired│ 3+ ind│3+ paired
          │t-test │t-test  │ANOVA  │RM-ANOVA     │Mann-  │Wilcoxon│Kruskal│Friedman
          │(unpaired) (paired)│(one-way)│           │Whitney│signed  │Wallis │
          │       │        │       │            │  U    │rank    │       │
          └───────┴────────┴───────┴─────       └───────┴────────┴───────┴──────

Categorical Data

                            ┌─────────────────────────┐
                            │   Categorical Outcome   │
                            └────────────┬────────────┘
                                         │
                            ┌────────────┴────────────┐
                            │    2×2 table or larger  │
                            └────────────┬────────────┘
                                         │
                     ┌───────────────────┴─────────────────────┐
                     │                                         │
            ┌────────┴────────┐                      ┌────────┴────────┐
            │   Expected ≥5?  │                      │  Paired data?   │
            └────────┬────────┘                      └────────┬────────┘
                     │                                         │
                ┌────┴────┐                              ┌────┴────┐
               YES│      │NO                            YES│       │NO
                  │      │                                │        │
             ┌────┴┐  ┌──┴────┐                     ┌────┴┐  ┌───┴────┐
             │  χ² │  │Fisher │                     │McNemar│  │Normal │
             │ test│  │exact  │                     │       │  │  χ²   │
             └─────┘  └───────┘                     └───────┘  └───────┘

5.2 Parametric Tests — Complete Details

Student's t-test

Assumptions: 1. Normality: Data in each group are approximately normally distributed (or n large enough for CLT) 2. Homogeneity of variance: Variance similar in both groups (check with Levene's test or F-test) 3. Independence: Observations are independent of each other

Unpaired (Independent Samples) t-test
  • Use: Compare means of TWO independent groups
  • Example: Birth weight in smokers vs non-smokers
  • Formula: t = (x̄₁ − x̄₂) / √(s²(1/n₁ + 1/n₂))
  • where s² = pooled variance = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁ + n₂ − 2)
  • Degrees of freedom: df = n₁ + n₂ − 2

Worked example: - Smokers: n=50, mean=3100g, SD=400g - Non-smokers: n=50, mean=3300g, SD=380g - Pooled SD = √([49×400² + 49×380²]/98) = √([7,840,000 + 7,072,400]/98) = √(152,168) = 390.1 - t = (3100 − 3300) / (390.1 × √(1/50 + 1/50)) = −200 / (390.1 × 0.2) = −200 / 78.02 = −2.56 - df = 98, critical t (two-tailed, α=0.05) = 1.984 - |t| = 2.56 > 1.984 → p < 0.05 → significant difference

Welch's t-test: Does NOT assume equal variances; more robust. Uses separate variances and adjusted df (Satterthwaite or Welch). Recommended as default.

Paired t-test
  • Use: Compare means of TWO RELATED measurements (same subjects, before-after, matched pairs)
  • Example: BP before and after antihypertensive treatment in pregnancy
  • Principle: Calculate difference for each pair, test if mean difference = 0
  • Formula: t = d̄ / (s_d / √n)
  • d̄ = mean of differences
  • s_d = SD of differences
  • n = number of pairs
  • df = n − 1

Worked example: Fasting glucose before and after 1 week of metformin in 10 women with PCOS:

Subject Before After Difference
1 5.9 5.5 0.4
2 6.2 5.8 0.4
3 5.6 5.3 0.3
4 5.8 5.4 0.4
5 6.0 5.7 0.3
6 5.7 5.6 0.1
7 6.1 5.8 0.3
8 5.9 5.5 0.4
9 5.8 5.6 0.2
10 6.0 5.9 0.1
  • Mean difference d̄ = 0.29
  • SD of differences = 0.12
  • t = 0.29 / (0.12/√10) = 0.29 / 0.038 = 7.63
  • df = 9, critical t = 2.262
  • t = 7.63 >> 2.262 → p < 0.001 → significant reduction

Analysis of Variance (ANOVA)

One-way ANOVA
  • Use: Compare means of THREE or MORE independent groups
  • Why not multiple t-tests? Inflates Type I error (for 3 groups: 3 pairwise tests → FWER = 14.3%)
  • Logic: Partition total variance into:
  • Between-group variance (attributable to the treatment/group effect)
  • Within-group variance (error/residual variance)
  • F-statistic = Mean Square (between) / Mean Square (within)
  • If F is large and p < 0.05 → at least one group differs from others

ANOVA table:

Source Sum of Squares df Mean Square F
Between groups SS_b k−1 MS_b = SS_b/(k−1) MS_b/MS_w
Within groups SS_w N−k MS_w = SS_w/(N−k)
Total SS_t N−1

k = number of groups, N = total sample size

Post-hoc Tests after Significant ANOVA

Why can't we just use pairwise t-tests? Multiple testing problem. Post-hoc tests control for multiple comparisons.

Test Conservatism When to use
Bonferroni Very conservative Small number of pre-planned comparisons
Tukey HSD Moderate All pairwise comparisons (most common)
Scheffé Most conservative Complex comparisons (contrasts)
Dunnett Moderate Comparing all groups to a single control
Least Significant Difference (LSD) NOT conservative (doesn't control FWER) Only if exactly 3 groups and significant F

Tukey HSD (Honest Significant Difference): - Controls FWER for ALL pairwise comparisons - Uses studentised range distribution (q) - Formula: HSD = q × √(MS_w / n)

Two-way ANOVA
  • Use: TWO independent variables (factors) + their interaction
  • Output: Main effect of factor A, main effect of factor B, interaction effect (A×B)
  • Example: Effect of smoking (yes/no) AND maternal age (<35 vs ≥35) on birth weight
  • Main effect of smoking (adjusted for age)
  • Main effect of age (adjusted for smoking)
  • Interaction: Does the effect of smoking DIFFER by maternal age?

Interpreting interaction: - Significant interaction p-value → the effect of one factor depends on the other - Example: Smoking reduces birth weight more in older mothers → significant smoking × age interaction - Report subgroup means or interaction plot

Repeated Measures ANOVA
  • Use: Same subjects measured at 3+ time points (e.g., BP at booking, 28 wks, 36 wks)
  • Advantage: Controls for between-subject variability → more powerful
  • Assumptions: Sphericity (variance of differences between all pairs of measurements is equal) — checked by Mauchly's test
  • Correction for non-sphericity: Greenhouse-Geisser, Huynh-Feldt

Assumptions of Parametric Tests — How to Check

Assumption What it means How to check What to do if violated
Normality Data follow normal distribution Histogram, Q-Q plot, Shapiro-Wilk, Kolmogorov-Smirnov Use non-parametric test, transform data
Homogeneity of variance Equal variances across groups Levene's test, F-test (2 groups), Bartlett's test Use Welch's t-test, Welch's ANOVA, or transformation
Independence Observations independent Study design check Mixed models, GEE, multilevel models
Sphericity (RM-ANOVA) Equal variances of differences Mauchly's test Greenhouse-Geisser correction

5.3 Non-Parametric Tests — Complete Details

Mann-Whitney U Test (Wilcoxon Rank-Sum)

  • Use: Compare TWO INDEPENDENT groups with non-normal data
  • Principle: Rank all observations together, then compare sum of ranks between groups
  • H₀: The two populations have the same location (median)
  • Output: U statistic (or W in some software)

Steps: 1. Rank all observations from both groups together (1 = smallest) 2. Sum the ranks for group 1 (R₁) 3. U₁ = R₁ − n₁(n₁+1)/2 and U₂ = R₂ − n₂(n₂+1)/2 4. U = min(U₁, U₂) — compared to critical value

Worked example: Pain scores (0–10) after two different perineal repair techniques

Technique A Rank A Technique B Rank B
2 1 4 4.5
3 2.5 5 6
3 2.5 6 7
4 4.5 8 9
7 8 9 10
Sum 18.5 Sum 36.5
  • n₁ = 5, n₂ = 5
  • U₁ = 18.5 − (5×6/2) = 18.5 − 15 = 3.5
  • U₂ = 36.5 − 15 = 21.5
  • U = 3.5 (critical U for n₁=5, n₂=5, α=0.05 two-tailed = 2)
  • U = 3.5 > 2 → not significant at α = 0.05

However, for ranks approach: Z = (mean rank_A − mean rank_B) / SE → can approximate significance.

Wilcoxon Signed-Rank Test

  • Use: TWO PAIRED groups (non-parametric equivalent of paired t-test)
  • Principle: Calculate differences, rank absolute differences, sum ranks of positive vs negative differences
  • Steps:
  • Calculate difference for each pair
  • Exclude pairs with difference = 0
  • Rank absolute differences (ignoring sign)
  • Sum ranks of positive differences (W+) and negative differences (W−)
  • Test statistic W = min(W+, W−)

Example (from paired t-test data above): Glucose before and after metformin - Differences: 0.4, 0.4, 0.3, 0.4, 0.3, 0.1, 0.3, 0.4, 0.2, 0.1 - All positive → W+ = 1+2+...+10 = 55, W− = 0 - For n=10, critical W = 8 (two-tailed, α=0.05) - W = 0 < 8 → p < 0.05 → significant (more powerful than sign test)

Sign test (simpler alternative): - Count number of positive and negative differences (ignoring magnitude) - Test using binomial distribution - Less powerful than Wilcoxon signed-rank (discards magnitude information)

Kruskal-Wallis Test

  • Use: THREE+ INDEPENDENT groups (non-parametric equivalent of one-way ANOVA)
  • Principle: Extension of Mann-Whitney — ranks all observations together, compares sum of ranks across groups
  • H₀: All groups have same median
  • Output: H statistic (approximately χ² with df = k−1)
  • Post-hoc: Dunn's test with Bonferroni correction

When to use: Comparing fetal fibronectin levels (skewed) across three groups: term labour, preterm labour, no labour

Friedman Test

  • Use: THREE+ PAIRED groups (non-parametric equivalent of repeated measures ANOVA)
  • Principle: Ranks within each subject/block, then compares across time points
  • Example: Pain scores at 1 hour, 6 hours, 24 hours after episiotomy repair
  • Post-hoc: Wilcoxon signed-rank with Bonferroni correction

5.4 Chi-Squared Test (χ²) — Complete Details

  • Use: Test association between TWO CATEGORICAL variables
  • Data format: Contingency table (r × c)

Formula: χ² = Σ [(Oᵢⱼ − Eᵢⱼ)² / Eᵢⱼ]

Where: - Oᵢⱼ = observed frequency in cell (i, j) - Eᵢⱼ = expected frequency = (row total × column total) / grand total - df = (rows − 1) × (columns − 1)

Worked example: Mode of delivery by maternal BMI category

SVD CS Total
BMI < 30 80 20 100
BMI ≥ 30 30 30 60
Total 110 50 160

Expected frequencies: - Normal BMI, SVD: (100 × 110)/160 = 68.75 - Normal BMI, CS: (100 × 50)/160 = 31.25 - Obese, SVD: (60 × 110)/160 = 41.25 - Obese, CS: (60 × 50)/160 = 18.75

χ² = (80−68.75)²/68.75 + (20−31.25)²/31.25 + (30−41.25)²/41.25 + (30−18.75)²/18.75 = 1.84 + 4.05 + 3.07 + 6.75 = 15.71

df = (2−1)(2−1) = 1 Critical χ² (df=1, α=0.05) = 3.84 15.71 > 3.84 → p < 0.001 → highly significant association

Assumptions: 1. Independent observations (each subject counted once) 2. No more than 20% of expected frequencies < 5 3. All expected frequencies ≥ 1

If assumptions violated: Use Fisher's exact test (any 2×2 table) or combine categories (for larger tables).

Yates' Correction for Continuity

  • Applied to 2×2 tables (subtract 0.5 from each |O−E| before squaring)
  • More conservative (reduces χ²)
  • Historically used; now controversial — Fisher's exact preferred for small samples

Fisher's Exact Test

  • Use: 2×2 tables when expected frequencies < 5 (any sample size works)
  • Principle: Calculates exact probability of observed table (and more extreme tables) given fixed margins — based on hypergeometric distribution
  • Advantage: Valid for ANY sample size
  • Disadvantage: Computationally intensive for large tables

McNemar's Test for Paired Categorical Data

  • Use: Compare PROPORTIONS in PAIRED or MATCHED categorical data (before-after, matched case-control)
  • Example: Diagnosis of GDM by two different criteria (IADPSG vs NICE) in same women

Paired 2×2 table:

Test B + Test B − Total
Test A + a (both positive) b (A positive, B negative) a + b
Test A − c (A negative, B positive) d (both negative) c + d
Total a + c b + d N

Formula: χ² = (|b − c| − 1)² / (b + c) [with continuity correction] - Only discordant pairs (b and c) contribute to the test - If b = c → no difference between tests

Example: GDM screening — IADPSG vs NICE criteria in 200 women

NICE + NICE − Total
IADPSG + 20 15 35
IADPSG − 3 162 165
Total 23 177 200

χ² = (|15 − 3| − 1)² / (15 + 3) = (11)² / 18 = 121/18 = 6.72 df = 1, p = 0.01 → Significant difference — IADPSG detects significantly more GDM than NICE criteria.

5.5 Correlation — Detailed

Coefficient Symbol Type Parametric? Range Measure of
Pearson r r Linear Yes −1 to +1 Linear relationship strength
Spearman ρ rₛ (or ρ) Monotonic No −1 to +1 Monotonic relationship (any consistent trend)
Kendall τ τ Concordant/discordant pairs No −1 to +1 Association in ranked data

Pearson Correlation (r)

Assumptions: 1. Both variables are continuous 2. Linear relationship 3. Bivariate normality (both normally distributed) 4. Homoscedasticity (equal scatter across values) 5. No significant outliers

Formula: r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² × Σ(yᵢ − ȳ)²]

Interpretation of r (Cohen's benchmarks):

r value Interpretation Approximate R²
0.0–0.1 Negligible 0–1%
0.1–0.3 Weak 1–9%
0.3–0.5 Moderate 9–25%
0.5–0.7 Strong 25–49%
0.7–1.0 Very strong 49–100%

R² = coefficient of determination: Proportion of variance in Y explained by X. - If r = 0.6, R² = 0.36 → 36% of variance in Y is explained by X - 64% is due to other factors

Spearman's Rank Correlation (ρ)

  • Use: Non-normal, ordinal, or skewed data
  • Principle: Rank both variables, then calculate Pearson r on ranks
  • Advantages: No normality assumption; detects monotonic (not just linear) relationships; robust to outliers
  • Interpretation: Same r scale (−1 to +1)

Kendall's Tau (τ)

  • Use: Small samples with many tied ranks
  • Principle: Based on number of concordant vs discordant pairs
  • τ = (C − D) / [½ n(n−1)] where C = concordant pairs, D = discordant
  • Advantage: More robust and interpretable with ties; better for small samples
  • Disadvantage: Usually smaller absolute value than Spearman

Correlation does NOT imply causation — 4 possible explanations for r ≠ 0: 1. X causes Y (direct causation) 2. Y causes X (reverse causation) 3. Z causes both X and Y (confounding) 4. Chance (random variation)

Common O&G example: Positive correlation between maternal age and Down's syndrome — direct causal relationship (meiotic non-disjunction increases with age). This is one case where correlation IS causation.

5.6 Regression — Complete Details

Linear Regression

Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε

  • Y = outcome (dependent) variable — CONTINUOUS
  • Xᵢ = predictor (independent) variables
  • β₀ = intercept (value of Y when all X = 0)
  • βᵢ = regression coefficient (change in Y per 1-unit change in Xᵢ, holding others constant)
  • ε = error term (residual)

Key outputs: | Output | Interpretation | |--------|----------------| | β coefficient | Effect estimate (units of Y per unit X) | | 95% CI for β | Precision and significance | | p-value for β | Test of H₀: β = 0 | | | Proportion of variance explained by model | | Adjusted R² | R² penalised for number of predictors | | F-test | Tests if overall model is significant |

Assumptions of linear regression: 1. Linearity: Relationship between X and Y is linear 2. Independence: Observations are independent 3. Homoscedasticity: Constant variance of residuals across fitted values 4. Normality: Residuals are normally distributed 5. No multicollinearity: Predictors not highly correlated

Checking assumptions: - Residual vs fitted plot: Look for random scatter (homoscedasticity) and no pattern (linearity) - Q-Q plot of residuals: Check normality - Variance Inflation Factor (VIF): Check multicollinearity (VIF > 10 = problematic)

Multiple Linear Regression

  • Use: ONE continuous outcome, MULTIPLE predictors
  • β coefficients are ADJUSTED — each β represents the effect of that predictor holding all others constant
  • Can control for confounders by including them in the model
  • Partial R²: Contribution of each predictor to explained variance

Example: Predicting birth weight - Y = birth weight (g) - X₁ = gestational age (weeks) - X₂ = maternal smoking (0/1) - X₃ = maternal BMI - β₁ = 150 means: each additional week of gestation → +150g birth weight (holding smoking and BMI constant) - β₂ = −200 means: smoking associated with 200g lower birth weight (holding gestational age and BMI constant)

Logistic Regression

  • Use: BINARY outcome (yes/no, alive/dead, disease/no disease)
  • Model: logit(p) = ln[p/(1−p)] = β₀ + β₁X₁ + ... + βₖXₖ
  • Exponentiated coefficients (e^βᵢ): Adjusted Odds Ratios (OR)
  • Interpretation of OR: e^βᵢ = change in odds of outcome for 1-unit increase in Xᵢ

Key outputs: | Output | Interpretation | |--------|----------------| | OR (e^β) | Adjusted odds ratio | | 95% CI for OR | Precision (if excludes 1 → significant) | | Hosmer-Lemeshow test | Goodness-of-fit (p > 0.05 = good fit) | | c-statistic (AUC) | Discriminatory ability | | Pseudo-R² | McFadden, Nagelkerke |

Worked O&G example: Predicting preterm birth

Predictor β OR (e^β) 95% CI p
Smoking 0.69 1.99 1.25–3.17 0.004
Previous preterm 1.39 4.01 2.10–7.66 <0.001
Multiple pregnancy 1.10 3.00 1.40–6.43 0.005
Maternal age (per year) 0.02 1.02 0.98–1.06 0.29
  • Smoking doubles the odds of preterm birth (OR = 1.99, p = 0.004)
  • Previous preterm is strongest predictor (OR = 4.01)
  • Maternal age not significant (CI includes 1, p > 0.05)

Cox Proportional Hazards (See also Section 9)

Model: h(t) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₖXₖ)

  • h(t) = hazard at time t
  • h₀(t) = baseline hazard (when all X = 0)
  • exp(βᵢ) = Hazard Ratio (HR)
  • Proportional hazards assumption: HR is constant over time

6. Risk & Effect Measures

6.1 The 2×2 Table — Foundation

Outcome + (Disease) Outcome − (No disease) Total
Exposed + a b a + b
Exposed − c d c + d
Total a + c b + d N

6.2 Definitions and Formulas — Complete

Measure Abbreviation Formula Interpretation
Risk in exposed Rₑ a / (a + b) Probability of outcome if exposed
Risk in unexposed R₀ c / (c + d) Probability of outcome if not exposed
Odds in exposed Oₑ a / b Ratio of outcome happening to not happening in exposed
Odds in unexposed O₀ c / d Ratio of outcome happening to not happening in unexposed
Risk Ratio / Relative Risk RR Rₑ / R₀ How many times more likely outcome is in exposed vs unexposed
Odds Ratio OR (a/b) / (c/d) = ad / bc Odds of exposure in cases vs controls
Attributable Risk AR Rₑ − R₀ Excess risk due to exposure
Attributable Risk Fraction ARF (Rₑ − R₀) / Rₑ = (RR−1)/RR Proportion of risk in exposed due to exposure
Population Attributable Risk PAR R_total − R₀ Excess risk in total population
Population Attributable Fraction PAF (R_total − R₀) / R_total Proportion of population disease due to exposure
Absolute Risk Reduction ARR Control risk − Treatment risk (if treatment reduces risk) Inverse of AR (treatment perspective)
Number Needed to Treat NNT 1 / ARR Number needed to treat to prevent one outcome
Number Needed to Harm NNH 1 / AR (if harmful) Number exposed to cause one adverse outcome

6.3 Worked Examples from O&G

Example 1: VTE Prevention with LMWH

VTE No VTE Total
LMWH 5 495 500
No LMWH 20 480 500
  • Rₑ = 5/500 = 0.01 (1%)
  • R₀ = 20/500 = 0.04 (4%)
  • RR = 0.01/0.04 = 0.25 → LMWH reduces VTE risk by 75%
  • AR (ARR) = |0.01 − 0.04| = 0.03 (3%) → absolute risk reduction
  • RRR (relative risk reduction) = (0.04−0.01)/0.04 = 0.75 (75%) → same as 1−RR
  • NNT = 1/0.03 = 33.3 → 34 women need LMWH to prevent one VTE
  • OR = (5×480)/(495×20) = 2400/9900 = 0.24 → similar to RR because VTE is rare

Example 2: Smoking and Preterm Birth

Preterm Term Total
Smoker 200 4,800 5,000
Non-smoker 100 4,900 5,000
  • RR = 0.04/0.02 = 2.0
  • OR = (200×4900)/(100×4800) = 980,000/480,000 = 2.04
  • OR ≈ RR because preterm birth is moderately common (3%) — the approximation is good but not perfect
  • AR = 0.04 − 0.02 = 0.02 (2%)
  • ARF = (2−1)/2 = 50% → half of preterm births in smokers are attributable to smoking
  • PAF = (0.03−0.02)/0.03 = 33% → one-third of all preterm births are attributable to smoking

6.4 Risk Ratio vs Odds Ratio — The Rare Disease Assumption

When disease is rare (prevalence < 10%): - OR ≈ RR - OR can be interpreted as RR in case-control studies

When disease is common: - OR overestimates RR - OR always > RR (when RR > 1) and OR always < RR (when RR < 1) - The more common the disease, the greater the divergence

Proof that OR ≈ RR when a << a+b and c << c+d: - RR = [a/(a+b)] / [c/(c+d)] - OR = (a/b) / (c/d) = ad/bc - If a << a+b then a/(a+b) ≈ a/b - If c << c+d then c/(c+d) ≈ c/d - Therefore RR ≈ (a/b) / (c/d) = OR

Clinical example where OR and RR diverge:

Disease + Disease − Total Risk
Exposed 80 20 100 0.80
Unexposed 60 40 100 0.60
  • RR = 0.80/0.60 = 1.33
  • OR = (80×40)/(20×60) = 3200/1200 = 2.67
  • OR is TWICE RR! Common disease → OR is a very poor approximation.

6.5 Number Needed to Treat (NNT) — Detailed

Formula: NNT = 1 / ARR

Where ARR = |Risk_control − Risk_treatment|

Important properties: - Lower NNT = more effective treatment - NNT always rounded UP to nearest integer - NNT depends on BASELINE RISK — same RR gives different NNT depending on baseline

Example of NNT dependence on baseline risk: - A treatment reduces the risk of an outcome by 50% (RR = 0.50)

Baseline risk ARR NNT
10% → 5% 5% 20
1% → 0.5% 0.5% 200
0.1% → 0.05% 0.05% 2000

Same RR (50% reduction) but NNT ranges dramatically. This is why NNT must be reported with baseline risk context.

NNT for harm (NNH): - NNH = 1 / AR (when exposure increases risk) - Example: Aspirin prevents pre-eclampsia (NNT = 50) but increases bleeding (NNH = 200 for minor bleeding, NNH = 1000 for major) - Net benefit: When NNT < NNH (more people helped than harmed) - Benefit-harm ratio: NNH/NNT

6.6 Incidence vs Prevalence

Measure Definition Formula When used
Point prevalence Proportion of population with disease at a specific time Existing cases / Total population Cross-sectional studies
Period prevalence Proportion with disease during a time period Cases in period / Population Chronic diseases
Cumulative incidence Proportion of at-risk population who develop disease over time New cases / At-risk population at start Cohort studies
Incidence rate Number of new cases per person-time New cases / Total person-time at risk When follow-up varies

Relationship: Prevalence = Incidence × Average duration of disease - For chronic diseases (long duration): high prevalence despite moderate incidence - For acute diseases (short duration, fatal or curable): low prevalence despite possibly high incidence

Example: - Ovarian cancer: Incidence ~20/100,000/year, 5-year survival ~45% → prevalence ~90/100,000 - Endometriosis: Incidence unclear (difficult to diagnose), prevalence ~10% in reproductive-age women (long duration → high prevalence)

6.7 Hazard Ratio — More Detail

  • From Cox proportional hazards regression
  • Interpretation: The instantaneous risk of the event at any time in one group relative to another
  • HR = 1: No difference
  • HR < 1: Reduced hazard (protective)
  • HR > 1: Increased hazard (risk factor)
  • Not a simple risk ratio — it's a ratio of hazards that applies across time (proportional hazards assumption)

HR vs RR: - RR compares cumulative incidence at a specific time point - HR compares the instantaneous rate of the event at any time - HR is more appropriate for time-to-event data with varying follow-up - If proportional hazards hold, HR is constant over time


7. Statistical Bias & Confounding

7.1 Classification of Bias

                      ┌──────────┐
                      │   Bias   │
                      └────┬─────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
     ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
     │Selection│     │Information│    │Confounding│
     │  Bias   │     │   Bias   │     │(not true │
     └─────────┘     └──────────┘     │  bias —  │
                                      │treatment │
                                      │ effect)  │
                                      └──────────┘

7.2 Selection Bias — Detailed with O&G Examples

Definition: Systematic error due to the way participants are selected for a study or due to differential participation/follow-up.

Type Mechanism O&G Example
Sampling bias Sample not representative of target population Studying postnatal depression in an affluent area → underestimates prevalence
Referral (centripetal) bias Tertiary centres see sicker patients Studying outcomes of placenta praevia at a teaching hospital → higher mortality
Volunteer bias Volunteers differ systematically from non-volunteers Women who join a menopause research study are healthier and more health-conscious
Healthy worker effect Workers healthier than general population Midwives have lower mortality than age-matched women in general population
Non-response bias Those who respond differ from those who don't Postal survey of incontinence — those most affected are more likely to respond → overestimates prevalence
Attrition bias (loss to follow-up) Dropouts differ from completers In a cohort of high-risk pregnancies, those who drop out may have worse outcomes → biased if differential
Berkson's bias Hospital controls differ from general population Studying association between oral contraceptives and DVT using hospital controls — controls may use OCPs at different rates
Survival (Neyman) bias Only survivors included Cross-sectional study of MI — fatal cases missed → underestimates severity
Incidence-prevalence (Neyman) bias Prevalent cases differ from incident Studying ovarian cancer — prevalent cases are longer-term survivors → different risk factor profile
Immortal time bias Time before exposure counted as exposed Studying survival after surgery: if time from diagnosis to surgery counted as postsurgical survival, it's "immortal" (patient alive by definition)
Detection bias More intensive surveillance in one group Women on HRT have more mammograms → more breast cancer detected (screening effect, not causation)

Immortal Time Bias — Detailed

An important MRCOG concept. Immortal time bias occurs when there is a period of follow-up during which the outcome cannot occur, and this time is misclassified.

Classic example: Study of whether screening for cervical cancer reduces mortality. - Women who attend screening (exposed) are compared to non-attendees - The time between the invitation and the actual screening result is "immortal" — women had to survive to be screened - If this immortal time is counted as "screened" time, it biases results in favour of screening (screened women appear to live longer) - Solution: Use time-dependent exposure or start follow-up at time of screening decision, not screening result

7.3 Information Bias (Measurement Bias) — Detailed

Definition: Systematic error in measuring exposure, outcome, or covariates.

Type Mechanism O&G Example
Recall bias Differential recall between groups Case-control study of miscarriage — cases recall more exposures than controls
Observer (ascertainment) bias Researcher's expectation influences measurement Knowing which group receives active treatment may influence interpretation of ultrasound measurements
Detection (verification) bias Systematic difference in outcome ascertainment More intensive follow-up in treatment group → more outcomes detected
Lead-time bias Early diagnosis falsely extends survival Screening: survival appears longer even if death occurs at same time
Publication bias Positive studies more likely published Meta-analyses overestimate effect if negative studies unpublished
Reporting bias Differential outcome reporting Participants on placebo may report more symptoms; doctors may report outcomes more carefully in one group
Interviewer bias Differential questioning Interviewer probes cases more thoroughly about exposures
Social desirability bias Participants give socially acceptable answers Underreporting smoking, alcohol in pregnancy
Hawthorne effect Behaviour changes because being observed Women may adhere better to medications when in a trial
Measurement error bias Inaccurate measurement tool Using a poorly calibrated sphygmomanometer

Recall Bias — Detailed

The most common bias tested in MRCOG for case-control studies.

Mechanism: - Mothers of babies with malformations (cases) search their memory for potential causes → more likely to recall medication use, infections, stress - Mothers of healthy babies (controls) have less motivation to recall → more likely to forget

Effect: OR is biased away from the null (spuriously large or small association)

Minimisation: - Use objective records (prescription databases, medical records) rather than recall - Blinding interviewers to case/control status - Use standardised, validated questionnaires - Use a "memory anchor" (e.g., calendar of significant events)

7.4 Confounding — Complete Details

Definition: A third variable (confounder) that distorts the relationship between exposure and outcome because it is associated with BOTH the exposure and the outcome and is NOT on the causal pathway.

Criteria for a Confounder — Three Conditions

  1. Associated with the exposure in the study population
  2. An independent risk factor for the outcome (among the unexposed)
  3. NOT an intermediate (mediator) on the causal pathway between exposure and outcome
     ┌──────────┐
     │Confounder │
     └┬──────┬───┘
      │      │
      ▼      ▼
  Exposure ──?──▶ Outcome

Not a confounder (it IS a mediator):

  Exposure ──────────▶ Mediator ──────────▶ Outcome

Example (confounder): - Exposure: Drinking coffee → Outcome: Pancreatic cancer - Confounder: Smoking (associated with coffee drinking AND causes pancreatic cancer) - If we don't adjust for smoking, we might wrongly attribute the cancer risk to coffee

Classic O&G Examples of Confounding

Study claim True relationship Confounder
"HRT reduces coronary heart disease" HRT users healthier → lower CHD Socioeconomic status, health awareness
"Maternal age causes Down's syndrome" Chromosomal non-disjunction increases with age AGE IS THE EXPOSURE — this is causal, not confounding!
"Coffee causes miscarriage" Coffee drinkers more likely to be older, smoke Smoking, maternal age
"Caesarean section causes asthma" Children born by CS have more asthma Indication for CS (maternal obesity, preterm) may itself be associated with asthma
"Fertility treatment causes cancer" Women who have IVF may have different cancer surveillance Underlying infertility (itself a risk factor for some cancers)

Simpson's Paradox — Detailed

A special case of confounding where a trend appears in several groups but reverses or disappears when groups are combined.

Classic medical example: Kidney stone treatment

Stone size Treatment A Treatment B
Small stones 93% (81/87) 87% (234/270)
Large stones 73% (192/263) 69% (55/80)
Overall 78% (273/350) 83% (289/350)

Paradox: Treatment A is better for BOTH small AND large stones, but Treatment B appears better overall!

Explanation: Treatment A was more often used for large stones (which have worse prognosis). Stone size is a confounder — associated with treatment choice (A used more for large) AND outcome (large stones have worse success). When you ignore stone size (combine groups), the confounding produces the paradoxical reversal.

Take-home: Always consider whether there might be a confounder creating a Simpson's paradox. Stratify by key confounders.

Methods to Control Confounding

Method When used How it works Strengths Weaknesses
Randomisation RCTs Random allocation balances confounders Gold standard; balances known AND unknown confounders Not always feasible or ethical
Restriction Any study Limit to one level of confounder (e.g., only non-smokers) Simple; eliminates confounding by restricted variable Limits generalisability; may not be feasible if confounder is common
Matching Case-control, cohort Select controls/comparison with same confounder levels Controls for confounding Cannot match too many variables; over-matching reduces efficiency; can't assess matched variables as risk factors
Stratification Any study Analyse within strata, then pool (Mantel-Haenszel) Simple to implement Cannot handle many confounders; continuous variables need categorisation
Multivariable regression Any study Adjust statistically Can handle many confounders; continuous and categorical Assumptions about model form; cannot adjust for unmeasured confounders
Standardisation Comparing populations Apply standard weights Direct or indirect; common in epidemiology Only adjusts for measured confounders
Propensity score Observational studies Probability of exposure given confounders; match/stratify/weight by PS Reduces many confounders to single score Only measured confounders; requires large n
Instrumental variable Natural experiments Variable associated with exposure but not outcome (except through exposure) Can handle unmeasured confounders Difficult to find valid instrument
Inverse probability weighting Longitudinal studies Weight by inverse of probability of remaining in study Handles attrition bias Depends on correct model for weights

Mantel-Haenszel Odds Ratio (Stratified Analysis)

Formula for stratified 2×2 tables: OR_MH = Σ(aᵢdᵢ/nᵢ) / Σ(bᵢcᵢ/nᵢ)

Where i indexes strata and nᵢ is the total in stratum i.

Comparing crude vs adjusted OR: - If crude OR ≠ adjusted OR → confounding present - If crude OR = adjusted OR → no confounding

Residual Confounding

Complete confounding adjustment is often impossible because: - Confounders may be measured with error (residual confounding) - Unmeasured confounders exist (unmeasured confounding) - Confounders may change over time (time-varying confounding)

Sensitivity analysis: How strong would an unmeasured confounder need to be to explain away the observed association? (E-value)

7.5 Effect Modification (Interaction)

Different from confounding!

Aspect Confounding Effect Modification
Type Bias to be minimised Real biological phenomenon
What it is Distortion of exposure-outcome relationship Effect of exposure differs by level of third variable
Deal with it Remove/adjust in analysis REPORT it — describe effect separately for each subgroup
Example Smoking confounds coffee-pancreatic cancer Aspirin effect on pre-eclampsia may differ by BMI

Testing for effect modification: 1. Stratified analysis: Calculate RR/OR separately for each stratum 2. Interaction term: Include product term in regression model (X₁ × X₂) 3. Statistical test: p-value for interaction (be cautious — underpowered for interaction)

Multiplicative vs Additive Interaction: - Multiplicative scale: Is the combined effect greater than the product of individual effects? (RR or OR scale) - Additive scale: Is the combined effect greater than the sum of individual effects? (Risk difference scale) - Public health importance: Additive scale often more relevant (synergy index)

O&G Example: Does the effect of smoking on preterm birth differ by maternal age?

Smoker Non-smoker RR (smoking vs not)
Age < 35 5% 3% 1.67
Age ≥ 35 10% 5% 2.00

The RR is 1.67 in younger and 2.00 in older women → possible effect modification by age. The absolute risk increase (AR) also differs: 2% vs 5%.

7.6 Confounding by Indication

An important concept for treatment studies.

Definition: The indication for a treatment is itself associated with the outcome. Patients who receive a treatment are systematically different from those who don't because of WHY they were treated.

Example: Studying whether magnesium sulphate prevents cerebral palsy in preterm infants. - Women who receive MgSO₄ are those in preterm labour - Preterm labour itself is a risk factor for cerebral palsy - Without randomisation, any difference in CP rates could be due to the underlying indication (preterm labour), not the treatment

Solution: Randomisation (e.g., the Magpie trial). If randomisation not possible: propensity score methods, indication-based restriction, or multivariable adjustment (though residual confounding likely remains).

7.7 Protopathic Bias

Definition: Treatment started for early symptoms of the outcome before the outcome is formally diagnosed.

Example: Studying whether NSAIDs cause miscarriage. - Women may take NSAIDs for pelvic pain - Pelvic pain might be an early symptom of miscarriage - Association between NSAID use and miscarriage could be due to NSAIDs treating early miscarriage symptoms (reverse causality)

Solution: Exclude medication use in the period immediately before outcome (lag window), or use new-user designs.


8. Evidence-Based Medicine

8.1 Levels of Evidence — Oxford CEBM (March 2009)

The traditional 5-level system (still used by many O&G guidelines including RCOG):

Level Therapy / Prevention Prognosis Diagnosis
1a SR of RCTs (with homogeneity) SR of inception cohort studies SR of diagnostic studies (homogeneous, with gold standard)
1b Individual RCT (narrow CI) Individual inception cohort (≥80% follow-up) Validating cohort with gold standard
1c All or none All or none case series SpPin or SnNOut
2a SR of cohort studies SR of retrospective cohorts / untreated controls SR of cross-sectional studies
2b Individual cohort study (including low-quality RCT) Retrospective cohort / follow-up of RCT controls Cross-sectional with gold standard
2c Outcomes research / ecological studies "Outcomes" research
3a SR of case-control studies SR of case-control studies
3b Individual case-control study Non-consecutive / no gold standard
4 Case series / poor quality cohort Case series / poor quality cohort Case-control / poor reference
5 Expert opinion Expert opinion Expert opinion

Key: "All or none" — when all patients died before treatment but some now survive, or when some died before but none die now.

Oxford 2011 revision: Simplified to 5 levels based on the type of question and the quality of evidence, but the 2009 system is still widely cited.

8.2 GRADE System — Complete

Grading of Recommendations Assessment, Development and Evaluation

Quality of Evidence

Level Definition Symbol
High Further research VERY UNLIKELY to change confidence in estimate ⊕⊕⊕⊕
Moderate Further research LIKELY to have important impact ⊕⊕⊕○
Low Further research VERY LIKELY to have important impact ⊕⊕○○
Very low Any estimate is very uncertain ⊕○○○

Factors that Lower Quality

Factor How it works
Risk of bias Study design limitations (no blinding, no allocation concealment, etc.)
Inconsistency Unexplained heterogeneity (I² > 50%, p < 0.10) across studies
Indirectness PICO differences (population, intervention, comparator, outcome)
Imprecision Wide CIs crossing clinically important thresholds
Publication bias Suspicion that negative studies are missing

Downgrading rules: - Start at HIGH for RCTs, LOW for observational - Downgrade 1 level for serious concern, 2 for very serious concern - Maximum downgrade: 3 levels

Factors that Raise Quality (Observational Studies)

Factor Criteria
Large effect RR > 2 or < 0.5 (up 1 level); RR > 5 or < 0.2 (up 2 levels)
Dose-response Clear biological gradient demonstrated
Confounding All plausible confounders would reduce the observed effect

Strength of Recommendation

Strength Wording Interpretation
Strong (1) "We recommend..." / "Offer" Most patients should receive the intervention
Weak (2) "We suggest..." / "Consider" Different choices appropriate for different patients; requires shared decision-making

Implications: - Strong recommendation: Can be adopted as policy in most situations - Weak recommendation: Policy-making requires substantial debate and stakeholder involvement

8.3 Systematic Reviews & Meta-Analysis — Complete

Definitions

Term Definition
Systematic review A review of a clearly formulated question that uses systematic and explicit methods to identify, select, and critically appraise relevant research, and to collect and analyse data from the studies that are included in the review
Meta-analysis The statistical combination of results from two or more separate studies
Narrative review Non-systematic summary of literature (not evidence-based)

Steps of a Systematic Review

  1. Formulate question (using PICO: Population, Intervention, Comparison, Outcome)
  2. Pre-register protocol (PROSPERO)
  3. Systematic search of multiple databases (MEDLINE, EMBASE, CENTRAL, CINAHL)
  4. Screen and select studies against pre-specified criteria (PRISMA flow diagram)
  5. Assess quality/risk of bias of included studies (Cochrane Risk of Bias tool for RCTs)
  6. Extract data (double-extraction recommended)
  7. Analyse (meta-analysis if appropriate)
  8. Interpret and report

PRISMA Flow Diagram

Records identified through database searching (n=...)
    Additional records identified through other sources (n=...)
                  Records after duplicates removed (n=...)
                  Records screened (n=...)
    Records excluded (n=...)
                  Full-text articles assessed for eligibility (n=...)
    Full-text articles excluded, with reasons (n=...)
                  Studies included in qualitative synthesis (n=...)
                  Studies included in quantitative synthesis (meta-analysis) (n=...)

Fixed Effect vs Random Effects Meta-Analysis

Feature Fixed Effect Random Effects
Assumption All studies estimate the SAME true effect Studies estimate DIFFERENT true effects (drawn from a distribution)
Implication Differences due to chance only Differences due to chance + real variation
Weighting By inverse variance (precision) By inverse variance + between-study variance (τ²)
CI Narrower Wider (if heterogeneity present)
Interpretation "The effect" (single value) "The average effect"
When to use Minimal heterogeneity Moderate/substantial heterogeneity

Which is more conservative? Random effects when heterogeneity > 0. But if there is no heterogeneity, they give identical results.

DerSimonian and Laird method — most common random effects approach [ wᵢ* = 1 / (sᵢ² + τ²) ]

Where τ² is the between-study variance (estimate of heterogeneity).

Heterogeneity — I² Statistic

I² = [(Q − df) / Q] × 100%

Where Q = chi-squared statistic for heterogeneity, df = degrees of freedom (# studies − 1)

Interpretation
0% No observed heterogeneity
<25% Low heterogeneity
25–50% Moderate
50–75% Substantial
>75% Considerable

But also consider p-value for Q statistic: - p < 0.10 suggests significant heterogeneity (note: not p < 0.05!) - Important to explore potential sources of heterogeneity even if I² is modest

Exploring heterogeneity: 1. Subgroup analysis: Pre-specified subgroups (e.g., by study quality, population, intervention type) 2. Meta-regression: Regression exploring whether study-level characteristics explain heterogeneity 3. Sensitivity analysis: Excluding one study at a time (leave-one-out analysis)

Forest Plot — Detailed Interpretation

Components:

Study                         Weight   RR (95% CI)
────────                      ──────   ──────────
Smith 2010                    ██████   1.20 (0.85–1.55)
Jones 2012                    ███████  1.50 (1.10–2.00)
Lee 2013                      ████     1.10 (0.70–1.50)
Brown 2015                    ████████ 1.40 (1.05–1.75)
Patel 2017                    ██████   1.30 (0.95–1.65)
──────────────────────────────────────────────────────
Overall (I²=0%, p=0.56)      ◆        1.33 (1.17–1.49)

           0.5   1.0   1.5   2.0   2.5
           ◀── Favours control   Favours exposure ──▶

Reading a forest plot: 1. Each row = one study 2. Square = point estimate 3. Horizontal line = 95% CI 4. Square size = weight in meta-analysis (proportional to inverse variance) 5. Vertical line at 1 = null effect (for RR/OR/HR) 6. Diamond at bottom = summary estimate (width = 95% CI) 7. If diamond does not cross the null line → statistically significant

Funnel Plot & Publication Bias

Funnel plot: - X-axis: Effect size (RR, OR, OR log-transformed) - Y-axis: Standard error (inverted — larger studies at top) - Each dot = one study

Interpretation: - Symmetric inverted funnel: No publication bias - Asymmetric (missing studies in bottom left): Possible publication bias (small negative studies missing) - Asymmetric (missing in bottom right): Other explanations (e.g., small studies with true larger effects)

Causes of asymmetry: 1. Publication bias: Small studies with null/negative results not published 2. True heterogeneity: Small studies have different populations/interventions 3. Poor methodology: Small studies have lower quality → biased effect estimates 4. Chance: Especially with few studies (<10)

Tests for publication bias: - Egger's test: Linear regression of effect size on standard error (p < 0.10 = asymmetry) - Begg's test: Rank correlation test - Trim-and-fill method: Imputes missing studies and adjusts summary estimate - Contour-enhanced funnel plot: Distinguishes publication bias from other causes

8.4 Critical Appraisal — CASP Tools

Key questions for any study:

Domain Key Questions
Validity Is the study design appropriate? Was bias minimised?
Results What is the effect size? How precise is it?
Applicability Can results be applied to my patients?

CASP Checklist for RCTs (abbreviated)

  1. Did the study address a clearly focused question?
  2. Was the assignment to treatment groups truly random?
  3. Were all participants properly accounted for at conclusion?
  4. Were participants, clinicians, and outcome assessors blinded?
  5. Were the groups similar at the start of the trial?
  6. Were groups treated equally (apart from intervention)?
  7. How large was the treatment effect?
  8. How precise was the estimate (CIs)?
  9. Can the results be applied to the local population?
  10. Were all clinically important outcomes considered?
  11. Are the benefits worth the harms and costs?

CONSORT Statement (RCT reporting)

Key items: - Methods: Eligibility criteria, randomisation, allocation concealment, blinding, sample size calculation - Results: Flow diagram (participant flow), baseline table (Table 1), outcomes (ITT analysis), harms - Discussion: Limitations, generalisability, interpretation

STROBE Statement (Observational studies)

22-item checklist covering: - Title and abstract - Introduction: Background, objectives - Methods: Study design, setting, participants, variables, data sources, bias, sample size - Results: Participants (flow diagram), descriptive data, outcome data, main results, other analyses - Discussion: Key results, limitations, interpretation, generalisability

PRISMA Statement (Systematic Reviews)

27-item checklist with flow diagram: - Title, abstract, structured summary - Rationale, objectives - Protocol registration, eligibility criteria, information sources, search strategy, selection process, data extraction, risk of bias, synthesis methods - Results: Study selection, characteristics, risk of bias, individual study results, synthesis - Discussion: Summary, limitations, conclusions

QUADAS-2 (Diagnostic accuracy studies)

Four domains: 1. Patient selection (was a consecutive or random sample used?) 2. Index test (was it performed and interpreted without knowledge of reference standard?) 3. Reference standard (is it likely to correctly classify the target condition?) 4. Flow and timing (appropriate interval between tests, all patients received reference standard?)

8.5 Using EBM in Practice — Fagan Nomogram

Pre-test probability → Post-test probability

Clinical example: 32-year-old woman, combined test risk for Down's = 1:150

  • Pre-test probability = 1/150 = 0.67%
  • Pre-test odds = 0.0067 / 0.9933 = 0.0067
  • Combined test positive: LR+ = 8 (from literature)
  • Post-test odds = 0.0067 × 8 = 0.0536
  • Post-test probability = 0.0536 / 1.0536 = 5.1% (1 in 20)

Using Fagan nomogram: Draw line from pre-test probability (0.67%) through LR (8) → post-test probability ~5%.

Clinical application: If post-test probability > invasive test threshold (~1/150), offer CVS/amniocentesis. If below, reassure.

8.6 Evidence-Based Guidelines in O&G

NICE guidelines: - Use GRADE for quality assessment - Recommendations: "Offer" (strong) vs "Consider" (weak) - Regular updates (usually 3–5 year cycle) - Include health economic modelling

RCOG Green-top Guidelines: - Use original Oxford CEBM levels - Grade A, B, C, D recommendations - Topic-specific expert review

SIGN Guidelines: - Scottish Intercollegiate Guidelines Network - Similar approach to GRADE - Identify key clinical questions, systematic review, evidence tables

WHO guidelines: - Use GRADE - Consider global applicability, resource implications - Include "Good Practice Statements"


9. Survival Analysis

9.1 Key Concepts — Detailed

Survival analysis = statistical methods for analysing data where the outcome is the TIME until an event occurs.

Key features: - Time-to-event data: Not just whether event occurred, but WHEN - Censoring: Some subjects don't experience event during follow-up - Time-varying risk: Risk may change over time (higher shortly after treatment, etc.)

Applications in O&G: - Time to pregnancy (survival = time to conception) - Time to labour onset after induction - Time to recurrence of endometriosis after surgery - Time to death in ovarian cancer - Time to treatment failure in IVF - Duration of breastfeeding

9.2 Censoring — Complete Types

Type Definition Example
Right censoring Subject does NOT experience event by study end, or is lost to follow-up Patient with ovarian cancer alive at 5-year study endpoint
Left censoring Event occurred before study began (subject already had the event at entry) Time to first pregnancy — some women already pregnant at study entry
Interval censoring Event occurs between two known time points, but exact time unknown Annual screening: cancer detected between visits

Assumption for valid analysis: Censoring is non-informative — the reason for censoring is unrelated to the probability of experiencing the event.

Example of INFORMATIVE censoring: If patients with more aggressive ovarian cancer are more likely to drop out (move to hospice, stop attending follow-up), their censoring is related to the outcome → biased results.

9.3 Kaplan-Meier Method — Complete Details

Purpose: Estimate the survival function without assuming a particular distribution.

Method: 1. Arrange event times in ascending order 2. At each event time, calculate: - Number at risk just before event - Number who experienced event - Number censored between this event and the next 3. S(t) = Πᵢ (nᵢ − dᵢ) / nᵢ where nᵢ = at risk at time i, dᵢ = events at time i

Properties: - Step function (drops only at event times) - Horizontal segments between events - Tick marks indicate censored observations - Median survival = time when S(t) = 0.5 - 95% CI (Greenwood's formula) shown as dashed lines or shading

Example: Time to recurrence of endometriosis after surgery

Recurrence-free survival
100% │─────────────────────────────────────
     │                                      ─────
 75% │                                            ─────
     │                                                   ─────
 50% │                                                            ─────
     │                                                                   ─────
 25% │                                                                          ─────
     │                                                                            ─────
  0% │──────────────────────┴──────────────────────┴──────────────────────┴─────▶ Time
    0        12         24         36         48         60  months

Censored observations represented as tick marks on the curve.

Kaplan-Meier by groups:

Survival
100% │─────── Treatment
     │         ─────────
 75% │                     ──────────
     │                                 ────── Control
 50% │                                         ──────
     │                                                 ──────
 25% │                                                        ──────
     │                                                               ──────
  0% │─────────────────────────────────────────────────────────────────────▶ Time

The log-rank test compares these two curves.

9.4 Log-Rank Test — Details

  • Non-parametric: No assumption about shape of survival curves
  • H₀: The survival functions are the same in all groups
  • H₁: At least one group differs
  • Calculation: Compares observed vs expected events at each time point, summed over all times
  • χ² = Σ[(O − E)² / E] across groups

Assumptions: - Non-informative censoring - Independence of survival times - The hazard ratio is roughly constant over time (proportional hazards — though log-rank is reasonably robust to violations)

Limitations: - Cannot adjust for confounders (use Cox regression instead) - Does not estimate the magnitude of difference (use Cox for HR) - If survival curves cross, log-rank has low power (use alternative tests: weighted log-rank, Peto-Peto, Fleming-Harrington)

9.5 Cox Proportional Hazards — Complete Details

Model: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₖXₖ)

Components: - Baseline hazard h₀(t): The hazard when all Xᵢ = 0 (can vary arbitrarily over time — hence "semi-parametric") - Proportional term exp(βX): Multiplicative effect of covariates on hazard (constant over time)

Interpretation of exp(β): - exp(β) = Hazard Ratio (HR) - HR > 1: increased hazard (worse survival) - HR < 1: decreased hazard (better survival) - HR = 1: no effect

Worked O&G example: Survival after ovarian cancer diagnosis

Predictor β HR 95% CI p
Stage III/IV vs I/II 1.39 4.01 2.50–6.43 <0.001
Suboptimal debulking 0.80 2.23 1.40–3.55 0.001
BRCA mutation −0.51 0.60 0.38–0.95 0.03
Age (per 10 years) 0.32 1.38 1.10–1.73 0.01
  • Advanced stage: 4× higher risk of death at any time (HR = 4.01)
  • BRCA mutation: 40% lower risk (HR = 0.60)
  • Each 10-year increase in age: 38% higher risk

Proportional Hazards Assumption — Checking

The HR is constant over time. This is the critical assumption.

How to check: 1. Log-minus-log plot: Plot −ln[−ln(S(t))] vs time for each group — parallel lines = proportional hazards 2. Schoenfeld residuals: Plot against time — if slope ≈ 0, assumption holds 3. Test: Significance test of time-dependent covariates (p > 0.05 = assumption met)

If assumption violated: - Stratified Cox model: Stratify by the variable with non-proportional hazards - Time-varying covariates: Include interaction with time (t) - Extended Cox model: Allow HR to change at a specified time point - Alternative: Parametric survival models (Weibull, exponential, log-normal)

9.6 Parametric Survival Models

Model Hazard function When used
Exponential Constant hazard over time Simplest; rarely realistic
Weibull Monotonic (always increasing or decreasing) Flexible; includes exponential as special case
Gompertz Mortality rate increases exponentially Demography; older populations
Log-normal Hazard increases then decreases Biological processes with "burn-in"
Log-logistic Similar to log-normal with heavier tails Accelerated failure time models

9.7 Describing Survival Results

Median survival time: Time when survival probability = 50% - In O&G: Median time to pregnancy, median time to recurrence

Survival at specific time point: Proportion surviving at 1 year, 5 years, etc. - Example: 5-year survival in ovarian cancer ~45% (all stages combined)

Hazard Ratio from Cox model: Describes relative risk across entire follow-up


10. Specific Topics in O&G

10.1 Key Rates and Definitions

Rate Numerator Denominator Multiplier UK Approx.
Crude birth rate (CBR) Live births Mid-year population ×1000 ~11/1000
General fertility rate (GFR) Live births Women aged 15–44 ×1000 ~60/1000
Total fertility rate (TFR) Sum of ASFRs × 5 Per woman ~1.6
Age-specific fertility rate (ASFR) Live births to women of age group Women in that age group ×1000 Varies
Perinatal mortality rate (PMR) Stillbirths + early neonatal deaths (≤7 days) Total births ×1000 ~5/1000
Stillbirth rate Stillbirths (≥24 wks UK) Total births ×1000 ~3.8/1000
Neonatal mortality rate (NMR) Neonatal deaths (≤28 days) Live births ×1000 ~2.5/1000
Early neonatal mortality Deaths (≤7 days) Live births ×1000 ~1.5/1000
Late neonatal mortality Deaths (8–28 days) Live births ×1000 ~1.0/1000
Infant mortality rate (IMR) Deaths <1 year Live births ×1000 ~3.9/1000
Maternal mortality ratio (MMR) Maternal deaths Live births ×100,000 ~9/100,000
Maternal mortality rate Maternal deaths Women aged 15–49 ×100,000 Rarely used

WHO Definitions

Term WHO Definition UK Definition
Stillbirth Fetal death ≥28 weeks Fetal death ≥24 weeks
Early neonatal death Death within 7 days of birth Same
Neonatal death Death within 28 days of birth Same
Perinatal period From 22 weeks gestation to 7 days after birth 24 weeks to 7 days
Maternal death Death of a woman while pregnant or within 42 days of termination of pregnancy, from any cause related to or aggravated by the pregnancy or its management, but not from accidental or incidental causes Same
Late maternal death Death >42 days and <1 year after end of pregnancy Same
Pregnancy-related death Death from any cause while pregnant or within 42 days of termination of pregnancy (includes incidental) Used before ICD-MM

ICD-MM Classification of Maternal Deaths

  1. Direct maternal deaths: Resulting from obstetric complications of the gravid state (pregnancy, labour, puerperium), from interventions, omissions, incorrect treatment, or from a chain of events resulting from any of these.
  2. Examples: Obstetric haemorrhage, pre-eclampsia/eclampsia, sepsis, amniotic fluid embolism, anaesthetic complications, thromboembolism

  3. Indirect maternal deaths: Resulting from previous existing disease or disease that developed during pregnancy and was not due to direct obstetric causes, but was aggravated by the physiological effects of pregnancy.

  4. Examples: Cardiac disease, epilepsy, diabetes, anaemia, HIV, mental health conditions

  5. Coincidental (fortuitous) maternal deaths: Deaths from unrelated causes that happen to occur in pregnancy or the puerperium.

  6. Examples: Road traffic accidents, homicide, suicide (though suicide related to postnatal depression is often classified as indirect)

  7. Late maternal deaths: Deaths occurring between 42 days and 1 year after the end of pregnancy.

10.2 MBRRACE-UK and Confidential Enquiries

MBRRACE-UK (Mothers and Babies: Reducing Risk through Audits and Confidential Enquiries across the UK) - Established 2012 (replaced CMACE) - Oversight: Healthcare Quality Improvement Partnership (HQIP) - Key reports: - Triennial "Saving Lives, Improving Mothers' Care" (maternal deaths) - Perinatal Mortality Surveillance Report - Each Baby Counts (intrapartum term stillbirths, neonatal deaths, brain injury)

Key Findings from Recent Reports (2022–2025)

Main causes of maternal death (UK, 2019–2021):

Rank Cause Type Proportion
1 Cardiac disease Indirect ~25%
2 Thromboembolism Direct ~15%
3 Sepsis Direct/Indirect ~12%
4 Pre-eclampsia/eclampsia Direct ~10%
5 Haemorrhage Direct ~8%
6 Neurological causes Indirect ~8%
7 Mental health (suicide) Indirect ~5%
8 Anaesthetic complications Direct Rare

Key disparities: - Ethnicity: Black women 4× more likely, Asian women 2× more likely to die than white women - Socioeconomic: Women from most deprived areas 3× more likely to die - Age: Women ≥35 at higher risk - Obesity: Leading contributor across multiple causes - Late booking: Women who book after 12 weeks have higher risk

Key recommendations (recent): - Better pre-conception counselling for women with medical conditions - Early pregnancy assessment for women with cardiac disease (joint obstetric-cardiac clinics) - Standardised management of obstetric haemorrhage (massive transfusion protocol) - Improved recognition and management of sepsis - e- learning for early warning scores (MEOWS — Modified Early Obstetric Warning Score) - Thromboprophylaxis risk assessment at every contact

10.3 Saving Babies' Lives Care Bundle — Version 3 (2023)

A national patient safety initiative to reduce stillbirth and neonatal death.

Element 1: Smoking cessation - Carbon monoxide (CO) testing at booking - Referral to stop smoking services if CO ≥ 4 ppm (or ≥ 7 ppm in some areas) - Brief intervention training for midwives

Element 2: Growth assessment - Use of customised GROW chart (Gestation Related Optimal Weight) - Serial symphysis-fundal height (SFH) measurements from 24 weeks - Referral for ultrasound if SFH diverges from chart (below 10th or above 90th centile) - Use of ultrasound for suspected SGA: estimated fetal weight + Doppler (umbilical artery PI)

Element 3: Reduced fetal movements (RFM) - Standardised information for women (counting movements, when to contact) - Standardised care pathway: CTG + ultrasound (growth, liquor volume, Doppler) within 2 hours - No digital fetal movement counting for all (controversial — evidence lacking) - Low PAPP-A (<0.4 MoM) → increased surveillance

Element 4: Effective fetal monitoring during labour - Standardised CTG interpretation training (e.g., K2MS, PROMPT, RCOG e-learning) - Use of STAN (ST-segment analysis) or similar adjunct if indicated - Fetal blood sampling (FBS) protocol - Structured communication (SBAR) and team working

Element 5: Reducing preterm birth - Cervical length screening at 20 weeks (transvaginal ultrasound) - Progesterone for short cervix (<25 mm) - Cervical cerclage for history-indicated or ultrasound-indicated short cervix - Arabin pessary (evidence still emerging)

10.4 Each Baby Counts (RCOG)

  • Aim: Reduce the number of term stillbirths, neonatal deaths, and brain injuries occurring as a result of intrapartum incidents
  • Data collection: All UK maternity units submit cases
  • Key findings:
  • ~80% of cases had some element of substandard care
  • Most common issues: CTG misinterpretation, failure to act on abnormal CTG, delayed delivery, poor communication
  • ≥30% of cases were potentially avoidable

Key recommendations: - Standardised CTG training every 12 months (including emergency drills) - Consultant-led review of all CTGs in labour - SBAR handover and communication - Real-time monitoring of outcomes - Human factors training (situational awareness, decision-making, communication)

10.5 RCOG Green-top Guidelines — Evidence Grading

Levels of Evidence (based on OCEBM):

Code Level Description
1++ 1a High-quality meta-analyses, systematic reviews of RCTs, or RCTs with very low risk of bias
1+ 1b Well-conducted meta-analyses, systematic reviews of RCTs, or RCTs with low risk of bias
1− 1c Meta-analyses, systematic reviews of RCTs, or RCTs with high risk of bias
2++ 2a High-quality SR of case-control or cohort studies; high-quality case-control/cohort with very low risk of confounding/bias/chance
2+ 2b Well-conducted case-control or cohort studies with low risk of confounding/bias/chance
2− 2c Case-control or cohort studies with high risk of confounding/bias/chance
3 3a/b Non-analytic studies (case reports, case series)
4 4 Expert opinion

Grades of Recommendation:

Grade Evidence Required
A At least one meta-analysis, systematic review, or RCT rated 1++ and directly applicable to target population; OR systematic review of RCTs or body of evidence consisting principally of studies rated 1+ directly applicable and demonstrating consistency of results
B Body of evidence including studies rated 2++ directly applicable and demonstrating consistency of results; OR extrapolated evidence from studies rated 1++ or 1+
C Body of evidence including studies rated 2+ directly applicable and demonstrating consistency of results; OR extrapolated evidence from studies rated 2++
D Evidence level 3 or 4; OR extrapolated evidence from studies rated 2+

Good Practice Point (GPP): Recommended best practice based on the clinical experience of the guideline development group.

10.6 NICE Guidelines

  • National Institute for Health and Care Excellence
  • Use GRADE for quality assessment
  • Evidence reviews conducted systematically
  • Health economic modelling integral to recommendations
  • Recommendation wording:
  • "Offer" = strong recommendation (most patients should receive)
  • "Consider" = weaker recommendation (requires discussion)
  • "Do not offer" = strong against
  • Cover the treatment options not recommended

Key NICE guidelines in O&G: - NG133: Hypertension in pregnancy - NG201: Preterm labour and birth - CG62: Antenatal care - NG3: Diabetes in pregnancy - NG122: Postnatal care - QS22: Ovarian cancer - NG241: Heavy menstrual bleeding

10.7 SIGN Guidelines

  • Scottish Intercollegiate Guidelines Network
  • Use methodology similar to GRADE
  • Grades A–D recommendations
  • Key examples:
  • SIGN 160: Management of gestational diabetes
  • SIGN 127: Prophylaxis of venous thromboembolism
  • SIGN 156: Induction of labour

10.8 Fertility & Population Demographics — UK Data

Measure Value Year Source
Births (England & Wales) ~600,000/year 2023 ONS
Total Fertility Rate (TFR) 1.49 2023 ONS
Mean age of mother 30.7 (first birth); all: 30.8 2023 ONS
Teenage pregnancy rate (<18) ~13/1000 women 2022 ONS
Percentage of births outside marriage ~51% 2023 ONS
Multiple pregnancy rate ~16/1000 maternities 2023 ONS
Caesarean section rate ~33% 2023 NHS Digital
Induction of labour ~30–35% 2023 NHS Digital
Preterm birth rate ~8% 2023 ONS
Low birth weight (<2500g) ~7% 2023 ONS
Perinatal mortality rate 4.9/1000 2022 MBRRACE-UK
Maternal mortality ratio 8.8/100,000 2020–2022 MBRRACE-UK
Stillbirth rate 3.9/1000 2022 ONS
Neonatal mortality rate 2.5/1000 2022 ONS
Infant mortality rate 3.9/1000 2022 ONS

10.9 Clinical Audit in O&G

Definition: A quality improvement process that seeks to improve patient care and outcomes through systematic review of care against explicit criteria and the implementation of change.

The Audit Cycle:

    ┌──────────────────────────────────────┐
        Set standards and criteria            └────────────┬─────────────────────────┘
                                      ┌──────────────────────────────────────┐
        Observe current practice              └────────────┬─────────────────────────┘
                                      ┌──────────────────────────────────────┐
        Compare practice to standards         └────────────┬─────────────────────────┘
                         ┌────────┴────────┐
                               (Met)           (Not met)
                                                              ┌─────────────────────────┐
                Implement change                 └───────────┬─────────────┘
                                └────────────────┘
                                      ┌──────────────────────────────────────┐
        Re-audit (to close the loop)         └──────────────────────────────────────┘

Types of audit: | Type | Definition | Example | |------|------------|---------| | Structure audit | Resources, facilities, staffing | Is there a 24-hour labour ward consultant? | | Process audit | What is done for patients | What proportion had thromboprophylaxis? | | Outcome audit | Results achieved | What is the CS rate? Perinatal mortality? |

National Audits in O&G

Audit Organisation What it measures
MBRRACE-UK MBRRACE-UK collaboration Maternal and perinatal deaths
NMPA (National Maternity and Perinatal Audit) RCOG Maternity service quality, outcomes
Saving Babies' Lives NHS England Stillbirth reduction
Each Baby Counts RCOG Intrapartum term outcomes
UKOSS (UK Obstetric Surveillance System) NPEU Rare pregnancy conditions

UKOSS (UK Obstetric Surveillance System)

  • Purpose: Surveillance of rare conditions in pregnancy (incidence < 1 in 10,000)
  • Method: Monthly case reporting cards sent to all consultant-led maternity units
  • Examples: Amniotic fluid embolism, placenta accreta, uterine rupture, peripartum cardiomyopathy, maternal sepsis
  • Outputs: Incidence rates, risk factors, management patterns, maternal and perinatal outcomes

10.10 Quality Improvement in O&G

Plan-Do-Study-Act (PDSA) cycles: - Plan: Define the change, predict outcomes, develop measurement - Do: Implement the change on a small scale - Study: Analyse data, compare to predictions - Act: Refine the change, scale up or abandon

Common QI projects in O&G: - Reducing induction-to-delivery interval - Improving antibiotic prophylaxis timing for CS - Reducing emergency CS decision-to-delivery interval - Implementing standardised CTG interpretation - Improving breastfeeding rates - Reducing perineal trauma

10.11 Key UK Screening Programmes — Summary Table

Programme Condition Test Population Interval
NHS Fetal Anomaly Screening Programme (FASP) 11 physical anomalies + Down's/Edwards'/Patau's Combined test (11–14w) or Quadruple (14–20w) + anomaly scan (18–20w) All pregnant women Per pregnancy
NHS Sickle Cell and Thalassaemia Screening Sickle cell disease, thalassaemia, carrier status Family origin questionnaire + Hb HPLC All pregnant women (and partners if carrier) Per pregnancy
NHS Infectious Diseases in Pregnancy Screening HIV, Hepatitis B, Syphilis Blood test All pregnant women Per pregnancy (and 28w for high-risk HIV)
NHS Cervical Screening Programme Cervical cancer (HPV-related) HPV test → reflex cytology Women 25–64 3-yearly (25–49), 5-yearly (50–64)
NHS Breast Screening Programme Breast cancer Mammography Women 50–70 (extending to 47–73) 3-yearly
NHS Abdominal Aortic Aneurysm Screening AAA Ultrasound Men 65+ Once
NHS Diabetic Eye Screening Diabetic retinopathy Digital retinal photography All with diabetes Annual

10.12 How to Answer MRCOG Part 1 Epidemiology Questions

Common question formats: 1. "A new screening test has sensitivity 95% and specificity 95%. The prevalence is 1%. What is the PPV?" 2. "Which study design would be best to investigate an association between a rare disease and a common exposure?" 3. "What is the most appropriate statistical test to compare birth weight between smokers and non-smokers?" 4. "What is the correct interpretation of this confidence interval?" 5. "Which type of bias is most likely in a case-control study of maternal medication and congenital anomalies?"

Answer strategy: 1. Identify what is being asked (study design, test, bias, interpretation) 2. Recall the relevant definition and formula 3. Apply to the specific scenario 4. Eliminate wrong answers systematically

Formulas to memorise (and practice): - Sensitivity, specificity, PPV, NPV, LR+, LR− - RR, OR, AR, ARF, NNT - χ² = Σ(O−E)²/E - SEM = SD/√n - Post-test odds = Pre-test odds × LR - Adjusted α (Bonferroni) = 0.05/k


Quick Reference: MRCOG Epidemiology Formulae

Screening

Formula Mnemonic
Sn = TP / (TP + FN) Sn = sick / (sick + missed)
Sp = TN / (TN + FP) Sp = well / (well + false alarms)
PPV = TP / (TP + FP) PPV = true positive / all positive
NPV = TN / (TN + FN) NPV = true negative / all negative
LR+ = Sn / (1 − Sp) Positive LR = sensitivity / false positive rate
LR− = (1 − Sn) / Sp Negative LR = false negative rate / specificity

Risk & Effect

Formula When used
RR = [a/(a+b)] / [c/(c+d)] Cohort studies
OR = ad / bc Case-control studies
AR = a/(a+b) − c/(c+d) Excess risk
ARF = (RR−1)/RR % of risk due to exposure
NNT = 1/ARR Number needed to treat
NNH = 1/AR (harm) Number needed to harm

Statistics

Formula Meaning
x̄ = Σx/n Mean
s² = Σ(x−x̄)²/(n−1) Sample variance
SD = √s² Standard deviation
SEM = SD/√n Standard error of mean
95% CI ≈ x̄ ± 2×SEM Confidence interval for mean
χ² = Σ(O−E)²/E Chi-squared test

Decision Rules

Rule Cut-off
α (Type I error) 0.05
β (Type II error) 0.20 (power = 80%)
p < 0.05 Statistically significant
95% CI excludes 1 (RR/OR) Statistically significant
I² > 50% Substantial heterogeneity
AUC > 0.8 Good diagnostic accuracy

Mnemonics for MRCOG Part 1

SnNOut: High Sensitivity → Negative rules Out SpPIn: High Specificity → Positive rules In

OSA to remember bias types: - O = Observer bias - S = Selection bias - A = Attrition bias

CRIB for confounder criteria: - C = Causes outcome (independent risk factor) - R = Related to exposure - I = Intermediate? NO — not on causal pathway - B = Before exposure? confounder must precede

NNT = 1/ARR — think "Need N To prevent" = 1 over Absolute Risk Reduction

SEM < SD (always!) — Standard Error is Smaller than Standard Deviation


Common MRCOG Part 1 Traps

# Trap Truth
1 "p-value = probability H₀ is true" WRONG — p = P(data
2 "SEM = SD" WRONG — SEM = SD/√n
3 "OR = RR always" WRONG — only when disease rare (<10%)
4 "PPV is a fixed test property" WRONG — PPV depends on prevalence
5 "Non-significant p = no effect" WRONG — may be underpowered
6 "ITT is good for non-inferiority" WRONG — ITT is anti-conservative for non-inferiority
7 "Screening always saves lives" WRONG — lead time, length time, overdiagnosis
8 "Case-control studies can calculate incidence" WRONG — only OR
9 "Correlation = causation" ALWAYS WRONG
10 "Confounder is an intermediate variable" WRONG — confounder is outside causal pathway
11 "95% CI range contains 95% of data" WRONG — 95% CI is about the mean, not individual values
12 "Histogram bars should have gaps" WRONG — histogram bars TOUCH (bar chart bars have gaps)
13 "χ² test can be used with any 2×2 table" WRONG — expected <5 requires Fisher's exact
14 "Mean is always the best measure" WRONG — use median for skewed data
15 "Blinding and allocation concealment are the same" WRONG — allocation concealment is ALWAYS possible; blinding is not
16 "Cluster RCT doesn't need special analysis" WRONG — must account for clustering (ICC, design effect)
17 "p < 0.01 means a more important result than p < 0.05" WRONG — p depends on sample size, not just effect size
18 "Systematic review = meta-analysis" WRONG — a meta-analysis is the statistical combination; not all SRs have one
19 "NNT is a fixed property of a treatment" WRONG — NNT depends on baseline risk
20 "One-sided test is always more powerful" WRONG — only if the true effect is in the hypothesised direction

References & Further Reading

Essential Textbooks: - Kirkwood BR & Sterne JAC. Essential Medical Statistics. 2nd ed. Blackwell Science, 2003. - Altman DG. Practical Statistics for Medical Research. Chapman & Hall, 1991. - Petrie A & Sabin C. Medical Statistics at a Glance. 4th ed. Wiley-Blackwell, 2020. - Bland M. An Introduction to Medical Statistics. 4th ed. OUP, 2015. - Fletcher RW & Fletcher SW. Clinical Epidemiology: The Essentials. 5th ed. Wolters Kluwer, 2014. - Straus SE et al. Evidence-Based Medicine: How to Practice and Teach It. 5th ed. Elsevier, 2018.

Key UK Documents: - RCOG. Green-top Guidelines Levels of Evidence and Grades of Recommendation. (Introductory sections of any Green-top Guideline) - MBRRACE-UK. Saving Lives, Improving Mothers' Care. (Latest triennial report) - NICE. The Guidelines Manual (process and methods). - NHS FASP. Fetal anomaly screening programme standards. - Wilson JMG & Jungner G. Principles and practice of screening for disease. WHO Public Health Papers No. 34. Geneva: WHO, 1968.

Key Papers: - Guyatt GH et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008;336:924–6. - Schulz KF et al. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ 2010;340:c332. - von Elm E et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. Lancet 2007;370:1453–7. - Moher D et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ 2009;339:b2535. - Altman DG & Bland JM. Statistics notes: diagnostic tests 1: sensitivity and specificity. BMJ 1994;308:1552. - Altman DG & Bland JM. Statistics notes: diagnostic tests 2: predictive values. BMJ 1994;309:102. - Deeks JJ. Systematic reviews in health care: systematic reviews of evaluations of diagnostic and screening tests. BMJ 2001;323:157–62. - Higgins JPT et al. Measuring inconsistency in meta-analyses. BMJ 2003;327:557–60. - Sterne JAC & Egger M. Funnel plots for detecting bias in meta-analysis. J Clin Epidemiol 2001;54:1046–55.

Online Resources: - OpenEpi (www.openepi.com) — free online calculators for epidemiological statistics - Cochrane Handbook for Systematic Reviews of Interventions (training.cochrane.org/handbook) - MedCalc Statistical Software (www.medcalc.org) — ROC curve analysis, diagnostic test evaluation - NICE guidance (www.nice.org.uk) - RCOG Green-top Guidelines (www.rcog.org.uk/guidelines) - MBRRACE-UK reports (www.npeu.ox.ac.uk/mbrrace-uk) - ONS birth statistics (www.ons.gov.uk) - StATS statistical calculator (www.statsdirect.com)


Last updated: May 2026 Target exam: MRCOG Part 1 Word count: ~18,500+ Author note: This document is intended as a comprehensive revision resource covering all examinable topics in epidemiology, statistics, screening, evidence-based medicine, and O&G-specific applications. Candidates should supplement with current RCOG Green-top Guidelines, recent NICE guidance, and the latest MBRRACE-UK reports for the most up-to-date statistical data (rates, mortality figures, screening programme updates). Particular attention should be paid to the formulae and interpretations flagged as "MRCOG Key Point" and "Common MRCOG Part 1 Traps" — these represent the most frequently tested and most commonly confused concepts in the examination.

Index