🏠/MRCOG Part 1 — Study Book/Chapters/MRCOG Part 1: Epidemiology & Statistics — Comprehensive Study Document/Index

On this page

Table of Contents
1. Study Design
1.1 Observational vs Experimental Studies
1.2 Cross-Sectional Studies
1.3 Cohort Studies
1.4 Case-Control Studies
1.5 Randomised Controlled Trials (RCTs)
1.6 Ecological Studies
2. Screening
2.1 The Wilson-Jungner Criteria (1968) — Detailed
2.2 Test Performance Characteristics — Complete Details
2.3 Likelihood Ratios — Complete Guide
2.4 ROC Curves — Detailed
2.5 Screening in O&G — Complete Clinical Details
2.6 Screening Biases — Expanded
3. Descriptive Statistics
3.1 Types of Data — Complete Classification
3.2 Measures of Central Tendency — Complete Guide
3.3 Measures of Dispersion — Complete Guide
3.4 Normal Distribution — Complete Details
3.5 Skewed Distributions & Transformations
3.6 Data Presentation — Types of Graphs
4. Hypothesis Testing
4.1 Fundamental Concepts — Complete
4.2 Type I and Type II Errors — The 2×2 Framework
4.3 The p-value — Essential MRCOG Detail
4.4 Confidence Intervals — Detailed
4.5 One-tailed vs Two-tailed Tests
4.6 Multiple Testing — Corrections
4.7 Significance vs Clinical Importance — Key MRCOG Concept
4.8 Bayesian vs Frequentist Statistics — Overview
5. Parametric vs Non-Parametric Tests
5.1 Choosing the Right Test — Decision Tree
5.2 Parametric Tests — Complete Details
5.3 Non-Parametric Tests — Complete Details
5.4 Chi-Squared Test (χ²) — Complete Details
5.5 Correlation — Detailed
5.6 Regression — Complete Details
6. Risk & Effect Measures
6.1 The 2×2 Table — Foundation
6.2 Definitions and Formulas — Complete
6.3 Worked Examples from O&G
6.4 Risk Ratio vs Odds Ratio — The Rare Disease Assumption
6.5 Number Needed to Treat (NNT) — Detailed
6.6 Incidence vs Prevalence
6.7 Hazard Ratio — More Detail
7. Statistical Bias & Confounding
7.1 Classification of Bias
7.2 Selection Bias — Detailed with O&G Examples
7.3 Information Bias (Measurement Bias) — Detailed
7.4 Confounding — Complete Details
7.5 Effect Modification (Interaction)
7.6 Confounding by Indication
7.7 Protopathic Bias
8. Evidence-Based Medicine
8.1 Levels of Evidence — Oxford CEBM (March 2009)
8.2 GRADE System — Complete
8.3 Systematic Reviews & Meta-Analysis — Complete
8.4 Critical Appraisal — CASP Tools
8.5 Using EBM in Practice — Fagan Nomogram
8.6 Evidence-Based Guidelines in O&G
9. Survival Analysis
9.1 Key Concepts — Detailed
9.2 Censoring — Complete Types
9.3 Kaplan-Meier Method — Complete Details
9.4 Log-Rank Test — Details
9.5 Cox Proportional Hazards — Complete Details
9.6 Parametric Survival Models
9.7 Describing Survival Results
10. Specific Topics in O&G
10.1 Key Rates and Definitions
10.2 MBRRACE-UK and Confidential Enquiries
10.3 Saving Babies' Lives Care Bundle — Version 3 (2023)
10.4 Each Baby Counts (RCOG)
10.5 RCOG Green-top Guidelines — Evidence Grading
10.6 NICE Guidelines
10.7 SIGN Guidelines
10.8 Fertility & Population Demographics — UK Data
10.9 Clinical Audit in O&G
10.10 Quality Improvement in O&G
10.11 Key UK Screening Programmes — Summary Table
10.12 How to Answer MRCOG Part 1 Epidemiology Questions
Quick Reference: MRCOG Epidemiology Formulae
Screening
Risk & Effect
Statistics
Decision Rules
Mnemonics for MRCOG Part 1
Common MRCOG Part 1 Traps
References & Further Reading

Index

MRCOG Part 1: Epidemiology & Statistics — Comprehensive Study Document

Target: MRCOG Part 1 Version: May 2026 Purpose: Complete deep-dive reference covering all examinable topics in epidemiology, medical statistics, screening, evidence-based medicine, and their application to obstetrics & gynaecology. This document is designed for thorough revision — every section includes definitions, formulae, mnemonics, worked examples from O&G, and MRCOG-specific exam tips.

Study Design
Screening
Descriptive Statistics
Hypothesis Testing
Parametric vs Non-Parametric Tests
Risk & Effect Measures
Statistical Bias & Confounding
Evidence-Based Medicine
Survival Analysis
Specific Topics in O&G

1. Study Design

1.1 Observational vs Experimental Studies

Feature	Observational	Experimental
Intervention	None — investigator observes naturally occurring groups	Investigator assigns intervention intentionally
Causality	Association only (unless Bradford Hill criteria satisfied)	Can infer causation (if properly randomised and blinded)
Bias risk	Higher — multiple sources of bias possible	Lower — randomisation balances confounders
Ethical issues	Fewer — no manipulation of participants	More — equipoise required; informed consent essential
Examples	Cohort, case-control, cross-sectional, ecological	RCT (parallel, crossover, cluster, factorial)

Bradford Hill Criteria for Causation (1965): These are important in interpreting observational studies — a set of nine viewpoints used to assess whether an observed association is likely causal: 1. Strength of association (larger effect = more likely causal) 2. Consistency (reproduced across different populations/settings) 3. Specificity (one cause → one effect — less applicable to O&G where most outcomes are multifactorial) 4. Temporality (cause must precede effect — the only absolutely essential criterion) 5. Biological gradient (dose-response relationship — e.g., more cigarettes → higher preterm birth risk) 6. Plausibility (biologically credible mechanism) 7. Coherence (consistent with natural history/biology) 8. Experiment (evidence from experimental studies) 9. Analogy (similar evidence for analogous exposures)

1.2 Cross-Sectional Studies

Design: Data collected at a SINGLE point in time — both exposure and outcome measured simultaneously
Measures: Prevalence (existing cases) — CANNOT measure incidence (new cases)
Uses: Disease burden estimates, health surveys, screening programme evaluation, hypothesis generation, planning health services
Advantages: Quick, cheap, good for hypothesis generation, no loss to follow-up, can study multiple outcomes and exposures simultaneously
Disadvantages: Cannot establish temporality (chicken-and-egg problem — did the exposure come before the outcome?), survival bias (only survivors captured — those who died cannot participate), not suitable for rare diseases (need very large samples), prevalence-incidence bias (Neyman bias)
Key statistic: Odds ratio (can be calculated but caution with interpretation — prevalence OR, not incidence OR)
In O&G: Estimating prevalence of pelvic organ prolapse, urinary incontinence (UI), infertility, contraception use patterns, postnatal depression (e.g., EPDS screening studies), HPV prevalence, endometriosis prevalence estimates

Example: A cross-sectional survey asks 5,000 women about incontinence symptoms and BMI. 1,200 report UI; 800 of those with UI are obese vs 1,500 of those without UI. - Prevalence of UI = 1200/5000 = 24% - OR for UI in obese vs non-obese = (800×1500)/(400×2300) = 1.30

Limitation: Cannot tell if obesity caused UI or UI led to reduced activity and weight gain.

1.3 Cohort Studies

Definition: Groups defined by exposure status; followed forward in time to see who develops outcome. This is the optimal observational design for establishing incidence and temporal relationships.

Prospective Cohort

Exposure assessed at BASELINE; participants followed FORWARD in time
Outcome develops during follow-up
Advantages: Direct measure of incidence, clear temporality (exposure definitely precedes outcome), can study multiple outcomes from one exposure, minimises recall bias (exposure recorded before outcome known), allows calculation of absolute risk, RR, AR
Disadvantages: Expensive and time-consuming (especially for rare outcomes or long latency), loss to follow-up (attrition) can introduce bias, inefficient for rare diseases (need very large numbers), exposure patterns may change over time
Key measures: Incidence (cumulative incidence and incidence rate), relative risk (RR), attributable risk (AR), population attributable fraction (PAF)

Retrospective Cohort (Historical Cohort)

Uses EXISTING data (medical records, databases, occupational records) to go back in time
Exposure and outcome have ALREADY occurred when study begins
Advantages: Cheaper, faster than prospective, good for long-latency diseases (e.g., DES exposure in utero and vaginal adenocarcinoma decades later), can use existing datasets
Disadvantages: Relies on quality and completeness of existing records, recall bias may still affect some data, cannot control what was measured or how, missing data issues

Key Measures in Cohort Studies — Detailed with Worked Example

Worked O&G Example: 10,000 pregnant women; 5,000 smoke, 5,000 do not. Followed for preterm birth (<37 weeks).

	Preterm	Term	Total	Risk
Smoker	200	4,800	5,000	200/5000 = 0.04
Non-smoker	100	4,900	5,000	100/5000 = 0.02
Total	300	9,700	10,000	300/10000 = 0.03

Measure	Formula	Calculation	Interpretation
Cumulative incidence (risk) in exposed	a/(a+b) = 200/5000	0.04 (4%)	4% of smokers had preterm birth
Cumulative incidence (risk) in unexposed	c/(c+d) = 100/5000	0.02 (2%)	2% of non-smokers had preterm birth
Incidence rate (IR) in exposed	200 / (sum of person-time)	Depends on follow-up timing	Accounts for when events occur
Relative Risk (RR)	0.04 / 0.02	2.0	Smokers 2× more likely to have preterm birth
Attributable Risk (AR)	0.04 − 0.02	0.02 (2%)	Excess risk attributable to smoking
AR Fraction (ARF)	(2.0−1)/2.0 = 50%	50%	Half of preterm births in smokers due to smoking
Population Attributable Risk (PAR)	I_total − I_unexposed = 0.03 − 0.02	0.01 (1%)	Excess risk in total population
PAF	(I_total − I_unexposed)/I_total = 0.01/0.03	33.3%	33% of all preterm births attributable to smoking

Person-time: - Each participant contributes time until event, loss to follow-up, or study end - Incidence rate = number of new events / sum of person-time at risk - Expressed as "per 1000 person-years" or similar - Superior to cumulative incidence when follow-up times vary

Confounding in Cohort Studies: - Common confounders in O&G cohorts: maternal age, socioeconomic status, parity, BMI, pre-existing medical conditions - Control: multivariable regression, stratification, matching, restriction, propensity scores

1.4 Case-Control Studies

Design: Select cases (with disease) and controls (without disease); look BACK retrospectively for exposure. The most efficient design for rare diseases.

2×2 Table

	Case (disease +)	Control (disease −)	Total
Exposed	a	b	a + b
Unexposed	c	d	c + d

Key Measures

Measure	Formula	Interpretation
Odds of exposure in cases	a / c	How likely cases were exposed
Odds of exposure in controls	b / d	How likely controls were exposed
Odds Ratio (OR)	(a/c) / (b/d) = ad / bc	Odds of exposure in cases vs controls
When disease rare (<10%)	OR ≈ RR	Rare disease assumption

Worked O&G Example: Case-control study of ovarian cancer and talc use. - Cases: 300 women with ovarian cancer - Controls: 600 women without ovarian cancer - Talc use: 120 cases exposed, 180 controls exposed

	Ovarian cancer (Case)	No cancer (Control)
Talc use	120	180
No talc	180	420

OR = (120 × 420) / (180 × 180) = 50,400 / 32,400 = 1.56

Interpretation: Odds of talc exposure are 1.56× higher in ovarian cancer cases than controls. Since ovarian cancer is relatively rare, this approximates RR = 1.56.

Cannot calculate: - Incidence (no denominator — we selected cases/controls, we did not follow a population) - RR (no incidence data) - Prevalence (same reason)

Advantages

Efficient for rare diseases (ovarian cancer, specific congenital anomalies, maternal death)
Quick and cheap compared to cohort studies
Can study multiple exposures (diet, environment, genetics, medications)
Good for diseases with long latency (like DES-related cancers)
Requires smaller sample sizes than cohort studies for rare outcomes

Disadvantages

Cannot calculate incidence directly (no denominator of total population at risk)
Recall bias: Cases remember exposures differently from controls (especially for subjective exposures like diet, pain medication, stress)
Selection bias: Choosing appropriate controls is the most difficult and critical part
Temporality: Difficult to establish if exposure preceded disease (especially for biomarkers measured after diagnosis)
Cannot study rare exposures (if exposure is rare, you need enormous numbers)
Survivorship bias: Cases are those who survived to be diagnosed; fatal cases are missed

Selection of Controls — Critical Issues

Fundamental principle: Controls must come from the SAME source population that gave rise to the cases.

Control type	Description	Advantage	Disadvantage
Population-based	Random sample from general population	Most representative	Expensive, low response rates
Hospital-based	Other patients from same hospital	Easy to recruit, good response rates	Berkson's bias — hospital controls may have different exposure patterns
Friend/relative	Friends or siblings of cases	Genetic/environmental matching	Over-matching possible (same exposures)
Neighbourhood	Neighbours of cases	Socioeconomic matching	Time-consuming
Disease controls	Patients with a DIFFERENT disease	Good response, similar recall	Diseased group may differ from healthy

Matching: - Frequency matching: Select controls to have same distribution of age, parity, etc. as cases - Individual matching: Each case matched to 1–4 controls on specific factors (age ± 5 years, parity, hospital) - Over-matching: Matching on a variable that is related to the exposure but NOT to the disease — reduces power without reducing confounding

Biases in Case-Control Studies — Expanded

Bias	Mechanism	Example
Recall bias	Cases search memory for causes; controls less motivated	Mothers of babies with malformations report more medication during pregnancy; mothers of healthy babies forget
Berkson's bias	Hospital controls have different admission patterns	Studying aspirin and stroke: hospital controls may have GI bleeds (also related to aspirin) → spurious protective effect
Neyman bias	Prevalent cases differ from incident cases	Studying survival after cancer: prevalent cases are long-term survivors, not representative
Detection bias	Cases diagnosed because of exposure-related screening	Women on HRT have more mammograms → more breast cancer detected (not causal)
Interviewer bias	Interviewer probes differently	More detailed questioning of cases about exposures
Survivorship bias	Fatal cases not included	Studying risk factors for eclampsia — only survivors available

1.5 Randomised Controlled Trials (RCTs)

Gold standard for establishing causality. The key strength is randomisation which (if adequate) balances both known and unknown confounders between groups.

Types of RCT

Type	Description	Key Feature	Example in O&G
Parallel group	Two (or more) independent groups, each receives one treatment concurrently	Most common; simplest analysis	TRUFFLE study (CTG monitoring in IUGR)
Crossover	Each participant receives both treatments in random sequence, separated by washout period	Each participant acts as own control → smaller sample size needed	Comparing two pain relief methods in labour (problem: carryover, and can't use if condition changes)
Cluster	Intact groups (hospitals, GP practices, communities) randomised	Used when contamination likely; analysis must account for clustering (ICC)	Comparing screening uptake with different invitation methods at hospital level
Factorial	Two or more interventions tested simultaneously (e.g., 2×2 design)	Efficient — can test interactions (synergy/antagonism)	Comparing aspirin AND heparin vs each alone in recurrent miscarriage
Zelen design	Randomised BEFORE consent; only treatment group approached for consent	Reduces selection bias; ethically controversial	Emergency trials where consent is difficult
N-of-1 trial	Single patient receives treatment and placebo in random sequence	Highest level for individual treatment decisions	Rarely used in O&G

Trial Phases

Phase	Primary Purpose	Typical Participants	Key Questions
Phase I	Safety, tolerability, pharmacokinetics	20–80 healthy volunteers (or patients with advanced disease)	What is safe dose? What are side effects? How is drug metabolised?
Phase II	Efficacy signal, dose-ranging, side effect profile	100–300 patients with condition	Does it work? What is optimal dose? More adverse effects?
Phase III	Confirm efficacy, compare to standard of care	1,000–3,000+ patients	Is it better than current standard? (or non-inferior)
Phase IV	Post-marketing surveillance, long-term safety	General population after licensing	Are there rare adverse effects? Long-term outcomes?

Randomisation Methods — Detailed

Method	Description	Strength	Weakness
Simple randomisation	Each participant assigned by coin toss, random number table, or computer	Unpredictable; simple	Can produce unequal group sizes and imbalance on prognostic factors
Block randomisation	Random permuted blocks of fixed size (e.g., 4: possibilities = TTCC, TCTC, TCCT, CTTC, CTCT, CCTT)	Ensures equal numbers in each group at all times	Block size must be CONCEALED to prevent prediction
Stratified randomisation	Separate randomisation within strata defined by key prognostic factors	Ensures balance on important confounders	Complex; need few strata or it becomes unwieldy
Minimisation	Next participant's allocation determined by current imbalance in prognostic factors	Excellent balance on many factors simultaneously	Not truly random; some controversy about analysis
Adaptive randomisation	Allocation probability changes based on accumulating outcomes	More patients get better treatment	Complex; operational bias possible

Key point: The randomisation sequence must be CONCEALED from those recruiting participants. If the recruiter knows the next allocation, they can (consciously or unconsciously) influence who is enrolled → selection bias.

Allocation Concealment vs Blinding

Feature	Allocation Concealment	Blinding
Purpose	Prevent selection bias at enrolment	Prevent performance/detection bias after enrolment
When	BEFORE randomisation (during recruitment)	AFTER randomisation (during treatment/follow-up)
Always possible?	YES — always possible, even in surgery/physical therapy trials	NO — some interventions cannot be blinded (surgery vs medical, behaviour change)
If broken	Destroys the integrity of randomisation	Less catastrophic but introduces bias
Example	Opaque sealed envelopes, central telephone randomisation	Identical placebo tablets, sham surgery, double-dummy technique

Blinding Levels

Level	Who is blinded	Purpose	Applicability
Open label	No one	Practical when blinding impossible	Surgical trials, device trials
Single blind	Participant only	Reduces placebo effect	Drug trials with distinct taste/appearance
Double blind	Participant AND investigator	Gold standard — reduces both performance and detection bias	Most drug trials
Triple blind	Participant, investigator AND data analyst/statistician	Prevents analytic bias	High-quality confirmatory trials

Double-dummy technique: Used when two treatments have different appearances (e.g., pill vs injection). Each participant receives a pill AND an injection — one active, one placebo.

Analysis Populations

Analysis	Definition	Effect on results	Best use
Intention-to-treat (ITT)	Analyse ALL participants in the group they were randomised to, regardless of compliance, crossover, or withdrawal	Conservative for superiority trials (dilutes effect toward null)	PRIMARY analysis for superiority trials
Per-protocol (PP)	Analyse only those who completed the allocated treatment as planned	May OVER-estimate efficacy (only includes compliant)	SECONDARY analysis; primary for non-inferiority
Modified ITT (mITT)	Excludes those who never received any treatment or had no post-randomisation data	Somewhere between ITT and PP	Common compromise in practice
As-treated	Analyse according to the treatment actually received	Most BIASED — breaks randomisation	Not recommended as primary analysis

MRCOG Key Point: ITT is the primary analysis for superiority trials because it preserves the benefit of randomisation (groups remain comparable). PP is considered secondary. ITT is conservative for superiority but anti-conservative for non-inferiority — in non-inferiority trials, PP is often primary because ITT can make a non-inferior treatment appear equivalent when it isn't (by diluting the difference).

Trial Types by Aim

Type	Null Hypothesis	Alternative Hypothesis	Key Consideration
Superiority	Treatment = Control	Treatment ≠ Control (or Treatment > Control)	Standard approach
Non-inferiority	Treatment − Control ≤ −Δ (margin)	Treatment − Control > −Δ	Requires pre-specified non-inferiority margin (Δ); PP analysis preferred
Equivalence		Treatment − Control	≥ Δ

Non-inferiority margin selection: - Should be the largest clinically acceptable difference - Often set as half the effect of the active control vs placebo (the "M1" margin, then "M2" = M1 minus a preservation of effect) - Example: If active control reduces mortality by 2% vs placebo, Δ might be 1%

Pragmatic vs Explanatory Trials — The PRECIS-2 Framework

Dimension	Explanatory (Efficacy)	Pragmatic (Effectiveness)
Question	"Can it work?" (ideal conditions)	"Does it work in real life?"
Eligibility	Highly selected — narrow criteria	Broad — represents typical patients
Recruitment	Intensive campaigning	Routine clinical pathways
Setting	Specialist academic centres	Primary care / routine hospitals
Intervention	Strictly protocolised, closely monitored	Flexible, as in real practice
Comparator	Placebo or best alternative	Usual care
Follow-up	Frequent, intensive	Routine visits
Outcome	Surrogate or mechanism-based	Clinically meaningful (patient-important)
Primary analysis	ITT and PP both informative	ITT primary
Adherence	Monitored and encouraged	Real-world compliance

Example (O&G): ASPRE trial of aspirin for pre-eclampsia prevention — highly selected (high-risk by FMF algorithm) → explanatory. A pragmatic version would include all nulliparous women.

Adaptive Trial Designs

Definition: Pre-specified plan for modifying trial features based on accumulating data, without undermining validity.

Type	Description	Example
Group sequential	Pre-planned interim analyses with stopping rules	Stop early for efficacy (if overwhelming benefit) or futility (if unlikely to show benefit)
Sample size re-estimation	Blinded re-estimation of variance to adjust sample size	Ensures adequate power
Adaptive randomisation	Allocation ratio changes to favour better-performing arm	More patients receive superior treatment
Seamless phase II/III	Combine dose-finding phase with confirmatory phase	Saves time and patients
Drop-the-loser	Arms dropped if inferior	Multi-arm multi-stage (MAMS) trials
Bayesian adaptive	Continuously update posterior probability	More flexible but complex

Stopping rules for group sequential designs:

Method	Boundary	Characteristics
Haybittle-Peto	p < 0.001 at interim; p < 0.05 at final	Very conservative early; easy to implement
O'Brien-Fleming	Very stringent early boundary, liberal later	Most common; preserves overall α well
Pocock	Same critical value throughout (e.g., p < 0.016 for 3 looks)	More likely to stop early
Wang-Tsiatis	Family of boundaries between O'Brien-Fleming and Pocock	Flexible

Sample Size Calculation — Detailed

Why calculate sample size? 1. Ensure adequate POWER to detect clinically important effect 2. Avoid wasting resources on underpowered studies 3. Meet ethical obligations (patients in underpowered study may be exposed to harm without benefit) 4. Meet regulatory requirements

Parameters needed:

Parameter	Symbol	Typical value	How to determine
Significance level	α (Type I error)	0.05 (two-sided)	Convention; sometimes 0.01
Power	1 − β	0.80 or 0.90	0.80 is minimum; 0.90 preferred
Effect size	δ	Varies	Minimum clinically important difference (MCID)
Standard deviation	σ	From pilot data/literature	Variability in outcome measure
Allocation ratio	r = n₁/n₂	1:1 is most efficient	Unequal allocation needs larger n

Sample size increases when: - ✅ Smaller effect size (harder to detect) - ✅ Lower α (more stringent significance level) - ✅ Higher power (i.e., lower β) - ✅ Larger variance (more noise) - ✅ Unequal group sizes (deviates from 1:1) - ✅ More comparisons (multiple endpoints or subgroups) - ✅ Clustering (ICC reduces effective sample size)

Design Effect (for cluster RCTs): DE = 1 + (m − 1) × ICC - m = average cluster size - ICC = intra-cluster correlation coefficient (typically 0.01–0.05 in O&G) - Effective sample size = Actual sample size / DE

Worked Example: To detect a difference in mean birth weight of 100 g (SD = 400 g) between smokers and non-smokers, with α = 0.05, power = 0.80, using a two-sided test:

n = [(z_{α/2} + z_β)² × 2σ²] / δ² n = [(1.96 + 0.84)² × 2 × 400²] / 100² n = [7.84 × 320,000] / 10,000 = 2,508,800 / 10,000 ≈ 251 per group

So ~502 women needed total.

For binary outcomes (e.g., preterm birth rate): Uses different formula based on proportions.

Interim Analyses & Data Monitoring

Data Monitoring Committee (DMC/DMSB): Independent group of experts with access to unblinded data
Responsibilities: Recommend stopping for efficacy (overwhelming benefit), harm (safety concerns), or futility (no realistic chance of benefit)
Members: Clinicians, statisticians, sometimes ethicists
Must be independent of trial investigators and sponsor
Stopping for futility: Uses conditional power — probability of reaching significant result at final analysis given current data

1.6 Ecological Studies

Design: Groups (populations) as unit of observation, not individuals

Examples: Comparing caesarean section rates across countries; correlating sunlight exposure and pre-eclampsia rates by region

Ecological fallacy (Robinson, 1950): Associations at population level may NOT hold at individual level. Classic example: Immigrants in US had higher literacy rates in states with more immigrants → actually at individual level, immigrants had lower literacy (states with more immigrants had higher literacy natives).

2. Screening

2.1 The Wilson-Jungner Criteria (1968) — Detailed

The 10 classic criteria proposed by Wilson and Jungner for the WHO. Every MRCOG candidate must know these:

The condition should be an important health problem
Burden of disease measured by incidence, prevalence, morbidity, mortality, economic cost
In O&G: Down's syndrome (lifetime cost ~£500k); cervical cancer (~850 deaths/year UK); GDM (affects ~5% pregnancies)
There should be an accepted treatment for patients with recognised disease
If no effective treatment exists, screening may cause harm without benefit
Exception: conditions where knowing diagnosis allows reproductive choice (Down's syndrome, anencephaly)
Example of problematic screening: some rare genetic conditions with no treatment
Facilities for diagnosis and treatment should be available
If screening identifies positives but diagnostic capacity is insufficient → anxiety and harm
UK has detailed pathways: screen positive → referral to fetal medicine unit or colposcopy within 2 weeks
There should be a recognisable latent or early symptomatic stage
Diseases with long preclinical phase are good screening targets
Cervical cancer: HPV infection → CIN I/II/III → invasive cancer (10+ year window)
Ovarian cancer: NO good latent stage → screening has failed in trials (UKCTOCS, PLCO)
There should be a suitable test or examination
Test must be acceptable, accurate (high sensitivity/specificity), and feasible at population scale
Combined test for Down's: NT ultrasound (~20 min), blood test → acceptable but requires skilled sonographers
The natural history of the condition, including development from latent to declared disease, should be adequately understood
Without knowing natural history, we cannot predict who will progress
CIN: Most low-grade lesions regress; only high-grade progress — essential knowledge for appropriate management
There should be an agreed policy on whom to treat as patients
Clear thresholds for intervention needed
GDM: IADPSG criteria (one-step) vs NICE criteria (two-step) produce different prevalence
HPV vaccine policy: age 12–13 girls (and boys from 2019) in UK
The total cost of finding a case should be economically balanced in relation to medical expenditure as a whole
Cost per QALY gained; NICE threshold ~£20,000–30,000/QALY
NIPT: ~£500/test; combined test: ~£80; NICE considered cost-effectiveness
Case-finding should be a continuing process and not a "once and for all" project
Screening must be repeated at appropriate intervals
Cervical screening: 3-yearly (25–49), 5-yearly (50–64)
Antenatal screening: per pregnancy (not lifetime)
The test should be acceptable to the population
- Low uptake → programme ineffective
- Cervical screening uptake: ~70% UK (below 80% target)
- Antenatal HIV screening: >99% uptake (well accepted as routine)

2.2 Test Performance Characteristics — Complete Details

The 2×2 Table

	Disease + (Gold Standard)	Disease − (Gold Standard)	Total
Test +	True positive (TP)	False positive (FP)	TP + FP
Test −	False negative (FN)	True negative (TN)	FN + TN
Total	TP + FN	FP + TN	N

Disease prevalence = (TP + FN) / N — this is the pre-test probability if the screening population mirrors the study population

Key Measures — Expanded with Clinical Interpretation

Measure	Formula	What it tells us	Clinical use
Sensitivity (Sn)	TP / (TP + FN)	Of those WITH disease, how many test positive?	SnNOut: High Sn → negative test rules OUT disease
Specificity (Sp)	TN / (TN + FP)	Of those WITHOUT disease, how many test negative?	SpPIn: High Sp → positive test rules IN disease
Positive Predictive Value (PPV)	TP / (TP + FP)	Of those who test positive, how many actually HAVE disease?	Counselling patient with positive result
Negative Predictive Value (NPV)	TN / (TN + FN)	Of those who test negative, how many actually are FREE of disease?	Counselling patient with negative result
Accuracy	(TP + TN) / N	Proportion correctly classified	Overall measure but misleading when prevalence low

Prevalence Effect on PPV — Expanded

The single most important concept in screening for MRCOG. PPV depends on prevalence, and thus screening works well in high-prevalence populations but poorly in low-prevalence populations.

Worked Example: Test with Sn = 99%, Sp = 99%

Scenario A: High prevalence (50%) — e.g., symptomatic women referred to clinic

	Disease +	Disease −	Total
Test +	495 (TP)	5 (FP)	500
Test −	5 (FN)	495 (TN)	500
Total	500	500	1000

PPV = 495/500 = 99% (a positive test is very reliable) NPV = 495/500 = 99%

Scenario B: Low prevalence (1%) — general population screening

	Disease +	Disease −	Total
Test +	99 (TP)	99 (FP)	198
Test −	1 (FN)	9,801 (TN)	9,802
Total	100	9,900	10,000

PPV = 99/198 = 50% (half of positives are false!) NPV = 9801/9802 = 99.99%

Scenario C: Very low prevalence (0.1%) — rare disease screening

	Disease +	Disease −	Total
Test +	9.9 (TP)	99.9 (FP)	109.8
Test −	0.1 (FN)	9,890.1 (TN)	9,890.2
Total	10	9,990	10,000

PPV = 9.9/109.8 = 9% (91% of positives are false!) NPV = 9890.1/9890.2 = ~100%

MRCOG Take-home: Even an "excellent" test (99% Sn, 99% Sp) has PPV of only 50% when prevalence is 1%, and only 9% when prevalence is 0.1%. This is why screening for very rare conditions is problematic.

Clinical Example: NIPT for Down's Syndrome

Sn = 99.5%, Sp = 99.9%
Prevalence at term = 1/800 (0.125%)

PPV = (0.995 × 0.00125) / [(0.995 × 0.00125) + (0.001 × 0.99875)] PPV = 0.00124 / (0.00124 + 0.000999) = 0.00124 / 0.00224 = 0.554 = 55%

So even NIPT, the best screening test, has PPV ~55% for Down's syndrome in a low-risk population. A positive NIPT still requires confirmatory invasive testing (CVS or amniocentesis).

For high-risk population (e.g., women aged 40 with combined test risk 1:10): Prevalence ~10% PPV = (0.995 × 0.10) / [(0.995 × 0.10) + (0.001 × 0.90)] = 0.0995 / (0.0995 + 0.0009) = 99.1%

2.3 Likelihood Ratios — Complete Guide

LR+ = Sensitivity / (1 − Specificity) - Tells you how much more likely a positive test is in someone WITH the disease vs WITHOUT - Range: 1 to ∞ - Higher = better (more diagnostic information)

LR− = (1 − Sensitivity) / Specificity - Tells you how much less likely a negative test is in someone WITH the disease vs WITHOUT - Range: 0 to 1 - Lower = better (closer to 0)

LR Value	Impact on Post-test Probability
LR+ > 10	Large, often conclusive increase
LR+ 5–10	Moderate increase
LR+ 2–5	Small increase
LR+ 1–2	Minimal increase
LR+ = 1	No diagnostic value
LR− < 0.1	Large, often conclusive decrease
LR− 0.1–0.2	Moderate decrease
LR− 0.2–0.5	Small decrease
LR− 0.5–1.0	Minimal decrease

Using LRs in Clinical Practice (Bayes' Theorem):

Step 1: Convert pre-test probability to pre-test odds - Odds = probability / (1 − probability) - Example: Pre-test probability of Down's = 1/250 = 0.004 - Pre-test odds = 0.004 / 0.996 = 0.004

Step 2: Multiply by LR to get post-test odds - Post-test odds = Pre-test odds × LR - If combined test positive (LR+ = 8): Post-test odds = 0.004 × 8 = 0.032

Step 3: Convert back to probability - Post-test probability = odds / (1 + odds) - = 0.032 / 1.032 = 0.031 = 3.1% (or about 1 in 32)

Fagan nomogram: A graphical tool that does this conversion for you. Draw a line from pre-test probability through the LR to read post-test probability directly.

2.4 ROC Curves — Detailed

Receiver Operating Characteristic curve: - X-axis: 1 − Specificity (false positive rate) - Y-axis: Sensitivity (true positive rate) - Each point = test at different threshold/cut-off

AUC (Area Under the Curve):

AUC	Interpretation
0.5	No better than chance (diagonal line)
0.6–0.7	Poor
0.7–0.8	Moderate (acceptable)
0.8–0.9	Good (excellent for many applications)
0.9–1.0	Excellent

Choosing the optimal cut-off: - Youden index: J = Sensitivity + Specificity − 1 - Maximised at optimal threshold - Gives equal weight to Sn and Sp - Clinical weighting: If FN more harmful than FP → choose lower threshold (higher Sn, lower Sp) - Example: Screening for anencephaly — a missed case is catastrophic → high sensitivity prioritised - Economic weighting: Cost of FP (anxiety, further tests) vs FN (missed case)

Worked O&G Example: cffDNA for Down's syndrome - AUC > 0.99 (excellent) - At standard cut-off (z-score > 3): Sn = 99.5%, Sp = 99.9% - Can trade off: at z-score > 2: Sn > 99.9%, Sp = 99.0% (more FPs but fewer missed cases)

2.5 Screening in O&G — Complete Clinical Details

Antenatal Screening Programme (UK)

Condition	Screening Test	Timing	Sensitivity	Specificity	Notes
Down's syndrome (T21)	Combined test (NT + PAPP-A + β-hCG)	11–14 wks	~85% at 5% FPR	95%	NICE recommendation
Down's syndrome (T21)	Quadruple test (AFP + hCG + uE3 + Inhibin A)	14–20 wks	~80% at 5% FPR	95%	When late booking or NT not available
Down's syndrome	NIPT (cfDNA)	From 10 wks	>99%	>99%	Contingent screening in NHS (if combined risk ≥ 1:150)
Edwards' syndrome (T18)	Combined test + NIPT	11–14 wks	~90%	99.9%	Low PAPP-A and hCG
Patau's syndrome (T13)	Combined test + NIPT	11–14 wks	~85%	99.9%	Low PAPP-A and hCG
Neural tube defects	AFP + anomaly scan	18–20 wks	~90% (anencephaly)	>99%	Anomaly scan is gold standard

Fetal Anomaly Screening Programme (FASP) — UK

The 11 conditions screened for at the 18–20 week anomaly scan:

Anencephaly — absence of cranial vault; uniformly lethal
Open spina bifida — neural tube defect; severity varies
Cleft lip — with or without cleft palate
Diaphragmatic hernia — herniation of abdominal contents into chest
Gastroschisis — abdominal wall defect (right of umbilical cord)
Exomphalos (omphalocele) — abdominal wall defect (midline, membrane-covered)
Serious cardiac anomalies — four-chamber view + outflow tracts (detects ~50% of major CHD)
Bilateral renal agenesis — absence of both kidneys → anhydramnios → pulmonary hypoplasia
Lethal skeletal dysplasia — severe short limbs, narrow thorax
Edwards' syndrome (T18) — structural anomalies + growth restriction
Patau's syndrome (T13) — structural anomalies + holoprosencephaly

Detection rates for anomaly scan: - Anencephaly: ~98% - Open spina bifida: ~90% - Cleft lip: ~75% - Diaphragmatic hernia: ~60% - Gastroschisis: ~90% - Major cardiac anomalies: ~50% - Bilateral renal agenesis: ~85%

Gestational Diabetes Mellitus (GDM) Screening

Approach	Method	Criteria	Prevalence detected
Universal (IADPSG/WHO)	One-step: 75g OGTT at 24–28 wks	Fasting ≥5.1, 1h ≥10.0, 2h ≥8.5 mmol/L	~15–20%
Selective (NICE)	Risk-factor based: 75g OGTT at 24–28 wks	Fasting ≥5.6, 2h ≥7.8 mmol/L	~5%
Two-step (ACOG)	50g GCT → if ≥7.8 → 100g OGTT (Carpenter-Coustan)	Two values elevated	~6–8%

NICE risk factors for GDM (2015): - BMI > 30 kg/m² - Previous GDM - Family history (first-degree relative with diabetes) - Ethnicity: South Asian, Black Caribbean, Middle Eastern - Previous macrosomic baby (≥4.5 kg) - Polycystic ovary syndrome

Cervical Screening (NHS Cervical Screening Programme)

Aspect	Details
Age range	25–64 years
Frequency	3-yearly (25–49); 5-yearly (50–64)
Primary test	HPV test (since 2019)
Reflex cytology	If HPV positive → cytology on same sample
Colposcopy referral	HPV positive + abnormal cytology (≥ borderline)
HPV 16/18 genotyping	If HPV positive with normal cytology → genotyping; 16/18+ → colposcopy; other HR-HPV → repeat in 12 months
Upper age	64 (if last 2 screens negative, no further screening)
Uptake	~70% (below 80% target)
Approach	Call-recall system via GP registration

Group B Streptococcus (GBS) Screening

Aspect	UK Practice	US Practice
Approach	Risk-factor based	Universal screening
Timing	At labour onset (risk-based)	35–37 weeks
Test	Not routine	Vaginal-rectal swab (enriched culture)
Risk factors	Previous GBS baby, GBS bacteriuria in pregnancy, preterm labour, prolonged ROM (>18h), intrapartum fever ≥38°C	None (universal screening)

Other Antenatal Screening

Test	Timing	Condition
HIV	Booking (and 28 wks if high risk)	Vertical transmission rate <1% with treatment
HBsAg	Booking	Hepatitis B — immunoprophylaxis reduces vertical transmission
Syphilis (TPPA/VDRL)	Booking	Congenital syphilis preventable
Rubella IgG	Booking	Susceptibility detected → post-partum vaccination
Sickle cell and thalassaemia	Booking	Family origin questionnaire + Hb HPLC
Asymptomatic bacteriuria	Booking	Urine culture (MSU)
Anaemia	Booking + 28 wks	FBC

2.6 Screening Biases — Expanded

Bias	Mechanism	Example
Lead time bias	Screening advances time of diagnosis but does NOT delay death. Survival appears longer because the clock starts earlier, even if death occurs at the same time.	Screening for ovarian cancer: if diagnosis moved from age 62 to age 60 but death at age 65 in both, apparent "survival" increases from 3 to 5 years — no real benefit.
Length time bias	Screening preferentially detects slower-growing (less aggressive) disease because it stays in the detectable preclinical phase longer. Fast-growing aggressive disease is more likely to present symptomatically between screens.	Cervical screening: Screen-detected CIN tends to be slower-progressing. Rapidly progressive cancers may present as interval cancers between screens.
Overdiagnosis	Detection of disease that would NEVER have caused symptoms or death. The patient is "harmed" by unnecessary diagnosis and treatment.	Screening for neuroblastoma in infants (abandoned due to overdiagnosis); overdiagnosis in thyroid and breast cancer screening is well-documented.
Selection bias (volunteer bias)	People who participate in screening are systematically different from those who don't — typically healthier, more health-conscious, higher SES.	Women attending for cervical screening have lower cervical cancer risk regardless of screening (healthy behaviours).
Recall rate	Proportion of screened population recalled for further investigations. Must balance: high recall → more detected cases but more anxiety and cost; low recall → missed cases.	Combined test recall rate: ~5% (a positive screening result). Of those, ~5% have Down's syndrome (PPV ~5% in low-risk population).
False positive rate	Proportion of normal pregnancies incorrectly labelled as high-risk. Causes anxiety, unnecessary invasive tests (with miscarriage risk ~0.5–1%), and increased healthcare costs.	Combined test FPR = 5%. For every 100,000 women screened, ~5,000 will be screen-positive; ~4,750 will be false positives.

Screening vs Diagnostic Accuracy

Aspect	Screening Test	Diagnostic Test
Population	Asymptomatic, low prevalence	Symptomatic, high pre-test probability
Purpose	Identify those who need diagnostic testing	Confirm or exclude diagnosis
Test characteristics	High sensitivity (minimise FNs)	High specificity (minimise FPs)
Acceptability	Must be acceptable to healthy people	Acceptability less critical
Cost	Must be cheap	Can be more expensive
PPV	Often low (due to low prevalence)	Higher (due to pre-test probability)
Example	Combined test for Down's (screening)	CVS/amniocentesis for karyotype (diagnostic)

3. Descriptive Statistics

3.1 Types of Data — Complete Classification

                           ┌─────────────┐
                           │    Data     │
                           └──────┬──────┘
                                  │
                    ┌─────────────┴─────────────┐
               ┌────┴────┐               ┌────┴────┐
               │Categorical│              │Numerical │
               └────┬────┘               └────┬────┘
                    │                         │
        ┌───────────┼───────────┐    ┌────────┴────────┐
        │Nominal    │ Ordinal   │    │Discrete   │Continuous│
        └───────────┴───────────┘    └────────┴──────────┘

Data Type	Description	Examples in O&G	Permissible Statistics
Nominal	Unordered categories	Blood group (A, B, AB, O), ethnicity, parity type (nulliparous/multiparous), mode of delivery (SVD, VEEB, CS)	Mode, frequency, χ², Fisher's exact
Ordinal	Ordered categories	FIGO stage (I–IV), pain score (0–10), Bishop's score, AGPAR score, severity of incontinence (mild/moderate/severe)	Median, IQR, Mann-Whitney, Wilcoxon, %iles
Discrete	Integer values (countable)	Parity (0, 1, 2...), number of miscarriages, gravidity, number of previous CS	Mean (if normally distributed), median (if skewed)
Continuous	Any value on a continuum	Birth weight, gestational age, BMI, blood pressure, Hb, cervical length	Mean, SD, t-test, ANOVA, regression

Special case — Binary/Dichotomous: Nominal with exactly 2 categories - Alive/dead, pregnant/not, term/preterm - Can use: proportions, OR, RR, logistic regression

Hierarchy of data: As you go from nominal → ordinal → interval → ratio, you gain more mathematical properties and more statistical options.

3.2 Measures of Central Tendency — Complete Guide

Measure	Definition	Formula	When to use
Mean (arithmetic)	Sum of all values divided by number of values	x̄ = Σxᵢ / n	Normally distributed, interval/ratio data
Median	Middle value when data ordered from smallest to largest	Value at position (n+1)/2	Skewed data, ordinal data, presence of outliers
Mode	Most frequently occurring value	Value with highest frequency	Nominal data, bimodal distributions

Mean

Advantages: Uses all data points; mathematically tractable (basis for many statistical tests) Disadvantages: Affected by outliers and skewness

Example: Birth weights (kg) of 5 babies: 2.5, 3.0, 3.2, 3.5, 4.8 - Mean = (2.5 + 3.0 + 3.2 + 3.5 + 4.8) / 5 = 17.0 / 5 = 3.4 kg - Median = 3.2 kg (3rd value of 5) - Mode: no repeated values → no mode

If the 4.8 kg outlier was actually 10.0 kg (error): - Mean = (2.5 + 3.0 + 3.2 + 3.5 + 10.0) / 5 = 4.44 kg (dramatically changed!) - Median = 3.2 kg (unchanged!)

Median

Advantages: Robust to outliers and skewness; appropriate for ordinal data Disadvantages: Does not use all data; less mathematically tractable

Calculation: - If n is odd: middle value (e.g., n=5 → 3rd value) - If n is even: average of two middle values (e.g., n=6 → average of 3rd and 4th values)

Mode

Advantages: Only measure for nominal data; can identify bimodal distributions Disadvantages: May not exist (no repeated values); may not be unique

Bimodal distribution example: Birth weight in preterm vs term babies will show two peaks.

Skewness — Visualising the Distribution

              Normal              Positive Skew           Negative Skew
                                                                   ╱
              ╱╲                 ╱╲╲                          ╱╱╲
             ╱  ╲               ╱  ╲╲                       ╱  ╲╲
            ╱    ╲             ╱    ╲╲                     ╱    ╲╲
           ╱      ╲           ╱      ╲╲                   ╱      ╲╲
Mean=Median=Mode      Mode > Median > Mean        Mean > Median > Mode

Skew	Direction	Relationship	Example in O&G
Positive (right) skew	Long tail to the right	Mean > Median > Mode	Length of hospital stay after CS, parity in general population, time to conceive
Negative (left) skew	Long tail to the left	Mean < Median < Mode	Age at menopause (most women 48–52, few <40 or >55)
No skew (symmetrical)	Bell-shaped	Mean = Median = Mode	Normally distributed: birth weight in term infants, height

Skewness coefficient = 0 for normal distribution; >0 for positive skew; <0 for negative skew.

Kurtosis: Measures "peakedness" of distribution - Leptokurtic: Tall peak, heavy tails (more outliers) - Platykurtic: Flat peak, thin tails - Mesokurtic: Normal distribution (kurtosis = 3 for normal; excess kurtosis = 0)

3.3 Measures of Dispersion — Complete Guide

Measure	Formula/Definition	Robust to outliers?	When to use
Range	Max − Min	NO	Quick summary only
Interquartile Range (IQR)	Q3 − Q1 (75th − 25th percentile)	YES	With median; skewed data
Variance (σ²)	Σ(xᵢ − μ)² / n	NO	Intermediate calculation for SD
Sample variance (s²)	Σ(xᵢ − x̄)² / (n−1)	NO	Unbiased estimate from sample
Standard deviation (SD)	√Variance	NO	Mean ± SD for normal data
Coefficient of variation (CV)	(SD / Mean) × 100%	—	Comparing variability across different scales
Standard error of mean (SEM)	SD / √n	—	Precision of sample mean estimate

Range

Simplest measure
Highly sensitive to outliers
May be missing extreme values if sample is small

Interquartile Range (IQR)

Contains middle 50% of data
Q1 = 25th percentile, Q3 = 75th percentile
Used with median for skewed data
Box plot whiskers typically extend to 1.5 × IQR beyond Q1 and Q3

Variance and Standard Deviation

Variance = average squared deviation from the mean - Population variance: σ² = Σ(xᵢ − μ)² / N - Sample variance: s² = Σ(xᵢ − x̄)² / (n−1)

Why n−1? Bessel's correction — using n−1 gives an unbiased estimate of population variance from a sample.

Standard deviation = √variance - In the SAME units as original data (unlike variance) - For normally distributed data: 68% within mean ± 1 SD, 95% within ± 1.96 SD

Worked Example: Cervical length measurements (mm): 25, 30, 32, 35, 38

Step	Calculation	Result
Mean	(25+30+32+35+38)/5	32
Deviations	−7, −2, 0, +3, +6
Squared deviations	49, 4, 0, 9, 36
Sum of squares	49+4+0+9+36	98
Variance (sample)	98/(5−1)	24.5 mm²
SD	√24.5	4.95 mm

Coefficient of Variation (CV)

CV = (SD / Mean) × 100%
Allows comparison of variability across different scales or units
Example: Birth weight SD = 400g, mean = 3400g → CV = 11.8%
Another population: SD = 300g, mean = 2800g → CV = 10.7%
The first population has higher absolute variability but similar relative variability

Standard Error of the Mean (SEM)

CRITICAL DISTINCTION for MRCOG: SD vs SEM

	SD	SEM
What it describes	Variability of INDIVIDUAL observations	Precision of the SAMPLE MEAN estimate
Formula	SD = √(Σ(x−x̄)²/(n−1))	SEM = SD / √n
Effect of n	Stable (doesn't systematically change with n)	DECREASES as n increases (more data = more precise mean)
Interpretation	~95% of individuals fall within x̄ ± 2SD	~95% CI for the mean = x̄ ± 2×SEM
Use	Describing population spread	Inferential statistics, CI for mean

Example: Birth weight study, n = 1000, mean = 3400g, SD = 400g - SEM = 400 / √1000 = 400 / 31.6 = 12.7 g - 95% CI for mean = 3400 ± 1.96 × 12.7 = 3400 ± 24.9 = (3375, 3425) - Interpretation: We are 95% confident the true population mean is between 3375g and 3425g

Note: 95% of INDIVIDUAL birth weights are in the range 3400 ± 800g (= mean ± 2SD), NOT the 95% CI of the mean.

3.4 Normal Distribution — Complete Details

Properties of the Normal (Gaussian) Distribution:

Symmetrical about the mean
Mean = Median = Mode
Bell-shaped with tails approaching but never reaching zero
Defined by two parameters: μ (mean) and σ (SD)
Area under curve = 1 (probability)

The 68-95-99.7 Rule

Range	Proportion included	Commonly known as
μ ± 1σ	68.27%	68%
μ ± 1.645σ	90%	90th percentile bounds
μ ± 1.96σ	95.00%	95% reference range
μ ± 2σ	95.45%	Approximate 95%
μ ± 2.58σ	99.00%	99% reference range
μ ± 3σ	99.73%	99.7%

Standard Normal Distribution

Z = (x − μ) / σ
Mean = 0, SD = 1
Z-table gives the probability of values less than a given Z-score
Critical values for hypothesis testing:
z₀.₀₂₅ = 1.96 (two-tailed 95% test)
z₀.₀₅ = 1.645 (one-tailed 95% test)
z₀.₀₀₅ = 2.58 (two-tailed 99% test)

Worked example: What proportion of term babies weigh <2500g if mean = 3400g, SD = 400g? - Z = (2500 − 3400) / 400 = −900/400 = −2.25 - P(Z < −2.25) = 0.0122 (from Z-table) - → 1.22% of term babies weigh <2500g

Central Limit Theorem (CLT)

Critical theorem: The sampling distribution of the mean approaches a normal distribution as sample size increases, REGARDLESS of the shape of the population distribution.

Why this matters: Even with skewed data, the sample mean is approximately normally distributed if n is large enough (typically n > 30)
This underpins: Use of z-tests and t-tests even for non-normal data when n is large

Standard Error vs Standard Deviation — Clinical Example

A study measures birth weight in 10,000 babies. - SD = 400g → tells us most babies weigh between 2600g and 4200g (±2SD) - SEM = 400/√10000 = 4g → tells us the mean is estimated very precisely (95% CI: ~3392 to 3408g)

A clinical mistake: Writing mean ± SD where mean ± SEM is intended (or vice versa). MRCOG exam might test your ability to distinguish.

3.5 Skewed Distributions & Transformations

Log-normal distribution: - Data are positively skewed - After log-transformation, data become normally distributed - Common in O&G: length of labour, parity, time to pregnancy, hormone levels (e.g., hCG)

Transformation options: | Transformation | Formula | When to use | |---------------|---------|-------------| | Log | y = ln(x) or y = log₁₀(x) | Positive skew; multiplicative data | | Square root | y = √x | Count data with moderate skew | | Reciprocal | y = 1/x | Strong skew | | Box-Cox | y = (x^λ − 1)/λ | Generalised power transformation | | Logit | y = ln[p/(1−p)] | Proportions (0 to 1) | | Arcsine | y = arcsin(√p) | Proportions (stabilises variance) |

How to check normality: 1. Histogram — visual inspection (bell-shaped?) 2. Q-Q plot (quantile-quantile plot) — points along diagonal = normal 3. Shapiro-Wilk test — most powerful for small n (H₀: data are normal) 4. Kolmogorov-Smirnov test — suitable for large n 5. Skewness and kurtosis — skewness between −2 and +2 and kurtosis between −7 and +7 often considered acceptable

3.6 Data Presentation — Types of Graphs

Graph	Type of Data	Variables	Purpose	Key features
Histogram	Continuous	One variable	Show distribution shape	Bars TOUCH; bin width matters
Bar chart	Categorical	One or two categorical	Compare frequencies	Bars DO NOT touch
Box plot	Continuous	One variable (or grouped)	Show median, IQR, outliers	Whiskers ±1.5×IQR
Scatter plot	Continuous	Two continuous variables	Show relationship/correlation	Look for direction, strength, outliers
Line graph	Continuous (often time)	Continuous × time	Trend over time	Time on x-axis
Pie chart	Categorical	One categorical (proportions)	Show parts of a whole	Avoid >5 categories
Kaplan-Meier	Time-to-event	Survival time + group	Survival analysis	Step function; censoring marks
Forest plot	Meta-analysis	Multiple studies	Summarise effect sizes	Square size = weight; diamond = summary
Funnel plot	Meta-analysis	Effect size vs precision	Assess publication bias	Symmetrical = no bias
Bland-Altman	Continuous	Two measurement methods	Assess agreement	Difference vs mean of two methods

Histogram vs Bar Chart — Critical MRCOG Distinction

Feature	Histogram	Bar Chart
Data type	Continuous (or large discrete)	Categorical
Bars	Touch (no gap)	Do not touch (gap between)
Order	Natural order of variable (cannot reorder)	Can be reordered (e.g. alphabetical, by frequency)
Width	Can vary (if unequal bin widths)	Always equal
Example	Distribution of birth weights	Caesarean section rates by hospital

Box Plot Interpretation

     Upper whisker (largest value ≤ Q3 + 1.5×IQR)
          │
     ─────┼─────   Q3 (75th percentile)
          │
     ─────┼─────   Median (Q2, 50th percentile)
          │
     ─────┼─────   Q1 (25th percentile)
          │
     Lower whisker (smallest value ≥ Q1 − 1.5×IQR)
          │
          ●        Outlier (>1.5×IQR beyond Q1 or Q3)

Uses: - Comparing distributions across groups (e.g., birth weight by maternal smoking status) - Identifying outliers - Showing skewness (if median not centred in box)

Scatter Plot Interpretation

Look for: - Direction: Positive (both increase together) or negative (one increases, other decreases) - Strength: How closely points follow a line (tight = strong correlation) - Shape: Linear, curvilinear, no pattern - Outliers: Points far from main cluster - Subgroups: Distinct clusters suggest different populations

Bland-Altman Plot for Method Comparison

X-axis: Mean of two measurements [(method A + method B)/2]
Y-axis: Difference (method A − method B)
Central horizontal line: Mean difference (bias)
Dashed lines: Limits of agreement (mean ± 1.96 SD of differences)
If limits are clinically acceptable → methods can be used interchangeably
Used for: Comparing ultrasound measurements between operators, comparing new test to gold standard

4. Hypothesis Testing

4.1 Fundamental Concepts — Complete

Concept	Symbol	Definition	Everyday analogy
Null hypothesis	H₀	No difference / no association / no effect	"He is innocent"
Alternative hypothesis	H₁	There IS a difference / association / effect	"He is guilty"
Type I error	α	Reject H₀ when H₀ is actually true (false positive)	Convicting an innocent person
Type II error	β	Fail to reject H₀ when H₁ is true (false negative)	Letting a guilty person go free
Power	1 − β	Correctly rejecting H₀ when H₁ is true	Probability of detecting a real effect
p-value	p	Probability of observing the data (or more extreme) assuming H₀ is true	Not directly analogous

4.2 Type I and Type II Errors — The 2×2 Framework

Decision	H₀ TRUE	H₁ TRUE (H₀ FALSE)
Reject H₀	Type I error (α) [FALSE POSITIVE]	✅ CORRECT (True positive)
Fail to reject H₀ (Accept H₀)	✅ CORRECT (True negative)	Type II error (β) [FALSE NEGATIVE]

Type I Error (α)

α = 0.05 means: If H₀ is true (no real effect), there is a 5% chance we will incorrectly conclude there IS an effect
Trades off with Type II error — making α stricter (e.g., 0.01) reduces false positives but increases false negatives
Multiple testing: If you test 20 independent null hypotheses, expected number of false positives = 20 × 0.05 = 1 (hence Bonferroni correction)

Type II Error (β) and Power

β = 0.20 → Power = 0.80 is conventional minimum
β = 0.10 → Power = 0.90 is preferred
Power depends on:
Sample size (n): Larger n → higher power
Effect size (δ): Larger effect → higher power
α-level: Less strict α (e.g., 0.05 vs 0.01) → higher power
Variance (σ²): Lower variance → higher power

Worked Example of power concept: A study of 50 women finds no significant difference in birth weight between smokers and non-smokers (p = 0.12). The study was designed with 80% power to detect a 200g difference. The actual observed difference was 150g — the study was UNDER-powered to detect this smaller difference. Therefore the non-significant result does NOT mean there is no effect — it means we cannot rule out an effect of this size.

4.3 The p-value — Essential MRCOG Detail

CRITICAL EXAM POINT: The p-value is NOT the probability that the null hypothesis is true! This is the single most common statistical misconception tested in MRCOG.

Mathematical Definition: p-value = P(observed data OR more extreme | H₀ true)

It is NOT P(H₀ true | observed data)

The correct interpretation: "If there were truly no difference between groups, the probability of observing a difference as large (or larger) than the one we saw is p."

Common Misconceptions — All WRONG:

❌ Incorrect Statement	✅ Correct Interpretation
"p = 0.03 means there is a 3% chance H₀ is true"	p = 0.03 means: if H₀ were true, we'd see data this extreme only 3% of the time
"p = 0.05 means there is a 5% probability the result is due to chance"	Probability refers to the data under H₀, not the result
"p > 0.05 means the treatment is equivalent to placebo"	Non-significant does NOT mean no effect — may be underpowered
"p = 0.001 means the effect is very large"	p does NOT measure effect size — only strength of evidence against H₀
"We failed to reject H₀, so H₀ is true"	We cannot prove H₀ — only fail to find evidence against it

4.4 Confidence Intervals — Detailed

Definition: A 95% confidence interval for a parameter is the range of values within which the true population parameter would fall in 95% of repeated samples.

Correct interpretation: If we repeated the study 100 times and calculated a 95% CI each time, about 95 of those CIs would contain the true population value.

WRONG interpretation: "There is a 95% probability that the true value lies within this CI" — this is a Bayesian credible interval interpretation, not a frequentist CI.

CI provides MORE information than p-value: - Shows the ESTIMATE (best guess of effect size) - Shows the PRECISION (width = how certain we are) - Shows STATISTICAL SIGNIFICANCE (if 95% CI excludes null value → p < 0.05) - Shows CLINICAL SIGNIFICANCE (even if significant, is the entire CI in a clinically meaningful range?)

CI includes null?	p-value	Interpretation
Yes (e.g., RR 1.2, 95% CI 0.9–1.5)	p ≥ 0.05	Not statistically significant
No (e.g., RR 1.2, 95% CI 1.01–1.5)	p < 0.05	Statistically significant
No (e.g., RR 1.2, 95% CI 1.1–1.3)	p < 0.001	Significant AND precise

Example: RR for preterm birth in smokers vs non-smokers - Study A: RR = 1.5, 95% CI 0.8–2.2 (wide CI → imprecise; not significant) - Study B: RR = 1.3, 95% CI 1.1–1.5 (narrow CI → precise; significant) - Study C: RR = 1.1, 95% CI 1.01–1.19 (significant but clinically marginal)

4.5 One-tailed vs Two-tailed Tests

Aspect	Two-tailed	One-tailed
Alternative hypothesis	H₁: μ₁ ≠ μ₂ (difference in either direction)	H₁: μ₁ > μ₂ (or μ₁ < μ₂)
When to use	Default — almost always	Only if difference in opposite direction is impossible or irrelevant
α distribution	Split equally between both tails (2.5% each)	All 5% in one tail
Critical value (α=0.05)	z = ±1.96	z = 1.645
For same data	p-value is 2× the one-tailed p	p-value is half the two-tailed p
Sample size	Larger	Smaller (for same power)
Controversy	Safe and standard	Can inflate Type I error if the "wrong" direction appears

MRCOG rule: Always use two-tailed unless you have an extremely strong justification. The exam expects two-tailed as default.

Example: Comparing two antihypertensives in pregnancy — you cannot be certain a new drug won't be worse → two-tailed. If comparing a known teratogen to placebo, you might use one-tailed (it can't reduce malformation risk below background), but even then, two-tailed is safer.

4.6 Multiple Testing — Corrections

The problem: Each statistical test at α = 0.05 has a 5% chance of false positive. If you run many tests, the familywise error rate (FWER) increases.

FWER = 1 − (1 − α)ᵏ

Number of tests (k)	FWER
1	0.05
5	0.23
10	0.40
20	0.64
100	0.99

Bonferroni correction: - Adjusted α = 0.05 / k - Example: 10 comparisons → α = 0.005 - Very conservative — reduces Type I error but increases Type II error (reduces power)

Other methods: | Method | Description | Comparison | |--------|-------------|------------| | Bonferroni | α/k | Most conservative | | Holm-Bonferroni | Stepwise: smallest p tested at α/k, then α/(k−1), etc. | Less conservative, more powerful | | Sidak | 1 − (1−α)^(1/k) | Slightly less conservative than Bonferroni | | Benjamini-Hochberg (FDR) | Controls false discovery rate (expected proportion of false positives among rejected hypotheses) | Least conservative; used in genomics |

4.7 Significance vs Clinical Importance — Key MRCOG Concept

	Statistically Significant	Not Statistically Significant
Clinically Important	✅ Optimal — real effect detected	🔴 Underpowered study — need larger n
Clinically Unimportant	🟡 Significant but trivial (large n)	✅ No evidence of important effect

Example 1: A study with 100,000 women finds that taking paracetamol once in pregnancy reduces preterm birth from 5.0% to 4.9% (p = 0.03). Statistically significant but clinically meaningless (ARR = 0.1%, NNT = 1000).

Example 2: A study with 100 women finds a 30% reduction in miscarriage rate but p = 0.15. Potentially clinically important but not proven — underpowered.

4.8 Bayesian vs Frequentist Statistics — Overview

Aspect	Frequentist	Bayesian
Probability definition	Long-run frequency	Degree of belief
Parameters	Fixed (unknown)	Random variables
Data	Random	Fixed
Prior	Not used	Used (prior probability)
Output	p-value, CI	Posterior probability, credible interval
Interpretation of 95% interval	95% of intervals contain true value	95% probability true value lies in interval
If H₀ is p=0.05	Cannot say "5% chance H₀ is true"	Can say "5% probability H₀ is true"

Bayesian approach in O&G: Increasingly used in adaptive trials, diagnostic test interpretation, and meta-analysis.

5. Parametric vs Non-Parametric Tests

5.1 Choosing the Right Test — Decision Tree

Continuous Data

                            ┌─────────────────────────┐
                            │   Continuous Outcome    │
                            └────────────┬────────────┘
                                         │
                            ┌────────────┴────────────┐
                            │    Normally distributed? │
                            └────────────┬────────────┘
                                         │
                     ┌───────────────────┴────────────────────┐
                   YES│                                      │NO
                      │                                       │
          ┌───────────┴───────────┐              ┌────────────┴────────────┐
          │   How many groups?    │              │   How many groups?      │
          └───────────┬───────────┘              └────────────┬────────────┘
                      │                                       │
          ┌───────┬───┴───┬───────┐              ┌───────┬───┴───┬───────┐
          │ 2 ind │2 paired│ 3+ ind│3+ paired    │ 2 ind │2 paired│ 3+ ind│3+ paired
          │t-test │t-test  │ANOVA  │RM-ANOVA     │Mann-  │Wilcoxon│Kruskal│Friedman
          │(unpaired) (paired)│(one-way)│           │Whitney│signed  │Wallis │
          │       │        │       │            │  U    │rank    │       │
          └───────┴────────┴───────┴─────       └───────┴────────┴───────┴──────

Categorical Data

                            ┌─────────────────────────┐
                            │   Categorical Outcome   │
                            └────────────┬────────────┘
                                         │
                            ┌────────────┴────────────┐
                            │    2×2 table or larger  │
                            └────────────┬────────────┘
                                         │
                     ┌───────────────────┴─────────────────────┐
                     │                                         │
            ┌────────┴────────┐                      ┌────────┴────────┐
            │   Expected ≥5?  │                      │  Paired data?   │
            └────────┬────────┘                      └────────┬────────┘
                     │                                         │
                ┌────┴────┐                              ┌────┴────┐
               YES│      │NO                            YES│       │NO
                  │      │                                │        │
             ┌────┴┐  ┌──┴────┐                     ┌────┴┐  ┌───┴────┐
             │  χ² │  │Fisher │                     │McNemar│  │Normal │
             │ test│  │exact  │                     │       │  │  χ²   │
             └─────┘  └───────┘                     └───────┘  └───────┘

5.2 Parametric Tests — Complete Details

Student's t-test

Assumptions: 1. Normality: Data in each group are approximately normally distributed (or n large enough for CLT) 2. Homogeneity of variance: Variance similar in both groups (check with Levene's test or F-test) 3. Independence: Observations are independent of each other

Unpaired (Independent Samples) t-test

Use: Compare means of TWO independent groups
Example: Birth weight in smokers vs non-smokers
Formula: t = (x̄₁ − x̄₂) / √(s²(1/n₁ + 1/n₂))
where s² = pooled variance = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁ + n₂ − 2)
Degrees of freedom: df = n₁ + n₂ − 2

Worked example: - Smokers: n=50, mean=3100g, SD=400g - Non-smokers: n=50, mean=3300g, SD=380g - Pooled SD = √([49×400² + 49×380²]/98) = √([7,840,000 + 7,072,400]/98) = √(152,168) = 390.1 - t = (3100 − 3300) / (390.1 × √(1/50 + 1/50)) = −200 / (390.1 × 0.2) = −200 / 78.02 = −2.56 - df = 98, critical t (two-tailed, α=0.05) = 1.984 - |t| = 2.56 > 1.984 → p < 0.05 → significant difference

Welch's t-test: Does NOT assume equal variances; more robust. Uses separate variances and adjusted df (Satterthwaite or Welch). Recommended as default.

Paired t-test

Use: Compare means of TWO RELATED measurements (same subjects, before-after, matched pairs)
Example: BP before and after antihypertensive treatment in pregnancy
Principle: Calculate difference for each pair, test if mean difference = 0
Formula: t = d̄ / (s_d / √n)
d̄ = mean of differences
s_d = SD of differences
n = number of pairs
df = n − 1

Worked example: Fasting glucose before and after 1 week of metformin in 10 women with PCOS:

Subject	Before	After	Difference
1	5.9	5.5	0.4
2	6.2	5.8	0.4
3	5.6	5.3	0.3
4	5.8	5.4	0.4
5	6.0	5.7	0.3
6	5.7	5.6	0.1
7	6.1	5.8	0.3
8	5.9	5.5	0.4
9	5.8	5.6	0.2
10	6.0	5.9	0.1

Mean difference d̄ = 0.29
SD of differences = 0.12
t = 0.29 / (0.12/√10) = 0.29 / 0.038 = 7.63
df = 9, critical t = 2.262
t = 7.63 >> 2.262 → p < 0.001 → significant reduction

Analysis of Variance (ANOVA)

One-way ANOVA

Use: Compare means of THREE or MORE independent groups
Why not multiple t-tests? Inflates Type I error (for 3 groups: 3 pairwise tests → FWER = 14.3%)
Logic: Partition total variance into:
Between-group variance (attributable to the treatment/group effect)
Within-group variance (error/residual variance)
F-statistic = Mean Square (between) / Mean Square (within)
If F is large and p < 0.05 → at least one group differs from others

ANOVA table:

Source	Sum of Squares	df	Mean Square	F
Between groups	SS_b	k−1	MS_b = SS_b/(k−1)	MS_b/MS_w
Within groups	SS_w	N−k	MS_w = SS_w/(N−k)	—
Total	SS_t	N−1	—	—

k = number of groups, N = total sample size

Post-hoc Tests after Significant ANOVA

Why can't we just use pairwise t-tests? Multiple testing problem. Post-hoc tests control for multiple comparisons.

Test	Conservatism	When to use
Bonferroni	Very conservative	Small number of pre-planned comparisons
Tukey HSD	Moderate	All pairwise comparisons (most common)
Scheffé	Most conservative	Complex comparisons (contrasts)
Dunnett	Moderate	Comparing all groups to a single control
Least Significant Difference (LSD)	NOT conservative (doesn't control FWER)	Only if exactly 3 groups and significant F

Tukey HSD (Honest Significant Difference): - Controls FWER for ALL pairwise comparisons - Uses studentised range distribution (q) - Formula: HSD = q × √(MS_w / n)

Two-way ANOVA

Use: TWO independent variables (factors) + their interaction
Output: Main effect of factor A, main effect of factor B, interaction effect (A×B)
Example: Effect of smoking (yes/no) AND maternal age (<35 vs ≥35) on birth weight
Main effect of smoking (adjusted for age)
Main effect of age (adjusted for smoking)
Interaction: Does the effect of smoking DIFFER by maternal age?

Interpreting interaction: - Significant interaction p-value → the effect of one factor depends on the other - Example: Smoking reduces birth weight more in older mothers → significant smoking × age interaction - Report subgroup means or interaction plot

Repeated Measures ANOVA

Use: Same subjects measured at 3+ time points (e.g., BP at booking, 28 wks, 36 wks)
Advantage: Controls for between-subject variability → more powerful
Assumptions: Sphericity (variance of differences between all pairs of measurements is equal) — checked by Mauchly's test
Correction for non-sphericity: Greenhouse-Geisser, Huynh-Feldt

Assumptions of Parametric Tests — How to Check

Assumption	What it means	How to check	What to do if violated
Normality	Data follow normal distribution	Histogram, Q-Q plot, Shapiro-Wilk, Kolmogorov-Smirnov	Use non-parametric test, transform data
Homogeneity of variance	Equal variances across groups	Levene's test, F-test (2 groups), Bartlett's test	Use Welch's t-test, Welch's ANOVA, or transformation
Independence	Observations independent	Study design check	Mixed models, GEE, multilevel models
Sphericity (RM-ANOVA)	Equal variances of differences	Mauchly's test	Greenhouse-Geisser correction

5.3 Non-Parametric Tests — Complete Details

Mann-Whitney U Test (Wilcoxon Rank-Sum)

Use: Compare TWO INDEPENDENT groups with non-normal data
Principle: Rank all observations together, then compare sum of ranks between groups
H₀: The two populations have the same location (median)
Output: U statistic (or W in some software)

Steps: 1. Rank all observations from both groups together (1 = smallest) 2. Sum the ranks for group 1 (R₁) 3. U₁ = R₁ − n₁(n₁+1)/2 and U₂ = R₂ − n₂(n₂+1)/2 4. U = min(U₁, U₂) — compared to critical value

Worked example: Pain scores (0–10) after two different perineal repair techniques

Technique A	Rank A	Technique B	Rank B
2	1	4	4.5
3	2.5	5	6
3	2.5	6	7
4	4.5	8	9
7	8	9	10
Sum	18.5	Sum	36.5

n₁ = 5, n₂ = 5
U₁ = 18.5 − (5×6/2) = 18.5 − 15 = 3.5
U₂ = 36.5 − 15 = 21.5
U = 3.5 (critical U for n₁=5, n₂=5, α=0.05 two-tailed = 2)
U = 3.5 > 2 → not significant at α = 0.05

However, for ranks approach: Z = (mean rank_A − mean rank_B) / SE → can approximate significance.

Wilcoxon Signed-Rank Test

Use: TWO PAIRED groups (non-parametric equivalent of paired t-test)
Principle: Calculate differences, rank absolute differences, sum ranks of positive vs negative differences
Steps:
Calculate difference for each pair
Exclude pairs with difference = 0
Rank absolute differences (ignoring sign)
Sum ranks of positive differences (W+) and negative differences (W−)
Test statistic W = min(W+, W−)

Example (from paired t-test data above): Glucose before and after metformin - Differences: 0.4, 0.4, 0.3, 0.4, 0.3, 0.1, 0.3, 0.4, 0.2, 0.1 - All positive → W+ = 1+2+...+10 = 55, W− = 0 - For n=10, critical W = 8 (two-tailed, α=0.05) - W = 0 < 8 → p < 0.05 → significant (more powerful than sign test)

Sign test (simpler alternative): - Count number of positive and negative differences (ignoring magnitude) - Test using binomial distribution - Less powerful than Wilcoxon signed-rank (discards magnitude information)

Kruskal-Wallis Test

Use: THREE+ INDEPENDENT groups (non-parametric equivalent of one-way ANOVA)
Principle: Extension of Mann-Whitney — ranks all observations together, compares sum of ranks across groups
H₀: All groups have same median
Output: H statistic (approximately χ² with df = k−1)
Post-hoc: Dunn's test with Bonferroni correction

When to use: Comparing fetal fibronectin levels (skewed) across three groups: term labour, preterm labour, no labour

Friedman Test

Use: THREE+ PAIRED groups (non-parametric equivalent of repeated measures ANOVA)
Principle: Ranks within each subject/block, then compares across time points
Example: Pain scores at 1 hour, 6 hours, 24 hours after episiotomy repair
Post-hoc: Wilcoxon signed-rank with Bonferroni correction

5.4 Chi-Squared Test (χ²) — Complete Details

Use: Test association between TWO CATEGORICAL variables
Data format: Contingency table (r × c)

Formula: χ² = Σ [(Oᵢⱼ − Eᵢⱼ)² / Eᵢⱼ]

Where: - Oᵢⱼ = observed frequency in cell (i, j) - Eᵢⱼ = expected frequency = (row total × column total) / grand total - df = (rows − 1) × (columns − 1)

Worked example: Mode of delivery by maternal BMI category

	SVD	CS	Total
BMI < 30	80	20	100
BMI ≥ 30	30	30	60
Total	110	50	160

Expected frequencies: - Normal BMI, SVD: (100 × 110)/160 = 68.75 - Normal BMI, CS: (100 × 50)/160 = 31.25 - Obese, SVD: (60 × 110)/160 = 41.25 - Obese, CS: (60 × 50)/160 = 18.75

χ² = (80−68.75)²/68.75 + (20−31.25)²/31.25 + (30−41.25)²/41.25 + (30−18.75)²/18.75 = 1.84 + 4.05 + 3.07 + 6.75 = 15.71

df = (2−1)(2−1) = 1 Critical χ² (df=1, α=0.05) = 3.84 15.71 > 3.84 → p < 0.001 → highly significant association

Assumptions: 1. Independent observations (each subject counted once) 2. No more than 20% of expected frequencies < 5 3. All expected frequencies ≥ 1

If assumptions violated: Use Fisher's exact test (any 2×2 table) or combine categories (for larger tables).

Yates' Correction for Continuity

Applied to 2×2 tables (subtract 0.5 from each |O−E| before squaring)
More conservative (reduces χ²)
Historically used; now controversial — Fisher's exact preferred for small samples

Fisher's Exact Test

Use: 2×2 tables when expected frequencies < 5 (any sample size works)
Principle: Calculates exact probability of observed table (and more extreme tables) given fixed margins — based on hypergeometric distribution
Advantage: Valid for ANY sample size
Disadvantage: Computationally intensive for large tables

McNemar's Test for Paired Categorical Data

Use: Compare PROPORTIONS in PAIRED or MATCHED categorical data (before-after, matched case-control)
Example: Diagnosis of GDM by two different criteria (IADPSG vs NICE) in same women

Paired 2×2 table:

	Test B +	Test B −	Total
Test A +	a (both positive)	b (A positive, B negative)	a + b
Test A −	c (A negative, B positive)	d (both negative)	c + d
Total	a + c	b + d	N

Formula: χ² = (|b − c| − 1)² / (b + c) [with continuity correction] - Only discordant pairs (b and c) contribute to the test - If b = c → no difference between tests

Example: GDM screening — IADPSG vs NICE criteria in 200 women

	NICE +	NICE −	Total
IADPSG +	20	15	35
IADPSG −	3	162	165
Total	23	177	200

χ² = (|15 − 3| − 1)² / (15 + 3) = (11)² / 18 = 121/18 = 6.72 df = 1, p = 0.01 → Significant difference — IADPSG detects significantly more GDM than NICE criteria.

5.5 Correlation — Detailed

Coefficient	Symbol	Type	Parametric?	Range	Measure of
Pearson r	r	Linear	Yes	−1 to +1	Linear relationship strength
Spearman ρ	rₛ (or ρ)	Monotonic	No	−1 to +1	Monotonic relationship (any consistent trend)
Kendall τ	τ	Concordant/discordant pairs	No	−1 to +1	Association in ranked data

Pearson Correlation (r)

Assumptions: 1. Both variables are continuous 2. Linear relationship 3. Bivariate normality (both normally distributed) 4. Homoscedasticity (equal scatter across values) 5. No significant outliers

Formula: r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² × Σ(yᵢ − ȳ)²]

Interpretation of r (Cohen's benchmarks):

r value	Interpretation	Approximate R²
0.0–0.1	Negligible	0–1%
0.1–0.3	Weak	1–9%
0.3–0.5	Moderate	9–25%
0.5–0.7	Strong	25–49%
0.7–1.0	Very strong	49–100%

R² = coefficient of determination: Proportion of variance in Y explained by X. - If r = 0.6, R² = 0.36 → 36% of variance in Y is explained by X - 64% is due to other factors

Spearman's Rank Correlation (ρ)

Use: Non-normal, ordinal, or skewed data
Principle: Rank both variables, then calculate Pearson r on ranks
Advantages: No normality assumption; detects monotonic (not just linear) relationships; robust to outliers
Interpretation: Same r scale (−1 to +1)

Kendall's Tau (τ)

Use: Small samples with many tied ranks
Principle: Based on number of concordant vs discordant pairs
τ = (C − D) / [½ n(n−1)] where C = concordant pairs, D = discordant
Advantage: More robust and interpretable with ties; better for small samples
Disadvantage: Usually smaller absolute value than Spearman

Correlation does NOT imply causation — 4 possible explanations for r ≠ 0: 1. X causes Y (direct causation) 2. Y causes X (reverse causation) 3. Z causes both X and Y (confounding) 4. Chance (random variation)

Common O&G example: Positive correlation between maternal age and Down's syndrome — direct causal relationship (meiotic non-disjunction increases with age). This is one case where correlation IS causation.

5.6 Regression — Complete Details

Linear Regression

Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε

Y = outcome (dependent) variable — CONTINUOUS
Xᵢ = predictor (independent) variables
β₀ = intercept (value of Y when all X = 0)
βᵢ = regression coefficient (change in Y per 1-unit change in Xᵢ, holding others constant)
ε = error term (residual)

Key outputs: | Output | Interpretation | |--------|----------------| | β coefficient | Effect estimate (units of Y per unit X) | | 95% CI for β | Precision and significance | | p-value for β | Test of H₀: β = 0 | | R² | Proportion of variance explained by model | | Adjusted R² | R² penalised for number of predictors | | F-test | Tests if overall model is significant |

Assumptions of linear regression: 1. Linearity: Relationship between X and Y is linear 2. Independence: Observations are independent 3. Homoscedasticity: Constant variance of residuals across fitted values 4. Normality: Residuals are normally distributed 5. No multicollinearity: Predictors not highly correlated

Checking assumptions: - Residual vs fitted plot: Look for random scatter (homoscedasticity) and no pattern (linearity) - Q-Q plot of residuals: Check normality - Variance Inflation Factor (VIF): Check multicollinearity (VIF > 10 = problematic)

Multiple Linear Regression

Use: ONE continuous outcome, MULTIPLE predictors
β coefficients are ADJUSTED — each β represents the effect of that predictor holding all others constant
Can control for confounders by including them in the model
Partial R²: Contribution of each predictor to explained variance

Example: Predicting birth weight - Y = birth weight (g) - X₁ = gestational age (weeks) - X₂ = maternal smoking (0/1) - X₃ = maternal BMI - β₁ = 150 means: each additional week of gestation → +150g birth weight (holding smoking and BMI constant) - β₂ = −200 means: smoking associated with 200g lower birth weight (holding gestational age and BMI constant)

Logistic Regression

Use: BINARY outcome (yes/no, alive/dead, disease/no disease)
Model: logit(p) = ln[p/(1−p)] = β₀ + β₁X₁ + ... + βₖXₖ
Exponentiated coefficients (e^βᵢ): Adjusted Odds Ratios (OR)
Interpretation of OR: e^βᵢ = change in odds of outcome for 1-unit increase in Xᵢ

Key outputs: | Output | Interpretation | |--------|----------------| | OR (e^β) | Adjusted odds ratio | | 95% CI for OR | Precision (if excludes 1 → significant) | | Hosmer-Lemeshow test | Goodness-of-fit (p > 0.05 = good fit) | | c-statistic (AUC) | Discriminatory ability | | Pseudo-R² | McFadden, Nagelkerke |

Worked O&G example: Predicting preterm birth

Predictor	β	OR (e^β)	95% CI	p
Smoking	0.69	1.99	1.25–3.17	0.004
Previous preterm	1.39	4.01	2.10–7.66	<0.001
Multiple pregnancy	1.10	3.00	1.40–6.43	0.005
Maternal age (per year)	0.02	1.02	0.98–1.06	0.29

Smoking doubles the odds of preterm birth (OR = 1.99, p = 0.004)
Previous preterm is strongest predictor (OR = 4.01)
Maternal age not significant (CI includes 1, p > 0.05)

Cox Proportional Hazards (See also Section 9)

Model: h(t) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₖXₖ)

h(t) = hazard at time t
h₀(t) = baseline hazard (when all X = 0)
exp(βᵢ) = Hazard Ratio (HR)
Proportional hazards assumption: HR is constant over time

6. Risk & Effect Measures

6.1 The 2×2 Table — Foundation

	Outcome + (Disease)	Outcome − (No disease)	Total
Exposed +	a	b	a + b
Exposed −	c	d	c + d
Total	a + c	b + d	N

6.2 Definitions and Formulas — Complete

Measure	Abbreviation	Formula	Interpretation
Risk in exposed	Rₑ	a / (a + b)	Probability of outcome if exposed
Risk in unexposed	R₀	c / (c + d)	Probability of outcome if not exposed
Odds in exposed	Oₑ	a / b	Ratio of outcome happening to not happening in exposed
Odds in unexposed	O₀	c / d	Ratio of outcome happening to not happening in unexposed
Risk Ratio / Relative Risk	RR	Rₑ / R₀	How many times more likely outcome is in exposed vs unexposed
Odds Ratio	OR	(a/b) / (c/d) = ad / bc	Odds of exposure in cases vs controls
Attributable Risk	AR	Rₑ − R₀	Excess risk due to exposure
Attributable Risk Fraction	ARF	(Rₑ − R₀) / Rₑ = (RR−1)/RR	Proportion of risk in exposed due to exposure
Population Attributable Risk	PAR	R_total − R₀	Excess risk in total population
Population Attributable Fraction	PAF	(R_total − R₀) / R_total	Proportion of population disease due to exposure
Absolute Risk Reduction	ARR	Control risk − Treatment risk (if treatment reduces risk)	Inverse of AR (treatment perspective)
Number Needed to Treat	NNT	1 / ARR	Number needed to treat to prevent one outcome
Number Needed to Harm	NNH	1 / AR (if harmful)	Number exposed to cause one adverse outcome

6.3 Worked Examples from O&G

Example 1: VTE Prevention with LMWH

	VTE	No VTE	Total
LMWH	5	495	500
No LMWH	20	480	500

Rₑ = 5/500 = 0.01 (1%)
R₀ = 20/500 = 0.04 (4%)
RR = 0.01/0.04 = 0.25 → LMWH reduces VTE risk by 75%
AR (ARR) = |0.01 − 0.04| = 0.03 (3%) → absolute risk reduction
RRR (relative risk reduction) = (0.04−0.01)/0.04 = 0.75 (75%) → same as 1−RR
NNT = 1/0.03 = 33.3 → 34 women need LMWH to prevent one VTE
OR = (5×480)/(495×20) = 2400/9900 = 0.24 → similar to RR because VTE is rare

Example 2: Smoking and Preterm Birth

	Preterm	Term	Total
Smoker	200	4,800	5,000
Non-smoker	100	4,900	5,000

RR = 0.04/0.02 = 2.0
OR = (200×4900)/(100×4800) = 980,000/480,000 = 2.04
OR ≈ RR because preterm birth is moderately common (3%) — the approximation is good but not perfect
AR = 0.04 − 0.02 = 0.02 (2%)
ARF = (2−1)/2 = 50% → half of preterm births in smokers are attributable to smoking
PAF = (0.03−0.02)/0.03 = 33% → one-third of all preterm births are attributable to smoking

6.4 Risk Ratio vs Odds Ratio — The Rare Disease Assumption

When disease is rare (prevalence < 10%): - OR ≈ RR - OR can be interpreted as RR in case-control studies

When disease is common: - OR overestimates RR - OR always > RR (when RR > 1) and OR always < RR (when RR < 1) - The more common the disease, the greater the divergence

Proof that OR ≈ RR when a << a+b and c << c+d: - RR = [a/(a+b)] / [c/(c+d)] - OR = (a/b) / (c/d) = ad/bc - If a << a+b then a/(a+b) ≈ a/b - If c << c+d then c/(c+d) ≈ c/d - Therefore RR ≈ (a/b) / (c/d) = OR

Clinical example where OR and RR diverge:

	Disease +	Disease −	Total	Risk
Exposed	80	20	100	0.80
Unexposed	60	40	100	0.60

RR = 0.80/0.60 = 1.33
OR = (80×40)/(20×60) = 3200/1200 = 2.67
OR is TWICE RR! Common disease → OR is a very poor approximation.

6.5 Number Needed to Treat (NNT) — Detailed

Formula: NNT = 1 / ARR

Where ARR = |Risk_control − Risk_treatment|

Important properties: - Lower NNT = more effective treatment - NNT always rounded UP to nearest integer - NNT depends on BASELINE RISK — same RR gives different NNT depending on baseline

Example of NNT dependence on baseline risk: - A treatment reduces the risk of an outcome by 50% (RR = 0.50)

Baseline risk	ARR	NNT
10% → 5%	5%	20
1% → 0.5%	0.5%	200
0.1% → 0.05%	0.05%	2000

Same RR (50% reduction) but NNT ranges dramatically. This is why NNT must be reported with baseline risk context.

NNT for harm (NNH): - NNH = 1 / AR (when exposure increases risk) - Example: Aspirin prevents pre-eclampsia (NNT = 50) but increases bleeding (NNH = 200 for minor bleeding, NNH = 1000 for major) - Net benefit: When NNT < NNH (more people helped than harmed) - Benefit-harm ratio: NNH/NNT

6.6 Incidence vs Prevalence

Measure	Definition	Formula	When used
Point prevalence	Proportion of population with disease at a specific time	Existing cases / Total population	Cross-sectional studies
Period prevalence	Proportion with disease during a time period	Cases in period / Population	Chronic diseases
Cumulative incidence	Proportion of at-risk population who develop disease over time	New cases / At-risk population at start	Cohort studies
Incidence rate	Number of new cases per person-time	New cases / Total person-time at risk	When follow-up varies

Relationship: Prevalence = Incidence × Average duration of disease - For chronic diseases (long duration): high prevalence despite moderate incidence - For acute diseases (short duration, fatal or curable): low prevalence despite possibly high incidence

Example: - Ovarian cancer: Incidence ~20/100,000/year, 5-year survival ~45% → prevalence ~90/100,000 - Endometriosis: Incidence unclear (difficult to diagnose), prevalence ~10% in reproductive-age women (long duration → high prevalence)

6.7 Hazard Ratio — More Detail

From Cox proportional hazards regression
Interpretation: The instantaneous risk of the event at any time in one group relative to another
HR = 1: No difference
HR < 1: Reduced hazard (protective)
HR > 1: Increased hazard (risk factor)
Not a simple risk ratio — it's a ratio of hazards that applies across time (proportional hazards assumption)

HR vs RR: - RR compares cumulative incidence at a specific time point - HR compares the instantaneous rate of the event at any time - HR is more appropriate for time-to-event data with varying follow-up - If proportional hazards hold, HR is constant over time

7. Statistical Bias & Confounding

7.1 Classification of Bias

                      ┌──────────┐
                      │   Bias   │
                      └────┬─────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
     ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
     │Selection│     │Information│    │Confounding│
     │  Bias   │     │   Bias   │     │(not true │
     └─────────┘     └──────────┘     │  bias —  │
                                      │treatment │
                                      │ effect)  │
                                      └──────────┘

7.2 Selection Bias — Detailed with O&G Examples

Definition: Systematic error due to the way participants are selected for a study or due to differential participation/follow-up.

Type	Mechanism	O&G Example
Sampling bias	Sample not representative of target population	Studying postnatal depression in an affluent area → underestimates prevalence
Referral (centripetal) bias	Tertiary centres see sicker patients	Studying outcomes of placenta praevia at a teaching hospital → higher mortality
Volunteer bias	Volunteers differ systematically from non-volunteers	Women who join a menopause research study are healthier and more health-conscious
Healthy worker effect	Workers healthier than general population	Midwives have lower mortality than age-matched women in general population
Non-response bias	Those who respond differ from those who don't	Postal survey of incontinence — those most affected are more likely to respond → overestimates prevalence
Attrition bias (loss to follow-up)	Dropouts differ from completers	In a cohort of high-risk pregnancies, those who drop out may have worse outcomes → biased if differential
Berkson's bias	Hospital controls differ from general population	Studying association between oral contraceptives and DVT using hospital controls — controls may use OCPs at different rates
Survival (Neyman) bias	Only survivors included	Cross-sectional study of MI — fatal cases missed → underestimates severity
Incidence-prevalence (Neyman) bias	Prevalent cases differ from incident	Studying ovarian cancer — prevalent cases are longer-term survivors → different risk factor profile
Immortal time bias	Time before exposure counted as exposed	Studying survival after surgery: if time from diagnosis to surgery counted as postsurgical survival, it's "immortal" (patient alive by definition)
Detection bias	More intensive surveillance in one group	Women on HRT have more mammograms → more breast cancer detected (screening effect, not causation)

Immortal Time Bias — Detailed

An important MRCOG concept. Immortal time bias occurs when there is a period of follow-up during which the outcome cannot occur, and this time is misclassified.

Classic example: Study of whether screening for cervical cancer reduces mortality. - Women who attend screening (exposed) are compared to non-attendees - The time between the invitation and the actual screening result is "immortal" — women had to survive to be screened - If this immortal time is counted as "screened" time, it biases results in favour of screening (screened women appear to live longer) - Solution: Use time-dependent exposure or start follow-up at time of screening decision, not screening result

7.3 Information Bias (Measurement Bias) — Detailed

Definition: Systematic error in measuring exposure, outcome, or covariates.

Type	Mechanism	O&G Example
Recall bias	Differential recall between groups	Case-control study of miscarriage — cases recall more exposures than controls
Observer (ascertainment) bias	Researcher's expectation influences measurement	Knowing which group receives active treatment may influence interpretation of ultrasound measurements
Detection (verification) bias	Systematic difference in outcome ascertainment	More intensive follow-up in treatment group → more outcomes detected
Lead-time bias	Early diagnosis falsely extends survival	Screening: survival appears longer even if death occurs at same time
Publication bias	Positive studies more likely published	Meta-analyses overestimate effect if negative studies unpublished
Reporting bias	Differential outcome reporting	Participants on placebo may report more symptoms; doctors may report outcomes more carefully in one group
Interviewer bias	Differential questioning	Interviewer probes cases more thoroughly about exposures
Social desirability bias	Participants give socially acceptable answers	Underreporting smoking, alcohol in pregnancy
Hawthorne effect	Behaviour changes because being observed	Women may adhere better to medications when in a trial
Measurement error bias	Inaccurate measurement tool	Using a poorly calibrated sphygmomanometer

Recall Bias — Detailed

The most common bias tested in MRCOG for case-control studies.

Mechanism: - Mothers of babies with malformations (cases) search their memory for potential causes → more likely to recall medication use, infections, stress - Mothers of healthy babies (controls) have less motivation to recall → more likely to forget

Effect: OR is biased away from the null (spuriously large or small association)

Minimisation: - Use objective records (prescription databases, medical records) rather than recall - Blinding interviewers to case/control status - Use standardised, validated questionnaires - Use a "memory anchor" (e.g., calendar of significant events)

7.4 Confounding — Complete Details

Definition: A third variable (confounder) that distorts the relationship between exposure and outcome because it is associated with BOTH the exposure and the outcome and is NOT on the causal pathway.

Criteria for a Confounder — Three Conditions

Associated with the exposure in the study population
An independent risk factor for the outcome (among the unexposed)
NOT an intermediate (mediator) on the causal pathway between exposure and outcome

     ┌──────────┐
     │Confounder │
     └┬──────┬───┘
      │      │
      ▼      ▼
  Exposure ──?──▶ Outcome

Not a confounder (it IS a mediator):

  Exposure ──────────▶ Mediator ──────────▶ Outcome

Example (confounder): - Exposure: Drinking coffee → Outcome: Pancreatic cancer - Confounder: Smoking (associated with coffee drinking AND causes pancreatic cancer) - If we don't adjust for smoking, we might wrongly attribute the cancer risk to coffee

Classic O&G Examples of Confounding

Study claim	True relationship	Confounder
"HRT reduces coronary heart disease"	HRT users healthier → lower CHD	Socioeconomic status, health awareness
"Maternal age causes Down's syndrome"	Chromosomal non-disjunction increases with age	AGE IS THE EXPOSURE — this is causal, not confounding!
"Coffee causes miscarriage"	Coffee drinkers more likely to be older, smoke	Smoking, maternal age
"Caesarean section causes asthma"	Children born by CS have more asthma	Indication for CS (maternal obesity, preterm) may itself be associated with asthma
"Fertility treatment causes cancer"	Women who have IVF may have different cancer surveillance	Underlying infertility (itself a risk factor for some cancers)

Simpson's Paradox — Detailed

A special case of confounding where a trend appears in several groups but reverses or disappears when groups are combined.

Classic medical example: Kidney stone treatment

Stone size	Treatment A	Treatment B
Small stones	93% (81/87)	87% (234/270)
Large stones	73% (192/263)	69% (55/80)
Overall	78% (273/350)	83% (289/350)

Paradox: Treatment A is better for BOTH small AND large stones, but Treatment B appears better overall!

Explanation: Treatment A was more often used for large stones (which have worse prognosis). Stone size is a confounder — associated with treatment choice (A used more for large) AND outcome (large stones have worse success). When you ignore stone size (combine groups), the confounding produces the paradoxical reversal.

Take-home: Always consider whether there might be a confounder creating a Simpson's paradox. Stratify by key confounders.

Methods to Control Confounding

Method	When used	How it works	Strengths	Weaknesses
Randomisation	RCTs	Random allocation balances confounders	Gold standard; balances known AND unknown confounders	Not always feasible or ethical
Restriction	Any study	Limit to one level of confounder (e.g., only non-smokers)	Simple; eliminates confounding by restricted variable	Limits generalisability; may not be feasible if confounder is common
Matching	Case-control, cohort	Select controls/comparison with same confounder levels	Controls for confounding	Cannot match too many variables; over-matching reduces efficiency; can't assess matched variables as risk factors
Stratification	Any study	Analyse within strata, then pool (Mantel-Haenszel)	Simple to implement	Cannot handle many confounders; continuous variables need categorisation
Multivariable regression	Any study	Adjust statistically	Can handle many confounders; continuous and categorical	Assumptions about model form; cannot adjust for unmeasured confounders
Standardisation	Comparing populations	Apply standard weights	Direct or indirect; common in epidemiology	Only adjusts for measured confounders
Propensity score	Observational studies	Probability of exposure given confounders; match/stratify/weight by PS	Reduces many confounders to single score	Only measured confounders; requires large n
Instrumental variable	Natural experiments	Variable associated with exposure but not outcome (except through exposure)	Can handle unmeasured confounders	Difficult to find valid instrument
Inverse probability weighting	Longitudinal studies	Weight by inverse of probability of remaining in study	Handles attrition bias	Depends on correct model for weights

Mantel-Haenszel Odds Ratio (Stratified Analysis)

Formula for stratified 2×2 tables: OR_MH = Σ(aᵢdᵢ/nᵢ) / Σ(bᵢcᵢ/nᵢ)

Where i indexes strata and nᵢ is the total in stratum i.

Comparing crude vs adjusted OR: - If crude OR ≠ adjusted OR → confounding present - If crude OR = adjusted OR → no confounding

Residual Confounding

Complete confounding adjustment is often impossible because: - Confounders may be measured with error (residual confounding) - Unmeasured confounders exist (unmeasured confounding) - Confounders may change over time (time-varying confounding)

Sensitivity analysis: How strong would an unmeasured confounder need to be to explain away the observed association? (E-value)

7.5 Effect Modification (Interaction)

Different from confounding!

Aspect	Confounding	Effect Modification
Type	Bias to be minimised	Real biological phenomenon
What it is	Distortion of exposure-outcome relationship	Effect of exposure differs by level of third variable
Deal with it	Remove/adjust in analysis	REPORT it — describe effect separately for each subgroup
Example	Smoking confounds coffee-pancreatic cancer	Aspirin effect on pre-eclampsia may differ by BMI

Testing for effect modification: 1. Stratified analysis: Calculate RR/OR separately for each stratum 2. Interaction term: Include product term in regression model (X₁ × X₂) 3. Statistical test: p-value for interaction (be cautious — underpowered for interaction)

Multiplicative vs Additive Interaction: - Multiplicative scale: Is the combined effect greater than the product of individual effects? (RR or OR scale) - Additive scale: Is the combined effect greater than the sum of individual effects? (Risk difference scale) - Public health importance: Additive scale often more relevant (synergy index)

O&G Example: Does the effect of smoking on preterm birth differ by maternal age?

	Smoker	Non-smoker	RR (smoking vs not)
Age < 35	5%	3%	1.67
Age ≥ 35	10%	5%	2.00

The RR is 1.67 in younger and 2.00 in older women → possible effect modification by age. The absolute risk increase (AR) also differs: 2% vs 5%.

7.6 Confounding by Indication

An important concept for treatment studies.

Definition: The indication for a treatment is itself associated with the outcome. Patients who receive a treatment are systematically different from those who don't because of WHY they were treated.

Example: Studying whether magnesium sulphate prevents cerebral palsy in preterm infants. - Women who receive MgSO₄ are those in preterm labour - Preterm labour itself is a risk factor for cerebral palsy - Without randomisation, any difference in CP rates could be due to the underlying indication (preterm labour), not the treatment

Solution: Randomisation (e.g., the Magpie trial). If randomisation not possible: propensity score methods, indication-based restriction, or multivariable adjustment (though residual confounding likely remains).

7.7 Protopathic Bias

Definition: Treatment started for early symptoms of the outcome before the outcome is formally diagnosed.

Example: Studying whether NSAIDs cause miscarriage. - Women may take NSAIDs for pelvic pain - Pelvic pain might be an early symptom of miscarriage - Association between NSAID use and miscarriage could be due to NSAIDs treating early miscarriage symptoms (reverse causality)

Solution: Exclude medication use in the period immediately before outcome (lag window), or use new-user designs.

8. Evidence-Based Medicine

8.1 Levels of Evidence — Oxford CEBM (March 2009)

The traditional 5-level system (still used by many O&G guidelines including RCOG):

Level	Therapy / Prevention	Prognosis	Diagnosis
1a	SR of RCTs (with homogeneity)	SR of inception cohort studies	SR of diagnostic studies (homogeneous, with gold standard)
1b	Individual RCT (narrow CI)	Individual inception cohort (≥80% follow-up)	Validating cohort with gold standard
1c	All or none	All or none case series	SpPin or SnNOut
2a	SR of cohort studies	SR of retrospective cohorts / untreated controls	SR of cross-sectional studies
2b	Individual cohort study (including low-quality RCT)	Retrospective cohort / follow-up of RCT controls	Cross-sectional with gold standard
2c	Outcomes research / ecological studies	"Outcomes" research	—
3a	SR of case-control studies	—	SR of case-control studies
3b	Individual case-control study	—	Non-consecutive / no gold standard
4	Case series / poor quality cohort	Case series / poor quality cohort	Case-control / poor reference
5	Expert opinion	Expert opinion	Expert opinion

Key: "All or none" — when all patients died before treatment but some now survive, or when some died before but none die now.

Oxford 2011 revision: Simplified to 5 levels based on the type of question and the quality of evidence, but the 2009 system is still widely cited.

8.2 GRADE System — Complete

Grading of Recommendations Assessment, Development and Evaluation

Quality of Evidence

Level	Definition	Symbol
High	Further research VERY UNLIKELY to change confidence in estimate	⊕⊕⊕⊕
Moderate	Further research LIKELY to have important impact	⊕⊕⊕○
Low	Further research VERY LIKELY to have important impact	⊕⊕○○
Very low	Any estimate is very uncertain	⊕○○○

Factors that Lower Quality

Factor	How it works
Risk of bias	Study design limitations (no blinding, no allocation concealment, etc.)
Inconsistency	Unexplained heterogeneity (I² > 50%, p < 0.10) across studies
Indirectness	PICO differences (population, intervention, comparator, outcome)
Imprecision	Wide CIs crossing clinically important thresholds
Publication bias	Suspicion that negative studies are missing

Downgrading rules: - Start at HIGH for RCTs, LOW for observational - Downgrade 1 level for serious concern, 2 for very serious concern - Maximum downgrade: 3 levels

Factors that Raise Quality (Observational Studies)

Factor	Criteria
Large effect	RR > 2 or < 0.5 (up 1 level); RR > 5 or < 0.2 (up 2 levels)
Dose-response	Clear biological gradient demonstrated
Confounding	All plausible confounders would reduce the observed effect

Strength of Recommendation

Strength	Wording	Interpretation
Strong (1)	"We recommend..." / "Offer"	Most patients should receive the intervention
Weak (2)	"We suggest..." / "Consider"	Different choices appropriate for different patients; requires shared decision-making

Implications: - Strong recommendation: Can be adopted as policy in most situations - Weak recommendation: Policy-making requires substantial debate and stakeholder involvement

8.3 Systematic Reviews & Meta-Analysis — Complete

Definitions

Term	Definition
Systematic review	A review of a clearly formulated question that uses systematic and explicit methods to identify, select, and critically appraise relevant research, and to collect and analyse data from the studies that are included in the review
Meta-analysis	The statistical combination of results from two or more separate studies
Narrative review	Non-systematic summary of literature (not evidence-based)

Steps of a Systematic Review

Formulate question (using PICO: Population, Intervention, Comparison, Outcome)
Pre-register protocol (PROSPERO)
Systematic search of multiple databases (MEDLINE, EMBASE, CENTRAL, CINAHL)
Screen and select studies against pre-specified criteria (PRISMA flow diagram)
Assess quality/risk of bias of included studies (Cochrane Risk of Bias tool for RCTs)
Extract data (double-extraction recommended)
Analyse (meta-analysis if appropriate)
Interpret and report

PRISMA Flow Diagram

Records identified through database searching (n=...)
    Additional records identified through other sources (n=...)
         │
         ▼
Records after duplicates removed (n=...)
         │
         ▼
Records screened (n=...)
    Records excluded (n=...)
         │
         ▼
Full-text articles assessed for eligibility (n=...)
    Full-text articles excluded, with reasons (n=...)
         │
         ▼
Studies included in qualitative synthesis (n=...)
         │
         ▼
Studies included in quantitative synthesis (meta-analysis) (n=...)

Fixed Effect vs Random Effects Meta-Analysis

Feature	Fixed Effect	Random Effects
Assumption	All studies estimate the SAME true effect	Studies estimate DIFFERENT true effects (drawn from a distribution)
Implication	Differences due to chance only	Differences due to chance + real variation
Weighting	By inverse variance (precision)	By inverse variance + between-study variance (τ²)
CI	Narrower	Wider (if heterogeneity present)
Interpretation	"The effect" (single value)	"The average effect"
When to use	Minimal heterogeneity	Moderate/substantial heterogeneity

Which is more conservative? Random effects when heterogeneity > 0. But if there is no heterogeneity, they give identical results.

DerSimonian and Laird method — most common random effects approach [ wᵢ* = 1 / (sᵢ² + τ²) ]

Where τ² is the between-study variance (estimate of heterogeneity).

Heterogeneity — I² Statistic

I² = [(Q − df) / Q] × 100%

Where Q = chi-squared statistic for heterogeneity, df = degrees of freedom (# studies − 1)

I²	Interpretation
0%	No observed heterogeneity
<25%	Low heterogeneity
25–50%	Moderate
50–75%	Substantial
>75%	Considerable

But also consider p-value for Q statistic: - p < 0.10 suggests significant heterogeneity (note: not p < 0.05!) - Important to explore potential sources of heterogeneity even if I² is modest

Exploring heterogeneity: 1. Subgroup analysis: Pre-specified subgroups (e.g., by study quality, population, intervention type) 2. Meta-regression: Regression exploring whether study-level characteristics explain heterogeneity 3. Sensitivity analysis: Excluding one study at a time (leave-one-out analysis)

Forest Plot — Detailed Interpretation

Components:

Study                         Weight   RR (95% CI)
────────                      ──────   ──────────
Smith 2010                    ██████   1.20 (0.85–1.55)
Jones 2012                    ███████  1.50 (1.10–2.00)
Lee 2013                      ████     1.10 (0.70–1.50)
Brown 2015                    ████████ 1.40 (1.05–1.75)
Patel 2017                    ██████   1.30 (0.95–1.65)
──────────────────────────────────────────────────────
Overall (I²=0%, p=0.56)      ◆        1.33 (1.17–1.49)

           0.5   1.0   1.5   2.0   2.5
           ◀── Favours control   Favours exposure ──▶

Reading a forest plot: 1. Each row = one study 2. Square = point estimate 3. Horizontal line = 95% CI 4. Square size = weight in meta-analysis (proportional to inverse variance) 5. Vertical line at 1 = null effect (for RR/OR/HR) 6. Diamond at bottom = summary estimate (width = 95% CI) 7. If diamond does not cross the null line → statistically significant

Funnel Plot & Publication Bias

Funnel plot: - X-axis: Effect size (RR, OR, OR log-transformed) - Y-axis: Standard error (inverted — larger studies at top) - Each dot = one study

Interpretation: - Symmetric inverted funnel: No publication bias - Asymmetric (missing studies in bottom left): Possible publication bias (small negative studies missing) - Asymmetric (missing in bottom right): Other explanations (e.g., small studies with true larger effects)

Causes of asymmetry: 1. Publication bias: Small studies with null/negative results not published 2. True heterogeneity: Small studies have different populations/interventions 3. Poor methodology: Small studies have lower quality → biased effect estimates 4. Chance: Especially with few studies (<10)

Tests for publication bias: - Egger's test: Linear regression of effect size on standard error (p < 0.10 = asymmetry) - Begg's test: Rank correlation test - Trim-and-fill method: Imputes missing studies and adjusts summary estimate - Contour-enhanced funnel plot: Distinguishes publication bias from other causes

8.4 Critical Appraisal — CASP Tools

Key questions for any study:

Domain	Key Questions
Validity	Is the study design appropriate? Was bias minimised?
Results	What is the effect size? How precise is it?
Applicability	Can results be applied to my patients?

CASP Checklist for RCTs (abbreviated)

Did the study address a clearly focused question?
Was the assignment to treatment groups truly random?
Were all participants properly accounted for at conclusion?
Were participants, clinicians, and outcome assessors blinded?
Were the groups similar at the start of the trial?
Were groups treated equally (apart from intervention)?
How large was the treatment effect?
How precise was the estimate (CIs)?
Can the results be applied to the local population?
Were all clinically important outcomes considered?
Are the benefits worth the harms and costs?

CONSORT Statement (RCT reporting)

Key items: - Methods: Eligibility criteria, randomisation, allocation concealment, blinding, sample size calculation - Results: Flow diagram (participant flow), baseline table (Table 1), outcomes (ITT analysis), harms - Discussion: Limitations, generalisability, interpretation

STROBE Statement (Observational studies)

22-item checklist covering: - Title and abstract - Introduction: Background, objectives - Methods: Study design, setting, participants, variables, data sources, bias, sample size - Results: Participants (flow diagram), descriptive data, outcome data, main results, other analyses - Discussion: Key results, limitations, interpretation, generalisability

PRISMA Statement (Systematic Reviews)

27-item checklist with flow diagram: - Title, abstract, structured summary - Rationale, objectives - Protocol registration, eligibility criteria, information sources, search strategy, selection process, data extraction, risk of bias, synthesis methods - Results: Study selection, characteristics, risk of bias, individual study results, synthesis - Discussion: Summary, limitations, conclusions

QUADAS-2 (Diagnostic accuracy studies)

Four domains: 1. Patient selection (was a consecutive or random sample used?) 2. Index test (was it performed and interpreted without knowledge of reference standard?) 3. Reference standard (is it likely to correctly classify the target condition?) 4. Flow and timing (appropriate interval between tests, all patients received reference standard?)

8.5 Using EBM in Practice — Fagan Nomogram

Pre-test probability → Post-test probability

Clinical example: 32-year-old woman, combined test risk for Down's = 1:150

Pre-test probability = 1/150 = 0.67%
Pre-test odds = 0.0067 / 0.9933 = 0.0067
Combined test positive: LR+ = 8 (from literature)
Post-test odds = 0.0067 × 8 = 0.0536
Post-test probability = 0.0536 / 1.0536 = 5.1% (1 in 20)

Using Fagan nomogram: Draw line from pre-test probability (0.67%) through LR (8) → post-test probability ~5%.

Clinical application: If post-test probability > invasive test threshold (~1/150), offer CVS/amniocentesis. If below, reassure.

8.6 Evidence-Based Guidelines in O&G

NICE guidelines: - Use GRADE for quality assessment - Recommendations: "Offer" (strong) vs "Consider" (weak) - Regular updates (usually 3–5 year cycle) - Include health economic modelling

RCOG Green-top Guidelines: - Use original Oxford CEBM levels - Grade A, B, C, D recommendations - Topic-specific expert review

SIGN Guidelines: - Scottish Intercollegiate Guidelines Network - Similar approach to GRADE - Identify key clinical questions, systematic review, evidence tables

WHO guidelines: - Use GRADE - Consider global applicability, resource implications - Include "Good Practice Statements"

9. Survival Analysis

9.1 Key Concepts — Detailed

Survival analysis = statistical methods for analysing data where the outcome is the TIME until an event occurs.

Key features: - Time-to-event data: Not just whether event occurred, but WHEN - Censoring: Some subjects don't experience event during follow-up - Time-varying risk: Risk may change over time (higher shortly after treatment, etc.)

Applications in O&G: - Time to pregnancy (survival = time to conception) - Time to labour onset after induction - Time to recurrence of endometriosis after surgery - Time to death in ovarian cancer - Time to treatment failure in IVF - Duration of breastfeeding

9.2 Censoring — Complete Types

Type	Definition	Example
Right censoring	Subject does NOT experience event by study end, or is lost to follow-up	Patient with ovarian cancer alive at 5-year study endpoint
Left censoring	Event occurred before study began (subject already had the event at entry)	Time to first pregnancy — some women already pregnant at study entry
Interval censoring	Event occurs between two known time points, but exact time unknown	Annual screening: cancer detected between visits

Assumption for valid analysis: Censoring is non-informative — the reason for censoring is unrelated to the probability of experiencing the event.

Example of INFORMATIVE censoring: If patients with more aggressive ovarian cancer are more likely to drop out (move to hospice, stop attending follow-up), their censoring is related to the outcome → biased results.

9.3 Kaplan-Meier Method — Complete Details

Purpose: Estimate the survival function without assuming a particular distribution.

Method: 1. Arrange event times in ascending order 2. At each event time, calculate: - Number at risk just before event - Number who experienced event - Number censored between this event and the next 3. S(t) = Πᵢ (nᵢ − dᵢ) / nᵢ where nᵢ = at risk at time i, dᵢ = events at time i

Properties: - Step function (drops only at event times) - Horizontal segments between events - Tick marks indicate censored observations - Median survival = time when S(t) = 0.5 - 95% CI (Greenwood's formula) shown as dashed lines or shading

Example: Time to recurrence of endometriosis after surgery

Recurrence-free survival
100% │─────────────────────────────────────
     │                                      ─────
 75% │                                            ─────
     │                                                   ─────
 50% │                                                            ─────
     │                                                                   ─────
 25% │                                                                          ─────
     │                                                                            ─────
  0% │──────────────────────┴──────────────────────┴──────────────────────┴─────▶ Time
    0        12         24         36         48         60  months

Censored observations represented as tick marks on the curve.

Kaplan-Meier by groups:

Survival
100% │─────── Treatment
     │         ─────────
 75% │                     ──────────
     │                                 ────── Control
 50% │                                         ──────
     │                                                 ──────
 25% │                                                        ──────
     │                                                               ──────
  0% │─────────────────────────────────────────────────────────────────────▶ Time

The log-rank test compares these two curves.

9.4 Log-Rank Test — Details

Non-parametric: No assumption about shape of survival curves
H₀: The survival functions are the same in all groups
H₁: At least one group differs
Calculation: Compares observed vs expected events at each time point, summed over all times
χ² = Σ[(O − E)² / E] across groups

Assumptions: - Non-informative censoring - Independence of survival times - The hazard ratio is roughly constant over time (proportional hazards — though log-rank is reasonably robust to violations)

Limitations: - Cannot adjust for confounders (use Cox regression instead) - Does not estimate the magnitude of difference (use Cox for HR) - If survival curves cross, log-rank has low power (use alternative tests: weighted log-rank, Peto-Peto, Fleming-Harrington)

9.5 Cox Proportional Hazards — Complete Details

Model: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₖXₖ)

Components: - Baseline hazard h₀(t): The hazard when all Xᵢ = 0 (can vary arbitrarily over time — hence "semi-parametric") - Proportional term exp(βX): Multiplicative effect of covariates on hazard (constant over time)

Interpretation of exp(β): - exp(β) = Hazard Ratio (HR) - HR > 1: increased hazard (worse survival) - HR < 1: decreased hazard (better survival) - HR = 1: no effect

Worked O&G example: Survival after ovarian cancer diagnosis

Predictor	β	HR	95% CI	p
Stage III/IV vs I/II	1.39	4.01	2.50–6.43	<0.001
Suboptimal debulking	0.80	2.23	1.40–3.55	0.001
BRCA mutation	−0.51	0.60	0.38–0.95	0.03
Age (per 10 years)	0.32	1.38	1.10–1.73	0.01

Advanced stage: 4× higher risk of death at any time (HR = 4.01)
BRCA mutation: 40% lower risk (HR = 0.60)
Each 10-year increase in age: 38% higher risk

Proportional Hazards Assumption — Checking

The HR is constant over time. This is the critical assumption.

How to check: 1. Log-minus-log plot: Plot −ln[−ln(S(t))] vs time for each group — parallel lines = proportional hazards 2. Schoenfeld residuals: Plot against time — if slope ≈ 0, assumption holds 3. Test: Significance test of time-dependent covariates (p > 0.05 = assumption met)

If assumption violated: - Stratified Cox model: Stratify by the variable with non-proportional hazards - Time-varying covariates: Include interaction with time (t) - Extended Cox model: Allow HR to change at a specified time point - Alternative: Parametric survival models (Weibull, exponential, log-normal)

9.6 Parametric Survival Models

Model	Hazard function	When used
Exponential	Constant hazard over time	Simplest; rarely realistic
Weibull	Monotonic (always increasing or decreasing)	Flexible; includes exponential as special case
Gompertz	Mortality rate increases exponentially	Demography; older populations
Log-normal	Hazard increases then decreases	Biological processes with "burn-in"
Log-logistic	Similar to log-normal with heavier tails	Accelerated failure time models

9.7 Describing Survival Results

Median survival time: Time when survival probability = 50% - In O&G: Median time to pregnancy, median time to recurrence

Survival at specific time point: Proportion surviving at 1 year, 5 years, etc. - Example: 5-year survival in ovarian cancer ~45% (all stages combined)

Hazard Ratio from Cox model: Describes relative risk across entire follow-up

10. Specific Topics in O&G

10.1 Key Rates and Definitions

Rate	Numerator	Denominator	Multiplier	UK Approx.
Crude birth rate (CBR)	Live births	Mid-year population	×1000	~11/1000
General fertility rate (GFR)	Live births	Women aged 15–44	×1000	~60/1000
Total fertility rate (TFR)	Sum of ASFRs × 5	—	Per woman	~1.6
Age-specific fertility rate (ASFR)	Live births to women of age group	Women in that age group	×1000	Varies
Perinatal mortality rate (PMR)	Stillbirths + early neonatal deaths (≤7 days)	Total births	×1000	~5/1000
Stillbirth rate	Stillbirths (≥24 wks UK)	Total births	×1000	~3.8/1000
Neonatal mortality rate (NMR)	Neonatal deaths (≤28 days)	Live births	×1000	~2.5/1000
Early neonatal mortality	Deaths (≤7 days)	Live births	×1000	~1.5/1000
Late neonatal mortality	Deaths (8–28 days)	Live births	×1000	~1.0/1000
Infant mortality rate (IMR)	Deaths <1 year	Live births	×1000	~3.9/1000
Maternal mortality ratio (MMR)	Maternal deaths	Live births	×100,000	~9/100,000
Maternal mortality rate	Maternal deaths	Women aged 15–49	×100,000	Rarely used

WHO Definitions

Term	WHO Definition	UK Definition
Stillbirth	Fetal death ≥28 weeks	Fetal death ≥24 weeks
Early neonatal death	Death within 7 days of birth	Same
Neonatal death	Death within 28 days of birth	Same
Perinatal period	From 22 weeks gestation to 7 days after birth	24 weeks to 7 days
Maternal death	Death of a woman while pregnant or within 42 days of termination of pregnancy, from any cause related to or aggravated by the pregnancy or its management, but not from accidental or incidental causes	Same
Late maternal death	Death >42 days and <1 year after end of pregnancy	Same
Pregnancy-related death	Death from any cause while pregnant or within 42 days of termination of pregnancy (includes incidental)	Used before ICD-MM

ICD-MM Classification of Maternal Deaths

Direct maternal deaths: Resulting from obstetric complications of the gravid state (pregnancy, labour, puerperium), from interventions, omissions, incorrect treatment, or from a chain of events resulting from any of these.
Examples: Obstetric haemorrhage, pre-eclampsia/eclampsia, sepsis, amniotic fluid embolism, anaesthetic complications, thromboembolism
Indirect maternal deaths: Resulting from previous existing disease or disease that developed during pregnancy and was not due to direct obstetric causes, but was aggravated by the physiological effects of pregnancy.
Examples: Cardiac disease, epilepsy, diabetes, anaemia, HIV, mental health conditions
Coincidental (fortuitous) maternal deaths: Deaths from unrelated causes that happen to occur in pregnancy or the puerperium.
Examples: Road traffic accidents, homicide, suicide (though suicide related to postnatal depression is often classified as indirect)
Late maternal deaths: Deaths occurring between 42 days and 1 year after the end of pregnancy.

10.2 MBRRACE-UK and Confidential Enquiries

MBRRACE-UK (Mothers and Babies: Reducing Risk through Audits and Confidential Enquiries across the UK) - Established 2012 (replaced CMACE) - Oversight: Healthcare Quality Improvement Partnership (HQIP) - Key reports: - Triennial "Saving Lives, Improving Mothers' Care" (maternal deaths) - Perinatal Mortality Surveillance Report - Each Baby Counts (intrapartum term stillbirths, neonatal deaths, brain injury)

Key Findings from Recent Reports (2022–2025)

Main causes of maternal death (UK, 2019–2021):

Rank	Cause	Type	Proportion
1	Cardiac disease	Indirect	~25%
2	Thromboembolism	Direct	~15%
3	Sepsis	Direct/Indirect	~12%
4	Pre-eclampsia/eclampsia	Direct	~10%
5	Haemorrhage	Direct	~8%
6	Neurological causes	Indirect	~8%
7	Mental health (suicide)	Indirect	~5%
8	Anaesthetic complications	Direct	Rare

Key disparities: - Ethnicity: Black women 4× more likely, Asian women 2× more likely to die than white women - Socioeconomic: Women from most deprived areas 3× more likely to die - Age: Women ≥35 at higher risk - Obesity: Leading contributor across multiple causes - Late booking: Women who book after 12 weeks have higher risk

Key recommendations (recent): - Better pre-conception counselling for women with medical conditions - Early pregnancy assessment for women with cardiac disease (joint obstetric-cardiac clinics) - Standardised management of obstetric haemorrhage (massive transfusion protocol) - Improved recognition and management of sepsis - e- learning for early warning scores (MEOWS — Modified Early Obstetric Warning Score) - Thromboprophylaxis risk assessment at every contact

10.3 Saving Babies' Lives Care Bundle — Version 3 (2023)

A national patient safety initiative to reduce stillbirth and neonatal death.

Element 1: Smoking cessation - Carbon monoxide (CO) testing at booking - Referral to stop smoking services if CO ≥ 4 ppm (or ≥ 7 ppm in some areas) - Brief intervention training for midwives

Element 2: Growth assessment - Use of customised GROW chart (Gestation Related Optimal Weight) - Serial symphysis-fundal height (SFH) measurements from 24 weeks - Referral for ultrasound if SFH diverges from chart (below 10th or above 90th centile) - Use of ultrasound for suspected SGA: estimated fetal weight + Doppler (umbilical artery PI)

Element 3: Reduced fetal movements (RFM) - Standardised information for women (counting movements, when to contact) - Standardised care pathway: CTG + ultrasound (growth, liquor volume, Doppler) within 2 hours - No digital fetal movement counting for all (controversial — evidence lacking) - Low PAPP-A (<0.4 MoM) → increased surveillance

Element 4: Effective fetal monitoring during labour - Standardised CTG interpretation training (e.g., K2MS, PROMPT, RCOG e-learning) - Use of STAN (ST-segment analysis) or similar adjunct if indicated - Fetal blood sampling (FBS) protocol - Structured communication (SBAR) and team working

Element 5: Reducing preterm birth - Cervical length screening at 20 weeks (transvaginal ultrasound) - Progesterone for short cervix (<25 mm) - Cervical cerclage for history-indicated or ultrasound-indicated short cervix - Arabin pessary (evidence still emerging)

10.4 Each Baby Counts (RCOG)

Aim: Reduce the number of term stillbirths, neonatal deaths, and brain injuries occurring as a result of intrapartum incidents
Data collection: All UK maternity units submit cases
Key findings:
~80% of cases had some element of substandard care
Most common issues: CTG misinterpretation, failure to act on abnormal CTG, delayed delivery, poor communication
≥30% of cases were potentially avoidable

Key recommendations: - Standardised CTG training every 12 months (including emergency drills) - Consultant-led review of all CTGs in labour - SBAR handover and communication - Real-time monitoring of outcomes - Human factors training (situational awareness, decision-making, communication)

10.5 RCOG Green-top Guidelines — Evidence Grading

Levels of Evidence (based on OCEBM):

Code	Level	Description
1++	1a	High-quality meta-analyses, systematic reviews of RCTs, or RCTs with very low risk of bias
1+	1b	Well-conducted meta-analyses, systematic reviews of RCTs, or RCTs with low risk of bias
1−	1c	Meta-analyses, systematic reviews of RCTs, or RCTs with high risk of bias
2++	2a	High-quality SR of case-control or cohort studies; high-quality case-control/cohort with very low risk of confounding/bias/chance
2+	2b	Well-conducted case-control or cohort studies with low risk of confounding/bias/chance
2−	2c	Case-control or cohort studies with high risk of confounding/bias/chance
3	3a/b	Non-analytic studies (case reports, case series)
4	4	Expert opinion

Grades of Recommendation:

Grade	Evidence Required
A	At least one meta-analysis, systematic review, or RCT rated 1++ and directly applicable to target population; OR systematic review of RCTs or body of evidence consisting principally of studies rated 1+ directly applicable and demonstrating consistency of results
B	Body of evidence including studies rated 2++ directly applicable and demonstrating consistency of results; OR extrapolated evidence from studies rated 1++ or 1+
C	Body of evidence including studies rated 2+ directly applicable and demonstrating consistency of results; OR extrapolated evidence from studies rated 2++
D	Evidence level 3 or 4; OR extrapolated evidence from studies rated 2+

Good Practice Point (GPP): Recommended best practice based on the clinical experience of the guideline development group.

10.6 NICE Guidelines

National Institute for Health and Care Excellence
Use GRADE for quality assessment
Evidence reviews conducted systematically
Health economic modelling integral to recommendations
Recommendation wording:
"Offer" = strong recommendation (most patients should receive)
"Consider" = weaker recommendation (requires discussion)
"Do not offer" = strong against
Cover the treatment options not recommended

Key NICE guidelines in O&G: - NG133: Hypertension in pregnancy - NG201: Preterm labour and birth - CG62: Antenatal care - NG3: Diabetes in pregnancy - NG122: Postnatal care - QS22: Ovarian cancer - NG241: Heavy menstrual bleeding

10.7 SIGN Guidelines

Scottish Intercollegiate Guidelines Network
Use methodology similar to GRADE
Grades A–D recommendations
Key examples:
SIGN 160: Management of gestational diabetes
SIGN 127: Prophylaxis of venous thromboembolism
SIGN 156: Induction of labour

10.8 Fertility & Population Demographics — UK Data

Measure	Value	Year	Source
Births (England & Wales)	~600,000/year	2023	ONS
Total Fertility Rate (TFR)	1.49	2023	ONS
Mean age of mother	30.7 (first birth); all: 30.8	2023	ONS
Teenage pregnancy rate (<18)	~13/1000 women	2022	ONS
Percentage of births outside marriage	~51%	2023	ONS
Multiple pregnancy rate	~16/1000 maternities	2023	ONS
Caesarean section rate	~33%	2023	NHS Digital
Induction of labour	~30–35%	2023	NHS Digital
Preterm birth rate	~8%	2023	ONS
Low birth weight (<2500g)	~7%	2023	ONS
Perinatal mortality rate	4.9/1000	2022	MBRRACE-UK
Maternal mortality ratio	8.8/100,000	2020–2022	MBRRACE-UK
Stillbirth rate	3.9/1000	2022	ONS
Neonatal mortality rate	2.5/1000	2022	ONS
Infant mortality rate	3.9/1000	2022	ONS

10.9 Clinical Audit in O&G

Definition: A quality improvement process that seeks to improve patient care and outcomes through systematic review of care against explicit criteria and the implementation of change.

The Audit Cycle:

    ┌──────────────────────────────────────┐
    │    Set standards and criteria        │
    └────────────┬─────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────────────┐
    │    Observe current practice          │
    └────────────┬─────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────────────┐
    │    Compare practice to standards     │
    └────────────┬─────────────────────────┘
                 │
        ┌────────┴────────┐
        │                 │
      (Met)           (Not met)
        │                 │
        │                 ▼
        │    ┌─────────────────────────┐
        │    │    Implement change     │
        │    └───────────┬─────────────┘
        │                │
        └────────────────┘
                 │
                 ▼
    ┌──────────────────────────────────────┐
    │    Re-audit (to close the loop)     │
    └──────────────────────────────────────┘

Types of audit: | Type | Definition | Example | |------|------------|---------| | Structure audit | Resources, facilities, staffing | Is there a 24-hour labour ward consultant? | | Process audit | What is done for patients | What proportion had thromboprophylaxis? | | Outcome audit | Results achieved | What is the CS rate? Perinatal mortality? |

National Audits in O&G

Audit	Organisation	What it measures
MBRRACE-UK	MBRRACE-UK collaboration	Maternal and perinatal deaths
NMPA (National Maternity and Perinatal Audit)	RCOG	Maternity service quality, outcomes
Saving Babies' Lives	NHS England	Stillbirth reduction
Each Baby Counts	RCOG	Intrapartum term outcomes
UKOSS (UK Obstetric Surveillance System)	NPEU	Rare pregnancy conditions

UKOSS (UK Obstetric Surveillance System)

Purpose: Surveillance of rare conditions in pregnancy (incidence < 1 in 10,000)
Method: Monthly case reporting cards sent to all consultant-led maternity units
Examples: Amniotic fluid embolism, placenta accreta, uterine rupture, peripartum cardiomyopathy, maternal sepsis
Outputs: Incidence rates, risk factors, management patterns, maternal and perinatal outcomes

10.10 Quality Improvement in O&G

Plan-Do-Study-Act (PDSA) cycles: - Plan: Define the change, predict outcomes, develop measurement - Do: Implement the change on a small scale - Study: Analyse data, compare to predictions - Act: Refine the change, scale up or abandon

Common QI projects in O&G: - Reducing induction-to-delivery interval - Improving antibiotic prophylaxis timing for CS - Reducing emergency CS decision-to-delivery interval - Implementing standardised CTG interpretation - Improving breastfeeding rates - Reducing perineal trauma

10.11 Key UK Screening Programmes — Summary Table

Programme	Condition	Test	Population	Interval
NHS Fetal Anomaly Screening Programme (FASP)	11 physical anomalies + Down's/Edwards'/Patau's	Combined test (11–14w) or Quadruple (14–20w) + anomaly scan (18–20w)	All pregnant women	Per pregnancy
NHS Sickle Cell and Thalassaemia Screening	Sickle cell disease, thalassaemia, carrier status	Family origin questionnaire + Hb HPLC	All pregnant women (and partners if carrier)	Per pregnancy
NHS Infectious Diseases in Pregnancy Screening	HIV, Hepatitis B, Syphilis	Blood test	All pregnant women	Per pregnancy (and 28w for high-risk HIV)
NHS Cervical Screening Programme	Cervical cancer (HPV-related)	HPV test → reflex cytology	Women 25–64	3-yearly (25–49), 5-yearly (50–64)
NHS Breast Screening Programme	Breast cancer	Mammography	Women 50–70 (extending to 47–73)	3-yearly
NHS Abdominal Aortic Aneurysm Screening	AAA	Ultrasound	Men 65+	Once
NHS Diabetic Eye Screening	Diabetic retinopathy	Digital retinal photography	All with diabetes	Annual

10.12 How to Answer MRCOG Part 1 Epidemiology Questions

Common question formats: 1. "A new screening test has sensitivity 95% and specificity 95%. The prevalence is 1%. What is the PPV?" 2. "Which study design would be best to investigate an association between a rare disease and a common exposure?" 3. "What is the most appropriate statistical test to compare birth weight between smokers and non-smokers?" 4. "What is the correct interpretation of this confidence interval?" 5. "Which type of bias is most likely in a case-control study of maternal medication and congenital anomalies?"

Answer strategy: 1. Identify what is being asked (study design, test, bias, interpretation) 2. Recall the relevant definition and formula 3. Apply to the specific scenario 4. Eliminate wrong answers systematically

Formulas to memorise (and practice): - Sensitivity, specificity, PPV, NPV, LR+, LR− - RR, OR, AR, ARF, NNT - χ² = Σ(O−E)²/E - SEM = SD/√n - Post-test odds = Pre-test odds × LR - Adjusted α (Bonferroni) = 0.05/k

Quick Reference: MRCOG Epidemiology Formulae

Screening

Formula	Mnemonic
Sn = TP / (TP + FN)	Sn = sick / (sick + missed)
Sp = TN / (TN + FP)	Sp = well / (well + false alarms)
PPV = TP / (TP + FP)	PPV = true positive / all positive
NPV = TN / (TN + FN)	NPV = true negative / all negative
LR+ = Sn / (1 − Sp)	Positive LR = sensitivity / false positive rate
LR− = (1 − Sn) / Sp	Negative LR = false negative rate / specificity

Risk & Effect

Formula	When used
RR = [a/(a+b)] / [c/(c+d)]	Cohort studies
OR = ad / bc	Case-control studies
AR = a/(a+b) − c/(c+d)	Excess risk
ARF = (RR−1)/RR	% of risk due to exposure
NNT = 1/ARR	Number needed to treat
NNH = 1/AR (harm)	Number needed to harm

Statistics

Formula	Meaning
x̄ = Σx/n	Mean
s² = Σ(x−x̄)²/(n−1)	Sample variance
SD = √s²	Standard deviation
SEM = SD/√n	Standard error of mean
95% CI ≈ x̄ ± 2×SEM	Confidence interval for mean
χ² = Σ(O−E)²/E	Chi-squared test

Decision Rules

Rule	Cut-off
α (Type I error)	0.05
β (Type II error)	0.20 (power = 80%)
p < 0.05	Statistically significant
95% CI excludes 1 (RR/OR)	Statistically significant
I² > 50%	Substantial heterogeneity
AUC > 0.8	Good diagnostic accuracy

Mnemonics for MRCOG Part 1

SnNOut: High Sensitivity → Negative rules Out SpPIn: High Specificity → Positive rules In

OSA to remember bias types: - O = Observer bias - S = Selection bias - A = Attrition bias

CRIB for confounder criteria: - C = Causes outcome (independent risk factor) - R = Related to exposure - I = Intermediate? NO — not on causal pathway - B = Before exposure? confounder must precede

NNT = 1/ARR — think "Need N To prevent" = 1 over Absolute Risk Reduction

SEM < SD (always!) — Standard Error is Smaller than Standard Deviation

Common MRCOG Part 1 Traps

#	Trap	Truth
1	"p-value = probability H₀ is true"	WRONG — p = P(data
2	"SEM = SD"	WRONG — SEM = SD/√n
3	"OR = RR always"	WRONG — only when disease rare (<10%)
4	"PPV is a fixed test property"	WRONG — PPV depends on prevalence
5	"Non-significant p = no effect"	WRONG — may be underpowered
6	"ITT is good for non-inferiority"	WRONG — ITT is anti-conservative for non-inferiority
7	"Screening always saves lives"	WRONG — lead time, length time, overdiagnosis
8	"Case-control studies can calculate incidence"	WRONG — only OR
9	"Correlation = causation"	ALWAYS WRONG
10	"Confounder is an intermediate variable"	WRONG — confounder is outside causal pathway
11	"95% CI range contains 95% of data"	WRONG — 95% CI is about the mean, not individual values
12	"Histogram bars should have gaps"	WRONG — histogram bars TOUCH (bar chart bars have gaps)
13	"χ² test can be used with any 2×2 table"	WRONG — expected <5 requires Fisher's exact
14	"Mean is always the best measure"	WRONG — use median for skewed data
15	"Blinding and allocation concealment are the same"	WRONG — allocation concealment is ALWAYS possible; blinding is not
16	"Cluster RCT doesn't need special analysis"	WRONG — must account for clustering (ICC, design effect)
17	"p < 0.01 means a more important result than p < 0.05"	WRONG — p depends on sample size, not just effect size
18	"Systematic review = meta-analysis"	WRONG — a meta-analysis is the statistical combination; not all SRs have one
19	"NNT is a fixed property of a treatment"	WRONG — NNT depends on baseline risk
20	"One-sided test is always more powerful"	WRONG — only if the true effect is in the hypothesised direction

References & Further Reading

Essential Textbooks: - Kirkwood BR & Sterne JAC. Essential Medical Statistics. 2nd ed. Blackwell Science, 2003. - Altman DG. Practical Statistics for Medical Research. Chapman & Hall, 1991. - Petrie A & Sabin C. Medical Statistics at a Glance. 4th ed. Wiley-Blackwell, 2020. - Bland M. An Introduction to Medical Statistics. 4th ed. OUP, 2015. - Fletcher RW & Fletcher SW. Clinical Epidemiology: The Essentials. 5th ed. Wolters Kluwer, 2014. - Straus SE et al. Evidence-Based Medicine: How to Practice and Teach It. 5th ed. Elsevier, 2018.

Key UK Documents: - RCOG. Green-top Guidelines Levels of Evidence and Grades of Recommendation. (Introductory sections of any Green-top Guideline) - MBRRACE-UK. Saving Lives, Improving Mothers' Care. (Latest triennial report) - NICE. The Guidelines Manual (process and methods). - NHS FASP. Fetal anomaly screening programme standards. - Wilson JMG & Jungner G. Principles and practice of screening for disease. WHO Public Health Papers No. 34. Geneva: WHO, 1968.

Key Papers: - Guyatt GH et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008;336:924–6. - Schulz KF et al. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ 2010;340:c332. - von Elm E et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. Lancet 2007;370:1453–7. - Moher D et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ 2009;339:b2535. - Altman DG & Bland JM. Statistics notes: diagnostic tests 1: sensitivity and specificity. BMJ 1994;308:1552. - Altman DG & Bland JM. Statistics notes: diagnostic tests 2: predictive values. BMJ 1994;309:102. - Deeks JJ. Systematic reviews in health care: systematic reviews of evaluations of diagnostic and screening tests. BMJ 2001;323:157–62. - Higgins JPT et al. Measuring inconsistency in meta-analyses. BMJ 2003;327:557–60. - Sterne JAC & Egger M. Funnel plots for detecting bias in meta-analysis. J Clin Epidemiol 2001;54:1046–55.

Online Resources: - OpenEpi (www.openepi.com) — free online calculators for epidemiological statistics - Cochrane Handbook for Systematic Reviews of Interventions (training.cochrane.org/handbook) - MedCalc Statistical Software (www.medcalc.org) — ROC curve analysis, diagnostic test evaluation - NICE guidance (www.nice.org.uk) - RCOG Green-top Guidelines (www.rcog.org.uk/guidelines) - MBRRACE-UK reports (www.npeu.ox.ac.uk/mbrrace-uk) - ONS birth statistics (www.ons.gov.uk) - StATS statistical calculator (www.statsdirect.com)

Last updated: May 2026 Target exam: MRCOG Part 1 Word count: ~18,500+ Author note: This document is intended as a comprehensive revision resource covering all examinable topics in epidemiology, statistics, screening, evidence-based medicine, and O&G-specific applications. Candidates should supplement with current RCOG Green-top Guidelines, recent NICE guidance, and the latest MBRRACE-UK reports for the most up-to-date statistical data (rates, mortality figures, screening programme updates). Particular attention should be paid to the formulae and interpretations flagged as "MRCOG Key Point" and "Common MRCOG Part 1 Traps" — these represent the most frequently tested and most commonly confused concepts in the examination.

Subject	Before	After	Difference
1	5.9	5.5	0.4
2	6.2	5.8	0.4
3	5.6	5.3	0.3
4	5.8	5.4	0.4
5	6.0	5.7	0.3
6	5.7	5.6	0.1
7	6.1	5.8	0.3
8	5.9	5.5	0.4
9	5.8	5.6	0.2
10	6.0	5.9	0.1

Subject	Before	After	Difference
1	5.9	5.5	0.4
2	6.2	5.8	0.4
3	5.6	5.3	0.3
4	5.8	5.4	0.4
5	6.0	5.7	0.3
6	5.7	5.6	0.1
7	6.1	5.8	0.3
8	5.9	5.5	0.4
9	5.8	5.6	0.2
10	6.0	5.9	0.1