반응형
Annotation of the discussion about AFT loss function:
https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550563
A general Understanding for AFT Loss function
- My notebook using AFT Loss function is [CV0.665 LB0.666]cat+xgb with AFT loss function based on Dear @cdeotte's code, thanks!
- My annotation on the kernel:
- The Accelerated Failure Time (AFT) model is a parametric survival analysis model that describes how covariates influence the survival time of an event.
- Unlike Proportional Hazards (PH) models, including COX ph model, which assume covariates proportionally scale the hazard function, AFT models assume that covariates accelerate or decelerate the life course of a survival process by a multiplicative factor.
- Detailed explanation about Proportional Hazards model vs. Accelerated Failure Time model
- Proportional Hazard(PH) Model:
- # Hazard-based approach
# Example: Comparing two patients
Patient A's hazard = baseline hazard × 2.0 # 2 times riskier than baseline
Patient B's hazard = baseline hazard × 0.5 # 0.5 times riskier than baseline
# Feature: Hazard changes proportionally
- # Hazard-based approach
- Accelerated Failure Time(AFT) Model:
- # Survival time-based approach
# Example: Comparing two patients
Patient A's survival time = baseline survival time × 0.5 # Progresses 2x faster than baseline
Patient B's survival time = baseline survival time × 2.0 # Progresses 2x slower than baseline
# Feature: Time scale is accelerated/decelerated
- # Survival time-based approach
- Example:
- Situation: Effect of a specific treatment on disease progression
PH Model Interpretation:
- "Patients receiving this treatment have half the risk of death"
AFT Model Interpretation:
- "Disease progression is 2x slower in patients receiving this treatment"
- i.e., it takes twice as long to reach the same stage
- Situation: Effect of a specific treatment on disease progression
- Key Differences:
- PH Model: Focuses on hazard (risk)
- AFT Model: Focuses on actual survival time
- PH models "how risky"
- AFT models "how fast/slow it progresses"
- Proportional Hazard(PH) Model:
- Detailed explanation about Proportional Hazards model vs. Accelerated Failure Time model
- Detailed Explanation:
- Basic Model Equation:
- log(T) = Xβ + ε
where:
T = survival time
X = feature variables (age, gender, disease status, etc.)
β = coefficients for each feature (impact)
ε = error term (random variable following probability distribution)
- log(T) = Xβ + ε
- Acceleration Factor:
- θ = exp(-Xβ)
# Interpretation:
θ > 1: survival time decreases (disease progresses faster)
θ < 1: survival time increases (disease progresses slower)
- θ = exp(-Xβ)
- Example:
- # Example: Modeling treatment effect
X = [treatment_dose]
β = -0.7 # assumed coefficient - # When treatment dose is 1 unit
θ = exp(-1 × -0.7) = exp(0.7) ≈ 2.01
# Interpretation: 1 unit of treatment doubles survival time - # When treatment dose is 2 units
θ = exp(-2 × -0.7) = exp(1.4) ≈ 4.06
# Interpretation: 2 units of treatment quadruples survival time
- # Example: Modeling treatment effect
- Key Features:
- Reasons for modeling log(T):
- Survival time is always positive
- Log transformation better satisfies normality assumption
- Interpretation becomes easier with multiplicative effects
- Reasons for modeling log(T):
- Basic Model Equation:
- Detailed Explanation:
- Component Explanations:
- tᵢ: observed survival time
δᵢ: event occurrence indicator (1=occurred, 0=censored)
μᵢ = Xᵢβ: predicted log-survival time
σ: scale parameter controlling variance
f(t): probability density function (PDF)
S(t): survival function
- tᵢ: observed survival time
- How log function works:
- When event occurs (δᵢ = 1):
- Loss = -log f(tᵢ; μᵢ, σ)
# Tries to maximize PDF
# Learns to increase probability density at actual occurrence time - PDF
- PDF represents the probability density of an event occurring at a specific time point
- Example: When a patient dies on day 100
- pdf(t) = density of probability of death at a specific time t
- High pdf value at t=100 = high probability of death around day 100
- PDF represents the probability density of an event occurring at a specific time point
- Loss = -log f(tᵢ; μᵢ, σ)
- When censored (δᵢ = 0):
- Loss = -log S(tᵢ; μᵢ, σ)
# Tries to maximize survival function
# Learns to increase probability of survival beyond observed time - Survival function S
- S(t) = P(T > t) = probability of survival beyond time point t
- Characteristics:
- Decreasing function over time (monotonically decreasing)
- Initial value S(0) = 1 (everyone is alive at time 0)
- When time approaches infinity, S(∞) = 0
- Decreasing function over time (monotonically decreasing)
- S(t) = P(T > t) = probability of survival beyond time point t
- Loss = -log S(tᵢ; μᵢ, σ)
- When event occurs (δᵢ = 1):
- Example:
- # Patient A: death at day 100 (δ = 1)
Loss_A = -log f(100; μ_A, σ)
# Learns to increase probability of death at day 100- In this case, we know the exact time of death
- So model learns to predict high probability of death around day 100
- Thus, "to make accurate predictions at the actual occurrence time", we "learn to increase probability density at actual occurrence time."
- # Patient B: censored at day 80 (δ = 0)
Loss_B = -log S(80; μ_B, σ)
# Learns to increase probability of survival beyond day 80- In this case, we don't know when death occurred after day 80
- we only know for certain they survived until day 80
- Thus, it is reasonable to increase probability of survival beyond day 80
- It is correct to decrease probability of death before day 80 and increase survival probability after
- # Patient A: death at day 100 (δ = 1)
- Key points:
- This loss function properly handles censored data
- Considers both PDF and survival function for more accurate predictions
- Choice of ε (random term) distribution affects baseline survival time T₀
- Component Explanations:
- Basic Assumption:
- ε ~ N(0, σ²) # Error term follows normal distribution
- This means:
- log(survival time) follows normal distribution
- WHY???
- log(T) = Xβ + ε
- When Y = a + bX
- If X follows normal distribution N(μ, σ²)
- Then Y follows normal distribution N(a + bμ, b²σ²) - log(T) = Xβ + ε
# Since ε follows N(0, σ²)
# log(T) follows N(Xβ, σ²)
Because:
- Xβ is constant term (mean shift)
- Coefficient of ε is 1 (variance remains same)
- WHY???
- Actual survival time follows log-symmetric distribution
- log(survival time) follows normal distribution
- Probability Density Function (PDF):
- f(t; μ, σ) = (1/tσ√2π) * exp(-(log(t)-μ)²/2σ²)
Components:
- t: observed time
- μ: predicted log-survival time (Xβ)
- σ: parameter controlling variance
- f(t; μ, σ) = (1/tσ√2π) * exp(-(log(t)-μ)²/2σ²)
- Survival Function:
- S(t; μ, σ) = 1 - Φ((log(t)-μ)/σ)
where:
- Φ: cumulative distribution function (CDF) of standard normal distribution
- Represents probability of survival beyond time point t
- S(t; μ, σ) = 1 - Φ((log(t)-μ)/σ)
- Use Cases:
- # Suitable cases:
- Symmetrically distributed survival times
- Constant variability
Example: Component lifetime in manufacturing - # Unsuitable cases:
- Distributions with very long tails
- Highly asymmetric distributions
- # Suitable cases:
- Advantages:
- Intuitive interpretation
- Relatively simple calculations
- Good fit for symmetric survival time data
- Real Example:
- # Predicting medical device lifetime
survival_time = exp(Xβ + ε)
ε ~ N(0, σ²)
# This means lifetime follows log-normal distribution
# i.e., log(lifetime) follows normal distribution
- # Predicting medical device lifetime
- Basic Assumption:
- ε ~ Log-Normal(μ, σ²)
# Error term follows log-normal distribution
# This means survival time T directly follows log-normal distribution
- ε ~ Log-Normal(μ, σ²)
- Log-Normal Distribution Characteristics:
- # Properties:
- Only takes positive values
- Has a heavy right tail
- Asymmetric distribution
- # Properties:
- Difference between AFT:Normal and AFT:Log:
- AFT:Normal
- log(survival time) follows normal distribution
- Survival time is symmetrically distributed
- Example: Manufacturing component lifetime - AFT:Log
- Survival time directly follows log-normal distribution
- Survival time is asymmetrically distributed (long tail)
- Example: Cancer patient survival period
- AFT:Normal
- Use Cases:
- # Suitable cases:
- When some patients survive much longer than others
- Biological processes or reliability data
- Cancer patient survival analysis
# Reasons:
- Most show similar survival periods but
- Some show very long survival periods
- Can model such long-tail distributions well
- # Suitable cases:
In simple terms
In simple terms, AFT assumes that different factors (i.e., input variables or features of the model) affect the rate at which events occur by "stretching" or "compressing" the timeline.
It's like adjusting the playback speed while watching a video:
- Double speed play (fast forward) : Time speeds up, events happen faster.
- Slow play (slow down) : Time slows down and events occur later.
Imagine you're studying the survival time of two cancer patients:
- Patient A receives standard treatment.
- Patient B receives a new experimental treatment.
Case 1: 𝜃=0.5 (Acceleration Factor < 1)
This means the timeline is stretched by 2x for Patient B compared to Patient A.
- If Patient A survives for 1 year, Patient B is expected to survive for 2 years under the new treatment.
Case 2: 𝜃=2 (Acceleration Factor > 1)
This means the timeline is compressed for Patient B, reducing their survival time by half.
- If Patient A survives for 1 year, Patient B is expected to survive for 6 months.
압력 없이는 다이아몬드가 만들어지지 않는다
- 토마스 칼라일 -
반응형