동선생

[경제] 4. 펀드와 그 종류

dongsunseng — Mon, 19 May 2025 22:34:42 +0900

펀드란?

펀드는 다수의 투자자로부터 자금을 모아 전문 운용사가 주식, 채권, 부동산 등 다양한 자산에 투자하여 발생한 수익을 투자자에게 배분하는 집합투자 상품 입니다.

펀드를 통해 소액 투자자도 전문가의 운용 능력과 분산 투자의 이점을 누릴 수 있게 됩니다.

펀드의 종류

경제 뉴스나 인터넷을 보면 사모 펀드, 주식형 펀드, 헤지 펀드 등등 여러 종류의 펀드에 대해 들어본 적이 있을겁니다. 펀드의 종류에 대해 자세히 알아보겠습니다.

설정 형태에 따른 분류: 공모 펀드 vs. 사모 펀드
- 공모 펀드:
  - 불특정 다수의 투자자를 대상으로 공개적으로 모집
  - 최소 투자 금액이 낮고 접근성이 높음
  - 엄격한 규제와 공시 의무가 있음
  - 예시) 대부분의 뮤추얼 펀드, ETF
- 사모 펀드:
  - 소수의 특정 투자자를 대상으로 비공개적으로 모집
  - 일반적으로 고액 자산가나 기관 투자자 대상
  - 상대적으로 규제가 적고 투자 전략의 자유도가 높음
  - 예시) 헤지펀드, 프라이빗 에쿼티, 벤처캐피털 펀드
    - 각각에 대한 자세한 설명은 아래에 있음
투자 대상에 따른 분류:
1. 주식형 펀드:
  - 주로 주식에 투자
  - 높은 수익 잠재력과 높은 위험
  - 하위 유형: 대형주, 중소형주, 배당주, 섹터(IT, 헬스 케어 등), 국가/지역별
    - 배당주: 주주들에게 정기적으로 현금 배당을 지급하는 기업의 주식
    - 섹터:
      - 섹터는 경제나 주식 시장에서 비슷한 사업 활동이나 제품/서비스를 제공하는 기업들을 묶어 분류한 산업 그룹을 의미합니다.
        
        정보 기술, 금융, 헬스케어, 산업재, 소비자 필수소비재, 임의 소비재, 통신 서비스, 에너지, 유틸리티, 소재, 부동산, ...
      - 섹터 투자란 다양한 섹터에 분산 투자함으로써 위험을 감소시키는 것을 말함
      - 특정 섹터에 집중된 ETF나 펀드에 투자할 수 있고 특정 섹터 내 우량 기업을 선별하여 투자하거나 경기 사이클에 따라 섹터의 비중을 조절하는 등의 전략을 꾀할 수 있습니다.
2. 채권형 펀드:
  - 주로 국채, 회사채, 특수채 등 채권에 투자
    - 채권은 정부, 기업 또는 기타 단체가 자금을 빌리기 위해 발행하는 부채 증서입니다.
    - 채권 소유자는 채권 발행자에게 돈을 빌려주는 것이고, 발행자는 일정 기간동안 정해진 이자를 지급하며 만기일에 원금을 상환하겠다고 약속합니다.
    - 채권은 주식과 달리 소유권이 아닌 채무 관계를 나타내며, 일반적으로 주식보다 위험이 낮은 투자 수단으로 간주됩니다.
    - 주식보다 채권이 위험이 낮은 투자수단인 이유:
      - 상환 우선순위:
        
        기업이 파산할 경우, 채권 투자자는 주주보다 먼저 상환받을 권리가 있습니다.
        
        회사 자산을 청산할 때 채권자가 주주보다 우선순위에 있기 때문입니다.
      - 확정 수익:
        
        채권은 일반적으로 고정된 이자율(쿠폰)을 제공하므로 투자자는 얼마의 수익을 얻을 수 있는지 미리 알 수 있습니다.
        
        반면 주식의 배당금은 회사 성과에 따라 변동되거나 아예 지급되지 않을 수도 있습니다.
      - 원금 보장:
        
        채권은 만기일에 원금을 상환받는 구조로, 발행 기관이 파산하지 않는 한 투자한 원금을 돌려받을 수 있습니다.
        
        주식은 가치가 폭락할 가능성이 상대적으로 높습니다.
      - 가격 변동성:
        
        일반적으로 채권 가격은 주식 가격보다 변동성이 작습니다.
        
        특히 국채와 같은 안전한 채권은 시장 상황에 따른 가격변동이 상대적으로 적습니다.
      - 수익과 위험의 관계:
        
        금융 이론에서 위험이 높을수록 잠재적 수익도 높아지는 경향이 있습니다.
        
        채권은 주식보다 잠재적 수익이 낮은 대신 위험도 낮은 경향이 있습니다.
      - 다만, 모든 채권이 모든 주식보다 안전한 것은 아닙니다.
      - 예를 들어, 신용등급이 낮은 회사의 정크본드(투기등급 채권)는 안정적인 대기업의 주식보다 더 위험할 수 있습니다.
  - 안정적인 이자수익 추구, 상대적으로 낮은 위험
  - 하위 유형: 국공채, 회사채, 하이일드, 신흥국 채권 등
    - 국공채(Government Bonds): 발행주체가 중앙정부, 지방정부, 공공기관
    - 회사채(Corporate Bonds): 발행주체가 일반 기업
    - 하이일드 채권(High Yield Bonds): 발행주체가 신용등급이 낮은 기업
      - 정크본드(Junk Bonds)라고도 불림
      - 높은 이자율 제공(투자등급 채권 대비)
      - 채무불이행(디폴트) 위험이 상대적으로 높음
      - 예시) 신생 기업 채권, 재무상태가 취약한 기업의 채권
      - 따라서, 높은 수익을 추구할 위험을 감수 가능한 투자자들이 투자함
    - 신흥국 채권(Emerging Market Bonds): 발행주체가 신흥국 정부 또는 기업
      - 선진국 채권보다 높은 이자율 제공
      - 정치적, 경제적 불안정성에 따른 추가 위험 존재
      - 통화 위험(환율 변동)에 노출
      - 예시) 브라질, 인도, 남아프리카 등의 국채 또는 기업 채권
3. 혼합형 펀드:
  - 주식과 채권에 동시 투자
  - 균형 잡힌 위험-수익 프로필
  - 하위 유형: 주식 비중에 따라 공격형/중립형/안정형으로 구분
4. 부동산 펀드:
  - 상업용/주거용 부동산, 리츠(REITs) 등에 투자
    - 리츠(REITs):
      - Real Estate Investment Trusts
      - 다수의 투자자로부터 자금을 모아 부동산에 투자하고, 그 수익을 투자자들에게 배당하는 부동산 투자회사 또는 신탁을 말합니다.
      - 구조: 회사 형태로 설립되어 주식처럼 거래소에 상장 가능
      - 투자 대상: 오피스 빌딩, 쇼핑몰, 호텔, 물류센터, 아파트 등의 상업용/주거용 부동산
      - 수익 구조:
        
        임대료 수입(주 수익원)
        
        부당산 매각 시 자본 이득
      - 배당 의무: 대부분의 국가에서 수익의 90% 이상을 투자자에게 배당하도록 법적으로 규정
      - 접근성: 적은 금액으로도 대규모 부동산에 투자 가능
  - 임대 수익과 자산가치 상승으로 수익 추구
  - 실물 자산에 대한 익스포저(exposure) 제공
    - 투자자들이 실제 물리적 자산(부동산, 인프라 등)에 투자할 기회를 얻게 된다는 의미입니다.
5. 원자재 펀드:
  - 금, 원유, 농산물 등 원자재에 투자
  - 인플레이션 헤지용이나 포트폴리오 다각화에 활용
  - 직접 원자재 보유 또는 선물 계약 등 파생상품 활용
6. 대체투자 펀드:
  - 전통적인 자산군 외의 투자 대상
  - 인프라, 사모투자, 헤지펀드, 벤처캐피털 등
    - 인프라: 경제와 사회의 기반이 되는 물리적 구조물과 시설에 대한 투자를 의미합니다.
    - 예시) 교통 인프라(도로, 고속도로, 터널, ...), 에너지 인프라(발전소, 가스 파이프라인, ...), 공공 유틸리티(수도 공급, 폐기물 처리 시설, 통신망, ...), 사회 인프라(병원, 학교, ...)
  - 일반적으로 높은 최소투자금액과 낮은 유동성
7. 머니마켓 펀드:
  - 단기 금융상품(CP, CD, 단기 국채, 콜론, RP 등)에 투자
    - CP(기업어음, Commercial Paper)
      - 우량 기업이 단기 자금 조달을 위해 발행하는 무담보 약속어음
      - 만기: 일반적으로 1-270일(보통 30-90일)
      - 할인 발행 방식(액면가보다 낮은 가격에 발행, 만기에 액면가 상환)
      - 예: A기업이 3개월 후 10억원을 상환하겠다는 약속으로 9억8천만원에 발행
    - CD(양도성 예금증서, Certificate of Deposit)
      - 은행이 발행하는 정기예금 증서로 양도가 가능한 금융상품
      - 만기: 주로 91일, 180일, 270일, 1년 등
      - 은행의 신용도를 기반으로 하는 안전한 상품
      - 예: 시중은행이 발행한 3개월 만기, 연 3% 이자율의 CD
    - 단기 국채(Treasury Bills)
      - 정부가 발행하는 만기 1년 이내의 채무증권
      - 가장 안전한 투자 수단으로 간주됨
      - 할인 발행 방식 사용
      - 예: 정부가 발행한 91일 만기 국고채권
    - 콜론(Call Loan)
      - 금융기관 간 초단기(1일-1주일) 대출
      - 금융시장의 일시적 자금 수급 조절 역할
      - 예: A은행이 B은행에 1일 동안 대출해주는 자금
    - RP(환매조건부채권, Repurchase Agreement)
      - 채권을 일정 기간 후 다시 매수하기로 약정하고 파는 계약
      - 만기: 일반적으로 1일-90일
      - 담보부 대출의 성격
      - 예: 증권사가 채권을 담보로 14일 동안 자금을 빌리는 계약
  - 높은 유동성과 안전성, 낮은 수익률
  - 일시적 자금 대기용으로 활용
운용 방식에 따른 분류: 액티브 펀드 vs. 패시브 펀드
- 액티브 펀드:
  - 펀드 매니저가 적극적으로 종목 선정 및 자산 배분
  - 시장 평균 수익률 초과 달성(알파 창출) 목표
  - 상대적으로 높은 운용 보수
- 패시브 펀드:
  - 특정 지수(인덱스)를 추종하는 투자 전략
  - 시장 평균 수익률 달성 목표
  - 낮은 운용 보수
  - 예) 인덱스 펀드, 대부분의 ETF
투자 지역에 따른 분류
- 국내형: 자국 시장에만 투자
- 해외형: 특정 해외 국가나 지역에 투자
- 글로벌형: 전세계 시장에 분산 투자
- 신흥시장형: 신흥국 시장에 집중 투자
특수 목적 펀드
1. ETF(상장지수펀드)
  - 지수를 추종하며 거래소에 상장되어 주식처럼 거래
  - 높은 유동성과 투명성, 낮은 비용
2. 인덱스 펀드
  - 특정 지수의 성과를 복제하는 패시브 운용 펀드
  - ETF와 유사하나 거래소에 상장되진 않음
3. 헤지펀드
  - 절대수익 추구, 다양한 투자 전략 구사
  - 레버리지(차입)활용, 롱숏 전략 등 활용
  - 높은 최소투자금액과 성과보수 구조
4. 프라이빗 에쿼티(PE)
  - 비상장 기업에 투자하거나 상장기업을 비상장화
    - 비상장 기업 투자:
      - PE 펀드는 성장 가능성이 높은 비상장 기업에 투자하여 지분을 취득합니다
      - 주로 성숙 단계의 중견기업이나 고성장 기업을 대상으로 합니다
      - 벤처캐피털이 초기 스타트업에 집중하는 것과는 달리, PE는 이미 사업 모델이 검증된 기업을 선호합니다
      - 지분 인수 비율은 소수지분부터 경영권 확보가 가능한 다수지분까지 다양합니다
    - 상장기업 비상장화:
      - 공개 시장에서 주식을 매입해 기업을 완전히 인수한 후 상장폐지시키는 과정입니다
      - 이를 '공개매수(Takeover)' 또는 'LBO(Leveraged Buyout, 차입매수)'라고 합니다
      - 보통 기업 가치가 시장에서 저평가되었다고 판단될 때 진행됩니다
      - 비상장화 후에는 단기적 실적 압박 없이 장기적 구조조정과 혁신에 집중할 수 있습니다
  - 장기적 관점에서 기업 가치 개선 후 매각
  - 낮은 유동성, 장기 투자 기간
5. 벤처캐피털 펀드
  - 초기 단계 스타트업에 투자
  - 높은 위험과 높은 수익 잠재력
  - 포트폴리요 접근 방식(여러 기업에 분산투자)
6. 목표일자 펀드
  - 특정 목표 날짜(주로 은퇴)에 맞춰 자산 배분 자동 조정
  - 시간이 지날수록 보수적인 자산 배분으로 변화
7. ESG 펀드
  - 환경(E), 사회(S), 지배구조(G) 기준을 고려한 투자
  - 재무적 수익과 함께 사회적 영향 추구
8. 퇴직연금 펀드
  - 은퇴 대비 장기 저축 목적
  - 세제 혜택 제공
  - 보수적인 자산 배분 경향

There's a tremendous bias against taking risks. Everyone is trying to optimize their ass-covering.
-Elon Musk-

[경제] 3. 세력 & 마켓 메이커

dongsunseng — Mon, 19 May 2025 14:01:10 +0900

투자 시장에 있다 보면 '세력'과 '마켓 메이커'라는 단어를 쉽게 접할 수 있습니다.

두 용어의 개념은 혼재되기 쉬우며 그 차이에 대한 부분이 불분명하게 느껴질 수 있습니다.

세력

금융 시장에서 '세력'이라는 단어는 시장에 큰 영향력을 행사할 수 있는 자본력과 정보력을 갖춘 개인이나 기관(단체)를 지칭합니다.

세력의 주요 특징:

대규모 자본: 시장 가격에 영향을 줄 정도로 충분한 자금력을 보유하고 있습니다.
정보 우위: 일반 투자자보다 더 많은 정보나 내부 정보에 접근할 가능성이 있습니다.
가격 형성 능력: 대량 매수나 매도를 통해 주가나 자산 가격에 일시적인 영향을 미칠 수 있습니다.

세력은 기관 투자자(투자 은행, 헤지 펀드, 연기금 등), 대형 개인 투자자나 자산가 그룹, 기업 내부자나 대주주, 전문 투자 그룹 등을 모두 포함하는 개념입니다.

세력은 위에서 언급했듯이 시장 가격에 의도적으로 영향을 줄 수 있는 자본력이 있기 때문에, 주가 띄우기(pump and dump)와 같은 시세 조작 행위를 통해 일반 투자자(개미)의 매수와 매도를 유도할 수 있습니다.

당연히 갑작스러운 거래량 증가와 가격 변동이 모두 세력때문은 아니지만, 일반 투자자로써 우위를 점하려면 세력의 의도를 파악하려고 노력하여 이를 경계하며 투자하는 것이 이상적입니다.

마켓 메이커 (Market Maker)

마켓 메이커는 금융 시장에서 유동성을 제공하는 것이 주요 역할입니다.

주식, 옵션, 채권, ETF, 암호화폐 등 다양한 시장에서 활동합니다.

기본적으로 마켓 메이커는 증권이나 기타 금융 상품에 대해 항상 매수와 매도 호가를 제시합니다.

이를 통해 모든 투자자들은 언제든지 원하는 자산을 사고팔 수 있도록 시장의 유동성을 확보합니다.

하지만 매수와 매도 호가의 차이가 크면 시장의 유동성이 적다는 의미로 원하는 가격에 매수 혹은 매도를 할 수 없는 불편함이 생깁니다.

따라서, 마켓 메이커는 호가 스프레드를 유지합니다.

다시 말해서, 매수가(bid)와 매도가(ask)사이의 차이(스프레드)를 좁게 유지함으로써 거래 비용을 낮추고 시장 효율성을 높입니다.

갑작스러운 대량 매수나 매도 주문이 들어올 때 반대 포지션을 취함으로써 급격한 가격 변동을 완화하는 등의 역할을 하게 됩니다.

유명한 대형 마켓 메이커로는 시타델(Citadel Securities), Virtu Financial, GTS, IMC 등이 있습니다.

이러한 마켓 메이커들은 시장의 원활한 작동을 위한 필수적인 시장 참가자이지만, 때로는 이들의 활동이 이해 상충이나 시장 조작 우려를 불러일으키기도 하여 규제 당국의 감시 대상이 되기도 합니다.

세력 vs. 마켓 메이커

법적 지위와 규제
- 마켓 메이커: 공식적으로 인정된 시장 참여자로, 규제 기관의 감독을 받으며 특정 의무와 권한을 가집니다.
- 세력: 비공식적인 개념으로, 법적 지위가 없으며 때로는 불법적인 시장 조작 행위와 연관될 수 있습니다.
목적과 동기
- 마켓 메이커: 시장 유동성 제공이 주 목적이며, 매수-매도 스프레드에서 소액의 이익을 지속적으로 얻는 비즈니스 모델입니다.
- 세력: 주로 단기적 가격 변동을 통한 이익 추구가 목적이며, 특정 종목의 가겨을 자신에게 유리하게 움직이려는 의도가 있을 수 있습니다.
거래 방식
- 마켓 메이커: 양방향 호가(매수/매도)를 항상 제시하며 투명하게 운영됩니다.
- 세력: 투명하게 운영될 의무가 없기 때문에 여러 계좌를 통한 분산 매매, 특정 시간대 집중 매매 등 다양한 전략을 사용합니다.
시장 기여도
- 마켓 메이커: 시장 안정화, 유동성 공급, 가격 발견 기능 등 시장의 효율적 작동에 기여합니다.
- 세력: 단기적으로 시장을 왜곡할 수 있으며, 본인들의 이익을 위해 소액 투자자들에게 손실을 줄 가능성이 있습니다.
투명성
- 마켓 메이커: 공식적으로 등록되어 있기 때문에 활동이 비교적 투명합니다.
- 세력: 신원과 활동이 불투명하여 시장에서 그 존재를 명확하게 확인하기 어렵습니다.

개인 투자자들은 세력들을 상대로 어떻게 투자해야 하는가

세력을 경계해야 하는 이유

단타 투기 세력들은 단기간에 주가를 급등시킨 후 고점에서 매도하는 Pump and Dump 전략을 구사합니다.
주로 유동성이 낮은 소형나/테마주, 알트코인 등에서 활동하는데 이는 유동성이 낮아야 적은 자본으로도 주가를 급등시킬 수 있기 때문입니다.
이런 세력들은 여러 계좌를 이용해 분산 매매하기 때문에 추적이 어렵고, 회사 내부 정보 혹은 개인 투자자들은 알기 힘든 시장 정보를 미리 알고 있기 때문에 포지션을 취할 때 상당히 유리합니다.

개인투자자의 방어 전략

유동성 낮은 종목 주의: 암호화폐를 제외한 다른 경우에는 Pump and Dump 등의 주가 조작 전략에 대한 규제가 더 강합니다. 하지만 유동성이 낮은 소형주나 알트 코인의 경우 세력의 가격 조작이 용이하기 때문에 특히 유의하면서 투자해야 합니다.
거래량 주의: 개인 투자자는 절대 세력을 이길 수 없습니다. 따라서, 기본적 & 기술적 분석을 하며 세력의 의도를 파악하는 것이 중요한데 이때 거래량이 엄청나게 중요한 역할을 합니다. 갑작스러운 거래량 증가와 가격 급등 등을 유의하며 투자해야 합니다.

I think it behooves one to have an internal locus of control. You think that you have control overyour own destiny.
-Elon Musk-

[코인 투자] 2. 다우 이론과 6가지 국면

dongsunseng — Mon, 5 May 2025 13:29:48 +0900

다우 이론과 6가지 국면은 결국 프렉탈 이론과 일맥상통하는 부분이 있습니다.

코인 투자 그 이전에 주식 투자에서까지 투자자들이 참고하는 모든 이론들은 과거 차트를 기반으로 형성되었고, 과거 차트의 모양새를 참고하여 전략을 수립하는 방식을 과거 프렉탈을 참고한다라고 합니다.

따라서, 차트의 단기 반등할 지점을 찾는 "기술적 분석"에 있어서 과거 차트를 분석하는 것은 굉장히 중요합니다.

이런 시장의 흐름은 다우 이론을 바탕으로 파악할 수 있습니다.

다우 이론을 바탕으로 국면을 파악한다는 것은 장기적 관점에서 가격의 방향성을 예측해본다는 것에 목적이 있습니다.

예를 들어, 상승 국면이라고 한다면 상승할 확률이 높다는 것을 기반으로 눌림롱을 잡거나 하락 국면이라고 파악이 되면 오름숏을 잡는 등의 전략을 수립할 수 있게 됩니다.

즉, 현재 시점에서의 추세 매매와 역추세 매매가 무엇인지를 파악하는 것 입니다.

다우 이론

다우 이론은 주식 시장의 장기적인 추세를 파악하고 예측하기 위해 개발된 기술적 분석 방법으로, 찰스 다우가 창시한 다우존스 평균 주가를 바탕으로 만들어졌습니다.

다우 이론은 시장이 상승세일 때 고점과 저점이 상승하고, 하락세일 때 고점과 저점이 하락한다라는 개념을 기반으로 시장의 장기적인 추세를 파악할 수 있다는 이론입니다.

시황 분석을 읽다보면 특정 저점이 깨지지 않는 한 상승을 볼겁니다 혹은 특정 고점이 깨지지 않는 이상 하락이 우세할거라고 생각됩니다 와 같은 문장을 많이 보았을텐데, 이는 다우 이론을 기반으로 한 분석이라고 할 수 있습니다.

다우 이론의 대전제

평균은 모든 것을 반영한다
- 이 부분은 아래 링크의 내용을 발췌했습니다 (개인 투자를 하며 매우 중요한 내용이라고 생각됩니다)
- https://www.imfnsec.com/systemtrade/st02090201.jsp
- "다우 이론에 따르면 시장에서 예상되고 있거나 이미 알려진 모든 정보는 시장 평균에 모두 반영되어 있으며, 예상치 못한 하나의 사건이 일어나면 이는 즉각적으로 시장에 반영된다.
- 이것의 의미는 흔히 우리는 어떤 상승 요인이 되는 재료가 발생하더라도 가격이 상승하지 않고 하락하는 것을 흔히 볼 수 있다.
- 이것은 미래 가격에 대한 "예상" 에 따라 시장 가격이 변동되는 것이므로 전혀 이상하거나 잘못된 것이 아니라 오히려 자연스러운 것이다.
- 어떤 종목이나 상품에 대해 상승 요인이 나오는 재료의 성장 기대치가 가령 5% 이었다면 이러한 기대요인에 따라 이미 가격은 형성되어 있는 상태이고 실제 발표는 3% 에 나왔다면 오히려 가격하락의 요인으로 작용할 수 있는 것이기 때문이다.
- 그러므로 다우 이론에 따라 분석한다면 뉴스가 나오는 시점을 잡아 거래하는 방법보다는 앞으로 나올 예상 정보를 바탕으로 하는 통계 분석이 훨씬 신뢰도가 높을 것이라는 것을 제시해 주고 있다."
시장은 참가자의 행동, 심리 상태 등 모든 정보를 반영한다
- 다우 이론은 주가를 통해 시장의 상황, 경제 상황 등을 파악할 수 있다고 주장합니다
시장은 일정한 추세를 갖고 추세는 상승, 하락, 횡보로 나뉜다
- 추세를 정확하게 파악한 투자자들은 향후 주가의 방향성을 예측할 수 있기 때문에 추세에 맞는 전략을 통해서 더 큰 수익을 가져갈 수 있게 됩니다
거래량은 시장 가격 추세 변동에 유용한 정보를 제공한다
- 거래량에 관련된 부분은 추후에 포스트로 작성해보도록 하겠습니다
기존의 가격 추세는 전환될 때까지 계속된다
지수는 상호 연관성이 있고, 관계를 통해 시장 상태를 파악한다
- 다우 지수에 관한 내용입니다
- 다우 지수는 주요 주식들의 가격을 평균하여 계산합니다
- 평균은 다우 이론에 있어서 아주 중요한 개념으로 시장의 전반적인 상황을 보여준다고 주장합니다
- 다우 지수는 투자자들이 시장의 상황을 파악하고 이를 통해 매수 매도 전략을 수립하는 데 중요한 지표로 사용됩니다
- 암호화폐 투자에 활용되는 개념은 아닙니다

다우는 시장 참여자들의 심리 상태를 위와 같이 6가지의 국면(Phase)로 나눠서 파악합니다.

이 6가지 국면이 하나의 사이클을 이뤄서 계속 순환합니다.

다우 이론의 6가지 국면

매집 국면: 시장에 공포심이 막연하고 합리적인 판단을 할 수 없는 시기
- 일반적으로 하락장에서는 평범한 투자자들(개미)은 보유하던 종목을 포기하고 헐값에라도 매도하려는 경향이 강함
- 반대로 전문 투자자들은 개미들이 던지는 물량을 매집하면서 저렴한 가격에 개미들의 물량을 받아먹음
- 따라서 겉으로 보이는 시장 상황은 굉장히 안 좋은 상태로 비춰지게 됨
- 역설적이지만 다우 이론에 다르면 사람들이 패닉셀(panic sell)하는 시점이 강세 시장의 첫 번째 국면임
- 코로나 쇼크가 그 예시
- 공포심이 막연한 것을 매집 국면이라고 파악하고 물량을 매집하는 투자자들은 돈을 벌고, 공포심으로 인해 투자를 포기하는 투자자들은 기회를 못 잡게 되는 것임
상승 국면: 상승에 대한 기대감이 시장에 반영되는 시기
- 매집 국면에서 악재로 작용한 요소들이 하나 둘씩 해소되기 시작함
- 결제 불황이 차츰 해결되고 기업들의 재정 상태도 회복됨
- 따라서, 일반 투자자들의 관심이 높아지기 시작함
- 관심의 증가는 차트에서 거래량으로 나타나고 거래량이 높아짐에 따라 가격도 상승함
- 상승 국면의 절정 부근에서는 신고가를 갱신하는 종목이 나타남
- 상승 국면은 일반 투자자들에게도 기회의 장이 됨
- 투자를 공부하는 이유도 기본적으로 시장 국면을 판단하고 시장의 흐름에 탑승하기 위함에 있음
- 일반 투자자들은 상승 국면에서 크게 두 가지 모습을 보이게 됨:
  1. 매수를 망설이는 부류: 이전 사이클의 하락 국면에서 큰 손실을 입었기 때문에 이에 대한 트라우마로 매수를 망설이게 됨
  2. 적극적으로 매수하는 부류: 매집 국면에서 매집했던 물량들을 이때부터 조금씩 현금화(매도)하기 시작함
- 상승 국면이 고조되기 시작하면 결국 시장은 과열 국면에 집입하게 됨
과열 국면: 시장이 과열된 것을 모른채 엄청 적극적으로 매수하는 투자자들이 많은 시기
- 과열 국면에서는 모든 시장의 지표가 상승을 가리키게 됨
- 투자 경험이 없는 개미들도 뉴스나 주변 이야기를 듣고 적극적으로 시장에 참여하니 내재 가치가 낮은 종목들도 덩달아 가격이 상승하는 모습을 보이게 됨
- 21년도 5월이 그 예시임
- 전문 투자자들은 과열 국면에 대부분의 물량을 정리함
- 곧 터질 폭탄을 전문가들이 일반 개미들에게 떠넘기는 시기
분산 국면: 가격 거품이 터지기 시작하면서 엄청난 낙폭을 보이는 시기
- 과열 국면 시기와는 달리 경제 지표가 좋지 않고 점점 매수하려는 사람은 줄고 매도하려는 사람은 늘어나게 됨
- 가격 거품이 터지기 시작하면서 엄청난 낙폭을 보임
- 따라서 다시금 시장에는 공포가 도래하기 시작함
공포 국면
- 여기서 일부 투자자들은 지금 하락은 그냥 건강한 조정일뿐이라고 생각하고 적극적으로 매수하기도 함
- 전문 투자자들은 하락폭을 가늠하면서 다시 매집 국면을 준비하기 위한 전략을 세움
- 공포 국면이 길어지다 보면 침체 국면이 시작됨
침체 국면
- 지친 개미들의 실망 매물이 계속 나오게 되면서 주가는 하락하고 시장에는 온통 개미들의 곡소리만 가득한 시기가 됨
- 하지만 시장의 사이클은 계속 순환하기 때문에 시간이 지날수록 하락폭은 줄어들며 다시 시장이 전환되는 시기가 오게 됨
- 하락폭이 저점에서 둥글게 말리면서 상승을 위한 변곡이 생기게 됨
- 침체 국면의 마지막 시기가 오면 차트를 봤을 때 가격은 그대로인데 전체 거래량은 많아지게 되는 이상한 현상을 발견하게 됨
- 일반 투자자들은 공포에 빠져서 매도를 이어가고 있지만 전문 투자자들(세력)들이 다시 그 매도 물량을 받아먹고 있다는 뜻임
- 침체 국면이 충분히 진행되면 다시 매집 국면이 시작됨

Reference

다우 이론의 기본 개념

다우 이론은 투자와 경제 분야에서 중요한 개념 중 하나로, 특히 주식 시장에 관한 이론입니다. 이 이론은 찰스 다우(Charles Dow)에 의해 개발되었으며, 다우 지수(Dow…

wikidocs.net

iM증권

주식, 선물, 외환시장 등 다양하게 적용되는 다우이론에 대한 강좌입니다. 다우이론의 개념 실제적으로 모든 기술적 분석의 시작은 다우이론을 알고 난 후에야 시작하는 것이 순서일 것이다. 가

www.imfnsec.com

You should take the approach that you're wrong. Your goal is to be less wrong.
- Elon Musk -

[코인 투자] 1. 내가 다시 보려고 저장해두는 코인 투자 참고 자료

dongsunseng — Mon, 5 May 2025 13:29:37 +0900

1. 피보나치 되돌림

피보나치 되돌림에 대하여

오늘은 기술적 분석기법 중에 지지, 저항을 확인하는 데 수단으로 많이 활용하는 피보나치 되돌림(fibonacci retracement)에 대하여 이야기 해보려 합니다.참고로 어린 시절 수학시간에 배웠던 피보나

www.chartistlab.com

2. 추세 기반 피보나치 확장

추세기반 피보나치 확장이란?

안녕하세요. 홀더 입니다.오늘은 "추세기반 피보나치 확장"에 대하여 다뤄 보려고 합니다."피보나치 되돌림"은 아마 차트 공부를 시작하신 10명 중 8~9명은 들어왔을 거라고 생각됩니다.하지만,

www.chartistlab.com

3. 다우 이론

추세를 확인하는 이론이 있다? Dow Theory

안녕하세요 오늘은 다우이론(Dow Theory)에 대해서 설명을 드리려고 합니다. 다우 이론은 찰스 다우라는 인물이 만들어낸 이론으로써 여러가지 원칙과 의미가 있는대요, Bull market이 오기 전의 매집

www.chartistlab.com

There's a tremendous bias against taking risks. Everyone is trying to optimize their ass-covering.
-Elon Musk-

[경제] 2. 신탁

dongsunseng — Sat, 3 May 2025 15:53:00 +0900

신탁의 정의

신탁(Trust)은 재산 관리를 위한 법적 제도로, 한 사람(위탁자)이 자신의 재산을 다른 사람(수탁자)에게 맡겨서 제3자(수익자)의 이익을 위해 관리하도록 하는 계약입니다.

즉, 신탁은 간단하게 생각하면 남의 돈의 법적 소유권을 받아서 수익자의 이익을 위해 돈을 운용하는 방식의 계약입니다.

주요 특징:

위탁자(Trustor): 재산을 맡기는 사람
수탁자(Trustee): 재산을 관리하는 사람이나 기관(은행, 신탁 회사 등)
수익자(Beneficiary): 신탁에서 발생하는 이익을 받는 사람
신탁 재산: 위탁된 자산(현금, 부동산, 주식 등)

금융권에서의 신탁 상품은 은행이나 증권사가 수탁자가 되어 고객(위탁자)의 자금을 운용하는 방식입니다.

신탁은 자산 관리, 절세, 상속 계획 등 다양한 목적으로 활용됩니다.

이렇게 목적이 굉장히 다양하기 때문에 위탁자와 수익자의 개념을 분리시켜둔 것입니다.

위탁자와 수익자가 다른 경우 뿐만 아니라 수익자가 다수인 경우도 존재합니다.

위탁자와 수익자가 다른 경우에는 부모가 자녀를 위해 신탁을 설정하는 경우나 기업이 직원의 퇴직금을 위한 신탁을 설정하는 경우 등이 있습니다.

일반적인 경우로는 위탁자와 수익자가 동일하고 그 예시는 자신의 노후를 위해 연금 신탁을 설정하거나 일반적인 자산 관리의 목적으로 본인이 수익자가 되는 신탁을 설정하는 경우 등이 있습니다.

일반적인 금융 신탁 상품으로는 금전신탁, 부동산신탁, 증권투자신탁 등이 있습니다.

신탁과 다른 금융 상품의 차이점

신탁 vs. 펀드
- 펀드는 다수 투자자의 자금을 모아 주식, 채권 등에 투자하는 방식이고, 신탁은 그 목적이 다양한만큼 개별 계약에 따라 맞춤형 자산 관리가 가능합니다.
- 맞춤형 관리가 가능한 것이 장점이기 때문에 운용 방법을 누가 지정하는지에 따라서도 신탁의 종류가 나뉘게 됩니다.
- 특정 금전 신탁: 위탁자가 운용방법을 지정하는 금전 신탁
- 불특정 금전 신탁: 수탁자가 운용방법을 지정하는 금전 신탁
신탁 vs. 보험
- 보험은 위험 보장이 주 목적이지만 신탁은 재산 관리와 처분이 주 목적입니다.

신탁 상품의 예시

블랙록, 기관 대상 비트코인 현물 신탁 출시..."지속가능 위한 노력 고무적" By TokenPost

블랙록, 기관 대상 비트코인 현물 신탁 출시..."지속가능 위한 노력 고무적"

kr.investing.com

위의 기사를 보면 2022년 8월 블랙록(자산운용사)이 미국 기관 투자자들을 대상으로 비트코인 현물 개인 신탁을 출시한다고 발표한 내용이다.

자산 운용사인 블랙록은 비트코인 현물 개인 신탁을 출시하여, 기관 투자자들의 돈으로 매집을 계획중인 것으로 해석할 수 있는 것이다.

Reference

미래에셋증권

securities.miraeasset.com

You should take the approach that you're wrong. Your goal is to be less wrong.
- Elon Musk -

[경제] 1. 비트코인 현물, 선물 ETF 상장이 갖는 의미

dongsunseng — Sat, 3 May 2025 15:13:24 +0900

미국이 전략적 비축 자산에 가상자산을 추가한다고 발표한 이후 가상자산 판에서의 입지 확장을 위한 작업을 진행중입니다.

해당 작업의 일환으로 미국은 2021년 10월 20일에 비트코인 선물 ETF를 상장시킵니다.

추후에 2023년 6월에는 자산운용사인 블랙록이 현물 ETF를 신청하게 되고, 2024년 1월 11일에 마침내 미국 첫 번째로 비트코인 현물 ETF 상장에 성공합니다.

여기서 ETF는 무엇인지, 이는 무슨 의미를 갖는지에 대해서 자세하게 알아보겠습니다.

ETF는 무엇인가

ETF는 "상장 지수 펀드" 입니다.

"Exchange Traded Fund의 줄임말로 특정 지수를 추종하는 인덱스 펀드를 거래소에 상장시켜 주식처럼 거래할 수 있도록 만든 펀드를 뜻합니다.

여기서 인덱스 펀드란 특정 주가지수와 동일하거나 유사한 수익률을 목표로 하는 펀드입니다.

즉, KOSPI200, KOSDAQ150, S&P500, NASDAQ 등의 특정 지수의 수익률을 추종하는 펀드이기 때문에 해당 지수가 상승하면 펀드의 수익률도 함께 상승하고, 하락하면 함께 하락하는 방식으로 운용됩니다.

ETF를 통해 투자자는 직접 매수하지 않고도 여러 자산에 베팅할 수 있게 됩니다.

예를 들어, 금과 은 값을 포함하여 하나의 ETF를 구성한다던가 상위권 IT 기업 및 보험회사의 주식을 혼합해 ETF를 구성하는 것도 가능합니다.

주식을 선택하거나 운용하는데에 드는 비용이 없기 때문에 액티브 펀드보다 낮은 운용 비용이 하나의 큰 특징이 됩니다.

지수를 그대로 따라가기 때문에, 액티브 펀드처럼 시장 상황을 예측하거나 투자 전략을 수립할 필요가 없다는 뜻입니다.

또한, 시장의 평균적인 수익률을 반영하기 때문에 개별 주식 선택에 따른 위험을 줄여줍니다.

반대로, 액티브 펀드보다 높은 수익률을 달성하기는 어렵습니다.

하나의 아웃라이어(outlier)가 평균에 큰 영향을 주기는 힘들기 때문입니다.

인덱스 펀드를 거래소에 상장시켜 주식처럼 거래할 수 있도록 만든다는 의미는 일반 펀드와 달리 증권 거래소에 상장되어 있기 때문에, 주식처럼 장중에 실시간으로 매매가 가능하다는 것을 의미합니다.

즉, 투자자는 펀드 판매사를 통하지 않고도 증권 계좌를 통해 주식을 사고파는 것과 동일한 방식으로 ETF를 거래할 수 있게 됩니다.

ETF vs. 일반 펀드

거래방식: 일반 펀드는 하루 한 번 기준가격으로 매매가 이루어지지만, ETF는 장중에 실시간 가격으로 거래됩니다
유동성: ETF는 거래소에서 즉시 매매가 가능하므로 유동성이 높습니다
비용 구조: 일반적으로 ETF는 운용 보수가 낮은 편입니다. 인덱스를 단순히 추종하는 패시브(passive) 전략을 사용하기 때문입니다.
- 패시브 운용(Passive Management):
  - 시장 지수의 성과를 그대로 복제하는 것이 목표임
  - 운용사의 주관적인 판단이나 시장 예측에 의존하지 않음
- 반대 개념은 액티브 운용(Active Management):
  - 펀드 매니저가 적극적으로 종목을 선택하고 매매하여 시장 평균 이상의 수익을 추구하는 방식
투명성: ETF는 매일 포트폴리오 구성이 공개되어 투명성이 높습니다.

일반 펀드의 다른 말로는 뮤추얼 펀드(Mutual Fund - 영어 표현)가 있습니다.

이는 다수의 투자자로부터 자금을 모아 전문 펀드 매니저가 액티브 운용을 통해 투자하는 방식으로, 투자자들은 펀드의 지분(수익증권)을 보유하게 됩니다.

주식회사 형태로 운영되며, 투자자들은 주주가 되어 투자 수익을 배당금 형태로 받게 됩니다.

이 중에서도 종류가 나뉘게 되는데 크게는 개방형과 폐쇄형이 있습니다. 개방형은 언제든지 돈을 찾을 수 있고, 폐쇄형은 만기 전에는 돈을 찾을 수 없다는 차이가 있습니다.

ETF의 장점

분산 투자: 하나의 ETF로 여러 종목에 분산투자 효과를 얻을 수 있습니다.
낮은 비용: 대부분의 ETF는 패시브 운용으로 인해 운용보수가 낮습니다.
세금 효율성: 일부 국가에서는 ETF가 세금 측면에서 유리한 구조를 가집니다.
접근성: 소액으로도 다양한 자산군에 투자할 수 있습니다.

ETF의 종류

주식형 ETF: 주식 시장 지수를 추종(KODEX200, SPY, ...)
채권형 ETF: 채권 지수를 추종
원자재 ETF: 금, 은, 원유 등 원자재 가격을 추종
섹터 ETF: 특정 산업 섹터에 집중(바이오, IT, 금융, ...)
국가/지역 ETF: 특정 국가나 지역의 시장을 추종
레버리지/인버스 ETF: 지수 수익률의 배수 또는 반대 방향으로 움직이는 ETF

비트코인 선물 ETF의 의미

위에서 언급했듯이, 비트코인 선물 ETF란 비트코인 자체가 아닌 비트코인 선물 계약에 투자하는 상장 지수 펀드입니다.

즉, 실제 비트코인을 직접 보유하지 않고, 비트코인의 미래 가격에 대한 계약인 '선물 계약'에 투자하는 개념입니다.

이 선물 계약들은 CME(시카고 상품거래소)와 같은 규제된 거래소에서 거래됩니다.

투자자들이 실제 비트코인을 구매하고 저장하는 복잡한 과정 없이 기존 증권 계좌를 통해 비트코인 시장에 노출될 수 있게 되었다는 것이 비트코인 선물 ETF를 상장시킨 이유입니다.

코인 거래소에서 하는 선물 거래 vs. 선물 ETF

여기서 바이비트와 같은 암호화폐 거래소에서 비트코인 선물 거래를 직접 하는 것과 비트코인 선물 ETF를 사고 파는 것의 차이점이 무엇인지 궁금할 것입니다.

주요 차이점:

일단 규제 환경에서의 차이가 있습니다. 바이비트와 같은 코인 거래소는 국가마다 규제 수준이 다르며, 일부 국가에서는 규제가 미흡하거나 차이가 심할 수도 있습니다. 하지만 ETF는 미국 증권거래위원회(SEC)와 같은 엄격한 금융 규제 기관의 감독을 받으며, 투자자 보호 장치가 확실합니다.
접근성에서의 차이도 분명히 존재합니다. 암호화폐 거래소는 기존 주식 투자자들에게 부가적인 절차를 요구하게 됩니다. 하지만 ETF는 기존 증권 계좌를 통해 주식처럼 매매할 수 있기 때문에 진입 장벽이 낮고 기존 주식 투자자들에게 편리함을 제공합니다.
위험 및 보안에서도 차이를 보입니다. 암호화폐 거래소는 해킹, 거래소 파산, 사기 등의 위험이 있지만 ETF는 규제된 금융 기관이 자산을 관리하므로 이러한 위험이 줄어듭니다.
그외에도 레버리지, 상품 구조 등의 부가적인 차이도 존재합니다.

미국의 비트코인 선물 ETF 상장의 의미

가상 자산이 미국의 전통적인 금융 시스템 내에서 공식적으로 인정받기 시작했다는 상징적인 의미가 있습니다.
SEC(미국 증권거래위원회)의 승인을 받은 상품으로, 일정 수준의 투자자 보호와 규제 감독이 가능해졌다는 의미가 있습니다.

위와 같은 이유도 물론 중요한 의미를 가지지만

일반 투자자들이 기존 증권 계좌를 통해 쉽게 비트코인 관련 투자를 할 수 있게 된 점
법적, 규제적 장벽으로 인해 직접 비트코인 투자를 꺼렸던 기관 투자자들의 시장 참여를 용이하게 된 점

위와 같은 효과를 내면서 결국 위에서 언급했듯이 일반 투자자는 물론 그보다도 훨씬 큰 힘이 있는 기관 투자자들의 비트코인 관련 투자를 활성화시켰다는 점을 주목해야 합니다.

즉, 미국의 기관 투자자들을 비트코인 시장에 참여시켜서 코인 시장 내의 미국의 입지를 확대하려는 목적으로 해석할 수 있습니다.

비트코인 선물 ETF vs. 비트코인 현물 ETF

비트코인 현물 ETF 보다 선물 ETF가 먼저 상장했습니다.

비트코인 선물 ETF는 실제 비트코인을 직접 보유하는 현물 ETF가 아니라 선물 계약에 투자하는 방식이라는 점에서 한계가 있습니다.

선물 계약에는 '만기'가 존재하고, '컨탱고'라고 불리는 롤오버 비용이 발생할 수 있기 때문에 장시적으로는 비트코인 가격 자체의 성과와 차이가 날 수 있게 됩니다.

즉, 현물 ETF는 자금 유입이 실제 비트코인 구매로 이어져 직접적인 수요를 창출하기 때문에 시장 가격에 더 직접적인 영향을 줄 수 있습니다.

다시 말해서, 코인의 가격을 조종하는 Market Maker들의 입장에서는 변수를 줄일 수 있는 현물 ETF도 상장하는 편이 당연합니다.

롤오버 비용(Rollover Cost)?

롤오버 비용은 만기가 다가오는 선물 계약에서 다음 만기의 선물 계약으로 포지션을 이전(롤오버)할 때 발생하는 비용입니다.

선물 ETF와 같은 상품들은 지속적인 노출을 제공하기 위해 이러한 롤오버 과정을 정기적으로 수행해야 합니다.

모든 선물 계약은 특정 만기일을 가지고 있습니다.

여기서 발생하는 현상 두 가지:

컨탱고(Contango): 미래 만기의 선물 가격이 현재 만기의 선물 가격보다 높은 상황
백워데이션(Backwardation): 미래 만기의 선물 가격이 현재 만기의 선물 가격보다 낮은 상황

비트코인 시장은 대부분의 시간동안 컨탱고 상태에 있는 경우가 많습니다.

비트코인 시장이 대부분의 시간동안 컨탱고(Contango) 상태에 있는 이유:

미래 가격 상승 기대감: 많은 투자자들은 비트코인의 장기적 가치 상승을 기대합니다. 이러한 낙관적 전망이 선물 가격을 현물 가격보다 높게 유지합니다.
보유 비용의 부재: 비트코인은 실물 자산과 달리 보관 비용이나 감가상각이 없습니다. 물리적 상품(석유, 농산물 등)은 보관 비용 때문에 백워데이션(선물 가격 < 현물 가격)이 자주 발생하지만, 비트코인은 그런 제약이 없습니다.
이자율 요소: 선물 가격에는 무위험 이자율이 반영됩니다. 투자자들은 현물을 구매하는 대신 그 자금으로 무위험 수익을 얻고 나중에 선물 만기에 비트코인을 구매할 수 있으므로, 이 기회비용이 선물 가격에 반영됩니다.
레버리지 수요: 많은 트레이더들이 더 큰 수익을 위해 레버리지를 사용하는데, 이는 선물 시장에서 이루어집니다. 이런 롱 포지션 수요가 선물 가격을 끌어올립니다.
기관 투자자의 헤지 전략: 기관들이 현물 비트코인을 보유하면서 선물로 헤지하는 전략을 사용할 때, 이런 활동이 컨탱고 상태를 강화할 수 있습니다.
수익률 파밍(Yield Farming): 투자자들이 현물 비트코인을 보유하면서 동시에 선물 시장에서 숏 포지션을 취하는 베이시스 트레이딩을 통해 무위험 수익을 추구합니다. 이러한 차익 거래 활동이 컨탱고 상태를 유지하는 데 기여합니다.

5, 6번에 대한 추가 설명:

기본적으로 5번과 6번은 같은 맥락입니다.
기본적인 헤지 포지션(Hedge Position) 구조:
- 기관 투자자들과 일부 투자자들은 장기 투자 목적으로 현물 비트코인을 구매하여 보유합니다.
- 동시에 가격 하락 위험을 관리하기 위해 선물 시장에서 숏(매도) 포지션을 취합니다.
- 이 전략은 현물 롱 + 선물 숏의 형태로 구성됩니다.
컨탱고를 강화하는 과정:
- 대형 기관이 대량의 현물 비트코인을 구매하면 현물 가격에 상승 압력이 가해집니다 (비트코인의 수량은 정해져있기 때문에 수요와 공급 법칙에 따라 당연한겁니다).
- 이후 이들이 선물 시장에서 숏 포지션을 취하면 이론적으로는 선물 가격에 하락 압력을 줄 수 있습니다.
- 그러나 대부분의 경우, 선물 시장에서의 숏 포지션보다 현물 시장에서의 매수 영향이 더 크게 작용합니다.
수익 창출 메커니즘:
- 기관들은 이 헤지 포지션을 통해 선물과 현물 간의 가격 차이(베이시스)에서 수익을 얻을 수 있습니다.
- 예를 들어, 현물 비트코인이 $50,000이고 3개월 선물이 $52,500이라면, 연간 20%의 무위험 수익률을 얻을 수 있게 됩니다.
- 컨탱고 상태에서는 선물 가격이 만기에 가까워질수록 현물 가격에 수렴합니다(베이시스가 줄어듦).
- 이 수렴 과정에서 헤지 포지션은 안정적인 수익을 창출합니다.
기관 투자자의 영향력:
- 대형 기관들이 이러한 전략을 대규모로 실행할 때, 이들의 거래 행위는 시장 전체 구조에 영향을 미칩니다.
- 특히 현물 비트코인에 대한 수요가 지속적으로 유지되어 현물 가격 지지로 이어집니다.
- 동시에 선물 시장에서의 숏 포지션이 선물 가격의 과도한 상승을 제한합니다.
컨탱고 지속 요인:
- 이러한 헤지 포지션 구축은 현물 비트코인에 대한 지속적인 수요를 창출합니다.
- 현물 매수 + 선물 매도 전략이 널리 채택될수록 현물과 선물 간의 가격 차이(컨탱고)가 유지됩니다.
- 이 가격 차이는 헤지 전략의 수익성을 결정하므로, 수익을 추구하는 다른 투자자들도 유사한 전략을 채택하게 됩니다.
- 베이시스(현물과 선물의 가격 차이)가 좁아지면 수익성이 감소하므로 새로운 참여자들의 진입이 줄어듭니다.
- 반대로 베이시스가 넓어지면 더 많은 투자자들이 이 전략에 참여하게 됩니다.
- 베이시스 트레이딩은 컨탱고 상태를 유지하는 자기 강화적 순환 구조를 형성합니다:
  - 컨탱고가 발생하면 투자자들이 베이시스 트레이딩으로 무위험 수익을 추구합니다.
  - 더 많은 투자자가 현물을 매수하고 선물을 매도합니다.
  - 따라서, 컨탱고 상태가 지속됩니다.
  - 추후에 시장 효율성으로 인해 베이시스는 특정 수준에서 안정화됩니다.
이러한 메커니즘이 순환적으로 작용하여 비트코인 시장에서 컨탱고 상태가 지속되는 데 기여합니다.
시장 참여자들의 이런 행동은 결과적으로 시장의 유동성을 높이고 가격 안정성에도 기여할 수 있게 됩니다.

ETF가 만료되는 선물 계약을 팔고 더 비싼 다음 달 계약을 사게 되면, 그 가격 차이가 비용으로 발생하게 됩니다.

예를 들어 만약 4월 만기 비트코인 선물이 $60,000에 거래되고 5월 만기 선물이 $61,000에 거래된다면, 롤오버할 때 계약당 $1,000의 비용이 발생하게 되는 것입니다.

이러한 롤오버 비용이 지속적으로 발생하면 ETF의 성과가 기초자산(비트코인)의 실제 성과보다 낮아지게 됩니다.

이를 '롤 이익'(Roll yield)의 감소 또는 '롤 손실'(Roll loss)라고 합니다.

롤오버가 투자자에게 미치는 영향

성과 괴리: 위에서 언급했듯이 롤오버 비용으로 인해 장기적으로 비트코인 선물 ETF의 성과는 실제 비트코인 가격 움직임과 괴리가 발생할 수 있습니다.
장기 투자 효율성 감소: 이러한 비용은 시간이 지남에 따라 누적되어 장기 투자 수익률에 부정적인 영향을 미칠 수 있습니다.
비용 가시성: 롤오버 비용은 명시적으로 표시되지 않고 ETF 가격 성과에 내제되어 있어 투자자가 인지하기 어렵기 때문에 투자자에게 더 큰 불편함을 제공합니다.

이러한 이유들로 많은 투자자와 시장 참여자들은 실제 비트코인을 보유할 수 있는 현물 ETF 상장의 승인을 기다렸고 2024년 초 미국 SEC는 결국 비트코인 현물 ETF의 상장을 승인합니다.

Reference

ETF 소개 | ETF 투자기초가이드 | Kodex

ETF 투자의 기초부터 심화까지 알아보세요.

www.samsungfund.com

비트코인 ETF 승인: 환호하는 암호화폐 시장… 그 이유와 의미는? - BBC News 코리아

오랫동안 기다려온 미 금융 당국의 비트코인 현물 ETF 승인 소식에 암호화폐 업계가 들썩이고 있다. 그 이유를 살펴봤다.

www.bbc.com

Work like hell.
I mean you just have to put in 80 to 100 hour weeks every week.
- Elon Musk -

[매매일지] 13. 비트야 멘징 좀 하자

dongsunseng — Tue, 18 Mar 2025 00:51:34 +0900

2025.03.17 - 1) 숏 1차 진입

1. 진입 근거:

순추세는 다시 숏이라고 판단함
618 라인에 세 번 연속 맞고 리테스트가 일어났기 때문에 4번째 618 부근에서 숏 포지션 진입함

2. 포지션 셋업:

진입(EP): 83276
익절(TP):
손절(SL): 83467
손익비(R/R):

3. 결과:

장대 양봉 손절 엔딩
더 올라갔다 내려갈거라고 판단하고 바로 손절침

2025.03.17 - 2) 숏 2차

1. 진입 근거:

거래량은 많이 두번 터지면서 장대 양봉을 쐈지만 사실 주가는 크게 못올림
세번째 터치할때 숏 진입함(고배로)

2. 포지션 셋업:

진입(EP): 83396
익절(TP): 82969 (매물대 상단)
손절(SL): 83735 (최근 고점)
손익비(R/R): 1.4

3. 결과:

고배로 쳐서 꽤나 멘징함

4. 배운점, 느낀점 정리:

마음 급하게 먹지말고 천천히 손익비 좋은 자리만 봐서 매매하면 수익을 본다
다시 말해서, 미리 진입해서 물려있지 말고 진짜 좋은 자리만 들어가는게 좋다

2025.03.17 - 2) 숏 3차

1. 진입 근거:

아래 마지노선으로 지켜주던 매물대를 뚫는 것을 보고 진입함

2. 처음 포지션 셋업:

진입(EP): 82574

3. 결과:

진짜 신이 살렸다....
일단 위에 보이는 것처럼 진입함
근데 이후로 이상하게 스물스물 오름
불안하긴 했지만 버텨봄
내려가겠지 라는 생각으로 남은 증거금까지 끌어와서 한 번 더 쳤음
저시드 고배로 치고 있었는데 남은 증거금을 끌어왔으니까 말도 안되는 고시드 고배 상황
높은 펀비 때문에 조금의 수익만 보자는 생각으로 익절을 걸고 버팀
딱 10틱 안쪽으로 익절 나가고 저렇게 올라버림...
진짜 운이 좋았다..

4. 배운점, 느낀점 정리:

증거금 끌어와서 평단 조정하는 짓은 진짜 그만해야겠음... 너무 쫄림
수익을 봐서 다행이지 하마터면 따로 빼둔 증거금까지 손실볼 뻔 했음..

5. 반성할 점:

제발 뚫을 것 같을 때 진입하지 말고 확인매매 하자... 완전하게 고쳐지지가 않네
오늘 매매는 여기서 그만하는게 좋겠다..

천천히 멘징해나가는중... 화이팅
뻘짓 그만하자..

[매매일지] 12. 역시 롱은 역추세였다..

dongsunseng — Mon, 17 Mar 2025 16:40:42 +0900

2025.03.16 - 1) 롱 진입

1. 진입 근거:

주말 내내 83k 대의 매물대를 마지노선으로 횡보중이었기 때문에 상승을 이어나가려면 이 구간이 깨지지 않아야 한다고 판단하고 롱을 잡았음

2. 포지션 셋업:

진입(EP): 83841
익절(TP): 길게 끌고갈 생각이었음
손절(SL): 83600
손익비(R/R): 길게 끌고갈 생각이었음

3. 결과:

손절
손절 라인이 짧았기 때문에 해볼만한 배팅이었다고 생각함

2025.03.16 - 2) 추격 숏 진입

1. 진입 근거:

빠르게 흐르는걸 보고 추격 숏 진입

2. 포지션 셋업:

진입(EP): 83607.0
익절(TP):
손절(SL): 0.4% 짧은 손절라인
손익비(R/R):

3. 결과:

멘징 + 수익 성공

4. 배운점, 느낀점 정리:

솔직히 기세만 봤을 때는 훨씬 아래로 갈 줄 알았음
혹시 모르니까 그어둔 노란색 하락 추세선이 깨질 떄 한번, 피보나치 0.5 구간에서 완익을 쳤음
갑자기 반등을 이어나가는걸 보고 익절 하길 잘했다고 생각함

2025.03.16 - 3) 숏 재진입

1. 진입 근거:

반등의 세기가 약해질 때 숏을 재진입함

2. 포지션 셋업:

진입(EP): 82915
익절(TP):
손절(SL):
손익비(R/R):

3. 결과:

여기서 문제가 발생함...
더 내려갈거라고 확신하고 손절 라인을 엄청 길게 잡아두고 다른 강의를 듣던 중에 엄청난 장대 양봉 발생함..

4. 배운점, 느낀점 정리:

5. 반성할 점

무조건 손절은 감당 가능할 정도로만 잡기
어떤 상황에서도...
특히 요즘같은 변동성이 심한 장세에 왜 자꾸 이런 말도 안되는 손실을 보는거니..

2025.03.16 - 4) 사팔사팔 멘징

1. 진입 근거:

사팔사팔

2. 포지션 셋업:

3. 결과:

단기봉만 보고 사고 팔며 손실의 25% 정도 멘징함
진짜 정신 나갈뻔...
좋은 방식인지는 모르겠지만 변동성이 심하고 휩쏘가 맞는것같다는 생각이 들면 사팔사팔하며 멘징하는건 나쁘지 않은 것 같기도 함

2025.03.16 - 5) 수면 멘징

1. 진입 근거:

중요 지지라인이 깨지는 것을 보고 전고점을 로스로 잡고 숏 포지션 진입 후 잤음

2. 포지션 셋업:

진입(EP): 83812
익절(TP): 92800 (멘징 완료 라인)
손절(SL): 84100
손익비(R/R):

3. 결과:

처음 확 쏟을 때 전체 멘징 완료..
91k대까지 쏟은 거 보니까 하 익절 라인 좀만 더 길게 잡을걸 이런 생각이 들긴 했지만 그냥 멘징했다는 것으로 위안 삼았음

큰 손실이 나면 멘징하는데는 훨씬 더 큰 노력이 필요하다는 것을 제발 명심하자..

[매매일지] 11. 롱차..?

dongsunseng — Sun, 16 Mar 2025 13:32:33 +0900

2025.03.15

1. 진입 근거:

가파르게 올라오는 것을 보고 아 한번도 쫙 올렸다가 패닉셀을 만드려는거구나 라고 생각을 바꿈
따라서 롱 자리를 보고 있다가 눌림을 보고 진입함
더블바텀 넥 라인 리테스트 후 올리는 것을 충분히 보고 진입함(최근 손절이 많이 아팠어서.. ㅎ)

2. 포지션 셋업:

진입(EP): 84261
익절(TP): 길게 끌고갈 생각
손절(SL): 83922 (제일 가까운 매물대 하단)
손익비(R/R): 길게 끌고갈 생각

3. 결과:

손절 나감..

4. 배운점, 느낀점 정리:

요즘 도통 감을 못잡겠다...
손익비는 자시고 몇연패를 하는 중인건지...
좋은 자리를 기다렸다가 들어가는 것 같은데도 전에 몇번 욕심부려서 손절난게 타격이 많이 큰 것 같다..

살려줘... 비트야

[매매일지] 10. 숏차

dongsunseng — Sat, 15 Mar 2025 22:25:29 +0900

이전 포스트:

[매매일지] 9. 왜 굳이 역추세를 탔니..

이전 포스트: [매매일지] 8. 사실 재진입함..이전 포스트: [매매일지] 7. 김비트 제발이전 포스트: [매매일지] 6. 2연승 추가..?이전 포스트: [매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는

dongsunseng.com

2025.03.11

1. 진입 근거:

순추세가 하락인게 자명해지는 가운데, 상승의 1.272 부근에서 엄청난 반등이 일어남
당연히 가짜 반등이라고 판단하고 숏 자리를 보고 있었음
상승 추세선에서 멀어지다가 다시 붙고 있었고 단기 더블탑의 넥라인이 돌파되는것을 확인하고 진입함
3번째 사진: 휩쏘에 당하고 다시 잡음(평단은 비슷)
4시간봉 히든 하락 다이버 컨펌(새벽 1시까지)

2. 포지션 셋업:

진입(EP): 81540
익절(TP): 길게 끌고 갈 생각
손절(SL): 82300(휩쏘 고점)
손익비(R/R):

3. 결과:

쭉쭉 잘 내려가다가 갑자기 오르더니 손절 맞음
물론 분익 + 본절 해둘 수는 있었지만 상황 자체가 처음에 너무 잘 나와서 그러긴 힘들었음
110k 부근에서 부터 잡은 채널의 하단을 빠르게 이탈했기 때문에 반등이 나올줄은 알았지만 채널 중단까지 다시 올라갈줄은 몰랐음
장기적인 하락 추세라고 판단했기 때문에
또한, 상승의 0.5 부근에서 반등한거였는데 786 부근까지는 갈거라고 생각함

2025.03.12

1. 진입 근거:

한참 전에 숏 진입은 다시 해뒀고 휩쏘에 당함(약손절)
휩쏘에 당하고 다시 잡음

2. 포지션 셋업:

진입(EP): 82942
익절(TP): 79330

3. 결과:

익절

2025.03.12

1. 진입 근거:

동일하게 숏

3. 결과:

계속 휩쏘를 당하니까 본절튀 하고 롱을 잡았다가 손절 맞음
이후에 뇌동매매로 숏을 잡았다가 풀고 하다가 결국 멘징은 하긴 했음

4. 배운점, 느낀점 정리:

휩쏘를 당해도 홀드 하자
단기 반등을 노리지 말자 (순추세 매매)

5. 반성할 점:

이슈로 인한 단기 반등을 고배로 먹어보려고 한 점은 정말 반성해야할 점이다
손절 맞은 후에 제대로 된 기준 없는 뇌동매매

처음 작성한 매매에서 큰 채널을 벗어나고 나서 상승이 쭉 이어졌기 때문에 장기적으로는 하락을 보지만 지금은 롱 포지션을 잡아야할 때라고 생각이 바뀜

최근에 멘탈 관리가 잘 안되서 크게 잃고 겨우 멘징하고를 반복하다가 다시 잃은 상황임..
시드가 50% 아래로 떨어진건 처음이라 많이 아픈데 천천히 복구해보자...

[매매일지] 9. 왜 굳이 역추세를 탔니..

dongsunseng — Tue, 11 Mar 2025 22:12:31 +0900

이전 포스트:

[매매일지] 8. 사실 재진입함..

이전 포스트: [매매일지] 7. 김비트 제발이전 포스트: [매매일지] 6. 2연승 추가..?이전 포스트: [매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고

dongsunseng.com

2025.03.10

1. 진입 근거:

피보나치 1.618 부근에서 되돌림을 먹으려고 역추세를 트라이함

2. 포지션 셋업:

진입(EP): 83,631.60
익절(TP): 약익절
손절(SL): 약손절
손익비(R/R): 초단타

3. 결과:

역추세 매매는 원래 지양하는 편인데 이상하게 홀린듯이 들어가버림
3번의 뇌동매매로 이어졌고 약손절 3번 후에 매매 종료해버림

4. 배운점, 느낀점 정리:

역추세는 기본적으로 심법적으로 두배 더 힘든 것 같음
언제 순추세의 움직임이 나온다는 불안감이 있기 때문
약간의 이익을 얻으려고 초보자가 역추세 매매를 하는 것은 가성비가 안나온다는 생각을 함

5. 반성할 점:

개인적으로 멘탈적으로 온전하지 못한 날이었는데 매매를 강행한게 독이 되었던 것 같음
이런 날에는 그냥 쉬자..
어차피 내일이고 모레고 좋은 자리는 오니까

2025.03.10

1. 진입 근거:

매물대 저항 자리라서 초단타 들어감

2. 포지션 셋업:

진입(EP): 80,963.57
익절(TP):
손절(SL):
손익비(R/R):

3. 결과:

위에서 손절 난거 멘징함

4. 배운점, 느낀점 정리:

순추세로 줄먹하는게 마음도 편하고 수익률도 좋은듯

원금을 넘기고 나서는 진입하기가 무서워지는데 기계적 매매하자

[매매일지] 8. 사실 재진입함..

dongsunseng — Sat, 8 Mar 2025 21:37:33 +0900

이전 포스트:

[매매일지] 7. 김비트 제발

이전 포스트: [매매일지] 6. 2연승 추가..?이전 포스트: [매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고 있음 따라서, 순추세를 하락으로 보고 숏

dongsunseng.com

1. 진입 근거:

전 포스트를 보고 오면 알겠지만 크립토써밋에 맞춰서 실망 매물로 인한 하락분을 먹기 위해 숏포지션을 계속 잡으려고 하고 있었음
고점 부근에서 포지션을 잘 잡았고 하락분을 대부분 먹었음
3월 4일부터 이어진 상승에서의 저점과 상승 최고점을 피보나치 되돌림으로 찍어봤을 때 0.618까지 가서 완익을 쳤음
장기적인 하락을 보고 있었기 때문에 786까지도 가지 않을까 싶어서 자기전에 다시 포지션을 잡고 잤음
저번 포스트를 작성했을 때는 무리라고 생각해서 다시 포지션을 안 잡고 자는게 낫다고 생각해서 욕심인 것 같다고 작성했지만 아무리 생각해도 더 큰 하락이 나왔어야 해서 손절을 해당 날에 본 수익을 뱉어낼 정도로만 잡아봄

2. 포지션 셋업:

진입(EP): 87405
익절(TP): 83688
손절(SL): 90000
손익비(R/R): 1.28

3. 결과:

손절 구간을 크게 안잡았으면 자는동안 스탑이 나갈뻔할 정도의 반등이 나옴
수익은 꽤 봤지만 이번에도 완익, 분익 판단이 아쉬웠음

4. 배운점, 느낀점 정리:

스탑이 나갈뻔할정도로 쎄게 반등이 나왔고 이를 예상했어야되는데 아직 부족한 것 같음 (2번째 저항으로 닿는 시점이었기 때문)
반등이 나온 후에 포지션을 잡았으면 수익을 더 볼 수 있었음
또한, 이미 618까지 한번 갔기 때문에 위와 같이 2번 정도 더 618 구간에서 비빈 후에 쭉 하락할 줄 알았는데 주말이라 그런지 계속 횡보중임
주말에는 보통 횡보를 하지만 최근에는 또 그렇지 않았기에 이것까지 예측하기에는 쉽지 않았을 것 같긴함
고점에서 포지션을 잡고 여기까지 끌고 왔다면 평단이 좋아서 괜찮았겠지만 완익후에 다시 잡았기 때문에 적정선에서 정리했음
완익 분익 판단이 아쉬웠던 이유:
- 4번째 618 구간에 닿았을 때는 하락을 보여줬어야 한다고 생각했는데 반등이 나오는걸 보고 바로 정리함
- 그냥 618 구간에 3번째 닿았을 때 반익을 치고, 본절 걸고 지켜봤다면 좀 더 나은 판단이 아니었을까 아쉬움
- 그래도 이정도로 횡보할거라는건 예측불가의 범위였다고 생각하긴함
반등이 나오면 다시 잡던가 해야겠음..

5. 반성할 점:

일단 반등을 생각 못하고 평단이 안좋음에도 끌고 가려고 고집을 부린 점
786까지 갈거라는 고집으로 분익을 칠 생각도 안한 점
개선:
- 일단 이런 고집을 부린 것은 전 매매들에서 반익을 치고 했어도 수익이 너무 아쉬웠어서 그냥 끌고가보자 라는 생각이 컸음
- 애초에 레버리지를 잘못 설정한듯
- 레버리지를 살짝 올려서 반익본절 운영을 하는 방식으로 다시 돌아가는게 나을듯
- 또한, 최근 인풋에 비해 아웃풋이 많은 상황이 반복되면서 매매로 인한 스트레스만 늘고있던 것 같음 (아는 것에 비해 수익을 더 보려고 하니까)
- 공부 시간을 더 늘리고 레버리지 운영을 좀 더 연구해봐야 될듯

아직 많이 미숙한듯..

[매매일지] 7. 김비트 제발

dongsunseng — Sat, 8 Mar 2025 01:17:56 +0900

이전 포스트:

[매매일지] 6. 2연승 추가..?

이전 포스트: [매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고 있음 따라서, 순추세를 하락으로 보고 숏 자리를 보는중임오늘은 포지션을 총 3번

dongsunseng.com

1. 진입 근거:

대망의 크립토 써밋
기대감으로 상승을 보여주고 있긴 하지만 별 내용 없을거라고 생각함 -> 하락

2. 포지션 셋업:

진입(EP): 90136
익절(TP): 87000 (618 부근)
손절(SL): 92100 (청산가 모여있는 부근 위 + 상단 매물대 윗부근)
손익비(R/R): 1.58

3. 결과:

88450 부근에서 반익절 하고 끌고가다가 거래량 실린 양봉보고 바로 정리함
사실 끌고가도 되는 상황이었는데 최근 몇번 매매에서 반익절하고 남은 반절의 수익을 먹은적이 없어서 한번 조금이라도 더 수익을 보고 싶었음
정리후에 다시 잡고싶었지만 피곤하기도 했고 반등이 계속 나오는거 보고 그냥 잤음 ㅋㅋ

4. 배운점, 느낀점 정리:

변동이 심한 장세에는 타점 잘 잡는게 굉장히 중요한듯
진입 후에 91000 넘어서 쭉 상승이 나왔는데 전혀 불안하지 않았음
청산가가 몰려있는 구간은 반드시 터뜨리고 가는 변동성이 큰 장세이기 때문
타점을 더 잘 잡으려고 욕심을 부리는 것은 별로 안좋을거라고 생각이 들어서 그냥 생각한 진입가가 나와서 바로 잡았음

흠.. 계속 먹으니까 오히려 좀 불안하네..

[매매일지] 6. 2연승 추가..?

dongsunseng — Fri, 7 Mar 2025 23:10:00 +0900

이전 포스트:

[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중

일단 장기적인 추세로 하락 추세를 보고 있음 따라서, 순추세를 하락으로 보고 숏 자리를 보는중임오늘은 포지션을 총 3번 잡았음 1) 박스권 매매1. 진입 근거: 박스권을 만들었다고 생각했고

dongsunseng.com

1. 진입 근거:

필자는 장기적인 하방 추세를 보고 있고(현물러분들한테는 죄송하지만), 따라서 계속 숏으로 큰 파동을 먹어서 시드를 불리려는 생각으로 매매에 임하고 있음

2. 포지션 셋업:

시점: 2025.03.05 10PM 부근
진입(EP): 90442.7
익절(TP):
- 최종 익절 구간은 피보나치 사용해서 1.272 구간으로 잡음
손절(SL): 상단 매물대 부근으로 잡음
손익비(R/R): 2.28

3. 결과:

부분 익절로 89594, 89492, 89171 부근에서 50%, 25%, 25% 이렇게 익절함
최종 익절구간까지 끌고가지 않은 이유는 최근 장세가 변동성이 심하고 거래량이 많은 양봉이 박힌게 쎄해서 정리해버림
결과적으로는 쎄함 감지를 잘한듯
그냥 정리하고 다시 잡자는 생각이었음

4. 배운점, 느낀점 정리:

쎄할때는 그냥 튀고 다시 잡자
변동세가 요즘 많이 심하기 때문

1. 진입 근거:

위와 마찬가지
+ 상승분은 전부 크립토 써밋 때문이라고 생각했고, 끝나면 하락추세를 이어갈거라고 생각함

2. 포지션 셋업:

진입(EP): 91,519.80
익절(TP): 1.272 부근
손절(SL): 전고점(92850)
손익비(R/R): >3

3. 결과:

일단 포지션은 상당히 잘잡은듯
버티는게 쉽진 않았지만 결국 스탑은 안터뜨리고 수익을 보게 됨
분익도 잘 잡았음(위기 감지 능력이 좀 올라가는듯)

4. 배운점, 느낀점 정리:

요즘 변동성이 너무 쎄서 저배 운용을 하고 있는데 안정적이고 좋은 것 같음
포지션을 상당히 잘 잡았기 때문에 수익을 극대화해보려고 분익하고 물량을 추가하고 배율도 중간에 조정해봤는데 아직 원리를 정확하게 몰라서 그런지 생각만큼 잘되지 않은 것 같음 -> 좀 더 연구 필요할듯
위에 사진에서 볼 수 있듯이 아주 아슬아슬하게 스탑을 안건드리고 하락함(자는중이었음) -> 스탑을 잘 생각해서 잡고 소신있게 가자

슬슬 손절 날때 됐으니까 조심하자

[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중

dongsunseng — Wed, 5 Mar 2025 02:48:24 +0900

일단 장기적인 추세로 하락 추세를 보고 있음

따라서, 순추세를 하락으로 보고 숏 자리를 보는중임

오늘은 포지션을 총 3번 잡았음

1) 박스권 매매

1. 진입 근거:

박스권을 만들었다고 생각했고 롱숏롱숏 먹으려고 하는게 정석적인 무빙
차트를 안보고 있다가 마지막 자리에 진입함..

2. 포지션 셋업:

진입(EP): 82800
익절(TP): 83288
손절(SL): 82437
손익비(R/R): 1.4

3. 결과:

11시반 개장과 함께 나스닥이 쭉 내리면서 비트까지 내려버림
손절 엔딩

4. 배운점, 느낀점 정리:

11시반 개장 타이밍은 항상 조심해야겠다 하면서도 까먹음
나스닥 차트를 보는 것도 기억하자

2) 다시 펌핑

1. 진입 근거:

쭉 펌핑하는 것을 보고 당연히 찐반은 아니라고 생각함
고점을 찍고 내리는 것을 보고 진입

2. 포지션 셋업:

진입(EP): 83,871.20
익절(TP): 82669 (0.786 되돌림)
손절(SL): 85011 (윗 매물대)
손익비(R/R): 1.06

3. 결과:

0.786 부근에서 욕심 안부리고 익절했음

4. 배운점, 느낀점 정리:

숏을 잡을때마다 느끼는거지만 데드캣 때문에 숏은 익절을 욕심안내고 줄먹 하는것이 중요함
데드캣: 쭉 내리고 갑자기 쭉 오르는 반등
순추세가 하락추세인게 명확하더라도 줄먹 해야됨

3) 푸근한 숏 포지션 진입 시도

1. 진입 근거:

4시간봉 채널의 하단에 회귀했다가 약한 반등이 나왔기 때문에 하방 돌파의 가능성이 크다고 생각함
자주색 매물대에 저항을 지속적으로 받았기 때문에 손익비가 좋은 푸근한 숏 포지션 자리라고 생각하고 진입

2. 포지션 셋업:

진입(EP): 82,602.10
익절(TP): 좀 길게 두고 볼 생각이었음: 큰 하방의 가능성
손절(SL): 83466 (윗 매물대 저항)
손익비(R/R):

3. 결과:

쭉 올리면서 손절 엔딩

4. 배운점, 느낀점 정리:

예측하지 못하겠는 변동성이 큰 장세에는 그냥 가만히 있는게 나은 것 같기도 하다 ㅋㅋ
결국 오늘 수익은 20불 남짓..
2번째 매매에서 꽤 짭짤한 수익을 냈지만 도로 뱉어버림

변동성이 커서 쉽지 않은데, 욕심 내지말고 조금이라도 수익 내는것에 만족하면서 공부해야겠다

[매매일지] 4. 진짜 미친 비트..

dongsunseng — Tue, 4 Mar 2025 00:51:45 +0900

[코인 투자] 매매일기 #3 - 첫 수익 + 비트 운전수 폭주

2025.02.25 4시 숏포지션 x30배1. 진입 근거: 강한 매도세살짝만 먹고 빠질 생각으로 진입2. 포지션 셋업 : 진입(EP): 90460 -> 강한 매도세를 한 파동 확인하고 나서 들어감익절(TP): 89196 -> 매도세가 조

dongsunseng.com

트럼프 말한마디에 아주 미쳐 날뛰는중...

현물이 없고 장기 하방 보고있던 난 트럼프가 밉다..

사실 저 위에 올라타려고 해봤다가 몇불 깨졌음 ㅋㅋ

1. 진입 근거:

상승이 끝나고 구라 반등이지 않을까 싶어서 숏을 쳐봄
보통 이렇게 빠르게 올린 반등은 명확한 의도가 있고, 목적 달성 후에는 상승분이 빠지는 경향이 있기 때문에 숏 포지션을 잡은 것임(프렉탈적 관점)

2. 포지션 셋업:

진입(EP): 93,338.00
익절(TP):
손절(SL): 전고점
손익비(R/R):

3. 결과:

변동성이 굉장히 컸음
위 사진을 보면 알겠지만 저 역헤숄 모양쯤부터 잠에서 깨서 봤는데, 계속 저점을 높이는 상승이 나와서 빠르게 정리했음
결과적으로 보면 수익은 봤지만 저점에서 큰 수익을 본 것도 아니고 저점을 높이는 과정을 관망하다가 노란색 하이라이트 부근에서 뒤늦게 나왔기 때문에 엄청 아쉬운 수익만 보게 됨
결과론적으로만 보면 관점이 어느정도 맞았고 큰 수익을 볼 수 있었는데 내가 기회를 걷어차버림

4. 배운점, 느낀점 정리:

한번 관점을 정해서 익손절 라인을 정했으면 음전한다고 마음 조리지말고 내 분석을 어느정도 믿으면서 내 포지션에 끝까지 책임질 줄 아는 자세가 필요한 것 같다

5. 반성할 점:

뇌동매매는 아니지만 내가 보지 못한 강추세가 나오는 경우 포모(FOMO)가 오고 머리가 뜨끈해질 때가 있는데 이 때는 제발 참자
다음 자리 노려도 충분히 수익 볼 수 있다
모든 자리 다 발라먹으려고 하지마

[매매일지] 3. 첫 수익 + 비트 운전수 폭주

dongsunseng — Sat, 1 Mar 2025 12:14:54 +0900

2025.02.25 4시 숏포지션 x30배

1. 진입 근거:

강한 매도세
살짝만 먹고 빠질 생각으로 진입

2. 포지션 셋업 :

진입(EP): 90460 -> 강한 매도세를 한 파동 확인하고 나서 들어감
익절(TP): 89196 -> 매도세가 조금 약해진다 느낄때쯤 익절
손절(SL): 90700 -> 반등 고점
손익비(R/R):

3. 결과:

4. 배운점, 느낀점 정리:

5. 반성할 점:

사실 결과적으로 보면 수익을 봤지만 강한 매도세만을 보고 잠깐 먹고 빠지는 매매도 건강한 매매일지 모르겠음
손절 라인 짧게 진입해도 강한 매도세에 수익을 볼거같은 느낌이 들어서 진입하긴함

2025.03.01

1. 진입 근거:

장기적인 하락을 보고 있음
이유:
- 상승은 공격적으로 나오지만 거래량이 실리지 않음
- 올린 가격에 비해 RSI가 비교적 너무 많이 올라옴
- OBV 보조지표를 보면 선물의 거래량이 현물의 거래량보다 압도적으로 많음
- 즉, 세력(MM)들이 선물로 가격을 올리고 현물을 비싸게 팔려는 의도라고 생각됨

2. 포지션 셋업:

진입(EP):
익절(TP): 일단 1차 익절 82651 (반등의 0.5 되돌림)
손절(SL): 84880 (반등의 고점)
손익비(R/R): 1.09

3. 결과:

손절 라인 닿고 바로 하락함
버텼어도 터질 손절이긴 했음

4. 배운점, 느낀점 정리:

손절 라인은 10틱, 50틱이라도 손해보게 잡아야 한 번 더 버틸 기회가 생김
익절 라인은 욕심을 조금이라도 덜어야 체결이 될 확률이 높아짐

5. 반성할 점:

요즘 비트코인 너무 어렵다..

[매매일지] 2. 도로 다 뱉어버림...

dongsunseng — Sat, 22 Feb 2025 19:33:57 +0900

[코인 투자] 매매일지 #1 - 시장 수업료로 뱉은거 100% 멘징 + 비트 제대로 복수 완료??

기존 코인 투자 포스트들은 기초 내용들을 다뤘지만 이제부턴 매매일지를 꾸준히 작성해보려고 함 일단 필자는 2월 14일부터 실제 돈으로 투자하기 시작한 초보자임 전 한달 반 가량 모의투자

dongsunseng.com

위의 포스트를 보면 알겠지만 아주 아름다운 상승 추세와 함께 스윙을 치며 100%가 넘는 수익을 보고있었음

하하... 이게 뭔

위의 포스트에서도 말했듯이 99k를 뚫고 수익금이 $1,500을 넘었지만, 바이비트 콜드월렛 해킹 이슈 + 중국 관세, 전염병 이슈가 한꺼번에 터지면서 수직 하강하게 되었음...

포지션 최종 결과

1차로 본절이 터졌고
2차로 이성을 잃고 추격매매를 하다 손절 라인이 바로 돌파되면서 -$200...

차트를 보며 배운/느낀 점들 정리

99k까지 상승하면서 추세를 깨지 않았고 98k 부근에서 다시 한번 쓰리마켓 패턴을 그리면서 고점 갱신을 예상했음
쓰리마켓 패턴의 확장 부분에서 위 따고 아래까지 따고 나서 살짝의 반등과 함께 수직하강하게 되었음 (원래였다면 쭉 상승하는게 쓰리마켓패턴)
여기서 배운 것은 진짜 한치 앞을 예상할 수 없다... 임
어떻게 3일동안 꾸득꾸득 올려서 패턴까지 만들어놓고 수직하강을 할 수 있나...
어떻게 대응했어야 되는지 솔직히 잘 모르겠음
차트가 이렇게까지 이쁘게 나왔는데 걍 익절해버리는 것도 어불성설인것 같고..
일단 분할익절은 특히 아직은 초보자인만큼 꼭 해야겠다고 생각했음

반성할 점

본절이 터졌는데 바로 추격매매를 한 것은 여기에 적기도 쪽팔릴 정도로 반성할 점임
포지션 초반에는 무리할 정도의 추가매매부터 추격매매까지 아주 아마추어 티를 팍팍 냈지만, 이런 충동을 잘 억제할 방법을 구축하기로 함
매매를 하면서 볼 체크리스트도 만들었고: https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%ED%95%98%EA%B8%B0%EC%A0%84-%EB%AC%B4%EC%A1%B0%EA%B1%B4-%EB%B4%90%EC%95%BC-%ED%95%A0-%EC%B2%B4%ED%81%AC-%EB%A6%AC%EC%8A%A4%ED%8A%B8
이렇게 말도 안되는 억까를 당할 때 어떻게 극복해야될지는 루틴을 좀 만들 필요가 있다고 생각함
투기가 아닌 건강하고 지속 가능한 투자 생활을 위해서..
사람들이 얼마나 볼지는 모르겠지만 요즘 장 진짜 쉽지 않네요.. 다들 화이팅합시다

시드 현황: $2,682

절치부심해서 다시 가보자...

[코인 투자] 0. 매매하기전 무조건 봐야 할 체크 리스트

dongsunseng — Sat, 22 Feb 2025 17:04:21 +0900

1. 개인은 절대 세력이 될 수 없다: "절대 세력은 편하게 개미들이 수익을 보게 하지 않는다"

물론 개인 투자를 하는 사람들 중에 내가 세력이 되겠다 라고 생각하는 무모한 사람은 보기 힘들다고 생각이 듦
내가 이야기 하고 싶은 포인트는 "절대 세력은 편하게 개미들이 수익을 보게 하지 않는다" 라는 점임
내 포지션이 내가 예상한 수익보다 큰 수익을 내고 있더라도 아 수익보면 이 돈으로 뭐하지 생각하면서 김치국 마시지 말고 항상 어딘가 께름칙한 부분은 없는지, 내가 예상한 세력의 움직임이 타당한지, 내 포지션과 반대로 움직일 가능성은 없는지 등에 대해 확인해야 함

께름칙한 부분을 확인할 때 상당한 도움을 주는 것이 Liquidation Heatmap 인 것 같다:

https://www.coinglass.com/pro/futures/LiquidationHeatMap

2. 절대 추격매매 하지 말기: 조급해하지 마라

추격매매하려고 마음 먹을 때는 당연히 롱 포지션에서 갑자기 정치적 이슈가 터지며 고꾸라졌다거나 아쉽게 본절/손절 라인이 터졌다거나 등등 인간 심리상 짜증을 유발할 때임
이럴 때가 가장 위험할 때임을 명심해야 됨
당연한 이야기지만 투자는 얼마나 버는가보다 손실을 덜보는게 훨씬 중요함
인터넷에 떠돌아다니는 손실에 따라 얼마나 이익을 봐야 메꿀 수 있는지 계산한 표만 봐도 알 수 있음
이럴 때는 그냥 시원섭섭하지만 보내주고 다시 포지션을 잡는 것이 더 큰 손실을 막는 방법임
필자 본인도 100% 이상의 수익을 보다가 바이비트 거래소 해킹 논란 + 중국 코로나 ver.2 + 중국 관세 이슈가 한꺼번에 터져서 본절이 터지고 추격매매를 한 적이 있음 -> 당연히 더 큰 손실로 이어졌음

1절만 해도 되지만 굳이 굳이 2절까지 적자면

매매만을 통해서 돈을 벌고 생활하는 전문 트레이더들의 매매를 보면 자 여기서 본절 터져버려서 빡쳐서 추격매매 진행했습니다
이런 경우는 본적이 없다
보통 시원섭섭한 마음으로 리프레쉬를 하고 와서 다시 차트를 보는 경우가 많다
한 분야를 빠르게 습득하기 위해서는 그 분야에서 뛰어난 사람이 어떻게 하는지를 분석하는 것이 가장 빠른 방법이라고 생각한다
코비 브라이언트, 르브론 제임스가 마이클 조던을 분석하며 자기 스타일을 만든 것과 같은 맥락이다
코비 브라이언트 같이 마이클의 스타일을 완벽하게 카피해버리든 르브론 제임스같이 그 속에서 자기 스타일을 재창조하든 그건 다른 문제고, 특히 투자와 같이 리스크가 따르는 분야에서는 겸손한 마음으로 본인이 초보자라고 생각하며 상급자를 참고하는 것이 효과적으로 작용할 때가 많은 것 같다

3. 손절/본절 라인 조정하지 말기: 고집 부리지 마라

경험 상 높은 확률로 손절/본절이 터질 것 같으면 그냥 냅두는게 더 나았을 뻔한 경우가 많음
롱 포지션에서는 매도세가, 숏 포지션에서는 매수세가 강하다는 것을 알면서도 내가 열심히 분석해서 잡은 포지션이 수익을 못보고 시간만 버린 것이 아쉽고 짜증나서 조정하는 경우가 많을 거라고 생각되는데 (내 이야기임) 매매를 진행하기 전 분석에 더 시간을 쏟고 이후 셋업에 대해서는 관망하는 것이 이성적으로 생각했을 때 맞다고 생각이 듦
괜한 고집으로 손실만 늘리지 말고 다시 포지션을 잡는 것이 맞다
관점을 정했고 손절 익절 라인을 정해서 포지션을 잡았으면 그 포지션에 대해 책임감을 갖고 지켜볼 줄 아는 태도가 마지막에 승리하게 만듦

4. 추가매매 조심하기: 한 번의 매매로 인생을 역전하려고 하지 마라

이건 추격매매랑은 좀 다른 경우지만 마찬가지로 상당히 위험함
상승추세 제대로 탔다고 생각이 들어서 시드 풀매수하는 그런 경우를 말하고 싶은건데 이런 경우는 어느정도 감당할 수 있는 선에서 더 큰 이익을 위해 추가매매하는 경우는 필요하다고 생각함
하지만, 평단과 예상 손실등의 리스크를 철저하게 관리하면서 진행해야 함

5. 시장 상황과 시드 현황을 다시 한번 생각해보고 매매하기: 시드 & 레버리지 비율을 꼼꼼하게 설정해라

필자는 투자를 하며 팔랑귀처럼 아 저 유튜버는 매수 들어갔네 나도 들어가야겠다 하는 그런 성격은 아님
그럼 뭘 말하고 싶은거냐?
필자는 풀매수가 나쁘다고 생각이 들진 않음
손절 라인이랑 레버리지만 잘 설정하면 큰 수익을 보는 것은 자명하기 때문
하지만 내가 지금 시드를 어느정도 투입해야 하는 경우인지 정확하게 판단하고 매매를 해야된다는 것을 강조하고 싶음
흔히 강력한 상승추세 혹은 하락 추세를 예측하고 높은 비율의 시드/레버리지와 함께 분할 익절의 비중까지 줄이고 매매를 하는 것을 "스윙"이라고 함
이 때 내가 스윙을 해야되는 상황인지 낮은 시드와 레버리지로 짧게 짧게 먹어야 하는지는 당연히 시장상황에 달려있음
물론 자연재해과 같은 예외적인 경우는 어쩔 수 없지만 이 경우에 대해서 명확한 기준을 갖고 매매를 진행해야 더 좋은 투자가 될 것이라는 부분을 강조하는 것임
또한, 내 시드 현황도 생각하고 매매해야 함: 시드 비중과 레버리지 비율에 따라 청산가, 주가 상승과 하락에 대한 수익과 손실의 비율이 달라지게 됨
내 시드 현황에 맞게 감당할 수 있는 정도로 설정하는 것이 중요함

6. 분할 익절은 필수다: 먹여줄 때 먹어라

아무리 상승 추세가 강하거나 하락 추세가 강하다고 해도 주가는 어떤 예외적인 상황이 나올지 모름
필자의 매매일지 #1, #2를 보면 알 수 있겠지만 비트코인의 상승추세를 보고 롱 포지션을 잡으면서 수익이 예상한 것 보다 많이 발생하고 있던 상황에 바이비트 거래소 해킹 논란 + 중국 코로나 ver2 + 중국 관세 이슈가 한꺼번에 터지며 3일동안 오른 주가가 5시간 가량만에 빠지면서 본절이 터져버린 경험을 했음
이처럼 어떤 상황이 나올지 모르는 상황에 변동성이 큰 코인 단타를 친다면 분할 익절은 필수라는 결론을 냈음

7. 내 성격을 파악해라: 내 자신을 가장 견재해라

개인 투자는 나와 다른 투자자들과의 싸움이 아님
또한, 나와 세력과의 싸움도 아님
우린 그저 세력의 등에 업혀서 콩고물을 조금씩 모아간다고 생각해야 함
개인 투자는 나 자신과의 싸움임
손실을 볼 때는 내 충동성, 짜증과 같은 부정적인 감정을 억제하는 동시에 수익을 볼 때는 들뜨는 마음을 억제하며 의심하는 습관을 들여야 계좌는 우상향함
내 성격적이 부분이 어떤 면에서 가장 취약한지를 파악하고 항상 리마인드해야함
필자 본인이 가장 견제하는 부분은 손해를 보기 싫어하는 성격임
매매를 짧게나마 진행하며 내 성격에 대해 파악할 수 있었는데, 추가 매매를 막 들어가거나 감당하지 못하는 레버리지를 사용하는 무모한 성격은 아니라 긍정적이지만 손해를 극도로 보기 싫어하기 때문에 추격매매를 주의해야겠다는 결론을 낼 수 있었음

8. 나만의 룰을 정해라: 머리가 뜨끈해질 때를 조심해라

위에서 언급한 것 처럼 필자는 개인 투자가 나와의 싸움이라고 생각함
돈과 관련된 내 감정을 컨트롤한다는 것은 정말 쉽지 않을 일임
따라서, 나만의 룰을 정하는 것이 좋다고 생각이 들었음
예를 들자면, 매매 횟수를 하루 2번 정도로 제한한다거나 손절을 두 번 했다면 그 날은 차트를 끄고 쉰다거나 하는 내 돈은 물론이고 내 멘탈과 건강한 투자 생활을 지킬 수 있는 방법을 구축하는 것은 필수적임

Last Updated at: 2025.02.22

[매매일지] 1. 시장 수업료로 뱉은거 100% 멘징 + 비트 제대로 복수 완료??

dongsunseng — Sat, 22 Feb 2025 15:51:34 +0900

기존 코인 투자 포스트들은 기초 내용들을 다뤘지만 이제부턴 매매일지를 꾸준히 작성해보려고 함

일단 필자는 2월 14일부터 실제 돈으로 투자하기 시작한 초보자임

전 한달 반 가량 모의투자와 투딩 챌린지 참여, 여러 강의 및 시황 분석 리딩을 통해 공부했음

예금으로 묶어뒀던 3000달러를 투입했고, 당연히 처음에는 총 시드 1000 달러 정도만 사용함

그 결과 4일에 걸쳐서 200 달러를 잃었음(이 부분은 다다음 포스트에서 다룰 예정)

이후 다시 포지션을 잡았고 이번에는 느낀 점이 많아 포스트를 작성하려고 함

먼저, 200 달러 가량을 날린 후 알트 코인은 접고 일단 비트코인 차트만 열심히 보기로 함 (당연히 알트만 했던 것은 아님)

일단 개인적으로는 일봉과 4시간봉 추세 분석과 다우 이론에 따라서 상승 추세 (근거#1)를 보고 있었음

롱 포지션을 잡던 중에 이번에는 올라가야되는데? 하는 상황이 몇 번 발생하였지만 내 손절가에 맞춰서 털고 살짝 반등하던 것이 반복되서 지치던 상황이었음(이런 식으로 수 차례에 걸쳐서 200달러 털림)

2월 18일부터 진행된 5분봉에서 쓰리마켓 패턴을 발견했고 이번에는 가겠지 라는 생각으로 (근거#2) 다시 포지션을 잡아보기로 함

아래 차트 이미지에 보면 초록색 부분은 강한 매물대로 작용한 것을 볼 수 있음: 5번의 지지를 받은 후에 강한 반등을 예측했음(근거 #3)

결과론적으로 보면 곡선 추세에 저항을 한 번 받고 다시 매물대에 저항을 한 번 더 받은 후에 상승하긴 했음

이 때 강의를 들었던 강사분의 시드를 빠르게 불려야 할 때의 매매법을 참고해서 400달러로 고배(30배 레버리지)를 사용해서 짧게 짧게 상승 추세를 먹어보기로 함

진입 시점은 위에서 보이는대로 2/18 아침 9시반

쓰리마켓 패턴과 함께 두 개의 주황색 추세선, 파란색 곡선 추세선을 그렸었는데 곡선 추세선을 뚫기 전에 진입함

전 고점들의 유동성들을 지지 저항선으로 그려뒀었는데 98,000 라인은 넘기고 익절한다는 생각으로 2.76%를 먹는 라인을 익절 라인, 손절 라인은 전에 지지 저항이 일어났던 부분을 보수적으로 잡았음(손실을 더이상 보기 싫었음...)

이렇게 잡았더니 손익비는 3.86 정도

익절 라인을 공격적으로 잡았던 이유는 이전에 3번 정도를 이번에는 무조건 상승할거라고 생각했지만 밑으로 고꾸라진 경우가 있었어서 만약 이번에 상승 추세로 돌파한다면 100,000까지도 갈 수 있지 않을까 싶었음

https://www.youtube.com/watch?v=-E9kIZA1tcE

위 유튜브는 내가 들었던 강의의 강사님이 운영하는 유튜브 채널임

같은 상승 추세를 보고있어서 참고중이었는데 이 분도 내가 진입하고 나서 상승 추세가 맞다고 생각한다는 영상을 올리셔서 첨부해봄

주요 내용은 이러함:

24년 7월이랑 패턴이 비슷함: 1. 청산빔 2. 강한 반등 3. 꾸득꾸득 내려버림 4. 786 반등 5. 상승 추세 전환
위의 초록색 박스로 표시된 매물대에서 3번의 지지를 받은 후에 개미들이 아 여긴 강력한 지지구간이구나 라고 인식하는 차에 바로 하방 이탈해버린 후에 바로 다시 회귀함
- 개미들은 수차례 지지를 받던 구간이 이탈을 해버렸으니 이제는 강한 저항구간이 되겠구나 라고 생각하고 해당 매물대에서 숏 포지션을 많이 잡았을 것임
- 하지만 세력들은 뻔한 저항구간에서 개미들을 먹여주지 않기 때문에 숏 포지션들을 청산시키는 강한 상승이 나올 것이라고 예상함
oi: 미체결 약정

이번 롱 포지션을 잡으면서 밤낮도 아예 바꿨고 밤 새면서 차트만 봤음

이번에는 무조건 먹고 싶다는 오기가 발동하기도 했고, 승률보다는 손익비가 중요하다지만 너무 자주 지는 것도 결국 좋은 것은 아니라고 생각했음

전에 했던 모의투자와는 무게감이 다른 상태에서 밤새 차트를 보며 분석하다 보니까 실력이 빠르게 향상된다는 것을 나도 느끼고, 전에는 특정 캔들에서 위로 올라갈지 내려갈지 감이 아예 안왔지만 이제는 여기서는 반등하지 않을까? 여기서는 단기 하방을 띌거같은데? 정도의 생각은 들 정도로 발전했음

포지션을 종료하진 않았지만 중간 결과:

P&L: +1500달러를 달리는 중임(99k 정도 기준)
96.7k 부근 등에서 부분 익절을 하지 않았던 이유:
1. 욕심(시드를 빨리 불리고 싶은 마음)
2. 곡선 추세를 이탈한 후에 처음 채널까지 이탈했을 때 몇 틱 차이로 못 찍고 다시 하락함: 이미 추세선을 기준으로 상당히 많이 비빈 구간이기 때문에 확실한 상승이 나온다면 큰 상승일거라는 생각(위의 첨부한 유튜브 차설님의 관점을 참고했음)

차트를 보며 배운 점들 정리:

일단 추세, 지지, 저항, 다우 이론 이렇게 4가지가 기법들, 패턴들 보다 선행되어야 함
저정도 지식만 갖고도 잘 활용할 줄만 안다면 충분히 돈 벌 수 있을 거라고 생각됨
따라서, 저 내용들에 대한 깊은 학습(+실전 경험)이 필요할듯
전체적인 추세 및 변곡 파악이 가장 중요
볼린저 밴드가 이번 포지션만 놓고 봤을 때 상당히 신뢰도가 높았음
사실 아직 매매 경험이 부족하기도 하고 익절, 손절 라인 설정 같은 부분이 미숙했음
큰 프레임에서 봤을 때 상승이 나올거라는 것은 자명하다고 생각이 들었기에 상승이나 하락이 나오면 손절 라인을 조정하는 일이 빈번하게 일어났음
이런 경우 그냥 손절 라인을 명확하게 설정한 후 손절이 나면 털어버리고 다음 포지션을 잡는 것이 이성적으로는 맞다고 생각이 들긴 하지만, 무조건 먹을 수 있다는 확신이 상당히 강하게 든 이번 포지션같은 경우에는 이렇게 하는 것이 지금까지 결론적으로 봤을 때는 좋았음
좀 더 경험을 쌓으면서 내 기준을 설정하는 것이 중요할듯
다시 볼린저 밴드로 돌아가서, 손절 라인 및 익절 라인을 막 조정하면서 가장 도움을 받았던 것이 볼린저 밴드였음
5분봉 같이 단기봉으로 차트를 보면서 지금 상승이 나올지 하락이 나올지 판단을 했어야 되는 상황에 볼린저 밴드가 상단이 먼저 꺾이는지 하단이 먼저 꺾이는지를 봐야함
이후에 해당 추세가 이어지는지 전환이 이루어질지도 볼린저 밴드가 어딜 향하는지를 보는 것이 큰 도움이 됨
자세한 내용은 이 포스트 참고: https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-14-%EB%B3%BC%EB%A6%B0%EC%A0%80-%EB%B0%B4%EB%93%9C
또한, 엘리어트 파동도 많은 도움이 되었던 것 같음
엘리어트 파동을 처음 배웠을 때는 아주 기초만 배우고 깊은 내용들은 초보자가 배우기 어렵다고 넘어갔었는데 조금은 더 공부해보고 싶은 마음이 생겼음
아주 기초인 충격 파동과 조정 파동만 읽어도 갑자기 하락이 나와도 단기 하방이라고 생각이 들기 때문에 심리적 안정감 측면에서 도움이 되었던 것 같음
마지막으로, 처음 코인 투자를 시작하면 어느 시점에서 어떤 분봉을 봐야되고 이런 부분들이 감이 아예 안잡힐 수 있음(내가 그랬음)
이건 실제로 신경이 쓰일 정도가 되는 금액으로 포지션을 잡아보면 감을 더 빨리 잡을 것 같음
내 피같은 돈이 날아갈지 힘들게 분석한 내 노력이 성과를 볼지에 대한 문제이기 때문에 1분봉부터 시작해서 5분봉 15분봉 30분봉 1시간봉 4시간봉 12시간봉 날봉을 막 돌아가면서 보게 될것임
그러다보면 어? 현 상황에선 4시간봉이 양봉 마감하는 것이 중요하겠구나 혹은 1시간봉부터 4시간봉 날봉이 모두 상승 추세를 띄니까 5분봉 15분봉 정도 체크하면서 단기 하방이 크게 나올 것만 주의하면 되겠구나 이런 부분들이 보이기 시작할 것임
진짜 감이 안 잡힌다하는 경우에는 유튜버 혹은 트레이딩 강사분들의 시황 공유 텔레그램 혹은 카톡방에 들어가서 이 사람은 이렇게 생각하는구나를 참고하면서 매매하는 것도 큰 도움이 됨
이 때 주의해야할 점은 그 사람들은 내가 청산을 당하던 말던 상관이 1도 없음
내 매매는 철저하게 나한테 책임이 있다는 것을 명심해야 하고 한 사람의 말에만 휩쓸리지 않기 위해서는 그런 커뮤니티를 여러 개 가입한 후 의견들을 비교해가며 내 분석과 매칭해보는 것이 합리적인 방법이라고 생각이 듦

차트를 보며 느꼈던 점들 정리:

그냥 어? 이 패턴이네 하고 포지션 잡는 것은 물론 맞을 수도 있겠지만 똑똑한 투자자가 되는 길은 아닌 듯함
여기서 다른 개미들은 어떤 판단을 했을까? 얼마나 털렸을까? 이런 부분들을 함께 생각해야 함
또한, 패턴이나 기법을 무작정 외우지말고 왜 이런 패턴에서는 큰 상승이 나오고 저런 패턴에서는 큰 하락이 나오는지 그 이유를 "이해"하고 매매하는 것이 중요해보임
기법, 패턴, 포지션 타점 모두 중요하지만 나한테 적진 않은 돈인 200 달러를 수 차례에 나눠서 (더 짜증남) 내면서 느낀 점은 리스크 관리가 제일 제일 중요함
과거에 이런 패턴이 있었는지를 확인하는 것도 아주 중요함
주의해야 하는 것은 해당 패턴이랑 비슷했던 적이 있었다고 똑같이 진행되겠지라고 생각하는 것 보단 항상 페이크 아웃이나 비슷한 패턴으로 학습된 개미들을 털어버릴려는 세력의 심산이 아닌지 경계해야 함
위의 주황색 추세선 두 개는 아주 강한 추세선이라고 생각하고 그은 부분임(실제로 아주 강했죠)
캔들을 보면 알겠지만 아주 역겨울 정도로 평행 채널과 두 개의 강한 추세선에게 저항 및 지지를 받으며 오르락 내리락 한 것을 볼 수 있음
이럴 때 내가 생각한 관점이 맞다고 생각하면 관점을 고수하는 멘탈이 필요함

반성할 점들:

처음에는 400달러 고배로 짧게 짧게 먹어보자 라고 생각했던 것이 상승 추세가 쭉 이어지니까 손절 라인만 조정하며 홀딩하는 방식으로 변했음
처음부터 이 점을 파악했으면 시드를 더 넣고 레버리지를 줄이는게 낫지 않았을까 생각됨
사실 이 점이 가장 반성해야할 점인데, 곡선 추세가 상승 이탈하는 부분을 확인하고 1000달러 가량을 추가 매수함
사실 내가 본 관점이 확실하다는 생각이 들어서 추가 매매를 한다 이것만 봤을 때는 합리적으로 보이지만 그럼 1400달러 30배 레버리지 매매를 해버린 것인데 상당히 충동적이었고, 결론은 좋았지만 23살의 나이한테는 상당히 큰 돈을 때려넣어버렸다는 것은 반성하고 조정해야할 부분임이 확실함
코인 투자를 해야겠다고 마음 먹은 계기는 당연히 뭐 큰 돈 벌고 싶다 관심이 오래전부터 있었다 이런건 너무 당연한 이야기이고, 강의를 들으면서 강사님이 가장 중요하게 강조했던 심법에 대해 자신이 있었기 때문이었음
평소에도 업다운도 없고 이성적이라고 생각했던 내가 이런 행동을 했다는 것이 스스로 좀 무섭기도 한데, 앞으로는 매매 기준을 잘 정해서 내가 감당할만큼만 매매해야겠음

이 바닥 심법이 전부다..

CIBMTR - Equity in post-HCT Survival Predictions #14 Feature Engineering Ideas

dongsunseng — Mon, 10 Feb 2025 21:07:41 +0900

Annotation of a discussion post about feature engineering ideas:

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550863

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

Feature Engineering Ideas

Hi everyone! My current best CV score and LB score (CV 0.683 and LB 0.688) is just ensemble and/or stack various models (without feature engineering).
Each model is trained with different targets and different losses.
I have not performed any feature engineering or data augmentation or external data yet.

Categorical vs. Numerical

Let's discuss feature engineering.
This dataset has 35 categorical features and 22 numerical features (for total 57 features).
However 55 features look like categorical with few unique values.
Only donor_age and act_at_hct look like true numerical with many continuous values.
It is true that 20 of the numerical features may be ordinal (which means that the order of values matters), but for my NN treating all features (except two ages) as categorical worked best.
So we could treat all (except two ages) as categorical and combine them creatively.

Feature Engineering

Let's have a discussion about which feature engineering to try.
One technique is to combine columns with train["new"] = train["col1"].astype("str") + "_" + train["cols2".astype("str").
Then we have a new categorical feature.
We can even combine 3, 4, 5, etc columns.
When we do this the cardinality increases, so we can try advanced techniques like target encoding, count encoding, etc to process the new high cardinality feature
Another idea is to try mathematical combinations like train["new"] = function( train["col1"], train["cols2"] ).
Here the function could just multiply the columns or it can do more advanced techniques like taking a product of the logs OR takes the difference etc etc.

Data Augmentation

Another idea to boost CV and LB is data augmentation.
With tabular data and GBDT, one way to perform data augmentation is to make copies of the train data.
Then for each copy, we can augment (i.e. modify change) the data.
Then we concatenate all the copies and train a GBDT on the new concatenated dataframe.

External Data

Another thing that helps is external data.
Has anyone found any good external data sets?

Recursive Feature Reduction

Another idea is to remove each feature one by one and see if CV score and LB score increases.
Sometimes there are features whose presence hurts CV score and LB score.

Model Hyerparameter Optimization

It is true that optimizing each model's hyperparameters in our ensemble will boost our overall CV score and LB score, but at this time I am more interested in discussing feature engineering.
So let's discuss feature engineering!

Let's Discuss

Let's discuss ideas that we are trying to improve CV and LB score.
So far the only public notebooks are using different models, but nobody has suggested or tried ways to modify, change, or increase the data.

Comments:

뛰어나고 훌륭하게 시작할 필요는 없다. 그러나 훌륭하기 위해서는 시작해야 한다.
- 지그 지글러 -

CIBMTR - Equity in post-HCT Survival Predictions #13 How to make sense of the race group distribution in the data?

dongsunseng — Mon, 10 Feb 2025 20:27:18 +0900

https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1

CIBMTR - Equity in post-HCT Survival Predictions #11 ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️ (AFT Loss func sol

Annotation post about AFT loss function solution:https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - E

dongsunseng.com

From my other blog post, we discussed about

This blog is about the "further discussion": https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550302

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

How to make sense of the race group distribution in the data ?

Counting values of race groups I get the following:

Having worked on the topic of equity for sensitive applications, I have found one of the main problem to be imbalance in data of interest.
Typically some less represented races will end up with wider estimates.
However the data at hand seems to have been resampled (or generated as balanced).
While this can be achieved on real data by downsampling the majority class, it usually kills representativeness of the population.
I am concerned a model optimised with this metric on this balanced dataset would perform worse on real life 'race imbalanced' data.
How does 'race-balancing' the dataset make sense in an equity competition ?

Comments:

Maybe the idea behind balanced, synthetic data is to accentuate differences in risk prediction due only to the available features, by taking imbalance out of the problem.
- By eliminating racial imbalances in the actual data, one can more clearly see differences in risk predictions that are "purely attributable to available features"
- This allows for more accurate evaluation of actual prediction performance differences rather than differences in population ratios
This could suggest a need for additional predictors if certain groups are more poorly predicted.
- If predictions are less accurate for certain groups, this could indicate that current features don't adequately explain those groups
- This could signal the need for additional predictors that better characterize these groups

완벽하려고 미루는 것보다 지속적으로 고쳐나가는 것이 낫습니다.
- 마크 트웨인 -

CIBMTR - Equity in post-HCT Survival Predictions #12 Deep understanding of (C-index) evaluation measure for better model

dongsunseng — Mon, 10 Feb 2025 20:11:35 +0900

Annotation of this discussion: https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550152

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

Deep understanding of (C-index) evaluation measure for better model

I will try to explain the C-index evaluation measure of the this competition in order to train the model well because 75% of the data is not included in the test data so understanding of the measure is very important.

Lets start with three patients groups:

Group A
Group B
Group C

For each patient, we will predict risk score (higher score means higher risk of early event).

Step 1: Understanding Concordance Index

The Concordance Index (C-index) evaluate how well the model ranks survival times.

Understand with sample data:

Group A has 3 patients with actual survival times and predicted risk scores:

Comparable pairs:

(P1, P2): P2 has a shorter survival time and a higher risk score → Concordant ✅
(P1, P3): P3 has a longer survival time and a lower risk score → Concordant ✅
(P2, P3): P3 has a longer survival time and a lower risk score → Concordant ✅

Total pairs = 3
Total concordant pairs = 3

C-index for Group A = Concordant pairs/Total pairs= 3/3 = 1.0

Step 2: Calculate C-index for All Groups

Repeat the process for all groups.

For now we can assume:

Group A: C-index = 1.0
Group B: C-index = 0.8
Group C: C-index = 0.6

Step 3: Stratified Concordance Index

The Stratified Concordance Index combines the C-index scores of all groups and focusing on the following:

Average performance across groups (mean of C-indices).
Consistency across groups (low standard deviation of C-indices).

Formula:

Stratified C-index = Mean(C-index scores) - Standard Deviation(C-index scores)

Calculate the mean:
Mean=1.0 + 0.8 + 0.6/3 = 0.8
Calculate the standard deviation:
Standard Deviation= sqrt((1.0-0.8)^2 + (0.8-0.8)^2 + (0.6-0.8)^/3) = 0.16
Stratified C-index:
Stratified C-index = 0.8 - 0.16 = 0.64

Step 4: Interpret the Results

A high Stratified C-index means:

The model predicts well overall (high mean C-index).
The model predicts equitably across racial groups (low standard deviation).

Finally we can say:

Group A predictions are perfect (C-index = 1.0).
Group B is decent (C-index = 0.8).
Group C struggles (C-index = 0.6).

The Stratified C-index = 0.64 showing that while predictions are good overall, the model is less consistent across groups.

실패를 미리 두려워할 필요는 없다.
- 버트런드 러셀 -

CIBMTR - Equity in post-HCT Survival Predictions #11 ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️ (AFT Loss func sol #1)

dongsunseng — Mon, 10 Feb 2025 19:46:34 +0900

Annotation post about AFT loss function solution:

https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense

ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️

Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions

www.kaggle.com

Equity in survival predictions: EDA which makes sense

This notebook shows

An exploratory data analysis
Survival functions and how they differ among race groups
Three types of models: Cox proportional hazards, accelerated failure times, and transformed target models
Cross-validation with metrics per race group

References

Competition: CIBMTR - Equity in post-HCT Survival Predictions
Wikipedia article which describes censoring, survival functions, cumulative hazard etc.
Libraries: scikit-survival, lifelines

%%time
try:
    from lifelines.utils import concordance_index
except ModuleNotFoundError:
    print('Installing lifelines...')
    !pip install -q /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
    !pip install -q /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
    !pip install -q /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
    !pip install -q /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
    !pip install -q /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter
import numpy as np
import xgboost
import catboost
import warnings
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.utils import concordance_index
from scipy.stats import rankdata

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, quantile_transform, FunctionTransformer, PolynomialFeatures, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

all_model_scores = {}

Reading the data

We read the data and observe:

The training dataset has 59 columns, many of which are categorical and have missing values.
Two columns are missing from the test dataset: efs and efs_time. These two columns together make up the target.

train = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/train.csv', index_col='ID')
test = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/test.csv', index_col='ID')
data_dictionary = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/data_dictionary.csv')
train.tail()

features = [f for f in test.columns if f != 'ID']

cat_features = list(train.select_dtypes(object).columns)
train[cat_features] = train[cat_features].astype(str).astype('category')

race_groups = np.unique(train.race_group)

features = [f for f in test.columns if f != 'ID']
- Select all columns from test dataset except 'ID'
- Create a list of features that will be used for actual modeling
cat_features = list(train.select_dtypes(object).columns)
- Select columns with dtype 'object' from train data
- This process finds categorical variables in string format
train[cat_features] = train[cat_features].astype(str).astype('category')
- First convert the selected categorical variables to string (str)
- Then convert them to category type
- This is a preprocessing step for memory efficiency and modeling
race_groups = np.unique(train.race_group)
- Extract unique values from the 'race_group' column in train data
- This can be used for race-based analysis or stratification

Race group distribution

In the training data, there are six race groups with about 4800 samples each.
Because in no country of the world these six race groups occur with equal frequencies, we know that some of the groups have been upsampled or downsampled in the dataset.
See this post for further discussion.
- Annotation post can be found from my other blog post

vc = train.race_group.value_counts()
plt.pie(vc, labels=vc.index)
plt.show()

The weirdness of the age distribution

There are only two features with continuous data: donor age and patient age.
The patient age histogram shows that the patient age distribution has five modes(최빈값).
Such a distribution is highly unnatural — it must be an artefact of the synthetic data generation.

plt.figure(figsize=(12, 3))
plt.subplot(1, 2, 1)
plt.hist(train.donor_age, bins=50, color='skyblue')
plt.title('Donor age histogram')
plt.xlabel('donor_age')
plt.ylabel('count')
plt.subplot(1, 2, 2)
plt.title('Patient age histogram')
plt.hist(train.age_at_hct, bins=50, color='skyblue')
plt.xlabel('age_at_hct')
plt.tight_layout()
plt.savefig('a.png')
plt.show()

My first thought was that different race groups had different modes, but the patient age distribution has the same five modes in every race group:

_, axs = plt.subplots(3, 2, sharex=True, sharey=True, figsize=(12, 9))
for race_group, ax in zip(race_groups, axs.ravel()):
    ax.hist(train.age_at_hct[train.race_group == race_group],
            bins=np.linspace(0, 74, 38),
            color='skyblue', alpha=0.5)
    ax.set_title(f'Patient age histogram for {race_group}')
    ax.set_xlabel('age_at_hct')
    ax.set_ylabel('count')
plt.tight_layout()
plt.savefig('b.png')
plt.show()

Even stranger: The age of 0.044 years (i.e., 16 days) occurs 1005 times in the training dataset, whereas every other age occurs at most six times.
Is hematopoietic cell transplantation a treatment which is often done to newborns? Possible.
But I can't believe that these babies are all treated exactly when they are 16 days old.

train.age_at_hct.value_counts().sort_values(ascending=False).head()

The target

The prediction target consists of two parts:

efs_time, always positive, is a time, measured in months.
efs, always zero or one, indicates the presence or absence of an event:
- efs=1 means "patient died exactly at time efs_time.
  - actually not "died" but event occurred is the right expression
- efs=0 means "patient still lives at time efs_time; in other words, "patient dies at an unknown time strictly greater than efs_time"

This situation is called "censored data": Samples of which we know the time of death are uncensored, and if we only know a lower bound for the time of death, the sample is (right-)censored.
Censoring is the main reason that this competition has a special metric and that we need special models.
The competition is a regression task, but we know y_true for only half the samples.
For the other (censored) half, all we know is lower bounds for y_true.
One cannot compute a squared error based on y_true > 100 and y_pred == 120.
RMSE and similar metrics cannot deal with that.
By the way, the column name is misleading: If a column is called "event-free survival", I'd expect that 0 means "patient died" and 1 means "patient lives", but that's wrong.
The data have been obfuscated(애매함).
efs_time is a float with three digits after the decimal point, and I don't think that events such as the death of a patient are recorded with such an exact timestamp.
A histogram of the target values shows that half the patients die within 20 months after the transplantation; but the other half, who survives the first 20 months, has a high probability of living much longer.

plt.figure(figsize=(6, 3))
plt.hist(train.efs_time[train.efs == 0], bins=np.linspace(0, 160, 41), label='efs=0: patient still lives at this time', alpha=0.5)
plt.hist(train.efs_time[train.efs == 1], bins=np.linspace(0, 160, 41), label='efs=1: patient dies at this time', alpha=0.5)
plt.legend()
plt.xlabel('efs_time')
plt.ylabel('count')
plt.title('Target histogram')
plt.show()

Survival function and cumulative hazard function

The survival function shows how many patients survive for how long (Wikipedia: Kaplan–Meier estimator).
At month 0, 100 % of the patients live.
At month 20, only 40 % – 60 % remain, depending on their race group.
Patients with "more than one race" have the highest probability of survival, whites the lowest.
For those who are used to working with cumulative density functions (cdf) of probability distributions, the survival function is nothing else than a top–down mirrored cdf of the time-of-event probability distribution.
- # CDF (Cumulative Distribution Function)
  - Probability that an event occurs by time t
  - Starts at 0 and increases upward (0 → 1)
  # Survival Function
  - Probability of surviving beyond time t
  - Starts at 1 and decreases downward (1 → 0)
  # Relationship
  S(t) = 1 - F(t) where F(t) is CDF

The cumulative hazard is another representation of the same facts; it corresponds to the negative logarithm of the survival function (Wikipedia: Nelson–Aalen estimator).
- H(t) = -log(S(t))
  where:
  - H(t): Cumulative hazard function
  - S(t): Survival function
  - log: Natural logarithm
  Characteristics:
  - H(t) starts at 0 and increases
  - As S(t) decreases, H(t) increases more rapidly
- # Values over time
  Time(t) | S(t) | H(t) = -log(S(t))
  0 | 1.0  | 0
  10 | 0.8  | 0.223
  20 | 0.6  | 0.511
  30 | 0.4  | 0.916
  40 | 0.2  | 1.609
- These two functions express the same information in different ways:
  - Survival function: Directly shows survival probability
  - Cumulative hazard: Shows accumulated risk on a log scale
- These different representations are useful for emphasizing or analyzing different aspects of the data.

# You can use library functions or write the few lines of code yourself
# !pip install -q scikit-survival
# from sksurv.nonparametric import kaplan_meier_estimator, nelson_aalen_estimator
# from lifelines import KaplanMeierFitter

def survival_function(df):
    survival_df = df[['efs', 'efs_time']].groupby('efs_time').agg(['size', 'sum']).droplevel(0, axis=1).astype(int)
    survival_df['n_at_risk'] = survival_df['size'].sum() - survival_df['size'].shift().fillna(0).cumsum().astype(int)
    hazard = survival_df['sum'] / survival_df['n_at_risk'] 
    survival_df['cumulative_hazard'] = np.cumsum(hazard) # nelson_aalen_estimator
    survival_df['survival_probability'] = (1 - hazard).cumprod() # kaplan_meier_estimator
    return survival_df

plt.figure(figsize=(6, 8))

plt.subplot(2, 1, 1)
survival_df = survival_function(train)
plt.step(survival_df.index, survival_df['survival_probability'], c='k', where="post", label='[Overall]')
plt.xlabel('efs_time')
for race_group in race_groups:
    subset = train.query('race_group == @race_group')
    survival_df = survival_function(subset)
    plt.step(survival_df.index, survival_df['survival_probability'], where="post", label=race_group)
plt.xlabel('efs_time')
plt.legend(loc='upper right')
plt.title('Survival function (Kaplan–Meier) by race group')
plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax=1, decimals=0)) # percent of xmax

plt.subplot(2, 1, 2)
survival_df = survival_function(train)
plt.step(survival_df.index, survival_df['cumulative_hazard'], c='k', where="post", label='[Overall]')
plt.xlabel('efs_time')
for race_group in race_groups:
    subset = train.query('race_group == @race_group')
    survival_df = survival_function(subset)
    plt.step(survival_df.index, survival_df['cumulative_hazard'], where="post", label=race_group)
plt.xlabel('efs_time')
plt.legend(loc='lower right')
plt.title('Cumulative hazard (Nelson–Aalen) by race group')

plt.tight_layout()
plt.show()

Cross-validation

This competition is about equity in the predictions.
This means that we score the predictions per race group and then derive the final score from these six sub-scores.
As the official implementation of the competition metric doesn't output the scores per race group, I've written my own implementation, which gives more transparency.
There are two main methods for survival analysis (the proportional hazards model and the accelerated failure time model), and both are implemented in XGBoost and in CatBoost.
The calling conventions are a bit unusual.
We present the cross-validation of six models:

Proportional hazards model (Cox regression) with XGBoost
- This model expects that the two target columns be combined into one (y = np.where(train.efs == 1, train.efs_time, -train.efs_time), negative target values are considered right censored)
Proportional hazards model with CatBoost.
- This model expects the targets in the same format as the XGBoost Cox model.
Accelerated failure time model with XGBoost.
- This model expects the lower and upper bounds for the target in a special form in a DMatrix.
Accelerated failure time model with CatBoost.
- This model expects the lower and upper bounds for the target in the form of a two-column array.
Proportional hazards model with a linear implementation.
- This model expects time and event columns in a dataframe.
MSE regression model with three different target transformations.

You'll find a comparison of the cv scores of these models at the end of the notebook.

Some hyperparameters have been taken from other public notebooks.

# from metric import score # This is the official metric which we don't use here

kf = StratifiedKFold(shuffle=True, random_state=1)

def evaluate_fold(y_va_pred, fold):
    """Compute and print the metrics (concordance index) per race group for a single fold.

    Global variables:
    - train, X_va, idx_va
    - The metrics are saved in the global list all_scores.
    """
    metric_list = []
    for race in race_groups:
        mask = X_va.race_group.values == race
        c_index_race = concordance_index(
            train.efs_time.iloc[idx_va][mask],
            - y_va_pred[mask],
            train.efs.iloc[idx_va][mask]
        )
        # print(f"# {race:42} {c_index_race:.3f}")
        metric_list.append(c_index_race)
    fold_score = np.mean(metric_list) - np.sqrt(np.var(metric_list))
    print(f"# Total fold {fold}:{' ':29} {fold_score:.3f} mean={np.mean(metric_list):.3f} std={np.std(metric_list):.3f}")
    all_scores.append(metric_list)

def display_overall(label):
    """Compute and print the overall metrics (concordance index)"""
    df = pd.DataFrame(all_scores, columns=race_groups)
    df['mean'] = df[race_groups].mean(axis=1)
    df['std'] = np.std(df[race_groups], axis=1)
    df['score'] = df['mean'] - df['std']
    df = df.T
    df['Overall'] = df.mean(axis=1)
    temp = df.drop(index=['std']).values
    print(f"# Overall:                                   {df.loc['score', 'Overall']:.3f} {label}")
    all_model_scores[label] = df.loc['score', 'Overall']
    display(df
            .iloc[:len(race_groups)]
            .style
            .format(precision=3)
            .background_gradient(axis=None, vmin=temp.min(), vmax=temp.max(), cmap="cool")
            .concat(df.iloc[len(race_groups):].style.format(precision=3))
           )

%%time
# XGBoost Cox regression
y = np.where(train.efs == 1, train.efs_time, -train.efs_time)
all_scores = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    y_tr = y[idx_tr]
    
    xgb_cox_params = {'objective': 'survival:cox', 'grow_policy': 'depthwise', 
                      'n_estimators': 700, 'learning_rate': 0.0254, 'max_depth': 8, 
                      'reg_lambda': 0.116, 'reg_alpha': 0.139, 'min_child_weight': 23.8,
                      'colsample_bytree': 0.59, 'subsample': 0.7, 'tree_method': 'hist',
                      'enable_categorical': True}
    model = xgboost.XGBRegressor(**xgb_cox_params)
    model.fit(X_tr, y_tr) # negative values are considered right censored
    y_va_pred = model.predict(X_va) # predicts hazard factor
    evaluate_fold(y_va_pred, fold)
display_overall('Cox Proportional Hazards XGBoost')
# Overall:                                   0.670

%%time
# Catboost Cox regression
y = np.where(train.efs == 1, train.efs_time, -train.efs_time)
all_scores = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    y_tr = y[idx_tr]
    
    cb_cox_params = {'loss_function': 'Cox', 'grow_policy': 'SymmetricTree',
                     'n_estimators': 800, 'learning_rate': 0.092, 'l2_leaf_reg': 2.5,
                     'max_depth': 7, 'colsample_bylevel': 0.84, 'subsample': 0.9, 
                     'random_strength': 0.8, 'verbose': False}
    
    model = catboost.CatBoostRegressor(**cb_cox_params, cat_features=cat_features)
    model.fit(X_tr, y_tr)
    y_va_pred = model.predict(X_va) # predicts log of hazard factor
    evaluate_fold(y_va_pred, fold)
display_overall('Cox Proportional Hazards CatBoost')
# Overall:                                   0.669

%%time
# XGBoost Accelerated failure time model
all_scores = []

# Data split and preparation
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    # K-fold cross-validation stratified by race_group
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    
    # Creating xgboost data matrix
    d_tr = xgboost.DMatrix(X_tr, enable_categorical=True)
    # Setting survival time information for AFT model
    d_tr.set_float_info('label_lower_bound', train.efs_time.iloc[idx_tr])
    d_tr.set_float_info('label_upper_bound', np.where(train.efs.iloc[idx_tr] == 0, np.inf, train.efs_time.iloc[idx_tr]))
    
    d_va = xgboost.DMatrix(X_va, enable_categorical=True)
    d_va.set_float_info('label_lower_bound', train.efs_time.iloc[idx_va])
    d_va.set_float_info('label_upper_bound', np.where(train.efs.iloc[idx_va] == 0, np.inf, train.efs_time.iloc[idx_va]))
    
    # Model parameters setting
    xgboost_aft_params = {'learning_rate': 0.08, 'max_depth': 4, 'reg_lambda': 3, 'aft_loss_distribution_scale': 0.9,
                          'reg_alpha': 0.24, 'gamma': 0.033, 'min_child_weight': 82.58861553592878,
                          'colsample_bytree': 0.5662198438953138, 'max_bin': 53, 'subsample': 0.7456329821182728, 
                          'objective': 'survival:aft', 'grow_policy': 'depthwise', 'tree_method': 'hist',
                          'aft_loss_distribution': 'normal'}
    # Model training
    bst = xgboost.train(xgboost_aft_params,
                        d_tr,
                        num_boost_round=300,
                        # evals=[(d_tr, 'train'), (d_va, 'val')],
                       )
                       
    # Prediction & Evaluation
    y_va_pred = - bst.predict(d_va) # model predicts time of death
    # Taking negative because: converting to risk score
    # Earlier death time means higher risk
    evaluate_fold(y_va_pred, fold)
display_overall('Accelerated Failure Time XGBoost')
# Overall:                                   0.664

d_tr = xgboost.DMatrix(X_tr, enable_categorical=True)
- Creates DMatrix, XGBoost's specialized data format
- enable_categorical=True: Automatically handles categorical variables
d_tr.set_float_info('label_lower_bound', train.efs_time.iloc[idx_tr])
- Setting survival time lower bound = label_lower_bound
- Sets observed time (efs_time) as lower bound for all patients
- Means the patient survived at least until this time, regardless of whether event occurred (efs=1) or was censored (efs=0)
d_tr.set_float_info('label_upper_bound', np.where(train.efs.iloc[idx_tr] == 0, np.inf, train.efs_time.iloc[idx_tr]))
- Setting survival time upper bound = label_upper_bound
- Splits into 2 cases:
  - efs=1 (event occurred):
    - upper_bound = efs_time
      # We know the exact time of death
  - efs=0 (censored):
    - upper_bound = np.inf (infinity)
      # We don't know when death occurred after last observation
Actually setting the upper and lower bound of survival time for validation data is unnecessary
Example:
- # Patient A: died on day 100 (efs=1)
  lower_bound = 100
  upper_bound = 100
  # Means death occurred exactly at 100 days
  
  # Patient B: censored on day 80 (efs=0)
  lower_bound = 80
  upper_bound = inf
  # Means survived at least 80 days, unknown after that

%%time
# CatBoost Accelerated failure time model
y = np.column_stack([train.efs_time,
                     np.where(train.efs == 1, train.efs_time, -1)])
all_scores = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    y_tr = y[idx_tr]
    cb_aft_params = {'loss_function': 'SurvivalAft', 'grow_policy': 'SymmetricTree', 
                     'n_estimators': 800, 'learning_rate': 0.066, 'l2_leaf_reg': 4.4,
                     'max_depth': 5, 'colsample_bylevel': 0.776, 'random_strength': 0.9, 
                     'verbose': False} # 0.67551
    model = catboost.CatBoostRegressor(**cb_aft_params, cat_features=cat_features)
    model.fit(X_tr, y_tr)
    y_va_pred = - model.predict(X_va) # model predicts log of time of death
    evaluate_fold(y_va_pred, fold)
display_overall('Accelerated Failure Time CatBoost')
# Overall:                                   0.664

Target transformation models and regression with mean squared error

The competition task can be interpreted as predicting the order of death of the patients.
Who dies first? Who dies second? ... Who dies last, and who survives?
With a suitable target transformation, we can apply the usual regression algorithms which optimize mse or similar metrics.
In the public notebooks of this competition, we can find various target transformations, but they all are similar.
Patients who die mostly have an efs_time between 0 and 15, whereas most survivors have an efs_time between 15 and 160.
This distribution is an impediment for regression models.
We want predictions to have high discriminative power for the patients who die, but we don't need to distinguish between survivors.
We can achieve this result by stretching the range of the patients who die and compressing the range of the survivors.
The diagram shows how a typical target transformation stretches and compresses the ranges:

def transform_survival_probability(time, event):
    """Transform the target by stretching the range of eventful efs_times and compressing the range of event_free efs_times

    From https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685
    """
    kmf = KaplanMeierFitter()
    kmf.fit(time, event)
    y = kmf.survival_function_at_times(time).values
    return y

y_quantile = transform_survival_probability(time=train.efs_time, event=train.efs)
survival_df = survival_function(train)

fig, axs = plt.subplots(2, 2, figsize=(10, 10), dpi=80)

axs[0, 0].hist(train.efs_time[train.efs == 0], bins=np.linspace(0, 160, 41), label='efs=0: patient still lives at this time', alpha=0.5)
axs[0, 0].hist(train.efs_time[train.efs == 1], bins=np.linspace(0, 160, 41), label='efs=1: patient dies at this time', alpha=0.5)
axs[0, 0].legend()
axs[0, 0].set_xlabel('efs_time')
axs[0, 0].set_ylabel('count')
axs[0, 0].set_title('Original target histogram')

axs[0, 1].set_axis_off()

axs[1, 0].step(survival_df.index, survival_df['survival_probability'], c='k', lw=3, where="post", label='[Overall]')
axs[1, 0].set_xlabel('efs_time')
axs[1, 0].set_ylabel("quantile")
axs[1, 0].set_title("Survival function")
axs[1, 0].yaxis.set_major_formatter(PercentFormatter(xmax=1, decimals=0))

axs[1, 1].hist(y_quantile[train.efs==0], bins=100, label="efs=0", orientation=u'horizontal', alpha=0.5)
axs[1, 1].hist(y_quantile[train.efs==1], bins=100, label="efs=1", orientation=u'horizontal', alpha=0.5)
axs[1, 1].legend()
axs[1, 1].set_ylabel("quantile")
axs[1, 1].set_xlabel("count")
axs[1, 1].set_title("Transformed target histogram (sideways)")
axs[1, 1].yaxis.set_major_formatter(PercentFormatter(xmax=1, decimals=0))

ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
fig.add_axes(ax)
ax.arrow(0.2, 0.55, 0, -0.47, length_includes_head=True, width=0.002, color=plt.get_cmap('tab10')(0), alpha=0.5, head_width=0.02, head_length=0.02)
ax.arrow(0.2, 0.082, 0.37, 0, length_includes_head=True, width=0.002, color=plt.get_cmap('tab10')(0), alpha=0.5, head_width=0.02, head_length=0.02)
ax.arrow(0.12, 0.55, 0, -0.3, length_includes_head=True, width=0.002, color=plt.get_cmap('tab10')(1), alpha=0.5, head_width=0.02, head_length=0.02)
ax.arrow(0.12, 0.25, 0.45, 0, length_includes_head=True, width=0.002, color=plt.get_cmap('tab10')(1), alpha=0.5, head_width=0.02, head_length=0.02)

plt.suptitle('Transforming the target', y=0.99, size=20)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    plt.tight_layout()
plt.show()

What we already saw from my discussion annotation: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-8-Finding-the-best-target-transformation
We now plot the histograms of five possible transformations and then fit regression models with MSE loss to each of the transformed targets.
You can of course try other loss functions and see what happens.

def transform_partial_hazard(time, event):
    """Transform the target by stretching the range of eventful efs_times and compressing the range of event_free efs_times

    From https://www.kaggle.com/code/andreasbis/cibmtr-eda-ensemble-model
    """
    data = pd.DataFrame({'efs_time': time, 'efs': event, 'time': time, 'event': event})
    cph = CoxPHFitter()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        cph.fit(data, duration_col='time', event_col='event')
    return cph.predict_partial_hazard(data)

def transform_separate(time, event):
    """Transform the target by separating events from non-events
    
    From https://www.kaggle.com/code/mtinti/cibmtr-lofo-feature-importance-gpu-accelerated"""
    transformed = time.values.copy()
    mx = transformed[event == 1].max() # last patient who dies
    mn = transformed[event == 0].min() # first patient who survives
    transformed[event == 0] = time[event == 0] + mx - mn
    transformed = rankdata(transformed)
    transformed[event == 0] += len(transformed) // 2
    transformed = transformed / transformed.max()
    return - transformed

def transform_rank_log(time, event):
    """Transform the target by stretching the range of eventful efs_times and compressing the range of event_free efs_times
    
    From https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676"""
    transformed = time.values.copy()
    mx = transformed[event == 1].max() # last patient who dies
    mn = transformed[event == 0].min() # first patient who survives
    transformed[event == 0] = time[event == 0] + mx - mn
    transformed = rankdata(transformed)
    transformed[event == 0] += len(transformed) * 2
    transformed = transformed / transformed.max()
    transformed = np.log(transformed)
    return - transformed

def transform_quantile(time, event):
    """Transform the target by stretching the range of eventful efs_times and compressing the range of event_free efs_times
    
    From https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense"""
    transformed = np.full(len(time), np.nan)
    transformed_dead = quantile_transform(- time[event == 1].values.reshape(-1, 1)).ravel()
    transformed[event == 1] = transformed_dead
    transformed[event == 0] = transformed_dead.min() - 0.3
    return transformed

# XGBoost: MSE loss with five different target transformations
for transformation in [transform_survival_probability,
                       transform_partial_hazard,
                       transform_separate,
                       transform_rank_log,
                       transform_quantile,
                      ]:
    plt.figure(figsize=(6, 1.5))
    target = transformation(time=train.efs_time, event=train.efs)
    vmin, vmax = 1.09 * target.min() - 0.09 * target.max(), 1.09 * target.max() - 0.09 * target.min()
    plt.hist(target[train.efs == 0], bins=np.linspace(vmin, vmax, 31), density=True, label='efs=0: patient still lives at this time', alpha=0.5)
    plt.hist(target[train.efs == 1], bins=np.linspace(vmin, vmax, 31), density=True, label='efs=1: patient dies at this time', alpha=0.5)
    plt.xlim(vmin, vmax)
    plt.yticks([])
    plt.title('Target histogram: ' + transformation.__name__)
    plt.show()
    
    print(transformation.__name__)

    all_scores = []
    for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
        X_tr = train.iloc[idx_tr][features]
        X_va = train.iloc[idx_va][features]
        y_tr = transformation(time=train.iloc[idx_tr].efs_time, event=train.iloc[idx_tr].efs)
    
        # from https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685
        model = xgboost.XGBRegressor(
            max_depth=3,  
            colsample_bytree=0.5,  
            subsample=0.8,  
            n_estimators=2000,  
            learning_rate=0.02,  
            enable_categorical=True,
            min_child_weight=80,
        )
        model.fit(X_tr, y_tr)
        y_va_pred = model.predict(X_va) # predicts quantile
        evaluate_fold(y_va_pred, fold)
    display_overall(f'{transformation.__name__} XGBoost (MSE)')
    print()
    
# Overall:                                   0.669 transform_survival_probability
# Overall:                                   0.668 transform_partial_hazard
# Overall:                                   0.666 transform_separate
# Overall:                                   0.672 transform_rank_log
# Overall:                                   0.674 transform_quantile

A linear model

The linear model CoxPHFitter needs one-hot encoding and missing value imputation:

%%time
# see https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#cox-s-proportional-hazard-model

all_scores = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    # Creating preprocessing pipeline - one hot encoding for categorical variables
    preproc = ColumnTransformer([
    # One-hot encoding for categorical variables
    ('ohe', OneHotEncoder(
        drop='first',  # Drop first category (dummy coding)
        sparse_output=False, 
        handle_unknown='ignore'), 
        cat_features),
    ],
    # Replace missing values in numerical variables with median
    remainder=SimpleImputer(strategy='median')
).set_output(transform='pandas')
    
    # Apply data preprocessing
    X_tr = preproc.fit_transform(train.iloc[idx_tr])
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        X_va = preproc.transform(train.iloc[idx_va])
        
    # Create and Train Cox model
    model = CoxPHFitter(penalizer=.01) # Apply L2 regularization
    feats = [f for f in X_tr.columns if f not in ['gvhd_proph_FK+- others(not MMF,MTX)']]
    model.fit(X_tr[feats], duration_col='efs_time', event_col='efs')
    # model.print_summary()
    y_va_pred = model.predict_partial_hazard(X_va[feats])
    X_va['race_group'] = train.race_group.iloc[idx_va]
    evaluate_fold(y_va_pred, fold)
display_overall('Cox Proportional Hazards Linear')
# Overall:                                   0.656

XGBoost Cox vs. CoxPHFitter
- XGBoost Cox:
  - Can learn nonlinear relationships
  - Can capture complex interactions between features
  - Automatically handles missing values as a tree-based model
- CoxPHFitter:
  - Linear model (learns only linear relationships)
  - Features affect hazard independently
  - Requires preprocessing (one-hot encoding, missing value handling)
  - Basic Cox Model Equation:
    - h(t|X) = h₀(t) * exp(β₁X₁ + β₂X₂ + ... + βₙXₙ)
      # h₀(t): baseline hazard function
      # βᵢ: coefficient for each feature
      # Xᵢ: feature value
  - Learns through partial likelihood func:
    - Details in my previous post: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-2-Understanding-Survival-Analysis

Observation: With most models, the Asian predictions get the highest scores (best concordance index) and the predictions for white patients get the lowest scores (worst concordance).
Insight: As the competition objective (equitability across diverse patient populations) rewards models with similar concordance scores for all six race groups, a possible strategy could be that we artificially make the predictions for Asian patients worse.
- # Stratified C-index = Mean(C-indices) - Std(C-indices)
  # That is, mean of C-indices for each racial group minus their standard deviation
  Example:
  Race A: C-index = 0.70
  Race B: C-index = 0.70
  Race C: C-index = 0.70
  => Mean 0.70, Std 0 -> Final score 0.70
  Race A: C-index = 0.75
  Race B: C-index = 0.65
  Race C: C-index = 0.70
  => Mean 0.70, Std 0.05 -> Final score 0.65
- What this insight suggests:
  - If predictions for Asian patients are more accurate than other races
  - Deliberately lowering the accuracy for Asian predictions
  - To make performance similar across all racial groups
  - Could improve the overall score

Final comparison

For the time being, the gradient-boosted proportional hazard models (Cox regression, blue) and the transformed-target models (pink) win.
Among the target transformations, transform_quantile is best.
The AFT models (green) perhaps need more hyperparameter tuning.

result_df = pd.DataFrame(all_model_scores, index=['score']).T
result_df = result_df.sort_values('score', ascending=False)
# with pd.option_context("display.precision", 3): display(result_df)
plt.figure(figsize=(6, len(result_df) * 0.4))

color = np.where(result_df.index.str.contains('Proportional'),
                 'cyan',
                 np.where(result_df.index.str.contains('Accelerated'), 'lightgreen', 
                          'lightpink'))
bars = plt.barh(np.arange(len(result_df)), result_df.score, color=color)
plt.gca().bar_label(bars, fmt='%.3f')
plt.yticks(np.arange(len(result_df)), result_df.index)
plt.xlim(0.65, 0.68)
plt.xticks([0.65, 0.66, 0.67, 0.68])
plt.gca().invert_yaxis()
plt.xlabel('CV score (higher is better)')
plt.show()

완벽해야 한다는 강박은 시작을 망친다.

CIBMTR - Equity in post-HCT Survival Predictions #10 A general Understanding for AFT Loss function

dongsunseng — Thu, 6 Feb 2025 22:17:59 +0900

Annotation of the discussion about AFT loss function:

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550563

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

A general Understanding for AFT Loss function

My notebook using AFT Loss function is [CV0.665 LB0.666]cat+xgb with AFT loss function based on Dear @cdeotte's code, thanks!
- My annotation on the kernel:
The Accelerated Failure Time (AFT) model is a parametric survival analysis model that describes how covariates influence the survival time of an event.
Unlike Proportional Hazards (PH) models, including COX ph model, which assume covariates proportionally scale the hazard function, AFT models assume that covariates accelerate or decelerate the life course of a survival process by a multiplicative factor.
- Detailed explanation about Proportional Hazards model vs. Accelerated Failure Time model
  - Proportional Hazard(PH) Model:
    - # Hazard-based approach
      # Example: Comparing two patients
      Patient A's hazard = baseline hazard × 2.0 # 2 times riskier than baseline
      Patient B's hazard = baseline hazard × 0.5 # 0.5 times riskier than baseline
      # Feature: Hazard changes proportionally
  - Accelerated Failure Time(AFT) Model:
    - # Survival time-based approach
      # Example: Comparing two patients
      Patient A's survival time = baseline survival time × 0.5 # Progresses 2x faster than baseline
      Patient B's survival time = baseline survival time × 2.0 # Progresses 2x slower than baseline
      # Feature: Time scale is accelerated/decelerated
  - Example:
    - Situation: Effect of a specific treatment on disease progression
      PH Model Interpretation:
      - "Patients receiving this treatment have half the risk of death"
      AFT Model Interpretation:
      - "Disease progression is 2x slower in patients receiving this treatment"
      - i.e., it takes twice as long to reach the same stage
  - Key Differences:
    - PH Model: Focuses on hazard (risk)
    - AFT Model: Focuses on actual survival time
    - PH models "how risky"
    - AFT models "how fast/slow it progresses"

Detailed Explanation:
- Basic Model Equation:
  - log(T) = Xβ + ε
    where:
    T = survival time
    X = feature variables (age, gender, disease status, etc.)
    β = coefficients for each feature (impact)
    ε = error term (random variable following probability distribution)
- Acceleration Factor:
  - θ = exp(-Xβ)
    # Interpretation:
    θ > 1: survival time decreases (disease progresses faster)
    θ < 1: survival time increases (disease progresses slower)
- Example:
  - # Example: Modeling treatment effect
    X = [treatment_dose]
    β = -0.7 # assumed coefficient
  - # When treatment dose is 1 unit
    θ = exp(-1 × -0.7) = exp(0.7) ≈ 2.01
    # Interpretation: 1 unit of treatment doubles survival time
  - # When treatment dose is 2 units
    θ = exp(-2 × -0.7) = exp(1.4) ≈ 4.06
    # Interpretation: 2 units of treatment quadruples survival time
- Key Features:
  - Reasons for modeling log(T):
    - Survival time is always positive
    - Log transformation better satisfies normality assumption
    - Interpretation becomes easier with multiplicative effects

Detailed Explanation:
- Component Explanations:
  - tᵢ: observed survival time
    δᵢ: event occurrence indicator (1=occurred, 0=censored)
    μᵢ = Xᵢβ: predicted log-survival time
    σ: scale parameter controlling variance
    f(t): probability density function (PDF)
    S(t): survival function
- How log function works:
  - When event occurs (δᵢ = 1):
    - Loss = -log f(tᵢ; μᵢ, σ)
      # Tries to maximize PDF
      # Learns to increase probability density at actual occurrence time
    - PDF
      - PDF represents the probability density of an event occurring at a specific time point
      - Example: When a patient dies on day 100
      - pdf(t) = density of probability of death at a specific time t
      - High pdf value at t=100 = high probability of death around day 100
  - When censored (δᵢ = 0):
    - Loss = -log S(tᵢ; μᵢ, σ)
      # Tries to maximize survival function
      # Learns to increase probability of survival beyond observed time
    - Survival function S
      - S(t) = P(T > t) = probability of survival beyond time point t
      - Characteristics:
        
        Decreasing function over time (monotonically decreasing)
        
        Initial value S(0) = 1 (everyone is alive at time 0)
        
        When time approaches infinity, S(∞) = 0
- Example:
  - # Patient A: death at day 100 (δ = 1)
    Loss_A = -log f(100; μ_A, σ)
    # Learns to increase probability of death at day 100
    - In this case, we know the exact time of death
    - So model learns to predict high probability of death around day 100
    - Thus, "to make accurate predictions at the actual occurrence time", we "learn to increase probability density at actual occurrence time."
  - # Patient B: censored at day 80 (δ = 0)
    Loss_B = -log S(80; μ_B, σ)
    # Learns to increase probability of survival beyond day 80
    - In this case, we don't know when death occurred after day 80
    - we only know for certain they survived until day 80
    - Thus, it is reasonable to increase probability of survival beyond day 80
    - It is correct to decrease probability of death before day 80 and increase survival probability after
- Key points:
  - This loss function properly handles censored data
  - Considers both PDF and survival function for more accurate predictions
  - Choice of ε (random term) distribution affects baseline survival time T₀

Basic Assumption:
- ε ~ N(0, σ²) # Error term follows normal distribution
- This means:
  - log(survival time) follows normal distribution
    - WHY???
      - log(T) = Xβ + ε
      - When Y = a + bX
        - If X follows normal distribution N(μ, σ²)
        - Then Y follows normal distribution N(a + bμ, b²σ²)
      - log(T) = Xβ + ε
        # Since ε follows N(0, σ²)
        # log(T) follows N(Xβ, σ²)
        Because:
        - Xβ is constant term (mean shift)
        - Coefficient of ε is 1 (variance remains same)
  - Actual survival time follows log-symmetric distribution
Probability Density Function (PDF):
- f(t; μ, σ) = (1/tσ√2π) * exp(-(log(t)-μ)²/2σ²)
  Components:
  - t: observed time
  - μ: predicted log-survival time (Xβ)
  - σ: parameter controlling variance
Survival Function:
- S(t; μ, σ) = 1 - Φ((log(t)-μ)/σ)
  where:
  - Φ: cumulative distribution function (CDF) of standard normal distribution
  - Represents probability of survival beyond time point t
Use Cases:
- # Suitable cases:
  - Symmetrically distributed survival times
  - Constant variability
  Example: Component lifetime in manufacturing
- # Unsuitable cases:
  - Distributions with very long tails
  - Highly asymmetric distributions
Advantages:
- Intuitive interpretation
- Relatively simple calculations
- Good fit for symmetric survival time data
Real Example:
- # Predicting medical device lifetime
  survival_time = exp(Xβ + ε)
  ε ~ N(0, σ²)
  # This means lifetime follows log-normal distribution
  # i.e., log(lifetime) follows normal distribution

Basic Assumption:
- ε ~ Log-Normal(μ, σ²)
  # Error term follows log-normal distribution
  # This means survival time T directly follows log-normal distribution
Log-Normal Distribution Characteristics:
- # Properties:
  - Only takes positive values
  - Has a heavy right tail
  - Asymmetric distribution
Difference between AFT:Normal and AFT:Log:
- AFT:Normal
  - log(survival time) follows normal distribution
  - Survival time is symmetrically distributed
  - Example: Manufacturing component lifetime
- AFT:Log
  - Survival time directly follows log-normal distribution
  - Survival time is asymmetrically distributed (long tail)
  - Example: Cancer patient survival period
Use Cases:
- # Suitable cases:
  - When some patients survive much longer than others
  - Biological processes or reliability data
  - Cancer patient survival analysis
  # Reasons:
  - Most show similar survival periods but
  - Some show very long survival periods
  - Can model such long-tail distributions well

In simple terms

In simple terms, AFT assumes that different factors (i.e., input variables or features of the model) affect the rate at which events occur by "stretching" or "compressing" the timeline.

It's like adjusting the playback speed while watching a video:

Double speed play (fast forward) : Time speeds up, events happen faster.
Slow play (slow down) : Time slows down and events occur later.

Imagine you're studying the survival time of two cancer patients:

Patient A receives standard treatment.
Patient B receives a new experimental treatment.

Case 1: $θ = 0.5$ (Acceleration Factor < 1)

This means the timeline is stretched by 2x for Patient B compared to Patient A.

If Patient A survives for 1 year, Patient B is expected to survive for 2 years under the new treatment.

Case 2: $θ = 2$ (Acceleration Factor > 1)

This means the timeline is compressed for Patient B, reducing their survival time by half.

If Patient A survives for 1 year, Patient B is expected to survive for 6 months.

압력 없이는 다이아몬드가 만들어지지 않는다
- 토마스 칼라일 -

CIBMTR - Equity in post-HCT Survival Predictions #9 NN Starter Notebook

dongsunseng — Thu, 6 Feb 2025 13:03:05 +0900

Annotation of discussion and kernel for NN Solution from Chris Deotte

Discussion Link:

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

Kernel Link:

https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676

NN (MLP) Baseline - [CV 670 LB 676]

Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions

www.kaggle.com

NN Starter Notebook CV 0.670 LB 0.676 (Discussion)

I published a starter notebook NN which uses the following simple architecture.
Consider improving architecture to boost CV and LB score!

Preprocessing

There are 57 features with 35 categorical and 22 numerical.
The majority of numerical features appear to be like categorical features with their low unique value count.
Therefore in my NN starter, I convert 55 features into categorical leaving only donor_age and act_at_hct as numerical.
For each categorical, we label encode.
In the NN architecture, we use embeddings for each categorical features.
For each categorical feature, the embedding input size is of course the number of unique values.
The embedding output size is sqrt(number unique)+1.
- About embedding
  - The process of converting categorical data into meaningful continuous vectors
  - Categories with similar meanings learn to have similar vector values
  - Examples:
    - # Example: Disease Type Categories
      Disease A = [0.2, 0.8, -0.3] # Converted to 3D vector
      Disease B = [0.3, 0.7, -0.2] # Similar diseases have similar vector values
      Disease C = [-0.8, -0.2, 0.9] # Different types have different vector values
  - Limitations of one-hot encoding:
    - # One-Hot Encoding
      Disease A = [1, 0, 0, 0]
      Disease B = [0, 1, 0, 0]
      Disease C = [0, 0, 1, 0]
      # All categories are equidistant
      # Cannot express similarity
  - Advantages of Embedding:
    - # Embedding
      Disease A = [0.2, 0.8]  # Compressed to 2D
      Disease B = [0.3, 0.7]  # Similar vector to A
      Disease C = [-0.8, -0.2]  # Different vector from A,B
      # Can express similarity through vector distances
- The practice of setting embedding output size to sqrt(number of unique values) + 1 is a commonly used rule of thumb.
  - Example:
    - categorical_feature = 'disease_type'
      unique_values = 16 # Assuming there are 16 disease types
      embedding_output_size = int(np.sqrt(16)) + 1
      # = 4 + 1 = 5
  - Reasons for this setting:
    - Dimension Reduction:
      - One-hot encoding would require 16 dimensions
      - Embedding can reduce it to 5 dimensions
      - This reduces model complexity and makes learning more efficient
    - Appropriate Expressiveness:
      - Too small embedding dimension: Risk of information loss
      - Too large embedding dimension: Risk of overfitting
      - sqrt(n) + 1 is an empirical method to find balance between these
  - Real Example:
    - # Embedding dimensions for various category sizes
      4 categories -> 3 embedding dimensions (√4 + 1)
      9 categories -> 4 embedding dimensions (√9 + 1)
      16 categories -> 5 embedding dimensions (√16 + 1)
      25 categories -> 6 embedding dimensions (√25 + 1)
Afterward we concatenate all the categorical embeddings together with the numerical features and continue forward with MLP.
For the two numericals, we standardize with feature = (feature - mean)/std because NN like standardized features.

Target Transformation

There are two ways to train a Survival Model:
- We can input both efs and efs_time and use survival loss like Cox.
- Transform efs and efs_time into a single target proxy for risk score and train with regression loss like MSE.
In my NN starter, I employ option 2 above.
I transform the original two targets into a proxy for risk score and train NN with MSE regression loss.
Below shows the original two targets and the new transformed target.
When training with MSE loss, the model likes the target to be like Gaussian distribution.
This was one factor when I invented this new way to transform target:

NN (MLP) Baseline - [CV 670 LB 676] (Kernel)

Intro

In this notebook, we present a Neural Network NN (MLP) baseline.
This NN is very fast to train on GPU! We achieve CV 0.670.
There is a discussion about this notebook here
- Above discussion
We tranform the two train targets (efs and efs_time) into a single target (y) and then train regression NN with MSE loss.
We load Kaggle's official metric code from here and evaluate the CV performance using competition metric Stratified Concordance Index.
In this comp, we need to predict risk score.
There are many different ways to transform the two train targets into a value that mimics risk score and train an NN (or any other regression model like SVR) with regression.
I present one transformation in this notebook and I presented a different one in my XGBoost starter notebook here.
Consider experimenting by creating your own target from efs and efs_time.
Or considering using survival loss directly which uses both efs and efs_time as explained in discussion post here.
Kaggle user MT describes another transformation here called KaplanMeierFitter and gives an example here

Pip Install Libraries for Metric

Since internet must be turned off for submission, we pip install from my other notebook here where I downloaded the WHL files.

!pip install /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
!pip install /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl

Load Train and Test

import numpy as np, pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

test = pd.read_csv("/kaggle/input/equity-post-HCT-survival-predictions/test.csv")
print("Test shape:", test.shape )

train = pd.read_csv("/kaggle/input/equity-post-HCT-survival-predictions/train.csv")
print("Train shape:",train.shape)
train.head()

EDA on Train Targets

There are two train targets efs and efs_time.
When efs==1 we know patient had an event and we know time of event is efs_time. When efs==0 we do not know if patient had an event or not, but we do know that patient was without event for at least efs_time.

plt.hist(train.loc[train.efs==1,"efs_time"],bins=100,label="efs=1, Yes Event")
plt.hist(train.loc[train.efs==0,"efs_time"],bins=100,label="efs=0, Maybe Event")
plt.xlabel("Time of Observation, efs_time")
plt.ylabel("Density")
plt.title("Times of Observation. Either time to event, or time observed without event.")
plt.legend()
plt.show()

Transform Two Train Targets into One Target!

Both targets efs and efs_time provide useful information.
We will tranform these two targets into a single target to train our model with.
In this competition we need to predict risk score.
So we will create a target that mimics risk score to train our model.
(Note this is only one out of many ways to transform two targets into one target. Considering experimenting on your own).

# 1. Set initial target value to efs_time
train["y"] = train.efs_time.values
# 2. Find maximum time of event cases (efs=1) and 
# minimum time of censored cases (efs=0)
mx = train.loc[train.efs==1,"efs_time"].max()
mn = train.loc[train.efs==0,"efs_time"].min()
# 3. Adjust time values for censored cases
# Add (mx - mn) to make all censored cases larger than event cases
train.loc[train.efs==0,"y"] = train.loc[train.efs==0,"y"] + mx - mn
# 4. Rank all values (starting from 1)
train.y = train.y.rank()
# 5. Make ranks of censored cases larger
# Add 2 times the data length to clearly differentiate
train.loc[train.efs==0,"y"] += 2*len(train)
# 6. Normalize to values between 0~1
train.y = train.y / train.y.max()
# 7. Apply log transformation
train.y = np.log(train.y)
# 8. Center mean to 0
train.y -= train.y.mean()
# 9. Reverse sign (to interpret as risk score)
train.y *= -1.0

plt.hist(train.loc[train.efs==1,"y"],bins=100,label="efs=1, Yes Event")
plt.hist(train.loc[train.efs==0,"y"],bins=100,label="efs=0, Maybe Event")
plt.xlim((-5,5))
plt.xlabel("Transformed Target y")
plt.ylabel("Density")
plt.title("Transformed Target y using both efs and efs_time.")
plt.legend()
plt.show()

Purpose of this transformation:

Clearly differentiate between censored cases and event cases
Transform values into appropriate range
Make it interpretable as risk scores (multiply by -1 at the end)

As a result:

Event cases (efs=1) have higher risk scores
Censored cases (efs=0) have lower risk scores
Overall distribution becomes normalized

Detailed Explanation about #5 part:

# 5. Make ranks of censored cases larger
# Add 2 times the data length to clearly differentiate
train.loc[train.efs==0,"y"] += 2*len(train)

# Example data: 5 patients
# efs=1: Event occurred (death)
# efs=0: Censored (end of tracking)
# Initial data
Patient A: efs=1, efs_time=10  # Died on day 10
Patient B: efs=1, efs_time=20  # Died on day 20
Patient C: efs=0, efs_time=15  # Survival confirmed until day 15
Patient D: efs=1, efs_time=5   # Died on day 5
Patient E: efs=0, efs_time=25  # Survival confirmed until day 25
# After applying rank()
Patient D: 1  (shortest survival)
Patient A: 2
Patient C: 3
Patient B: 4
Patient E: 5  (longest survival)
# Adding 2*len(train) = 2*5 = 10 to censored cases
Patient D: 1      # efs=1, no change
Patient A: 2      # efs=1, no change
Patient C: 13     # efs=0, 3+10
Patient B: 4      # efs=1, no change
Patient E: 15     # efs=0, 5+10

Reasons for doing this:

Censored cases (efs=0) might have actually lived longer
Therefore, we make their ranks definitively larger
Adding twice the data length creates a large gap between efs=1 and efs=0 cases
This helps the model better distinguish between the two groups

Features

There are a total of 57 features.
From these 35 are categorical and 22 are numerical.
Since most of the numerical features has only a few unique values, we will treat all features except donor_age and act_at_hct as categorical for our NN.
So we will feed our NN 55 categorical features and 2 numerical features.

RMV = ["ID","efs","efs_time","y"]
FEATURES = [c for c in train.columns if not c in RMV]
print(f"There are {len(FEATURES)} FEATURES: {FEATURES}")

# Create empty list CATS - will store categorical variables
CATS = []
# Iterate through each feature (column) in FEATURES list
for c in FEATURES:
    # If the column's data type is "object" (strings etc.)
    if train[c].dtype=="object":
        # Fill missing values with "NAN" in both train and test
        train[c] = train[c].fillna("NAN")
        test[c] = test[c].fillna("NAN")
        # Add this column to CATS list
        CATS.append(c)
    
    # If it's a numerical column not containing "age" in its name    
    elif not "age" in c:
        # Convert numeric values to strings in both train and test
        train[c] = train[c].astype("str")
        test[c] = test[c].astype("str")
        # Add this column to CATS list
        CATS.append(c)
# Print the number and list of features treated as categorical
print(f"In these features, there are {len(CATS)} CATEGORICAL FEATURES: {CATS}")

# Create lists to store categorical variable sizes and embedding dimensions
CAT_SIZE = []  # Number of unique values for each categorical variable
CAT_EMB = []   # Embedding dimensions for each categorical variable
NUMS = []      # List of numerical variables
# Combine train and test data
combined = pd.concat([train,test],axis=0,ignore_index=True)
print("We LABEL ENCODE the CATEGORICAL FEATURES: ")
# Iterate through all features
for c in FEATURES:
    # If it's a categorical variable
    if c in CATS:
        # Perform label encoding using factorize()
        combined[c],_ = combined[c].factorize()
        # Make minimum value 0
        combined[c] -= combined[c].min()
        # Convert to int32 type
        combined[c] = combined[c].astype("int32")
        
        # Calculate number of unique values and range
        n = combined[c].nunique()
        mn = combined[c].min()
        mx = combined[c].max()
        print(f'{c} has ({n}) unique values')
        
        # Store category size (max+1) and embedding dimension (sqrt(max+1))
        CAT_SIZE.append(mx+1)
        CAT_EMB.append( int(np.ceil( np.sqrt(mx+1))) )
    
    # If it's a numerical variable
    else:
        # Convert float64 to float32, int64 to int32 (memory optimization)
        if combined[c].dtype=="float64":
            combined[c] = combined[c].astype("float32")
        if combined[c].dtype=="int64":
            combined[c] = combined[c].astype("int32")
        
        # Perform standardization
        m = combined[c].mean()
        s = combined[c].std()
        combined[c] = (combined[c]-m)/s
        # Fill missing values with 0
        combined[c] = combined[c].fillna(0)
        
        # Add to numerical variables list
        NUMS.append(c)
# Split back into train and test
train = combined.iloc[:len(train)].copy()
test = combined.iloc[len(train):].reset_index(drop=True).copy()

We LABEL ENCODE the CATEGORICAL FEATURES: 
dri_score has (12) unique values
psych_disturb has (4) unique values
cyto_score has (8) unique values
diabetes has (4) unique values
hla_match_c_high has (4) unique values
hla_high_res_8 has (8) unique values
tbi_status has (8) unique values
arrhythmia has (4) unique values
hla_low_res_6 has (6) unique values
graft_type has (2) unique values
vent_hist has (3) unique values
renal_issue has (4) unique values
pulm_severe has (4) unique values
prim_disease_hct has (18) unique values
hla_high_res_6 has (7) unique values
cmv_status has (5) unique values
hla_high_res_10 has (9) unique values
hla_match_dqb1_high has (4) unique values
tce_imm_match has (9) unique values
hla_nmdp_6 has (6) unique values
hla_match_c_low has (4) unique values
rituximab has (3) unique values
hla_match_drb1_low has (3) unique values
hla_match_dqb1_low has (4) unique values
prod_type has (2) unique values
cyto_score_detail has (6) unique values
conditioning_intensity has (7) unique values
ethnicity has (4) unique values
year_hct has (13) unique values
obesity has (4) unique values
mrd_hct has (3) unique values
in_vivo_tcd has (3) unique values
tce_match has (5) unique values
hla_match_a_high has (4) unique values
hepatic_severe has (4) unique values
prior_tumor has (4) unique values
hla_match_b_low has (4) unique values
peptic_ulcer has (4) unique values
hla_match_a_low has (4) unique values
gvhd_proph has (18) unique values
rheum_issue has (4) unique values
sex_match has (5) unique values
hla_match_b_high has (4) unique values
race_group has (6) unique values
comorbidity_score has (12) unique values
karnofsky_score has (8) unique values
hepatic_mild has (4) unique values
tce_div_match has (5) unique values
donor_related has (4) unique values
melphalan_dose has (3) unique values
hla_low_res_8 has (8) unique values
cardiac has (4) unique values
hla_match_drb1_high has (4) unique values
pulm_moderate has (4) unique values
hla_low_res_10 has (8) unique values

TensorFlow NN

We train NN model with CV 0.670

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, Input, Embedding
from tensorflow.keras.layers import Concatenate, BatchNormalization
import tensorflow.keras.backend as K
from sklearn.model_selection import KFold

print('TF Version',tf.__version__)

Learning Schedule

# Set total 4 epochs
EPOCHS = 4
# Define learning rate for each epoch
LRS = [0.01]*2 + [0.001]*1 + [0.0001]*1
# Written out: LRS = [0.01, 0.01, 0.001, 0.0001]
# Function that returns learning rate for each epoch
def lrfn(epoch):
    return LRS[epoch]
# Create list of epoch numbers (0 to 3)
rng = [i for i in range(EPOCHS)]
# Create list of learning rate values for each epoch
lr_y = [lrfn(x) for x in rng]

plt.figure(figsize=(10, 4))
plt.plot(rng, lr_y, '-o')
print("Learning rate schedule: {:.3g} to {:.3g} to {:.3g}". \
        format(lr_y[0], max(lr_y), lr_y[-1]))
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Learning Rate Schedule")
plt.show()

lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose = False)

Learning Rate Schedule:
- First 2 epochs: 0.01 (fast learning with high learning rate)
- 3rd epoch: 0.001 (decreased learning rate)
- 4th epoch: 0.0001 (fine-tuning with smaller learning rate)
Reasons for gradually decreasing the learning rate:
1. Fast learning with large learning rate initially
2. Fine-tuning with small learning rate in later stages
3. This helps the model converge more stably

Model Definition

We use embedding layers for all label encoded categorical features.
Then we concatenate all categorical embeddings with the numerical features.
We create an MLP with two hidden layers.
Our final output layer has one linear neuron and during training we use MSE loss with Adam optimizer.

def build_model():
    # 1. Handle categorical variables
    # Create input layer for categorical variables
    x_input_cats = Input(shape=(len(CATS),))
    embs = []
    
    # Create embedding layer for each categorical variable
    for j in range(len(CATS)):
        # Create embedding layer (input size: CAT_SIZE[j], output size: CAT_EMB[j])
        e = tf.keras.layers.Embedding(CAT_SIZE[j], CAT_EMB[j])
        # Apply embedding to j-th categorical variable
        x = e(x_input_cats[:,j])
        # Flatten embedding result to 1D
        x = tf.keras.layers.Flatten()(x)
        # Store embedding result
        embs.append(x)
    
    # 2. Handle numerical variables
    # Create input layer for numerical variables
    x_input_nums = Input(shape=(len(NUMS),))
    
    # 3. Combine categorical and numerical features
    # Connect all embedding results and numerical variables
    x = tf.keras.layers.Concatenate(axis=-1)(embs+[x_input_nums])
    
    # 4. Add fully connected layers (Dense)
    # Hidden layer with 256 neurons (ReLU activation)
    x = Dense(256, activation='relu')(x)
    x = Dense(256, activation='relu')(x)
    # Output layer with 1 neuron (linear activation)
    x = Dense(1, activation='linear')(x)
    
    # 5. Create model
    # Input: categorical and numerical variables
    # Output: predicted value
    model = Model(inputs=[x_input_cats,x_input_nums], outputs=x)
    
    return model

linear activation: f(x) = x
- Outputs input value as is
- No transformation applied
- Characteristics:
  - Unlimited output range (-∞ ~ +∞)
  - Commonly used in output layer for regression
  - Suitable for continuous real value prediction
ReLU (Rectified Linear Unit) Activation: f(x) = max(0, x)
- Outputs 0 for negative values, keeps positive values as is
- Characteristics:
  - Output range: [0, ∞)
  - Most commonly used in hidden layers
  - Reduces vanishing gradient problem
  - Simple and fast computation
Linear:        ReLU:
    ↗           ↗
  ↗           _/
↗           _/

%%time

REPEATS = 3
FOLDS = 5
kf = KFold(n_splits=FOLDS, random_state=42, shuffle=True)

oof_nn = np.zeros( len(train) )
pred_nn = np.zeros( len(test) )

#directory = "checkpoints"
#if not os.path.exists(directory):
#    os.makedirs(directory)

for r in range(REPEATS):
    VERBOSE = r==0
    print("#"*25)
    print(f"### REPEAT {r+1} ###")
    print("#"*25)
        
    for i, (train_index, test_index) in enumerate(kf.split(train)):
        
        X_train_cats = train.loc[train_index,CATS].values
        X_train_nums = train.loc[train_index,NUMS].values
        y_train = train.loc[train_index,"y"].values
        y_train2 = train.loc[train_index,"efs"].values
        
        X_valid_cats = train.loc[test_index,CATS].values
        X_valid_nums = train.loc[test_index,NUMS].values
        y_valid = train.loc[test_index,"y"].values
        y_valid2 = train.loc[test_index,"efs"].values
        
        X_test_cats = test[CATS].values
        X_test_nums = test[NUMS].values

        if VERBOSE:
            print(" ","#"*25)
            print(" ",f"### Fold {i+1} ###")
            print(" ","#"*25)
        
        # TRAIN MODEL
        K.clear_session()
        model = build_model()
        model.compile(optimizer=tf.keras.optimizers.Adam(0.001), 
                      loss="mean_squared_error",  
                     )
        v = 2 if VERBOSE else 0
        model.fit([X_train_cats,X_train_nums], [y_train], 
                  validation_data = ([X_valid_cats,X_valid_nums], [y_valid]),
                  callbacks = [lr_callback],
                  batch_size=512, epochs=EPOCHS, verbose=v)
        #model.save_weights(f'{directory}/NN_f{i}_r{r}.weights.h5')
        
        # INFER OOF
        oof_nn[test_index] += model.predict([X_valid_cats,X_valid_nums], verbose=v, batch_size=512).flatten()
        # INFER TEST
        pred_nn += model.predict([X_test_cats,X_test_nums], verbose=v, batch_size=512).flatten()

oof_nn /= REPEATS
pred_nn /= (FOLDS*REPEATS)

Compute Overall Metric

from metric import score

y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = oof_nn
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for NN =",m)

Create Submission CSV

sub = pd.read_csv("/kaggle/input/equity-post-HCT-survival-predictions/sample_submission.csv")
sub.prediction = pred_nn
sub.to_csv("submission.csv",index=False)
print("Sub shape:",sub.shape)
sub.head()

게으른 천재는 그냥 게으름뱅이일 뿐이다.

CIBMTR - Equity in post-HCT Survival Predictions #8 Finding the best target transformation

dongsunseng — Wed, 5 Feb 2025 23:55:24 +0900

Annotation post on discussion about finding the best target transformation

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550835

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

Finding the best target transformation

The competition task can be interpreted as predicting the order of death of the patients.
Who dies first? Who dies second? … Who dies last, and who survives?
With a suitable target transformation, we can apply the usual regression algorithms which optimize mse or similar metrics.
The original target is distributed in such a way that most patients who die have an efs_time between 0 and 15, whereas most survivors have an efs_time between 15 and 160.
This distribution is an impediment(장애) for regression models.
We need predictions which have high discriminative power for the patients who die, but we don't need to distinguish between survivors.
We can achieve this result by stretching the range of the patients who die and compressing the range of the survivors.
The diagram visualizes how a typical target transformation stretches and compresses the ranges:

In the public notebooks of this competition, we can find various target transformations, and most of them are similar.
For a comparison, I've taken three target transformations from public notebooks, added a fourth one, and given them all to XGBRegressor with an mse objective.
The cross-validation scores confirm that the orange part of the histogram must be stretched and the blue part must be condensed:

A comparison with other model types shows that target-transformed mse models (pink) are competitive with Cox proportional hazards models (blue).
My AFT models (green) perhaps need more hyperparameter tuning.

NN starter code annotation here:

Maybe I should check on Nelson-Aalen

Source code is in the EDA which makes sense.

ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️

Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions

www.kaggle.com

My annotation:

지금 당장 꽃을 피우지 못했다고 해서 좌절하지 마세요. 친구와 비교하지도 마세요.
지금은 그저 나의 계절이 아닌 것뿐이에요.
<책 '모든 꽃이 봄에 피지는 않는다'중에서>

CIBMTR - Equity in post-HCT Survival Predictions #7 AFT model

dongsunseng — Wed, 5 Feb 2025 22:12:07 +0900

Introduction post about AFT model

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

First saw this conversation about AFT
Host paper link: https://proceedings.mlr.press/v206/norcliffe23a/norcliffe23a.pdf
Above SurvivalXGBoost model objective:
- Objective: Survival: AFT (Accelerated Failure Time)
- Evaluation Metric: AFT Negative Log Likelihood
- AFT Loss Distribution: Normal
- AFT Loss Distribution Scale: 1.0
This is a specialized survival analysis configuration of XGBoost that can be used in the competition.
The AFT (Accelerated Failure Time) model is specialized for predicting survival time, making it a suitable approach for predicting the survival rate of HCT patients, which is the objective of this competition.

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

Also metioned here
My discussion annotation:
Notebook example annotation:
- d
- d

인내할 수 있는 사람은 그가 바라는 것은 무엇이든지 손에 넣을 수 있다.
- 벤자민 프랭클린 -

CIBMTR - Equity in post-HCT Survival Predictions #6 How To Train XGBoost with Survival Loss

dongsunseng — Wed, 5 Feb 2025 16:34:30 +0900

Annotation of Chris Deotte's discussion about "How To Train XGBoost with Survival Loss".

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

How To Train XGBoost with Survival Loss

This competition involves training survival models.
We need to predict risk scores which are inversely proportional to how long a patient is event free.
XGBoost can train survival models! (This discussion is a continuation of my first discussion here).
- Annotation for the previous discussion here: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-5-How-To-Get-Started-Understanding-the-Metric

Targets Explained

For patients with efs=1, we observe they had an event and know exactly how long they were event free (namely efs_time).
For patients with efs=0, we observe that they were event free for efs_time but do not know if eventually they will have an event or not.
So we only know they are event free for at least efs_time.
Survival models are new to me so yesterday my starter notebook does not use survival models directly.
Instead I studied the metric and mathematically determined how to transform the two targets efs and efs_time into a single target y and then trained a regression model to predict a proxy for inverse risk score.
My starter discussion is here.
Today I learned that XGBoost and CatBoost can train survival models directly.

XGBoost Survival:Cox Model

Starting from my public starter notebook, we can train XGBoost survival model as follows.
First we make a new column called efs_time2 which includes the information of both efs and efs_time:

train["efs_time2"] = train.efs_time.copy()
train.loc[train.efs==0,"efs_time2"] *= -1

Then remove this new column from features by changing code cell #5 with:

RMV = ["ID","efs","efs_time","y","efs_time2"]

Then we train using this target:

y_train = train.loc[train_index,"efs_time2"]

And we change XGBoost parameters to:

    objective='survival:cox',
    eval_metric='cox-nloglik',

CV Score: 0.672

Horikita Saku's comment about this part:
- He tried:
  train["efs_time2"] = train.efs_time.copy()
  train.loc[train.efs==0,"efs_time2"] *= -1
- and train by:
  x_train = train.loc[train_index, FEATURES].copy()
  y_train = train.loc[train_index, "efs_time2"]
  x_valid = train.loc[test_index, FEATURES].copy()
  y_valid = train.loc[test_index, "efs_time2"]
- the params are:
  eval_metric='cox-nloglik',
  objective='survival:cox',
  boosting_type= "dart",
- ran the eval(scoring) by:
  from metric import score
  y_true = train[["ID","efs","efs_time","race_group"]].copy()
  y_pred = train[["ID"]].copy()
  y_pred["prediction"] = oof_xgb
  m = score(y_true.copy(), y_pred.copy(), "ID")
  print(f"\nOverall CV for XGBoost =",m)
- However, I obtained an Overall CV for XGBoost = 0.9889430880402769, but the LB is 0.58, which seems to be definitely an anomaly. Do you have any ideas on what might be causing this?
- Reply by the author:
  - It is because the lack of this code:
    RMV = ["ID","efs","efs_time","y","efs_time2"]
    FEATURES = [c for c in train.columns if not c in RMV]
  - "I'm guessing that your model is using efs_time2 as both the target and a feature. I will add this to the discussion above. Thanks for discovering this."
    - Details:
    - The core issue here is "Data leakage" problem
    - Where the problem occurred:
      - A new target variable efs_time2 was created, but it was accidentally used as a feature as well
      - As a result, the model received target information as a feature, leading to abnormally high cv scores(0.98)
      - However, since there's no such leakage in the actual test data, the LB score was very low (0.58)
    - Solution:
      - RMV = ["ID","efs","efs_time","y","efs_time2"]
        FEATURES = [c for c in train.columns if not c in RMV]
      - The RMV list specifies columns that should be excluded from features
      - FEATURES selects only columns not included in RMV
      - This explicitly prevents efs_time2 from being used as a feature
    - Why this code is necessary:
      - If the target variable (efs_time2) is included in features, the model essentially "cheats"
      - It ends up using information during training that would never be available in real prediction scenarios
      - This makes it impossible to accurately evaluate the model's generalization performance

CatBoost Survival:Cox Model

For CatBoost, we use the target efs_time2 and loss_function="Cox"

CV Score: 0.670

Starter Notebook

I publish a starter notebook demonstrating this code here.
Using these techniques I achieved CV=0.681 and LB=0.685
- Annotation of the starter notebook here: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685

CV Score: 0.681

UPDATE - We can use Survivial:AFT Model

We can also train XGBoost and CatBoost with Survivial:AFT loss.
See discussions here and notebook examples here and here.

My Annotation & Explanation about AFT model here:

UPDATE - NN Starter Notebook

I published an NN starter notebook here with CV 0.670 and LB 0.676!

My Annotation & Explanation about NN model here:

It's nice to be important, but it's more important to be nice.
- Dwayne Johnson -

CIBMTR - Equity in post-HCT Survival Predictions #5 How To Get Started - Understanding the Metric

dongsunseng — Wed, 5 Feb 2025 16:34:19 +0900

Annotation of Chris Deotte's discussion about "How To Get Started - Understanding the Metric".

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

C-Index Explained

The competition metric is Stratified Concordance Index.
Let's explain how C-Index works (and let's ignore stratified for now).
Here is the formula:

Ground Truth and Predictions

Here is an image which will help us understand what this means:

Imagine that there are only 10 rows in the train.csv file shown above as 10 dots.
There are 5 efs=1 and 5 efs=0.
The efs_time is displayed in the plot above.
Points A, B, C, D, E have efs=1, and points F, G, H, I, J have efs=0.
The point A has the least efs_time and the point J has the greatest efs_time.
Each patient with efs=1 had an event, and the time before event was efs_time.
Each patient with efs=0 we do not know if they had an event or did not have an event.
All we know is that they were without event for at least efs_time long.
To summarize, each efs=1 was without event for exactly efs_time.
And each efs=0 was without event for at least efs_time.

How To Compute C-Index Denominator

The C-Index metric is a ranking metric similar to AUC.
The denominator counts all pairs of dots where we know ground truth T_j < T_i where T is the actual time without event (note when efs=0 then actual time without event > efs_time and when efs=1 then actual time without event = efs_time).
- T represents the "actual time without event" of a patient
- efs=1 indicates the occurrence of an event (e.g., death)
- efs=0 indicates a censored case (end of follow-up)
- Actual survival time (T) in two situations:
  - case 1) When efs=1:
    - T = efs_time
    - We can know the exact survival time
    case 2) When efs=0:
    - T > efs_time
    - We only know that they survived beyond the last observation point (efs_time)
- Examples of when we can determine "T_j < T_i":
  - # Example:
    Patient1: efs=1, efs_time=10  → T1=10
    Patient2: efs=0, efs_time=15  → T2>15
    Patient3: efs=1, efs_time=20  → T3=20
    # Comparable pairs:
    - T1 < T3 (10 < 20)
    - T1 < T2 (10 < T2, where T2 is greater than 15)
- Meaning of C-Index denominator:
  - Among all possible patient pairs (i,j)
  - Count the number of pairs where we can definitively determine "who lived longer"
  - The reason for calculating this way is that for censored cases (efs=0), we don't know the exact survival time, so we only include pairs that can be definitively compared in the evaluation.
The variables i and j are indices that range over every dot. In the example above, there are 32 possible pairs that we know T_j < T_i:

Note we do not know if D is less than F because we do not know the actual time without event for F, we only know that F's time without event is at least what it appears in the plot above (because F is efs=0).
Also we do not know if G is less than H because we do not know actual time without event for G nor H (both are efs=0).

How To Compute C-Index Numerator

The C-Index numerator is about our predictions.
For the 32 pairs above, we count how many of our predictions also follow these inequalities.
For example, is our prediction A greater than B? Is our prediction A greater than C? etc etc.
We ask 32 questions.
The last is, is our prediction E greater than J.
If all 32 questions answer yes, then our metric score = 1.
If all questions answer no, then our metric score = 0.
If 22 questions answer yes, then our metric score = 22/32 = 0.6875.
(Note that inequalities in denominator are less than (and about time).
And our predictions for the same pairs are greater than (and about risk).
This is because the denominator represents times being less than.
And our numerator represents risks being greater than.
In other words a patient with a shorter time without event has a greater risk.
And we are predicting risk factor.
(If you get this backwards, just change your predictions with pred = -1 * pred).)

How To Build a Model

If we only use efs as classification 0 or 1, to train our model (like current public notebooks), then our model will not be able to correctly compare A and C which both have efs=1.
If we use efs_time as regression, then our model can be smarter.
And if we use both efs and efs_time to train our model (classification/regression), our model will be smartest!

Starter Notebook

There are two ways to approach this competition and utilize both efs and efs_time:
1. Combine efs and efs_time ourselves into a new single target.
  - Then train a model using either classification or regression (on the new single target).
  - This is what i do in my XGB starter notebook here and NN starter notebook here.
    - XGB starter notebook: https://www.kaggle.com/code/cdeotte/xgboost-catboost-baseline-cv-668-lb-668
      - No annotation due to lower cv, lb score (0.668)
      - Basically https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685 without survival model(cox, kaplan meier)
    - My annotation on NN starter notebook:
  - (Note each uses a different transformed target and we can experiment making more transformed targets to find the best!)
2. Use a model that supports survival loss (i.e. Cox or AFT).
  - Then we leave efs and efs_time as is and input both into the model.
  - The model learns from both and predicts a single target for us.
  - More discussion about this here.
    - My annotation on the discussion: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-6-How-To-Train-XGBoost-with-Survival-Loss

How To Compute Metric in Notebook

To compute the competition metric in your notebook, attached this notebook here which contains WHL files (because we need to pip install with internet off to be able to submit to comp).
Also attach Kaggle's metic notebook here.
Then add the following code in the first cell:

!pip install /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
!pip install /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl

Afterward to compute the competition metric, run this code where preds are your oof predictions:

from metric import score
y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = preds
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"CV Score = {m}")

Calculating c-index(from turkenm's comment)

Misunderstanding: Thought that all efs=0 patients should have lower risk scores than efs=1 patients
However, the author of the kernel corrected:
- "Not all efs=0 predictions need to be lower than efs=1.
  Only efs=0 cases where efs_time is greater than all efs=1 cases need to have lower risk scores."
Example:
- # Case examples
  Patient A: efs=1, efs_time=10  # Event occurred on day 10
  Patient B: efs=0, efs_time=5   # Censored on day 5
  Patient C: efs=0, efs_time=15  # Censored on day 15
  Patient D: efs=1, efs_time=8   # Event occurred on day 8
  # Only Patient C has efs_time greater than all efs=1
  # Therefore, only Patient C's risk score needs to be lower than efs=1 patients
  # Patient B's risk score can be any value
- Why Patient B's risk score can be any value
  - Patient A: efs=1, efs_time=10  # Death confirmed on day 10
    Patient B: efs=0, efs_time=5   # Observation stopped after day 5
    Patient D: efs=1, efs_time=8   # Death confirmed on day 8
  - From a C-index calculation perspective:
    - C-Index is only included in calculations when survival times between two patients can be compared
    - Patient B has no information after day 5, making clear comparisons with other patients impossible
    - Therefore, Patient B's risk score does not affect C-Index calculation
  - In contrast for patient C:
    1. Survival is confirmed until day 15
    2. Comparable with Patient A (died on day 10)
      - We can definitively know that Patient C lived longer than Patient A
      - Therefore, Patient C's risk score should be lower than Patient A's
    3. Also comparable with Patient D (died on day 8)
      - We can definitively know that Patient C lived longer than Patient D
      - Therefore, Patient C's risk score should also be lower than Patient D's
Special Case (efs=0, efs_time=0):
- These cases are not included in C-Index calculation at all
- Therefore, predictions for such cases don't affect the final score
Not necessary to predict low risk for all censored cases (efs=0)
Can selectively predict low risk considering efs_time
This allows the model to learn more flexibly
This understanding enables creating more effective survival analysis models.

Understanding delta_j

Summary on Daniel's Question:
- Checking if delta_j in C-Index calculation means the efs value (0 or 1)
- Curious about how our predicted risk values are evaluated
- Asking if perfect prediction is possible

Summary of Chris's Answer:
- # Elements used in C-Index calculation:
  - T_j, T_i: represent efs_time values
  - delta_j: represents efs value
  - N_j, N_i: represent our predictions
- Model should not try to directly predict efs_time
  - Reason: efs_time is randomly hidden due to censoring
- Instead, should predict 'risk score'
- Because risk is actually related to features (X features)
- Example:
  - # Wrong approach
    Patient A: survived 10 days -> model tries to predict 10
    Patient B: censored at 5 days -> actual survival unknown
    # Correct approach
    Patient A: high risk -> expect short survival
    Patient B: low risk -> expect long survival

성공한 자의 과거는 비참할수록 아름답다.

CIBMTR - Equity in post-HCT Survival Predictions #4 GPU LightGBM Baseline [CV 681 LB 685)

dongsunseng — Mon, 3 Feb 2025 17:33:35 +0900

This is an annotation of this kernel:

https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685

GPU LightGBM Baseline - [CV 681 LB 685]

Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions

www.kaggle.com

GPU LightGBM Baseline

In this notebook, we present a GPU LightGBM baseline. In this notebook, compared to my previous starter notebooks we teach 5 new things:

How to tranform efs and efs_time into single target with KaplanMeierFitter.
How to train GPU LightGBM model with KaplanMeierFitter target
How to train XGBoost with Survivial:Cox loss
How to train CatBoost with Survival:Cox loss
How to ensemble 5 models using scipy.stats.rankdata().

Two Competition Approaches

In this competition, there are two ways to train a Survival Model:

We can input both efs and efs_time and train a model that supports survival loss like Cox.
Transform efs and efs_time into a single target proxy for risk score and train any model with regression loss like MSE.

In this notebook, we train 5 models.
The first 3 models (XGBoost, CatBoost, LightGBM) use bullet point two.
And the next 2 models (XGBoost Cox, CatBoost Cox) use bullet point one. Discussion about this notebook is here and here.
Since this competition's metric is a ranking metric, we ensemble the 5 predictions by first converting each into ranks using scipy.stats.rankdata().
Afterward we created a weighted average from the ranks.

Previous Notebooks

My previous starter notebooks are:

XGBoost and CatBoost starter here
NN (MLP) starter here

Associated discussions are here, here, here

Pip Install Libraries for Metric

Since internet must be turned off for submission, we pip install from my other notebook here where I downloaded the WHL files.

!pip install /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
!pip install /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl

https://www.kaggle.com/code/cdeotte/pip-install-lifelines
There is a discussion explaining how to use these WHL files here.
- Annotation on the details posted on another blog of mine
Below is a quick summary:
- To compute the competition metric in your notebook, attached this notebook (which you are reading) which contains WHL files (because we need to pip install with internet off to be able to submit to comp).
Also attached Kaggle's metic notebook here.

Afterward to compute the competition metric, run this code where preds are your oof predictions:

from metric import score
y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = preds
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"CV Score = {m}")

Load Train and Test

import numpy as np, pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

test = pd.read_csv("/kaggle/input/equity-post-HCT-survival-predictions/test.csv")
print("Test shape:", test.shape )

train = pd.read_csv("/kaggle/input/equity-post-HCT-survival-predictions/train.csv")
print("Train shape:",train.shape)
train.head()

EDA on Train Targets

There are two train targets efs and efs_time.
When efs==1 we know patient had an event and we know time of event is efs_time.
When efs==0 we do not know if patient had an event or not, but we do know that patient was without event for at least efs_time.

plt.hist(train.loc[train.efs==1,"efs_time"],bins=100,label="efs=1, Yes Event")
plt.hist(train.loc[train.efs==0,"efs_time"],bins=100,label="efs=0, Maybe Event")
plt.xlabel("Time of Observation, efs_time")
plt.ylabel("Density")
plt.title("Times of Observation. Either time to event, or time observed without event.")
plt.legend()
plt.show()

Transform Two Targets into One Target with KaplanMeier

Both targets efs and efs_time provide useful information.
We will tranform these two targets into a single target to train our model with.
In this competition we need to predict risk score.
So we will create a target that mimics risk score to train our model.
(Note this is only one out of many ways to transform two targets into one target.
Considering experimenting on your own).

from lifelines import KaplanMeierFitter
def transform_survival_probability(df, time_col='efs_time', event_col='efs'):
    kmf = KaplanMeierFitter()
    kmf.fit(df[time_col], df[event_col])
    y = kmf.survival_function_at_times(df[time_col]).values
    return y
train["y"] = transform_survival_probability(train, time_col='efs_time', event_col='efs')

plt.hist(train.loc[train.efs==1,"y"],bins=100,label="efs=1, Yes Event")
plt.hist(train.loc[train.efs==0,"y"],bins=100,label="efs=0, Maybe Event")
plt.xlabel("Transformed Target y")
plt.ylabel("Density")
plt.title("KaplanMeier Transformed Target y using both efs and efs_time.")
plt.legend()
plt.show()

Features

There are a total of 57 features.
From these 35 are categorical and 22 are numerical.
We will label encode the categorical features.
Then our XGB and CAT model will accept these as categorical features and process them special internally.
We leave the numerical feature NANs as NANs because GBDT (like XGB and CAT) can handle NAN and will use this information.

RMV = ["ID","efs","efs_time","y"]
FEATURES = [c for c in train.columns if not c in RMV]
print(f"There are {len(FEATURES)} FEATURES: {FEATURES}")

CATS = []
for c in FEATURES:
    if train[c].dtype=="object":
        CATS.append(c)
        train[c] = train[c].fillna("NAN")
        test[c] = test[c].fillna("NAN")
print(f"In these features, there are {len(CATS)} CATEGORICAL FEATURES: {CATS}")

combined = pd.concat([train,test],axis=0,ignore_index=True)
#print("Combined data shape:", combined.shape )

# LABEL ENCODE CATEGORICAL FEATURES
print("We LABEL ENCODE the CATEGORICAL FEATURES: ",end="")
for c in FEATURES:

    # LABEL ENCODE CATEGORICAL AND CONVERT TO INT32 CATEGORY
    if c in CATS:
        print(f"{c}, ",end="")
        combined[c],_ = combined[c].factorize()
        combined[c] -= combined[c].min()
        combined[c] = combined[c].astype("int32")
        combined[c] = combined[c].astype("category")
        
    # REDUCE PRECISION OF NUMERICAL TO 32BIT TO SAVE MEMORY
    else:
        if combined[c].dtype=="float64":
            combined[c] = combined[c].astype("float32")
        if combined[c].dtype=="int64":
            combined[c] = combined[c].astype("int32")
    
train = combined.iloc[:len(train)].copy()
test = combined.iloc[len(train):].reset_index(drop=True).copy()

Details about categorical feature label encoding part:
- combined[c],_ = combined[c].factorize()
  - factorize() is a pandas function that converts categorical data like strings into numbers
  - Example: ['A', 'B', 'A', 'C'] → [0, 1, 0, 2]
  - '_' ignores the second return value (list of unique values)
- combined[c] -= combined[c].min()
  - Subtracts the minimum value to make the sequence start from 0
  - Example: [1, 2, 1, 3] → [0, 1, 0, 2]
- combined[c] = combined[c].astype("int32")
  - Converts data type to 32-bit integer
  - Used for memory efficiency
- combined[c] = combined[c].astype("category")
  - Finally converts back to category type
  - For efficient handling of categorical data in pandas
    - Memory Efficiency
      - Category type internally stores repeating values as integers and maintains only a mapping table
      - Very efficient when strings like ['High', 'Low', 'High', 'Medium'] are repeated
    - Improved Operation Speed
      - Faster operations on categorical data
      - Optimized for tasks like grouping and sorting
    - Metadata Preservation
      - Explicitly expresses that this column is categorical
      - Better represents the meaning of the data
  - Result: Maintains information that this column is categorical, while saving memory
- original = ['High', 'Low', 'High', 'Medium']
  ↓ factorize()
  [0, 1, 0, 2]
  ↓ No need to subtract min() (already starts from 0)
  [0, 1, 0, 2]
  ↓ Convert to int32
  [0, 1, 0, 2] (only data type changes)
  ↓ Convert to category
  [0, 1, 0, 2] (processed as categorical)

XGBoost with KaplanMeier

Trained XGBoost model for 10 folds and achieved CV 0.674

from sklearn.model_selection import KFold
from xgboost import XGBRegressor, XGBClassifier
import xgboost as xgb
print("Using XGBoost version",xgb.__version__)

%%time
FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42) # shuffle=True for randomizing data
    
oof_xgb = np.zeros(len(train)) # out-of-fold predictions
pred_xgb = np.zeros(len(test)) # test data predictions

# k-fold cross validation loop -> Train on 90% and test(validation) on 10%
for i, (train_index, test_index) in enumerate(kf.split(train)):

    print("#"*25)
    print(f"### Fold {i+1}")
    print("#"*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,"y"]
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,"y"]
    x_test = test[FEATURES].copy()

    model_xgb = XGBRegressor(
        device="cuda", # Use GPU
        max_depth=3,  # Tree depth
        colsample_bytree=0.5, # details below
        subsample=0.8,  # details below
        n_estimators=2000,  # Number of trees
        learning_rate=0.02,  
        enable_categorical=True, # Handle categorical variables
        min_child_weight=80, # details below
        #early_stopping_rounds=25,
    )
    
    model_xgb.fit(
        x_train, y_train,
        eval_set=[(x_valid, y_valid)],  
        verbose=500 
    )

    # INFER OOF: predictions for validation data
    oof_xgb[test_index] = model_xgb.predict(x_valid)
    
    # INFER TEST: predictions for test data
    pred_xgb += model_xgb.predict(x_test)

# COMPUTE AVERAGE TEST PREDS: Average predictions from 10 folds
pred_xgb /= FOLDS

colsample_bytree=0.5
- Meaning: Proportion of features (columns) to use when creating each tree
- Example:
  - If there are 100 features and colsample_bytree=0.5
  - Each tree uses only 50 randomly selected features
- Effects:
  1. Prevents overfitting
  2. Tries various feature combinations
  3. Reduces over-dependence on specific features
- Lower values: More conservative model, reduced overfitting risk
- Higher values: Uses more features, can capture complex patterns
subsample=0.8
- Meaning: Proportion of training data to use for each tree
- Example:
  - If there are 1000 data points and subsample=0.8
  - Each tree learns from 800 randomly selected data points
- Effects:
  1. Prevents overfitting
  2. Improves model generalization
  3. Ensures diversity as each tree learns from slightly different data
- Lower values: More randomness, reduced overfitting risk
- Higher values: Uses more data, stable learning
min_child_weight=80
- Meaning: Minimum sum of weights required to create a leaf node
- Example:
  - If min_child_weight=80
  - Won't split if the resulting node's weight sum would be less than 80
- Effects:
  - Prevents splitting into too small groups
  - Controls overfitting
  - Improves model stability
- Lower values: Allows finer splits, can learn complex patterns
- Higher values: More conservative splits, reduced overfitting risk

Scoring the model:

from metric import score

y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = oof_xgb
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for XGBoost KaplanMeier =",m)

feature_importance = model_xgb.feature_importances_
importance_df = pd.DataFrame({
    "Feature": FEATURES,  # Replace FEATURES with your list of feature names
    "Importance": feature_importance
}).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df["Feature"], importance_df["Importance"])
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("XGBoost KaplanMeier Feature Importance")
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()

CatBoost with KaplanMeier

Trained CatBoost model for 10 folds and achieved CV 0.674

from catboost import CatBoostRegressor, CatBoostClassifier
import catboost as cb
print("Using CatBoost version",cb.__version__)

%%time
FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_cat = np.zeros(len(train))
pred_cat = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print("#"*25)
    print(f"### Fold {i+1}")
    print("#"*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,"y"]
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,"y"]
    x_test = test[FEATURES].copy()

    model_cat = CatBoostRegressor(
        task_type="GPU",  # Using GPU
        learning_rate=0.1,    
        grow_policy='Lossguide', # Details below
        #early_stopping_rounds=25,
    )
    model_cat.fit(x_train,y_train,
              eval_set=(x_valid, y_valid),
              cat_features=CATS,
              verbose=250)

    # INFER OOF
    oof_cat[test_index] = model_cat.predict(x_valid)
    # INFER TEST
    pred_cat += model_cat.predict(x_test)

# COMPUTE AVERAGE TEST PREDS
pred_cat /= FOLDS

grow_policy='Lossguide'
- Tree growth method: grow tree by selecting leaves that minimize loss

Scoring the model:

y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = oof_cat
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for CatBoost KaplanMeier =",m)

feature_importance = model_cat.get_feature_importance()
importance_df = pd.DataFrame({
    "Feature": FEATURES, 
    "Importance": feature_importance
}).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df["Feature"], importance_df["Importance"])
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("CatBoost KaplanMeier Feature Importance")
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()

LightGBM with KaplanMeier

Trained LightGBM model for 10 folds and achieved CV 0.6725

from lightgbm import LGBMRegressor
import lightgbm as lgb
print("Using LightGBM version",lgb.__version__)

FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_lgb = np.zeros(len(train))
pred_lgb = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print("#"*25)
    print(f"### Fold {i+1}")
    print("#"*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,"y"]    
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,"y"]
    x_test = test[FEATURES].copy()

    model_lgb = LGBMRegressor(
        device="gpu", 
        max_depth=3, 
        colsample_bytree=0.4,  
        #subsample=0.9, 
        n_estimators=2500, 
        learning_rate=0.02, 
        objective="regression", # detail below
        verbose=-1, # detail below
        #early_stopping_rounds=25,
    )
    model_lgb.fit(
        x_train, y_train,
        eval_set=[(x_valid, y_valid)],
    )
    
    # INFER OOF
    oof_lgb[test_index] = model_lgb.predict(x_valid)
    # INFER TEST
    pred_lgb += model_lgb.predict(x_test)

# COMPUTE AVERAGE TEST PREDS
pred_lgb /= FOLDS

objective="regression"
- Parameter that specifies the learning objective (loss function)
- "regression" is the default setting for regression problems
- Other options:
  - "binary": binary classification
  - "multiclass": multi-class classification
  - "ranking": ranking problems
  - "poisson": Poisson regression
  - "quantile": quantile regression
verbose=-1
- Specifies the level of detail for logs during training
- Meaning of values:
  - -1: no output (completely silent)
  - 0: only warnings and errors
  - 1: basic information
  - 2: detailed information
- Currently set to -1, so no messages will be output during training

Scoring the model:

y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = oof_lgb
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for LightGBM KaplanMeier =",m)

feature_importance = model_lgb.feature_importances_ 
importance_df = pd.DataFrame({
    "Feature": FEATURES,
    "Importance": feature_importance
}).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df["Feature"], importance_df["Importance"], color='skyblue')
plt.xlabel("Importance (Gain)")
plt.ylabel("Feature")
plt.title("LightGBM KaplanMeier Feature Importance")
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()

XGBoost with Survival:Cox

Trained XGBoost using Survival:Cox loss for 10 folds and achieved CV=672!

# SURVIVAL COX NEEDS THIS TARGET (TO DIGEST EFS AND EFS_TIME)
train["efs_time2"] = train.efs_time.copy()
train.loc[train.efs==0,"efs_time2"] *= -1

Above code prepares the target variable for Cox model
- train["efs_time2"] = train.efs_time.copy()
  - Creates a new column by copying efs_time
- train.loc[train.efs==0,"efs_time2"] *= -1
  - For cases where efs is 0 (no event occurred)
  - Converts the time value to negative
- Reasons for doing this:
  - Cox models use this approach to represent censoring information
  - Negative time → censored case (no event occurred)
  - Positive time → event occurred case
- Example:
  - Original Data:
    efs  | efs_time | efs_time2
    1    | 100      | 100      (death/relapse occurred)
    0    | 150      | -150     (censored)
    1    | 80       | 80       (death/relapse occurred)

FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_xgb_cox = np.zeros(len(train))
pred_xgb_cox = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print("#"*25)
    print(f"### Fold {i+1}")
    print("#"*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,"efs_time2"]    
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,"efs_time2"]
    x_test = test[FEATURES].copy()

    # same attributes with xgb above except objective and eval-metric
    model_xgb_cox = XGBRegressor(
        device="cuda",
        max_depth=3,  
        colsample_bytree=0.5,  
        subsample=0.8,  
        n_estimators=2000,  
        learning_rate=0.02,  
        enable_categorical=True,
        min_child_weight=80,
        objective='survival:cox',
        eval_metric='cox-nloglik',
    )
    model_xgb_cox.fit(
        x_train, y_train,
        eval_set=[(x_valid, y_valid)],  
        verbose=500  
    )
    
    # INFER OOF
    oof_xgb_cox[test_index] = model_xgb_cox.predict(x_valid)
    # INFER TEST
    pred_xgb_cox += model_xgb_cox.predict(x_test)

# COMPUTE AVERAGE TEST PREDS
pred_xgb_cox /= FOLDS

Scoring the model:

y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = oof_xgb_cox
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for XGBoost Survival:Cox =",m)

feature_importance = model_xgb_cox.feature_importances_
importance_df = pd.DataFrame({
    "Feature": FEATURES,  # Replace FEATURES with your list of feature names
    "Importance": feature_importance
}).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df["Feature"], importance_df["Importance"])
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("XGBoost Survival:Cox Feature Importance")
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()

CatBoost with Survival:Cox

Trained CatBoost using Survival:Cox loss for 10 folds and achieved CV=671!

FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_cat_cox = np.zeros(len(train))
pred_cat_cox = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print("#"*25)
    print(f"### Fold {i+1}")
    print("#"*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,"efs_time2"]    
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,"efs_time2"]
    x_test = test[FEATURES].copy()

    model_cat_cox = CatBoostRegressor(
        loss_function="Cox",
        #task_type="GPU",   
        iterations=400,   # Total number of trees to train  
        learning_rate=0.1,  
        grow_policy='Lossguide',
        use_best_model=False, # details below
    )
    model_cat_cox.fit(x_train,y_train,
              eval_set=(x_valid, y_valid),
              cat_features=CATS,
              verbose=100)
    
    # INFER OOF
    oof_cat_cox[test_index] = model_cat_cox.predict(x_valid)
    # INFER TEST
    pred_cat_cox += model_cat_cox.predict(x_test)

# COMPUTE AVERAGE TEST PREDS
pred_cat_cox /= FOLDS

use_best_model=False
- Uses all iterations (doesn't use early-stopped optimal model)
- Conversely, when use_best_model=True:
  1. Selects the model from the point showing best performance on validation data
  2. Stops training if performance decreases in subsequent iterations
  3. Acts as a form of early stopping
- Reasons for setting it to False:
  - Later iterations can sometimes be important in Cox models
  - Even if validation performance temporarily worsens, it might help overall survival prediction
  - Especially with lots of censored data, using all iterations might be more stable
- Example:
  - # When True
    iter 100: performance 0.8
    iter 200: performance 0.85 (best)
    iter 300: performance 0.83
    → Uses model from iter 200
    
    # When False
    Uses combined results from all iterations (400)

Scoring the model:

y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = oof_cat_cox
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for CatBoost Survival:Cox =",m)

feature_importance = model_cat_cox.get_feature_importance()
importance_df = pd.DataFrame({
    "Feature": FEATURES, 
    "Importance": feature_importance
}).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df["Feature"], importance_df["Importance"])
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("CatBoost Survival:Cox Feature Importance")
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()

Ensemble CAT and XGB and LGB

We ensemble our XGBoost, CatBoost, LightGBM, XGBoost Cox, and CatBoost Cox using scipy.stats.rankdata() and achieve an amazing CV=0.681 Wow!

from scipy.stats import rankdata 

y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = rankdata(oof_xgb) + rankdata(oof_cat) + rankdata(oof_lgb)\
                     + rankdata(oof_xgb_cox) + rankdata(oof_cat_cox)
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for Ensemble =",m)

rankdata: Function that returns the rank of each prediction -> Sums up ranks from five models
How does the function work(Example):
from scipy.stats import rankdata
# Sample data
predictions = [10.5, 5.2, 15.7, 5.2, 8.1]
# Apply rankdata
ranks = rankdata(predictions)
1. First, sort the values in ascending order:
  - 5.2, 5.2, 8.1, 10.5, 15.7
    (1-2), (1-2), (3), (4), (5)
2. Assign ranks to each value:
  - Original: [10.5, 5.2, 15.7, 5.2, 8.1]
    Ranks: [4, 1.5, 5, 1.5, 3]
  - Since 5.2 appears twice, both receive 1.5 (average of 1st and 2nd place)
3. Sum the rank value
  - model1_ranks = rankdata([0.1, 0.2, 0.3])  # [1, 2, 3]
    model2_ranks = rankdata([0.3, 0.1, 0.2])  # [3, 1, 2]
    ensemble = model1_ranks + model2_ranks     # [4, 3, 5]
Why use rankdata? --> Suitable for Survival Analysis
- Well-aligned with rank-based evaluation metrics such as the Concordance index
- The relative risk ranking is often more important than the actual survival time

Create Submission CSV

sub = pd.read_csv("/kaggle/input/equity-post-HCT-survival-predictions/sample_submission.csv")
sub.prediction = rankdata(pred_xgb) + rankdata(pred_cat) + rankdata(pred_lgb)\
                     + rankdata(pred_xgb_cox) + rankdata(pred_cat_cox)
sub.to_csv("submission.csv",index=False)
print("Sub shape:",sub.shape)
sub.head()

Cast all your anxiety on him because he cares for you
<Peter 5:7>

CIBMTR - Equity in post-HCT Survival Predictions #3 Understanding Survival Analysis - 2

dongsunseng — Sat, 1 Feb 2025 16:42:55 +0900

Annotation of modeling & SHAP part of this kernel:

Understanding Survival Analysis

Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions

www.kaggle.com

XGBoost Model for Survival

We will now use an XGBoost model with Optuna to find the ideal hyperparameters.
This model will be used to submit predictions.
This XGBoost model implements survival analysis using the Cox proportional hazards (CPH) loss function, a widely used approach for time-to-event modeling.
It predicts risk scores for patients undergoing hematopoietic cell transplantation (HCT), leveraging features such as patient demographics and clinical characteristics.
The CPH model ranks patients based on their relative risk of experiencing an event, such as death or relapse.
It evaluates performance using metrics like the concordance index (C-index), which measures the model’s ability to rank patients by their predicted risk correctly.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from xgboost import XGBRegressor
import optuna

# Load the data
train_path = "/kaggle/input/equity-post-HCT-survival-predictions/train.csv"
data_dict = "/kaggle/input/equity-post-HCT-survival-predictions/data_dictionary.csv"

train_df = pd.read_csv(train_path)
data_info_df = pd.read_csv(data_dict)

# Preprocessing
epsilon = 1e-5
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

for index, row in data_info_df.iterrows():
    if row["type"] == "Categorical":
        # Encode categorical variables as numbers
        train_df[row["variable"]] = label_encoder.fit_transform(train_df[row["variable"]].astype(str))
    else:
        # Fill missing values in numerical variables with -1
        train_df[row["variable"]] = train_df[row["variable"]].fillna(-1)
        
# Define target variable
train_df["y"] = train_df["efs"] / (train_df["efs_time"] + epsilon)

# Define features and target
X = train_df.drop(columns=["efs", "efs_time", "ID", "y"])
y = train_df["y"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Optuna objective function
def objective(trial):
    # Hyperparameter search space
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 2000, step=100),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 15),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-5, 1.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-5, 1.0, log=True),
        "early_stopping_rounds": 50  # Move early stopping here
    }

    # Train the model
    model = XGBRegressor(random_state=42, **params)
    model.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        verbose=False
    )
    
    # Predictions and evaluation
    y_pred = model.predict(X_test)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    return rmse

# Run Optuna
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50)

# Best parameters and RMSE
print("Best parameters:", study.best_params)
print("Best RMSE:", study.best_value)

# Train the final model with the best parameters
best_params = study.best_params
final_model = XGBRegressor(random_state=42, **best_params)
final_model.fit(X_train, y_train)

# Final predictions and evaluation
y_pred = final_model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Final RMSE: {rmse:.4f}")
print(f"Final MAE: {mae:.4f}")
print(f"Final R²: {r2:.4f}")

Reason why we set target variable as train_df["y"] = train_df["efs"] / (train_df["efs_time"] + epsilon)

This calculation means:

efs (event-free survival): whether an event (death or relapse) occurred (0 or 1)
efs_time: observation period
epsilon (1e-5): very small number to prevent division by zero

Reasons for this division:

Time normalization
- Same events might have different importance depending on when they occurred
- Example: relapse within 1 year vs relapse after 5 years
Reflects hazard concept
- Higher value means event occurred more quickly
- Examples:
  - case 1: efs=1, time=1 year → y ≈ 1
    case 2: efs=1, time=5 years → y ≈ 0.2
    case 3: efs=0, time=any → y = 0
Reflects survival analysis characteristics
- Censored cases (efs=0) automatically become 0
- Cases with events get different weights based on occurrence time

This target variable becomes an indicator of "event occurrence risk per unit time."

Scoring the submission

from lifelines.utils import concordance_index

# Define the score function
def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -> float:
    """
    Calculate C-index for each race group and return the global score.
    """
    del solution[row_id_column_name]
    del submission[row_id_column_name]
    
    event_label = 'efs'
    interval_label = 'efs_time'
    prediction_label = 'prediction'
    for col in submission.columns:
        if not pd.api.types.is_numeric_dtype(submission[col]):
            raise ValueError(f'Submission column {col} must be a number')

    # Merging solution and submission dfs on ID
    merged_df = pd.concat([solution, submission], axis=1)
    merged_df.reset_index(inplace=True)
    merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)
    metric_list = []
    for race in merged_df_race_dict.keys():
        # Retrieving values from y_test based on index
        indices = sorted(merged_df_race_dict[race])
        merged_df_race = merged_df.iloc[indices]
        # Calculate the concordance index
        c_index_race = concordance_index(
                        merged_df_race[interval_label],
                        -merged_df_race[prediction_label],
                        merged_df_race[event_label])
        metric_list.append(c_index_race)
    return float(np.mean(metric_list) - np.sqrt(np.var(metric_list)))

# Final predictions
y_pred = final_model.predict(X_test)

# Prepare DataFrames for scoring
y_true_df = train_df.iloc[X_test.index][["ID", "efs", "efs_time", "race_group"]].copy()
y_pred_df = train_df.iloc[X_test.index][["ID"]].copy()
y_pred_df["prediction"] = y_pred

# Calculate the stratified C-index
stratified_c_index = score(y_true_df, y_pred_df, "ID")
print(f"Stratified C-index: {stratified_c_index:.4f}")

Stratified C-index: 0.6716

import optuna.visualization as vis

# Plot optimization history (objective value per trial)
fig = vis.plot_optimization_history(study)
fig.show()

SHAP (SHapley Additive exPlanations)

SHAP (SHapley Additive exPlanations) is a unified framework for interpreting machine learning models.
It is based on cooperative game theory and provides insights into the contribution of each feature to a model's predictions.

Formula for Shapley Value (φᵢ) of a specific Feature i:

S: subset of features excluding feature i
N: set of all features
|S|: number of features in S
|N|: total number of features
f(S): model prediction using only features in S
f(S∪{i}): model prediction when feature i is added to S
(|S|!(|N|-|S|-1)!)/(|N|!):
- formula for calculating weights, related to permutations and combinations
- |S|! : factorial of the number of features in S
- (|N|-|S|-1)! : factorial of (total number of features minus S's features minus 1)
- |N|! : factorial of total number of features
- Why this weight is necessary:
  - Contribution can be calculated differently depending on feature order
  - To calculate average contribution considering all possible orders
- If total features are 3 (N={A,B,C}),
  feature i is A, and
  S is {B}:
  |S| = 1 (just B)
  |N| = 3 (A,B,C three features)
  |N|-|S|-1 = 1 (3-1-1)
  Therefore:
  (1! * 1!) / 3! = (1 * 1) / 6 = 1/6
- This weight is used to calculate the average influence of feature A when considering all possible orders in which it can be combined with other features.
This formula calculates "how much feature i contributes to predictions when combined with other features."

Formula for Final Model Prediction (ŷ):

φ₀: base prediction (average of all predictions)
φᵢ: Shapley value of each feature
ŷ: final prediction for a specific instance
This formula means "create the final prediction by adding each feature's contribution to the base prediction."

import shap
import matplotlib.pyplot as plt
from tqdm import tqdm
import numpy as np

# Use only the first 100 rows of X
X = X.iloc[:100, :]

# Clean feature names by replacing special characters
X.columns = (
    X.columns.str.replace(r"\[", "_", regex=True)
             .str.replace(r"\]", "_", regex=True)
             .str.replace(r"<", "_", regex=True)
)

# Initialize SHAP TreeExplainer
explainer = shap.TreeExplainer(final_model)  # Use TreeExplainer with the XGBoost model

# Compute SHAP values for all rows at once
shap_values = explainer.shap_values(X)

# Summary plot: Displays the importance of features
shap.summary_plot(shap_values, X, plot_type="bar")  # Bar plot of mean absolute SHAP values

# Summary plot: Detailed distribution of feature impacts
shap.summary_plot(shap_values, X)  # Beeswarm plot

Creating submission csv

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor

# Load the test and sample submission files
test = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/test.csv')
sample_submission = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/sample_submission.csv')

# Load the training data for consistent preprocessing
train = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/train.csv')

# Preprocessing: Handle categorical and numerical variables consistently
label_encoder = LabelEncoder()

for column in train.columns:
    if column in ["efs", "efs_time"]:  # Skip target variables not present in the test set
        continue
    
    if train[column].dtype == 'object':  # Handle categorical variables
        test[column] = test[column].fillna("NAN")
        train[column] = train[column].fillna("NAN")
        label_encoder.fit(pd.concat([train[column], test[column]], axis=0))
        test[column] = label_encoder.transform(test[column])
    else:  # Handle numerical variables
        test[column] = test[column].fillna(-1)  # Replace missing values with -1

# Define features to align with the training data
FEATURES = [col for col in train.columns if col not in ["ID", "efs", "efs_time", "y"]]

# Ensure the test set matches the feature space of the training data
missing_cols = [col for col in FEATURES if col not in test.columns]
for col in missing_cols:
    test[col] = 0  # Add missing columns with default values

test = test[FEATURES]  # Reorder columns to match the training feature space

# Make predictions on the test set
test['predicted_risk'] = final_model.predict(test)

# Prepare the submission file
sample_submission['prediction'] = test['predicted_risk']

# Check for any missing or invalid values in the predictions
if sample_submission['prediction'].isnull().any():
    raise ValueError("The submission file contains NaN values. Please check your predictions.")

# Save the submission file in the correct format
sample_submission.to_csv('submission.csv', index=False)

# Display the first few rows of the submission file to verify
sample_submission.head()

꿈을 크게 가져야 깨져도 그 조각이 크다.

CIBMTR - Equity in post-HCT Survival Predictions #2 Understanding Survival Analysis - 1

dongsunseng — Sat, 1 Feb 2025 02:41:15 +0900

Annotation of this kernel: https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis

Understanding Survival Analysis

Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions

www.kaggle.com

Initial EDA

# Check the distribution of the target variables
plt.figure(figsize=(10, 5))
sns.countplot(data=train, x='efs', palette='coolwarm')
plt.title('Distribution of Event-Free Survival (efs)')
plt.show()

Event-free survival(efs) is an important outcome measure in medical research, particularly in transplant studies
EFS refers to the period from the start of treatment(transplant) until the occurrence of an "event": probably death in this competition
EFS differs from Overall Survival (OS):
- OS only considers survival/death
- EFS includes not only survival but also various important clinical events related to treatment success

plt.figure(figsize=(10, 5))
sns.histplot(data=train, x='efs_time', bins=30, kde=True, color='blue')
plt.title('Distribution of Time to Event-Free Survival (efs_time)')
plt.show()

plt.figure(figsize=(6, 3))
plt.hist(train.efs_time[train.efs == 0], bins=50, label='efs=0: Patient Still Alive Or Unknown', alpha=0.5)
plt.hist(train.efs_time[train.efs == 1], bins=50, label='efs=1: Patient Dies', alpha=0.5)
plt.legend()
plt.xlabel('Event Free Survival Time')
plt.ylabel('Count')
plt.title('Histogram of Time to Event-Free Survival (efs_time)')
plt.show()

# Explore distribution of key demographic features
demo_features = ['race_group', 'sex_match', 'ethnicity']
for feature in demo_features:
    plt.figure(figsize=(10, 5))
    sns.countplot(data=train, x=feature, palette='viridis', order=train[feature].value_counts().index)
    plt.title(f'Distribution of {feature}')
    plt.xticks(rotation=45)
    plt.show()

sex_match is a variable that indicates the gender match between donor and recipient in Hematopoietic Cell Transplantation (HCT).
It is typically categorized as follows:
- M-M: Male donor → Male recipient
- M-F: Male donor → Female recipient
- F-M: Female donor → Male recipient
- F-F: Female donor → Female recipient

Kaplan-Meier Estimator

The Kaplan-Meier Estimator is a non-parametric statistical method used in survival analysis to estimate the survival function from time-to-event data.
It calculates the probability that an individual will survive beyond a certain point in time, accounting for censored data (cases, where the event of interest has not occurred by the end of the study or the individual, is lost to follow-up).

Key Properties:

The Kaplan-Meier curve is a step function, with drops occurring at times when events are observed.
It handles censoring by only considering individuals at risk just before each event time.

Advantages:

Non-parametric: Makes no assumptions about the distribution of survival times.
Handles Censoring: Incorporates censored data effectively.
Easy Interpretation: Provides intuitive survival probabilities.

Limitations:

Assumes Independence of Censoring: Assumes that the censored individuals have the same survival prospects as those still under observation.
Lack of Multivariable Adjustments: Does not account for the effects of covariates (e.g., age, race). For this, models like Cox regression are used.
Uncertainty at Long Times: If few individuals remain at risk at later time points, the estimates may become less reliable.

Use Case:

In the context of HCT survival analysis:

Kaplan-Meier can estimate survival probabilities for the entire population or subgroups (e.g., race or gender).
It helps visualize differences in survival rates among groups, providing insights into disparities or the impact of certain factors.

Results:

The Kaplan-Meier survival curve represents the probability of remaining event-free (e.g., alive or without relapse) over time, with the y-axis showing survival probability and the x-axis representing time in months.
- In Kaplan-Meier survival curves, "event-free (alive or without relapse)" means satisfying both of these conditions:
  - alive: the patient is living
  - without relapse: the disease has not recurred(재발)
Initially, the curve starts at 1.0 (100% survival) since all individuals are event-free at time zero.
The steep decline in the early months indicates that a significant number of patients experience events, such as death or relapse, shortly after the transplant.
This highlights the high-risk nature of the initial post-transplant period.
As time progresses, the curve begins to level off, particularly after 20-30 months, suggesting that those who survive the initial phase tend to have better long-term outcomes.
The survival probability never reaches zero, indicating that a portion of the population remains event-free throughout the observation period.
The shaded region around the curve represents the confidence interval, which reflects the uncertainty of the survival estimates.
Early on, the confidence intervals are narrow, indicating precise estimates due to a larger sample size.
However, they widen at later time points, reflecting fewer patients being observed (due to censoring), which reduces the precision of the estimates.
Overall, the Kaplan-Meier curve provides insight into the time-dependent risks of events, emphasizing the need for targeted interventions during the early post-transplant period to improve survival outcomes.
The curve also suggests that patients who pass the high-risk early phase may achieve more favorable long-term survival.
Further analysis, such as stratifying the data by race or comorbidity scores, could provide deeper insights into factors influencing survival and potential disparities across subgroups.

from lifelines import KaplanMeierFitter

# Instantiate the Kaplan-Meier fitter
kmf = KaplanMeierFitter()

# Kaplan-Meier fit for the entire dataset
plt.figure(figsize=(10, 6))
kmf.fit(durations=train['efs_time'], event_observed=train['efs'])
kmf.plot_survival_function()
plt.title('Kaplan-Meier Survival Curve for Entire Dataset')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.grid()
plt.show()

Kaplan-Meier Survival Curve: Stratified by Race

The Kaplan-Meier survival curve below visualizes the survival probabilities for different racial groups over time. Each line represents a specific race group. The shaded areas around the curves represent confidence intervals.

Key Observations:

Early Survival Decline:
- All race groups show a steep initial decline in survival probability, indicating a high risk of adverse events shortly after transplantation.
- The rate of decline varies among groups, suggesting potential disparities in early survival outcomes.
Group Differences in Long-Term Survival:
- Groups like "More than one race" and "Asian" exhibit higher long-term survival probabilities compared to "White" and "Black or African-American" groups.
- "American Indian or Alaska Native" and "Native Hawaiian or other Pacific Islander" groups show moderate survival probabilities.
Confidence Intervals:
- Confidence intervals widen over time, reflecting reduced sample sizes.
- Widening is more pronounced in smaller racial groups, indicating greater uncertainty in survival estimates.
Potential Disparities:
- The observed differences in survival probabilities suggest disparities in post-transplant outcomes that may be influenced by various factors.
- "White" and "Black or African-American" groups consistently have lower survival probabilities, highlighting areas for potential intervention.

# Kaplan-Meier fit for different groups (e.g., race_group)
plt.figure(figsize=(12, 8))
for group in train['race_group'].dropna().unique():
    group_data = train[train['race_group'] == group]
    kmf.fit(durations=group_data['efs_time'], event_observed=group_data['efs'], label=group)
    kmf.plot_survival_function()

plt.title('Kaplan-Meier Survival Curve by Race Group')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.legend(title='Race Group')
plt.grid()
plt.show()

Kaplan-Meier Survival Curve: Stratified by Donor/Recipient Sex Match

The Kaplan-Meier survival curve below visualizes the survival probabilities for different donor/recipient sex match combinations over time. Each curve represents one of the four possible combinations:

Male-to-Female (M-F)
Female-to-Female (F-F)
Female-to-Male (F-M)
Male-to-Male (M-M)

The shaded areas around the curves indicate confidence intervals.

Key Observations:

Early Decline in Survival:
- All groups show a steep initial decline in survival probability, reflecting the high-risk post-transplant period.
Long-Term Survival Differences:
- F-F and M-M show the highest long-term survival probability.
- F-M and M-F have lower long-term survival probabilities.
Confidence Intervals:
- Confidence intervals widen over time, particularly for M-F and F-M.
- F-F has relatively narrow intervals.

Sex Match Impact:

F-F and M-M transplants tend to have better outcomes.
M-F and F-M groups have lower survival probabilities.

Insights and Implications:

Clinical Relevance:
- The survival advantage for F-F and M-M may reflect better immunological compatibility.
- M-F and F-M groups might benefit from additional clinical interventions.
  - Meaning that M-F, F-M groups may require additional clinical interventions(treatments)
Biological Factors:
- Differences in survival may stem from biological factors like immunological response or GVHD risk.
  - immunological response:
    - Refers to how our body's immune system responds to foreign substances
    - In transplant situations:
      - The immune reaction that occurs when donor cells enter the recipient's body
      - If this response is too strong or too weak, it can negatively affect transplant outcomes
  - GVHD (Graft Versus Host Disease) risk:
    - A condition where transplanted donor immune cells recognize the recipient's body as 'foreign' and attack it
    - Major symptoms:
      - Skin rash
      - Liver damage
      - Digestive system problems
    - A serious complication that can be life-threatening in severe cases
Further Analysis:
- Additional factors should be analyzed alongside sex match.
- Statistical tests can confirm the significance of observed differences.

# Kaplan-Meier fit for a binary feature (e.g., gender)
plt.figure(figsize=(12, 8))
for gender in train['sex_match'].dropna().unique():
    gender_data = train[train['sex_match'] == gender]
    kmf.fit(durations=gender_data['efs_time'], event_observed=gender_data['efs'], label=gender)
    kmf.plot_survival_function()

plt.title('Kaplan-Meier Survival Curve by Sex Match')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.legend(title='Sex Match')
plt.grid()
plt.show()

Cox Proportional Hazards (CPH) Model

The Cox Proportional Hazards (CPH) model is a widely used method in survival analysis for evaluating the effect of multiple covariates on the time to a specific event, such as death or relapse.
Unlike non-parametric methods like Kaplan-Meier, CPH is a semi-parametric model incorporating covariates to estimate their influence on survival while making no assumptions about the baseline hazard function.
- baseline hazard function: the change in basic risk rate over time

Hazard Function h(t|X) in detail:

Basic Concept:
- Represents the instantaneous probability that someone who has survived until time t will experience an event right after
- Here, 'event' could be death, disease recurrence, etc.
Formula Components:
- h₀(t): baseline hazard function
  - Basic risk rate when all covariates are 0
  - Can change over time
- exp(β₁X₁ + β₂X₂ + ... + βₚXₚ): effect of covariates
  - X₁, X₂, ..., Xₚ: covariates (age, gender, etc.)
  - β₁, β₂, ..., βₚ: coefficients showing the influence of each covariate
  - Why use exp: ensures hazard rate is always positive
Practical Meaning:
- For example, in hematopoietic cell transplant patients:
  - h(t): risk of death/relapse at time t
  - X₁: patient's age
  - X₂: gender matching status
  - β₁: impact of age on risk
  - β₂: impact of gender matching on risk

Proportional Hazards Assumption in detail:

Assumes that the hazard ratio between two patients remains constant over time
Example:
- If a 50-year-old patient has twice the risk of a 30-year-old patient
- This "twice" ratio remains constant whether it's 1 month or 1 year post-transplant
- Therefore "TIME-INDEPENDENT"

Hazard Ratio (HR) in detail:

Calculated as HR = exp(β)
HR > 1: increased risk
- Example: HR = 2 means double the risk
HR < 1: decreased risk
- Example: HR = 0.5 means half the risk
HR = 1: no effect
If β = 0.693 for gender matching:
- HR = exp(0.693) = 2
- This means for gender mismatch:
  - Risk doubles
  - This doubling remains constant at any time post-transplant

Censoring in detail:

What is censoring?

When the event of interest (e.g., death, relapse) doesn't occur during the study period
In other words, when we can't know the patient's final outcome

Cases of right-censoring:

No event occurs until the end of the study
- Example: Patient survives throughout a 5-year follow-up study
Patient drops out during follow-up
- Example: Transfer to another hospital
- Example: Loss of contact
Excluded from study for other reasons
- Example: Patient requests to discontinue participation

Handling in Cox model:

Censored data is included in the analysis
Information up to the censoring point is used for model estimation
Unbiased estimates are calculated through likelihood function

Likelihood function in detail:

What is a likelihood function:

A function that calculates the possibility (probability) that observed data came from a specific statistical model
In other words, it quantifies "how likely this data would come from this model"
Let's assume we have patient survival data:
- Patient A: Died after 2 years
- Patient B: Survived until 3 years (then lost to follow-up)
- Patient C: Died after 5 years

The likelihood function:
1. Calculates the probability of each patient's observed outcome
2. Multiplies all these probabilities
3. The higher this value, the better the model explains the data

In Cox model:

Censored data (e.g., Patient B) is included in the likelihood function
Uses information up to the point of censoring
This enables unbiased parameter estimation

In this way, the likelihood function allows us to effectively use incomplete data (censored data) in the analysis.

Partial Likelihood in detail:

Considers only the order of event occurrences instead of complete time information
In other words, focuses more on "who experienced the event first" rather than "exact timing"
Assume we have three patients:
Patient A: Dies at 2 months
Patient B: Dies at 5 months
Patient C: Survives until 7 months (censored)

Partial likelihood analyzes:
- At 2 months: "Why did A die instead of the others"
- At 5 months: "Why did B die among remaining patients"
Reasons for This Approach:
- No need to specify baseline hazard function (h₀(t))
- Can estimate covariate effects (β) using just event order
- Simpler and more efficient computation
Maximization Process:
- Find β values that maximize the partial likelihood
- These β values are considered to best explain each variable's effect on survival

from lifelines import CoxPHFitter

# Preprocess data
# Select relevant columns for Cox regression
cox_features = ['efs_time', 'efs', 'age_at_hct', 'karnofsky_score', 'comorbidity_score', 'race_group']
train = train[cox_features]

# Convert categorical variables into dummy variables
train = pd.get_dummies(train, columns=['race_group'], drop_first=True)

# Drop rows with missing values (ensure clean data for Cox model)
train = train.dropna()

# Instantiate and fit the Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(train, duration_col='efs_time', event_col='efs')

# Show summary of the model
cph.print_summary()

Resultsr

The hazard ratio (HR) plot illustrates the effects of different covariates on the hazard of the event occurring, as estimated by the Cox Proportional Hazards model.
The x-axis represents the hazard ratio, where a value of 1.0 (marked by the dashed vertical line) indicates no effect on the hazard.
Hazard ratios greater than 1.0 indicate an increased risk of the event, while values less than 1.0 suggest a protective effect or reduced risk.
The 95% confidence intervals (CIs) are shown as horizontal lines around each hazard ratio, indicating the uncertainty in the estimates.
If a confidence interval crosses 1.0, the effect of the covariate is not statistically significant.
The analysis reveals several key findings.
Among race groups, "Black or African-American" and "White" have hazard ratios slightly above 1.0, indicating a marginally increased risk compared to the reference group (likely another race, such as "Asian" or "More than one race").
Conversely, the "More than one race" group has an HR less than 1.0, suggesting a protective effect, while "Native Hawaiian or other Pacific Islander" shows little to no impact on the hazard.
The comorbidity score has an HR slightly above 1.0, indicating that patients with more comorbidities are at greater risk of the event.
- Comorbidity: A condition where (a patient) suffers from two chronic diseases simultaneously
Similarly, "age at HCT" has a hazard ratio above 1.0, suggesting that older patients face a slightly higher risk.
In contrast, the Karnofsky performance score has a hazard ratio less than 1.0, reflecting a protective effect where higher scores (indicating better performance status) are associated with reduced risk.
- The Karnofsky performance score (KPS) or Karnofsky performance status scale is a measure to evaluate a patient's overall functional status.
  - Scoring System (0-100):
    - 100: Normal, no symptoms or signs of disease
    - 90: Able to carry on normal activity, minor symptoms/signs
    - 80: Normal activity with effort, some symptoms/signs
    - 70: Cares for self but unable to carry on normal activity or work
    - 60: Requires occasional assistance but can meet most personal needs
    - 50: Requires considerable assistance and frequent medical care
    - 40: Disabled, requires special care and assistance
    - 30: Severely disabled, hospital admission indicated, death not imminent
    - 20: Very sick, hospital admission necessary, active supportive treatment needed
    - 10: Moribund, death imminent
    - 0: Dead
Statistical significance can be inferred from the confidence intervals.
Covariates such as comorbidity score and Karnofsky score likely have statistically significant effects, as their confidence intervals do not cross 1.0.
Some race groups and "age at HCT", however, may not have significant effects, as their intervals overlap with 1.0.
These findings suggest that clinical factors, particularly comorbidity score and performance status, are key predictors of survival outcomes.
Additionally, differences in hazard ratios among race groups point to potential disparities in outcomes that warrant further investigation.
Efforts to reduce comorbidities, improve performance status, and explore the underlying causes of racial disparities could help optimize patient care and outcomes.
This analysis highlights the importance of targeted interventions and provides a foundation for further exploration of survival determinants.

# Visualize the coefficients (hazard ratios)
cph.plot(hazard_ratios=True)
plt.title("Cox Regression - Hazard Ratios")
plt.show()

Survival Curves for Comorbidity Score

The survival curves generated by the Cox Proportional Hazards (CPH) model illustrate the relationship between comorbidity score and survival probabilities over time.
The x-axis represents time (e.g., in months), while the y-axis shows the probability of survival.
Each line corresponds to a specific comorbidity score, ranging from 0 (no comorbidities) to 4 (high comorbidity burden), with a dashed line representing the baseline survival curve.
The results indicate that higher comorbidity scores are associated with lower survival probabilities, as reflected by the descending order of the survival curves.
Patients with a comorbidity score of 0 exhibit the highest survival probabilities, while those with a score of 4 experience the steepest decline and the lowest overall survival.
All survival curves show a steep decline during the early months, reflecting a high-risk period immediately after the transplant.
This decline is more pronounced for patients with higher comorbidity scores, indicating that comorbidities significantly exacerbate early post-transplant risks.
Beyond the initial phase, the survival curves stabilize, but patients with higher comorbidity scores continue to have significantly lower survival probabilities compared to those with lower scores.
The persistent gap between the survival curves suggests that comorbidities have a lasting impact on survival outcomes.
The baseline survival curve aligns closely with a mid-range comorbidity score, representing an "average" patient in the population.
These findings highlight the clinical importance of managing comorbidities before and after transplantation.
Higher comorbidity scores predict worse survival outcomes, emphasizing the need for targeted interventions and closer monitoring for high-risk patients, particularly during the early post-transplant phase.
Even long-term outcomes are worse for patients with higher scores, indicating the necessity of sustained care.
This analysis also underscores the potential for risk stratification, where patients can be categorized by comorbidity scores to prioritize resources and tailor interventions.

cph.plot_partial_effects_on_outcome(covariates='comorbidity_score', values=[0, 1, 2, 3, 4], cmap='coolwarm');

탁월성은 평범함에서 나온다
<GRIT>

CIBMTR - Equity in post-HCT Survival Predictions #1 About the Competition

dongsunseng — Thu, 30 Jan 2025 20:15:04 +0900

Introduction

Basically Survey Analysis Competition
Predicting transplant survival rates for allogeneic HCT patients
allogeneic: transplanting cells, tissues, or organs from a donor of the same species who is not genetically identical to the recipient
HCT: Hematopoietic Stem Cell Transplantation is a treatment method used to fundamentally treat diseases such as leukemia(백혈병) where abnormalities occur during cell differentiation(세포 분화), or conditions like aplastic anemia(재생불량성빈혈) where problems arise due to decreased numbers of hematopoietic stem cells.

Competition Description

"In this competition, you’ll develop models to improve the prediction of transplant survival rates for patients undergoing allogeneic Hematopoietic Cell Transplantation (HCT) — an important step in ensuring that every patient has a fair chance at a successful outcome, regardless of their background."
"Improving survival predictions for allogeneic HCT patients is a vital healthcare challenge. Current predictive models often fall short in addressing disparities related to socioeconomic status, race, and geography. Addressing these gaps is crucial for enhancing patient care, optimizing resource utilization, and rebuilding trust in the healthcare system."
- That's why they put "Equity" on the title of the competition: maybe decreasing those disparities during prediction is the key point of this competition
"This competition aims to encourage participants to advance predictive modeling by ensuring that survival predictions are both precise and fair for patients across diverse groups. By using synthetic data—which mirrors real-world situations while protecting patient privacy—participants can build and improve models that more effectively consider diverse backgrounds and conditions."
"You’re challenged to develop advanced predictive models for allogeneic HCT that enhance both accuracy and fairness in survival predictions. The goal is to address disparities by bridging diverse data sources, refining algorithms, and reducing biases to ensure equitable outcomes for patients across diverse race groups. Your work will help create a more just and effective healthcare environment, ensuring every patient receives the care they deserve."

Evaluation Metric

Evaluation Criteria

The evaluation of prediction accuracy in the competition will involve a specialized metric known as the Stratified Concordance Index (C-index), adapted to consider different racial groups independently. This method allows us to gauge the predictive performance of models in a way that emphasizes equitability across diverse patient populations, particularly focusing on racial disparities in transplant outcomes.

Concordance index

It represents the global assessment of the model discrimination power: this is the model’s ability to correctly provide a reliable ranking of the survival times based on the individual risk scores. It can be computed with the following formula:

The concordance index is a value between 0 and 1 where:

0.5 is the expected result from random predictions,
1.0 is a perfect concordance and,
0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)
- If the predicted values are all perfectly opposite to the actual values, resulting in a concordance index of 0.0
- If we multiply these predicted values by -1 (i.e., reverse the signs)
- The predictions will perfectly match the actual values, resulting in a concordance index of 1.0

Stratified Concordance Index

For this competition, we adjust the standard C-index to account for racial stratification, thus ensuring that each racial group's outcomes are weighed equally in the model evaluation. The stratified c-index is calculated as the mean minus the standard deviation of the c-index scores calculated within the recipient race categories, i.e., the score will be better if the mean c-index over the different race categories is large and the standard deviation of the c-indices over the race categories is small. This value will range from 0 to 1, 1 is the theoretical perfect score, but this value will practically be lower due to censored outcomes.

The submitted risk scores will be evaluated using the score function. This evaluation process involves comparing the submitted risk scores against actual observed values (i.e., survival times and event occurrences) from a test dataset. The function specifically calculates the stratified concordance index across different racial groups, ensuring that the predictions are not only accurate overall but also equitable across diverse patient demographics.

Final score = Mean(c-index for each race) - Standard deviation(c-index for each race)

Evaluation metric implementation:

https://www.kaggle.com/code/metric/eefs-concordance-index

"""
To evaluate the equitable prediction of transplant survival outcomes,
we use the concordance index (C-index) between a series of event
times and a predicted score across each race group.
 
It represents the global assessment of the model discrimination power:
this is the model’s ability to correctly provide a reliable ranking
of the survival times based on the individual risk scores.
 
The concordance index is a value between 0 and 1 where:
 
0.5 is the expected result from random predictions,
1.0 is perfect concordance (with no censoring, otherwise <1.0),
0.0 is perfect anti-concordance (with no censoring, otherwise >0.0)

"""

import pandas as pd
import pandas.api.types
import numpy as np
from lifelines.utils import concordance_index

class ParticipantVisibleError(Exception):
    pass


def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -> float:
    """
    >>> import pandas as pd
    >>> row_id_column_name = "id"
    >>> y_pred = {'prediction': {0: 1.0, 1: 0.0, 2: 1.0}}
    >>> y_pred = pd.DataFrame(y_pred)
    >>> y_pred.insert(0, row_id_column_name, range(len(y_pred)))
    >>> y_true = { 'efs': {0: 1.0, 1: 0.0, 2: 0.0}, 'efs_time': {0: 25.1234,1: 250.1234,2: 2500.1234}, 'race_group': {0: 'race_group_1', 1: 'race_group_1', 2: 'race_group_1'}}
    >>> y_true = pd.DataFrame(y_true)
    >>> y_true.insert(0, row_id_column_name, range(len(y_true)))
    >>> score(y_true.copy(), y_pred.copy(), row_id_column_name)
    0.75
    """
    
    del solution[row_id_column_name]
    del submission[row_id_column_name]
    
    # Define key columns
    event_label = 'efs' # event occurrence
    interval_label = 'efs_time' # survival time
    prediction_label = 'prediction' # predicted value
    
    # Validate submitted predictions
    for col in submission.columns:
        if not pandas.api.types.is_numeric_dtype(submission[col]):
            raise ParticipantVisibleError(f'Submission column {col} must be a number')
    
    # Merging solution and submission dfs on ID
    merged_df = pd.concat([solution, submission], axis=1)
    merged_df.reset_index(inplace=True)
    merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)
    
    # Calculate c-index for each racial group
    metric_list = []
    for race in merged_df_race_dict.keys():
        # Retrieving values from y_test based on index
        indices = sorted(merged_df_race_dict[race])
        merged_df_race = merged_df.iloc[indices]
        # Calculate the concordance index
        c_index_race = concordance_index(
                        merged_df_race[interval_label],
                        -merged_df_race[prediction_label],
                        merged_df_race[event_label])
        metric_list.append(c_index_race)
    return float(np.mean(metric_list)-np.sqrt(np.var(metric_list)))

def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str)
- solution: actual answer data including efs, efs_time, race_group columns
- submission: participant's submitted predictions: prediction column
- row_id_column_name: ID column name
del solution[row_id_column_name] del submission[row_id_column_name]
- Removing id column
Validate submitted predictions:
- Check if all predictions are numeric
Merging solution and submission dfs on ID
- First merge solution and submission
- Second create new index
- Last, classify data by racial groups
Calculate c-index for each racial group
- Calculate concordance_index for each racial group
- Note: prediction is multiplied by -1 (high risk score should correlate with low survival time)
return float(np.mean(metric_list)-np.sqrt(np.var(metric_list)))
- Returns the mean of racial c-indices minus their standard deviation
- This considers both overall performance (mean) and performance differences between races (standard deviation)
More about c-index:
- c-index = concordance-index
- basic concept:
  - C-index measures how well a model predicts the "relative risk ranking"
  - When comparing two patients, it evaluates whether the model predicted higher risk for the patient who actually died earlier (or experienced the event)
- concordance_index(actual_survival_time, predicted_risk, event_occurrence)
  - Compares all possible patient pairs
  - Concordant pair: pairs where predicted ranking matches actual ranking
  - C-index = (number of concordant pairs) / (total number of comparable pairs)
- Example:
  - Patient A: Survival time 10 days, deceased
  - Patient B: Survival time 20 days, deceased
  - Patient C: Survival time 15 days, censored
  - Predicted risk scores: A: 0.8 (high risk) B: 0.3 (low risk) C: 0.5 (medium risk)
  - A vs B: concordant (A died earlier + model predicted higher risk for A)
  - A vs C: not comparable (C is censored)
  - B vs C: not comparable (C is censored)
- Why multiply predictions by -1:
  - Originally, high risk score = low survival time
  - Multiplying by -1 aligns directions (high risk = low survival time prediction)

Submission

Participants must submit their predictions for the test dataset as real-valued risk scores. These scores represent the model's assessment of each patient's risk following transplantation. A higher risk score typically indicates a higher likelihood of the target event occurrence.

The submission file must include a header and follow this format:

ID,prediction
28800,0.5
28801,1.2
28802,0.8
etc.

where:

ID refers to the identifier for each patient in the test dataset.
prediction is the corresponding risk score generated by your model.

탁월성은 평범함에서 나온다
<GRIT>

CZII - CryoET Object Identification #4 Making synthetic data for Baseline YOLO11 Solution

dongsunseng — Tue, 28 Jan 2025 21:58:28 +0900

This is an annotation of code that produces datasets for YOLO solution with additional data(synthetic data)

CZII making datasets for YOLO + synthetic data

Modified version of https://www.kaggle.com/code/itsuki9180/czii-making-datasets-for-yolo
Basically generates datasets with additional synthetic data, denoised using Gaussian denoising
Quite controversy whether models trained with synthetic data perform better or not (https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/555247)
However, discovered true for YOLO
"If someone manages to incorporate the original denoise model or IsoNet, I’m sure that better results could be achieved."
Model was trained with TS_5_4, TS_69_2, TS_6_4, TS_6_6 as validation.

1) Install + Import

!pip install zarr opencv-python

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zarr
from tqdm import tqdm
import glob, os
import cv2
import shutil

2) Load + Organize data

runs = sorted(glob.glob('/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/*'))
print(runs)

runs = [os.path.basename(x) for x in runs]

# Processing additional dataset
additional_runs = sorted(glob.glob('/kaggle/input/czii10441/10441/T*'))
print(additional_runs)
additional_runs = [os.path.basename(x) for x in additional_runs]
runs = runs + additional_runs

# Creating mapping dictionaries
i2r_dict = {i: r for i, r in zip(range(len(runs)), runs)}
r2t_dict = {r: i for i, r in zip(range(len(runs)), runs)}
print("Runs:", i2r_dict)

runs = sorted(glob.glob('/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/*'))
- Collects paths from the base training dataset
- Uses glob.glob() to get all experiment paths
- Uses sorted() to arrange the paths
runs = [os.path.basename(x) for x in runs]
- Extracting experiment names from paths
- Uses os.path.basename() to extract just the experiment names from full paths
- Processes all paths using list comprehension
Processing additional data part
- Processes paths from additional dataset (czii10441) in the same way
- Merges additional dataset with existing experiment list
Creating mapping dictionaries part
- i2r_dict: Maps indices to experiment names (index to run)
- r2t_dict: Maps experiment names to indices (run to index)
- Used as lookup tables for later data processing or reference

3) Helper function - Normalize function

# Normalize the image to a value between 0 and 255

def convert_to_8bit(x):
    # 1. Calculate percentiles for outlier removal
    lower, upper = np.percentile(x, (0.5, 99.5))
    
    # 2. Remove extreme values (clipping)
    x = np.clip(x, lower, upper)
    
    # 3. Convert to 0-255 range using Min-max normalization
    x = (x - x.min()) / (x.max() - x.min() + 1e-12) * 255
    
    # 4. Convert to 8-bit integer
    return x.round().astype("uint8")

Normalizes image data to 8-bit format (0-255 range)
Crucial preprocessing step in CryoET image processing
Clipping: Reduces the impact of noise and extreme values
Min-max normalization:
- 1e-12: Small value added to prevent division by zero
- * 255: Scales 0-1 range to 0-255 range

4) Information about labels

p2i_dict = {
    'apo-ferritin': 0,
    'beta-amylase': 1,
    'beta-galactosidase': 2,
    'ribosome': 3,
    'thyroglobulin': 4,
    'virus-like-particle': 5
}

i2p = {v: k for k, v in p2i_dict.items()}

particle_radius = {
    'apo-ferritin': 60,
    'beta-amylase': 65,
    'beta-galactosidase': 90,
    'ribosome': 150,
    'thyroglobulin': 130,
    'virus-like-particle': 135,
}

particle_names = ['apo-ferritin', 'beta-amylase', 'beta-galactosidase', 'ribosome', 'thyroglobulin', 'virus-like-particle']

from scipy.ndimage import gaussian_filter, median_filter

def denoise_tomogram(tomogram, method='gaussian', **kwargs):
    """
    Apply denoising to a tomogram.

    Parameters:
        tomogram (np.ndarray): The input tomogram to denoise.
        method (str): The denoising method ('gaussian' or 'median').
        kwargs: Parameters for the respective method.
    
    Returns:
        np.ndarray: The denoised tomogram.
    """
    if method == 'gaussian':
        return gaussian_filter(tomogram, sigma=kwargs.get('sigma', 1))
    elif method == 'median':
        return median_filter(tomogram, size=kwargs.get('size', 3))
    else:
        raise ValueError(f"Unsupported denoising method: {method}")

Removes noise using Gaussian or median filter
Filter parameters can be flexibly adjusted via kwargs

name_map = {
    'apo-ferritin': 'ferritin_complex',
    'beta-amylase': 'beta_amylase',
    'beta-galactosidase': 'beta_galactosidase',
    'ribosome': 'cytosolic_ribosome',
    'thyroglobulin': 'thyroglobulin',
    'virus-like-particle': 'pp7_vlp',
}

def ndjson_to_json(ndjson_path):
    # Check if file exists
    if not os.path.isfile(ndjson_path):
        raise FileNotFoundError(f"The file {ndjson_path} does not exist.")

    data = []
    # Parse each line as JSON object
    try:
        with open(ndjson_path, 'r', encoding='utf-8') as ndjson_file:
            for line_number, line in enumerate(ndjson_file, start=1):
                stripped_line = line.strip()
                if stripped_line:  
                    try:
                        json_object = json.loads(stripped_line)
                        data.append(json_object)
                    except json.JSONDecodeError as e:
                        raise json.JSONDecodeError(
                            f"Error decoding JSON on line {line_number}: {e.msg}",
                            e.doc,
                            e.pos
                        )
    except Exception as e:
        raise e

    return data

Parses NDJSON (Newline Delimited JSON) files
Converts each line to individual JSON objects
Includes error handling and line number tracking

import os
import glob
import json
import pandas as pd
import numpy as np
import zarr
import cv2
from tqdm import tqdm

# Takes experiment name, train/validation flag, synthetic data flag as input
def make_annotate_yolo(run_name, is_train_path=True, is_syntetic=False):
    dataset_split = 'train' if is_train_path else 'val'
    
    # Loading and preprocessing volume data
    # Setting the path to the denoised volume(data)
    if is_syntetic:
        vol_path = glob.glob(f'/kaggle/input/czii10441/10441/{run_name}/**/Tomograms/**/*.zarr', recursive=True)
        if not vol_path:
            print(f"No volume found for run {run_name} in synthetic data.")
            return
        vol_path = vol_path[0]
    else:
        vol_path = f'/kaggle/input/czii-cryo-et-object-identification/train/static/ExperimentRuns/{run_name}/VoxelSpacing10.000/denoised.zarr'
    
    print(f"Volume path: {vol_path}")
    if not os.path.exists(vol_path):
        print(f"Volume file not found: {vol_path}")
        return

    # Read the volume
    vol = zarr.open(vol_path, mode='r') # loads volume data in zarr format
    vol = vol[0]
    if is_syntetic:
        vol = denoise_tomogram(np.array(vol)[:184], method='gaussian', sigma=1)  # Apply denoise for synthetic data
    vol2 = convert_to_8bit(vol) # into 8-bit format
    
    n_imgs = vol2.shape[0]
    print(n_imgs)
    
    # Image generation - CONVERT 3D Volume data into 2D Images that YOLO can process
    for j in range(n_imgs):
        # 1. Extract current slice
        newvol = vol2[j]
        
        # 2. Convert grayscale to RGB
        newvolf = np.stack([newvol]*3, axis=-1)
        
        # 3. Resize to YOLO input size
        newvolf = cv2.resize(newvolf, (640, 640))
        
        # 4. Save image
        image_filename = f'images/{dataset_split}/{run_name}_{j*10}.png'
        cv2.imwrite(image_filename, newvolf)
        
        # 5. Create empty label file
        label_filename = f'labels/{dataset_split}/{run_name}_{j*10}.txt'
        with open(label_filename, 'w') as f:
            pass
    
    # Process each particle type (label processing)
    for p, particle in enumerate(tqdm(particle_names, desc=f"Processing particles for run {run_name}")):
        if particle == "beta-amylase":
            continue
        
        # Load JSON data for each particle
        if is_syntetic:
            particle_name_in_file = name_map.get(particle)
            if not particle_name_in_file:
                print(f"Particle name mapping not found for: {particle}")
                continue
            
            ndjson_each_particle = glob.glob(f'/kaggle/input/czii10441/10441/{run_name}/**/Annotations/**/*.ndjson', recursive=True)
            if not ndjson_each_particle:
                print(f"No NDJSON files found for particle: {particle} in run: {run_name}")
                continue
            
            filtered_ndjson_files = [f for f in ndjson_each_particle if particle_name_in_file in f]
            if not filtered_ndjson_files:
                print(f"No NDJSON files match the particle: {particle} for run: {run_name}")
                continue
            
            json_each_particle = ndjson_to_json(filtered_ndjson_files[0])
            df = pd.DataFrame(json_each_particle)
            
        # Data loading for real data
        else:
            json_each_particle = f"/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/{run_name}/Picks/{particle}.json"
            
            if not os.path.exists(json_each_particle):
                print(f"JSON file not found: {json_each_particle}")
                continue
            print(f"Loading JSON file: {json_each_particle}")
            try:
                df = pd.read_json(json_each_particle)
            except ValueError as e:
                print(f"Error reading JSON file {json_each_particle}: {e}")
                continue
                
        # Coordinate Extraction Processing 
        if is_syntetic:
            column_name = 'location'
        else:
            column_name = 'points'

        if  column_name not in df.columns:
            print(f"'{column_name}' column not found in DataFrame for particle: {particle}")
            continue
        
        if is_syntetic:
            # Flattens nested JSON data into dataframe format
            normalized_data = pd.json_normalize(df[column_name])
            # *10.012: applies pixel scaling factor(converts to actual physical size)
            df[['x', 'y', 'z']] = normalized_data * 10.012
        
        # For real data: Extracts coordinates for each axis (x, y, z)
        else:      
            for axis in ["x", "y", "z"]:
                df[axis] = df[column_name].apply(lambda x: x["location"][axis] if "location" in x and axis in x["location"] else np.nan)
                print("aquiii",df.head())

        # Missing Value Handling: Removes rows with NaN in any of x, y, z coordinates
        df.dropna(subset=["x", "y", "z"], inplace=True)

        # Get defined radius for each particle type
        radius = particle_radius.get(particle)
        if radius is None:
            print(f"Radius not defined for particle: {particle}")
            continue
        divide_by = 10.012
        
        # Convert to YOLO format
        for i, row in df.iterrows():    
            # Calculate Z-axis range (range of slices where particle is visible)
            start_z = np.round(row['z'] - radius).astype(np.int32)
            start_z = max(0, start_z//10) 
            end_z = np.round(row['z'] + radius).astype(np.int32)
            end_z = min(n_imgs, end_z//10)
            
            # Generate YOLO format labels for each slice
            for j in range(start_z, end_z):
                label_filename = f'labels/{dataset_split}/{run_name}_{j*10}.txt'
                
                # Calculate normalized coordinates
                x_center = row["x"] / divide_by / vol2.shape[1]
                y_center = row["y"] / divide_by / vol2.shape[2]
                box_width = (radius * 2) / divide_by / vol2.shape[1]
                box_height = (radius * 2) / divide_by / vol2.shape[2]
                
                # Save in YOLO format
                # format: class_id center_x center_y width height
                with open(label_filename, 'a') as f:
                    f.write(f'{p2i_dict.get(particle, 0)} {x_center:.6f} {y_center:.6f} {box_width:.6f} {box_height:.6f}\n')

Generating datasets for YOLO training
Overall process:
- Convert 3D coordinates to 2D YOLO format
- Generate labels for all slices within particle's Z-axis range
- Normalize coordinates and box sizes to 0-1 range
- YOLO format: class_id x_center y_center width height
- This code plays a crucial role in converting 3D particle location information into 2D bounding box format that YOLO can understand.
Image generation
- newvolf = np.stack([newvol]*3, axis=-1)
  - [newvol]*3: Replicate the same grayscale image 3 times
  - axis=-1: Stack along the last dimension (creating R,G,B channels)
  - Result: (height, width) -> (height, width, 3)
- Image resizing:
  - 640x640 is YOLOv5's default input size
Label Processing
- Labels here refer to annotation information used for training YOLO object detection models
- Exclude beta-amylase (excluded from competition evaluation)
- Data Loading - For Synthetic Data
  - Synthetic data stored in NDJSON format
  - Filter by matching particle type in filename
  - Convert NDJSON to DataFrame
- Data Loading - For Real Data
  - Real data stored in JSON format
  - Direct JSON file loading
- Coordinate Extraction Processing
  - Different coordinate extraction methods for synthetic/real data
  - Normalize and scale coordinate values

5) Prepare folders

os.makedirs("images/train", exist_ok=True)
os.makedirs("images/val", exist_ok=True)
os.makedirs("labels/train", exist_ok=True)
os.makedirs("labels/val", exist_ok=True)

exist_ok=True: No error if directories already exist

6) Create Dataset

validation_indices = [0, 1, 2, 3]  # TS_5_4, TS_69_2 TS_6_4 TS_6_6

#runs = runs[:7] 
    
for i, r in enumerate(runs):
    # Determine if training or validation
    is_train_path = i not in validation_indices
    
    # Determine if synthetic data (after index 7 is synthetic)
    is_syntetic = i > 7
    
    print(f"Processing Run {i}: {r}, Is Train: {is_train_path}")
    
    # Call dataset generation function
    make_annotate_yolo(r, is_train_path=is_train_path, is_syntetic=is_syntetic)

Generates the dataset by splitting it into training and validation sets

images_train_dir = "images/train"
labels_train_dir = "labels/train"

7) Organize Dataset Folder Structure

# Create top-level dataset directory
os.makedirs('datasets/czii_det2d', exist_ok=True)

# Move image and label files to new locations
shutil.move('images/train', 'datasets/czii_det2d/images/train')
shutil.move('images/val', 'datasets/czii_det2d/images/val')
shutil.move('labels/train', 'datasets/czii_det2d/labels/train')
shutil.move('labels/val', 'datasets/czii_det2d/labels/val')

Reorganizes the generated training data into final directory structure expected by YOLO
Final dir structure:
- datasets/
  └── czii_det2d/
      ├── images/
      │   ├── train/  # Training images
      │   └── val/    # Validation images
      └── labels/
          ├── train/  # Training labels
          └── val/    # Validation labels

8) Create Configuration File for YOLO

config_content = """
path: /kaggle/input/czii-making-datasets-for-yolo/datasets/czii_det2d  # Dataset root path
train: images/train  # Training images path (relative to path)
val: images/val      # Validation images path (relative to path)
# Classes
names:               # Class (particle type) definitions
  0: apo-ferritin
  1: beta-amylase
  2: beta-galactosidase
  3: ribosome
  4: thyroglobulin
  5: virus-like-particle
"""

# Create YAML file
with open("czii_conf.yaml", "w") as f:
    f.write(config_content.strip())

Generates a configuration file (YAML) for YOLO model training

In order to make the impossible possible, you need to change the rules.
- Elon Musk -

CZII - CryoET Object Identification #3 Baseline YOLO11 Solution

dongsunseng — Thu, 16 Jan 2025 17:14:07 +0900

This post is an annotation of baseline YOLO11 solution kernel from @SERGIO ALVAREZ.

https://www.kaggle.com/code/sersasj/czii-yolo11-submission-baseline-with-kdtree-update

CZII YOLO11 Submission Baseline with KDTree Update

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

CZII YOLO11 Submission Baseline with KDTree Update - LB 0.682

Inspired by https://www.kaggle.com/code/itsuki9180/czii-yolo11-submission-baseline (LB: 0.625)
Problem: "It already takes 10 hours with the YOLO model - if I train a 2D UNET and aggregate the results in a similar way to YOLO, would it be possible to fit within the time limit?"
Introduced the KDTree algorithm for performance improvement
- KDTree is an efficient algorithm for finding nearest neighbors
Also added @min fuka's multi-processing idea
- https://www.kaggle.com/code/minfuka/czii-yolo11-submission-baseline-speed-up-ver
Time with KDTree was reduced to ~6500 seconds and with multiprocessing was reduced to ~4500 seconds.
Used synthetic data for training
- Data: https://www.kaggle.com/datasets/sersasj/czii-yolo-l-trained-with-synthetic-data/data
- Code making synthetic data: https://www.kaggle.com/code/sersasj/czii-making-datasets-for-yolo-synthetic-data#CZII:-Creating-Datasets-for-YOLO-with-Additional-Data
  - My annotation on this code: https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-4-Making-synthetic-data-for-Baseline-YOLO11-Solution#google_vignette
Used TS_5_4, TS_69_2, TS_6_4, and TS_6_6 as validation datasets
Used OPTUNA to optimize the following parameters:
- z_distance
- zy_distance
- first_conf
- conf_coef

Score for TS_5_4: 0.658783812957661,{'apo-ferritin': {'total_tp': 42, 'total_fp': 20, 'total_fn': 2, 'fbeta': 0.9321148825065274}, 'beta-galactosidase': {'total_tp': 5, 'total_fp': 27, 'total_fn': 7, 'fbeta': 0.3794642857142858}, 'ribosome': {'total_tp': 20, 'total_fp': 29, 'total_fn': 10, 'fbeta': 0.6427221172022684}, 'thyroglobulin': {'total_tp': 23, 'total_fp': 104, 'total_fn': 7, 'fbeta': 0.6441515650741352}, 'virus-like-particle': {'total_tp': 11, 'total_fp': 2, 'total_fn': 0, 'fbeta': 0.9894179894179894}}

Score for TS_69_2: 0.8191956150699464,{'apo-ferritin': {'total_tp': 35, 'total_fp': 25, 'total_fn': 0, 'fbeta': 0.9596774193548387}, 'beta-galactosidase': {'total_tp': 13, 'total_fp': 46, 'total_fn': 3, 'fbeta': 0.7015873015873016}, 'ribosome': {'total_tp': 35, 'total_fp': 15, 'total_fn': 2, 'fbeta': 0.926791277258567}, 'thyroglobulin': {'total_tp': 28, 'total_fp': 84, 'total_fn': 6, 'fbeta': 0.7256097560975611}, 'virus-like-particle': {'total_tp': 9, 'total_fp': 1, 'total_fn': 0, 'fbeta': 0.9935064935064936}}

Score for TS_6_4: 0.685180923434018,{'apo-ferritin': {'total_tp': 45, 'total_fp': 34, 'total_fn': 12, 'fbeta': 0.7719475277497477}, 'beta-galactosidase': {'total_tp': 7, 'total_fp': 29, 'total_fn': 5, 'fbeta': 0.5219298245614036}, 'ribosome': {'total_tp': 54, 'total_fp': 59, 'total_fn': 12, 'fbeta': 0.7852865697177076}, 'thyroglobulin': {'total_tp': 24, 'total_fp': 77, 'total_fn': 6, 'fbeta': 0.70223752151463}, 'virus-like-particle': {'total_tp': 8, 'total_fp': 4, 'total_fn': 2, 'fbeta': 0.7906976744186046}}

Score for TS_6_6: 0.7575532250952666,{'apo-ferritin': {'total_tp': 37, 'total_fp': 39, 'total_fn': 2, 'fbeta': 0.8985714285714286}, 'beta-galactosidase': {'total_tp': 8, 'total_fp': 43, 'total_fn': 3, 'fbeta': 0.5991189427312775}, 'ribosome': {'total_tp': 17, 'total_fp': 11, 'total_fn': 6, 'fbeta': 0.7297979797979798}, 'thyroglobulin': {'total_tp': 31, 'total_fp': 120, 'total_fn': 4, 'fbeta': 0.7412095639943742}, 'virus-like-particle': {'total_tp': 19, 'total_fp': 2, 'total_fn': 0, 'fbeta': 0.9938461538461538}}

1) Ultralytics setting for offline env (External kernel linked to the main submission kernel)

Ultralytics is an open-source package for implementing and training YOLO (You Only Look Once) object detection models

https://www.kaggle.com/code/itsuki9180/ultralytics-for-offline-install

!pip download -d ./packages ultralytics
!tar cfvz archive.tar.gz ./packages

!pip download -d ./packages ultralytics
- -d ./packages: Specifies the download location as ./packages directory
- Downloads the package and all its dependencies
- Only downloads wheel files without actual installation
!tar cfvz archive.tar.gz ./packages
- Compresses downloaded packages into a tar file
- c: Create a new archive
- f: Specify filename
- v: Verbose (detailed output)
- z: Use gzip compression
- archive.tar.gz: Name of the compressed file to be created
- ./packages: Directory to be compressed
This is done because internet access is restricted in the competition environment
All necessary packages are downloaded and compressed in advance so they can be installed later in an offline environment
wheel files?
- A binary package that bundles Python packages in an installable form
- Includes compiled code, metadata, and dependency information
- Has the .whl extension

!tar xfvz archive.tar.gz
!pip install --no-index --find-links=./packages ultralytics
!rm -rf ./packages

!tar xfvz archive.tar.gz
- x: Extract
- f: Specify filename
- v: Verbose (detailed output)
- z: Extract gzip compression
- Extracts archive.tar.gz to create ./packages directory
!pip install --no-index --find-links=./packages ultralytics
- --no-index: Don't use PyPI (Python Package Index).
  - This means don't download packages from the internet
- --find-links=./packages: Specify local directory to find packages
- Install ultralytics using locally downloaded wheel files
!rm -rf ./packages
- Delete the temporarily used packages directory after installation
- -r: Delete recursively (including all files in directory)
- -f: Force delete (without confirmation messages)

2) Dependencies (Back to the kernel)

# Installing Ultralytics
!tar xfvz /kaggle/input/ultralytics-for-offline-install/archive.tar.gz
!pip install --no-index --find-links=./packages ultralytics
!rm -rf ./packages

# Installing Zarr package
!cp -r '/kaggle/input/hengck-czii-cryo-et-01/wheel_file' '/kaggle/working/'
!pip install /kaggle/working/wheel_file/asciitree-0.3.3/asciitree-0.3.3
!pip install --no-index --find-links=/kaggle/working/wheel_file zarr

Copy wheel files from another Kaggle dataset to working directory
First install asciitree (a dependency of zarr)
Install zarr package
The reasons for this approach:
1. Kaggle notebooks have restricted internet access
2. Required packages are pre-uploaded as datasets
3. Enables package installation in offline environments Specifically, zarr is a package used for efficient storage and processing of large array data, which will likely be used in this competition for handling 3D image data.

import os
import glob
import time
import sys
import warnings
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2
import torch
from tqdm import tqdm
from ultralytics import YOLO
import zarr
from scipy.spatial import cKDTree
from collections import defaultdict

3) Loading model + Configuration

model_path = '/kaggle/input/czii-yolo-l-trained-with-synthetic-data/best_synthetic.pt'
model = YOLO(model_path)

Loading the 'best_synthetic.pt' file
And then, Uses the YOLO class from Ultralytics to load the model

# Processing experiment data paths
runs_path = '/kaggle/input/czii-cryo-et-object-identification/test/static/ExperimentRuns/*'
runs = sorted(glob.glob(runs_path))
runs = [os.path.basename(run) for run in runs]

# Data Splitting
sp = len(runs)//2
runs1 = runs[:sp]
runs1[:5]

#add by @minfuka
runs2 = runs[sp:]
runs2[:5]

#add by @minfuka - GPU Checking
assert torch.cuda.device_count() == 2

Processing experiment data paths:
- Gets experiment data paths from test dataset
- Uses glob.glob to get all experiment folders
- Uses os.path.basename to extract only folder names from paths
Data splitting:
- Divides all experiments into two groups
- Appears to be preparation for parallel processing
GPU Checking:
- Verifies that 2 GPUs are available
- Uses assert statement to raise error if not 2
- For multi-GPU processing
- This appears to be intended for parallelizing data processing using multiple GPUs
- It's preparing for each GPU to process half of the data

particle_names = [
    'apo-ferritin',
    'beta-amylase',
    'beta-galactosidase',
    'ribosome',
    'thyroglobulin',
    'virus-like-particle'
]

particle_to_index = {
    'apo-ferritin': 0,
    'beta-amylase': 1,
    'beta-galactosidase': 2,
    'ribosome': 3,
    'thyroglobulin': 4,
    'virus-like-particle': 5
}

index_to_particle = {index: name for name, index in particle_to_index.items()}

particle_radius = {
    'apo-ferritin': 60,
    'beta-amylase': 65,
    'beta-galactosidase': 90,
    'ribosome': 150,
    'thyroglobulin': 130,
    'virus-like-particle': 135,
}

4) Helper functions

I think that's the single best piece of advice: constantly think about how you could be doing things better and questioning yourself.
- Elon Musk -

CZII - CryoET Object Identification #2 Baseline UNet Solution

dongsunseng — Wed, 15 Jan 2025 17:09:33 +0900

This post is an annotation of baseline unet solution kernel from "fnands".

https://www.kaggle.com/code/ldywinner/baseline-unet-train-submit/notebook#Baseline-UNet-training-+-prediction/submission

'Baseline UNet train + submit' - LB score 0.529

Literally a baseline solution with no high lb score
Based on 3 notebooks:
Pre-computed the input data and stored them as numpy arrays so they don't have to be extracted every time the notebooks is run:
- My annotation of that part here: https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-1-Training-Data

CZII - CryoET Object Identification #1 - Training Data

This post is an annotation of training data code kernel from "fnands".https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name Create Numpy dataset exp nameExplore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET

dongsunseng.com

1) Installing offline deps

deps_path = '/kaggle/input/czii-cryoet-dependencies'
! cp -r /kaggle/input/czii-cryoet-dependencies/asciitree-0.3.3/ asciitree-0.3.3/
! pip wheel asciitree-0.3.3/asciitree-0.3.3/
! pip install asciitree-0.3.3-py3-none-any.whl
! pip install -q --no-index --find-links {deps_path} --requirement {deps_path}/requirements.txt

Process of installing dependency packages in an offline environment
"As this is a code comp, there is no internet. So we have to do some silly things to get dependencies in here. Why is asciitree such a PITA?"
In Kaggle competitions, internet access is restricted, so necessary packages must be prepared in advance
Kaggle competition environments block internet access for security and fairness.
The asciitree package, in particular, is tricky to install and requires special handling
All dependency packages must be prepared in advance in a locally installable format.

2) Import deps

from typing import List, Tuple, Union
import numpy as np
import torch
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
    Compose, 
    EnsureChannelFirstd, 
    Orientationd,  
    AsDiscrete,  
    RandFlipd, 
    RandRotate90d, 
    NormalizeIntensityd,
    RandCropByLabelClassesd,
)

3) Define some helper functions

Patching helper functions

def calculate_patch_starts(dimension_size: int, patch_size: int) -> List[int]:
    """
    Calculate the starting positions of patches along a single dimension
    with minimal overlap to cover the entire dimension.
    
    Parameters:
    -----------
    dimension_size : int
        Size of the dimension
    patch_size : int
        Size of the patch in this dimension
        
    Returns:
    --------
    List[int]
        List of starting positions for patches
    """
    if dimension_size <= patch_size:
        return [0]
        
    # Calculate number of patches needed
    n_patches = np.ceil(dimension_size / patch_size)
    
    if n_patches == 1:
        return [0]
    
    # Calculate overlap
    total_overlap = (n_patches * patch_size - dimension_size) / (n_patches - 1)
    
    # Generate starting positions
    positions = []
    for i in range(int(n_patches)):
        pos = int(i * (patch_size - total_overlap))
        if pos + patch_size > dimension_size:
            pos = dimension_size - patch_size
        if pos not in positions:  # Avoid duplicates
            positions.append(pos)
    
    return positions

def extract_3d_patches_minimal_overlap(arrays: List[np.ndarray], patch_size: int) -> Tuple[List[np.ndarray], List[Tuple[int, int, int]]]:
    """
    Extract 3D patches from multiple arrays with minimal overlap to cover the entire array.
    
    Parameters:
    -----------
    arrays : List[np.ndarray]
        List of input arrays, each with shape (m, n, l)
    patch_size : int
        Size of cubic patches (a x a x a)
        
    Returns:
    --------
    patches : List[np.ndarray]
        List of all patches from all input arrays
    coordinates : List[Tuple[int, int, int]]
        List of starting coordinates (x, y, z) for each patch
    """
    if not arrays or not isinstance(arrays, list):
        raise ValueError("Input must be a non-empty list of arrays")
    
    # Verify all arrays have the same shape
    shape = arrays[0].shape
    if not all(arr.shape == shape for arr in arrays):
        raise ValueError("All input arrays must have the same shape")
    
    if patch_size > min(shape):
        raise ValueError(f"patch_size ({patch_size}) must be smaller than smallest dimension {min(shape)}")
    
    m, n, l = shape
    patches = []
    coordinates = []
    
    # Calculate starting positions for each dimension
    x_starts = calculate_patch_starts(m, patch_size)
    y_starts = calculate_patch_starts(n, patch_size)
    z_starts = calculate_patch_starts(l, patch_size)
    
    # Extract patches from each array
    for arr in arrays:
        for x in x_starts:
            for y in y_starts:
                for z in z_starts:
                    patch = arr[
                        x:x + patch_size,
                        y:y + patch_size,
                        z:z + patch_size
                    ]
                    patches.append(patch)
                    coordinates.append((x, y, z))
    
    return patches, coordinates

# Note: I should probably averge the overlapping areas, 
# but here they are just overwritten by the most recent one. 

def reconstruct_array(patches: List[np.ndarray], 
                     coordinates: List[Tuple[int, int, int]], 
                     original_shape: Tuple[int, int, int]) -> np.ndarray:
    """
    Reconstruct array from patches.
    
    Parameters:
    -----------
    patches : List[np.ndarray]
        List of patches to reconstruct from
    coordinates : List[Tuple[int, int, int]]
        Starting coordinates for each patch
    original_shape : Tuple[int, int, int]
        Shape of the original array
        
    Returns:
    --------
    np.ndarray
        Reconstructed array
    """
    reconstructed = np.zeros(original_shape, dtype=np.int64)  # To track overlapping regions
    
    patch_size = patches[0].shape[0]
    
    for patch, (x, y, z) in zip(patches, coordinates):
        reconstructed[
            x:x + patch_size,
            y:y + patch_size,
            z:z + patch_size
        ] = patch
        
    
    return reconstructed

"These are mostly used to split large volumes into smaller ones and stitch them back together"
This code implements functions for extracting and reconstructing patches from 3D image data
def calculate_patch_starts(dimension_size: int, patch_size: int) -> List[int]:
- Purpose: Calculates the starting positions of patches in one dimension
- Operation:
  - Returns [0] if dimension size is smaller than patch size
  - Calculates required number of patches: n_patches = ceil(dimension_size / patch_size)
  - Calculates overlap between patches
  - Generates starting positions for each patch considering overlap
- Example: For dimension size 100 and patch size 40, returns list of positions like [0, 30, 60]
def extract_3d_patches_minimal_overlap(arrays: List[np.ndarray], patch_size: int)
- Purpose: Divides 3D arrays into smaller patches
- Key features:
  - Input validation (array shape, size, etc.)
  - Calculates patch starting positions for each dimension (x, y, z)
  - Extracts patches from all possible positions
- Return values:
  - patches: List of all extracted patches
  - coordinates: List of starting coordinates for each patch
def reconstruct_array(patches: List[np.ndarray], coordinates: List[Tuple[int, int, int]], original_shape: Tuple[int, int, int])
- Purpose: Reconstructs patches back into original-sized 3D array
- Operation:
  - Creates empty array of original size
  - Places each patch at its corresponding position
  - Overlapping regions are overwritten by most recent patch
- Note:
  - As mentioned in code comments, using average values for overlapping regions might be better

Submission helper functions

import pandas as pd

def dict_to_df(coord_dict, experiment_name):
    """
    Convert dictionary of coordinates to pandas DataFrame.
    
    Parameters:
    -----------
    coord_dict : dict
        Dictionary where keys are labels and values are Nx3 coordinate arrays
        
    Returns:
    --------
    pd.DataFrame
        DataFrame with columns ['x', 'y', 'z', 'label']
    """
    # Create lists to store data
    all_coords = []
    all_labels = []
    
    # Process each label and its coordinates
    for label, coords in coord_dict.items():
        all_coords.append(coords)
        all_labels.extend([label] * len(coords))
    
    # Concatenate all coordinates
    all_coords = np.vstack(all_coords)
    
    df = pd.DataFrame({
        'experiment': experiment_name,
        'particle_type': all_labels,
        'x': all_coords[:, 0],
        'y': all_coords[:, 1],
        'z': all_coords[:, 2]
    })

    
    return df

Purpose:
- Converts position coordinates of multiple particle types in 3D space into a structured dataframe format
- Structures data to match the submission format for Kaggle competition
Input Parameters:
- coord_dict: A dictionary with particle types as keys and their coordinates (N×3 array) as values
  - Example: {'apo-ferritin': array([[x1,y1,z1], [x2,y2,z2]...]), 'ribosome': array([[x3,y3,z3]...])}
- experiment_name: Experiment name (e.g., 'TS_5_4')

4) Reading in the data

TRAIN_DATA_DIR = "/kaggle/input/create-numpy-dataset-exp-name"
TEST_DATA_DIR = "/kaggle/input/czii-cryo-et-object-identification"

train_names = ['TS_5_4', 'TS_69_2', 'TS_6_6', 'TS_73_6', 'TS_86_3', 'TS_99_9']
valid_names = ['TS_6_4']

train_files = []
valid_files = []

for name in train_names:
    image = np.load(f"{TRAIN_DATA_DIR}/train_image_{name}.npy")
    label = np.load(f"{TRAIN_DATA_DIR}/train_label_{name}.npy")

    train_files.append({"image": image, "label": label})
    

for name in valid_names:
    image = np.load(f"{TRAIN_DATA_DIR}/train_image_{name}.npy")
    label = np.load(f"{TRAIN_DATA_DIR}/train_label_{name}.npy")

    valid_files.append({"image": image, "label": label})

Loading data used in training and validation
For each experiment ID:
- image: 3D volume data (.npy format)
- label: corresponding label data
- stored as dictionary {"image": image, "label": label}

Create the training dataloader

"I should probably find a way to create a dataloader that takes more batches."

# Non-random transforms to be cached
non_random_transforms = Compose([
    EnsureChannelFirstd(keys=["image", "label"], channel_dim="no_channel"),
    NormalizeIntensityd(keys="image"),
    Orientationd(keys=["image", "label"], axcodes="RAS")
])

raw_train_ds = CacheDataset(data=train_files, transform=non_random_transforms, cache_rate=1.0)


my_num_samples = 16
train_batch_size = 1

# Random transforms to be applied during training
random_transforms = Compose([
    RandCropByLabelClassesd(
        keys=["image", "label"],
        label_key="label",
        spatial_size=[96, 96, 96],
        num_classes=7,
        num_samples=my_num_samples
    ),
    RandRotate90d(keys=["image", "label"], prob=0.5, spatial_axes=[0, 2]),
    RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=0),    
])

# Final Dataset and DataLoader Creation:
train_ds = Dataset(data=raw_train_ds, transform=random_transforms)


# DataLoader remains the same
train_loader = DataLoader(
    train_ds,
    batch_size=train_batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=torch.cuda.is_available()
)

This code sets up the training data loader using the MONAI library for medical image data processing with transforms and data loaders
Data loader?
- DataLoader is a pipeline that supplies data to the model
- Main features:
  1. Batch Creation: Bundles multiple data samples together
  2. Shuffling: Randomly shuffles the order of data
  3. Parallel Processing: Accelerates data loading using multiple CPU cores
  4. Memory Efficiency: Loads data as needed instead of loading all at once Example:
Non-random Transforms Setup:
- EnsureChannelFirstd: Moves channel dimension to first position
- NormalizeIntensityd: Normalizes image values (standardizes image values)
- Orientationd: Aligns 3D images to RAS (Right-Anterior-Superior) standard
raw_train_ds = CacheDataset(data=train_files, transform=non_random_transforms, cache_rate=1.0)
- Caches transformed data in memory for fast access
- cache_rate=1.0: Caches all data
Random Transforms Setup:
- RandCropByLabelClassesd: Random cropping by label classes (96×96×96 size)
- RandRotate90d: Random 90-degree rotation (50% probability)
- RandFlipd: Random flipping (50% probability)
- Random Transform doesn't replace original data but applies new transformations each time data is loaded
- my_num_samples = 16 # Creates 16 transformed samples per image
  - Creates 16 different transformations from each original image per epoch
  - 6 training images × 16 samples = total of 96 samples used in each epoch
  - New random transformations are applied every epoch
Final Dataset and DataLoader Creation:
- batch_size=1: Number of samples to process at once
- shuffle=True: Shuffle data order each epoch
- num_workers=4: Number of workers for parallel data loading
- pin_memory=True: Memory performance optimization for GPU usage

Create the validation dataloader

"Here I deviate a little from the source notebooks."
"In the source, the validation dataloader also used the random transformations. This is bad practice and will result in noisy validation."
"Here I split the validation dataset in (slightly) overlapping blocks of (96, 96 , 96) so that we can have a consistent validation set that uses all the validation data.

val_images,val_labels = [dcts['image'] for dcts in valid_files],[dcts['label'] for dcts in valid_files]

val_image_patches, _ = extract_3d_patches_minimal_overlap(val_images, 96)
val_label_patches, _ = extract_3d_patches_minimal_overlap(val_labels, 96)

val_patched_data = [{"image": img, "label": lbl} for img, lbl in zip(val_image_patches, val_label_patches)]


valid_ds = CacheDataset(data=val_patched_data, transform=non_random_transforms, cache_rate=1.0)


valid_batch_size = 16
# DataLoader remains the same
valid_loader = DataLoader(
    valid_ds,
    batch_size=valid_batch_size,
    shuffle=False,
    num_workers=4,
    pin_memory=torch.cuda.is_available()
)

valid_batch_size = 16
- Larger batch size than training(1)
shuffle=False
- Maintaining order
Dataloader configuration details:
- Consistency:
  - Random transforms would result in unstable performance evaluation
  - Fixed patches enable consistent evaluation
- Completeness:
  - Using entire data allows more accurate evaluation
  - Slight overlap ensures good evaluation of boundary regions
- Efficiency:
  - Can use larger batch size
  - Faster validation process than training

5) Initializing the model

This model is pretty much directly copied from https://www.kaggle.com/code/zhuowenzhao11/3d-u-net-pytorch-lightning-distributed-training

import lightning.pytorch as pl

from monai.networks.nets import UNet
from monai.losses import TverskyLoss
from monai.metrics import DiceMetric

class Model(pl.LightningModule):
    def __init__(
        self, 
        spatial_dims: int = 3,
        in_channels: int = 1,
        out_channels: int = 7,
        channels: Union[Tuple[int, ...], List[int]] = (48, 64, 80, 80),
        strides: Union[Tuple[int, ...], List[int]] = (2, 2, 1),
        num_res_units: int = 1,
        lr: float=1e-3):
    
        super().__init__()
        self.save_hyperparameters()
        self.model = UNet(
            spatial_dims=self.hparams.spatial_dims,
            in_channels=self.hparams.in_channels,
            out_channels=self.hparams.out_channels,
            channels=self.hparams.channels,
            strides=self.hparams.strides,
            num_res_units=self.hparams.num_res_units,
        )
        self.loss_fn = TverskyLoss(include_background=True, to_onehot_y=True, softmax=True)  # softmax=True for multiclass
        self.metric_fn = DiceMetric(include_background=False, reduction="mean", ignore_empty=True)

        self.train_loss = 0
        self.val_metric = 0
        self.num_train_batch = 0
        self.num_val_batch = 0

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch['image'], batch['label']
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self.train_loss += loss
        self.num_train_batch += 1
        torch.cuda.empty_cache()
        return loss

    def on_train_epoch_end(self):
        loss_per_epoch = self.train_loss/self.num_train_batch
        #print(f"Epoch {self.current_epoch} - Average Train Loss: {loss_per_epoch:.4f}")
        self.log('train_loss', loss_per_epoch, prog_bar=True)
        self.train_loss = 0
        self.num_train_batch = 0
    
    def validation_step(self, batch, batch_idx):
        with torch.no_grad(): # This ensures that gradients are not stored in memory
            x, y = batch['image'], batch['label'] # Extract images and labels from batch
            y_hat = self(x) # Perform model prediction
            
            # Process predictions
            metric_val_outputs = [AsDiscrete(
                argmax=True,  # Select class with highest probability
                to_onehot=self.hparams.out_channels  # Convert to one-hot encoding
            )(i) for i in decollate_batch(y_hat)]
            
            # Process labels
            metric_val_labels = [AsDiscrete(
                to_onehot=self.hparams.out_channels  # Convert labels to one-hot encoding
            )(i) for i in decollate_batch(y)]

            # compute metric for current iteration
            # Calculate Dice score for current batch
            self.metric_fn(y_pred=metric_val_outputs, y=metric_val_labels)
            # Calculate batch average metric
            metrics = self.metric_fn.aggregate(reduction="mean_batch")
            # Calculate mean across all particle types
            val_metric = torch.mean(metrics) # I used mean over all particle species as the metric. This can be explored.
            
            # Result Accumulation
            self.val_metric += val_metric 
            self.num_val_batch += 1
            
        torch.cuda.empty_cache()
        return {'val_metric': val_metric}

    def on_validation_epoch_end(self):
        metric_per_epoch = self.val_metric/self.num_val_batch
        #print(f"Epoch {self.current_epoch} - Average Val Metric: {metric_per_epoch:.4f}")
        self.log('val_metric', metric_per_epoch, prog_bar=True, sync_dist=False) # sync_dist=True for distributed training
        self.val_metric = 0
        self.num_val_batch = 0
    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)

This code implements a 3D U-Net model using PyTorch Lightning
def __init__(self, spatial_dims=3, in_channels=1, out_channels=7, ...):
- UNet Model Configuration:
  - spatial_dims=3: Process 3D data
  - in_channels=1: Grayscale image input
  - out_channels=7: 7 class outputs (background + 6 particle types)
  - channels=(48, 64, 80, 80): Number of channels per layer
  - strides=(2, 2, 1): Stride for each layer
  - num_res_units=1: Number of residual units to include in each encoder and decoder block
    - Residual Unit structure:
      - Input -> Conv3D -> BatchNorm -> ReLU -> Conv3D -> BatchNorm -> Add(Input) -> ReLU -> Output
      - One "unit" consists of:
        
        2 3D convolution layers
        
        2 Batch Normalization layers
        
        ReLU activation function
        
        Skip connection (adding input to output)
    - num_res_units=1 means this structure is repeated once at each level. If num_res_units=2, this entire structure would be repeated twice in sequence
- Loss Function and Evaluation Metrics:
  - TverskyLoss: Loss function robust to class imbalance
    - Tversky Loss is a generalized version of Dice Loss
    - Effective for handling class imbalance problems
    - Parameter explanation:
      - include_background=True: Include background (class 0) in loss calculation
      - to_onehot_y=True: Convert integer labels to one-hot vectors
      - softmax=True: Apply softmax for multi-class classification
  - DiceMetric: Segmentation performance metric
    - Dice coefficient is a standard metric for evaluating segmentation performance
    - Formula: 2|X∩Y| / (|X|+|Y|)
      - X: Predicted region
      - Y: Actual region
    - Parameter explanation:
      - include_background=False: Exclude background class from evaluation
      - reduction="mean": Average Dice scores across all classes
      - ignore_empty=True: Exclude cases where certain classes are absent
Variable initialization
- self.train_loss = 0 # Accumulate training loss
- self.val_metric = 0 # Accumulate validation metric
- self.num_train_batch = 0 # Count processed training batches
- self.num_val_batch = 0 # Count processed validation batches
- These variables are used to calculate average performance during an epoch
- Reset to 0 at the end of each epoch
def forward(self, x):
- Basic inference method for PyTorch models
- Passes input x through the model
- Simple but important roles:
  1. Simplifies model calls (enables self(x) instead of model(x))
  2. Used for model inference in other methods
  3. Integrates with PyTorch Lightning's automated features
def training_step(self, batch, batch_idx):
- Process batch data
- Perform model prediction
- Calculate and accumulate loss
def on_train_epoch_end(self):
- Calculate average loss per epoch
- Perform logging
def validation_step(self, batch, batch_idx):
- OVERALL:
  - Perform validation without gradient calculation
  - Convert predictions to class labels
  - Calculate Dice score
- with torch.no_grad():
  - Turns off gradient calculation as backpropagation isn't needed during validation
  - Reduces memory usage and improves computation speed
- metric_val_outputs
  - decollate_batch(y_hat):
    - Separates batch into individual samples
    - Example: [16 batches] → [sample1, sample2, ..., sample16]
  - AsDiscrete(argmax=True):
    - Selects class with highest probability at each position
    - Example: [0.1, 0.7, 0.2] → 1 (second class)
  - to_onehot=7:
    - Converts selected class to one-hot vector
    - Example: 1 → [0, 1, 0, 0, 0, 0, 0]
- metric_val_labels
  - decollate_batch(y):
    - Separates batch into individual samples
  - AsDiscrete(to_onehot=7):
    - Converts class index to one-hot vector
    - Example: 2 → [0, 0, 1, 0, 0, 0, 0]
def on_validation_epoch_end(self):
- Same with on_train_epoch_end(self)
def configure_optimizers(self):
- Use AdamW optimizer
- Set learning rate

channels = (48, 64, 80, 80)
strides_pattern = (2, 2, 1)       
num_res_units = 1
learning_rate = 1e-3
num_epochs = 100

model = Model(channels=channels, strides=strides_pattern, num_res_units=num_res_units, lr=learning_rate)

6) Training the model

torch.set_float32_matmul_precision('medium')

# Check if CUDA is available and then count the GPUs
if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    print(f"Number of GPUs available: {num_gpus}")
else:
    print("No GPU available. Running on CPU.")
devices = list(range(num_gpus))
print(devices)


trainer = pl.Trainer(
    max_epochs=num_epochs,        # Total number of training epochs (100)
    #strategy="ddp_notebook",     # Distributed training strategy (currently commented)
    accelerator="gpu",           # Use GPU
    devices=[0],                 # Use only first GPU
    num_nodes=1,                 # Use single node
    log_every_n_steps=10,        # Log every 10 steps
    enable_progress_bar=True,    # Enable progress bar
)

trainer.fit(model, train_loader, valid_loader)

torch.set_float32_matmul_precision('medium')
- Sets precision of 32-bit floating-point matrix multiplication to 'medium'
- Balances speed and accuracy
GPU Availability Check
- Checks for CUDA (GPU) availability
- Counts available GPUs
- Creates list of GPU indices (e.g., [0,1,2] for 3 GPUs)
Trainer Setup
- max_epochs: Total number of training iterations
- accelerator: Hardware to use for training (GPU/CPU)
- devices: GPU numbers to use
- num_nodes: Number of nodes for distributed training
- log_every_n_steps: Logging frequency
- enable_progress_bar: Visualize training progress
trainer.fit(model, train_loader, valid_loader)
- Training Cycle:
  - Each epoch loads batch data from train_loader
  - Executes training_step method:
    - Model prediction
    - Loss calculation
    - Backpropagation and weight updates
  - Runs on_train_epoch_end at epoch end
- Validation Cycle:
  - Loads data from valid_loader after each epoch
  - Executes validation_step method:
    - Model prediction
    - Dice score calculation
  - Runs on_validation_epoch_end at epoch end

7) Predicting on the test set

# Model setup
model.eval();
model.to("cuda");

# Configuration File Processing
import json
copick_config_path = TRAIN_DATA_DIR + "/copick.config"

with open(copick_config_path) as f:
    copick_config = json.load(f)

copick_config['static_root'] = '/kaggle/input/czii-cryo-et-object-identification/test/static'

copick_test_config_path = 'copick_test.config'

with open(copick_test_config_path, 'w') as outfile:
    json.dump(copick_config, outfile)

# Copick Setup
import copick

root = copick.from_file(copick_test_config_path)

copick_user_name = "copickUtils"
copick_segmentation_name = "paintedPicks"
voxel_size = 10
tomo_type = "denoised"

Switches the trained model to evaluation mode and prepares settings for test data
Model Setup
- eval(): Switches dropout, batch normalization, etc. to evaluation mode
- to("cuda"): Moves model to GPU memory
Configuration File Processing
- Loads original configuration file
- Updates test data path
- Saves new configuration to file
Copick Setup
- Loads configuration using copick library
- Sets parameters needed for testing

# Setting up Inference Transformations:
# Non-random transforms to be cached
inference_transforms = Compose([
    EnsureChannelFirstd(keys=["image"], channel_dim="no_channel"),
    NormalizeIntensityd(keys="image"),
    Orientationd(keys=["image"], axcodes="RAS")
])

import cc3d

id_to_name = {1: "apo-ferritin", 
              2: "beta-amylase",
              3: "beta-galactosidase", 
              4: "ribosome", 
              5: "thyroglobulin", 
              6: "virus-like-particle"}

Setting up Inference Transformations:
- Unlike training, no random transformations (for consistent predictions)
- Applied transformations:
  - EnsureChannelFirstd: Moves channel dimension to first position
  - NormalizeIntensityd: Normalizes image values
  - Orientationd: Aligns 3D images to RAS standard
cc3d
- Library for Connected Components analysis
- Used to find and label connected regions in 3D images

Iterate over test set:

Read in a run
Split it into patches of size (96, 96, 96)
Create a dataset from the patches
Predict the segmentation mask
Glue the mask back together
Find the connected components for each class
Find the centroids of the connected components
Add to the dataframe

Then do this for all runs.
"This can probably be optimized quite a bit."

BLOB_THRESHOLD = 500
CERTAINTY_THRESHOLD = 0.5

classes = [1, 2, 3, 4, 5, 6]
with torch.no_grad():
    location_df = []
    for run in root.runs:
        print(run)
		
        # "Read in a run"
        tomo = run.get_voxel_spacing(10)
        tomo = tomo.get_tomogram(tomo_type).numpy()

        # "Split into patches"
        tomo_patches, coordinates  = extract_3d_patches_minimal_overlap([tomo], 96)

        # "Create a dataset"
        tomo_patched_data = [{"image": img} for img in tomo_patches]
        tomo_ds = CacheDataset(data=tomo_patched_data, transform=inference_transforms, cache_rate=1.0)

        # "Predict the segmentation mask"
        pred_masks = []

        for i in range(len(tomo_ds)):
            input_tensor = tomo_ds[i]['image'].unsqueeze(0).to("cuda")
            model_output = model(input_tensor)

            probs = torch.softmax(model_output[0], dim=0)
            thresh_probs = probs > CERTAINTY_THRESHOLD
            _, max_classes = thresh_probs.max(dim=0)

            pred_masks.append(max_classes.cpu().numpy())
            
        # "Glue the mask back together"
        reconstructed_mask = reconstruct_array(pred_masks, coordinates, tomo.shape)
        
        location = {}

        for c in classes:
            # "Find the connected components"
            cc = cc3d.connected_components(reconstructed_mask == c)
            stats = cc3d.statistics(cc)
            
            # "Find the centroids"
            zyx=stats['centroids'][1:]*10.012444 #https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/544895#3040071
            zyx_large = zyx[stats['voxel_counts'][1:] > BLOB_THRESHOLD]
            xyz =np.ascontiguousarray(zyx_large[:,::-1])

            location[id_to_name[c]] = xyz

        # "Add to the dataframe"
        df = dict_to_df(location, run.name)
        location_df.append(df)
    
    location_df = pd.concat(location_df)

location_df.insert(
    loc=0,                              # Insert at first position
    column='id',                        # Column name is 'id'
    value=np.arange(len(location_df))   # Sequential numbers starting from 0
)
location_df.to_csv("submission.csv", index=False)

Adding ID Column
- Assigns unique ID to each predicted particle
- Meets Kaggle submission format requirements
Saving to CSV file
- index=False: Excludes DataFrame index from saved file

!cp -r /kaggle/input/hengck-czii-cryo-et-01/* .

from czii_helper import *
from dataset import *
from scipy.optimize import linear_sum_assignment
import matplotlib.pyplot as plt

!cp ~:
- Linux command that copies required files from a Kaggle dataset to the current working directory
hengck-czii-cryo-et-01 includes:
- czii_helper.py: Utility functions for evaluation metric calculations
- dataset.py: Functions for data loading and processing
- PARTICLE: a constant defined in the copied files, containing characteristics of each particle type (name, radius, difficulty level, etc.)

import os
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    MODE = 'submit'
else:
    MODE = 'local'







valid_dir ='/kaggle/input/czii-cryo-et-object-identification/train'
valid_id = ['TS_6_4', ]

def do_one_eval(truth, predict, threshold):
    P=len(predict)
    T=len(truth)

    if P==0:
        hit=[[],[]]
        miss=np.arange(T).tolist()
        fp=[]
        metric = [P,T,len(hit[0]),len(miss),len(fp)]
        return hit, fp, miss, metric

    if T==0:
        hit=[[],[]]
        fp=np.arange(P).tolist()
        miss=[]
        metric = [P,T,len(hit[0]),len(miss),len(fp)]
        return hit, fp, miss, metric

    #---
    distance = predict.reshape(P,1,3)-truth.reshape(1,T,3)
    distance = distance**2
    distance = distance.sum(axis=2)
    distance = np.sqrt(distance)
    p_index, t_index = linear_sum_assignment(distance)

    valid = distance[p_index, t_index] <= threshold
    p_index = p_index[valid]
    t_index = t_index[valid]
    hit = [p_index.tolist(), t_index.tolist()]
    miss = np.arange(T)
    miss = miss[~np.isin(miss,t_index)].tolist()
    fp = np.arange(P)
    fp = fp[~np.isin(fp,p_index)].tolist()

    metric = [P,T,len(hit[0]),len(miss),len(fp)] #for lb metric F-beta copmutation
    return hit, fp, miss, metric


def compute_lb(submit_df, overlay_dir):
    valid_id = list(submit_df['experiment'].unique())
    print(valid_id)

    eval_df = []
    for id in valid_id:
        truth = read_one_truth(id, overlay_dir) #=f'{valid_dir}/overlay/ExperimentRuns')
        id_df = submit_df[submit_df['experiment'] == id]
        for p in PARTICLE:
            p = dotdict(p)
            print('\r', id, p.name, end='', flush=True)
            xyz_truth = truth[p.name]
            xyz_predict = id_df[id_df['particle_type'] == p.name][['x', 'y', 'z']].values
            hit, fp, miss, metric = do_one_eval(xyz_truth, xyz_predict, p.radius* 0.5)
            eval_df.append(dotdict(
                id=id, particle_type=p.name,
                P=metric[0], T=metric[1], hit=metric[2], miss=metric[3], fp=metric[4],
            ))
    print('')
    eval_df = pd.DataFrame(eval_df)
    gb = eval_df.groupby('particle_type').agg('sum').drop(columns=['id'])
    gb.loc[:, 'precision'] = gb['hit'] / gb['P']
    gb.loc[:, 'precision'] = gb['precision'].fillna(0)
    gb.loc[:, 'recall'] = gb['hit'] / gb['T']
    gb.loc[:, 'recall'] = gb['recall'].fillna(0)
    gb.loc[:, 'f-beta4'] = 17 * gb['precision'] * gb['recall'] / (16 * gb['precision'] + gb['recall'])
    gb.loc[:, 'f-beta4'] = gb['f-beta4'].fillna(0)

    gb = gb.sort_values('particle_type').reset_index(drop=False)
    # https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/544895
    gb.loc[:, 'weight'] = [1, 0, 2, 1, 2, 1]
    lb_score = (gb['f-beta4'] * gb['weight']).sum() / gb['weight'].sum()
    return gb, lb_score


#debug
if 1:
    if MODE=='local':
    #if 1:
        submit_df=pd.read_csv(
           'submission.csv'
            # '/kaggle/input/hengck-czii-cryo-et-weights-01/submission.csv'
        )
        gb, lb_score = compute_lb(submit_df, f'{valid_dir}/overlay/ExperimentRuns')
        print(gb)
        print('lb_score:',lb_score)
        print('')


        #show one ----------------------------------
        fig = plt.figure(figsize=(18, 8))

        id = valid_id[0]
        truth = read_one_truth(id,overlay_dir=f'{valid_dir}/overlay/ExperimentRuns')

        submit_df = submit_df[submit_df['experiment']==id]
        for p in PARTICLE:
            p = dotdict(p)
            xyz_truth = truth[p.name]
            xyz_predict = submit_df[submit_df['particle_type']==p.name][['x','y','z']].values
            hit, fp, miss, _ = do_one_eval(xyz_truth, xyz_predict, p.radius)
            print(id, p.name)
            print('\t num truth   :',len(xyz_truth) )
            print('\t num predict :',len(xyz_predict) )
            print('\t num hit  :',len(hit[0]) )
            print('\t num fp   :',len(fp) )
            print('\t num miss :',len(miss) )

            ax = fig.add_subplot(2, 3, p.label, projection='3d')
            if hit[0]:
                pt = xyz_predict[hit[0]]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], alpha=0.5, color='r')
                pt = xyz_truth[hit[1]]
                ax.scatter(pt[:,0], pt[:,1], pt[:,2], s=80, facecolors='none', edgecolors='r')
            if fp:
                pt = xyz_predict[fp]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], alpha=1, color='k')
            if miss:
                pt = xyz_truth[miss]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], s=160, alpha=1, facecolors='none', edgecolors='k')

            ax.set_title(f'{p.name} ({p.difficulty})')

        plt.tight_layout()
        plt.show()
        
        #--- 
        zz=0

Overall comprehensive evaluation and visualization of model predictions
do_one_eval:
- Inputs:
  - truth: actual particle positions
  - predict: predicted particle positions
  - threshold: matching distance threshold
- Main process:
  1. Handle exceptions (P=0 or T=0 cases)
  2. Calculate distances between predictions and truth
  3. Find optimal matching (using linear_sum_assignment)
  4. Filter valid matches based on threshold
  5. Calculate hits/misses/false positives
- Returns:
  - hit: correct prediction indices
  - fp: false prediction indices
  - miss: missed particle indices
  - metric: [P, T, num_hits, num_misses, num_fps]
compute_lb
- Inputs:
  - submit_df: prediction results dataframe to submit
  - overlay_dir: ground truth data path
- Main process:
  1. Evaluate predictions for each experiment ID
  2. Calculate performance per particle type
  3. Calculate precision and recall
  4. Calculate f-beta4 score (beta=4 weights recall)
  5. Apply particle type weights
- Returns:
  - gb: performance metrics per particle type
  - lb_score: final score
read_one_truth: loading the ground truth data
We are scoring the lb score based on the test data we configured: TS_6_4

I think it's very important to have a feedback loop, where you're constantly thinking about what you've done and how you could be doing it better.
- Elon Musk -

CZII - CryoET Object Identification #1 - Training Data

dongsunseng — Tue, 14 Jan 2025 02:09:34 +0900

This post is an annotation of training data code kernel from "fnands".

https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name

Create Numpy dataset exp name

Explore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET Object Identification

www.kaggle.com

Kernel 'Create Numpy dataset exp name'

Overall this kernel is about PREPARING TRAINING DATA

!pip install git+https://github.com/copick/copick-utils.git matplotlib tqdm copick 
!pip install -q "monai-weekly[mlflow]"

This combination of packages creates a complete environment for processing, analyzing, and applying machine learning models to Cryo-electron microscope data.

copick-utils (`git+https://github.com/copick/copick-utils.git`)
- A utility library for processing Cryo-EM (Cryo-electron microscope) data
- Installed directly from GitHub repository
- Provides tools for processing, analyzing, and visualizing electron microscope images
matplotlib
- Python's primary visualization library
- Used for creating and displaying graphs, charts, and images
- Essential tool for visualizing data analysis results
tqdm
- Library that provides progress bars
- Enables real-time monitoring of long-running tasks
- Particularly useful when processing large datasets
copick
- Main library for Cryo-EM data
- Provides functionality for image processing, data management, and analysis
- Serves as the basic framework for utilizing copick-utils features
monai-weekly[mlflow]
- MONAI (Medical Open Network for AI) is a deep learning framework for medical images
- Built on PyTorch and specialized for medical image processing
- Key features:
  - Data preprocessing and augmentation
  - Neural network models for medical images
  - Training and evaluation tools
  - [mlflow] is an optional dependency where:
    - MLflow is a platform for tracking and managing machine learning experiments
    - Records and manages experimental results, models, and parameters
    - Helps compare and reproduce model performance
- The '-q' option means 'quiet' mode, which minimizes installation process output.

!pip install zarr
!pip install copick

zarr:
- Format and library for storing and processing N-dimensional arrays
- Main Features:
  - Chunked compression storage
  - Parallel processing support
  - Hierarchical organization capability
  - Cloud storage compatibility
  - NumPy-compatible interface
- Purpose:
  - Processing large-scale scientific data
  - Data sharing in distributed computing environments
  - Processing datasets larger than available memory
- Advantages:
  - Memory efficient: Can process data without loading entire dataset into memory
  - Fast I/O performance: Efficient data access through chunk-based approach
  - Flexible storage format: Supports various storage options (local disk, cloud, etc.)
  - Parallel processing: Multiple processes can access data simultaneously
- Common Use Cases:
  - Large scientific datasets (e.g., meteorological data, satellite images)
  - Machine learning datasets
  - Biological data (e.g., cryo-electron microscopy data)
- Relationship with asciitree package:
  - asciitree is used to visually represent Zarr data structure
  - Shows hierarchical structure of Zarr arrays in tree format in terminal
  - While asciitree is necessary for visualizing Zarr structures, it can sometimes be challenging to install or use

# Make a copick project
import os
import shutil

# Define configuration for protein structures and project settings
config_blob = """{
   "name": "czii_cryoet_mlchallenge_2024",
   "description": "2024 CZII CryoET ML Challenge training data.",
   "version": "1.0.0",

   "pickable_objects": [
       {
           "name": "apo-ferritin",
           "is_particle": true,
           "pdb_id": "4V1W",
           "label": 1,
           "color": [0, 117, 220, 128],
           "radius": 60,
           "map_threshold": 0.0418
       },
       {
           "name": "beta-amylase",
           "is_particle": true,
           "pdb_id": "1FA2", 
           "label": 2,
           "color": [153, 63, 0, 128],
           "radius": 65,
           "map_threshold": 0.035
       },
       {
           "name": "beta-galactosidase",
           "is_particle": true,
           "pdb_id": "6X1Q",
           "label": 3,
           "color": [76, 0, 92, 128],
           "radius": 90,
           "map_threshold": 0.0578
       },
       {
           "name": "ribosome",
           "is_particle": true,
           "pdb_id": "6EK0",
           "label": 4,
           "color": [0, 92, 49, 128],
           "radius": 150,
           "map_threshold": 0.0374
       },
       {
           "name": "thyroglobulin",
           "is_particle": true,
           "pdb_id": "6SCJ",
           "label": 5,
           "color": [43, 206, 72, 128],
           "radius": 130,
           "map_threshold": 0.0278
       },
       {
           "name": "virus-like-particle",
           "is_particle": true,
           "pdb_id": "6N4V",            
           "label": 6,
           "color": [255, 204, 153, 128],
           "radius": 135,
           "map_threshold": 0.201
       }
   ],

   "overlay_root": "/kaggle/working/overlay",
   "overlay_fs_args": {
       "auto_mkdir": true
   },
   "static_root": "/kaggle/input/czii-cryo-et-object-identification/train/static"
}"""

# Define paths
copick_config_path = "/kaggle/working/copick.config"
output_overlay = "/kaggle/working/overlay"

# Write configuration file
with open(copick_config_path, "w") as f:
   f.write(config_blob)
   
# Update the overlay
# Define source and destination directories
source_dir = '/kaggle/input/czii-cryo-et-object-identification/train/overlay'
destination_dir = '/kaggle/working/overlay'

# Walk through the source directory
for root, dirs, files in os.walk(source_dir):
   # Create corresponding subdirectories in the destination
   relative_path = os.path.relpath(root, source_dir)
   target_dir = os.path.join(destination_dir, relative_path)
   os.makedirs(target_dir, exist_ok=True)
   
   # Copy and rename each file
   for file in files:
       # Add prefix 'curation_0_' if not already present
       if file.startswith("curation_0_"):
           new_filename = file
       else:
           new_filename = f"curation_0_{file}"
       
       # Define full paths for the source and destination files
       source_file = os.path.join(root, file)
       destination_file = os.path.join(target_dir, new_filename)
       
       # Copy the file with the new name
       shutil.copy2(source_file, destination_file)
       print(f"Copied {source_file} to {destination_file}")

This code sets up a project for the competition:
shutil:
- shutil is a Python standard library - it stands for "shell utility"
- It provides high-level file operations such as copying, moving, and removing files and file collections
<<config_blob = """...""">> part
- Contains information about 6 protein structures:
  - apo-ferritin: Iron storage protein
  - beta-amylase: Enzyme protein
  - beta-galactosidase: Sugar breakdown enzyme
  - ribosome: Protein synthesis structure
  - thyroglobulin: Thyroid hormone precursor
  - virus-like-particle: Virus-like particle
- Attributes defined for each structure:
  - name: Structure name
  - is_particle: Particle status
  - pdb_id: Protein Data Bank ID
  - label: Classification label (1-6)
  - color: RGBA color value ([R,G,B,A])
  - radius: Particle radius
  - map_threshold: Mapping threshold
- "overlay_root": "/kaggle/working/overlay"
  - Specifies the root directory where generated data (overlays) will be stored
  - Represents the working directory for use in Kaggle environment
  - /kaggle/working/ is a writable directory in Kaggle notebooks
- "overlay_fs_args": {
  "auto_mkdir": true
  }
  - Sets file system related arguments
  - auto_mkdir: true means it will automatically create directories if they don't exist
  - Creates necessary paths automatically when saving files or data
- "static_root": "/kaggle/input/czii-cryo-et-object-identification/train/static"
  - Specifies the path where original or unchanging static data is stored
  - Path to input data for the Kaggle competition
  - /kaggle/input/ is the read-only data directory provided by Kaggle
- These configurations define in the Kaggle environment:
  - Where to read data from
  - Where to store processed results
  - How to manage the file system
- is_particle: (Particle status):
  - Set to true in the data
  - Indicates whether the object should be treated as an independent particle
  - true means this structure is an individually identifiable, separate particle
  - This affects how the object is handled during image processing and analysis
- pdb_id: (Protein Data Bank ID)
  - Unique identifier like "6N4V"
  - PDB (Protein Data Bank) is a global database storing 3D structural information of proteins and nucleic acids
  - This ID allows access to detailed structural information of the molecule
  - For example, "6N4V" for virus-like-particle is a unique identifier storing atomic-level details of this structure
  - Detailed information can be viewed by searching this ID on the PDB website (rcsb.org)
- Radius:
  - Typically measured in Angstroms (Å) or nanometers (nm)
  - Reflects the actual physical size of virus-like particles
  - Set based on average particle size visible in electron microscope images
- map_threshold:
  - Threshold value for identifying particles in electron density maps
  - Higher values mean stricter particle identification criteria
  - 0.201 is significantly higher than other particles (e.g., apo-ferritin's 0.0418, beta-amylase's 0.035)
  - This might be because virus-like particles show stronger contrast in electron microscope images
- color:
  - RGBA: color values with transparency %
  - Set for visualization purposes, doesn't affect analysis
  - Last value 128 indicates transparency (middle value in 0-255 range)

File System Setup:
- copick_config_path = "/kaggle/working/copick.config"
  output_overlay = "/kaggle/working/overlay"
  - Specifies paths for configuration file and output directory
  - Set up for Kaggle environment
For loop part:
- for root, dirs, files in os.walk(source_dir):
  - Uses os.walk to traverse all files and subdirectories in source directory
  - Creates identical directory structure at destination
- for file in files:
  - if else clause:
    - Adds "curation_0_" prefix to all file names
    - Keeps files that already have the prefix unchanged
  - shutil.copy2(source_file, destination_file):
    - Uses shutil.copy2 to copy files
    - Also copies metadata (creation time, modification time, etc.)
Overall prepares and structures the dataset needed for training machine learning models, specifically for identifying and classifying various protein structures captured by cryo-electron microscopy.

import os
import numpy as np
from pathlib import Path
import torch
import torchinfo
import zarr, copick
from tqdm import tqdm
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
    Compose, 
    EnsureChannelFirstd, 
    Orientationd,  
    AsDiscrete,  
    RandFlipd, 
    RandRotate90d, 
    NormalizeIntensityd,
    RandCropByLabelClassesd,
)
from monai.networks.nets import UNet
from monai.losses import DiceLoss, FocalLoss, TverskyLoss
from monai.metrics import DiceMetric, ConfusionMatrixMetric
import mlflow
import mlflow.pytorch

Preparing the dataset

1. Get copick root

root = copick.from_file(copick_config_path)

copick_user_name = "copickUtils"
copick_segmentation_name = "paintedPicks"
voxel_size = 10
tomo_type = "denoised"

Initializing the basic configuration of the copick project
root = copick.from_file(copick_config_path)
- Initializes a copick object by reading the configuration file from copick_config_path
- Loads settings including protein structure information and paths into this object
copick_user_name = "copickUtils"
- Sets an identifier for the user/tool performing the work
- Used to track and distinguish results
copick_segmentation_name = "paintedPicks"
- Specifies the name for segmentation (image region distinction) results
- Results will be saved and referenced using this name
voxel_size = 10
- Sets the voxel size that defines the resolution of 3D images
- A voxel is the basic unit of 3D images, similar to pixels in 2D images
tomo_type = "denoised"
- Specifies the type of tomogram (3D image) data to use
- "denoised" means using processed images with noise removed
- Noise removal improves image quality and facilitates analysis

2. Generate multi-class segmentation masks from picks, and saved them to the copick overlay directory (one-time)

# Import segmentation-related utilities
from copick_utils.segmentation import segmentation_from_picks
import copick_utils.writers.write as write
from collections import defaultdict

# Just do this once
generate_masks = True

if generate_masks:
    # Stores label and radius information for each particle in a dictionary
    # Only processes objects where is_particle is true
    target_objects = defaultdict(dict)
    for object in root.pickable_objects:
        if object.is_particle:
            target_objects[object.name]['label'] = object.label
            target_objects[object.name]['radius'] = object.radius

    # Process Tomograms and Create Masks
    for run in tqdm(root.runs):
        # Get tomogram data
        tomo = run.get_voxel_spacing(10)
        tomo = tomo.get_tomogram(tomo_type).numpy()
        
        # Create empty target array
        target = np.zeros(tomo.shape, dtype=np.uint8)
        
        # Generate Segmentation Masks
        for pickable_object in root.pickable_objects:
            pick = run.get_picks(object_name=pickable_object.name, user_id="curation")
            if len(pick):  
                target = segmentation_from_picks.from_picks(pick[0], 
                                                            target, 
                                                            target_objects[pickable_object.name]['radius'] * 0.8,
                                                            target_objects[pickable_object.name]['label']
                                                            )
        write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)

from collections import defaultdict
- defaultdict automatically handles default values for missing dictionary keys
generate_masks = True
- Flag that controls whether to generate segmentation masks or not
- Generating segmentation masks is a time-consuming operation
- It only needs to be done once (this is why the comment says "Just do this once")
- A segmentation mask is a binary or multi-class label map used to distinguish specific objects or regions in an image
- It's used to distinguish 6 different protein structures
- Each structure has a unique label (1-6)
- Background is marked as 0
- The mask is in 3D form, indicating which structure each voxel belongs to
for run in tqdm(root.runs):
- Gets tomogram data for each run
- Retrieves data at specified voxel size (10)
- Converts to numpy array for processing
- Creates empty array for storing segmentation masks
for pickable_object in root.pickable_objects:
- For each object:
  - Gets pick information
  - Creates segmentation mask if pick exists
  - Uses 80% of radius (* 0.8) for mask creation
  - Uses object's label
write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)
- Saves generated segmentation masks
- Saves with specified user name and segmentation name
root.runs:
- Represents each experimental run in the dataset
- In this code, we can see there are 7 experimental datasets:
  1. TS_86_3
  2. TS_6_6
  3. TS_6_4
  4. TS_5_4
  5. TS_73_6
  6. TS_99_9
  7. TS_69_2
- Each run represents one electron microscope imaging session
- Therefore, for run in tqdm(root.runs)::
  - For each experimental session (TS_*)
  - Retrieves the tomogram data from that session
  - Locates each protein
  - Generates segmentation masks
- Each run contains:
  - Tomogram data (run.get_tomogram())
  - Protein location information (run.get_picks())
  - Other metadata

3. Get tomograms and their segmentaion masks (from picks) arrays

data_dicts = []  # Create empty list to store data
for run in tqdm(root.runs):  # Iterate over 7 experimental datasets
    # Get tomogram data
    tomogram = run.get_voxel_spacing(voxel_size)  # Get data at resolution set to voxel_size=10
    tomogram = tomogram.get_tomogram(tomo_type)   # Get "denoised" type tomogram
    tomogram = tomogram.numpy()                    # Convert to numpy array

    # Get segmentation masks
    segmentation = run.get_segmentations(
        name=copick_segmentation_name,    # "paintedPicks"
        user_id=copick_user_name,         # "copickUtils"
        voxel_size=voxel_size,           # 10
        is_multilabel=True               # Mask distinguishing multiple classes (proteins)
    )[0].numpy()

    # Add to data dictionary
    data_dicts.append({
        "name": run.name,        # Experiment name (e.g., "TS_86_3")
        "image": tomogram,       # Tomogram data
        "label": segmentation    # Segmentation mask
    })

# Print label values from first data
print(np.unique(data_dicts[0]['label']))  # Outputs [0 1 2 3 4 5 6]

Collects tomograms and segmentation masks from each experimental data
Results explanation:
- [0 1 2 3 4 5 6] are all unique values in the mask:
  - 0: Background
  - 1: apo-ferritin
  - 2: beta-amylase
  - 3: beta-galactosidase
  - 4: ribosome
  - 5: thyroglobulin
  - 6: virus-like-particle
Each dictionary created for experimental data includes:
- Experiment name
- Original image (tomogram)
- Segmentation mask (labels)
This prepared data can be used later for training machine learning models.

# For each of the 7 experimental datasets
for i in range(7):
    # Save image (tomogram) data
    with open(f"train_image_{data_dicts[i]['name']}.npy", 'wb') as f:
        np.save(f, data_dicts[i]['image'])
    
    # Save label (segmentation mask) data    
    with open(f"train_label_{data_dicts[i]['name']}.npy", 'wb') as f:
        np.save(f, data_dicts[i]['label'])

saves the previously created data to files
Specifically:
- Two .npy files are created for each experiment:
  - train_image_TS_XX_X.npy: tomogram data
  - train_label_TS_XX_X.npy: segmentation mask
- File format:
  - .npy: NumPy's array storage format
  - 'wb': open file in binary write mode

I could either watch it happen or be a part of it.
- Elon Musk -

[LLM] 1. Prompt Engineering Basics #1

dongsunseng — Sun, 5 Jan 2025 23:28:14 +0900

This post heavily relies on this lecture:

개발자를 위한 ChatGPT 프롬프트 엔지니어링

2시간 이내에 이 안내 프로젝트를 완료하세요. 채팅 상자를 넘어서세요. API 액세스를 사용하여 자체 애플리케이션에 LLM을 활용하고 맞춤형 챗봇을 구축하는 방법을 배워보세요. 개발자를 위한 C

www.coursera.org

Two types of LLMs

Base LLM
- Predicts next word based on text training data
Instruction Tuned LLM
- Tries to follow instructions
- Fine-tune on instructions and good attempts at following those instructions
- Often further refined using RLHF technique
  - RLHF: Reinforcement Learning with Human
- Trained to be Helpful, Honest, and Harmless
- Thus, likely to be less toxic than Base LLM
- Recommended to be used for practical usages

Guidelines for Prompting

First Principle: Write clear and specific instructions
- Clear prompt doesn't mean short prompt
- Detailed tactics:
  1. Use delimiters to clearly indicate distinct parts of the input
    - Delimiters can be anything like: ```, """, < >, <tag> </tag>, ---, etc
    - Delimiters can also avoid prompt injections
      - prompt injection: if a user is allowed to add some input into your prompt, they might give kind of conflicting instructions to the model that might make it follow the user's instructions rather than doing what you wanted it to do
      - In other words, model can successfully distinguish the input part and the instruction part to avoid confusions
  2. Ask for structured output
    - for example: JSON, HTML
  3. Ask the model to check whether conditions are satisfied
    - Example: If the text does not contain a sequence of instructions, then simply write "No steps provided".
  4. "Few-shot" prompting
Second Principle: Give the model time to THINK
- Detailed tactics:
  1. Specify the steps required to complete a task
  2. Instruct the model to work out its own solution before rushing to a conclusion
    - When we simply provide sample answer and ask if that is correct, the model might skim read it and simply say that it is correct without fully thinking about the answer
    - Therefore, we should first ask to draw its own solution first and then make the model compare its own and the sample answer

Model limitations: Hallucinations

Even though large language models are exposed to a vast amount of knowledge during its training process, it has not perfectly memorized the information it have seen and so it doesn't know the boundary of its knowledge very well.
Thus, those models are likely to make statements that sound plausible but are not true.
How to reduce hallucinations:
- First ask the model to find any relevant quotes from the text and then ask it to use those quotes to answer the question

If you need inspiration, don't do it.
-Elon Musk-

[NLP] 3. How does Transformer Work?

dongsunseng — Fri, 3 Jan 2025 15:52:40 +0900

Background

Transformer, introduced by Google in 2017 for natural language processing, is a language model that's leading innovation in the AI field.
ChatGPT, which first enabled us to use AI through web and API interfaces, is also based on Transformer, as are the language models that companies like Google and Facebook are developing as competitors.
Transformer is expected to achieve state-of-the-art performance not only in natural language processing but also in other fields like computer vision and speech recognition.

Shift from CNN Dominance to Transformer

Deep learning can be traced back to the Perceptron of the 1950s, which was inspired by human neurons.
However, deep learning faced a dark age until the early 2010s due to insufficient computing power and more importantly lack of data for analysis during the 1990s-2000s.
However, in the 2010s, data increased explosively through smartphones and social media.
In 2012, AlexNet, using deep learning, became a breakthrough in the ImageNet Challenge (classifying 1000 images) by improving image classification accuracy by more than 10% from the previous 70-80%.
AlexNet consists of 5 CNN layers and 3 FC layers.
After that, computer vision field development mainly focused on CNN-based models, and ResNet, which emerged in 2015, achieved an image recognition error rate of around 3%, similar to human performance.

Natural Language Processing's History

In contrast, for natural language processing, RNN, which is an artificial neural network for processing sequential data like text, emerged in the 1980s, and its improved version LSTM came out in 1997.
However, they couldn't solve the long-term dependencies problem for a while, where it became difficult to remember previous data as input sentences got longer.
There were also attempts to analyze sentence sentiment by creating embedding vectors using CNN, which was popular at the time.
The Sequence to Sequence language model, introduced in 2014, is considered one of the greatest inventions in natural language processing history.
It could not only convert existing sentences into numerical values but also generate new sentences using these values.
Machine Translation is a typical example, such as generating English sentences from Korean input.
However, the Seq2Seq model still had RNN's chronic problem where it struggled to remember previous information as input sentences got longer, as it used RNN in both the encoder(processing input sentences) and decoder(generating new sentences).
Also, information loss occurred when trying to reconstruct target sentences using only the numerical information from the encoder's last timestep.
This issue was later resolved with the addition of Attention, enabling translation regardless of sentence length.

RNN's Main Problem(Summary)

They process the input data sequentially, one after the other. Such a recurrent process does not make use of modern graphics processing units (GPUs), which were designed for parallel computation and, thus, makes the training of such models quite slow.
They become quite ineffective when elements are distant from one another. This is due to the fact that information is passed at each step and the longer the chain is, the more probable the information is lost along the chain.

Attention?

https://wikidocs.net/22893

The basic idea of Attention is that since the numerical information from the encoder's last timestep alone is sufficient, the decoder refers back to the entire input sentence at every timestep when predicting output words.
However, it doesn't reference all input words equally - instead, it pays more attention to words most relevant to the word being predicted at that timestep.
Mathematically, this involves creating a query by multiplying weights with the decoder's current timestep output (hidden state), then taking the dot product with all encoder timestep outputs, and learning these weights through backpropagation to better reference the words that need to be predicted.
Although the addition of Attention somewhat removed limitations on sentence length, RNN-based Seq2Seq models still produced lower quality translations compared to humans.
However, the emergence of the Transformer brought significant changes to natural language processing.
In 2017, Google introduced the Transformer model through their paper "Attention is All You Need," implementing both encoder and decoder entirely with attention mechanisms, rather than just using attention for corrections.
The Transformer model became not only free from sentence length constraints but also better at understanding input sentences through the encoder and previously generated words through the decoder.
All famous pre-trained language models (PLMs) since then have been Transformer-based.
BERT consists of 12 Transformer encoders and excels at natural language understanding, while GPT-1 consists of 12 Transformer decoders and shows strength in natural language generation.
Subsequent language models have evolved by increasing model size and datasets - GPT-3's largest version has 96 decoders and 175 billion parameters.
ChatGPT is a model fine-tuned from GPT-3, specialized for conversation.

Transformer excelling in image field

The Transformer model is achieving good results not only in natural language processing but also in image processing.
Vision Transformer (ViT), announced in 2020, applies the Transformer model to the vision field.
It divides input images into patches, feeds them into the Transformer's encoder, and can capture interdependencies between different positions of input images and global image features using attention.
Additionally, Transformer is used in popular text-to-image generation models like DALL-E 2 and Stable Diffusion.
These models learn optimal weights for image generation by adding noise to images and restoring them, but instead of blindly restoring images, they find directions for restoration conditioned on given text information.
The Transformer is used not only to understand information between texts but also to model interactions between text and image representations.

More into the Model

Transformer uses attention in both the encoder (which understands input sentences) and the decoder (which generates target sentences).
There are three types of attention in the Transformer:
1. Encoder Self-Attention
  - Used within the encoder for understanding input sentences
2. Decoder Self-Attention (also called Masked Attention)
  - Used within the decoder for understanding the sentence it's generating
  - Called "masked attention" because it masks future tokens during the word-by-word sentence generation process
3. Encoder-Decoder Attention
  - The original purpose of attention
  - Used for the decoder to reference information from the encoder when generating sentences, supplementing any missing information

https://wikidocs.net/31379

Looking at the process step by step from where words enter the encoder:
1. Input sentences are tokenized to create a dictionary
2. Tokens are mapped to integers
3. These pass through the embedding layer
4. This creates embedding values for tokens that the model will learn
The Transformer maintains a consistent dimensionality of 512 for both word embedding vectors and all input/output values within the model.

Let me break down this explanation of Transformer's detailed operation:
1. Multi-head Attention in First Encoding Layer

When generating contextual representations by calculating similarities between input sentence tokens
Instead of calculating similarities between 512-dimensional tokens all at once
Divides into n heads for learning (hence "Multi-head Attention")
Paper used head=8

2. Example Calculation

For a sentence like "나는, 학교, 에, 간다":
Instead of full (4, 512).T x (4, 512) matrix multiplication
Changes weight vector size to 64 dimensions (512/8)
Enables 8 parallel (4, 64).T x (4, 64) matrix operations

3. Efficient Processing

Uses matrix multiplication between input values and model weights
Processes efficiently through:
- Batch matrix operations
- Parallel attention processing via multi-head attention mechanism

4. Subsequent Encoder Blocks

Perform self-attention learning with output from previous block
Each encoder block has different weight parameters
Model's expressiveness improves as layers stack up

5. Decoder Operation

Performs self-attention on masked output sentence tokens
Conducts encoder-decoder attention using:
- Self-attention values
- Values passed through final encoder block
Both self-attention and encoder-decoder attention use parallel-processed multi-head attention

Conclusion

The Transformer achieved several key breakthroughs:
1. Overcame Sentence Length Limitations
  - Through attention mechanisms
  - Improved understanding of both input and generated sentences
2. Efficient Processing
  - Handles massive matrix operations between input values and weights
  - Achieves efficiency through parallel processing of all operations
3. Foundation for Large Language Models
  - Enabled development of large-scale language models like GPT (Generative Pre-trained Transformer)
  - Made it possible to pre-train on massive datasets
  - Achieved superior performance through this architecture
This architecture laid the groundwork for modern large language models and continues to drive innovation in AI.

Reference

Transformer 모델이란? : AI 혁신을 주도하는 트랜스포머 알고리즘

트랜스포머(Transformer)는 구글이 자연어처리를 위해 2017년 발표한 모델로 현재 AI 분야의 혁신을 이끌고 있는 언어모델이다. 우리가 웹이나 API를 통해 AI를 처음 활용하게 된 계기가 된 ChatGPT 역시

blog-ko.superb-ai.com

You can find detailed steps of how transformer gets the sense of the data and generate new data from this blog:

https://www.datacamp.com/tutorial/how-transformers-work

Failure is an option here. If things are not failing, you are not innovating enough.
- Elon Musk -

[Prompt Engineering] 2. Zer0-shot Prompting

dongsunseng — Thu, 2 Jan 2025 17:36:24 +0900

What is Zero-shot prompting?

In artificial intelligence, a "shot" refers to an example.
Therefore, zero-shot means the AI processing a new task without examples, in other words, handling tasks it hasn't specifically learned.
Zero-shot prompting refers to how AI like ChatGPT processes responses to prompts without specifically trained data or example answers.
Simply put, AI performs the requested task without seeing any examples.
It processes responses to prompts using only the knowledge learned during model training.

Examples

1. Text Classification

Classify the following text as either "Business", "Technology", or "Health":
"New research shows that regular meditation can reduce stress levels and improve sleep quality."

2. Sentiment Analysis

Determine if the following customer review expresses a positive, negative, or neutral sentiment:
"After waiting for 45 minutes, the food arrived cold and the waiter was nowhere to be found."

3. Language Translation

Translate the following English text to French, maintaining a formal tone:
"We look forward to meeting you at the conference next week."

4. Question Answering

Based on the following text, answer the question below:

Text: The Industrial Revolution began in Britain in the late 18th century and spread to other parts of Europe and North America during the 19th century. It marked a major turning point in human history.

Question: Where and when did the Industrial Revolution begin?

5. Summarization

Provide a brief summary of the following paragraph in no more than two sentences:

The Great Barrier Reef is the world's largest coral reef system, stretching over 2,300 kilometers along the northeast coast of Australia. It consists of nearly 3,000 individual reefs and 900 islands, supporting an incredibly diverse ecosystem of marine life including over 1,500 species of fish, 400 species of hard coral, and 4,000 types of mollusks.

6. Intent Classification

Identify the user's intent in the following customer service query as either "Request Information", "Technical Support", "Complaint", or "Account Management":

"I've been trying to log into my account for the past hour but keep getting an error message."

Furthermore

This is called zero-shot prompting, where AI interprets prompts and generates results using only pre-trained data without being given example answers.
Most prompts you use daily without examples are classified as zero-shot prompts.
So what about prompting with examples?
The method of performing tasks using a few examples is called Few-shot prompting.
The advantages of zero-shot prompting are:
- Minimizes time spent preparing prompts
- Can quickly utilize existing models without separate training for new fields or tasks

While zero-shot prompting specializes in quick and flexible interaction with AI, it may have lower performance or accuracy compared to models pre-trained for specific tasks.
A representative prompting engineering technique to compensate for this is the Few-shot prompting mentioned in the previous post of mine:

[Prompt Engineering] 1. Few-shot Prompting

What is Few-shot Prompting?In artifical intelligence, a "shot" refers to an exampleTherefore, Few-shot means a few examples.Few-shot prompting is a method that helps AI models better understand and perform new tasks by providing a small number of examples

dongsunseng.com

Great companies are built on great products.
- Elon Musk -

[Prompt Engineering] 1. Few-shot Prompting

dongsunseng — Thu, 2 Jan 2025 16:51:37 +0900

What is Few-shot Prompting?

In artifical intelligence, a "shot" refers to an example
Therefore, Few-shot means a few examples.
Few-shot prompting is a method that helps AI models better understand and perform new tasks by providing a small number of examples when the model needs to perform a new task.
Few-shot prompting is broadly divided into:
- Instructions: Description of the task the model needs to perform
- Examples: Examples for the model to reference when generating responses
- Input data: Optional use depending on whether there is data to analyze
It is common to use 2-5 examples for few-shot prompting

Examples of few-shot prompting

1. Sentiment Analysis

Input: "The food was amazing!" 
Output: Positive

Input: "Terrible service, would not recommend." 
Output: Negative

Input: "It was an okay experience."
Output: Neutral

Input: "The concert exceeded all my expectations!"
Output: [The model should predict: Positive]

2. Text Classification

Input: "How do I reset my password?"
Category: Technical Support

Input: "I'd like to return my recent purchase"
Category: Customer Service

Input: "What are your business hours?"
Category: General Inquiry

Input: "My account is locked, please help"
Category: [The model should predict: Technical Support]

3. Language Translation (Informal -> Formal)

Informal: "Hey, what's up?"
Formal: "Hello, how are you?"

Informal: "Gimme a sec"
Formal: "Please give me a moment"

Informal: "That's awesome!"
Formal: "That is excellent"

Informal: "Can't wait to see ya"
Formal: [The model should predict: "I look forward to seeing you"]

4. Entity Extraction

Text: "John Smith lives in New York"
Person: John Smith
Location: New York

Text: "Apple Inc. is headquartered in Cupertino"
Company: Apple Inc.
Location: Cupertino

Text: "Microsoft CEO Satya Nadella announced"
Person: Satya Nadella
Company: Microsoft

Text: "Tesla opened a new factory in Berlin"
Company: [The model should predict: Tesla]
Location: [The model should predict: Berlin]

Advantages

Few-shot prompting enables AI models to better understand and perform tasks with just a small amount of data.
While it takes longer to write prompts compared to zero-shot prompting, it allows for more precise control of responses.

Limitations

Since few-shot prompting only provides a small number of examples to the AI, if the quality of the given examples is low, there's a higher probability that the AI will produce incorrect results.
Therefore, when using few-shot prompting, it's crucial to carefully check the consistency and quality of the examples.

It is a mistake to underestimate the power of a single individual to change the world.
- Elon Musk -

Child Mind Institute — Problematic Internet Use: The Greatest Shake-Up?

dongsunseng — Mon, 23 Dec 2024 17:01:32 +0900

Child Mind Institute — Problematic Internet Use

Relating Physical Activity to Problematic Internet Use

www.kaggle.com

CRAZY SHAKE-UP HERE

About the Competition

The aim of this competition is to develop a model that predicts problematic internet usage levels based on physical activity and health data from children and adolescents.
Since the current method of measuring problematic internet use requires complex expert evaluation, the goal is to identify it through easily obtainable physical activity indicators instead.
This competition is hosted by the Child Mind Institute and sponsored by Dell Technologies and NVIDIA, with a total prize pool of $60,000. The evaluation metric used is quadratic weighted kappa.

Shake-Up?

What competition organizer says:

Many participants adjusted model hyperparameters, thresholds, and random seeds to improve public leaderboard scores
Submission analysis:
- Top 10 teams on public: Average 212 submissions (median 199)
- Top 10 teams on private: Average 64 submissions (median 25)
This suggests attempts to artificially inflate scores on the public leaderboard
Overfitting almost certainly occurred

Striking differences between missing values proportion in train vs test data

Actual discussion here: Link
The missing percentage of series parquet in test: 80-85%(~60% in train)
The missing percentage of FGC features in test: 70-75%(29.8% in train)
The missing percentage of BIA features in test: 60-65%(33.7% in train)
The missing percentage of PreInt_EduHx-computerinternet_hoursday in test: 40-50%(3% in train)
Might be the reason why KNNImputer works so well on the private test set despite decreasing the CV on the train set - both with leakage and without leakage (much worse)

9th Place Sol: This is not a lottery compettions (LB Rank 9. Best notebook Private score:0.493 )

Why did we observe such dramatic drops in the Leaderboard rankings?
1. Data leakage from the hidden dataset
  - During the competition, he mentioned about "Private dataset leakage.. Cv and PL dont have any corelation."
  - "in my experience, the score above 0.47 only data leakage."
2. KNNImputer process in the shared notebooks contained a bug
  - I believe this is why everyone who made their final submission based on the shared notebook fell significantly in the rankings.
What didn't work:
1. Treating the SII as a classification problem instead of a regression problem.
2. Predicting PCIA-Total.
3. Improving the SII value through post-processing.
4. Testing various methods and optimizing threshold values for the Kappa metric.
5. Calculating the SII value separately for datasets with and without accelerometer data.
What worked:
- I followed a phased approach for missing data:
  a. Used all data (including missing SII's).
  b. Combined train and test datasets to train the model.
  c. Predicted missing values separately for train and test datasets.
  d. Dropped rows with missing data for SII or PCIA1–PCIA19.
  e. Removed columns with a very high proportion of missing values.
  f. For feature engineering, I computed mean, standard deviation, kurtosis, and skew values using a
  windowing method for both accelerometer data and columns within the same category.
  g. Conducted feature selection for an LGBM model, reducing the features from 200–300 to 50–60.
  h- Used voting and stacking ensemble techniques with LGBM (GBDT, GOSS, and DART) and CatBoost
  models.
  i- For the final results, I selected the most common label.
  - Using the Windowing method:
    - Data is divided into fixed-size intervals (windows)
    - Statistical values were calculated for each interval
  - Calculated statistical values:
    - Mean: Central tendency of the data
    - Standard deviation: Measure of data dispersion
    - Kurtosis: Measure of how peaked/flat the data distribution is
    - Skew: Measure of data distribution asymmetry
  - This calculation was applied to two types of data:
    - accelerometer data
    - columns within the same category
Highest Private Leaderboard Score: 0.493
- Unlike my selected solution, I used the PCIA7 value from the train dataset as it was.
- For the test set, I used the predictions made by the model.
- Why did I do this?
  - Because after predicting all PCIA values, I observed that the Kappa score for PCIA7 was
    significantly higher than for the others.
  - For this reason, I decided to proceed with this approach.
- However, I didn’t include this in my final submission because I noticed this three days before the deadline. -
- The two notebooks I tested showed little difference, with private leaderboard scores of 0.482 and 0.485.
- I had set a CV threshold of 0.450, so I chose not to submit these.
- Among the three notebooks I didn’t submit, one had a CV score of 0.451 and a private leaderboard score of 0.493.
Conclusion: This competition was absolutely not a matter of luck for me.

16th Place Solution

Code: Link
Main points:
1. imputation of missing values with IterativeImputer
2. feature engineering from parquet files
3. LightGBM training with custom QWK objective and metric
4. performing 10 x 10 nested cross-validation to get reliable validation scores and stable test predictions
5. performing threshold optimization only once using the overall predictions from the nested cross-validation. (GRID SEARCH)

feature engineering from parquet files

Discussion question #1 about imputation: What kind of reasoning led to filling in the missing values? Some may argue that the fact that the data is missing itself is valuable information and should not be filled in. Especially since LightGBM can train without handling missing values.
- Author's answer: Indeed, as you mentioned, I don't have a clear perspective on the reason of the effectiveness of missing value imputation either. However, I think that when the missing feature is strongly correlated with the target (in this competition, for instance, PreInt_EduHx-computerinternet_hoursday), it might have been better to impute the missing values rather than indirectly predicting the target from other features.
Discussion question #2 about number of folds: I think large numbers of folds may lead overfit to validation data especially in small data, but does the nested CV prevent this ? Why do you choose 10folds?
- Author's answer: Yes. Since no optimization was performed on the test data for each fold, I believe there is no risk of overfitting by increasing the number of folds.

10th Solution

Used only 5 features with hierarchical bayes model
code: https://www.kaggle.com/code/junpeimorioka/10th-place-5-features-hierarchical-bayes

Final Conclusion

So, most of the rankers realized that the CV-LB relationship was weak in this competition.
Indicating that we should stick on the CV score.
1. threshold optimization leading to unstable results
2. Using simpler model due to unstable lb-cv score
3. Data leakage problem
4. SII vs. PCIAT total as target label
5. Using mean, std, ... values instead of autoencoder(led to lower cv score)
6. various imputations
7. more common to get rid of optimizations for simpler model

My Solution

It was my first competition but I knew the problem of the CV-LB score in this comp.
However, I wasn't able to establish stable standard of CV just as the high solutions did.
I also extensively took the idea of high LB score solutions into account: for example, autoencoder and tabnet.
- However I found out simpler model with both of them can score up to 0.439 among my solutions which is a bronze medal score.

Persistence is very important. You should not give up unless you are forced to give up.
- Elon Musk -

[Kaggle Study] #13 Mercari Price Suggestion Challenge

dongsunseng — Fri, 6 Dec 2024 01:07:52 +0900

Twelveth competition following Youhan Lee's curriculum. Natural Language Processing competition.

First Kernel: Mercari Interactive EDA + Topic Modelling

EDA kernel with matplotlib.

Insight / Summary:

1. Log transformation on target var

Our response or target variables is the price we are suggesting to the Mercari's marketplace sellers.
The median price of all the items in the training is about $267 but given the existence of some extreme values of over $100 and the maximum at $2,009, the distribution of the variables is heavily skewed to the left.
So let's make log-transformation on the price (we added +1 to the value before the transformation to avoid zero and negative values).

plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
plt.xlabel('price+', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Price Distribution - Training Set', fontsize=17)

plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
plt.xlabel('log(price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Log(Price) Distribution - Training Set', fontsize=17)
plt.show()

2. Dealing with Item Description feature

It will be more challenging to parse through this particular item since it's unstructured data.
Does it mean a more detailed and lengthy description will result in a higher bidding price?
We will strip out all punctuations, remove some english stop words (i.e. redundant words such as "a", "the", etc.) and any other words with a length less than 3:

def wordCount(text):
    # convert to lower case and strip regex
    try:
         # convert to lower case and strip regex
        text = text.lower()
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        txt = regex.sub(" ", text)
        # tokenize
        # words = nltk.word_tokenize(clean_txt)
        # remove words in stop words
        words = [w for w in txt.split(" ") \
                 if not w in stop_words.ENGLISH_STOP_WORDS and len(w)>3]
        return len(words)
    except: 
        return 0

# add a column of word counts to both the training and test set
train['desc_len'] = train['item_description'].apply(lambda x: wordCount(x))
test['desc_len'] = test['item_description'].apply(lambda x: wordCount(x))

We also need to check if there are any missing values in the item description (4 observations don't have a description) andl remove those observations from our training set.

train.item_description.isnull().sum()
# result: 4

# remove missing values in item description
train = train[pd.notnull(train['item_description'])]

3. Pre-processing: tokenization

Most of the time, the first steps of an NLP project is to "tokenize" your documents, which main purpose is to normalize our texts.
The three fundamental stages will usually include:
- break the descriptions into sentences and then break the sentences into tokens
- remove punctuation and stop words
- lowercase the tokens
- herein, I will also only consider words that have length equal to or greater than 3 characters

stop = set(stopwords.words('english'))
def tokenize(text):
    """
    sent_tokenize(): segment text into sentences
    word_tokenize(): break sentences into words
    """
    try: 
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        text = regex.sub(" ", text) # remove punctuation
        
        tokens_ = [word_tokenize(s) for s in sent_tokenize(text)]
        tokens = []
        for token_by_sent in tokens_:
            tokens += token_by_sent
        tokens = list(filter(lambda t: t.lower() not in stop, tokens))
        filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)]
        filtered_tokens = [w.lower() for w in filtered_tokens if len(w)>=3]
        
        return filtered_tokens
            
    except TypeError as e: print(text,e)

# apply the tokenizer into the item descriptipn column
train['tokens'] = train['item_description'].map(tokenize)
test['tokens'] = test['item_description'].map(tokenize)

train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

4. WordCloud package

We could aso use the package WordCloud to easily visualize which words has the highest frequencies within each category:

# build dictionary with key=category and values as all the descriptions related.
cat_desc = dict()
for cat in general_cats: 
    text = " ".join(train.loc[train['general_cat']==cat, 'item_description'].values)
    cat_desc[cat] = tokenize(text)


# find the most common words for the top 4 categories
women100 = Counter(cat_desc['Women']).most_common(100)
beauty100 = Counter(cat_desc['Beauty']).most_common(100)
kids100 = Counter(cat_desc['Kids']).most_common(100)
electronics100 = Counter(cat_desc['Electronics']).most_common(100)

def generate_wordcloud(tup):
    wordcloud = WordCloud(background_color='white',
                          max_words=50, max_font_size=40,
                          random_state=42
                         ).generate(str(tup))
    return wordcloud

fig,axes = plt.subplots(2, 2, figsize=(30, 15))

ax = axes[0, 0]
ax.imshow(generate_wordcloud(women100), interpolation="bilinear")
ax.axis('off')
ax.set_title("Women Top 100", fontsize=30)

ax = axes[0, 1]
ax.imshow(generate_wordcloud(beauty100))
ax.axis('off')
ax.set_title("Beauty Top 100", fontsize=30)

ax = axes[1, 0]
ax.imshow(generate_wordcloud(kids100))
ax.axis('off')
ax.set_title("Kids Top 100", fontsize=30)

ax = axes[1, 1]
ax.imshow(generate_wordcloud(electronics100))
ax.axis('off')
ax.set_title("Electronic Top 100", fontsize=30)

5. Pre-processing: tf-idf

tf-idf is the acronym for Term Frequency-Inverse Document Frequency.
It quantifies the importance of a particular word in relative to the vocabulary of a collection of documents or corpus.
The metric depends on two factors:
- Term Frequency: the occurences of a word in a given document (i.e. bag of words)
- Inverse Document Frequency: the reciprocal number of times a word occurs in a corpus of documents
Think about of it this way: If the word is used extensively in all documents, its existence within a specific document will not be able to provide us much specific information about the document itself.
So the second term could be seen as a penalty term that penalizes common words such as "a", "the", "and", etc. tf-idf can therefore, be seen as a weighting scheme for words relevancy in a specific document.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10,
                             max_features=180000,
                             tokenizer=tokenize,
                             ngram_range=(1, 2))
                             
all_desc = np.append(train['item_description'].values, test['item_description'].values)
vz = vectorizer.fit_transform(list(all_desc))

vz is a tfidf matrix where:
- the number of rows is the total number of descriptions
- the number of columns is the total number of unique tokens across the descriptions
Given the high dimension of our tfidf matrix, we need to reduce their dimension using the Singular Value Decomposition (SVD) technique.
And to visualize our vocabulary, we could next use t-SNE to reduce the dimension from 50 to 2.
t-SNE is more suitable for dimensionality reduction to 2 or 3.

SVD (Singular Value Decomposition):

Linear dimensionality reduction
Preserves major patterns in data
Relatively fast computation
Suitable for reduction to larger dimensions (e.g., 50 dimensions)

t-SNE (t-Distributed Stochastic Neighbor Embedding):

Non-linear dimensionality reduction
Preserves similarity relationships between data points
Optimized for visualization (2-3 dimensions)
Effectively reveals cluster structures

6. t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
The goal is to take a set of points in a high-dimensional space and find a representation of those points in a lower-dimensional space, typically the 2D plane.
It is based on probability distributions with random walk on neighborhood graphs to find the structure within the data.
But since t-SNE complexity is significantly high, usually we'd use other high-dimension reduction techniques before applying t-SNE.
First, let's take a sample from the both training and testing item's description since t-SNE can take a very long time to execute.
We can then reduce the dimension of each vector from to n_components (50) using SVD.

trn = train.copy()
tst = test.copy()
trn['is_train'] = 1
tst['is_train'] = 0

sample_sz = 15000

combined_df = pd.concat([trn, tst])
combined_sample = combined_df.sample(n=sample_sz)
vz_sample = vectorizer.fit_transform(list(combined_sample['item_description']))

from sklearn.decomposition import TruncatedSVD

n_comp=30
svd = TruncatedSVD(n_components=n_comp, random_state=42)
svd_tfidf = svd.fit_transform(vz_sample)

# Dimension from 50 to 2
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=500)

tsne_tfidf = tsne_model.fit_transform(svd_tfidf)

7. tf-idf clustering of the item description

plot_tfidf.scatter(x='x', y='y', source=tfidf_df, alpha=0.7)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"description": "@description", "tokens": "@tokens", "category":"@category"}
show(plot_tfidf)

8. K-Means Clustering

K-means clustering objective is to minimize the average squared Euclidean distance of the document / description from their cluster centroids.

from sklearn.cluster import MiniBatchKMeans

num_clusters = 30 # need to be selected wisely
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters,
                               init='k-means++',
                               n_init=1,
                               init_size=1000, batch_size=1000, verbose=0, max_iter=1000)

kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(num_clusters):
    print("Cluster %d:" % i)
    aux = ''
    for j in sorted_centroids[i, :10]:
        aux += terms[j] + ' | '
    print(aux)
    print()

In order to plot these clusters, first we will need to reduce the dimension of the distances to 2 using tsne:

# repeat the same steps for the sample
kmeans = kmeans_model.fit(vz_sample)
kmeans_clusters = kmeans.predict(vz_sample)
kmeans_distances = kmeans.transform(vz_sample)
# reduce dimension to 2 using tsne
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)

#combined_sample.reset_index(drop=True, inplace=True)
kmeans_df = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
kmeans_df['cluster'] = kmeans_clusters
kmeans_df['description'] = combined_sample['item_description']
kmeans_df['category'] = combined_sample['general_cat']
#kmeans_df['cluster']=kmeans_df.cluster.astype(str).astype('category')

plot_kmeans = bp.figure(plot_width=700, plot_height=600,
                        title="KMeans clustering of the description",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

source = ColumnDataSource(data=dict(x=kmeans_df['x'], y=kmeans_df['y'],
                                    color=colormap[kmeans_clusters],
                                    description=kmeans_df['description'],
                                    category=kmeans_df['category'],
                                    cluster=kmeans_df['cluster']))

plot_kmeans.scatter(x='x', y='y', color='color', source=source)
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={"description": "@description", "category": "@category", "cluster":"@cluster" }
show(plot_kmeans)

9. Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is an algorithms used to discover the topics that are present in a corpus.
LDA starts from a fixed number of topics.
Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics.
Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents.
Its input is a bag of words, i.e. each document represented as a row, with each columns containing the count of words in the corpus.

cvectorizer = CountVectorizer(min_df=4,
                              max_features=180000,
                              tokenizer=tokenize,
                              ngram_range=(1,2))
                              
cvz = cvectorizer.fit_transform(combined_sample['item_description'])

lda_model = LatentDirichletAllocation(n_components=20,
                                      learning_method='online',
                                      max_iter=20,
                                      random_state=42)
                                      
X_topics = lda_model.fit_transform(cvz)

n_top_words = 10
topic_summaries = []

topic_word = lda_model.components_  # get the topic words
vocab = cvectorizer.get_feature_names()

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))

# reduce dimension to 2 using tsne
tsne_lda = tsne_model.fit_transform(X_topics)

unnormalized = np.matrix(X_topics)
doc_topic = unnormalized/unnormalized.sum(axis=1)

lda_keys = []
for i, tweet in enumerate(combined_sample['item_description']):
    lda_keys += [doc_topic[i].argmax()]

lda_df = pd.DataFrame(tsne_lda, columns=['x','y'])
lda_df['description'] = combined_sample['item_description']
lda_df['category'] = combined_sample['general_cat']
lda_df['topic'] = lda_keys
lda_df['topic'] = lda_df['topic'].map(int)

plot_lda = bp.figure(plot_width=700,
                     plot_height=600,
                     title="LDA topic visualization",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

source = ColumnDataSource(data=dict(x=lda_df['x'], y=lda_df['y'],
                                    color=colormap[lda_keys],
                                    description=lda_df['description'],
                                    topic=lda_df['topic'],
                                    category=lda_df['category']))

plot_lda.scatter(source=source, x='x', y='y', color='color')
hover = plot_kmeans.select(dict(type=HoverTool))
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={"description":"@description",
                "topic":"@topic", "category":"@category"}

pyLDAvis is a powerful tool that gives us an interactive visualization for LDA.
It's a shame that by putting the HTML of the visualization using pyLDAvis, it will distort the layout of the kernel, I won't upload in here.
But if you follow the below code, there should be an HTML file generated with very interesting interactive bubble chart that visualizes the space of your topic clusters and the term components within each topic.

def prepareLDAData():
    data = {
        'vocab': vocab,
        'doc_topic_dists': doc_topic,
        'doc_lengths': list(lda_df['len_docs']),
        'term_frequency':cvectorizer.vocabulary_,
        'topic_term_dists': lda_model.components_
    } 
    return data
    
import pyLDAvis

lda_df['len_docs'] = combined_sample['tokens'].map(len)
ldadata = prepareLDAData()
pyLDAvis.enable_notebook()
prepared_data = pyLDAvis.prepare(**ldadata)

Second Kernel: A simple nn solution with Keras (~0.48611 PL)

Kernel using neural network for modeling.

Insight / Summary:

1. Metric

def rmsle(y, y_pred):
    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5
#Source: https://www.kaggle.com/marknagelberg/rmsle-function

2. Missing value

#HANDLE MISSING VALUES
print("Handling missing values...")
def handle_missing(dataset):
    dataset.category_name.fillna(value="missing", inplace=True)
    dataset.brand_name.fillna(value="missing", inplace=True)
    dataset.item_description.fillna(value="missing", inplace=True)
    return (dataset)

train = handle_missing(train)
test = handle_missing(test)

3. Categorical data - label encoding

#PROCESS CATEGORICAL DATA
print("Handling categorical variables...")
le = LabelEncoder()

le.fit(np.hstack([train.category_name, test.category_name]))
train.category_name = le.transform(train.category_name)
test.category_name = le.transform(test.category_name)

le.fit(np.hstack([train.brand_name, test.brand_name]))
train.brand_name = le.transform(train.brand_name)
test.brand_name = le.transform(test.brand_name)
del le

train.head(3)

4. raw text - tokenization

#PROCESS TEXT: RAW
print("Text to seq process...")
from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.item_description.str.lower(), train.name.str.lower()])

print("   Fitting tokenizer...")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print("   Transforming text to seq...")

train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower())
test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())

train.head(3)

5. Scaling target variable

#SCALE target variable
train["target"] = np.log(train.price+1)
target_scaler = MinMaxScaler(feature_range=(-1, 1))
train["target"] = target_scaler.fit_transform(train.target.reshape(-1,1))
pd.DataFrame(train.target).hist()

6. Modeling GRU NN

1) Finding max values for NN

#EMBEDDINGS MAX VALUE
#Base on the histograms, we select the next lengths
MAX_NAME_SEQ = 10
MAX_ITEM_DESC_SEQ = 75
MAX_TEXT = np.max([np.max(train.seq_name.max())
                   , np.max(test.seq_name.max())
                  , np.max(train.seq_item_description.max())
                  , np.max(test.seq_item_description.max())])+2
MAX_CATEGORY = np.max([train.category_name.max(), test.category_name.max()])+1
MAX_BRAND = np.max([train.brand_name.max(), test.brand_name.max()])+1
MAX_CONDITION = np.max([train.item_condition_id.max(), test.item_condition_id.max()])+1

2) Actual modeling

#KERAS MODEL DEFINITION
from keras.layers import Input, Dropout, Dense, BatchNormalization, Activation, concatenate, GRU, Embedding, Flatten, BatchNormalization
from keras.models import Model
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras import backend as K

def get_callbacks(filepath, patience=2):
    es = EarlyStopping('val_loss', patience=patience, mode="min")
    msave = ModelCheckpoint(filepath, save_best_only=True)
    return [es, msave]

def rmsle_cust(y_true, y_pred):
    first_log = K.log(K.clip(y_pred, K.epsilon(), None) + 1.)
    second_log = K.log(K.clip(y_true, K.epsilon(), None) + 1.)
    return K.sqrt(K.mean(K.square(first_log - second_log), axis=-1))

def get_model():
    #params
    dr_r = 0.1
    
    #Inputs
    name = Input(shape=[X_train["name"].shape[1]], name="name")
    item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
    brand_name = Input(shape=[1], name="brand_name")
    category_name = Input(shape=[1], name="category_name")
    item_condition = Input(shape=[1], name="item_condition")
    num_vars = Input(shape=[X_train["num_vars"].shape[1]], name="num_vars")
    
    #Embeddings layers
    emb_name = Embedding(MAX_TEXT, 50)(name)
    emb_item_desc = Embedding(MAX_TEXT, 50)(item_desc)
    emb_brand_name = Embedding(MAX_BRAND, 10)(brand_name)
    emb_category_name = Embedding(MAX_CATEGORY, 10)(category_name)
    emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
    
    #rnn layer
    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_name)
    
    #main layer
    main_l = concatenate([
        Flatten() (emb_brand_name)
        , Flatten() (emb_category_name)
        , Flatten() (emb_item_condition)
        , rnn_layer1
        , rnn_layer2
        , num_vars
    ])
    main_l = Dropout(dr_r) (Dense(128) (main_l))
    main_l = Dropout(dr_r) (Dense(64) (main_l))
    
    #output
    output = Dense(1, activation="linear") (main_l)
    
    #model
    model = Model([name, item_desc, brand_name
                   , category_name, item_condition, num_vars], output)
    model.compile(loss="mse", optimizer="adam", metrics=["mae", rmsle_cust])
    
    return model

    
model = get_model()
model.summary()

Third Kernel: Ridge (LB 0.41943)

Mainly using Ridge model kernel.

Insight / Summary:

1. Overall Summary

This code implements a machine learning pipeline for product price prediction.
Here are the main steps:
- Data Preprocessing:
  - Remove data with zero prices
  - Clean text data including categories, brand names, product names, and descriptions
  - Fill missing brand names using the SymSpell algorithm based on similarity
  - Split categories into major/medium/minor classifications
  - Combine text data to create rich features
- Feature Engineering:
  - Vectorize text data using HashingVectorizer and CountVectorizer
  - Encode categorical variables using OneHotEncoder
  - Apply TF-IDF transformation to reflect text feature importance
  - Select only features common to both training and test sets
- Modeling:
  - Use Ridge regression for price prediction
  - Use log-transformed prices as targets
  - Generate final prices through exponential transformation of predictions
The main characteristics of this implementation are:
- Focus on Text Data Processing:
  - Use HashingVectorizer for memory-efficient handling of large vocabularies
  - Extract context information through n-gram based features
  - Reflect word importance through TF-IDF transformation
- Efficient Memory Management:
  - Immediately release unnecessary data from memory
  - Optimize memory usage with HashingVectorizer
  - Maintain only features common to training/test sets
- Robust Preprocessing:
  - Intelligent filling of missing brand names
  - Hierarchical use of category information
  - Text data normalization and combination
- Scalable Structure:
  - Modularization using Pipeline and FeatureUnion
  - Flexibility through custom transformer classes
  - Support for multiprocessing

2. Code Details

# Import essential libraries
import multiprocessing as mp  # Library for parallel processing
import pandas as pd  # pandas for data processing 
from time import time  # For measuring execution time
from scipy.sparse import csr_matrix  # For sparse matrix operations
import os  # For OS related functionality
from sklearn.linear_model import Ridge  # Ridge regression model
from sklearn.pipeline import FeatureUnion, Pipeline  # For feature processing pipeline
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer  # Text processing
from sklearn.metrics import mean_squared_log_error  # Evaluation metric
from sklearn.preprocessing import OneHotEncoder  # For encoding categorical variables
import numpy as np  # For numerical operations
import gc  # Garbage collection
from sklearn.base import BaseEstimator, TransformerMixin  # For creating custom transformers
import re  # Regular expressions
from pandas.api.types import is_numeric_dtype, is_categorical_dtype  # For checking data types

# Multithreading configuration
os.environ['MKL_NUM_THREADS'] = '4'  # Limit Intel Math Kernel Library threads
os.environ['OMP_NUM_THREADS'] = '4'  # Limit OpenMP threads
os.environ['JOBLIB_START_METHOD'] = 'forkserver'  # Set joblib parallel processing method

# Set input data path
INPUT_PATH = r'../input'

# Function to calculate Damerau-Levenshtein distance
def dameraulevenshtein(seq1, seq2):
   """Calculate the Damerau-Levenshtein distance between sequences.

    This method has not been modified from the original.
    Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
   # Implementation maintained as original using dynamic programming
   # Stores only current row and previous two rows for memory efficiency
   oneago = None
   thisrow = list(range(1, len(seq2) + 1)) + [0]
   for x in range(len(seq1)):
       twoago, oneago, thisrow = (oneago, thisrow, [0] * len(seq2) + [x + 1])
       for y in range(len(seq2)):
           delcost = oneago[y] + 1  # Deletion cost
           addcost = thisrow[y - 1] + 1  # Addition cost
           subcost = oneago[y - 1] + (seq1[x] != seq2[y])  # Substitution cost
           thisrow[y] = min(delcost, addcost, subcost)
           # Handle transpositions
           if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                   and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]):
               thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
   return thisrow[len(seq2) - 1]
   
class SymSpell:
   """
   A class implementing the SymSpell algorithm.
   This algorithm provides an efficient method for spell correction.
   """
   
   def __init__(self, max_edit_distance=3, verbose=0):
       """
       Parameters:
       max_edit_distance: Maximum edit distance (how many edits to allow)
       verbose: Verbosity level (0: top suggestion only, 1: all suggestions with minimal distance, 2: all possible suggestions)
       """
       self.max_edit_distance = max_edit_distance
       self.verbose = verbose
       self.dictionary = {}  # Word dictionary
       self.longest_word_length = 0  # Length of longest word

   def get_deletes_list(self, w):
       """
       Generates all possible combinations of the word with characters deleted up to max_edit_distance.
       Example: "word" -> ["ord", "wrd", "wod", "wor"]
       """
       deletes = []
       queue = [w]
       for d in range(self.max_edit_distance):
           temp_queue = []
           for word in queue:
               if len(word) > 1:
                   for c in range(len(word)):
                       word_minus_c = word[:c] + word[c + 1:]
                       if word_minus_c not in deletes:
                           deletes.append(word_minus_c)
                       if word_minus_c not in temp_queue:
                           temp_queue.append(word_minus_c)
           queue = temp_queue
       return deletes

   def create_dictionary_entry(self, w):
       """
       Adds a word and its derived deletion variants to the dictionary.
       Returns:
       bool: Whether a new real word was added
       """
       new_real_word_added = False
       if w in self.dictionary:
           # If word exists, increase frequency
           self.dictionary[w] = (self.dictionary[w][0], self.dictionary[w][1] + 1)
       else:
           # Add new word
           self.dictionary[w] = ([], 1)
           self.longest_word_length = max(self.longest_word_length, len(w))
           
       if self.dictionary[w][1] == 1:
           # If this is the first occurrence of the word in the corpus
           new_real_word_added = True
           
       deletes = self.get_deletes_list(w)
       for item in deletes:
           if item in self.dictionary:
               # Add original word to the deletion's entry
               self.dictionary[item][0].append(w)
           else:
               # Add new deletion form
               self.dictionary[item] = ([w], 0)
               
       return new_real_word_added
       
def create_dictionary_from_arr(self, arr, token_pattern=r'[a-z]+'):
   """
   Creates a word dictionary from an array.
   Parameters:
   arr: Array containing words
   token_pattern: Regular expression pattern for extracting words
   Returns:
   dictionary: Generated word dictionary
   """
   total_word_count = 0  # Total words processed
   unique_word_count = 0  # Number of unique words
   
   for line in arr:
       # Split words by non-alphabetic characters
       words = re.findall(token_pattern, line.lower())
       for word in words:
           total_word_count += 1
           if self.create_dictionary_entry(word):
               unique_word_count += 1
   
   # Print processing results
   print("total words processed: %i" % total_word_count)
   print("total unique words in corpus: %i" % unique_word_count)
   print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary))
   print(" edit distance for deletions: %i" % self.max_edit_distance)
   print(" length of longest word in corpus: %i" % self.longest_word_length)
   
   return self.dictionary

def create_dictionary(self, fname):
   """
   Creates a word dictionary from a file.
   
   Parameters:
   fname: Path to the file to read
   
   Returns:
   dictionary: Generated word dictionary
   
   How it works:
   1. Reads the file line by line.
   2. Extracts words containing only alphabetic characters from each line.
   3. Converts each word to lowercase and adds it to the dictionary.
   4. Prints processing results.
   """
   total_word_count = 0      # Total number of words processed
   unique_word_count = 0     # Number of unique words
   with open(fname) as file:  # Open file with context manager
       for line in file:
           # Split words by non-alphabetic characters
           # [a-z]+ pattern finds one or more consecutive lowercase letters
           words = re.findall('[a-z]+', line.lower())
           
           for word in words:
               total_word_count += 1  # Increase total word count
               # If create_dictionary_entry returns True (new word added)
               # Increase unique word count
               if self.create_dictionary_entry(word):
                   unique_word_count += 1
                   
   # Print processing results
   print("total words processed: %i" % total_word_count)           # Total words processed
   print("total unique words in corpus: %i" % unique_word_count)   # Unique words
   print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary))  # Dictionary size
   print("  edit distance for deletions: %i" % self.max_edit_distance)  # Maximum edit distance
   print("  length of longest word in corpus: %i" % self.longest_word_length)  # Length of longest word
   
   return self.dictionary  # Return generated dictionary

def get_suggestions(self, string, silent=False):
        """return list of suggested corrections for potentially incorrectly
           spelled word"""
        if (len(string) - self.longest_word_length) > self.max_edit_distance:
            if not silent:
                print("no items in dictionary within maximum edit distance")
            return []

        suggest_dict = {}
        min_suggest_len = float('inf')

        queue = [string]
        q_dictionary = {}  # items other than string that we've checked

        while len(queue) > 0:
            q_item = queue[0]  # pop
            queue = queue[1:]

            # early exit
            if ((self.verbose < 2) and (len(suggest_dict) > 0) and
                    ((len(string) - len(q_item)) > min_suggest_len)):
                break

            # process queue item
            if (q_item in self.dictionary) and (q_item not in suggest_dict):
                if self.dictionary[q_item][1] > 0:
                    # word is in dictionary, and is a word from the corpus, and
                    # not already in suggestion list so add to suggestion
                    # dictionary, indexed by the word with value (frequency in
                    # corpus, edit distance)
                    # note q_items that are not the input string are shorter
                    # than input string since only deletes are added (unless
                    # manual dictionary corrections are added)
                    assert len(string) >= len(q_item)
                    suggest_dict[q_item] = (self.dictionary[q_item][1],
                                            len(string) - len(q_item))
                    # early exit
                    if (self.verbose < 2) and (len(string) == len(q_item)):
                        break
                    elif (len(string) - len(q_item)) < min_suggest_len:
                        min_suggest_len = len(string) - len(q_item)

                # the suggested corrections for q_item as stored in
                # dictionary (whether or not q_item itself is a valid word
                # or merely a delete) can be valid corrections
                for sc_item in self.dictionary[q_item][0]:
                    if sc_item not in suggest_dict:

                        # compute edit distance
                        # suggested items should always be longer
                        # (unless manual corrections are added)
                        assert len(sc_item) > len(q_item)

                        # q_items that are not input should be shorter
                        # than original string
                        # (unless manual corrections added)
                        assert len(q_item) <= len(string)

                        if len(q_item) == len(string):
                            assert q_item == string
                            item_dist = len(sc_item) - len(q_item)

                        # item in suggestions list should not be the same as
                        # the string itself
                        assert sc_item != string

                        # calculate edit distance using, for example,
                        # Damerau-Levenshtein distance
                        item_dist = dameraulevenshtein(sc_item, string)

                        # do not add words with greater edit distance if
                        # verbose setting not on
                        if (self.verbose < 2) and (item_dist > min_suggest_len):
                            pass
                        elif item_dist <= self.max_edit_distance:
                            assert sc_item in self.dictionary  # should already be in dictionary if in suggestion list
                            suggest_dict[sc_item] = (self.dictionary[sc_item][1], item_dist)
                            if item_dist < min_suggest_len:
                                min_suggest_len = item_dist

                        # depending on order words are processed, some words
                        # with different edit distances may be entered into
                        # suggestions; trim suggestion dictionary if verbose
                        # setting not on
                        if self.verbose < 2:
                            suggest_dict = {k: v for k, v in suggest_dict.items() if v[1] <= min_suggest_len}

            # now generate deletes (e.g. a substring of string or of a delete)
            # from the queue item
            # as additional items to check -- add to end of queue
            assert len(string) >= len(q_item)

            # do not add words with greater edit distance if verbose setting
            # is not on
            if (self.verbose < 2) and ((len(string) - len(q_item)) > min_suggest_len):
                pass
            elif (len(string) - len(q_item)) < self.max_edit_distance and len(q_item) > 1:
                for c in range(len(q_item)):  # character index
                    word_minus_c = q_item[:c] + q_item[c + 1:]
                    if word_minus_c not in q_dictionary:
                        queue.append(word_minus_c)
                        q_dictionary[word_minus_c] = None  # arbitrary value, just to identify we checked this

        # queue is now empty: convert suggestions in dictionary to
        # list for output
        if not silent and self.verbose != 0:
            print("number of possible corrections: %i" % len(suggest_dict))
            print("  edit distance for deletions: %i" % self.max_edit_distance)

        # output option 1
        # sort results by ascending order of edit distance and descending
        # order of frequency
        #     and return list of suggested word corrections only:
        # return sorted(suggest_dict, key = lambda x:
        #               (suggest_dict[x][1], -suggest_dict[x][0]))

        # output option 2
        # return list of suggestions with (correction,
        #                                  (frequency in corpus, edit distance)):
        as_list = suggest_dict.items()
        # outlist = sorted(as_list, key=lambda (term, (freq, dist)): (dist, -freq))
        outlist = sorted(as_list, key=lambda x: (x[1][1], -x[1][0]))

        if self.verbose == 0:
            return outlist[0]
        else:
            return outlist

        '''
        Option 1:
        ['file', 'five', 'fire', 'fine', ...]

        Option 2:
        [('file', (5, 0)),
         ('five', (67, 1)),
         ('fire', (54, 1)),
         ('fine', (17, 1))...]  
        '''

def best_word(self, s, silent=False):
   """
   Returns the best correction for a given word.
   Parameters:
   s: Word to check
   silent: If True, don't print progress
   Returns:
   tuple or None: (corrected word, (frequency, edit distance)) or None if failed
   """
   try:
       return self.get_suggestions(s, silent)[0]
   except:
       return None
 
class ItemSelector(BaseEstimator, TransformerMixin):
   """
   A transformer for selecting specific fields from a pandas DataFrame and converting them to appropriate format.
   This is a custom transformer for use in scikit-learn Pipelines.
   """
   def __init__(self, field, start_time=time()):
       self.field = field  # Column name to select from DataFrame
       self.start_time = start_time  # Start time for processing time measurement

   def fit(self, x, y=None):
       return self

   def transform(self, dataframe):
       """
       Selects and transforms specific fields from the DataFrame.
       - Categorical data is converted to codes
       - Numeric data is kept as is 
       - Other data is treated as text
       """
       print(f'[{time()-self.start_time}] select {self.field}')
       dt = dataframe[self.field].dtype
       if is_categorical_dtype(dt):
           return dataframe[self.field].cat.codes[:, None]
       elif is_numeric_dtype(dt):
           return dataframe[self.field][:, None]
       else:
           return dataframe[self.field]

class DropColumnsByDf(BaseEstimator, TransformerMixin):
   """
   A transformer that filters features (columns) based on document frequency
   """
   def __init__(self, min_df=1, max_df=1.0):
       """
       Parameters:
       min_df: Minimum document frequency (features below this are removed)
       max_df: Maximum document frequency ratio (features above this are removed)
       """
       self.min_df = min_df
       self.max_df = max_df

   def fit(self, X, y=None):
       """
       Calculates document frequency for given data and determines which columns to filter.
       """
       # Convert to CSC (Compressed Sparse Column) format
       m = X.tocsc()
   
       # Process minimum document frequency (min_df) condition
       # (m != 0).sum(axis=0): Calculate number of non-zero values in each column 
   	   # >= self.min_df: Check if it's greater than minimum document frequency
       # .A1: Flatten array to 1 dimension
       self.nnz_cols = ((m != 0).sum(axis=0) >= self.min_df).A1
   
       # Process maximum document frequency (max_df) condition
       if self.max_df < 1.0:
           # Calculate maximum allowed number of documents
           max_df = m.shape[0] * self.max_df
           # AND operation with maximum document frequency condition
           self.nnz_cols = self.nnz_cols & ((m != 0).sum(axis=0) <= max_df).A1
       
   	   return self

   def transform(self, X, y=None):
       """
       Selects features according to the determined filtering criteria.
       """
       m = X.tocsc()
       # Select columns according to conditions determined in fit (self.nnz_cols)
       return m[:, self.nnz_cols]
 
def get_rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(np.expm1(y_true), np.expm1(y_pred)))


def split_cat(text):
    try:
        cats = text.split("/")
        return cats[0], cats[1], cats[2], cats[0] + '/' + cats[1]
    except:
        print("no category")
        return 'other', 'other', 'other', 'other/other'

# Function to fill in missing brand names
# Uses the SymSpell algorithm to find and fill brand names from product names and descriptions
# Processes single-word and multi-word brand names separately
def brands_filling(dataset):
    vc = dataset['brand_name'].value_counts()
    brands = vc[vc > 0].index
    brand_word = r"[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+"

    many_w_brands = brands[brands.str.contains(' ')]
    one_w_brands = brands[~brands.str.contains(' ')]

    ss2 = SymSpell(max_edit_distance=0)
    ss2.create_dictionary_from_arr(many_w_brands, token_pattern=r'.+')

    ss1 = SymSpell(max_edit_distance=0)
    ss1.create_dictionary_from_arr(one_w_brands, token_pattern=r'.+')

    two_words_re = re.compile(r"(?=(\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+))")

    def find_in_str_ss2(row):
        for doc_word in two_words_re.finditer(row):
            print(doc_word)
            suggestion = ss2.best_word(doc_word.group(1), silent=True)
            if suggestion is not None:
                return doc_word.group(1)
        return ''

    def find_in_list_ss1(list):
        for doc_word in list:
            suggestion = ss1.best_word(doc_word, silent=True)
            if suggestion is not None:
                return doc_word
        return ''

    def find_in_list_ss2(list):
        for doc_word in list:
            suggestion = ss2.best_word(doc_word, silent=True)
            if suggestion is not None:
                return doc_word
        return ''

    print(f"Before empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}")

    n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(
        pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+")
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_name]

    n_desc = dataset[dataset['brand_name'] == '']['item_description'].str.findall(
        pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+")
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_desc]

    n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(pat=brand_word)
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in n_name]

    desc_lower = dataset[dataset['brand_name'] == '']['item_description'].str.findall(pat=brand_word)
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in desc_lower]

    print(f"After empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}")

    del ss1, ss2
    gc.collect()


def preprocess_regex(dataset, start_time=time()):
    karats_regex = r'(\d)([\s-]?)(karat|karats|carat|carats|kt)([^\w])'
    karats_repl = r'\1k\4'

    unit_regex = r'(\d+)[\s-]([a-z]{2})(\s)'
    unit_repl = r'\1\2\3'

    dataset['name'] = dataset['name'].str.replace(karats_regex, karats_repl)
    dataset['item_description'] = dataset['item_description'].str.replace(karats_regex, karats_repl)
    print(f'[{time() - start_time}] Karats normalized.')

    dataset['name'] = dataset['name'].str.replace(unit_regex, unit_repl)
    dataset['item_description'] = dataset['item_description'].str.replace(unit_regex, unit_repl)
    print(f'[{time() - start_time}] Units glued.')


def preprocess_pandas(train, test, start_time=time()):
    train = train[train.price > 0.0].reset_index(drop=True)
    print('Train shape without zero price: ', train.shape)

    nrow_train = train.shape[0]
    y_train = np.log1p(train["price"])
    merge: pd.DataFrame = pd.concat([train, test])

    del train
    del test
    gc.collect()

    merge['has_category'] = (merge['category_name'].notnull()).astype('category')
    print(f'[{time() - start_time}] Has_category filled.')

    merge['category_name'] = merge['category_name'] \
        .fillna('other/other/other') \
        .str.lower() \
        .astype(str)
    merge['general_cat'], merge['subcat_1'], merge['subcat_2'], merge['gen_subcat1'] = \
        zip(*merge['category_name'].apply(lambda x: split_cat(x)))
    print(f'[{time() - start_time}] Split categories completed.')

    merge['has_brand'] = (merge['brand_name'].notnull()).astype('category')
    print(f'[{time() - start_time}] Has_brand filled.')

    merge['gencat_cond'] = merge['general_cat'].map(str) + '_' + merge['item_condition_id'].astype(str)
    merge['subcat_1_cond'] = merge['subcat_1'].map(str) + '_' + merge['item_condition_id'].astype(str)
    merge['subcat_2_cond'] = merge['subcat_2'].map(str) + '_' + merge['item_condition_id'].astype(str)
    print(f'[{time() - start_time}] Categories and item_condition_id concancenated.')

    merge['name'] = merge['name'] \
        .fillna('') \
        .str.lower() \
        .astype(str)
    merge['brand_name'] = merge['brand_name'] \
        .fillna('') \
        .str.lower() \
        .astype(str)
    merge['item_description'] = merge['item_description'] \
        .fillna('') \
        .str.lower() \
        .replace(to_replace='No description yet', value='')
    print(f'[{time() - start_time}] Missing filled.')

    preprocess_regex(merge, start_time)

    brands_filling(merge)
    print(f'[{time() - start_time}] Brand name filled.')

    merge['name'] = merge['name'] + ' ' + merge['brand_name']
    print(f'[{time() - start_time}] Name concancenated.')

    merge['item_description'] = merge['item_description'] \
                                + ' ' + merge['name'] \
                                + ' ' + merge['subcat_1'] \
                                + ' ' + merge['subcat_2'] \
                                + ' ' + merge['general_cat'] \
                                + ' ' + merge['brand_name']
    print(f'[{time() - start_time}] Item description concatenated.')

    merge.drop(['price', 'test_id', 'train_id'], axis=1, inplace=True)

    return merge, y_train, nrow_train


def intersect_drop_columns(train: csr_matrix, valid: csr_matrix, min_df=0):
    t = train.tocsc()
    v = valid.tocsc()
    nnz_train = ((t != 0).sum(axis=0) >= min_df).A1
    nnz_valid = ((v != 0).sum(axis=0) >= min_df).A1
    nnz_cols = nnz_train & nnz_valid
    res = t[:, nnz_cols], v[:, nnz_cols]
    return res


if __name__ == '__main__':
    mp.set_start_method('forkserver', True)

    start_time = time()

    train = pd.read_table(os.path.join(INPUT_PATH, 'train.tsv'),
                          engine='c',
                          dtype={'item_condition_id': 'category',
                                 'shipping': 'category'}
                          )
    test = pd.read_table(os.path.join(INPUT_PATH, 'test.tsv'),
                         engine='c',
                         dtype={'item_condition_id': 'category',
                                'shipping': 'category'}
                         )
    print(f'[{time() - start_time}] Finished to load data')
    print('Train shape: ', train.shape)
    print('Test shape: ', test.shape)

    submission: pd.DataFrame = test[['test_id']]

    merge, y_train, nrow_train = preprocess_pandas(train, test, start_time)

    meta_params = {'name_ngram': (1, 2),
                   'name_max_f': 75000,
                   'name_min_df': 10,

                   'category_ngram': (2, 3),
                   'category_token': '.+',
                   'category_min_df': 10,

                   'brand_min_df': 10,

                   'desc_ngram': (1, 3),
                   'desc_max_f': 150000,
                   'desc_max_df': 0.5,
                   'desc_min_df': 10}

    stopwords = frozenset(['the', 'a', 'an', 'is', 'it', 'this', ])
    # 'i', 'so', 'its', 'am', 'are'])

    vectorizer = FeatureUnion([
        ('name', Pipeline([
            ('select', ItemSelector('name', start_time=start_time)),
            ('transform', HashingVectorizer(
                ngram_range=(1, 2),
                n_features=2 ** 27,
                norm='l2',
                lowercase=False,
                stop_words=stopwords
            )),
            ('drop_cols', DropColumnsByDf(min_df=2))
        ])),
        ('category_name', Pipeline([
            ('select', ItemSelector('category_name', start_time=start_time)),
            ('transform', HashingVectorizer(
                ngram_range=(1, 1),
                token_pattern='.+',
                tokenizer=split_cat,
                n_features=2 ** 27,
                norm='l2',
                lowercase=False
            )),
            ('drop_cols', DropColumnsByDf(min_df=2))
        ])),
        ('brand_name', Pipeline([
            ('select', ItemSelector('brand_name', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('gencat_cond', Pipeline([
            ('select', ItemSelector('gencat_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('subcat_1_cond', Pipeline([
            ('select', ItemSelector('subcat_1_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('subcat_2_cond', Pipeline([
            ('select', ItemSelector('subcat_2_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('has_brand', Pipeline([
            ('select', ItemSelector('has_brand', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('shipping', Pipeline([
            ('select', ItemSelector('shipping', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('item_condition_id', Pipeline([
            ('select', ItemSelector('item_condition_id', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('item_description', Pipeline([
            ('select', ItemSelector('item_description', start_time=start_time)),
            ('hash', HashingVectorizer(
                ngram_range=(1, 3),
                n_features=2 ** 27,
                dtype=np.float32,
                norm='l2',
                lowercase=False,
                stop_words=stopwords
            )),
            ('drop_cols', DropColumnsByDf(min_df=2)),
        ]))
    ], n_jobs=1)

    sparse_merge = vectorizer.fit_transform(merge)
    print(f'[{time() - start_time}] Merge vectorized')
    print(sparse_merge.shape)

    tfidf_transformer = TfidfTransformer()

    X = tfidf_transformer.fit_transform(sparse_merge)
    print(f'[{time() - start_time}] TF/IDF completed')

    X_train = X[:nrow_train]
    print(X_train.shape)

    X_test = X[nrow_train:]
    del merge
    del sparse_merge
    del vectorizer
    del tfidf_transformer
    gc.collect()

    X_train, X_test = intersect_drop_columns(X_train, X_test, min_df=1)
    print(f'[{time() - start_time}] Drop only in train or test cols: {X_train.shape[1]}')
    gc.collect()

    ridge = Ridge(solver='auto', fit_intercept=True, alpha=0.4, max_iter=200, normalize=False, tol=0.01)
    ridge.fit(X_train, y_train)
    print(f'[{time() - start_time}] Train Ridge completed. Iterations: {ridge.n_iter_}')

    predsR = ridge.predict(X_test)
    print(f'[{time() - start_time}] Predict Ridge completed.')

    submission.loc[:, 'price'] = np.expm1(predsR)
    submission.loc[submission['price'] < 0.0, 'price'] = 0.0
    submission.to_csv("submission_ridge.csv", index=False)

3. Damerau-Levenshtein distance

Basic Concept:
- Calculates the minimum number of edits needed to transform one string into another
- It's an extension of the Levenshtein distance that adds the operation of "transposing adjacent characters"
Allowed Edit Operations:
- Character insertion
- Example: "cat" → "cart" (insert r)
- Character deletion
- Example: "cart" → "cat" (delete r)
- Character substitution
- Example: "cat" → "cut" (substitute a with u)
- Adjacent character transposition
- Example: "cloud" → "could" (swap u and l)
Use Cases:
- Spell checking
- Similar word search
- Measuring similarity between brand names or product names
- Text matching in natural language processing
Example:

# Converting "kitten" to "sitting":
# 1. k → s (substitution)
# 2. e → i (substitution)
# 3. n → ng (insertion)
# Total distance: 3

In this code, it's used to find missing brand names by calculating similarity with existing brand names.
For example, it can help correct a misspelled brand name like "Nkie" to "Nike".

4. Getting suggestion code detail

get_suggestions(self, string, silent=False) function:
- Purpose: Generates a list of correction suggestions for potentially misspelled words

Main features:
- Finds similar words to the input word in the dictionary
- Calculates edit distance (character deletion/addition/change)
- Finds and sorts all possible correction words
- Provides (frequency, edit distance) information for each suggested word
Return values:
- When verbose=0: Returns only the best suggestion
- When verbose>0: Returns all possible suggestions in the form (word, (frequency, edit distance))
best_word(self, s, silent=False) function:
- Purpose: Finds the optimal correction word for a given word

Main features:
- Calls get_suggestions function to get suggestion list
- Selects the most appropriate word (smallest edit distance and highest frequency) from the suggestion list
Return values:
- On success: (corrected word, (frequency, edit distance))
- On failure: None

5. More about Ridge Regression

Basic Concept:
- It's an advanced form of linear regression
- Created to prevent overfitting
- A regression model that uses L2 regularization
How it works:
- Basic linear regression equation: y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
- Ridge regression adds a penalty term
- Penalty term: α(w₁² + w₂² + ... + wₙ²)
- α is a hyperparameter that controls regularization strength
- Applies penalty to the sum of squared weights
Advantages:
- Can solve multicollinearity problems
- Example: Handles strongly correlated features like 'height' and 'weight' well
- Prevents overfitting and improves model generalization
- Maintains all features while adjusting their influence
Difference from Regular Linear Regression:

# Cost function for regular linear regression
Cost = Σ(y - ŷ)²
# Cost function for Ridge regression
Cost = Σ(y - ŷ)² + α(w₁² + w₂² + ... + wₙ²)

When to use:
- Data with many features
- When features have strong correlations
- When overfitting is a concern
For example, in the given code, Ridge regression is used because text vectorization creates many features that might be correlated:

ridge = Ridge(
    solver='auto',           # Automatically choose best solver
    fit_intercept=True,      # Use intercept term
    alpha=0.4,              # Regularization strength (α value)
    max_iter=200,           # Maximum iterations
    normalize=False,         # Whether to normalize
    tol=0.01                # Convergence tolerance
)

This configured Ridge regression model helps predict prices while appropriately adjusting the influence of many features.

Fourth Kernel: LGB and FM [18th Place - 0.40604]

18th place solution using lightgbm and FM_FTRL.
0.33 * FM_FTRL + 0.67 * LGB

Insight / Summary:

1. Overall Summary

WordBatch is a Python package designed for fast processing of large-scale text data.
FM_FTRL is a model that combines Factorization Machines with the Follow-The-Regularized-Leader algorithm.
- Looking at each component:
  - FM (Factorization Machines)
    - A method for modeling interactions between features in high-dimensional sparse data
    - Particularly effective in tasks like recommendation systems and click-through rate (CTR) prediction
    - Represents potential interactions between features as low-dimensional vectors
  - FTRL (Follow-The-Regularized-Leader)
    - A type of online learning algorithm
    - Effectively handles L1 and L2 regularization
    - Particularly effective in learning sparse models
      - A sparse model refers to a model where many parameters (weights) are zero
    - Memory efficient and capable of incremental learning
- Main advantages of this combination:
  - Suitable for large-scale sparse data processing
    - Data is very sparse due to numerous text and categorical variables
    - FM effectively learns feature interactions in sparse data
  - Automatically learns feature interactions
    - FM automatically learns second-order feature interactions
    - Example: Captures the impact of brand and category combinations on price
  - Enables online learning for real-time updates
    - FTRL is an online learning algorithm that is memory efficient
    - Can effectively train on large-scale datasets
  - High prediction performance
- Commonly used in tasks such as advertising click-through rate prediction, recommendation systems, and tasks requiring real-time prediction.
Feature Engineering
- Extraction of various statistical features from text data MANUALLY
- Text vectorization using WordBatch and TF-IDF
- Label encoding for categorical variables
Model Ensemble
- FM_FTRL: Factorization Machine effective for sparse data
- Ridge Regression: Basic regression for text features
- LightGBM: High-performance gradient boosting model
Ridge model?
- The Ridge model is used to perform basic regression analysis on text data (product names and descriptions)
- Basic linear regression model with L2 regularization applied.
Memory Optimization
- Use of sparse matrices
- Periodic garbage collection
- Removal of unnecessary variables
Final Prediction
- Weighted average of FM_FTRL and LightGBM predictions
- Save results after reversing log transformation

2. Code Details

# Record start time for execution time measurement
import time
start_time = time.time()

# Set submission mode (True: train on full data, False: use validation split)
SUBMIT_MODE = True

# Import required libraries
import pandas as pd
import numpy as np
import time
import gc  # For garbage collection
import string
import re

# Use NLTK stopwords
from nltk.corpus import stopwords

# Import scipy for sparse matrix handling
from scipy.sparse import csr_matrix, hstack
# Import sklearn for text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# Import sklearn for feature selection
from sklearn.feature_selection.univariate_selection import SelectKBest, f_regression
# Import sklearn for label encoding
from sklearn.preprocessing import LabelBinarizer

# Import WordBatch related modules (for text processing)
import wordbatch
from wordbatch.extractors import WordBag
from wordbatch.models import FM_FTRL

# Import sklearn model related modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.naive_bayes import MultinomialNB
import lightgbm as lgb

# Define RMSE calculation function
def rmse(predicted, actual):
    """
    Calculate Root Mean Squared Error between predicted and actual values
    Args:
        predicted: array of predicted values
        actual: array of actual values
    Returns:
        RMSE value
    """
    return np.sqrt(((predicted - actual) ** 2).mean())

# Category splitting function
def split_cat(text):
    """
    Split category string into subcategories
    Args:
        text: Category string with '/' delimiter
    Returns:
        Tuple of 3 subcategories (returns 'No Label' if missing)
    """
    try:
        return text.split("/")
    except:
        return ("No Label", "No Label", "No Label")

# Define Target Encoding class
class TargetEncoder:
    """
    Class for performing target encoding on categorical variables
    Numerically encodes categories based on mean target values
    """
    def __repr__(self):
        return 'TargetEncoder'

    def __init__(self, cols, smoothing=1, min_samples_leaf=1, noise_level=0, keep_original=False):
        """
        Args:
            cols: List of columns to encode
            smoothing: Smoothing parameter
            min_samples_leaf: Minimum number of samples
            noise_level: Level of noise to add
            keep_original: Whether to keep original columns
        """
        self.cols = cols
        self.smoothing = smoothing
        self.min_samples_leaf = min_samples_leaf
        self.noise_level = noise_level
        self.keep_original = keep_original

    @staticmethod
    def add_noise(series, noise_level):
        """
        Add noise to prevent overfitting
        """
        return series * (1 + noise_level * np.random.randn(len(series)))

    def encode(self, train, test, target):
        """
        Perform target encoding on categorical columns in given dataframe
        """
        for col in self.cols:
            if self.keep_original:
                train[col + '_te'], test[col + '_te'] = self.encode_column(train[col], test[col], target)
            else:
                train[col], test[col] = self.encode_column(train[col], test[col], target)
        return train, test

    def encode_column(self, trn_series, tst_series, target):
        """
        Perform target encoding on a single column
        """
        temp = pd.concat([trn_series, target], axis=1)
        # Calculate target means
        averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
        # Calculate smoothing
        smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
        # Calculate overall mean
        prior = target.mean()
        # Calculate smoothed means
        averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
        averages.drop(['mean', 'count'], axis=1, inplace=True)
        
        # Apply encoding to train/test data
        ft_trn_series = pd.merge(
            trn_series.to_frame(trn_series.name),
            averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
            on=trn_series.name,
            how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
        ft_trn_series.index = trn_series.index
        
        ft_tst_series = pd.merge(
            tst_series.to_frame(tst_series.name),
            averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
            on=tst_series.name,
            how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
        ft_tst_series.index = tst_series.index
        
        return self.add_noise(ft_trn_series, self.noise_level), self.add_noise(ft_tst_series, self.noise_level)

# Numeric value processing functions
def to_number(x):
    """
    Convert string to number (limit to 100 if greater than 100)
    """
    try:
        if not x.isdigit():
            return 0
        x = int(x)
        if x > 100:
            return 100
        else:
            return x
    except:
        return 0

def sum_numbers(desc):
    """
    Calculate sum of numbers in description text
    """
    if not isinstance(desc, str):
        return 0
    try:
        return sum([to_number(s) for s in desc.split()])
    except:
        return 0

# Set regex and stopwords for text preprocessing
stopwords = {x: 1 for x in stopwords.words('english')}
non_alphanums = re.compile(u'[^A-Za-z0-9]+')
non_alphanumpunct = re.compile(u'[^A-Za-z0-9\.?!,; \(\)\[\]\'\"\$]+')
RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])

def normalize_text(text):
    """
    Text normalization:
    - Convert to lowercase
    - Remove special characters
    - Remove stopwords
    - Remove short words
    """
    return u" ".join(
        [x for x in [y for y in non_alphanums.sub(' ', text).lower().strip().split(" ")] \
         if len(x) > 1 and x not in stopwords])

def clean_name(x):
    """
    Extract first word from name
    """
    if len(x):
        x = non_alphanums.sub(' ', x).split()
        if len(x):
            return x[0].lower()
    return ''

# Load data
print('[{}] Finished defining stuff'.format(time.time() - start_time))

# Load training data
train = pd.read_table('../input/train.tsv', engine='c', 
                      dtype={'item_condition_id': 'category',
                             'shipping': 'category',
                            }, 
                     converters={'category_name': split_cat})
# Load test data
test = pd.read_table('../input/test.tsv', engine='c', 
                      dtype={'item_condition_id': 'category',
                             'shipping': 'category',
                            },
                    converters={'category_name': split_cat})
print('[{}] Finished load data'.format(time.time() - start_time))

# Add flag for train/test data distinction
train['is_train'] = 1
test['is_train'] = 0
print('[{}] Compiled train / test'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)

# Remove data with price 0
train = train[train.price != 0].reset_index(drop=True)
print('[{}] Removed zero price'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)

# Log transform target variable (price)
y = np.log1p(train['price'])
nrow_train = train.shape[0]

# Merge train/test data
merge = pd.concat([train, test])
submission = test[['test_id']]
print('[{}] Compiled merge'.format(time.time() - start_time))
print('Merge shape: ', merge.shape)

# Remove unnecessary columns and clear memory
del train
del test
merge.drop(['train_id', 'test_id', 'price'], axis=1, inplace=True)
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# Split and process categories
merge['gencat_name'] = merge['category_name'].str.get(0).replace('', 'missing').astype('category')
merge['subcat1_name'] = merge['category_name'].str.get(1).fillna('missing').astype('category')
merge['subcat2_name'] = merge['category_name'].str.get(2).fillna('missing').astype('category')
merge.drop('category_name', axis=1, inplace=True)
print('[{}] Split categories completed.'.format(time.time() - start_time))

# Handle missing values
merge['item_condition_id'] = merge['item_condition_id'].cat.add_categories(['missing']).fillna('missing')
merge['shipping'] = merge['shipping'].cat.add_categories(['missing']).fillna('missing')
merge['item_description'].fillna('missing', inplace=True)
merge['brand_name'] = merge['brand_name'].fillna('missing').astype('category')
print('[{}] Handle missing completed.'.format(time.time() - start_time))

# Start feature engineering
# Name-related features
merge['name_first'] = merge['name'].apply(clean_name)
print('[{}] FE 1/37'.format(time.time() - start_time))
merge['name_first_count'] = merge.groupby('name_first')['name_first'].transform('count')
print('[{}] FE 2/37'.format(time.time() - start_time))

# Category-related features
merge['gencat_name_count'] = merge.groupby('gencat_name')['gencat_name'].transform('count')
print('[{}] FE 3/37'.format(time.time() - start_time))
merge['subcat1_name_count'] = merge.groupby('subcat1_name')['subcat1_name'].transform('count')
print('[{}] FE 4/37'.format(time.time() - start_time))
merge['subcat2_name_count'] = merge.groupby('subcat2_name')['subcat2_name'].transform('count')
print('[{}] FE 5/37'.format(time.time() - start_time))
merge['brand_name_count'] = merge.groupby('brand_name')['brand_name'].transform('count')
print('[{}] FE 6/37'.format(time.time() - start_time))

# Text-related features
merge['NameLower'] = merge.name.str.count('[a-z]')
print('[{}] FE 7/37'.format(time.time() - start_time))
merge['DescriptionLower'] = merge.item_description.str.count('[a-z]')
print('[{}] FE 8/37'.format(time.time() - start_time))
merge['NameUpper'] = merge.name.str.count('[A-Z]')
print('[{}] FE 9/37'.format(time.time() - start_time))
merge['DescriptionUpper'] = merge.item_description.str.count('[A-Z]')
print('[{}] FE 10/37'.format(time.time() - start_time))

# Length-related features
merge['name_len'] = merge['name'].apply(lambda x: len(x))
print('[{}] FE 11/37'.format(time.time() - start_time))
merge['des_len'] = merge['item_description'].apply(lambda x: len(x))
print('[{}] FE 12/37'.format(time.time() - start_time))
merge['name_desc_len_ratio'] = merge['name_len']/merge['des_len']
print('[{}] FE 13/37'.format(time.time() - start_time))

# Word count related features
merge['desc_word_count'] = merge['item_description'].apply(lambda x: len(x.split()))
print('[{}] FE 14/37'.format(time.time() - start_time))
merge['mean_des'] = merge['item_description'].apply(lambda x: 0 if len(x) == 0 else float(len(x.split())) / len(x)) * 10
print('[{}] FE 15/37'.format(time.time() - start_time))
merge['name_word_count'] = merge['name'].apply(lambda x: len(x.split()))
print('[{}] FE 16/37'.format(time.time() - start_time))
merge['mean_name'] = merge['name'].apply(lambda x: 0 if len(x) == 0 else float(len(x.split())) / len(x)) * 10
print('[{}] FE 17/37'.format(time.time() - start_time))

# Characters per word features
merge['desc_letters_per_word'] = merge['des_len'] / merge['desc_word_count']
print('[{}] FE 18/37'.format(time.time() - start_time))
merge['name_letters_per_word'] = merge['name_len'] / merge['name_word_count']
print('[{}] FE 19/37'.format(time.time() - start_time))

# Upper/lowercase ratio features
merge['NameLowerRatio'] = merge['NameLower'] / merge['name_len']
print('[{}] FE 20/37'.format(time.time() - start_time))
merge['DescriptionLowerRatio'] = merge['DescriptionLower'] / merge['des_len']
print('[{}] FE 21/37'.format(time.time() - start_time))
merge['NameUpperRatio'] = merge['NameUpper'] / merge['name_len']
print('[{}] FE 22/37'.format(time.time() - start_time))
merge['DescriptionUpperRatio'] = merge['DescriptionUpper'] / merge['des_len']
print('[{}] FE 23/37'.format(time.time() - start_time))

# Punctuation related features
merge['NamePunctCount'] = merge.name.str.count(RE_PUNCTUATION)
print('[{}] FE 24/37'.format(time.time() - start_time))
merge['DescriptionPunctCount'] = merge.item_description.str.count(RE_PUNCTUATION)
print('[{}] FE 25/37'.format(time.time() - start_time))
merge['NamePunctCountRatio'] = merge['NamePunctCount'] / merge['name_word_count']
print('[{}] FE 26/37'.format(time.time() - start_time))
merge['DescriptionPunctCountRatio'] = merge['DescriptionPunctCount'] / merge['desc_word_count']
print('[{}] FE 27/37'.format(time.time() - start_time))

# Number related features
merge['NameDigitCount'] = merge.name.str.count('[0-9]')
print('[{}] FE 28/37'.format(time.time() - start_time))
merge['DescriptionDigitCount'] = merge.item_description.str.count('[0-9]')
print('[{}] FE 29/37'.format(time.time() - start_time))
merge['NameDigitCountRatio'] = merge['NameDigitCount'] / merge['name_word_count']
print('[{}] FE 30/37'.format(time.time() - start_time))
merge['DescriptionDigitCountRatio'] = merge['DescriptionDigitCount']/merge['desc_word_count']
print('[{}] FE 31/37'.format(time.time() - start_time))

# Stopword and special character related features
merge['stopword_ratio_desc'] = merge['item_description'].apply(lambda x: len([w for w in x.split() if w in stopwords])) / merge['desc_word_count']
print('[{}] FE 32/37'.format(time.time() - start_time))
merge['num_sum'] = merge['item_description'].apply(sum_numbers)  # Sum of numbers in description
print('[{}] FE 33/37'.format(time.time() - start_time))
merge['weird_characters_desc'] = merge['item_description'].str.count(non_alphanumpunct)  # Count of special characters
print('[{}] FE 34/37'.format(time.time() - start_time))
merge['weird_characters_name'] = merge['name'].str.count(non_alphanumpunct)
print('[{}] FE 35/37'.format(time.time() - start_time))

# Price related keyword features
merge['prices_count'] = merge['item_description'].str.count('[rm]')  # Count of price indicator characters (rm)
print('[{}] FE 36/37'.format(time.time() - start_time))
merge['price_in_name'] = merge['item_description'].str.contains('[rm]', regex=False).astype('int')  # Price indicator presence
print('[{}] FE 37/37'.format(time.time() - start_time))

# Feature normalization
cols = set(merge.columns.values)
basic_cols = {'name', 'item_condition_id', 'brand_name',
 'shipping', 'item_description', 'gencat_name',
 'subcat1_name', 'subcat2_name', 'name_first', 'is_train'}

# Separate columns to normalize and keep 
cols_to_normalize = cols - basic_cols - {'price_in_name'}
other_cols = basic_cols | {'price_in_name'}

# Perform Min-Max normalization
merge_to_normalize = merge[list(cols_to_normalize)]
merge_to_normalize = (merge_to_normalize - merge_to_normalize.mean()) / (merge_to_normalize.max() - merge_to_normalize.min())
print('[{}] FE Normalized'.format(time.time() - start_time))

# Merge normalized features and basic features
merge = merge[list(other_cols)]
merge = pd.concat([merge, merge_to_normalize], axis=1)
print('[{}] FE Merged'.format(time.time() - start_time))

# Memory cleanup
del(merge_to_normalize)
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# Split train/test data
df_test = merge.loc[merge['is_train'] == 0]
df_train = merge.loc[merge['is_train'] == 1]
del merge
gc.collect()
df_test = df_test.drop(['is_train'], axis=1)
df_train = df_train.drop(['is_train'], axis=1)

# Split validation data (if not in submit mode)
if SUBMIT_MODE:
   y_train = y
   del y
   gc.collect()
else:
   df_train, df_test, y_train, y_test = train_test_split(df_train, y, test_size=0.2, random_state=144)

print('[{}] Splitting completed.'.format(time.time() - start_time))

# Process name text using WordBatch
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {
   "hash_ngrams": 2,  # Use up to 2-grams
   "hash_ngrams_weights": [1.5, 1.0],  # Weights for unigram and bigram
   "hash_size": 2 ** 29,  # Hash size
   "norm": None,  # No normalization
   "tf": 'binary',  # Use binary TF
   "idf": None,  # Don't use IDF
}), procs=8)
wb.dictionary_freeze = True
X_name_train = wb.fit_transform(df_train['name'])
X_name_test = wb.transform(df_test['name'])
del(wb)

# Remove low frequency features
mask = np.where(X_name_train.getnnz(axis=0) > 3)[0]
X_name_train = X_name_train[:, mask]
X_name_test = X_name_test[:, mask]
print('[{}] Vectorize `name` completed.'.format(time.time() - start_time))

# Process item description text using WordBatch
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {
   "hash_ngrams": 2,
   "hash_ngrams_weights": [1.0, 1.0],
   "hash_size": 2 ** 28,
   "norm": "l2",  # Use L2 normalization
   "tf": 1.0,  # Use actual frequency
   "idf": None
}), procs=8)
wb.dictionary_freeze = True
X_description_train = wb.fit_transform(df_train['item_description'])
X_description_test = wb.transform(df_test['item_description'])
del(wb)

# Remove low frequency features
mask = np.where(X_description_train.getnnz(axis=0) > 3)[0]
X_description_train = X_description_train[:, mask]
X_description_test = X_description_test[:, mask]
print('[{}] Vectorize `item_description` completed.'.format(time.time() - start_time))

# Ridge regression modeling for description text
# Split data in half for cross validation
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_description_train, y_train,
                                                             test_size=0.5,
                                                             shuffle=False)
print('[{}] Finished splitting'.format(time.time() - start_time))

# First Ridge model
model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_1, y_train_1)
print('[{}] Finished to train desc ridge (1)'.format(time.time() - start_time))
desc_ridge_preds1 = model.predict(X_train_2)
desc_ridge_preds1f = model.predict(X_description_test)
print('[{}] Finished to predict desc ridge (1)'.format(time.time() - start_time))

# Second Ridge model
model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_2, y_train_2)
print('[{}] Finished to train desc ridge (2)'.format(time.time() - start_time))
desc_ridge_preds2 = model.predict(X_train_1)
desc_ridge_preds2f = model.predict(X_description_test)
print('[{}] Finished to predict desc ridge (2)'.format(time.time() - start_time))

# Combine Ridge predictions
desc_ridge_preds_oof = np.concatenate((desc_ridge_preds2, desc_ridge_preds1), axis=0)
desc_ridge_preds_test = (desc_ridge_preds1f + desc_ridge_preds2f) / 2.0
print('RMSLE OOF: {}'.format(rmse(desc_ridge_preds_oof, y_train)))
if not SUBMIT_MODE:
   print('RMSLE TEST: {}'.format(rmse(desc_ridge_preds_test, y_test)))

# Ridge regression modeling for name text (same process as above)
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_name_train, y_train,
                                                             test_size=0.5,
                                                             shuffle=False)
print('[{}] Finished splitting'.format(time.time() - start_time))

model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_1, y_train_1)
print('[{}] Finished to train name ridge (1)'.format(time.time() - start_time))
name_ridge_preds1 = model.predict(X_train_2)
name_ridge_preds1f = model.predict(X_name_test)
print('[{}] Finished to predict name ridge (1)'.format(time.time() - start_time))

model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_2, y_train_2)
print('[{}] Finished to train name ridge (2)'.format(time.time() - start_time))
name_ridge_preds2 = model.predict(X_train_1)
name_ridge_preds2f = model.predict(X_name_test)
print('[{}] Finished to predict name ridge (2)'.format(time.time() - start_time))

name_ridge_preds_oof = np.concatenate((name_ridge_preds2, name_ridge_preds1), axis=0)
name_ridge_preds_test = (name_ridge_preds1f + name_ridge_preds2f) / 2.0
print('RMSLE OOF: {}'.format(rmse(name_ridge_preds_oof, y_train)))
if not SUBMIT_MODE:
   print('RMSLE TEST: {}'.format(rmse(name_ridge_preds_test, y_test)))

# Memory cleanup
del X_train_1
del X_train_2
del y_train_1
del y_train_2
del name_ridge_preds1
del name_ridge_preds1f
del name_ridge_preds2
del name_ridge_preds2f
del desc_ridge_preds1
del desc_ridge_preds1f
del desc_ridge_preds2
del desc_ridge_preds2f
gc.collect()
print('[{}] Finished garbage collection'.format(time.time() - start_time))

# Process categorical variables
# Label encode brand names
lb = LabelBinarizer(sparse_output=True)
X_brand_train = lb.fit_transform(df_train['brand_name'])
X_brand_test = lb.transform(df_test['brand_name'])
print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))

# Label encode categories
X_cat_train = lb.fit_transform(df_train['gencat_name'])
X_cat_test = lb.transform(df_test['gencat_name'])
X_cat1_train = lb.fit_transform(df_train['subcat1_name'])
X_cat1_test = lb.transform(df_test['subcat1_name'])
X_cat2_train = lb.fit_transform(df_train['subcat2_name'])
X_cat2_test = lb.transform(df_test['subcat2_name'])
print('[{}] Finished label binarize categories'.format(time.time() - start_time))

# Create dummy variables for numeric features
X_dummies_train = csr_matrix(
   pd.get_dummies(df_train[list(cols - (basic_cols - {'item_condition_id', 'shipping'}))],
                  sparse=True).values)
print('[{}] Create dummies completed - train'.format(time.time() - start_time))

X_dummies_test = csr_matrix(
   pd.get_dummies(df_test[list(cols - (basic_cols - {'item_condition_id', 'shipping'}))],
                  sparse=True).values)
print('[{}] Create dummies completed - test'.format(time.time() - start_time))

# Combine all feature matrices
sparse_merge_train = hstack((X_dummies_train, X_description_train, X_brand_train, X_cat_train,
                            X_cat1_train, X_cat2_train, X_name_train)).tocsr()
del X_description_train, lb, X_name_train, X_dummies_train
gc.collect()
print('[{}] Create sparse merge train completed'.format(time.time() - start_time))

sparse_merge_test = hstack((X_dummies_test, X_description_test, X_brand_test, X_cat_test,
                            X_cat1_test, X_cat2_test, X_name_test)).tocsr()
del X_description_test, X_name_test, X_dummies_test
gc.collect()
print('[{}] Create sparse merge test completed'.format(time.time() - start_time))

# Set number of iterations for FM_FTRL model training
if SUBMIT_MODE:
   iters = 3
else:
   iters = 1
   rounds = 3

# Define FM_FTRL model
model = FM_FTRL(alpha=0.035, beta=0.001, L1=0.00001, L2=0.15, D=sparse_merge_train.shape[1],
               alpha_fm=0.05, L2_fm=0.0, init_fm=0.01,
               D_fm=100, e_noise=0, iters=iters, inv_link="identity", threads=4)

# Train and predict with FM_FTRL model
if SUBMIT_MODE:
   model.fit(sparse_merge_train, y_train)
   print('[{}] Train FM completed'.format(time.time() - start_time))
   predsFM = model.predict(sparse_merge_test)
   print('[{}] Predict FM completed'.format(time.time() - start_time))
else:
   # In validation mode, repeat multiple times to check performance
   for i in range(rounds):
       model.fit(sparse_merge_train, y_train)
       predsFM = model.predict(sparse_merge_test)
       print('[{}] Iteration {}/{} -- RMSLE: {}'.format(time.time() - start_time, i + 1, rounds, rmse(predsFM, y_test)))

del model
gc.collect()
if not SUBMIT_MODE:
   print("FM_FTRL dev RMSLE:", rmse(predsFM, y_test))

# Feature selection (SelectKBest)
fselect = SelectKBest(f_regression, k=48000)
train_features = fselect.fit_transform(sparse_merge_train, y_train)
test_features = fselect.transform(sparse_merge_test)
print('[{}] Select best completed'.format(time.time() - start_time))

# Memory cleanup
del sparse_merge_train
del sparse_merge_test
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# TF-IDF vectorization (name)
tv = TfidfVectorizer(max_features=250000,
                    ngram_range=(1, 3),
                    stop_words=None)
X_name_train = tv.fit_transform(df_train['name'])
print('[{}] Finished TFIDF vectorize `name` (1/2)'.format(time.time() - start_time))
X_name_test = tv.transform(df_test['name'])
print('[{}] Finished TFIDF vectorize `name` (2/2)'.format(time.time() - start_time))

# TF-IDF vectorization (item description)
tv = TfidfVectorizer(max_features=500000,
                    ngram_range=(1, 3),
                    stop_words=None)
X_description_train = tv.fit_transform(df_train['item_description'])
print('[{}] Finished TFIDF vectorize `item_description` (1/2)'.format(time.time() - start_time))
X_description_test = tv.transform(df_test['item_description'])
print('[{}] Finished TFIDF vectorize `item_description` (2/2)'.format(time.time() - start_time))

# Prepare dataset for LightGBM model
d_train = lgb.Dataset(train_features, label=y_train)
del train_features; gc.collect()
if SUBMIT_MODE:
   watchlist = [d_train]
else:
   d_valid = lgb.Dataset(test_features, label=y_test)
   watchlist = [d_train, d_valid]

# Set LightGBM parameters
params = {
   'learning_rate': 0.15,
   'application': 'regression',
   'max_depth': 13,
   'num_leaves': 400,
   'verbosity': -1,  # Don't print training progress
   'metric': 'RMSE',
   'data_random_seed': 1,
   'bagging_fraction': 0.8,  # Data sampling ratio for bagging
   'feature_fraction': 0.6,  # Feature ratio to use in each tree
   'nthread': 4,  # Number of CPU threads to use
   'lambda_l1': 10,  # L1 regularization
   'lambda_l2': 10   # L2 regularization
}
print('[{}] Finished compiling LGB'.format(time.time() - start_time))

# Train LightGBM model
modelL = lgb.train(params,
                 train_set=d_train,
                 num_boost_round=1350,  # Number of boosting iterations
                 valid_sets=watchlist,
                 verbose_eval=50)  # Print evaluation results every 50 iterations

# LightGBM prediction
predsL = modelL.predict(test_features)
predsL[predsL < 0] = 0  # Adjust negative predictions to 0

if not SUBMIT_MODE:
   print("LGB RMSLE:", rmse(predsL, y_test))

# Memory cleanup
del d_train
del modelL
if not SUBMIT_MODE:
   del d_valid
gc.collect()

# Combine FM_FTRL and LightGBM predictions (weighted average)
preds_final = predsFM * 0.33 + predsL * 0.67
if not SUBMIT_MODE:
   print('Final RMSE: ', rmse(preds_final, y_test))

# Save final prediction results
if SUBMIT_MODE:
   preds_final = np.expm1(preds_final)  # Reverse log transformation
   submission['price'] = preds_final
   submission.to_csv('lgb_and_fm_separate_train_test.csv', index=False)
   print('[{}] Writing submission done'.format(time.time() - start_time))

3. Combining Ridge predictions

desc_ridge_preds_oof = np.concatenate((desc_ridge_preds2, desc_ridge_preds1), axis=0)
desc_ridge_preds_test = (desc_ridge_preds1f + desc_ridge_preds2f) / 2.0

OOF Predictions:
- Each data point is predicted by a model that didn't use it for training
- Can obtain predictions for the entire training data without overfitting
- These predictions can be used as features for the next level models
Test Predictions Average:
- More stable predictions by averaging predictions from two models
- Offsets errors from individual models
- Applies the basic principle of ensemble learning
However, this part in this kernel is only used for printing sample prediction result with simple ridge model in the middle of the process. NOT FOR FINAL PREDICTION!!!

Stay focused on your goals and don't let distractions derail you from your path.
- Max Holloway -

[Kaggle Study] #15 2017 Kaggle Machine Learning & Data Science Survey

dongsunseng — Thu, 5 Dec 2024 00:57:58 +0900

Fourteenth(Last) course following Youhan Lee's curriculum. Not competition.

First Kernel: Novice to Grandmaster

The biggest problem that we might face is fake and bogus responses.
As it is a survey, not everyone will answer with proper credentials, and thus I assume that there will be a lot many outlier.

Second Kernel: What do Kagglers say about Data Science ?

EDA Kernel with trying some prediction with modeling techniques.

Insight / Summary:

1. Dimensionality reduction and 2D-plotting

The most known / used dimensionality reduction technique has to be PCA.
The problem with PCA is that it works best for numerical / continuous variables which is not the case here.
A similar technique, Multi Correspondence Analysis (MCA), is used to achieve dimensionality reduction for categorical data.
Simply put, It's a technique that use chi-2 independence tests to create a distance between row points that will be further contained in a matrix.
Each of the eigenvalues of this matrix has an inertia (similar to expressed variance for PCA) and the process to obtain the 2D visualization is the same.

### NOT WORKING ON KAGGLE SERVERS (no module prince)####
#import prince
#np.random.seed(42)
#mca = prince.MCA(data_viz, n_components=2,use_benzecri_rates=True)
#mca.plot_rows(show_points=True, show_labels=False, color_by='CompensationAmount', ellipse_fill=True)

Third Kernel: PLOTLY TUTORIAL - 1

Literally plotting plots analyzing response data using PLOTLY.

The first step is to establish that something is possible; then probability will occur.
- Elon Musk -

[Kaggle Study] #14 Toxic Comment Classification Challenge

dongsunseng — Wed, 4 Dec 2024 16:53:14 +0900

Thirteenth competition following Youhan Lee's curriculum. Natural Language Processing competition.

First Kernel: [For Beginners] Tackling Toxic Using Keras

Kernel using keras LSTM.

Insight / Summary:

1. Checking null values

train.isnull().any(),test.isnull().any()

2. Tokenization

list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

Keras has turned our words into index representation for us:

[[688,
  75,
  1,
  126,
  130,
  177,
  29,
  672,
  4511,
  12052,
  1116,
  ...
  ]]

3. We have to feed a stream of data that has a consistent length(fixed number of features) -> Padding

We could make the shorter sentences as long as the others by filling the shortfall by zeros.
But on the other hand, we also have to trim the longer ones to the same length(maxlen) as the short ones.
In this case, we have set the max length to be 200.

maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

How do you know what is the best "maxlen" to set?
If you put it too short, you might lose some useful feature that could cost you some accuracy points down the path.
If you put it too long, your LSTM cell will have to be larger to store the possible values or states.
One of the ways to go about it is to see the distribution of the number of words in sentences.

totalNumWords = [len(one_comment) for one_comment in list_tokenized_train]

plt.hist(totalNumWords,bins = np.arange(0,410,10))#[0,50,100,150,200,250,300,350,400])#,450,500,550,600,650,700,750,800,850,900])
plt.show()

As we can see, most of the sentence length is about 30+.
We could set the "maxlen" to about 50, but I'm being paranoid so I have set to 200.
Then again, it sounds like something you could experiment and see what is the magic number.

4. LSTM Modeling Details

Before we could pass the output to a normal layer, we need to reshape the 3D tensor into a 2D one.
We reshape carefully to avoid throwing away data that is important to us, and ideally we want the resulting data to be a good representative of the original data.
Therefore, we use a Global Max Pooling layer which is traditionally used in CNN problems to reduce the dimensionality of image data. In simple terms, we go through each patch of data, and we take the maximum values of each patch.
These collection of maximum values will be a new set of down-sized data we can use.

5. Additional tips and tricks

1) If you have hit some roadblocks, especially when it starts returning dimension related errors, a good idea is to run "model.summary()" because it lists out all your layer outputs, which is pretty useful for diagnosis.

model.summary()

2) While adding more layers, and doing more fancy transformations, it's a good idea to check if the outputs are performing as you have expected. You can reveal the output of a particular layer by:

from keras import backend as K

# with a Sequential model
get_3rd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[2].output])
layer_output = get_3rd_layer_output([X_t[:1]])[0]
layer_output.shape
# print layer_output to see the actual data

# result: (1, 200, 60)

Second Kernel: Stop the S@#$ - Toxic Comments EDA

EDA Kernel.

Insight / Summary:

1. Multi-tagging

There are ~95k comments in the training dataset and there are ~21 k tags and ~86k clean comments
This is only possible when multiple tags are associated with each comment (eg) a comment can be classified as both toxic and obscene.

x=rowsums.value_counts()

#plot
plt.figure(figsize=(8,4))
ax = sns.barplot(x.index, x.values, alpha=0.8,color=color[2])
plt.title("Multiple tags per comment")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('# of tags ', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

2. Feature Engineering

1) Direct features: Features which are a directly due to words/content.We would be exploring the following techniques

Word frequency features
- Count features
- Bigrams
- Trigrams
Vector distance mapping of words (Eg: Word2Vec)
Sentiment scores

2) Indirect features: Some more experimental features.

count of sentences
count of words
count of unique words
count of letters
count of punctuations
count of uppercase words/letters
count of stop words
Avg length of each word

3) Leaky features:

From the example, we know that the comments contain identifier information (eg: IP, username,etc.). We can create features out of them but, it will certainly lead to overfitting to this specific Wikipedia use-case.

toxic IP scores
toxic users

Note: Creating the indirect and leaky features first. There are two reasons for this:

Count features(Direct features) are useful only if they are created from a clean corpus
Also the indirect features help compensate for the loss of information when cleaning the dataset

3. Indirect features

## Indirect features

#Sentense count in each comment:
    #  '\n' can be used to count the number of sentences in each comment
df['count_sent']=df["comment_text"].apply(lambda x: len(re.findall("\n",str(x)))+1)
#Word count in each comment:
df['count_word']=df["comment_text"].apply(lambda x: len(str(x).split()))
#Unique word count
df['count_unique_word']=df["comment_text"].apply(lambda x: len(set(str(x).split())))
#Letter count
df['count_letters']=df["comment_text"].apply(lambda x: len(str(x)))
#punctuation count
df["count_punctuations"] =df["comment_text"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
#upper case words count
df["count_words_upper"] = df["comment_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
#title case words count
df["count_words_title"] = df["comment_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
#Number of stopwords
df["count_stopwords"] = df["comment_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
#Average length of the words
df["mean_word_len"] = df["comment_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

#derived features
#Word count percent in each comment:
df['word_unique_percent']=df['count_unique_word']*100/df['count_word']
#derived features
#Punct percent in each comment:
df['punct_percent']=df['count_punctuations']*100/df['count_word']

4. Leaky features

Caution: Even though including these features might help us perform better in this particular scenario, it will not make sence to add them in the final model/general purpose model.
Here we are creating our own custom count vectorizer to create count variables that match our regex condition.

#Leaky features
df['ip']=df["comment_text"].apply(lambda x: re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",str(x)))
#count of ip addresses
df['count_ip']=df["ip"].apply(lambda x: len(x))

#links
df['link']=df["comment_text"].apply(lambda x: re.findall("http://.*com",str(x)))
#count of links
df['count_links']=df["link"].apply(lambda x: len(x))

#article ids
df['article_id']=df["comment_text"].apply(lambda x: re.findall("\d:\d\d\s{0,5}$",str(x)))
df['article_id_flag']=df.article_id.apply(lambda x: len(x))

#username
##              regex for     Match anything with [[User: ---------- ]]
# regexp = re.compile("\[\[User:(.*)\|")
df['username']=df["comment_text"].apply(lambda x: re.findall("\[\[User(.*)\|",str(x)))
#count of username mentions
df['count_usernames']=df["username"].apply(lambda x: len(x))
#check if features are created
#df.username[df.count_usernames>0]

# Leaky Ip
cv = CountVectorizer()
count_feats_ip = cv.fit_transform(df["ip"].apply(lambda x : str(x)))


# Leaky usernames

cv = CountVectorizer()
count_feats_user = cv.fit_transform(df["username"].apply(lambda x : str(x)))

5. Direct Features

1) Count based features(for unigrams):

Lets create some features based on frequency distribution of the words.
Initially lets consider taking words one at a time (ie) Unigrams
Python's SKlearn provides 3 ways of creating count features.
All three of them first create a vocabulary(dictionary) of words and then create a sparse matrix of word counts for the words in the sentence that are present in the dictionary.
A brief description of them:
- CountVectorizer
  - Creates a matrix with frequency counts of each word in the text corpus
- TF-IDF Vectorizer
  - TF - Term Frequency -- Count of the words(Terms) in the text corpus (same of Count Vect)
  - IDF - Inverse Document Frequency -- Penalizes words that are too frequent. We can think of this as regularization
- HashingVectorizer
  - Creates a hashmap(word to number mapping based on hashing technique) instead of a dictionary for vocabulary
  - This enables it to be more scalable and faster for larger text coprus
  - Can be parallelized across multiple threads
- Using TF-IDF here.
- Note: Using the concatenated dataframe "merge" which contains both text from train and test dataset to ensure that the vocabulary that we create does not missout on the words that are unique to testset.

### Unigrams -- TF-IDF 
# using settings recommended here for TF-IDF -- https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle

#some detailed description of the parameters
# min_df=10 --- ignore terms that appear lesser than 10 times 
# max_features=None  --- Create as many words as present in the text corpus
    # changing max_features to 10k for memmory issues
# analyzer='word'  --- Create features from words (alternatively char can also be used)
# ngram_range=(1,1)  --- Use only one word at a time (unigrams)
# strip_accents='unicode' -- removes accents
# use_idf=1,smooth_idf=1 --- enable IDF
# sublinear_tf=1   --- Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf)


#temp settings to min=200 to facilitate top features section to run in kernals
#change back to min=10 to get better results
start_unigrams=time.time()
tfv = TfidfVectorizer(min_df=200,  max_features=10000, 
            strip_accents='unicode', analyzer='word',ngram_range=(1,1),
            use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')
tfv.fit(clean_corpus)
features = np.array(tfv.get_feature_names())

train_unigrams =  tfv.transform(clean_corpus.iloc[:train.shape[0]])
test_unigrams = tfv.transform(clean_corpus.iloc[train.shape[0]:])

#https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=25):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)

def top_mean_feats(Xtr, features, grp_ids, min_tfidf=0.1, top_n=25):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    
    D = Xtr[grp_ids].toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)
    
# modified for multilabel milticlass
def top_feats_by_class(Xtr, features, min_tfidf=0.1, top_n=20):
    ''' Return a list of dfs, where each df holds top_n features and their mean tfidf value
        calculated across documents with the same class label. '''
    dfs = []
    cols=train_tags.columns
    for col in cols:
        ids = train_tags.index[train_tags[col]==1]
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs

#get top n for unigrams
tfidf_top_n_per_lass=top_feats_by_class(train_unigrams,features)

end_unigrams=time.time()

print("total time in unigrams",end_unigrams-start_unigrams)
print("total time till unigrams",end_unigrams-start_time)

# result: total time in unigrams 85.26099634170532
#         total time till unigrams 366.4286904335022

Third Kernel: Logistic regression with words and char n-grams

Literally kernel using logistic regression for modeling with both words features and char features.

Insight / Summary:

1. Summary

This code implements a machine learning model for classifying toxic comments.
Here are its key features and operational methods:
1. Data Processing Approach:
  - Analyzes comment text at two levels (word and character)
  - Word-level analysis captures individual word meanings
  - Character-level analysis can capture typos and special expressions
2. Feature Extraction Method:
  - Uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
  - Word features: Extracts up to 10,000 unigram features
  - Character features: Extracts up to 50,000 features from 2-6 character sequences
  - Combines both features to create rich text representation
3. Modeling Approach:
  - Creates separate binary classification models for each of 6 toxic categories
  - Uses logistic regression to predict probabilities for each category
  - Evaluates model performance using 3-fold cross-validation
  - Measures performance using ROC-AUC score
4. Optimization Considerations:
  - Sets sublinear_tf=True to reduce impact of extreme frequency values
  - Sets stop_words='english' to remove stop words
  - Prevents overfitting through L2 regularization (C=0.1)
  - Optimizes large-scale data processing using SAG (Stochastic Average Gradient) optimizer
This implementation demonstrates a practical approach to text classification, particularly effective for analyzing toxicity in comments from multiple angles.

2. Code Analysis

# Import required libraries
import numpy as np
import pandas as pd
# scikit-learn libraries for text processing and modeling
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
# Define toxic comment categories for classification
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
# Load data and handle missing values with empty spaces
train = pd.read_csv('../input/train.csv').fillna(' ')
test = pd.read_csv('../input/test.csv').fillna(' ')
# Extract comment text from training and test data
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])
# Word-level TF-IDF vectorization settings
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,      # Apply log scale to TF values
    strip_accents='unicode', # Remove accents
    analyzer='word',         # Word-level analysis
    token_pattern=r'\w{1,}', # Recognize one or more word characters as tokens
    stop_words='english',    # Remove English stop words
    ngram_range=(1, 1),     # Use single words only (unigram)
    max_features=10000)      # Use maximum of 10000 features
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)
# Character-level TF-IDF vectorization settings
char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,      # Apply log scale to TF values
    strip_accents='unicode', # Remove accents
    analyzer='char',         # Character-level analysis
    stop_words='english',    # Remove English stop words
    ngram_range=(2, 6),     # Use 2-6 character sequences
    max_features=50000)      # Use maximum of 50000 features
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)
# Horizontally combine word and character features
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
# Train models and make predictions for each toxic category
scores = []
submission = pd.DataFrame.from_dict({'id': test['id']})
for class_name in class_names:
    # Extract target data for current category
    train_target = train[class_name]
    # Initialize logistic regression model (L2 regularization, SAG optimizer)
    classifier = LogisticRegression(C=0.1, solver='sag')
    # Calculate ROC-AUC score using 3-fold cross-validation
    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, cv=3, scoring='roc_auc'))
    scores.append(cv_score)
    print('CV score for class {} is {}'.format(class_name, cv_score))
    # Train model on full training data and predict test data
    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]
# Calculate average score across all categories
print('Total CV score is {}'.format(np.mean(scores)))
# Save predictions to CSV file
submission.to_csv('submission.csv', index=False)

Fourth Kernel: Classifying multi-label comments (0.9741 lb)

Also using logistic regression.

Insight / Summary:

1. Unlabelled data

As the mean values are very small (some way below 0.05), there would be many not labelled as positive in the six categories.
From this I guess that there would be many comments which are not labelled in any of the six categories.

unlabelled_in_all = train_df[(train_df['toxic']!=1) & (train_df['severe_toxic']!=1) & (train_df['obscene']!=1) & 
                            (train_df['threat']!=1) & (train_df['insult']!=1) & (train_df['identity_hate']!=1)]
print('Percentage of unlabelled comments is ', len(unlabelled_in_all)/len(train_df)*100)

# result: Percentage of unlabelled comments is  89.83211235124176

# Let's look at the character length for the rows in the training data and record these
train_df['char_length'] = train_df['comment_text'].apply(lambda x: len(str(x)))

# look at the histogram plot for text length
sns.set()
train_df['char_length'].hist()
plt.show()

Most of the text length are within 500 characters, with some up to 5,000 characters long.

2. Manually cleaning comment text

def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text
    
# clean the comment_text in train_df [Thanks to Pulkit Jha for the useful pointer.]
train_df['comment_text'] = train_df['comment_text'].map(lambda com : clean_text(com))

# clean the comment_text in test_df [Thanks, Pulkit Jha.]
test_df['comment_text'] = test_df['comment_text'].map(lambda com : clean_text(com))

3. Problem Transformation

One way to approach a multi-label classification problem is to transform the problem into separate single-class classifier problems.
This is known as 'problem transformation'. There are three methods:
- Binary Relevance. This is probably the simplest which treats each label as a separate single classification problems. The key assumption here though, is that there are no correlation among the various labels.
- Classifier Chains. In this method, the first classifier is trained on the input X. Then the subsequent classifiers are trained on the input X and all previous classifiers' predictions in the chain. This method attempts to draw the signals from the correlation among preceding target variables.
- Label Powerset. This method transforms the problem into a multi-class problem where the multi-class labels are essentially all the unique label combinations. In our case here, where there are six labels, Label Powerset would in effect turn this into a 2^6 or 64-class problem. {Thanks Joshua for pointing out.}

1) Binary Relevance

# import and instantiate the Logistic Regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logreg = LogisticRegression(C=12.0)

# create submission file
submission_binary = pd.read_csv('../input/sample_submission.csv')

for label in cols_target:
    print('... Processing {}'.format(label))
    y = train_df[label]
    # train the model using X_dtm & y
    logreg.fit(X_dtm, y)
    # compute the training accuracy
    y_pred_X = logreg.predict(X_dtm)
    print('Training accuracy is {}'.format(accuracy_score(y, y_pred_X)))
    # compute the predicted probabilities for X_test_dtm
    test_y_prob = logreg.predict_proba(test_X_dtm)[:,1]
    submission_binary[label] = test_y_prob

2) Classifier Chains

# create submission file
submission_chains = pd.read_csv('../input/sample_submission.csv')

# create a function to add features
def add_feature(X, feature_to_add):
    '''
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    '''
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

for label in cols_target:
    print('... Processing {}'.format(label))
    y = train_df[label]
    # train the model using X_dtm & y
    logreg.fit(X_dtm,y)
    # compute the training accuracy
    y_pred_X = logreg.predict(X_dtm)
    print('Training Accuracy is {}'.format(accuracy_score(y,y_pred_X)))
    # make predictions from test_X
    test_y = logreg.predict(test_X_dtm)
    test_y_prob = logreg.predict_proba(test_X_dtm)[:,1]
    submission_chains[label] = test_y_prob
    # chain current label to X_dtm
    X_dtm = add_feature(X_dtm, y)
    print('Shape of X_dtm is now {}'.format(X_dtm.shape))
    # chain current label predictions to test_X_dtm
    test_X_dtm = add_feature(test_X_dtm, test_y)
    print('Shape of test_X_dtm is now {}'.format(test_X_dtm.shape))

The only way to fail is to never try. Take risks, learn from your mistakes, and keep pushing forward.
- Max Holloway -

[Kaggle Study] #12 Spooky Author Identification

dongsunseng — Wed, 4 Dec 2024 00:49:37 +0900

Eleventh competition following Youhan Lee's curriculum. Natural language processing competition.

Spooky Author Identification

Share code and discuss insights to identify horror authors from their writings

www.kaggle.com

First Kernel: Spooky NLP and Topic Modelling tutorial

Topic modeling: the process in which we try uncover abstract themes or "topics" based on the underlying documents and words in a corpus of text
Two standard topic modeling techniques:
- Latent Dirichlet Allocation (LDA)
- Non-negative Matrix Factorization (NMF)

Insight / Summary:

1. Top 50 (Uncleaned) Word Frequency in Training set

all_words = train['text'].str.split(expand=True).unstack().value_counts()
data = [go.Bar(
            x = all_words.index.values[2:50],
            y = all_words.values[2:50],
            marker= dict(colorscale='Jet',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 (Uncleaned) Word frequencies in the training dataset'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

These words are all so commonly occuring words which you could find just anywhere else. Not just in spooky stories and novels by our three authors but also in newspapers, kid book, religious texts - really almost every other english text.
Therefore we must find some way to preprocess our dataset first to strip out all these commonly occurring words which do not bring much to the table.

2. WordClouds to visualise each author's work

One very handy visualization tool for a data scientist when it comes to any sort of natural language processing is plotting "Word Cloud".
A word cloud (as the name suggests) is an image that is made up of a mixture of distinct words which may make up a text or book and where the size of each word is proportional to its word frequency in that text (number of times the word appears).
Here instead of dealing with an actual book or text, our words can simply be taken from the column "text"

1) Store the text of each author in a Python list

eap = train[train.author=="EAP"]["text"].values
hpl = train[train.author=="HPL"]["text"].values
mws = train[train.author=="MWS"]["text"].values

2) Encoding image and imported

from wordcloud import WordCloud, STOPWORDS

Generating a normal wordcloud is rather boring so I would like to introduce to you a technique of importing pictures (something relevant) and using the outline of that picture as a mask for our wordclouds.
Therefore the pictures that I have chosen are the ones I feel most representative for their authors:
- The Raven for Edgar Allen Poe
- Octopus Cthulu-thingy for HP Lovecraft
- Frankenstein for Mary Shelly
The way I am loading in the pictures on Kaggle is a sort of a feature hack although readers familiar to my work know this trick.
I first derive the Base64 encoding of whatever images I want to use and then use that particular encoding and re-convert the picture back on the notebook.

3) Decoding image using codecs module

import codecs
# Generate the Mask for EAP
f1 = open("eap.png", "wb")
f1.write(codecs.decode(eap_64,'base64'))
f1.close()
img1 = imread("eap.png")
# img = img.resize((980,1080))
hcmask = img1

f2 = open("mws.png", "wb")
f2.write(codecs.decode(mws_64,'base64'))
f2.close()
img2 = imread("mws.png")
hcmask2 = img2

f3 = open("hpl.png", "wb")
f3.write(codecs.decode(hpl_64,'base64'))
f3.close()
img3 = imread("hpl.png")
hcmask3 = img3;

4) Finally wordcloud

# The wordcloud of Cthulhu/squidy thing for HP Lovecraft
plt.figure(figsize=(16,13))
wc = WordCloud(background_color="black", max_words=10000, 
               mask=hcmask3, stopwords=STOPWORDS, max_font_size= 40)
wc.generate(" ".join(hpl))
plt.title("HP Lovecraft (Cthulhu-Squidy)", fontsize=20)
# plt.imshow(wc.recolor( colormap= 'Pastel1_r' , random_state=17), alpha=0.98)
plt.imshow(wc.recolor( colormap= 'Pastel2' , random_state=17), alpha=0.98)
plt.axis('off')

3. Text preprocessing

Tokenization - Segregation of the text into its individual constitutent words.
Stopwords - Throw away any words that occur too frequently as its frequency of occurrence will not be useful in helping detecting relevant texts. (as an aside also consider throwing away words that occur very infrequently).
Stemming - combine variants of words into a single parent word that still conveys the same meaning
Vectorization - Converting text into vector format. One of the simplest is the famous bag-of-words approach, where you create a matrix (for each document or text in the corpus). In the simplest form, this matrix stores word frequencies (word counts) and is often referred to as vectorization of the raw text.

4. Tokenization using NLTK module

The concept of tokenization is the act of taking a sequence of characters (think of Python strings) in a given document and dicing it up into its individual constituent pieces, which are the eponymous "tokens" of this method.
One could loosely think of them as singular words in a sentence. One could naively implement the "split( )" method on a string which separates it into a python list based on the identifier in the argument. It is actually not that trivial to.
Here we split the first sentence of the text in the training data just on a space as follows:

# Storing the first text element as a string
first_text = train.text.values[0]
print(first_text)
print("="*90)
print(first_text.split(" "))

However as you can see from this first attempt at tokenization, the segregation(분리) of the sentence into its individual elements (or terms) is not entirely accurate.
As an example, look at the second element of the list which contains the term "process,".
The punctuation mark (comma) has also been included and is being treated along with the word "process" as a term in itself.
Ideally we would like the comma and the word to be in two different and separate elements of the list.
Trying to do this with pure python list operations will be quite complex so this is where the NLTK library comes into play.
There is a convenient method "word_tokenize( )" (TreebankWord tokenizer) which strips out singular words as well as punctuations into separate elements automatically as follows:

first_text_list = nltk.word_tokenize(first_text)
print(first_text_list)

5. Stopword Removal

As alluded to above stopwords are generally words that appear so commonly and at such a high frequency in the corpus that they don't actually contribute much to the learning or predictive process as a learning model would fail to distinguish it from other texts.
Stopwords include terms such as "to" or "the" and therefore, it would be to our benefit to remove them during the pre-processing phase.
Conveniently, NLTK comes with a predefined list of 153 english stopwords.

stopwords = nltk.corpus.stopwords.words('english')
len(stopwords)

# result: 179

Filtering out stopwords from our tokenized list of words:

first_text_list_cleaned = [word for word in first_text_list if word.lower() not in stopwords]
print(first_text_list_cleaned)
print("="*90)
print("Length of original list: {0} words\n"
      "Length of list after stopwords removal: {1} words"
      .format(len(first_text_list), len(first_text_list_cleaned)))

6. Stemming and Lemmatization

The work at this stage attempts to reduce as many different variations of similar words into a single term ( different branches all reduced to single word stem).
Therefore if we have "running", "runs" and "run", you would really want these three distinct words to collapse into just the word "run". (However of course you lose granularity of the past, present or future tense).
We can turn to NLTK again which provides various stemmers which include variants such as the Porter stemming algorithm, the lancaster stemmer and the Snowball stemmer.
In the following example, I will create a porter stemmer instance as follows:

stemmer = nltk.stem.PorterStemmer()

print("The stemmed form of running is: {}".format(stemmer.stem("running")))
print("The stemmed form of runs is: {}".format(stemmer.stem("runs")))
print("The stemmed form of run is: {}".format(stemmer.stem("run")))

# The stemmed form of running is: run
# The stemmed form of runs is: run
# The stemmed form of run is: run

As we can see, the stemmer has successfully reduced the given words above into a base form and this will be most in helping us reduce the size of our dataset of words when we come to learning and classification tasks.
However there is one flaw with stemming and that is the fact that the process involves quite a crude heuristic in chopping off the ends of words in the hope of reducing a particular word into a human recognizable base form.
Therefore this process does not take into account vocabulary or word forms when collapsing words as this example will illustrate:

print("The stemmed form of leaves is: {}".format(stemmer.stem("leaves")))

# result: The stemmed form of leaves is: leav

Lemmatization

Therefore we turn to another that we could use in lieu of stemming.
This method is called lemmatization which aims to achieve the same effect as the former method.
However unlike a stemmer, lemmatizing the dataset aims to reduce words based on an actual dictionary or vocabulary (the Lemma) and therefore will not chop off words into stemmed forms that do not carry any lexical meaning.
Here we can utilize NLTK once again to initialize a lemmatizer (WordNet variant) and inspect how it collapses words as follows:

from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
print("The lemmatized form of leaves is: {}".format(lemm.lemmatize("leaves")))

# result: The lemmatized form of leaves is: leaf

7. Vectorizing Raw Text

In the vast collection of NLP literature, there are many different purposes for analyzing raw text, where in some cases you would like to compare the similarity of one body of text to another (Clustering techniques/Distance measurements), text classification (the purpose of this competition) as well as uncovering the topics that comprise a body of text (the aim of this notebook).
With the purpose of uncovering topics at the back of our minds we must now think of how to feed the raw text into a machine learning model.
Having already discussed tokenization, stopword removals and stemming (or maybe lemmatizing) we have now arrived at a reasonably cleaner text dataset then we started out with.
However at this juncture, our raw text though human readable is still unfortunately not yet machine readable.
A machine can read in bits and numbers and therefore we will first need to convert our text into numbers for which we utilise a very common approach known as the Bag-of-Words

The Bag of Words approach

This approach uses the counts of words as a starting block and records the occurrence of each word (from the entire text) in a vector specific to that particular word.
For example given these two sentences "I love to eat Burgers", "I love to eat Fries", we first tokenize to obtain our vocabulary of 6 words from which we can get the word counts for - [I, love, to, eat, Burgers, Fries].
Vectorizing the text via the Bag of Words approach, we get six distinct vectors one for each word.
So you ask since we now have rows consisting of numbers (instead of text) what forms the columns (or features)?
Well each word now becomes an individual feature/column in this new transformed dataset.
To illustrate this point, I shall utilize the Scikit-learn library to implement a vectorizer that generates a vector of word counts (term frequencies) - via the CountVectorizer method as follows.

# Defining our sentence
sentence = ["I love to eat Burgers", 
            "I love to eat Fries"]
vectorizer = CountVectorizer(min_df=0)
sentence_transform = vectorizer.fit_transform(sentence)

Fitting the vectorizer to the dataset

Here we initialize and create a simple term frequency object via the CountVectorizer function simply called "vectorizer".
The parameters that I have provided explicitly (the rest are left as default) are the bare minimum.
Here "min_df" in the parameter refers to the minimum document frequency and the vectorizer will simply drop all words that occur less than that value set (either integer or in fraction form).
Finally we apply the fit_transform method is actually comprised of two steps.
The first step is the fit method where the vectorizer is mapped to the dataset that you provide.
Once this is done, the actual vectorizing operation is performed via the transform method where the raw text is turned into its vector form as shown below:

print("The features are:\n {}".format(vectorizer.get_feature_names()))
print("\nThe vectorized array looks like:\n {}".format(sentence_transform.toarray()))

Sparse matrix vector ouptuts

From the output of the vectorized text, we can see that the features consist of the words in the corpus of text that we fed into the vectorizer (here the corpus being the two sentences we defined earlier).
Simply call the get_feature_names attribute from the vectorizer to inspect it.
With regards to the transformed text, one would be tempted to inspect the values by simplying calling it.
However when you try to call it you really just get a message which states "sparse matrix of type class 'numpy.int64' with 8 stored elements in Compressed Sparse Row format".
Therefore this means that the vectorizer returns the transformed raw text as a matrix where most of its values are zero or almost negligible, hence the term sparse.
Thinking about this, it does make sense that our returned matrices contain quite a high degree of sparsity due to the fact that most words in a language appear relatively infrequently in any given text.

8. Topic modeling

Latent Dirichlet Allocation - Probabilistic, generative model which uncovers the topics latent to a dataset by assigning weights to words in a corpus, where each topic will assign different probability weights to each word.
Non-negative Matrix Factorization - Approximation method that takes an input matrix and approximates the factorization of this matrix into two other matrices, with the caveat that the values in the matrix be non-negative.

When you vectorize the raw text with CountVectorizer, the dual stages of tokenizing and stopwords filtering are automatically included as a high-level component.
Here unlike the NLTK tokenizer that you were introduced to in the Section 2a earlier, Sklearn's tokenizer discards all single character terms like ('a', 'w' etc) and also lower cases all terms by default.
Filtering out stopwords in Sklearn is as convenient as passing the value 'english' into the argument "stop_words" where a built-in English stopword list is automatically used.
Unfortunately, there is no built-in lemmatizer in the vectorizer so we are left with a couple of options.
Either implementing it separately everytime before feeding the data for vectorizing or somehow extend the sklearn implementation to include this functionality.
Luckily for us, we have the latter option where we can extend the CountVectorizer class by overwriting the "build_analyzer" method as follows:

lemm = WordNetLemmatizer()
class LemmaCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(LemmaCountVectorizer, self).build_analyzer()
        return lambda doc: (lemm.lemmatize(w) for w in analyzer(doc))

# Storing the entire training text in a list
text = list(train.text.values)
# Calling our overwritten Count vectorizer
tf_vectorizer = LemmaCountVectorizer(max_df=0.95, 
                                     min_df=2,
                                     stop_words='english',
                                     decode_error='ignore')
tf = tf_vectorizer.fit_transform(text)

Latent Dirichlet Allocation

There are a couple of different implements of this LDA algorithm but in this notebook, I will be using Sklearn's implementation.
Another very well-known LDA implementation is Radim Rehurek's gensim, so check it out as well.
In LDA, the modelling process revolves around three things: the text corpus, its collection of documents, D and the words W in the documents.
Therefore the algorithm attempts to uncover K topics from this corpus via the following way (illustrated by the diagram).

Model each topic, $\kappa$ via a Dirichlet prior distribution given by $\beta_{k}$:
Model each document d by another Dirichlet distribution parameterized by $\alpha$:
Subsequently for document d, we generate a topic via a multinomial distribution which we then backtrack and use to generate the correspondings words related to that topic via another multinomial distribution:
1. The LDA algorithm first models documents via a mixture model of topics.
2. From these topics, words are then assigned weights based on the probability distribution of these topics.
3. It is this probabilistic assignment over words that allow a user of LDA to say how likely a particular word falls into a topic.
4. Subsequently from the collection of words assigned to a particular topic, are we thus able to gain an insight as to what that topic may actually represent from a lexical point of view.
From a standard LDA model, there are really a few key parameters that we have to keep in mind and consider programmatically tuning before we invoke the model:
- n_components: The number of topics that you specify to the model
- $\alpha$ parameter: This is the dirichlet parameter that can be linked to the document topic prior
- $\beta$ parameter: This is the dirichlet parameter linked to the topic word prior
To invoke the algorithm, we simply create an LDA instance through the Sklearn's LatentDirichletAllocation function.
The various parameters would ideally have been obtained through some sort of validation scheme.
In this instance, the optimal value of n_components (or topic number) was found by conducting a KMeans + Latent Semantic Analysis(LSA) Scheme (as shown in this paper here) whereby the number of Kmeans clusters and number of LSA dimensions were iterated through and the best silhouette mean score.

lda = LatentDirichletAllocation(n_components=11, max_iter=5,
                                learning_method = 'online',
                                learning_offset = 50.,
                                random_state = 0)
lda.fit(tf)

Second Kernel: Approaching (Almost) Any NLP Problem on Kaggle

Trying out various modeling techniques:
- tfidf
- count features
- logistic regression
- naive bayes
- svm
- xgboost
- grid search
- word vectors
- LSTM
- GRU
- Ensembling

Insight / Summary:

1. Metric

def multiclass_logloss(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

For this particular problem, Kaggle has specified multi-class log-loss as evaluation metric.

2. TF-IDF + Logistic Regression

# Always start with these features. They work (almost) everytime!
tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(xtrain) + list(xvalid))
xtrain_tfv =  tfv.transform(xtrain) 
xvalid_tfv = tfv.transform(xvalid)

# Fitting a simple Logistic Regression on TFIDF
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.626

3. Word Count as feature + Logistic Regression

ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
ctv.fit(list(xtrain) + list(xvalid))
xtrain_ctv =  ctv.transform(xtrain) 
xvalid_ctv = ctv.transform(xvalid)

# Fitting a simple Logistic Regression on Counts
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.528

Instead of using TF-IDF, we can also use word counts as features.
This can be done easily using CountVectorizer from scikit-learn.

4. Naive Bayes + TF-IDF

# Fitting a simple Naive Bayes on TFIDF
clf = MultinomialNB()
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.578

5. Naive Bayes + Word Count

# Fitting a simple Naive Bayes on Counts
clf = MultinomialNB()
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.485

6. SVM + TF-IDF

Since SVMs take a lot of time, we will reduce the number of features from the TF-IDF using Singular Value Decomposition before applying SVM.
Also, note that before applying SVMs, we must standardize the data.

# Apply SVD, I chose 120 components. 120-200 components are good enough for SVM model.
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv)
xvalid_svd = svd.transform(xvalid_tfv)

# Scale the data obtained from SVD. Renaming variable to reuse without scaling.
scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)

# Fitting a simple SVM
clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.741

7. XGBoost + TF-IDF

# Fitting a simple xgboost on tf-idf
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_tfv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_tfv.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.782

# Fitting a simple xgboost on tf-idf
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_ctv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_ctv.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.772

# Fitting a simple xgboost on tf-idf svd features
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.768

# Fitting a simple xgboost on tf-idf svd features
clf = xgb.XGBClassifier(nthread=10)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.816

8. Word Embedding - Using GloVe

# this function creates a normalized vector for the whole sentence
def sent2vec(s):
    words = str(s).lower().decode('utf-8')
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())

# create sentence vectors using the above function for training and validation set
xtrain_glove = [sent2vec(x) for x in tqdm(xtrain)]
xvalid_glove = [sent2vec(x) for x in tqdm(xvalid)]

xtrain_glove = np.array(xtrain_glove)
xvalid_glove = np.array(xvalid_glove)

9. Using Neural Network

# scale the data before any neural net:
scl = preprocessing.StandardScaler()
xtrain_glove_scl = scl.fit_transform(xtrain_glove)
xvalid_glove_scl = scl.transform(xvalid_glove)

# we need to binarize the labels for the neural net
ytrain_enc = np_utils.to_categorical(ytrain)
yvalid_enc = np_utils.to_categorical(yvalid)

# create a simple 3 layer sequential neural net
model = Sequential()

model.add(Dense(300, input_dim=300, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(300, activation='relu'))
model.add(Dropout(0.3))
model.add(BatchNormalization())

model.add(Dense(3))
model.add(Activation('softmax'))

# compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

model.fit(xtrain_glove_scl, y=ytrain_enc, batch_size=64, 
          epochs=5, verbose=1, 
          validation_data=(xvalid_glove_scl, yvalid_enc))

10. LSTM

With LSTMs we need to tokenize the text data

# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 70

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

# zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, verbose=1, validation_data=(xvalid_pad, yvalid_enc))

11. Version with early stopping

# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

12. Bi-directional LSTM

# A simple bidirectional LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

13. GRU

# GRU with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

14. Ensemble

# this is the main ensembling class. how to use it is in the next cell!
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold
import pandas as pd
import os
import sys
import logging

logging.basicConfig(
    level=logging.DEBUG,
    format="[%(asctime)s] %(levelname)s %(message)s",
    datefmt="%H:%M:%S", stream=sys.stdout)
logger = logging.getLogger(__name__)


class Ensembler(object):
    def __init__(self, model_dict, num_folds=3, task_type='classification', optimize=roc_auc_score,
                 lower_is_better=False, save_path=None):
        """
        Ensembler init function
        :param model_dict: model dictionary, see README for its format
        :param num_folds: the number of folds for ensembling
        :param task_type: classification or regression
        :param optimize: the function to optimize for, e.g. AUC, logloss, etc. Must have two arguments y_test and y_pred
        :param lower_is_better: is lower value of optimization function better or higher
        :param save_path: path to which model pickles will be dumped to along with generated predictions, or None
        """

        self.model_dict = model_dict
        self.levels = len(self.model_dict)
        self.num_folds = num_folds
        self.task_type = task_type
        self.optimize = optimize
        self.lower_is_better = lower_is_better
        self.save_path = save_path

        self.training_data = None
        self.test_data = None
        self.y = None
        self.lbl_enc = None
        self.y_enc = None
        self.train_prediction_dict = None
        self.test_prediction_dict = None
        self.num_classes = None

    def fit(self, training_data, y, lentrain):
        """
        :param training_data: training data in tabular format
        :param y: binary, multi-class or regression
        :return: chain of models to be used in prediction
        """

        self.training_data = training_data
        self.y = y

        if self.task_type == 'classification':
            self.num_classes = len(np.unique(self.y))
            logger.info("Found %d classes", self.num_classes)
            self.lbl_enc = LabelEncoder()
            self.y_enc = self.lbl_enc.fit_transform(self.y)
            kf = StratifiedKFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, self.num_classes)
        else:
            self.num_classes = -1
            self.y_enc = self.y
            kf = KFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, 1)

        self.train_prediction_dict = {}
        for level in range(self.levels):
            self.train_prediction_dict[level] = np.zeros((train_prediction_shape[0],
                                                          train_prediction_shape[1] * len(self.model_dict[level])))

        for level in range(self.levels):

            if level == 0:
                temp_train = self.training_data
            else:
                temp_train = self.train_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):
                validation_scores = []
                foldnum = 1
                for train_index, valid_index in kf.split(self.train_prediction_dict[0], self.y_enc):
                    logger.info("Training Level %d Fold # %d. Model # %d", level, foldnum, model_num)

                    if level != 0:
                        l_training_data = temp_train[train_index]
                        l_validation_data = temp_train[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])
                    else:
                        l0_training_data = temp_train[0][model_num]
                        if type(l0_training_data) == list:
                            l_training_data = [x[train_index] for x in l0_training_data]
                            l_validation_data = [x[valid_index] for x in l0_training_data]
                        else:
                            l_training_data = l0_training_data[train_index]
                            l_validation_data = l0_training_data[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])

                    logger.info("Predicting Level %d. Fold # %d. Model # %d", level, foldnum, model_num)

                    if self.task_type == 'classification':
                        temp_train_predictions = model.predict_proba(l_validation_data)
                        self.train_prediction_dict[level][valid_index,
                        (model_num * self.num_classes):(model_num * self.num_classes) +
                                                       self.num_classes] = temp_train_predictions

                    else:
                        temp_train_predictions = model.predict(l_validation_data)
                        self.train_prediction_dict[level][valid_index, model_num] = temp_train_predictions
                    validation_score = self.optimize(self.y_enc[valid_index], temp_train_predictions)
                    validation_scores.append(validation_score)
                    logger.info("Level %d. Fold # %d. Model # %d. Validation Score = %f", level, foldnum, model_num,
                                validation_score)
                    foldnum += 1
                avg_score = np.mean(validation_scores)
                std_score = np.std(validation_scores)
                logger.info("Level %d. Model # %d. Mean Score = %f. Std Dev = %f", level, model_num,
                            avg_score, std_score)

            logger.info("Saving predictions for level # %d", level)
            train_predictions_df = pd.DataFrame(self.train_prediction_dict[level])
            train_predictions_df.to_csv(os.path.join(self.save_path, "train_predictions_level_" + str(level) + ".csv"),
                                        index=False, header=None)

        return self.train_prediction_dict

    def predict(self, test_data, lentest):
        self.test_data = test_data
        if self.task_type == 'classification':
            test_prediction_shape = (lentest, self.num_classes)
        else:
            test_prediction_shape = (lentest, 1)

        self.test_prediction_dict = {}
        for level in range(self.levels):
            self.test_prediction_dict[level] = np.zeros((test_prediction_shape[0],
                                                         test_prediction_shape[1] * len(self.model_dict[level])))
        self.test_data = test_data
        for level in range(self.levels):
            if level == 0:
                temp_train = self.training_data
                temp_test = self.test_data
            else:
                temp_train = self.train_prediction_dict[level - 1]
                temp_test = self.test_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):

                logger.info("Training Fulldata Level %d. Model # %d", level, model_num)
                if level == 0:
                    model.fit(temp_train[0][model_num], self.y_enc)
                else:
                    model.fit(temp_train, self.y_enc)

                logger.info("Predicting Test Level %d. Model # %d", level, model_num)

                if self.task_type == 'classification':
                    if level == 0:
                        temp_test_predictions = model.predict_proba(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict_proba(temp_test)
                    self.test_prediction_dict[level][:, (model_num * self.num_classes): (model_num * self.num_classes) +
                                                                                        self.num_classes] = temp_test_predictions

                else:
                    if level == 0:
                        temp_test_predictions = model.predict(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict(temp_test)
                    self.test_prediction_dict[level][:, model_num] = temp_test_predictions

            test_predictions_df = pd.DataFrame(self.test_prediction_dict[level])
            test_predictions_df.to_csv(os.path.join(self.save_path, "test_predictions_level_" + str(level) + ".csv"),
                                       index=False, header=None)

        return self.test_prediction_dict

# specify the data to be used for every level of ensembling:
train_data_dict = {0: [xtrain_tfv, xtrain_ctv, xtrain_tfv, xtrain_ctv], 1: [xtrain_glove]}
test_data_dict = {0: [xvalid_tfv, xvalid_ctv, xvalid_tfv, xvalid_ctv], 1: [xvalid_glove]}

model_dict = {0: [LogisticRegression(), LogisticRegression(), MultinomialNB(alpha=0.1), MultinomialNB()],

              1: [xgb.XGBClassifier(silent=True, n_estimators=120, max_depth=7)]}

ens = Ensembler(model_dict=model_dict, num_folds=3, task_type='classification',
                optimize=multiclass_logloss, lower_is_better=True, save_path='')

ens.fit(train_data_dict, ytrain, lentrain=xtrain_glove.shape[0])
preds = ens.predict(test_data_dict, lentest=xvalid_glove.shape[0])

Third Kernel: Simple Feature Engg Notebook - Spooky Author

Create different features that will help us in identifying the spooky authors.
1. Meta features - features that are extracted from the text like number of words, number of stop words, number of punctuations etc
2. Text based features - features directly based on the text / words like frequency, svd, word2vec etc.

Insight / Summary:

1. Meta Features

Number of words in the text
Number of unique words in the text
Number of characters in the text
Number of stopwords
Number of punctuations
Number of upper case words
Number of title case words
Average length of the words

## Number of words in the text ##
train_df["num_words"] = train_df["text"].apply(lambda x: len(str(x).split()))
test_df["num_words"] = test_df["text"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text ##
train_df["num_unique_words"] = train_df["text"].apply(lambda x: len(set(str(x).split())))
test_df["num_unique_words"] = test_df["text"].apply(lambda x: len(set(str(x).split())))

## Number of characters in the text ##
train_df["num_chars"] = train_df["text"].apply(lambda x: len(str(x)))
test_df["num_chars"] = test_df["text"].apply(lambda x: len(str(x)))

## Number of stopwords in the text ##
train_df["num_stopwords"] = train_df["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
test_df["num_stopwords"] = test_df["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

## Number of punctuations in the text ##
train_df["num_punctuations"] =train_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test_df["num_punctuations"] =test_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Number of title case words in the text ##
train_df["num_words_upper"] = train_df["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test_df["num_words_upper"] = test_df["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

## Number of title case words in the text ##
train_df["num_words_title"] = train_df["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test_df["num_words_title"] = test_df["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

## Average length of the words in the text ##
train_df["mean_word_len"] = train_df["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["mean_word_len"] = test_df["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

2. Text-based features

1) tf-idf values of the words present in the text

### Fit transform the tfidf vectorizer ###
tfidf_vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
full_tfidf = tfidf_vec.fit_transform(train_df['text'].values.tolist() + test_df['text'].values.tolist())
train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())
test_tfidf = tfidf_vec.transform(test_df['text'].values.tolist())

The tfidf output is a sparse matrix and so if we have to use it with other dense features, we have couple of choices.

We can choose to get the top 'n' features (depending on the system config) from the tfidf vectorizer, convert it into dense format and concat with other features.
Build a model using just the sparse features and then use the predictions as one of the features along with other dense features.

Based on the dataset, one might perform better than the other. Here we can use the second approach since there are some very good scoring kernels using all the features of tfidf.
Also it seems that, Naive Bayes is performing better in this dataset. So we could build a naive bayes model using tfidf features as it is faster to train.

def runMNB(train_X, train_y, test_X, test_y, test_X2):
    model = naive_bayes.MultinomialNB()
    model.fit(train_X, train_y)
    pred_test_y = model.predict_proba(test_X)
    pred_test_y2 = model.predict_proba(test_X2)
    return pred_test_y, pred_test_y2, model

cv_scores = []
pred_full_test = 0
pred_train = np.zeros([train_df.shape[0], 3])
kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2017)
for dev_index, val_index in kf.split(train_X):
    dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    pred_val_y, pred_test_y, model = runMNB(dev_X, dev_y, val_X, val_y, test_tfidf)
    pred_full_test = pred_full_test + pred_test_y
    pred_train[val_index,:] = pred_val_y
    cv_scores.append(metrics.log_loss(val_y, pred_val_y))
print("Mean cv score : ", np.mean(cv_scores))
pred_full_test = pred_full_test / 5.

Success is not guaranteed, but it is worth fighting for.
- Max Holloway -

[Kaggle Study] #11 Credit Card Fraud Detection

dongsunseng — Tue, 3 Dec 2024 02:52:01 +0900

Tenth competition following Youhan Lee's curriculum. Anomaly detection competition using tabular data.

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine

www.kaggle.com

First Kernel: In depth skewed data classif. (93% recall acc now)

Testing different methods on skewed data.
The idea is to compare if preprocessing techniques work better when there is an overwhelming majority class that can disrupt the efficiency of our predictive model.

Insight / Summary:

1. Methodologies for dealing with unbalanced data

There are several ways to approach this classification problem taking into consideration this unbalance.
- Collect more data? Nice strategy but not applicable in this case
- Changing the performance metric:
  - Use the confusion matrix to calculate Precision, Recall
  - F1score (weighted average of precision recall)
  - Use Kappa - which is a classification accuracy normalized by the imbalance of the classes in the data
  - ROC curves - calculates sensitivity/specificity ratio.
- Resampling the dataset
  - Essentially this is a method that will process the data to have an approximate 50-50 ratio.
  - One way to achieve this is by OVER-sampling, which is adding copies of the under-represented class (better when you have little data)
  - Another is UNDER-sampling, which deletes instances from the over-represented class (better when he have lot's of data)

2. Nice approach that can be applied to other anomaly detection problems as well

Approach

We are not going to perform feature engineering in first instance. The dataset has been downgraded in order to contain 30 features (28 anonymized + time + amount).
- This means that the 28 features in the dataset have been anonymized so that their actual names and meanings cannot be known.
- For example, if the original feature names were "age," "gender," "occupation," etc., they have been changed to neutral names like "V1," "V2," "V3," and so on.
We will then compare what happens when using resampling and when not using it. We will test this approach using a simple logistic regression classifier.
- When the result is happy with the resampling dataset, we will then apply the same hyperparameter to the whole dataset.
We will evaluate the models by using some of the performance metrics mentioned above.
We will repeat the best resampling/not resampling method, by tuning the parameters in the logistic regression classifier.
We will finally perform classifications model using other classification algorithms.(actually not in this kernel)

3. Resampling process

As we mentioned earlier, there are several ways to resample skewed data.
Apart from under and over sampling, there is a very popular approach called SMOTE (Synthetic Minority Over-Sampling Technique), which is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.
In this notebook, we will use traditional UNDER-sampling.
The way we will under sample the dataset will be by creating a 50/50 ratio.
This will be done by randomly selecting "x" amount of sample from the majority class, being "x" the total number of records with the minority class.

X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']

# Number of data points in the minority class
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index

# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

Result:

Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resampled data:  984

4. Recall Metric

We are very interested in the recall score, because that is the metric that will help us try to capture the most fraudulent transactions.
If you think how Accuracy, Precision and Recall work for a confusion matrix, recall would be the most interesting:
- Accuracy = (TP+TN)/total
- Precision = TP/(TP+FP)
- Recall = TP/(TP+FN)
As we know, due to the imbalacing of the data, many observations could be predicted as False Negatives, being, that we predict a normal transaction, but it is in fact a fraudulent one. Recall captures this.
Obviously, trying to increase recall, tends to come with a decrease of precision.
However, in our case, if we predict that a transaction is fraudulent and turns out not to be, is not a massive problem compared to the opposite.
- Misclassifying a fraudulent transaction as legitimate (False Negative) is a bigger problem
- Than misclassifying a legitimate transaction as fraudulent (False Positive)
We could even apply a cost function when having FN and FP with different weights for each type of error, but let's leave that aside for now.

5. Result checking process

The model is offering an 93.2% recall accuracy on the generalised unseen data (test set).
Not a bad percentage to be the first try.
However, recall this is a 93.2% recall accuracy measure on the undersampled test set.
Being happy with this result, let's apply the model we fitted and test it on the whole data.

Still a very decent recall accuracy when applying it to a much larger and skewed dataset.
So, we now move on to checking various metrics to evaluate the performance.

6. Plotting ROC curve and Precision-Recall curve

Found precision-recall curve much more convenient in this case as our problems relies on the "positive" class being more interesting than the negative class, but as we have calculated the recall precision, I am not going to plot the precision recall curves yet.
AUC and ROC curve are also interesting to check if the model is also predicting as a whole correctly and not making many errors

# ROC CURVE
lr = LogisticRegression(C = best_c, penalty = 'l1')
y_pred_undersample_score = lr.fit(X_train_undersample,y_train_undersample.values.ravel()).decision_function(X_test_undersample.values)

fpr, tpr, thresholds = roc_curve(y_test_undersample.values.ravel(),y_pred_undersample_score)
roc_auc = auc(fpr,tpr)

# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

An additional comment that would be interesting to do is to initialize multiple undersampled datasets and repeat the process in loop.
Remember that, to create an undersample data, we randomly got records from the majority class.
Even though this is a valid technique, is doesn't represent the real population, so it would be interesting to repeat the process with different undersample configurations and check if the previous chosen parameters are still the most effective.
In the end, the idea is to use a wider random representation of the whole dataset and rely on the averaged best parameters.

7. Now testing on skewed data after resampled data

Having tested our previous approach, I find really interesting to test the same process on the skewed data.
Our intuition is that skewness will introduce issues difficult to capture, and therefore, provide a less effective algorithm.
To be fair, taking into account the fact that the train and test datasets are substantially bigger than the undersampled ones, I believe a K-fold cross validation is necessary.
I guess that by splitting the data with 60% in training set, 20% cross validation and 20% test should be enough... but let's take the same approach as before (no harm on this, it's just that K-fold is computationally more expensive)

Therefore by undersampling the data, our algorithm does a much better job at detecting fraud.

8. Threshold Tuning

I wanted also to show how can we tweak our final classification by changing the thresold.
Initially, you build the classification model and then you predict unseen data using it.
We previously used the "predict()" method to decided whether a record should belong to "1" or "0".
There is another method "predict_proba()".
This method returns the probabilities for each class.
The idea is that by changing the threshold to assign a record to class 1, we can control precision and recall.
Let's check this using the undersampled data (best C_param = 0.01)

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i)

The pattern is very clear: the more you lower the required probability to put a certain in the class "1" category, more records will be put in that bucket.
This implies an increase in recall (we want all the "1"s), but at the same time, a decrease in precision (we misclassify many of the other class).
Therefore, even though recall is our goal metric (do not miss a fraud transaction), we also want to keep the model being accurate as a whole.
There is an option I think could be quite interesting to tackle this.
We could assign cost to misclassifications, but being interested in classifying "1s" correctly, the cost for misclassifying "1s" should be bigger than "0" misclassifications.
- Incorrectly classifying an actual fraudulent transaction (1) as legitimate (0) (False Negative)
- This case should have a higher cost
- Incorrectly classifying a legitimate transaction (0) as fraudulent (1) (False Positive)
- This case should have a relatively lower cost
After that, the algorithm would select the threshold which minimises the total cost.
A drawback I see is that we have to manually select the weight of each cost... therefore, I will leave this know as a thought.
Going back to the threshold changing, there is an option which is the Precision-Recall curve.
By visually seeing the performance of the model depending on the threshold we choose, we can investigate a sweet spot where recall is high enough whilst keeping a high precision value.

9. Investigate Precision-Recall curve and area under this curve

from itertools import cycle

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
colors = cycle(['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal', 'red', 'yellow', 'green', 'blue','black'])

plt.figure(figsize=(5,5))

j = 1
for i,color in zip(thresholds,colors):
    y_test_predictions_prob = y_pred_undersample_proba[:,1] > i
    
    precision, recall, thresholds = precision_recall_curve(y_test_undersample,y_test_predictions_prob)
    
    # Plot Precision-Recall curve
    plt.plot(recall, precision, color=color,
                 label='Threshold: %s'%i)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('Precision-Recall example')
    plt.legend(loc="lower left")

Second Kernel

Disappeared.... :(

Third Kernel: Semi-Supervised Anomaly Detection Survey

Explore some anomaly detection techniques.

Insight / Summary:

1. Three types of anomalies

1) Point Anomaly

"an individual data instance can be considered as anomalous with respect to the rest of data"
In the image above, instance ( o_1 ) and ( o_2 ) and all instances in ( O_3 ) are point anomalies since they lie outside the normal regions.
As another example, consider credit card transaction data, with information only about amount spent.
Then, a high transaction compared to the rest for a particular individual is an anomaly.

2) Contextual Anomaly

In this case, the data must have features regarding some contextual attribute (e.g. time, space) and some features regarding behavioral attributes.
The anomaly is then determined within a given context.
As an example, consider again credit card transactions, but now we have both information about the amount spend and day of the year.
Now, a high amount transaction might be considered normal if it occurred in the week before Christmas, but the same amount transaction in July might be suspicious.
We could also have information about the location the client is when performing transactions, and then expect high amounts if we detect he/she is somewhere far from home, as in a vacation.

3) Collective Anomaly

In this case, some related data instances are anomalous with respect to the entire data set, but each individual instances may not be considered anomalous.
As an example, consider the stock of a retailer.
We expect to see its volume fluctuating in time, with low values followed by high values.
However, a low stock for a long period of time is a anomaly.
Note that the low volume per se is not an anomaly, but it persistence is.

2. Summary

Note that the last two types assume some relation among data instances, that is, they are not independent identically distributed (i.i.d).
In the present work, we have credit card transaction information and time is one of the features, so we could treat this problem as contextual anomaly detection.
However, we only have two days of data, making it almost impossible to determine a useful temporal context.
Hence, we will only consider point anomalies techniques to avoid the burden in the extra work of defining a context.
Nonetheless, we will keep time as a feature, so in some sense the contextual information will be considered, although no directly modeled.

3. Challenges

One straightforward approach to anomaly detection would be to simply define a region where the normal data lies and classify anything out of that region as an anomaly.
This is most easily said than done and there are some major challenges that often arise in anomaly detection problem:
- Modeling a normal region that captures all normal behavior is extremely difficult and the boundary between normal an abnormal is often blurred.
- Anomalies might be the result of malicious actions. Then, the malicious adversaries are always trying to adapt to make anomalous observations seem normal.
- The normal behavior can change, and then a current notion of normal might not be valid in the future.
- As we've seen, the notion of an anomaly varies for different application domains, and there is no algorithm that can handle all of them equally well.
- Labeled data for training/validation of models used by anomaly detection techniques is usually a major issue, being either extremely scarce or non existent.
- If the data contains a lot of noise, it is difficult to distinguish noisy instances from anomalies.

4. Metric

A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels.
A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.
An ideal system with high precision and high recall will return many results, with all results labeled correctly.
Since we are in a scenario of credit card fraud detection, failing to detect a fraud has a higher cost than assigning as fraudulent a normal transaction.
Hence, we are more concerned with a high recall metric, as this shows that our system can consistently detect frauds, even if this means getting a few false positives.
Nonetheless, we don't want to have a lot of false positives, since there is also a cost in verifying to much transactions assigned as frauds.
So we can summarize our model's performance in a single metric, we will use the (( F_2 )) score, which places more importance in recall than precision. Formally, it is defined as:

5. Statistical Anomaly Detection Techniques

We assume that Normal data instances occur in high probability regions of a stochastic model, while anomalies occur in the low probability regions of the stochastic model.
In the statistical model techniques we fit a statistical model and perform statistical inference to decide if an unseen observation comes from the model distribution or not.
One advantage of this methods is that we can associate a confidence interval to each prediction, which can help when deciding on a course of action to deal with the anomalies.
Another advantages is that if the model is robust to anomalies, it can be used in an unsupervised fashion, without needing any labeled data.

1) Gaussian Model Based

from scipy.stats import multivariate_normal

mu = train.drop('Class', axis=1).mean(axis=0).values
sigma = train.drop('Class', axis=1).cov().values
model = multivariate_normal(cov=sigma, mean=mu, allow_singular=True)

print(np.median(model.logpdf(valid[valid['Class'] == 0].drop('Class', axis=1).values))) 
print(np.median(model.logpdf(valid[valid['Class'] == 1].drop('Class', axis=1).values)))

2) Histogram Based

class hist_model(object):
    
    def __init__(self, bins=50):
        self.bins = bins
        
    def fit(self, X):
        
        bin_hight, bin_edge = [], []
        
        for var in X.T:
            # get bins hight and interval
            bh, bedge = np.histogram(var, bins=self.bins)
            bin_hight.append(bh)
            bin_edge.append(bedge)
        
        self.bin_hight = np.array(bin_hight)
        self.bin_edge = np.array(bin_edge)
   

    def predict(self, X):
        
        scores = []
        for obs in X:
            obs_score = []
            for i, var in enumerate(obs):
                # find wich bin obs is in
                bin_num = (var > self.bin_edge[i]).argmin()-1
                obs_score.append(self.bin_hight[i, bin_num]) # find bin hitght
            
            scores.append(np.mean(obs_score))
        
        return np.array(scores)
                

        
model = hist_model()
model.fit(train.drop('Class', axis=1).values)
print(np.median(model.predict(valid[valid['Class'] == 0].drop('Class', axis=1).values))) 
print(np.median(model.predict(valid[valid['Class'] == 1].drop('Class', axis=1).values)))

6. Cluster based Technique

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, n_init=4, random_state=42)
gmm.fit(train.drop('Class', axis=1).values)
print(gmm.score(valid[valid['Class'] == 0].drop('Class', axis=1).values))
print(gmm.score(valid[valid['Class'] == 1].drop('Class', axis=1).values))

7. SVM based Technique

from sklearn.svm import OneClassSVM
np.random.seed(42)

model = OneClassSVM(gamma=0.000562, nu=.95, kernel='rbf')
model.fit(train.drop('Class', axis=1).values)
print(model.decision_function(valid[valid['Class'] == 0].drop('Class', axis=1).values).mean())
print(model.decision_function(valid[valid['Class'] == 1].drop('Class', axis=1).values).mean())

8. Tree based Technique

from sklearn.ensemble import IsolationForest
np.random.seed(42)

model = IsolationForest(random_state=42, n_jobs=4, max_samples=train.shape[0], bootstrap=True, n_estimators=50)
model.fit(train.drop('Class', axis=1).values)
print(model.decision_function(valid[valid['Class'] == 0].drop('Class', axis=1).values).mean())
print(model.decision_function(valid[valid['Class'] == 1].drop('Class', axis=1).values).mean())

9. Neural based Technique: AutoEncoder

Embrace challenges as opportunities for growth and transformation.
- Max Holloway -

[Kaggle Study] #10 Zillow Prize: Zillow’s Home Value Prediction (Zestimate)

dongsunseng — Fri, 29 Nov 2024 18:59:27 +0900

Nineth competition following Youhan Lee's curriculum. Regression competition using tabular data.

Zillow Prize: Zillow’s Home Value Prediction (Zestimate)

Can you improve the algorithm that changed the world of real estate?

www.kaggle.com

First Kernel: Simple Exploration Notebook - Zillow Prize

EDA kernel focused on univariate correlation analysis.

Insight / Summary:

1. Removing outliers

ulimit = np.percentile(train_df.logerror.values, 99)
llimit = np.percentile(train_df.logerror.values, 1)
train_df['logerror'].ix[train_df['logerror']>ulimit] = ulimit
train_df['logerror'].ix[train_df['logerror']<llimit] = llimit

Second Kernel: Simple XGBoost Starter (~0.0655)

Literally simple baseline kernel using xgboost.

Insight / Summary:

1. XGBoost Dmatrix

d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)

DMatrix is a special data structure used in XGBoost.
It's an object that converts regular numpy arrays or pandas DataFrames into a format that XGBoost can process efficiently.
The main reasons for using DMatrix are:
- Memory efficiency: Stores data in an optimized format to save memory
- Training speed: Prepares data in an optimized format so XGBoost can train quickly
- Sparse matrix support: Can efficiently handle data when it's sparse

Third Kernel: Zillow EDA On Missing Values & Multicollinearity

EDA focused on missing values and multicollinearity.

- Missing Value Analysis
- Correlation Analysis
- Top Contributing Features (Through XGBoost)
- Correlation Analysis 
- Multicollinearity Analysis
- Univariate Analysis 
- Bivariate Analysis

Insight / Summary:

1. Multicollinearity Analysis

# Import function for calculating VIF (Variance Inflation Factor)
from statsmodels.stats.outliers_influence import variance_inflation_factor  
# Hide warning messages
import warnings
warnings.filterwarnings("ignore")
# Define function for calculating VIF
def calculate_vif_(X):
    variables = list(X.columns)
    # Calculate VIF scores for each variable and return as dictionary
    vif = {variable:variance_inflation_factor(exog=X.values, exog_idx=ix) 
           for ix,variable in enumerate(list(X.columns))}
    return vif
# Select numerical columns only
numericalCol = []
for f in merged.columns:
    # Select columns that are not object type and exclude specific columns (parcelid, transactiondate, logerror)
    if merged[f].dtype!='object' and f not in ["parcelid", "transactiondate", "logerror"]:
        numericalCol.append(f)
# Create dataframe with missing values filled with -999
mergedFilterd = merged[numericalCol].fillna(-999)
# Calculate VIF scores
vifDict = calculate_vif_(mergedFilterd)
# Convert VIF results to dataframe
vifDf = pd.DataFrame()
vifDf['variables'] = vifDict.keys()
vifDf['vifScore'] = vifDict.values()
# Sort by VIF score in descending order
vifDf.sort_values(by=['vifScore'],ascending=False,inplace=True)
# Variables with VIF score ≤ 5 (no multicollinearity)
validVariables = vifDf[vifDf["vifScore"]<=5]
# Variables with VIF score > 5 (with multicollinearity)
variablesWithMC  = vifDf[vifDf["vifScore"]>5]
# Create subplots for visualization
fig,(ax1,ax2) = plt.subplots(ncols=2)
fig.set_size_inches(20,8)
# Visualize VIF scores for variables without multicollinearity
sn.barplot(data=validVariables,x="vifScore",y="variables",ax=ax1,orient="h",color="#34495e")
# Visualize VIF scores for top 5 variables with multicollinearity
sn.barplot(data=variablesWithMC.head(5),x="vifScore",y="variables",ax=ax2,orient="h",color="#34495e")
# Set graph titles and labels
ax1.set(xlabel='VIF Scores', ylabel='Features',title="Valid Variables Without Multicollinearity")
ax2.set(xlabel='VIF Scores', ylabel='Features',title="Variables Which Exhibit Multicollinearity")

Overall explanation: This code demonstrates the process of analyzing multicollinearity between features in the dataset. Multicollinearity refers to strong correlations between independent variables, which can degrade model performance.

Main steps:

Selects only numerical variables for analysis
Calculates VIF (Variance Inflation Factor) scores, which measure multicollinearity
Generally, VIF scores above 5 or 10 indicate multicollinearity; this code uses 5 as the threshold
Visualizes results in two graphs:
- Left graph: Variables without multicollinearity (VIF ≤ 5)
- Right graph: Top 5 variables with multicollinearity (VIF > 5)

This analysis helps identify which variables have strong correlations with each other, which is valuable information for preprocessing steps like feature selection or dimensionality reduction.

Fourth Kernel: XGBoost, LightGBM, and OLS and NN

Kernel ensembling various prediction methods: XGBoost, LightGBM, OLS(Linear Regression), and NN.

Insight / Summary:

1. Summary

LightGBM Model
1. Data preprocessing
2. Set LightGBM parameters
3. Train model and make predictions
XGBoost Model
1. Data reprocessing (different from LightGBM)
2. Remove outliers
3. Train two different XGBoost models
4. Combine predictions from both models

Neural Network Model
1. Data preprocessing (standardization, handling missing values)
2. Network structure:
  - 4 hidden layers (400 → 160 → 64 → 26 units)
  - PReLU activation function
  - Use Dropout and BatchNormalization
3. Train model and make predictions
OLS Model: OLS(Ordinary Least Squares) is the basic form of linear regression
1. Feature engineering
2. Train LinearRegression
3. Make predictions for multiple dates
Final Prediction Combination
1. Combine predictions from each model using weights
2. Apply FUDGE_FACTOR for final adjustment
3. Save results to CSV file

2. FUDGE_FACTOR

pred = FUDGE_FACTOR * (OLS_WEIGHT*reg.predict(get_features(test)) + (1-OLS_WEIGHT)*pred0)

This coefficient is applied after combining predictions from all models and increases the final prediction value by 12% (1.12 times).
It is used for the following purposes:
- Prediction Bias Correction: To correct when models tend to systematically underpredict
- Calibration: To adjust predictions based on validation set or previous submission results
- Systematic Error Correction: To correct systematic errors due to data characteristics or model limitations
This is an empirically determined value, and the optimal value was likely found through validation dataset or leaderboard performance.

Success is not about luck, but about hard work, dedication, and sacrifice.
- Max Holloway -

[Kaggle Study] #9 New York City Taxi Trip Duration

dongsunseng — Fri, 29 Nov 2024 15:43:25 +0900

Eighth competition following Youhan Lee's curriculum. Regression competition using tabular data.

New York City Taxi Trip Duration

Share code and data to improve ride time predictions

www.kaggle.com

First Kernel: Dynamics of New York city - Animation

Use K-means clustering to cluster New York into different groups based on location, and analyze the traffic into and out of every cluster as a function of the time along the day

Insight / Summary:

1. Clustering Code Example: cluster New York City based on the pick-up and drop-off points of each taxi ride

kmeans = KMeans(n_clusters=15, random_state=2, n_init = 10).fit(loc_df)
loc_df['label'] = kmeans.labels_

loc_df = loc_df.sample(200000)
plt.figure(figsize = (10,10))
for label in loc_df.label.unique():
    plt.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.3, markersize = 0.3)

plt.title('Clusters of New York')
plt.show()

2. Plotting cluster center

fig,ax = plt.subplots(figsize = (10,10))
for label in loc_df.label.unique():
    ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.4, markersize = 0.1, color = 'gray')
    ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')
    ax.annotate(label, (kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1]), color = 'b', fontsize = 20)
ax.set_title('Cluster Centers')
plt.show()

3. Plotting taxi rides from one cluster to another

Absolute traffic:

fig, ax = plt.subplots(1, 1, figsize = (10,10))

def animate(hour):
    ax.clear()
    ax.set_title('Absolute Traffic - Hour ' + str(int(hour)) + ':00')    
    plt.figure(figsize = (10,10));
    for label in loc_df.label.unique():
        ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray');
        ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r');


    for label in clusters.label:
        for dest_label in clusters.label:
            num_of_rides = len(df[(df.pickup_cluster == label) & (df.dropoff_cluster == dest_label) & (df.pickup_hour == hour)])
            dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]
            dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]
            pct = np.true_divide(num_of_rides,len(df))
            arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = 15*pct)
            ax.add_patch(arr)
            arr.set_facecolor('g')


ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
plt.close()
ani.save('animation.gif', writer='imagemagick', fps=2)
filename = 'animation.gif'
video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<img src="data:image/gif;base64,{0}" type="gif" />'''.format(encoded.decode('ascii')))

Relative traffic:

fig, ax = plt.subplots(1, 1, figsize = (10,10))

def animate(hour):
    ax.clear()
    ax.set_title('Relative Traffic - Hour ' + str(int(hour)) + ':00')    
    plt.figure(figsize = (10,10))
    for label in loc_df.label.unique():
        ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray')
        ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')


    for label in clusters.label:
        for dest_label in clusters.label:
            num_of_rides = len(df[(df.pickup_cluster == label) & (df.dropoff_cluster == dest_label) & (df.pickup_hour == hour)])
            dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]
            dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]
            pct = np.true_divide(num_of_rides,len(df[df.pickup_hour == hour]))
            arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = pct)
            ax.add_patch(arr)
            arr.set_facecolor('g')


ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
plt.close()
ani.save('animation.gif', writer='imagemagick', fps=2)
filename = 'animation.gif'
video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<img src="data:image/gif;base64,{0}" type="gif" />'''.format(encoded.decode('ascii')))

Second Kernel: EDA + Baseline Model(0.40 RMSE)

Literally EDA + making baseline model with decent LB.

Insight / Summary:

1. Calculating Haversine Distance using latitude, longitude

def calculateDistance(row):
    R=6373.0 # approximate radius of earth in km
    pickup_lat=radians(row['pickup_latitude'])
    pickup_lon=radians(row['pickup_longitude'])
    dropoff_lat=radians(row['dropoff_latitude'])
    dropoff_lon=radians(row['dropoff_longitude'])
    dlon = dropoff_lon - pickup_lon
    dlat = dropoff_lat - pickup_lat
    a = sin(dlat / 2)**2 + cos(pickup_lat) * cos(dropoff_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    distance = R * c
    return distance

2. Bearing

Bearing (also called azimuth) is the angle between the direction of travel and true north, measured clockwise from north. In other words, it tells you which direction you're heading:
- 0° (or 360°) = North
- 90° = East
- 180° = South
- 270° = West
The formula is: θ = atan2( sin Δλ ⋅ cos φ2 , cos φ1 ⋅ sin φ2 − sin φ1 ⋅ cos φ2 ⋅ cos Δλ ) λ is the longitude

def calculateBearing(lat1,lng1,lat2,lng2):
    R = 6371 
    lng_delta_rad = np.radians(lng2 - lng1)
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    y = np.sin(lng_delta_rad) * np.cos(lat2)
    x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
    return np.degrees(np.arctan2(y, x))

Third Kernel: Beat the benchmark!

Similar kernel but XGBoost used.

Believe in your abilities, even when others doubt you. Your belief will carry you through.
- Max Holloway -