<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>동선생</title>
    <link>https://dongsunseng.tistory.com/</link>
    <description></description>
    <language>ko</language>
    <pubDate>Sat, 23 May 2026 10:07:20 +0900</pubDate>
    <generator>TISTORY</generator>
    <ttl>100</ttl>
    <managingEditor>dongsunseng</managingEditor>
    <image>
      <title>동선생</title>
      <url>https://tistory1.daumcdn.net/tistory/6446949/attach/925792b74bd54d1c8ee0407eb4af2954</url>
      <link>https://dongsunseng.tistory.com</link>
    </image>
    <item>
      <title>[경제] 4. 펀드와 그 종류</title>
      <link>https://dongsunseng.tistory.com/entry/%EA%B2%BD%EC%A0%9C-4-%ED%8E%80%EB%93%9C%EC%99%80-%EA%B7%B8-%EC%A2%85%EB%A5%98</link>
      <description>&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;펀드란?&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;펀드는 다수의 투자자로부터 자금을 모아 전문 운용사가 주식, 채권, 부동산 등 다양한 자산에 투자하여 발생한 수익을 투자자에게 배분하는 &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;i&gt;집합투자&lt;/i&gt;&lt;/b&gt;&lt;i&gt;&lt;b&gt; 상품 &lt;/b&gt;&lt;/i&gt;&lt;/u&gt;&lt;/span&gt;입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;펀드를 통해 소액 투자자도 전문가의 운용 능력과 분산 투자의 이점을 누릴 수 있게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;펀드의 종류&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;경제 뉴스나 인터넷을 보면 사모 펀드, 주식형 펀드, 헤지 펀드 등등 여러 종류의 펀드에 대해 들어본 적이 있을겁니다. 펀드의 종류에 대해 자세히 알아보겠습니다.&amp;nbsp;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;설정 형태에 따른 분류: 공모 펀드 vs. 사모 펀드&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;공모 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;불특정 다수의 투자자를 대상으로 공개적으로 모집&lt;/li&gt;
&lt;li&gt;최소 투자 금액이 낮고 접근성이 높음&lt;/li&gt;
&lt;li&gt;엄격한 규제와 공시 의무가 있음&lt;/li&gt;
&lt;li&gt;예시) 대부분의 뮤추얼 펀드, ETF&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;사모 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;소수의 특정 투자자를 대상으로 비공개적으로 모집&lt;/li&gt;
&lt;li&gt;일반적으로 고액 자산가나 기관 투자자 대상&lt;/li&gt;
&lt;li&gt;상대적으로 규제가 적고 투자 전략의 자유도가 높음&lt;/li&gt;
&lt;li&gt;예시) 헤지펀드, 프라이빗 에쿼티, 벤처캐피털 펀드
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;각각에 대한 자세한 설명은 아래에 있음&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;투자 대상에 따른 분류:&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;주식형 펀드:&amp;nbsp;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;주로 주식에 투자&lt;/li&gt;
&lt;li&gt;높은 수익 잠재력과 높은 위험&lt;/li&gt;
&lt;li&gt;하위 유형: 대형주, 중소형주, 배당주, 섹터(IT, 헬스 케어 등), 국가/지역별
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;배당주&lt;/b&gt;: 주주들에게 정기적으로 현금 배당을 지급하는 기업의 주식&lt;/li&gt;
&lt;li&gt;&lt;b&gt;섹터&lt;/b&gt;:&amp;nbsp;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;섹터는 경제나 주식 시장에서 비슷한 사업 활동이나 제품/서비스를 제공하는 기업들을 묶어 분류한 산업 그룹을 의미합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;정보 기술, 금융, 헬스케어, 산업재, 소비자 필수소비재, 임의 소비재, 통신 서비스, 에너지, 유틸리티, 소재, 부동산, ...&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;섹터 투자란 다양한 섹터에 분산 투자함으로써 위험을 감소시키는 것을 말함&lt;/li&gt;
&lt;li&gt;특정 섹터에 집중된 ETF나 펀드에 투자할 수 있고 특정 섹터 내 우량 기업을 선별하여 투자하거나 경기 사이클에 따라 섹터의 비중을 조절하는 등의 전략을 꾀할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;채권형 펀드:&amp;nbsp;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;주로 국채, 회사채, 특수채 등 채권에 투자
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;채권은 정부, 기업 또는 기타 단체가 자금을 빌리기 위해 발행하는 부채 증서입니다.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;채권 소유자는 채권 발행자에게 돈을 빌려주는 것이고, 발행자는 일정 기간동안 정해진 이자를 지급하며 만기일에 원금을 상환하겠다고 약속합니다.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;채권은 주식과 달리 소유권이 아닌 채무 관계를 나타내며, 일반적으로 주식보다 위험이 낮은 투자 수단으로 간주됩니다.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #333333; text-align: left;&quot;&gt;주식보다 채권이 위험이 낮은 투자수단인 이유:&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;상환 우선순위:&amp;nbsp;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;기업이 파산할 경우, 채권 투자자는 주주보다 먼저 상환받을 권리가 있습니다.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;회사 자산을 청산할 때 채권자가 주주보다 우선순위에 있기 때문입니다.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;b&gt;확정 수익:&lt;/b&gt; &lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;채권은 일반적으로 고정된 이자율(쿠폰)을 제공하므로 투자자는 얼마의 수익을 얻을 수 있는지 미리 알 수 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;반면 주식의 배당금은 회사 성과에 따라 변동되거나 아예 지급되지 않을 수도 있습니다. &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;원금 보장:&amp;nbsp;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;채권은 만기일에 원금을 상환받는 구조로, 발행 기관이 파산하지 않는 한 투자한 원금을 돌려받을 수 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;주식은 가치가 폭락할 가능성이 상대적으로 높습니다.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;가격 변동성:&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;일반적으로 채권 가격은 주식 가격보다 변동성이 작습니다.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;특히 국채와 같은 안전한 채권은 시장 상황에 따른 가격변동이 상대적으로 적습니다.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;수익과 위험의 관계:&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;금융 이론에서 위험이 높을수록 잠재적 수익도 높아지는 경향이 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;채권은 주식보다 잠재적 수익이 낮은 대신 위험도 낮은 경향이 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;다만, 모든 채권이 모든 주식보다 안전한 것은 아닙니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;예를 들어, 신용등급이 낮은 회사의 정크본드(투기등급 채권)는 안정적인 대기업의 주식보다 더 위험할 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #fafafa; color: #333333; text-align: left;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;안정적인 이자수익 추구, 상대적으로 낮은 위험&lt;/li&gt;
&lt;li&gt;하위 유형: 국공채, 회사채, 하이일드, 신흥국 채권 등
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;국공채(Government Bonds)&lt;/b&gt;: 발행주체가 중앙정부, 지방정부, 공공기관&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;회사채(Corporate Bonds)&lt;/b&gt;: 발행주체가 일반 기업&lt;/li&gt;
&lt;li&gt;&lt;b&gt;하이일드 채권(High Yield Bonds)&lt;/b&gt;: 발행주체가 신용등급이 낮은 기업
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;정크본드(Junk Bonds)라고도 불림&lt;/li&gt;
&lt;li&gt;높은 이자율 제공(투자등급 채권 대비)&lt;/li&gt;
&lt;li&gt;채무불이행(디폴트) 위험이 상대적으로 높음&lt;/li&gt;
&lt;li&gt;예시) 신생 기업 채권, 재무상태가 취약한 기업의 채권&lt;/li&gt;
&lt;li&gt;따라서, 높은 수익을 추구할 위험을 감수 가능한 투자자들이 투자함&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;신흥국 채권(Emerging Market Bonds)&lt;/b&gt;: 발행주체가 신흥국 정부 또는 기업
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;선진국 채권보다 높은 이자율 제공&lt;/li&gt;
&lt;li&gt;정치적, 경제적 불안정성에 따른 추가 위험 존재&lt;/li&gt;
&lt;li&gt;통화 위험(환율 변동)에 노출&lt;/li&gt;
&lt;li&gt;예시) 브라질, 인도, 남아프리카 등의 국채 또는 기업 채권&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;혼합형 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;주식과 채권에 동시 투자&lt;/li&gt;
&lt;li&gt;균형 잡힌 위험-수익 프로필&lt;/li&gt;
&lt;li&gt;하위 유형: 주식 비중에 따라 공격형/중립형/안정형으로 구분&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;부동산 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;상업용/주거용 부동산, 리츠(REITs) 등에 투자
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;리츠(REITs):&amp;nbsp;&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Real Estate Investment Trusts&lt;/li&gt;
&lt;li&gt;&lt;b&gt;다수의 투자자로부터 자금을 모아 부동산에 투자하고, 그 수익을 투자자들에게 배당하는 부동산 투자회사 또는 신탁을 말합니다.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;구조: 회사 형태로 설립되어 주식처럼 거래소에 상장 가능&lt;/li&gt;
&lt;li&gt;투자 대상: 오피스 빌딩, 쇼핑몰, 호텔, 물류센터, 아파트 등의 상업용/주거용 부동산&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;수익 구조:&amp;nbsp;&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;임대료 수입(주 수익원)&lt;/li&gt;
&lt;li&gt;부당산 매각 시 자본 이득&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;배당 의무: 대부분의 국가에서 수익의 90% 이상을 투자자에게 배당하도록 법적으로 규정&lt;/li&gt;
&lt;li&gt;&lt;b&gt;접근성: 적은 금액으로도 대규모 부동산에 투자 가능&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;임대 수익과 자산가치 상승으로 수익 추구&lt;/li&gt;
&lt;li&gt;&lt;b&gt;실물 자산에 대한 익스포저(exposure) 제공&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;투자자들이 실제 물리적 자산(부동산, 인프라 등)에 투자할 기회를 얻게 된다는 의미입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;원자재 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;금, 원유, 농산물 등 원자재에 투자&lt;/li&gt;
&lt;li&gt;인플레이션 헤지용이나 포트폴리오 다각화에 활용&lt;/li&gt;
&lt;li&gt;직접 원자재 보유 또는 선물 계약 등 파생상품 활용&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;대체투자 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;전통적인 자산군 외의 투자 대상&lt;/li&gt;
&lt;li&gt;인프라, 사모투자, 헤지펀드, 벤처캐피털 등
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;인프라: 경제와 사회의 기반이 되는 물리적 구조물과 시설에 대한 투자를 의미합니다.&lt;/li&gt;
&lt;li&gt;예시) 교통 인프라(도로, 고속도로, 터널, ...), 에너지 인프라(발전소, 가스 파이프라인, ...), 공공 유틸리티(수도 공급, 폐기물 처리 시설, 통신망, ...), 사회 인프라(병원, 학교, ...)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;일반적으로 높은 최소투자금액과 낮은 유동성&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;머니마켓 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;단기 금융상품(CP, CD, 단기 국채, 콜론, RP 등)에 투자
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;CP(기업어음, Commercial Paper)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;우량 기업이 단기 자금 조달을 위해 발행하는 무담보 약속어음&lt;/li&gt;
&lt;li&gt;만기: 일반적으로 1-270일(보통 30-90일)&lt;/li&gt;
&lt;li&gt;할인 발행 방식(액면가보다 낮은 가격에 발행, 만기에 액면가 상환)&lt;/li&gt;
&lt;li&gt;예: A기업이 3개월 후 10억원을 상환하겠다는 약속으로 9억8천만원에 발행&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;CD(양도성 예금증서, Certificate of Deposit)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;은행이 발행하는 정기예금 증서로 양도가 가능한 금융상품&lt;/li&gt;
&lt;li&gt;만기: 주로 91일, 180일, 270일, 1년 등&lt;/li&gt;
&lt;li&gt;은행의 신용도를 기반으로 하는 안전한 상품&lt;/li&gt;
&lt;li&gt;예: 시중은행이 발행한 3개월 만기, 연 3% 이자율의 CD&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;단기 국채(Treasury Bills)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;정부가 발행하는 만기 1년 이내의 채무증권&lt;/li&gt;
&lt;li&gt;가장 안전한 투자 수단으로 간주됨&lt;/li&gt;
&lt;li&gt;할인 발행 방식 사용&lt;/li&gt;
&lt;li&gt;예: 정부가 발행한 91일 만기 국고채권&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;콜론(Call Loan)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;금융기관 간 초단기(1일-1주일) 대출&lt;/li&gt;
&lt;li&gt;금융시장의 일시적 자금 수급 조절 역할&lt;/li&gt;
&lt;li&gt;예: A은행이 B은행에 1일 동안 대출해주는 자금&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;RP(환매조건부채권, Repurchase Agreement)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;채권을 일정 기간 후 다시 매수하기로 약정하고 파는 계약&lt;/li&gt;
&lt;li&gt;만기: 일반적으로 1일-90일&lt;/li&gt;
&lt;li&gt;담보부 대출의 성격&lt;/li&gt;
&lt;li&gt;예: 증권사가 채권을 담보로 14일 동안 자금을 빌리는 계약&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;높은 유동성과 안전성, 낮은 수익률&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;i&gt;&lt;b&gt;일시적 자금 대기용으로 활용&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;운용 방식에 따른 분류: 액티브 펀드 vs. 패시브 펀드&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;액티브 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;펀드 매니저가 적극적으로 종목 선정 및 자산 배분&lt;/li&gt;
&lt;li&gt;시장 평균 수익률 초과 달성(알파 창출) 목표&lt;/li&gt;
&lt;li&gt;상대적으로 높은 운용 보수&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;패시브 펀드:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;특정 지수(인덱스)를 추종하는 투자 전략&lt;/li&gt;
&lt;li&gt;시장 평균 수익률 달성 목표&lt;/li&gt;
&lt;li&gt;낮은 운용 보수&lt;/li&gt;
&lt;li&gt;예) 인덱스 펀드, 대부분의 ETF&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;투자 지역에 따른 분류&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;국내형: 자국 시장에만 투자&lt;/li&gt;
&lt;li&gt;해외형: 특정 해외 국가나 지역에 투자&lt;/li&gt;
&lt;li&gt;글로벌형: 전세계 시장에 분산 투자&lt;/li&gt;
&lt;li&gt;신흥시장형: 신흥국 시장에 집중 투자&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;특수 목적 펀드&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;ETF(상장지수펀드)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;지수를 추종하며 거래소에 상장되어 주식처럼 거래&lt;/li&gt;
&lt;li&gt;높은 유동성과 투명성, 낮은 비용&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;인덱스 펀드&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;특정 지수의 성과를 복제하는 패시브 운용 펀드&lt;/li&gt;
&lt;li&gt;ETF와 유사하나 거래소에 상장되진 않음&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;헤지펀드&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;절대수익 추구, 다양한 투자 전략 구사&lt;/li&gt;
&lt;li&gt;레버리지(차입)활용, 롱숏 전략 등 활용&lt;/li&gt;
&lt;li&gt;높은 최소투자금액과 성과보수 구조&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;프라이빗 에쿼티(PE)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;비상장 기업에 투자하거나 상장기업을 비상장화&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;비상장 기업 투자:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;PE 펀드는 성장 가능성이 높은 비상장 기업에 투자하여 지분을 취득합니다&lt;/li&gt;
&lt;li&gt;주로&amp;nbsp;성숙&amp;nbsp;단계의&amp;nbsp;중견기업이나&amp;nbsp;고성장&amp;nbsp;기업을&amp;nbsp;대상으로&amp;nbsp;합니다&lt;/li&gt;
&lt;li&gt;벤처캐피털이&amp;nbsp;초기&amp;nbsp;스타트업에&amp;nbsp;집중하는&amp;nbsp;것과는&amp;nbsp;달리,&amp;nbsp;PE는&amp;nbsp;이미&amp;nbsp;사업&amp;nbsp;모델이&amp;nbsp;검증된&amp;nbsp;기업을&amp;nbsp;선호합니다&lt;/li&gt;
&lt;li&gt;지분 인수 비율은 소수지분부터 경영권 확보가 가능한 다수지분까지 다양합니다&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;u&gt;&lt;b&gt;상장기업 비상장화:&lt;/b&gt;&lt;/u&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;공개 시장에서 주식을 매입해 기업을 완전히 인수한 후 상장폐지시키는 과정입니다&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;이를&amp;nbsp;'공개매수(Takeover)'&amp;nbsp;또는&amp;nbsp;'LBO(Leveraged&amp;nbsp;Buyout,&amp;nbsp;차입매수)'라고&amp;nbsp;합니다&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;보통&amp;nbsp;기업&amp;nbsp;가치가&amp;nbsp;시장에서&amp;nbsp;저평가되었다고&amp;nbsp;판단될&amp;nbsp;때&amp;nbsp;진행됩니다&lt;/li&gt;
&lt;li&gt;비상장화&amp;nbsp;후에는&amp;nbsp;단기적&amp;nbsp;실적&amp;nbsp;압박&amp;nbsp;없이&amp;nbsp;장기적&amp;nbsp;구조조정과&amp;nbsp;혁신에&amp;nbsp;집중할&amp;nbsp;수&amp;nbsp;있습니다&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;장기적 관점에서 기업 가치 개선 후 매각&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;낮은 유동성, 장기 투자 기간&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;벤처캐피털 펀드&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;초기 단계 스타트업에 투자&lt;/li&gt;
&lt;li&gt;높은 위험과 높은 수익 잠재력&lt;/li&gt;
&lt;li&gt;포트폴리요 접근 방식(여러 기업에 분산투자)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;목표일자 펀드&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;특정 목표 날짜(주로 은퇴)에 맞춰 자산 배분 자동 조정&lt;/li&gt;
&lt;li&gt;시간이 지날수록 보수적인 자산 배분으로 변화&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;ESG 펀드&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;환경(E), 사회(S), 지배구조(G) 기준을 고려한 투자&lt;/li&gt;
&lt;li&gt;재무적 수익과 함께 사회적 영향 추구&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;퇴직연금 펀드&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;은퇴 대비 장기 저축 목적&lt;/li&gt;
&lt;li&gt;세제 혜택 제공&lt;/li&gt;
&lt;li&gt;보수적인 자산 배분 경향&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;There's a tremendous bias against taking risks. Everyone is trying to optimize their ass-covering.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;-Elon Musk-&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>경제</category>
      <category>경제</category>
      <category>투자</category>
      <category>펀드</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/152</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EA%B2%BD%EC%A0%9C-4-%ED%8E%80%EB%93%9C%EC%99%80-%EA%B7%B8-%EC%A2%85%EB%A5%98#entry152comment</comments>
      <pubDate>Mon, 19 May 2025 22:34:42 +0900</pubDate>
    </item>
    <item>
      <title>[경제] 3. 세력 &amp;amp; 마켓 메이커</title>
      <link>https://dongsunseng.tistory.com/entry/%EA%B2%BD%EC%A0%9C-3-%EC%84%B8%EB%A0%A5-%EB%A7%88%EC%BC%93-%EB%A9%94%EC%9D%B4%EC%BB%A4</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;투자 시장에 있다 보면 '세력'과 '마켓 메이커'라는 단어를 쉽게 접할 수 있습니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;두 용어의 개념은 혼재되기 쉬우며 그 차이에 대한 부분이 불분명하게 느껴질 수 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;세력&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;금융 시장에서 '세력'이라는 단어는 시장에 &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;i&gt;&lt;b&gt;큰 영향력을 행사할 수 있는 자본력과 정보력을 갖춘 개인이나 기관(단체)&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;를 지칭합니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;세력의 주요 특징:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;대규모 자본: 시장 가격에 영향을 줄 정도로 충분한 자금력을 보유하고 있습니다.&lt;/li&gt;
&lt;li&gt;정보 우위: 일반 투자자보다 더 많은 정보나 &lt;b&gt;내부 정보&lt;/b&gt;에 접근할 가능성이 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;가격 형성 능력&lt;/b&gt;&lt;/span&gt;: 대량 매수나 매도를 통해 주가나 자산 가격에 일시적인 영향을 미칠 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;세력은 기관 투자자(투자 은행, 헤지 펀드, 연기금 등), 대형 개인 투자자나 자산가 그룹, 기업 내부자나 대주주, 전문 투자 그룹 등을 모두 포함하는 개념입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;세력은 위에서 언급했듯이 시장 가격에 의도적으로 영향을 줄 수 있는 자본력이 있기 때문에, 주가 띄우기(pump and dump)와 같은 시세 조작 행위를 통해 일반 투자자(개미)의 매수와 매도를 유도할 수 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;당연히 갑작스러운 거래량 증가와 가격 변동이 모두 세력때문은 아니지만, 일반 투자자로써 우위를 점하려면 세력의 의도를 파악하려고 노력하여 이를 경계하며 투자하는 것이 이상적입니다.&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;마켓 메이커 (Market Maker)&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;마켓 메이커는 금융 시장에서 유동성을 제공하는 것이 주요 역할입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;주식, 옵션, 채권, ETF, 암호화폐 등 다양한 시장에서 활동합니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;기본적으로 마켓 메이커는 증권이나 기타 금융 상품에 대해 항상 매수와 매도 호가를 제시합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이를 통해 모든 투자자들은 언제든지 원하는 자산을 사고팔 수 있도록 시장의 유동성을 확보합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;하지만 매수와 매도 호가의 차이가 크면 시장의 유동성이 적다는 의미로 원하는 가격에 매수 혹은 매도를 할 수 없는 불편함이 생깁니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;따라서, 마켓 메이커는 호가 스프레드를 유지합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;다시 말해서, 매수가(bid)와 매도가(ask)사이의 차이(스프레드)를 좁게 유지함으로써 거래 비용을 낮추고 시장 효율성을 높입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;갑작스러운 대량 매수나 매도 주문이 들어올 때 반대 포지션을 취함으로써 급격한 가격 변동을 완화하는 등의 역할을 하게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;유명한 대형 마켓 메이커로는 시타델(Citadel Securities), Virtu Financial, GTS, IMC 등이 있습니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이러한 마켓 메이커들은 시장의 원활한 작동을 위한 필수적인 시장 참가자이지만, 때로는 이들의 활동이 이해 상충이나 시장 조작 우려를 불러일으키기도 하여 규제 당국의 감시 대상이 되기도 합니다.&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;세력 vs. 마켓 메이커&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;법적 지위와 규제&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마켓 메이커: 공식적으로 인정된 시장 참여자로, 규제 기관의 감독을 받으며 특정 의무와 권한을 가집니다.&lt;/li&gt;
&lt;li&gt;세력: 비공식적인 개념으로, 법적 지위가 없으며 때로는 불법적인 시장 조작 행위와 연관될 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;목적과 동기&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마켓 메이커: 시장 유동성 제공이 주 목적이며, 매수-매도 스프레드에서 소액의 이익을 지속적으로 얻는 비즈니스 모델입니다.&lt;/li&gt;
&lt;li&gt;세력: 주로 단기적 가격 변동을 통한 이익 추구가 목적이며, 특정 종목의 가겨을 자신에게 유리하게 움직이려는 의도가 있을 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;거래 방식&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마켓 메이커: 양방향 호가(매수/매도)를 항상 제시하며 투명하게 운영됩니다.&lt;/li&gt;
&lt;li&gt;세력: 투명하게 운영될 의무가 없기 때문에 여러 계좌를 통한 분산 매매, 특정 시간대 집중 매매 등 다양한 전략을 사용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;시장 기여도&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마켓 메이커: 시장 안정화, 유동성 공급, 가격 발견 기능 등 시장의 효율적 작동에 기여합니다.&lt;/li&gt;
&lt;li&gt;세력: 단기적으로 시장을 왜곡할 수 있으며, 본인들의 이익을 위해 소액 투자자들에게 손실을 줄 가능성이 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;투명성&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마켓 메이커: 공식적으로 등록되어 있기 때문에 활동이 비교적 투명합니다.&lt;/li&gt;
&lt;li&gt;세력: 신원과 활동이 불투명하여 시장에서 그 존재를 명확하게 확인하기 어렵습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;개인 투자자들은 세력들을 상대로 어떻게 투자해야 하는가&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;세력을 경계해야 하는 이유&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;단타 투기 세력들은 단기간에 주가를 급등시킨 후 고점에서 매도하는 Pump and Dump 전략을 구사합니다.&lt;/li&gt;
&lt;li&gt;주로 유동성이 낮은 소형나/테마주, 알트코인 등에서 활동하는데 이는 유동성이 낮아야 적은 자본으로도 주가를 급등시킬 수 있기 때문입니다.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;이런 세력들은 여러 계좌를 이용해 분산 매매하기 때문에 추적이 어렵고, 회사 내부 정보 혹은 개인 투자자들은 알기 힘든 시장 정보를 미리 알고 있기 때문에 포지션을 취할 때 상당히 유리합니다.&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;개인투자자의 방어 전략&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;유동성 낮은 종목 주의&lt;/b&gt;&lt;/span&gt;: 암호화폐를 제외한 다른 경우에는 Pump and Dump 등의 주가 조작 전략에 대한 규제가 더 강합니다. 하지만 유동성이 낮은 소형주나 알트 코인의 경우 세력의 가격 조작이 용이하기 때문에 특히 유의하면서 투자해야 합니다.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;거래량 주의&lt;/b&gt;&lt;/span&gt;: 개인 투자자는 절대 세력을 이길 수 없습니다. 따라서, 기본적 &amp;amp; 기술적 분석을 하며 세력의 의도를 파악하는 것이 중요한데 이때 거래량이 엄청나게 중요한 역할을 합니다. 갑작스러운 거래량 증가와 가격 급등 등을 유의하며 투자해야 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;I think it behooves one to have an internal locus of control. You think that you have control overyour own destiny.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;-Elon Musk-&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>경제</category>
      <category>경제</category>
      <category>마켓 메이커</category>
      <category>세력</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/151</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EA%B2%BD%EC%A0%9C-3-%EC%84%B8%EB%A0%A5-%EB%A7%88%EC%BC%93-%EB%A9%94%EC%9D%B4%EC%BB%A4#entry151comment</comments>
      <pubDate>Mon, 19 May 2025 14:01:10 +0900</pubDate>
    </item>
    <item>
      <title>[코인 투자] 2. 다우 이론과 6가지 국면</title>
      <link>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-2-%EB%8B%A4%EC%9A%B0-%EC%9D%B4%EB%A1%A0%EA%B3%BC-6%EA%B0%80%EC%A7%80-%EA%B5%AD%EB%A9%B4</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;다우 이론과 6가지 국면은 결국 프렉탈 이론과 일맥상통하는 부분이 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;코인 투자 그 이전에 주식 투자에서까지 투자자들이 참고하는 모든 이론들은 과거 차트를 기반으로 형성되었고, 과거 차트의 모양새를 참고하여 전략을 수립하는 방식을 과거 프렉탈을 참고한다라고 합니다.&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;따라서, 차트의 단기 반등할 지점을 찾는 &quot;기술적 분석&quot;에 있어서 과거 차트를 분석하는 것은 굉장히 중요합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이런 시장의 흐름은 다우 이론을 바탕으로 파악할 수 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;다우 이론을 바탕으로 국면을 파악한다는 것은 장기적 관점에서 가격의 방향성을 예측해본다는 것에 목적이 있습니다.&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;예를 들어, 상승 국면이라고 한다면 상승할 확률이 높다는 것을 기반으로 눌림롱을 잡거나 하락 국면이라고 파악이 되면 오름숏을 잡는 등의 전략을 수립할 수 있게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;즉, 현재 시점에서의 추세 매매와 역추세 매매가 무엇인지를 파악하는 것 입니다.&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;다우 이론&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;다우 이론은 주식 시장의 장기적인 추세를 파악하고 예측하기 위해 개발된 기술적 분석 방법으로, 찰스 다우가 창시한 다우존스 평균 주가를 바탕으로 만들어졌습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;다우 이론은 시장이 상승세일 때 고점과 저점이 상승하고, 하락세일 때 고점과 저점이 하락한다라는 개념을 기반으로 시장의 장기적인 추세를 파악할 수 있다는 이론입니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;시황 분석을 읽다보면 특정 저점이 깨지지 않는 한 상승을 볼겁니다 혹은 특정 고점이 깨지지 않는 이상 하락이 우세할거라고 생각됩니다 와 같은 문장을 많이 보았을텐데, 이는 다우 이론을 기반으로 한 분석이라고 할 수 있습니다.&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;다우 이론의 대전제&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;평균은 모든 것을 반영한다&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;/span&gt;이 부분은 아래 링크의 내용을 발췌했습니다 (개인 투자를 하며 매우 중요한 내용이라고 생각됩니다)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.imfnsec.com/systemtrade/st02090201.jsp&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.imfnsec.com/systemtrade/st02090201.jsp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&quot;다우 이론에 따르면 시장에서 예상되고 있거나 이미 알려진 모든 정보는 시장 평균에 모두 반영되어 있으며, 예상치 못한 하나의 사건이 일어나면 이는 즉각적으로 시장에 반영된다.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;이것의 의미는 흔히 우리는 어떤 상승 요인이 되는 재료가 발생하더라도 가격이 상승하지 않고 하락하는 것을 흔히 볼 수 있다.&lt;/b&gt; &lt;/i&gt;&lt;/li&gt;
&lt;li&gt;이것은 미래 가격에 대한 &quot;예상&quot; 에 따라 시장 가격이 변동되는 것이므로 전혀 이상하거나 잘못된 것이 아니라 오히려 자연스러운 것이다.&lt;/li&gt;
&lt;li&gt;어떤 종목이나 상품에 대해 상승 요인이 나오는 재료의 성장 기대치가 가령 5% 이었다면 이러한 기대요인에 따라 이미 가격은 형성되어 있는 상태이고 실제 발표는 3% 에 나왔다면 오히려 가격하락의 요인으로 작용할 수 있는 것이기 때문이다.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;그러므로 다우 이론에 따라 분석한다면 뉴스가 나오는 시점을 잡아 거래하는 방법보다는 앞으로 나올 예상 정보를 바탕으로 하는 통계 분석이 훨씬 신뢰도가 높을 것이라는 것을 제시해 주고 있다.&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;시장은 참가자의 행동, 심리 상태 등 모든 정보를 반영한다&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;다우 이론은 주가를 통해 시장의 상황, 경제 상황 등을 파악할 수 있다고 주장합니다&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;시장은 일정한 추세를 갖고 추세는 상승, 하락, 횡보로 나뉜다&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;추세를 정확하게 파악한 투자자들은 향후 주가의 방향성을 예측할 수 있기 때문에 추세에 맞는 전략을 통해서 더 큰 수익을 가져갈 수 있게 됩니다&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;거래량은 시장 가격 추세 변동에 유용한 정보를 제공한다&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;거래량에 관련된 부분은 추후에 포스트로 작성해보도록 하겠습니다&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;기존의 가격 추세는 전환될 때까지 계속된다&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;지수는 상호 연관성이 있고, 관계를 통해 시장 상태를 파악한다
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;다우 지수에 관한 내용입니다&lt;/li&gt;
&lt;li&gt;다우 지수는 주요 주식들의 가격을 평균하여 계산합니다&lt;/li&gt;
&lt;li&gt;평균은 다우 이론에 있어서 아주 중요한 개념으로 시장의 전반적인 상황을 보여준다고 주장합니다&lt;/li&gt;
&lt;li&gt;다우 지수는 투자자들이 시장의 상황을 파악하고 이를 통해 매수 매도 전략을 수립하는 데 중요한 지표로 사용됩니다&lt;/li&gt;
&lt;li&gt;암호화폐 투자에 활용되는 개념은 아닙니다&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;tabtrader_academy_the_dow_theory_bull_market_and_bear_market_phases_3f8155a538.png&quot; data-origin-width=&quot;1280&quot; data-origin-height=&quot;720&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/tQkhz/btsNJYgqqKi/Nz8K7uJ0KB7M0noLEAvOQ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/tQkhz/btsNJYgqqKi/Nz8K7uJ0KB7M0noLEAvOQ0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/tQkhz/btsNJYgqqKi/Nz8K7uJ0KB7M0noLEAvOQ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FtQkhz%2FbtsNJYgqqKi%2FNz8K7uJ0KB7M0noLEAvOQ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1280&quot; height=&quot;720&quot; data-filename=&quot;tabtrader_academy_the_dow_theory_bull_market_and_bear_market_phases_3f8155a538.png&quot; data-origin-width=&quot;1280&quot; data-origin-height=&quot;720&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;다우는 시장 참여자들의 심리 상태를 위와 같이 6가지의 국면(Phase)로 나눠서 파악합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이 6가지 국면이 하나의 사이클을 이뤄서 계속 순환합니다.&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;다우 이론의 6가지 국면&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;매집 국면: 시장에 &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;공포심&lt;/span&gt;이 막연하고 합리적인 판단을 할 수 없는 시기&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;일반적으로 하락장에서는 평범한 투자자들(개미)은 보유하던 종목을 포기하고 헐값에라도 매도하려는 경향이 강함&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;반대로 전문 투자자들은 개미들이 던지는 물량을 매집하면서 저렴한 가격에 개미들의 물량을 받아먹음&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;따라서 겉으로 보이는 시장 상황은 굉장히 안 좋은 상태로 비춰지게 됨&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;역설적이지만 다우 이론에 다르면 사람들이 패닉셀(panic sell)하는 시점이 강세 시장의 첫 번째 국면임&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;코로나 쇼크가 그 예시&lt;/li&gt;
&lt;li&gt;공포심이 막연한 것을 매집 국면이라고 파악하고 물량을 매집하는 투자자들은 돈을 벌고, 공포심으로 인해 투자를 포기하는 투자자들은 기회를 못 잡게 되는 것임&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;상승 국면: &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;상승에 대한 기대감&lt;/span&gt;이 시장에 반영되는 시기&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;매집 국면에서 악재로 작용한 요소들이 하나 둘씩 해소되기 시작함&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;결제 불황이 차츰 해결되고 기업들의 재정 상태도 회복됨&lt;/li&gt;
&lt;li&gt;따라서, 일반 투자자들의 관심이 높아지기 시작함&lt;/li&gt;
&lt;li&gt;관심의 증가는 차트에서 &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;거래량&lt;/b&gt;&lt;/span&gt;으로 나타나고 거래량이 높아짐에 따라 가격도 상승함&lt;/li&gt;
&lt;li&gt;상승 국면의 절정 부근에서는 신고가를 갱신하는 종목이 나타남&lt;/li&gt;
&lt;li&gt;상승 국면은 일반 투자자들에게도 기회의 장이 됨&lt;/li&gt;
&lt;li&gt;&lt;b&gt;투자를 공부하는 이유도 기본적으로 시장 국면을 판단하고 시장의 흐름에 탑승하기 위함에 있음&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;일반 투자자들은 상승 국면에서 크게 두 가지 모습을 보이게 됨:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;매수를 망설이는 부류: 이전 사이클의 하락 국면에서 큰 손실을 입었기 때문에 이에 대한 트라우마로 매수를 망설이게 됨&lt;/li&gt;
&lt;li&gt;적극적으로 매수하는 부류: 매집 국면에서 매집했던 물량들을 이때부터 조금씩 현금화(매도)하기 시작함&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;상승 국면이 고조되기 시작하면 결국 시장은 과열 국면에 집입하게 됨&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;과열 국면: &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;시장이 과열된 것을 모른채 엄청 적극적으로 매수하는 투자자들이 많은 시기&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;과열 국면에서는 &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;모든 시장의 지표가 상승을 가리키게 됨&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;투자 경험이 없는 개미들도 뉴스나 주변 이야기를 듣고 적극적으로 시장에 참여하니 &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;내재 가치가 낮은 종목들도 덩달아 가격이 상승하는 모습을 보이게 됨&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;21년도 5월이 그 예시임&lt;/li&gt;
&lt;li&gt;전문 투자자들은 과열 국면에 대부분의 물량을 정리함&lt;/li&gt;
&lt;li&gt;&lt;b&gt;곧 터질 폭탄을 전문가들이 일반 개미들에게 떠넘기는 시기&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;분산 국면: 가격 거품이 터지기 시작하면서 &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;엄청난 낙폭&lt;/span&gt;을 보이는 시기&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;과열 국면 시기와는 달리 경제 지표가 좋지 않고 점점 매수하려는 사람은 줄고 매도하려는 사람은 늘어나게 됨&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;가격 거품이 터지기 시작하면서 엄청난 낙폭을 보임&lt;/li&gt;
&lt;li&gt;따라서 다시금 시장에는&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt; 공포&lt;/span&gt;가 도래하기 시작함&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;공포 국면&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;여기서 일부 투자자들은 지금 하락은 그냥 건강한 조정일뿐이라고 생각하고 적극적으로 매수하기도 함&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;전문 투자자들은 하락폭을 가늠하면서 다시 매집 국면을 준비하기 위한 전략을 세움&lt;/li&gt;
&lt;li&gt;공포 국면이 길어지다 보면 &lt;b&gt;침체&lt;/b&gt; 국면이 시작됨&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;침체 국면&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;지친 개미들의 &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;실망 매물&lt;/span&gt;&lt;/b&gt;이 계속 나오게 되면서 주가는 하락하고 &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;시장에는 온통 개미들의 곡소리만 가득한 시기&lt;/b&gt;&lt;/span&gt;가 됨&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;하지만 시장의 사이클은 계속 순환하기 때문에 시간이 지날수록 하락폭은 줄어들며 다시 시장이 전환되는 시기가 오게 됨&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;하락폭이 저점에서 둥글게 말리면서 상승을 위한 변곡이 생기게 됨&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;침체 국면의 마지막 시기가 오면 차트를 봤을 때 가격은 그대로인데 전체 거래량은 많아지게 되는 이상한 현상을 발견하게 됨&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;일반 투자자들은 공포에 빠져서 매도를 이어가고 있지만 전문 투자자들(세력)들이 다시 그 매도 물량을 받아먹고 있다는 뜻임&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;침체 국면이 충분히 진행되면 다시 매집 국면이 시작됨&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Reference&lt;/b&gt;&lt;/h4&gt;
&lt;figure id=&quot;og_1746417002628&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;다우 이론의 기본 개념&quot; data-og-description=&quot;다우 이론은 투자와 경제 분야에서 중요한 개념 중 하나로, 특히 주식 시장에 관한 이론입니다. 이 이론은 찰스 다우(Charles Dow)에 의해 개발되었으며, 다우 지수(Dow&amp;hellip;&quot; data-og-host=&quot;wikidocs.net&quot; data-og-source-url=&quot;https://wikidocs.net/213438&quot; data-og-url=&quot;https://wikidocs.net/213438&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/H25xU/hyYMgDLvEv/VhX0KNkaf6bInNN1tfddGK/img.png?width=259&amp;amp;height=337&amp;amp;face=0_0_259_337&quot;&gt;&lt;a href=&quot;https://wikidocs.net/213438&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://wikidocs.net/213438&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/H25xU/hyYMgDLvEv/VhX0KNkaf6bInNN1tfddGK/img.png?width=259&amp;amp;height=337&amp;amp;face=0_0_259_337');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;다우 이론의 기본 개념&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;다우 이론은 투자와 경제 분야에서 중요한 개념 중 하나로, 특히 주식 시장에 관한 이론입니다. 이 이론은 찰스 다우(Charles Dow)에 의해 개발되었으며, 다우 지수(Dow&amp;hellip;&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;wikidocs.net&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;figure id=&quot;og_1746418767364&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;iM증권&quot; data-og-description=&quot;주식, 선물, 외환시장 등 다양하게 적용되는 다우이론에 대한 강좌입니다. 다우이론의 개념 실제적으로 모든 기술적 분석의 시작은 다우이론을 알고 난 후에야 시작하는 것이 순서일 것이다. 가&quot; data-og-host=&quot;www.imfnsec.com&quot; data-og-source-url=&quot;https://www.imfnsec.com/systemtrade/st02090201.jsp&quot; data-og-url=&quot;https://www.imfnsec.com/systemtrade/st02090201.jsp&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.imfnsec.com/systemtrade/st02090201.jsp&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.imfnsec.com/systemtrade/st02090201.jsp&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;iM증권&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;주식, 선물, 외환시장 등 다양하게 적용되는 다우이론에 대한 강좌입니다. 다우이론의 개념 실제적으로 모든 기술적 분석의 시작은 다우이론을 알고 난 후에야 시작하는 것이 순서일 것이다. 가&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.imfnsec.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;You should take the approach that you're wrong. Your goal is to be less wrong.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;- Elon Musk -&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>국면</category>
      <category>다우 이론</category>
      <category>추세</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/150</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-2-%EB%8B%A4%EC%9A%B0-%EC%9D%B4%EB%A1%A0%EA%B3%BC-6%EA%B0%80%EC%A7%80-%EA%B5%AD%EB%A9%B4#entry150comment</comments>
      <pubDate>Mon, 5 May 2025 13:29:48 +0900</pubDate>
    </item>
    <item>
      <title>[코인 투자] 1. 내가 다시 보려고 저장해두는 코인 투자 참고 자료</title>
      <link>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-1-%EB%82%B4%EA%B0%80-%EB%8B%A4%EC%8B%9C-%EB%B3%B4%EB%A0%A4%EA%B3%A0-%EC%A0%80%EC%9E%A5%ED%95%B4%EB%91%90%EB%8A%94-%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EC%B0%B8%EA%B3%A0-%EC%9E%90%EB%A3%8C</link>
      <description>&lt;h4 data-ke-size=&quot;size20&quot;&gt;1. 피보나치 되돌림&lt;/h4&gt;
&lt;figure id=&quot;og_1746419417429&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;피보나치 되돌림에 대하여&quot; data-og-description=&quot;오늘은 기술적 분석기법 중에 지지, 저항을 확인하는 데 수단으로 많이 활용하는 피보나치 되돌림(fibonacci retracement)에 대하여 이야기 해보려 합니다.참고로 어린 시절 수학시간에 배웠던 피보나&quot; data-og-host=&quot;www.chartistlab.com&quot; data-og-source-url=&quot;https://www.chartistlab.com/post/%ED%94%BC%EB%B3%B4%EB%82%98%EC%B9%98-%EB%90%98%EB%8F%8C%EB%A6%BC%EC%97%90-%EB%8C%80%ED%95%98%EC%97%AC&quot; data-og-url=&quot;https://www.chartistlab.com/post/피보나치-되돌림에-대하여&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/jfLRW/hyYM3w6YAM/RNeriWcfWi9nkpDh1eRPk0/img.png?width=768&amp;amp;height=474&amp;amp;face=0_0_768_474,https://scrap.kakaocdn.net/dn/bDIrce/hyYRj58qit/C2IZ6JjdlJ64Immm5abgD0/img.png?width=768&amp;amp;height=474&amp;amp;face=0_0_768_474&quot;&gt;&lt;a href=&quot;https://www.chartistlab.com/post/%ED%94%BC%EB%B3%B4%EB%82%98%EC%B9%98-%EB%90%98%EB%8F%8C%EB%A6%BC%EC%97%90-%EB%8C%80%ED%95%98%EC%97%AC&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.chartistlab.com/post/%ED%94%BC%EB%B3%B4%EB%82%98%EC%B9%98-%EB%90%98%EB%8F%8C%EB%A6%BC%EC%97%90-%EB%8C%80%ED%95%98%EC%97%AC&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/jfLRW/hyYM3w6YAM/RNeriWcfWi9nkpDh1eRPk0/img.png?width=768&amp;amp;height=474&amp;amp;face=0_0_768_474,https://scrap.kakaocdn.net/dn/bDIrce/hyYRj58qit/C2IZ6JjdlJ64Immm5abgD0/img.png?width=768&amp;amp;height=474&amp;amp;face=0_0_768_474');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;피보나치 되돌림에 대하여&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;오늘은 기술적 분석기법 중에 지지, 저항을 확인하는 데 수단으로 많이 활용하는 피보나치 되돌림(fibonacci retracement)에 대하여 이야기 해보려 합니다.참고로 어린 시절 수학시간에 배웠던 피보나&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.chartistlab.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;2. 추세 기반 피보나치 확장&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;figure id=&quot;og_1746419439860&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;추세기반 피보나치 확장이란?&quot; data-og-description=&quot;안녕하세요. 홀더 입니다.오늘은 &amp;quot;추세기반 피보나치 확장&amp;quot;에 대하여 다뤄 보려고 합니다.&amp;quot;피보나치 되돌림&amp;quot;은 아마 차트 공부를 시작하신 10명 중 8~9명은 들어왔을 거라고 생각됩니다.하지만, &quot; data-og-host=&quot;www.chartistlab.com&quot; data-og-source-url=&quot;https://www.chartistlab.com/post/%EC%B6%94%EC%84%B8%EA%B8%B0%EB%B0%98-%ED%94%BC%EB%B3%B4%EB%82%98%EC%B9%98-%ED%99%95%EC%9E%A5%EC%9D%B4%EB%9E%80&quot; data-og-url=&quot;https://www.chartistlab.com/post/추세기반-피보나치-확장이란&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/u6RGT/hyYPeEhvXO/jRbfdYYWiylK54mlkppVb0/img.png?width=372&amp;amp;height=389&amp;amp;face=71_120_191_251,https://scrap.kakaocdn.net/dn/bI02l0/hyYRpL28fI/EjDJZNeHKtbJpw3lWsckt0/img.png?width=372&amp;amp;height=389&amp;amp;face=71_120_191_251&quot;&gt;&lt;a href=&quot;https://www.chartistlab.com/post/%EC%B6%94%EC%84%B8%EA%B8%B0%EB%B0%98-%ED%94%BC%EB%B3%B4%EB%82%98%EC%B9%98-%ED%99%95%EC%9E%A5%EC%9D%B4%EB%9E%80&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.chartistlab.com/post/%EC%B6%94%EC%84%B8%EA%B8%B0%EB%B0%98-%ED%94%BC%EB%B3%B4%EB%82%98%EC%B9%98-%ED%99%95%EC%9E%A5%EC%9D%B4%EB%9E%80&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/u6RGT/hyYPeEhvXO/jRbfdYYWiylK54mlkppVb0/img.png?width=372&amp;amp;height=389&amp;amp;face=71_120_191_251,https://scrap.kakaocdn.net/dn/bI02l0/hyYRpL28fI/EjDJZNeHKtbJpw3lWsckt0/img.png?width=372&amp;amp;height=389&amp;amp;face=71_120_191_251');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;추세기반 피보나치 확장이란?&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;안녕하세요. 홀더 입니다.오늘은 &quot;추세기반 피보나치 확장&quot;에 대하여 다뤄 보려고 합니다.&quot;피보나치 되돌림&quot;은 아마 차트 공부를 시작하신 10명 중 8~9명은 들어왔을 거라고 생각됩니다.하지만,&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.chartistlab.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;3. 다우 이론&lt;/h4&gt;
&lt;figure id=&quot;og_1746428971590&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;추세를 확인하는 이론이 있다? Dow Theory&quot; data-og-description=&quot;안녕하세요 오늘은 다우이론(Dow Theory)에 대해서 설명을 드리려고 합니다. 다우 이론은 찰스 다우라는 인물이 만들어낸 이론으로써 여러가지 원칙과 의미가 있는대요, Bull market이 오기 전의 매집&quot; data-og-host=&quot;www.chartistlab.com&quot; data-og-source-url=&quot;https://www.chartistlab.com/post/%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%99%95%EC%9D%B8%ED%95%98%EB%8A%94-%EC%9D%B4%EB%A1%A0%EC%9D%B4-%EC%9E%88%EB%8B%A4-dow-theory&quot; data-og-url=&quot;https://www.chartistlab.com/post/추세를-확인하는-이론이-있다-dow-theory&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/dyGuYI/hyYMRjcgPH/BUM1pJFZuZ5lFmkAx8CPXk/img.png?width=1000&amp;amp;height=421&amp;amp;face=0_0_1000_421,https://scrap.kakaocdn.net/dn/GHpeL/hyYRo7sP0F/XYGf7zAf9MN1MNPvCS9Hm0/img.png?width=1000&amp;amp;height=421&amp;amp;face=0_0_1000_421&quot;&gt;&lt;a href=&quot;https://www.chartistlab.com/post/%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%99%95%EC%9D%B8%ED%95%98%EB%8A%94-%EC%9D%B4%EB%A1%A0%EC%9D%B4-%EC%9E%88%EB%8B%A4-dow-theory&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.chartistlab.com/post/%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%99%95%EC%9D%B8%ED%95%98%EB%8A%94-%EC%9D%B4%EB%A1%A0%EC%9D%B4-%EC%9E%88%EB%8B%A4-dow-theory&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/dyGuYI/hyYMRjcgPH/BUM1pJFZuZ5lFmkAx8CPXk/img.png?width=1000&amp;amp;height=421&amp;amp;face=0_0_1000_421,https://scrap.kakaocdn.net/dn/GHpeL/hyYRo7sP0F/XYGf7zAf9MN1MNPvCS9Hm0/img.png?width=1000&amp;amp;height=421&amp;amp;face=0_0_1000_421');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;추세를 확인하는 이론이 있다? Dow Theory&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;안녕하세요 오늘은 다우이론(Dow Theory)에 대해서 설명을 드리려고 합니다. 다우 이론은 찰스 다우라는 인물이 만들어낸 이론으로써 여러가지 원칙과 의미가 있는대요, Bull market이 오기 전의 매집&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.chartistlab.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;There's a tremendous bias against taking risks. Everyone is trying to optimize their ass-covering.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;-Elon Musk-&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>비트코인</category>
      <category>추세 기반 피보나치 확장</category>
      <category>코인</category>
      <category>투자</category>
      <category>피보나치 되돌림</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/149</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-1-%EB%82%B4%EA%B0%80-%EB%8B%A4%EC%8B%9C-%EB%B3%B4%EB%A0%A4%EA%B3%A0-%EC%A0%80%EC%9E%A5%ED%95%B4%EB%91%90%EB%8A%94-%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EC%B0%B8%EA%B3%A0-%EC%9E%90%EB%A3%8C#entry149comment</comments>
      <pubDate>Mon, 5 May 2025 13:29:37 +0900</pubDate>
    </item>
    <item>
      <title>[경제] 2. 신탁</title>
      <link>https://dongsunseng.tistory.com/entry/%EA%B2%BD%EC%A0%9C-2-%EC%8B%A0%ED%83%81</link>
      <description>&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;신탁의 정의&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;신탁(Trust)은 재산 관리를 위한 법적 제도로, 한 사람(위탁자)이 자신의 재산을 다른 사람(수탁자)에게 맡겨서 제3자(수익자)의 이익을 위해 관리하도록 하는 계약입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;즉, 신탁은 간단하게 생각하면 남의 돈의 법적 소유권을 받아서 수익자의 이익을 위해 돈을 운용하는 방식의 계약입니다.&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;주요 특징:&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위탁자(Trustor): 재산을 맡기는 사람&lt;/li&gt;
&lt;li&gt;수탁자(Trustee): 재산을 관리하는 사람이나 기관(은행, 신탁 회사 등)&lt;/li&gt;
&lt;li&gt;수익자(Beneficiary): 신탁에서 발생하는 이익을 받는 사람&lt;/li&gt;
&lt;li&gt;신탁 재산: 위탁된 자산(현금, 부동산, 주식 등)&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;금융권에서의 신탁 상품은 은행이나 증권사가 수탁자가 되어 고객(위탁자)의 자금을 운용하는 방식입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;신탁은 자산 관리, 절세, 상속 계획 등 다양한 목적으로 활용됩니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이렇게 목적이 굉장히 다양하기 때문에 위탁자와 수익자의 개념을 분리시켜둔 것입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위탁자와 수익자가 다른 경우 뿐만 아니라 수익자가 다수인 경우도 존재합니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위탁자와 수익자가 다른 경우에는 부모가 자녀를 위해 신탁을 설정하는 경우나 기업이 직원의 퇴직금을 위한 신탁을 설정하는 경우 등이 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;일반적인 경우로는 위탁자와 수익자가 동일하고 그 예시는 자신의 노후를 위해 연금 신탁을 설정하거나 일반적인 자산 관리의 목적으로 본인이 수익자가 되는 신탁을 설정하는 경우 등이 있습니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;일반적인 금융 신탁 상품으로는 금전신탁, 부동산신탁, 증권투자신탁 등이 있습니다.&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;신탁과 다른 금융 상품의 차이점&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;신탁 vs. 펀드
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;펀드는 다수 투자자의 자금을 모아 주식, 채권 등에 투자하는 방식이고, 신탁은 그 목적이 다양한만큼 개별 계약에 따라 맞춤형 자산 관리가 가능합니다.&lt;/li&gt;
&lt;li&gt;맞춤형 관리가 가능한 것이 장점이기 때문에 운용 방법을 누가 지정하는지에 따라서도 신탁의 종류가 나뉘게 됩니다.&lt;/li&gt;
&lt;li&gt;특정 금전 신탁: 위탁자가 운용방법을 지정하는 금전 신탁&lt;/li&gt;
&lt;li&gt;불특정 금전 신탁: 수탁자가 운용방법을 지정하는 금전 신탁&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;신탁 vs. 보험
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;보험은 위험 보장이 주 목적이지만 신탁은 재산 관리와 처분이 주 목적입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;신탁 상품의 예시&lt;/b&gt;&lt;/h4&gt;
&lt;figure id=&quot;og_1746254124306&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;블랙록, 기관 대상 비트코인 현물 신탁 출시...&amp;quot;지속가능 위한 노력 고무적&amp;quot; By TokenPost&quot; data-og-description=&quot;블랙록, 기관 대상 비트코인 현물 신탁 출시...&amp;quot;지속가능 위한 노력 고무적&amp;quot;&quot; data-og-host=&quot;kr.investing.com&quot; data-og-source-url=&quot;https://kr.investing.com/news/cryptocurrency-news/article-823341&quot; data-og-url=&quot;https://kr.investing.com/news/cryptocurrency-news/article-823341&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bU4PGs/hyYMeFGO1W/KYvWqAjrMq98UyaEr89VWK/img.jpg?width=728&amp;amp;height=485&amp;amp;face=0_0_728_485,https://scrap.kakaocdn.net/dn/nEgFY/hyYMcVo4xM/FbCv5QDDJN6T9ZWNJ4yXd0/img.jpg?width=728&amp;amp;height=485&amp;amp;face=0_0_728_485,https://scrap.kakaocdn.net/dn/KCPw6/hyYPhndOnd/83AwNgmKWz7hSnghwnOvT0/img.jpg?width=559&amp;amp;height=347&amp;amp;face=0_0_559_347&quot;&gt;&lt;a href=&quot;https://kr.investing.com/news/cryptocurrency-news/article-823341&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://kr.investing.com/news/cryptocurrency-news/article-823341&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bU4PGs/hyYMeFGO1W/KYvWqAjrMq98UyaEr89VWK/img.jpg?width=728&amp;amp;height=485&amp;amp;face=0_0_728_485,https://scrap.kakaocdn.net/dn/nEgFY/hyYMcVo4xM/FbCv5QDDJN6T9ZWNJ4yXd0/img.jpg?width=728&amp;amp;height=485&amp;amp;face=0_0_728_485,https://scrap.kakaocdn.net/dn/KCPw6/hyYPhndOnd/83AwNgmKWz7hSnghwnOvT0/img.jpg?width=559&amp;amp;height=347&amp;amp;face=0_0_559_347');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;블랙록, 기관 대상 비트코인 현물 신탁 출시...&quot;지속가능 위한 노력 고무적&quot; By TokenPost&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;블랙록, 기관 대상 비트코인 현물 신탁 출시...&quot;지속가능 위한 노력 고무적&quot;&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;kr.investing.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위의 기사를 보면 2022년 8월 블랙록(자산운용사)이 미국 기관 투자자들을 대상으로 비트코인 현물 개인 신탁을 출시한다고 발표한 내용이다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;자산 운용사인 블랙록은 비트코인 현물 개인 신탁을 출시하여, 기관 투자자들의 돈으로 매집을 계획중인 것으로 해석할 수 있는 것이다.&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Reference&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;/h4&gt;
&lt;figure id=&quot;og_1746253471989&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;미래에셋증권&quot; data-og-description=&quot;&quot; data-og-host=&quot;securities.miraeasset.com&quot; data-og-source-url=&quot;https://securities.miraeasset.com/imf/400/imf601.do&quot; data-og-url=&quot;https://securities.miraeasset.com/imf/400/imf601.do&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://securities.miraeasset.com/imf/400/imf601.do&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://securities.miraeasset.com/imf/400/imf601.do&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;미래에셋증권&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;securities.miraeasset.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;You should take the approach that you're wrong. Your goal is to be less wrong.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: center;&quot;&gt;- Elon Musk -&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>경제</category>
      <category>경제</category>
      <category>블랙록</category>
      <category>비트코인</category>
      <category>신탁</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/148</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EA%B2%BD%EC%A0%9C-2-%EC%8B%A0%ED%83%81#entry148comment</comments>
      <pubDate>Sat, 3 May 2025 15:53:00 +0900</pubDate>
    </item>
    <item>
      <title>[경제] 1. 비트코인 현물, 선물 ETF 상장이 갖는 의미</title>
      <link>https://dongsunseng.tistory.com/entry/%EA%B2%BD%EC%A0%9C-1-%EB%B9%84%ED%8A%B8%EC%BD%94%EC%9D%B8-%ED%98%84%EB%AC%BC-%EC%84%A0%EB%AC%BC-ETF-%EC%83%81%EC%9E%A5%EC%9D%B4-%EA%B0%96%EB%8A%94-%EC%9D%98%EB%AF%B8</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;미국이 전략적 비축 자산에 가상자산을 추가한다고 발표한 이후 가상자산 판에서의 입지 확장을 위한 작업을 진행중입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;해당 작업의 일환으로 미국은 2021년 10월 20일에 비트코인 선물 ETF를 상장시킵니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;추후에 2023년 6월에는 자산운용사인 블랙록이 현물 ETF를 신청하게 되고, 2024년 1월 11일에 마침내 미국 첫 번째로 비트코인 현물 ETF 상장에 성공합니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;여기서 ETF는 무엇인지, 이는 무슨 의미를 갖는지에 대해서 자세하게 알아보겠습니다.&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;ETF는 무엇인가&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;ETF는 &quot;상장 지수 펀드&quot; 입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&quot;Exchange Traded Fund의 줄임말로 특정 지수를 추종하는 인덱스 펀드를 거래소에 상장시켜 주식처럼 거래할 수 있도록 만든 펀드를 뜻합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;여기서 &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;인덱스 펀드&lt;/span&gt;&lt;/b&gt;란 특정 주가지수와 동일하거나 유사한 수익률을 목표로 하는 펀드입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;즉, KOSPI200, KOSDAQ150, S&amp;amp;P500, NASDAQ 등의 특정 지수의 수익률을 추종하는 펀드이기 때문에 해당 지수가 상승하면 펀드의 수익률도 함께 상승하고, 하락하면 함께 하락하는 방식으로 운용됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;ETF를 통해 투자자는 직접 매수하지 않고도 여러 자산에 베팅할 수 있게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;예를 들어, 금과 은 값을 포함하여 하나의 ETF를 구성한다던가 상위권 IT 기업 및 보험회사의 주식을 혼합해 ETF를 구성하는 것도 가능합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;주식을 선택하거나 운용하는데에 드는 비용이 없기 때문에 액티브 펀드보다 낮은 운용 비용이 하나의 큰 특징이 됩니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;지수를 그대로 따라가기 때문에, 액티브 펀드처럼 시장 상황을 예측하거나 투자 전략을 수립할 필요가 없다는 뜻입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;또한, 시장의 평균적인 수익률을 반영하기 때문에 개별 주식 선택에 따른 위험을 줄여줍니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;반대로, 액티브 펀드보다 높은 수익률을 달성하기는 어렵습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;하나의 아웃라이어(outlier)가 평균에 큰 영향을 주기는 힘들기 때문입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;인덱스 펀드를 거래소에 상장시켜 주식처럼 거래할 수 있도록 만든다는 의미는 일반 펀드와 달리 증권 거래소에 상장되어 있기 때문에, 주식처럼 장중에 실시간으로 매매가 가능하다는 것을 의미합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;즉, 투자자는 펀드 판매사를 통하지 않고도 증권 계좌를 통해 주식을 사고파는 것과 동일한 방식으로 ETF를 거래할 수 있게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;ETF vs. 일반 펀드&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;거래방식: 일반 펀드는 하루 한 번 기준가격으로 매매가 이루어지지만, ETF는 장중에 실시간 가격으로 거래됩니다&lt;/li&gt;
&lt;li&gt;유동성: ETF는 거래소에서 즉시 매매가 가능하므로 유동성이 높습니다&lt;/li&gt;
&lt;li&gt;비용 구조: 일반적으로 ETF는 운용 보수가 낮은 편입니다. 인덱스를 단순히 추종하는 패시브(passive) 전략을 사용하기 때문입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;패시브 운용(Passive Management):&amp;nbsp;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시장 지수의 성과를 그대로 복제하는 것이 목표임&lt;/li&gt;
&lt;li&gt;운용사의 주관적인 판단이나 시장 예측에 의존하지 않음&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;반대 개념은 액티브 운용(Active Management):
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;펀드 매니저가 적극적으로 종목을 선택하고 매매하여 시장 평균 이상의 수익을 추구하는 방식&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;투명성: ETF는 매일 포트폴리오 구성이 공개되어 투명성이 높습니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;일반 펀드의 다른 말로는 뮤추얼 펀드(Mutual Fund - 영어 표현)가 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이는 다수의 투자자로부터 자금을 모아 전문 펀드 매니저가 액티브 운용을 통해 투자하는 방식으로, 투자자들은 펀드의 지분(수익증권)을 보유하게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;주식회사 형태로 운영되며, 투자자들은 주주가 되어 투자 수익을 배당금 형태로 받게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이 중에서도 종류가 나뉘게 되는데 크게는 개방형과 폐쇄형이 있습니다. 개방형은 언제든지 돈을 찾을 수 있고, 폐쇄형은 만기 전에는 돈을 찾을 수 없다는 차이가 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;ETF의 장점&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;분산 투자: 하나의 ETF로 여러 종목에 분산투자 효과를 얻을 수 있습니다.&lt;/li&gt;
&lt;li&gt;낮은 비용: 대부분의 ETF는 패시브 운용으로 인해 운용보수가 낮습니다.&lt;/li&gt;
&lt;li&gt;세금 효율성: 일부 국가에서는 ETF가 세금 측면에서 유리한 구조를 가집니다.&lt;/li&gt;
&lt;li&gt;접근성: 소액으로도 다양한 자산군에 투자할 수 있습니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;ETF의 종류&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;주식형 ETF: 주식 시장 지수를 추종(KODEX200, SPY, ...)&lt;/li&gt;
&lt;li&gt;채권형 ETF: 채권 지수를 추종&lt;/li&gt;
&lt;li&gt;원자재 ETF: 금, 은, 원유 등 원자재 가격을 추종&lt;/li&gt;
&lt;li&gt;섹터 ETF: 특정 산업 섹터에 집중(바이오, IT, 금융, ...)&lt;/li&gt;
&lt;li&gt;국가/지역 ETF: 특정 국가나 지역의 시장을 추종&lt;/li&gt;
&lt;li&gt;레버리지/인버스 ETF: 지수 수익률의 배수 또는 반대 방향으로 움직이는 ETF&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;비트코인 선물 ETF의 의미&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위에서 언급했듯이, 비트코인 선물 ETF란 비트코인 자체가 아닌 비트코인 선물 계약에 투자하는 상장 지수 펀드입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;즉, 실제 비트코인을 직접 보유하지 않고, 비트코인의 미래 가격에 대한 계약인 '선물 계약'에 투자하는 개념입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이 선물 계약들은 CME(시카고 상품거래소)와 같은 규제된 거래소에서 거래됩니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;투자자들이 실제 비트코인을 구매하고 저장하는 복잡한 과정 없이 기존 증권 계좌를 통해 비트코인 시장에 노출될 수 있게 되었다는 것이 비트코인 선물 ETF를 상장시킨 이유입니다.&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;코인 거래소에서 하는 선물 거래 vs. 선물 ETF&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;여기서 바이비트와 같은 암호화폐 거래소에서 비트코인 선물 거래를 직접 하는 것과 비트코인 선물 ETF를 사고 파는 것의 차이점이 무엇인지 궁금할 것입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;주요 차이점:&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;일단 규제 환경에서의 차이가 있습니다. 바이비트와 같은 코인 거래소는 국가마다 규제 수준이 다르며, 일부 국가에서는 규제가 미흡하거나 차이가 심할 수도 있습니다. 하지만 ETF는 미국 증권거래위원회(SEC)와 같은 엄격한 금융 규제 기관의 감독을 받으며, 투자자 보호 장치가 확실합니다.&lt;/li&gt;
&lt;li&gt;접근성에서의 차이도 분명히 존재합니다. 암호화폐 거래소는 기존 주식 투자자들에게 부가적인 절차를 요구하게 됩니다. 하지만 ETF는 기존 증권 계좌를 통해 주식처럼 매매할 수 있기 때문에 진입 장벽이 낮고 기존 주식 투자자들에게 편리함을 제공합니다.&lt;/li&gt;
&lt;li&gt;위험 및 보안에서도 차이를 보입니다. 암호화폐 거래소는 해킹, 거래소 파산, 사기 등의 위험이 있지만 ETF는 규제된 금융 기관이 자산을 관리하므로 이러한 위험이 줄어듭니다.&lt;/li&gt;
&lt;li&gt;그외에도 레버리지, 상품 구조 등의 부가적인 차이도 존재합니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;미국의 비트코인 선물 ETF 상장의 의미&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;가상 자산이 미국의 전통적인 금융 시스템 내에서 공식적으로 인정받기 시작했다는 상징적인 의미가 있습니다.&lt;/li&gt;
&lt;li&gt;SEC(미국 증권거래위원회)의 승인을 받은 상품으로, 일정 수준의 투자자 보호와 규제 감독이 가능해졌다는 의미가 있습니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위와 같은 이유도 물론 중요한 의미를 가지지만&amp;nbsp;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;일반 투자자들이 기존 증권 계좌를 통해 쉽게 비트코인 관련 투자를 할 수 있게 된 점&lt;/li&gt;
&lt;li&gt;법적, 규제적 장벽으로 인해 직접 비트코인 투자를 꺼렸던 &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;기관 투자자&lt;/b&gt;&lt;/span&gt;들의 시장 참여를 용이하게 된 점&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위와 같은 효과를 내면서 결국 위에서 언급했듯이 일반 투자자는 물론 그보다도 훨씬 큰 힘이 있는 기관 투자자들의 비트코인 관련 투자를 활성화시켰다는 점을 주목해야 합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;즉, 미국의 기관 투자자들을 비트코인 시장에 참여시켜서 코인 시장 내의 미국의 입지를 확대하려는 목적으로 해석할 수 있습니다.&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;비트코인 선물 ETF vs. 비트코인 현물 ETF&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;비트코인 현물 ETF 보다 선물 ETF가 먼저 상장했습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;비트코인 선물 ETF는 실제 비트코인을 직접 보유하는 현물 ETF가 아니라 선물 계약에 투자하는 방식이라는 점에서 한계가 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;선물 계약에는 '만기'가 존재하고, '컨탱고'라고 불리는 롤오버 비용이 발생할 수 있기 때문에 장시적으로는 비트코인 가격 자체의 성과와 차이가 날 수 있게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;즉, 현물 ETF는 자금 유입이 실제 비트코인 구매로 이어져 직접적인 수요를 창출하기 때문에 시장 가격에 더 직접적인 영향을 줄 수 있습니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;다시 말해서, 코인의 가격을 조종하는 Market Maker들의 입장에서는 변수를 줄일 수 있는 현물 ETF도 상장하는 편이 당연합니다.&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;롤오버 비용(Rollover Cost)?&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;롤오버 비용은 만기가 다가오는 선물 계약에서 다음 만기의 선물 계약으로 포지션을 이전(롤오버)할 때 발생하는 비용입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;선물 ETF와 같은 상품들은 지속적인 노출을 제공하기 위해 이러한 롤오버 과정을 정기적으로 수행해야 합니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;모든 선물 계약은 특정 만기일을 가지고 있습니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;여기서 발생하는 현상 두 가지:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;컨탱고(Contango): 미래 만기의 선물 가격이 현재 만기의 선물 가격보다 높은 상황&lt;/li&gt;
&lt;li&gt;백워데이션(Backwardation): 미래 만기의 선물 가격이 현재 만기의 선물 가격보다 낮은 상황&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;비트코인 시장은 대부분의 시간동안 컨탱고 상태에 있는 경우가 많습니다.&lt;/p&gt;
&lt;div data-ke-type=&quot;moreLess&quot; data-text-more=&quot;더보기&quot; data-text-less=&quot;닫기&quot;&gt;&lt;a class=&quot;btn-toggle-moreless&quot;&gt;더보기&lt;/a&gt;
&lt;div class=&quot;moreless-content&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;비트코인 시장이 대부분의 시간동안 컨탱고(Contango) 상태에 있는 이유:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;미래 가격 상승 기대감: 많은 투자자들은 비트코인의 장기적 가치 상승을 기대합니다. 이러한 낙관적 전망이 선물 가격을 현물 가격보다 높게 유지합니다.&lt;/li&gt;
&lt;li&gt;보유 비용의 부재: 비트코인은 실물 자산과 달리 보관 비용이나 감가상각이 없습니다. 물리적 상품(석유, 농산물 등)은 보관 비용 때문에 백워데이션(선물 가격 &amp;lt; 현물 가격)이 자주 발생하지만, 비트코인은 그런 제약이 없습니다.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;이자율 요소: 선물 가격에는 무위험 이자율이 반영됩니다. 투자자들은 현물을 구매하는 대신 그 자금으로 무위험 수익을 얻고 나중에 선물 만기에 비트코인을 구매할 수 있으므로, 이 기회비용이 선물 가격에 반영됩니다.&lt;/li&gt;
&lt;li&gt;레버리지 수요: 많은 트레이더들이 더 큰 수익을 위해 레버리지를 사용하는데, 이는 선물 시장에서 이루어집니다. 이런 롱 포지션 수요가 선물 가격을 끌어올립니다.&lt;/li&gt;
&lt;li&gt;기관 투자자의 헤지 전략: 기관들이 현물 비트코인을 보유하면서 선물로 헤지하는 전략을 사용할 때, 이런 활동이 컨탱고 상태를 강화할 수 있습니다.&lt;/li&gt;
&lt;li&gt;수익률 파밍(Yield Farming): 투자자들이 현물 비트코인을 보유하면서 동시에 선물 시장에서 숏 포지션을 취하는 베이시스 트레이딩을 통해 무위험 수익을 추구합니다. 이러한 차익 거래 활동이 컨탱고 상태를 유지하는 데 기여합니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;5, 6번에 대한 추가 설명:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;기본적으로 5번과 6번은 같은 맥락입니다.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;기본적인 헤지 포지션(Hedge Position) 구조:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;기관 투자자들과 일부 투자자들은 장기 투자 목적으로 현물 비트코인을 구매하여 보유합니다.&lt;/li&gt;
&lt;li&gt;동시에 가격 하락 위험을 관리하기 위해 선물 시장에서 숏(매도) 포지션을 취합니다.&lt;/li&gt;
&lt;li&gt;이 전략은 현물 롱 + 선물 숏의 형태로 구성됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;컨탱고를 강화하는 과정:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;대형 기관이 대량의 현물 비트코인을 구매하면 현물 가격에 상승 압력이 가해집니다 (비트코인의 수량은 정해져있기 때문에 수요와 공급 법칙에 따라 당연한겁니다).&lt;/li&gt;
&lt;li&gt;이후 이들이 선물 시장에서 숏 포지션을 취하면 이론적으로는 선물 가격에 하락 압력을 줄 수 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;그러나 대부분의 경우, 선물 시장에서의 숏 포지션보다 현물 시장에서의 매수 영향이 더 크게 작용합니다.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;수익 창출 메커니즘:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;기관들은 이 헤지 포지션을 통해 선물과 현물 간의 가격 차이(베이시스)에서 수익을 얻을 수 있습니다.&lt;/li&gt;
&lt;li&gt;예를 들어, 현물 비트코인이 $50,000이고 3개월 선물이 $52,500이라면, 연간 20%의 무위험 수익률을 얻을 수 있게 됩니다.&lt;/li&gt;
&lt;li&gt;컨탱고 상태에서는 선물 가격이 만기에 가까워질수록 현물 가격에 수렴합니다(베이시스가 줄어듦).&lt;/li&gt;
&lt;li&gt;이 수렴 과정에서 헤지 포지션은 안정적인 수익을 창출합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;기관 투자자의 영향력:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;대형 기관들이 이러한 전략을 대규모로 실행할 때, 이들의 거래 행위는 시장 전체 구조에 영향을 미칩니다.&lt;/li&gt;
&lt;li&gt;특히 현물 비트코인에 대한 수요가 지속적으로 유지되어 현물 가격 지지로 이어집니다.&lt;/li&gt;
&lt;li&gt;동시에 선물 시장에서의 숏 포지션이 선물 가격의 과도한 상승을 제한합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;컨탱고 지속 요인:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이러한 헤지 포지션 구축은 현물 비트코인에 대한 지속적인 수요를 창출합니다.&lt;/li&gt;
&lt;li&gt;현물 매수 + 선물 매도 전략이 널리 채택될수록 현물과 선물 간의 가격 차이(컨탱고)가 유지됩니다.&lt;/li&gt;
&lt;li&gt;이 가격 차이는 헤지 전략의 수익성을 결정하므로, 수익을 추구하는 다른 투자자들도 유사한 전략을 채택하게 됩니다.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;베이시스(현물과 선물의 가격 차이)가 좁아지면 수익성이 감소하므로 새로운 참여자들의 진입이 줄어듭니다.&lt;/li&gt;
&lt;li&gt;반대로 베이시스가 넓어지면 더 많은 투자자들이 이 전략에 참여하게 됩니다.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;베이시스 트레이딩은 컨탱고 상태를 유지하는 자기 강화적 순환 구조를 형성합니다:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;컨탱고가 발생하면 투자자들이 베이시스 트레이딩으로 무위험 수익을 추구합니다.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;더 많은 투자자가 현물을 매수하고 선물을 매도합니다.&lt;/li&gt;
&lt;li&gt;따라서, 컨탱고 상태가 지속됩니다.&lt;/li&gt;
&lt;li&gt;추후에 시장 효율성으로 인해 베이시스는 특정 수준에서 안정화됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이러한 메커니즘이 순환적으로 작용하여 비트코인 시장에서 컨탱고 상태가 지속되는 데 기여합니다.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;시장 참여자들의 이런 행동은 결과적으로 시장의 유동성을 높이고 가격 안정성에도 기여할 수 있게 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;ETF가 만료되는 선물 계약을 팔고 더 비싼 다음 달 계약을 사게 되면, 그 가격 차이가 비용으로 발생하게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;예를 들어 만약 4월 만기 비트코인 선물이 $60,000에 거래되고 5월 만기 선물이 $61,000에 거래된다면, 롤오버할 때 계약당 $1,000의 비용이 발생하게 되는 것입니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이러한 롤오버 비용이 지속적으로 발생하면 ETF의 성과가 기초자산(비트코인)의 실제 성과보다 낮아지게 됩니다.&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이를 '롤 이익'(Roll yield)의 감소 또는 '롤 손실'(Roll loss)라고 합니다.&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;롤오버가 투자자에게 미치는 영향&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;성과 괴리: 위에서 언급했듯이 롤오버 비용으로 인해 장기적으로 비트코인 선물 ETF의 성과는 실제 비트코인 가격 움직임과 괴리가 발생할 수 있습니다.&lt;/li&gt;
&lt;li&gt;장기 투자 효율성 감소: 이러한 비용은 시간이 지남에 따라 누적되어 장기 투자 수익률에 부정적인 영향을 미칠 수 있습니다.&lt;/li&gt;
&lt;li&gt;비용 가시성: 롤오버 비용은 명시적으로 표시되지 않고 ETF 가격 성과에 내제되어 있어 투자자가 인지하기 어렵기 때문에 투자자에게 더 큰 불편함을 제공합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이러한 이유들로 많은 투자자와 시장 참여자들은 실제 비트코인을 보유할 수 있는 현물 ETF 상장의 승인을 기다렸고 2024년 초 미국 SEC는 결국 비트코인 현물 ETF의 상장을 승인합니다.&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Reference&lt;/b&gt;&lt;/h4&gt;
&lt;figure id=&quot;og_1744809541857&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;ETF 소개 | ETF 투자기초가이드 | Kodex&quot; data-og-description=&quot;ETF 투자의 기초부터 심화까지 알아보세요.&quot; data-og-host=&quot;www.samsungfund.com&quot; data-og-source-url=&quot;https://www.samsungfund.com/etf/insight/guide/view01.do&quot; data-og-url=&quot;https://www.samsungfund.com/etf/insight/guide/view01.do&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/sKo6p/hyYFBT9zgj/f7zK80ojNPHHwJmTzo2gH0/img.jpg?width=2400&amp;amp;height=1002&amp;amp;face=0_0_2400_1002,https://scrap.kakaocdn.net/dn/boMuDJ/hyYIjeeS6x/mkfkNkhAwiqXhzS1DjQ4fk/img.jpg?width=2400&amp;amp;height=1002&amp;amp;face=0_0_2400_1002,https://scrap.kakaocdn.net/dn/j6dik/hyYHeR7LoL/7oLjbtqalngDdG7p69NKC1/img.png?width=852&amp;amp;height=828&amp;amp;face=0_0_852_828&quot;&gt;&lt;a href=&quot;https://www.samsungfund.com/etf/insight/guide/view01.do&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.samsungfund.com/etf/insight/guide/view01.do&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/sKo6p/hyYFBT9zgj/f7zK80ojNPHHwJmTzo2gH0/img.jpg?width=2400&amp;amp;height=1002&amp;amp;face=0_0_2400_1002,https://scrap.kakaocdn.net/dn/boMuDJ/hyYIjeeS6x/mkfkNkhAwiqXhzS1DjQ4fk/img.jpg?width=2400&amp;amp;height=1002&amp;amp;face=0_0_2400_1002,https://scrap.kakaocdn.net/dn/j6dik/hyYHeR7LoL/7oLjbtqalngDdG7p69NKC1/img.png?width=852&amp;amp;height=828&amp;amp;face=0_0_852_828');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;ETF 소개 | ETF 투자기초가이드 | Kodex&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;ETF 투자의 기초부터 심화까지 알아보세요.&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.samsungfund.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;figure id=&quot;og_1746430992011&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;비트코인 ETF 승인: 환호하는 암호화폐 시장&amp;hellip; 그 이유와 의미는? - BBC News 코리아&quot; data-og-description=&quot;오랫동안 기다려온 미 금융 당국의 비트코인 현물 ETF 승인 소식에 암호화폐 업계가 들썩이고 있다. 그 이유를 살펴봤다.&quot; data-og-host=&quot;www.bbc.com&quot; data-og-source-url=&quot;https://www.bbc.com/korean/articles/c2vy2zdn99vo&quot; data-og-url=&quot;https://www.bbc.com/korean/articles/c2vy2zdn99vo&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/ccGE5z/hyYMY3GHPA/EwAoDVGgq1kVjVweINQ090/img.png?width=1024&amp;amp;height=576&amp;amp;face=0_0_1024_576&quot;&gt;&lt;a href=&quot;https://www.bbc.com/korean/articles/c2vy2zdn99vo&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.bbc.com/korean/articles/c2vy2zdn99vo&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/ccGE5z/hyYMY3GHPA/EwAoDVGgq1kVjVweINQ090/img.png?width=1024&amp;amp;height=576&amp;amp;face=0_0_1024_576');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;비트코인 ETF 승인: 환호하는 암호화폐 시장&amp;hellip; 그 이유와 의미는? - BBC News 코리아&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;오랫동안 기다려온 미 금융 당국의 비트코인 현물 ETF 승인 소식에 암호화폐 업계가 들썩이고 있다. 그 이유를 살펴봤다.&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.bbc.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;Work like hell. &lt;br /&gt;I mean you just have to put in 80 to 100 hour weeks every week.&amp;nbsp;&lt;br /&gt;- Elon Musk -&lt;/blockquote&gt;</description>
      <category>경제</category>
      <category>ETF</category>
      <category>롤오버</category>
      <category>백워데이션</category>
      <category>비트코인</category>
      <category>선물</category>
      <category>컨탱고</category>
      <category>코인 투자</category>
      <category>현물</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/147</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EA%B2%BD%EC%A0%9C-1-%EB%B9%84%ED%8A%B8%EC%BD%94%EC%9D%B8-%ED%98%84%EB%AC%BC-%EC%84%A0%EB%AC%BC-ETF-%EC%83%81%EC%9E%A5%EC%9D%B4-%EA%B0%96%EB%8A%94-%EC%9D%98%EB%AF%B8#entry147comment</comments>
      <pubDate>Sat, 3 May 2025 15:13:24 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 13. 비트야 멘징 좀 하자</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-13-%EB%B9%84%ED%8A%B8%EC%95%BC-%EB%A9%98%EC%A7%95-%EC%A2%80-%ED%95%98%EC%9E%90</link>
      <description>&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.43.04.png&quot; data-origin-width=&quot;1264&quot; data-origin-height=&quot;657&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cfRAoZ/btsMMedAB5e/6ZqbbclgvKRppkd8g3sic0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cfRAoZ/btsMMedAB5e/6ZqbbclgvKRppkd8g3sic0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cfRAoZ/btsMMedAB5e/6ZqbbclgvKRppkd8g3sic0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcfRAoZ%2FbtsMMedAB5e%2F6ZqbbclgvKRppkd8g3sic0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1264&quot; height=&quot;657&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.43.04.png&quot; data-origin-width=&quot;1264&quot; data-origin-height=&quot;657&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.17 - 1) 숏 1차 진입&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;순추세는 다시 숏이라고 판단함&lt;/li&gt;
&lt;li&gt;618 라인에 세 번 연속 맞고 리테스트가 일어났기 때문에 4번째 618 부근에서 숏 포지션 진입함&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP):&lt;span&gt; 83276&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP):&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손절(SL):&lt;span&gt; 83467&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.53.10.png&quot; data-origin-width=&quot;129&quot; data-origin-height=&quot;136&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Zvntm/btsMNJKgko0/Gown7YcWX2aHFnmUVqO990/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Zvntm/btsMNJKgko0/Gown7YcWX2aHFnmUVqO990/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Zvntm/btsMNJKgko0/Gown7YcWX2aHFnmUVqO990/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FZvntm%2FbtsMNJKgko0%2FGown7YcWX2aHFnmUVqO990%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;195&quot; height=&quot;206&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.53.10.png&quot; data-origin-width=&quot;129&quot; data-origin-height=&quot;136&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;장대 양봉 손절 엔딩&lt;/li&gt;
&lt;li&gt;더 올라갔다 내려갈거라고 판단하고 바로 손절침&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-17 오후 10.54.40.png&quot; data-origin-width=&quot;541&quot; data-origin-height=&quot;474&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/q5cQM/btsMNBsiCoL/GsvITnSiNkvbsXXQotwbq1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/q5cQM/btsMNBsiCoL/GsvITnSiNkvbsXXQotwbq1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/q5cQM/btsMNBsiCoL/GsvITnSiNkvbsXXQotwbq1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fq5cQM%2FbtsMNBsiCoL%2FGsvITnSiNkvbsXXQotwbq1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;541&quot; height=&quot;474&quot; data-filename=&quot;스크린샷 2025-03-17 오후 10.54.40.png&quot; data-origin-width=&quot;541&quot; data-origin-height=&quot;474&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.17 - 2) 숏 2차&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;거래량은 많이 두번 터지면서 장대 양봉을 쐈지만 사실 주가는 크게 못올림&lt;/li&gt;
&lt;li&gt;세번째 터치할때 숏 진입함(고배로)&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP):&lt;span&gt; 83396&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP):&lt;span&gt; 82969 (매물대 상단)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손절(SL):&lt;span&gt; 83735 (최근 고점)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;span&gt; 1.4&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-17 오후 11.01.40.png&quot; data-origin-width=&quot;177&quot; data-origin-height=&quot;275&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/kRHHL/btsMMLvwvn5/S4vPOx07aqf1kmOHpWbMjk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/kRHHL/btsMMLvwvn5/S4vPOx07aqf1kmOHpWbMjk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/kRHHL/btsMMLvwvn5/S4vPOx07aqf1kmOHpWbMjk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FkRHHL%2FbtsMMLvwvn5%2FS4vPOx07aqf1kmOHpWbMjk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;177&quot; height=&quot;275&quot; data-filename=&quot;스크린샷 2025-03-17 오후 11.01.40.png&quot; data-origin-width=&quot;177&quot; data-origin-height=&quot;275&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;고배로 쳐서 꽤나 멘징함&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마음 급하게 먹지말고 천천히 손익비 좋은 자리만 봐서 매매하면 수익을 본다&lt;/li&gt;
&lt;li&gt;다시 말해서, 미리 진입해서 물려있지 말고 진짜 좋은 자리만 들어가는게 좋다&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-18 오전 12.09.27.png&quot; data-origin-width=&quot;546&quot; data-origin-height=&quot;383&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Lo7hF/btsMOg84bu9/255bZUwHTqDej3zmTNQKkK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Lo7hF/btsMOg84bu9/255bZUwHTqDej3zmTNQKkK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Lo7hF/btsMOg84bu9/255bZUwHTqDej3zmTNQKkK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FLo7hF%2FbtsMOg84bu9%2F255bZUwHTqDej3zmTNQKkK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;546&quot; height=&quot;383&quot; data-filename=&quot;스크린샷 2025-03-18 오전 12.09.27.png&quot; data-origin-width=&quot;546&quot; data-origin-height=&quot;383&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;br /&gt;2&lt;/b&gt;&lt;b&gt;025.03.17 - 2) 숏 3차&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 마지노선으로 지켜주던 매물대를 뚫는 것을 보고 진입함&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;2. 처음 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP):&lt;span&gt;&lt;span&gt; 82574&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-18 오전 12.25.20.png&quot; data-origin-width=&quot;467&quot; data-origin-height=&quot;325&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/QagxK/btsMNQbHIX4/BDyHckaGSpdnjVwNB3CMYK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/QagxK/btsMNQbHIX4/BDyHckaGSpdnjVwNB3CMYK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/QagxK/btsMNQbHIX4/BDyHckaGSpdnjVwNB3CMYK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FQagxK%2FbtsMNQbHIX4%2FBDyHckaGSpdnjVwNB3CMYK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;467&quot; height=&quot;325&quot; data-filename=&quot;스크린샷 2025-03-18 오전 12.25.20.png&quot; data-origin-width=&quot;467&quot; data-origin-height=&quot;325&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진짜 신이 살렸다....&lt;/li&gt;
&lt;li&gt;일단 위에 보이는 것처럼 진입함&lt;/li&gt;
&lt;li&gt;근데 이후로 이상하게 스물스물 오름&lt;/li&gt;
&lt;li&gt;불안하긴 했지만 버텨봄&lt;/li&gt;
&lt;li&gt;내려가겠지 라는 생각으로 남은 증거금까지 끌어와서 한 번 더 쳤음&lt;/li&gt;
&lt;li&gt;저시드 고배로 치고 있었는데 남은 증거금을 끌어왔으니까 말도 안되는 고시드 고배 상황&lt;/li&gt;
&lt;li&gt;높은 펀비 때문에 조금의 수익만 보자는 생각으로 익절을 걸고 버팀&lt;/li&gt;
&lt;li&gt;딱 10틱 안쪽으로 익절 나가고 저렇게 올라버림...&lt;/li&gt;
&lt;li&gt;진짜 운이 좋았다..&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;증거금 끌어와서 평단 조정하는 짓은 진짜 그만해야겠음... 너무 쫄림&lt;/li&gt;
&lt;li&gt;수익을 봐서 다행이지 하마터면 따로 빼둔 증거금까지 손실볼 뻔 했음..&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;5.&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;span&gt;반성할&lt;/span&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;span&gt;점&lt;/span&gt;:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;제발 뚫을 것 같을 때 진입하지 말고 확인매매 하자... 완전하게 고쳐지지가 않네&lt;/li&gt;
&lt;li&gt;오늘 매매는 여기서 그만하는게 좋겠다..&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;천천히 멘징해나가는중... 화이팅&lt;br /&gt;뻘짓 그만하자..&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>단타</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/146</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-13-%EB%B9%84%ED%8A%B8%EC%95%BC-%EB%A9%98%EC%A7%95-%EC%A2%80-%ED%95%98%EC%9E%90#entry146comment</comments>
      <pubDate>Tue, 18 Mar 2025 00:51:34 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 12. 역시 롱은 역추세였다..</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-12-%EC%97%AD%EC%8B%9C-%EB%A1%B1%EC%9D%80-%EC%97%AD%EC%B6%94%EC%84%B8%EC%98%80%EB%8B%A4</link>
      <description>&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-16 오후 8.32.51.png&quot; data-origin-width=&quot;1076&quot; data-origin-height=&quot;539&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/JWyRm/btsMMlCXLJ5/IldOaN2aP1mWQqklZkyBtK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/JWyRm/btsMMlCXLJ5/IldOaN2aP1mWQqklZkyBtK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/JWyRm/btsMMlCXLJ5/IldOaN2aP1mWQqklZkyBtK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FJWyRm%2FbtsMMlCXLJ5%2FIldOaN2aP1mWQqklZkyBtK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1076&quot; height=&quot;539&quot; data-filename=&quot;스크린샷 2025-03-16 오후 8.32.51.png&quot; data-origin-width=&quot;1076&quot; data-origin-height=&quot;539&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.16 - 1) 롱 진입&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;주말 내내 83k 대의 매물대를 마지노선으로 횡보중이었기 때문에 상승을 이어나가려면 이 구간이 깨지지 않아야 한다고 판단하고 롱을 잡았음&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP):&lt;span&gt; 83841&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP):&lt;span&gt; 길게 끌고갈 생각이었음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손절(SL):&lt;span&gt; 83600&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;span&gt; 길게 끌고갈 생각이었음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;손절&lt;/li&gt;
&lt;li&gt;손절 라인이 짧았기 때문에 해볼만한 배팅이었다고 생각함&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.26.44.png&quot; data-origin-width=&quot;545&quot; data-origin-height=&quot;683&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dHL4k6/btsMOsulEUr/nT1R1wmy5BJRXcof52zx91/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dHL4k6/btsMOsulEUr/nT1R1wmy5BJRXcof52zx91/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dHL4k6/btsMOsulEUr/nT1R1wmy5BJRXcof52zx91/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdHL4k6%2FbtsMOsulEUr%2FnT1R1wmy5BJRXcof52zx91%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;545&quot; height=&quot;683&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.26.44.png&quot; data-origin-width=&quot;545&quot; data-origin-height=&quot;683&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.16 - 2) 추격 숏 진입&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;빠르게 흐르는걸 보고 추격 숏 진입&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP):&lt;span&gt; 83607.0&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP):&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손절(SL):&lt;span&gt; 0.4% 짧은 손절라인&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;멘징 + 수익 성공&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;솔직히 기세만 봤을 때는 훨씬 아래로 갈 줄 알았음&lt;/li&gt;
&lt;li&gt;혹시 모르니까 그어둔 노란색 하락 추세선이 깨질 떄 한번, 피보나치 0.5 구간에서 완익을 쳤음&lt;/li&gt;
&lt;li&gt;갑자기 반등을 이어나가는걸 보고 익절 하길 잘했다고 생각함&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.32.04.png&quot; data-origin-width=&quot;650&quot; data-origin-height=&quot;520&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/chPc0P/btsMMECX70T/smOkE3ul5f67nefBQcTrck/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/chPc0P/btsMMECX70T/smOkE3ul5f67nefBQcTrck/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/chPc0P/btsMMECX70T/smOkE3ul5f67nefBQcTrck/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FchPc0P%2FbtsMMECX70T%2FsmOkE3ul5f67nefBQcTrck%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;650&quot; height=&quot;520&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.32.04.png&quot; data-origin-width=&quot;650&quot; data-origin-height=&quot;520&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.16 - 3) 숏 재진입&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;반등의 세기가 약해질 때 숏을 재진입함&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP):&lt;span&gt;&lt;span&gt; 82915&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP):&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손절(SL):&lt;span&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;여기서 문제가 발생함...&lt;/li&gt;
&lt;li&gt;더 내려갈거라고 확신하고 손절 라인을 엄청 길게 잡아두고 다른 강의를 듣던 중에 엄청난 장대 양봉 발생함..&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;5. 반성할 점&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;무조건 손절은 감당 가능할 정도로만 잡기&amp;nbsp;&lt;/li&gt;
&lt;li&gt;어떤 상황에서도...&lt;/li&gt;
&lt;li&gt;특히 요즘같은 변동성이 심한 장세에 왜 자꾸 이런 말도 안되는 손실을 보는거니..&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.35.59.png&quot; data-origin-width=&quot;649&quot; data-origin-height=&quot;410&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/oil9U/btsMMDc2luM/Q4zs0GG9xggoiCljP2KIAK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/oil9U/btsMMDc2luM/Q4zs0GG9xggoiCljP2KIAK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/oil9U/btsMMDc2luM/Q4zs0GG9xggoiCljP2KIAK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Foil9U%2FbtsMMDc2luM%2FQ4zs0GG9xggoiCljP2KIAK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;649&quot; height=&quot;410&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.35.59.png&quot; data-origin-width=&quot;649&quot; data-origin-height=&quot;410&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;br /&gt;2025.03.16 - 4) 사팔사팔 멘징&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;사팔사팔&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;단기봉만 보고 사고 팔며 손실의 25% 정도 멘징함&lt;/li&gt;
&lt;li&gt;진짜 정신 나갈뻔...&lt;/li&gt;
&lt;li&gt;좋은 방식인지는 모르겠지만 변동성이 심하고 휩쏘가 맞는것같다는 생각이 들면 사팔사팔하며 멘징하는건 나쁘지 않은 것 같기도 함&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.37.20.png&quot; data-origin-width=&quot;283&quot; data-origin-height=&quot;333&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bxypSk/btsMNLHYaZC/TSCEptF5xcXMrGwV7mZl2K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bxypSk/btsMNLHYaZC/TSCEptF5xcXMrGwV7mZl2K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bxypSk/btsMNLHYaZC/TSCEptF5xcXMrGwV7mZl2K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbxypSk%2FbtsMNLHYaZC%2FTSCEptF5xcXMrGwV7mZl2K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;283&quot; height=&quot;333&quot; data-filename=&quot;스크린샷 2025-03-17 오후 4.37.20.png&quot; data-origin-width=&quot;283&quot; data-origin-height=&quot;333&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;br /&gt;2&lt;/b&gt;&lt;b&gt;025.03.16 - 5) 수면 멘징&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;중요 지지라인이 깨지는 것을 보고 전고점을 로스로 잡고 숏 포지션 진입 후 잤음&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP):&lt;span&gt;&lt;span&gt; 83812&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP):&lt;span&gt; 92800 (멘징 완료 라인)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손절(SL):&lt;span&gt;&lt;span&gt; 84100&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;처음 확 쏟을 때 전체 멘징 완료..&lt;/li&gt;
&lt;li&gt;91k대까지 쏟은 거 보니까 하 익절 라인 좀만 더 길게 잡을걸 이런 생각이 들긴 했지만 그냥 멘징했다는 것으로 위안 삼았음&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;큰 손실이 나면 멘징하는데는 훨씬 더 큰 노력이 필요하다는 것을 제발 명심하자..&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>단타</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/145</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-12-%EC%97%AD%EC%8B%9C-%EB%A1%B1%EC%9D%80-%EC%97%AD%EC%B6%94%EC%84%B8%EC%98%80%EB%8B%A4#entry145comment</comments>
      <pubDate>Mon, 17 Mar 2025 16:40:42 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 11. 롱차..?</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-11-%EB%A1%B1%EC%B0%A8</link>
      <description>&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 10.37.44.png&quot; data-origin-width=&quot;936&quot; data-origin-height=&quot;523&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bWuTZA/btsMLXPX4XD/7mS6P6pXTBfUaE6ynB3nf1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bWuTZA/btsMLXPX4XD/7mS6P6pXTBfUaE6ynB3nf1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bWuTZA/btsMLXPX4XD/7mS6P6pXTBfUaE6ynB3nf1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbWuTZA%2FbtsMLXPX4XD%2F7mS6P6pXTBfUaE6ynB3nf1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;936&quot; height=&quot;523&quot; data-filename=&quot;스크린샷 2025-03-15 오후 10.37.44.png&quot; data-origin-width=&quot;936&quot; data-origin-height=&quot;523&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 10.41.40.png&quot; data-origin-width=&quot;657&quot; data-origin-height=&quot;351&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/KtwvK/btsMMoM6aIO/5h0DKNrnWugWOu5QY3C4r1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/KtwvK/btsMMoM6aIO/5h0DKNrnWugWOu5QY3C4r1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/KtwvK/btsMMoM6aIO/5h0DKNrnWugWOu5QY3C4r1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FKtwvK%2FbtsMMoM6aIO%2F5h0DKNrnWugWOu5QY3C4r1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;657&quot; height=&quot;351&quot; data-filename=&quot;스크린샷 2025-03-15 오후 10.41.40.png&quot; data-origin-width=&quot;657&quot; data-origin-height=&quot;351&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.15&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;가파르게 올라오는 것을 보고 아 한번도 쫙 올렸다가 패닉셀을 만드려는거구나 라고 생각을 바꿈&lt;/li&gt;
&lt;li&gt;따라서 롱 자리를 보고 있다가 눌림을 보고 진입함&lt;/li&gt;
&lt;li&gt;더블바텀 넥 라인 리테스트 후 올리는 것을 충분히 보고 진입함(최근 손절이 많이 아팠어서.. ㅎ)&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): 84261&lt;/li&gt;
&lt;li&gt;익절(TP): 길게 끌고갈 생각&lt;/li&gt;
&lt;li&gt;손절(SL): 83922 (제일 가까운 매물대 하단)&lt;/li&gt;
&lt;li&gt;손익비(R/R): 길게 끌고갈 생각&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-16 오후 1.30.07.png&quot; data-origin-width=&quot;1431&quot; data-origin-height=&quot;571&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/s2BtN/btsMMBr13JP/ThUTnHiYYqtWHKLaetsk4k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/s2BtN/btsMMBr13JP/ThUTnHiYYqtWHKLaetsk4k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/s2BtN/btsMMBr13JP/ThUTnHiYYqtWHKLaetsk4k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fs2BtN%2FbtsMMBr13JP%2FThUTnHiYYqtWHKLaetsk4k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1431&quot; height=&quot;571&quot; data-filename=&quot;스크린샷 2025-03-16 오후 1.30.07.png&quot; data-origin-width=&quot;1431&quot; data-origin-height=&quot;571&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-16 오후 1.30.22.png&quot; data-origin-width=&quot;207&quot; data-origin-height=&quot;241&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cmKD9p/btsMNh7B6c2/oKOsxzJJs9HkZ7YCVbzix0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cmKD9p/btsMNh7B6c2/oKOsxzJJs9HkZ7YCVbzix0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cmKD9p/btsMNh7B6c2/oKOsxzJJs9HkZ7YCVbzix0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcmKD9p%2FbtsMNh7B6c2%2FoKOsxzJJs9HkZ7YCVbzix0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;207&quot; height=&quot;241&quot; data-filename=&quot;스크린샷 2025-03-16 오후 1.30.22.png&quot; data-origin-width=&quot;207&quot; data-origin-height=&quot;241&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;손절 나감..&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;요즘 도통 감을 못잡겠다...&lt;/li&gt;
&lt;li&gt;손익비는 자시고 몇연패를 하는 중인건지...&lt;/li&gt;
&lt;li&gt;좋은 자리를 기다렸다가 들어가는 것 같은데도 전에 몇번 욕심부려서 손절난게 타격이 많이 큰 것 같다..&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;살려줘... 비트야&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>단타</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/144</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-11-%EB%A1%B1%EC%B0%A8#entry144comment</comments>
      <pubDate>Sun, 16 Mar 2025 13:32:33 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 10. 숏차</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-10-%EC%88%8F%EC%B0%A8</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;이전 포스트:&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1741698790373&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[매매일지] 9. 왜 굳이 역추세를 탔니..&quot; data-og-description=&quot;이전 포스트:&amp;nbsp;[매매일지] 8. 사실 재진입함..이전 포스트:&amp;nbsp;[매매일지] 7. 김비트 제발이전 포스트:&amp;nbsp;[매매일지] 6. 2연승 추가..?이전 포스트:&amp;nbsp;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는&quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-9-%EC%99%9C-%EA%B5%B3%EC%9D%B4-%EC%97%AD%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%83%94%EB%8B%88&quot; data-og-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-9-%EC%99%9C-%EA%B5%B3%EC%9D%B4-%EC%97%AD%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%83%94%EB%8B%88&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/dDVoIl/hyYrQjglLT/2UNE06Vn8e6HAuov4fYfuk/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/bG8gzD/hyYmQyQI6m/zwnu2S9K1lOfNPr8bo1j3K/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/RXw25/hyYp9KuBxB/c0T0IZXrnaq0Iax5Vue3Tk/img.png?width=500&amp;amp;height=500&amp;amp;face=0_0_500_500&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-9-%EC%99%9C-%EA%B5%B3%EC%9D%B4-%EC%97%AD%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%83%94%EB%8B%88&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-9-%EC%99%9C-%EA%B5%B3%EC%9D%B4-%EC%97%AD%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%83%94%EB%8B%88&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/dDVoIl/hyYrQjglLT/2UNE06Vn8e6HAuov4fYfuk/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/bG8gzD/hyYmQyQI6m/zwnu2S9K1lOfNPr8bo1j3K/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/RXw25/hyYp9KuBxB/c0T0IZXrnaq0Iax5Vue3Tk/img.png?width=500&amp;amp;height=500&amp;amp;face=0_0_500_500');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[매매일지] 9. 왜 굳이 역추세를 탔니..&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;이전 포스트:&amp;nbsp;[매매일지] 8. 사실 재진입함..이전 포스트:&amp;nbsp;[매매일지] 7. 김비트 제발이전 포스트:&amp;nbsp;[매매일지] 6. 2연승 추가..?이전 포스트:&amp;nbsp;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.15.14.png&quot; data-origin-width=&quot;748&quot; data-origin-height=&quot;602&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b6PhBn/btsMF6z4YMS/SceaUj9iz3fOICf4NchNkk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b6PhBn/btsMF6z4YMS/SceaUj9iz3fOICf4NchNkk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b6PhBn/btsMF6z4YMS/SceaUj9iz3fOICf4NchNkk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb6PhBn%2FbtsMF6z4YMS%2FSceaUj9iz3fOICf4NchNkk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;748&quot; height=&quot;602&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.15.14.png&quot; data-origin-width=&quot;748&quot; data-origin-height=&quot;602&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.18.41.png&quot; data-origin-width=&quot;1168&quot; data-origin-height=&quot;722&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Lr9bG/btsMFHN35ow/wRF7oqwpUKFyDwslPs1eok/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Lr9bG/btsMFHN35ow/wRF7oqwpUKFyDwslPs1eok/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Lr9bG/btsMFHN35ow/wRF7oqwpUKFyDwslPs1eok/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FLr9bG%2FbtsMFHN35ow%2FwRF7oqwpUKFyDwslPs1eok%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1168&quot; height=&quot;722&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.18.41.png&quot; data-origin-width=&quot;1168&quot; data-origin-height=&quot;722&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.34.22.png&quot; data-origin-width=&quot;538&quot; data-origin-height=&quot;468&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/XgxPD/btsMHp6GzDc/iTK5j7WkbSUg3IP0J0HDNK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/XgxPD/btsMHp6GzDc/iTK5j7WkbSUg3IP0J0HDNK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/XgxPD/btsMHp6GzDc/iTK5j7WkbSUg3IP0J0HDNK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FXgxPD%2FbtsMHp6GzDc%2FiTK5j7WkbSUg3IP0J0HDNK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;538&quot; height=&quot;468&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.34.22.png&quot; data-origin-width=&quot;538&quot; data-origin-height=&quot;468&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.11&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;순추세가 하락인게 자명해지는 가운데, 상승의 1.272 부근에서 엄청난 반등이 일어남&lt;/li&gt;
&lt;li&gt;당연히 가짜 반등이라고 판단하고 숏 자리를 보고 있었음&lt;/li&gt;
&lt;li&gt;상승 추세선에서 멀어지다가 다시 붙고 있었고 단기 더블탑의 넥라인이 돌파되는것을 확인하고 진입함&lt;/li&gt;
&lt;li&gt;3번째 사진: 휩쏘에 당하고 다시 잡음(평단은 비슷)&lt;/li&gt;
&lt;li&gt;4시간봉 히든 하락 다이버 컨펌(새벽 1시까지)&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): 81540&lt;/li&gt;
&lt;li&gt;익절(TP): 길게 끌고 갈 생각&lt;/li&gt;
&lt;li&gt;손절(SL): 82300(휩쏘 고점)&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.38.58.png&quot; data-origin-width=&quot;1141&quot; data-origin-height=&quot;423&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/VhVco/btsMLuguxik/K4wopI4eDUhY7i6LlXWqp1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/VhVco/btsMLuguxik/K4wopI4eDUhY7i6LlXWqp1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/VhVco/btsMLuguxik/K4wopI4eDUhY7i6LlXWqp1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FVhVco%2FbtsMLuguxik%2FK4wopI4eDUhY7i6LlXWqp1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1141&quot; height=&quot;423&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.38.58.png&quot; data-origin-width=&quot;1141&quot; data-origin-height=&quot;423&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;쭉쭉 잘 내려가다가 갑자기 오르더니 손절 맞음&lt;/li&gt;
&lt;li&gt;물론 분익 + 본절 해둘 수는 있었지만 상황 자체가 처음에 너무 잘 나와서 그러긴 힘들었음&lt;/li&gt;
&lt;li&gt;110k 부근에서 부터 잡은 채널의 하단을 빠르게 이탈했기 때문에 반등이 나올줄은 알았지만 채널 중단까지 다시 올라갈줄은 몰랐음&lt;/li&gt;
&lt;li&gt;장기적인 하락 추세라고 판단했기 때문에&lt;/li&gt;
&lt;li&gt;또한, 상승의 0.5 부근에서 반등한거였는데 786 부근까지는 갈거라고 생각함&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.41.09.png&quot; data-origin-width=&quot;226&quot; data-origin-height=&quot;242&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/FAPnN/btsMMfimpoU/LX90vxMYQ5bO5FvmFCuf00/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/FAPnN/btsMMfimpoU/LX90vxMYQ5bO5FvmFCuf00/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/FAPnN/btsMMfimpoU/LX90vxMYQ5bO5FvmFCuf00/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FFAPnN%2FbtsMMfimpoU%2FLX90vxMYQ5bO5FvmFCuf00%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;226&quot; height=&quot;242&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.41.09.png&quot; data-origin-width=&quot;226&quot; data-origin-height=&quot;242&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.41.47.png&quot; data-origin-width=&quot;813&quot; data-origin-height=&quot;594&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ctTqeK/btsMLIZQYae/GGDz8P7QqfOifLxHq8s8k0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ctTqeK/btsMLIZQYae/GGDz8P7QqfOifLxHq8s8k0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ctTqeK/btsMLIZQYae/GGDz8P7QqfOifLxHq8s8k0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FctTqeK%2FbtsMLIZQYae%2FGGDz8P7QqfOifLxHq8s8k0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;813&quot; height=&quot;594&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.41.47.png&quot; data-origin-width=&quot;813&quot; data-origin-height=&quot;594&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.52.08.png&quot; data-origin-width=&quot;325&quot; data-origin-height=&quot;290&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/1bnQ1/btsMNjjXthb/yFPRPx5FN6w5ZCJNsbBi7k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/1bnQ1/btsMNjjXthb/yFPRPx5FN6w5ZCJNsbBi7k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/1bnQ1/btsMNjjXthb/yFPRPx5FN6w5ZCJNsbBi7k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F1bnQ1%2FbtsMNjjXthb%2FyFPRPx5FN6w5ZCJNsbBi7k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;325&quot; height=&quot;290&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.52.08.png&quot; data-origin-width=&quot;325&quot; data-origin-height=&quot;290&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.54.46.png&quot; data-origin-width=&quot;578&quot; data-origin-height=&quot;413&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b0hh9Q/btsMKU7Ainw/B7KTFrMKKMeHhmbyhkUJJ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b0hh9Q/btsMKU7Ainw/B7KTFrMKKMeHhmbyhkUJJ0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b0hh9Q/btsMKU7Ainw/B7KTFrMKKMeHhmbyhkUJJ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb0hh9Q%2FbtsMKU7Ainw%2FB7KTFrMKKMeHhmbyhkUJJ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;578&quot; height=&quot;413&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.54.46.png&quot; data-origin-width=&quot;578&quot; data-origin-height=&quot;413&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.56.12.png&quot; data-origin-width=&quot;702&quot; data-origin-height=&quot;564&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/5civl/btsMMAT7gD9/8EJUWo1Nqg477wGUYXqn91/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/5civl/btsMMAT7gD9/8EJUWo1Nqg477wGUYXqn91/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/5civl/btsMMAT7gD9/8EJUWo1Nqg477wGUYXqn91/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F5civl%2FbtsMMAT7gD9%2F8EJUWo1Nqg477wGUYXqn91%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;702&quot; height=&quot;564&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.56.12.png&quot; data-origin-width=&quot;702&quot; data-origin-height=&quot;564&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.12&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;한참 전에 숏 진입은 다시 해뒀고 휩쏘에 당함(약손절)&lt;/li&gt;
&lt;li&gt;휩쏘에 당하고 다시 잡음&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): 82942&lt;/li&gt;
&lt;li&gt;익절(TP): 79330&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;익절&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.56.46.png&quot; data-origin-width=&quot;410&quot; data-origin-height=&quot;393&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/IQRqy/btsMNeiH4qy/TIyThEAHyVfMm6iVr9kX7K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/IQRqy/btsMNeiH4qy/TIyThEAHyVfMm6iVr9kX7K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/IQRqy/btsMNeiH4qy/TIyThEAHyVfMm6iVr9kX7K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FIQRqy%2FbtsMNeiH4qy%2FTIyThEAHyVfMm6iVr9kX7K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;410&quot; height=&quot;393&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.56.46.png&quot; data-origin-width=&quot;410&quot; data-origin-height=&quot;393&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.57.39.png&quot; data-origin-width=&quot;1025&quot; data-origin-height=&quot;394&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/2D9Zl/btsML2XSqE3/4aDToqIcJkQ8TCZVMSujP0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/2D9Zl/btsML2XSqE3/4aDToqIcJkQ8TCZVMSujP0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/2D9Zl/btsML2XSqE3/4aDToqIcJkQ8TCZVMSujP0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F2D9Zl%2FbtsML2XSqE3%2F4aDToqIcJkQ8TCZVMSujP0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1025&quot; height=&quot;394&quot; data-filename=&quot;스크린샷 2025-03-15 오후 9.57.39.png&quot; data-origin-width=&quot;1025&quot; data-origin-height=&quot;394&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.12&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;동일하게 숏&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;계속 휩쏘를 당하니까 본절튀 하고 롱을 잡았다가 손절 맞음&lt;/li&gt;
&lt;li&gt;이후에 뇌동매매로 숏을 잡았다가 풀고 하다가 결국 멘징은 하긴 했음&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;휩쏘를 당해도 홀드 하자&lt;/li&gt;
&lt;li&gt;단기 반등을 노리지 말자 (순추세 매매)&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;5. 반성할 점:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;이슈로 인한 단기 반등을 고배로 먹어보려고 한 점은 정말 반성해야할 점이다&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;손절 맞은 후에 제대로 된 기준 없는 뇌동매매&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-15 오후 10.15.16.png&quot; data-origin-width=&quot;1427&quot; data-origin-height=&quot;613&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/kLhRS/btsMMGzUbep/v3LximqSyj8YKdiHcbpyC1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/kLhRS/btsMMGzUbep/v3LximqSyj8YKdiHcbpyC1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/kLhRS/btsMMGzUbep/v3LximqSyj8YKdiHcbpyC1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FkLhRS%2FbtsMMGzUbep%2Fv3LximqSyj8YKdiHcbpyC1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1427&quot; height=&quot;613&quot; data-filename=&quot;스크린샷 2025-03-15 오후 10.15.16.png&quot; data-origin-width=&quot;1427&quot; data-origin-height=&quot;613&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;처음 작성한 매매에서 큰 채널을 벗어나고 나서 상승이 쭉 이어졌기 때문에 장기적으로는 하락을 보지만 지금은 롱 포지션을 잡아야할 때라고 생각이 바뀜&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;최근에 멘탈 관리가 잘 안되서 크게 잃고 겨우 멘징하고를 반복하다가 다시 잃은 상황임..&lt;br /&gt;&lt;/span&gt;시드가 50% 아래로 떨어진건 처음이라 많이 아픈데 천천히 복구해보자...&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>단타</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/143</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-10-%EC%88%8F%EC%B0%A8#entry143comment</comments>
      <pubDate>Sat, 15 Mar 2025 22:25:29 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 9. 왜 굳이 역추세를 탔니..</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-9-%EC%99%9C-%EA%B5%B3%EC%9D%B4-%EC%97%AD%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%83%94%EB%8B%88</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;이전 포스트:&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1741696991191&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[매매일지] 8. 사실 재진입함..&quot; data-og-description=&quot;이전 포스트:&amp;nbsp;[매매일지] 7. 김비트 제발이전 포스트:&amp;nbsp;[매매일지] 6. 2연승 추가..?이전 포스트:&amp;nbsp;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고 &quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-8-%EC%82%AC%EC%8B%A4-%EC%9E%AC%EC%A7%84%EC%9E%85%ED%95%A8&quot; data-og-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-8-%EC%82%AC%EC%8B%A4-%EC%9E%AC%EC%A7%84%EC%9E%85%ED%95%A8&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/ddCZQ7/hyYr3bPW57/4AOGFkB69posC1CbTaee81/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/dW1UeZ/hyYm1mQrCi/zbbzmqkP5KFi7no8E15s21/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/VLDBw/hyYp9DIt4g/CXebaSMiYzt1WMwvF9Sc6K/img.png?width=639&amp;amp;height=447&amp;amp;face=0_0_639_447&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-8-%EC%82%AC%EC%8B%A4-%EC%9E%AC%EC%A7%84%EC%9E%85%ED%95%A8&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-8-%EC%82%AC%EC%8B%A4-%EC%9E%AC%EC%A7%84%EC%9E%85%ED%95%A8&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/ddCZQ7/hyYr3bPW57/4AOGFkB69posC1CbTaee81/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/dW1UeZ/hyYm1mQrCi/zbbzmqkP5KFi7no8E15s21/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/VLDBw/hyYp9DIt4g/CXebaSMiYzt1WMwvF9Sc6K/img.png?width=639&amp;amp;height=447&amp;amp;face=0_0_639_447');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[매매일지] 8. 사실 재진입함..&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;이전 포스트:&amp;nbsp;[매매일지] 7. 김비트 제발이전 포스트:&amp;nbsp;[매매일지] 6. 2연승 추가..?이전 포스트:&amp;nbsp;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-11 오후 9.42.24.png&quot; data-origin-width=&quot;549&quot; data-origin-height=&quot;409&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/mj2i0/btsMF8kkUZI/025QBQbNqxWMNM4WjKydtK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/mj2i0/btsMF8kkUZI/025QBQbNqxWMNM4WjKydtK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/mj2i0/btsMF8kkUZI/025QBQbNqxWMNM4WjKydtK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fmj2i0%2FbtsMF8kkUZI%2F025QBQbNqxWMNM4WjKydtK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;549&quot; height=&quot;409&quot; data-filename=&quot;스크린샷 2025-03-11 오후 9.42.24.png&quot; data-origin-width=&quot;549&quot; data-origin-height=&quot;409&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.10&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;피보나치 1.618 부근에서 되돌림을 먹으려고 역추세를 트라이함&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): 83,631.60&lt;/li&gt;
&lt;li&gt;익절(TP): 약익절&lt;/li&gt;
&lt;li&gt;손절(SL): 약손절&lt;/li&gt;
&lt;li&gt;손익비(R/R): 초단타&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;역추세 매매는 원래 지양하는 편인데 이상하게 홀린듯이 들어가버림&lt;/li&gt;
&lt;li&gt;3번의 뇌동매매로 이어졌고 약손절 3번 후에 매매 종료해버림&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;역추세는 기본적으로 심법적으로 두배 더 힘든 것 같음&lt;/li&gt;
&lt;li&gt;언제 순추세의 움직임이 나온다는 불안감이 있기 때문&lt;/li&gt;
&lt;li&gt;약간의 이익을 얻으려고 초보자가 역추세 매매를 하는 것은 가성비가 안나온다는 생각을 함&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;5. 반성할 점:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;개인적으로 멘탈적으로 온전하지 못한 날이었는데 매매를 강행한게 독이 되었던 것 같음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;이런 날에는 그냥 쉬자..&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;어차피 내일이고 모레고 좋은 자리는 오니까&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.00.43.png&quot; data-origin-width=&quot;345&quot; data-origin-height=&quot;452&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cddrrG/btsMFGV0rBv/1q3qKZ9D0uKW9MjKhdj9k0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cddrrG/btsMFGV0rBv/1q3qKZ9D0uKW9MjKhdj9k0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cddrrG/btsMFGV0rBv/1q3qKZ9D0uKW9MjKhdj9k0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcddrrG%2FbtsMFGV0rBv%2F1q3qKZ9D0uKW9MjKhdj9k0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;345&quot; height=&quot;452&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.00.43.png&quot; data-origin-width=&quot;345&quot; data-origin-height=&quot;452&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.03.22.png&quot; data-origin-width=&quot;352&quot; data-origin-height=&quot;618&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/47Ck6/btsMHa2U7o2/bX4q3XYhiLKDXeOejmqmt1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/47Ck6/btsMHa2U7o2/bX4q3XYhiLKDXeOejmqmt1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/47Ck6/btsMHa2U7o2/bX4q3XYhiLKDXeOejmqmt1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F47Ck6%2FbtsMHa2U7o2%2FbX4q3XYhiLKDXeOejmqmt1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;352&quot; height=&quot;618&quot; data-filename=&quot;스크린샷 2025-03-11 오후 10.03.22.png&quot; data-origin-width=&quot;352&quot; data-origin-height=&quot;618&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2025.03.10&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;매물대 저항 자리라서 초단타 들어감&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): &lt;span style=&quot;background-color: #101014; color: #ffffff; text-align: left;&quot;&gt;80,963.57&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP):&amp;nbsp;&lt;/li&gt;
&lt;li&gt;손절(SL):&amp;nbsp;&lt;/li&gt;
&lt;li&gt;손익비(R/R):&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위에서 손절 난거 멘징함&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;순추세로 줄먹하는게 마음도 편하고 수익률도 좋은듯&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;원금을 넘기고 나서는 진입하기가 무서워지는데 기계적 매매하자&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>단타</category>
      <category>매매일지</category>
      <category>비트코인</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/142</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-9-%EC%99%9C-%EA%B5%B3%EC%9D%B4-%EC%97%AD%EC%B6%94%EC%84%B8%EB%A5%BC-%ED%83%94%EB%8B%88#entry142comment</comments>
      <pubDate>Tue, 11 Mar 2025 22:12:31 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 8. 사실 재진입함..</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-8-%EC%82%AC%EC%8B%A4-%EC%9E%AC%EC%A7%84%EC%9E%85%ED%95%A8</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;이전 포스트:&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1741436293827&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[매매일지] 7. 김비트 제발&quot; data-og-description=&quot;이전 포스트:&amp;nbsp;[매매일지] 6. 2연승 추가..?이전 포스트:&amp;nbsp;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고 있음&amp;nbsp;따라서, 순추세를 하락으로 보고 숏&quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-7-%EA%B9%80%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%B0%9C&quot; data-og-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-7-%EA%B9%80%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%B0%9C&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bLcg3e/hyYqbumJ85/m1HLyb2urjy4LVbeo0mk41/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/PI78o/hyYqaCcGjU/iH7IY5vXAgIkDAiWk4GOm1/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/lnyqU/hyYmZPyk6I/3nzyr2IMZO9DXyKvyscui0/img.png?width=1510&amp;amp;height=693&amp;amp;face=0_0_1510_693&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-7-%EA%B9%80%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%B0%9C&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-7-%EA%B9%80%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%B0%9C&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bLcg3e/hyYqbumJ85/m1HLyb2urjy4LVbeo0mk41/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/PI78o/hyYqaCcGjU/iH7IY5vXAgIkDAiWk4GOm1/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/lnyqU/hyYmZPyk6I/3nzyr2IMZO9DXyKvyscui0/img.png?width=1510&amp;amp;height=693&amp;amp;face=0_0_1510_693');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[매매일지] 7. 김비트 제발&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;이전 포스트:&amp;nbsp;[매매일지] 6. 2연승 추가..?이전 포스트:&amp;nbsp;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고 있음&amp;nbsp;따라서, 순추세를 하락으로 보고 숏&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-08 오후 6.44.59.png&quot; data-origin-width=&quot;639&quot; data-origin-height=&quot;447&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bfQcVG/btsMFLnmuGw/2b2NWTJgtgekFbrKEepApK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bfQcVG/btsMFLnmuGw/2b2NWTJgtgekFbrKEepApK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bfQcVG/btsMFLnmuGw/2b2NWTJgtgekFbrKEepApK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbfQcVG%2FbtsMFLnmuGw%2F2b2NWTJgtgekFbrKEepApK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;639&quot; height=&quot;447&quot; data-filename=&quot;스크린샷 2025-03-08 오후 6.44.59.png&quot; data-origin-width=&quot;639&quot; data-origin-height=&quot;447&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-08 오후 9.14.31.png&quot; data-origin-width=&quot;1270&quot; data-origin-height=&quot;796&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/GACgQ/btsMFvE7WX0/gkm94q7xYKsjTmOQnCQc9k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/GACgQ/btsMFvE7WX0/gkm94q7xYKsjTmOQnCQc9k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/GACgQ/btsMFvE7WX0/gkm94q7xYKsjTmOQnCQc9k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FGACgQ%2FbtsMFvE7WX0%2Fgkm94q7xYKsjTmOQnCQc9k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;648&quot; height=&quot;406&quot; data-filename=&quot;스크린샷 2025-03-08 오후 9.14.31.png&quot; data-origin-width=&quot;1270&quot; data-origin-height=&quot;796&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;전 포스트를 보고 오면 알겠지만 크립토써밋에 맞춰서 실망 매물로 인한 하락분을 먹기 위해 숏포지션을 계속 잡으려고 하고 있었음&lt;/li&gt;
&lt;li&gt;고점 부근에서 포지션을 잘 잡았고 하락분을 대부분 먹었음&lt;/li&gt;
&lt;li&gt;3월 4일부터 이어진 상승에서의 저점과 상승 최고점을 피보나치 되돌림으로 찍어봤을 때 0.618까지 가서 완익을 쳤음&lt;/li&gt;
&lt;li&gt;장기적인 하락을 보고 있었기 때문에 786까지도 가지 않을까 싶어서 자기전에 다시 포지션을 잡고 잤음&lt;/li&gt;
&lt;li&gt;저번 포스트를 작성했을 때는 무리라고 생각해서 다시 포지션을 안 잡고 자는게 낫다고 생각해서 욕심인 것 같다고 작성했지만 아무리 생각해도 더 큰 하락이 나왔어야 해서 손절을 해당 날에 본 수익을 뱉어낼 정도로만 잡아봄&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): 87405&lt;/li&gt;
&lt;li&gt;익절(TP): 83688&lt;/li&gt;
&lt;li&gt;손절(SL): 90000&lt;/li&gt;
&lt;li&gt;손익비(R/R): 1.28&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;손절 구간을 크게 안잡았으면 자는동안 스탑이 나갈뻔할 정도의 반등이 나옴&lt;/li&gt;
&lt;li&gt;수익은 꽤 봤지만 이번에도 완익, 분익 판단이 아쉬웠음&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-08 오후 9.24.44.png&quot; data-origin-width=&quot;624&quot; data-origin-height=&quot;212&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bz0NFr/btsME8jhq7D/MqAvfnbVACI7dkBFTv8Zmk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bz0NFr/btsME8jhq7D/MqAvfnbVACI7dkBFTv8Zmk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bz0NFr/btsME8jhq7D/MqAvfnbVACI7dkBFTv8Zmk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbz0NFr%2FbtsME8jhq7D%2FMqAvfnbVACI7dkBFTv8Zmk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;624&quot; height=&quot;212&quot; data-filename=&quot;스크린샷 2025-03-08 오후 9.24.44.png&quot; data-origin-width=&quot;624&quot; data-origin-height=&quot;212&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;스탑이 나갈뻔할정도로 쎄게 반등이 나왔고 이를 예상했어야되는데 아직 부족한 것 같음 (2번째 저항으로 닿는 시점이었기 때문)&lt;/li&gt;
&lt;li&gt;반등이 나온 후에 포지션을 잡았으면 수익을 더 볼 수 있었음&lt;/li&gt;
&lt;li&gt;또한, 이미 618까지 한번 갔기 때문에 위와 같이 2번 정도 더 618 구간에서 비빈 후에 쭉 하락할 줄 알았는데 주말이라 그런지 계속 횡보중임&lt;/li&gt;
&lt;li&gt;주말에는 보통 횡보를 하지만 최근에는 또 그렇지 않았기에 이것까지 예측하기에는 쉽지 않았을 것 같긴함&lt;/li&gt;
&lt;li&gt;고점에서 포지션을 잡고 여기까지 끌고 왔다면 평단이 좋아서 괜찮았겠지만 완익후에 다시 잡았기 때문에 적정선에서 정리했음&lt;/li&gt;
&lt;li&gt;완익 분익 판단이 아쉬웠던 이유:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;4번째 618 구간에 닿았을 때는 하락을 보여줬어야 한다고 생각했는데 반등이 나오는걸 보고 바로 정리함&lt;/li&gt;
&lt;li&gt;그냥 618 구간에 3번째 닿았을 때 반익을 치고, 본절 걸고 지켜봤다면 좀 더 나은 판단이 아니었을까 아쉬움&lt;/li&gt;
&lt;li&gt;그래도 이정도로 횡보할거라는건 예측불가의 범위였다고 생각하긴함&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;반등이 나오면 다시 잡던가 해야겠음..&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;5. 반성할 점:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;일단 반등을 생각 못하고 평단이 안좋음에도 끌고 가려고 고집을 부린 점&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;786까지 갈거라는 고집으로 분익을 칠 생각도 안한 점&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;개선:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;일단 이런 고집을 부린 것은 전 매매들에서 반익을 치고 했어도 수익이 너무 아쉬웠어서 그냥 끌고가보자 라는 생각이 컸음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;애초에 레버리지를 잘못 설정한듯&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;레버리지를 살짝 올려서 반익본절 운영을 하는 방식으로 다시 돌아가는게 나을듯&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;또한, 최근 인풋에 비해 아웃풋이 많은 상황이 반복되면서 매매로 인한 스트레스만 늘고있던 것 같음 (아는 것에 비해 수익을 더 보려고 하니까)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;공부 시간을 더 늘리고 레버리지 운영을 좀 더 연구해봐야 될듯&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;아직 많이 미숙한듯..&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>단타</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/141</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-8-%EC%82%AC%EC%8B%A4-%EC%9E%AC%EC%A7%84%EC%9E%85%ED%95%A8#entry141comment</comments>
      <pubDate>Sat, 8 Mar 2025 21:37:33 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 7. 김비트 제발</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-7-%EA%B9%80%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%B0%9C</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;이전 포스트:&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1741356710442&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[매매일지] 6. 2연승 추가..?&quot; data-og-description=&quot;이전 포스트:&amp;nbsp;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고 있음&amp;nbsp;따라서, 순추세를 하락으로 보고 숏 자리를 보는중임오늘은 포지션을 총 3번&quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-6-2%EC%97%B0%EC%8A%B9-%EC%B6%94%EA%B0%80&quot; data-og-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-6-2%EC%97%B0%EC%8A%B9-%EC%B6%94%EA%B0%80&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/g9diY/hyYqdleCQu/vfLKEhLDKlK06kKyNZ7vI1/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/bFc5lA/hyYnd76vzQ/pk9gOJsKhfohVRBqtmpH31/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/xvUop/hyYqTmBY3x/Rj8FbNEYwvGUSFV5P8b4U0/img.png?width=991&amp;amp;height=654&amp;amp;face=0_0_991_654&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-6-2%EC%97%B0%EC%8A%B9-%EC%B6%94%EA%B0%80&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-6-2%EC%97%B0%EC%8A%B9-%EC%B6%94%EA%B0%80&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/g9diY/hyYqdleCQu/vfLKEhLDKlK06kKyNZ7vI1/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/bFc5lA/hyYnd76vzQ/pk9gOJsKhfohVRBqtmpH31/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/xvUop/hyYqTmBY3x/Rj8FbNEYwvGUSFV5P8b4U0/img.png?width=991&amp;amp;height=654&amp;amp;face=0_0_991_654');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[매매일지] 6. 2연승 추가..?&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;이전 포스트:&amp;nbsp;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중일단 장기적인 추세로 하락 추세를 보고 있음&amp;nbsp;따라서, 순추세를 하락으로 보고 숏 자리를 보는중임오늘은 포지션을 총 3번&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-08 오전 1.09.37.png&quot; data-origin-width=&quot;1510&quot; data-origin-height=&quot;693&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bTJAun/btsMFzgeXZB/K9nHvjkLZia4g8s2nH95B0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bTJAun/btsMFzgeXZB/K9nHvjkLZia4g8s2nH95B0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bTJAun/btsMFzgeXZB/K9nHvjkLZia4g8s2nH95B0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbTJAun%2FbtsMFzgeXZB%2FK9nHvjkLZia4g8s2nH95B0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1510&quot; height=&quot;693&quot; data-filename=&quot;스크린샷 2025-03-08 오전 1.09.37.png&quot; data-origin-width=&quot;1510&quot; data-origin-height=&quot;693&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-08 오전 1.14.56.png&quot; data-origin-width=&quot;659&quot; data-origin-height=&quot;469&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bF8xa7/btsMEoNUtxB/TtX58AJ12tnc3WRPA8t1K0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bF8xa7/btsMEoNUtxB/TtX58AJ12tnc3WRPA8t1K0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bF8xa7/btsMEoNUtxB/TtX58AJ12tnc3WRPA8t1K0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbF8xa7%2FbtsMEoNUtxB%2FTtX58AJ12tnc3WRPA8t1K0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;659&quot; height=&quot;469&quot; data-filename=&quot;스크린샷 2025-03-08 오전 1.14.56.png&quot; data-origin-width=&quot;659&quot; data-origin-height=&quot;469&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;br /&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;대망의 크립토 써밋&lt;/li&gt;
&lt;li&gt;기대감으로 상승을 보여주고 있긴 하지만 별 내용 없을거라고 생각함 -&amp;gt; 하락&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): 90136&lt;/li&gt;
&lt;li&gt;익절(TP): 87000 (618 부근)&lt;/li&gt;
&lt;li&gt;손절(SL): 92100 (청산가 모여있는 부근 위 + 상단 매물대 윗부근)&lt;/li&gt;
&lt;li&gt;손익비(R/R): 1.58&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;88450 부근에서 반익절 하고 끌고가다가 거래량 실린 양봉보고 바로 정리함&lt;/li&gt;
&lt;li&gt;사실 끌고가도 되는 상황이었는데 최근 몇번 매매에서 반익절하고 남은 반절의 수익을 먹은적이 없어서 한번 조금이라도 더 수익을 보고 싶었음&lt;/li&gt;
&lt;li&gt;정리후에 다시 잡고싶었지만 피곤하기도 했고 반등이 계속 나오는거 보고 그냥 잤음 ㅋㅋ&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;변동이 심한 장세에는 타점 잘 잡는게 굉장히 중요한듯&lt;/li&gt;
&lt;li&gt;진입 후에 91000 넘어서 쭉 상승이 나왔는데 전혀 불안하지 않았음&lt;/li&gt;
&lt;li&gt;청산가가 몰려있는 구간은 반드시 터뜨리고 가는 변동성이 큰 장세이기 때문&lt;/li&gt;
&lt;li&gt;타점을 더 잘 잡으려고 욕심을 부리는 것은 별로 안좋을거라고 생각이 들어서 그냥 생각한 진입가가 나와서 바로 잡았음&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;흠.. 계속 먹으니까 오히려 좀 불안하네..&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>단타</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/140</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-7-%EA%B9%80%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%B0%9C#entry140comment</comments>
      <pubDate>Sat, 8 Mar 2025 01:17:56 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 6. 2연승 추가..?</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-6-2%EC%97%B0%EC%8A%B9-%EC%B6%94%EA%B0%80</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;이전 포스트:&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1741353178659&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중&quot; data-og-description=&quot;일단 장기적인 추세로 하락 추세를 보고 있음&amp;nbsp;따라서, 순추세를 하락으로 보고 숏 자리를 보는중임오늘은 포지션을 총 3번 잡았음&amp;nbsp;1) 박스권 매매1. 진입 근거:&amp;nbsp;박스권을 만들었다고 생각했고 &quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-5-%EC%89%BD%EC%A7%80-%EC%95%8A%EC%9D%80-%EC%9E%A5%EC%84%B8%EC%97%90-%EC%A1%B0%EA%B8%88%EC%94%A9-%EC%88%98%EC%9D%B5-%EC%8C%93%EC%95%84%EA%B0%80%EB%8A%94%EC%A4%91&quot; data-og-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-5-%EC%89%BD%EC%A7%80-%EC%95%8A%EC%9D%80-%EC%9E%A5%EC%84%B8%EC%97%90-%EC%A1%B0%EA%B8%88%EC%94%A9-%EC%88%98%EC%9D%B5-%EC%8C%93%EC%95%84%EA%B0%80%EB%8A%94%EC%A4%91&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cj7n8Q/hyYnaKgmPr/Zqxft1vmf6U02bAEIDI8A1/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/Yo02s/hyYmPlSFRK/FNCuZvInjH7ncE2eSyUoVk/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/y0nJT/hyYmSv66fZ/RXe7WRnmulSuLLk7jIkq51/img.png?width=1095&amp;amp;height=646&amp;amp;face=0_0_1095_646&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-5-%EC%89%BD%EC%A7%80-%EC%95%8A%EC%9D%80-%EC%9E%A5%EC%84%B8%EC%97%90-%EC%A1%B0%EA%B8%88%EC%94%A9-%EC%88%98%EC%9D%B5-%EC%8C%93%EC%95%84%EA%B0%80%EB%8A%94%EC%A4%91&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-5-%EC%89%BD%EC%A7%80-%EC%95%8A%EC%9D%80-%EC%9E%A5%EC%84%B8%EC%97%90-%EC%A1%B0%EA%B8%88%EC%94%A9-%EC%88%98%EC%9D%B5-%EC%8C%93%EC%95%84%EA%B0%80%EB%8A%94%EC%A4%91&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cj7n8Q/hyYnaKgmPr/Zqxft1vmf6U02bAEIDI8A1/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/Yo02s/hyYmPlSFRK/FNCuZvInjH7ncE2eSyUoVk/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/y0nJT/hyYmSv66fZ/RXe7WRnmulSuLLk7jIkq51/img.png?width=1095&amp;amp;height=646&amp;amp;face=0_0_1095_646');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;일단 장기적인 추세로 하락 추세를 보고 있음&amp;nbsp;따라서, 순추세를 하락으로 보고 숏 자리를 보는중임오늘은 포지션을 총 3번 잡았음&amp;nbsp;1) 박스권 매매1. 진입 근거:&amp;nbsp;박스권을 만들었다고 생각했고&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.12.24.png&quot; data-origin-width=&quot;991&quot; data-origin-height=&quot;654&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bOGX5i/btsMDQ40Kx0/365FsJVsSttOLavBjKyTkK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bOGX5i/btsMDQ40Kx0/365FsJVsSttOLavBjKyTkK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bOGX5i/btsMDQ40Kx0/365FsJVsSttOLavBjKyTkK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbOGX5i%2FbtsMDQ40Kx0%2F365FsJVsSttOLavBjKyTkK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;991&quot; height=&quot;654&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.12.24.png&quot; data-origin-width=&quot;991&quot; data-origin-height=&quot;654&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.39.17.png&quot; data-origin-width=&quot;177&quot; data-origin-height=&quot;270&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/pz9nF/btsMESgutWT/amNkBjxxLKZGHpDohZmY9k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/pz9nF/btsMESgutWT/amNkBjxxLKZGHpDohZmY9k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/pz9nF/btsMESgutWT/amNkBjxxLKZGHpDohZmY9k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fpz9nF%2FbtsMESgutWT%2FamNkBjxxLKZGHpDohZmY9k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;234&quot; height=&quot;357&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.39.17.png&quot; data-origin-width=&quot;177&quot; data-origin-height=&quot;270&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;필자는 장기적인 하방 추세를 보고 있고(현물러분들한테는 죄송하지만), 따라서 계속 숏으로 큰 파동을 먹어서 시드를 불리려는 생각으로 매매에 임하고 있음&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시점: 2025.03.05 10PM 부근&lt;/li&gt;
&lt;li&gt;진입(EP): 90442.7&lt;/li&gt;
&lt;li&gt;익절(TP):
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;최종 익절 구간은 피보나치 사용해서 1.272 구간으로 잡음&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;손절(SL): 상단 매물대 부근으로 잡음&lt;/li&gt;
&lt;li&gt;손익비(R/R): 2.28&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #333333; text-align: left;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;부분 익절로 89594, 89492, 89171 부근에서 50%, 25%, 25% 이렇게 익절함&lt;/li&gt;
&lt;li&gt;최종 익절구간까지 끌고가지 않은 이유는 최근 장세가 변동성이 심하고 거래량이 많은 양봉이 박힌게 쎄해서 정리해버림&lt;/li&gt;
&lt;li&gt;결과적으로는 쎄함 감지를 잘한듯&lt;/li&gt;
&lt;li&gt;그냥 정리하고 다시 잡자는 생각이었음&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;쎄할때는 그냥 튀고 다시 잡자&lt;/li&gt;
&lt;li&gt;변동세가 요즘 많이 심하기 때문&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.46.18.png&quot; data-origin-width=&quot;526&quot; data-origin-height=&quot;625&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/p0R1g/btsMEd6Qk2s/nQpTcSDnpkrze0KdaKbDn1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/p0R1g/btsMEd6Qk2s/nQpTcSDnpkrze0KdaKbDn1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/p0R1g/btsMEd6Qk2s/nQpTcSDnpkrze0KdaKbDn1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fp0R1g%2FbtsMEd6Qk2s%2FnQpTcSDnpkrze0KdaKbDn1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;526&quot; height=&quot;625&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.46.18.png&quot; data-origin-width=&quot;526&quot; data-origin-height=&quot;625&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위와 마찬가지&lt;/li&gt;
&lt;li&gt;+ 상승분은 전부 크립토 써밋 때문이라고 생각했고, 끝나면 하락추세를 이어갈거라고 생각함&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): 91,519.80&lt;/li&gt;
&lt;li&gt;익절(TP): 1.272 부근&lt;/li&gt;
&lt;li&gt;손절(SL): 전고점(92850)&lt;/li&gt;
&lt;li&gt;손익비(R/R): &amp;gt;3&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.50.41.png&quot; data-origin-width=&quot;593&quot; data-origin-height=&quot;797&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Jcz08/btsMEAmUnqX/cKfvK7lYWm75TrfOmbP7f1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Jcz08/btsMEAmUnqX/cKfvK7lYWm75TrfOmbP7f1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Jcz08/btsMEAmUnqX/cKfvK7lYWm75TrfOmbP7f1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FJcz08%2FbtsMEAmUnqX%2FcKfvK7lYWm75TrfOmbP7f1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;593&quot; height=&quot;797&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.50.41.png&quot; data-origin-width=&quot;593&quot; data-origin-height=&quot;797&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.42.14.png&quot; data-origin-width=&quot;584&quot; data-origin-height=&quot;482&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/vENTf/btsMFtmRk2m/3e8qVMD9q0etZ1MG1gcB8k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/vENTf/btsMFtmRk2m/3e8qVMD9q0etZ1MG1gcB8k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/vENTf/btsMFtmRk2m/3e8qVMD9q0etZ1MG1gcB8k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FvENTf%2FbtsMFtmRk2m%2F3e8qVMD9q0etZ1MG1gcB8k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;584&quot; height=&quot;482&quot; data-filename=&quot;스크린샷 2025-03-07 오후 10.42.14.png&quot; data-origin-width=&quot;584&quot; data-origin-height=&quot;482&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;일단 포지션은 상당히 잘잡은듯&lt;/li&gt;
&lt;li&gt;버티는게 쉽진 않았지만 결국 스탑은 안터뜨리고 수익을 보게 됨&lt;/li&gt;
&lt;li&gt;분익도 잘 잡았음(위기 감지 능력이 좀 올라가는듯)&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;요즘 변동성이 너무 쎄서 저배 운용을 하고 있는데 안정적이고 좋은 것 같음&lt;/li&gt;
&lt;li&gt;포지션을 상당히 잘 잡았기 때문에 수익을 극대화해보려고 분익하고 물량을 추가하고 배율도 중간에 조정해봤는데 아직 원리를 정확하게 몰라서 그런지 생각만큼 잘되지 않은 것 같음 -&amp;gt; 좀 더 연구 필요할듯&lt;/li&gt;
&lt;li&gt;위에 사진에서 볼 수 있듯이 아주 아슬아슬하게 스탑을 안건드리고 하락함(자는중이었음) -&amp;gt; 스탑을 잘 생각해서 잡고 소신있게 가자&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;슬슬 손절 날때 됐으니까 조심하자&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>단타</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/139</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-6-2%EC%97%B0%EC%8A%B9-%EC%B6%94%EA%B0%80#entry139comment</comments>
      <pubDate>Fri, 7 Mar 2025 23:10:00 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 5. 쉽지 않은 장세에 조금씩 수익 쌓아가는중</title>
      <link>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-5-%EC%89%BD%EC%A7%80-%EC%95%8A%EC%9D%80-%EC%9E%A5%EC%84%B8%EC%97%90-%EC%A1%B0%EA%B8%88%EC%94%A9-%EC%88%98%EC%9D%B5-%EC%8C%93%EC%95%84%EA%B0%80%EB%8A%94%EC%A4%91</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;일단 장기적인 추세로 하락 추세를 보고 있음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;따라서, 순추세를 하락으로 보고 숏 자리를 보는중임&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.07.52.png&quot; data-origin-width=&quot;1016&quot; data-origin-height=&quot;642&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bsbSX8/btsMCwX0wny/mPcFh7eSEc9VMqMSoZvOXk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bsbSX8/btsMCwX0wny/mPcFh7eSEc9VMqMSoZvOXk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bsbSX8/btsMCwX0wny/mPcFh7eSEc9VMqMSoZvOXk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbsbSX8%2FbtsMCwX0wny%2FmPcFh7eSEc9VMqMSoZvOXk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1016&quot; height=&quot;642&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.07.52.png&quot; data-origin-width=&quot;1016&quot; data-origin-height=&quot;642&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;오늘은 포지션을 총 3번 잡았음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;1) 박스권 매매&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.10.33.png&quot; data-origin-width=&quot;430&quot; data-origin-height=&quot;338&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/10i7X/btsMz5OSXD5/HptEVvWXHIA81gSa0Xkky0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/10i7X/btsMz5OSXD5/HptEVvWXHIA81gSa0Xkky0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/10i7X/btsMz5OSXD5/HptEVvWXHIA81gSa0Xkky0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F10i7X%2FbtsMz5OSXD5%2FHptEVvWXHIA81gSa0Xkky0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;430&quot; height=&quot;338&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.10.33.png&quot; data-origin-width=&quot;430&quot; data-origin-height=&quot;338&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;박스권을 만들었다고 생각했고 롱숏롱숏 먹으려고 하는게 정석적인 무빙&lt;/li&gt;
&lt;li&gt;차트를 안보고 있다가 마지막 자리에 진입함..&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&amp;nbsp;진입(EP): 82800&lt;/li&gt;
&lt;li&gt;익절(TP): 83288&lt;/li&gt;
&lt;li&gt;손절(SL): 82437&lt;/li&gt;
&lt;li&gt;손익비(R/R): 1.4&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;11시반 개장과 함께 나스닥이 쭉 내리면서 비트까지 내려버림&lt;/li&gt;
&lt;li&gt;손절 엔딩&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;11시반 개장 타이밍은 항상 조심해야겠다 하면서도 까먹음&lt;/li&gt;
&lt;li&gt;나스닥 차트를 보는 것도 기억하자&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;2) 다시 펌핑&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.17.17.png&quot; data-origin-width=&quot;548&quot; data-origin-height=&quot;625&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lUInG/btsMCu6WW0f/7cbEIkszBJ5HG9FdAyKzIk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lUInG/btsMCu6WW0f/7cbEIkszBJ5HG9FdAyKzIk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lUInG/btsMCu6WW0f/7cbEIkszBJ5HG9FdAyKzIk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlUInG%2FbtsMCu6WW0f%2F7cbEIkszBJ5HG9FdAyKzIk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;548&quot; height=&quot;625&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.17.17.png&quot; data-origin-width=&quot;548&quot; data-origin-height=&quot;625&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;쭉 펌핑하는 것을 보고 당연히 찐반은 아니라고 생각함&lt;/li&gt;
&lt;li&gt;고점을 찍고 내리는 것을 보고 진입&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&amp;nbsp;진입(EP): &lt;span style=&quot;background-color: #101014; color: #ffffff; text-align: left;&quot;&gt;83,871.20&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP): 82669 (0.786 되돌림)&lt;/li&gt;
&lt;li&gt;손절(SL): 85011 (윗 매물대)&lt;/li&gt;
&lt;li&gt;손익비(R/R): 1.06&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;0.786 부근에서 욕심 안부리고 익절했음&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;숏을 잡을때마다 느끼는거지만 데드캣 때문에 숏은 익절을 욕심안내고 줄먹 하는것이 중요함&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;데드캣: 쭉 내리고 갑자기 쭉 오르는 반등&lt;/li&gt;
&lt;li&gt;순추세가 하락추세인게 명확하더라도 줄먹 해야됨&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;3) 푸근한 숏 포지션 진입 시도&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.35.20.png&quot; data-origin-width=&quot;523&quot; data-origin-height=&quot;403&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bbfBPO/btsMCxJmUj1/y1o3KR1QnLZBZkFAciDKK1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bbfBPO/btsMCxJmUj1/y1o3KR1QnLZBZkFAciDKK1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bbfBPO/btsMCxJmUj1/y1o3KR1QnLZBZkFAciDKK1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbbfBPO%2FbtsMCxJmUj1%2Fy1o3KR1QnLZBZkFAciDKK1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;523&quot; height=&quot;403&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.35.20.png&quot; data-origin-width=&quot;523&quot; data-origin-height=&quot;403&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.43.55.png&quot; data-origin-width=&quot;1095&quot; data-origin-height=&quot;646&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/SmLuc/btsMCvdIENd/bKId1EvfkoejkDHTRHfWnK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/SmLuc/btsMCvdIENd/bKId1EvfkoejkDHTRHfWnK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/SmLuc/btsMCvdIENd/bKId1EvfkoejkDHTRHfWnK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FSmLuc%2FbtsMCvdIENd%2FbKId1EvfkoejkDHTRHfWnK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1095&quot; height=&quot;646&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.43.55.png&quot; data-origin-width=&quot;1095&quot; data-origin-height=&quot;646&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;4시간봉 채널의 하단에 회귀했다가 약한 반등이 나왔기 때문에 하방 돌파의 가능성이 크다고 생각함&lt;/li&gt;
&lt;li&gt;자주색 매물대에 저항을 지속적으로 받았기 때문에 손익비가 좋은 푸근한 숏 포지션 자리라고 생각하고 진입&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&amp;nbsp;진입(EP): &lt;span style=&quot;background-color: #101014; color: #ffffff; text-align: left;&quot;&gt;82,602.10&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP): 좀 길게 두고 볼 생각이었음: 큰 하방의 가능성&lt;/li&gt;
&lt;li&gt;손절(SL): 83466 (윗 매물대 저항)&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.46.50.png&quot; data-origin-width=&quot;686&quot; data-origin-height=&quot;644&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/4o7ie/btsMCsg9iCO/Lq2PhbXdGixDMHFiCbBBs1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/4o7ie/btsMCsg9iCO/Lq2PhbXdGixDMHFiCbBBs1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/4o7ie/btsMCsg9iCO/Lq2PhbXdGixDMHFiCbBBs1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F4o7ie%2FbtsMCsg9iCO%2FLq2PhbXdGixDMHFiCbBBs1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;686&quot; height=&quot;644&quot; data-filename=&quot;스크린샷 2025-03-05 오전 2.46.50.png&quot; data-origin-width=&quot;686&quot; data-origin-height=&quot;644&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;쭉 올리면서 손절 엔딩&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;예측하지 못하겠는 변동성이 큰 장세에는 그냥 가만히 있는게 나은 것 같기도 하다 ㅋㅋ&lt;/li&gt;
&lt;li&gt;결국 오늘 수익은 20불 남짓..&lt;/li&gt;
&lt;li&gt;2번째 매매에서 꽤 짭짤한 수익을 냈지만 도로 뱉어버림&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;변동성이 커서 쉽지 않은데, 욕심 내지말고 조금이라도 수익 내는것에 만족하면서 공부해야겠다&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/138</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-5-%EC%89%BD%EC%A7%80-%EC%95%8A%EC%9D%80-%EC%9E%A5%EC%84%B8%EC%97%90-%EC%A1%B0%EA%B8%88%EC%94%A9-%EC%88%98%EC%9D%B5-%EC%8C%93%EC%95%84%EA%B0%80%EB%8A%94%EC%A4%91#entry138comment</comments>
      <pubDate>Wed, 5 Mar 2025 02:48:24 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 4. 진짜 미친 비트..</title>
      <link>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-4-%EC%A7%84%EC%A7%9C-%EB%AF%B8%EC%B9%9C-%EB%B9%84%ED%8A%B8</link>
      <description>&lt;figure id=&quot;og_1740798940159&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[코인 투자] 매매일기 #3 - 첫 수익 + 비트 운전수 폭주&quot; data-og-description=&quot;2025.02.25 4시 숏포지션 x30배1. 진입 근거:&amp;nbsp;강한 매도세살짝만 먹고 빠질 생각으로 진입2. 포지션 셋업 :&amp;nbsp;진입(EP): 90460 -&amp;gt; 강한 매도세를 한 파동 확인하고 나서 들어감익절(TP): 89196 -&amp;gt; 매도세가 조&quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EA%B8%B0-3-%EC%B2%AB-%EC%88%98%EC%9D%B5-%EB%B9%84%ED%8A%B8-%EC%9A%B4%EC%A0%84%EC%88%98-%ED%8F%AD%EC%A3%BC&quot; data-og-url=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EA%B8%B0-3-%EC%B2%AB-%EC%88%98%EC%9D%B5-%EB%B9%84%ED%8A%B8-%EC%9A%B4%EC%A0%84%EC%88%98-%ED%8F%AD%EC%A3%BC&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/wKV4J/hyYjBnSDiv/bIjmlZFrlB83rkkArMZp20/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/ooEnP/hyYjlZFELv/4BFDgNxnwKj6ymXhka7kH0/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/fyQBk/hyYjvVzciU/IcYhW2iHE5wRfPW8XrXQeK/img.png?width=1010&amp;amp;height=797&amp;amp;face=0_0_1010_797&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EA%B8%B0-3-%EC%B2%AB-%EC%88%98%EC%9D%B5-%EB%B9%84%ED%8A%B8-%EC%9A%B4%EC%A0%84%EC%88%98-%ED%8F%AD%EC%A3%BC&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EA%B8%B0-3-%EC%B2%AB-%EC%88%98%EC%9D%B5-%EB%B9%84%ED%8A%B8-%EC%9A%B4%EC%A0%84%EC%88%98-%ED%8F%AD%EC%A3%BC&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/wKV4J/hyYjBnSDiv/bIjmlZFrlB83rkkArMZp20/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/ooEnP/hyYjlZFELv/4BFDgNxnwKj6ymXhka7kH0/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/fyQBk/hyYjvVzciU/IcYhW2iHE5wRfPW8XrXQeK/img.png?width=1010&amp;amp;height=797&amp;amp;face=0_0_1010_797');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[코인 투자] 매매일기 #3 - 첫 수익 + 비트 운전수 폭주&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;2025.02.25 4시 숏포지션 x30배1. 진입 근거:&amp;nbsp;강한 매도세살짝만 먹고 빠질 생각으로 진입2. 포지션 셋업 :&amp;nbsp;진입(EP): 90460 -&amp;gt; 강한 매도세를 한 파동 확인하고 나서 들어감익절(TP): 89196 -&amp;gt; 매도세가 조&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-04 오전 12.24.46.png&quot; data-origin-width=&quot;798&quot; data-origin-height=&quot;651&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/brwqij/btsMzX3J0qt/uF4cWchzC35MdLw083z280/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/brwqij/btsMzX3J0qt/uF4cWchzC35MdLw083z280/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/brwqij/btsMzX3J0qt/uF4cWchzC35MdLw083z280/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbrwqij%2FbtsMzX3J0qt%2FuF4cWchzC35MdLw083z280%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;798&quot; height=&quot;651&quot; data-filename=&quot;스크린샷 2025-03-04 오전 12.24.46.png&quot; data-origin-width=&quot;798&quot; data-origin-height=&quot;651&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;트럼프 말한마디에 아주 미쳐 날뛰는중...&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;현물이 없고 장기 하방 보고있던 난 트럼프가 밉다..&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;사실 저 위에 올라타려고 해봤다가 몇불 깨졌음 ㅋㅋ&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;상승이 끝나고 구라 반등이지 않을까 싶어서 숏을 쳐봄&lt;/li&gt;
&lt;li&gt;보통 이렇게 빠르게 올린 반등은 명확한 의도가 있고, 목적 달성 후에는 상승분이 빠지는 경향이 있기 때문에 숏 포지션을 잡은 것임(프렉탈적 관점)&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;진입(EP): &lt;span style=&quot;background-color: #101014; color: #ffffff; text-align: left;&quot;&gt;93,338.00&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;익절(TP):&amp;nbsp;&lt;/li&gt;
&lt;li&gt;손절(SL): 전고점&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-04 오전 12.44.33.png&quot; data-origin-width=&quot;923&quot; data-origin-height=&quot;422&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/KgXnM/btsMAGUHo60/S4whlbYapKGo8zM2tRUzI0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/KgXnM/btsMAGUHo60/S4whlbYapKGo8zM2tRUzI0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/KgXnM/btsMAGUHo60/S4whlbYapKGo8zM2tRUzI0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FKgXnM%2FbtsMAGUHo60%2FS4whlbYapKGo8zM2tRUzI0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;923&quot; height=&quot;422&quot; data-filename=&quot;스크린샷 2025-03-04 오전 12.44.33.png&quot; data-origin-width=&quot;923&quot; data-origin-height=&quot;422&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;변동성이 굉장히 컸음&lt;/li&gt;
&lt;li&gt;위 사진을 보면 알겠지만 저 역헤숄 모양쯤부터 잠에서 깨서 봤는데, 계속 저점을 높이는 상승이 나와서 빠르게 정리했음&lt;/li&gt;
&lt;li&gt;결과적으로 보면 수익은 봤지만 저점에서 큰 수익을 본 것도 아니고 저점을 높이는 과정을 관망하다가 노란색 하이라이트 부근에서 뒤늦게 나왔기 때문에 엄청 아쉬운 수익만 보게 됨&lt;/li&gt;
&lt;li&gt;결과론적으로만 보면 관점이 어느정도 맞았고 큰 수익을 볼 수 있었는데 내가 기회를 걷어차버림&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;한번 관점을 정해서 익손절 라인을 정했으면 음전한다고 마음 조리지말고 내 분석을 어느정도 믿으면서 내 포지션에 끝까지 책임질 줄 아는 자세가 필요한 것 같다&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;5. 반성할 점:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;뇌동매매는 아니지만 내가 보지 못한 강추세가 나오는 경우 포모(FOMO)가 오고 머리가 뜨끈해질 때가 있는데 이 때는 제발 참자 &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;다음 자리 노려도 충분히 수익 볼 수 있다&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;모든 자리 다 발라먹으려고 하지마&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>투자</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/137</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-4-%EC%A7%84%EC%A7%9C-%EB%AF%B8%EC%B9%9C-%EB%B9%84%ED%8A%B8#entry137comment</comments>
      <pubDate>Tue, 4 Mar 2025 00:51:45 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 3. 첫 수익 + 비트 운전수 폭주</title>
      <link>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EA%B8%B0-3-%EC%B2%AB-%EC%88%98%EC%9D%B5-%EB%B9%84%ED%8A%B8-%EC%9A%B4%EC%A0%84%EC%88%98-%ED%8F%AD%EC%A3%BC</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;2025.02.25 4시 숏포지션 x30배&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-01 오전 10.42.31.png&quot; data-origin-width=&quot;495&quot; data-origin-height=&quot;389&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/uHzBD/btsMyUTvRoN/h4JbwkYjzjnWuNr1kJzybK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/uHzBD/btsMyUTvRoN/h4JbwkYjzjnWuNr1kJzybK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/uHzBD/btsMyUTvRoN/h4JbwkYjzjnWuNr1kJzybK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FuHzBD%2FbtsMyUTvRoN%2Fh4JbwkYjzjnWuNr1kJzybK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;495&quot; height=&quot;389&quot; data-filename=&quot;스크린샷 2025-03-01 오전 10.42.31.png&quot; data-origin-width=&quot;495&quot; data-origin-height=&quot;389&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;강한 매도세&lt;/li&gt;
&lt;li&gt;살짝만 먹고 빠질 생각으로 진입&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업 :&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&amp;nbsp;진입(EP): 90460 -&amp;gt; 강한 매도세를 한 파동 확인하고 나서 들어감&lt;/li&gt;
&lt;li&gt;익절(TP): 89196 -&amp;gt; 매도세가 조금 약해진다 느낄때쯤 익절&lt;/li&gt;
&lt;li&gt;손절(SL): 90700 -&amp;gt; 반등 고점&lt;/li&gt;
&lt;li&gt;손익비(R/R):&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;Share (2).png&quot; data-origin-width=&quot;1320&quot; data-origin-height=&quot;960&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/uElHw/btsMAshvd5D/sWY4xpN7Bc6TM0QXl1nRZk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/uElHw/btsMAshvd5D/sWY4xpN7Bc6TM0QXl1nRZk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/uElHw/btsMAshvd5D/sWY4xpN7Bc6TM0QXl1nRZk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FuElHw%2FbtsMAshvd5D%2FsWY4xpN7Bc6TM0QXl1nRZk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;420&quot; height=&quot;305&quot; data-filename=&quot;Share (2).png&quot; data-origin-width=&quot;1320&quot; data-origin-height=&quot;960&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;5. 반성할 점:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;사실 결과적으로 보면 수익을 봤지만 강한 매도세만을 보고 잠깐 먹고 빠지는 매매도 건강한 매매일지 모르겠음&lt;/li&gt;
&lt;li&gt;손절 라인 짧게 진입해도 강한 매도세에 수익을 볼거같은 느낌이 들어서 진입하긴함&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2025.03.01&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-03-01 오전 11.43.23.png&quot; data-origin-width=&quot;1010&quot; data-origin-height=&quot;797&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/KAX8B/btsMALnCeUM/oiLpHjksGNXEBoOaEn0y3K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/KAX8B/btsMALnCeUM/oiLpHjksGNXEBoOaEn0y3K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/KAX8B/btsMALnCeUM/oiLpHjksGNXEBoOaEn0y3K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FKAX8B%2FbtsMALnCeUM%2FoiLpHjksGNXEBoOaEn0y3K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1010&quot; height=&quot;797&quot; data-filename=&quot;스크린샷 2025-03-01 오전 11.43.23.png&quot; data-origin-width=&quot;1010&quot; data-origin-height=&quot;797&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1. 진입 근거:&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;장기적인 하락을 보고 있음&lt;/li&gt;
&lt;li&gt;이유:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;상승은 공격적으로 나오지만 거래량이 실리지 않음&lt;/li&gt;
&lt;li&gt;올린 가격에 비해 RSI가 비교적 너무 많이 올라옴&lt;/li&gt;
&lt;li&gt;OBV 보조지표를 보면 선물의 거래량이 현물의 거래량보다 압도적으로 많음&lt;/li&gt;
&lt;li&gt;즉, 세력(MM)들이 선물로 가격을 올리고 현물을 비싸게 팔려는 의도라고 생각됨&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2. 포지션 셋업:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&amp;nbsp;진입(EP):&amp;nbsp;&lt;/li&gt;
&lt;li&gt;익절(TP): 일단 1차 익절 82651 (반등의 0.5 되돌림)&lt;/li&gt;
&lt;li&gt;손절(SL): 84880 (반등의 고점)&lt;/li&gt;
&lt;li&gt;손익비(R/R): 1.09&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;3. 결과:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;손절 라인 닿고 바로 하락함&lt;/li&gt;
&lt;li&gt;버텼어도 터질 손절이긴 했음&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. 배운점, 느낀점 정리:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;손절 라인은 10틱, 50틱이라도 손해보게 잡아야 한 번 더 버틸 기회가 생김&lt;/li&gt;
&lt;li&gt;익절 라인은 욕심을 조금이라도 덜어야 체결이 될 확률이 높아짐&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;5. 반성할 점:&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;요즘 비트코인 너무 어렵다..&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>수익</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/136</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EA%B8%B0-3-%EC%B2%AB-%EC%88%98%EC%9D%B5-%EB%B9%84%ED%8A%B8-%EC%9A%B4%EC%A0%84%EC%88%98-%ED%8F%AD%EC%A3%BC#entry136comment</comments>
      <pubDate>Sat, 1 Mar 2025 12:14:54 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 2. 도로 다 뱉어버림...</title>
      <link>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-2-%EB%8F%84%EB%A1%9C-%EB%8B%A4-%EB%B1%89%EC%96%B4%EB%B2%84%EB%A6%BC</link>
      <description>&lt;figure id=&quot;og_1740207197986&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[코인 투자] 매매일지 #1 - 시장 수업료로 뱉은거 100% 멘징 + 비트 제대로 복수 완료??&quot; data-og-description=&quot;기존 코인 투자 포스트들은 기초 내용들을 다뤘지만 이제부턴 매매일지를 꾸준히 작성해보려고 함&amp;nbsp;일단 필자는 2월 14일부터 실제 돈으로 투자하기 시작한 초보자임&amp;nbsp;전 한달 반 가량 모의투자&quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-1-%EC%8B%9C%EC%9E%A5-%EC%88%98%EC%97%85%EB%A3%8C%EB%A1%9C-%EB%B1%89%EC%9D%80%EA%B1%B0-100-%EB%A9%98%EC%A7%95-%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%8C%80%EB%A1%9C-%EB%B3%B5%EC%88%98-%EC%99%84%EB%A3%8C&quot; data-og-url=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-1-%EC%8B%9C%EC%9E%A5-%EC%88%98%EC%97%85%EB%A3%8C%EB%A1%9C-%EB%B1%89%EC%9D%80%EA%B1%B0-100-%EB%A9%98%EC%A7%95-%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%8C%80%EB%A1%9C-%EB%B3%B5%EC%88%98-%EC%99%84%EB%A3%8C&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/dJj8JJ/hyYfQTbLap/2Vq865QKVEUoY0kF7PYFUK/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/cJc4YU/hyYfKMaT3l/oymzBYxSOBczrussTbXAsk/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/mIolf/hyYfQlm5Fl/TPQkAEokqAoZN8AvsDqA80/img.png?width=1299&amp;amp;height=827&amp;amp;face=0_0_1299_827&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-1-%EC%8B%9C%EC%9E%A5-%EC%88%98%EC%97%85%EB%A3%8C%EB%A1%9C-%EB%B1%89%EC%9D%80%EA%B1%B0-100-%EB%A9%98%EC%A7%95-%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%8C%80%EB%A1%9C-%EB%B3%B5%EC%88%98-%EC%99%84%EB%A3%8C&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-1-%EC%8B%9C%EC%9E%A5-%EC%88%98%EC%97%85%EB%A3%8C%EB%A1%9C-%EB%B1%89%EC%9D%80%EA%B1%B0-100-%EB%A9%98%EC%A7%95-%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%8C%80%EB%A1%9C-%EB%B3%B5%EC%88%98-%EC%99%84%EB%A3%8C&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/dJj8JJ/hyYfQTbLap/2Vq865QKVEUoY0kF7PYFUK/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/cJc4YU/hyYfKMaT3l/oymzBYxSOBczrussTbXAsk/img.jpg?width=640&amp;amp;height=360&amp;amp;face=0_0_640_360,https://scrap.kakaocdn.net/dn/mIolf/hyYfQlm5Fl/TPQkAEokqAoZN8AvsDqA80/img.png?width=1299&amp;amp;height=827&amp;amp;face=0_0_1299_827');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[코인 투자] 매매일지 #1 - 시장 수업료로 뱉은거 100% 멘징 + 비트 제대로 복수 완료??&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;기존 코인 투자 포스트들은 기초 내용들을 다뤘지만 이제부턴 매매일지를 꾸준히 작성해보려고 함&amp;nbsp;일단 필자는 2월 14일부터 실제 돈으로 투자하기 시작한 초보자임&amp;nbsp;전 한달 반 가량 모의투자&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위의 포스트를 보면 알겠지만 아주 아름다운 상승 추세와 함께 스윙을 치며 100%가 넘는 수익을 보고있었음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-22 오후 6.26.05.png&quot; data-origin-width=&quot;1477&quot; data-origin-height=&quot;696&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bJb8aF/btsMt66wCw1/CmBnjnO1htIhiU7FwRXsxk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bJb8aF/btsMt66wCw1/CmBnjnO1htIhiU7FwRXsxk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bJb8aF/btsMt66wCw1/CmBnjnO1htIhiU7FwRXsxk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbJb8aF%2FbtsMt66wCw1%2FCmBnjnO1htIhiU7FwRXsxk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1477&quot; height=&quot;696&quot; data-filename=&quot;스크린샷 2025-02-22 오후 6.26.05.png&quot; data-origin-width=&quot;1477&quot; data-origin-height=&quot;696&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;하하... 이게 뭔&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위의 포스트에서도 말했듯이 99k를 뚫고 수익금이 $1,500을 넘었지만, 바이비트 콜드월렛 해킹 이슈 + 중국 관세, 전염병 이슈가 한꺼번에 터지면서 수직 하강하게 되었음...&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;포지션 최종 결과&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;1차로 본절이 터졌고&lt;/li&gt;
&lt;li&gt;2차로 이성을 잃고 추격매매를 하다 손절 라인이 바로 돌파되면서 -$200...&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;차트를 보며 배운/느낀 점들 정리&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;99k까지 상승하면서 추세를 깨지 않았고 98k 부근에서 다시 한번 쓰리마켓 패턴을 그리면서 고점 갱신을 예상했음&lt;/li&gt;
&lt;li&gt;쓰리마켓 패턴의 확장 부분에서 위 따고 아래까지 따고 나서 살짝의 반등과 함께 수직하강하게 되었음 (원래였다면 쭉 상승하는게 쓰리마켓패턴)&lt;/li&gt;
&lt;li&gt;여기서 배운 것은 진짜 한치 앞을 예상할 수 없다... 임&lt;/li&gt;
&lt;li&gt;어떻게 3일동안 꾸득꾸득 올려서 패턴까지 만들어놓고 수직하강을 할 수 있나...&lt;/li&gt;
&lt;li&gt;어떻게 대응했어야 되는지 솔직히 잘 모르겠음&lt;/li&gt;
&lt;li&gt;차트가 이렇게까지 이쁘게 나왔는데 걍 익절해버리는 것도 어불성설인것 같고..&amp;nbsp;&lt;/li&gt;
&lt;li&gt;일단 분할익절은 특히 아직은 초보자인만큼 꼭 해야겠다고 생각했음&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;반성할 점&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;본절이 터졌는데 바로 추격매매를 한 것은 여기에 적기도 쪽팔릴 정도로 반성할 점임&lt;/li&gt;
&lt;li&gt;포지션 초반에는 무리할 정도의 추가매매부터 추격매매까지 아주 아마추어 티를 팍팍 냈지만, 이런 충동을 잘 억제할 방법을 구축하기로 함&lt;/li&gt;
&lt;li&gt;매매를 하면서 볼 체크리스트도 만들었고: &lt;a href=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%ED%95%98%EA%B8%B0%EC%A0%84-%EB%AC%B4%EC%A1%B0%EA%B1%B4-%EB%B4%90%EC%95%BC-%ED%95%A0-%EC%B2%B4%ED%81%AC-%EB%A6%AC%EC%8A%A4%ED%8A%B8&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%ED%95%98%EA%B8%B0%EC%A0%84-%EB%AC%B4%EC%A1%B0%EA%B1%B4-%EB%B4%90%EC%95%BC-%ED%95%A0-%EC%B2%B4%ED%81%AC-%EB%A6%AC%EC%8A%A4%ED%8A%B8&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;이렇게 말도 안되는 억까를 당할 때 어떻게 극복해야될지는 루틴을 좀 만들 필요가 있다고 생각함&lt;/li&gt;
&lt;li&gt;투기가 아닌 건강하고 지속 가능한 투자 생활을 위해서..&lt;/li&gt;
&lt;li&gt;사람들이 얼마나 볼지는 모르겠지만 요즘 장 진짜 쉽지 않네요.. 다들 화이팅합시다&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;시드 현황: $2,682&lt;/blockquote&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;절치부심해서 다시 가보자...&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/135</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-2-%EB%8F%84%EB%A1%9C-%EB%8B%A4-%EB%B1%89%EC%96%B4%EB%B2%84%EB%A6%BC#entry135comment</comments>
      <pubDate>Sat, 22 Feb 2025 19:33:57 +0900</pubDate>
    </item>
    <item>
      <title>[코인 투자] 0. 매매하기전 무조건 봐야 할 체크 리스트</title>
      <link>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%ED%95%98%EA%B8%B0%EC%A0%84-%EB%AC%B4%EC%A1%B0%EA%B1%B4-%EB%B4%90%EC%95%BC-%ED%95%A0-%EC%B2%B4%ED%81%AC-%EB%A6%AC%EC%8A%A4%ED%8A%B8</link>
      <description>&lt;h4 data-ke-size=&quot;size20&quot;&gt;1. 개인은 절대 세력이 될 수 없다: &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&quot;절대 세력은 편하게 개미들이 수익을 보게 하지 않는다&quot;&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;물론 개인 투자를 하는 사람들 중에 내가 세력이 되겠다 라고 생각하는 무모한 사람은 보기 힘들다고 생각이 듦&lt;/li&gt;
&lt;li&gt;내가 이야기 하고 싶은 포인트는 &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&quot;절대 세력은 편하게 개미들이 수익을 보게 하지 않는다&quot;&lt;/b&gt;&lt;/span&gt; 라는 점임&lt;/li&gt;
&lt;li&gt;내 포지션이 내가 예상한 수익보다 큰 수익을 내고 있더라도 &lt;i&gt;&lt;b&gt;아 수익보면 이 돈으로 뭐하지 생각하면서 김치국 마시지 말고&lt;/b&gt; &lt;/i&gt;항상 어딘가 께름칙한 부분은 없는지, 내가 예상한 세력의 움직임이 타당한지, 내 포지션과 반대로 움직일 가능성은 없는지 등에 대해 확인해야 함&lt;/li&gt;
&lt;/ul&gt;
&lt;div data-ke-type=&quot;moreLess&quot; data-text-more=&quot;더보기&quot; data-text-less=&quot;닫기&quot;&gt;&lt;a class=&quot;btn-toggle-moreless&quot;&gt;더보기&lt;/a&gt;
&lt;div class=&quot;moreless-content&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;께름칙한 부분을 확인할 때 상당한 도움을 주는 것이 Liquidation Heatmap 인 것 같다:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.coinglass.com/pro/futures/LiquidationHeatMap&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.coinglass.com/pro/futures/LiquidationHeatMap&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;2. &lt;b&gt;절대&lt;/b&gt; 추격매매 하지 말기: &lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;조급해하지 마라&lt;/span&gt;&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;추격매매하려고 마음 먹을 때는 당연히 롱 포지션에서 갑자기 정치적 이슈가 터지며 고꾸라졌다거나 아쉽게 본절/손절 라인이 터졌다거나 등등 인간 심리상 짜증을 유발할 때임&lt;/li&gt;
&lt;li&gt;이럴 때가 가장 위험할 때임을 명심해야 됨&lt;/li&gt;
&lt;li&gt;당연한 이야기지만 투자는 얼마나 버는가보다 손실을 덜보는게 훨씬 중요함&lt;/li&gt;
&lt;li&gt;인터넷에 떠돌아다니는 손실에 따라 얼마나 이익을 봐야 메꿀 수 있는지 계산한 표만 봐도 알 수 있음&lt;/li&gt;
&lt;li&gt;이럴 때는 그냥 시원섭섭하지만 보내주고 다시 포지션을 잡는 것이 더 큰 손실을 막는 방법임&lt;/li&gt;
&lt;li&gt;필자 본인도 100% 이상의 수익을 보다가 바이비트 거래소 해킹 논란 + 중국 코로나 ver.2 + 중국 관세 이슈가 한꺼번에 터져서 본절이 터지고 추격매매를 한 적이 있음 -&amp;gt; 당연히 더 큰 손실로 이어졌음&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;div data-ke-type=&quot;moreLess&quot; data-text-more=&quot;더보기&quot; data-text-less=&quot;닫기&quot;&gt;&lt;a class=&quot;btn-toggle-moreless&quot;&gt;더보기&lt;/a&gt;
&lt;div class=&quot;moreless-content&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;1절만 해도 되지만 굳이 굳이 2절까지 적자면&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;매매만을 통해서 돈을 벌고 생활하는 전문 트레이더들의 매매를 보면 자 여기서 본절 터져버려서 빡쳐서 추격매매 진행했습니다&lt;/li&gt;
&lt;li&gt;이런 경우는 본적이 없다&lt;/li&gt;
&lt;li&gt;보통 시원섭섭한 마음으로 리프레쉬를 하고 와서 다시 차트를 보는 경우가 많다&lt;/li&gt;
&lt;li&gt;한 분야를 빠르게 습득하기 위해서는 그 분야에서 뛰어난 사람이 어떻게 하는지를 분석하는 것이 가장 빠른 방법이라고 생각한다&lt;/li&gt;
&lt;li&gt;코비 브라이언트, 르브론 제임스가 마이클 조던을 분석하며 자기 스타일을 만든 것과 같은 맥락이다&lt;/li&gt;
&lt;li&gt;코비 브라이언트 같이 마이클의 스타일을 완벽하게 카피해버리든 르브론 제임스같이 그 속에서 자기 스타일을 재창조하든 그건 다른 문제고, 특히 투자와 같이 리스크가 따르는 분야에서는 겸손한 마음으로 본인이 초보자라고 생각하며 상급자를 참고하는 것이 효과적으로 작용할 때가 많은 것 같다&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;3. 손절/본절 라인 조정하지 말기: &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;고집 부리지 마라&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;경험 상 높은 확률로 손절/본절이 터질 것 같으면 그냥 냅두는게 더 나았을 뻔한 경우가 많음&lt;/li&gt;
&lt;li&gt;롱 포지션에서는 매도세가, 숏 포지션에서는 매수세가 강하다는 것을 알면서도 내가 열심히 분석해서 잡은 포지션이 수익을 못보고 시간만 버린 것이 아쉽고 짜증나서 조정하는 경우가 많을 거라고 생각되는데 (내 이야기임) 매매를 진행하기 전 분석에 더 시간을 쏟고 이후 셋업에 대해서는 관망하는 것이 이성적으로 생각했을 때 맞다고 생각이 듦&lt;/li&gt;
&lt;li&gt;괜한 고집으로 손실만 늘리지 말고 다시 포지션을 잡는 것이 맞다&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;관점을 정했고 손절 익절 라인을 정해서 포지션을 잡았으면 그 포지션에 대해 책임감을 갖고 지켜볼 줄 아는 태도가 마지막에 승리하게 만듦&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;4. 추가매매 조심하기:&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;한 번의 매매로 인생을 역전하려고 하지 마라 &lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이건 추격매매랑은 좀 다른 경우지만 마찬가지로 상당히 위험함&lt;/li&gt;
&lt;li&gt;상승추세 제대로 탔다고 생각이 들어서&amp;nbsp; 시드 풀매수하는 그런 경우를 말하고 싶은건데 이런 경우는 어느정도 감당할 수 있는 선에서 더 큰 이익을 위해 추가매매하는 경우는 필요하다고 생각함&lt;/li&gt;
&lt;li&gt;하지만, 평단과 예상 손실등의 리스크를 철저하게 관리하면서 진행해야 함&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;5. 시장 상황과 시드 현황을 다시 한번 생각해보고 매매하기: &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;시드 &amp;amp; 레버리지 비율을 꼼꼼하게 설정해라&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;필자는 투자를 하며 팔랑귀처럼 아 저 유튜버는 매수 들어갔네 나도 들어가야겠다 하는 그런 성격은 아님&lt;/li&gt;
&lt;li&gt;그럼 뭘 말하고 싶은거냐?&amp;nbsp;&lt;/li&gt;
&lt;li&gt;필자는 풀매수가 나쁘다고 생각이 들진 않음&lt;/li&gt;
&lt;li&gt;손절 라인이랑 레버리지만 잘 설정하면 큰 수익을 보는 것은 자명하기 때문&lt;/li&gt;
&lt;li&gt;하지만 내가 지금 시드를 어느정도 투입해야 하는 경우인지 정확하게 판단하고 매매를 해야된다는 것을 강조하고 싶음&lt;/li&gt;
&lt;li&gt;흔히 강력한 상승추세 혹은 하락 추세를 예측하고 높은 비율의 시드/레버리지와 함께 분할 익절의 비중까지 줄이고 매매를 하는 것을 &quot;스윙&quot;이라고 함&lt;/li&gt;
&lt;li&gt;이 때 내가 스윙을 해야되는 상황인지 낮은 시드와 레버리지로 짧게 짧게 먹어야 하는지는 당연히 시장상황에 달려있음&lt;/li&gt;
&lt;li&gt;물론 자연재해과 같은 예외적인 경우는 어쩔 수 없지만 이 경우에 대해서 명확한 기준을 갖고 매매를 진행해야 더 좋은 투자가 될 것이라는 부분을 강조하는 것임&lt;/li&gt;
&lt;li&gt;또한, 내 시드 현황도 생각하고 매매해야 함: 시드 비중과 레버리지 비율에 따라 청산가, 주가 상승과 하락에 대한 수익과 손실의 비율이 달라지게 됨&lt;/li&gt;
&lt;li&gt;내 시드 현황에 맞게 감당할 수 있는 정도로 설정하는 것이 중요함&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;6. 분할 익절은 필수다: &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;먹여줄 때 먹어라&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아무리 상승 추세가 강하거나 하락 추세가 강하다고 해도 주가는 어떤 예외적인 상황이 나올지 모름&lt;/li&gt;
&lt;li&gt;필자의 매매일지 #1, #2를 보면 알 수 있겠지만 비트코인의 상승추세를 보고 롱 포지션을 잡으면서 수익이 예상한 것 보다 많이 발생하고 있던 상황에 바이비트 거래소 해킹 논란 + 중국 코로나 ver2 + 중국 관세 이슈가 한꺼번에 터지며 3일동안 오른 주가가 5시간 가량만에 빠지면서 본절이 터져버린 경험을 했음&lt;/li&gt;
&lt;li&gt;이처럼 어떤 상황이 나올지 모르는 상황에 변동성이 큰 코인 단타를 친다면 분할 익절은 필수라는 결론을 냈음&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;7. 내 성격을 파악해라: &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;내 자신을 가장 견재해라&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;개인 투자는 나와 다른 투자자들과의 싸움이 아님&lt;/li&gt;
&lt;li&gt;또한, 나와 세력과의 싸움도 아님&lt;/li&gt;
&lt;li&gt;우린 그저 세력의 등에 업혀서 콩고물을 조금씩 모아간다고 생각해야 함&lt;/li&gt;
&lt;li&gt;개인 투자는 나 자신과의 싸움임&lt;/li&gt;
&lt;li&gt;손실을 볼 때는 내 충동성, 짜증과 같은 부정적인 감정을 억제하는 동시에 수익을 볼 때는 들뜨는 마음을 억제하며 의심하는 습관을 들여야 계좌는 우상향함&lt;/li&gt;
&lt;li&gt;내 성격적이 부분이 어떤 면에서 가장 취약한지를 파악하고 항상 리마인드해야함&lt;/li&gt;
&lt;li&gt;필자 본인이 가장 견제하는 부분은 손해를 보기 싫어하는 성격임&lt;/li&gt;
&lt;li&gt;매매를 짧게나마 진행하며 내 성격에 대해 파악할 수 있었는데, 추가 매매를 막 들어가거나 감당하지 못하는 레버리지를 사용하는 무모한 성격은 아니라 긍정적이지만 손해를 극도로 보기 싫어하기 때문에 추격매매를 주의해야겠다는 결론을 낼 수 있었음&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;8. 나만의 룰을 정해라: &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;머리가 뜨끈해질 때를 조심해라&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위에서 언급한 것 처럼 필자는 개인 투자가 나와의 싸움이라고 생각함&lt;/li&gt;
&lt;li&gt;돈과 관련된 내 감정을 컨트롤한다는 것은 정말 쉽지 않을 일임&lt;/li&gt;
&lt;li&gt;따라서, 나만의 룰을 정하는 것이 좋다고 생각이 들었음&lt;/li&gt;
&lt;li&gt;예를 들자면, 매매 횟수를 하루 2번 정도로 제한한다거나 손절을 두 번 했다면 그 날은 차트를 끄고 쉰다거나 하는 내 돈은 물론이고 내 멘탈과 건강한 투자 생활을 지킬 수 있는 방법을 구축하는 것은 필수적임&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;Last Updated at: 2025.02.22&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/134</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%ED%95%98%EA%B8%B0%EC%A0%84-%EB%AC%B4%EC%A1%B0%EA%B1%B4-%EB%B4%90%EC%95%BC-%ED%95%A0-%EC%B2%B4%ED%81%AC-%EB%A6%AC%EC%8A%A4%ED%8A%B8#entry134comment</comments>
      <pubDate>Sat, 22 Feb 2025 17:04:21 +0900</pubDate>
    </item>
    <item>
      <title>[매매일지] 1. 시장 수업료로 뱉은거 100% 멘징 + 비트 제대로 복수 완료??</title>
      <link>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-1-%EC%8B%9C%EC%9E%A5-%EC%88%98%EC%97%85%EB%A3%8C%EB%A1%9C-%EB%B1%89%EC%9D%80%EA%B1%B0-100-%EB%A9%98%EC%A7%95-%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%8C%80%EB%A1%9C-%EB%B3%B5%EC%88%98-%EC%99%84%EB%A3%8C</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;기존 코인 투자 포스트들은 기초 내용들을 다뤘지만 이제부턴 매매일지를 꾸준히 작성해보려고 함&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;일단 필자는 2월 14일부터 실제 돈으로 투자하기 시작한 초보자임&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;전 한달 반 가량 모의투자와 투딩 챌린지 참여, 여러 강의 및 시황 분석 리딩을 통해 공부했음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;예금으로 묶어뒀던 3000달러를 투입했고, 당연히 처음에는 총 시드 1000 달러 정도만 사용함&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;그 결과 4일에 걸쳐서 200 달러를 잃었음(이 부분은 다다음 포스트에서 다룰 예정)&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이후 다시 포지션을 잡았고 이번에는 느낀 점이 많아 포스트를 작성하려고 함&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;먼저, 200 달러 가량을 날린 후 알트 코인은 접고 일단 비트코인 차트만 열심히 보기로 함 (당연히 알트만 했던 것은 아님)&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;일단 개인적으로는 &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;일봉과 4시간봉 추세 분석과 다우 이론에 따라서 상승 추세 (근거#1)&lt;/b&gt;&lt;/span&gt;를 보고 있었음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;롱 포지션을 잡던 중에 이번에는 올라가야되는데? 하는 상황이 몇 번 발생하였지만 내 손절가에 맞춰서 털고 살짝 반등하던 것이 반복되서 지치던 상황이었음(이런 식으로 수 차례에 걸쳐서 200달러 털림)&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;2월 18일부터 진행된 5분봉에서 &lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;쓰리마켓 패턴을 발견했고 이번에는 가겠지 라는 생각으로 (근거#2)&lt;/span&gt;&lt;/b&gt;&amp;nbsp;다시 포지션을 잡아보기로 함&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;아래 차트 이미지에 보면 초록색 부분은 강한 매물대로 작용한 것을 볼 수 있음: &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;5번의 지지를 받은 후에 강한 반등을 예측했음(근거 #3)&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;결과론적으로 보면 곡선 추세에 저항을 한 번 받고 다시 매물대에 저항을 한 번 더 받은 후에 상승하긴 했음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이 때 강의를 들었던 강사분의 시드를 빠르게 불려야 할 때의 매매법을 참고해서 400달러로 고배(30배 레버리지)를 사용해서 짧게 짧게 상승 추세를 먹어보기로 함&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-20 오후 4.08.55.png&quot; data-origin-width=&quot;1299&quot; data-origin-height=&quot;827&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bXXKJv/btsMpIyCFYm/LAXDBT1AxSspB8MkmENnk0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bXXKJv/btsMpIyCFYm/LAXDBT1AxSspB8MkmENnk0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bXXKJv/btsMpIyCFYm/LAXDBT1AxSspB8MkmENnk0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbXXKJv%2FbtsMpIyCFYm%2FLAXDBT1AxSspB8MkmENnk0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1299&quot; height=&quot;827&quot; data-filename=&quot;스크린샷 2025-02-20 오후 4.08.55.png&quot; data-origin-width=&quot;1299&quot; data-origin-height=&quot;827&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;진입 시점은 위에서 보이는대로 2/18 아침 9시반&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;쓰리마켓 패턴과 함께 두 개의 주황색 추세선, 파란색 곡선 추세선을 그렸었는데 곡선 추세선을 뚫기 전에 진입함&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;전 고점들의 유동성들을 지지 저항선으로 그려뒀었는데 98,000 라인은 넘기고 익절한다는 생각으로 2.76%를 먹는 라인을 익절 라인, 손절 라인은 전에 지지 저항이 일어났던 부분을 보수적으로 잡았음(손실을 더이상 보기 싫었음...)&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이렇게 잡았더니 &lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;손익비는 3.86 정도&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;익절 라인을 공격적으로 잡았던 이유는 이전에 3번 정도를 이번에는 무조건 상승할거라고 생각했지만 밑으로 고꾸라진 경우가 있었어서 만약 이번에 상승 추세로 돌파한다면 100,000까지도 갈 수 있지 않을까 싶었음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=-E9kIZA1tcE&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.youtube.com/watch?v=-E9kIZA1tcE&lt;/a&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;위 유튜브는 내가 들었던 강의의 강사님이 운영하는 유튜브 채널임&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;같은 상승 추세를 보고있어서 참고중이었는데 이 분도 내가 진입하고 나서 상승 추세가 맞다고 생각한다는 영상을 올리셔서 첨부해봄&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;주요 내용은 이러함:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;24년 7월이랑 패턴이 비슷함: 1. 청산빔 2. 강한 반등 3. 꾸득꾸득 내려버림 4. 786 반등 5. 상승 추세 전환&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;위의 초록색 박스로 표시된 매물대에서 3번의 지지를 받은 후에 개미들이 아 여긴 강력한 지지구간이구나 라고 인식하는 차에 바로 하방 이탈해버린 후에 바로 다시 회귀함&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;개미들은 수차례 지지를 받던 구간이 이탈을 해버렸으니 이제는 강한 저항구간이 되겠구나 라고 생각하고 해당 매물대에서 숏 포지션을 많이 잡았을 것임&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;하지만 세력들은 뻔한 저항구간에서 개미들을 먹여주지 않기 때문에 숏 포지션들을 청산시키는 강한 상승이 나올 것이라고 예상함&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;oi: 미체결 약정&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이번 롱 포지션을 잡으면서 밤낮도 아예 바꿨고 밤 새면서 차트만 봤음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;이번에는 무조건 먹고 싶다는 오기가 발동하기도 했고, 승률보다는 손익비가 중요하다지만 너무 자주 지는 것도 결국 좋은 것은 아니라고 생각했음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;전에 했던 모의투자와는 무게감이 다른 상태에서 밤새 차트를 보며 분석하다 보니까 실력이 빠르게 향상된다는 것을 나도 느끼고, 전에는 특정 캔들에서 위로 올라갈지 내려갈지 감이 아예 안왔지만 이제는 여기서는 반등하지 않을까? 여기서는 단기 하방을 띌거같은데? 정도의 생각은 들 정도로 발전했음&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;포지션을 종료하진 않았지만 중간 결과:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;P&amp;amp;L: +1500달러를 달리는 중임(99k 정도 기준)&lt;/li&gt;
&lt;li&gt;&lt;b&gt;96.7k 부근 등에서 부분 익절을 하지 않았던 이유:&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;욕심(시드를 빨리 불리고 싶은 마음)&lt;/li&gt;
&lt;li&gt;곡선 추세를 이탈한 후에 처음 채널까지 이탈했을 때 몇 틱 차이로 못 찍고 다시 하락함: 이미 추세선을 기준으로 상당히 많이 비빈 구간이기 때문에 확실한 상승이 나온다면 큰 상승일거라는 생각(위의 첨부한 유튜브 차설님의 관점을 참고했음)&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;차트를 보며 배운 점들 정리:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;일단 추세, 지지, 저항, 다우 이론 이렇게 4가지가 기법들, 패턴들 보다 선행되어야 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;저정도 지식만 갖고도 잘 활용할 줄만 안다면 충분히 돈 벌 수 있을 거라고 생각됨&lt;/li&gt;
&lt;li&gt;따라서, 저 내용들에 대한 깊은 학습(+실전 경험)이 필요할듯&lt;/li&gt;
&lt;li&gt;전체적인 추세 및 변곡 파악이 가장 중요&lt;/li&gt;
&lt;li&gt;볼린저 밴드가 이번 포지션만 놓고 봤을 때 상당히 신뢰도가 높았음&lt;/li&gt;
&lt;li&gt;사실 아직 매매 경험이 부족하기도 하고 익절, 손절 라인 설정 같은 부분이 미숙했음&lt;/li&gt;
&lt;li&gt;큰 프레임에서 봤을 때 상승이 나올거라는 것은 자명하다고 생각이 들었기에 상승이나 하락이 나오면 손절 라인을 조정하는 일이 빈번하게 일어났음&lt;/li&gt;
&lt;li&gt;이런 경우 그냥 손절 라인을 명확하게 설정한 후 손절이 나면 털어버리고 다음 포지션을 잡는 것이 이성적으로는 맞다고 생각이 들긴 하지만, 무조건 먹을 수 있다는 확신이 상당히 강하게 든 이번 포지션같은 경우에는 이렇게 하는 것이 지금까지 결론적으로 봤을 때는 좋았음&lt;/li&gt;
&lt;li&gt;좀 더 경험을 쌓으면서 내 기준을 설정하는 것이 중요할듯&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;다시 볼린저 밴드로 돌아가서, 손절 라인 및 익절 라인을 막 조정하면서 가장 도움을 받았던 것이 볼린저 밴드였음&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;5분봉 같이 단기봉으로 차트를 보면서 지금 상승이 나올지 하락이 나올지 판단을 했어야 되는 상황에 볼린저 밴드가 상단이 먼저 꺾이는지 하단이 먼저 꺾이는지를 봐야함&amp;nbsp;&lt;/li&gt;
&lt;li&gt;이후에 해당 추세가 이어지는지 전환이 이루어질지도 볼린저 밴드가 어딜 향하는지를 보는 것이 큰 도움이 됨&lt;/li&gt;
&lt;li&gt;자세한 내용은 이 포스트 참고: &lt;a href=&quot;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-14-%EB%B3%BC%EB%A6%B0%EC%A0%80-%EB%B0%B4%EB%93%9C&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-14-%EB%B3%BC%EB%A6%B0%EC%A0%80-%EB%B0%B4%EB%93%9C&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;또한, 엘리어트 파동도 많은 도움이 되었던 것 같음&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;/b&gt;엘리어트 파동을 처음 배웠을 때는 아주 기초만 배우고 깊은 내용들은 초보자가 배우기 어렵다고 넘어갔었는데 조금은 더 공부해보고 싶은 마음이 생겼음&lt;/li&gt;
&lt;li&gt;아주 기초인 충격 파동과 조정 파동만 읽어도 갑자기 하락이 나와도 단기 하방이라고 생각이 들기 때문에 심리적 안정감 측면에서 도움이 되었던 것 같음&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;마지막으로, 처음 코인 투자를 시작하면 어느 시점에서 어떤 분봉을 봐야되고 이런 부분들이 감이 아예 안잡힐 수 있음(내가 그랬음)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;이건 실제로 신경이 쓰일 정도가 되는 금액으로 포지션을 잡아보면 감을 더 빨리 잡을 것 같음&lt;/li&gt;
&lt;li&gt;내 피같은 돈이 날아갈지 힘들게 분석한 내 노력이 성과를 볼지에 대한 문제이기 때문에 1분봉부터 시작해서 5분봉 15분봉 30분봉 1시간봉 4시간봉 12시간봉 날봉을 막 돌아가면서 보게 될것임&lt;/li&gt;
&lt;li&gt;그러다보면 어? 현 상황에선 4시간봉이 양봉 마감하는 것이 중요하겠구나 혹은 1시간봉부터 4시간봉 날봉이 모두 상승 추세를 띄니까 5분봉 15분봉 정도 체크하면서 단기 하방이 크게 나올 것만 주의하면 되겠구나 이런 부분들이 보이기 시작할 것임&lt;/li&gt;
&lt;li&gt;진짜 감이 안 잡힌다하는 경우에는 유튜버 혹은 트레이딩 강사분들의 시황 공유 텔레그램 혹은 카톡방에 들어가서 이 사람은 이렇게 생각하는구나를 참고하면서 매매하는 것도 큰 도움이 됨&lt;/li&gt;
&lt;li&gt;이 때 주의해야할 점은 그 사람들은 내가 청산을 당하던 말던 상관이 1도 없음&lt;/li&gt;
&lt;li&gt;&lt;b&gt;내 매매는 철저하게 나한테 책임이 있다는 것을 명심해야 하고 한 사람의 말에만 휩쓸리지 않기 위해서는 그런 커뮤니티를 여러 개 가입한 후 의견들을 비교해가며 내 분석과 매칭해보는 것이 합리적인 방법이라고 생각이 듦&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;차트를 보며 느꼈던 점들 정리:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그냥 어? 이 패턴이네 하고 포지션 잡는 것은 물론 맞을 수도 있겠지만 똑똑한 투자자가 되는 길은 아닌 듯함&lt;/li&gt;
&lt;li&gt;여기서 다른 개미들은 어떤 판단을 했을까? 얼마나 털렸을까? 이런 부분들을 함께 생각해야 함&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;또한, 패턴이나 기법을 무작정 외우지말고 왜 이런 패턴에서는 큰 상승이 나오고 저런 패턴에서는 큰 하락이 나오는지 그 이유를 &quot;이해&quot;하고 매매하는 것이 중요해보임&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;기법, 패턴, 포지션 타점 모두 중요하지만 나한테 적진 않은 돈인 200 달러를 수 차례에 나눠서 (더 짜증남) 내면서 느낀 점은 리스크 관리가 제일 제일 중요함&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;과거에 이런 패턴이 있었는지를 확인하는 것도 아주 중요함&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;주의해야 하는 것은 해당 패턴이랑 비슷했던 적이 있었다고 똑같이 진행되겠지라고 생각하는 것 보단 항상 페이크 아웃이나 비슷한 패턴으로 학습된 개미들을 털어버릴려는 세력의 심산이 아닌지 경계해야 함&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;위의 주황색 추세선 두 개는 아주 강한 추세선이라고 생각하고 그은 부분임(실제로 아주 강했죠)&lt;/li&gt;
&lt;li&gt;캔들을 보면 알겠지만 아주 역겨울 정도로 평행 채널과 두 개의 강한 추세선에게 저항 및 지지를 받으며 오르락 내리락 한 것을 볼 수 있음&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;이럴 때 내가 생각한 관점이 맞다고 생각하면 관점을 고수하는 멘탈이 필요함&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;반성할 점들:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;처음에는 400달러 고배로 짧게 짧게 먹어보자 라고 생각했던 것이 상승 추세가 쭉 이어지니까 손절 라인만 조정하며 홀딩하는 방식으로 변했음&lt;/li&gt;
&lt;li&gt;처음부터 이 점을 파악했으면 시드를 더 넣고 레버리지를 줄이는게 낫지 않았을까 생각됨&lt;/li&gt;
&lt;li&gt;사실 이 점이 가장 반성해야할 점인데, 곡선 추세가 상승 이탈하는 부분을 확인하고 1000달러 가량을 추가 매수함&lt;/li&gt;
&lt;li&gt;사실 내가 본 관점이 확실하다는 생각이 들어서 추가 매매를 한다 이것만 봤을 때는 합리적으로 보이지만 그럼 1400달러 30배 레버리지 매매를 해버린 것인데 상당히 충동적이었고, 결론은 좋았지만 23살의 나이한테는 상당히 큰 돈을 때려넣어버렸다는 것은 반성하고 조정해야할 부분임이 확실함&lt;/li&gt;
&lt;li&gt;코인 투자를 해야겠다고 마음 먹은 계기는 당연히 뭐 큰 돈 벌고 싶다 관심이 오래전부터 있었다 이런건 너무 당연한 이야기이고, 강의를 들으면서 강사님이 가장 중요하게 강조했던 심법에 대해 자신이 있었기 때문이었음&lt;/li&gt;
&lt;li&gt;&lt;b&gt;평소에도 업다운도 없고 이성적이라고 생각했던 내가 이런 행동을 했다는 것이 스스로 좀 무섭기도 한데, 앞으로는 매매 기준을 잘 정해서 내가 감당할만큼만 매매해야겠음&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이 바닥 심법이 전부다..&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>투자</category>
      <category>매매일지</category>
      <category>코인</category>
      <category>투자</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/133</guid>
      <comments>https://dongsunseng.tistory.com/entry/%EC%BD%94%EC%9D%B8-%ED%88%AC%EC%9E%90-%EB%A7%A4%EB%A7%A4%EC%9D%BC%EC%A7%80-1-%EC%8B%9C%EC%9E%A5-%EC%88%98%EC%97%85%EB%A3%8C%EB%A1%9C-%EB%B1%89%EC%9D%80%EA%B1%B0-100-%EB%A9%98%EC%A7%95-%EB%B9%84%ED%8A%B8-%EC%A0%9C%EB%8C%80%EB%A1%9C-%EB%B3%B5%EC%88%98-%EC%99%84%EB%A3%8C#entry133comment</comments>
      <pubDate>Sat, 22 Feb 2025 15:51:34 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #14 Feature Engineering Ideas</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-14-Feature-Engineering-Ideas</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Annotation of a discussion post about feature engineering ideas:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550863&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550863&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1739186873963&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550863&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/d4sX6v/hyYcbo5KUA/a17HVGVzteRNUyp2Kv1iP1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/cptjge/hyYf3XcXSu/tIc4bZsLkuu8o2UjvKRn00/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550863&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550863&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/d4sX6v/hyYcbo5KUA/a17HVGVzteRNUyp2Kv1iP1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/cptjge/hyYf3XcXSu/tIc4bZsLkuu8o2UjvKRn00/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h2 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;Feature Engineering Ideas&lt;/b&gt;&lt;/h2&gt;
&lt;div style=&quot;background-color: #ffffff; color: #000000; text-align: start;&quot;&gt;
&lt;div style=&quot;background-color: #ffffff; color: #3c4043;&quot;&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Hi everyone! My current best CV score and LB score (CV 0.683 and LB 0.688) is just ensemble and/or stack various models (without feature engineering).&lt;/li&gt;
&lt;li&gt;Each model is trained with different targets and different losses.&lt;/li&gt;
&lt;li&gt;I have not performed any feature engineering or data augmentation or external data yet.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Categorical vs. Numerical&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Let's discuss feature engineering.&lt;/li&gt;
&lt;li&gt;This dataset has 35 categorical features and 22 numerical features (for total 57 features).&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;However 55 features look like categorical with few unique values. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Only donor_age and act_at_hct look like true numerical with many continuous values.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;It is true that 20 of the numerical features may be ordinal (which means that the order of values matters), but for my NN treating all features (except two ages) as categorical worked best. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;So we could treat all (except two ages) as categorical and combine them creatively.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Feature Engineering&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Let's have a discussion about which feature engineering to try.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;One technique is to combine columns with train[&quot;new&quot;] = train[&quot;col1&quot;].astype(&quot;str&quot;) + &quot;_&quot; + train[&quot;cols2&quot;.astype(&quot;str&quot;). &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;Then we have a new categorical feature. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;We can even combine 3, 4, 5, etc columns.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;When we do this the cardinality increases, so we can try advanced techniques like target encoding, count encoding, etc to process the new high cardinality feature&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;Another idea is to try mathematical combinations like&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;train[&quot;new&quot;] = function( train[&quot;col1&quot;], train[&quot;cols2&quot;] ). &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Here the function could just multiply the columns or it can do more advanced techniques like taking a product of the logs OR takes the difference etc etc.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Data Augmentation&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Another idea to boost CV and LB is data augmentation.&lt;/li&gt;
&lt;li&gt;With tabular data and GBDT, one way to perform data augmentation is to make copies of the train data.&lt;/li&gt;
&lt;li&gt;Then for each copy, we can augment (i.e. modify change) the data.&lt;/li&gt;
&lt;li&gt;Then we concatenate all the copies and train a GBDT on the new concatenated dataframe.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;External Data&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Another thing that helps is external data.&lt;/li&gt;
&lt;li&gt;Has anyone found any good external data sets?&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Recursive Feature Reduction&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Another idea is to remove each feature one by one and see if CV score and LB score increases. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Sometimes there are features whose presence hurts CV score and LB score.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Model Hyerparameter Optimization&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;It is true that optimizing each model's hyperparameters in our ensemble will boost our overall CV score and LB score, but at this time I am more interested in discussing feature engineering.&lt;/li&gt;
&lt;li&gt;So let's discuss feature engineering!&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Let's Discuss&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Let's discuss ideas that we are trying to improve CV and LB score.&lt;/li&gt;
&lt;li&gt;So far the only public notebooks are using different models, but nobody has suggested or tried ways to modify, change, or increase the data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;Comments:&lt;/b&gt;&lt;/h3&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.42.39.png&quot; data-origin-width=&quot;1181&quot; data-origin-height=&quot;348&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/brevOS/btsMcGO7cH8/FlWwoqQSmegp1xJeYlXIZk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/brevOS/btsMcGO7cH8/FlWwoqQSmegp1xJeYlXIZk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/brevOS/btsMcGO7cH8/FlWwoqQSmegp1xJeYlXIZk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbrevOS%2FbtsMcGO7cH8%2FFlWwoqQSmegp1xJeYlXIZk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1181&quot; height=&quot;348&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.42.39.png&quot; data-origin-width=&quot;1181&quot; data-origin-height=&quot;348&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.53.46.png&quot; data-origin-width=&quot;1199&quot; data-origin-height=&quot;599&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c48vpf/btsMcd0XlOj/UeG1GZshr9cGUU4kgkNR8k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c48vpf/btsMcd0XlOj/UeG1GZshr9cGUU4kgkNR8k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c48vpf/btsMcd0XlOj/UeG1GZshr9cGUU4kgkNR8k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc48vpf%2FbtsMcd0XlOj%2FUeG1GZshr9cGUU4kgkNR8k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1199&quot; height=&quot;599&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.53.46.png&quot; data-origin-width=&quot;1199&quot; data-origin-height=&quot;599&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.54.39.png&quot; data-origin-width=&quot;1025&quot; data-origin-height=&quot;313&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/La7eb/btsMd42D9LB/81duDoh12UKqD3yR5O7TN1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/La7eb/btsMd42D9LB/81duDoh12UKqD3yR5O7TN1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/La7eb/btsMd42D9LB/81duDoh12UKqD3yR5O7TN1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FLa7eb%2FbtsMd42D9LB%2F81duDoh12UKqD3yR5O7TN1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1025&quot; height=&quot;313&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.54.39.png&quot; data-origin-width=&quot;1025&quot; data-origin-height=&quot;313&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 9.05.43.png&quot; data-origin-width=&quot;1202&quot; data-origin-height=&quot;495&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/vKDW9/btsMcR30hPn/FELPS7FL3kQ7YiqjpEO370/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/vKDW9/btsMcR30hPn/FELPS7FL3kQ7YiqjpEO370/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/vKDW9/btsMcR30hPn/FELPS7FL3kQ7YiqjpEO370/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FvKDW9%2FbtsMcR30hPn%2FFELPS7FL3kQ7YiqjpEO370%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1202&quot; height=&quot;495&quot; data-filename=&quot;스크린샷 2025-02-10 오후 9.05.43.png&quot; data-origin-width=&quot;1202&quot; data-origin-height=&quot;495&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;뛰어나고 훌륭하게 시작할 필요는 없다. 그러나 훌륭하기 위해서는 시작해야 한다.&lt;br /&gt;- 지그 지글러 -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>Feature Engineering</category>
      <category>Kaggle</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/125</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-14-Feature-Engineering-Ideas#entry125comment</comments>
      <pubDate>Mon, 10 Feb 2025 21:07:41 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #13 How to make sense of the race group distribution in the data?</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-13-How-to-make-sense-of-the-race-group-distribution-in-the-data</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1739186027828&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions #11 ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️ (AFT Loss func sol&quot; data-og-description=&quot;Annotation post about AFT loss function solution:https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&amp;nbsp;ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - E&quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1&quot; data-og-url=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cANkhr/hyYfSH8RfR/ybgOJkKySrwkYP539BEYP1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/9umr5/hyYcawYP1h/kk9etk39rBXxhIXAXQAEzk/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/qgRlv/hyYckeO0uM/2GFwgPxQCrJabq9aHqTSQk/img.png?width=1596&amp;amp;height=1186&amp;amp;face=0_0_1596_1186&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cANkhr/hyYfSH8RfR/ybgOJkKySrwkYP539BEYP1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/9umr5/hyYcawYP1h/kk9etk39rBXxhIXAXQAEzk/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/qgRlv/hyYckeO0uM/2GFwgPxQCrJabq9aHqTSQk/img.png?width=1596&amp;amp;height=1186&amp;amp;face=0_0_1596_1186');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions #11 ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️ (AFT Loss func sol&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Annotation post about AFT loss function solution:https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&amp;nbsp;ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - E&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;From my other blog post, we discussed about&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.14.17.png&quot; data-origin-width=&quot;1852&quot; data-origin-height=&quot;1598&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dRGruW/btsMcIF7yAD/nZoXXWhCICwkJk2IOCTGgK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dRGruW/btsMcIF7yAD/nZoXXWhCICwkJk2IOCTGgK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dRGruW/btsMcIF7yAD/nZoXXWhCICwkJk2IOCTGgK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdRGruW%2FbtsMcIF7yAD%2FnZoXXWhCICwkJk2IOCTGgK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1852&quot; height=&quot;1598&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.14.17.png&quot; data-origin-width=&quot;1852&quot; data-origin-height=&quot;1598&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This blog is about the &quot;further discussion&quot;: &lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550302&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550302&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1739186093036&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550302&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/gUi4W/hyYciuX2Vo/V0eETTk7ScucF2yGhiCwf1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/N7WA2/hyYf1LRAro/A5wDWBPsbAEkHvwih2ayM1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550302&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550302&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/gUi4W/hyYciuX2Vo/V0eETTk7ScucF2yGhiCwf1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/N7WA2/hyYf1LRAro/A5wDWBPsbAEkHvwih2ayM1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h3 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;How to make sense of the race group distribution in the data ?&lt;/b&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Counting values of race groups I get the following:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.17.25.png&quot; data-origin-width=&quot;444&quot; data-origin-height=&quot;176&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ejmQlh/btsMcqFGahp/hG584pYsOhxN8LXoU9V2BK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ejmQlh/btsMcqFGahp/hG584pYsOhxN8LXoU9V2BK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ejmQlh/btsMcqFGahp/hG584pYsOhxN8LXoU9V2BK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FejmQlh%2FbtsMcqFGahp%2FhG584pYsOhxN8LXoU9V2BK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;502&quot; height=&quot;199&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.17.25.png&quot; data-origin-width=&quot;444&quot; data-origin-height=&quot;176&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Having worked on the topic of equity for sensitive applications, I have found one of the main problem to be imbalance in data of interest. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Typically some less represented races will end up with wider estimates.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;However the data at hand seems to have been resampled (or generated as balanced).&lt;/li&gt;
&lt;li&gt;While this can be achieved on real data by downsampling the majority class, it usually kills representativeness of the population.&lt;/li&gt;
&lt;li&gt;I am concerned a model optimised with this metric on this balanced dataset would perform worse on real life 'race imbalanced' data.&lt;/li&gt;
&lt;li&gt;How does 'race-balancing' the dataset make sense in an equity competition ?&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Comments:&lt;/b&gt;&lt;/h4&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.18.57.png&quot; data-origin-width=&quot;1204&quot; data-origin-height=&quot;614&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/oc9u0/btsMchvkGC6/tXkwhA559U0s7khFSZMhSk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/oc9u0/btsMchvkGC6/tXkwhA559U0s7khFSZMhSk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/oc9u0/btsMchvkGC6/tXkwhA559U0s7khFSZMhSk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Foc9u0%2FbtsMchvkGC6%2FtXkwhA559U0s7khFSZMhSk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1204&quot; height=&quot;614&quot; data-filename=&quot;스크린샷 2025-02-10 오후 8.18.57.png&quot; data-origin-width=&quot;1204&quot; data-origin-height=&quot;614&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #3c4043; text-align: start;&quot;&gt;Maybe the idea behind balanced, synthetic data is to accentuate differences in risk prediction due only to the available features, by taking imbalance out of the problem.&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;By eliminating racial imbalances in the actual data, one can more clearly see differences in risk predictions that are &quot;purely attributable to available features&quot;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;This allows for more accurate evaluation of actual prediction performance differences rather than differences in population ratios&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199; color: #3c4043; text-align: start;&quot;&gt;This could suggest a need for additional predictors if certain groups are more poorly predicted.&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;If predictions are less accurate for certain groups, this could indicate that current features don't adequately explain those groups&lt;/li&gt;
&lt;li&gt;This could signal the need for additional predictors that better characterize these groups&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;완벽하려고 미루는 것보다 지속적으로 고쳐나가는 것이 낫습니다.&lt;br /&gt;- 마크 트웨인 -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>how to make sense of the race group distribution in the data ?</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/124</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-13-How-to-make-sense-of-the-race-group-distribution-in-the-data#entry124comment</comments>
      <pubDate>Mon, 10 Feb 2025 20:27:18 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #12 Deep understanding of (C-index) evaluation measure for better model</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-12-Deep-understanding-of-C-index-evaluation-measure-for-better-model</link>
      <description>&lt;p style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;Annotation of this discussion: &lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550152&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550152&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1739184885441&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550152&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/MfU5G/hyYb80eJaJ/1VQiHTgcNrejv9uQRYekqk/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/zSBOG/hyYf0sD8FD/QcsRqAxsTRkvEG6j81wKPK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550152&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550152&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/MfU5G/hyYb80eJaJ/1VQiHTgcNrejv9uQRYekqk/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/zSBOG/hyYf0sD8FD/QcsRqAxsTRkvEG6j81wKPK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h3 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;Deep understanding of (C-index) evaluation measure for better model&lt;/b&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;I will try to explain the C-index evaluation measure of the this competition in order to train the model well because &lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;75% of the data is not included in the test data&lt;/span&gt;&lt;/b&gt; so understanding of the measure is very important.&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;Lets start with three patients groups:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Group A&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Group B&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Group C&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;For each patient, we will predict&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;risk score&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;(higher score means higher risk of early event).&lt;/p&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Step 1: Understanding Concordance Index&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;The&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Concordance Index (C-index)&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;evaluate how well the model ranks survival times.&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Understand with sample data:&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Group A&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;has 3 patients with actual survival times and predicted risk scores:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 7.59.47.png&quot; data-origin-width=&quot;896&quot; data-origin-height=&quot;468&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/YFmxL/btsMb1zuSR1/nq9f4yrkRAFIjrxcfbXGHK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/YFmxL/btsMb1zuSR1/nq9f4yrkRAFIjrxcfbXGHK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/YFmxL/btsMb1zuSR1/nq9f4yrkRAFIjrxcfbXGHK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FYFmxL%2FbtsMb1zuSR1%2Fnq9f4yrkRAFIjrxcfbXGHK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;475&quot; height=&quot;248&quot; data-filename=&quot;스크린샷 2025-02-10 오후 7.59.47.png&quot; data-origin-width=&quot;896&quot; data-origin-height=&quot;468&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Comparable pairs&lt;/b&gt;:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;(P1, P2): P2 has a shorter survival time and a higher risk score &amp;rarr;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Concordant&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;✅&lt;/li&gt;
&lt;li&gt;(P1, P3): P3 has a longer survival time and a lower risk score &amp;rarr;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Concordant&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;✅&lt;/li&gt;
&lt;li&gt;(P2, P3): P3 has a longer survival time and a lower risk score &amp;rarr;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Concordant&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;✅&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;Total pairs = 3&lt;br /&gt;Total concordant pairs = 3&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;C-index for Group A = Concordant pairs/Total pairs= 3/3 = 1.0&lt;/p&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Step 2: Calculate C-index for All Groups&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;Repeat the process for all groups.&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;For now we can assume:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Group A&lt;/b&gt;: C-index = 1.0&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Group B&lt;/b&gt;: C-index = 0.8&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Group C&lt;/b&gt;: C-index = 0.6&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Step 3: Stratified Concordance Index&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;The&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Stratified Concordance Index&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;combines the C-index scores of all groups and focusing on the following:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Average performance across groups&lt;/b&gt;&amp;nbsp;(mean of C-indices).&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;Consistency across groups&lt;/b&gt;&amp;nbsp;(low standard deviation of C-indices).&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Formula:&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;Stratified C-index = Mean(C-index scores) - Standard Deviation(C-index scores)&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Calculate the mean&lt;/b&gt;:&lt;br /&gt;Mean=1.0 + 0.8 + 0.6/3 = 0.8&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Calculate the standard deviation&lt;/b&gt;:&lt;br /&gt;Standard Deviation= sqrt((1.0-0.8)^2 + (0.8-0.8)^2 + (0.6-0.8)^/3) = 0.16&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Stratified C-index&lt;/b&gt;:&lt;br /&gt;Stratified C-index = 0.8 - 0.16 = 0.64&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Step 4: Interpret the Results&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;A&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;high Stratified C-index&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;means:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;The model predicts well overall (high mean C-index).&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;The model predicts equitably across racial groups (low standard deviation).&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;Finally we can say:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Group A&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;predictions are perfect (C-index = 1.0).&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Group B&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;is decent (C-index = 0.8).&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Group C&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;struggles (C-index = 0.6).&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;The&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Stratified C-index = 0.64&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;showing that while predictions are good overall, the model is less consistent across groups.&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;실패를 미리 두려워할 필요는 없다.&lt;br /&gt;- 버트런드 러셀 -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/123</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-12-Deep-understanding-of-C-index-evaluation-measure-for-better-model#entry123comment</comments>
      <pubDate>Mon, 10 Feb 2025 20:11:35 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #11 ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️ (AFT Loss func sol #1)</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Annotation post about AFT loss function solution:&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1739151500155&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️&quot; data-og-description=&quot;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot; data-og-url=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Equity in survival predictions: EDA which makes sense&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;This notebook shows&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;An exploratory data analysis&lt;/li&gt;
&lt;li&gt;Survival functions and how they differ among race groups&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Three types of models: Cox proportional hazards, accelerated failure times, and transformed target models&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Cross-validation with metrics per race group&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;References&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Competition:&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://en.wikipedia.org/wiki/Survival_analysis&quot;&gt;Wikipedia article&lt;/a&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;which describes censoring, survival functions, cumulative hazard etc.&lt;/li&gt;
&lt;li&gt;Libraries:&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://scikit-survival.readthedocs.io/en/stable/index.html&quot;&gt;scikit-survival&lt;/a&gt;,&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://lifelines.readthedocs.io/en/latest/&quot;&gt;lifelines&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;%%time
try:
    from lifelines.utils import concordance_index
except ModuleNotFoundError:
    print('Installing lifelines...')
    !pip install -q /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
    !pip install -q /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
    !pip install -q /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
    !pip install -q /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
    !pip install -q /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1739151546798&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter
import numpy as np
import xgboost
import catboost
import warnings
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.utils import concordance_index
from scipy.stats import rankdata

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, quantile_transform, FunctionTransformer, PolynomialFeatures, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

all_model_scores = {}&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;Reading-the-data&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Reading the data&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;We read the data and observe:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;The training dataset has 59 columns, many of which are categorical and have missing values.&lt;/li&gt;
&lt;li&gt;Two columns are missing from the test dataset:&lt;span&gt;&amp;nbsp;&lt;/span&gt;efs&lt;span&gt;&amp;nbsp;&lt;/span&gt;and&lt;span&gt;&amp;nbsp;&lt;/span&gt;efs_time. These two columns together make up the target.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;train = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/train.csv', index_col='ID')
test = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/test.csv', index_col='ID')
data_dictionary = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/data_dictionary.csv')
train.tail()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;features = [f for f in test.columns if f != 'ID']

cat_features = list(train.select_dtypes(object).columns)
train[cat_features] = train[cat_features].astype(str).astype('category')

race_groups = np.unique(train.race_group)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;features&amp;nbsp;=&amp;nbsp;[f&amp;nbsp;for&amp;nbsp;f&amp;nbsp;in&amp;nbsp;test.columns&amp;nbsp;if&amp;nbsp;f&amp;nbsp;!=&amp;nbsp;'ID']&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Select all columns from test dataset except 'ID'&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Create a list of features that will be used for actual modeling&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;cat_features&amp;nbsp;=&amp;nbsp;list(train.select_dtypes(object).columns)&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Select columns with dtype 'object' from train data&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;This process finds categorical variables in string format&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;train[cat_features]&amp;nbsp;=&amp;nbsp;train[cat_features].astype(str).astype('category')&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;First convert the selected categorical variables to string (str)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Then convert them to category type&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;This is a preprocessing step for memory efficiency and modeling&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;race_groups&amp;nbsp;=&amp;nbsp;np.unique(train.race_group)&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Extract unique values from the 'race_group' column in train data&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;This can be used for race-based analysis or stratification&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div style=&quot;color: #3c4043; text-align: start;&quot;&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div style=&quot;background-color: #ffffff; color: #3c4043;&quot;&gt;
&lt;h4 id=&quot;Race-group-distribution&quot; style=&quot;color: #202214;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Race group distribution&lt;/b&gt;&lt;/h4&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;In the training data, there are six race groups with about 4800 samples each.&lt;/li&gt;
&lt;li&gt;Because in no country of the world these six race groups occur with equal frequencies, we know that some of the groups have been upsampled or downsampled in the dataset.&lt;/li&gt;
&lt;li&gt;See&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550302&quot;&gt;this post&lt;/a&gt;&amp;nbsp;for further discussion.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Annotation post can be found from my other blog post&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;vc = train.race_group.value_counts()
plt.pie(vc, labels=vc.index)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오전 10.48.53.png&quot; data-origin-width=&quot;1578&quot; data-origin-height=&quot;780&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/CZpTH/btsMdqKRr4V/pmDOREgVrYlbDCWnojP5J1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/CZpTH/btsMdqKRr4V/pmDOREgVrYlbDCWnojP5J1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/CZpTH/btsMdqKRr4V/pmDOREgVrYlbDCWnojP5J1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FCZpTH%2FbtsMdqKRr4V%2FpmDOREgVrYlbDCWnojP5J1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1578&quot; height=&quot;780&quot; data-filename=&quot;스크린샷 2025-02-10 오전 10.48.53.png&quot; data-origin-width=&quot;1578&quot; data-origin-height=&quot;780&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;The-weirdness-of-the-age-distribution&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;The weirdness of the age distribution&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;There are only two features with continuous data: donor age and patient age.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;The patient age histogram shows that the patient age distribution has five modes(최빈값).&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Such a distribution is &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;highly unnatural&lt;/span&gt;&lt;/b&gt; &amp;mdash; &lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;it must be an artefact of the synthetic data generation.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1739152707232&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plt.figure(figsize=(12, 3))
plt.subplot(1, 2, 1)
plt.hist(train.donor_age, bins=50, color='skyblue')
plt.title('Donor age histogram')
plt.xlabel('donor_age')
plt.ylabel('count')
plt.subplot(1, 2, 2)
plt.title('Patient age histogram')
plt.hist(train.age_at_hct, bins=50, color='skyblue')
plt.xlabel('age_at_hct')
plt.tight_layout()
plt.savefig('a.png')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오전 10.58.54.png&quot; data-origin-width=&quot;1614&quot; data-origin-height=&quot;410&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cs2ETK/btsMdshAsVc/gN8Q1a2UojkQ9MwbNFCd1k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cs2ETK/btsMdshAsVc/gN8Q1a2UojkQ9MwbNFCd1k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cs2ETK/btsMdshAsVc/gN8Q1a2UojkQ9MwbNFCd1k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fcs2ETK%2FbtsMdshAsVc%2FgN8Q1a2UojkQ9MwbNFCd1k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1614&quot; height=&quot;410&quot; data-filename=&quot;스크린샷 2025-02-10 오전 10.58.54.png&quot; data-origin-width=&quot;1614&quot; data-origin-height=&quot;410&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;My first thought was that different race groups had different modes, but the patient age distribution has the same five modes in every race group:&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1739152752427&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;_, axs = plt.subplots(3, 2, sharex=True, sharey=True, figsize=(12, 9))
for race_group, ax in zip(race_groups, axs.ravel()):
    ax.hist(train.age_at_hct[train.race_group == race_group],
            bins=np.linspace(0, 74, 38),
            color='skyblue', alpha=0.5)
    ax.set_title(f'Patient age histogram for {race_group}')
    ax.set_xlabel('age_at_hct')
    ax.set_ylabel('count')
plt.tight_layout()
plt.savefig('b.png')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오전 10.59.30.png&quot; data-origin-width=&quot;1596&quot; data-origin-height=&quot;1186&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b7GdXU/btsMb1yBPf3/541klb9WTHKCkMwpaWWo80/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b7GdXU/btsMb1yBPf3/541klb9WTHKCkMwpaWWo80/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b7GdXU/btsMb1yBPf3/541klb9WTHKCkMwpaWWo80/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb7GdXU%2FbtsMb1yBPf3%2F541klb9WTHKCkMwpaWWo80%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1596&quot; height=&quot;1186&quot; data-filename=&quot;스크린샷 2025-02-10 오전 10.59.30.png&quot; data-origin-width=&quot;1596&quot; data-origin-height=&quot;1186&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Even stranger: The age of 0.044 years (i.e., 16 days) occurs 1005 times in the training dataset, whereas every other age occurs at most six times. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Is hematopoietic cell transplantation a treatment which is often done to newborns? Possible.&lt;/li&gt;
&lt;li&gt;But I can't believe that these babies are all treated exactly when they are 16 days old.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1739152791247&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;train.age_at_hct.value_counts().sort_values(ascending=False).head()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오전 11.00.04.png&quot; data-origin-width=&quot;744&quot; data-origin-height=&quot;406&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/beXqDb/btsMbHtJrYz/WqP7B593BKEoy0JKAKH8u0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/beXqDb/btsMbHtJrYz/WqP7B593BKEoy0JKAKH8u0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/beXqDb/btsMbHtJrYz/WqP7B593BKEoy0JKAKH8u0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbeXqDb%2FbtsMbHtJrYz%2FWqP7B593BKEoy0JKAKH8u0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;420&quot; height=&quot;229&quot; data-filename=&quot;스크린샷 2025-02-10 오전 11.00.04.png&quot; data-origin-width=&quot;744&quot; data-origin-height=&quot;406&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;The-target&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;The target&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;The prediction target consists of two parts:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;efs_time, always positive, is a time, measured in months.&lt;/li&gt;
&lt;li&gt;efs, always zero or one, indicates the presence or absence of an event:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;efs=1&lt;span&gt;&amp;nbsp;&lt;/span&gt;means &quot;patient died exactly at time&lt;span&gt;&amp;nbsp;&lt;/span&gt;efs_time.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;actually not &quot;died&quot; but event occurred is the right expression&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;efs=0&lt;span&gt;&amp;nbsp;&lt;/span&gt;means &quot;patient still lives at time&lt;span&gt;&amp;nbsp;&lt;/span&gt;efs_time; in other words, &quot;patient dies at an unknown time strictly greater than&lt;span&gt;&amp;nbsp;&lt;/span&gt;efs_time&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This situation is called &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&quot;censored data&quot;&lt;/b&gt;&lt;/span&gt;: Samples of which we know the time of death are uncensored, and if we only know a lower bound for the time of death, the sample is (right-)censored.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Censoring is the main reason that this competition has a special metric and that we need special models.&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;The competition is a regression task, but we know y_true for only half the samples.&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;For the other (censored) half, all we know is lower bounds for y_true.&lt;/span&gt; &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;One cannot compute a squared error based on&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;y_true &amp;gt; 100 and y_pred == 120. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;RMSE and similar metrics cannot deal with that.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;By the way, the column name is misleading: If a column is called &quot;event-free survival&quot;, I'd expect that 0 means &quot;patient died&quot; and 1 means &quot;patient lives&quot;, but that's wrong.&lt;/li&gt;
&lt;li&gt;The data have been obfuscated(애매함).&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;is a float with three digits after the decimal point, and I don't think that events such as the death of a patient are recorded with such an exact timestamp.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;A histogram of the target values shows that half the patients die within 20 months after the transplantation; but the other half, who survives the first 20 months, has a high probability of living much longer.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1739156435864&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plt.figure(figsize=(6, 3))
plt.hist(train.efs_time[train.efs == 0], bins=np.linspace(0, 160, 41), label='efs=0: patient still lives at this time', alpha=0.5)
plt.hist(train.efs_time[train.efs == 1], bins=np.linspace(0, 160, 41), label='efs=1: patient dies at this time', alpha=0.5)
plt.legend()
plt.xlabel('efs_time')
plt.ylabel('count')
plt.title('Target histogram')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 12.00.45.png&quot; data-origin-width=&quot;1242&quot; data-origin-height=&quot;670&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/kI58p/btsMdnAMcMD/vuqDBh5OQvzMMiuzlvO8n1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/kI58p/btsMdnAMcMD/vuqDBh5OQvzMMiuzlvO8n1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/kI58p/btsMdnAMcMD/vuqDBh5OQvzMMiuzlvO8n1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FkI58p%2FbtsMdnAMcMD%2FvuqDBh5OQvzMMiuzlvO8n1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;582&quot; height=&quot;314&quot; data-filename=&quot;스크린샷 2025-02-10 오후 12.00.45.png&quot; data-origin-width=&quot;1242&quot; data-origin-height=&quot;670&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Survival-function-and-cumulative-hazard-function&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Survival function and cumulative hazard function&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The survival function shows how many patients survive for how long (Wikipedia:&amp;nbsp;&lt;a href=&quot;https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator&quot;&gt;Kaplan&amp;ndash;Meier estimator&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;At month 0, 100 % of the patients live.&lt;/li&gt;
&lt;li&gt;At month 20, only 40&amp;nbsp;%&amp;nbsp;&amp;ndash;&amp;nbsp;60&amp;nbsp;% remain, depending on their race group.&lt;/li&gt;
&lt;li&gt;Patients with &quot;more than one race&quot; have the highest probability of survival, whites the lowest.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;For those who are used to working with cumulative density functions (cdf) of probability distributions, the survival function is nothing else than a top&amp;ndash;down mirrored cdf of the time-of-event probability distribution.&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;#&amp;nbsp;CDF&amp;nbsp;(Cumulative&amp;nbsp;Distribution&amp;nbsp;Function)&lt;/b&gt;&lt;br /&gt;-&amp;nbsp;Probability&amp;nbsp;that&amp;nbsp;an&amp;nbsp;event&amp;nbsp;occurs&amp;nbsp;by&amp;nbsp;time&amp;nbsp;t&lt;br /&gt;-&amp;nbsp;Starts&amp;nbsp;at&amp;nbsp;0&amp;nbsp;and&amp;nbsp;increases&amp;nbsp;upward&amp;nbsp;(0&amp;nbsp;&amp;rarr;&amp;nbsp;1)&lt;br /&gt;&lt;b&gt;#&amp;nbsp;Survival&amp;nbsp;Function&lt;/b&gt;&lt;br /&gt;-&amp;nbsp;Probability&amp;nbsp;of&amp;nbsp;surviving&amp;nbsp;beyond&amp;nbsp;time&amp;nbsp;t&lt;br /&gt;-&amp;nbsp;Starts&amp;nbsp;at&amp;nbsp;1&amp;nbsp;and&amp;nbsp;decreases&amp;nbsp;downward&amp;nbsp;(1&amp;nbsp;&amp;rarr;&amp;nbsp;0)&lt;br /&gt;&lt;b&gt;#&amp;nbsp;Relationship&lt;/b&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;S(t) = 1 - F(t) where&amp;nbsp;F(t)&amp;nbsp;is&amp;nbsp;CDF&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 2.59.31.png&quot; data-origin-width=&quot;860&quot; data-origin-height=&quot;540&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/brh3mK/btsMcSBrFG0/s4HuQwBVUXcEbMSIP9A47K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/brh3mK/btsMcSBrFG0/s4HuQwBVUXcEbMSIP9A47K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/brh3mK/btsMcSBrFG0/s4HuQwBVUXcEbMSIP9A47K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbrh3mK%2FbtsMcSBrFG0%2Fs4HuQwBVUXcEbMSIP9A47K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;500&quot; height=&quot;314&quot; data-filename=&quot;스크린샷 2025-02-10 오후 2.59.31.png&quot; data-origin-width=&quot;860&quot; data-origin-height=&quot;540&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The cumulative hazard is another representation of the same facts; it corresponds to the negative logarithm of the survival function (Wikipedia:&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://en.wikipedia.org/wiki/Nelson%E2%80%93Aalen_estimator&quot;&gt;Nelson&amp;ndash;Aalen estimator&lt;/a&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;).&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;H(t)&amp;nbsp;=&amp;nbsp;-log(S(t))&lt;br /&gt;where:&lt;br /&gt;-&amp;nbsp;H(t):&amp;nbsp;Cumulative&amp;nbsp;hazard&amp;nbsp;function&lt;br /&gt;-&amp;nbsp;S(t):&amp;nbsp;Survival&amp;nbsp;function&lt;br /&gt;-&amp;nbsp;log:&amp;nbsp;Natural&amp;nbsp;logarithm&lt;br /&gt;Characteristics:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;-&amp;nbsp;H(t)&amp;nbsp;starts&amp;nbsp;at&amp;nbsp;0&amp;nbsp;and&amp;nbsp;increases&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;-&amp;nbsp;As&amp;nbsp;S(t)&amp;nbsp;decreases,&amp;nbsp;H(t)&amp;nbsp;increases&amp;nbsp;more&amp;nbsp;rapidly&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;#&amp;nbsp;Values&amp;nbsp;over&amp;nbsp;time&lt;br /&gt;Time(t)&amp;nbsp;|&amp;nbsp;S(t)&amp;nbsp;|&amp;nbsp;H(t)&amp;nbsp;=&amp;nbsp;-log(S(t))&lt;br /&gt;0&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; | 1.0&amp;nbsp;&amp;nbsp;| 0&lt;br /&gt;10&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;| 0.8&amp;nbsp;&amp;nbsp;| 0.223&lt;br /&gt;20&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;| 0.6&amp;nbsp;&amp;nbsp;| 0.511&lt;br /&gt;30&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;| 0.4&amp;nbsp;&amp;nbsp;| 0.916&lt;br /&gt;40&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;| 0.2&amp;nbsp;&amp;nbsp;| 1.609&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;These&amp;nbsp;two&amp;nbsp;functions&amp;nbsp;express&amp;nbsp;the&amp;nbsp;same&amp;nbsp;information&amp;nbsp;in&amp;nbsp;different&amp;nbsp;ways:&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;Survival function: Directly shows survival probability&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;Cumulative&amp;nbsp;hazard:&amp;nbsp;Shows&amp;nbsp;accumulated&amp;nbsp;risk&amp;nbsp;on&amp;nbsp;a&amp;nbsp;log&amp;nbsp;scale&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;These&amp;nbsp;different&amp;nbsp;representations&amp;nbsp;are&amp;nbsp;useful&amp;nbsp;for&amp;nbsp;emphasizing&amp;nbsp;or&amp;nbsp;analyzing&amp;nbsp;different&amp;nbsp;aspects&amp;nbsp;of&amp;nbsp;the&amp;nbsp;data.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1739156532428&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# You can use library functions or write the few lines of code yourself
# !pip install -q scikit-survival
# from sksurv.nonparametric import kaplan_meier_estimator, nelson_aalen_estimator
# from lifelines import KaplanMeierFitter

def survival_function(df):
    survival_df = df[['efs', 'efs_time']].groupby('efs_time').agg(['size', 'sum']).droplevel(0, axis=1).astype(int)
    survival_df['n_at_risk'] = survival_df['size'].sum() - survival_df['size'].shift().fillna(0).cumsum().astype(int)
    hazard = survival_df['sum'] / survival_df['n_at_risk'] 
    survival_df['cumulative_hazard'] = np.cumsum(hazard) # nelson_aalen_estimator
    survival_df['survival_probability'] = (1 - hazard).cumprod() # kaplan_meier_estimator
    return survival_df

plt.figure(figsize=(6, 8))

plt.subplot(2, 1, 1)
survival_df = survival_function(train)
plt.step(survival_df.index, survival_df['survival_probability'], c='k', where=&quot;post&quot;, label='[Overall]')
plt.xlabel('efs_time')
for race_group in race_groups:
    subset = train.query('race_group == @race_group')
    survival_df = survival_function(subset)
    plt.step(survival_df.index, survival_df['survival_probability'], where=&quot;post&quot;, label=race_group)
plt.xlabel('efs_time')
plt.legend(loc='upper right')
plt.title('Survival function (Kaplan&amp;ndash;Meier) by race group')
plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax=1, decimals=0)) # percent of xmax

plt.subplot(2, 1, 2)
survival_df = survival_function(train)
plt.step(survival_df.index, survival_df['cumulative_hazard'], c='k', where=&quot;post&quot;, label='[Overall]')
plt.xlabel('efs_time')
for race_group in race_groups:
    subset = train.query('race_group == @race_group')
    survival_df = survival_function(subset)
    plt.step(survival_df.index, survival_df['cumulative_hazard'], where=&quot;post&quot;, label=race_group)
plt.xlabel('efs_time')
plt.legend(loc='lower right')
plt.title('Cumulative hazard (Nelson&amp;ndash;Aalen) by race group')

plt.tight_layout()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 12.02.22.png&quot; data-origin-width=&quot;1242&quot; data-origin-height=&quot;818&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bqcVhU/btsMcy3Pjsm/QLwWKNbEbiQJnADBVJXgPK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bqcVhU/btsMcy3Pjsm/QLwWKNbEbiQJnADBVJXgPK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bqcVhU/btsMcy3Pjsm/QLwWKNbEbiQJnADBVJXgPK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbqcVhU%2FbtsMcy3Pjsm%2FQLwWKNbEbiQJnADBVJXgPK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;556&quot; height=&quot;366&quot; data-filename=&quot;스크린샷 2025-02-10 오후 12.02.22.png&quot; data-origin-width=&quot;1242&quot; data-origin-height=&quot;818&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 12.02.39.png&quot; data-origin-width=&quot;1242&quot; data-origin-height=&quot;790&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/caAneH/btsMc0Z0uMZ/WcA9ttkrq8AuboSOSUWFfk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/caAneH/btsMc0Z0uMZ/WcA9ttkrq8AuboSOSUWFfk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/caAneH/btsMc0Z0uMZ/WcA9ttkrq8AuboSOSUWFfk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcaAneH%2FbtsMc0Z0uMZ%2FWcA9ttkrq8AuboSOSUWFfk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;547&quot; height=&quot;348&quot; data-filename=&quot;스크린샷 2025-02-10 오후 12.02.39.png&quot; data-origin-width=&quot;1242&quot; data-origin-height=&quot;790&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Cross-validation&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Cross-validation&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This competition is about &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;equity in the predictions. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;This means that we score the predictions per race group and then derive the final score from these six sub-scores.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;As the official implementation of the competition metric doesn't output the scores per race group, I've written my own implementation, which gives more transparency.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;There are two main methods for survival analysis (the proportional hazards model and the accelerated failure time model), and both are implemented in XGBoost and in CatBoost. &lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;The calling conventions are a bit unusual.&lt;/li&gt;
&lt;li&gt;We present the cross-validation of six models:&lt;/li&gt;
&lt;/ul&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Proportional hazards model&lt;span&gt;&amp;nbsp;&lt;/span&gt;(Cox regression) with XGBoost&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This model expects that the two target columns be combined into one (y = np.where(train.efs == 1, train.efs_time, -train.efs_time), negative target values are considered right censored)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Proportional hazards model with CatBoost.&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This model expects the targets in the same format as the XGBoost Cox model.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Accelerated failure time model&lt;span&gt;&amp;nbsp;&lt;/span&gt;with XGBoost.&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This model &lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;expects the lower and upper bounds for the target in a special form in a DMatrix.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Accelerated failure time model with CatBoost.&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This model &lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;expects the lower and upper bounds for the target in the form of a two-column array.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Proportional hazards model with a linear implementation.&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This model expects time and event columns in a dataframe.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;MSE regression model with three different target transformations.&lt;/b&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;You'll find a comparison of the cv scores of these models at the end of the notebook.&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;Some hyperparameters have been taken from other public notebooks.&lt;/p&gt;
&lt;pre id=&quot;code_1739167690303&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# from metric import score # This is the official metric which we don't use here

kf = StratifiedKFold(shuffle=True, random_state=1)

def evaluate_fold(y_va_pred, fold):
    &quot;&quot;&quot;Compute and print the metrics (concordance index) per race group for a single fold.

    Global variables:
    - train, X_va, idx_va
    - The metrics are saved in the global list all_scores.
    &quot;&quot;&quot;
    metric_list = []
    for race in race_groups:
        mask = X_va.race_group.values == race
        c_index_race = concordance_index(
            train.efs_time.iloc[idx_va][mask],
            - y_va_pred[mask],
            train.efs.iloc[idx_va][mask]
        )
        # print(f&quot;# {race:42} {c_index_race:.3f}&quot;)
        metric_list.append(c_index_race)
    fold_score = np.mean(metric_list) - np.sqrt(np.var(metric_list))
    print(f&quot;# Total fold {fold}:{' ':29} {fold_score:.3f} mean={np.mean(metric_list):.3f} std={np.std(metric_list):.3f}&quot;)
    all_scores.append(metric_list)

def display_overall(label):
    &quot;&quot;&quot;Compute and print the overall metrics (concordance index)&quot;&quot;&quot;
    df = pd.DataFrame(all_scores, columns=race_groups)
    df['mean'] = df[race_groups].mean(axis=1)
    df['std'] = np.std(df[race_groups], axis=1)
    df['score'] = df['mean'] - df['std']
    df = df.T
    df['Overall'] = df.mean(axis=1)
    temp = df.drop(index=['std']).values
    print(f&quot;# Overall:                                   {df.loc['score', 'Overall']:.3f} {label}&quot;)
    all_model_scores[label] = df.loc['score', 'Overall']
    display(df
            .iloc[:len(race_groups)]
            .style
            .format(precision=3)
            .background_gradient(axis=None, vmin=temp.min(), vmax=temp.max(), cmap=&quot;cool&quot;)
            .concat(df.iloc[len(race_groups):].style.format(precision=3))
           )&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1739167704731&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;%%time
# XGBoost Cox regression
y = np.where(train.efs == 1, train.efs_time, -train.efs_time)
all_scores = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    y_tr = y[idx_tr]
    
    xgb_cox_params = {'objective': 'survival:cox', 'grow_policy': 'depthwise', 
                      'n_estimators': 700, 'learning_rate': 0.0254, 'max_depth': 8, 
                      'reg_lambda': 0.116, 'reg_alpha': 0.139, 'min_child_weight': 23.8,
                      'colsample_bytree': 0.59, 'subsample': 0.7, 'tree_method': 'hist',
                      'enable_categorical': True}
    model = xgboost.XGBRegressor(**xgb_cox_params)
    model.fit(X_tr, y_tr) # negative values are considered right censored
    y_va_pred = model.predict(X_va) # predicts hazard factor
    evaluate_fold(y_va_pred, fold)
display_overall('Cox Proportional Hazards XGBoost')
# Overall:                                   0.670&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 3.08.41.png&quot; data-origin-width=&quot;1498&quot; data-origin-height=&quot;1096&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/El0XO/btsMesIoJgx/jxnuqInSMCwYaRUrxuNxqk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/El0XO/btsMesIoJgx/jxnuqInSMCwYaRUrxuNxqk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/El0XO/btsMesIoJgx/jxnuqInSMCwYaRUrxuNxqk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FEl0XO%2FbtsMesIoJgx%2FjxnuqInSMCwYaRUrxuNxqk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1498&quot; height=&quot;1096&quot; data-filename=&quot;스크린샷 2025-02-10 오후 3.08.41.png&quot; data-origin-width=&quot;1498&quot; data-origin-height=&quot;1096&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1739167740013&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;%%time
# Catboost Cox regression
y = np.where(train.efs == 1, train.efs_time, -train.efs_time)
all_scores = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    y_tr = y[idx_tr]
    
    cb_cox_params = {'loss_function': 'Cox', 'grow_policy': 'SymmetricTree',
                     'n_estimators': 800, 'learning_rate': 0.092, 'l2_leaf_reg': 2.5,
                     'max_depth': 7, 'colsample_bylevel': 0.84, 'subsample': 0.9, 
                     'random_strength': 0.8, 'verbose': False}
    
    model = catboost.CatBoostRegressor(**cb_cox_params, cat_features=cat_features)
    model.fit(X_tr, y_tr)
    y_va_pred = model.predict(X_va) # predicts log of hazard factor
    evaluate_fold(y_va_pred, fold)
display_overall('Cox Proportional Hazards CatBoost')
# Overall:                                   0.669&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 3.09.11.png&quot; data-origin-width=&quot;1598&quot; data-origin-height=&quot;1098&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bGsq8j/btsMbXDKTeJ/rXEy33IsjaM3kD0FMAgZ0K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bGsq8j/btsMbXDKTeJ/rXEy33IsjaM3kD0FMAgZ0K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bGsq8j/btsMbXDKTeJ/rXEy33IsjaM3kD0FMAgZ0K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbGsq8j%2FbtsMbXDKTeJ%2FrXEy33IsjaM3kD0FMAgZ0K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1598&quot; height=&quot;1098&quot; data-filename=&quot;스크린샷 2025-02-10 오후 3.09.11.png&quot; data-origin-width=&quot;1598&quot; data-origin-height=&quot;1098&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1739167782486&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;%%time
# XGBoost Accelerated failure time model
all_scores = []

# Data split and preparation
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    # K-fold cross-validation stratified by race_group
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    
    # Creating xgboost data matrix
    d_tr = xgboost.DMatrix(X_tr, enable_categorical=True)
    # Setting survival time information for AFT model
    d_tr.set_float_info('label_lower_bound', train.efs_time.iloc[idx_tr])
    d_tr.set_float_info('label_upper_bound', np.where(train.efs.iloc[idx_tr] == 0, np.inf, train.efs_time.iloc[idx_tr]))
    
    d_va = xgboost.DMatrix(X_va, enable_categorical=True)
    d_va.set_float_info('label_lower_bound', train.efs_time.iloc[idx_va])
    d_va.set_float_info('label_upper_bound', np.where(train.efs.iloc[idx_va] == 0, np.inf, train.efs_time.iloc[idx_va]))
    
    # Model parameters setting
    xgboost_aft_params = {'learning_rate': 0.08, 'max_depth': 4, 'reg_lambda': 3, 'aft_loss_distribution_scale': 0.9,
                          'reg_alpha': 0.24, 'gamma': 0.033, 'min_child_weight': 82.58861553592878,
                          'colsample_bytree': 0.5662198438953138, 'max_bin': 53, 'subsample': 0.7456329821182728, 
                          'objective': 'survival:aft', 'grow_policy': 'depthwise', 'tree_method': 'hist',
                          'aft_loss_distribution': 'normal'}
    # Model training
    bst = xgboost.train(xgboost_aft_params,
                        d_tr,
                        num_boost_round=300,
                        # evals=[(d_tr, 'train'), (d_va, 'val')],
                       )
                       
    # Prediction &amp;amp; Evaluation
    y_va_pred = - bst.predict(d_va) # model predicts time of death
    # Taking negative because: converting to risk score
    # Earlier death time means higher risk
    evaluate_fold(y_va_pred, fold)
display_overall('Accelerated Failure Time XGBoost')
# Overall:                                   0.664&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;d_tr&amp;nbsp;=&amp;nbsp;xgboost.DMatrix(X_tr,&amp;nbsp;enable_categorical=True)&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Creates DMatrix, XGBoost's specialized data format&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;enable_categorical=True&lt;/b&gt;&lt;/i&gt;: Automatically handles categorical variables&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;d_tr.set_float_info('label_lower_bound',&amp;nbsp;train.efs_time.iloc[idx_tr])&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Setting survival time lower bound = label_lower_bound&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Sets observed time (efs_time) as lower bound for all patients&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Means the patient survived at least until this time, regardless of whether event occurred (efs=1) or was censored (efs=0)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;d_tr.set_float_info('label_upper_bound',&amp;nbsp;np.where(train.efs.iloc[idx_tr] == 0, np.inf, train.efs_time.iloc[idx_tr]))&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Setting survival time upper bound = label_upper_bound&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Splits into 2 cases:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;efs=1 (event occurred):&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;upper_bound&amp;nbsp;=&amp;nbsp;efs_time&lt;br /&gt;#&amp;nbsp;We&amp;nbsp;know&amp;nbsp;the&amp;nbsp;exact&amp;nbsp;time&amp;nbsp;of&amp;nbsp;death&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;efs=0 (censored):&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;upper_bound&amp;nbsp;=&amp;nbsp;np.inf&amp;nbsp;(infinity)&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;#&amp;nbsp;We&amp;nbsp;don't&amp;nbsp;know&amp;nbsp;when&amp;nbsp;death&amp;nbsp;occurred&amp;nbsp;after&amp;nbsp;last&amp;nbsp;observation&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Actually setting the upper and lower bound of survival time for validation data is unnecessary&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Example:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Patient&amp;nbsp;A:&amp;nbsp;died&amp;nbsp;on&amp;nbsp;day&amp;nbsp;100&amp;nbsp;(efs=1)&lt;br /&gt;lower_bound&amp;nbsp;=&amp;nbsp;100&lt;br /&gt;upper_bound&amp;nbsp;=&amp;nbsp;100&lt;br /&gt;#&amp;nbsp;Means&amp;nbsp;death&amp;nbsp;occurred&amp;nbsp;exactly&amp;nbsp;at&amp;nbsp;100&amp;nbsp;days&lt;br /&gt;&lt;br /&gt;#&amp;nbsp;Patient&amp;nbsp;B:&amp;nbsp;censored&amp;nbsp;on&amp;nbsp;day&amp;nbsp;80&amp;nbsp;(efs=0)&lt;br /&gt;lower_bound&amp;nbsp;=&amp;nbsp;80&lt;br /&gt;upper_bound&amp;nbsp;=&amp;nbsp;inf&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;#&amp;nbsp;Means&amp;nbsp;survived&amp;nbsp;at&amp;nbsp;least&amp;nbsp;80&amp;nbsp;days,&amp;nbsp;unknown&amp;nbsp;after&amp;nbsp;that&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 3.09.50.png&quot; data-origin-width=&quot;1498&quot; data-origin-height=&quot;1096&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dSz941/btsMdHsKBW9/q6n4tCnvqlps7nx9KnVEkk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dSz941/btsMdHsKBW9/q6n4tCnvqlps7nx9KnVEkk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dSz941/btsMdHsKBW9/q6n4tCnvqlps7nx9KnVEkk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdSz941%2FbtsMdHsKBW9%2Fq6n4tCnvqlps7nx9KnVEkk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1498&quot; height=&quot;1096&quot; data-filename=&quot;스크린샷 2025-02-10 오후 3.09.50.png&quot; data-origin-width=&quot;1498&quot; data-origin-height=&quot;1096&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1739167806755&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;%%time
# CatBoost Accelerated failure time model
y = np.column_stack([train.efs_time,
                     np.where(train.efs == 1, train.efs_time, -1)])
all_scores = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    y_tr = y[idx_tr]
    cb_aft_params = {'loss_function': 'SurvivalAft', 'grow_policy': 'SymmetricTree', 
                     'n_estimators': 800, 'learning_rate': 0.066, 'l2_leaf_reg': 4.4,
                     'max_depth': 5, 'colsample_bylevel': 0.776, 'random_strength': 0.9, 
                     'verbose': False} # 0.67551
    model = catboost.CatBoostRegressor(**cb_aft_params, cat_features=cat_features)
    model.fit(X_tr, y_tr)
    y_va_pred = - model.predict(X_va) # model predicts log of time of death
    evaluate_fold(y_va_pred, fold)
display_overall('Accelerated Failure Time CatBoost')
# Overall:                                   0.664&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 3.10.18.png&quot; data-origin-width=&quot;1498&quot; data-origin-height=&quot;1090&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bJb7yE/btsMcR3EV08/E3869sQTIZr8frBYKBTvCk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bJb7yE/btsMcR3EV08/E3869sQTIZr8frBYKBTvCk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bJb7yE/btsMcR3EV08/E3869sQTIZr8frBYKBTvCk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbJb7yE%2FbtsMcR3EV08%2FE3869sQTIZr8frBYKBTvCk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1498&quot; height=&quot;1090&quot; data-filename=&quot;스크린샷 2025-02-10 오후 3.10.18.png&quot; data-origin-width=&quot;1498&quot; data-origin-height=&quot;1090&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Target-transformation-models-and-regression-with-mean-squared-error&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Target transformation models and regression with mean squared error&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The competition task can be interpreted as predicting the order of death of the patients.&lt;/li&gt;
&lt;li&gt;Who dies first? Who dies second? ... Who dies last, and who survives?&lt;/li&gt;
&lt;li&gt;With a suitable target transformation, we can apply the usual regression algorithms which optimize mse or similar metrics.&lt;/li&gt;
&lt;li&gt;In the public notebooks of this competition, we can find various target transformations, but they all are similar.&lt;/li&gt;
&lt;li&gt;Patients who die mostly have an&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;between 0 and 15, whereas most survivors have an&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;between 15 and 160. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This distribution is an impediment for regression models. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;We want predictions to have high discriminative power for the patients who die, but we don't need to distinguish between survivors. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;We can achieve this result by stretching the range of the patients who die and compressing the range of the survivors.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;The diagram shows how a typical target transformation stretches and compresses the ranges:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1739173318610&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def transform_survival_probability(time, event):
    &quot;&quot;&quot;Transform the target by stretching the range of eventful efs_times and compressing the range of event_free efs_times

    From https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685
    &quot;&quot;&quot;
    kmf = KaplanMeierFitter()
    kmf.fit(time, event)
    y = kmf.survival_function_at_times(time).values
    return y

y_quantile = transform_survival_probability(time=train.efs_time, event=train.efs)
survival_df = survival_function(train)

fig, axs = plt.subplots(2, 2, figsize=(10, 10), dpi=80)

axs[0, 0].hist(train.efs_time[train.efs == 0], bins=np.linspace(0, 160, 41), label='efs=0: patient still lives at this time', alpha=0.5)
axs[0, 0].hist(train.efs_time[train.efs == 1], bins=np.linspace(0, 160, 41), label='efs=1: patient dies at this time', alpha=0.5)
axs[0, 0].legend()
axs[0, 0].set_xlabel('efs_time')
axs[0, 0].set_ylabel('count')
axs[0, 0].set_title('Original target histogram')

axs[0, 1].set_axis_off()

axs[1, 0].step(survival_df.index, survival_df['survival_probability'], c='k', lw=3, where=&quot;post&quot;, label='[Overall]')
axs[1, 0].set_xlabel('efs_time')
axs[1, 0].set_ylabel(&quot;quantile&quot;)
axs[1, 0].set_title(&quot;Survival function&quot;)
axs[1, 0].yaxis.set_major_formatter(PercentFormatter(xmax=1, decimals=0))

axs[1, 1].hist(y_quantile[train.efs==0], bins=100, label=&quot;efs=0&quot;, orientation=u'horizontal', alpha=0.5)
axs[1, 1].hist(y_quantile[train.efs==1], bins=100, label=&quot;efs=1&quot;, orientation=u'horizontal', alpha=0.5)
axs[1, 1].legend()
axs[1, 1].set_ylabel(&quot;quantile&quot;)
axs[1, 1].set_xlabel(&quot;count&quot;)
axs[1, 1].set_title(&quot;Transformed target histogram (sideways)&quot;)
axs[1, 1].yaxis.set_major_formatter(PercentFormatter(xmax=1, decimals=0))

ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
fig.add_axes(ax)
ax.arrow(0.2, 0.55, 0, -0.47, length_includes_head=True, width=0.002, color=plt.get_cmap('tab10')(0), alpha=0.5, head_width=0.02, head_length=0.02)
ax.arrow(0.2, 0.082, 0.37, 0, length_includes_head=True, width=0.002, color=plt.get_cmap('tab10')(0), alpha=0.5, head_width=0.02, head_length=0.02)
ax.arrow(0.12, 0.55, 0, -0.3, length_includes_head=True, width=0.002, color=plt.get_cmap('tab10')(1), alpha=0.5, head_width=0.02, head_length=0.02)
ax.arrow(0.12, 0.25, 0.45, 0, length_includes_head=True, width=0.002, color=plt.get_cmap('tab10')(1), alpha=0.5, head_width=0.02, head_length=0.02)

plt.suptitle('Transforming the target', y=0.99, size=20)
with warnings.catch_warnings():
    warnings.simplefilter(&quot;ignore&quot;)
    plt.tight_layout()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.42.21.png&quot; data-origin-width=&quot;1498&quot; data-origin-height=&quot;1262&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cnVQlH/btsMbWSsO6E/inNrnWp1nI7Q2Xg3LTw441/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cnVQlH/btsMbWSsO6E/inNrnWp1nI7Q2Xg3LTw441/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cnVQlH/btsMbWSsO6E/inNrnWp1nI7Q2Xg3LTw441/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcnVQlH%2FbtsMbWSsO6E%2FinNrnWp1nI7Q2Xg3LTw441%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1498&quot; height=&quot;1262&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.42.21.png&quot; data-origin-width=&quot;1498&quot; data-origin-height=&quot;1262&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;What we already saw from my discussion annotation: &lt;a href=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-8-Finding-the-best-target-transformation&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-8-Finding-the-best-target-transformation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;We now plot the histograms of five possible transformations and then fit regression models with MSE loss to each of the transformed targets.&lt;/li&gt;
&lt;li&gt;You can of course try other loss functions and see what happens.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1739173425872&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def transform_partial_hazard(time, event):
    &quot;&quot;&quot;Transform the target by stretching the range of eventful efs_times and compressing the range of event_free efs_times

    From https://www.kaggle.com/code/andreasbis/cibmtr-eda-ensemble-model
    &quot;&quot;&quot;
    data = pd.DataFrame({'efs_time': time, 'efs': event, 'time': time, 'event': event})
    cph = CoxPHFitter()
    with warnings.catch_warnings():
        warnings.simplefilter(&quot;ignore&quot;)
        cph.fit(data, duration_col='time', event_col='event')
    return cph.predict_partial_hazard(data)

def transform_separate(time, event):
    &quot;&quot;&quot;Transform the target by separating events from non-events
    
    From https://www.kaggle.com/code/mtinti/cibmtr-lofo-feature-importance-gpu-accelerated&quot;&quot;&quot;
    transformed = time.values.copy()
    mx = transformed[event == 1].max() # last patient who dies
    mn = transformed[event == 0].min() # first patient who survives
    transformed[event == 0] = time[event == 0] + mx - mn
    transformed = rankdata(transformed)
    transformed[event == 0] += len(transformed) // 2
    transformed = transformed / transformed.max()
    return - transformed

def transform_rank_log(time, event):
    &quot;&quot;&quot;Transform the target by stretching the range of eventful efs_times and compressing the range of event_free efs_times
    
    From https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&quot;&quot;&quot;
    transformed = time.values.copy()
    mx = transformed[event == 1].max() # last patient who dies
    mn = transformed[event == 0].min() # first patient who survives
    transformed[event == 0] = time[event == 0] + mx - mn
    transformed = rankdata(transformed)
    transformed[event == 0] += len(transformed) * 2
    transformed = transformed / transformed.max()
    transformed = np.log(transformed)
    return - transformed

def transform_quantile(time, event):
    &quot;&quot;&quot;Transform the target by stretching the range of eventful efs_times and compressing the range of event_free efs_times
    
    From https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot;&quot;&quot;
    transformed = np.full(len(time), np.nan)
    transformed_dead = quantile_transform(- time[event == 1].values.reshape(-1, 1)).ravel()
    transformed[event == 1] = transformed_dead
    transformed[event == 0] = transformed_dead.min() - 0.3
    return transformed&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1739173437915&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# XGBoost: MSE loss with five different target transformations
for transformation in [transform_survival_probability,
                       transform_partial_hazard,
                       transform_separate,
                       transform_rank_log,
                       transform_quantile,
                      ]:
    plt.figure(figsize=(6, 1.5))
    target = transformation(time=train.efs_time, event=train.efs)
    vmin, vmax = 1.09 * target.min() - 0.09 * target.max(), 1.09 * target.max() - 0.09 * target.min()
    plt.hist(target[train.efs == 0], bins=np.linspace(vmin, vmax, 31), density=True, label='efs=0: patient still lives at this time', alpha=0.5)
    plt.hist(target[train.efs == 1], bins=np.linspace(vmin, vmax, 31), density=True, label='efs=1: patient dies at this time', alpha=0.5)
    plt.xlim(vmin, vmax)
    plt.yticks([])
    plt.title('Target histogram: ' + transformation.__name__)
    plt.show()
    
    print(transformation.__name__)

    all_scores = []
    for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
        X_tr = train.iloc[idx_tr][features]
        X_va = train.iloc[idx_va][features]
        y_tr = transformation(time=train.iloc[idx_tr].efs_time, event=train.iloc[idx_tr].efs)
    
        # from https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685
        model = xgboost.XGBRegressor(
            max_depth=3,  
            colsample_bytree=0.5,  
            subsample=0.8,  
            n_estimators=2000,  
            learning_rate=0.02,  
            enable_categorical=True,
            min_child_weight=80,
        )
        model.fit(X_tr, y_tr)
        y_va_pred = model.predict(X_va) # predicts quantile
        evaluate_fold(y_va_pred, fold)
    display_overall(f'{transformation.__name__} XGBoost (MSE)')
    print()
    
# Overall:                                   0.669 transform_survival_probability
# Overall:                                   0.668 transform_partial_hazard
# Overall:                                   0.666 transform_separate
# Overall:                                   0.672 transform_rank_log
# Overall:                                   0.674 transform_quantile&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.44.24.png&quot; data-origin-width=&quot;1590&quot; data-origin-height=&quot;862&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/NbEpV/btsMcZtTJJl/sE0sKkYG642s7SLdv83TV0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/NbEpV/btsMcZtTJJl/sE0sKkYG642s7SLdv83TV0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/NbEpV/btsMcZtTJJl/sE0sKkYG642s7SLdv83TV0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FNbEpV%2FbtsMcZtTJJl%2FsE0sKkYG642s7SLdv83TV0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1590&quot; height=&quot;862&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.44.24.png&quot; data-origin-width=&quot;1590&quot; data-origin-height=&quot;862&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.44.37.png&quot; data-origin-width=&quot;1320&quot; data-origin-height=&quot;1078&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cxnwmJ/btsMebUzouG/gGc5i5YVdko8CUsEeAHcMk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cxnwmJ/btsMebUzouG/gGc5i5YVdko8CUsEeAHcMk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cxnwmJ/btsMebUzouG/gGc5i5YVdko8CUsEeAHcMk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcxnwmJ%2FbtsMebUzouG%2FgGc5i5YVdko8CUsEeAHcMk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;620&quot; height=&quot;506&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.44.37.png&quot; data-origin-width=&quot;1320&quot; data-origin-height=&quot;1078&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.45.00.png&quot; data-origin-width=&quot;1616&quot; data-origin-height=&quot;992&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qbB9D/btsMb1lM6U8/brLSWhADJLcsxKqMFwqKpk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qbB9D/btsMb1lM6U8/brLSWhADJLcsxKqMFwqKpk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qbB9D/btsMb1lM6U8/brLSWhADJLcsxKqMFwqKpk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqbB9D%2FbtsMb1lM6U8%2FbrLSWhADJLcsxKqMFwqKpk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1616&quot; height=&quot;992&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.45.00.png&quot; data-origin-width=&quot;1616&quot; data-origin-height=&quot;992&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.45.29.png&quot; data-origin-width=&quot;1534&quot; data-origin-height=&quot;822&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/baWX1O/btsMcgpwhqC/jiuPIXZ7DVZvhhiQKCQ9TK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/baWX1O/btsMcgpwhqC/jiuPIXZ7DVZvhhiQKCQ9TK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/baWX1O/btsMcgpwhqC/jiuPIXZ7DVZvhhiQKCQ9TK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbaWX1O%2FbtsMcgpwhqC%2FjiuPIXZ7DVZvhhiQKCQ9TK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1534&quot; height=&quot;822&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.45.29.png&quot; data-origin-width=&quot;1534&quot; data-origin-height=&quot;822&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.45.41.png&quot; data-origin-width=&quot;1176&quot; data-origin-height=&quot;1078&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/PWYu8/btsMdHzDwbA/rmF2ZT2o8sPORYGKQKZ8A0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/PWYu8/btsMdHzDwbA/rmF2ZT2o8sPORYGKQKZ8A0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/PWYu8/btsMdHzDwbA/rmF2ZT2o8sPORYGKQKZ8A0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FPWYu8%2FbtsMdHzDwbA%2FrmF2ZT2o8sPORYGKQKZ8A0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;553&quot; height=&quot;507&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.45.41.png&quot; data-origin-width=&quot;1176&quot; data-origin-height=&quot;1078&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.45.58.png&quot; data-origin-width=&quot;1528&quot; data-origin-height=&quot;930&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Zb4Nj/btsMcH76zNb/n5hwG8yoSi6rtt9ORWj5Ek/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Zb4Nj/btsMcH76zNb/n5hwG8yoSi6rtt9ORWj5Ek/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Zb4Nj/btsMcH76zNb/n5hwG8yoSi6rtt9ORWj5Ek/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FZb4Nj%2FbtsMcH76zNb%2Fn5hwG8yoSi6rtt9ORWj5Ek%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1528&quot; height=&quot;930&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.45.58.png&quot; data-origin-width=&quot;1528&quot; data-origin-height=&quot;930&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.46.08.png&quot; data-origin-width=&quot;1514&quot; data-origin-height=&quot;828&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cgqVff/btsMdpTrtp7/bmsterIShjP8UmGM19TwI1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cgqVff/btsMdpTrtp7/bmsterIShjP8UmGM19TwI1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cgqVff/btsMdpTrtp7/bmsterIShjP8UmGM19TwI1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcgqVff%2FbtsMdpTrtp7%2FbmsterIShjP8UmGM19TwI1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1514&quot; height=&quot;828&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.46.08.png&quot; data-origin-width=&quot;1514&quot; data-origin-height=&quot;828&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.46.18.png&quot; data-origin-width=&quot;1112&quot; data-origin-height=&quot;550&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bSekk8/btsMcJkuCRq/bV0rn1iMb0nV9wgWqt3U80/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bSekk8/btsMcJkuCRq/bV0rn1iMb0nV9wgWqt3U80/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bSekk8/btsMcJkuCRq/bV0rn1iMb0nV9wgWqt3U80/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbSekk8%2FbtsMcJkuCRq%2FbV0rn1iMb0nV9wgWqt3U80%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;672&quot; height=&quot;332&quot; data-filename=&quot;스크린샷 2025-02-10 오후 4.46.18.png&quot; data-origin-width=&quot;1112&quot; data-origin-height=&quot;550&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;A-linear-model&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;A linear model&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The linear model&amp;nbsp;CoxPHFitter&amp;nbsp;needs one-hot encoding and missing value imputation:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1739180918018&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;%%time
# see https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#cox-s-proportional-hazard-model

all_scores = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, train.race_group)):
    # Creating preprocessing pipeline - one hot encoding for categorical variables
    preproc = ColumnTransformer([
    # One-hot encoding for categorical variables
    ('ohe', OneHotEncoder(
        drop='first',  # Drop first category (dummy coding)
        sparse_output=False, 
        handle_unknown='ignore'), 
        cat_features),
    ],
    # Replace missing values in numerical variables with median
    remainder=SimpleImputer(strategy='median')
).set_output(transform='pandas')
    
    # Apply data preprocessing
    X_tr = preproc.fit_transform(train.iloc[idx_tr])
    with warnings.catch_warnings():
        warnings.simplefilter(&quot;ignore&quot;)
        X_va = preproc.transform(train.iloc[idx_va])
        
    # Create and Train Cox model
    model = CoxPHFitter(penalizer=.01) # Apply L2 regularization
    feats = [f for f in X_tr.columns if f not in ['gvhd_proph_FK+- others(not MMF,MTX)']]
    model.fit(X_tr[feats], duration_col='efs_time', event_col='efs')
    # model.print_summary()
    y_va_pred = model.predict_partial_hazard(X_va[feats])
    X_va['race_group'] = train.race_group.iloc[idx_va]
    evaluate_fold(y_va_pred, fold)
display_overall('Cox Proportional Hazards Linear')
# Overall:                                   0.656&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;XGBoost Cox vs. CoxPHFitter
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;XGBoost Cox:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Can learn nonlinear relationships&lt;/li&gt;
&lt;li&gt;Can capture complex interactions between features&lt;/li&gt;
&lt;li&gt;Automatically handles missing values as a tree-based model&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;CoxPHFitter:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Linear model (learns only linear relationships)&lt;/li&gt;
&lt;li&gt;Features affect hazard independently&lt;/li&gt;
&lt;li&gt;Requires preprocessing (one-hot encoding, missing value handling)&lt;/li&gt;
&lt;li&gt;Basic Cox Model Equation:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;h(t|X)&amp;nbsp;=&amp;nbsp;h₀(t)&amp;nbsp;*&amp;nbsp;exp(&amp;beta;₁X₁&amp;nbsp;+&amp;nbsp;&amp;beta;₂X₂&amp;nbsp;+&amp;nbsp;...&amp;nbsp;+&amp;nbsp;&amp;beta;ₙXₙ)&lt;br /&gt;#&amp;nbsp;h₀(t):&amp;nbsp;baseline&amp;nbsp;hazard&amp;nbsp;function&lt;br /&gt;#&amp;nbsp;&amp;beta;ᵢ:&amp;nbsp;coefficient&amp;nbsp;for&amp;nbsp;each&amp;nbsp;feature&lt;br /&gt;#&amp;nbsp;Xᵢ:&amp;nbsp;feature&amp;nbsp;value&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Learns through partial likelihood func:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Details in my previous post: &lt;a href=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-2-Understanding-Survival-Analysis&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-2-Understanding-Survival-Analysis&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 6.48.47.png&quot; data-origin-width=&quot;1510&quot; data-origin-height=&quot;1110&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/byAHDh/btsMemaIZNl/8xfWxK1lIeAJRcskTyzAPK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/byAHDh/btsMemaIZNl/8xfWxK1lIeAJRcskTyzAPK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/byAHDh/btsMemaIZNl/8xfWxK1lIeAJRcskTyzAPK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbyAHDh%2FbtsMemaIZNl%2F8xfWxK1lIeAJRcskTyzAPK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1510&quot; height=&quot;1110&quot; data-filename=&quot;스크린샷 2025-02-10 오후 6.48.47.png&quot; data-origin-width=&quot;1510&quot; data-origin-height=&quot;1110&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;div style=&quot;color: #3c4043; text-align: start;&quot;&gt;
&lt;div style=&quot;background-color: #ffffff; color: #3c4043;&quot;&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Observation&lt;/b&gt;: With most models, the Asian predictions get the highest scores (best concordance index) and the predictions for white patients get the lowest scores (worst concordance).&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Insight:&lt;/b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;As the competition objective (equitability across diverse patient populations) rewards models with similar concordance scores for all six race groups, a possible strategy could be that we artificially make the predictions for Asian patients worse.&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;#&amp;nbsp;Stratified&amp;nbsp;C-index&amp;nbsp;=&amp;nbsp;Mean(C-indices)&amp;nbsp;-&amp;nbsp;Std(C-indices)&lt;br /&gt;#&amp;nbsp;That&amp;nbsp;is,&amp;nbsp;mean&amp;nbsp;of&amp;nbsp;C-indices&amp;nbsp;for&amp;nbsp;each&amp;nbsp;racial&amp;nbsp;group&amp;nbsp;minus&amp;nbsp;their&amp;nbsp;standard&amp;nbsp;deviation&lt;br /&gt;Example:&lt;br /&gt;Race&amp;nbsp;A:&amp;nbsp;C-index&amp;nbsp;=&amp;nbsp;0.70&lt;br /&gt;Race&amp;nbsp;B:&amp;nbsp;C-index&amp;nbsp;=&amp;nbsp;0.70&lt;br /&gt;Race&amp;nbsp;C:&amp;nbsp;C-index&amp;nbsp;=&amp;nbsp;0.70&lt;br /&gt;=&amp;gt;&amp;nbsp;Mean&amp;nbsp;0.70,&amp;nbsp;Std&amp;nbsp;0&amp;nbsp;-&amp;gt;&amp;nbsp;Final&amp;nbsp;score&amp;nbsp;0.70&lt;br /&gt;Race&amp;nbsp;A:&amp;nbsp;C-index&amp;nbsp;=&amp;nbsp;0.75&lt;br /&gt;Race&amp;nbsp;B:&amp;nbsp;C-index&amp;nbsp;=&amp;nbsp;0.65&lt;br /&gt;Race&amp;nbsp;C:&amp;nbsp;C-index&amp;nbsp;=&amp;nbsp;0.70&lt;br /&gt;=&amp;gt;&amp;nbsp;Mean&amp;nbsp;0.70,&amp;nbsp;Std&amp;nbsp;0.05&amp;nbsp;-&amp;gt;&amp;nbsp;Final&amp;nbsp;score&amp;nbsp;0.65&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;What&amp;nbsp;this&amp;nbsp;insight&amp;nbsp;suggests:&lt;br /&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;If predictions for Asian patients are more accurate than other races&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Deliberately&amp;nbsp;lowering&amp;nbsp;the&amp;nbsp;accuracy&amp;nbsp;for&amp;nbsp;Asian&amp;nbsp;predictions&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;To&amp;nbsp;make&amp;nbsp;performance&amp;nbsp;similar&amp;nbsp;across&amp;nbsp;all&amp;nbsp;racial&amp;nbsp;groups&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Could&amp;nbsp;improve&amp;nbsp;the&amp;nbsp;overall&amp;nbsp;score&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;color: #3c4043; text-align: start;&quot;&gt;
&lt;div style=&quot;background-color: #ffffff; color: #3c4043;&quot;&gt;
&lt;h4 id=&quot;Final-comparison&quot; style=&quot;color: #202214;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Final comparison&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;For the time being, the gradient-boosted proportional hazard models (Cox regression, blue) and the transformed-target models (pink) win. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Among the target transformations,&amp;nbsp;transform_quantile&amp;nbsp;is best.&lt;/li&gt;
&lt;li&gt;The AFT models (green) perhaps need more hyperparameter tuning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;pre id=&quot;code_1739180953284&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;result_df = pd.DataFrame(all_model_scores, index=['score']).T
result_df = result_df.sort_values('score', ascending=False)
# with pd.option_context(&quot;display.precision&quot;, 3): display(result_df)
plt.figure(figsize=(6, len(result_df) * 0.4))

color = np.where(result_df.index.str.contains('Proportional'),
                 'cyan',
                 np.where(result_df.index.str.contains('Accelerated'), 'lightgreen', 
                          'lightpink'))
bars = plt.barh(np.arange(len(result_df)), result_df.score, color=color)
plt.gca().bar_label(bars, fmt='%.3f')
plt.yticks(np.arange(len(result_df)), result_df.index)
plt.xlim(0.65, 0.68)
plt.xticks([0.65, 0.66, 0.67, 0.68])
plt.gca().invert_yaxis()
plt.xlabel('CV score (higher is better)')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-10 오후 6.49.24.png&quot; data-origin-width=&quot;1562&quot; data-origin-height=&quot;728&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cAUFun/btsMcgccvD2/OgZBkUKAcLkKuCKnUnYOJK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cAUFun/btsMcgccvD2/OgZBkUKAcLkKuCKnUnYOJK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cAUFun/btsMcgccvD2/OgZBkUKAcLkKuCKnUnYOJK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcAUFun%2FbtsMcgccvD2%2FOgZBkUKAcLkKuCKnUnYOJK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1562&quot; height=&quot;728&quot; data-filename=&quot;스크린샷 2025-02-10 오후 6.49.24.png&quot; data-origin-width=&quot;1562&quot; data-origin-height=&quot;728&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;완벽해야&amp;nbsp;한다는&amp;nbsp;강박은&amp;nbsp;시작을&amp;nbsp;망친다.&lt;/span&gt;&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/122</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-11-ESP-EDA-which-makes-sense-%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F%E2%AD%90%EF%B8%8F-AFT-Loss-func-sol-1#entry122comment</comments>
      <pubDate>Mon, 10 Feb 2025 19:46:34 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #10 A general Understanding for AFT Loss function</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-10-A-general-Understanding-for-AFT-Loss-function</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Annotation of the discussion about AFT loss function:&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550563&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550563&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738826089238&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550563&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cADcEO/hyYcjl22D4/uCI8IvFv2Eoq2XtdGk2Ibk/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/JGsJn/hyX7PAa7Jr/4rhFNx6QbFaNNnhKft3Fz0/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550563&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550563&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cADcEO/hyYcjl22D4/uCI8IvFv2Eoq2XtdGk2Ibk/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/JGsJn/hyX7PAa7Jr/4rhFNx6QbFaNNnhKft3Fz0/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h3 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;A general Understanding for AFT Loss function&lt;/b&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;My notebook using AFT Loss function is&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/code/horikitasaku/cv0-665-cat-xgb-with-aft-loss-function?scriptVersionId=211842969&quot;&gt;[CV0.665 LB0.666]cat+xgb with AFT loss function&lt;/a&gt;&amp;nbsp;based on Dear&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/cdeotte&quot; data-user-name=&quot;cdeotte&quot; data-id=&quot;6740d699-9a74-4524-bcdf-23601cabb26a&quot;&gt;@cdeotte&lt;/a&gt;'s code, thanks!
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;My annotation on the kernel:&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;The&amp;nbsp;Accelerated Failure Time (AFT) model&amp;nbsp;is a parametric survival analysis model that describes how covariates influence the survival time of an event.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Unlike Proportional Hazards (PH) models, including COX ph model, which assume covariates proportionally scale the hazard function, AFT models assume that covariates accelerate or decelerate the life course of a survival process by a multiplicative factor.&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Detailed explanation about Proportional Hazards model vs. Accelerated Failure Time model&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Proportional Hazard(PH) Model:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Hazard-based&amp;nbsp;approach&lt;br /&gt;#&amp;nbsp;Example:&amp;nbsp;Comparing&amp;nbsp;two&amp;nbsp;patients&lt;br /&gt;Patient&amp;nbsp;A's&amp;nbsp;hazard&amp;nbsp;=&amp;nbsp;baseline&amp;nbsp;hazard&amp;nbsp;&amp;times;&amp;nbsp;2.0&amp;nbsp;&amp;nbsp;#&amp;nbsp;2&amp;nbsp;times&amp;nbsp;riskier&amp;nbsp;than&amp;nbsp;baseline&lt;br /&gt;Patient&amp;nbsp;B's&amp;nbsp;hazard&amp;nbsp;=&amp;nbsp;baseline&amp;nbsp;hazard&amp;nbsp;&amp;times;&amp;nbsp;0.5&amp;nbsp;&amp;nbsp;#&amp;nbsp;0.5&amp;nbsp;times&amp;nbsp;riskier&amp;nbsp;than&amp;nbsp;baseline&lt;br /&gt;#&amp;nbsp;Feature:&amp;nbsp;Hazard&amp;nbsp;changes&amp;nbsp;proportionally&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Accelerated Failure Time(AFT) Model:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Survival&amp;nbsp;time-based&amp;nbsp;approach&lt;br /&gt;#&amp;nbsp;Example:&amp;nbsp;Comparing&amp;nbsp;two&amp;nbsp;patients&lt;br /&gt;Patient&amp;nbsp;A's&amp;nbsp;survival&amp;nbsp;time&amp;nbsp;=&amp;nbsp;baseline&amp;nbsp;survival&amp;nbsp;time&amp;nbsp;&amp;times;&amp;nbsp;0.5&amp;nbsp;&amp;nbsp;#&amp;nbsp;Progresses&amp;nbsp;2x&amp;nbsp;faster&amp;nbsp;than&amp;nbsp;baseline&lt;br /&gt;Patient&amp;nbsp;B's&amp;nbsp;survival&amp;nbsp;time&amp;nbsp;=&amp;nbsp;baseline&amp;nbsp;survival&amp;nbsp;time&amp;nbsp;&amp;times;&amp;nbsp;2.0&amp;nbsp;&amp;nbsp;#&amp;nbsp;Progresses&amp;nbsp;2x&amp;nbsp;slower&amp;nbsp;than&amp;nbsp;baseline&lt;br /&gt;#&amp;nbsp;Feature:&amp;nbsp;Time&amp;nbsp;scale&amp;nbsp;is&amp;nbsp;accelerated/decelerated&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Example:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Situation:&amp;nbsp;Effect&amp;nbsp;of&amp;nbsp;a&amp;nbsp;specific&amp;nbsp;treatment&amp;nbsp;on&amp;nbsp;disease&amp;nbsp;progression&lt;br /&gt;&lt;b&gt;PH&amp;nbsp;Model&amp;nbsp;Interpretation:&lt;/b&gt;&lt;br /&gt;-&amp;nbsp;&quot;Patients&amp;nbsp;receiving&amp;nbsp;this&amp;nbsp;treatment&amp;nbsp;have&amp;nbsp;half&amp;nbsp;the&amp;nbsp;risk&amp;nbsp;of&amp;nbsp;death&quot;&lt;br /&gt;&lt;b&gt;AFT&amp;nbsp;Model&amp;nbsp;Interpretation:&lt;/b&gt;&lt;br /&gt;-&amp;nbsp;&quot;Disease&amp;nbsp;progression&amp;nbsp;is&amp;nbsp;2x&amp;nbsp;slower&amp;nbsp;in&amp;nbsp;patients&amp;nbsp;receiving&amp;nbsp;this&amp;nbsp;treatment&quot;&lt;br /&gt;-&amp;nbsp;i.e.,&amp;nbsp;it&amp;nbsp;takes&amp;nbsp;twice&amp;nbsp;as&amp;nbsp;long&amp;nbsp;to&amp;nbsp;reach&amp;nbsp;the&amp;nbsp;same&amp;nbsp;stage&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Key Differences:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;PH Model: Focuses on hazard (risk)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;AFT Model: Focuses on actual survival time&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;PH models &quot;how risky&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;AFT models &quot;how fast/slow it progresses&quot;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.16.02.png&quot; data-origin-width=&quot;1876&quot; data-origin-height=&quot;742&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Bw209/btsL9n2pvhU/uJMegY76alQB7KEZUKuWhK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Bw209/btsL9n2pvhU/uJMegY76alQB7KEZUKuWhK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Bw209/btsL9n2pvhU/uJMegY76alQB7KEZUKuWhK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FBw209%2FbtsL9n2pvhU%2FuJMegY76alQB7KEZUKuWhK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1876&quot; height=&quot;742&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.16.02.png&quot; data-origin-width=&quot;1876&quot; data-origin-height=&quot;742&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Detailed Explanation:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Basic Model Equation:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;log(T)&amp;nbsp;=&amp;nbsp;X&amp;beta;&amp;nbsp;+&amp;nbsp;&amp;epsilon;&lt;br /&gt;where:&lt;br /&gt;T&amp;nbsp;=&amp;nbsp;survival&amp;nbsp;time&lt;br /&gt;X&amp;nbsp;=&amp;nbsp;feature&amp;nbsp;variables&amp;nbsp;(age,&amp;nbsp;gender,&amp;nbsp;disease&amp;nbsp;status,&amp;nbsp;etc.)&lt;br /&gt;&amp;beta;&amp;nbsp;=&amp;nbsp;coefficients&amp;nbsp;for&amp;nbsp;each&amp;nbsp;feature&amp;nbsp;(impact)&lt;br /&gt;&amp;epsilon;&amp;nbsp;=&amp;nbsp;error&amp;nbsp;term&amp;nbsp;(random&amp;nbsp;variable&amp;nbsp;following&amp;nbsp;probability&amp;nbsp;distribution)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Acceleration Factor:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&amp;theta;&amp;nbsp;=&amp;nbsp;exp(-X&amp;beta;)&lt;/b&gt;&lt;br /&gt;#&amp;nbsp;Interpretation:&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&amp;theta;&amp;nbsp;&amp;gt;&amp;nbsp;1:&amp;nbsp;survival&amp;nbsp;time&amp;nbsp;decreases&amp;nbsp;(disease&amp;nbsp;progresses&amp;nbsp;faster)&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&amp;theta;&amp;nbsp;&amp;lt;&amp;nbsp;1:&amp;nbsp;survival&amp;nbsp;time&amp;nbsp;increases&amp;nbsp;(disease&amp;nbsp;progresses&amp;nbsp;slower)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Example:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Example:&amp;nbsp;Modeling&amp;nbsp;treatment&amp;nbsp;effect&lt;br /&gt;X&amp;nbsp;=&amp;nbsp;[treatment_dose]&lt;br /&gt;&amp;beta;&amp;nbsp;=&amp;nbsp;-0.7&amp;nbsp;&amp;nbsp;#&amp;nbsp;assumed&amp;nbsp;coefficient&lt;/li&gt;
&lt;li&gt;#&amp;nbsp;When&amp;nbsp;treatment&amp;nbsp;dose&amp;nbsp;is&amp;nbsp;1&amp;nbsp;unit&lt;br /&gt;&amp;theta;&amp;nbsp;=&amp;nbsp;exp(-1&amp;nbsp;&amp;times;&amp;nbsp;-0.7)&amp;nbsp;=&amp;nbsp;exp(0.7)&amp;nbsp;&amp;asymp;&amp;nbsp;&lt;b&gt;2.01&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;#&amp;nbsp;Interpretation:&amp;nbsp;1&amp;nbsp;unit&amp;nbsp;of&amp;nbsp;treatment&amp;nbsp;doubles&amp;nbsp;survival&amp;nbsp;time&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;#&amp;nbsp;When&amp;nbsp;treatment&amp;nbsp;dose&amp;nbsp;is&amp;nbsp;2&amp;nbsp;units&lt;br /&gt;&amp;theta;&amp;nbsp;=&amp;nbsp;exp(-2&amp;nbsp;&amp;times;&amp;nbsp;-0.7)&amp;nbsp;=&amp;nbsp;exp(1.4)&amp;nbsp;&amp;asymp;&amp;nbsp;&lt;b&gt;4.06&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;#&amp;nbsp;Interpretation:&amp;nbsp;2&amp;nbsp;units&amp;nbsp;of&amp;nbsp;treatment&amp;nbsp;quadruples&amp;nbsp;survival&amp;nbsp;time&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Key Features:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Reasons for modeling log(T):&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Survival time is always positive&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Log transformation better satisfies normality assumption&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Interpretation becomes easier with multiplicative effects&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.16.35.png&quot; data-origin-width=&quot;1940&quot; data-origin-height=&quot;822&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/A1Xwd/btsL92DA9Tt/AIf3jtydDRr85V4HawkHGk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/A1Xwd/btsL92DA9Tt/AIf3jtydDRr85V4HawkHGk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/A1Xwd/btsL92DA9Tt/AIf3jtydDRr85V4HawkHGk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FA1Xwd%2FbtsL92DA9Tt%2FAIf3jtydDRr85V4HawkHGk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1940&quot; height=&quot;822&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.16.35.png&quot; data-origin-width=&quot;1940&quot; data-origin-height=&quot;822&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Detailed Explanation:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Component Explanations:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;tᵢ:&amp;nbsp;observed&amp;nbsp;survival&amp;nbsp;time&lt;br /&gt;&amp;delta;ᵢ:&amp;nbsp;event&amp;nbsp;occurrence&amp;nbsp;indicator&amp;nbsp;(1=occurred,&amp;nbsp;0=censored)&lt;br /&gt;&amp;mu;ᵢ&amp;nbsp;=&amp;nbsp;Xᵢ&amp;beta;:&amp;nbsp;predicted&amp;nbsp;log-survival&amp;nbsp;time&lt;br /&gt;&amp;sigma;:&amp;nbsp;scale&amp;nbsp;parameter&amp;nbsp;controlling&amp;nbsp;variance&lt;br /&gt;f(t):&amp;nbsp;probability&amp;nbsp;density&amp;nbsp;function&amp;nbsp;(PDF)&lt;br /&gt;S(t):&amp;nbsp;survival&amp;nbsp;function&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;How log function works:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;When event occurs (&amp;delta;ᵢ = 1):&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Loss&amp;nbsp;=&amp;nbsp;-log&amp;nbsp;f(tᵢ;&amp;nbsp;&amp;mu;ᵢ,&amp;nbsp;&amp;sigma;)&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;#&amp;nbsp;Tries&amp;nbsp;to&amp;nbsp;maximize&amp;nbsp;PDF&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;#&amp;nbsp;Learns&amp;nbsp;to&amp;nbsp;increase&amp;nbsp;probability&amp;nbsp;density&amp;nbsp;at&amp;nbsp;actual&amp;nbsp;occurrence&amp;nbsp;time&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;b&gt;&lt;b&gt;PDF&lt;/b&gt;&lt;/b&gt;&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;b&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;PDF represents the probability density of an event occurring at a specific time point&lt;/span&gt;&lt;br /&gt;&lt;/b&gt;&lt;/b&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;b&gt;&lt;b&gt;Example:&amp;nbsp;When&amp;nbsp;a&amp;nbsp;patient&amp;nbsp;dies&amp;nbsp;on&amp;nbsp;day&amp;nbsp;100&lt;br /&gt;&lt;/b&gt;&lt;/b&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;b&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;pdf(t)&amp;nbsp;=&amp;nbsp;density&amp;nbsp;of&amp;nbsp;probability&amp;nbsp;of&amp;nbsp;death&amp;nbsp;at&amp;nbsp;a&amp;nbsp;specific&amp;nbsp;time&amp;nbsp;t&lt;/span&gt;&lt;br /&gt;&lt;/b&gt;&lt;/b&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;b&gt;&lt;b&gt;High pdf value at t=100 = high probability of death around day 100&lt;/b&gt;&lt;/b&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;b&gt;&lt;b&gt;When censored (&amp;delta;ᵢ = 0):&lt;/b&gt;&lt;/b&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Loss&amp;nbsp;=&amp;nbsp;-log&amp;nbsp;S(tᵢ;&amp;nbsp;&amp;mu;ᵢ,&amp;nbsp;&amp;sigma;)&lt;br /&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;#&amp;nbsp;Tries&amp;nbsp;to&amp;nbsp;maximize&amp;nbsp;survival&amp;nbsp;function&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;#&amp;nbsp;Learns&amp;nbsp;to&amp;nbsp;increase&amp;nbsp;probability&amp;nbsp;of&amp;nbsp;survival&amp;nbsp;beyond&amp;nbsp;observed&amp;nbsp;time&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Survival function S&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;S(t)&amp;nbsp;=&amp;nbsp;P(T&amp;nbsp;&amp;gt;&amp;nbsp;t)&amp;nbsp;=&amp;nbsp;probability&amp;nbsp;of&amp;nbsp;survival&amp;nbsp;beyond&amp;nbsp;time&amp;nbsp;point&amp;nbsp;t&lt;/span&gt;&lt;br /&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Characteristics:&lt;br /&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Decreasing function over time (monotonically decreasing)&lt;br /&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Initial&amp;nbsp;value&amp;nbsp;S(0)&amp;nbsp;=&amp;nbsp;1&amp;nbsp;(everyone&amp;nbsp;is&amp;nbsp;alive&amp;nbsp;at&amp;nbsp;time&amp;nbsp;0)&lt;br /&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;When&amp;nbsp;time&amp;nbsp;approaches&amp;nbsp;infinity,&amp;nbsp;S(&amp;infin;)&amp;nbsp;=&amp;nbsp;0&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Example:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Patient&amp;nbsp;A:&amp;nbsp;death&amp;nbsp;at&amp;nbsp;day&amp;nbsp;100&amp;nbsp;(&amp;delta;&amp;nbsp;=&amp;nbsp;1)&lt;br /&gt;Loss_A&amp;nbsp;=&amp;nbsp;-log&amp;nbsp;f(100;&amp;nbsp;&amp;mu;_A,&amp;nbsp;&amp;sigma;)&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;#&amp;nbsp;Learns&amp;nbsp;to&amp;nbsp;increase&amp;nbsp;probability&amp;nbsp;of&amp;nbsp;death&amp;nbsp;at&amp;nbsp;day&amp;nbsp;100&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;In this case, we know the exact time of death&lt;/li&gt;
&lt;li&gt;So model learns to predict high probability of death around day 100&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Thus, &quot;to make accurate predictions at the actual occurrence time&quot;, we &quot;learn to increase probability density at actual occurrence time.&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;# Patient B: censored at day 80 (&amp;delta; = 0)&lt;br /&gt;Loss_B&amp;nbsp;=&amp;nbsp;-log&amp;nbsp;S(80;&amp;nbsp;&amp;mu;_B,&amp;nbsp;&amp;sigma;)&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;#&amp;nbsp;Learns&amp;nbsp;to&amp;nbsp;increase&amp;nbsp;probability&amp;nbsp;of&amp;nbsp;survival&amp;nbsp;beyond&amp;nbsp;day&amp;nbsp;80&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;In this case, we don't know when death occurred after day 80&lt;/li&gt;
&lt;li&gt;we only know for certain they survived until day 80&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Thus, it is reasonable to increase probability of survival beyond day 80&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;It is c&lt;span&gt;orrect to decrease probability of death before day &lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;80 and&lt;/span&gt;&lt;span&gt; increase survival probability after&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Key points:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;This loss function properly handles censored data&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Considers both PDF and survival function for more accurate predictions&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Choice of &amp;epsilon; (random term) distribution affects baseline survival time T₀&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.21.53.png&quot; data-origin-width=&quot;1938&quot; data-origin-height=&quot;756&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bghn6D/btsL9La42Dm/xspibw9qHMsLMB0OSbM6OK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bghn6D/btsL9La42Dm/xspibw9qHMsLMB0OSbM6OK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bghn6D/btsL9La42Dm/xspibw9qHMsLMB0OSbM6OK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbghn6D%2FbtsL9La42Dm%2Fxspibw9qHMsLMB0OSbM6OK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1938&quot; height=&quot;756&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.21.53.png&quot; data-origin-width=&quot;1938&quot; data-origin-height=&quot;756&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Basic Assumption:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&amp;epsilon; ~ N(0, &amp;sigma;&amp;sup2;)&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span style=&quot;color: #5c6370;&quot;&gt;# Error term follows normal distribution&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;This means&lt;span style=&quot;color: #abb2bf;&quot;&gt;:&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;log&lt;span style=&quot;color: #abb2bf;&quot;&gt;(&lt;/span&gt;survival time&lt;span style=&quot;color: #abb2bf;&quot;&gt;)&lt;/span&gt; follows normal distribution&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;WHY???&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;&lt;span&gt;log(T)&amp;nbsp;=&amp;nbsp;X&amp;beta;&amp;nbsp;+&amp;nbsp;&amp;epsilon;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;When&amp;nbsp;Y&amp;nbsp;=&amp;nbsp;a&amp;nbsp;+&amp;nbsp;bX&lt;br /&gt;-&amp;nbsp;If&amp;nbsp;X&amp;nbsp;follows&amp;nbsp;normal&amp;nbsp;distribution&amp;nbsp;N(&amp;mu;,&amp;nbsp;&amp;sigma;&amp;sup2;)&lt;br /&gt;-&amp;nbsp;Then&amp;nbsp;Y&amp;nbsp;follows&amp;nbsp;normal&amp;nbsp;distribution&amp;nbsp;N(a&amp;nbsp;+&amp;nbsp;b&amp;mu;,&amp;nbsp;b&amp;sup2;&amp;sigma;&amp;sup2;)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;&lt;span&gt;log(T)&amp;nbsp;=&amp;nbsp;X&amp;beta;&amp;nbsp;+&amp;nbsp;&amp;epsilon;&lt;br /&gt;#&amp;nbsp;Since&amp;nbsp;&amp;epsilon;&amp;nbsp;follows&amp;nbsp;N(0,&amp;nbsp;&amp;sigma;&amp;sup2;)&lt;br /&gt;#&amp;nbsp;log(T)&amp;nbsp;follows&amp;nbsp;N(X&amp;beta;,&amp;nbsp;&amp;sigma;&amp;sup2;)&lt;br /&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Because:&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;-&amp;nbsp;X&amp;beta;&amp;nbsp;is&amp;nbsp;constant&amp;nbsp;term&amp;nbsp;(mean&amp;nbsp;shift)&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;-&amp;nbsp;Coefficient&amp;nbsp;of&amp;nbsp;&amp;epsilon;&amp;nbsp;is&amp;nbsp;1&amp;nbsp;(variance&amp;nbsp;remains&amp;nbsp;same)&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span&gt;&lt;span&gt;Actual survival time follows log&lt;/span&gt;&lt;span style=&quot;color: #61afef;&quot;&gt;-&lt;/span&gt;&lt;span&gt;symmetric distribution&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Probability Density Function (PDF):&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;f(t;&amp;nbsp;&amp;mu;,&amp;nbsp;&amp;sigma;)&amp;nbsp;=&amp;nbsp;(1/t&amp;sigma;&amp;radic;2&amp;pi;)&amp;nbsp;*&amp;nbsp;exp(-(log(t)-&amp;mu;)&amp;sup2;/2&amp;sigma;&amp;sup2;)&lt;br /&gt;Components:&lt;br /&gt;-&amp;nbsp;t:&amp;nbsp;observed&amp;nbsp;time&lt;br /&gt;-&amp;nbsp;&amp;mu;:&amp;nbsp;predicted&amp;nbsp;log-survival&amp;nbsp;time&amp;nbsp;(X&amp;beta;)&lt;br /&gt;-&amp;nbsp;&amp;sigma;:&amp;nbsp;parameter&amp;nbsp;controlling&amp;nbsp;variance&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Survival Function:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;S(t;&amp;nbsp;&amp;mu;,&amp;nbsp;&amp;sigma;)&amp;nbsp;=&amp;nbsp;1&amp;nbsp;-&amp;nbsp;&amp;Phi;((log(t)-&amp;mu;)/&amp;sigma;)&lt;br /&gt;where:&lt;br /&gt;-&amp;nbsp;&amp;Phi;:&amp;nbsp;cumulative&amp;nbsp;distribution&amp;nbsp;function&amp;nbsp;(CDF)&amp;nbsp;of&amp;nbsp;standard&amp;nbsp;normal&amp;nbsp;distribution&lt;br /&gt;-&amp;nbsp;Represents&amp;nbsp;probability&amp;nbsp;of&amp;nbsp;survival&amp;nbsp;beyond&amp;nbsp;time&amp;nbsp;point&amp;nbsp;t&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Use Cases:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;#&amp;nbsp;Suitable&amp;nbsp;cases:&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;-&amp;nbsp;Symmetrically&amp;nbsp;distributed&amp;nbsp;survival&amp;nbsp;times&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;-&amp;nbsp;Constant&amp;nbsp;variability&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;Example:&amp;nbsp;Component&amp;nbsp;lifetime&amp;nbsp;in&amp;nbsp;manufacturing&lt;/li&gt;
&lt;li&gt;#&amp;nbsp;Unsuitable&amp;nbsp;cases:&lt;br /&gt;-&amp;nbsp;Distributions&amp;nbsp;with&amp;nbsp;very&amp;nbsp;long&amp;nbsp;tails&lt;br /&gt;-&amp;nbsp;Highly&amp;nbsp;asymmetric&amp;nbsp;distributions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Advantages:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Intuitive interpretation&lt;/li&gt;
&lt;li&gt;Relatively simple calculations&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Good fit for symmetric survival time data&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Real Example:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Predicting&amp;nbsp;medical&amp;nbsp;device&amp;nbsp;lifetime&lt;br /&gt;survival_time&amp;nbsp;=&amp;nbsp;exp(X&amp;beta;&amp;nbsp;+&amp;nbsp;&amp;epsilon;)&lt;br /&gt;&amp;epsilon;&amp;nbsp;~&amp;nbsp;N(0,&amp;nbsp;&amp;sigma;&amp;sup2;)&lt;br /&gt;#&amp;nbsp;This&amp;nbsp;means&amp;nbsp;lifetime&amp;nbsp;follows&amp;nbsp;log-normal&amp;nbsp;distribution&lt;br /&gt;#&amp;nbsp;i.e.,&amp;nbsp;log(lifetime)&amp;nbsp;follows&amp;nbsp;normal&amp;nbsp;distribution&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.22.24.png&quot; data-origin-width=&quot;1930&quot; data-origin-height=&quot;592&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b7GArq/btsL8eyEl11/wNLMj9G1OBTg3Zkwvf4MD0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b7GArq/btsL8eyEl11/wNLMj9G1OBTg3Zkwvf4MD0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b7GArq/btsL8eyEl11/wNLMj9G1OBTg3Zkwvf4MD0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb7GArq%2FbtsL8eyEl11%2FwNLMj9G1OBTg3Zkwvf4MD0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1930&quot; height=&quot;592&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.22.24.png&quot; data-origin-width=&quot;1930&quot; data-origin-height=&quot;592&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Basic Assumption:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&amp;epsilon;&amp;nbsp;~&amp;nbsp;Log-Normal(&amp;mu;,&amp;nbsp;&amp;sigma;&amp;sup2;)&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;#&amp;nbsp;Error&amp;nbsp;term&amp;nbsp;follows&amp;nbsp;log-normal&amp;nbsp;distribution&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;#&amp;nbsp;This&amp;nbsp;means&amp;nbsp;survival&amp;nbsp;time&amp;nbsp;T&amp;nbsp;directly&amp;nbsp;follows&amp;nbsp;log-normal&amp;nbsp;distribution&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Log-Normal Distribution Characteristics:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Properties:&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;-&amp;nbsp;Only&amp;nbsp;takes&amp;nbsp;positive&amp;nbsp;values&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;-&amp;nbsp;Has&amp;nbsp;a&amp;nbsp;heavy&amp;nbsp;right&amp;nbsp;tail&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;-&amp;nbsp;Asymmetric&amp;nbsp;distribution&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Difference between AFT:Normal and AFT:Log:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;AFT:Normal&lt;br /&gt;-&amp;nbsp;log(survival&amp;nbsp;time)&amp;nbsp;follows&amp;nbsp;normal&amp;nbsp;distribution&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;-&amp;nbsp;Survival&amp;nbsp;time&amp;nbsp;is&amp;nbsp;symmetrically&amp;nbsp;distributed&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;-&amp;nbsp;Example:&amp;nbsp;Manufacturing&amp;nbsp;component&amp;nbsp;lifetime&lt;/li&gt;
&lt;li&gt;AFT:Log&lt;br /&gt;-&amp;nbsp;Survival&amp;nbsp;time&amp;nbsp;directly&amp;nbsp;follows&amp;nbsp;log-normal&amp;nbsp;distribution&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;-&amp;nbsp;Survival&amp;nbsp;time&amp;nbsp;is&amp;nbsp;asymmetrically&amp;nbsp;distributed&amp;nbsp;(long&amp;nbsp;tail)&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;-&amp;nbsp;Example:&amp;nbsp;Cancer&amp;nbsp;patient&amp;nbsp;survival&amp;nbsp;period&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Use Cases:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Suitable&amp;nbsp;cases:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;-&amp;nbsp;When&amp;nbsp;some&amp;nbsp;patients&amp;nbsp;survive&amp;nbsp;much&amp;nbsp;longer&amp;nbsp;than&amp;nbsp;others&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;-&amp;nbsp;Biological&amp;nbsp;processes&amp;nbsp;or&amp;nbsp;reliability&amp;nbsp;data&lt;br /&gt;-&amp;nbsp;Cancer&amp;nbsp;patient&amp;nbsp;survival&amp;nbsp;analysis&lt;br /&gt;#&amp;nbsp;Reasons:&lt;br /&gt;-&amp;nbsp;Most&amp;nbsp;show&amp;nbsp;similar&amp;nbsp;survival&amp;nbsp;periods&amp;nbsp;but&lt;br /&gt;-&amp;nbsp;Some&amp;nbsp;show&amp;nbsp;very&amp;nbsp;long&amp;nbsp;survival&amp;nbsp;periods&lt;br /&gt;-&amp;nbsp;Can&amp;nbsp;model&amp;nbsp;such&amp;nbsp;long-tail&amp;nbsp;distributions&amp;nbsp;well&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.22.39.png&quot; data-origin-width=&quot;1412&quot; data-origin-height=&quot;744&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cx1kzp/btsL8NVejM9/80enrsM9BKLXb4foSwa8Uk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cx1kzp/btsL8NVejM9/80enrsM9BKLXb4foSwa8Uk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cx1kzp/btsL8NVejM9/80enrsM9BKLXb4foSwa8Uk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fcx1kzp%2FbtsL8NVejM9%2F80enrsM9BKLXb4foSwa8Uk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;702&quot; height=&quot;370&quot; data-filename=&quot;스크린샷 2025-02-06 오후 4.22.39.png&quot; data-origin-width=&quot;1412&quot; data-origin-height=&quot;744&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;In simple terms&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;In simple terms, AFT assumes that different factors (i.e., input variables or features of the model) affect the rate at which events occur by &quot;stretching&quot; or &quot;compressing&quot; the timeline.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;br /&gt;It's like adjusting the playback speed while watching a video:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Double speed play (fast forward) : Time speeds up, events happen faster.&lt;/li&gt;
&lt;li&gt;Slow play (slow down) : Time slows down and events occur later.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;Imagine you're studying the survival time of two cancer patients:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Patient A&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;receives standard treatment.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Patient B&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;receives a new experimental treatment.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Case 1:&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;text-align: left;&quot; data-mathml=&quot;&amp;lt;math xmlns=&amp;quot;http://www.w3.org/1998/Math/MathML&amp;quot;&amp;gt;&amp;lt;mi&amp;gt;&amp;amp;#x03B8;&amp;lt;/mi&amp;gt;&amp;lt;mo&amp;gt;=&amp;lt;/mo&amp;gt;&amp;lt;mn&amp;gt;0.5&amp;lt;/mn&amp;gt;&amp;lt;/math&amp;gt;&quot;&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;0.5 &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;(Acceleration Factor &amp;lt; 1)&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;This means the timeline is stretched by 2x for Patient B compared to Patient A.&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;If Patient A survives for&amp;nbsp;&lt;b&gt;1 year&lt;/b&gt;, Patient B is expected to survive for&amp;nbsp;&lt;b&gt;2 years&lt;/b&gt;&amp;nbsp;under the new treatment.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Case 2:&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;text-align: left;&quot; data-mathml=&quot;&amp;lt;math xmlns=&amp;quot;http://www.w3.org/1998/Math/MathML&amp;quot;&amp;gt;&amp;lt;mi&amp;gt;&amp;amp;#x03B8;&amp;lt;/mi&amp;gt;&amp;lt;mo&amp;gt;=&amp;lt;/mo&amp;gt;&amp;lt;mn&amp;gt;2&amp;lt;/mn&amp;gt;&amp;lt;/math&amp;gt;&quot;&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;2 &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;(Acceleration Factor &amp;gt; 1)&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;This means the timeline is compressed for Patient B, reducing their survival time by half.&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;If Patient A survives for&amp;nbsp;&lt;b&gt;1 year&lt;/b&gt;, Patient B is expected to survive for&amp;nbsp;&lt;b&gt;6 months&lt;/b&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;압력&amp;nbsp;없이는&amp;nbsp;다이아몬드가&amp;nbsp;만들어지지&amp;nbsp;않는다&lt;br /&gt;&lt;/span&gt;- 토마스 칼라일 -&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>a general understanding for aft loss function</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/120</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-10-A-general-Understanding-for-AFT-Loss-function#entry120comment</comments>
      <pubDate>Thu, 6 Feb 2025 22:17:59 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #9 NN Starter Notebook</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-9-NN-Starter-Notebook</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Annotation of discussion and kernel for NN Solution from Chris Deotte&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Discussion Link:&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343&lt;/a&gt;&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738801248200&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/82hMf/hyX7TbpIXx/cbR2hgxArSjZ6wT2UVw6I1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/ca4Pov/hyX727grls/Sz4cht3ZDEAsjK7PnOSqIK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/82hMf/hyX7TbpIXx/cbR2hgxArSjZ6wT2UVw6I1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/ca4Pov/hyX727grls/Sz4cht3ZDEAsjK7PnOSqIK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Kernel Link:&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738801278895&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;NN (MLP) Baseline - [CV 670 LB 676]&quot; data-og-description=&quot;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&quot; data-og-url=&quot;https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;NN (MLP) Baseline - [CV 670 LB 676]&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h3 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;NN Starter Notebook CV 0.670 LB 0.676 (Discussion)&lt;/b&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;I published a starter notebook NN which uses the following simple architecture. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Consider improving architecture to boost CV and LB score!&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오전 9.21.59.png&quot; data-origin-width=&quot;873&quot; data-origin-height=&quot;562&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ek6Mtc/btsL8MgUpCa/5gzF3HUK0C0ZoC9GUBDo10/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ek6Mtc/btsL8MgUpCa/5gzF3HUK0C0ZoC9GUBDo10/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ek6Mtc/btsL8MgUpCa/5gzF3HUK0C0ZoC9GUBDo10/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fek6Mtc%2FbtsL8MgUpCa%2F5gzF3HUK0C0ZoC9GUBDo10%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;657&quot; height=&quot;423&quot; data-filename=&quot;스크린샷 2025-02-06 오전 9.21.59.png&quot; data-origin-width=&quot;873&quot; data-origin-height=&quot;562&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Preprocessing&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;There are 57 features with 35 categorical and 22 numerical.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;The majority of numerical features appear to be like categorical features with their low unique value count.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;Therefore in my NN starter, I convert 55 features into categorical leaving only&amp;nbsp;donor_age&amp;nbsp;and&amp;nbsp;act_at_hct as numerical.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;For each categorical, we label encode.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;In the NN architecture, we use &lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;embeddings for each categorical features. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;For each categorical feature, the embedding input size is of course the number of unique values.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;The embedding output size is&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;b&gt;sqrt(number unique)+1.&lt;/b&gt; &lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;About embedding&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The process of converting categorical data into meaningful continuous vectors&lt;/li&gt;
&lt;li&gt;Categories with similar meanings learn to have similar vector values&lt;/li&gt;
&lt;li&gt;Examples:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Example:&amp;nbsp;Disease&amp;nbsp;Type&amp;nbsp;Categories&lt;br /&gt;Disease&amp;nbsp;A&amp;nbsp;=&amp;nbsp;[0.2,&amp;nbsp;0.8,&amp;nbsp;-0.3]&amp;nbsp;&amp;nbsp;#&amp;nbsp;Converted&amp;nbsp;to&amp;nbsp;3D&amp;nbsp;vector&lt;br /&gt;Disease&amp;nbsp;B&amp;nbsp;=&amp;nbsp;[0.3,&amp;nbsp;0.7,&amp;nbsp;-0.2]&amp;nbsp;&amp;nbsp;#&amp;nbsp;Similar&amp;nbsp;diseases&amp;nbsp;have&amp;nbsp;similar&amp;nbsp;vector&amp;nbsp;values&lt;br /&gt;Disease&amp;nbsp;C&amp;nbsp;=&amp;nbsp;[-0.8,&amp;nbsp;-0.2,&amp;nbsp;0.9]&amp;nbsp;#&amp;nbsp;Different&amp;nbsp;types&amp;nbsp;have&amp;nbsp;different&amp;nbsp;vector&amp;nbsp;values&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Limitations of one-hot encoding:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;One-Hot&amp;nbsp;Encoding&lt;br /&gt;Disease&amp;nbsp;A&amp;nbsp;=&amp;nbsp;[1,&amp;nbsp;0,&amp;nbsp;0,&amp;nbsp;0]&lt;br /&gt;Disease&amp;nbsp;B&amp;nbsp;=&amp;nbsp;[0,&amp;nbsp;1,&amp;nbsp;0,&amp;nbsp;0]&lt;br /&gt;Disease&amp;nbsp;C&amp;nbsp;=&amp;nbsp;[0,&amp;nbsp;0,&amp;nbsp;1,&amp;nbsp;0]&lt;br /&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;#&amp;nbsp;All&amp;nbsp;categories&amp;nbsp;are&amp;nbsp;equidistant&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;#&amp;nbsp;Cannot&amp;nbsp;express&amp;nbsp;similarity&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Advantages of Embedding:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Embedding&lt;br /&gt;Disease&amp;nbsp;A&amp;nbsp;=&amp;nbsp;[0.2,&amp;nbsp;0.8]&amp;nbsp;&amp;nbsp;#&amp;nbsp;Compressed&amp;nbsp;to&amp;nbsp;2D&lt;br /&gt;Disease&amp;nbsp;B&amp;nbsp;=&amp;nbsp;[0.3,&amp;nbsp;0.7]&amp;nbsp;&amp;nbsp;#&amp;nbsp;Similar&amp;nbsp;vector&amp;nbsp;to&amp;nbsp;A&lt;br /&gt;Disease&amp;nbsp;C&amp;nbsp;=&amp;nbsp;[-0.8,&amp;nbsp;-0.2]&amp;nbsp;&amp;nbsp;#&amp;nbsp;Different&amp;nbsp;vector&amp;nbsp;from&amp;nbsp;A,B&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;#&amp;nbsp;Can&amp;nbsp;express&amp;nbsp;similarity&amp;nbsp;through&amp;nbsp;vector&amp;nbsp;distances&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;The practice of setting embedding output size to sqrt(number of unique values) + 1 is a commonly used rule of thumb.&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Example:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;categorical_feature = 'disease_type'&lt;br /&gt;unique_values&amp;nbsp;=&amp;nbsp;16&amp;nbsp;&amp;nbsp;#&amp;nbsp;Assuming&amp;nbsp;there&amp;nbsp;are&amp;nbsp;16&amp;nbsp;disease&amp;nbsp;types&lt;br /&gt;embedding_output_size&amp;nbsp;=&amp;nbsp;int(np.sqrt(16))&amp;nbsp;+&amp;nbsp;1&lt;br /&gt;#&amp;nbsp;=&amp;nbsp;4&amp;nbsp;+&amp;nbsp;1&amp;nbsp;=&amp;nbsp;5&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Reasons&amp;nbsp;for&amp;nbsp;this&amp;nbsp;setting:&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Dimension Reduction:&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;One-hot encoding would require 16 dimensions&lt;/li&gt;
&lt;li&gt;Embedding&amp;nbsp;can&amp;nbsp;reduce&amp;nbsp;it&amp;nbsp;to&amp;nbsp;5&amp;nbsp;dimensions&lt;/li&gt;
&lt;li&gt;This reduces model complexity and makes learning more efficient&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Appropriate&amp;nbsp;Expressiveness:&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Too small embedding dimension: Risk of information loss&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;Too&amp;nbsp;large&amp;nbsp;embedding&amp;nbsp;dimension:&amp;nbsp;Risk&amp;nbsp;of&amp;nbsp;overfitting&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;sqrt(n)&amp;nbsp;+&amp;nbsp;1&amp;nbsp;is&amp;nbsp;an&amp;nbsp;empirical&amp;nbsp;method&amp;nbsp;to&amp;nbsp;find&amp;nbsp;balance&amp;nbsp;between&amp;nbsp;these&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Real Example:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Embedding&amp;nbsp;dimensions&amp;nbsp;for&amp;nbsp;various&amp;nbsp;category&amp;nbsp;sizes&lt;br /&gt;4&amp;nbsp;categories&amp;nbsp;-&amp;gt;&amp;nbsp;3&amp;nbsp;embedding&amp;nbsp;dimensions&amp;nbsp;&amp;nbsp;(&amp;radic;4&amp;nbsp;+&amp;nbsp;1)&lt;br /&gt;9&amp;nbsp;categories&amp;nbsp;-&amp;gt;&amp;nbsp;4&amp;nbsp;embedding&amp;nbsp;dimensions&amp;nbsp;&amp;nbsp;(&amp;radic;9&amp;nbsp;+&amp;nbsp;1)&lt;br /&gt;16&amp;nbsp;categories&amp;nbsp;-&amp;gt;&amp;nbsp;5&amp;nbsp;embedding&amp;nbsp;dimensions&amp;nbsp;(&amp;radic;16&amp;nbsp;+&amp;nbsp;1)&lt;br /&gt;25&amp;nbsp;categories&amp;nbsp;-&amp;gt;&amp;nbsp;6&amp;nbsp;embedding&amp;nbsp;dimensions&amp;nbsp;(&amp;radic;25&amp;nbsp;+&amp;nbsp;1)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Afterward we concatenate all the categorical embeddings together with the numerical features and continue forward with MLP.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;For the two numericals, we standardize with&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;feature = (feature - mean)/std&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;because NN like standardized features.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Target Transformation&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;There are two ways to train a&amp;nbsp;Survival Model:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;We can input both&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time&amp;nbsp;and use survival loss like&amp;nbsp;Cox.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;Transform&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time&amp;nbsp;into a single target proxy for&amp;nbsp;risk score&amp;nbsp;and train with regression loss like&amp;nbsp;MSE.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;In my NN starter, I employ option 2 above. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;I transform the original two targets into a proxy for&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;risk score&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and train NN with MSE regression loss. &lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Below shows the original two targets and the new transformed target. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;When training with MSE loss, the model likes the target to be like Gaussian distribution.&lt;/span&gt; &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This was one factor when I invented this new way to transform target:&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오전 9.22.24.png&quot; data-origin-width=&quot;779&quot; data-origin-height=&quot;532&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/nzoWz/btsL8VLoVBz/0xXSnKQcBiyee687CYRf11/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/nzoWz/btsL8VLoVBz/0xXSnKQcBiyee687CYRf11/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/nzoWz/btsL8VLoVBz/0xXSnKQcBiyee687CYRf11/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnzoWz%2FbtsL8VLoVBz%2F0xXSnKQcBiyee687CYRf11%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;625&quot; height=&quot;427&quot; data-filename=&quot;스크린샷 2025-02-06 오전 9.22.24.png&quot; data-origin-width=&quot;779&quot; data-origin-height=&quot;532&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오전 9.22.38.png&quot; data-origin-width=&quot;749&quot; data-origin-height=&quot;563&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/boYLbJ/btsL9MG08ij/CiEfDqf9Y9geVdnuw0jD2K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/boYLbJ/btsL9MG08ij/CiEfDqf9Y9geVdnuw0jD2K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/boYLbJ/btsL9MG08ij/CiEfDqf9Y9geVdnuw0jD2K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FboYLbJ%2FbtsL9MG08ij%2FCiEfDqf9Y9geVdnuw0jD2K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;564&quot; height=&quot;424&quot; data-filename=&quot;스크린샷 2025-02-06 오전 9.22.38.png&quot; data-origin-width=&quot;749&quot; data-origin-height=&quot;563&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h3 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;NN (MLP) Baseline - [CV 670 LB 676] (Kernel)&lt;/b&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Intro&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;In this notebook, we present a Neural Network NN (MLP) baseline. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This NN is very fast to train on GPU! We achieve CV 0.670. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;There is a discussion about this notebook&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343&quot;&gt;here&lt;/a&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Above discussion&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;We tranform the two train targets (efs&amp;nbsp;and&amp;nbsp;efs_time) into a single target (y) and then train regression NN with MSE loss.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;We load Kaggle's official metric code from&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/code/metric/eefs-concordance-index&quot;&gt;here&lt;/a&gt;&amp;nbsp;and evaluate the CV performance using competition metric Stratified Concordance Index.&lt;/li&gt;
&lt;li&gt;In this comp, we need to predict&amp;nbsp;risk score.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;There are many different ways to transform the two train targets into a value that mimics&amp;nbsp;risk score&amp;nbsp;and train an NN (or any other regression model like SVR) with regression.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;I present one transformation in this notebook and I presented a different one in my XGBoost starter notebook&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/code/cdeotte/xgboost-catboost-baseline-cv-668-lb-668&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Consider experimenting by creating your own target from&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time.&lt;/li&gt;
&lt;li&gt;Or considering using survival loss directly which uses both&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time&amp;nbsp;as explained in discussion post&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Kaggle user MT describes another transformation&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141#3064661&quot;&gt;here&lt;/a&gt;&amp;nbsp;called&amp;nbsp;KaplanMeierFitter&amp;nbsp;and gives an example&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550337&quot;&gt;here&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id=&quot;Pip-Install-Libraries-for-Metric&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Pip Install Libraries for Metric&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Since internet must be turned off for submission, we pip install from my other notebook&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;a style=&quot;background-color: #ffffff; color: #008abc; text-align: left;&quot; href=&quot;https://www.kaggle.com/code/cdeotte/pip-install-lifelines&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;where I downloaded the WHL files.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738801854188&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!pip install /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
!pip install /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;Load-Train-and-Test&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Load Train and Test&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1738801890281&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import numpy as np, pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

test = pd.read_csv(&quot;/kaggle/input/equity-post-HCT-survival-predictions/test.csv&quot;)
print(&quot;Test shape:&quot;, test.shape )

train = pd.read_csv(&quot;/kaggle/input/equity-post-HCT-survival-predictions/train.csv&quot;)
print(&quot;Train shape:&quot;,train.shape)
train.head()&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;EDA-on-Train-Targets&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;EDA on Train Targets&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;There are two train targets&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;and&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs_time&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;When&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs==1&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;we know patient&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;had an event&lt;/b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;and we know time of event is&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs_time&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;. When&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs==0&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;we&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;do not know&lt;/b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;if patient had an event or not, but we do know that patient was&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;without event for at least&lt;/b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs_time&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738803849770&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plt.hist(train.loc[train.efs==1,&quot;efs_time&quot;],bins=100,label=&quot;efs=1, Yes Event&quot;)
plt.hist(train.loc[train.efs==0,&quot;efs_time&quot;],bins=100,label=&quot;efs=0, Maybe Event&quot;)
plt.xlabel(&quot;Time of Observation, efs_time&quot;)
plt.ylabel(&quot;Density&quot;)
plt.title(&quot;Times of Observation. Either time to event, or time observed without event.&quot;)
plt.legend()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오전 10.04.20.png&quot; data-origin-width=&quot;662&quot; data-origin-height=&quot;465&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/LJi4S/btsL80eWY2e/kAVZnEUZn9TU7h9obG0GtK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/LJi4S/btsL80eWY2e/kAVZnEUZn9TU7h9obG0GtK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/LJi4S/btsL80eWY2e/kAVZnEUZn9TU7h9obG0GtK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FLJi4S%2FbtsL80eWY2e%2FkAVZnEUZn9TU7h9obG0GtK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;662&quot; height=&quot;465&quot; data-filename=&quot;스크린샷 2025-02-06 오전 10.04.20.png&quot; data-origin-width=&quot;662&quot; data-origin-height=&quot;465&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Transform-Two-Train-Targets-into-One-Target!&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Transform Two Train Targets into One Target!&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Both targets&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time&amp;nbsp;provide useful information.&lt;/li&gt;
&lt;li&gt;We will tranform these two targets into a single target to train our model with.&lt;/li&gt;
&lt;li&gt;In this competition we need to predict&amp;nbsp;risk score.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;So we will create a target that mimics&amp;nbsp;risk score&amp;nbsp;to train our model.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;(Note this is only one out of many ways to transform two targets into one target. Considering experimenting on your own).&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738803889299&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# 1. Set initial target value to efs_time
train[&quot;y&quot;] = train.efs_time.values
# 2. Find maximum time of event cases (efs=1) and 
# minimum time of censored cases (efs=0)
mx = train.loc[train.efs==1,&quot;efs_time&quot;].max()
mn = train.loc[train.efs==0,&quot;efs_time&quot;].min()
# 3. Adjust time values for censored cases
# Add (mx - mn) to make all censored cases larger than event cases
train.loc[train.efs==0,&quot;y&quot;] = train.loc[train.efs==0,&quot;y&quot;] + mx - mn
# 4. Rank all values (starting from 1)
train.y = train.y.rank()
# 5. Make ranks of censored cases larger
# Add 2 times the data length to clearly differentiate
train.loc[train.efs==0,&quot;y&quot;] += 2*len(train)
# 6. Normalize to values between 0~1
train.y = train.y / train.y.max()
# 7. Apply log transformation
train.y = np.log(train.y)
# 8. Center mean to 0
train.y -= train.y.mean()
# 9. Reverse sign (to interpret as risk score)
train.y *= -1.0

plt.hist(train.loc[train.efs==1,&quot;y&quot;],bins=100,label=&quot;efs=1, Yes Event&quot;)
plt.hist(train.loc[train.efs==0,&quot;y&quot;],bins=100,label=&quot;efs=0, Maybe Event&quot;)
plt.xlim((-5,5))
plt.xlabel(&quot;Transformed Target y&quot;)
plt.ylabel(&quot;Density&quot;)
plt.title(&quot;Transformed Target y using both efs and efs_time.&quot;)
plt.legend()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Purpose of this transformation:&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Clearly differentiate between censored cases and event cases&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Transform values into appropriate range&lt;/li&gt;
&lt;li&gt;Make it interpretable as risk scores (multiply by -1 at the end)&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;As a result:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Event cases (efs=1) have higher risk scores&lt;/li&gt;
&lt;li&gt;Censored cases (efs=0) have lower risk scores&lt;/li&gt;
&lt;li&gt;Overall distribution becomes normalized&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Detailed Explanation about #5 part:&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738806251811&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# 5. Make ranks of censored cases larger
# Add 2 times the data length to clearly differentiate
train.loc[train.efs==0,&quot;y&quot;] += 2*len(train)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1738806260885&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Example data: 5 patients
# efs=1: Event occurred (death)
# efs=0: Censored (end of tracking)
# Initial data
Patient A: efs=1, efs_time=10  # Died on day 10
Patient B: efs=1, efs_time=20  # Died on day 20
Patient C: efs=0, efs_time=15  # Survival confirmed until day 15
Patient D: efs=1, efs_time=5   # Died on day 5
Patient E: efs=0, efs_time=25  # Survival confirmed until day 25
# After applying rank()
Patient D: 1  (shortest survival)
Patient A: 2
Patient C: 3
Patient B: 4
Patient E: 5  (longest survival)
# Adding 2*len(train) = 2*5 = 10 to censored cases
Patient D: 1      # efs=1, no change
Patient A: 2      # efs=1, no change
Patient C: 13     # efs=0, 3+10
Patient B: 4      # efs=1, no change
Patient E: 15     # efs=0, 5+10&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Reasons for doing this:&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Censored cases (efs=0) might have actually lived longer&lt;/li&gt;
&lt;li&gt;Therefore, we make their ranks definitively larger&lt;/li&gt;
&lt;li&gt;Adding twice the data length creates a large gap between efs=1 and efs=0 cases&lt;/li&gt;
&lt;li&gt;This helps the model better distinguish between the two groups&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오전 10.04.58.png&quot; data-origin-width=&quot;611&quot; data-origin-height=&quot;465&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/YoqCx/btsL7ABt9bg/VmTw9GkCUY7UlF5HDiABa1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/YoqCx/btsL7ABt9bg/VmTw9GkCUY7UlF5HDiABa1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/YoqCx/btsL7ABt9bg/VmTw9GkCUY7UlF5HDiABa1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FYoqCx%2FbtsL7ABt9bg%2FVmTw9GkCUY7UlF5HDiABa1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;611&quot; height=&quot;465&quot; data-filename=&quot;스크린샷 2025-02-06 오전 10.04.58.png&quot; data-origin-width=&quot;611&quot; data-origin-height=&quot;465&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Features&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Features&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;There are a total of 57 features.&lt;/li&gt;
&lt;li&gt;From these 35 are categorical and 22 are numerical.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;Since most of the numerical features has only a few unique values, we will treat all features except donor_age and act_at_hct as categorical for our NN. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;So we will feed our NN 55 categorical features and 2 numerical features.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738806300151&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;RMV = [&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;y&quot;]
FEATURES = [c for c in train.columns if not c in RMV]
print(f&quot;There are {len(FEATURES)} FEATURES: {FEATURES}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오전 10.45.09.png&quot; data-origin-width=&quot;833&quot; data-origin-height=&quot;273&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/1uovl/btsL8OlB0vW/mgUZtwCRhjM9ESKr9Ci13K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/1uovl/btsL8OlB0vW/mgUZtwCRhjM9ESKr9Ci13K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/1uovl/btsL8OlB0vW/mgUZtwCRhjM9ESKr9Ci13K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F1uovl%2FbtsL8OlB0vW%2FmgUZtwCRhjM9ESKr9Ci13K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;833&quot; height=&quot;273&quot; data-filename=&quot;스크린샷 2025-02-06 오전 10.45.09.png&quot; data-origin-width=&quot;833&quot; data-origin-height=&quot;273&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738806320959&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Create empty list CATS - will store categorical variables
CATS = []
# Iterate through each feature (column) in FEATURES list
for c in FEATURES:
    # If the column's data type is &quot;object&quot; (strings etc.)
    if train[c].dtype==&quot;object&quot;:
        # Fill missing values with &quot;NAN&quot; in both train and test
        train[c] = train[c].fillna(&quot;NAN&quot;)
        test[c] = test[c].fillna(&quot;NAN&quot;)
        # Add this column to CATS list
        CATS.append(c)
    
    # If it's a numerical column not containing &quot;age&quot; in its name    
    elif not &quot;age&quot; in c:
        # Convert numeric values to strings in both train and test
        train[c] = train[c].astype(&quot;str&quot;)
        test[c] = test[c].astype(&quot;str&quot;)
        # Add this column to CATS list
        CATS.append(c)
# Print the number and list of features treated as categorical
print(f&quot;In these features, there are {len(CATS)} CATEGORICAL FEATURES: {CATS}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오전 10.45.26.png&quot; data-origin-width=&quot;833&quot; data-origin-height=&quot;273&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/0fOI1/btsL6VlR7p7/qXkpkla6bB8lhf4Fc53IBk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/0fOI1/btsL6VlR7p7/qXkpkla6bB8lhf4Fc53IBk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/0fOI1/btsL6VlR7p7/qXkpkla6bB8lhf4Fc53IBk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F0fOI1%2FbtsL6VlR7p7%2FqXkpkla6bB8lhf4Fc53IBk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;833&quot; height=&quot;273&quot; data-filename=&quot;스크린샷 2025-02-06 오전 10.45.26.png&quot; data-origin-width=&quot;833&quot; data-origin-height=&quot;273&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738806339876&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Create lists to store categorical variable sizes and embedding dimensions
CAT_SIZE = []  # Number of unique values for each categorical variable
CAT_EMB = []   # Embedding dimensions for each categorical variable
NUMS = []      # List of numerical variables
# Combine train and test data
combined = pd.concat([train,test],axis=0,ignore_index=True)
print(&quot;We LABEL ENCODE the CATEGORICAL FEATURES: &quot;)
# Iterate through all features
for c in FEATURES:
    # If it's a categorical variable
    if c in CATS:
        # Perform label encoding using factorize()
        combined[c],_ = combined[c].factorize()
        # Make minimum value 0
        combined[c] -= combined[c].min()
        # Convert to int32 type
        combined[c] = combined[c].astype(&quot;int32&quot;)
        
        # Calculate number of unique values and range
        n = combined[c].nunique()
        mn = combined[c].min()
        mx = combined[c].max()
        print(f'{c} has ({n}) unique values')
        
        # Store category size (max+1) and embedding dimension (sqrt(max+1))
        CAT_SIZE.append(mx+1)
        CAT_EMB.append( int(np.ceil( np.sqrt(mx+1))) )
    
    # If it's a numerical variable
    else:
        # Convert float64 to float32, int64 to int32 (memory optimization)
        if combined[c].dtype==&quot;float64&quot;:
            combined[c] = combined[c].astype(&quot;float32&quot;)
        if combined[c].dtype==&quot;int64&quot;:
            combined[c] = combined[c].astype(&quot;int32&quot;)
        
        # Perform standardization
        m = combined[c].mean()
        s = combined[c].std()
        combined[c] = (combined[c]-m)/s
        # Fill missing values with 0
        combined[c] = combined[c].fillna(0)
        
        # Add to numerical variables list
        NUMS.append(c)
# Split back into train and test
train = combined.iloc[:len(train)].copy()
test = combined.iloc[len(train):].reset_index(drop=True).copy()&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1738806357816&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;We LABEL ENCODE the CATEGORICAL FEATURES: 
dri_score has (12) unique values
psych_disturb has (4) unique values
cyto_score has (8) unique values
diabetes has (4) unique values
hla_match_c_high has (4) unique values
hla_high_res_8 has (8) unique values
tbi_status has (8) unique values
arrhythmia has (4) unique values
hla_low_res_6 has (6) unique values
graft_type has (2) unique values
vent_hist has (3) unique values
renal_issue has (4) unique values
pulm_severe has (4) unique values
prim_disease_hct has (18) unique values
hla_high_res_6 has (7) unique values
cmv_status has (5) unique values
hla_high_res_10 has (9) unique values
hla_match_dqb1_high has (4) unique values
tce_imm_match has (9) unique values
hla_nmdp_6 has (6) unique values
hla_match_c_low has (4) unique values
rituximab has (3) unique values
hla_match_drb1_low has (3) unique values
hla_match_dqb1_low has (4) unique values
prod_type has (2) unique values
cyto_score_detail has (6) unique values
conditioning_intensity has (7) unique values
ethnicity has (4) unique values
year_hct has (13) unique values
obesity has (4) unique values
mrd_hct has (3) unique values
in_vivo_tcd has (3) unique values
tce_match has (5) unique values
hla_match_a_high has (4) unique values
hepatic_severe has (4) unique values
prior_tumor has (4) unique values
hla_match_b_low has (4) unique values
peptic_ulcer has (4) unique values
hla_match_a_low has (4) unique values
gvhd_proph has (18) unique values
rheum_issue has (4) unique values
sex_match has (5) unique values
hla_match_b_high has (4) unique values
race_group has (6) unique values
comorbidity_score has (12) unique values
karnofsky_score has (8) unique values
hepatic_mild has (4) unique values
tce_div_match has (5) unique values
donor_related has (4) unique values
melphalan_dose has (3) unique values
hla_low_res_8 has (8) unique values
cardiac has (4) unique values
hla_match_drb1_high has (4) unique values
pulm_moderate has (4) unique values
hla_low_res_10 has (8) unique values&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;TensorFlow-NN&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;TensorFlow NN&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;We train NN model with CV 0.670&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738813329453&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, Input, Embedding
from tensorflow.keras.layers import Concatenate, BatchNormalization
import tensorflow.keras.backend as K
from sklearn.model_selection import KFold

print('TF Version',tf.__version__)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 12.42.20.png&quot; data-origin-width=&quot;182&quot; data-origin-height=&quot;59&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cFhep9/btsL7xZdYkX/qUgTvTki14IqbvlVu9v410/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cFhep9/btsL7xZdYkX/qUgTvTki14IqbvlVu9v410/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cFhep9/btsL7xZdYkX/qUgTvTki14IqbvlVu9v410/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcFhep9%2FbtsL7xZdYkX%2FqUgTvTki14IqbvlVu9v410%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;182&quot; height=&quot;59&quot; data-filename=&quot;스크린샷 2025-02-06 오후 12.42.20.png&quot; data-origin-width=&quot;182&quot; data-origin-height=&quot;59&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Learning-Schedule&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Learning Schedule&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1738813361351&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Set total 4 epochs
EPOCHS = 4
# Define learning rate for each epoch
LRS = [0.01]*2 + [0.001]*1 + [0.0001]*1
# Written out: LRS = [0.01, 0.01, 0.001, 0.0001]
# Function that returns learning rate for each epoch
def lrfn(epoch):
    return LRS[epoch]
# Create list of epoch numbers (0 to 3)
rng = [i for i in range(EPOCHS)]
# Create list of learning rate values for each epoch
lr_y = [lrfn(x) for x in rng]

plt.figure(figsize=(10, 4))
plt.plot(rng, lr_y, '-o')
print(&quot;Learning rate schedule: {:.3g} to {:.3g} to {:.3g}&quot;. \
        format(lr_y[0], max(lr_y), lr_y[-1]))
plt.xlabel(&quot;Epoch&quot;)
plt.ylabel(&quot;Learning Rate&quot;)
plt.title(&quot;Learning Rate Schedule&quot;)
plt.show()

lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose = False)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 12.42.53.png&quot; data-origin-width=&quot;841&quot; data-origin-height=&quot;445&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d6XZdy/btsL7zWXGY3/MgXJ7Uns64bJjBE2TtN1H0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d6XZdy/btsL7zWXGY3/MgXJ7Uns64bJjBE2TtN1H0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d6XZdy/btsL7zWXGY3/MgXJ7Uns64bJjBE2TtN1H0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd6XZdy%2FbtsL7zWXGY3%2FMgXJ7Uns64bJjBE2TtN1H0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;841&quot; height=&quot;445&quot; data-filename=&quot;스크린샷 2025-02-06 오후 12.42.53.png&quot; data-origin-width=&quot;841&quot; data-origin-height=&quot;445&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Learning&amp;nbsp;Rate&amp;nbsp;Schedule:&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;First 2 epochs: 0.01 (fast learning with high learning rate)&lt;/li&gt;
&lt;li&gt;3rd&amp;nbsp;epoch:&amp;nbsp;0.001&amp;nbsp;(decreased&amp;nbsp;learning&amp;nbsp;rate)&lt;/li&gt;
&lt;li&gt;4th&amp;nbsp;epoch:&amp;nbsp;0.0001&amp;nbsp;(fine-tuning&amp;nbsp;with&amp;nbsp;smaller&amp;nbsp;learning&amp;nbsp;rate)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Reasons for gradually decreasing the learning rate:&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Fast learning with large learning rate initially&lt;/li&gt;
&lt;li&gt;Fine-tuning with small learning rate in later stages&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;This helps the model converge more stably&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div style=&quot;background-color: #ffffff; color: #3c4043;&quot;&gt;
&lt;h4 id=&quot;Model-Definition&quot; style=&quot;color: #202214;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Model Definition&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;We use embedding layers for all label encoded categorical features.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Then we concatenate all categorical embeddings with the numerical features.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;We create an MLP with &lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;two hidden layers. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Our final output layer has one linear neuron and during training we use MSE loss with Adam optimizer.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738813404208&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def build_model():
    # 1. Handle categorical variables
    # Create input layer for categorical variables
    x_input_cats = Input(shape=(len(CATS),))
    embs = []
    
    # Create embedding layer for each categorical variable
    for j in range(len(CATS)):
        # Create embedding layer (input size: CAT_SIZE[j], output size: CAT_EMB[j])
        e = tf.keras.layers.Embedding(CAT_SIZE[j], CAT_EMB[j])
        # Apply embedding to j-th categorical variable
        x = e(x_input_cats[:,j])
        # Flatten embedding result to 1D
        x = tf.keras.layers.Flatten()(x)
        # Store embedding result
        embs.append(x)
    
    # 2. Handle numerical variables
    # Create input layer for numerical variables
    x_input_nums = Input(shape=(len(NUMS),))
    
    # 3. Combine categorical and numerical features
    # Connect all embedding results and numerical variables
    x = tf.keras.layers.Concatenate(axis=-1)(embs+[x_input_nums])
    
    # 4. Add fully connected layers (Dense)
    # Hidden layer with 256 neurons (ReLU activation)
    x = Dense(256, activation='relu')(x)
    x = Dense(256, activation='relu')(x)
    # Output layer with 1 neuron (linear activation)
    x = Dense(1, activation='linear')(x)
    
    # 5. Create model
    # Input: categorical and numerical variables
    # Output: predicted value
    model = Model(inputs=[x_input_cats,x_input_nums], outputs=x)
    
    return model&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;linear activation: f(x)&amp;nbsp;=&amp;nbsp;x&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Outputs input value as is&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;No transformation applied&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Characteristics:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Unlimited output range (-&amp;infin; ~ +&amp;infin;)&lt;/li&gt;
&lt;li&gt;Commonly used in output layer for regression&lt;/li&gt;
&lt;li&gt;Suitable for continuous real value prediction&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;ReLU (Rectified Linear Unit) Activation: f(x)&amp;nbsp;=&amp;nbsp;max(0,&amp;nbsp;x)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Outputs 0 for negative values, keeps positive values as is&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Characteristics:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Output range: [0, &amp;infin;)&lt;/li&gt;
&lt;li&gt;Most&amp;nbsp;commonly&amp;nbsp;used&amp;nbsp;in&amp;nbsp;hidden&amp;nbsp;layers&lt;/li&gt;
&lt;li&gt;Reduces&amp;nbsp;vanishing&amp;nbsp;gradient&amp;nbsp;problem&lt;/li&gt;
&lt;li&gt;Simple&amp;nbsp;and&amp;nbsp;fast&amp;nbsp;computation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Linear:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;ReLU:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;↗&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;↗&lt;br /&gt;&amp;nbsp;&amp;nbsp;↗&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;_/&lt;br /&gt;↗&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;_/&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div style=&quot;color: #3c4043;&quot;&gt;
&lt;div&gt;
&lt;div id=&quot;sharing-control-portal-19&quot; style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;
&lt;div&gt;
&lt;div style=&quot;background-color: #ffffff; color: #202124;&quot;&gt;
&lt;h4 id=&quot;Train-K-Fold&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Train K Fold&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;We train our NN below.&lt;/li&gt;
&lt;li&gt;If you want to save the trained model weights, you can uncomment the commented lines below.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;
&lt;div&gt;
&lt;pre id=&quot;code_1738813425040&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;%%time

REPEATS = 3
FOLDS = 5
kf = KFold(n_splits=FOLDS, random_state=42, shuffle=True)

oof_nn = np.zeros( len(train) )
pred_nn = np.zeros( len(test) )

#directory = &quot;checkpoints&quot;
#if not os.path.exists(directory):
#    os.makedirs(directory)

for r in range(REPEATS):
    VERBOSE = r==0
    print(&quot;#&quot;*25)
    print(f&quot;### REPEAT {r+1} ###&quot;)
    print(&quot;#&quot;*25)
        
    for i, (train_index, test_index) in enumerate(kf.split(train)):
        
        X_train_cats = train.loc[train_index,CATS].values
        X_train_nums = train.loc[train_index,NUMS].values
        y_train = train.loc[train_index,&quot;y&quot;].values
        y_train2 = train.loc[train_index,&quot;efs&quot;].values
        
        X_valid_cats = train.loc[test_index,CATS].values
        X_valid_nums = train.loc[test_index,NUMS].values
        y_valid = train.loc[test_index,&quot;y&quot;].values
        y_valid2 = train.loc[test_index,&quot;efs&quot;].values
        
        X_test_cats = test[CATS].values
        X_test_nums = test[NUMS].values

        if VERBOSE:
            print(&quot; &quot;,&quot;#&quot;*25)
            print(&quot; &quot;,f&quot;### Fold {i+1} ###&quot;)
            print(&quot; &quot;,&quot;#&quot;*25)
        
        # TRAIN MODEL
        K.clear_session()
        model = build_model()
        model.compile(optimizer=tf.keras.optimizers.Adam(0.001), 
                      loss=&quot;mean_squared_error&quot;,  
                     )
        v = 2 if VERBOSE else 0
        model.fit([X_train_cats,X_train_nums], [y_train], 
                  validation_data = ([X_valid_cats,X_valid_nums], [y_valid]),
                  callbacks = [lr_callback],
                  batch_size=512, epochs=EPOCHS, verbose=v)
        #model.save_weights(f'{directory}/NN_f{i}_r{r}.weights.h5')
        
        # INFER OOF
        oof_nn[test_index] += model.predict([X_valid_cats,X_valid_nums], verbose=v, batch_size=512).flatten()
        # INFER TEST
        pred_nn += model.predict([X_test_cats,X_test_nums], verbose=v, batch_size=512).flatten()

oof_nn /= REPEATS
pred_nn /= (FOLDS*REPEATS)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h4 id=&quot;Compute-Overall-Metric&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Compute Overall Metric&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1738813451619&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from metric import score

y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = oof_nn
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;\nOverall CV for NN =&quot;,m)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 12.44.19.png&quot; data-origin-width=&quot;348&quot; data-origin-height=&quot;51&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bCevdS/btsL7yKzjn6/TjiJ94RdptrZzpR4W5h131/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bCevdS/btsL7yKzjn6/TjiJ94RdptrZzpR4W5h131/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bCevdS/btsL7yKzjn6/TjiJ94RdptrZzpR4W5h131/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbCevdS%2FbtsL7yKzjn6%2FTjiJ94RdptrZzpR4W5h131%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;348&quot; height=&quot;51&quot; data-filename=&quot;스크린샷 2025-02-06 오후 12.44.19.png&quot; data-origin-width=&quot;348&quot; data-origin-height=&quot;51&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Create-Submission-CSV&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Create Submission CSV&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1738813476807&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;sub = pd.read_csv(&quot;/kaggle/input/equity-post-HCT-survival-predictions/sample_submission.csv&quot;)
sub.prediction = pred_nn
sub.to_csv(&quot;submission.csv&quot;,index=False)
print(&quot;Sub shape:&quot;,sub.shape)
sub.head()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-06 오후 12.44.44.png&quot; data-origin-width=&quot;324&quot; data-origin-height=&quot;208&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bJ2spL/btsL8FIXNeL/9f7pZeO4ANfTNLkoGMp4Xk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bJ2spL/btsL8FIXNeL/9f7pZeO4ANfTNLkoGMp4Xk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bJ2spL/btsL8FIXNeL/9f7pZeO4ANfTNLkoGMp4Xk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbJ2spL%2FbtsL8FIXNeL%2F9f7pZeO4ANfTNLkoGMp4Xk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;324&quot; height=&quot;208&quot; data-filename=&quot;스크린샷 2025-02-06 오후 12.44.44.png&quot; data-origin-width=&quot;324&quot; data-origin-height=&quot;208&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;게으른&amp;nbsp;천재는&amp;nbsp;그냥&amp;nbsp;게으름뱅이일&amp;nbsp;뿐이다.&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/119</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-9-NN-Starter-Notebook#entry119comment</comments>
      <pubDate>Thu, 6 Feb 2025 13:03:05 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #8 Finding the best target transformation</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-8-Finding-the-best-target-transformation</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Annotation post on discussion about finding the best target transformation&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550835&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550835&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738762965645&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550835&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/dl1DsA/hyYcjsH88T/dzhflx8AsfXo0vCPIgFTL1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/oZwR1/hyYciN7mfM/K8IPKyFvFdZWTHxWyRVRjK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550835&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550835&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/dl1DsA/hyYcjsH88T/dzhflx8AsfXo0vCPIgFTL1/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/oZwR1/hyYciN7mfM/K8IPKyFvFdZWTHxWyRVRjK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;Finding the best target transformation&lt;/b&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The competition task can be interpreted as predicting the order of death of the patients. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Who dies first? Who dies second? &amp;hellip; Who dies last, and who survives? &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;With a suitable target transformation, we can apply the usual regression algorithms which optimize mse or similar metrics.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;The original target is distributed in such a way that most patients who die have an&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;between 0 and 15, whereas most survivors have an&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;between 15 and 160. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This distribution is an impediment(장애) for regression models.&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;We need predictions which have high discriminative power for the patients who die, but we don't need to distinguish between survivors.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;We can achieve this result by stretching the range of the patients who die and compressing the range of the survivors.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;The diagram visualizes how a typical target transformation stretches and compresses the ranges:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.08.01.png&quot; data-origin-width=&quot;842&quot; data-origin-height=&quot;811&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/rVJlp/btsL8JqSllp/jJQWDmDSGFfEeVzKik883K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/rVJlp/btsL8JqSllp/jJQWDmDSGFfEeVzKik883K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/rVJlp/btsL8JqSllp/jJQWDmDSGFfEeVzKik883K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FrVJlp%2FbtsL8JqSllp%2FjJQWDmDSGFfEeVzKik883K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;748&quot; height=&quot;720&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.08.01.png&quot; data-origin-width=&quot;842&quot; data-origin-height=&quot;811&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;In the public notebooks of this competition, we can find various target transformations, and most of them are similar. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;For a comparison, I've taken three target transformations from public notebooks, added a fourth one, and given them all to XGBRegressor with an mse objective. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;The cross-validation scores confirm that the orange part of the histogram must be stretched and the blue part must be condensed:&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.08.47.png&quot; data-origin-width=&quot;623&quot; data-origin-height=&quot;608&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/rv9hT/btsL8MnDGbz/mHlwecCwziPTGAurq75Xm1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/rv9hT/btsL8MnDGbz/mHlwecCwziPTGAurq75Xm1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/rv9hT/btsL8MnDGbz/mHlwecCwziPTGAurq75Xm1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Frv9hT%2FbtsL8MnDGbz%2FmHlwecCwziPTGAurq75Xm1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;623&quot; height=&quot;608&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.08.47.png&quot; data-origin-width=&quot;623&quot; data-origin-height=&quot;608&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199; color: #3c4043; text-align: start;&quot;&gt;&lt;b&gt;A comparison with other model types shows that target-transformed mse models (pink) are competitive with Cox proportional hazards models (blue).&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;My AFT models (green) perhaps need more hyperparameter tuning.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.09.10.png&quot; data-origin-width=&quot;848&quot; data-origin-height=&quot;347&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Claqp/btsL7ptnR1T/Q9OzguKj1g5BoLzWcP0jRk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Claqp/btsL7ptnR1T/Q9OzguKj1g5BoLzWcP0jRk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Claqp/btsL7ptnR1T/Q9OzguKj1g5BoLzWcP0jRk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FClaqp%2FbtsL7ptnR1T%2FQ9OzguKj1g5BoLzWcP0jRk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;848&quot; height=&quot;347&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.09.10.png&quot; data-origin-width=&quot;848&quot; data-origin-height=&quot;347&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.50.21.png&quot; data-origin-width=&quot;1196&quot; data-origin-height=&quot;733&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bDOtiX/btsL7hITp4j/K4xOTtS17neNIcoBafvn01/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bDOtiX/btsL7hITp4j/K4xOTtS17neNIcoBafvn01/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bDOtiX/btsL7hITp4j/K4xOTtS17neNIcoBafvn01/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbDOtiX%2FbtsL7hITp4j%2FK4xOTtS17neNIcoBafvn01%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1196&quot; height=&quot;733&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.50.21.png&quot; data-origin-width=&quot;1196&quot; data-origin-height=&quot;733&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;NN starter code annotation here:&amp;nbsp;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.48.47.png&quot; data-origin-width=&quot;1201&quot; data-origin-height=&quot;872&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/oOTT1/btsL8ZmGU72/PwKTUOWk2jtqXsG8kKFXA1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/oOTT1/btsL8ZmGU72/PwKTUOWk2jtqXsG8kKFXA1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/oOTT1/btsL8ZmGU72/PwKTUOWk2jtqXsG8kKFXA1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FoOTT1%2FbtsL8ZmGU72%2FPwKTUOWk2jtqXsG8kKFXA1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1201&quot; height=&quot;872&quot; data-filename=&quot;스크린샷 2025-02-05 오후 11.48.47.png&quot; data-origin-width=&quot;1201&quot; data-origin-height=&quot;872&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Maybe I should check on &lt;b&gt;Nelson-Aalen&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Source code is in the&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;a style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; href=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot;&gt;EDA which makes sense&lt;/a&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738764562503&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️&quot; data-og-description=&quot;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot; data-og-url=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;My annotation:&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;지금 당장 꽃을 피우지 못했다고 해서 좌절하지 마세요. 친구와 비교하지도 마세요. &lt;br /&gt;지금은 그저 나의 계절이 아닌 것뿐이에요.&lt;br /&gt;&amp;lt;책 '모든 꽃이 봄에 피지는 않는다'중에서&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>target transformation</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/118</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-8-Finding-the-best-target-transformation#entry118comment</comments>
      <pubDate>Wed, 5 Feb 2025 23:55:24 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #7 AFT model</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-7-AFT-model</link>
      <description>&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;Introduction post about AFT model&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738760536673&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cQ1wdP/hyX7Th0sRQ/3fHQ9v8ImZgk3i1YmD0rC0/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/At8yD/hyYciHfoQN/ilQ31z0DAGsInlXUCLBrVK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cQ1wdP/hyX7Th0sRQ/3fHQ9v8ImZgk3i1YmD0rC0/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/At8yD/hyYciHfoQN/ilQ31z0DAGsInlXUCLBrVK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 10.02.03.png&quot; data-origin-width=&quot;2296&quot; data-origin-height=&quot;1144&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/UvMTo/btsL88jp2ZQ/MrEkCHmRUzlXSJg46T2v51/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/UvMTo/btsL88jp2ZQ/MrEkCHmRUzlXSJg46T2v51/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/UvMTo/btsL88jp2ZQ/MrEkCHmRUzlXSJg46T2v51/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FUvMTo%2FbtsL88jp2ZQ%2FMrEkCHmRUzlXSJg46T2v51%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2296&quot; height=&quot;1144&quot; data-filename=&quot;스크린샷 2025-02-05 오후 10.02.03.png&quot; data-origin-width=&quot;2296&quot; data-origin-height=&quot;1144&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 10.03.02.png&quot; data-origin-width=&quot;2252&quot; data-origin-height=&quot;1024&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bcYFyk/btsL80lB1lO/B5PPBnR6pk5HRMaC43dcU0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bcYFyk/btsL80lB1lO/B5PPBnR6pk5HRMaC43dcU0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bcYFyk/btsL80lB1lO/B5PPBnR6pk5HRMaC43dcU0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbcYFyk%2FbtsL80lB1lO%2FB5PPBnR6pk5HRMaC43dcU0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2252&quot; height=&quot;1024&quot; data-filename=&quot;스크린샷 2025-02-05 오후 10.03.02.png&quot; data-origin-width=&quot;2252&quot; data-origin-height=&quot;1024&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;First saw this conversation about AFT&lt;/li&gt;
&lt;li&gt;Host paper link: &lt;a href=&quot;https://proceedings.mlr.press/v206/norcliffe23a/norcliffe23a.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://proceedings.mlr.press/v206/norcliffe23a/norcliffe23a.pdf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Above SurvivalXGBoost model objective:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;/b&gt;Objective: Survival: AFT (Accelerated Failure Time)&lt;/li&gt;
&lt;li&gt;Evaluation Metric: AFT Negative Log Likelihood&lt;/li&gt;
&lt;li&gt;AFT Loss Distribution: Normal&lt;/li&gt;
&lt;li&gt;AFT Loss Distribution Scale: 1.0&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;This is a specialized survival analysis configuration of XGBoost that can be used in the competition.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;The AFT (Accelerated Failure Time) model is specialized for predicting survival time, making it a suitable approach for predicting the survival rate of HCT patients, which is the objective of this competition.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738761033083&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cQ1wdP/hyX7Th0sRQ/3fHQ9v8ImZgk3i1YmD0rC0/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/At8yD/hyYciHfoQN/ilQ31z0DAGsInlXUCLBrVK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cQ1wdP/hyX7Th0sRQ/3fHQ9v8ImZgk3i1YmD0rC0/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/At8yD/hyYciHfoQN/ilQ31z0DAGsInlXUCLBrVK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 10.10.18.png&quot; data-origin-width=&quot;902&quot; data-origin-height=&quot;99&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c5GnMX/btsL75gXcWs/8ew1BjHo90lJYQM2T4FrrK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c5GnMX/btsL75gXcWs/8ew1BjHo90lJYQM2T4FrrK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c5GnMX/btsL75gXcWs/8ew1BjHo90lJYQM2T4FrrK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc5GnMX%2FbtsL75gXcWs%2F8ew1BjHo90lJYQM2T4FrrK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;902&quot; height=&quot;99&quot; data-filename=&quot;스크린샷 2025-02-05 오후 10.10.18.png&quot; data-origin-width=&quot;902&quot; data-origin-height=&quot;99&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Also metioned here&lt;/li&gt;
&lt;li&gt;My discussion annotation:&lt;/li&gt;
&lt;li&gt;Notebook example annotation:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;d&lt;/li&gt;
&lt;li&gt;d&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;인내할&amp;nbsp;수&amp;nbsp;있는&amp;nbsp;사람은&amp;nbsp;그가&amp;nbsp;바라는&amp;nbsp;것은&amp;nbsp;무엇이든지&amp;nbsp;손에&amp;nbsp;넣을&amp;nbsp;수&amp;nbsp;있다.&lt;br /&gt;&lt;/span&gt;- 벤자민 프랭클린 -&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>aft model</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/117</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-7-AFT-model#entry117comment</comments>
      <pubDate>Wed, 5 Feb 2025 22:12:07 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #6 How To Train XGBoost with Survival Loss</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-6-How-To-Train-XGBoost-with-Survival-Loss</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Annotation of Chris Deotte's discussion about &lt;b&gt;&quot;How To Train XGBoost with Survival Loss&quot;&lt;/b&gt;.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738718464455&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cQ1wdP/hyX7Th0sRQ/3fHQ9v8ImZgk3i1YmD0rC0/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/At8yD/hyYciHfoQN/ilQ31z0DAGsInlXUCLBrVK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cQ1wdP/hyX7Th0sRQ/3fHQ9v8ImZgk3i1YmD0rC0/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/At8yD/hyYciHfoQN/ilQ31z0DAGsInlXUCLBrVK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h3 style=&quot;color: #202124;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;How To Train XGBoost with Survival Loss&lt;/b&gt;&lt;/h3&gt;
&lt;div&gt;
&lt;div style=&quot;background-color: #ffffff; color: #3c4043;&quot;&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This competition involves training survival models.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;We need to predict risk scores which are inversely proportional to how long a patient is&amp;nbsp;event free.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;XGBoost can train survival models! (This discussion is a continuation of my first discussion&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot;&gt;here&lt;/a&gt;).
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Annotation for the previous discussion here: &lt;a href=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-5-How-To-Get-Started-Understanding-the-Metric&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-5-How-To-Get-Started-Understanding-the-Metric&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Targets Explained&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;For patients with&amp;nbsp;efs=1, we observe they had an&amp;nbsp;event&amp;nbsp;and know&amp;nbsp;exactly&amp;nbsp;how long they were&amp;nbsp;event free&amp;nbsp;(namely&amp;nbsp;efs_time).&lt;/li&gt;
&lt;li&gt;For patients with&amp;nbsp;efs=0, we observe that they were&amp;nbsp;event free&amp;nbsp;for&amp;nbsp;efs_time&amp;nbsp;but do not know if eventually they will have an event or not.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;So we only know they are&amp;nbsp;event free&amp;nbsp;for&amp;nbsp;at least efs_time.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Survival models are new to me so yesterday my starter notebook does not use survival models directly.&lt;/li&gt;
&lt;li&gt;Instead I studied the metric and mathematically determined how to transform the two targets&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;into a single target&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;y&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and then trained a regression model to predict a proxy for inverse risk score. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;My starter discussion is&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #202124;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Today I learned that XGBoost and CatBoost can train survival models directly.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;XGBoost Survival:Cox Model&lt;/b&gt;&lt;/h4&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Starting from my public starter notebook, we can train XGBoost survival model as follows. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;First we make a new column called&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs_time2&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;which includes the information of both&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;and&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs_time&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;:&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;train[&quot;efs_time2&quot;] = train.efs_time.copy()
train.loc[train.efs==0,&quot;efs_time2&quot;] *= -1&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Then remove this new column from features by changing code cell #5 with:&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;RMV = [&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;y&quot;,&quot;efs_time2&quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Then we train using this target:&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;y_train = train.loc[train_index,&quot;efs_time2&quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;And we change XGBoost parameters to:&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;    objective='survival:cox',
    eval_metric='cox-nloglik',&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;CV Score: 0.672&lt;/blockquote&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Horikita Saku's comment about this part:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;He tried:&lt;/b&gt;&lt;br /&gt;&lt;i&gt;train[&quot;efs_time2&quot;]&amp;nbsp;=&amp;nbsp;train.efs_time.copy()&lt;/i&gt;&lt;br /&gt;&lt;i&gt;train.loc[train.efs==0,&quot;efs_time2&quot;]&amp;nbsp;*=&amp;nbsp;-1&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;b&gt;and train by:&lt;/b&gt;&lt;br /&gt;&lt;i&gt;x_train&amp;nbsp;=&amp;nbsp;train.loc[train_index,&amp;nbsp;FEATURES].copy()&lt;/i&gt;&lt;br /&gt;&lt;i&gt;y_train&amp;nbsp;=&amp;nbsp;train.loc[train_index,&amp;nbsp;&quot;efs_time2&quot;]&lt;/i&gt;&lt;br /&gt;&lt;i&gt;x_valid&amp;nbsp;=&amp;nbsp;train.loc[test_index,&amp;nbsp;FEATURES].copy()&lt;/i&gt;&lt;br /&gt;&lt;i&gt;y_valid&amp;nbsp;=&amp;nbsp;train.loc[test_index,&amp;nbsp;&quot;efs_time2&quot;]&lt;/i&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;b&gt;the&amp;nbsp;params&amp;nbsp;are:&lt;/b&gt;&lt;br /&gt;&lt;i&gt;eval_metric='cox-nloglik',&lt;/i&gt;&lt;br /&gt;&lt;i&gt;objective='survival:cox',&lt;/i&gt;&lt;br /&gt;&lt;i&gt;boosting_type=&amp;nbsp;&quot;dart&quot;,&lt;/i&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;b&gt;ran the eval(scoring) by:&lt;/b&gt;&lt;br /&gt;&lt;i&gt;from&amp;nbsp;metric&amp;nbsp;import&amp;nbsp;score&lt;/i&gt;&lt;br /&gt;&lt;i&gt;y_true&amp;nbsp;=&amp;nbsp;train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()&lt;/i&gt;&lt;br /&gt;&lt;i&gt;y_pred&amp;nbsp;=&amp;nbsp;train[[&quot;ID&quot;]].copy()&lt;/i&gt;&lt;br /&gt;&lt;i&gt;y_pred[&quot;prediction&quot;]&amp;nbsp;=&amp;nbsp;oof_xgb&lt;/i&gt;&lt;br /&gt;&lt;i&gt;m&amp;nbsp;=&amp;nbsp;score(y_true.copy(),&amp;nbsp;y_pred.copy(),&amp;nbsp;&quot;ID&quot;)&lt;/i&gt;&lt;br /&gt;&lt;i&gt;print(f&quot;\nOverall&amp;nbsp;CV&amp;nbsp;for&amp;nbsp;XGBoost&amp;nbsp;=&quot;,m)&lt;/i&gt;&lt;/span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;However, I obtained an&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;Overall CV for XGBoost = 0.9889430880402769&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;, but the LB is 0.58, which seems to be definitely an anomaly. Do you have any ideas on what might be causing this?  &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;Reply by the author:&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;b&gt;It is because the lack of this code:&lt;/b&gt;&lt;br /&gt;&lt;i&gt;&amp;nbsp;RMV&amp;nbsp;=&amp;nbsp;[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;y&quot;,&quot;efs_time2&quot;]&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&amp;nbsp;FEATURES&amp;nbsp;=&amp;nbsp;[c&amp;nbsp;for&amp;nbsp;c&amp;nbsp;in&amp;nbsp;train.columns&amp;nbsp;if&amp;nbsp;not&amp;nbsp;c&amp;nbsp;in&amp;nbsp;RMV]&lt;/i&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&quot;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;I'm guessing that your model is using&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs_time2&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;as both the target and a feature. I will add this to the discussion above. Thanks for discovering this.&quot;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;Details:&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;The core issue here is &lt;b&gt;&quot;Data leakage&quot; problem&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Where the problem occurred:&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;A new target variable efs_time2 was created, but it was accidentally used as a feature as well&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;As a result, the model received target information as a feature, leading to abnormally high cv scores(0.98)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;However, since there's no such leakage in the actual test data, the LB score was very low (0.58)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Solution:&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;RMV &lt;/span&gt;&lt;span style=&quot;color: #61afef;&quot;&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;[&lt;/span&gt;&lt;span style=&quot;color: #98c379;&quot;&gt;&quot;ID&quot;&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;,&lt;/span&gt;&lt;span style=&quot;color: #98c379;&quot;&gt;&quot;efs&quot;&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;,&lt;/span&gt;&lt;span style=&quot;color: #98c379;&quot;&gt;&quot;efs_time&quot;&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;,&lt;/span&gt;&lt;span style=&quot;color: #98c379;&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;,&lt;/span&gt;&lt;span style=&quot;color: #98c379;&quot;&gt;&quot;efs_time2&quot;&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;]&lt;/span&gt;&lt;span&gt; &lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;FEATURES &lt;/span&gt;&lt;span style=&quot;color: #61afef;&quot;&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;[&lt;/span&gt;&lt;span&gt;c &lt;/span&gt;&lt;span style=&quot;color: #c678dd;&quot;&gt;for&lt;/span&gt;&lt;span&gt; c &lt;/span&gt;&lt;span style=&quot;color: #c678dd;&quot;&gt;in&lt;/span&gt;&lt;span&gt; train&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;.&lt;/span&gt;&lt;span&gt;columns &lt;/span&gt;&lt;span style=&quot;color: #c678dd;&quot;&gt;if&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span style=&quot;color: #c678dd;&quot;&gt;not&lt;/span&gt;&lt;span&gt; c &lt;/span&gt;&lt;span style=&quot;color: #c678dd;&quot;&gt;in&lt;/span&gt;&lt;span&gt; RMV&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;]&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;The RMV list specifies columns that should be excluded from features&lt;/li&gt;
&lt;li&gt;FEATURES selects only columns not included in RMV&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;This explicitly prevents efs_time2 from being used as a feature&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Why this code is necessary:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;If the target variable (efs_time2) is included in features, the model essentially &quot;cheats&quot;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;It ends up using information during training that would never be available in real prediction scenarios&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;This makes it impossible to accurately evaluate the model's generalization performance&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;CatBoost Survival:Cox Model&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;For CatBoost, we use the target&lt;span&gt;&amp;nbsp;&lt;/span&gt;efs_time2&lt;span&gt;&amp;nbsp;&lt;/span&gt;and&lt;span&gt;&amp;nbsp;&lt;/span&gt;loss_function=&quot;Cox&quot;&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;CV Score: 0.670&lt;/blockquote&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Starter Notebook&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;I publish a starter notebook demonstrating this code&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Using these techniques I achieved&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;CV=0.681&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;LB=0.685&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Annotation of the starter notebook here: &lt;a href=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685&lt;/a&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;CV Score: 0.681&lt;/blockquote&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;UPDATE - We can use Survivial:AFT Model&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;We can also train XGBoost and CatBoost with&amp;nbsp;Survivial:AFT&amp;nbsp;loss.&lt;/li&gt;
&lt;li&gt;See discussions&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550563&quot;&gt;here&lt;/a&gt;&amp;nbsp;and notebook examples&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/code/ambrosm/esp-eda-which-makes-sense&quot;&gt;here&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/code/horikitasaku/cv0-665-lb0-666-cat-xgb-with-aft-loss-function&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote data-ke-style=&quot;style2&quot;&gt;My Annotation &amp;amp; Explanation about AFT model here:&amp;nbsp;&lt;/blockquote&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;UPDATE - NN Starter Notebook&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;I published an NN starter notebook&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343&quot;&gt;here&lt;/a&gt;&amp;nbsp;with CV 0.670 and LB 0.676!&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote data-ke-style=&quot;style2&quot;&gt;My Annotation &amp;amp; Explanation about NN model here:&amp;nbsp;&lt;/blockquote&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;It's nice to be important, but it's more important to be nice.&lt;br /&gt;&lt;/span&gt;- Dwayne Johnson -&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/116</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-6-How-To-Train-XGBoost-with-Survival-Loss#entry116comment</comments>
      <pubDate>Wed, 5 Feb 2025 16:34:30 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #5 How To Get Started - Understanding the Metric</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-5-How-To-Get-Started-Understanding-the-Metric</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;Annotation of Chris Deotte's discussion about&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&quot;How&amp;nbsp;To&amp;nbsp;Get&amp;nbsp;Started&amp;nbsp;-&amp;nbsp;Understanding&amp;nbsp;the&amp;nbsp;Metric&quot;&lt;/b&gt;.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738735870299&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-description=&quot;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot; data-og-url=&quot;https://kaggle.com/equity-post-HCT-survival-predictions&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/kzroU/hyX72TFXl5/ubSkXAjyakI9yDW6qTQbhK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/cai7mF/hyYb6GQQld/kh8iVCCvSxit4k0YzKFNfk/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/kzroU/hyX72TFXl5/ubSkXAjyakI9yDW6qTQbhK/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/cai7mF/hyYb6GQQld/kh8iVCCvSxit4k0YzKFNfk/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Improve prediction of transplant survival rates equitably for allogeneic HCT patients&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;C-Index Explained&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The competition metric is&amp;nbsp;&lt;b&gt;Stratified Concordance Index&lt;/b&gt;.&lt;/li&gt;
&lt;li&gt;Let's explain how C-Index works (and let's ignore stratified for now).&lt;/li&gt;
&lt;li&gt;Here is the formula:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 3.11.52.png&quot; data-origin-width=&quot;531&quot; data-origin-height=&quot;151&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bjfRZT/btsL7gbYmcc/KG89quzXgJKkYieNJwBUz0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bjfRZT/btsL7gbYmcc/KG89quzXgJKkYieNJwBUz0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bjfRZT/btsL7gbYmcc/KG89quzXgJKkYieNJwBUz0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbjfRZT%2FbtsL7gbYmcc%2FKG89quzXgJKkYieNJwBUz0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;531&quot; height=&quot;151&quot; data-filename=&quot;스크린샷 2025-02-05 오후 3.11.52.png&quot; data-origin-width=&quot;531&quot; data-origin-height=&quot;151&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Ground Truth and Predictions&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Here is an image which will help us understand what this means:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 3.12.23.png&quot; data-origin-width=&quot;975&quot; data-origin-height=&quot;353&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cDavae/btsL73po4QK/K30S4sq3j4WCSIb4gg8fy1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cDavae/btsL73po4QK/K30S4sq3j4WCSIb4gg8fy1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cDavae/btsL73po4QK/K30S4sq3j4WCSIb4gg8fy1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcDavae%2FbtsL73po4QK%2FK30S4sq3j4WCSIb4gg8fy1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;975&quot; height=&quot;353&quot; data-filename=&quot;스크린샷 2025-02-05 오후 3.12.23.png&quot; data-origin-width=&quot;975&quot; data-origin-height=&quot;353&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Imagine that there are only 10 rows in the&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;train.csv&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;file shown above as 10 dots. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;There are 5&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs=1&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and 5&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs=0. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;is displayed in the plot above. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Points&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;A, B, C, D, E&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;have&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs=1, and points&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;F, G, H, I, J&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;have&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs=0. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The point&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;A&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;has the least&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and the point&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;J&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;has the greatest&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Each patient with&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs=1&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;had an event, and the time before event was&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Each patient with&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs=0&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;we&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;do not know&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;if they had an event or did not have an event. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;All we know is that they were without event for at least&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;long. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;To summarize, each&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs=1&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;was without event for&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;exactly&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;efs_time.&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;And each&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs=0&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;was without event for&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;at least&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;How To Compute C-Index&lt;span&gt;&amp;nbsp;&lt;/span&gt;Denominator&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The C-Index metric is a &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;ranking metric&lt;/span&gt;&lt;/b&gt; similar to AUC.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;The&amp;nbsp;denominator&amp;nbsp;counts all pairs of dots where we&amp;nbsp;know&amp;nbsp;ground truth&amp;nbsp;T_j &amp;lt; T_i&amp;nbsp;where&amp;nbsp;T&amp;nbsp;is the&amp;nbsp;actual time without event&amp;nbsp;(note when&amp;nbsp;efs=0&amp;nbsp;then&amp;nbsp;actual time without event &amp;gt; efs_time&amp;nbsp;and when&amp;nbsp;efs=1&amp;nbsp;then&amp;nbsp;actual time without event = efs_time). &lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;T&lt;/b&gt;&lt;/i&gt; represents the &quot;actual time without event&quot; of a patient&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;efs=1&lt;/b&gt;&lt;/i&gt; indicates the occurrence of an event (e.g., death)&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;efs=0&lt;/b&gt;&lt;/i&gt; indicates a censored case (end of follow-up)&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Actual survival time (T) in two situations&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;case&amp;nbsp;1)&amp;nbsp;When&amp;nbsp;efs=1:&lt;/b&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;-&amp;nbsp;T&amp;nbsp;=&amp;nbsp;efs_time&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;-&amp;nbsp;We&amp;nbsp;can&amp;nbsp;know&amp;nbsp;the&amp;nbsp;exact&amp;nbsp;survival&amp;nbsp;time&lt;br /&gt;&lt;b&gt;case&amp;nbsp;2)&amp;nbsp;When&amp;nbsp;efs=0:&lt;/b&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;-&amp;nbsp;T&amp;nbsp;&amp;gt;&amp;nbsp;efs_time&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;-&amp;nbsp;We&amp;nbsp;only&amp;nbsp;know&amp;nbsp;that&amp;nbsp;they&amp;nbsp;survived&amp;nbsp;beyond&amp;nbsp;the&amp;nbsp;last&amp;nbsp;observation&amp;nbsp;point&amp;nbsp;(efs_time)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Examples of when we can determine &quot;T_j &amp;lt; T_i&quot;&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Example:&lt;br /&gt;Patient1:&amp;nbsp;efs=1,&amp;nbsp;efs_time=10&amp;nbsp;&amp;nbsp;&amp;rarr;&amp;nbsp;T1=10&lt;br /&gt;Patient2:&amp;nbsp;efs=0,&amp;nbsp;efs_time=15&amp;nbsp;&amp;nbsp;&amp;rarr;&amp;nbsp;T2&amp;gt;15&lt;br /&gt;Patient3:&amp;nbsp;efs=1,&amp;nbsp;efs_time=20&amp;nbsp;&amp;nbsp;&amp;rarr;&amp;nbsp;T3=20&lt;br /&gt;#&amp;nbsp;Comparable&amp;nbsp;pairs:&lt;br /&gt;-&amp;nbsp;T1&amp;nbsp;&amp;lt;&amp;nbsp;T3&amp;nbsp;(10&amp;nbsp;&amp;lt;&amp;nbsp;20)&lt;br /&gt;-&amp;nbsp;T1&amp;nbsp;&amp;lt;&amp;nbsp;T2&amp;nbsp;(10&amp;nbsp;&amp;lt;&amp;nbsp;T2,&amp;nbsp;where&amp;nbsp;T2&amp;nbsp;is&amp;nbsp;greater&amp;nbsp;than&amp;nbsp;15)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Meaning&amp;nbsp;of&amp;nbsp;C-Index&amp;nbsp;denominator:&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Among&amp;nbsp;all&amp;nbsp;possible&amp;nbsp;patient&amp;nbsp;pairs&amp;nbsp;(i,j)&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Count&amp;nbsp;the&amp;nbsp;number&amp;nbsp;of&amp;nbsp;pairs&amp;nbsp;where&amp;nbsp;we&amp;nbsp;can&amp;nbsp;definitively&amp;nbsp;determine&amp;nbsp;&quot;who&amp;nbsp;lived&amp;nbsp;longer&quot;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;The reason for calculating this way is that for censored cases (efs=0), we don't know the exact survival time, so we only include pairs that can be definitively compared in the evaluation.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The variables&amp;nbsp;i&amp;nbsp;and&amp;nbsp;j&amp;nbsp;are indices that range over every dot. In the example above, there are 32 possible pairs that we&amp;nbsp;know&amp;nbsp;T_j &amp;lt; T_i:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 3.30.11.png&quot; data-origin-width=&quot;996&quot; data-origin-height=&quot;144&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cclJ73/btsL6qlTeMX/C81cK0UHBEVqk0HIEj1oj0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cclJ73/btsL6qlTeMX/C81cK0UHBEVqk0HIEj1oj0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cclJ73/btsL6qlTeMX/C81cK0UHBEVqk0HIEj1oj0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcclJ73%2FbtsL6qlTeMX%2FC81cK0UHBEVqk0HIEj1oj0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;996&quot; height=&quot;144&quot; data-filename=&quot;스크린샷 2025-02-05 오후 3.30.11.png&quot; data-origin-width=&quot;996&quot; data-origin-height=&quot;144&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Note we do not know if&amp;nbsp;D&amp;nbsp;is less than&amp;nbsp;F&amp;nbsp;because we do not know the&amp;nbsp;actual time without event&amp;nbsp;for&amp;nbsp;F, we only know that&amp;nbsp;F's time without event is&amp;nbsp;at least&amp;nbsp;what it appears in the plot above (because&amp;nbsp;F&amp;nbsp;is&amp;nbsp;efs=0). &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Also we do not know if&amp;nbsp;G&amp;nbsp;is less than&amp;nbsp;H&amp;nbsp;because we do not know&amp;nbsp;actual time without event&amp;nbsp;for&amp;nbsp;G&amp;nbsp;nor&amp;nbsp;H&amp;nbsp;(both are&amp;nbsp;efs=0).&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;How To Compute C-Index&lt;span&gt;&amp;nbsp;&lt;/span&gt;Numerator&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;The C-Index numerator is about our predictions.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;For the 32 pairs above, we count how many of our predictions also follow these inequalities.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;For example, is our prediction A greater than B? Is our prediction A greater than C? etc etc.&lt;/li&gt;
&lt;li&gt;We ask 32 questions.&lt;/li&gt;
&lt;li&gt;The last is, is our prediction E greater than J.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;If all 32 questions answer yes, then our metric score = 1.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;If all questions answer no, then our metric score = 0.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;If 22 questions answer yes, then our metric score = 22/32 = 0.6875.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;(Note that inequalities in denominator are&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;less than&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;(and about time).&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;And our predictions for the same pairs are&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;greater than&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;(and about risk).&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This is because the denominator represents&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;times&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;being&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;less than&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;. &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;And our&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;numerator&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;represents&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;risks&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;being&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;greater than&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;.&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;In other words a patient with a&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;shorter&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;time without event has a&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;greater&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;risk. &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;And we are predicting risk factor.&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;(If you get this backwards, just change your predictions with&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;pred = -1 * pred).)&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;How To Build a Model&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;If we only use&amp;nbsp;efs&amp;nbsp;as&amp;nbsp;classification&amp;nbsp;0 or 1, to train our model (like current public notebooks), then our model will not be able to correctly compare&amp;nbsp;A&amp;nbsp;and&amp;nbsp;C&amp;nbsp;which both have&amp;nbsp;efs=1.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;If we use&amp;nbsp;efs_time&amp;nbsp;as&amp;nbsp;regression, then our model can be&amp;nbsp;smarter. &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;And if we use both&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time&amp;nbsp;to train our model (classification/regression), our model will be&amp;nbsp;smartest!&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Starter Notebook&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;There are &lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;two ways to approach this competition and utilize both&amp;nbsp;efs&amp;nbsp;and efs_time&lt;/span&gt;&lt;/b&gt;:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Combine&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;ourselves into a new single target.&lt;/span&gt; &lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Then train a model using either classification or regression (on the new single target). &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This is what i do in my XGB starter notebook&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #202124;&quot; href=&quot;https://www.kaggle.com/code/cdeotte/xgboost-catboost-baseline-cv-668-lb-668&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and NN starter notebook&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #202124;&quot; href=&quot;https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;XGB starter notebook: &lt;a href=&quot;https://www.kaggle.com/code/cdeotte/xgboost-catboost-baseline-cv-668-lb-668&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/cdeotte/xgboost-catboost-baseline-cv-668-lb-668&lt;/a&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;No annotation due to lower cv, lb score (0.668)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Basically &lt;a href=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685&lt;/a&gt; without survival model(cox, kaplan meier)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043;&quot;&gt;My annotation on NN starter notebook:&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; background-color: #ffc1c8;&quot;&gt;(Note each uses a different transformed target and we can experiment making more transformed targets to find the best!)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Use a model that supports survival loss (i.e. Cox or AFT).&lt;/span&gt; &lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Then we leave&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;efs_time&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;as is and &lt;b&gt;input both into the model&lt;/b&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The model learns from both and predicts a single target for us. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;More discussion about this&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #202124;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;My annotation on the discussion: &lt;a href=&quot;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-6-How-To-Train-XGBoost-with-Survival-Loss&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-6-How-To-Train-XGBoost-with-Survival-Loss&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;How To Compute Metric in Notebook&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;To compute the competition metric in your notebook, attached this notebook&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/code/cdeotte/pip-install-lifelines&quot;&gt;here&lt;/a&gt;&amp;nbsp;which contains WHL files (because we need to pip install with internet off to be able to submit to comp).&lt;/li&gt;
&lt;li&gt;Also attach Kaggle's metic notebook&amp;nbsp;&lt;a href=&quot;https://www.kaggle.com/code/metric/eefs-concordance-index&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Then add the following code in the first cell:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738739867440&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!pip install /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
!pip install /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Afterward to compute the competition metric, run this code where&amp;nbsp;preds&amp;nbsp;are your oof predictions:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;from metric import score
y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = preds
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;CV Score = {m}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Calculating c-index(from turkenm's comment)&lt;/b&gt;&lt;/h4&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 8.25.56.png&quot; data-origin-width=&quot;2204&quot; data-origin-height=&quot;946&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bT9V0q/btsL7Ghi1pG/EXVpYHyKDKsu3wZpm9KVi0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bT9V0q/btsL7Ghi1pG/EXVpYHyKDKsu3wZpm9KVi0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bT9V0q/btsL7Ghi1pG/EXVpYHyKDKsu3wZpm9KVi0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbT9V0q%2FbtsL7Ghi1pG%2FEXVpYHyKDKsu3wZpm9KVi0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2204&quot; height=&quot;946&quot; data-filename=&quot;스크린샷 2025-02-05 오후 8.25.56.png&quot; data-origin-width=&quot;2204&quot; data-origin-height=&quot;946&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Misunderstanding: Thought that all efs=0 patients should have lower risk scores than efs=1 patients&lt;/li&gt;
&lt;li&gt;However, the author of the kernel corrected:&amp;nbsp;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;&quot;Not&amp;nbsp;all&amp;nbsp;efs=0&amp;nbsp;predictions&amp;nbsp;need&amp;nbsp;to&amp;nbsp;be&amp;nbsp;lower&amp;nbsp;than&amp;nbsp;efs=1.&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Only efs=0 cases where efs_time is greater than all efs=1 cases &lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;need&amp;nbsp;to&amp;nbsp;have&amp;nbsp;lower&amp;nbsp;risk&amp;nbsp;scores.&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Example:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;#&amp;nbsp;Case&amp;nbsp;examples&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Patient&amp;nbsp;A:&amp;nbsp;efs=1,&amp;nbsp;efs_time=10&amp;nbsp;&amp;nbsp;#&amp;nbsp;Event&amp;nbsp;occurred&amp;nbsp;on&amp;nbsp;day&amp;nbsp;10&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Patient&amp;nbsp;B:&amp;nbsp;efs=0,&amp;nbsp;efs_time=5&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;Censored&amp;nbsp;on&amp;nbsp;day&amp;nbsp;5&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Patient&amp;nbsp;C:&amp;nbsp;efs=0,&amp;nbsp;efs_time=15&amp;nbsp;&amp;nbsp;#&amp;nbsp;Censored&amp;nbsp;on&amp;nbsp;day&amp;nbsp;15&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Patient&amp;nbsp;D:&amp;nbsp;efs=1,&amp;nbsp;efs_time=8&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;Event&amp;nbsp;occurred&amp;nbsp;on&amp;nbsp;day&amp;nbsp;8&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;#&amp;nbsp;Only&amp;nbsp;Patient&amp;nbsp;C&amp;nbsp;has&amp;nbsp;efs_time&amp;nbsp;greater&amp;nbsp;than&amp;nbsp;all&amp;nbsp;efs=1&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;#&amp;nbsp;Therefore,&amp;nbsp;only&amp;nbsp;Patient&amp;nbsp;C's&amp;nbsp;risk&amp;nbsp;score&amp;nbsp;needs&amp;nbsp;to&amp;nbsp;be&amp;nbsp;lower&amp;nbsp;than&amp;nbsp;efs=1&amp;nbsp;patients&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;#&amp;nbsp;Patient&amp;nbsp;B's&amp;nbsp;risk&amp;nbsp;score&amp;nbsp;can&amp;nbsp;be&amp;nbsp;any&amp;nbsp;value&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Why &lt;b&gt;Patient B's risk score can be any value&lt;/b&gt;&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Patient&amp;nbsp;A:&amp;nbsp;efs=1,&amp;nbsp;efs_time=10&amp;nbsp;&amp;nbsp;#&amp;nbsp;Death&amp;nbsp;confirmed&amp;nbsp;on&amp;nbsp;day&amp;nbsp;10&lt;br /&gt;Patient&amp;nbsp;B:&amp;nbsp;efs=0,&amp;nbsp;efs_time=5&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;Observation&amp;nbsp;stopped&amp;nbsp;after&amp;nbsp;day&amp;nbsp;5&lt;br /&gt;Patient&amp;nbsp;D:&amp;nbsp;efs=1,&amp;nbsp;efs_time=8&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;Death&amp;nbsp;confirmed&amp;nbsp;on&amp;nbsp;day&amp;nbsp;8&lt;/li&gt;
&lt;li&gt;From a C-index calculation perspective:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;C-Index is only included in calculations when survival times between two patients can be compared&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Patient B has no information after day 5, making clear comparisons with other patients impossible&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Therefore, Patient B's risk score does not affect C-Index calculation&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;b&gt;In contrast for patient C:&lt;/b&gt;&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Survival is confirmed until day 15&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Comparable with Patient A (died on day 10)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;We can definitively know that Patient C lived longer than Patient A&lt;/li&gt;
&lt;li&gt;Therefore, Patient C's risk score should be lower than Patient A's&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Also comparable with Patient D (died on day 8)&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;We can definitively know that Patient C lived longer than Patient D&lt;/li&gt;
&lt;li&gt;Therefore, Patient C's risk score should also be lower than Patient D's&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;b&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Special Case (efs=0, efs_time=0):&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;These cases are not included in C-Index calculation at all&lt;/li&gt;
&lt;li&gt;Therefore, predictions for such cases don't affect the final score&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Not necessary to predict low risk for all censored cases (efs=0)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Can selectively predict low risk considering efs_time&lt;/li&gt;
&lt;li&gt;This allows the model to learn more flexibly&lt;/li&gt;
&lt;li&gt;This understanding enables creating more effective survival analysis models.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Understanding delta_j&lt;/b&gt;&lt;/h4&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 8.29.28.png&quot; data-origin-width=&quot;2290&quot; data-origin-height=&quot;850&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/QN05L/btsL8dyTzEw/nhUg3pOudI5mJATp6dhTL0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/QN05L/btsL8dyTzEw/nhUg3pOudI5mJATp6dhTL0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/QN05L/btsL8dyTzEw/nhUg3pOudI5mJATp6dhTL0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FQN05L%2FbtsL8dyTzEw%2FnhUg3pOudI5mJATp6dhTL0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2290&quot; height=&quot;850&quot; data-filename=&quot;스크린샷 2025-02-05 오후 8.29.28.png&quot; data-origin-width=&quot;2290&quot; data-origin-height=&quot;850&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Summary on Daniel's Question:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Checking if &lt;i&gt;&lt;b&gt;delta_j&lt;/b&gt;&lt;/i&gt; in C-Index calculation means the efs value (0 or 1)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Curious about how our predicted risk values are evaluated&lt;/li&gt;
&lt;li&gt;Asking if perfect prediction is possible&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-05 오후 8.30.33.png&quot; data-origin-width=&quot;2250&quot; data-origin-height=&quot;946&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bwyjDn/btsL72Eri2q/depRBM5qEWdSewzk1b0ES0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bwyjDn/btsL72Eri2q/depRBM5qEWdSewzk1b0ES0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bwyjDn/btsL72Eri2q/depRBM5qEWdSewzk1b0ES0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbwyjDn%2FbtsL72Eri2q%2FdepRBM5qEWdSewzk1b0ES0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2250&quot; height=&quot;946&quot; data-filename=&quot;스크린샷 2025-02-05 오후 8.30.33.png&quot; data-origin-width=&quot;2250&quot; data-origin-height=&quot;946&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Summary of Chris's Answer:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;Elements&amp;nbsp;used&amp;nbsp;in&amp;nbsp;C-Index&amp;nbsp;calculation:&lt;br /&gt;-&amp;nbsp;T_j,&amp;nbsp;T_i:&amp;nbsp;represent&amp;nbsp;efs_time&amp;nbsp;values&lt;br /&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;-&amp;nbsp;delta_j:&amp;nbsp;represents&amp;nbsp;efs&amp;nbsp;value&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;-&amp;nbsp;N_j,&amp;nbsp;N_i:&amp;nbsp;represent&amp;nbsp;our&amp;nbsp;predictions&lt;/li&gt;
&lt;li&gt;Model should not try to directly predict efs_time
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Reason: efs_time is randomly hidden due to censoring&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Instead, should predict 'risk score'&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Because risk is actually related to features (X features)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Example:&lt;/b&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;#&amp;nbsp;Wrong&amp;nbsp;approach&lt;br /&gt;Patient&amp;nbsp;A:&amp;nbsp;survived&amp;nbsp;10&amp;nbsp;days&amp;nbsp;-&amp;gt;&amp;nbsp;model&amp;nbsp;tries&amp;nbsp;to&amp;nbsp;predict&amp;nbsp;10&lt;br /&gt;Patient&amp;nbsp;B:&amp;nbsp;censored&amp;nbsp;at&amp;nbsp;5&amp;nbsp;days&amp;nbsp;-&amp;gt;&amp;nbsp;actual&amp;nbsp;survival&amp;nbsp;unknown&lt;br /&gt;#&amp;nbsp;Correct&amp;nbsp;approach&lt;br /&gt;Patient&amp;nbsp;A:&amp;nbsp;high&amp;nbsp;risk&amp;nbsp;-&amp;gt;&amp;nbsp;expect&amp;nbsp;short&amp;nbsp;survival&lt;br /&gt;Patient&amp;nbsp;B:&amp;nbsp;low&amp;nbsp;risk&amp;nbsp;-&amp;gt;&amp;nbsp;expect&amp;nbsp;long&amp;nbsp;survival&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;성공한&amp;nbsp;자의&amp;nbsp;과거는&amp;nbsp;비참할수록&amp;nbsp;아름답다.&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>Metrics</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/115</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-5-How-To-Get-Started-Understanding-the-Metric#entry115comment</comments>
      <pubDate>Wed, 5 Feb 2025 16:34:19 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #4 GPU LightGBM Baseline [CV 681 LB 685)</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;This is an annotation of this kernel:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738490645616&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;GPU LightGBM  Baseline - [CV 681 LB 685]&quot; data-og-description=&quot;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685&quot; data-og-url=&quot;https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/code/cdeotte/gpu-lightgbm-baseline-cv-681-lb-685&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GPU LightGBM Baseline - [CV 681 LB 685]&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 id=&quot;GPU-LightGBM-Baseline&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;GPU LightGBM Baseline&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;In this notebook, we present a GPU LightGBM baseline. In this notebook, compared to my previous starter notebooks we teach 5 new things:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;How to tranform&lt;span&gt;&amp;nbsp;&lt;/span&gt;efs&lt;span&gt;&amp;nbsp;&lt;/span&gt;and&lt;span&gt;&amp;nbsp;&lt;/span&gt;efs_time&lt;span&gt;&amp;nbsp;&lt;/span&gt;into single target with&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;KaplanMeierFitter&lt;/span&gt;&lt;/b&gt;.&lt;/li&gt;
&lt;li&gt;How to train&lt;span&gt;&amp;nbsp;&lt;/span&gt;GPU LightGBM model&lt;span&gt;&amp;nbsp;&lt;/span&gt;with&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;KaplanMeierFitter&lt;/span&gt;&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;target&lt;/li&gt;
&lt;li&gt;How to train&lt;span&gt;&amp;nbsp;&lt;/span&gt;XGBoost with Survivial:&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Cox loss&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;How to train&lt;span&gt;&amp;nbsp;&lt;/span&gt;CatBoost with Survival:&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Cox loss&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;How to ensemble 5 models using&amp;nbsp;scipy.stats.rankdata().&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id=&quot;Two-Competition-Approaches&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Two Competition Approaches&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;In this competition, there are two ways to train a Survival Model:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;We can input both&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time&amp;nbsp;and train a&amp;nbsp;&lt;b&gt;model that supports&lt;/b&gt;&amp;nbsp;survival loss like Cox.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;Transform&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time&amp;nbsp;into a single target proxy for&amp;nbsp;risk score&amp;nbsp;and train&amp;nbsp;&lt;b&gt;any model&lt;/b&gt;&amp;nbsp;with&amp;nbsp;regression loss like MSE.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;In this notebook, we train 5 models.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;The first 3 models (XGBoost, CatBoost, LightGBM) use bullet point two.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;And the next 2 models (XGBoost Cox, CatBoost Cox) use bullet point one. Discussion about this notebook is&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; color: #008abc; background-color: #99cefa;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; color: #008abc; background-color: #99cefa;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Since this competition's metric is a ranking metric, we ensemble the 5 predictions by first converting each into ranks using&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;scipy.stats.rankdata().&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Afterward we created a weighted average from the ranks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id=&quot;Previous-Notebooks&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Previous Notebooks&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;My previous starter notebooks are:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;XGBoost and CatBoost starter&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/code/cdeotte/xgboost-catboost-baseline-cv-668-lb-668&quot;&gt;here&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;NN (MLP) starter&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/code/cdeotte/nn-mlp-baseline-cv-670-lb-676&quot;&gt;here&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;Associated discussions are&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot;&gt;here&lt;/a&gt;,&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141&quot;&gt;here&lt;/a&gt;,&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550343&quot;&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style3&quot; /&gt;
&lt;h4 id=&quot;Pip-Install-Libraries-for-Metric&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Pip Install Libraries for Metric&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Since internet must be turned off for submission, we pip install from my other notebook&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;a style=&quot;background-color: #ffffff; color: #008abc; text-align: left;&quot; href=&quot;https://www.kaggle.com/code/cdeotte/pip-install-lifelines&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;where I downloaded the WHL files.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738499477253&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!pip install /kaggle/input/pip-install-lifelines/autograd-1.7.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/autograd-gamma-0.5.0.tar.gz
!pip install /kaggle/input/pip-install-lifelines/interface_meta-1.3.0-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/formulaic-1.0.2-py3-none-any.whl
!pip install /kaggle/input/pip-install-lifelines/lifelines-0.30.0-py3-none-any.whl&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/code/cdeotte/pip-install-lifelines&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/cdeotte/pip-install-lifelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;There is a discussion explaining how to use these WHL files&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550003&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Annotation on the details posted on another blog of mine&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Below is a quick summary:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;To compute the competition metric in your notebook, attached this notebook (which you are reading) which contains WHL files (because &lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;we need to pip install with internet off to be able to submit to comp&lt;/span&gt;&lt;/b&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;Also attached Kaggle's metic notebook&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;a style=&quot;color: #008abc;&quot; href=&quot;https://www.kaggle.com/code/metric/eefs-concordance-index&quot;&gt;here&lt;/a&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;. &lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;Afterward to compute the competition metric, run this code where preds are your oof predictions:&lt;/p&gt;
&lt;pre class=&quot;dockerfile&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;from metric import score
y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = preds
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;CV Score = {m}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 id=&quot;Load-Train-and-Test&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Load Train and Test&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1738499517530&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import numpy as np, pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

test = pd.read_csv(&quot;/kaggle/input/equity-post-HCT-survival-predictions/test.csv&quot;)
print(&quot;Test shape:&quot;, test.shape )

train = pd.read_csv(&quot;/kaggle/input/equity-post-HCT-survival-predictions/train.csv&quot;)
print(&quot;Train shape:&quot;,train.shape)
train.head()&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;EDA-on-Train-Targets&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;EDA on Train Targets&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;There are two train targets&amp;nbsp;efs&amp;nbsp;and&amp;nbsp;efs_time.&lt;/li&gt;
&lt;li&gt;When&amp;nbsp;efs==1&amp;nbsp;we know patient&amp;nbsp;had an event&amp;nbsp;and we know time of event is&amp;nbsp;efs_time.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;When&amp;nbsp;efs==0&amp;nbsp;we&amp;nbsp;do not know&amp;nbsp;if patient had an event or not, but we do know that patient was&amp;nbsp;without event for at least&amp;nbsp;efs_time.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738500667604&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plt.hist(train.loc[train.efs==1,&quot;efs_time&quot;],bins=100,label=&quot;efs=1, Yes Event&quot;)
plt.hist(train.loc[train.efs==0,&quot;efs_time&quot;],bins=100,label=&quot;efs=0, Maybe Event&quot;)
plt.xlabel(&quot;Time of Observation, efs_time&quot;)
plt.ylabel(&quot;Density&quot;)
plt.title(&quot;Times of Observation. Either time to event, or time observed without event.&quot;)
plt.legend()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-02 오후 9.51.15.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;465&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cBlHAw/btsL5mhHs4O/y7OhiHbYtEn3YiGY0aokT1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cBlHAw/btsL5mhHs4O/y7OhiHbYtEn3YiGY0aokT1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cBlHAw/btsL5mhHs4O/y7OhiHbYtEn3YiGY0aokT1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcBlHAw%2FbtsL5mhHs4O%2Fy7OhiHbYtEn3YiGY0aokT1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;561&quot; height=&quot;389&quot; data-filename=&quot;스크린샷 2025-02-02 오후 9.51.15.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;465&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Transform-Two-Targets-into-One-Target-with-KaplanMeier!&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Transform Two Targets into One Target with KaplanMeier&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Both targets&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;and&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;efs_time&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;provide useful information. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;We will tranform these two targets into a single target to train our model with. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In this competition we need to predict&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;risk score&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;So we will create a target that mimics&amp;nbsp;&lt;/span&gt;risk score&lt;/span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&amp;nbsp;to train our model.&lt;/span&gt; &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;(Note this is only one out of many ways to transform two targets into one target. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Considering experimenting on your own).&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738500720523&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from lifelines import KaplanMeierFitter
def transform_survival_probability(df, time_col='efs_time', event_col='efs'):
    kmf = KaplanMeierFitter()
    kmf.fit(df[time_col], df[event_col])
    y = kmf.survival_function_at_times(df[time_col]).values
    return y
train[&quot;y&quot;] = transform_survival_probability(train, time_col='efs_time', event_col='efs')

plt.hist(train.loc[train.efs==1,&quot;y&quot;],bins=100,label=&quot;efs=1, Yes Event&quot;)
plt.hist(train.loc[train.efs==0,&quot;y&quot;],bins=100,label=&quot;efs=0, Maybe Event&quot;)
plt.xlabel(&quot;Transformed Target y&quot;)
plt.ylabel(&quot;Density&quot;)
plt.title(&quot;KaplanMeier Transformed Target y using both efs and efs_time.&quot;)
plt.legend()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-02 오후 9.52.07.png&quot; data-origin-width=&quot;619&quot; data-origin-height=&quot;452&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/UTDrr/btsL5ifjED5/548N3vkDpK3ZfcL24fSiy0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/UTDrr/btsL5ifjED5/548N3vkDpK3ZfcL24fSiy0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/UTDrr/btsL5ifjED5/548N3vkDpK3ZfcL24fSiy0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FUTDrr%2FbtsL5ifjED5%2F548N3vkDpK3ZfcL24fSiy0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;619&quot; height=&quot;452&quot; data-filename=&quot;스크린샷 2025-02-02 오후 9.52.07.png&quot; data-origin-width=&quot;619&quot; data-origin-height=&quot;452&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Features&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Features&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;There are a total of 57 features. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;From these 35 are categorical and 22 are numerical. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;We will label encode the categorical features. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3; color: #3c4043; text-align: left;&quot;&gt;Then our XGB and CAT model will accept these as categorical features and process them special internally. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #3c4043; text-align: left;&quot;&gt;We leave the numerical feature NANs as NANs because GBDT (like XGB and CAT) can handle NAN and will use this information.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738509463528&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;RMV = [&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;y&quot;]
FEATURES = [c for c in train.columns if not c in RMV]
print(f&quot;There are {len(FEATURES)} FEATURES: {FEATURES}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오전 12.17.50.png&quot; data-origin-width=&quot;823&quot; data-origin-height=&quot;270&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dfLCi2/btsL5sa2hXq/DyE3Kl8teBbzrSOH6gguB1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dfLCi2/btsL5sa2hXq/DyE3Kl8teBbzrSOH6gguB1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dfLCi2/btsL5sa2hXq/DyE3Kl8teBbzrSOH6gguB1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdfLCi2%2FbtsL5sa2hXq%2FDyE3Kl8teBbzrSOH6gguB1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;823&quot; height=&quot;270&quot; data-filename=&quot;스크린샷 2025-02-03 오전 12.17.50.png&quot; data-origin-width=&quot;823&quot; data-origin-height=&quot;270&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;CATS = []
for c in FEATURES:
    if train[c].dtype==&quot;object&quot;:
        CATS.append(c)
        train[c] = train[c].fillna(&quot;NAN&quot;)
        test[c] = test[c].fillna(&quot;NAN&quot;)
print(f&quot;In these features, there are {len(CATS)} CATEGORICAL FEATURES: {CATS}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오전 12.18.10.png&quot; data-origin-width=&quot;834&quot; data-origin-height=&quot;179&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bQhysJ/btsL301mFNj/t8IH5zxAe1fZMkUgBb63Ck/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bQhysJ/btsL301mFNj/t8IH5zxAe1fZMkUgBb63Ck/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bQhysJ/btsL301mFNj/t8IH5zxAe1fZMkUgBb63Ck/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbQhysJ%2FbtsL301mFNj%2Ft8IH5zxAe1fZMkUgBb63Ck%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;834&quot; height=&quot;179&quot; data-filename=&quot;스크린샷 2025-02-03 오전 12.18.10.png&quot; data-origin-width=&quot;834&quot; data-origin-height=&quot;179&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738510766671&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;combined = pd.concat([train,test],axis=0,ignore_index=True)
#print(&quot;Combined data shape:&quot;, combined.shape )

# LABEL ENCODE CATEGORICAL FEATURES
print(&quot;We LABEL ENCODE the CATEGORICAL FEATURES: &quot;,end=&quot;&quot;)
for c in FEATURES:

    # LABEL ENCODE CATEGORICAL AND CONVERT TO INT32 CATEGORY
    if c in CATS:
        print(f&quot;{c}, &quot;,end=&quot;&quot;)
        combined[c],_ = combined[c].factorize()
        combined[c] -= combined[c].min()
        combined[c] = combined[c].astype(&quot;int32&quot;)
        combined[c] = combined[c].astype(&quot;category&quot;)
        
    # REDUCE PRECISION OF NUMERICAL TO 32BIT TO SAVE MEMORY
    else:
        if combined[c].dtype==&quot;float64&quot;:
            combined[c] = combined[c].astype(&quot;float32&quot;)
        if combined[c].dtype==&quot;int64&quot;:
            combined[c] = combined[c].astype(&quot;int32&quot;)
    
train = combined.iloc[:len(train)].copy()
test = combined.iloc[len(train):].reset_index(drop=True).copy()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오전 12.39.32.png&quot; data-origin-width=&quot;825&quot; data-origin-height=&quot;150&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b5TPfH/btsL5BscjOp/cq1qK1EKGVUxaa0SMCCtRK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b5TPfH/btsL5BscjOp/cq1qK1EKGVUxaa0SMCCtRK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b5TPfH/btsL5BscjOp/cq1qK1EKGVUxaa0SMCCtRK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb5TPfH%2FbtsL5BscjOp%2Fcq1qK1EKGVUxaa0SMCCtRK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;825&quot; height=&quot;150&quot; data-filename=&quot;스크린샷 2025-02-03 오전 12.39.32.png&quot; data-origin-width=&quot;825&quot; data-origin-height=&quot;150&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Details about categorical feature label encoding part:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;combined[c],_&amp;nbsp;=&amp;nbsp;combined[c].factorize()&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;factorize() is a pandas function that converts categorical data like strings into numbers&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example:&amp;nbsp;['A',&amp;nbsp;'B',&amp;nbsp;'A',&amp;nbsp;'C']&amp;nbsp;&amp;rarr;&amp;nbsp;[0,&amp;nbsp;1,&amp;nbsp;0,&amp;nbsp;2]&lt;/li&gt;
&lt;li&gt;'_' ignores the second return value (list of unique values)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;combined[c] -= combined[c].min()&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Subtracts&amp;nbsp;the&amp;nbsp;minimum&amp;nbsp;value&amp;nbsp;to&amp;nbsp;make&amp;nbsp;the&amp;nbsp;sequence&amp;nbsp;start&amp;nbsp;from&amp;nbsp;0&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Example: [1, 2, 1, 3] &amp;rarr; [0, 1, 0, 2]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;combined[c]&amp;nbsp;=&amp;nbsp;combined[c].astype(&quot;int32&quot;)&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Converts data type to 32-bit integer&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Used for memory efficiency&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;combined[c]&amp;nbsp;=&amp;nbsp;combined[c].astype(&quot;category&quot;)&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Finally converts back to category type&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;For&amp;nbsp;efficient&amp;nbsp;handling&amp;nbsp;of&amp;nbsp;categorical&amp;nbsp;data&amp;nbsp;in&amp;nbsp;pandas&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Memory Efficiency&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Category type internally stores repeating values as integers and maintains only a mapping table&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Very efficient when strings like ['High', 'Low', 'High', 'Medium'] are repeated&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Improved Operation Speed&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Faster operations on categorical data&lt;/li&gt;
&lt;li&gt;Optimized for tasks like grouping and sorting&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Metadata Preservation&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Explicitly expresses that this column is categorical&lt;/li&gt;
&lt;li&gt;Better represents the meaning of the data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Result: Maintains information that this column is categorical, while saving memory&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;original&amp;nbsp;=&amp;nbsp;['High',&amp;nbsp;'Low',&amp;nbsp;'High',&amp;nbsp;'Medium']&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&amp;darr;&amp;nbsp;factorize()&lt;/b&gt;&lt;br /&gt;&lt;b&gt;[0,&amp;nbsp;1,&amp;nbsp;0,&amp;nbsp;2]&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&amp;darr;&amp;nbsp;No&amp;nbsp;need&amp;nbsp;to&amp;nbsp;subtract&amp;nbsp;min()&amp;nbsp;(already&amp;nbsp;starts&amp;nbsp;from&amp;nbsp;0)&lt;/b&gt;&lt;br /&gt;&lt;b&gt;[0,&amp;nbsp;1,&amp;nbsp;0,&amp;nbsp;2]&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&amp;darr;&amp;nbsp;Convert&amp;nbsp;to&amp;nbsp;int32&lt;/b&gt;&lt;br /&gt;&lt;b&gt;[0,&amp;nbsp;1,&amp;nbsp;0,&amp;nbsp;2]&amp;nbsp;(only&amp;nbsp;data&amp;nbsp;type&amp;nbsp;changes)&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&amp;darr;&amp;nbsp;Convert&amp;nbsp;to&amp;nbsp;category&lt;/b&gt;&lt;br /&gt;&lt;b&gt;[0,&amp;nbsp;1,&amp;nbsp;0,&amp;nbsp;2]&amp;nbsp;(processed&amp;nbsp;as&amp;nbsp;categorical)&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id=&quot;XGBoost-with-KaplanMeier&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;XGBoost with KaplanMeier&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;Trained XGBoost model for 10 folds and achieved&amp;nbsp;&lt;/span&gt;&lt;b&gt;CV 0.674&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738561072832&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from sklearn.model_selection import KFold
from xgboost import XGBRegressor, XGBClassifier
import xgboost as xgb
print(&quot;Using XGBoost version&quot;,xgb.__version__)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1738561107244&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;%%time
FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42) # shuffle=True for randomizing data
    
oof_xgb = np.zeros(len(train)) # out-of-fold predictions
pred_xgb = np.zeros(len(test)) # test data predictions

# k-fold cross validation loop -&amp;gt; Train on 90% and test(validation) on 10%
for i, (train_index, test_index) in enumerate(kf.split(train)):

    print(&quot;#&quot;*25)
    print(f&quot;### Fold {i+1}&quot;)
    print(&quot;#&quot;*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,&quot;y&quot;]
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,&quot;y&quot;]
    x_test = test[FEATURES].copy()

    model_xgb = XGBRegressor(
        device=&quot;cuda&quot;, # Use GPU
        max_depth=3,  # Tree depth
        colsample_bytree=0.5, # details below
        subsample=0.8,  # details below
        n_estimators=2000,  # Number of trees
        learning_rate=0.02,  
        enable_categorical=True, # Handle categorical variables
        min_child_weight=80, # details below
        #early_stopping_rounds=25,
    )
    
    model_xgb.fit(
        x_train, y_train,
        eval_set=[(x_valid, y_valid)],  
        verbose=500 
    )

    # INFER OOF: predictions for validation data
    oof_xgb[test_index] = model_xgb.predict(x_valid)
    
    # INFER TEST: predictions for test data
    pred_xgb += model_xgb.predict(x_test)

# COMPUTE AVERAGE TEST PREDS: Average predictions from 10 folds
pred_xgb /= FOLDS&lt;/code&gt;&lt;/pre&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt;colsample_bytree=0.5&lt;/i&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Meaning: &lt;b&gt;Proportion of features (columns) to use when creating each tree&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;If there are 100 features and colsample_bytree=0.5&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Each tree uses only 50 randomly selected features&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Effects:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Prevents overfitting&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Tries various feature combinations&lt;/li&gt;
&lt;li&gt;Reduces over-dependence on specific features&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Lower values: More conservative model, reduced overfitting risk&lt;/li&gt;
&lt;li&gt;Higher values: Uses more features, can capture complex patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;subsample=0.8&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Meaning: &lt;b&gt;Proportion of training data to use for each tree&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;If there are 1000 data points and subsample=0.8&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Each tree learns from 800 randomly selected data points&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Effects:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Prevents overfitting&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Improves model generalization&lt;/li&gt;
&lt;li&gt;Ensures diversity as each tree learns from slightly different data&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Lower values: More randomness, reduced overfitting risk&lt;/li&gt;
&lt;li&gt;Higher values: Uses more data, stable learning&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;min_child_weight=80&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Meaning: &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Minimum sum of weights required to create a leaf node&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Example:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;If min_child_weight=80&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Won't split if the resulting node's weight sum would be less than 80&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Effects:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Prevents splitting into too small groups&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Controls overfitting&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Improves model stability&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Lower values: Allows finer splits, can learn complex patterns&lt;/li&gt;
&lt;li&gt;Higher values: More conservative splits, reduced overfitting risk&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Scoring the model:&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738565495462&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from metric import score

y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = oof_xgb
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;\nOverall CV for XGBoost KaplanMeier =&quot;,m)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 3.51.49.png&quot; data-origin-width=&quot;1504&quot; data-origin-height=&quot;380&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qvFD5/btsL4JkUq1z/Fy9TUbIBNE7e85E9qSO6Q1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qvFD5/btsL4JkUq1z/Fy9TUbIBNE7e85E9qSO6Q1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qvFD5/btsL4JkUq1z/Fy9TUbIBNE7e85E9qSO6Q1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqvFD5%2FbtsL4JkUq1z%2FFy9TUbIBNE7e85E9qSO6Q1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1504&quot; height=&quot;380&quot; data-filename=&quot;스크린샷 2025-02-03 오후 3.51.49.png&quot; data-origin-width=&quot;1504&quot; data-origin-height=&quot;380&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738565536837&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;feature_importance = model_xgb.feature_importances_
importance_df = pd.DataFrame({
    &quot;Feature&quot;: FEATURES,  # Replace FEATURES with your list of feature names
    &quot;Importance&quot;: feature_importance
}).sort_values(by=&quot;Importance&quot;, ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df[&quot;Feature&quot;], importance_df[&quot;Importance&quot;])
plt.xlabel(&quot;Importance&quot;)
plt.ylabel(&quot;Feature&quot;)
plt.title(&quot;XGBoost KaplanMeier Feature Importance&quot;)
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 3.53.13.png&quot; data-origin-width=&quot;1186&quot; data-origin-height=&quot;1300&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/buh4TS/btsL4no2Idm/e6zZREBIHHT5o3nHZmXk91/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/buh4TS/btsL4no2Idm/e6zZREBIHHT5o3nHZmXk91/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/buh4TS/btsL4no2Idm/e6zZREBIHHT5o3nHZmXk91/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbuh4TS%2FbtsL4no2Idm%2Fe6zZREBIHHT5o3nHZmXk91%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;652&quot; height=&quot;715&quot; data-filename=&quot;스크린샷 2025-02-03 오후 3.53.13.png&quot; data-origin-width=&quot;1186&quot; data-origin-height=&quot;1300&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;CatBoost-with-KaplanMeier&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;CatBoost with KaplanMeier&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Trained CatBoost model for 10 folds and achieved&amp;nbsp;CV 0.674&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738565661214&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from catboost import CatBoostRegressor, CatBoostClassifier
import catboost as cb
print(&quot;Using CatBoost version&quot;,cb.__version__)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 3.54.37.png&quot; data-origin-width=&quot;662&quot; data-origin-height=&quot;96&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/vuU31/btsL4WjXgbH/TUxjzigAeg5g25TjaH9oPK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/vuU31/btsL4WjXgbH/TUxjzigAeg5g25TjaH9oPK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/vuU31/btsL4WjXgbH/TUxjzigAeg5g25TjaH9oPK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FvuU31%2FbtsL4WjXgbH%2FTUxjzigAeg5g25TjaH9oPK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;373&quot; height=&quot;54&quot; data-filename=&quot;스크린샷 2025-02-03 오후 3.54.37.png&quot; data-origin-width=&quot;662&quot; data-origin-height=&quot;96&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738565709011&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;%%time
FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_cat = np.zeros(len(train))
pred_cat = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print(&quot;#&quot;*25)
    print(f&quot;### Fold {i+1}&quot;)
    print(&quot;#&quot;*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,&quot;y&quot;]
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,&quot;y&quot;]
    x_test = test[FEATURES].copy()

    model_cat = CatBoostRegressor(
        task_type=&quot;GPU&quot;,  # Using GPU
        learning_rate=0.1,    
        grow_policy='Lossguide', # Details below
        #early_stopping_rounds=25,
    )
    model_cat.fit(x_train,y_train,
              eval_set=(x_valid, y_valid),
              cat_features=CATS,
              verbose=250)

    # INFER OOF
    oof_cat[test_index] = model_cat.predict(x_valid)
    # INFER TEST
    pred_cat += model_cat.predict(x_test)

# COMPUTE AVERAGE TEST PREDS
pred_cat /= FOLDS&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;grow_policy='Lossguide'&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Tree growth method: grow tree by selecting leaves that minimize loss&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Scoring the model:&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738567375576&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = oof_cat
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;\nOverall CV for CatBoost KaplanMeier =&quot;,m)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.23.07.png&quot; data-origin-width=&quot;1526&quot; data-origin-height=&quot;364&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bvlSCY/btsL6kjXjcy/ijXvuRxR9gz2tNaKohu13K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bvlSCY/btsL6kjXjcy/ijXvuRxR9gz2tNaKohu13K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bvlSCY/btsL6kjXjcy/ijXvuRxR9gz2tNaKohu13K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbvlSCY%2FbtsL6kjXjcy%2FijXvuRxR9gz2tNaKohu13K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;793&quot; height=&quot;189&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.23.07.png&quot; data-origin-width=&quot;1526&quot; data-origin-height=&quot;364&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738567406149&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;feature_importance = model_cat.get_feature_importance()
importance_df = pd.DataFrame({
    &quot;Feature&quot;: FEATURES, 
    &quot;Importance&quot;: feature_importance
}).sort_values(by=&quot;Importance&quot;, ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df[&quot;Feature&quot;], importance_df[&quot;Importance&quot;])
plt.xlabel(&quot;Importance&quot;)
plt.ylabel(&quot;Feature&quot;)
plt.title(&quot;CatBoost KaplanMeier Feature Importance&quot;)
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.23.43.png&quot; data-origin-width=&quot;1168&quot; data-origin-height=&quot;1308&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bgzKS6/btsL6qEeA46/0vbRKBQd1FRA2K0cGxs2Ek/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bgzKS6/btsL6qEeA46/0vbRKBQd1FRA2K0cGxs2Ek/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bgzKS6/btsL6qEeA46/0vbRKBQd1FRA2K0cGxs2Ek/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbgzKS6%2FbtsL6qEeA46%2F0vbRKBQd1FRA2K0cGxs2Ek%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;629&quot; height=&quot;704&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.23.43.png&quot; data-origin-width=&quot;1168&quot; data-origin-height=&quot;1308&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;LightGBM-with-KaplanMeier&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;LightGBM with KaplanMeier&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Trained LightGBM model for 10 folds and achieved&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;CV 0.6725&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738567541345&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from lightgbm import LGBMRegressor
import lightgbm as lgb
print(&quot;Using LightGBM version&quot;,lgb.__version__)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.25.50.png&quot; data-origin-width=&quot;414&quot; data-origin-height=&quot;48&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Of3j8/btsL4cHRZyk/Gpm9HTQXFZFdz1oRPkHoSk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Of3j8/btsL4cHRZyk/Gpm9HTQXFZFdz1oRPkHoSk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Of3j8/btsL4cHRZyk/Gpm9HTQXFZFdz1oRPkHoSk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOf3j8%2FbtsL4cHRZyk%2FGpm9HTQXFZFdz1oRPkHoSk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;414&quot; height=&quot;48&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.25.50.png&quot; data-origin-width=&quot;414&quot; data-origin-height=&quot;48&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738567579861&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_lgb = np.zeros(len(train))
pred_lgb = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print(&quot;#&quot;*25)
    print(f&quot;### Fold {i+1}&quot;)
    print(&quot;#&quot;*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,&quot;y&quot;]    
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,&quot;y&quot;]
    x_test = test[FEATURES].copy()

    model_lgb = LGBMRegressor(
        device=&quot;gpu&quot;, 
        max_depth=3, 
        colsample_bytree=0.4,  
        #subsample=0.9, 
        n_estimators=2500, 
        learning_rate=0.02, 
        objective=&quot;regression&quot;, # detail below
        verbose=-1, # detail below
        #early_stopping_rounds=25,
    )
    model_lgb.fit(
        x_train, y_train,
        eval_set=[(x_valid, y_valid)],
    )
    
    # INFER OOF
    oof_lgb[test_index] = model_lgb.predict(x_valid)
    # INFER TEST
    pred_lgb += model_lgb.predict(x_test)

# COMPUTE AVERAGE TEST PREDS
pred_lgb /= FOLDS&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;objective=&quot;regression&quot;&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Parameter that specifies the learning objective (loss function)&lt;/li&gt;
&lt;li&gt;&quot;regression&quot;&amp;nbsp;is&amp;nbsp;the&amp;nbsp;default&amp;nbsp;setting&amp;nbsp;for&amp;nbsp;regression&amp;nbsp;problems&lt;/li&gt;
&lt;li&gt;Other options:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&quot;binary&quot;: binary classification&lt;/li&gt;
&lt;li&gt;&quot;multiclass&quot;:&amp;nbsp;multi-class&amp;nbsp;classification&lt;/li&gt;
&lt;li&gt;&quot;ranking&quot;:&amp;nbsp;ranking&amp;nbsp;problems&lt;/li&gt;
&lt;li&gt;&quot;poisson&quot;:&amp;nbsp;Poisson&amp;nbsp;regression&lt;/li&gt;
&lt;li&gt;&quot;quantile&quot;:&amp;nbsp;quantile&amp;nbsp;regression&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;verbose=-1&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Specifies&amp;nbsp;the&amp;nbsp;level&amp;nbsp;of&amp;nbsp;detail&amp;nbsp;for&amp;nbsp;logs&amp;nbsp;during&amp;nbsp;training&lt;/li&gt;
&lt;li&gt;Meaning&amp;nbsp;of&amp;nbsp;values:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;-1: no output (completely silent)&lt;/li&gt;
&lt;li&gt;0:&amp;nbsp;only&amp;nbsp;warnings&amp;nbsp;and&amp;nbsp;errors&lt;/li&gt;
&lt;li&gt;1:&amp;nbsp;basic&amp;nbsp;information&lt;/li&gt;
&lt;li&gt;2:&amp;nbsp;detailed&amp;nbsp;information&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;Currently set to -1, so no messages will be output during training&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Scoring the model:&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738567975306&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = oof_lgb
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;\nOverall CV for LightGBM KaplanMeier =&quot;,m)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.33.03.png&quot; data-origin-width=&quot;1082&quot; data-origin-height=&quot;264&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lN8w5/btsL4zXewDB/6KjPqarO05u9ckGEiOCY60/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lN8w5/btsL4zXewDB/6KjPqarO05u9ckGEiOCY60/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lN8w5/btsL4zXewDB/6KjPqarO05u9ckGEiOCY60/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlN8w5%2FbtsL4zXewDB%2F6KjPqarO05u9ckGEiOCY60%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;737&quot; height=&quot;180&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.33.03.png&quot; data-origin-width=&quot;1082&quot; data-origin-height=&quot;264&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;feature_importance = model_lgb.feature_importances_ 
importance_df = pd.DataFrame({
    &quot;Feature&quot;: FEATURES,
    &quot;Importance&quot;: feature_importance
}).sort_values(by=&quot;Importance&quot;, ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df[&quot;Feature&quot;], importance_df[&quot;Importance&quot;], color='skyblue')
plt.xlabel(&quot;Importance (Gain)&quot;)
plt.ylabel(&quot;Feature&quot;)
plt.title(&quot;LightGBM KaplanMeier Feature Importance&quot;)
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.33.33.png&quot; data-origin-width=&quot;1144&quot; data-origin-height=&quot;1304&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cChko9/btsL5if7bPF/8pWUSeWvgsQJeij9liG3kK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cChko9/btsL5if7bPF/8pWUSeWvgsQJeij9liG3kK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cChko9/btsL5if7bPF/8pWUSeWvgsQJeij9liG3kK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcChko9%2FbtsL5if7bPF%2F8pWUSeWvgsQJeij9liG3kK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;790&quot; height=&quot;900&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.33.33.png&quot; data-origin-width=&quot;1144&quot; data-origin-height=&quot;1304&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;XGBoost-with-Survival:Cox&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;XGBoost with Survival:Cox&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Trained XGBoost using Survival:Cox loss for 10 folds and achieved&amp;nbsp;CV=672!&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;# SURVIVAL COX NEEDS THIS TARGET (TO DIGEST EFS AND EFS_TIME)
train[&quot;efs_time2&quot;] = train.efs_time.copy()
train.loc[train.efs==0,&quot;efs_time2&quot;] *= -1&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Above code prepares the target variable for Cox model
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;train[&quot;efs_time2&quot;] = train.efs_time.copy()&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Creates a new column by copying efs_time&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;train.loc[train.efs==0,&quot;efs_time2&quot;] *= -1&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;For cases where efs is 0 (no event occurred)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Converts the time value to negative&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Reasons&amp;nbsp;for&amp;nbsp;doing&amp;nbsp;this:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Cox models use this approach to represent censoring information&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Negative&amp;nbsp;time&amp;nbsp;&amp;rarr;&amp;nbsp;censored&amp;nbsp;case&amp;nbsp;(no&amp;nbsp;event&amp;nbsp;occurred)&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Positive&amp;nbsp;time&amp;nbsp;&amp;rarr;&amp;nbsp;event&amp;nbsp;occurred&amp;nbsp;case&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Example:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Original&amp;nbsp;Data:&lt;br /&gt;efs&amp;nbsp;&amp;nbsp;|&amp;nbsp;efs_time&amp;nbsp;|&amp;nbsp;efs_time2&lt;br /&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;|&amp;nbsp;100&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;|&amp;nbsp;100&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;(death/relapse&amp;nbsp;occurred)&lt;br /&gt;0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;|&amp;nbsp;150&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;|&amp;nbsp;-150&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;(censored)&lt;br /&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;|&amp;nbsp;80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;|&amp;nbsp;80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;(death/relapse&amp;nbsp;occurred)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738568106650&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_xgb_cox = np.zeros(len(train))
pred_xgb_cox = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print(&quot;#&quot;*25)
    print(f&quot;### Fold {i+1}&quot;)
    print(&quot;#&quot;*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,&quot;efs_time2&quot;]    
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,&quot;efs_time2&quot;]
    x_test = test[FEATURES].copy()

    # same attributes with xgb above except objective and eval-metric
    model_xgb_cox = XGBRegressor(
        device=&quot;cuda&quot;,
        max_depth=3,  
        colsample_bytree=0.5,  
        subsample=0.8,  
        n_estimators=2000,  
        learning_rate=0.02,  
        enable_categorical=True,
        min_child_weight=80,
        objective='survival:cox',
        eval_metric='cox-nloglik',
    )
    model_xgb_cox.fit(
        x_train, y_train,
        eval_set=[(x_valid, y_valid)],  
        verbose=500  
    )
    
    # INFER OOF
    oof_xgb_cox[test_index] = model_xgb_cox.predict(x_valid)
    # INFER TEST
    pred_xgb_cox += model_xgb_cox.predict(x_test)

# COMPUTE AVERAGE TEST PREDS
pred_xgb_cox /= FOLDS&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Scoring the model:&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738569377104&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = oof_xgb_cox
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;\nOverall CV for XGBoost Survival:Cox =&quot;,m)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.56.28.png&quot; data-origin-width=&quot;1098&quot; data-origin-height=&quot;258&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Prrdq/btsL4zXguU6/K2b9ga6CjW76wpnUkxKJgk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Prrdq/btsL4zXguU6/K2b9ga6CjW76wpnUkxKJgk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Prrdq/btsL4zXguU6/K2b9ga6CjW76wpnUkxKJgk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FPrrdq%2FbtsL4zXguU6%2FK2b9ga6CjW76wpnUkxKJgk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;732&quot; height=&quot;172&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.56.28.png&quot; data-origin-width=&quot;1098&quot; data-origin-height=&quot;258&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738569404191&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;feature_importance = model_xgb_cox.feature_importances_
importance_df = pd.DataFrame({
    &quot;Feature&quot;: FEATURES,  # Replace FEATURES with your list of feature names
    &quot;Importance&quot;: feature_importance
}).sort_values(by=&quot;Importance&quot;, ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df[&quot;Feature&quot;], importance_df[&quot;Importance&quot;])
plt.xlabel(&quot;Importance&quot;)
plt.ylabel(&quot;Feature&quot;)
plt.title(&quot;XGBoost Survival:Cox Feature Importance&quot;)
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.57.00.png&quot; data-origin-width=&quot;1118&quot; data-origin-height=&quot;1298&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/S6DbW/btsL4NnpKgc/iecvKiXOoO4pQKcpg7H6s0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/S6DbW/btsL4NnpKgc/iecvKiXOoO4pQKcpg7H6s0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/S6DbW/btsL4NnpKgc/iecvKiXOoO4pQKcpg7H6s0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FS6DbW%2FbtsL4NnpKgc%2FiecvKiXOoO4pQKcpg7H6s0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1118&quot; height=&quot;1298&quot; data-filename=&quot;스크린샷 2025-02-03 오후 4.57.00.png&quot; data-origin-width=&quot;1118&quot; data-origin-height=&quot;1298&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;CatBoost-with-Survival:Cox&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;CatBoost with Survival:Cox&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Trained CatBoost using Survival:Cox loss for 10 folds and achieved&amp;nbsp;CV=671!&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738569460543&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_cat_cox = np.zeros(len(train))
pred_cat_cox = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print(&quot;#&quot;*25)
    print(f&quot;### Fold {i+1}&quot;)
    print(&quot;#&quot;*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,&quot;efs_time2&quot;]    
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,&quot;efs_time2&quot;]
    x_test = test[FEATURES].copy()

    model_cat_cox = CatBoostRegressor(
        loss_function=&quot;Cox&quot;,
        #task_type=&quot;GPU&quot;,   
        iterations=400,   # Total number of trees to train  
        learning_rate=0.1,  
        grow_policy='Lossguide',
        use_best_model=False, # details below
    )
    model_cat_cox.fit(x_train,y_train,
              eval_set=(x_valid, y_valid),
              cat_features=CATS,
              verbose=100)
    
    # INFER OOF
    oof_cat_cox[test_index] = model_cat_cox.predict(x_valid)
    # INFER TEST
    pred_cat_cox += model_cat_cox.predict(x_test)

# COMPUTE AVERAGE TEST PREDS
pred_cat_cox /= FOLDS&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span&gt;use_best_model&lt;/span&gt;&lt;span style=&quot;color: #61afef;&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color: #d19a66;&quot;&gt;False&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Uses&amp;nbsp;all&amp;nbsp;iterations&amp;nbsp;(doesn't&amp;nbsp;use&amp;nbsp;early-stopped&amp;nbsp;optimal&amp;nbsp;model)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Conversely, when &lt;i&gt;&lt;b&gt;use_best_model=True&lt;/b&gt;&lt;/i&gt;:&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Selects the model from the point showing best performance on validation data&lt;/li&gt;
&lt;li&gt;Stops training if performance decreases in subsequent iterations&lt;/li&gt;
&lt;li&gt;Acts as a form of early stopping&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Reasons for setting it to False:&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Later iterations can sometimes be important in Cox models&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Even if validation performance temporarily worsens, it might help overall survival prediction&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Especially with lots of censored data, using all iterations might be more stable&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Example:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;#&amp;nbsp;When&amp;nbsp;True&lt;br /&gt;iter&amp;nbsp;100:&amp;nbsp;performance&amp;nbsp;0.8&lt;br /&gt;iter&amp;nbsp;200:&amp;nbsp;performance&amp;nbsp;0.85&amp;nbsp;(best)&lt;br /&gt;iter&amp;nbsp;300:&amp;nbsp;performance&amp;nbsp;0.83&lt;br /&gt;&amp;rarr;&amp;nbsp;Uses&amp;nbsp;model&amp;nbsp;from&amp;nbsp;iter&amp;nbsp;200&lt;br /&gt;&lt;br /&gt;#&amp;nbsp;When&amp;nbsp;False&lt;br /&gt;Uses&amp;nbsp;combined&amp;nbsp;results&amp;nbsp;from&amp;nbsp;all&amp;nbsp;iterations&amp;nbsp;(400)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Scoring the model:&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738570389045&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = oof_cat_cox
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;\nOverall CV for CatBoost Survival:Cox =&quot;,m)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 5.13.45.png&quot; data-origin-width=&quot;1108&quot; data-origin-height=&quot;260&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cfQhIU/btsL5rKPkqg/0VphqGDhBGd3ak0zHIhCGk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cfQhIU/btsL5rKPkqg/0VphqGDhBGd3ak0zHIhCGk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cfQhIU/btsL5rKPkqg/0VphqGDhBGd3ak0zHIhCGk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcfQhIU%2FbtsL5rKPkqg%2F0VphqGDhBGd3ak0zHIhCGk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1108&quot; height=&quot;260&quot; data-filename=&quot;스크린샷 2025-02-03 오후 5.13.45.png&quot; data-origin-width=&quot;1108&quot; data-origin-height=&quot;260&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738570434797&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;feature_importance = model_cat_cox.get_feature_importance()
importance_df = pd.DataFrame({
    &quot;Feature&quot;: FEATURES, 
    &quot;Importance&quot;: feature_importance
}).sort_values(by=&quot;Importance&quot;, ascending=False)
plt.figure(figsize=(10, 15))
plt.barh(importance_df[&quot;Feature&quot;], importance_df[&quot;Importance&quot;])
plt.xlabel(&quot;Importance&quot;)
plt.ylabel(&quot;Feature&quot;)
plt.title(&quot;CatBoost Survival:Cox Feature Importance&quot;)
plt.gca().invert_yaxis()  # Flip features for better readability
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 5.13.31.png&quot; data-origin-width=&quot;1118&quot; data-origin-height=&quot;1298&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bOdklW/btsL4qTBKk1/D44DxnJ830COcN8BpkzdJ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bOdklW/btsL4qTBKk1/D44DxnJ830COcN8BpkzdJ0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bOdklW/btsL4qTBKk1/D44DxnJ830COcN8BpkzdJ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbOdklW%2FbtsL4qTBKk1%2FD44DxnJ830COcN8BpkzdJ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1118&quot; height=&quot;1298&quot; data-filename=&quot;스크린샷 2025-02-03 오후 5.13.31.png&quot; data-origin-width=&quot;1118&quot; data-origin-height=&quot;1298&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 id=&quot;Ensemble-CAT-and-XGB-and-LGB&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;Ensemble CAT and XGB and LGB&lt;/b&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;We ensemble our XGBoost, CatBoost, LightGBM, XGBoost Cox, and CatBoost Cox using&amp;nbsp;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;scipy.stats.rankdata()&lt;/b&gt;&lt;/span&gt;&amp;nbsp;and achieve an amazing&amp;nbsp;CV=0.681&amp;nbsp;Wow!&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738570474307&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from scipy.stats import rankdata 

y_true = train[[&quot;ID&quot;,&quot;efs&quot;,&quot;efs_time&quot;,&quot;race_group&quot;]].copy()
y_pred = train[[&quot;ID&quot;]].copy()
y_pred[&quot;prediction&quot;] = rankdata(oof_xgb) + rankdata(oof_cat) + rankdata(oof_lgb)\
                     + rankdata(oof_xgb_cox) + rankdata(oof_cat_cox)
m = score(y_true.copy(), y_pred.copy(), &quot;ID&quot;)
print(f&quot;\nOverall CV for Ensemble =&quot;,m)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;i&gt;&lt;b&gt;rankdata&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;: Function that returns the rank of each prediction -&amp;gt; Sums up ranks from five models&lt;/li&gt;
&lt;li&gt;&lt;b&gt;How does the function work(Example):&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt;from&amp;nbsp;scipy.stats&amp;nbsp;import&amp;nbsp;rankdata&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;#&amp;nbsp;Sample&amp;nbsp;data&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;i&gt;predictions&amp;nbsp;=&amp;nbsp;[10.5,&amp;nbsp;5.2,&amp;nbsp;15.7,&amp;nbsp;5.2,&amp;nbsp;8.1]&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;#&amp;nbsp;Apply&amp;nbsp;rankdata&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;ranks&amp;nbsp;=&amp;nbsp;rankdata(predictions)&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;First, sort the values in ascending order:
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;5.2,&amp;nbsp;5.2,&amp;nbsp;8.1,&amp;nbsp;10.5,&amp;nbsp;15.7&lt;br /&gt;(1-2),&amp;nbsp;(1-2),&amp;nbsp;(3),&amp;nbsp;(4),&amp;nbsp;(5)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Assign ranks to each value:
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;Original:&amp;nbsp;[10.5,&amp;nbsp;&amp;nbsp;5.2,&amp;nbsp;&amp;nbsp;15.7,&amp;nbsp;&amp;nbsp;5.2,&amp;nbsp;&amp;nbsp;&amp;nbsp;8.1]&lt;br /&gt;Ranks:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;[4,&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;1.5,&amp;nbsp;&amp;nbsp;5,&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;1.5,&amp;nbsp;&amp;nbsp;&amp;nbsp;3]&lt;/li&gt;
&lt;li&gt;Since 5.2 appears twice, both receive 1.5 (average of 1st and 2nd place)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Sum the rank value
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;model1_ranks&amp;nbsp;=&amp;nbsp;rankdata([0.1,&amp;nbsp;0.2,&amp;nbsp;0.3])&amp;nbsp;&amp;nbsp;#&amp;nbsp;[1,&amp;nbsp;2,&amp;nbsp;3]&lt;br /&gt;model2_ranks&amp;nbsp;=&amp;nbsp;rankdata([0.3,&amp;nbsp;0.1,&amp;nbsp;0.2])&amp;nbsp;&amp;nbsp;#&amp;nbsp;[3,&amp;nbsp;1,&amp;nbsp;2]&lt;br /&gt;ensemble&amp;nbsp;=&amp;nbsp;model1_ranks&amp;nbsp;+&amp;nbsp;model2_ranks&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;[4,&amp;nbsp;3,&amp;nbsp;5]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Why use rankdata? --&amp;gt; &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Suitable&amp;nbsp;for&amp;nbsp;Survival&amp;nbsp;Analysis&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Well-aligned with rank-based evaluation metrics such as the Concordance index&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;The&amp;nbsp;relative&amp;nbsp;risk&amp;nbsp;ranking&amp;nbsp;is&amp;nbsp;often&amp;nbsp;more&amp;nbsp;important&amp;nbsp;than&amp;nbsp;the&amp;nbsp;actual&amp;nbsp;survival&amp;nbsp;time&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 5.14.46.png&quot; data-origin-width=&quot;1108&quot; data-origin-height=&quot;260&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cgBa18/btsL47shBuA/CTJqqujKlAgFJWoZxzGgEk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cgBa18/btsL47shBuA/CTJqqujKlAgFJWoZxzGgEk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cgBa18/btsL47shBuA/CTJqqujKlAgFJWoZxzGgEk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcgBa18%2FbtsL47shBuA%2FCTJqqujKlAgFJWoZxzGgEk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1108&quot; height=&quot;260&quot; data-filename=&quot;스크린샷 2025-02-03 오후 5.14.46.png&quot; data-origin-width=&quot;1108&quot; data-origin-height=&quot;260&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 id=&quot;Create-Submission-CSV&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;Create Submission CSV&lt;/b&gt;&lt;/h3&gt;
&lt;pre id=&quot;code_1738570521100&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;sub = pd.read_csv(&quot;/kaggle/input/equity-post-HCT-survival-predictions/sample_submission.csv&quot;)
sub.prediction = rankdata(pred_xgb) + rankdata(pred_cat) + rankdata(pred_lgb)\
                     + rankdata(pred_xgb_cox) + rankdata(pred_cat_cox)
sub.to_csv(&quot;submission.csv&quot;,index=False)
print(&quot;Sub shape:&quot;,sub.shape)
sub.head()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-03 오후 5.15.35.png&quot; data-origin-width=&quot;392&quot; data-origin-height=&quot;270&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bwwBZo/btsL6omd1da/jHY240ZsVRT95w8XrcPEu0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bwwBZo/btsL6omd1da/jHY240ZsVRT95w8XrcPEu0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bwwBZo/btsL6omd1da/jHY240ZsVRT95w8XrcPEu0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbwwBZo%2FbtsL6omd1da%2FjHY240ZsVRT95w8XrcPEu0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;392&quot; height=&quot;270&quot; data-filename=&quot;스크린샷 2025-02-03 오후 5.15.35.png&quot; data-origin-width=&quot;392&quot; data-origin-height=&quot;270&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Cast&amp;nbsp;all&amp;nbsp;your&amp;nbsp;anxiety&amp;nbsp;on&amp;nbsp;him&amp;nbsp;because&amp;nbsp;he&amp;nbsp;cares&amp;nbsp;for&amp;nbsp;you&lt;br /&gt;&lt;/span&gt;&amp;lt;Peter 5:7&amp;gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/108</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685#entry108comment</comments>
      <pubDate>Mon, 3 Feb 2025 17:33:35 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #3 Understanding Survival Analysis - 2</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-2-Understanding-Survival-Analysis-2</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Annotation of modeling &amp;amp; &lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;SHAP &lt;/span&gt;part of this kernel:&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738388282232&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Understanding Survival Analysis&quot; data-og-description=&quot;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot; data-og-url=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Understanding Survival Analysis&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;XGBoost Model for Survival&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;We will now use an&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;XGBoost&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;model with&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Optuna&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;to find the ideal hyperparameters. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This model will be used to submit predictions.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;This&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;XGBoost&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;model implements survival analysis using the Cox proportional hazards (CPH) loss function, a widely used approach for time-to-event modeling.&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;It predicts risk scores for patients undergoing hematopoietic cell transplantation (HCT), leveraging features such as patient demographics and clinical characteristics. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;The CPH model ranks patients based on their relative risk of experiencing an event, such as death or relapse.&lt;/span&gt; &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; background-color: #ffc9af;&quot;&gt;It evaluates performance using metrics like the concordance index (C-index), which measures the model&amp;rsquo;s ability to rank patients by their predicted risk correctly.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738388362176&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from xgboost import XGBRegressor
import optuna

# Load the data
train_path = &quot;/kaggle/input/equity-post-HCT-survival-predictions/train.csv&quot;
data_dict = &quot;/kaggle/input/equity-post-HCT-survival-predictions/data_dictionary.csv&quot;

train_df = pd.read_csv(train_path)
data_info_df = pd.read_csv(data_dict)

# Preprocessing
epsilon = 1e-5
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

for index, row in data_info_df.iterrows():
    if row[&quot;type&quot;] == &quot;Categorical&quot;:
        # Encode categorical variables as numbers
        train_df[row[&quot;variable&quot;]] = label_encoder.fit_transform(train_df[row[&quot;variable&quot;]].astype(str))
    else:
        # Fill missing values in numerical variables with -1
        train_df[row[&quot;variable&quot;]] = train_df[row[&quot;variable&quot;]].fillna(-1)
        
# Define target variable
train_df[&quot;y&quot;] = train_df[&quot;efs&quot;] / (train_df[&quot;efs_time&quot;] + epsilon)

# Define features and target
X = train_df.drop(columns=[&quot;efs&quot;, &quot;efs_time&quot;, &quot;ID&quot;, &quot;y&quot;])
y = train_df[&quot;y&quot;]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Optuna objective function
def objective(trial):
    # Hyperparameter search space
    params = {
        &quot;n_estimators&quot;: trial.suggest_int(&quot;n_estimators&quot;, 100, 2000, step=100),
        &quot;learning_rate&quot;: trial.suggest_float(&quot;learning_rate&quot;, 0.01, 0.3, log=True),
        &quot;max_depth&quot;: trial.suggest_int(&quot;max_depth&quot;, 3, 15),
        &quot;subsample&quot;: trial.suggest_float(&quot;subsample&quot;, 0.6, 1.0),
        &quot;colsample_bytree&quot;: trial.suggest_float(&quot;colsample_bytree&quot;, 0.6, 1.0),
        &quot;reg_alpha&quot;: trial.suggest_float(&quot;reg_alpha&quot;, 1e-5, 1.0, log=True),
        &quot;reg_lambda&quot;: trial.suggest_float(&quot;reg_lambda&quot;, 1e-5, 1.0, log=True),
        &quot;early_stopping_rounds&quot;: 50  # Move early stopping here
    }

    # Train the model
    model = XGBRegressor(random_state=42, **params)
    model.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        verbose=False
    )
    
    # Predictions and evaluation
    y_pred = model.predict(X_test)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    return rmse

# Run Optuna
study = optuna.create_study(direction=&quot;minimize&quot;)
study.optimize(objective, n_trials=50)

# Best parameters and RMSE
print(&quot;Best parameters:&quot;, study.best_params)
print(&quot;Best RMSE:&quot;, study.best_value)

# Train the final model with the best parameters
best_params = study.best_params
final_model = XGBRegressor(random_state=42, **best_params)
final_model.fit(X_train, y_train)

# Final predictions and evaluation
y_pred = final_model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f&quot;Final RMSE: {rmse:.4f}&quot;)
print(f&quot;Final MAE: {mae:.4f}&quot;)
print(f&quot;Final R&amp;sup2;: {r2:.4f}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Reason why we set target variable as train_df[&quot;y&quot;] = train_df[&quot;efs&quot;] / (train_df[&quot;efs_time&quot;] + epsilon)&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This calculation means:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;efs (event-free survival): whether an event (death or relapse) occurred (0 or 1)&lt;/li&gt;
&lt;li&gt;efs_time: observation period&lt;/li&gt;
&lt;li&gt;epsilon (1e-5): very small number to prevent division by zero&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Reasons for this division:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Time normalization&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Same events might have different importance depending on when they occurred&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example: relapse within 1 year vs relapse after 5 years&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Reflects hazard concept&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Higher value means event occurred more quickly&lt;/li&gt;
&lt;li&gt;Examples:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;case&amp;nbsp;1:&amp;nbsp;efs=1,&amp;nbsp;time=1&amp;nbsp;year&amp;nbsp;&amp;rarr;&amp;nbsp;y&amp;nbsp;&amp;asymp;&amp;nbsp;1&lt;br /&gt;case&amp;nbsp;2:&amp;nbsp;efs=1,&amp;nbsp;time=5&amp;nbsp;years&amp;nbsp;&amp;rarr;&amp;nbsp;y&amp;nbsp;&amp;asymp;&amp;nbsp;0.2&lt;br /&gt;case&amp;nbsp;3:&amp;nbsp;efs=0,&amp;nbsp;time=any&amp;nbsp;&amp;rarr;&amp;nbsp;y&amp;nbsp;=&amp;nbsp;0&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Reflects survival analysis characteristics&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Censored cases (efs=0) automatically become 0&lt;/li&gt;
&lt;li&gt;Cases with events get different weights based on occurrence time&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;This target variable becomes an indicator of &quot;event occurrence risk per unit time.&quot;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Scoring the submission&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1738389237685&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from lifelines.utils import concordance_index

# Define the score function
def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -&amp;gt; float:
    &quot;&quot;&quot;
    Calculate C-index for each race group and return the global score.
    &quot;&quot;&quot;
    del solution[row_id_column_name]
    del submission[row_id_column_name]
    
    event_label = 'efs'
    interval_label = 'efs_time'
    prediction_label = 'prediction'
    for col in submission.columns:
        if not pd.api.types.is_numeric_dtype(submission[col]):
            raise ValueError(f'Submission column {col} must be a number')

    # Merging solution and submission dfs on ID
    merged_df = pd.concat([solution, submission], axis=1)
    merged_df.reset_index(inplace=True)
    merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)
    metric_list = []
    for race in merged_df_race_dict.keys():
        # Retrieving values from y_test based on index
        indices = sorted(merged_df_race_dict[race])
        merged_df_race = merged_df.iloc[indices]
        # Calculate the concordance index
        c_index_race = concordance_index(
                        merged_df_race[interval_label],
                        -merged_df_race[prediction_label],
                        merged_df_race[event_label])
        metric_list.append(c_index_race)
    return float(np.mean(metric_list) - np.sqrt(np.var(metric_list)))

# Final predictions
y_pred = final_model.predict(X_test)

# Prepare DataFrames for scoring
y_true_df = train_df.iloc[X_test.index][[&quot;ID&quot;, &quot;efs&quot;, &quot;efs_time&quot;, &quot;race_group&quot;]].copy()
y_pred_df = train_df.iloc[X_test.index][[&quot;ID&quot;]].copy()
y_pred_df[&quot;prediction&quot;] = y_pred

# Calculate the stratified C-index
stratified_c_index = score(y_true_df, y_pred_df, &quot;ID&quot;)
print(f&quot;Stratified C-index: {stratified_c_index:.4f}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;Stratified C-index: 0.6716&lt;/blockquote&gt;
&lt;pre id=&quot;code_1738389813999&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import optuna.visualization as vis

# Plot optimization history (objective value per trial)
fig = vis.plot_optimization_history(study)
fig.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.03.44.png&quot; data-origin-width=&quot;827&quot; data-origin-height=&quot;482&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cOgZaF/btsL3hhOsbD/YvwoS3vEYXu1qnMskiI1v0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cOgZaF/btsL3hhOsbD/YvwoS3vEYXu1qnMskiI1v0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cOgZaF/btsL3hhOsbD/YvwoS3vEYXu1qnMskiI1v0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcOgZaF%2FbtsL3hhOsbD%2FYvwoS3vEYXu1qnMskiI1v0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;827&quot; height=&quot;482&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.03.44.png&quot; data-origin-width=&quot;827&quot; data-origin-height=&quot;482&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;SHAP-(SHapley-Additive-exPlanations)&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;SHAP (SHapley Additive exPlanations)&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;SHAP (SHapley Additive exPlanations) is a unified framework for interpreting machine learning models. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;It is based on &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;cooperative game theory and provides insights into the contribution of each feature to a model's predictions.&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.06.48.png&quot; data-origin-width=&quot;859&quot; data-origin-height=&quot;437&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/t4jYr/btsL4yv7o6a/B0umOJWVHFb6bwqwBER270/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/t4jYr/btsL4yv7o6a/B0umOJWVHFb6bwqwBER270/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/t4jYr/btsL4yv7o6a/B0umOJWVHFb6bwqwBER270/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Ft4jYr%2FbtsL4yv7o6a%2FB0umOJWVHFb6bwqwBER270%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;859&quot; height=&quot;437&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.06.48.png&quot; data-origin-width=&quot;859&quot; data-origin-height=&quot;437&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.50.27.png&quot; data-origin-width=&quot;852&quot; data-origin-height=&quot;660&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/LtIWm/btsL3nvD5Ut/SK5yef5nKnUCx3KKI9nvuk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/LtIWm/btsL3nvD5Ut/SK5yef5nKnUCx3KKI9nvuk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/LtIWm/btsL3nvD5Ut/SK5yef5nKnUCx3KKI9nvuk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FLtIWm%2FbtsL3nvD5Ut%2FSK5yef5nKnUCx3KKI9nvuk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;852&quot; height=&quot;660&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.50.27.png&quot; data-origin-width=&quot;852&quot; data-origin-height=&quot;660&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.52.12.png&quot; data-origin-width=&quot;857&quot; data-origin-height=&quot;556&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/baDM6q/btsL4NGt2KS/6WONLgNWwCxVEXP9Vkr6RK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/baDM6q/btsL4NGt2KS/6WONLgNWwCxVEXP9Vkr6RK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/baDM6q/btsL4NGt2KS/6WONLgNWwCxVEXP9Vkr6RK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbaDM6q%2FbtsL4NGt2KS%2F6WONLgNWwCxVEXP9Vkr6RK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;857&quot; height=&quot;556&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.52.12.png&quot; data-origin-width=&quot;857&quot; data-origin-height=&quot;556&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.52.59.png&quot; data-origin-width=&quot;849&quot; data-origin-height=&quot;786&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ctaNJ8/btsL3nCqxLS/7nE9F24pPeZGTpOOJcjKt0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ctaNJ8/btsL3nCqxLS/7nE9F24pPeZGTpOOJcjKt0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ctaNJ8/btsL3nCqxLS/7nE9F24pPeZGTpOOJcjKt0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FctaNJ8%2FbtsL3nCqxLS%2F7nE9F24pPeZGTpOOJcjKt0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;849&quot; height=&quot;786&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.52.59.png&quot; data-origin-width=&quot;849&quot; data-origin-height=&quot;786&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Formula for Shapley Value (&amp;phi;ᵢ) of a specific Feature i&lt;/b&gt;:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;S: subset of features excluding feature i&lt;/li&gt;
&lt;li&gt;N: set of all features&lt;/li&gt;
&lt;li&gt;|S|: number of features in S&lt;/li&gt;
&lt;li&gt;|N|: total number of features&lt;/li&gt;
&lt;li&gt;f(S): model prediction using only features in S&lt;/li&gt;
&lt;li&gt;f(S&amp;cup;{i}): model prediction when feature i is added to S&lt;/li&gt;
&lt;li&gt;&lt;b&gt;(|S|!(|N|-|S|-1)!)/(|N|!):&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;formula for calculating weights, related to permutations and combinations&lt;/li&gt;
&lt;li&gt;&lt;b&gt;|S|!&lt;/b&gt; : factorial of the number of features in S&lt;/li&gt;
&lt;li&gt;&lt;b&gt;(|N|-|S|-1)!&lt;/b&gt; : factorial of (total number of features minus S's features minus 1)&lt;/li&gt;
&lt;li&gt;&lt;b&gt;|N|!&lt;/b&gt; : factorial of total number of features&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Why&amp;nbsp;this&amp;nbsp;weight&amp;nbsp;is&amp;nbsp;necessary:&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Contribution can be calculated differently depending on feature order&lt;/li&gt;
&lt;li&gt;To&amp;nbsp;calculate&amp;nbsp;average&amp;nbsp;contribution&amp;nbsp;considering&amp;nbsp;all&amp;nbsp;possible&amp;nbsp;orders&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;If&amp;nbsp;total&amp;nbsp;features&amp;nbsp;are&amp;nbsp;3&amp;nbsp;(N={A,B,C}),&lt;br /&gt;feature&amp;nbsp;i&amp;nbsp;is&amp;nbsp;A,&amp;nbsp;and&lt;br /&gt;S&amp;nbsp;is&amp;nbsp;{B}:&lt;br /&gt;|S|&amp;nbsp;=&amp;nbsp;1&amp;nbsp;(just&amp;nbsp;B)&lt;br /&gt;|N|&amp;nbsp;=&amp;nbsp;3&amp;nbsp;(A,B,C&amp;nbsp;three&amp;nbsp;features)&lt;br /&gt;|N|-|S|-1&amp;nbsp;=&amp;nbsp;1&amp;nbsp;(3-1-1)&lt;br /&gt;Therefore:&lt;br /&gt;(1!&amp;nbsp;*&amp;nbsp;1!)&amp;nbsp;/&amp;nbsp;3!&amp;nbsp;=&amp;nbsp;(1&amp;nbsp;*&amp;nbsp;1)&amp;nbsp;/&amp;nbsp;6&amp;nbsp;=&amp;nbsp;1/6&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;This weight is used to calculate the average influence of feature A when considering all possible orders in which it can be combined with other features.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;This formula calculates &quot;how much feature i contributes to predictions when combined with other features.&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Formula for Final Model Prediction (ŷ)&lt;/b&gt;:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&amp;phi;₀: base prediction (average of all predictions)&lt;/li&gt;
&lt;li&gt;&amp;phi;ᵢ: Shapley value of each feature&lt;/li&gt;
&lt;li&gt;ŷ: final prediction for a specific instance&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;This formula means &quot;create the final prediction by adding each feature's contribution to the base prediction.&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.53.13.png&quot; data-origin-width=&quot;843&quot; data-origin-height=&quot;133&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bfiBHr/btsL4JjTgkX/dqADraUfqXRd6ih5xe4Z3k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bfiBHr/btsL4JjTgkX/dqADraUfqXRd6ih5xe4Z3k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bfiBHr/btsL4JjTgkX/dqADraUfqXRd6ih5xe4Z3k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbfiBHr%2FbtsL4JjTgkX%2FdqADraUfqXRd6ih5xe4Z3k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;843&quot; height=&quot;133&quot; data-filename=&quot;스크린샷 2025-02-01 오후 3.53.13.png&quot; data-origin-width=&quot;843&quot; data-origin-height=&quot;133&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;pre id=&quot;code_1738394530186&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import shap
import matplotlib.pyplot as plt
from tqdm import tqdm
import numpy as np

# Use only the first 100 rows of X
X = X.iloc[:100, :]

# Clean feature names by replacing special characters
X.columns = (
    X.columns.str.replace(r&quot;\[&quot;, &quot;_&quot;, regex=True)
             .str.replace(r&quot;\]&quot;, &quot;_&quot;, regex=True)
             .str.replace(r&quot;&amp;lt;&quot;, &quot;_&quot;, regex=True)
)

# Initialize SHAP TreeExplainer
explainer = shap.TreeExplainer(final_model)  # Use TreeExplainer with the XGBoost model

# Compute SHAP values for all rows at once
shap_values = explainer.shap_values(X)

# Summary plot: Displays the importance of features
shap.summary_plot(shap_values, X, plot_type=&quot;bar&quot;)  # Bar plot of mean absolute SHAP values&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 4.22.16.png&quot; data-origin-width=&quot;823&quot; data-origin-height=&quot;913&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/RDfBU/btsL4cGPWcu/pHeTKkcIKrE7YUumDXN490/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/RDfBU/btsL4cGPWcu/pHeTKkcIKrE7YUumDXN490/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/RDfBU/btsL4cGPWcu/pHeTKkcIKrE7YUumDXN490/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FRDfBU%2FbtsL4cGPWcu%2FpHeTKkcIKrE7YUumDXN490%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;608&quot; height=&quot;674&quot; data-filename=&quot;스크린샷 2025-02-01 오후 4.22.16.png&quot; data-origin-width=&quot;823&quot; data-origin-height=&quot;913&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738394554916&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Summary plot: Detailed distribution of feature impacts
shap.summary_plot(shap_values, X)  # Beeswarm plot&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 4.22.45.png&quot; data-origin-width=&quot;821&quot; data-origin-height=&quot;919&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bAfGTN/btsL5ie4dSn/LoyxlNVehzS9PHWudqkX0K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bAfGTN/btsL5ie4dSn/LoyxlNVehzS9PHWudqkX0K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bAfGTN/btsL5ie4dSn/LoyxlNVehzS9PHWudqkX0K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbAfGTN%2FbtsL5ie4dSn%2FLoyxlNVehzS9PHWudqkX0K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;821&quot; height=&quot;919&quot; data-filename=&quot;스크린샷 2025-02-01 오후 4.22.45.png&quot; data-origin-width=&quot;821&quot; data-origin-height=&quot;919&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Creating submission csv&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1738394599483&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor

# Load the test and sample submission files
test = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/test.csv')
sample_submission = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/sample_submission.csv')

# Load the training data for consistent preprocessing
train = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/train.csv')

# Preprocessing: Handle categorical and numerical variables consistently
label_encoder = LabelEncoder()

for column in train.columns:
    if column in [&quot;efs&quot;, &quot;efs_time&quot;]:  # Skip target variables not present in the test set
        continue
    
    if train[column].dtype == 'object':  # Handle categorical variables
        test[column] = test[column].fillna(&quot;NAN&quot;)
        train[column] = train[column].fillna(&quot;NAN&quot;)
        label_encoder.fit(pd.concat([train[column], test[column]], axis=0))
        test[column] = label_encoder.transform(test[column])
    else:  # Handle numerical variables
        test[column] = test[column].fillna(-1)  # Replace missing values with -1

# Define features to align with the training data
FEATURES = [col for col in train.columns if col not in [&quot;ID&quot;, &quot;efs&quot;, &quot;efs_time&quot;, &quot;y&quot;]]

# Ensure the test set matches the feature space of the training data
missing_cols = [col for col in FEATURES if col not in test.columns]
for col in missing_cols:
    test[col] = 0  # Add missing columns with default values

test = test[FEATURES]  # Reorder columns to match the training feature space

# Make predictions on the test set
test['predicted_risk'] = final_model.predict(test)

# Prepare the submission file
sample_submission['prediction'] = test['predicted_risk']

# Check for any missing or invalid values in the predictions
if sample_submission['prediction'].isnull().any():
    raise ValueError(&quot;The submission file contains NaN values. Please check your predictions.&quot;)

# Save the submission file in the correct format
sample_submission.to_csv('submission.csv', index=False)

# Display the first few rows of the submission file to verify
sample_submission.head()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 4.23.29.png&quot; data-origin-width=&quot;246&quot; data-origin-height=&quot;150&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bSlxcB/btsL4pF5Jk5/kPT0ls2X13Jlkbe4YqWPYK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bSlxcB/btsL4pF5Jk5/kPT0ls2X13Jlkbe4YqWPYK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bSlxcB/btsL4pF5Jk5/kPT0ls2X13Jlkbe4YqWPYK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbSlxcB%2FbtsL4pF5Jk5%2FkPT0ls2X13Jlkbe4YqWPYK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;246&quot; height=&quot;150&quot; data-filename=&quot;스크린샷 2025-02-01 오후 4.23.29.png&quot; data-origin-width=&quot;246&quot; data-origin-height=&quot;150&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;꿈을&amp;nbsp;크게&amp;nbsp;가져야&amp;nbsp;깨져도&amp;nbsp;그&amp;nbsp;조각이&amp;nbsp;크다.&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>shap</category>
      <category>understanding survival analysis</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/106</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-2-Understanding-Survival-Analysis-2#entry106comment</comments>
      <pubDate>Sat, 1 Feb 2025 16:42:55 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #2 Understanding Survival Analysis - 1</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-2-Understanding-Survival-Analysis</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Annotation of this kernel: &lt;a href=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1738237779967&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Understanding Survival Analysis&quot; data-og-description=&quot;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot; data-og-url=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/code/benjenkins96/understanding-survival-analysis&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Understanding Survival Analysis&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore and run machine learning code with Kaggle Notebooks | Using data from CIBMTR - Equity in post-HCT Survival Predictions&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 id=&quot;Loading-Data-and-Initial-EDA&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Initial EDA&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1738237901075&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Check the distribution of the target variables
plt.figure(figsize=(10, 5))
sns.countplot(data=train, x='efs', palette='coolwarm')
plt.title('Distribution of Event-Free Survival (efs)')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-30 오후 8.52.00.png&quot; data-origin-width=&quot;1622&quot; data-origin-height=&quot;850&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bOn9Rv/btsL3QiO0Rp/42NKqxisNQlN1XkjTMN67K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bOn9Rv/btsL3QiO0Rp/42NKqxisNQlN1XkjTMN67K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bOn9Rv/btsL3QiO0Rp/42NKqxisNQlN1XkjTMN67K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbOn9Rv%2FbtsL3QiO0Rp%2F42NKqxisNQlN1XkjTMN67K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;738&quot; height=&quot;387&quot; data-filename=&quot;스크린샷 2025-01-30 오후 8.52.00.png&quot; data-origin-width=&quot;1622&quot; data-origin-height=&quot;850&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Event-free survival(efs) is an important outcome measure in medical research, particularly in transplant studies&lt;/li&gt;
&lt;li&gt;EFS refers to the period from the start of treatment(transplant) until the occurrence of an &quot;event&quot;: probably death in this competition&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;EFS differs from Overall Survival (OS):&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;OS only considers survival/death&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;EFS includes not only survival but also various important clinical events related to treatment success&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738237946057&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plt.figure(figsize=(10, 5))
sns.histplot(data=train, x='efs_time', bins=30, kde=True, color='blue')
plt.title('Distribution of Time to Event-Free Survival (efs_time)')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-30 오후 8.54.40.png&quot; data-origin-width=&quot;1622&quot; data-origin-height=&quot;850&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/kRujy/btsL36Mv6Zp/T9KKkgRkWIthFeCLMxlM81/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/kRujy/btsL36Mv6Zp/T9KKkgRkWIthFeCLMxlM81/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/kRujy/btsL36Mv6Zp/T9KKkgRkWIthFeCLMxlM81/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FkRujy%2FbtsL36Mv6Zp%2FT9KKkgRkWIthFeCLMxlM81%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;690&quot; height=&quot;362&quot; data-filename=&quot;스크린샷 2025-01-30 오후 8.54.40.png&quot; data-origin-width=&quot;1622&quot; data-origin-height=&quot;850&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738238284325&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plt.figure(figsize=(6, 3))
plt.hist(train.efs_time[train.efs == 0], bins=50, label='efs=0: Patient Still Alive Or Unknown', alpha=0.5)
plt.hist(train.efs_time[train.efs == 1], bins=50, label='efs=1: Patient Dies', alpha=0.5)
plt.legend()
plt.xlabel('Event Free Survival Time')
plt.ylabel('Count')
plt.title('Histogram of Time to Event-Free Survival (efs_time)')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-30 오후 8.58.17.png&quot; data-origin-width=&quot;1196&quot; data-origin-height=&quot;686&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lnjql/btsL38Q5XKm/hhLHjEOFwNWvIWlJXK40kK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lnjql/btsL38Q5XKm/hhLHjEOFwNWvIWlJXK40kK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lnjql/btsL38Q5XKm/hhLHjEOFwNWvIWlJXK40kK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Flnjql%2FbtsL38Q5XKm%2FhhLHjEOFwNWvIWlJXK40kK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;645&quot; height=&quot;370&quot; data-filename=&quot;스크린샷 2025-01-30 오후 8.58.17.png&quot; data-origin-width=&quot;1196&quot; data-origin-height=&quot;686&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1738238411123&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Explore distribution of key demographic features
demo_features = ['race_group', 'sex_match', 'ethnicity']
for feature in demo_features:
    plt.figure(figsize=(10, 5))
    sns.countplot(data=train, x=feature, palette='viridis', order=train[feature].value_counts().index)
    plt.title(f'Distribution of {feature}')
    plt.xticks(rotation=45)
    plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-30 오후 9.00.29.png&quot; data-origin-width=&quot;1570&quot; data-origin-height=&quot;1174&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Uu9iM/btsL1jAdW5Q/zvLaKeMUB0McMrQAu5dvf0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Uu9iM/btsL1jAdW5Q/zvLaKeMUB0McMrQAu5dvf0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Uu9iM/btsL1jAdW5Q/zvLaKeMUB0McMrQAu5dvf0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FUu9iM%2FbtsL1jAdW5Q%2FzvLaKeMUB0McMrQAu5dvf0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;729&quot; height=&quot;545&quot; data-filename=&quot;스크린샷 2025-01-30 오후 9.00.29.png&quot; data-origin-width=&quot;1570&quot; data-origin-height=&quot;1174&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-30 오후 9.01.54.png&quot; data-origin-width=&quot;1570&quot; data-origin-height=&quot;910&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/wEB4L/btsL357Uv0I/KwEl8mFdLsohXzvVvwrcSK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/wEB4L/btsL357Uv0I/KwEl8mFdLsohXzvVvwrcSK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/wEB4L/btsL357Uv0I/KwEl8mFdLsohXzvVvwrcSK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FwEB4L%2FbtsL357Uv0I%2FKwEl8mFdLsohXzvVvwrcSK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;694&quot; height=&quot;402&quot; data-filename=&quot;스크린샷 2025-01-30 오후 9.01.54.png&quot; data-origin-width=&quot;1570&quot; data-origin-height=&quot;910&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-30 오후 9.02.23.png&quot; data-origin-width=&quot;1570&quot; data-origin-height=&quot;1050&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/CmACf/btsL3jyUXbr/IyWm4pk7KPYF1gjp3UQYG0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/CmACf/btsL3jyUXbr/IyWm4pk7KPYF1gjp3UQYG0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/CmACf/btsL3jyUXbr/IyWm4pk7KPYF1gjp3UQYG0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FCmACf%2FbtsL3jyUXbr%2FIyWm4pk7KPYF1gjp3UQYG0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;651&quot; height=&quot;435&quot; data-filename=&quot;스크린샷 2025-01-30 오후 9.02.23.png&quot; data-origin-width=&quot;1570&quot; data-origin-height=&quot;1050&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;sex_match is a variable that indicates the gender match between donor and recipient in Hematopoietic Cell Transplantation (HCT). &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;It is typically categorized as follows:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;M-M: Male donor &amp;rarr; Male recipient&lt;/li&gt;
&lt;li&gt;M-F: Male donor &amp;rarr; Female recipient&lt;/li&gt;
&lt;li&gt;F-M: Female donor &amp;rarr; Male recipient&lt;/li&gt;
&lt;li&gt;F-F: Female donor &amp;rarr; Female recipient&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Kaplan-Meier Estimator&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The Kaplan-Meier Estimator is a non-parametric statistical method used in survival analysis to estimate the survival function from time-to-event data.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;It calculates the probability that an individual will survive beyond a certain point in time, accounting for &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;censored data (cases, where the event of interest has not occurred by the end of the study or the individual, is lost to follow-up)&lt;/b&gt;&lt;/span&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-31 오후 9.30.16.png&quot; data-origin-width=&quot;846&quot; data-origin-height=&quot;698&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/blTlSq/btsL4FhuNGx/2dCukHPOSHboMlq2eBbc11/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/blTlSq/btsL4FhuNGx/2dCukHPOSHboMlq2eBbc11/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/blTlSq/btsL4FhuNGx/2dCukHPOSHboMlq2eBbc11/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FblTlSq%2FbtsL4FhuNGx%2F2dCukHPOSHboMlq2eBbc11%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;846&quot; height=&quot;698&quot; data-filename=&quot;스크린샷 2025-01-31 오후 9.30.16.png&quot; data-origin-width=&quot;846&quot; data-origin-height=&quot;698&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Key Properties:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;The Kaplan-Meier curve is a step function, with drops occurring at times when events are observed.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;It handles censoring by only considering individuals at risk just before each event time.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Advantages:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Non-parametric: Makes no assumptions about the distribution of survival times.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Handles Censoring: Incorporates censored data effectively.&lt;/li&gt;
&lt;li&gt;Easy Interpretation: Provides intuitive survival probabilities.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Limitations:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Assumes Independence of Censoring: Assumes that the censored individuals have the same survival prospects as those still under observation.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Lack of Multivariable Adjustments: Does not account for the effects of covariates (e.g., age, race). For this, models like Cox regression are used.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Uncertainty at Long Times: If few individuals remain at risk at later time points, the estimates may become less reliable.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Use Case:&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;In the context of HCT survival analysis:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Kaplan-Meier can estimate survival probabilities for the entire population or subgroups (e.g., race or gender).&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;It helps visualize differences in survival rates among groups, providing insights into disparities or the impact of certain factors.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Results:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Kaplan-Meier&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;survival curve represents the probability of remaining event-free (e.g., alive or without relapse) over time, with the y-axis showing survival probability and the x-axis representing time in months.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;In Kaplan-Meier survival curves, &quot;event-free (alive or without relapse)&quot; means satisfying both of these conditions:&lt;br /&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;alive: the patient is living&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;without relapse: the disease has not recurred(재발)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Initially, the curve starts at 1.0 (100% survival) since all individuals are event-free at time zero. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; background-color: #ffc9af;&quot;&gt;&lt;b&gt;The steep decline in the early months indicates that a significant number of patients experience events, such as death or relapse, shortly after the transplant.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This highlights the high-risk nature of the initial post-transplant period.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;As time progresses, the curve begins to level off, particularly after 20-30 months, suggesting that those who survive the initial phase tend to have better long-term outcomes.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;The survival probability never reaches zero, indicating that a portion of the population remains event-free throughout the observation period.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;The shaded region around the curve represents the confidence interval, which reflects the uncertainty of the survival estimates.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Early on, the confidence intervals are narrow, indicating precise estimates due to a larger sample size. &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;However, they widen at later time points, reflecting fewer patients being observed (due to censoring), which reduces the precision of the estimates.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Overall, the&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Kaplan-Meier&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;b&gt;curve provides insight into the time-dependent risks of events, emphasizing the need for targeted interventions during the early post-transplant period to improve survival outcomes.&lt;/b&gt; &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The curve also suggests that patients who pass the high-risk early phase may achieve more favorable long-term survival. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Further analysis, such as stratifying the data by race or comorbidity scores, could provide deeper insights into factors influencing survival and potential disparities across subgroups.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738328113571&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from lifelines import KaplanMeierFitter

# Instantiate the Kaplan-Meier fitter
kmf = KaplanMeierFitter()

# Kaplan-Meier fit for the entire dataset
plt.figure(figsize=(10, 6))
kmf.fit(durations=train['efs_time'], event_observed=train['efs'])
kmf.plot_survival_function()
plt.title('Kaplan-Meier Survival Curve for Entire Dataset')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.grid()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-31 오후 9.55.27.png&quot; data-origin-width=&quot;847&quot; data-origin-height=&quot;535&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bFbeeS/btsL4RhH1Y0/Siq1auktBzKlLdlQdeOY7k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bFbeeS/btsL4RhH1Y0/Siq1auktBzKlLdlQdeOY7k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bFbeeS/btsL4RhH1Y0/Siq1auktBzKlLdlQdeOY7k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbFbeeS%2FbtsL4RhH1Y0%2FSiq1auktBzKlLdlQdeOY7k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;687&quot; height=&quot;434&quot; data-filename=&quot;스크린샷 2025-01-31 오후 9.55.27.png&quot; data-origin-width=&quot;847&quot; data-origin-height=&quot;535&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Kaplan-Meier Survival Curve: Stratified by &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;Race&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;The Kaplan-Meier survival curve below visualizes the survival probabilities for &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;different racial groups over time&lt;/span&gt;&lt;/b&gt;. Each line represents a specific race group. The shaded areas around the curves represent confidence intervals.&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Key Observations:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Early Survival Decline:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;All race groups show a steep initial decline in survival probability, indicating a high risk of adverse events shortly after transplantation.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;The rate of decline varies among groups, suggesting potential disparities in early survival outcomes.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Group Differences in Long-Term Survival:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Groups like &quot;More than one race&quot; and &quot;Asian&quot; exhibit higher long-term survival probabilities compared to &quot;White&quot; and &quot;Black or African-American&quot; groups.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&quot;American Indian or Alaska Native&quot; and &quot;Native Hawaiian or other Pacific Islander&quot; groups show moderate survival probabilities.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Confidence Intervals:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Confidence intervals widen over time, &lt;b&gt;reflecting reduced sample sizes&lt;/b&gt;.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Widening is more pronounced in smaller racial groups, indicating greater uncertainty in survival estimates.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Potential Disparities:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The observed differences in survival probabilities suggest disparities in post-transplant outcomes that may be influenced by various factors.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&quot;White&quot; and &quot;Black or African-American&quot; groups consistently have lower survival probabilities, highlighting areas for potential intervention.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738329420192&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Kaplan-Meier fit for different groups (e.g., race_group)
plt.figure(figsize=(12, 8))
for group in train['race_group'].dropna().unique():
    group_data = train[train['race_group'] == group]
    kmf.fit(durations=group_data['efs_time'], event_observed=group_data['efs'], label=group)
    kmf.plot_survival_function()

plt.title('Kaplan-Meier Survival Curve by Race Group')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.legend(title='Race Group')
plt.grid()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-31 오후 10.17.11.png&quot; data-origin-width=&quot;847&quot; data-origin-height=&quot;588&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/rRdkx/btsL2H2iP9n/3kvBZHX8KKndH2LCMPLA50/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/rRdkx/btsL2H2iP9n/3kvBZHX8KKndH2LCMPLA50/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/rRdkx/btsL2H2iP9n/3kvBZHX8KKndH2LCMPLA50/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FrRdkx%2FbtsL2H2iP9n%2F3kvBZHX8KKndH2LCMPLA50%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;713&quot; height=&quot;495&quot; data-filename=&quot;스크린샷 2025-01-31 오후 10.17.11.png&quot; data-origin-width=&quot;847&quot; data-origin-height=&quot;588&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Kaplan-Meier Survival Curve: Stratified by &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Donor/Recipient Sex Match&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;The Kaplan-Meier survival curve below visualizes the survival probabilities for different donor/recipient sex match combinations over time. Each curve represents one of the four possible combinations:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Male-to-Female (M-F)&lt;/li&gt;
&lt;li&gt;Female-to-Female (F-F)&lt;/li&gt;
&lt;li&gt;Female-to-Male (F-M)&lt;/li&gt;
&lt;li&gt;Male-to-Male (M-M)&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;The shaded areas around the curves indicate confidence intervals.&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Key Observations:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Early Decline in Survival:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;All groups show a steep initial decline in survival probability, reflecting the high-risk post-transplant period.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Long-Term Survival Differences:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;F-F and M-M show the highest long-term survival probability.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;F-M and M-F have lower long-term survival probabilities.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Confidence Intervals:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Confidence intervals widen over time, particularly for M-F and F-M.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;F-F has relatively narrow intervals.&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Sex Match Impact:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;F-F and M-M transplants tend to have better outcomes.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;M-F and F-M groups have lower survival probabilities.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Insights and Implications:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Clinical Relevance:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;The survival advantage for F-F and M-M may reflect better immunological compatibility.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;M-F and F-M groups might benefit from additional clinical interventions.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Meaning that M-F, F-M groups may require additional clinical interventions(treatments)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Biological Factors:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Differences in survival may stem from biological factors like &lt;b&gt;immunological response or GVHD risk&lt;/b&gt;.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;immunological response:&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;Refers&amp;nbsp;to&amp;nbsp;how&amp;nbsp;our&amp;nbsp;body's&amp;nbsp;immune&amp;nbsp;system&amp;nbsp;responds&amp;nbsp;to&amp;nbsp;foreign&amp;nbsp;substances&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;In transplant situations:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The immune reaction that occurs when donor cells enter the recipient's body&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;If&amp;nbsp;this&amp;nbsp;response&amp;nbsp;is&amp;nbsp;too&amp;nbsp;strong&amp;nbsp;or&amp;nbsp;too&amp;nbsp;weak,&amp;nbsp;it&amp;nbsp;can&amp;nbsp;negatively&amp;nbsp;affect&amp;nbsp;transplant&amp;nbsp;outcomes&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;GVHD (Graft Versus Host Disease) risk:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;A condition where transplanted donor immune cells recognize the recipient's body as 'foreign' and attack it&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Major symptoms:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Skin rash&lt;/li&gt;
&lt;li&gt;Liver&amp;nbsp;damage&lt;/li&gt;
&lt;li&gt;Digestive system problems&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;A&amp;nbsp;serious&amp;nbsp;complication&amp;nbsp;that&amp;nbsp;can&amp;nbsp;be&amp;nbsp;life-threatening&amp;nbsp;in&amp;nbsp;severe&amp;nbsp;cases&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Further Analysis:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Additional factors should be analyzed alongside sex match.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Statistical tests can confirm the significance of observed differences.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738332370897&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Kaplan-Meier fit for a binary feature (e.g., gender)
plt.figure(figsize=(12, 8))
for gender in train['sex_match'].dropna().unique():
    gender_data = train[train['sex_match'] == gender]
    kmf.fit(durations=gender_data['efs_time'], event_observed=gender_data['efs'], label=gender)
    kmf.plot_survival_function()

plt.title('Kaplan-Meier Survival Curve by Sex Match')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.legend(title='Sex Match')
plt.grid()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-31 오후 11.06.18.png&quot; data-origin-width=&quot;847&quot; data-origin-height=&quot;588&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/CMoeD/btsL4wyfi3Y/ypPNoX92tiZtkKMfFBmy31/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/CMoeD/btsL4wyfi3Y/ypPNoX92tiZtkKMfFBmy31/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/CMoeD/btsL4wyfi3Y/ypPNoX92tiZtkKMfFBmy31/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FCMoeD%2FbtsL4wyfi3Y%2FypPNoX92tiZtkKMfFBmy31%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;726&quot; height=&quot;504&quot; data-filename=&quot;스크린샷 2025-01-31 오후 11.06.18.png&quot; data-origin-width=&quot;847&quot; data-origin-height=&quot;588&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 id=&quot;Cox-Proportional-Hazards-(CPH)-Model&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Cox Proportional Hazards (CPH) Model&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The&amp;nbsp;Cox Proportional Hazards (CPH) model&amp;nbsp;is a widely used method in survival analysis for &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;evaluating the effect of multiple covariates on the time to a specific event, such as death or relapse&lt;/b&gt;&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;Unlike non-parametric methods like Kaplan-Meier, CPH is a &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;semi-parametric model incorporating covariates to estimate their influence on survival &lt;span style=&quot;background-color: #f6e199;&quot;&gt;while making no assumptions about the baseline hazard function.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;baseline hazard function: the change in basic risk rate over time&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-31 오후 11.46.02.png&quot; data-origin-width=&quot;878&quot; data-origin-height=&quot;866&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cxJB4g/btsL4dFGBU6/nIKpdXEJ2QT3Ye0tQM8pFK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cxJB4g/btsL4dFGBU6/nIKpdXEJ2QT3Ye0tQM8pFK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cxJB4g/btsL4dFGBU6/nIKpdXEJ2QT3Ye0tQM8pFK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcxJB4g%2FbtsL4dFGBU6%2FnIKpdXEJ2QT3Ye0tQM8pFK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;878&quot; height=&quot;866&quot; data-filename=&quot;스크린샷 2025-01-31 오후 11.46.02.png&quot; data-origin-width=&quot;878&quot; data-origin-height=&quot;866&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-31 오후 11.46.20.png&quot; data-origin-width=&quot;878&quot; data-origin-height=&quot;889&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bYvBZt/btsL4xKIiWn/C5E6xcSXpk11tZVXRK18B1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bYvBZt/btsL4xKIiWn/C5E6xcSXpk11tZVXRK18B1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bYvBZt/btsL4xKIiWn/C5E6xcSXpk11tZVXRK18B1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbYvBZt%2FbtsL4xKIiWn%2FC5E6xcSXpk11tZVXRK18B1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;878&quot; height=&quot;889&quot; data-filename=&quot;스크린샷 2025-01-31 오후 11.46.20.png&quot; data-origin-width=&quot;878&quot; data-origin-height=&quot;889&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오전 12.34.16.png&quot; data-origin-width=&quot;840&quot; data-origin-height=&quot;375&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bxhbNv/btsL4IyyHAi/z2p6o8JZoEyv8Tck1pAUU1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bxhbNv/btsL4IyyHAi/z2p6o8JZoEyv8Tck1pAUU1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bxhbNv/btsL4IyyHAi/z2p6o8JZoEyv8Tck1pAUU1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbxhbNv%2FbtsL4IyyHAi%2Fz2p6o8JZoEyv8Tck1pAUU1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;828&quot; height=&quot;370&quot; data-filename=&quot;스크린샷 2025-02-01 오전 12.34.16.png&quot; data-origin-width=&quot;840&quot; data-origin-height=&quot;375&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Hazard Function h(t|X) in detail:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Basic&amp;nbsp;Concept:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Represents the instantaneous probability that someone who has survived until time t will experience an event right after&lt;/li&gt;
&lt;li&gt;Here, 'event' could be death, disease recurrence, etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Formula Components:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;h₀(t): baseline hazard function
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Basic risk rate when all covariates are 0&lt;/li&gt;
&lt;li&gt;Can change over time&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;exp(&amp;beta;₁X₁ + &amp;beta;₂X₂ + ... + &amp;beta;ₚXₚ): effect of covariates
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;X₁, X₂, ..., Xₚ: covariates (age, gender, etc.)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&amp;beta;₁, &amp;beta;₂, ..., &amp;beta;ₚ: coefficients showing the influence of each covariate&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Why use exp: ensures hazard rate is always positive&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Practical Meaning:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;For example, in hematopoietic cell transplant patients:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;h(t): risk of death/relapse at time t&lt;/li&gt;
&lt;li&gt;X₁:&amp;nbsp;patient's&amp;nbsp;age&lt;/li&gt;
&lt;li&gt;X₂:&amp;nbsp;gender&amp;nbsp;matching&amp;nbsp;status&lt;/li&gt;
&lt;li&gt;&amp;beta;₁:&amp;nbsp;impact&amp;nbsp;of&amp;nbsp;age&amp;nbsp;on&amp;nbsp;risk&lt;/li&gt;
&lt;li&gt;&amp;beta;₂:&amp;nbsp;impact&amp;nbsp;of&amp;nbsp;gender&amp;nbsp;matching&amp;nbsp;on&amp;nbsp;risk&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Proportional Hazards Assumption in detail:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Assumes that the hazard ratio between two patients remains constant over time&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;If a 50-year-old patient has twice the risk of a 30-year-old patient&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;This &quot;twice&quot; ratio remains constant whether it's 1 month or 1 year post-transplant&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Therefore &quot;TIME-INDEPENDENT&quot;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Hazard Ratio (HR) in detail:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;Calculated as HR = exp(&amp;beta;)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;HR &amp;gt; 1: increased risk
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Example: HR = 2 means double the risk&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;HR &amp;lt; 1: decreased risk
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Example: HR = 0.5 means half the risk&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;HR = 1: no effect&lt;/li&gt;
&lt;li&gt;If&amp;nbsp;&amp;beta;&amp;nbsp;=&amp;nbsp;0.693&amp;nbsp;for&amp;nbsp;gender&amp;nbsp;matching:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;HR&amp;nbsp;=&amp;nbsp;exp(0.693)&amp;nbsp;=&amp;nbsp;2&lt;/li&gt;
&lt;li&gt;This means for gender mismatch:&amp;nbsp;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Risk doubles&lt;/li&gt;
&lt;li&gt;This&amp;nbsp;doubling&amp;nbsp;remains&amp;nbsp;constant&amp;nbsp;at&amp;nbsp;any&amp;nbsp;time&amp;nbsp;post-transplant&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Censoring in detail:&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;What is censoring?&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;When the event of interest (e.g., death, relapse) doesn't occur during the study period&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;In other words, when we can't know the patient's final outcome&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Cases of &lt;b&gt;right-censoring:&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;No event occurs until the end of the study&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Example: Patient survives throughout a 5-year follow-up study&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Patient drops out during follow-up&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Example: Transfer to another hospital&lt;/li&gt;
&lt;li&gt;Example: Loss of contact&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Excluded from study for other reasons&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Example: Patient requests to discontinue participation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Handling in Cox model:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Censored data is included in the analysis&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Information up to the censoring point is used for model estimation&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Unbiased estimates are calculated through likelihood function&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Likelihood function in detail:&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;What is a likelihood function:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;A function that &lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;calculates the possibility (probability) that observed data came from a specific statistical model&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;In other words, it quantifies &quot;how likely this data would come from this model&quot;&lt;/li&gt;
&lt;li&gt;Let's&amp;nbsp;assume&amp;nbsp;we&amp;nbsp;have&amp;nbsp;patient&amp;nbsp;survival&amp;nbsp;data:&lt;br /&gt;-&amp;nbsp;Patient&amp;nbsp;A:&amp;nbsp;Died&amp;nbsp;after&amp;nbsp;2&amp;nbsp;years&lt;br /&gt;-&amp;nbsp;Patient&amp;nbsp;B:&amp;nbsp;Survived&amp;nbsp;until&amp;nbsp;3&amp;nbsp;years&amp;nbsp;(then&amp;nbsp;lost&amp;nbsp;to&amp;nbsp;follow-up)&lt;br /&gt;-&amp;nbsp;Patient&amp;nbsp;C:&amp;nbsp;Died&amp;nbsp;after&amp;nbsp;5&amp;nbsp;years&lt;br /&gt;&lt;br /&gt;The&amp;nbsp;likelihood&amp;nbsp;function:&lt;br /&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;1.&amp;nbsp;Calculates&amp;nbsp;the&amp;nbsp;probability&amp;nbsp;of&amp;nbsp;each&amp;nbsp;patient's&amp;nbsp;observed&amp;nbsp;outcome&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;2.&amp;nbsp;Multiplies&amp;nbsp;all&amp;nbsp;these&amp;nbsp;probabilities&lt;/b&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;3.&amp;nbsp;The&amp;nbsp;higher&amp;nbsp;this&amp;nbsp;value,&amp;nbsp;the&amp;nbsp;better&amp;nbsp;the&amp;nbsp;model&amp;nbsp;explains&amp;nbsp;the&amp;nbsp;data&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;In Cox model:&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Censored data (e.g., Patient B) is included in the likelihood function&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Uses information up to the point of censoring&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;This enables unbiased parameter estimation&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;In this way, the likelihood function allows us to effectively use incomplete data (censored data) in the analysis.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Partial Likelihood in detail:&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;Considers only the order of event occurrences instead of complete time information&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;In other words, focuses more on &quot;who experienced the event first&quot; rather than &quot;exact timing&quot;&lt;/li&gt;
&lt;li&gt;Assume&amp;nbsp;we&amp;nbsp;have&amp;nbsp;three&amp;nbsp;patients:&lt;br /&gt;Patient&amp;nbsp;A:&amp;nbsp;Dies&amp;nbsp;at&amp;nbsp;2&amp;nbsp;months&lt;br /&gt;Patient&amp;nbsp;B:&amp;nbsp;Dies&amp;nbsp;at&amp;nbsp;5&amp;nbsp;months&lt;br /&gt;Patient&amp;nbsp;C:&amp;nbsp;Survives&amp;nbsp;until&amp;nbsp;7&amp;nbsp;months&amp;nbsp;(censored)&lt;br /&gt;&lt;br /&gt;Partial&amp;nbsp;likelihood&amp;nbsp;analyzes:&lt;br /&gt;-&amp;nbsp;At&amp;nbsp;2&amp;nbsp;months:&amp;nbsp;&quot;Why&amp;nbsp;did&amp;nbsp;A&amp;nbsp;die&amp;nbsp;instead&amp;nbsp;of&amp;nbsp;the&amp;nbsp;others&quot;&lt;br /&gt;-&amp;nbsp;At&amp;nbsp;5&amp;nbsp;months:&amp;nbsp;&quot;Why&amp;nbsp;did&amp;nbsp;B&amp;nbsp;die&amp;nbsp;among&amp;nbsp;remaining&amp;nbsp;patients&quot;&lt;/li&gt;
&lt;li&gt;Reasons&amp;nbsp;for&amp;nbsp;This&amp;nbsp;Approach:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;No need to specify baseline hazard function (h₀(t))&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Can&amp;nbsp;estimate&amp;nbsp;covariate&amp;nbsp;effects&amp;nbsp;(&amp;beta;)&amp;nbsp;using&amp;nbsp;just&amp;nbsp;event&amp;nbsp;order&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Simpler and more efficient computation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Maximization&amp;nbsp;Process:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;Find &amp;beta; values that maximize the partial likelihood&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;These&amp;nbsp;&amp;beta;&amp;nbsp;values&amp;nbsp;are&amp;nbsp;considered&amp;nbsp;to&amp;nbsp;best&amp;nbsp;explain&amp;nbsp;each&amp;nbsp;variable's&amp;nbsp;effect&amp;nbsp;on&amp;nbsp;survival&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738383425644&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from lifelines import CoxPHFitter

# Preprocess data
# Select relevant columns for Cox regression
cox_features = ['efs_time', 'efs', 'age_at_hct', 'karnofsky_score', 'comorbidity_score', 'race_group']
train = train[cox_features]

# Convert categorical variables into dummy variables
train = pd.get_dummies(train, columns=['race_group'], drop_first=True)

# Drop rows with missing values (ensure clean data for Cox model)
train = train.dropna()

# Instantiate and fit the Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(train, duration_col='efs_time', event_col='efs')

# Show summary of the model
cph.print_summary()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 1.17.20.png&quot; data-origin-width=&quot;862&quot; data-origin-height=&quot;703&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b9CF5J/btsL2RwMcSp/vtoynqZUZp0Uj9JJwdHHa0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b9CF5J/btsL2RwMcSp/vtoynqZUZp0Uj9JJwdHHa0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b9CF5J/btsL2RwMcSp/vtoynqZUZp0Uj9JJwdHHa0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb9CF5J%2FbtsL2RwMcSp%2FvtoynqZUZp0Uj9JJwdHHa0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;862&quot; height=&quot;703&quot; data-filename=&quot;스크린샷 2025-02-01 오후 1.17.20.png&quot; data-origin-width=&quot;862&quot; data-origin-height=&quot;703&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Resultsr&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The hazard ratio (HR) plot illustrates the &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;effects of different covariates on the hazard of the event occurring&lt;/span&gt;&lt;/b&gt;, as estimated by the Cox Proportional Hazards model. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The x-axis represents the hazard ratio, where a value of 1.0 (marked by the dashed vertical line) indicates no effect on the hazard. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Hazard ratios greater than 1.0 indicate an increased risk of the event, while values less than 1.0 suggest a protective effect or reduced risk. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The 95% confidence intervals (CIs) are shown as horizontal lines around each hazard ratio, indicating the uncertainty in the estimates. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; background-color: #f6e199;&quot;&gt;If a confidence interval crosses 1.0, the effect of the covariate is not statistically significant.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;The analysis reveals several key findings.&lt;/li&gt;
&lt;li&gt;Among race groups,&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&quot;Black or African-American&quot;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&quot;White&quot;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;have hazard ratios slightly above 1.0, indicating a marginally increased risk compared to the reference group (likely another race, such as&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&quot;Asian&quot;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;or&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&quot;More than one race&quot;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;). &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Conversely, the&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&quot;More than one race&quot;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;group has an HR less than 1.0, suggesting a protective effect, while&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&quot;Native Hawaiian or other Pacific Islander&quot;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;shows little to no impact on the hazard. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;comorbidity score&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;has an HR slightly above 1.0, indicating that patients with more comorbidities are at greater risk of the event. &lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;b&gt;Comorbidity&lt;/b&gt;: A condition where (a patient) suffers from two chronic diseases simultaneously&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Similarly,&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&quot;age at HCT&quot;&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;has a hazard ratio above 1.0, suggesting that older patients face a slightly higher risk. &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;In contrast, the&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Karnofsky performance score&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;has a hazard ratio less than 1.0, reflecting a protective effect where higher scores (indicating better performance status) are associated with reduced risk.&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;b&gt;The&amp;nbsp;Karnofsky&amp;nbsp;performance&amp;nbsp;score&amp;nbsp;(KPS)&amp;nbsp;or&amp;nbsp;Karnofsky&amp;nbsp;performance&amp;nbsp;status&amp;nbsp;scale&amp;nbsp;is&amp;nbsp;a&amp;nbsp;measure&amp;nbsp;to&amp;nbsp;evaluate&amp;nbsp;a&amp;nbsp;patient's&amp;nbsp;overall&amp;nbsp;functional&amp;nbsp;status.&lt;/b&gt;&lt;br /&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Scoring&amp;nbsp;System&amp;nbsp;(0-100):&lt;br /&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;100: Normal, no symptoms or signs of disease&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;90:&amp;nbsp;Able&amp;nbsp;to&amp;nbsp;carry&amp;nbsp;on&amp;nbsp;normal&amp;nbsp;activity,&amp;nbsp;minor&amp;nbsp;symptoms/signs&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;80:&amp;nbsp;Normal&amp;nbsp;activity&amp;nbsp;with&amp;nbsp;effort,&amp;nbsp;some&amp;nbsp;symptoms/signs&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;70:&amp;nbsp;Cares&amp;nbsp;for&amp;nbsp;self&amp;nbsp;but&amp;nbsp;unable&amp;nbsp;to&amp;nbsp;carry&amp;nbsp;on&amp;nbsp;normal&amp;nbsp;activity&amp;nbsp;or&amp;nbsp;work&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;60:&amp;nbsp;Requires&amp;nbsp;occasional&amp;nbsp;assistance&amp;nbsp;but&amp;nbsp;can&amp;nbsp;meet&amp;nbsp;most&amp;nbsp;personal&amp;nbsp;needs&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;50:&amp;nbsp;Requires&amp;nbsp;considerable&amp;nbsp;assistance&amp;nbsp;and&amp;nbsp;frequent&amp;nbsp;medical&amp;nbsp;care&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;40:&amp;nbsp;Disabled,&amp;nbsp;requires&amp;nbsp;special&amp;nbsp;care&amp;nbsp;and&amp;nbsp;assistance&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;30:&amp;nbsp;Severely&amp;nbsp;disabled,&amp;nbsp;hospital&amp;nbsp;admission&amp;nbsp;indicated,&amp;nbsp;death&amp;nbsp;not&amp;nbsp;imminent&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;20:&amp;nbsp;Very&amp;nbsp;sick,&amp;nbsp;hospital&amp;nbsp;admission&amp;nbsp;necessary,&amp;nbsp;active&amp;nbsp;supportive&amp;nbsp;treatment&amp;nbsp;needed&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;10:&amp;nbsp;Moribund,&amp;nbsp;death&amp;nbsp;imminent&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;0:&amp;nbsp;Dead&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Statistical significance can be inferred from the confidence intervals.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Covariates such as&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;comorbidity score&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Karnofsky score&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;likely have statistically significant effects, as their confidence intervals do not cross 1.0. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Some race groups and&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&quot;age at HCT&quot;&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;, however, may not have significant effects, as their intervals overlap with 1.0.&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;These findings suggest that clinical factors, particularly&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;comorbidity score&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;performance status&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;, are key predictors of survival outcomes.&lt;/span&gt; &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; background-color: #f6e199;&quot;&gt;Additionally, differences in hazard ratios among race groups point to potential disparities in outcomes that warrant further investigation.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Efforts to reduce comorbidities, improve performance status, and explore the underlying causes of racial disparities could help optimize patient care and outcomes.&lt;/li&gt;
&lt;li&gt;This analysis highlights the importance of targeted interventions and provides a foundation for further exploration of survival determinants.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738383562865&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Visualize the coefficients (hazard ratios)
cph.plot(hazard_ratios=True)
plt.title(&quot;Cox Regression - Hazard Ratios&quot;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 1.19.33.png&quot; data-origin-width=&quot;858&quot; data-origin-height=&quot;439&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qhjRY/btsL4Nfx4AS/vXUciLerdC2q5rMZ5IYKn1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qhjRY/btsL4Nfx4AS/vXUciLerdC2q5rMZ5IYKn1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qhjRY/btsL4Nfx4AS/vXUciLerdC2q5rMZ5IYKn1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqhjRY%2FbtsL4Nfx4AS%2FvXUciLerdC2q5rMZ5IYKn1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;858&quot; height=&quot;439&quot; data-filename=&quot;스크린샷 2025-02-01 오후 1.19.33.png&quot; data-origin-width=&quot;858&quot; data-origin-height=&quot;439&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Survival Curves for Comorbidity Score&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The survival curves generated by the&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Cox Proportional Hazards (CPH)&lt;/b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;model illustrate the relationship between comorbidity score and survival probabilities over time. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The x-axis represents time (e.g., in months), while the y-axis shows the probability of survival. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Each line corresponds to a specific comorbidity score, ranging from 0 (no comorbidities) to 4 (high comorbidity burden), with &lt;b&gt;a dashed line representing the baseline survival curve&lt;/b&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; background-color: #ffc9af;&quot;&gt;The results indicate that higher comorbidity scores are associated with lower survival probabilities, as reflected by the descending order of the survival curves. &lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Patients with a comorbidity score of 0 exhibit the highest survival probabilities, while those with a score of 4 experience the steepest decline and the lowest overall survival.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;All survival curves show a steep decline during the early months, reflecting a high-risk period immediately after the transplant.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;This decline is more pronounced for patients with higher comorbidity scores, indicating that comorbidities significantly exacerbate early post-transplant risks.&lt;/span&gt; &lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Beyond the initial phase, the survival curves stabilize, but patients with higher comorbidity scores continue to have significantly lower survival probabilities compared to those with lower scores.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;The persistent gap between the survival curves suggests that comorbidities have a lasting impact on survival outcomes.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;The baseline survival curve aligns closely with a mid-range comorbidity score, representing an &quot;average&quot; patient in the population.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;These findings highlight the clinical importance of managing comorbidities before and after transplantation.&lt;/li&gt;
&lt;li&gt;Higher comorbidity scores predict worse survival outcomes, emphasizing the need for targeted interventions and closer monitoring for high-risk patients, particularly during the early post-transplant phase.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Even long-term outcomes are worse for patients with higher scores, indicating the necessity of sustained care.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;This analysis also underscores the potential for risk stratification, where patients can be categorized by comorbidity scores to prioritize resources and tailor interventions.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1738385239394&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;cph.plot_partial_effects_on_outcome(covariates='comorbidity_score', values=[0, 1, 2, 3, 4], cmap='coolwarm');&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-02-01 오후 1.47.28.png&quot; data-origin-width=&quot;575&quot; data-origin-height=&quot;439&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/wdLBE/btsL3A2wwyy/K0Mi2HlHJMANdlKQEgKgMK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/wdLBE/btsL3A2wwyy/K0Mi2HlHJMANdlKQEgKgMK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/wdLBE/btsL3A2wwyy/K0Mi2HlHJMANdlKQEgKgMK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FwdLBE%2FbtsL3A2wwyy%2FK0Mi2HlHJMANdlKQEgKgMK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;575&quot; height=&quot;439&quot; data-filename=&quot;스크린샷 2025-02-01 오후 1.47.28.png&quot; data-origin-width=&quot;575&quot; data-origin-height=&quot;439&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;탁월성은 평범함에서 나온다&lt;br /&gt;&amp;lt;GRIT&amp;gt;&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/105</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-2-Understanding-Survival-Analysis#entry105comment</comments>
      <pubDate>Sat, 1 Feb 2025 02:41:15 +0900</pubDate>
    </item>
    <item>
      <title>CIBMTR - Equity in post-HCT Survival Predictions #1 About the Competition</title>
      <link>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-1-About-the-Competition</link>
      <description>&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Introduction&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Basically Survey Analysis Competition&lt;/li&gt;
&lt;li&gt;Predicting transplant survival rates for allogeneic HCT patients&lt;/li&gt;
&lt;li&gt;allogeneic: transplanting cells, tissues, or organs from a donor of the same species who is not genetically identical to the recipient&lt;/li&gt;
&lt;li&gt;HCT: Hematopoietic Stem Cell Transplantation is a treatment method used to fundamentally treat diseases such as &lt;b&gt;leukemia(백혈병)&lt;/b&gt; where abnormalities occur during cell differentiation(세포 분화), or conditions like &lt;b&gt;aplastic&lt;/b&gt; &lt;b&gt;anemia(재생불량성빈혈)&lt;/b&gt; where problems arise due to decreased numbers of hematopoietic stem cells.&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Competition Description&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&quot;In this competition, you&amp;rsquo;ll &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;develop models to improve the prediction of transplant survival rates for patients undergoing allogeneic Hematopoietic Cell Transplantation (HCT)&lt;/b&gt;&lt;/span&gt; &amp;mdash; &lt;b&gt;an important step in ensuring that every patient has a fair chance at a successful outcome, regardless of their background&lt;/b&gt;.&quot;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&quot;Improving survival predictions for allogeneic HCT patients is a vital healthcare challenge. &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Current predictive models often fall short in addressing disparities related to socioeconomic status, race, and geography&lt;/b&gt;&lt;/span&gt;. Addressing these gaps is crucial for enhancing patient care, optimizing resource utilization, and rebuilding trust in the healthcare system.&quot;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;That's why they put &quot;Equity&quot; on the title of the competition: maybe decreasing those disparities during prediction is the key point of this competition&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&quot;This competition aims to encourage participants to advance predictive modeling by ensuring that survival predictions are &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;both precise and fair for patients across diverse groups&lt;/b&gt;&lt;/span&gt;. By using &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;synthetic data&lt;/b&gt;&lt;/span&gt;&amp;mdash;which &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;mirrors real-world situations while protecting patient privacy&lt;/b&gt;&lt;/span&gt;&amp;mdash;participants can build and improve models that more effectively consider diverse backgrounds and conditions.&quot;&lt;/li&gt;
&lt;li&gt;&quot;You&amp;rsquo;re challenged to develop advanced predictive models for allogeneic HCT that enhance both accuracy and fairness in survival predictions. &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;The goal is to address disparities by bridging diverse data sources, refining algorithms, and reducing biases to ensure equitable outcomes for patients across diverse race groups.&lt;/b&gt; &lt;/span&gt;Your work will help create a more just and effective healthcare environment, ensuring every patient receives the care they deserve.&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Evaluation Metric&lt;/b&gt;&lt;/h4&gt;
&lt;p style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Evaluation Criteria&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;The evaluation of prediction accuracy in the competition will involve a specialized metric known as the &lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Stratified Concordance Index (C-index)&lt;/b&gt;&lt;/span&gt;, adapted to &lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;consider different racial groups independently&lt;/span&gt;&lt;/b&gt;. &lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;This method allows us to gauge the predictive performance of models in a way that emphasizes equitability across diverse patient populations, particularly focusing on racial disparities in transplant outcomes.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Concordance index&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;It &lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;represents the global assessment of the model discrimination power&lt;/span&gt;&lt;/b&gt;: this is the &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;model&amp;rsquo;s ability to correctly provide a reliable ranking of the survival times based on the individual risk scores&lt;/b&gt;&lt;/span&gt;. It can be computed with the following formula:&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-30 오후 4.32.32.png&quot; data-origin-width=&quot;1340&quot; data-origin-height=&quot;688&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bXZrOT/btsL3PjR5XR/knRbuKfnJimBL41IXq8dm0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bXZrOT/btsL3PjR5XR/knRbuKfnJimBL41IXq8dm0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bXZrOT/btsL3PjR5XR/knRbuKfnJimBL41IXq8dm0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbXZrOT%2FbtsL3PjR5XR%2FknRbuKfnJimBL41IXq8dm0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;625&quot; height=&quot;321&quot; data-filename=&quot;스크린샷 2025-01-30 오후 4.32.32.png&quot; data-origin-width=&quot;1340&quot; data-origin-height=&quot;688&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The concordance index is a value between 0 and 1 where:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;0.5 is the expected result from random predictions,&lt;/li&gt;
&lt;li&gt;1.0 is a perfect concordance and,&lt;/li&gt;
&lt;li&gt;0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;If the predicted values are all perfectly opposite to the actual values, resulting in a concordance index of 0.0&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;If we multiply these predicted values by -1 (i.e., reverse the signs)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;The predictions will perfectly match the actual values, resulting in a concordance index of 1.0&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #202124; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Stratified Concordance Index&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;For this competition, we adjust the standard C-index to account for racial stratification, thus ensuring that each racial group's outcomes are weighed equally in the model evaluation. &lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;The stratified c-index is calculated as the mean minus the standard deviation of the c-index scores calculated within the recipient race categories, &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;i.e., the score will be better if the mean c-index over the different race categories is large and the standard deviation of the c-indices over the race categories is small&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;. This value will range from 0 to 1, 1 is the theoretical perfect score, but this value will practically be lower due to censored outcomes.&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;The submitted risk scores will be evaluated using the&lt;span&gt;&amp;nbsp;&lt;/span&gt;score&lt;span&gt;&amp;nbsp;&lt;/span&gt;function. &lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;This evaluation process involves comparing the submitted risk scores against actual observed values (i.e., survival times and event occurrences) from a test dataset.&lt;/b&gt;&lt;/span&gt; &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;The function specifically calculates the stratified concordance index across different racial groups, ensuring that the predictions are not only accurate overall but also equitable across diverse patient demographics.&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;&lt;i&gt;&lt;b&gt;Final score = Mean(c-index for each race) - Standard deviation(c-index for each race)&lt;/b&gt;&lt;/i&gt;&lt;/blockquote&gt;
&lt;h4 style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Evaluation metric implementation:&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;a href=&quot;https://www.kaggle.com/code/metric/eefs-concordance-index&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/metric/eefs-concordance-index&lt;/a&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738223660022&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;&quot;&quot;&quot;
To evaluate the equitable prediction of transplant survival outcomes,
we use the concordance index (C-index) between a series of event
times and a predicted score across each race group.
 
It represents the global assessment of the model discrimination power:
this is the model&amp;rsquo;s ability to correctly provide a reliable ranking
of the survival times based on the individual risk scores.
 
The concordance index is a value between 0 and 1 where:
 
0.5 is the expected result from random predictions,
1.0 is perfect concordance (with no censoring, otherwise &amp;lt;1.0),
0.0 is perfect anti-concordance (with no censoring, otherwise &amp;gt;0.0)

&quot;&quot;&quot;

import pandas as pd
import pandas.api.types
import numpy as np
from lifelines.utils import concordance_index

class ParticipantVisibleError(Exception):
    pass


def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -&amp;gt; float:
    &quot;&quot;&quot;
    &amp;gt;&amp;gt;&amp;gt; import pandas as pd
    &amp;gt;&amp;gt;&amp;gt; row_id_column_name = &quot;id&quot;
    &amp;gt;&amp;gt;&amp;gt; y_pred = {'prediction': {0: 1.0, 1: 0.0, 2: 1.0}}
    &amp;gt;&amp;gt;&amp;gt; y_pred = pd.DataFrame(y_pred)
    &amp;gt;&amp;gt;&amp;gt; y_pred.insert(0, row_id_column_name, range(len(y_pred)))
    &amp;gt;&amp;gt;&amp;gt; y_true = { 'efs': {0: 1.0, 1: 0.0, 2: 0.0}, 'efs_time': {0: 25.1234,1: 250.1234,2: 2500.1234}, 'race_group': {0: 'race_group_1', 1: 'race_group_1', 2: 'race_group_1'}}
    &amp;gt;&amp;gt;&amp;gt; y_true = pd.DataFrame(y_true)
    &amp;gt;&amp;gt;&amp;gt; y_true.insert(0, row_id_column_name, range(len(y_true)))
    &amp;gt;&amp;gt;&amp;gt; score(y_true.copy(), y_pred.copy(), row_id_column_name)
    0.75
    &quot;&quot;&quot;
    
    del solution[row_id_column_name]
    del submission[row_id_column_name]
    
    # Define key columns
    event_label = 'efs' # event occurrence
    interval_label = 'efs_time' # survival time
    prediction_label = 'prediction' # predicted value
    
    # Validate submitted predictions
    for col in submission.columns:
        if not pandas.api.types.is_numeric_dtype(submission[col]):
            raise ParticipantVisibleError(f'Submission column {col} must be a number')
    
    # Merging solution and submission dfs on ID
    merged_df = pd.concat([solution, submission], axis=1)
    merged_df.reset_index(inplace=True)
    merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)
    
    # Calculate c-index for each racial group
    metric_list = []
    for race in merged_df_race_dict.keys():
        # Retrieving values from y_test based on index
        indices = sorted(merged_df_race_dict[race])
        merged_df_race = merged_df.iloc[indices]
        # Calculate the concordance index
        c_index_race = concordance_index(
                        merged_df_race[interval_label],
                        -merged_df_race[prediction_label],
                        merged_df_race[event_label])
        metric_list.append(c_index_race)
    return float(np.mean(metric_list)-np.sqrt(np.var(metric_list)))&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;def&amp;nbsp;score(solution:&amp;nbsp;pd.DataFrame,&amp;nbsp;submission:&amp;nbsp;pd.DataFrame,&amp;nbsp;row_id_column_name:&amp;nbsp;str)&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;solution&lt;/b&gt;&lt;/i&gt;: &lt;span style=&quot;background-color: #f6e199;&quot;&gt;actual answer data including efs, efs_time, race_group columns&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt;submission&lt;/i&gt;&lt;/b&gt;: participant's submitted predictions: prediction column&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt;row_id_column_name&lt;/i&gt;&lt;/b&gt;: ID column name&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span&gt;&lt;span style=&quot;color: #c678dd;&quot;&gt;del&lt;/span&gt;&lt;span&gt; solution&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;[&lt;/span&gt;&lt;span&gt;row_id_column_name&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;]&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span style=&quot;color: #c678dd;&quot;&gt;del&lt;/span&gt;&lt;span&gt; submission&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;[&lt;/span&gt;&lt;span&gt;row_id_column_name&lt;/span&gt;&lt;span style=&quot;color: #abb2bf;&quot;&gt;]&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Removing id column&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;Validate submitted predictions:&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Check if all predictions are numeric&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;Merging&amp;nbsp;solution&amp;nbsp;and&amp;nbsp;submission&amp;nbsp;dfs&amp;nbsp;on&amp;nbsp;ID&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;First merge solution and submission&lt;/li&gt;
&lt;li&gt;Second create new index&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Last, classify data by racial groups&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;Calculate c-index for each racial group&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Calculate concordance_index for each racial group&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Note: prediction is multiplied by -1 (high risk score should correlate with low survival time)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;return&amp;nbsp;float(np.mean(metric_list)-np.sqrt(np.var(metric_list)))&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;Returns the mean of racial c-indices minus their standard deviation&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;This considers both overall performance (mean) and performance differences between races (standard deviation)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;More about c-index:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;c-index = concordance-index&lt;/li&gt;
&lt;li&gt;basic concept:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;C-index measures how well a model predicts the &lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;&quot;relative risk ranking&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;When comparing two patients, it evaluates whether the model predicted higher risk for the patient who actually died earlier (or experienced the event)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;concordance_index(actual_survival_time,&amp;nbsp;predicted_risk,&amp;nbsp;event_occurrence)&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Compares all possible patient pairs&lt;/li&gt;
&lt;li&gt;Concordant pair: pairs where predicted ranking matches actual ranking&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;C-index = (number of concordant pairs) / (total number of comparable pairs)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Example:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;&lt;span&gt;Patient A: Survival time 10 days, deceased &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;Patient B: Survival time 20 days, deceased &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;Patient C: Survival time 15 days, censored &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;Predicted risk scores: &lt;/span&gt;&lt;span&gt;A: 0.8 (high risk) &lt;/span&gt;&lt;span&gt;B: 0.3 (low risk) &lt;/span&gt;&lt;span&gt;C: 0.5 (medium risk)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;A vs B: concordant (A died earlier + model predicted higher risk for A)&lt;/li&gt;
&lt;li&gt;A vs C: not comparable (C is censored)&lt;/li&gt;
&lt;li&gt;B vs C: not comparable (C is censored)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Why multiply predictions by -1:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Originally, high risk score = low survival time&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Multiplying by -1 aligns directions (high risk = low survival time prediction)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Submission&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Participants must submit their predictions for the test dataset as &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;real-valued risk scores&lt;/b&gt;&lt;/span&gt;. These scores represent the model's assessment of each patient's risk following transplantation. &lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;A higher risk score typically indicates a higher likelihood of the target event occurrence.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;The submission file must include a header and follow this format:&lt;/p&gt;
&lt;pre id=&quot;code_1738225167504&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;ID,prediction
28800,0.5
28801,1.2
28802,0.8
etc.&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;where:&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;ID&lt;/b&gt;&lt;/i&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;refers to the identifier for each patient in the test dataset.&lt;br /&gt;&lt;i&gt;&lt;b&gt;prediction&lt;/b&gt;&lt;/i&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;is the corresponding risk score generated by your model.&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;탁월성은 평범함에서 나온다&lt;br /&gt;&lt;/span&gt;&amp;lt;GRIT&amp;gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>cibmtr - equity in post-hct survival predictions</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/101</guid>
      <comments>https://dongsunseng.tistory.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-1-About-the-Competition#entry101comment</comments>
      <pubDate>Thu, 30 Jan 2025 20:15:04 +0900</pubDate>
    </item>
    <item>
      <title>CZII - CryoET Object Identification #4 Making synthetic data for Baseline YOLO11 Solution</title>
      <link>https://dongsunseng.tistory.com/entry/CZII-CryoET-Object-Identification-4-Making-synthetic-data-for-Baseline-YOLO11-Solution</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;This is an annotation of code that produces datasets for YOLO solution with additional data(synthetic data)&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt; Trained weights: &lt;a href=&quot;https://www.kaggle.com/datasets/sersasj/czii-yolo-l-trained-with-synthetic-data/data&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/datasets/sersasj/czii-yolo-l-trained-with-synthetic-data/data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Code: &lt;a href=&quot;https://www.kaggle.com/code/sersasj/czii-making-datasets-for-yolo-synthetic-data#CZII:-Creating-Datasets-for-YOLO-with-Additional-Data&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/sersasj/czii-making-datasets-for-yolo-synthetic-data#CZII:-Creating-Datasets-for-YOLO-with-Additional-Data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot;&gt;&lt;b&gt;CZII making datasets for YOLO + synthetic data&lt;/b&gt;&lt;/h1&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Modified version of &lt;a href=&quot;https://www.kaggle.com/code/itsuki9180/czii-making-datasets-for-yolo&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/itsuki9180/czii-making-datasets-for-yolo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Basically generates datasets with additional synthetic data, denoised using Gaussian denoising&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Quite controversy whether models trained with synthetic data perform better or not (&lt;a href=&quot;https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/555247&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/555247&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;However, discovered true for YOLO&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&quot;If someone manages to incorporate the original denoise model or IsoNet, I&amp;rsquo;m sure that better results could be achieved.&quot;&lt;/span&gt;&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Model was trained with TS_5_4&lt;/b&gt;, &lt;/span&gt;&lt;b&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;TS_69_2, TS_6_4, TS_6_6 as validation&lt;/span&gt;.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;b&gt;1) Install + Import&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737352973005&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!pip install zarr opencv-python

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zarr
from tqdm import tqdm
import glob, os
import cv2
import shutil&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;2) Load + Organize data&lt;/b&gt;&lt;/h4&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;runs = sorted(glob.glob('/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/*'))
print(runs)

runs = [os.path.basename(x) for x in runs]

# Processing additional dataset
additional_runs = sorted(glob.glob('/kaggle/input/czii10441/10441/T*'))
print(additional_runs)
additional_runs = [os.path.basename(x) for x in additional_runs]
runs = runs + additional_runs

# Creating mapping dictionaries
i2r_dict = {i: r for i, r in zip(range(len(runs)), runs)}
r2t_dict = {r: i for i, r in zip(range(len(runs)), runs)}
print(&quot;Runs:&quot;, i2r_dict)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;runs&amp;nbsp;=&amp;nbsp;sorted(glob.glob('/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/*'))&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Collects paths from the base training dataset&lt;/li&gt;
&lt;li&gt;Uses &lt;i&gt;&lt;b&gt;glob.glob()&lt;/b&gt;&lt;/i&gt; to get all experiment paths&lt;/li&gt;
&lt;li&gt;Uses &lt;i&gt;&lt;b&gt;sorted()&lt;/b&gt;&lt;/i&gt; to arrange the paths&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;runs&amp;nbsp;=&amp;nbsp;[os.path.basename(x)&amp;nbsp;for&amp;nbsp;x&amp;nbsp;in&amp;nbsp;runs]&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Extracting experiment names from paths&lt;/li&gt;
&lt;li&gt;Uses&lt;i&gt;&lt;b&gt; os.path.basename()&lt;/b&gt;&lt;/i&gt; to extract just the experiment names from full paths&lt;/li&gt;
&lt;li&gt;Processes all paths using list comprehension&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;Processing additional data part&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Processes paths from additional dataset (czii10441) in the same way&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Merges additional dataset with existing experiment list&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;Creating mapping dictionaries part&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;i2r_dict&lt;/b&gt;&lt;/i&gt;: Maps indices to experiment names (index to run)&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;r2t_dict&lt;/b&gt;&lt;/i&gt;: Maps experiment names to indices (run to index)&lt;/li&gt;
&lt;li&gt;Used as lookup tables for later data processing or reference&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;3) Helper function - Normalize function&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737353900065&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Normalize the image to a value between 0 and 255

def convert_to_8bit(x):
    # 1. Calculate percentiles for outlier removal
    lower, upper = np.percentile(x, (0.5, 99.5))
    
    # 2. Remove extreme values (clipping)
    x = np.clip(x, lower, upper)
    
    # 3. Convert to 0-255 range using Min-max normalization
    x = (x - x.min()) / (x.max() - x.min() + 1e-12) * 255
    
    # 4. Convert to 8-bit integer
    return x.round().astype(&quot;uint8&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Normalizes image data to 8-bit format&lt;/b&gt; &lt;/span&gt;(0-255 range)&lt;/li&gt;
&lt;li&gt;Crucial preprocessing step in CryoET image processing&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;Clipping: &lt;/b&gt;&lt;/i&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Reduces the impact of noise and extreme values&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;Min-max normalization:&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;1e-12&lt;/b&gt;&lt;/i&gt;: Small value added to &lt;b&gt;prevent division by zero&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;* 255&lt;/b&gt;&lt;/i&gt;: Scales 0-1 range to 0-255 range&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;4) Information about labels&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737354285185&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;p2i_dict = {
    'apo-ferritin': 0,
    'beta-amylase': 1,
    'beta-galactosidase': 2,
    'ribosome': 3,
    'thyroglobulin': 4,
    'virus-like-particle': 5
}

i2p = {v: k for k, v in p2i_dict.items()}

particle_radius = {
    'apo-ferritin': 60,
    'beta-amylase': 65,
    'beta-galactosidase': 90,
    'ribosome': 150,
    'thyroglobulin': 130,
    'virus-like-particle': 135,
}

particle_names = ['apo-ferritin', 'beta-amylase', 'beta-galactosidase', 'ribosome', 'thyroglobulin', 'virus-like-particle']&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1737354927900&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from scipy.ndimage import gaussian_filter, median_filter

def denoise_tomogram(tomogram, method='gaussian', **kwargs):
    &quot;&quot;&quot;
    Apply denoising to a tomogram.

    Parameters:
        tomogram (np.ndarray): The input tomogram to denoise.
        method (str): The denoising method ('gaussian' or 'median').
        kwargs: Parameters for the respective method.
    
    Returns:
        np.ndarray: The denoised tomogram.
    &quot;&quot;&quot;
    if method == 'gaussian':
        return gaussian_filter(tomogram, sigma=kwargs.get('sigma', 1))
    elif method == 'median':
        return median_filter(tomogram, size=kwargs.get('size', 3))
    else:
        raise ValueError(f&quot;Unsupported denoising method: {method}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Removes noise using Gaussian or median filter&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Filter parameters can be flexibly adjusted via kwargs&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1737355256809&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;name_map = {
    'apo-ferritin': 'ferritin_complex',
    'beta-amylase': 'beta_amylase',
    'beta-galactosidase': 'beta_galactosidase',
    'ribosome': 'cytosolic_ribosome',
    'thyroglobulin': 'thyroglobulin',
    'virus-like-particle': 'pp7_vlp',
}&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1737355269227&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def ndjson_to_json(ndjson_path):
    # Check if file exists
    if not os.path.isfile(ndjson_path):
        raise FileNotFoundError(f&quot;The file {ndjson_path} does not exist.&quot;)

    data = []
    # Parse each line as JSON object
    try:
        with open(ndjson_path, 'r', encoding='utf-8') as ndjson_file:
            for line_number, line in enumerate(ndjson_file, start=1):
                stripped_line = line.strip()
                if stripped_line:  
                    try:
                        json_object = json.loads(stripped_line)
                        data.append(json_object)
                    except json.JSONDecodeError as e:
                        raise json.JSONDecodeError(
                            f&quot;Error decoding JSON on line {line_number}: {e.msg}&quot;,
                            e.doc,
                            e.pos
                        )
    except Exception as e:
        raise e

    return data&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Parses NDJSON (Newline Delimited JSON) files&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Converts each line to individual JSON objects&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Includes error handling and line number tracking&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1737355338197&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os
import glob
import json
import pandas as pd
import numpy as np
import zarr
import cv2
from tqdm import tqdm

# Takes experiment name, train/validation flag, synthetic data flag as input
def make_annotate_yolo(run_name, is_train_path=True, is_syntetic=False):
    dataset_split = 'train' if is_train_path else 'val'
    
    # Loading and preprocessing volume data
    # Setting the path to the denoised volume(data)
    if is_syntetic:
        vol_path = glob.glob(f'/kaggle/input/czii10441/10441/{run_name}/**/Tomograms/**/*.zarr', recursive=True)
        if not vol_path:
            print(f&quot;No volume found for run {run_name} in synthetic data.&quot;)
            return
        vol_path = vol_path[0]
    else:
        vol_path = f'/kaggle/input/czii-cryo-et-object-identification/train/static/ExperimentRuns/{run_name}/VoxelSpacing10.000/denoised.zarr'
    
    print(f&quot;Volume path: {vol_path}&quot;)
    if not os.path.exists(vol_path):
        print(f&quot;Volume file not found: {vol_path}&quot;)
        return

    # Read the volume
    vol = zarr.open(vol_path, mode='r') # loads volume data in zarr format
    vol = vol[0]
    if is_syntetic:
        vol = denoise_tomogram(np.array(vol)[:184], method='gaussian', sigma=1)  # Apply denoise for synthetic data
    vol2 = convert_to_8bit(vol) # into 8-bit format
    
    n_imgs = vol2.shape[0]
    print(n_imgs)
    
    # Image generation - CONVERT 3D Volume data into 2D Images that YOLO can process
    for j in range(n_imgs):
        # 1. Extract current slice
        newvol = vol2[j]
        
        # 2. Convert grayscale to RGB
        newvolf = np.stack([newvol]*3, axis=-1)
        
        # 3. Resize to YOLO input size
        newvolf = cv2.resize(newvolf, (640, 640))
        
        # 4. Save image
        image_filename = f'images/{dataset_split}/{run_name}_{j*10}.png'
        cv2.imwrite(image_filename, newvolf)
        
        # 5. Create empty label file
        label_filename = f'labels/{dataset_split}/{run_name}_{j*10}.txt'
        with open(label_filename, 'w') as f:
            pass
    
    # Process each particle type (label processing)
    for p, particle in enumerate(tqdm(particle_names, desc=f&quot;Processing particles for run {run_name}&quot;)):
        if particle == &quot;beta-amylase&quot;:
            continue
        
        # Load JSON data for each particle
        if is_syntetic:
            particle_name_in_file = name_map.get(particle)
            if not particle_name_in_file:
                print(f&quot;Particle name mapping not found for: {particle}&quot;)
                continue
            
            ndjson_each_particle = glob.glob(f'/kaggle/input/czii10441/10441/{run_name}/**/Annotations/**/*.ndjson', recursive=True)
            if not ndjson_each_particle:
                print(f&quot;No NDJSON files found for particle: {particle} in run: {run_name}&quot;)
                continue
            
            filtered_ndjson_files = [f for f in ndjson_each_particle if particle_name_in_file in f]
            if not filtered_ndjson_files:
                print(f&quot;No NDJSON files match the particle: {particle} for run: {run_name}&quot;)
                continue
            
            json_each_particle = ndjson_to_json(filtered_ndjson_files[0])
            df = pd.DataFrame(json_each_particle)
            
        # Data loading for real data
        else:
            json_each_particle = f&quot;/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/{run_name}/Picks/{particle}.json&quot;
            
            if not os.path.exists(json_each_particle):
                print(f&quot;JSON file not found: {json_each_particle}&quot;)
                continue
            print(f&quot;Loading JSON file: {json_each_particle}&quot;)
            try:
                df = pd.read_json(json_each_particle)
            except ValueError as e:
                print(f&quot;Error reading JSON file {json_each_particle}: {e}&quot;)
                continue
                
        # Coordinate Extraction Processing 
        if is_syntetic:
            column_name = 'location'
        else:
            column_name = 'points'

        if  column_name not in df.columns:
            print(f&quot;'{column_name}' column not found in DataFrame for particle: {particle}&quot;)
            continue
        
        if is_syntetic:
            # Flattens nested JSON data into dataframe format
            normalized_data = pd.json_normalize(df[column_name])
            # *10.012: applies pixel scaling factor(converts to actual physical size)
            df[['x', 'y', 'z']] = normalized_data * 10.012
        
        # For real data: Extracts coordinates for each axis (x, y, z)
        else:      
            for axis in [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;]:
                df[axis] = df[column_name].apply(lambda x: x[&quot;location&quot;][axis] if &quot;location&quot; in x and axis in x[&quot;location&quot;] else np.nan)
                print(&quot;aquiii&quot;,df.head())

        # Missing Value Handling: Removes rows with NaN in any of x, y, z coordinates
        df.dropna(subset=[&quot;x&quot;, &quot;y&quot;, &quot;z&quot;], inplace=True)

        # Get defined radius for each particle type
        radius = particle_radius.get(particle)
        if radius is None:
            print(f&quot;Radius not defined for particle: {particle}&quot;)
            continue
        divide_by = 10.012
        
        # Convert to YOLO format
        for i, row in df.iterrows():    
            # Calculate Z-axis range (range of slices where particle is visible)
            start_z = np.round(row['z'] - radius).astype(np.int32)
            start_z = max(0, start_z//10) 
            end_z = np.round(row['z'] + radius).astype(np.int32)
            end_z = min(n_imgs, end_z//10)
            
            # Generate YOLO format labels for each slice
            for j in range(start_z, end_z):
                label_filename = f'labels/{dataset_split}/{run_name}_{j*10}.txt'
                
                # Calculate normalized coordinates
                x_center = row[&quot;x&quot;] / divide_by / vol2.shape[1]
                y_center = row[&quot;y&quot;] / divide_by / vol2.shape[2]
                box_width = (radius * 2) / divide_by / vol2.shape[1]
                box_height = (radius * 2) / divide_by / vol2.shape[2]
                
                # Save in YOLO format
                # format: class_id center_x center_y width height
                with open(label_filename, 'a') as f:
                    f.write(f'{p2i_dict.get(particle, 0)} {x_center:.6f} {y_center:.6f} {box_width:.6f} {box_height:.6f}\n')&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Generating datasets for YOLO training&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Overall process:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Convert 3D coordinates to &lt;b&gt;2D YOLO format&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Generate labels for all slices within particle's Z-axis range&lt;/li&gt;
&lt;li&gt;Normalize coordinates and box sizes to 0-1 range&lt;/li&gt;
&lt;li&gt;YOLO format: class_id x_center y_center width height&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;This code plays a crucial role in converting 3D particle location information into 2D bounding box format that YOLO can understand.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Image&amp;nbsp;generation&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt;newvolf = np.stack([newvol]*3, axis=-1)&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt;[newvol]*3&lt;/b&gt;&lt;/i&gt;: &lt;b&gt;Replicate the same grayscale image 3 times &lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt;axis=-1&lt;/b&gt;&lt;/i&gt;: &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Stack along the last dimension (creating R,G,B channels) &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Result: (height, width) -&amp;gt; (height, width, 3) &lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Image resizing:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;640x640 is YOLOv5's default input size&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Label Processing&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Labels here refer to annotation information used for training YOLO object detection models&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Exclude beta-amylase (excluded from competition evaluation)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Data&amp;nbsp;Loading&amp;nbsp;-&amp;nbsp;For&amp;nbsp;Synthetic&amp;nbsp;Data&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Synthetic data stored in NDJSON format&lt;/li&gt;
&lt;li&gt;Filter by matching particle type in filename&lt;/li&gt;
&lt;li&gt;Convert NDJSON to DataFrame&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt; Data Loading - For Real Data&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Real data stored in JSON format&lt;/li&gt;
&lt;li&gt;Direct JSON file loading&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Coordinate Extraction Processing&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Different coordinate extraction methods for synthetic/real data&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Normalize and scale coordinate values&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;5) Prepare folders&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737362078017&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;os.makedirs(&quot;images/train&quot;, exist_ok=True)
os.makedirs(&quot;images/val&quot;, exist_ok=True)
os.makedirs(&quot;labels/train&quot;, exist_ok=True)
os.makedirs(&quot;labels/val&quot;, exist_ok=True)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;exist_ok=True&lt;/b&gt;&lt;/i&gt;: No error if directories already exist&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;6) Create&amp;nbsp;Dataset&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737362164057&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;validation_indices = [0, 1, 2, 3]  # TS_5_4, TS_69_2 TS_6_4 TS_6_6

#runs = runs[:7] 
    
for i, r in enumerate(runs):
    # Determine if training or validation
    is_train_path = i not in validation_indices
    
    # Determine if synthetic data (after index 7 is synthetic)
    is_syntetic = i &amp;gt; 7
    
    print(f&quot;Processing Run {i}: {r}, Is Train: {is_train_path}&quot;)
    
    # Call dataset generation function
    make_annotate_yolo(r, is_train_path=is_train_path, is_syntetic=is_syntetic)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Generates the dataset by splitting it into training and validation sets&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1737362329748&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;images_train_dir = &quot;images/train&quot;
labels_train_dir = &quot;labels/train&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;7) Organize&amp;nbsp;Dataset&amp;nbsp;Folder&amp;nbsp;Structure&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737362356392&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Create top-level dataset directory
os.makedirs('datasets/czii_det2d', exist_ok=True)

# Move image and label files to new locations
shutil.move('images/train', 'datasets/czii_det2d/images/train')
shutil.move('images/val', 'datasets/czii_det2d/images/val')
shutil.move('labels/train', 'datasets/czii_det2d/labels/train')
shutil.move('labels/val', 'datasets/czii_det2d/labels/val')&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Reorganizes the generated training data into final directory structure expected by YOLO&lt;/li&gt;
&lt;li&gt;Final dir structure:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;datasets/&lt;br /&gt;└──&amp;nbsp;czii_det2d/&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;├──&amp;nbsp;images/&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;│&amp;nbsp;&amp;nbsp;&amp;nbsp;├──&amp;nbsp;train/&amp;nbsp;&amp;nbsp;#&amp;nbsp;Training&amp;nbsp;images&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;│&amp;nbsp;&amp;nbsp;&amp;nbsp;└──&amp;nbsp;val/&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;Validation&amp;nbsp;images&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;└──&amp;nbsp;labels/&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;├──&amp;nbsp;train/&amp;nbsp;&amp;nbsp;#&amp;nbsp;Training&amp;nbsp;labels&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;└──&amp;nbsp;val/&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;Validation&amp;nbsp;labels&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;8) Create&amp;nbsp;Configuration&amp;nbsp;File&amp;nbsp;for&amp;nbsp;YOLO&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737362388916&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;config_content = &quot;&quot;&quot;
path: /kaggle/input/czii-making-datasets-for-yolo/datasets/czii_det2d  # Dataset root path
train: images/train  # Training images path (relative to path)
val: images/val      # Validation images path (relative to path)
# Classes
names:               # Class (particle type) definitions
  0: apo-ferritin
  1: beta-amylase
  2: beta-galactosidase
  3: ribosome
  4: thyroglobulin
  5: virus-like-particle
&quot;&quot;&quot;

# Create YAML file
with open(&quot;czii_conf.yaml&quot;, &quot;w&quot;) as f:
    f.write(config_content.strip())&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Generates a configuration file (YAML) for YOLO model training&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;In order to make the impossible possible, you need to change the rules.&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/100</guid>
      <comments>https://dongsunseng.tistory.com/entry/CZII-CryoET-Object-Identification-4-Making-synthetic-data-for-Baseline-YOLO11-Solution#entry100comment</comments>
      <pubDate>Tue, 28 Jan 2025 21:58:28 +0900</pubDate>
    </item>
    <item>
      <title>CZII - CryoET Object Identification #3 Baseline YOLO11 Solution</title>
      <link>https://dongsunseng.tistory.com/entry/CZII-CryoET-Object-Identification-3-Baseline-YOLO11-Solution</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;This post is an annotation of baseline YOLO11 solution kernel from @SERGIO ALVAREZ.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/sersasj/czii-yolo11-submission-baseline-with-kdtree-update&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/sersasj/czii-yolo11-submission-baseline-with-kdtree-update&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1737006339985&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;CZII YOLO11 Submission Baseline with KDTree Update&quot; data-og-description=&quot;Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/code/sersasj/czii-yolo11-submission-baseline-with-kdtree-update&quot; data-og-url=&quot;https://www.kaggle.com/code/sersasj/czii-yolo11-submission-baseline-with-kdtree-update&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/sersasj/czii-yolo11-submission-baseline-with-kdtree-update&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/code/sersasj/czii-yolo11-submission-baseline-with-kdtree-update&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CZII YOLO11 Submission Baseline with KDTree Update&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h2 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;CZII YOLO11 Submission Baseline with KDTree Update - LB 0.682&lt;/b&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Inspired by &lt;a href=&quot;https://www.kaggle.com/code/itsuki9180/czii-yolo11-submission-baseline&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/itsuki9180/czii-yolo11-submission-baseline&lt;/a&gt; (LB: 0.625)&lt;/li&gt;
&lt;li&gt;Problem: &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&quot;It already takes 10 hours with the YOLO model - if I train a 2D UNET and aggregate the results in a similar way to YOLO, would it be possible to fit within the time limit?&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Introduced the &lt;i&gt;&lt;b&gt;KDTree algorithm&lt;/b&gt;&lt;/i&gt; for performance improvement
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;KDTree is an efficient algorithm for &lt;b&gt;finding nearest neighbors&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Also added @min fuka's multi-processing idea
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/code/minfuka/czii-yolo11-submission-baseline-speed-up-ver&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/minfuka/czii-yolo11-submission-baseline-speed-up-ver&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Time with KDTree was reduced to ~6500 seconds and with multiprocessing was reduced to ~4500 seconds.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Used synthetic data for training&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Data: &lt;a href=&quot;https://www.kaggle.com/datasets/sersasj/czii-yolo-l-trained-with-synthetic-data/data&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/datasets/sersasj/czii-yolo-l-trained-with-synthetic-data/data&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Code making synthetic data: &lt;a href=&quot;https://www.kaggle.com/code/sersasj/czii-making-datasets-for-yolo-synthetic-data#CZII:-Creating-Datasets-for-YOLO-with-Additional-Data&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/sersasj/czii-making-datasets-for-yolo-synthetic-data#CZII:-Creating-Datasets-for-YOLO-with-Additional-Data&lt;/a&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;My annotation on this code: &lt;a href=&quot;https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-4-Making-synthetic-data-for-Baseline-YOLO11-Solution#google_vignette&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-4-Making-synthetic-data-for-Baseline-YOLO11-Solution#google_vignette&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Used TS_5_4, TS_69_2, TS_6_4, and TS_6_6 as validation datasets&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Used OPTUNA to optimize the following parameters:&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;z_distance&lt;/li&gt;
&lt;li&gt;zy_distance&lt;/li&gt;
&lt;li&gt;first_conf&lt;/li&gt;
&lt;li&gt;conf_coef&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;Score for TS_5_4: 0.658783812957661,{'apo-ferritin': {'total_tp': 42, 'total_fp': 20, 'total_fn': 2, 'fbeta': 0.9321148825065274}, 'beta-galactosidase': {'total_tp': 5, 'total_fp': 27, 'total_fn': 7, 'fbeta': 0.3794642857142858}, 'ribosome': {'total_tp': 20, 'total_fp': 29, 'total_fn': 10, 'fbeta': 0.6427221172022684}, 'thyroglobulin': {'total_tp': 23, 'total_fp': 104, 'total_fn': 7, 'fbeta': 0.6441515650741352}, 'virus-like-particle': {'total_tp': 11, 'total_fp': 2, 'total_fn': 0, 'fbeta': 0.9894179894179894}} &lt;br /&gt;&lt;br /&gt;Score for TS_69_2: 0.8191956150699464,{'apo-ferritin': {'total_tp': 35, 'total_fp': 25, 'total_fn': 0, 'fbeta': 0.9596774193548387}, 'beta-galactosidase': {'total_tp': 13, 'total_fp': 46, 'total_fn': 3, 'fbeta': 0.7015873015873016}, 'ribosome': {'total_tp': 35, 'total_fp': 15, 'total_fn': 2, 'fbeta': 0.926791277258567}, 'thyroglobulin': {'total_tp': 28, 'total_fp': 84, 'total_fn': 6, 'fbeta': 0.7256097560975611}, 'virus-like-particle': {'total_tp': 9, 'total_fp': 1, 'total_fn': 0, 'fbeta': 0.9935064935064936}} &lt;br /&gt;&lt;br /&gt;Score for TS_6_4: 0.685180923434018,{'apo-ferritin': {'total_tp': 45, 'total_fp': 34, 'total_fn': 12, 'fbeta': 0.7719475277497477}, 'beta-galactosidase': {'total_tp': 7, 'total_fp': 29, 'total_fn': 5, 'fbeta': 0.5219298245614036}, 'ribosome': {'total_tp': 54, 'total_fp': 59, 'total_fn': 12, 'fbeta': 0.7852865697177076}, 'thyroglobulin': {'total_tp': 24, 'total_fp': 77, 'total_fn': 6, 'fbeta': 0.70223752151463}, 'virus-like-particle': {'total_tp': 8, 'total_fp': 4, 'total_fn': 2, 'fbeta': 0.7906976744186046}}&lt;br /&gt;&lt;br /&gt;Score for TS_6_6: 0.7575532250952666,{'apo-ferritin': {'total_tp': 37, 'total_fp': 39, 'total_fn': 2, 'fbeta': 0.8985714285714286}, 'beta-galactosidase': {'total_tp': 8, 'total_fp': 43, 'total_fn': 3, 'fbeta': 0.5991189427312775}, 'ribosome': {'total_tp': 17, 'total_fp': 11, 'total_fn': 6, 'fbeta': 0.7297979797979798}, 'thyroglobulin': {'total_tp': 31, 'total_fp': 120, 'total_fn': 4, 'fbeta': 0.7412095639943742}, 'virus-like-particle': {'total_tp': 19, 'total_fp': 2, 'total_fn': 0, 'fbeta': 0.9938461538461538}}&lt;/blockquote&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;1) Ultralytics setting for offline env (External kernel linked to the main submission kernel)&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Ultralytics is an open-source package for implementing and training YOLO (You Only Look Once) object detection models&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/code/itsuki9180/ultralytics-for-offline-install&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/itsuki9180/ultralytics-for-offline-install&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1737010152499&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!pip download -d ./packages ultralytics
!tar cfvz archive.tar.gz ./packages&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;!pip&amp;nbsp;download&amp;nbsp;-d&amp;nbsp;./packages&amp;nbsp;ultralytics&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;-d ./packages&lt;/b&gt;&lt;/i&gt;: Specifies the &lt;b&gt;download location&lt;/b&gt; as ./packages directory&lt;/li&gt;
&lt;li&gt;Downloads the package and all its dependencies&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Only downloads wheel files without actual installation&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;!tar cfvz archive.tar.gz ./packages&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt; Compresses&lt;/b&gt; downloaded packages into a &lt;b&gt;tar&lt;/b&gt; file&lt;/li&gt;
&lt;li&gt;c: Create a new archive&lt;/li&gt;
&lt;li&gt;f: Specify filename&lt;/li&gt;
&lt;li&gt;v: Verbose (detailed output)&lt;/li&gt;
&lt;li&gt;z: Use gzip compression&lt;/li&gt;
&lt;li&gt;archive.tar.gz: Name of the compressed file to be created&lt;/li&gt;
&lt;li&gt;./packages: Directory to be compressed&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;This is done because internet access is restricted in the competition environment&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;All necessary packages are downloaded and compressed in advance so they can be installed later in an offline environment&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;wheel files?&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;A binary package that bundles Python packages in an installable form&lt;/li&gt;
&lt;li&gt;Includes compiled code, metadata, and dependency information&lt;/li&gt;
&lt;li&gt;Has the .whl extension&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1737011649131&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!tar xfvz archive.tar.gz
!pip install --no-index --find-links=./packages ultralytics
!rm -rf ./packages&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;!tar&amp;nbsp;xfvz&amp;nbsp;archive.tar.gz&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;x: Extract&lt;/li&gt;
&lt;li&gt;f: Specify filename&lt;/li&gt;
&lt;li&gt;v: Verbose (detailed output)&lt;/li&gt;
&lt;li&gt;z: Extract gzip compression&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Extracts archive.tar.gz to create ./packages directory&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt; !pip install --no-index --find-links=./packages ultralytics&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;--no-index&lt;/b&gt;&lt;/i&gt;: Don't use PyPI (Python Package Index).
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;This means don't download packages from the internet&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;--find-links=./packages&lt;/b&gt;&lt;/i&gt;: Specify local directory to find packages&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Install ultralytics using locally downloaded wheel files&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt; !rm -rf ./packages&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Delete the temporarily used packages directory after installation&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;-r&lt;/b&gt;&lt;/i&gt;: Delete recursively (including all files in directory)&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt;-f&lt;/i&gt;&lt;/b&gt;: Force delete (without confirmation messages)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;2) Dependencies (Back to the kernel)&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737011596582&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Installing Ultralytics
!tar xfvz /kaggle/input/ultralytics-for-offline-install/archive.tar.gz
!pip install --no-index --find-links=./packages ultralytics
!rm -rf ./packages

# Installing Zarr package
!cp -r '/kaggle/input/hengck-czii-cryo-et-01/wheel_file' '/kaggle/working/'
!pip install /kaggle/working/wheel_file/asciitree-0.3.3/asciitree-0.3.3
!pip install --no-index --find-links=/kaggle/working/wheel_file zarr&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Copy wheel files from another Kaggle dataset to working directory&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;First install asciitree (a dependency of zarr)&lt;/li&gt;
&lt;li&gt;Install zarr package&lt;/li&gt;
&lt;li&gt;The reasons for this approach:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Kaggle notebooks have &lt;b&gt;restricted internet access&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Required packages are &lt;b&gt;pre-uploaded as datasets&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Enables package installation in offline environments Specifically, &lt;b&gt;zarr&lt;/b&gt; is a &lt;b&gt;package used for efficient storage and processing of large array data&lt;/b&gt;, which will likely be used in this competition for handling 3D image data.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1737011610751&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os
import glob
import time
import sys
import warnings
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2
import torch
from tqdm import tqdm
from ultralytics import YOLO
import zarr
from scipy.spatial import cKDTree
from collections import defaultdict&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;3) Loading model + Configuration&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1737013455121&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;model_path = '/kaggle/input/czii-yolo-l-trained-with-synthetic-data/best_synthetic.pt'
model = YOLO(model_path)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Loading the 'best_synthetic.pt' file&lt;/li&gt;
&lt;li&gt;And then, &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Uses the YOLO class from Ultralytics to load the model&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1737013470388&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Processing experiment data paths
runs_path = '/kaggle/input/czii-cryo-et-object-identification/test/static/ExperimentRuns/*'
runs = sorted(glob.glob(runs_path))
runs = [os.path.basename(run) for run in runs]

# Data Splitting
sp = len(runs)//2
runs1 = runs[:sp]
runs1[:5]

#add by @minfuka
runs2 = runs[sp:]
runs2[:5]

#add by @minfuka - GPU Checking
assert torch.cuda.device_count() == 2&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Processing experiment data paths:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Gets experiment data paths from test dataset&lt;/li&gt;
&lt;li&gt;Uses glob.glob to get all experiment folders&lt;/li&gt;
&lt;li&gt;Uses os.path.basename to&lt;b&gt; extract only folder names from paths&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Data splitting:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Divides all experiments into two groups&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Appears to be preparation for parallel processing&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;GPU Checking:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Verifies&amp;nbsp;that&amp;nbsp;2&amp;nbsp;GPUs&amp;nbsp;are&amp;nbsp;available&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Uses&amp;nbsp;assert&amp;nbsp;statement&amp;nbsp;to&amp;nbsp;raise&amp;nbsp;error&amp;nbsp;if&amp;nbsp;not&amp;nbsp;2&lt;/li&gt;
&lt;li&gt;For multi-GPU processing&lt;/li&gt;
&lt;li&gt;This appears to be intended for parallelizing data processing using multiple GPUs&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;It's preparing for each GPU to process half of the data&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1737013519881&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;particle_names = [
    'apo-ferritin',
    'beta-amylase',
    'beta-galactosidase',
    'ribosome',
    'thyroglobulin',
    'virus-like-particle'
]

particle_to_index = {
    'apo-ferritin': 0,
    'beta-amylase': 1,
    'beta-galactosidase': 2,
    'ribosome': 3,
    'thyroglobulin': 4,
    'virus-like-particle': 5
}

index_to_particle = {index: name for name, index in particle_to_index.items()}

particle_radius = {
    'apo-ferritin': 60,
    'beta-amylase': 65,
    'beta-galactosidase': 90,
    'ribosome': 150,
    'thyroglobulin': 130,
    'virus-like-particle': 135,
}&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;4) Helper functions&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;I think that's the single best piece of advice: constantly think about how you could be doing things better and questioning yourself.&amp;nbsp;&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>yolo11</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/86</guid>
      <comments>https://dongsunseng.tistory.com/entry/CZII-CryoET-Object-Identification-3-Baseline-YOLO11-Solution#entry86comment</comments>
      <pubDate>Thu, 16 Jan 2025 17:14:07 +0900</pubDate>
    </item>
    <item>
      <title>CZII - CryoET Object Identification #2 Baseline UNet Solution</title>
      <link>https://dongsunseng.tistory.com/entry/CZII-CryoET-Object-Identification-2-Baseline-UNet-Solution</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;This post is an annotation of baseline unet solution kernel from &quot;fnands&quot;.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/ldywinner/baseline-unet-train-submit/notebook#Baseline-UNet-training-+-prediction/submission&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/ldywinner/baseline-unet-train-submit/notebook#Baseline-UNet-training-+-prediction/submission&lt;/a&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;'Baseline UNet train + submit' - LB score 0.529&lt;/b&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Literally a baseline solution with no high lb score&lt;/li&gt;
&lt;li&gt;Based on 3 notebooks:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/code/ahsuna123/3d-u-net-training-only&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/ahsuna123/3d-u-net-training-only&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/code/zhuowenzhao11/3d-u-net-pytorch-lightning-distributed-training&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/zhuowenzhao11/3d-u-net-pytorch-lightning-distributed-training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.kaggle.com/code/hengck23/3d-unet-using-2d-image-encoder/notebook&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/hengck23/3d-unet-using-2d-image-encoder/notebook&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Pre-computed the input data and stored them as numpy arrays so they don't have to be extracted every time the notebooks is run:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;My annotation of that part here: &lt;a href=&quot;https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-1-Training-Data&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-1-Training-Data&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;figure id=&quot;og_1736827482533&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;CZII - CryoET Object Identification #1 - Training Data&quot; data-og-description=&quot;This post is an annotation of training data code kernel from &amp;quot;fnands&amp;quot;.https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name&amp;nbsp;Create Numpy dataset exp nameExplore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET&quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-1-Training-Data&quot; data-og-url=&quot;https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-1-Training-Data&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bejtwF/hyX0tJQC1S/BqFKMuA4CXm2EjKypVKV6K/img.png?width=416&amp;amp;height=121&amp;amp;face=0_0_416_121,https://scrap.kakaocdn.net/dn/pkMMe/hyX0m42cla/wogzeeY6x8lHWXFakQKZlK/img.png?width=416&amp;amp;height=121&amp;amp;face=0_0_416_121,https://scrap.kakaocdn.net/dn/eBx8pr/hyX0tiK0d0/mpMmTH4WcnFdmCoyPYiAD1/img.png?width=500&amp;amp;height=500&amp;amp;face=0_0_500_500&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-1-Training-Data&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-1-Training-Data&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bejtwF/hyX0tJQC1S/BqFKMuA4CXm2EjKypVKV6K/img.png?width=416&amp;amp;height=121&amp;amp;face=0_0_416_121,https://scrap.kakaocdn.net/dn/pkMMe/hyX0m42cla/wogzeeY6x8lHWXFakQKZlK/img.png?width=416&amp;amp;height=121&amp;amp;face=0_0_416_121,https://scrap.kakaocdn.net/dn/eBx8pr/hyX0tiK0d0/mpMmTH4WcnFdmCoyPYiAD1/img.png?width=500&amp;amp;height=500&amp;amp;face=0_0_500_500');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;CZII - CryoET Object Identification #1 - Training Data&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;This post is an annotation of training data code kernel from &quot;fnands&quot;.https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name&amp;nbsp;Create Numpy dataset exp nameExplore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;1) Installing offline deps&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1736827324887&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;deps_path = '/kaggle/input/czii-cryoet-dependencies'
! cp -r /kaggle/input/czii-cryoet-dependencies/asciitree-0.3.3/ asciitree-0.3.3/
! pip wheel asciitree-0.3.3/asciitree-0.3.3/
! pip install asciitree-0.3.3-py3-none-any.whl
! pip install -q --no-index --find-links {deps_path} --requirement {deps_path}/requirements.txt&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Process of installing dependency packages in an &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;offline environment&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&quot;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;As this is a code comp, there is no internet. So we have to do some silly things to get dependencies in here. Why is asciitree such a PITA?&quot;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In Kaggle competitions, internet access is restricted, so necessary packages must be prepared in advance&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Kaggle competition environments block internet access for security and fairness.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;The asciitree package, in particular, is tricky to install and requires special handling&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;All dependency packages must be prepared in advance in a locally installable format.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;2) Import deps&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1736827362042&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from typing import List, Tuple, Union
import numpy as np
import torch
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
    Compose, 
    EnsureChannelFirstd, 
    Orientationd,  
    AsDiscrete,  
    RandFlipd, 
    RandRotate90d, 
    NormalizeIntensityd,
    RandCropByLabelClassesd,
)&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;3) Define&amp;nbsp;some&amp;nbsp;helper&amp;nbsp;functions&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Patching helper functions&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1736827941857&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def calculate_patch_starts(dimension_size: int, patch_size: int) -&amp;gt; List[int]:
    &quot;&quot;&quot;
    Calculate the starting positions of patches along a single dimension
    with minimal overlap to cover the entire dimension.
    
    Parameters:
    -----------
    dimension_size : int
        Size of the dimension
    patch_size : int
        Size of the patch in this dimension
        
    Returns:
    --------
    List[int]
        List of starting positions for patches
    &quot;&quot;&quot;
    if dimension_size &amp;lt;= patch_size:
        return [0]
        
    # Calculate number of patches needed
    n_patches = np.ceil(dimension_size / patch_size)
    
    if n_patches == 1:
        return [0]
    
    # Calculate overlap
    total_overlap = (n_patches * patch_size - dimension_size) / (n_patches - 1)
    
    # Generate starting positions
    positions = []
    for i in range(int(n_patches)):
        pos = int(i * (patch_size - total_overlap))
        if pos + patch_size &amp;gt; dimension_size:
            pos = dimension_size - patch_size
        if pos not in positions:  # Avoid duplicates
            positions.append(pos)
    
    return positions

def extract_3d_patches_minimal_overlap(arrays: List[np.ndarray], patch_size: int) -&amp;gt; Tuple[List[np.ndarray], List[Tuple[int, int, int]]]:
    &quot;&quot;&quot;
    Extract 3D patches from multiple arrays with minimal overlap to cover the entire array.
    
    Parameters:
    -----------
    arrays : List[np.ndarray]
        List of input arrays, each with shape (m, n, l)
    patch_size : int
        Size of cubic patches (a x a x a)
        
    Returns:
    --------
    patches : List[np.ndarray]
        List of all patches from all input arrays
    coordinates : List[Tuple[int, int, int]]
        List of starting coordinates (x, y, z) for each patch
    &quot;&quot;&quot;
    if not arrays or not isinstance(arrays, list):
        raise ValueError(&quot;Input must be a non-empty list of arrays&quot;)
    
    # Verify all arrays have the same shape
    shape = arrays[0].shape
    if not all(arr.shape == shape for arr in arrays):
        raise ValueError(&quot;All input arrays must have the same shape&quot;)
    
    if patch_size &amp;gt; min(shape):
        raise ValueError(f&quot;patch_size ({patch_size}) must be smaller than smallest dimension {min(shape)}&quot;)
    
    m, n, l = shape
    patches = []
    coordinates = []
    
    # Calculate starting positions for each dimension
    x_starts = calculate_patch_starts(m, patch_size)
    y_starts = calculate_patch_starts(n, patch_size)
    z_starts = calculate_patch_starts(l, patch_size)
    
    # Extract patches from each array
    for arr in arrays:
        for x in x_starts:
            for y in y_starts:
                for z in z_starts:
                    patch = arr[
                        x:x + patch_size,
                        y:y + patch_size,
                        z:z + patch_size
                    ]
                    patches.append(patch)
                    coordinates.append((x, y, z))
    
    return patches, coordinates

# Note: I should probably averge the overlapping areas, 
# but here they are just overwritten by the most recent one. 

def reconstruct_array(patches: List[np.ndarray], 
                     coordinates: List[Tuple[int, int, int]], 
                     original_shape: Tuple[int, int, int]) -&amp;gt; np.ndarray:
    &quot;&quot;&quot;
    Reconstruct array from patches.
    
    Parameters:
    -----------
    patches : List[np.ndarray]
        List of patches to reconstruct from
    coordinates : List[Tuple[int, int, int]]
        Starting coordinates for each patch
    original_shape : Tuple[int, int, int]
        Shape of the original array
        
    Returns:
    --------
    np.ndarray
        Reconstructed array
    &quot;&quot;&quot;
    reconstructed = np.zeros(original_shape, dtype=np.int64)  # To track overlapping regions
    
    patch_size = patches[0].shape[0]
    
    for patch, (x, y, z) in zip(patches, coordinates):
        reconstructed[
            x:x + patch_size,
            y:y + patch_size,
            z:z + patch_size
        ] = patch
        
    
    return reconstructed&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&quot;These are mostly used to split large volumes into smaller ones and stitch them back together&quot;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;This code implements functions for extracting and reconstructing patches from 3D image data&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;def&amp;nbsp;calculate_patch_starts(dimension_size:&amp;nbsp;int,&amp;nbsp;patch_size:&amp;nbsp;int)&amp;nbsp;-&amp;gt;&amp;nbsp;List[int]:&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Purpose: &lt;b&gt;Calculates the starting positions of patches in one dimension&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Operation:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Returns [0] if dimension size is smaller than patch size&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Calculates required number of patches: n_patches = ceil(dimension_size / patch_size)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Calculates overlap between patches&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Generates starting positions for each patch considering overlap&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Example: For dimension size 100 and patch size 40, returns list of positions like [0, 30, 60]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;def&amp;nbsp;extract_3d_patches_minimal_overlap(arrays:&amp;nbsp;List[np.ndarray],&amp;nbsp;patch_size:&amp;nbsp;int)&amp;nbsp;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Purpose: &lt;b&gt;Divides 3D arrays into smaller patches&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Key features:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Input validation (array shape, size, etc.)&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Calculates patch starting positions for each dimension (x, y, z)&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Extracts patches from all possible positions&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Return values:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;patches: List of all extracted patches&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;coordinates: List of starting coordinates for each patch&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;def&amp;nbsp;reconstruct_array(patches:&amp;nbsp;List[np.ndarray],&amp;nbsp;coordinates:&amp;nbsp;List[Tuple[int,&amp;nbsp;int,&amp;nbsp;int]],&amp;nbsp;original_shape:&amp;nbsp;Tuple[int,&amp;nbsp;int,&amp;nbsp;int])&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Purpose: &lt;b&gt;Reconstructs patches back into original-sized 3D array&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Operation:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Creates empty array of original size&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Places each patch at its corresponding position&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Overlapping regions are overwritten by most recent patch&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Note:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;As mentioned in code comments, using average values for overlapping regions might be better&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Submission helper functions&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1736830174759&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import pandas as pd

def dict_to_df(coord_dict, experiment_name):
    &quot;&quot;&quot;
    Convert dictionary of coordinates to pandas DataFrame.
    
    Parameters:
    -----------
    coord_dict : dict
        Dictionary where keys are labels and values are Nx3 coordinate arrays
        
    Returns:
    --------
    pd.DataFrame
        DataFrame with columns ['x', 'y', 'z', 'label']
    &quot;&quot;&quot;
    # Create lists to store data
    all_coords = []
    all_labels = []
    
    # Process each label and its coordinates
    for label, coords in coord_dict.items():
        all_coords.append(coords)
        all_labels.extend([label] * len(coords))
    
    # Concatenate all coordinates
    all_coords = np.vstack(all_coords)
    
    df = pd.DataFrame({
        'experiment': experiment_name,
        'particle_type': all_labels,
        'x': all_coords[:, 0],
        'y': all_coords[:, 1],
        'z': all_coords[:, 2]
    })

    
    return df&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Purpose:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Converts position coordinates of multiple particle types in 3D space into a &lt;span style=&quot;background-color: #f6e199;&quot;&gt;structured dataframe format&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Structures data to match the submission format for Kaggle competition&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Input&amp;nbsp;Parameters:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;coord_dict: A dictionary with particle types as keys and their coordinates (N&amp;times;3 array) as values&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Example: {'apo-ferritin': array([[x1,y1,z1], [x2,y2,z2]...]), 'ribosome': array([[x3,y3,z3]...])}&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;experiment_name: Experiment name (e.g., 'TS_5_4')&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;4) Reading in the data&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1736830755316&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;TRAIN_DATA_DIR = &quot;/kaggle/input/create-numpy-dataset-exp-name&quot;
TEST_DATA_DIR = &quot;/kaggle/input/czii-cryo-et-object-identification&quot;

train_names = ['TS_5_4', 'TS_69_2', 'TS_6_6', 'TS_73_6', 'TS_86_3', 'TS_99_9']
valid_names = ['TS_6_4']

train_files = []
valid_files = []

for name in train_names:
    image = np.load(f&quot;{TRAIN_DATA_DIR}/train_image_{name}.npy&quot;)
    label = np.load(f&quot;{TRAIN_DATA_DIR}/train_label_{name}.npy&quot;)

    train_files.append({&quot;image&quot;: image, &quot;label&quot;: label})
    

for name in valid_names:
    image = np.load(f&quot;{TRAIN_DATA_DIR}/train_image_{name}.npy&quot;)
    label = np.load(f&quot;{TRAIN_DATA_DIR}/train_label_{name}.npy&quot;)

    valid_files.append({&quot;image&quot;: image, &quot;label&quot;: label})&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Loading data used in training and validation&lt;/li&gt;
&lt;li&gt;For each experiment ID:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;image: 3D volume data (.npy format)&lt;/li&gt;
&lt;li&gt;label: corresponding label data&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;stored as dictionary {&quot;image&quot;: image, &quot;label&quot;: label}&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Create&amp;nbsp;the&amp;nbsp;training&amp;nbsp;dataloader&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&quot;I should probably find a way to create a dataloader that takes more batches.&quot;&lt;/p&gt;
&lt;pre id=&quot;code_1736832821967&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Non-random transforms to be cached
non_random_transforms = Compose([
    EnsureChannelFirstd(keys=[&quot;image&quot;, &quot;label&quot;], channel_dim=&quot;no_channel&quot;),
    NormalizeIntensityd(keys=&quot;image&quot;),
    Orientationd(keys=[&quot;image&quot;, &quot;label&quot;], axcodes=&quot;RAS&quot;)
])

raw_train_ds = CacheDataset(data=train_files, transform=non_random_transforms, cache_rate=1.0)


my_num_samples = 16
train_batch_size = 1

# Random transforms to be applied during training
random_transforms = Compose([
    RandCropByLabelClassesd(
        keys=[&quot;image&quot;, &quot;label&quot;],
        label_key=&quot;label&quot;,
        spatial_size=[96, 96, 96],
        num_classes=7,
        num_samples=my_num_samples
    ),
    RandRotate90d(keys=[&quot;image&quot;, &quot;label&quot;], prob=0.5, spatial_axes=[0, 2]),
    RandFlipd(keys=[&quot;image&quot;, &quot;label&quot;], prob=0.5, spatial_axis=0),    
])

# Final Dataset and DataLoader Creation:
train_ds = Dataset(data=raw_train_ds, transform=random_transforms)


# DataLoader remains the same
train_loader = DataLoader(
    train_ds,
    batch_size=train_batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=torch.cuda.is_available()
)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;This code sets up the training data loader using the MONAI library for medical image data processing with transforms and data loaders&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Data loader?&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;DataLoader is a &lt;b&gt;pipeline that supplies data to the model&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Main features:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Batch Creation: Bundles multiple data samples together&lt;/li&gt;
&lt;li&gt;Shuffling: Randomly shuffles the order of data&lt;/li&gt;
&lt;li&gt;Parallel Processing: Accelerates data loading using multiple CPU cores&lt;/li&gt;
&lt;li&gt;Memory Efficiency: Loads data as needed instead of loading all at once Example:&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Non-random Transforms Setup:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;EnsureChannelFirstd&lt;/b&gt;&lt;/i&gt;: Moves channel dimension to first position&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;NormalizeIntensityd&lt;/b&gt;&lt;/i&gt;: Normalizes image values (standardizes image values)&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt;Orientationd&lt;/i&gt;: Aligns 3D images to RAS (Right-Anterior-Superior) standard&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;raw_train_ds&amp;nbsp;=&amp;nbsp;CacheDataset(data=train_files,&amp;nbsp;transform=non_random_transforms,&amp;nbsp;cache_rate=1.0)&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Caches transformed data in memory&lt;b&gt; &lt;/b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;for fast access&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;cache_rate=1.0: Caches all data&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt; Random&amp;nbsp;Transforms&amp;nbsp;Setup:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;RandCropByLabelClassesd&lt;/b&gt;&lt;/i&gt;: Random cropping by label classes (96&amp;times;96&amp;times;96 size)&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;RandRotate90d&lt;/b&gt;&lt;/i&gt;:&amp;nbsp;Random&amp;nbsp;90-degree&amp;nbsp;rotation&amp;nbsp;(50%&amp;nbsp;probability)&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;RandFlipd&lt;/b&gt;&lt;/i&gt;:&amp;nbsp;Random&amp;nbsp;flipping&amp;nbsp;(50%&amp;nbsp;probability)&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Random Transform doesn't replace original data but applies new transformations each time data is loaded&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;my_num_samples = 16&lt;/b&gt;&lt;/i&gt;&amp;nbsp;&amp;nbsp;# Creates 16 transformed samples per image
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Creates 16 different transformations from each original image per epoch&lt;/li&gt;
&lt;li&gt;6 training images &amp;times; 16 samples = total of 96 samples used in each epoch&lt;/li&gt;
&lt;li&gt;New random transformations are applied every epoch&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Final Dataset and DataLoader Creation:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;batch_size=1&lt;/i&gt;: Number of samples to process at once&lt;/li&gt;
&lt;li&gt;&lt;i&gt;shuffle=True&lt;/i&gt;: Shuffle data order each epoch&lt;/li&gt;
&lt;li&gt;&lt;i&gt;num_workers=4&lt;/i&gt;: Number of workers for parallel data loading&lt;/li&gt;
&lt;li&gt;&lt;i&gt;pin_memory=True&lt;/i&gt;: Memory performance optimization for GPU usage&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Create the validation dataloader&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&quot;Here I deviate a little from the source notebooks.&quot;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&quot;In the source, the validation dataloader also used the random transformations. This is bad practice and will result in noisy validation.&quot;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&quot;Here I split the validation dataset in (slightly) overlapping blocks of&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;(96, 96 , 96)&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;so that we can have a consistent validation set that uses all the validation data.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736837771323&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;val_images,val_labels = [dcts['image'] for dcts in valid_files],[dcts['label'] for dcts in valid_files]

val_image_patches, _ = extract_3d_patches_minimal_overlap(val_images, 96)
val_label_patches, _ = extract_3d_patches_minimal_overlap(val_labels, 96)

val_patched_data = [{&quot;image&quot;: img, &quot;label&quot;: lbl} for img, lbl in zip(val_image_patches, val_label_patches)]


valid_ds = CacheDataset(data=val_patched_data, transform=non_random_transforms, cache_rate=1.0)


valid_batch_size = 16
# DataLoader remains the same
valid_loader = DataLoader(
    valid_ds,
    batch_size=valid_batch_size,
    shuffle=False,
    num_workers=4,
    pin_memory=torch.cuda.is_available()
)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;valid_batch_size = 16&lt;/b&gt; &lt;/span&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Larger batch size than training(1)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt;shuffle&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;False&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #5c6370;&quot;&gt;Maintaining order&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #5c6370;&quot;&gt;Dataloader configuration details:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Consistency:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Random transforms would result in unstable performance evaluation&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Fixed patches enable consistent evaluation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Completeness:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Using entire data allows more accurate evaluation&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Slight overlap ensures good evaluation of boundary regions&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Efficiency:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Can use &lt;b&gt;larger batch size&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Faster validation process than training&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;5) Initializing the model&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;This model is pretty much directly copied from &lt;a href=&quot;https://www.kaggle.com/code/zhuowenzhao11/3d-u-net-pytorch-lightning-distributed-training&quot;&gt;https://www.kaggle.com/code/zhuowenzhao11/3d-u-net-pytorch-lightning-distributed-training&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736876363137&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import lightning.pytorch as pl

from monai.networks.nets import UNet
from monai.losses import TverskyLoss
from monai.metrics import DiceMetric

class Model(pl.LightningModule):
    def __init__(
        self, 
        spatial_dims: int = 3,
        in_channels: int = 1,
        out_channels: int = 7,
        channels: Union[Tuple[int, ...], List[int]] = (48, 64, 80, 80),
        strides: Union[Tuple[int, ...], List[int]] = (2, 2, 1),
        num_res_units: int = 1,
        lr: float=1e-3):
    
        super().__init__()
        self.save_hyperparameters()
        self.model = UNet(
            spatial_dims=self.hparams.spatial_dims,
            in_channels=self.hparams.in_channels,
            out_channels=self.hparams.out_channels,
            channels=self.hparams.channels,
            strides=self.hparams.strides,
            num_res_units=self.hparams.num_res_units,
        )
        self.loss_fn = TverskyLoss(include_background=True, to_onehot_y=True, softmax=True)  # softmax=True for multiclass
        self.metric_fn = DiceMetric(include_background=False, reduction=&quot;mean&quot;, ignore_empty=True)

        self.train_loss = 0
        self.val_metric = 0
        self.num_train_batch = 0
        self.num_val_batch = 0

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch['image'], batch['label']
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self.train_loss += loss
        self.num_train_batch += 1
        torch.cuda.empty_cache()
        return loss

    def on_train_epoch_end(self):
        loss_per_epoch = self.train_loss/self.num_train_batch
        #print(f&quot;Epoch {self.current_epoch} - Average Train Loss: {loss_per_epoch:.4f}&quot;)
        self.log('train_loss', loss_per_epoch, prog_bar=True)
        self.train_loss = 0
        self.num_train_batch = 0
    
    def validation_step(self, batch, batch_idx):
        with torch.no_grad(): # This ensures that gradients are not stored in memory
            x, y = batch['image'], batch['label'] # Extract images and labels from batch
            y_hat = self(x) # Perform model prediction
            
            # Process predictions
            metric_val_outputs = [AsDiscrete(
                argmax=True,  # Select class with highest probability
                to_onehot=self.hparams.out_channels  # Convert to one-hot encoding
            )(i) for i in decollate_batch(y_hat)]
            
            # Process labels
            metric_val_labels = [AsDiscrete(
                to_onehot=self.hparams.out_channels  # Convert labels to one-hot encoding
            )(i) for i in decollate_batch(y)]

            # compute metric for current iteration
            # Calculate Dice score for current batch
            self.metric_fn(y_pred=metric_val_outputs, y=metric_val_labels)
            # Calculate batch average metric
            metrics = self.metric_fn.aggregate(reduction=&quot;mean_batch&quot;)
            # Calculate mean across all particle types
            val_metric = torch.mean(metrics) # I used mean over all particle species as the metric. This can be explored.
            
            # Result Accumulation
            self.val_metric += val_metric 
            self.num_val_batch += 1
            
        torch.cuda.empty_cache()
        return {'val_metric': val_metric}

    def on_validation_epoch_end(self):
        metric_per_epoch = self.val_metric/self.num_val_batch
        #print(f&quot;Epoch {self.current_epoch} - Average Val Metric: {metric_per_epoch:.4f}&quot;)
        self.log('val_metric', metric_per_epoch, prog_bar=True, sync_dist=False) # sync_dist=True for distributed training
        self.val_metric = 0
        self.num_val_batch = 0
    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;This code implements a 3D U-Net model using PyTorch Lightning&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;def&amp;nbsp;__init__(self,&amp;nbsp;spatial_dims=3,&amp;nbsp;in_channels=1,&amp;nbsp;out_channels=7,&amp;nbsp;...):&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;UNet Model Configuration:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;spatial_dims=3: Process 3D data&lt;/li&gt;
&lt;li&gt;in_channels=1: Grayscale image input&lt;/li&gt;
&lt;li&gt;out_channels=7: 7 class outputs (background + 6 particle types)&lt;/li&gt;
&lt;li&gt;channels=(48, 64, 80, 80): Number of channels per layer&lt;/li&gt;
&lt;li&gt;strides=(2, 2, 1): Stride for each layer&lt;/li&gt;
&lt;li&gt;num_res_units=1:  Number of residual units to include in each encoder and decoder block
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Residual Unit structure:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Input -&amp;gt; Conv3D -&amp;gt; BatchNorm -&amp;gt; ReLU -&amp;gt; Conv3D -&amp;gt; BatchNorm -&amp;gt; Add(Input) -&amp;gt; ReLU -&amp;gt; Output&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;One &quot;&lt;b&gt;unit&lt;/b&gt;&quot; consists of:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;2&amp;nbsp;3D&amp;nbsp;convolution&amp;nbsp;layers&lt;/li&gt;
&lt;li&gt;2&amp;nbsp;Batch&amp;nbsp;Normalization&amp;nbsp;layers&lt;/li&gt;
&lt;li&gt;ReLU&amp;nbsp;activation&amp;nbsp;function&lt;/li&gt;
&lt;li&gt;Skip&amp;nbsp;connection&amp;nbsp;(adding&amp;nbsp;input&amp;nbsp;to&amp;nbsp;output)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;num_res_units=1&lt;/b&gt;&lt;/i&gt; means this structure is repeated once at each level. If &lt;i&gt;&lt;b&gt;num_res_units=2&lt;/b&gt;&lt;/i&gt;, this entire structure would be repeated twice in sequence&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Loss Function and Evaluation Metrics:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;TverskyLoss&lt;/b&gt;&lt;/i&gt;: &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Loss function&lt;/b&gt; &lt;b&gt;robust to class imbalance&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Tversky Loss is a &lt;b&gt;generalized version of Dice Loss&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Effective for handling class imbalance problems&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Parameter explanation:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;include_background=True&lt;/b&gt;&lt;/i&gt;: &lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Include background (class 0) in loss calculation&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;to_onehot_y=True&lt;/b&gt;&lt;/i&gt;: Convert integer labels to &lt;b&gt;one-hot vectors&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;softmax=True&lt;/b&gt;&lt;/i&gt;: Apply &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;softmax&lt;/b&gt;&lt;/span&gt; for multi-class classification&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;DiceMetric&lt;/b&gt;&lt;/i&gt;: &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Segmentation&lt;/b&gt; &lt;b&gt;performance metric&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Dice coefficient is a &lt;b&gt;standard metric for evaluating segmentation performance&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Formula: 2|X&amp;cap;Y| / (|X|+|Y|)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;X: Predicted region&lt;/li&gt;
&lt;li&gt;Y: Actual region&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Parameter explanation:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;include_background=False&lt;/b&gt;&lt;/i&gt;: &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Exclude background class from evaluation&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;reduction=&quot;mean&quot;&lt;/b&gt;&lt;/i&gt;: &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Average Dice scores across all classes&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;ignore_empty=True&lt;/b&gt;&lt;/i&gt;: &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Exclude cases where certain classes are absent&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Variable initialization&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;self.train_loss&amp;nbsp;=&amp;nbsp;0&lt;/b&gt;&lt;/i&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;Accumulate&amp;nbsp;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;training&amp;nbsp;loss&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;self.val_metric&amp;nbsp;=&amp;nbsp;0&lt;/b&gt;&lt;/i&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;Accumulate&amp;nbsp;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;validation&amp;nbsp;metric&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;self.num_train_batch&amp;nbsp;=&amp;nbsp;0&lt;/b&gt;&lt;/i&gt;&amp;nbsp;#&amp;nbsp;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Count&lt;/span&gt;&amp;nbsp;processed&amp;nbsp;training&amp;nbsp;batches&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;self.num_val_batch&amp;nbsp;=&amp;nbsp;0&lt;/b&gt;&lt;/i&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;#&amp;nbsp;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Count&lt;/span&gt;&amp;nbsp;processed&amp;nbsp;validation&amp;nbsp;batches&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;These variables are used to calculate average performance during an epoch&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Reset to 0 at the end of each epoch&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt; def forward(self, x):&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Basic inference method&lt;/b&gt; for PyTorch models&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Passes input x through the model&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Simple but important roles:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Simplifies model calls (enables self(x) instead of model(x))&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Used for model inference in other methods&lt;/li&gt;
&lt;li&gt;Integrates with PyTorch Lightning's automated features&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;def training_step(self, batch, batch_idx):&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Process batch data&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Perform model prediction&lt;/li&gt;
&lt;li&gt;Calculate and accumulate loss&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;def on_train_epoch_end(self):&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Calculate average loss per epoch&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Perform logging&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;def validation_step(self, batch, batch_idx):&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;OVERALL&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Perform validation &lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;without gradient calculation&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Convert predictions to class labels&lt;/li&gt;
&lt;li&gt;Calculate Dice score&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt;with torch.no_grad():&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Turns off gradient calculation&lt;/b&gt; as backpropagation isn't needed during validation&lt;/li&gt;
&lt;li&gt;Reduces memory usage and improves computation speed&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;metric_val_outputs&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;decollate_batch(y_hat):&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Separates batch into individual samples&lt;/li&gt;
&lt;li&gt;Example: [16 batches] &amp;rarr; [sample1, sample2, ..., sample16]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;AsDiscrete(argmax=True):&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Selects&amp;nbsp;class&amp;nbsp;with&amp;nbsp;highest&amp;nbsp;probability&amp;nbsp;at&amp;nbsp;each&amp;nbsp;position&lt;/li&gt;
&lt;li&gt;Example: [0.1, 0.7, 0.2] &amp;rarr; 1 (second class)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;to_onehot=7:&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Converts selected class to one-hot vector&lt;/li&gt;
&lt;li&gt;Example:&amp;nbsp;1&amp;nbsp;&amp;rarr;&amp;nbsp;[0,&amp;nbsp;1,&amp;nbsp;0,&amp;nbsp;0,&amp;nbsp;0,&amp;nbsp;0,&amp;nbsp;0]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;metric_val_labels&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;decollate_batch(y):&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Separates batch into individual samples&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;AsDiscrete(to_onehot=7):&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Converts class index to one-hot vector&lt;/li&gt;
&lt;li&gt;Example: 2 &amp;rarr; [0, 0, 1, 0, 0, 0, 0]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;def&amp;nbsp;on_validation_epoch_end(self):&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Same with &lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;on_train_epoch_end(self)&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;def configure_optimizers(self):&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Use AdamW optimizer&lt;/li&gt;
&lt;li&gt;Set learning rate&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736924418204&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;channels = (48, 64, 80, 80)
strides_pattern = (2, 2, 1)       
num_res_units = 1
learning_rate = 1e-3
num_epochs = 100

model = Model(channels=channels, strides=strides_pattern, num_res_units=num_res_units, lr=learning_rate)&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;6) Training the model&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1736924450689&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;torch.set_float32_matmul_precision('medium')

# Check if CUDA is available and then count the GPUs
if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    print(f&quot;Number of GPUs available: {num_gpus}&quot;)
else:
    print(&quot;No GPU available. Running on CPU.&quot;)
devices = list(range(num_gpus))
print(devices)


trainer = pl.Trainer(
    max_epochs=num_epochs,        # Total number of training epochs (100)
    #strategy=&quot;ddp_notebook&quot;,     # Distributed training strategy (currently commented)
    accelerator=&quot;gpu&quot;,           # Use GPU
    devices=[0],                 # Use only first GPU
    num_nodes=1,                 # Use single node
    log_every_n_steps=10,        # Log every 10 steps
    enable_progress_bar=True,    # Enable progress bar
)

trainer.fit(model, train_loader, valid_loader)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;torch.set_float32_matmul_precision('medium')&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Sets precision of 32-bit floating-point matrix multiplication to 'medium'&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Balances speed and accuracy&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;GPU&amp;nbsp;Availability&amp;nbsp;Check&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Checks for CUDA (GPU) availability&lt;/li&gt;
&lt;li&gt;Counts available GPUs&lt;/li&gt;
&lt;li&gt;Creates list of GPU indices (e.g., [0,1,2] for 3 GPUs)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Trainer Setup&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;max_epochs&lt;/b&gt;&lt;/i&gt;: Total number of training iterations&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;accelerator&lt;/b&gt;&lt;/i&gt;: Hardware to use for training (GPU/CPU)&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;devices&lt;/b&gt;&lt;/i&gt;: GPU numbers to use&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;num_nodes&lt;/b&gt;&lt;/i&gt;: Number of nodes for distributed training&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;log_every_n_steps&lt;/b&gt;&lt;/i&gt;: Logging frequency&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;enable_progress_bar&lt;/b&gt;&lt;/i&gt;: Visualize training progress&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;trainer.fit(model,&amp;nbsp;train_loader,&amp;nbsp;valid_loader)&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Training Cycle:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Each epoch loads batch data from &lt;i&gt;train_loader&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;Executes &lt;i&gt;training_step&lt;/i&gt; method:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Model prediction&lt;/li&gt;
&lt;li&gt;Loss calculation&lt;/li&gt;
&lt;li&gt;Backpropagation and weight updates&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Runs &lt;i&gt;on_train_epoch_end&lt;/i&gt; at epoch end&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Validation Cycle:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Loads data from &lt;i&gt;valid_loader&lt;/i&gt; after each epoch&lt;/li&gt;
&lt;li&gt;Executes &lt;i&gt;validation_step&lt;/i&gt; method:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Model prediction&lt;/li&gt;
&lt;li&gt;Dice score calculation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Runs &lt;i&gt;on_validation_epoch_end&lt;/i&gt; at epoch end&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;7) Predicting on the test set&lt;/b&gt;&lt;/h4&gt;
&lt;pre id=&quot;code_1736925785226&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Model setup
model.eval();
model.to(&quot;cuda&quot;);

# Configuration File Processing
import json
copick_config_path = TRAIN_DATA_DIR + &quot;/copick.config&quot;

with open(copick_config_path) as f:
    copick_config = json.load(f)

copick_config['static_root'] = '/kaggle/input/czii-cryo-et-object-identification/test/static'

copick_test_config_path = 'copick_test.config'

with open(copick_test_config_path, 'w') as outfile:
    json.dump(copick_config, outfile)

# Copick Setup
import copick

root = copick.from_file(copick_test_config_path)

copick_user_name = &quot;copickUtils&quot;
copick_segmentation_name = &quot;paintedPicks&quot;
voxel_size = 10
tomo_type = &quot;denoised&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Switches the trained model to evaluation mode and prepares settings for test data&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Model Setup&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;eval()&lt;/b&gt;&lt;/i&gt;: Switches dropout, batch normalization, etc. to evaluation mode&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;to(&quot;cuda&quot;)&lt;/b&gt;&lt;/i&gt;: Moves model to GPU memory&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Configuration File Processing&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Loads original configuration file&lt;/li&gt;
&lt;li&gt;Updates test data path&lt;/li&gt;
&lt;li&gt;Saves new configuration to file&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Copick Setup&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Loads configuration using copick library&lt;/li&gt;
&lt;li&gt;Sets parameters needed for testing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736925804926&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Setting up Inference Transformations:
# Non-random transforms to be cached
inference_transforms = Compose([
    EnsureChannelFirstd(keys=[&quot;image&quot;], channel_dim=&quot;no_channel&quot;),
    NormalizeIntensityd(keys=&quot;image&quot;),
    Orientationd(keys=[&quot;image&quot;], axcodes=&quot;RAS&quot;)
])

import cc3d

id_to_name = {1: &quot;apo-ferritin&quot;, 
              2: &quot;beta-amylase&quot;,
              3: &quot;beta-galactosidase&quot;, 
              4: &quot;ribosome&quot;, 
              5: &quot;thyroglobulin&quot;, 
              6: &quot;virus-like-particle&quot;}&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Setting up Inference Transformations:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Unlike training, no random transformations (for consistent predictions)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Applied transformations:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;EnsureChannelFirstd&lt;/b&gt;&lt;/i&gt;: Moves channel dimension to first position&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;NormalizeIntensityd&lt;/b&gt;&lt;/i&gt;: Normalizes image values&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;Orientationd&lt;/b&gt;&lt;/i&gt;: Aligns 3D images to RAS standard&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;cc3d&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Library for &lt;b&gt;Connected Components analysis&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;Used to find and label connected regions in 3D images&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Iterate over test set:&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Read in a run&lt;/li&gt;
&lt;li&gt;Split it into patches of size (96, 96, 96)&lt;/li&gt;
&lt;li&gt;Create a dataset from the patches&lt;/li&gt;
&lt;li&gt;Predict the segmentation mask&lt;/li&gt;
&lt;li&gt;Glue the mask back together&lt;/li&gt;
&lt;li&gt;Find the connected components for each class&lt;/li&gt;
&lt;li&gt;Find the centroids of the connected components&lt;/li&gt;
&lt;li&gt;Add to the dataframe&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Then do this for all runs.&lt;/li&gt;
&lt;li&gt;&quot;This can probably be optimized quite a bit.&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736926940129&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;BLOB_THRESHOLD = 500
CERTAINTY_THRESHOLD = 0.5

classes = [1, 2, 3, 4, 5, 6]
with torch.no_grad():
    location_df = []
    for run in root.runs:
        print(run)
		
        # &quot;Read in a run&quot;
        tomo = run.get_voxel_spacing(10)
        tomo = tomo.get_tomogram(tomo_type).numpy()

        # &quot;Split into patches&quot;
        tomo_patches, coordinates  = extract_3d_patches_minimal_overlap([tomo], 96)

        # &quot;Create a dataset&quot;
        tomo_patched_data = [{&quot;image&quot;: img} for img in tomo_patches]
        tomo_ds = CacheDataset(data=tomo_patched_data, transform=inference_transforms, cache_rate=1.0)

        # &quot;Predict the segmentation mask&quot;
        pred_masks = []

        for i in range(len(tomo_ds)):
            input_tensor = tomo_ds[i]['image'].unsqueeze(0).to(&quot;cuda&quot;)
            model_output = model(input_tensor)

            probs = torch.softmax(model_output[0], dim=0)
            thresh_probs = probs &amp;gt; CERTAINTY_THRESHOLD
            _, max_classes = thresh_probs.max(dim=0)

            pred_masks.append(max_classes.cpu().numpy())
            
        # &quot;Glue the mask back together&quot;
        reconstructed_mask = reconstruct_array(pred_masks, coordinates, tomo.shape)
        
        location = {}

        for c in classes:
            # &quot;Find the connected components&quot;
            cc = cc3d.connected_components(reconstructed_mask == c)
            stats = cc3d.statistics(cc)
            
            # &quot;Find the centroids&quot;
            zyx=stats['centroids'][1:]*10.012444 #https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/544895#3040071
            zyx_large = zyx[stats['voxel_counts'][1:] &amp;gt; BLOB_THRESHOLD]
            xyz =np.ascontiguousarray(zyx_large[:,::-1])

            location[id_to_name[c]] = xyz

        # &quot;Add to the dataframe&quot;
        df = dict_to_df(location, run.name)
        location_df.append(df)
    
    location_df = pd.concat(location_df)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1736927454803&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;location_df.insert(
    loc=0,                              # Insert at first position
    column='id',                        # Column name is 'id'
    value=np.arange(len(location_df))   # Sequential numbers starting from 0
)
location_df.to_csv(&quot;submission.csv&quot;, index=False)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Adding ID Column&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Assigns unique ID to each predicted particle&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Meets Kaggle submission format requirements&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Saving to CSV file&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;index=False&lt;/b&gt;&lt;/i&gt;: &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Excludes DataFrame index from saved file&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736927930521&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!cp -r /kaggle/input/hengck-czii-cryo-et-01/* .

from czii_helper import *
from dataset import *
from scipy.optimize import linear_sum_assignment
import matplotlib.pyplot as plt&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;!cp ~:&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Linux command that copies required files from a Kaggle dataset to the current working directory&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;hengck-czii-cryo-et-01 includes:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;czii_helper.py&lt;/b&gt;&lt;/i&gt;: Utility functions for evaluation metric calculations&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;dataset.py&lt;/b&gt;&lt;/i&gt;: Functions for data loading and processing&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;PARTICLE&lt;/b&gt;&lt;/i&gt;: a constant defined in the copied files, containing characteristics of each particle type (name, radius, difficulty level, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736928045344&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    MODE = 'submit'
else:
    MODE = 'local'







valid_dir ='/kaggle/input/czii-cryo-et-object-identification/train'
valid_id = ['TS_6_4', ]

def do_one_eval(truth, predict, threshold):
    P=len(predict)
    T=len(truth)

    if P==0:
        hit=[[],[]]
        miss=np.arange(T).tolist()
        fp=[]
        metric = [P,T,len(hit[0]),len(miss),len(fp)]
        return hit, fp, miss, metric

    if T==0:
        hit=[[],[]]
        fp=np.arange(P).tolist()
        miss=[]
        metric = [P,T,len(hit[0]),len(miss),len(fp)]
        return hit, fp, miss, metric

    #---
    distance = predict.reshape(P,1,3)-truth.reshape(1,T,3)
    distance = distance**2
    distance = distance.sum(axis=2)
    distance = np.sqrt(distance)
    p_index, t_index = linear_sum_assignment(distance)

    valid = distance[p_index, t_index] &amp;lt;= threshold
    p_index = p_index[valid]
    t_index = t_index[valid]
    hit = [p_index.tolist(), t_index.tolist()]
    miss = np.arange(T)
    miss = miss[~np.isin(miss,t_index)].tolist()
    fp = np.arange(P)
    fp = fp[~np.isin(fp,p_index)].tolist()

    metric = [P,T,len(hit[0]),len(miss),len(fp)] #for lb metric F-beta copmutation
    return hit, fp, miss, metric


def compute_lb(submit_df, overlay_dir):
    valid_id = list(submit_df['experiment'].unique())
    print(valid_id)

    eval_df = []
    for id in valid_id:
        truth = read_one_truth(id, overlay_dir) #=f'{valid_dir}/overlay/ExperimentRuns')
        id_df = submit_df[submit_df['experiment'] == id]
        for p in PARTICLE:
            p = dotdict(p)
            print('\r', id, p.name, end='', flush=True)
            xyz_truth = truth[p.name]
            xyz_predict = id_df[id_df['particle_type'] == p.name][['x', 'y', 'z']].values
            hit, fp, miss, metric = do_one_eval(xyz_truth, xyz_predict, p.radius* 0.5)
            eval_df.append(dotdict(
                id=id, particle_type=p.name,
                P=metric[0], T=metric[1], hit=metric[2], miss=metric[3], fp=metric[4],
            ))
    print('')
    eval_df = pd.DataFrame(eval_df)
    gb = eval_df.groupby('particle_type').agg('sum').drop(columns=['id'])
    gb.loc[:, 'precision'] = gb['hit'] / gb['P']
    gb.loc[:, 'precision'] = gb['precision'].fillna(0)
    gb.loc[:, 'recall'] = gb['hit'] / gb['T']
    gb.loc[:, 'recall'] = gb['recall'].fillna(0)
    gb.loc[:, 'f-beta4'] = 17 * gb['precision'] * gb['recall'] / (16 * gb['precision'] + gb['recall'])
    gb.loc[:, 'f-beta4'] = gb['f-beta4'].fillna(0)

    gb = gb.sort_values('particle_type').reset_index(drop=False)
    # https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/544895
    gb.loc[:, 'weight'] = [1, 0, 2, 1, 2, 1]
    lb_score = (gb['f-beta4'] * gb['weight']).sum() / gb['weight'].sum()
    return gb, lb_score


#debug
if 1:
    if MODE=='local':
    #if 1:
        submit_df=pd.read_csv(
           'submission.csv'
            # '/kaggle/input/hengck-czii-cryo-et-weights-01/submission.csv'
        )
        gb, lb_score = compute_lb(submit_df, f'{valid_dir}/overlay/ExperimentRuns')
        print(gb)
        print('lb_score:',lb_score)
        print('')


        #show one ----------------------------------
        fig = plt.figure(figsize=(18, 8))

        id = valid_id[0]
        truth = read_one_truth(id,overlay_dir=f'{valid_dir}/overlay/ExperimentRuns')

        submit_df = submit_df[submit_df['experiment']==id]
        for p in PARTICLE:
            p = dotdict(p)
            xyz_truth = truth[p.name]
            xyz_predict = submit_df[submit_df['particle_type']==p.name][['x','y','z']].values
            hit, fp, miss, _ = do_one_eval(xyz_truth, xyz_predict, p.radius)
            print(id, p.name)
            print('\t num truth   :',len(xyz_truth) )
            print('\t num predict :',len(xyz_predict) )
            print('\t num hit  :',len(hit[0]) )
            print('\t num fp   :',len(fp) )
            print('\t num miss :',len(miss) )

            ax = fig.add_subplot(2, 3, p.label, projection='3d')
            if hit[0]:
                pt = xyz_predict[hit[0]]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], alpha=0.5, color='r')
                pt = xyz_truth[hit[1]]
                ax.scatter(pt[:,0], pt[:,1], pt[:,2], s=80, facecolors='none', edgecolors='r')
            if fp:
                pt = xyz_predict[fp]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], alpha=1, color='k')
            if miss:
                pt = xyz_truth[miss]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], s=160, alpha=1, facecolors='none', edgecolors='k')

            ax.set_title(f'{p.name} ({p.difficulty})')

        plt.tight_layout()
        plt.show()
        
        #--- 
        zz=0&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Overall &lt;/b&gt;&lt;b&gt;comprehensive evaluation and visualization of model predictions&lt;/b&gt;&lt;i&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;do_one_eval:&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Inputs&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;truth&lt;/b&gt;&lt;/i&gt;: actual particle positions&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;predict&lt;/b&gt;&lt;/i&gt;: predicted particle positions&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;threshold&lt;/b&gt;&lt;/i&gt;: matching distance threshold&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Main process:&lt;/b&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Handle exceptions (P=0 or T=0 cases)&lt;/li&gt;
&lt;li&gt;Calculate distances between predictions and truth&lt;/li&gt;
&lt;li&gt;Find optimal matching (using linear_sum_assignment)&lt;/li&gt;
&lt;li&gt;Filter valid matches based on threshold&lt;/li&gt;
&lt;li&gt;Calculate hits/misses/false positives&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Returns&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;hit&lt;/b&gt;&lt;/i&gt;: correct prediction indices&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;fp&lt;/b&gt;&lt;/i&gt;: false prediction indices&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;miss&lt;/b&gt;&lt;/i&gt;: missed particle indices&lt;/li&gt;
&lt;li&gt;metric: [P, T, num_hits, num_misses, num_fps]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt; compute_lb&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Inputs&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;submit_df&lt;/b&gt;&lt;/i&gt;: prediction results dataframe to submit&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;overlay_dir&lt;/b&gt;&lt;/i&gt;: ground truth data path&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Main process:&lt;/b&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Evaluate predictions for each experiment ID&lt;/li&gt;
&lt;li&gt;Calculate performance per particle type&lt;/li&gt;
&lt;li&gt;Calculate precision and recall&lt;/li&gt;
&lt;li&gt;Calculate f-beta4 score (beta=4 weights recall)&lt;/li&gt;
&lt;li&gt;Apply particle type weights&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Returns&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;gb&lt;/b&gt;&lt;/i&gt;: performance metrics per particle type&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;lb_score&lt;/b&gt;&lt;/i&gt;: final score&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;read_one_truth&lt;/b&gt;&lt;/i&gt;: loading the ground truth data&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;We are scoring the lb score based on the test data we configured: TS_6_4&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;I think it's very important to have a feedback loop, where you're constantly thinking about what you've done and how you could be doing it better.&amp;nbsp;&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>3d unet</category>
      <category>UNET</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/85</guid>
      <comments>https://dongsunseng.tistory.com/entry/CZII-CryoET-Object-Identification-2-Baseline-UNet-Solution#entry85comment</comments>
      <pubDate>Wed, 15 Jan 2025 17:09:33 +0900</pubDate>
    </item>
    <item>
      <title>CZII - CryoET Object Identification #1 - Training Data</title>
      <link>https://dongsunseng.tistory.com/entry/CZII-CryoET-Object-Identification-1-Training-Data</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;This post is an annotation of training data code kernel from &quot;fnands&quot;.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1736748925180&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Create Numpy dataset exp name&quot; data-og-description=&quot;Explore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET Object Identification&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name&quot; data-og-url=&quot;https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Create Numpy dataset exp name&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET Object Identification&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 style=&quot;color: #000000;&quot; data-ke-size=&quot;size20&quot;&gt;Kernel 'Create Numpy dataset exp name'&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Overall this kernel is about PREPARING TRAINING DATA&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736749010641&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!pip install git+https://github.com/copick/copick-utils.git matplotlib tqdm copick 
!pip install -q &quot;monai-weekly[mlflow]&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This combination of packages creates a complete environment for processing, analyzing, and applying machine learning models to Cryo-electron microscope data.&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;copick-utils&lt;/b&gt; (`git+&lt;a href=&quot;https://github.com/copick/copick-utils.git&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/copick/copick-utils.git&lt;/a&gt;`)
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;A utility library for processing Cryo-EM (Cryo-electron microscope) data&lt;/li&gt;
&lt;li&gt;Installed directly from GitHub repository&lt;/li&gt;
&lt;li&gt;Provides tools for processing, analyzing, and visualizing &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;electron microscope images&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;matplotlib&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Python's primary visualization library&lt;/li&gt;
&lt;li&gt;Used for creating and displaying graphs, charts, and images&lt;/li&gt;
&lt;li&gt;Essential tool for visualizing data analysis results&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;tqdm&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Library that provides&lt;b&gt; progress bars&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Enables real-time monitoring of long-running tasks&lt;/li&gt;
&lt;li&gt;Particularly useful when processing large datasets&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;copick&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Main library for Cryo-EM data&lt;/li&gt;
&lt;li&gt;Provides functionality for image processing, data management, and analysis&lt;/li&gt;
&lt;li&gt;Serves as the basic framework for utilizing copick-utils features&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;monai-weekly[mlflow]&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;MONAI&amp;nbsp;(Medical&amp;nbsp;Open&amp;nbsp;Network&amp;nbsp;for&amp;nbsp;AI)&amp;nbsp;is&amp;nbsp;a&amp;nbsp;&lt;b&gt;deep learning framework for medical images&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Built on &lt;b&gt;PyTorch&lt;/b&gt; and specialized for &lt;b&gt;medical image processing&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Key features:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Data preprocessing and &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;augmentation&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Neural network models for medical images&lt;/li&gt;
&lt;li&gt;Training and evaluation tools&lt;/li&gt;
&lt;li&gt;[mlflow] is an optional dependency where:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;MLflow is a platform for &lt;b&gt;tracking and managing machine learning experiments&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Records and manages experimental results, models, and parameters&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Helps&amp;nbsp;compare&amp;nbsp;and&amp;nbsp;reproduce&amp;nbsp;model&amp;nbsp;performance&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The '-q' option means 'quiet' mode, which minimizes installation process output.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre id=&quot;code_1736750461998&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;!pip install zarr
!pip install copick&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;zarr&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Format and library for storing and processing N-dimensional arrays&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Main Features:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Chunked compression storage&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Parallel&amp;nbsp;processing&amp;nbsp;support&lt;/li&gt;
&lt;li&gt;Hierarchical&amp;nbsp;organization&amp;nbsp;capability&lt;/li&gt;
&lt;li&gt;Cloud&amp;nbsp;storage&amp;nbsp;compatibility&lt;/li&gt;
&lt;li&gt;NumPy-compatible&amp;nbsp;interface&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Purpose:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Processing large-scale scientific data&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Data sharing in distributed computing environments&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Processing datasets larger than available memory&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Advantages:&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Memory&amp;nbsp;efficient:&amp;nbsp;Can&amp;nbsp;process&amp;nbsp;data&amp;nbsp;without&amp;nbsp;loading&amp;nbsp;entire&amp;nbsp;dataset&amp;nbsp;into&amp;nbsp;memory&lt;/li&gt;
&lt;li&gt;Fast&amp;nbsp;I/O&amp;nbsp;performance:&amp;nbsp;Efficient&amp;nbsp;data&amp;nbsp;access&amp;nbsp;through&amp;nbsp;chunk-based&amp;nbsp;approach&lt;/li&gt;
&lt;li&gt;Flexible&amp;nbsp;storage&amp;nbsp;format:&amp;nbsp;Supports&amp;nbsp;various&amp;nbsp;storage&amp;nbsp;options&amp;nbsp;(local&amp;nbsp;disk,&amp;nbsp;cloud,&amp;nbsp;etc.)&lt;/li&gt;
&lt;li&gt;Parallel processing: Multiple processes can access data simultaneously&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Common Use Cases:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Large scientific datasets (e.g., meteorological data, satellite images)&lt;/li&gt;
&lt;li&gt;Machine&amp;nbsp;learning&amp;nbsp;datasets&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Biological data (e.g., cryo-electron microscopy data)&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Relationship with &lt;b&gt;asciitree&lt;/b&gt; package:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;asciitree is used to visually represent Zarr data structure&lt;/li&gt;
&lt;li&gt;Shows&amp;nbsp;hierarchical&amp;nbsp;structure&amp;nbsp;of&amp;nbsp;Zarr&amp;nbsp;arrays&amp;nbsp;in&amp;nbsp;tree&amp;nbsp;format&amp;nbsp;in&amp;nbsp;terminal&lt;/li&gt;
&lt;li&gt;While asciitree is necessary for visualizing Zarr structures, it can sometimes be challenging to install or use&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736752987904&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Make a copick project
import os
import shutil

# Define configuration for protein structures and project settings
config_blob = &quot;&quot;&quot;{
   &quot;name&quot;: &quot;czii_cryoet_mlchallenge_2024&quot;,
   &quot;description&quot;: &quot;2024 CZII CryoET ML Challenge training data.&quot;,
   &quot;version&quot;: &quot;1.0.0&quot;,

   &quot;pickable_objects&quot;: [
       {
           &quot;name&quot;: &quot;apo-ferritin&quot;,
           &quot;is_particle&quot;: true,
           &quot;pdb_id&quot;: &quot;4V1W&quot;,
           &quot;label&quot;: 1,
           &quot;color&quot;: [0, 117, 220, 128],
           &quot;radius&quot;: 60,
           &quot;map_threshold&quot;: 0.0418
       },
       {
           &quot;name&quot;: &quot;beta-amylase&quot;,
           &quot;is_particle&quot;: true,
           &quot;pdb_id&quot;: &quot;1FA2&quot;, 
           &quot;label&quot;: 2,
           &quot;color&quot;: [153, 63, 0, 128],
           &quot;radius&quot;: 65,
           &quot;map_threshold&quot;: 0.035
       },
       {
           &quot;name&quot;: &quot;beta-galactosidase&quot;,
           &quot;is_particle&quot;: true,
           &quot;pdb_id&quot;: &quot;6X1Q&quot;,
           &quot;label&quot;: 3,
           &quot;color&quot;: [76, 0, 92, 128],
           &quot;radius&quot;: 90,
           &quot;map_threshold&quot;: 0.0578
       },
       {
           &quot;name&quot;: &quot;ribosome&quot;,
           &quot;is_particle&quot;: true,
           &quot;pdb_id&quot;: &quot;6EK0&quot;,
           &quot;label&quot;: 4,
           &quot;color&quot;: [0, 92, 49, 128],
           &quot;radius&quot;: 150,
           &quot;map_threshold&quot;: 0.0374
       },
       {
           &quot;name&quot;: &quot;thyroglobulin&quot;,
           &quot;is_particle&quot;: true,
           &quot;pdb_id&quot;: &quot;6SCJ&quot;,
           &quot;label&quot;: 5,
           &quot;color&quot;: [43, 206, 72, 128],
           &quot;radius&quot;: 130,
           &quot;map_threshold&quot;: 0.0278
       },
       {
           &quot;name&quot;: &quot;virus-like-particle&quot;,
           &quot;is_particle&quot;: true,
           &quot;pdb_id&quot;: &quot;6N4V&quot;,            
           &quot;label&quot;: 6,
           &quot;color&quot;: [255, 204, 153, 128],
           &quot;radius&quot;: 135,
           &quot;map_threshold&quot;: 0.201
       }
   ],

   &quot;overlay_root&quot;: &quot;/kaggle/working/overlay&quot;,
   &quot;overlay_fs_args&quot;: {
       &quot;auto_mkdir&quot;: true
   },
   &quot;static_root&quot;: &quot;/kaggle/input/czii-cryo-et-object-identification/train/static&quot;
}&quot;&quot;&quot;

# Define paths
copick_config_path = &quot;/kaggle/working/copick.config&quot;
output_overlay = &quot;/kaggle/working/overlay&quot;

# Write configuration file
with open(copick_config_path, &quot;w&quot;) as f:
   f.write(config_blob)
   
# Update the overlay
# Define source and destination directories
source_dir = '/kaggle/input/czii-cryo-et-object-identification/train/overlay'
destination_dir = '/kaggle/working/overlay'

# Walk through the source directory
for root, dirs, files in os.walk(source_dir):
   # Create corresponding subdirectories in the destination
   relative_path = os.path.relpath(root, source_dir)
   target_dir = os.path.join(destination_dir, relative_path)
   os.makedirs(target_dir, exist_ok=True)
   
   # Copy and rename each file
   for file in files:
       # Add prefix 'curation_0_' if not already present
       if file.startswith(&quot;curation_0_&quot;):
           new_filename = file
       else:
           new_filename = f&quot;curation_0_{file}&quot;
       
       # Define full paths for the source and destination files
       source_file = os.path.join(root, file)
       destination_file = os.path.join(target_dir, new_filename)
       
       # Copy the file with the new name
       shutil.copy2(source_file, destination_file)
       print(f&quot;Copied {source_file} to {destination_file}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This code sets up a project for the competition:&lt;/li&gt;
&lt;li&gt;&lt;b&gt;shutil&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;shutil is a Python standard library - it stands for &quot;shell utility&quot;&lt;/li&gt;
&lt;li&gt;It provides high-level file operations such as copying, moving, and removing files and file collections&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&amp;lt;&amp;lt;config_blob = &quot;&quot;&quot;...&quot;&quot;&quot;&amp;gt;&amp;gt; part&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Contains information about 6 protein structures:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;apo-ferritin: Iron storage protein&lt;/li&gt;
&lt;li&gt;beta-amylase: Enzyme protein&lt;/li&gt;
&lt;li&gt;beta-galactosidase: Sugar breakdown enzyme&lt;/li&gt;
&lt;li&gt;ribosome: Protein synthesis structure&lt;/li&gt;
&lt;li&gt;thyroglobulin: Thyroid hormone precursor&lt;/li&gt;
&lt;li&gt;virus-like-particle: Virus-like particle&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Attributes defined for each structure:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;name: Structure name&lt;/li&gt;
&lt;li&gt;is_particle: Particle status&lt;/li&gt;
&lt;li&gt;pdb_id: Protein Data Bank ID&lt;/li&gt;
&lt;li&gt;label: Classification label (1-6)&lt;/li&gt;
&lt;li&gt;color: RGBA color value ([R,G,B,A])&lt;/li&gt;
&lt;li&gt;radius: Particle radius&lt;/li&gt;
&lt;li&gt;map_threshold: Mapping threshold&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&quot;overlay_root&quot;:&amp;nbsp;&quot;/kaggle/working/overlay&quot;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Specifies the root directory where generated data (overlays) will be stored&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Represents the working directory for use in Kaggle environment&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;/kaggle/working/ is a writable directory in Kaggle notebooks&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&quot;overlay_fs_args&quot;:&amp;nbsp;{&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;auto_mkdir&quot;:&amp;nbsp;true&lt;br /&gt;}&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Sets file system related arguments&lt;/li&gt;
&lt;li&gt;&lt;b&gt;auto_mkdir: true means it will automatically create directories if they don't exist&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Creates necessary paths automatically when saving files or data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&quot;static_root&quot;:&amp;nbsp;&quot;/kaggle/input/czii-cryo-et-object-identification/train/static&quot;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Specifies the path where original or unchanging static data is stored&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Path to input data for the Kaggle competition&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;/kaggle/input/ is the read-only data directory provided by Kaggle&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;These&amp;nbsp;configurations&amp;nbsp;define&amp;nbsp;in&amp;nbsp;the&amp;nbsp;Kaggle&amp;nbsp;environment:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Where to read data from&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Where&amp;nbsp;to&amp;nbsp;store&amp;nbsp;processed&amp;nbsp;results&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;How&amp;nbsp;to&amp;nbsp;manage&amp;nbsp;the&amp;nbsp;file&amp;nbsp;system&lt;/b&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;is_particle: (Particle status):&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Set to true in the data&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Indicates whether the object should be treated as an independent particle&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;true means this structure is an individually identifiable, separate particle&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;This affects how the object is handled during image processing and analysis&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt; pdb_id: (Protein Data Bank ID)&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Unique identifier&lt;/b&gt; like &quot;6N4V&quot;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;PDB (Protein Data Bank) is a global database storing 3D structural information of proteins and nucleic acids&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;This ID allows access to detailed structural information of the molecule&lt;/li&gt;
&lt;li&gt;For example, &quot;6N4V&quot; for virus-like-particle is a unique identifier storing atomic-level details of this structure&lt;/li&gt;
&lt;li&gt;Detailed information can be viewed by searching this ID on the PDB website (rcsb.org)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Radius:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Typically measured in Angstroms (&amp;Aring;) or nanometers (nm)&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Reflects the actual physical size of virus-like particles&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Set based on average particle size visible in electron microscope images&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;map_threshold:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Threshold value for identifying particles in electron density maps&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Higher values mean stricter particle identification criteria&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;0.201 is significantly higher than other particles (e.g., apo-ferritin's 0.0418, beta-amylase's 0.035)&lt;/li&gt;
&lt;li&gt;This might be because virus-like particles &lt;b&gt;show stronger contrast in electron microscope images&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&amp;nbsp;&lt;b&gt;color:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;RGBA: color values with transparency %&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Set for visualization purposes, doesn't affect analysis&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Last value 128 indicates transparency (middle value in 0-255 range)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&amp;nbsp;File System Setup:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;i&gt;copick_config_path&amp;nbsp;=&amp;nbsp;&quot;/kaggle/working/copick.config&quot;&lt;/i&gt;&lt;br /&gt;&lt;i&gt;output_overlay&amp;nbsp;=&amp;nbsp;&quot;/kaggle/working/overlay&quot;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Specifies paths for configuration file and output directory&lt;/li&gt;
&lt;li&gt;Set up for Kaggle environment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;For loop part:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt; for&amp;nbsp;root,&amp;nbsp;dirs,&amp;nbsp;files&amp;nbsp;in&amp;nbsp;os.walk(source_dir):&lt;/i&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Uses &lt;b&gt;os.walk&lt;/b&gt; to&lt;b&gt; traverse all files and subdirectories in source directory&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Creates identical directory structure at destination&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt;for file in files:&lt;/i&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;if else clause:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Adds &quot;curation_0_&quot; prefix to all file names&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Keeps files that already have the prefix unchanged&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;shutil.copy2(source_file, destination_file):&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Uses shutil.copy2 to copy files&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Also copies metadata (creation time, modification time, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Overall prepares and structures the dataset needed for training machine learning models, specifically for identifying and classifying various protein structures captured by cryo-electron microscopy.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736771463940&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os
import numpy as np
from pathlib import Path
import torch
import torchinfo
import zarr, copick
from tqdm import tqdm
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
    Compose, 
    EnsureChannelFirstd, 
    Orientationd,  
    AsDiscrete,  
    RandFlipd, 
    RandRotate90d, 
    NormalizeIntensityd,
    RandCropByLabelClassesd,
)
from monai.networks.nets import UNet
from monai.losses import DiceLoss, FocalLoss, TverskyLoss
from monai.metrics import DiceMetric, ConfusionMatrixMetric
import mlflow
import mlflow.pytorch&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Preparing the dataset&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1. Get copick root&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1736771544698&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;root = copick.from_file(copick_config_path)

copick_user_name = &quot;copickUtils&quot;
copick_segmentation_name = &quot;paintedPicks&quot;
voxel_size = 10
tomo_type = &quot;denoised&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Initializing the basic configuration of the copick project&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;root&amp;nbsp;=&amp;nbsp;copick.from_file(copick_config_path)&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Initializes a copick object&lt;/b&gt;&lt;/span&gt; by reading the configuration file from copick_config_path&lt;/li&gt;
&lt;li&gt;Loads settings including protein structure information and paths into this object&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;copick_user_name&amp;nbsp;=&amp;nbsp;&quot;copickUtils&quot;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Sets an &lt;b&gt;identifier&lt;/b&gt; for the user/tool performing the work&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Used to track and distinguish results&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;copick_segmentation_name&amp;nbsp;=&amp;nbsp;&quot;paintedPicks&quot;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Specifies the name for &lt;b&gt;segmentation (image region distinction) results&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Results will be saved and referenced using this name&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt; voxel_size&amp;nbsp;=&amp;nbsp;10&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Sets the voxel size that defines &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;the resolution of 3D images&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;A voxel is the basic unit of 3D images, similar to pixels in 2D images&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;tomo_type&amp;nbsp;=&amp;nbsp;&quot;denoised&quot;&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Specifies the type of tomogram (3D image) data to use&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&quot;denoised&quot; means using processed images with noise removed&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Noise removal improves image quality and facilitates analysis&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2.&amp;nbsp;Generate&amp;nbsp;multi-class&amp;nbsp;segmentation&amp;nbsp;masks&amp;nbsp;from&amp;nbsp;picks,&amp;nbsp;and&amp;nbsp;saved&amp;nbsp;them&amp;nbsp;to&amp;nbsp;the&amp;nbsp;copick&amp;nbsp;overlay&amp;nbsp;directory&amp;nbsp;(one-time)&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1736771572833&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Import segmentation-related utilities
from copick_utils.segmentation import segmentation_from_picks
import copick_utils.writers.write as write
from collections import defaultdict

# Just do this once
generate_masks = True

if generate_masks:
    # Stores label and radius information for each particle in a dictionary
    # Only processes objects where is_particle is true
    target_objects = defaultdict(dict)
    for object in root.pickable_objects:
        if object.is_particle:
            target_objects[object.name]['label'] = object.label
            target_objects[object.name]['radius'] = object.radius

    # Process Tomograms and Create Masks
    for run in tqdm(root.runs):
        # Get tomogram data
        tomo = run.get_voxel_spacing(10)
        tomo = tomo.get_tomogram(tomo_type).numpy()
        
        # Create empty target array
        target = np.zeros(tomo.shape, dtype=np.uint8)
        
        # Generate Segmentation Masks
        for pickable_object in root.pickable_objects:
            pick = run.get_picks(object_name=pickable_object.name, user_id=&quot;curation&quot;)
            if len(pick):  
                target = segmentation_from_picks.from_picks(pick[0], 
                                                            target, 
                                                            target_objects[pickable_object.name]['radius'] * 0.8,
                                                            target_objects[pickable_object.name]['label']
                                                            )
        write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt;from collections import defaultdict&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;/i&gt;defaultdict &lt;b&gt;automatically handles default values for missing dictionary keys&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt;generate_masks = True&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Flag that controls whether to generate segmentation masks or not&lt;/li&gt;
&lt;li&gt;Generating segmentation masks is a &lt;b&gt;time-consuming&lt;/b&gt; operation&lt;/li&gt;
&lt;li&gt;It only needs to be done once (this is why the comment says &quot;Just do this once&quot;)&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;A segmentation mask is a binary or multi-class label map used to distinguish specific objects or regions in an image&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;It's used to distinguish 6 different protein structures&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Each structure has a unique label (1-6)&lt;/li&gt;
&lt;li&gt;Background is marked as 0&lt;/li&gt;
&lt;li&gt;&lt;b&gt;The mask is in 3D form, indicating which structure each voxel belongs to&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;for&amp;nbsp;run&amp;nbsp;in&amp;nbsp;tqdm(root.runs):&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Gets tomogram data for each run&lt;/li&gt;
&lt;li&gt;Retrieves data at specified voxel size (10)&lt;/li&gt;
&lt;li&gt;Converts to numpy array for processing&lt;/li&gt;
&lt;li&gt;Creates empty array for storing segmentation masks&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt; for pickable_object in root.pickable_objects:&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;/i&gt;For each object:&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Gets pick information&lt;/li&gt;
&lt;li&gt;Creates segmentation mask if pick exists&lt;/li&gt;
&lt;li&gt;Uses 80% of radius (* 0.8) for mask creation&lt;/li&gt;
&lt;li&gt;Uses object's label&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;write.segmentation(run,&amp;nbsp;target,&amp;nbsp;copick_user_name,&amp;nbsp;name=copick_segmentation_name)&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Saves generated segmentation masks&lt;/li&gt;
&lt;li&gt;Saves&amp;nbsp;with&amp;nbsp;specified&amp;nbsp;user&amp;nbsp;name&amp;nbsp;and&amp;nbsp;segmentation&amp;nbsp;name&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;i&gt;&lt;b&gt;root.runs:&lt;/b&gt;&lt;/i&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Represents each experimental run in the dataset&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;In this code, we can see there are 7 experimental datasets:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;TS_86_3&lt;/li&gt;
&lt;li&gt;TS_6_6&lt;/li&gt;
&lt;li&gt;TS_6_4&lt;/li&gt;
&lt;li&gt;TS_5_4&lt;/li&gt;
&lt;li&gt;TS_73_6&lt;/li&gt;
&lt;li&gt;TS_99_9&lt;/li&gt;
&lt;li&gt;TS_69_2&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&amp;nbsp;Each run represents one electron microscope imaging session&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Therefore, for run in tqdm(root.runs)::
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;For each experimental session (TS_*)&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Retrieves the tomogram data from that session&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Locates each protein&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Generates segmentation masks&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Each &lt;i&gt;&lt;b&gt;run&lt;/b&gt;&lt;/i&gt; contains:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Tomogram data (run.get_tomogram())&lt;/li&gt;
&lt;li&gt;Protein location information (run.get_picks())&lt;/li&gt;
&lt;li&gt;Other metadata&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3.&amp;nbsp;Get&amp;nbsp;tomograms&amp;nbsp;and&amp;nbsp;their&amp;nbsp;segmentaion&amp;nbsp;masks&amp;nbsp;(from&amp;nbsp;picks)&amp;nbsp;arrays&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1736771610077&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;data_dicts = []  # Create empty list to store data
for run in tqdm(root.runs):  # Iterate over 7 experimental datasets
    # Get tomogram data
    tomogram = run.get_voxel_spacing(voxel_size)  # Get data at resolution set to voxel_size=10
    tomogram = tomogram.get_tomogram(tomo_type)   # Get &quot;denoised&quot; type tomogram
    tomogram = tomogram.numpy()                    # Convert to numpy array

    # Get segmentation masks
    segmentation = run.get_segmentations(
        name=copick_segmentation_name,    # &quot;paintedPicks&quot;
        user_id=copick_user_name,         # &quot;copickUtils&quot;
        voxel_size=voxel_size,           # 10
        is_multilabel=True               # Mask distinguishing multiple classes (proteins)
    )[0].numpy()

    # Add to data dictionary
    data_dicts.append({
        &quot;name&quot;: run.name,        # Experiment name (e.g., &quot;TS_86_3&quot;)
        &quot;image&quot;: tomogram,       # Tomogram data
        &quot;label&quot;: segmentation    # Segmentation mask
    })

# Print label values from first data
print(np.unique(data_dicts[0]['label']))  # Outputs [0 1 2 3 4 5 6]&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Collects tomograms and segmentation masks from each experimental data&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Results explanation:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;[0 1 2 3 4 5 6] are all unique values in the mask:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;0: Background&lt;/li&gt;
&lt;li&gt;1: apo-ferritin&lt;/li&gt;
&lt;li&gt;2: beta-amylase&lt;/li&gt;
&lt;li&gt;3: beta-galactosidase&lt;/li&gt;
&lt;li&gt;4: ribosome&lt;/li&gt;
&lt;li&gt;5: thyroglobulin&lt;/li&gt;
&lt;li&gt;6: virus-like-particle&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Each dictionary created for experimental data includes:&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Experiment name&lt;/li&gt;
&lt;li&gt;Original image (tomogram)&lt;/li&gt;
&lt;li&gt;Segmentation mask (labels)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;This prepared data can be used later for training machine learning models.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1736787726650&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# For each of the 7 experimental datasets
for i in range(7):
    # Save image (tomogram) data
    with open(f&quot;train_image_{data_dicts[i]['name']}.npy&quot;, 'wb') as f:
        np.save(f, data_dicts[i]['image'])
    
    # Save label (segmentation mask) data    
    with open(f&quot;train_label_{data_dicts[i]['name']}.npy&quot;, 'wb') as f:
        np.save(f, data_dicts[i]['label'])&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;saves the previously created data to files&lt;/li&gt;
&lt;li&gt;Specifically:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Two .npy files are created for each experiment:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;train_image_TS_XX_X.npy: tomogram data&lt;/li&gt;
&lt;li&gt;train_label_TS_XX_X.npy: segmentation mask&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;File format:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;.npy&lt;/b&gt;: NumPy's array storage format&lt;/li&gt;
&lt;li&gt;'wb': open file in binary write mode&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;I&amp;nbsp;could&amp;nbsp;either&amp;nbsp;watch&amp;nbsp;it&amp;nbsp;happen&amp;nbsp;or&amp;nbsp;be&amp;nbsp;a&amp;nbsp;part&amp;nbsp;of&amp;nbsp;it.&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>czii - cryoet object identification</category>
      <category>segmentation</category>
      <category>training data</category>
      <category>캐글</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/84</guid>
      <comments>https://dongsunseng.tistory.com/entry/CZII-CryoET-Object-Identification-1-Training-Data#entry84comment</comments>
      <pubDate>Tue, 14 Jan 2025 02:09:34 +0900</pubDate>
    </item>
    <item>
      <title>[LLM] 1. Prompt Engineering Basics #1</title>
      <link>https://dongsunseng.tistory.com/entry/LLM-1-Prompt-Engineering-Basics-1</link>
      <description>&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;img.jpg&quot; data-origin-width=&quot;1600&quot; data-origin-height=&quot;700&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bcvrJ2/btsLEKKkkLo/yfAujDk32Sf5WmRckkbrQk/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bcvrJ2/btsLEKKkkLo/yfAujDk32Sf5WmRckkbrQk/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bcvrJ2/btsLEKKkkLo/yfAujDk32Sf5WmRckkbrQk/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbcvrJ2%2FbtsLEKKkkLo%2FyfAujDk32Sf5WmRckkbrQk%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1600&quot; height=&quot;700&quot; data-filename=&quot;img.jpg&quot; data-origin-width=&quot;1600&quot; data-origin-height=&quot;700&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;This post heavily relies on this lecture:&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1735989068042&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;개발자를 위한 ChatGPT 프롬프트 엔지니어링&quot; data-og-description=&quot;2시간 이내에 이 안내 프로젝트를 완료하세요. 채팅 상자를 넘어서세요. API 액세스를 사용하여 자체 애플리케이션에 LLM을 활용하고 맞춤형 챗봇을 구축하는 방법을 배워보세요. 개발자를 위한 C&quot; data-og-host=&quot;www.coursera.org&quot; data-og-source-url=&quot;https://www.coursera.org/projects/chatgpt-prompt-engineering-for-developers-project&quot; data-og-url=&quot;https://www.coursera.org/projects/chatgpt-prompt-engineering-for-developers-project&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cqRudn/hyXWvHltCv/ehg30cQfez3TDOQ7SbLTu0/img.jpg?width=1772&amp;amp;height=928&amp;amp;face=1168_66_1242_146,https://scrap.kakaocdn.net/dn/b7sMd2/hyXWzJImpW/kkPxTzeKJitgtj3UByTKek/img.jpg?width=1772&amp;amp;height=928&amp;amp;face=1168_66_1242_146,https://scrap.kakaocdn.net/dn/eSUTq/hyXWtJw9JN/ims9kLuLSUui0nryRITg5k/img.jpg?width=2048&amp;amp;height=808&amp;amp;face=0_0_2048_808&quot;&gt;&lt;a href=&quot;https://www.coursera.org/projects/chatgpt-prompt-engineering-for-developers-project&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.coursera.org/projects/chatgpt-prompt-engineering-for-developers-project&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cqRudn/hyXWvHltCv/ehg30cQfez3TDOQ7SbLTu0/img.jpg?width=1772&amp;amp;height=928&amp;amp;face=1168_66_1242_146,https://scrap.kakaocdn.net/dn/b7sMd2/hyXWzJImpW/kkPxTzeKJitgtj3UByTKek/img.jpg?width=1772&amp;amp;height=928&amp;amp;face=1168_66_1242_146,https://scrap.kakaocdn.net/dn/eSUTq/hyXWtJw9JN/ims9kLuLSUui0nryRITg5k/img.jpg?width=2048&amp;amp;height=808&amp;amp;face=0_0_2048_808');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;개발자를 위한 ChatGPT 프롬프트 엔지니어링&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;2시간 이내에 이 안내 프로젝트를 완료하세요. 채팅 상자를 넘어서세요. API 액세스를 사용하여 자체 애플리케이션에 LLM을 활용하고 맞춤형 챗봇을 구축하는 방법을 배워보세요. 개발자를 위한 C&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.coursera.org&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Two types of LLMs&lt;/b&gt;&lt;/h4&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Base LLM&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Predicts next word based on text training data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Instruction Tuned LLM&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Tries to follow instructions&lt;/li&gt;
&lt;li&gt;Fine-tune on instructions and good attempts at following those instructions&lt;/li&gt;
&lt;li&gt;Often further refined using RLHF technique
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;RLHF: Reinforcement Learning with Human&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Trained to be Helpful, Honest, and Harmless&lt;/li&gt;
&lt;li&gt;Thus, likely to be less toxic than Base LLM&lt;/li&gt;
&lt;li&gt;Recommended to be used for practical usages&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Guidelines for Prompting&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;First Principle: &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Write clear and specific instructions&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Clear prompt doesn't mean short prompt&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Detailed tactics:&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Use &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;delimiters&lt;/b&gt;&lt;/span&gt; to clearly indicate distinct parts of the input
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Delimiters can be anything like: ```, &quot;&quot;&quot;, &amp;lt; &amp;gt;,&lt;span&gt;&amp;nbsp;&lt;/span&gt;&amp;lt;tag&amp;gt; &amp;lt;/tag&amp;gt;,&lt;span&gt; ---, etc&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;Delimiters can also avoid &lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;prompt injections&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span&gt;prompt injection: if a user is allowed to add some input into your prompt, they might give kind of conflicting instructions to the model that might make it follow the user's instructions rather than doing what you wanted it to do&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span&gt;In other words, model can successfully distinguish the input part and the instruction part to avoid confusions&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;Ask for &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;structured&lt;/b&gt; &lt;b&gt;output&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;for example: JSON, HTML&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Ask&amp;nbsp;the&amp;nbsp;model&amp;nbsp;to&amp;nbsp;check&amp;nbsp;whether&amp;nbsp;conditions&amp;nbsp;are&amp;nbsp;satisfied&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Example: If the text does not contain a sequence of instructions, then simply write &quot;No steps provided&quot;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&quot;Few-shot&quot;&lt;/b&gt; prompting&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Second Principle: &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;Give the model time to THINK&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Detailed tactics:&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Specify the steps required to complete a task&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Instruct&amp;nbsp;the&amp;nbsp;model&amp;nbsp;to&amp;nbsp;work&amp;nbsp;out&amp;nbsp;its&amp;nbsp;own&amp;nbsp;solution&amp;nbsp;before&amp;nbsp;rushing&amp;nbsp;to&amp;nbsp;a&amp;nbsp;conclusion&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;When we simply provide sample answer and ask if that is correct, the model might skim read it and simply say that it is correct without fully thinking about the answer&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Therefore, we should first ask to draw its own solution first and then make the model compare its own and the sample answer&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Model limitations: Hallucinations&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Even though large language models are exposed to a vast amount of knowledge during its training process, it has not perfectly memorized the information it have seen and so it doesn't know the boundary of its knowledge very well.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Thus, those models are likely to make statements that sound plausible but are not true.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;How to reduce hallucinations:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;First ask the model to find any relevant quotes from the text and then ask it to use those quotes to answer the question&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;If&amp;nbsp;you&amp;nbsp;need&amp;nbsp;inspiration,&amp;nbsp;don't&amp;nbsp;do&amp;nbsp;it.&lt;br /&gt;&amp;nbsp;-Elon Musk-&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>NLP</category>
      <category>llm</category>
      <category>Prompt Engineering</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/82</guid>
      <comments>https://dongsunseng.tistory.com/entry/LLM-1-Prompt-Engineering-Basics-1#entry82comment</comments>
      <pubDate>Sun, 5 Jan 2025 23:28:14 +0900</pubDate>
    </item>
    <item>
      <title>[NLP] 3. How does Transformer Work?</title>
      <link>https://dongsunseng.tistory.com/entry/NLP-3-How-does-Transformer-Work</link>
      <description>&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Background&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Transformer, introduced by Google in 2017 for natural language processing, is a language model that's leading innovation in the AI field.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;ChatGPT, which first enabled us to use AI through web and API interfaces, is also based on Transformer, as are the language models that companies like Google and Facebook are developing as competitors.&lt;/li&gt;
&lt;li&gt;Transformer is expected to achieve state-of-the-art performance not only in natural language processing but also in other fields like computer vision and speech recognition.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Shift from CNN Dominance to Transformer&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Deep learning can be traced back to the Perceptron of the 1950s, which was inspired by human neurons.&lt;/li&gt;
&lt;li&gt;However, deep learning faced a dark age until the early 2010s due to insufficient computing power and more importantly lack of data for analysis during the 1990s-2000s.&lt;/li&gt;
&lt;li&gt;However, in the 2010s, data increased explosively through smartphones and social media.&lt;/li&gt;
&lt;li&gt;In 2012, AlexNet, using deep learning, became a breakthrough in the ImageNet Challenge (classifying 1000 images) by improving image classification accuracy by more than 10% from the previous 70-80%.&lt;/li&gt;
&lt;li&gt;AlexNet consists of 5 CNN layers and 3 FC layers.&lt;/li&gt;
&lt;li&gt;After that, computer vision field development mainly focused on CNN-based models, and ResNet, which emerged in 2015, achieved an image recognition error rate of around 3%, similar to human performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Natural Language Processing's History&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;In contrast, for natural language processing, RNN, which is an artificial neural network for processing sequential data like text, emerged in the 1980s, and its improved version LSTM came out in 1997.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;However, they couldn't solve the long-term dependencies problem for a while, where it became difficult to remember previous data as input sentences got longer.&lt;/li&gt;
&lt;li&gt;There were also attempts to analyze sentence sentiment by creating embedding vectors using CNN, which was popular at the time.&lt;/li&gt;
&lt;li&gt;The &lt;b&gt;Sequence to Sequence language model&lt;/b&gt;, introduced in 2014, is considered one of the greatest inventions in natural language processing history.&lt;/li&gt;
&lt;li&gt;It could not only convert existing sentences into numerical values but also generate new sentences using these values.&lt;/li&gt;
&lt;li&gt;Machine Translation is a typical example, such as generating English sentences from Korean input.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;However, the Seq2Seq model still had RNN's chronic problem where it struggled to remember previous information as input sentences got longer, as it used RNN in both the encoder(processing input sentences) and decoder(generating new sentences).&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;Also, information loss occurred when trying to reconstruct target sentences using only the numerical information from the encoder's last timestep.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;This issue was later resolved with the addition of Attention, enabling translation regardless of sentence length.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;RNN's Main Problem(Summary)&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;They &lt;b&gt;process&lt;/b&gt; the input data &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;sequentially&lt;/b&gt;&lt;/span&gt;, one after the other. Such a recurrent process does not make use of modern graphics processing units (GPUs), which were designed for parallel computation and, thus, makes the training of such models quite slow.&lt;/li&gt;
&lt;li&gt;They become quite ineffective &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;when elements are distant&lt;/b&gt;&lt;/span&gt; from one another. This is due to the fact that information is passed at each step and the longer the chain is, the more probable the information is lost along the chain.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Attention?&lt;/b&gt;&lt;/h4&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-02 오후 11.46.15.png&quot; data-origin-width=&quot;649&quot; data-origin-height=&quot;440&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ch1jB3/btsLCERDjJp/aE2CnsnOV3i4kFZaEpe7xK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ch1jB3/btsLCERDjJp/aE2CnsnOV3i4kFZaEpe7xK/img.png&quot; data-alt=&quot;https://wikidocs.net/22893&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ch1jB3/btsLCERDjJp/aE2CnsnOV3i4kFZaEpe7xK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fch1jB3%2FbtsLCERDjJp%2FaE2CnsnOV3i4kFZaEpe7xK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;649&quot; height=&quot;440&quot; data-filename=&quot;스크린샷 2025-01-02 오후 11.46.15.png&quot; data-origin-width=&quot;649&quot; data-origin-height=&quot;440&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;https://wikidocs.net/22893&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;The basic idea of Attention is that since the numerical information from the encoder's last timestep alone is sufficient, the decoder refers back to the entire input sentence at every timestep when predicting output words.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;However, &lt;b&gt;it doesn't reference all input words equally - instead, it pays more attention to words most relevant to the word being predicted at that timestep.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Mathematically, this involves creating a query by multiplying weights with the decoder's current timestep output (hidden state), then taking the dot product with all encoder timestep outputs, and learning these weights through backpropagation to better reference the words that need to be predicted.&lt;/li&gt;
&lt;li&gt;Although the addition of Attention somewhat removed limitations on sentence length, RNN-based Seq2Seq models still produced lower quality translations compared to humans.&lt;/li&gt;
&lt;li&gt;However, the emergence of the &lt;b&gt;Transformer&lt;/b&gt; brought significant changes to natural language processing.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;In 2017, Google introduced the Transformer model through their paper &quot;Attention is All You Need,&quot; implementing both encoder and decoder entirely with attention mechanisms, &lt;b&gt;rather than just using attention for corrections&lt;/b&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The Transformer model became &lt;/span&gt;&lt;b&gt;not only free from sentence length constraints but also better at understanding input sentences through the encoder and previously generated words through the decoder.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;All famous pre-trained language models (PLMs) since then have been Transformer-based.&lt;/li&gt;
&lt;li&gt;BERT consists of 12 Transformer encoders and excels at natural language understanding, while GPT-1 consists of 12 Transformer decoders and shows strength in natural language generation.&lt;/li&gt;
&lt;li&gt;Subsequent language models have evolved by increasing model size and datasets - GPT-3's largest version has 96 decoders and 175 billion parameters.&lt;/li&gt;
&lt;li&gt;ChatGPT is a model fine-tuned from GPT-3, specialized for conversation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Transformer excelling in image field&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The Transformer model is achieving good results not only in natural language processing but also in image processing. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Vision Transformer (ViT), announced in 2020, applies the Transformer model to the vision field. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;It divides input images into patches, feeds them into the Transformer's encoder, and can capture interdependencies between different positions of input images and global image features using attention.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Additionally, Transformer is used in popular text-to-image generation models like DALL-E 2 and Stable Diffusion.&lt;/li&gt;
&lt;li&gt;These models learn optimal weights for image generation by adding noise to images and restoring them, but instead of blindly restoring images, they &lt;span style=&quot;background-color: #9feec3;&quot;&gt;find directions for restoration conditioned on given text information&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;The Transformer is used not only to understand information between texts but also to model interactions between text and image representations.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;More into the Model&amp;nbsp;&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Transformer uses attention in both the encoder (which understands input sentences) and the decoder (which generates target sentences).&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;There are three types of attention in the Transformer:
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Encoder Self-Attention&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Used within the encoder for understanding input sentences&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Decoder Self-Attention (also called Masked Attention)&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Used&amp;nbsp;within&amp;nbsp;the&amp;nbsp;decoder&amp;nbsp;for&amp;nbsp;understanding&amp;nbsp;the&amp;nbsp;sentence&amp;nbsp;it's&amp;nbsp;generating&lt;/li&gt;
&lt;li&gt;Called &quot;masked attention&quot; because it masks future tokens during the word-by-word sentence generation process&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Encoder-Decoder&amp;nbsp;Attention&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The original purpose of attention&lt;/li&gt;
&lt;li&gt;Used for the decoder to reference information from the encoder when generating sentences, supplementing any missing information&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2025-01-03 오전 1.05.31.png&quot; data-origin-width=&quot;430&quot; data-origin-height=&quot;305&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/eIq2xM/btsLBWx8BlV/0w1TrxBbuTEEWenHbhF1tK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/eIq2xM/btsLBWx8BlV/0w1TrxBbuTEEWenHbhF1tK/img.png&quot; data-alt=&quot;https://wikidocs.net/31379&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/eIq2xM/btsLBWx8BlV/0w1TrxBbuTEEWenHbhF1tK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FeIq2xM%2FbtsLBWx8BlV%2F0w1TrxBbuTEEWenHbhF1tK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;430&quot; height=&quot;305&quot; data-filename=&quot;스크린샷 2025-01-03 오전 1.05.31.png&quot; data-origin-width=&quot;430&quot; data-origin-height=&quot;305&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;https://wikidocs.net/31379&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Looking at the process step by step from where words enter the encoder:&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Input sentences are tokenized to create a dictionary&lt;/li&gt;
&lt;li&gt;Tokens are mapped to integers&lt;/li&gt;
&lt;li&gt;These pass through the embedding layer&lt;/li&gt;
&lt;li&gt;This creates embedding values for tokens that the model will learn&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;The Transformer maintains a consistent dimensionality of 512 for both word embedding vectors and all input/output values within the model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Let&amp;nbsp;me&amp;nbsp;break&amp;nbsp;down&amp;nbsp;this&amp;nbsp;explanation&amp;nbsp;of&amp;nbsp;Transformer's&amp;nbsp;detailed&amp;nbsp;operation:&lt;/b&gt;&lt;br /&gt;&lt;b&gt;1. Multi-head Attention in First Encoding Layer&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;When generating contextual representations by calculating similarities between input sentence tokens&lt;/li&gt;
&lt;li&gt;Instead of calculating similarities between 512-dimensional tokens all at once&lt;/li&gt;
&lt;li&gt;Divides into n heads for learning (hence &quot;Multi-head Attention&quot;)&lt;/li&gt;
&lt;li&gt;Paper used head=8&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Example Calculation&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;For a sentence like &quot;나는, 학교, 에, 간다&quot;:&lt;/li&gt;
&lt;li&gt;Instead of full (4, 512).T x (4, 512) matrix multiplication&lt;/li&gt;
&lt;li&gt;Changes weight vector size to 64 dimensions (512/8)&lt;/li&gt;
&lt;li&gt;Enables 8 parallel (4, 64).T x (4, 64) matrix operations&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. Efficient Processing&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Uses matrix multiplication between input values and model weights&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Processes efficiently through:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Batch matrix operations&lt;/li&gt;
&lt;li&gt;Parallel attention processing via multi-head attention mechanism&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. Subsequent Encoder Blocks&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Perform self-attention learning with output from previous block&lt;/li&gt;
&lt;li&gt;Each encoder block has different weight parameters&lt;/li&gt;
&lt;li&gt;Model's expressiveness improves as layers stack up&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;5. Decoder Operation&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Performs self-attention on masked output sentence tokens&lt;/li&gt;
&lt;li&gt;Conducts encoder-decoder attention using:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Self-attention values&lt;/li&gt;
&lt;li&gt;Values passed through final encoder block&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Both&amp;nbsp;self-attention&amp;nbsp;and&amp;nbsp;encoder-decoder&amp;nbsp;attention&amp;nbsp;use&amp;nbsp;parallel-processed&amp;nbsp;multi-head&amp;nbsp;attention&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The&amp;nbsp;Transformer&amp;nbsp;achieved&amp;nbsp;several&amp;nbsp;key&amp;nbsp;breakthroughs:&lt;br /&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Overcame Sentence Length Limitations&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Through&amp;nbsp;&lt;b&gt;attention&lt;/b&gt;&amp;nbsp;mechanisms&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Improved&lt;/b&gt; &lt;b&gt;understanding&lt;/b&gt; of both input and generated sentences&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Efficient Processing&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Handles&amp;nbsp;massive&amp;nbsp;matrix&amp;nbsp;operations&amp;nbsp;between&amp;nbsp;input&amp;nbsp;values&amp;nbsp;and&amp;nbsp;weights&lt;/li&gt;
&lt;li&gt;Achieves efficiency through parallel processing of all operations&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Foundation for Large Language Models&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Enabled&amp;nbsp;development&amp;nbsp;of&amp;nbsp;&lt;b&gt;large-scale&amp;nbsp;language&amp;nbsp;models&lt;/b&gt;&amp;nbsp;like&amp;nbsp;GPT&amp;nbsp;(Generative&amp;nbsp;Pre-trained&amp;nbsp;Transformer)&lt;/li&gt;
&lt;li&gt;Made&amp;nbsp;it&amp;nbsp;possible&amp;nbsp;to&amp;nbsp;&lt;b&gt;pre-train&lt;/b&gt;&amp;nbsp;on&amp;nbsp;massive&amp;nbsp;datasets&lt;/li&gt;
&lt;li&gt;Achieved&amp;nbsp;superior&amp;nbsp;performance&amp;nbsp;through&amp;nbsp;this&amp;nbsp;architecture&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;This&amp;nbsp;architecture&amp;nbsp;laid&amp;nbsp;the&amp;nbsp;groundwork&amp;nbsp;for&amp;nbsp;modern&amp;nbsp;large&amp;nbsp;language&amp;nbsp;models&amp;nbsp;and&amp;nbsp;continues&amp;nbsp;to&amp;nbsp;drive&amp;nbsp;innovation&amp;nbsp;in&amp;nbsp;AI.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Reference&lt;/b&gt;&lt;/h4&gt;
&lt;figure id=&quot;og_1735884780238&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;Transformer 모델이란? : AI 혁신을 주도하는 트랜스포머 알고리즘&quot; data-og-description=&quot;트랜스포머(Transformer)는 구글이 자연어처리를 위해 2017년 발표한 모델로 현재 AI 분야의 혁신을 이끌고 있는 언어모델이다. 우리가 웹이나 API를 통해 AI를 처음 활용하게 된 계기가 된 ChatGPT 역시 &quot; data-og-host=&quot;blog-ko.superb-ai.com&quot; data-og-source-url=&quot;https://blog-ko.superb-ai.com/what-is-the-transformer-model/&quot; data-og-url=&quot;https://blog-ko.superb-ai.com/what-is-the-transformer-model/&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/SPFGe/hyXWtoXIFi/QXvJOhd7MyNpIpbvcQQ3P0/img.png?width=601&amp;amp;height=336&amp;amp;face=0_0_601_336,https://scrap.kakaocdn.net/dn/BtynV/hyXWzvTWmD/8kWHnkaDCAGKb28B0VXIFk/img.png?width=601&amp;amp;height=336&amp;amp;face=0_0_601_336&quot;&gt;&lt;a href=&quot;https://blog-ko.superb-ai.com/what-is-the-transformer-model/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://blog-ko.superb-ai.com/what-is-the-transformer-model/&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/SPFGe/hyXWtoXIFi/QXvJOhd7MyNpIpbvcQQ3P0/img.png?width=601&amp;amp;height=336&amp;amp;face=0_0_601_336,https://scrap.kakaocdn.net/dn/BtynV/hyXWzvTWmD/8kWHnkaDCAGKb28B0VXIFk/img.png?width=601&amp;amp;height=336&amp;amp;face=0_0_601_336');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Transformer 모델이란? : AI 혁신을 주도하는 트랜스포머 알고리즘&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;트랜스포머(Transformer)는 구글이 자연어처리를 위해 2017년 발표한 모델로 현재 AI 분야의 혁신을 이끌고 있는 언어모델이다. 우리가 웹이나 API를 통해 AI를 처음 활용하게 된 계기가 된 ChatGPT 역시&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;blog-ko.superb-ai.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;You can find detailed steps of how transformer gets the sense of the data and generate new data from this blog:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://www.datacamp.com/tutorial/how-transformers-work&quot;&gt;https://www.datacamp.com/tutorial/how-transformers-work&lt;/a&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Failure&amp;nbsp;is&amp;nbsp;an&amp;nbsp;option&amp;nbsp;here.&amp;nbsp;If&amp;nbsp;things&amp;nbsp;are&amp;nbsp;not&amp;nbsp;failing,&amp;nbsp;you&amp;nbsp;are&amp;nbsp;not&amp;nbsp;innovating&amp;nbsp;enough.&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>NLP</category>
      <category>attention</category>
      <category>nlp</category>
      <category>Transformer</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/81</guid>
      <comments>https://dongsunseng.tistory.com/entry/NLP-3-How-does-Transformer-Work#entry81comment</comments>
      <pubDate>Fri, 3 Jan 2025 15:52:40 +0900</pubDate>
    </item>
    <item>
      <title>[Prompt Engineering] 2. Zer0-shot Prompting</title>
      <link>https://dongsunseng.tistory.com/entry/Prompt-Engineering-2-Zer0-shot-Prompting</link>
      <description>&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;img.jpg&quot; data-origin-width=&quot;1600&quot; data-origin-height=&quot;700&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/nMpPw/btsLBjmUkl2/20iikD7O1Mmng8l9Sc5Tc0/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/nMpPw/btsLBjmUkl2/20iikD7O1Mmng8l9Sc5Tc0/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/nMpPw/btsLBjmUkl2/20iikD7O1Mmng8l9Sc5Tc0/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnMpPw%2FbtsLBjmUkl2%2F20iikD7O1Mmng8l9Sc5Tc0%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1600&quot; height=&quot;700&quot; data-filename=&quot;img.jpg&quot; data-origin-width=&quot;1600&quot; data-origin-height=&quot;700&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;What is Zero-shot prompting?&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;In artificial intelligence, a &quot;shot&quot; refers to an example.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Therefore, zero-shot means the AI processing a new task without examples, in other words, handling tasks it hasn't specifically learned.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Zero-shot prompting refers to how AI like ChatGPT processes responses to prompts without specifically trained data or example answers.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Simply put, AI performs the requested task &lt;b&gt;without seeing any examples&lt;/b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px; background-color: #9feec3;&quot;&gt;It processes responses to prompts using only the knowledge learned during model training.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Examples&lt;/span&gt;&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;1. Text Classification&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735806093261&quot; class=&quot;html xml&quot; data-ke-language=&quot;html&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Classify the following text as either &quot;Business&quot;, &quot;Technology&quot;, or &quot;Health&quot;:
&quot;New research shows that regular meditation can reduce stress levels and improve sleep quality.&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;2. Sentiment Analysis&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735806080536&quot; class=&quot;html xml&quot; data-ke-language=&quot;html&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Determine if the following customer review expresses a positive, negative, or neutral sentiment:
&quot;After waiting for 45 minutes, the food arrived cold and the waiter was nowhere to be found.&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;3. Language Translation&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735806068629&quot; class=&quot;html xml&quot; data-ke-language=&quot;html&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Translate the following English text to French, maintaining a formal tone:
&quot;We look forward to meeting you at the conference next week.&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;4. Question Answering&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735806044245&quot; class=&quot;html xml&quot; data-ke-language=&quot;html&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Based on the following text, answer the question below:

Text: The Industrial Revolution began in Britain in the late 18th century and spread to other parts of Europe and North America during the 19th century. It marked a major turning point in human history.

Question: Where and when did the Industrial Revolution begin?&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;5. Summarization&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735806022058&quot; class=&quot;html xml&quot; data-ke-language=&quot;html&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Provide a brief summary of the following paragraph in no more than two sentences:

The Great Barrier Reef is the world's largest coral reef system, stretching over 2,300 kilometers along the northeast coast of Australia. It consists of nearly 3,000 individual reefs and 900 islands, supporting an incredibly diverse ecosystem of marine life including over 1,500 species of fish, 400 species of hard coral, and 4,000 types of mollusks.&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;6. Intent Classification&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735805993988&quot; class=&quot;html xml&quot; data-ke-language=&quot;html&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Identify the user's intent in the following customer service query as either &quot;Request Information&quot;, &quot;Technical Support&quot;, &quot;Complaint&quot;, or &quot;Account Management&quot;:

&quot;I've been trying to log into my account for the past hour but keep getting an error message.&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Furthermore&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This is called zero-shot prompting, where AI interprets prompts and generates results using only pre-trained data without being given example answers.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Most prompts you use daily without examples are classified as zero-shot prompts.&lt;/li&gt;
&lt;li&gt;So what about prompting with examples?&lt;/li&gt;
&lt;li&gt;The method of performing tasks using a few examples is called &lt;b&gt;Few-shot prompting&lt;/b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;The &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;advantages&lt;/b&gt;&lt;/span&gt; of zero-shot prompting are:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Minimizes&lt;/b&gt; &lt;b&gt;time&lt;/b&gt; spent preparing prompts&lt;/li&gt;
&lt;li&gt;Can &lt;b&gt;quickly utilize existing models without separate training for new fields or tasks&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;While zero-shot prompting specializes in quick and flexible interaction with AI, it may have lower performance or accuracy compared to models pre-trained for specific tasks.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;A representative prompting engineering technique to compensate for this is the &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Few-shot prompting&lt;/b&gt;&lt;/span&gt; mentioned in the previous post of mine:&lt;/li&gt;
&lt;/ul&gt;
&lt;figure id=&quot;og_1735806237783&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;[Prompt Engineering] 1. Few-shot Prompting&quot; data-og-description=&quot;What is Few-shot Prompting?In artifical intelligence, a &amp;quot;shot&amp;quot; refers to an exampleTherefore, Few-shot means a few examples.Few-shot prompting is a method that helps AI models better understand and perform new tasks by providing a small number of examples &quot; data-og-host=&quot;dongsunseng.com&quot; data-og-source-url=&quot;https://dongsunseng.com/entry/Prompt-Engineering-1-Few-shot-Prompting&quot; data-og-url=&quot;https://dongsunseng.com/entry/Prompt-Engineering-1-Few-shot-Prompting&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bq4mHA/hyXWAOY8Qo/IX4A0K4NlpHRLZqFIvCgMk/img.jpg?width=800&amp;amp;height=350&amp;amp;face=0_0_800_350,https://scrap.kakaocdn.net/dn/bLO7Be/hyXSuwbagS/c1tB5pIKm2tIcZDZzAXmkK/img.jpg?width=800&amp;amp;height=350&amp;amp;face=0_0_800_350,https://scrap.kakaocdn.net/dn/boNGvZ/hyXWnoz6C5/SkOeKpjjdlb0hOJ0F9Mhn0/img.jpg?width=1600&amp;amp;height=700&amp;amp;face=0_0_1600_700&quot;&gt;&lt;a href=&quot;https://dongsunseng.com/entry/Prompt-Engineering-1-Few-shot-Prompting&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://dongsunseng.com/entry/Prompt-Engineering-1-Few-shot-Prompting&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bq4mHA/hyXWAOY8Qo/IX4A0K4NlpHRLZqFIvCgMk/img.jpg?width=800&amp;amp;height=350&amp;amp;face=0_0_800_350,https://scrap.kakaocdn.net/dn/bLO7Be/hyXSuwbagS/c1tB5pIKm2tIcZDZzAXmkK/img.jpg?width=800&amp;amp;height=350&amp;amp;face=0_0_800_350,https://scrap.kakaocdn.net/dn/boNGvZ/hyXWnoz6C5/SkOeKpjjdlb0hOJ0F9Mhn0/img.jpg?width=1600&amp;amp;height=700&amp;amp;face=0_0_1600_700');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;[Prompt Engineering] 1. Few-shot Prompting&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;What is Few-shot Prompting?In artifical intelligence, a &quot;shot&quot; refers to an exampleTherefore, Few-shot means a few examples.Few-shot prompting is a method that helps AI models better understand and perform new tasks by providing a small number of examples&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;dongsunseng.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Great&amp;nbsp;companies&amp;nbsp;are&amp;nbsp;built&amp;nbsp;on&amp;nbsp;great&amp;nbsp;products.&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>NLP</category>
      <category>llm</category>
      <category>nlp</category>
      <category>Prompt Engineering</category>
      <category>zero-shot</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/80</guid>
      <comments>https://dongsunseng.tistory.com/entry/Prompt-Engineering-2-Zer0-shot-Prompting#entry80comment</comments>
      <pubDate>Thu, 2 Jan 2025 17:36:24 +0900</pubDate>
    </item>
    <item>
      <title>[Prompt Engineering] 1. Few-shot Prompting</title>
      <link>https://dongsunseng.tistory.com/entry/Prompt-Engineering-1-Few-shot-Prompting</link>
      <description>&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;img.jpg&quot; data-origin-width=&quot;1600&quot; data-origin-height=&quot;700&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/brOReA/btsLBxFf1bo/HqFvMJsiZ7Y5GDksdMcUXK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/brOReA/btsLBxFf1bo/HqFvMJsiZ7Y5GDksdMcUXK/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/brOReA/btsLBxFf1bo/HqFvMJsiZ7Y5GDksdMcUXK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbrOReA%2FbtsLBxFf1bo%2FHqFvMJsiZ7Y5GDksdMcUXK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1600&quot; height=&quot;700&quot; data-filename=&quot;img.jpg&quot; data-origin-width=&quot;1600&quot; data-origin-height=&quot;700&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;What is Few-shot Prompting?&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;In artifical intelligence, a &quot;shot&quot; refers to an example&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Therefore, Few-shot means a few examples.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Few-shot prompting is a method that helps AI models better understand and perform new tasks by providing a small number of examples when the model needs to perform a new task.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Few-shot prompting is broadly divided into:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Instructions&lt;/b&gt;: Description of the task the model needs to perform&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Examples&lt;/b&gt;: Examples for the model to reference when generating responses&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Input&lt;/b&gt; &lt;b&gt;data&lt;/b&gt;: Optional use depending on whether there is data to analyze&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;It is common to use 2-5 examples for few-shot prompting&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Examples of few-shot prompting&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1. Sentiment Analysis&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735803008162&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Input: &quot;The food was amazing!&quot; 
Output: Positive

Input: &quot;Terrible service, would not recommend.&quot; 
Output: Negative

Input: &quot;It was an okay experience.&quot;
Output: Neutral

Input: &quot;The concert exceeded all my expectations!&quot;
Output: [The model should predict: Positive]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Text Classification&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735803024956&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Input: &quot;How do I reset my password?&quot;
Category: Technical Support

Input: &quot;I'd like to return my recent purchase&quot;
Category: Customer Service

Input: &quot;What are your business hours?&quot;
Category: General Inquiry

Input: &quot;My account is locked, please help&quot;
Category: [The model should predict: Technical Support]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. Language Translation (Informal -&amp;gt; Formal)&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735803039352&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Informal: &quot;Hey, what's up?&quot;
Formal: &quot;Hello, how are you?&quot;

Informal: &quot;Gimme a sec&quot;
Formal: &quot;Please give me a moment&quot;

Informal: &quot;That's awesome!&quot;
Formal: &quot;That is excellent&quot;

Informal: &quot;Can't wait to see ya&quot;
Formal: [The model should predict: &quot;I look forward to seeing you&quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. Entity Extraction&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735803046614&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Text: &quot;John Smith lives in New York&quot;
Person: John Smith
Location: New York

Text: &quot;Apple Inc. is headquartered in Cupertino&quot;
Company: Apple Inc.
Location: Cupertino

Text: &quot;Microsoft CEO Satya Nadella announced&quot;
Person: Satya Nadella
Company: Microsoft

Text: &quot;Tesla opened a new factory in Berlin&quot;
Company: [The model should predict: Tesla]
Location: [The model should predict: Berlin]&lt;/code&gt;&lt;/pre&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Advantages&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Few-shot prompting enables AI models to &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;better understand and perform tasks with just a small amount of data&lt;/b&gt;&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;While it takes longer to write prompts compared to zero-shot prompting, it allows for more &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;precise control of responses&lt;/b&gt;&lt;/span&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Limitations&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Since few-shot prompting only provides a small number of examples to the AI, &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;if the quality of the given examples is low&lt;/b&gt;&lt;/span&gt;, there's a &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;higher probability that the AI will produce incorrect results&lt;/b&gt;&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;Therefore, when using few-shot prompting, it's crucial to carefully check the &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;consistency and quality of the examples&lt;/b&gt;&lt;/span&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;It&amp;nbsp;is&amp;nbsp;a&amp;nbsp;mistake&amp;nbsp;to&amp;nbsp;underestimate&amp;nbsp;the&amp;nbsp;power&amp;nbsp;of&amp;nbsp;a&amp;nbsp;single&amp;nbsp;individual&amp;nbsp;to&amp;nbsp;change&amp;nbsp;the&amp;nbsp;world.&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>NLP</category>
      <category>Few-shot</category>
      <category>llm</category>
      <category>nlp</category>
      <category>Prompt Engineering</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/79</guid>
      <comments>https://dongsunseng.tistory.com/entry/Prompt-Engineering-1-Few-shot-Prompting#entry79comment</comments>
      <pubDate>Thu, 2 Jan 2025 16:51:37 +0900</pubDate>
    </item>
    <item>
      <title>Child Mind Institute &amp;mdash; Problematic Internet Use: The Greatest Shake-Up?</title>
      <link>https://dongsunseng.tistory.com/entry/Child-Mind-Institute-%E2%80%94-Problematic-Internet-Use-The-Greatest-Shake-Up</link>
      <description>&lt;h4 data-ke-size=&quot;size20&quot;&gt;&amp;nbsp;&lt;/h4&gt;
&lt;figure id=&quot;og_1734931292680&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Child Mind Institute &amp;mdash; Problematic Internet Use&quot; data-og-description=&quot;Relating Physical Activity to Problematic Internet Use&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/overview&quot; data-og-url=&quot;https://kaggle.com/child-mind-institute-problematic-internet-use&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/vNkqq/hyXOnKlre8/T02BabpHnqJ4KsEscMFj50/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/bkTNsA/hyXOp2twZ8/VdsuFMGhPkOIlcKSJZoc5K/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/overview&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/overview&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/vNkqq/hyXOnKlre8/T02BabpHnqJ4KsEscMFj50/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280,https://scrap.kakaocdn.net/dn/bkTNsA/hyXOp2twZ8/VdsuFMGhPkOIlcKSJZoc5K/img.png?width=560&amp;amp;height=280&amp;amp;face=0_0_560_280');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Child Mind Institute &amp;mdash; Problematic Internet Use&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Relating Physical Activity to Problematic Internet Use&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-23 오후 2.32.23.png&quot; data-origin-width=&quot;2328&quot; data-origin-height=&quot;1194&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cpVzIX/btsLssRywb7/haTaDCukQsWTwKwCkwmnD1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cpVzIX/btsLssRywb7/haTaDCukQsWTwKwCkwmnD1/img.png&quot; data-alt=&quot;CRAZY SHAKE-UP HERE&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cpVzIX/btsLssRywb7/haTaDCukQsWTwKwCkwmnD1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcpVzIX%2FbtsLssRywb7%2FhaTaDCukQsWTwKwCkwmnD1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2328&quot; height=&quot;1194&quot; data-filename=&quot;스크린샷 2024-12-23 오후 2.32.23.png&quot; data-origin-width=&quot;2328&quot; data-origin-height=&quot;1194&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;CRAZY SHAKE-UP HERE&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;About the Competition&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The aim of this competition is to develop a model that predicts problematic internet usage levels based on physical activity and health data from children and adolescents.&lt;/li&gt;
&lt;li&gt;Since the current method of measuring problematic internet use requires complex expert evaluation, the goal is to identify it through easily obtainable physical activity indicators instead.&lt;/li&gt;
&lt;li&gt;This competition is hosted by the Child Mind Institute and sponsored by Dell Technologies and NVIDIA, with a total prize pool of $60,000. The evaluation metric used is quadratic weighted kappa.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;Shake-Up?&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;What competition organizer says:&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Many participants adjusted model hyperparameters, thresholds, and random seeds to improve public leaderboard scores&lt;/li&gt;
&lt;li&gt;Submission analysis:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Top 10 teams on public: Average 212 submissions (median 199)&lt;/li&gt;
&lt;li&gt;Top 10 teams on private: Average 64 submissions (median 25)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;This suggests attempts to artificially inflate scores on the public leaderboard&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Overfitting almost certainly occurred&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Striking differences between missing values proportion in train vs test data&lt;/span&gt;&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Actual discussion here: &lt;a href=&quot;https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/discussion/552488#3076452&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;The missing percentage of series parquet in test: 80-85%(~60% in train)&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;The missing percentage of&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;FGC&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;features in test: 70-75%(29.8% in train)&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;The missing percentage of&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;BIA&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;features in test: 60-65%(33.7% in train)&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;The missing percentage of&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;PreInt_EduHx-computerinternet_hoursday&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;in test: 40-50%(3% in train)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;Might be the reason why KNNImputer works so well on the private test set despite decreasing the CV on the train set   - both with leakage and without leakage (much worse)&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/discussion/552890&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;9th Place Sol: This is not a lottery compettions (LB Rank 9. Best notebook Private score:0.493 )&lt;/b&gt;&lt;/a&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Why did we observe such dramatic drops in the Leaderboard rankings?&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Data leakage from the hidden dataset&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;During the competition, he mentioned about &quot;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Private dataset leakage.. Cv and PL dont have any corelation.&quot;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&quot;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;in my experience, the score above 0.47 only data leakage.&quot;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;KNNImputer process in the shared notebooks contained a bug&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;I believe this is why everyone who made their final submission based on the shared notebook fell significantly in the rankings.&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;What didn't work:&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Treating the SII as a classification problem instead of a regression problem.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Predicting PCIA-Total.&lt;/li&gt;
&lt;li&gt;Improving the SII value through post-processing.&lt;/li&gt;
&lt;li&gt;Testing various methods and optimizing threshold values for the Kappa metric.&lt;/li&gt;
&lt;li&gt;Calculating the SII value separately for datasets with and without accelerometer data.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;What worked:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;I followed a phased approach for missing data:&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;a. Used all data (including missing SII's).&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;b. Combined train and test datasets to train the model.&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: start;&quot;&gt;c. Predicted missing values separately for train and test datasets.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;d. Dropped rows with missing data for SII or PCIA1&amp;ndash;PCIA19.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;e. Removed columns with a very high proportion of missing values.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;f. For feature engineering, I computed mean, standard deviation, kurtosis, and skew values using a&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;windowing method for both accelerometer data and columns within the same category.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;g. Conducted feature selection for an LGBM model, reducing the features from 200&amp;ndash;300 to 50&amp;ndash;60.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;h- Used voting and stacking ensemble techniques with LGBM (GBDT, GOSS, and DART) and CatBoost&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;models.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;i- For the final results, I selected the most common label.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Using the Windowing method:&lt;br /&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Data is divided into fixed-size intervals (windows)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Statistical values were calculated for each interval&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Calculated statistical values:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Mean: Central tendency of the data&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Standard deviation: Measure of data dispersion&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Kurtosis: Measure of how peaked/flat the data distribution is&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;Skew: Measure of data distribution asymmetry&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;This&amp;nbsp;calculation&amp;nbsp;was&amp;nbsp;applied&amp;nbsp;to&amp;nbsp;two&amp;nbsp;types&amp;nbsp;of&amp;nbsp;data:&lt;br /&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;accelerometer data&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;columns within the same category&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Highest Private Leaderboard Score: 0.493&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Unlike my selected solution, I used the PCIA7 value from the train dataset as it was.&lt;/li&gt;
&lt;li&gt;For the test set, I used the predictions made by the model.&lt;/li&gt;
&lt;li&gt;Why did I do this?
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Because after predicting all PCIA values, I observed that the Kappa score for PCIA7 was&lt;br /&gt;significantly higher than for the others.&lt;/li&gt;
&lt;li&gt;For this reason, I decided to proceed with this approach.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;However, I didn&amp;rsquo;t include this in my final submission because I noticed this three days before the deadline. -&lt;/li&gt;
&lt;li&gt;The two notebooks I tested showed little difference, with private leaderboard scores of 0.482 and 0.485.&lt;/li&gt;
&lt;li&gt;I had set a CV threshold of 0.450, so I chose not to submit these.&lt;/li&gt;
&lt;li&gt;Among the three notebooks I didn&amp;rsquo;t submit, one had a CV score of 0.451 and a private leaderboard score of 0.493.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;Conclusion: This competition was absolutely not a matter of luck for me.&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;background-color: #ffffff; color: #202124; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/discussion/552569&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;16th Place Solution&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Code: &lt;a href=&quot;https://www.kaggle.com/code/rsakata/cmi-piu-16th-place-solution&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Main points:&amp;nbsp;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;imputation of missing values with IterativeImputer&lt;/li&gt;
&lt;li&gt;feature engineering from parquet files&lt;/li&gt;
&lt;li&gt;LightGBM training with custom QWK objective and metric&lt;/li&gt;
&lt;li&gt;performing &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;10 x 10 nested cross-validation&lt;/b&gt;&lt;/span&gt; to get reliable validation scores and stable test predictions&lt;/li&gt;
&lt;li&gt;performing threshold optimization only once using the overall predictions from the nested cross-validation. (GRID SEARCH)&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-23 오후 4.49.08.png&quot; data-origin-width=&quot;666&quot; data-origin-height=&quot;308&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bBBuH3/btsLup0AhuW/81eotQLflOxMV1DbNYqdQK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bBBuH3/btsLup0AhuW/81eotQLflOxMV1DbNYqdQK/img.png&quot; data-alt=&quot;feature engineering from parquet files&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bBBuH3/btsLup0AhuW/81eotQLflOxMV1DbNYqdQK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbBBuH3%2FbtsLup0AhuW%2F81eotQLflOxMV1DbNYqdQK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;666&quot; height=&quot;308&quot; data-filename=&quot;스크린샷 2024-12-23 오후 4.49.08.png&quot; data-origin-width=&quot;666&quot; data-origin-height=&quot;308&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;feature engineering from parquet files&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Discussion question #1 about imputation: What kind of reasoning led to filling in the missing values? Some may argue that the fact that the data is missing itself is valuable information and should not be filled in. Especially since LightGBM can train without handling missing values.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Author's answer:&lt;/b&gt; Indeed, as you mentioned, I don't have a clear perspective on the reason of the effectiveness of missing value imputation either. However, I think that when the missing feature is strongly correlated with the target (in this competition, for instance, PreInt_EduHx-computerinternet_hoursday), it might have been better to impute the missing values rather than indirectly predicting the target from other features.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #333333; text-align: left;&quot;&gt;Discussion question #2 about&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;number of folds: I think large numbers of folds may lead overfit to validation data especially in small data, but does the nested CV prevent this ? Why do you choose 10folds?
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Author's answer:&lt;/b&gt;&lt;span style=&quot;color: #333333; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;Yes. &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Since no optimization was performed on the test data for each fold, I believe there is no risk of overfitting by increasing the number of folds.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/discussion/552940&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;10th Solution&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Used only 5 features with hierarchical bayes model&lt;/li&gt;
&lt;li&gt;code: &lt;a href=&quot;https://www.kaggle.com/code/junpeimorioka/10th-place-5-features-hierarchical-bayes&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://www.kaggle.com/code/junpeimorioka/10th-place-5-features-hierarchical-bayes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Final Conclusion&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;So, most of the rankers realized that the CV-LB relationship was weak in this competition.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Indicating that we should stick on the CV score.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;1. threshold optimization leading to unstable results&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;2. Using simpler model due to unstable lb-cv score&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;3. Data leakage problem&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;4. SII vs. PCIAT total as target label&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;5. Using mean, std, ... values instead of autoencoder(led to lower cv score)&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;6. various imputations&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;7. more common to get rid of optimizations for simpler model&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;My Solution&lt;/b&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;It was my first competition but I knew the problem of the CV-LB score in this comp.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;However, I wasn't able to establish stable standard of CV just as the high solutions did.&lt;/li&gt;
&lt;li&gt;I also extensively took the idea of high LB score solutions into account: for example, autoencoder and tabnet.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;However I found out simpler model with both of them can score up to 0.439 among my solutions which is a bronze medal score.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Persistence&amp;nbsp;is&amp;nbsp;very&amp;nbsp;important.&amp;nbsp;You&amp;nbsp;should&amp;nbsp;not&amp;nbsp;give&amp;nbsp;up&amp;nbsp;unless&amp;nbsp;you&amp;nbsp;are&amp;nbsp;forced&amp;nbsp;to&amp;nbsp;give&amp;nbsp;up.&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>대회</category>
      <category>child mind institute &amp;mdash; problematic internet us</category>
      <category>shake-up</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/78</guid>
      <comments>https://dongsunseng.tistory.com/entry/Child-Mind-Institute-%E2%80%94-Problematic-Internet-Use-The-Greatest-Shake-Up#entry78comment</comments>
      <pubDate>Mon, 23 Dec 2024 17:01:32 +0900</pubDate>
    </item>
    <item>
      <title>[Kaggle Study] #13 Mercari Price Suggestion Challenge</title>
      <link>https://dongsunseng.tistory.com/entry/Kaggle-Study-13-Mercari-Price-Suggestion-Challenge</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Twelveth competition following Youhan Lee's curriculum.&lt;b&gt;&lt;span&gt;&lt;span&gt; Natural Language Processing competition&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;.&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/thykhuely/mercari-interactive-eda-topic-modelling&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;First Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Mercari&amp;nbsp;Interactive&amp;nbsp;EDA&amp;nbsp;+&amp;nbsp;Topic&amp;nbsp;Modelling&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;EDA kernel with matplotlib.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Log transformation on target var&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Our response or target variables is the&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;price&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;we are suggesting to the Mercari's marketplace sellers. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;The median price of all the items in the training is about&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;$&lt;/span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;267 but given the existence of some extreme values of over&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;$&lt;/span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;100 and the maximum at&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;$&lt;/span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;2,009, the distribution of the variables is heavily skewed to the left. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;So let's make log-transformation on the price (we added +1 to the value before the transformation to avoid zero and negative values).&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733376305871&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
plt.xlabel('price+', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Price Distribution - Training Set', fontsize=17)

plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
plt.xlabel('log(price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Log(Price) Distribution - Training Set', fontsize=17)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-05 오후 2.25.20.png&quot; data-origin-width=&quot;1434&quot; data-origin-height=&quot;740&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ccDcwR/btsK7L4C2rf/gfPQkzwJueKO8CH53K6740/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ccDcwR/btsK7L4C2rf/gfPQkzwJueKO8CH53K6740/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ccDcwR/btsK7L4C2rf/gfPQkzwJueKO8CH53K6740/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FccDcwR%2FbtsK7L4C2rf%2FgfPQkzwJueKO8CH53K6740%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;732&quot; height=&quot;378&quot; data-filename=&quot;스크린샷 2024-12-05 오후 2.25.20.png&quot; data-origin-width=&quot;1434&quot; data-origin-height=&quot;740&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;2. Dealing with Item Description feature&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;It&amp;nbsp;will&amp;nbsp;be&amp;nbsp;more&amp;nbsp;challenging&amp;nbsp;to&amp;nbsp;parse&amp;nbsp;through&amp;nbsp;this&amp;nbsp;particular&amp;nbsp;item&amp;nbsp;since&amp;nbsp;it's&amp;nbsp;unstructured&amp;nbsp;data.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Does&amp;nbsp;it&amp;nbsp;mean&amp;nbsp;a&amp;nbsp;more&amp;nbsp;detailed&amp;nbsp;and&amp;nbsp;lengthy&amp;nbsp;description&amp;nbsp;will&amp;nbsp;result&amp;nbsp;in&amp;nbsp;a&amp;nbsp;higher&amp;nbsp;bidding&amp;nbsp;price?&amp;nbsp;&lt;/li&gt;
&lt;li&gt;We&amp;nbsp;will&amp;nbsp;strip&amp;nbsp;out&amp;nbsp;all&amp;nbsp;punctuations,&amp;nbsp;remove&amp;nbsp;some&amp;nbsp;english&amp;nbsp;stop&amp;nbsp;words&amp;nbsp;(i.e.&amp;nbsp;redundant&amp;nbsp;words&amp;nbsp;such&amp;nbsp;as&amp;nbsp;&quot;a&quot;,&amp;nbsp;&quot;the&quot;,&amp;nbsp;etc.)&amp;nbsp;and&amp;nbsp;any&amp;nbsp;other&amp;nbsp;words&amp;nbsp;with&amp;nbsp;a&amp;nbsp;length&amp;nbsp;less&amp;nbsp;than&amp;nbsp;3:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733378536824&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def wordCount(text):
    # convert to lower case and strip regex
    try:
         # convert to lower case and strip regex
        text = text.lower()
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        txt = regex.sub(&quot; &quot;, text)
        # tokenize
        # words = nltk.word_tokenize(clean_txt)
        # remove words in stop words
        words = [w for w in txt.split(&quot; &quot;) \
                 if not w in stop_words.ENGLISH_STOP_WORDS and len(w)&amp;gt;3]
        return len(words)
    except: 
        return 0&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733378550550&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# add a column of word counts to both the training and test set
train['desc_len'] = train['item_description'].apply(lambda x: wordCount(x))
test['desc_len'] = test['item_description'].apply(lambda x: wordCount(x))&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;We also need to check if there are any missing values in the item description (4 observations don't have a description) andl remove those observations from our training set.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733378617822&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;train.item_description.isnull().sum()
# result: 4

# remove missing values in item description
train = train[pd.notnull(train['item_description'])]&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;3. Pre-processing: tokenization&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Most&amp;nbsp;of&amp;nbsp;the&amp;nbsp;time,&amp;nbsp;the&amp;nbsp;first&amp;nbsp;steps&amp;nbsp;of&amp;nbsp;an&amp;nbsp;NLP&amp;nbsp;project&amp;nbsp;is&amp;nbsp;to&amp;nbsp;&quot;tokenize&quot;&amp;nbsp;your&amp;nbsp;documents,&amp;nbsp;which&amp;nbsp;main&amp;nbsp;purpose&amp;nbsp;is&amp;nbsp;to&amp;nbsp;normalize&amp;nbsp;our&amp;nbsp;texts.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;The&amp;nbsp;three&amp;nbsp;fundamental&amp;nbsp;stages&amp;nbsp;will&amp;nbsp;usually&amp;nbsp;include:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;break the descriptions into sentences and then break the sentences into tokens&lt;/li&gt;
&lt;li&gt;remove&amp;nbsp;punctuation&amp;nbsp;and&amp;nbsp;stop&amp;nbsp;words&lt;/li&gt;
&lt;li&gt;lowercase&amp;nbsp;the&amp;nbsp;tokens&lt;/li&gt;
&lt;li&gt;herein,&amp;nbsp;I&amp;nbsp;will&amp;nbsp;also&amp;nbsp;only&amp;nbsp;consider&amp;nbsp;words&amp;nbsp;that&amp;nbsp;have&amp;nbsp;length&amp;nbsp;equal&amp;nbsp;to&amp;nbsp;or&amp;nbsp;greater&amp;nbsp;than&amp;nbsp;3&amp;nbsp;characters&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733379484263&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;stop = set(stopwords.words('english'))
def tokenize(text):
    &quot;&quot;&quot;
    sent_tokenize(): segment text into sentences
    word_tokenize(): break sentences into words
    &quot;&quot;&quot;
    try: 
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        text = regex.sub(&quot; &quot;, text) # remove punctuation
        
        tokens_ = [word_tokenize(s) for s in sent_tokenize(text)]
        tokens = []
        for token_by_sent in tokens_:
            tokens += token_by_sent
        tokens = list(filter(lambda t: t.lower() not in stop, tokens))
        filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)]
        filtered_tokens = [w.lower() for w in filtered_tokens if len(w)&amp;gt;=3]
        
        return filtered_tokens
            
    except TypeError as e: print(text,e)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733379504234&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# apply the tokenizer into the item descriptipn column
train['tokens'] = train['item_description'].map(tokenize)
test['tokens'] = test['item_description'].map(tokenize)

train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;4. WordCloud package&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;We could aso use the package&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;WordCloud&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;to easily visualize which words has the highest frequencies within each category:&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733379578057&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# build dictionary with key=category and values as all the descriptions related.
cat_desc = dict()
for cat in general_cats: 
    text = &quot; &quot;.join(train.loc[train['general_cat']==cat, 'item_description'].values)
    cat_desc[cat] = tokenize(text)


# find the most common words for the top 4 categories
women100 = Counter(cat_desc['Women']).most_common(100)
beauty100 = Counter(cat_desc['Beauty']).most_common(100)
kids100 = Counter(cat_desc['Kids']).most_common(100)
electronics100 = Counter(cat_desc['Electronics']).most_common(100)

def generate_wordcloud(tup):
    wordcloud = WordCloud(background_color='white',
                          max_words=50, max_font_size=40,
                          random_state=42
                         ).generate(str(tup))
    return wordcloud&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733379612463&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;fig,axes = plt.subplots(2, 2, figsize=(30, 15))

ax = axes[0, 0]
ax.imshow(generate_wordcloud(women100), interpolation=&quot;bilinear&quot;)
ax.axis('off')
ax.set_title(&quot;Women Top 100&quot;, fontsize=30)

ax = axes[0, 1]
ax.imshow(generate_wordcloud(beauty100))
ax.axis('off')
ax.set_title(&quot;Beauty Top 100&quot;, fontsize=30)

ax = axes[1, 0]
ax.imshow(generate_wordcloud(kids100))
ax.axis('off')
ax.set_title(&quot;Kids Top 100&quot;, fontsize=30)

ax = axes[1, 1]
ax.imshow(generate_wordcloud(electronics100))
ax.axis('off')
ax.set_title(&quot;Electronic Top 100&quot;, fontsize=30)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-05 오후 3.20.24.png&quot; data-origin-width=&quot;2068&quot; data-origin-height=&quot;1066&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c2K1ZB/btsK7c2RzaP/KcM48JDkSaKVNBWvVWtmtk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c2K1ZB/btsK7c2RzaP/KcM48JDkSaKVNBWvVWtmtk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c2K1ZB/btsK7c2RzaP/KcM48JDkSaKVNBWvVWtmtk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc2K1ZB%2FbtsK7c2RzaP%2FKcM48JDkSaKVNBWvVWtmtk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2068&quot; height=&quot;1066&quot; data-filename=&quot;스크린샷 2024-12-05 오후 3.20.24.png&quot; data-origin-width=&quot;2068&quot; data-origin-height=&quot;1066&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;5. Pre-processing: tf-idf&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;tf-idf is the acronym for Term Frequency-Inverse Document Frequency.&lt;/li&gt;
&lt;li&gt;It quantifies the importance of a particular word in relative to the vocabulary of a collection of documents or corpus.&lt;/li&gt;
&lt;li&gt;The metric depends on two factors:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Term Frequency&lt;/b&gt;: the occurences of a word in a given document (i.e. bag of words)&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Inverse&amp;nbsp;Document&amp;nbsp;Frequency&lt;/b&gt;:&amp;nbsp;the&amp;nbsp;reciprocal&amp;nbsp;number&amp;nbsp;of&amp;nbsp;times&amp;nbsp;a&amp;nbsp;word&amp;nbsp;occurs&amp;nbsp;in&amp;nbsp;a&amp;nbsp;corpus&amp;nbsp;of&amp;nbsp;documents&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Think&amp;nbsp;about&amp;nbsp;of&amp;nbsp;it&amp;nbsp;this&amp;nbsp;way:&amp;nbsp;If&amp;nbsp;the&amp;nbsp;word&amp;nbsp;is&amp;nbsp;used&amp;nbsp;extensively&amp;nbsp;in&amp;nbsp;all&amp;nbsp;documents,&amp;nbsp;its&amp;nbsp;existence&amp;nbsp;within&amp;nbsp;a&amp;nbsp;specific&amp;nbsp;document&amp;nbsp;will&amp;nbsp;not&amp;nbsp;be&amp;nbsp;able&amp;nbsp;to&amp;nbsp;provide&amp;nbsp;us&amp;nbsp;much&amp;nbsp;specific&amp;nbsp;information&amp;nbsp;about&amp;nbsp;the&amp;nbsp;document&amp;nbsp;itself.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;So the second term could be seen as a penalty term that penalizes common words such as &quot;a&quot;, &quot;the&quot;, &quot;and&quot;, etc. tf-idf can therefore, be seen as a weighting scheme for words relevancy in a specific document.&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733388739929&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10,
                             max_features=180000,
                             tokenizer=tokenize,
                             ngram_range=(1, 2))
                             
all_desc = np.append(train['item_description'].values, test['item_description'].values)
vz = vectorizer.fit_transform(list(all_desc))&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;vz is a tfidf matrix where:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;the number of &lt;b&gt;rows&lt;/b&gt; is the total number of descriptions&lt;/li&gt;
&lt;li&gt;the number of &lt;b&gt;columns&lt;/b&gt; is the total number of unique tokens across the descriptions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Given the high dimension of our tfidf matrix, we need to reduce their dimension using the Singular Value Decomposition (SVD) technique. &lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;And to visualize our vocabulary, we could next use t-SNE to reduce the dimension from 50 to 2. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;t-SNE is more suitable for dimensionality reduction to 2 or 3.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;SVD (Singular Value Decomposition):&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Linear dimensionality reduction&lt;/li&gt;
&lt;li&gt;Preserves major patterns in data&lt;/li&gt;
&lt;li&gt;Relatively fast computation&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Suitable for reduction to larger dimensions (e.g., 50 dimensions)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;t-SNE (t-Distributed Stochastic Neighbor Embedding):&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Non-linear dimensionality reduction&lt;/li&gt;
&lt;li&gt;Preserves similarity relationships between data points&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Optimized for visualization (2-3 dimensions)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Effectively reveals cluster structures&lt;/li&gt;
&lt;/ul&gt;
&lt;p id=&quot;t-Distributed-Stochastic-Neighbor-Embedding-(t-SNE)&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;6. t-Distributed Stochastic Neighbor Embedding (t-SNE)&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.&lt;/li&gt;
&lt;li&gt;The goal is to take a set of points in a high-dimensional space and find a representation of those points in a lower-dimensional space, typically the 2D plane.&lt;/li&gt;
&lt;li&gt;It is based on probability distributions with random walk on neighborhood graphs to find the structure within the data.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;But since t-SNE complexity is significantly high, usually we'd use other high-dimension reduction techniques before applying t-SNE.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;First, let's take a sample from the both training and testing item's description since t-SNE can take a very long time to execute.&lt;/li&gt;
&lt;li&gt;We can then reduce the dimension of each vector from to n_components (50) using SVD.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733389653820&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;trn = train.copy()
tst = test.copy()
trn['is_train'] = 1
tst['is_train'] = 0

sample_sz = 15000

combined_df = pd.concat([trn, tst])
combined_sample = combined_df.sample(n=sample_sz)
vz_sample = vectorizer.fit_transform(list(combined_sample['item_description']))

from sklearn.decomposition import TruncatedSVD

n_comp=30
svd = TruncatedSVD(n_components=n_comp, random_state=42)
svd_tfidf = svd.fit_transform(vz_sample)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733389824322&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Dimension from 50 to 2
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=500)

tsne_tfidf = tsne_model.fit_transform(svd_tfidf)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;7. tf-idf clustering of the item description&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;plot_tfidf.scatter(x='x', y='y', source=tfidf_df, alpha=0.7)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={&quot;description&quot;: &quot;@description&quot;, &quot;tokens&quot;: &quot;@tokens&quot;, &quot;category&quot;:&quot;@category&quot;}
show(plot_tfidf)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-05 오후 6.13.49.png&quot; data-origin-width=&quot;1358&quot; data-origin-height=&quot;1164&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bFKrRv/btsK82q8WOu/CLOMUmfjTSPKcm2151hZNk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bFKrRv/btsK82q8WOu/CLOMUmfjTSPKcm2151hZNk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bFKrRv/btsK82q8WOu/CLOMUmfjTSPKcm2151hZNk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbFKrRv%2FbtsK82q8WOu%2FCLOMUmfjTSPKcm2151hZNk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;509&quot; height=&quot;436&quot; data-filename=&quot;스크린샷 2024-12-05 오후 6.13.49.png&quot; data-origin-width=&quot;1358&quot; data-origin-height=&quot;1164&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;8. K-Means Clustering&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;K-means clustering objective is &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;to minimize the average squared Euclidean distance of the document / description from their cluster centroids.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733390172365&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from sklearn.cluster import MiniBatchKMeans

num_clusters = 30 # need to be selected wisely
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters,
                               init='k-means++',
                               n_init=1,
                               init_size=1000, batch_size=1000, verbose=0, max_iter=1000)

kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(num_clusters):
    print(&quot;Cluster %d:&quot; % i)
    aux = ''
    for j in sorted_centroids[i, :10]:
        aux += terms[j] + ' | '
    print(aux)
    print()&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In order to plot these clusters, first we will need to reduce the dimension of the distances to 2 using tsne:&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733390376272&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# repeat the same steps for the sample
kmeans = kmeans_model.fit(vz_sample)
kmeans_clusters = kmeans.predict(vz_sample)
kmeans_distances = kmeans.transform(vz_sample)
# reduce dimension to 2 using tsne
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733390398978&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#combined_sample.reset_index(drop=True, inplace=True)
kmeans_df = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
kmeans_df['cluster'] = kmeans_clusters
kmeans_df['description'] = combined_sample['item_description']
kmeans_df['category'] = combined_sample['general_cat']
#kmeans_df['cluster']=kmeans_df.cluster.astype(str).astype('category')

plot_kmeans = bp.figure(plot_width=700, plot_height=600,
                        title=&quot;KMeans clustering of the description&quot;,
    tools=&quot;pan,wheel_zoom,box_zoom,reset,hover,previewsave&quot;,
    x_axis_type=None, y_axis_type=None, min_border=1)

source = ColumnDataSource(data=dict(x=kmeans_df['x'], y=kmeans_df['y'],
                                    color=colormap[kmeans_clusters],
                                    description=kmeans_df['description'],
                                    category=kmeans_df['category'],
                                    cluster=kmeans_df['cluster']))

plot_kmeans.scatter(x='x', y='y', color='color', source=source)
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={&quot;description&quot;: &quot;@description&quot;, &quot;category&quot;: &quot;@category&quot;, &quot;cluster&quot;:&quot;@cluster&quot; }
show(plot_kmeans)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-05 오후 6.20.11.png&quot; data-origin-width=&quot;1356&quot; data-origin-height=&quot;1204&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/SSxEY/btsK8TnuVWG/bFw2Yo3TpcuUFTkSxP70Kk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/SSxEY/btsK8TnuVWG/bFw2Yo3TpcuUFTkSxP70Kk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/SSxEY/btsK8TnuVWG/bFw2Yo3TpcuUFTkSxP70Kk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FSSxEY%2FbtsK8TnuVWG%2FbFw2Yo3TpcuUFTkSxP70Kk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;554&quot; height=&quot;492&quot; data-filename=&quot;스크린샷 2024-12-05 오후 6.20.11.png&quot; data-origin-width=&quot;1356&quot; data-origin-height=&quot;1204&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;9. Latent&amp;nbsp;Dirichlet&amp;nbsp;Allocation&lt;br /&gt;&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Latent&amp;nbsp;Dirichlet&amp;nbsp;Allocation&amp;nbsp;(LDA)&amp;nbsp;is&amp;nbsp;an&amp;nbsp;algorithms&amp;nbsp;used&amp;nbsp;to&amp;nbsp;discover&amp;nbsp;the&amp;nbsp;topics&amp;nbsp;that&amp;nbsp;are&amp;nbsp;present&amp;nbsp;in&amp;nbsp;a&amp;nbsp;corpus.&lt;/li&gt;
&lt;li&gt;LDA starts from a fixed number of topics.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. &lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents.&lt;/li&gt;
&lt;li&gt;Its input is a bag of words, i.e. each document represented as a row, with each columns containing the count of words in the corpus.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733390582867&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;cvectorizer = CountVectorizer(min_df=4,
                              max_features=180000,
                              tokenizer=tokenize,
                              ngram_range=(1,2))
                              
cvz = cvectorizer.fit_transform(combined_sample['item_description'])

lda_model = LatentDirichletAllocation(n_components=20,
                                      learning_method='online',
                                      max_iter=20,
                                      random_state=42)
                                      
X_topics = lda_model.fit_transform(cvz)

n_top_words = 10
topic_summaries = []

topic_word = lda_model.components_  # get the topic words
vocab = cvectorizer.get_feature_names()

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-05 오후 6.23.34.png&quot; data-origin-width=&quot;1984&quot; data-origin-height=&quot;966&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/wMo7F/btsK8tQjroj/wsyorKMws7Gi8OFBSAKH30/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/wMo7F/btsK8tQjroj/wsyorKMws7Gi8OFBSAKH30/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/wMo7F/btsK8tQjroj/wsyorKMws7Gi8OFBSAKH30/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FwMo7F%2FbtsK8tQjroj%2FwsyorKMws7Gi8OFBSAKH30%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1984&quot; height=&quot;966&quot; data-filename=&quot;스크린샷 2024-12-05 오후 6.23.34.png&quot; data-origin-width=&quot;1984&quot; data-origin-height=&quot;966&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1733391181098&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# reduce dimension to 2 using tsne
tsne_lda = tsne_model.fit_transform(X_topics)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733391206881&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;unnormalized = np.matrix(X_topics)
doc_topic = unnormalized/unnormalized.sum(axis=1)

lda_keys = []
for i, tweet in enumerate(combined_sample['item_description']):
    lda_keys += [doc_topic[i].argmax()]

lda_df = pd.DataFrame(tsne_lda, columns=['x','y'])
lda_df['description'] = combined_sample['item_description']
lda_df['category'] = combined_sample['general_cat']
lda_df['topic'] = lda_keys
lda_df['topic'] = lda_df['topic'].map(int)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733391218504&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plot_lda = bp.figure(plot_width=700,
                     plot_height=600,
                     title=&quot;LDA topic visualization&quot;,
    tools=&quot;pan,wheel_zoom,box_zoom,reset,hover,previewsave&quot;,
    x_axis_type=None, y_axis_type=None, min_border=1)

source = ColumnDataSource(data=dict(x=lda_df['x'], y=lda_df['y'],
                                    color=colormap[lda_keys],
                                    description=lda_df['description'],
                                    topic=lda_df['topic'],
                                    category=lda_df['category']))

plot_lda.scatter(source=source, x='x', y='y', color='color')
hover = plot_kmeans.select(dict(type=HoverTool))
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={&quot;description&quot;:&quot;@description&quot;,
                &quot;topic&quot;:&quot;@topic&quot;, &quot;category&quot;:&quot;@category&quot;}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-05 오후 6.33.52.png&quot; data-origin-width=&quot;1358&quot; data-origin-height=&quot;1206&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dpxj7o/btsK9DqRnK0/K2x9JO58aTnITX7LakTVKK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dpxj7o/btsK9DqRnK0/K2x9JO58aTnITX7LakTVKK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dpxj7o/btsK9DqRnK0/K2x9JO58aTnITX7LakTVKK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fdpxj7o%2FbtsK9DqRnK0%2FK2x9JO58aTnITX7LakTVKK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;539&quot; height=&quot;479&quot; data-filename=&quot;스크린샷 2024-12-05 오후 6.33.52.png&quot; data-origin-width=&quot;1358&quot; data-origin-height=&quot;1206&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;pyLDAvis &lt;/b&gt;is a powerful tool&lt;span&gt;&amp;nbsp;&lt;/span&gt;that gives us an interactive visualization for LDA.&lt;/li&gt;
&lt;li&gt;It's a shame that by putting the HTML of the visualization using pyLDAvis, it will distort the layout of the kernel, I won't upload in here.&lt;/li&gt;
&lt;li&gt;But if you follow the below code, there should be an HTML file generated with very interesting interactive bubble chart that visualizes the space of your topic clusters and the term components within each topic.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733391458821&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def prepareLDAData():
    data = {
        'vocab': vocab,
        'doc_topic_dists': doc_topic,
        'doc_lengths': list(lda_df['len_docs']),
        'term_frequency':cvectorizer.vocabulary_,
        'topic_term_dists': lda_model.components_
    } 
    return data
    
import pyLDAvis

lda_df['len_docs'] = combined_sample['tokens'].map(len)
ldadata = prepareLDAData()
pyLDAvis.enable_notebook()
prepared_data = pyLDAvis.prepare(**ldadata)&lt;/code&gt;&lt;/pre&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/knowledgegrappler/a-simple-nn-solution-with-keras-0-48611-pl&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Second Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; A&amp;nbsp;simple&amp;nbsp;nn&amp;nbsp;solution&amp;nbsp;with&amp;nbsp;Keras&amp;nbsp;(~0.48611&amp;nbsp;PL)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Kernel using neural network for modeling.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Metric&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733391508988&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def rmsle(y, y_pred):
    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5
#Source: https://www.kaggle.com/marknagelberg/rmsle-function&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Missing value&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733391617373&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#HANDLE MISSING VALUES
print(&quot;Handling missing values...&quot;)
def handle_missing(dataset):
    dataset.category_name.fillna(value=&quot;missing&quot;, inplace=True)
    dataset.brand_name.fillna(value=&quot;missing&quot;, inplace=True)
    dataset.item_description.fillna(value=&quot;missing&quot;, inplace=True)
    return (dataset)

train = handle_missing(train)
test = handle_missing(test)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. Categorical data - label encoding&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733391711342&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#PROCESS CATEGORICAL DATA
print(&quot;Handling categorical variables...&quot;)
le = LabelEncoder()

le.fit(np.hstack([train.category_name, test.category_name]))
train.category_name = le.transform(train.category_name)
test.category_name = le.transform(test.category_name)

le.fit(np.hstack([train.brand_name, test.brand_name]))
train.brand_name = le.transform(train.brand_name)
test.brand_name = le.transform(test.brand_name)
del le

train.head(3)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. raw text - tokenization&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733391747167&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#PROCESS TEXT: RAW
print(&quot;Text to seq process...&quot;)
from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.item_description.str.lower(), train.name.str.lower()])

print(&quot;   Fitting tokenizer...&quot;)
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print(&quot;   Transforming text to seq...&quot;)

train[&quot;seq_item_description&quot;] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test[&quot;seq_item_description&quot;] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train[&quot;seq_name&quot;] = tok_raw.texts_to_sequences(train.name.str.lower())
test[&quot;seq_name&quot;] = tok_raw.texts_to_sequences(test.name.str.lower())

train.head(3)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;5. Scaling target variable&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733392339858&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#SCALE target variable
train[&quot;target&quot;] = np.log(train.price+1)
target_scaler = MinMaxScaler(feature_range=(-1, 1))
train[&quot;target&quot;] = target_scaler.fit_transform(train.target.reshape(-1,1))
pd.DataFrame(train.target).hist()&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;6. Modeling GRU NN&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1) Finding max values for NN&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;#EMBEDDINGS MAX VALUE
#Base on the histograms, we select the next lengths
MAX_NAME_SEQ = 10
MAX_ITEM_DESC_SEQ = 75
MAX_TEXT = np.max([np.max(train.seq_name.max())
                   , np.max(test.seq_name.max())
                  , np.max(train.seq_item_description.max())
                  , np.max(test.seq_item_description.max())])+2
MAX_CATEGORY = np.max([train.category_name.max(), test.category_name.max()])+1
MAX_BRAND = np.max([train.brand_name.max(), test.brand_name.max()])+1
MAX_CONDITION = np.max([train.item_condition_id.max(), test.item_condition_id.max()])+1&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2) Actual modeling&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733392484623&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#KERAS MODEL DEFINITION
from keras.layers import Input, Dropout, Dense, BatchNormalization, Activation, concatenate, GRU, Embedding, Flatten, BatchNormalization
from keras.models import Model
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras import backend as K

def get_callbacks(filepath, patience=2):
    es = EarlyStopping('val_loss', patience=patience, mode=&quot;min&quot;)
    msave = ModelCheckpoint(filepath, save_best_only=True)
    return [es, msave]

def rmsle_cust(y_true, y_pred):
    first_log = K.log(K.clip(y_pred, K.epsilon(), None) + 1.)
    second_log = K.log(K.clip(y_true, K.epsilon(), None) + 1.)
    return K.sqrt(K.mean(K.square(first_log - second_log), axis=-1))

def get_model():
    #params
    dr_r = 0.1
    
    #Inputs
    name = Input(shape=[X_train[&quot;name&quot;].shape[1]], name=&quot;name&quot;)
    item_desc = Input(shape=[X_train[&quot;item_desc&quot;].shape[1]], name=&quot;item_desc&quot;)
    brand_name = Input(shape=[1], name=&quot;brand_name&quot;)
    category_name = Input(shape=[1], name=&quot;category_name&quot;)
    item_condition = Input(shape=[1], name=&quot;item_condition&quot;)
    num_vars = Input(shape=[X_train[&quot;num_vars&quot;].shape[1]], name=&quot;num_vars&quot;)
    
    #Embeddings layers
    emb_name = Embedding(MAX_TEXT, 50)(name)
    emb_item_desc = Embedding(MAX_TEXT, 50)(item_desc)
    emb_brand_name = Embedding(MAX_BRAND, 10)(brand_name)
    emb_category_name = Embedding(MAX_CATEGORY, 10)(category_name)
    emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
    
    #rnn layer
    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_name)
    
    #main layer
    main_l = concatenate([
        Flatten() (emb_brand_name)
        , Flatten() (emb_category_name)
        , Flatten() (emb_item_condition)
        , rnn_layer1
        , rnn_layer2
        , num_vars
    ])
    main_l = Dropout(dr_r) (Dense(128) (main_l))
    main_l = Dropout(dr_r) (Dense(64) (main_l))
    
    #output
    output = Dense(1, activation=&quot;linear&quot;) (main_l)
    
    #model
    model = Model([name, item_desc, brand_name
                   , category_name, item_condition, num_vars], output)
    model.compile(loss=&quot;mse&quot;, optimizer=&quot;adam&quot;, metrics=[&quot;mae&quot;, rmsle_cust])
    
    return model

    
model = get_model()
model.summary()&lt;/code&gt;&lt;/pre&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/rumbok/ridge-lb-0-41944&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Third Kernel: Ridge&amp;nbsp;(LB&amp;nbsp;0.41943)&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Mainly using Ridge model kernel.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Overall Summary&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This code implements a machine learning pipeline for product price prediction.&lt;/li&gt;
&lt;li&gt;Here are the &lt;b&gt;main steps&lt;/b&gt;:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Data Preprocessing:&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Remove data with zero prices&lt;/li&gt;
&lt;li&gt;Clean text data including categories, brand names, product names, and descriptions&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Fill missing brand names using the SymSpell algorithm based on similarity&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Split&amp;nbsp;categories&amp;nbsp;into&amp;nbsp;major/medium/minor&amp;nbsp;classifications&lt;/li&gt;
&lt;li&gt;Combine text data to create rich features&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Feature Engineering:&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Vectorize&amp;nbsp;text&amp;nbsp;data&amp;nbsp;using&amp;nbsp;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;HashingVectorizer&lt;/b&gt;&lt;/span&gt; and&amp;nbsp;CountVectorizer&lt;/li&gt;
&lt;li&gt;Encode categorical variables using OneHotEncoder&lt;/li&gt;
&lt;li&gt;Apply&amp;nbsp;TF-IDF&amp;nbsp;transformation&amp;nbsp;to&amp;nbsp;reflect&amp;nbsp;text&amp;nbsp;feature&amp;nbsp;importance&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Select only features common to both training and test sets&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Modeling:&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Use Ridge regression for price prediction&lt;/li&gt;
&lt;li&gt;Use&amp;nbsp;log-transformed&amp;nbsp;prices&amp;nbsp;as&amp;nbsp;targets&lt;/li&gt;
&lt;li&gt;Generate final prices through exponential transformation of predictions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;The main characteristics of this implementation are:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Focus&amp;nbsp;on&amp;nbsp;Text&amp;nbsp;Data&amp;nbsp;Processing:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Use HashingVectorizer for memory-efficient handling of large vocabularies&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Extract context information through n-gram based features&lt;/li&gt;
&lt;li&gt;Reflect word importance through TF-IDF transformation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Efficient&amp;nbsp;Memory&amp;nbsp;Management:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Immediately release unnecessary data from memory&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Optimize memory usage with HashingVectorizer&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Maintain only features common to training/test sets&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Robust&amp;nbsp;Preprocessing:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Intelligent filling of missing brand names&lt;/li&gt;
&lt;li&gt;Hierarchical use of category information&lt;/li&gt;
&lt;li&gt;Text data normalization and combination&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Scalable&amp;nbsp;Structure:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Modularization using Pipeline and FeatureUnion&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Flexibility through custom transformer classes&lt;/li&gt;
&lt;li&gt;Support for multiprocessing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;2. Code Details&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733410795110&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Import essential libraries
import multiprocessing as mp  # Library for parallel processing
import pandas as pd  # pandas for data processing 
from time import time  # For measuring execution time
from scipy.sparse import csr_matrix  # For sparse matrix operations
import os  # For OS related functionality
from sklearn.linear_model import Ridge  # Ridge regression model
from sklearn.pipeline import FeatureUnion, Pipeline  # For feature processing pipeline
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer  # Text processing
from sklearn.metrics import mean_squared_log_error  # Evaluation metric
from sklearn.preprocessing import OneHotEncoder  # For encoding categorical variables
import numpy as np  # For numerical operations
import gc  # Garbage collection
from sklearn.base import BaseEstimator, TransformerMixin  # For creating custom transformers
import re  # Regular expressions
from pandas.api.types import is_numeric_dtype, is_categorical_dtype  # For checking data types

# Multithreading configuration
os.environ['MKL_NUM_THREADS'] = '4'  # Limit Intel Math Kernel Library threads
os.environ['OMP_NUM_THREADS'] = '4'  # Limit OpenMP threads
os.environ['JOBLIB_START_METHOD'] = 'forkserver'  # Set joblib parallel processing method

# Set input data path
INPUT_PATH = r'../input'

# Function to calculate Damerau-Levenshtein distance
def dameraulevenshtein(seq1, seq2):
   &quot;&quot;&quot;Calculate the Damerau-Levenshtein distance between sequences.

    This method has not been modified from the original.
    Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    &amp;gt;&amp;gt;&amp;gt; dameraulevenshtein('ba', 'abc')
    2
    &amp;gt;&amp;gt;&amp;gt; dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    &amp;gt;&amp;gt;&amp;gt; dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    &quot;&quot;&quot;
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
   # Implementation maintained as original using dynamic programming
   # Stores only current row and previous two rows for memory efficiency
   oneago = None
   thisrow = list(range(1, len(seq2) + 1)) + [0]
   for x in range(len(seq1)):
       twoago, oneago, thisrow = (oneago, thisrow, [0] * len(seq2) + [x + 1])
       for y in range(len(seq2)):
           delcost = oneago[y] + 1  # Deletion cost
           addcost = thisrow[y - 1] + 1  # Addition cost
           subcost = oneago[y - 1] + (seq1[x] != seq2[y])  # Substitution cost
           thisrow[y] = min(delcost, addcost, subcost)
           # Handle transpositions
           if (x &amp;gt; 0 and y &amp;gt; 0 and seq1[x] == seq2[y - 1]
                   and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]):
               thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
   return thisrow[len(seq2) - 1]
   
class SymSpell:
   &quot;&quot;&quot;
   A class implementing the SymSpell algorithm.
   This algorithm provides an efficient method for spell correction.
   &quot;&quot;&quot;
   
   def __init__(self, max_edit_distance=3, verbose=0):
       &quot;&quot;&quot;
       Parameters:
       max_edit_distance: Maximum edit distance (how many edits to allow)
       verbose: Verbosity level (0: top suggestion only, 1: all suggestions with minimal distance, 2: all possible suggestions)
       &quot;&quot;&quot;
       self.max_edit_distance = max_edit_distance
       self.verbose = verbose
       self.dictionary = {}  # Word dictionary
       self.longest_word_length = 0  # Length of longest word

   def get_deletes_list(self, w):
       &quot;&quot;&quot;
       Generates all possible combinations of the word with characters deleted up to max_edit_distance.
       Example: &quot;word&quot; -&amp;gt; [&quot;ord&quot;, &quot;wrd&quot;, &quot;wod&quot;, &quot;wor&quot;]
       &quot;&quot;&quot;
       deletes = []
       queue = [w]
       for d in range(self.max_edit_distance):
           temp_queue = []
           for word in queue:
               if len(word) &amp;gt; 1:
                   for c in range(len(word)):
                       word_minus_c = word[:c] + word[c + 1:]
                       if word_minus_c not in deletes:
                           deletes.append(word_minus_c)
                       if word_minus_c not in temp_queue:
                           temp_queue.append(word_minus_c)
           queue = temp_queue
       return deletes

   def create_dictionary_entry(self, w):
       &quot;&quot;&quot;
       Adds a word and its derived deletion variants to the dictionary.
       Returns:
       bool: Whether a new real word was added
       &quot;&quot;&quot;
       new_real_word_added = False
       if w in self.dictionary:
           # If word exists, increase frequency
           self.dictionary[w] = (self.dictionary[w][0], self.dictionary[w][1] + 1)
       else:
           # Add new word
           self.dictionary[w] = ([], 1)
           self.longest_word_length = max(self.longest_word_length, len(w))
           
       if self.dictionary[w][1] == 1:
           # If this is the first occurrence of the word in the corpus
           new_real_word_added = True
           
       deletes = self.get_deletes_list(w)
       for item in deletes:
           if item in self.dictionary:
               # Add original word to the deletion's entry
               self.dictionary[item][0].append(w)
           else:
               # Add new deletion form
               self.dictionary[item] = ([w], 0)
               
       return new_real_word_added
       
def create_dictionary_from_arr(self, arr, token_pattern=r'[a-z]+'):
   &quot;&quot;&quot;
   Creates a word dictionary from an array.
   Parameters:
   arr: Array containing words
   token_pattern: Regular expression pattern for extracting words
   Returns:
   dictionary: Generated word dictionary
   &quot;&quot;&quot;
   total_word_count = 0  # Total words processed
   unique_word_count = 0  # Number of unique words
   
   for line in arr:
       # Split words by non-alphabetic characters
       words = re.findall(token_pattern, line.lower())
       for word in words:
           total_word_count += 1
           if self.create_dictionary_entry(word):
               unique_word_count += 1
   
   # Print processing results
   print(&quot;total words processed: %i&quot; % total_word_count)
   print(&quot;total unique words in corpus: %i&quot; % unique_word_count)
   print(&quot;total items in dictionary (corpus words and deletions): %i&quot; % len(self.dictionary))
   print(&quot; edit distance for deletions: %i&quot; % self.max_edit_distance)
   print(&quot; length of longest word in corpus: %i&quot; % self.longest_word_length)
   
   return self.dictionary

def create_dictionary(self, fname):
   &quot;&quot;&quot;
   Creates a word dictionary from a file.
   
   Parameters:
   fname: Path to the file to read
   
   Returns:
   dictionary: Generated word dictionary
   
   How it works:
   1. Reads the file line by line.
   2. Extracts words containing only alphabetic characters from each line.
   3. Converts each word to lowercase and adds it to the dictionary.
   4. Prints processing results.
   &quot;&quot;&quot;
   total_word_count = 0      # Total number of words processed
   unique_word_count = 0     # Number of unique words
   with open(fname) as file:  # Open file with context manager
       for line in file:
           # Split words by non-alphabetic characters
           # [a-z]+ pattern finds one or more consecutive lowercase letters
           words = re.findall('[a-z]+', line.lower())
           
           for word in words:
               total_word_count += 1  # Increase total word count
               # If create_dictionary_entry returns True (new word added)
               # Increase unique word count
               if self.create_dictionary_entry(word):
                   unique_word_count += 1
                   
   # Print processing results
   print(&quot;total words processed: %i&quot; % total_word_count)           # Total words processed
   print(&quot;total unique words in corpus: %i&quot; % unique_word_count)   # Unique words
   print(&quot;total items in dictionary (corpus words and deletions): %i&quot; % len(self.dictionary))  # Dictionary size
   print(&quot;  edit distance for deletions: %i&quot; % self.max_edit_distance)  # Maximum edit distance
   print(&quot;  length of longest word in corpus: %i&quot; % self.longest_word_length)  # Length of longest word
   
   return self.dictionary  # Return generated dictionary

def get_suggestions(self, string, silent=False):
        &quot;&quot;&quot;return list of suggested corrections for potentially incorrectly
           spelled word&quot;&quot;&quot;
        if (len(string) - self.longest_word_length) &amp;gt; self.max_edit_distance:
            if not silent:
                print(&quot;no items in dictionary within maximum edit distance&quot;)
            return []

        suggest_dict = {}
        min_suggest_len = float('inf')

        queue = [string]
        q_dictionary = {}  # items other than string that we've checked

        while len(queue) &amp;gt; 0:
            q_item = queue[0]  # pop
            queue = queue[1:]

            # early exit
            if ((self.verbose &amp;lt; 2) and (len(suggest_dict) &amp;gt; 0) and
                    ((len(string) - len(q_item)) &amp;gt; min_suggest_len)):
                break

            # process queue item
            if (q_item in self.dictionary) and (q_item not in suggest_dict):
                if self.dictionary[q_item][1] &amp;gt; 0:
                    # word is in dictionary, and is a word from the corpus, and
                    # not already in suggestion list so add to suggestion
                    # dictionary, indexed by the word with value (frequency in
                    # corpus, edit distance)
                    # note q_items that are not the input string are shorter
                    # than input string since only deletes are added (unless
                    # manual dictionary corrections are added)
                    assert len(string) &amp;gt;= len(q_item)
                    suggest_dict[q_item] = (self.dictionary[q_item][1],
                                            len(string) - len(q_item))
                    # early exit
                    if (self.verbose &amp;lt; 2) and (len(string) == len(q_item)):
                        break
                    elif (len(string) - len(q_item)) &amp;lt; min_suggest_len:
                        min_suggest_len = len(string) - len(q_item)

                # the suggested corrections for q_item as stored in
                # dictionary (whether or not q_item itself is a valid word
                # or merely a delete) can be valid corrections
                for sc_item in self.dictionary[q_item][0]:
                    if sc_item not in suggest_dict:

                        # compute edit distance
                        # suggested items should always be longer
                        # (unless manual corrections are added)
                        assert len(sc_item) &amp;gt; len(q_item)

                        # q_items that are not input should be shorter
                        # than original string
                        # (unless manual corrections added)
                        assert len(q_item) &amp;lt;= len(string)

                        if len(q_item) == len(string):
                            assert q_item == string
                            item_dist = len(sc_item) - len(q_item)

                        # item in suggestions list should not be the same as
                        # the string itself
                        assert sc_item != string

                        # calculate edit distance using, for example,
                        # Damerau-Levenshtein distance
                        item_dist = dameraulevenshtein(sc_item, string)

                        # do not add words with greater edit distance if
                        # verbose setting not on
                        if (self.verbose &amp;lt; 2) and (item_dist &amp;gt; min_suggest_len):
                            pass
                        elif item_dist &amp;lt;= self.max_edit_distance:
                            assert sc_item in self.dictionary  # should already be in dictionary if in suggestion list
                            suggest_dict[sc_item] = (self.dictionary[sc_item][1], item_dist)
                            if item_dist &amp;lt; min_suggest_len:
                                min_suggest_len = item_dist

                        # depending on order words are processed, some words
                        # with different edit distances may be entered into
                        # suggestions; trim suggestion dictionary if verbose
                        # setting not on
                        if self.verbose &amp;lt; 2:
                            suggest_dict = {k: v for k, v in suggest_dict.items() if v[1] &amp;lt;= min_suggest_len}

            # now generate deletes (e.g. a substring of string or of a delete)
            # from the queue item
            # as additional items to check -- add to end of queue
            assert len(string) &amp;gt;= len(q_item)

            # do not add words with greater edit distance if verbose setting
            # is not on
            if (self.verbose &amp;lt; 2) and ((len(string) - len(q_item)) &amp;gt; min_suggest_len):
                pass
            elif (len(string) - len(q_item)) &amp;lt; self.max_edit_distance and len(q_item) &amp;gt; 1:
                for c in range(len(q_item)):  # character index
                    word_minus_c = q_item[:c] + q_item[c + 1:]
                    if word_minus_c not in q_dictionary:
                        queue.append(word_minus_c)
                        q_dictionary[word_minus_c] = None  # arbitrary value, just to identify we checked this

        # queue is now empty: convert suggestions in dictionary to
        # list for output
        if not silent and self.verbose != 0:
            print(&quot;number of possible corrections: %i&quot; % len(suggest_dict))
            print(&quot;  edit distance for deletions: %i&quot; % self.max_edit_distance)

        # output option 1
        # sort results by ascending order of edit distance and descending
        # order of frequency
        #     and return list of suggested word corrections only:
        # return sorted(suggest_dict, key = lambda x:
        #               (suggest_dict[x][1], -suggest_dict[x][0]))

        # output option 2
        # return list of suggestions with (correction,
        #                                  (frequency in corpus, edit distance)):
        as_list = suggest_dict.items()
        # outlist = sorted(as_list, key=lambda (term, (freq, dist)): (dist, -freq))
        outlist = sorted(as_list, key=lambda x: (x[1][1], -x[1][0]))

        if self.verbose == 0:
            return outlist[0]
        else:
            return outlist

        '''
        Option 1:
        ['file', 'five', 'fire', 'fine', ...]

        Option 2:
        [('file', (5, 0)),
         ('five', (67, 1)),
         ('fire', (54, 1)),
         ('fine', (17, 1))...]  
        '''

def best_word(self, s, silent=False):
   &quot;&quot;&quot;
   Returns the best correction for a given word.
   Parameters:
   s: Word to check
   silent: If True, don't print progress
   Returns:
   tuple or None: (corrected word, (frequency, edit distance)) or None if failed
   &quot;&quot;&quot;
   try:
       return self.get_suggestions(s, silent)[0]
   except:
       return None
 
class ItemSelector(BaseEstimator, TransformerMixin):
   &quot;&quot;&quot;
   A transformer for selecting specific fields from a pandas DataFrame and converting them to appropriate format.
   This is a custom transformer for use in scikit-learn Pipelines.
   &quot;&quot;&quot;
   def __init__(self, field, start_time=time()):
       self.field = field  # Column name to select from DataFrame
       self.start_time = start_time  # Start time for processing time measurement

   def fit(self, x, y=None):
       return self

   def transform(self, dataframe):
       &quot;&quot;&quot;
       Selects and transforms specific fields from the DataFrame.
       - Categorical data is converted to codes
       - Numeric data is kept as is 
       - Other data is treated as text
       &quot;&quot;&quot;
       print(f'[{time()-self.start_time}] select {self.field}')
       dt = dataframe[self.field].dtype
       if is_categorical_dtype(dt):
           return dataframe[self.field].cat.codes[:, None]
       elif is_numeric_dtype(dt):
           return dataframe[self.field][:, None]
       else:
           return dataframe[self.field]

class DropColumnsByDf(BaseEstimator, TransformerMixin):
   &quot;&quot;&quot;
   A transformer that filters features (columns) based on document frequency
   &quot;&quot;&quot;
   def __init__(self, min_df=1, max_df=1.0):
       &quot;&quot;&quot;
       Parameters:
       min_df: Minimum document frequency (features below this are removed)
       max_df: Maximum document frequency ratio (features above this are removed)
       &quot;&quot;&quot;
       self.min_df = min_df
       self.max_df = max_df

   def fit(self, X, y=None):
       &quot;&quot;&quot;
       Calculates document frequency for given data and determines which columns to filter.
       &quot;&quot;&quot;
       # Convert to CSC (Compressed Sparse Column) format
       m = X.tocsc()
   
       # Process minimum document frequency (min_df) condition
       # (m != 0).sum(axis=0): Calculate number of non-zero values in each column 
   	   # &amp;gt;= self.min_df: Check if it's greater than minimum document frequency
       # .A1: Flatten array to 1 dimension
       self.nnz_cols = ((m != 0).sum(axis=0) &amp;gt;= self.min_df).A1
   
       # Process maximum document frequency (max_df) condition
       if self.max_df &amp;lt; 1.0:
           # Calculate maximum allowed number of documents
           max_df = m.shape[0] * self.max_df
           # AND operation with maximum document frequency condition
           self.nnz_cols = self.nnz_cols &amp;amp; ((m != 0).sum(axis=0) &amp;lt;= max_df).A1
       
   	   return self

   def transform(self, X, y=None):
       &quot;&quot;&quot;
       Selects features according to the determined filtering criteria.
       &quot;&quot;&quot;
       m = X.tocsc()
       # Select columns according to conditions determined in fit (self.nnz_cols)
       return m[:, self.nnz_cols]
 
def get_rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(np.expm1(y_true), np.expm1(y_pred)))


def split_cat(text):
    try:
        cats = text.split(&quot;/&quot;)
        return cats[0], cats[1], cats[2], cats[0] + '/' + cats[1]
    except:
        print(&quot;no category&quot;)
        return 'other', 'other', 'other', 'other/other'

# Function to fill in missing brand names
# Uses the SymSpell algorithm to find and fill brand names from product names and descriptions
# Processes single-word and multi-word brand names separately
def brands_filling(dataset):
    vc = dataset['brand_name'].value_counts()
    brands = vc[vc &amp;gt; 0].index
    brand_word = r&quot;[a-z0-9*/+\-'&amp;rsquo;?!.,|&amp;amp;%&amp;reg;&amp;trade;&amp;ocirc;&amp;egrave;&amp;eacute;&amp;uuml;]+&quot;

    many_w_brands = brands[brands.str.contains(' ')]
    one_w_brands = brands[~brands.str.contains(' ')]

    ss2 = SymSpell(max_edit_distance=0)
    ss2.create_dictionary_from_arr(many_w_brands, token_pattern=r'.+')

    ss1 = SymSpell(max_edit_distance=0)
    ss1.create_dictionary_from_arr(one_w_brands, token_pattern=r'.+')

    two_words_re = re.compile(r&quot;(?=(\s[a-z0-9*/+\-'&amp;rsquo;?!.,|&amp;amp;%&amp;reg;&amp;trade;&amp;ocirc;&amp;egrave;&amp;eacute;&amp;uuml;]+\s[a-z0-9*/+\-'&amp;rsquo;?!.,|&amp;amp;%&amp;reg;&amp;trade;&amp;ocirc;&amp;egrave;&amp;eacute;&amp;uuml;]+))&quot;)

    def find_in_str_ss2(row):
        for doc_word in two_words_re.finditer(row):
            print(doc_word)
            suggestion = ss2.best_word(doc_word.group(1), silent=True)
            if suggestion is not None:
                return doc_word.group(1)
        return ''

    def find_in_list_ss1(list):
        for doc_word in list:
            suggestion = ss1.best_word(doc_word, silent=True)
            if suggestion is not None:
                return doc_word
        return ''

    def find_in_list_ss2(list):
        for doc_word in list:
            suggestion = ss2.best_word(doc_word, silent=True)
            if suggestion is not None:
                return doc_word
        return ''

    print(f&quot;Before empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}&quot;)

    n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(
        pat=r&quot;^[a-z0-9*/+\-'&amp;rsquo;?!.,|&amp;amp;%&amp;reg;&amp;trade;&amp;ocirc;&amp;egrave;&amp;eacute;&amp;uuml;]+\s[a-z0-9*/+\-'&amp;rsquo;?!.,|&amp;amp;%&amp;reg;&amp;trade;&amp;ocirc;&amp;egrave;&amp;eacute;&amp;uuml;]+&quot;)
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_name]

    n_desc = dataset[dataset['brand_name'] == '']['item_description'].str.findall(
        pat=r&quot;^[a-z0-9*/+\-'&amp;rsquo;?!.,|&amp;amp;%&amp;reg;&amp;trade;&amp;ocirc;&amp;egrave;&amp;eacute;&amp;uuml;]+\s[a-z0-9*/+\-'&amp;rsquo;?!.,|&amp;amp;%&amp;reg;&amp;trade;&amp;ocirc;&amp;egrave;&amp;eacute;&amp;uuml;]+&quot;)
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_desc]

    n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(pat=brand_word)
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in n_name]

    desc_lower = dataset[dataset['brand_name'] == '']['item_description'].str.findall(pat=brand_word)
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in desc_lower]

    print(f&quot;After empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}&quot;)

    del ss1, ss2
    gc.collect()


def preprocess_regex(dataset, start_time=time()):
    karats_regex = r'(\d)([\s-]?)(karat|karats|carat|carats|kt)([^\w])'
    karats_repl = r'\1k\4'

    unit_regex = r'(\d+)[\s-]([a-z]{2})(\s)'
    unit_repl = r'\1\2\3'

    dataset['name'] = dataset['name'].str.replace(karats_regex, karats_repl)
    dataset['item_description'] = dataset['item_description'].str.replace(karats_regex, karats_repl)
    print(f'[{time() - start_time}] Karats normalized.')

    dataset['name'] = dataset['name'].str.replace(unit_regex, unit_repl)
    dataset['item_description'] = dataset['item_description'].str.replace(unit_regex, unit_repl)
    print(f'[{time() - start_time}] Units glued.')


def preprocess_pandas(train, test, start_time=time()):
    train = train[train.price &amp;gt; 0.0].reset_index(drop=True)
    print('Train shape without zero price: ', train.shape)

    nrow_train = train.shape[0]
    y_train = np.log1p(train[&quot;price&quot;])
    merge: pd.DataFrame = pd.concat([train, test])

    del train
    del test
    gc.collect()

    merge['has_category'] = (merge['category_name'].notnull()).astype('category')
    print(f'[{time() - start_time}] Has_category filled.')

    merge['category_name'] = merge['category_name'] \
        .fillna('other/other/other') \
        .str.lower() \
        .astype(str)
    merge['general_cat'], merge['subcat_1'], merge['subcat_2'], merge['gen_subcat1'] = \
        zip(*merge['category_name'].apply(lambda x: split_cat(x)))
    print(f'[{time() - start_time}] Split categories completed.')

    merge['has_brand'] = (merge['brand_name'].notnull()).astype('category')
    print(f'[{time() - start_time}] Has_brand filled.')

    merge['gencat_cond'] = merge['general_cat'].map(str) + '_' + merge['item_condition_id'].astype(str)
    merge['subcat_1_cond'] = merge['subcat_1'].map(str) + '_' + merge['item_condition_id'].astype(str)
    merge['subcat_2_cond'] = merge['subcat_2'].map(str) + '_' + merge['item_condition_id'].astype(str)
    print(f'[{time() - start_time}] Categories and item_condition_id concancenated.')

    merge['name'] = merge['name'] \
        .fillna('') \
        .str.lower() \
        .astype(str)
    merge['brand_name'] = merge['brand_name'] \
        .fillna('') \
        .str.lower() \
        .astype(str)
    merge['item_description'] = merge['item_description'] \
        .fillna('') \
        .str.lower() \
        .replace(to_replace='No description yet', value='')
    print(f'[{time() - start_time}] Missing filled.')

    preprocess_regex(merge, start_time)

    brands_filling(merge)
    print(f'[{time() - start_time}] Brand name filled.')

    merge['name'] = merge['name'] + ' ' + merge['brand_name']
    print(f'[{time() - start_time}] Name concancenated.')

    merge['item_description'] = merge['item_description'] \
                                + ' ' + merge['name'] \
                                + ' ' + merge['subcat_1'] \
                                + ' ' + merge['subcat_2'] \
                                + ' ' + merge['general_cat'] \
                                + ' ' + merge['brand_name']
    print(f'[{time() - start_time}] Item description concatenated.')

    merge.drop(['price', 'test_id', 'train_id'], axis=1, inplace=True)

    return merge, y_train, nrow_train


def intersect_drop_columns(train: csr_matrix, valid: csr_matrix, min_df=0):
    t = train.tocsc()
    v = valid.tocsc()
    nnz_train = ((t != 0).sum(axis=0) &amp;gt;= min_df).A1
    nnz_valid = ((v != 0).sum(axis=0) &amp;gt;= min_df).A1
    nnz_cols = nnz_train &amp;amp; nnz_valid
    res = t[:, nnz_cols], v[:, nnz_cols]
    return res


if __name__ == '__main__':
    mp.set_start_method('forkserver', True)

    start_time = time()

    train = pd.read_table(os.path.join(INPUT_PATH, 'train.tsv'),
                          engine='c',
                          dtype={'item_condition_id': 'category',
                                 'shipping': 'category'}
                          )
    test = pd.read_table(os.path.join(INPUT_PATH, 'test.tsv'),
                         engine='c',
                         dtype={'item_condition_id': 'category',
                                'shipping': 'category'}
                         )
    print(f'[{time() - start_time}] Finished to load data')
    print('Train shape: ', train.shape)
    print('Test shape: ', test.shape)

    submission: pd.DataFrame = test[['test_id']]

    merge, y_train, nrow_train = preprocess_pandas(train, test, start_time)

    meta_params = {'name_ngram': (1, 2),
                   'name_max_f': 75000,
                   'name_min_df': 10,

                   'category_ngram': (2, 3),
                   'category_token': '.+',
                   'category_min_df': 10,

                   'brand_min_df': 10,

                   'desc_ngram': (1, 3),
                   'desc_max_f': 150000,
                   'desc_max_df': 0.5,
                   'desc_min_df': 10}

    stopwords = frozenset(['the', 'a', 'an', 'is', 'it', 'this', ])
    # 'i', 'so', 'its', 'am', 'are'])

    vectorizer = FeatureUnion([
        ('name', Pipeline([
            ('select', ItemSelector('name', start_time=start_time)),
            ('transform', HashingVectorizer(
                ngram_range=(1, 2),
                n_features=2 ** 27,
                norm='l2',
                lowercase=False,
                stop_words=stopwords
            )),
            ('drop_cols', DropColumnsByDf(min_df=2))
        ])),
        ('category_name', Pipeline([
            ('select', ItemSelector('category_name', start_time=start_time)),
            ('transform', HashingVectorizer(
                ngram_range=(1, 1),
                token_pattern='.+',
                tokenizer=split_cat,
                n_features=2 ** 27,
                norm='l2',
                lowercase=False
            )),
            ('drop_cols', DropColumnsByDf(min_df=2))
        ])),
        ('brand_name', Pipeline([
            ('select', ItemSelector('brand_name', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('gencat_cond', Pipeline([
            ('select', ItemSelector('gencat_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('subcat_1_cond', Pipeline([
            ('select', ItemSelector('subcat_1_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('subcat_2_cond', Pipeline([
            ('select', ItemSelector('subcat_2_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('has_brand', Pipeline([
            ('select', ItemSelector('has_brand', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('shipping', Pipeline([
            ('select', ItemSelector('shipping', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('item_condition_id', Pipeline([
            ('select', ItemSelector('item_condition_id', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('item_description', Pipeline([
            ('select', ItemSelector('item_description', start_time=start_time)),
            ('hash', HashingVectorizer(
                ngram_range=(1, 3),
                n_features=2 ** 27,
                dtype=np.float32,
                norm='l2',
                lowercase=False,
                stop_words=stopwords
            )),
            ('drop_cols', DropColumnsByDf(min_df=2)),
        ]))
    ], n_jobs=1)

    sparse_merge = vectorizer.fit_transform(merge)
    print(f'[{time() - start_time}] Merge vectorized')
    print(sparse_merge.shape)

    tfidf_transformer = TfidfTransformer()

    X = tfidf_transformer.fit_transform(sparse_merge)
    print(f'[{time() - start_time}] TF/IDF completed')

    X_train = X[:nrow_train]
    print(X_train.shape)

    X_test = X[nrow_train:]
    del merge
    del sparse_merge
    del vectorizer
    del tfidf_transformer
    gc.collect()

    X_train, X_test = intersect_drop_columns(X_train, X_test, min_df=1)
    print(f'[{time() - start_time}] Drop only in train or test cols: {X_train.shape[1]}')
    gc.collect()

    ridge = Ridge(solver='auto', fit_intercept=True, alpha=0.4, max_iter=200, normalize=False, tol=0.01)
    ridge.fit(X_train, y_train)
    print(f'[{time() - start_time}] Train Ridge completed. Iterations: {ridge.n_iter_}')

    predsR = ridge.predict(X_test)
    print(f'[{time() - start_time}] Predict Ridge completed.')

    submission.loc[:, 'price'] = np.expm1(predsR)
    submission.loc[submission['price'] &amp;lt; 0.0, 'price'] = 0.0
    submission.to_csv(&quot;submission_ridge.csv&quot;, index=False)&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;3. Damerau-Levenshtein&amp;nbsp;distance&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Basic Concept:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Calculates the minimum number of edits needed to transform one string into another&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;It's an extension of the Levenshtein distance that adds the operation of &quot;transposing adjacent characters&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Allowed Edit Operations:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Character &lt;b&gt;insertion&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example: &quot;cat&quot; &amp;rarr; &quot;cart&quot; (insert r)&lt;/li&gt;
&lt;li&gt;Character &lt;b&gt;deletion&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example: &quot;cart&quot; &amp;rarr; &quot;cat&quot; (delete r)&lt;/li&gt;
&lt;li&gt;Character &lt;b&gt;substitution&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example: &quot;cat&quot; &amp;rarr; &quot;cut&quot; (substitute a with u)&lt;/li&gt;
&lt;li&gt;Adjacent character &lt;b&gt;transposition&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Example: &quot;cloud&quot; &amp;rarr; &quot;could&quot; (swap u and l)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Use Cases:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Spell checking&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Similar word search&lt;/li&gt;
&lt;li&gt;Measuring similarity between brand names or product names&lt;/li&gt;
&lt;li&gt;Text matching in natural language processing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Example:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733411907714&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Converting &quot;kitten&quot; to &quot;sitting&quot;:
# 1. k &amp;rarr; s (substitution)
# 2. e &amp;rarr; i (substitution)
# 3. n &amp;rarr; ng (insertion)
# Total distance: 3&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;In this code, it's used to find missing brand names by calculating similarity with existing brand names. &lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;For example, it can help correct a misspelled brand name like &quot;Nkie&quot; to &quot;Nike&quot;.&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. Getting suggestion code detail&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;get_suggestions(self, string, silent=False) function:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Purpose: Generates a list of correction suggestions for potentially misspelled words&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Main features:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Finds similar words to the input word in the dictionary&lt;/li&gt;
&lt;li&gt;Calculates edit distance (character deletion/addition/change)&lt;/li&gt;
&lt;li&gt;Finds and sorts all possible correction words&lt;/li&gt;
&lt;li&gt;Provides (frequency, edit distance) information for each suggested word&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Return values:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;When verbose=0: Returns only the best suggestion&lt;/li&gt;
&lt;li&gt;When verbose&amp;gt;0: Returns all possible suggestions in the form (word, (frequency, edit distance))&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;best_word(self, s, silent=False) function:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Purpose: Finds the optimal correction word for a given word&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Main features:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Calls get_suggestions function to get suggestion list&lt;/li&gt;
&lt;li&gt;Selects the most appropriate word (smallest edit distance and highest frequency) from the suggestion list&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Return values:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;On success: (corrected word, (frequency, edit distance))&lt;/li&gt;
&lt;li&gt;On failure: None&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;5. More about Ridge Regression&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Basic Concept:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;It's an advanced form of linear regression&lt;/li&gt;
&lt;li&gt;Created to prevent overfitting&lt;/li&gt;
&lt;li&gt;A regression model that uses &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;L2 regularization&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;How it works:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Basic linear regression equation: y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b&lt;/li&gt;
&lt;li&gt;Ridge regression adds a penalty term&lt;/li&gt;
&lt;li&gt;Penalty term: &amp;alpha;(w₁&amp;sup2; + w₂&amp;sup2; + ... + wₙ&amp;sup2;)&lt;/li&gt;
&lt;li&gt;&amp;alpha; is a hyperparameter that controls regularization strength&lt;/li&gt;
&lt;li&gt;Applies penalty to the sum of squared weights&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Advantages:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Can solve multicollinearity problems&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Example: Handles strongly correlated features like 'height' and 'weight' well&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Prevents overfitting and improves model generalization&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Maintains all features while adjusting their influence&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Difference&amp;nbsp;from&amp;nbsp;Regular&amp;nbsp;Linear&amp;nbsp;Regression:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733409175940&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Cost function for regular linear regression
Cost = &amp;Sigma;(y - ŷ)&amp;sup2;
# Cost function for Ridge regression
Cost = &amp;Sigma;(y - ŷ)&amp;sup2; + &amp;alpha;(w₁&amp;sup2; + w₂&amp;sup2; + ... + wₙ&amp;sup2;)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;When&amp;nbsp;to&amp;nbsp;use:&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Data with many features&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;When features have strong correlations&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;When overfitting is a concern&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;For example, in the given code, Ridge regression is used because &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;text vectorization creates many features that might be correlated:&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733409240256&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;ridge = Ridge(
    solver='auto',           # Automatically choose best solver
    fit_intercept=True,      # Use intercept term
    alpha=0.4,              # Regularization strength (&amp;alpha; value)
    max_iter=200,           # Maximum iterations
    normalize=False,         # Whether to normalize
    tol=0.01                # Convergence tolerance
)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;This configured Ridge regression model helps predict prices while appropriately adjusting the influence of many features.&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/peterhurford/lgb-and-fm-18th-place-0-40604&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt; Fourth Kernel:&lt;span&gt; LGB&amp;nbsp;and&amp;nbsp;FM&amp;nbsp;[18th&amp;nbsp;Place&amp;nbsp;-&amp;nbsp;0.40604]&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;18th place solution using lightgbm and FM_FTRL.&lt;/li&gt;
&lt;li&gt;0.33 * FM_FTRL + 0.67 * LGB&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Overall Summary&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;WordBatch&lt;/b&gt; is a Python package designed for fast processing of large-scale text data.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;FM_FTRL&lt;/b&gt; is a model that combines &lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Factorization Machines&lt;/b&gt;&lt;/span&gt; with the &lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;Follow-The-Regularized-Leader&lt;/b&gt; &lt;/span&gt;algorithm.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Looking at each component:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;FM (Factorization Machines)&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;A method for &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;modeling interactions between features in high-dimensional sparse data&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Particularly effective in tasks like recommendation systems and click-through rate (CTR) prediction&lt;/li&gt;
&lt;li&gt;Represents potential interactions between features as low-dimensional vectors&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;FTRL (Follow-The-Regularized-Leader)&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;A type of online learning algorithm&lt;/li&gt;
&lt;li&gt;Effectively handles L1 and L2 regularization&lt;/li&gt;
&lt;li&gt;Particularly effective in learning sparse models
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;A sparse model refers to a model where many parameters (weights) are zero&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Memory efficient and capable of incremental learning&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Main &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;advantages&lt;/b&gt;&lt;/span&gt; of this combination:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Suitable for large-scale sparse data processing&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Data is very sparse due to numerous text and categorical variables&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;FM effectively learns feature interactions in sparse data&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Automatically learns&lt;b&gt; feature interactions&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;FM automatically learns second-order feature interactions&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Example: Captures the impact of brand and category combinations on price&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Enables online learning for real-time updates
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;FTRL is an online learning algorithm that is &lt;b&gt;memory efficient&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Can effectively train on &lt;b&gt;large-scale datasets&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;High prediction performance&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Commonly used in tasks such as advertising click-through rate prediction, recommendation systems, and tasks requiring real-time prediction.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Feature Engineering
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Extraction of various statistical features from text data MANUALLY&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Text vectorization using WordBatch and TF-IDF&lt;/li&gt;
&lt;li&gt;Label encoding for categorical variables&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Model Ensemble
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;FM_FTRL: Factorization Machine effective for sparse data&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Ridge Regression: Basic regression for text features&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;LightGBM: High-performance gradient boosting model&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Ridge model?&lt;/b&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The Ridge model is used to perform basic regression analysis on text data (product names and descriptions)&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;Basic linear regression model with L2 regularization applied.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Memory Optimization
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Use of sparse matrices&lt;/li&gt;
&lt;li&gt;Periodic garbage collection&lt;/li&gt;
&lt;li&gt;Removal of unnecessary variables&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Final&amp;nbsp;Prediction&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Weighted average of FM_FTRL and LightGBM predictions&lt;/li&gt;
&lt;li&gt;Save results after reversing log transformation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;2. Code Details&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733394029441&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Record start time for execution time measurement
import time
start_time = time.time()

# Set submission mode (True: train on full data, False: use validation split)
SUBMIT_MODE = True

# Import required libraries
import pandas as pd
import numpy as np
import time
import gc  # For garbage collection
import string
import re

# Use NLTK stopwords
from nltk.corpus import stopwords

# Import scipy for sparse matrix handling
from scipy.sparse import csr_matrix, hstack
# Import sklearn for text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# Import sklearn for feature selection
from sklearn.feature_selection.univariate_selection import SelectKBest, f_regression
# Import sklearn for label encoding
from sklearn.preprocessing import LabelBinarizer

# Import WordBatch related modules (for text processing)
import wordbatch
from wordbatch.extractors import WordBag
from wordbatch.models import FM_FTRL

# Import sklearn model related modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.naive_bayes import MultinomialNB
import lightgbm as lgb

# Define RMSE calculation function
def rmse(predicted, actual):
    &quot;&quot;&quot;
    Calculate Root Mean Squared Error between predicted and actual values
    Args:
        predicted: array of predicted values
        actual: array of actual values
    Returns:
        RMSE value
    &quot;&quot;&quot;
    return np.sqrt(((predicted - actual) ** 2).mean())

# Category splitting function
def split_cat(text):
    &quot;&quot;&quot;
    Split category string into subcategories
    Args:
        text: Category string with '/' delimiter
    Returns:
        Tuple of 3 subcategories (returns 'No Label' if missing)
    &quot;&quot;&quot;
    try:
        return text.split(&quot;/&quot;)
    except:
        return (&quot;No Label&quot;, &quot;No Label&quot;, &quot;No Label&quot;)

# Define Target Encoding class
class TargetEncoder:
    &quot;&quot;&quot;
    Class for performing target encoding on categorical variables
    Numerically encodes categories based on mean target values
    &quot;&quot;&quot;
    def __repr__(self):
        return 'TargetEncoder'

    def __init__(self, cols, smoothing=1, min_samples_leaf=1, noise_level=0, keep_original=False):
        &quot;&quot;&quot;
        Args:
            cols: List of columns to encode
            smoothing: Smoothing parameter
            min_samples_leaf: Minimum number of samples
            noise_level: Level of noise to add
            keep_original: Whether to keep original columns
        &quot;&quot;&quot;
        self.cols = cols
        self.smoothing = smoothing
        self.min_samples_leaf = min_samples_leaf
        self.noise_level = noise_level
        self.keep_original = keep_original

    @staticmethod
    def add_noise(series, noise_level):
        &quot;&quot;&quot;
        Add noise to prevent overfitting
        &quot;&quot;&quot;
        return series * (1 + noise_level * np.random.randn(len(series)))

    def encode(self, train, test, target):
        &quot;&quot;&quot;
        Perform target encoding on categorical columns in given dataframe
        &quot;&quot;&quot;
        for col in self.cols:
            if self.keep_original:
                train[col + '_te'], test[col + '_te'] = self.encode_column(train[col], test[col], target)
            else:
                train[col], test[col] = self.encode_column(train[col], test[col], target)
        return train, test

    def encode_column(self, trn_series, tst_series, target):
        &quot;&quot;&quot;
        Perform target encoding on a single column
        &quot;&quot;&quot;
        temp = pd.concat([trn_series, target], axis=1)
        # Calculate target means
        averages = temp.groupby(by=trn_series.name)[target.name].agg([&quot;mean&quot;, &quot;count&quot;])
        # Calculate smoothing
        smoothing = 1 / (1 + np.exp(-(averages[&quot;count&quot;] - self.min_samples_leaf) / self.smoothing))
        # Calculate overall mean
        prior = target.mean()
        # Calculate smoothed means
        averages[target.name] = prior * (1 - smoothing) + averages[&quot;mean&quot;] * smoothing
        averages.drop(['mean', 'count'], axis=1, inplace=True)
        
        # Apply encoding to train/test data
        ft_trn_series = pd.merge(
            trn_series.to_frame(trn_series.name),
            averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
            on=trn_series.name,
            how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
        ft_trn_series.index = trn_series.index
        
        ft_tst_series = pd.merge(
            tst_series.to_frame(tst_series.name),
            averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
            on=tst_series.name,
            how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
        ft_tst_series.index = tst_series.index
        
        return self.add_noise(ft_trn_series, self.noise_level), self.add_noise(ft_tst_series, self.noise_level)

# Numeric value processing functions
def to_number(x):
    &quot;&quot;&quot;
    Convert string to number (limit to 100 if greater than 100)
    &quot;&quot;&quot;
    try:
        if not x.isdigit():
            return 0
        x = int(x)
        if x &amp;gt; 100:
            return 100
        else:
            return x
    except:
        return 0

def sum_numbers(desc):
    &quot;&quot;&quot;
    Calculate sum of numbers in description text
    &quot;&quot;&quot;
    if not isinstance(desc, str):
        return 0
    try:
        return sum([to_number(s) for s in desc.split()])
    except:
        return 0

# Set regex and stopwords for text preprocessing
stopwords = {x: 1 for x in stopwords.words('english')}
non_alphanums = re.compile(u'[^A-Za-z0-9]+')
non_alphanumpunct = re.compile(u'[^A-Za-z0-9\.?!,; \(\)\[\]\'\&quot;\$]+')
RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])

def normalize_text(text):
    &quot;&quot;&quot;
    Text normalization:
    - Convert to lowercase
    - Remove special characters
    - Remove stopwords
    - Remove short words
    &quot;&quot;&quot;
    return u&quot; &quot;.join(
        [x for x in [y for y in non_alphanums.sub(' ', text).lower().strip().split(&quot; &quot;)] \
         if len(x) &amp;gt; 1 and x not in stopwords])

def clean_name(x):
    &quot;&quot;&quot;
    Extract first word from name
    &quot;&quot;&quot;
    if len(x):
        x = non_alphanums.sub(' ', x).split()
        if len(x):
            return x[0].lower()
    return ''

# Load data
print('[{}] Finished defining stuff'.format(time.time() - start_time))

# Load training data
train = pd.read_table('../input/train.tsv', engine='c', 
                      dtype={'item_condition_id': 'category',
                             'shipping': 'category',
                            }, 
                     converters={'category_name': split_cat})
# Load test data
test = pd.read_table('../input/test.tsv', engine='c', 
                      dtype={'item_condition_id': 'category',
                             'shipping': 'category',
                            },
                    converters={'category_name': split_cat})
print('[{}] Finished load data'.format(time.time() - start_time))

# Add flag for train/test data distinction
train['is_train'] = 1
test['is_train'] = 0
print('[{}] Compiled train / test'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)

# Remove data with price 0
train = train[train.price != 0].reset_index(drop=True)
print('[{}] Removed zero price'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)

# Log transform target variable (price)
y = np.log1p(train['price'])
nrow_train = train.shape[0]

# Merge train/test data
merge = pd.concat([train, test])
submission = test[['test_id']]
print('[{}] Compiled merge'.format(time.time() - start_time))
print('Merge shape: ', merge.shape)

# Remove unnecessary columns and clear memory
del train
del test
merge.drop(['train_id', 'test_id', 'price'], axis=1, inplace=True)
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# Split and process categories
merge['gencat_name'] = merge['category_name'].str.get(0).replace('', 'missing').astype('category')
merge['subcat1_name'] = merge['category_name'].str.get(1).fillna('missing').astype('category')
merge['subcat2_name'] = merge['category_name'].str.get(2).fillna('missing').astype('category')
merge.drop('category_name', axis=1, inplace=True)
print('[{}] Split categories completed.'.format(time.time() - start_time))

# Handle missing values
merge['item_condition_id'] = merge['item_condition_id'].cat.add_categories(['missing']).fillna('missing')
merge['shipping'] = merge['shipping'].cat.add_categories(['missing']).fillna('missing')
merge['item_description'].fillna('missing', inplace=True)
merge['brand_name'] = merge['brand_name'].fillna('missing').astype('category')
print('[{}] Handle missing completed.'.format(time.time() - start_time))

# Start feature engineering
# Name-related features
merge['name_first'] = merge['name'].apply(clean_name)
print('[{}] FE 1/37'.format(time.time() - start_time))
merge['name_first_count'] = merge.groupby('name_first')['name_first'].transform('count')
print('[{}] FE 2/37'.format(time.time() - start_time))

# Category-related features
merge['gencat_name_count'] = merge.groupby('gencat_name')['gencat_name'].transform('count')
print('[{}] FE 3/37'.format(time.time() - start_time))
merge['subcat1_name_count'] = merge.groupby('subcat1_name')['subcat1_name'].transform('count')
print('[{}] FE 4/37'.format(time.time() - start_time))
merge['subcat2_name_count'] = merge.groupby('subcat2_name')['subcat2_name'].transform('count')
print('[{}] FE 5/37'.format(time.time() - start_time))
merge['brand_name_count'] = merge.groupby('brand_name')['brand_name'].transform('count')
print('[{}] FE 6/37'.format(time.time() - start_time))

# Text-related features
merge['NameLower'] = merge.name.str.count('[a-z]')
print('[{}] FE 7/37'.format(time.time() - start_time))
merge['DescriptionLower'] = merge.item_description.str.count('[a-z]')
print('[{}] FE 8/37'.format(time.time() - start_time))
merge['NameUpper'] = merge.name.str.count('[A-Z]')
print('[{}] FE 9/37'.format(time.time() - start_time))
merge['DescriptionUpper'] = merge.item_description.str.count('[A-Z]')
print('[{}] FE 10/37'.format(time.time() - start_time))

# Length-related features
merge['name_len'] = merge['name'].apply(lambda x: len(x))
print('[{}] FE 11/37'.format(time.time() - start_time))
merge['des_len'] = merge['item_description'].apply(lambda x: len(x))
print('[{}] FE 12/37'.format(time.time() - start_time))
merge['name_desc_len_ratio'] = merge['name_len']/merge['des_len']
print('[{}] FE 13/37'.format(time.time() - start_time))

# Word count related features
merge['desc_word_count'] = merge['item_description'].apply(lambda x: len(x.split()))
print('[{}] FE 14/37'.format(time.time() - start_time))
merge['mean_des'] = merge['item_description'].apply(lambda x: 0 if len(x) == 0 else float(len(x.split())) / len(x)) * 10
print('[{}] FE 15/37'.format(time.time() - start_time))
merge['name_word_count'] = merge['name'].apply(lambda x: len(x.split()))
print('[{}] FE 16/37'.format(time.time() - start_time))
merge['mean_name'] = merge['name'].apply(lambda x: 0 if len(x) == 0 else float(len(x.split())) / len(x)) * 10
print('[{}] FE 17/37'.format(time.time() - start_time))

# Characters per word features
merge['desc_letters_per_word'] = merge['des_len'] / merge['desc_word_count']
print('[{}] FE 18/37'.format(time.time() - start_time))
merge['name_letters_per_word'] = merge['name_len'] / merge['name_word_count']
print('[{}] FE 19/37'.format(time.time() - start_time))

# Upper/lowercase ratio features
merge['NameLowerRatio'] = merge['NameLower'] / merge['name_len']
print('[{}] FE 20/37'.format(time.time() - start_time))
merge['DescriptionLowerRatio'] = merge['DescriptionLower'] / merge['des_len']
print('[{}] FE 21/37'.format(time.time() - start_time))
merge['NameUpperRatio'] = merge['NameUpper'] / merge['name_len']
print('[{}] FE 22/37'.format(time.time() - start_time))
merge['DescriptionUpperRatio'] = merge['DescriptionUpper'] / merge['des_len']
print('[{}] FE 23/37'.format(time.time() - start_time))

# Punctuation related features
merge['NamePunctCount'] = merge.name.str.count(RE_PUNCTUATION)
print('[{}] FE 24/37'.format(time.time() - start_time))
merge['DescriptionPunctCount'] = merge.item_description.str.count(RE_PUNCTUATION)
print('[{}] FE 25/37'.format(time.time() - start_time))
merge['NamePunctCountRatio'] = merge['NamePunctCount'] / merge['name_word_count']
print('[{}] FE 26/37'.format(time.time() - start_time))
merge['DescriptionPunctCountRatio'] = merge['DescriptionPunctCount'] / merge['desc_word_count']
print('[{}] FE 27/37'.format(time.time() - start_time))

# Number related features
merge['NameDigitCount'] = merge.name.str.count('[0-9]')
print('[{}] FE 28/37'.format(time.time() - start_time))
merge['DescriptionDigitCount'] = merge.item_description.str.count('[0-9]')
print('[{}] FE 29/37'.format(time.time() - start_time))
merge['NameDigitCountRatio'] = merge['NameDigitCount'] / merge['name_word_count']
print('[{}] FE 30/37'.format(time.time() - start_time))
merge['DescriptionDigitCountRatio'] = merge['DescriptionDigitCount']/merge['desc_word_count']
print('[{}] FE 31/37'.format(time.time() - start_time))

# Stopword and special character related features
merge['stopword_ratio_desc'] = merge['item_description'].apply(lambda x: len([w for w in x.split() if w in stopwords])) / merge['desc_word_count']
print('[{}] FE 32/37'.format(time.time() - start_time))
merge['num_sum'] = merge['item_description'].apply(sum_numbers)  # Sum of numbers in description
print('[{}] FE 33/37'.format(time.time() - start_time))
merge['weird_characters_desc'] = merge['item_description'].str.count(non_alphanumpunct)  # Count of special characters
print('[{}] FE 34/37'.format(time.time() - start_time))
merge['weird_characters_name'] = merge['name'].str.count(non_alphanumpunct)
print('[{}] FE 35/37'.format(time.time() - start_time))

# Price related keyword features
merge['prices_count'] = merge['item_description'].str.count('[rm]')  # Count of price indicator characters (rm)
print('[{}] FE 36/37'.format(time.time() - start_time))
merge['price_in_name'] = merge['item_description'].str.contains('[rm]', regex=False).astype('int')  # Price indicator presence
print('[{}] FE 37/37'.format(time.time() - start_time))

# Feature normalization
cols = set(merge.columns.values)
basic_cols = {'name', 'item_condition_id', 'brand_name',
 'shipping', 'item_description', 'gencat_name',
 'subcat1_name', 'subcat2_name', 'name_first', 'is_train'}

# Separate columns to normalize and keep 
cols_to_normalize = cols - basic_cols - {'price_in_name'}
other_cols = basic_cols | {'price_in_name'}

# Perform Min-Max normalization
merge_to_normalize = merge[list(cols_to_normalize)]
merge_to_normalize = (merge_to_normalize - merge_to_normalize.mean()) / (merge_to_normalize.max() - merge_to_normalize.min())
print('[{}] FE Normalized'.format(time.time() - start_time))

# Merge normalized features and basic features
merge = merge[list(other_cols)]
merge = pd.concat([merge, merge_to_normalize], axis=1)
print('[{}] FE Merged'.format(time.time() - start_time))

# Memory cleanup
del(merge_to_normalize)
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# Split train/test data
df_test = merge.loc[merge['is_train'] == 0]
df_train = merge.loc[merge['is_train'] == 1]
del merge
gc.collect()
df_test = df_test.drop(['is_train'], axis=1)
df_train = df_train.drop(['is_train'], axis=1)

# Split validation data (if not in submit mode)
if SUBMIT_MODE:
   y_train = y
   del y
   gc.collect()
else:
   df_train, df_test, y_train, y_test = train_test_split(df_train, y, test_size=0.2, random_state=144)

print('[{}] Splitting completed.'.format(time.time() - start_time))

# Process name text using WordBatch
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {
   &quot;hash_ngrams&quot;: 2,  # Use up to 2-grams
   &quot;hash_ngrams_weights&quot;: [1.5, 1.0],  # Weights for unigram and bigram
   &quot;hash_size&quot;: 2 ** 29,  # Hash size
   &quot;norm&quot;: None,  # No normalization
   &quot;tf&quot;: 'binary',  # Use binary TF
   &quot;idf&quot;: None,  # Don't use IDF
}), procs=8)
wb.dictionary_freeze = True
X_name_train = wb.fit_transform(df_train['name'])
X_name_test = wb.transform(df_test['name'])
del(wb)

# Remove low frequency features
mask = np.where(X_name_train.getnnz(axis=0) &amp;gt; 3)[0]
X_name_train = X_name_train[:, mask]
X_name_test = X_name_test[:, mask]
print('[{}] Vectorize `name` completed.'.format(time.time() - start_time))

# Process item description text using WordBatch
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {
   &quot;hash_ngrams&quot;: 2,
   &quot;hash_ngrams_weights&quot;: [1.0, 1.0],
   &quot;hash_size&quot;: 2 ** 28,
   &quot;norm&quot;: &quot;l2&quot;,  # Use L2 normalization
   &quot;tf&quot;: 1.0,  # Use actual frequency
   &quot;idf&quot;: None
}), procs=8)
wb.dictionary_freeze = True
X_description_train = wb.fit_transform(df_train['item_description'])
X_description_test = wb.transform(df_test['item_description'])
del(wb)

# Remove low frequency features
mask = np.where(X_description_train.getnnz(axis=0) &amp;gt; 3)[0]
X_description_train = X_description_train[:, mask]
X_description_test = X_description_test[:, mask]
print('[{}] Vectorize `item_description` completed.'.format(time.time() - start_time))

# Ridge regression modeling for description text
# Split data in half for cross validation
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_description_train, y_train,
                                                             test_size=0.5,
                                                             shuffle=False)
print('[{}] Finished splitting'.format(time.time() - start_time))

# First Ridge model
model = Ridge(solver=&quot;sag&quot;, fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_1, y_train_1)
print('[{}] Finished to train desc ridge (1)'.format(time.time() - start_time))
desc_ridge_preds1 = model.predict(X_train_2)
desc_ridge_preds1f = model.predict(X_description_test)
print('[{}] Finished to predict desc ridge (1)'.format(time.time() - start_time))

# Second Ridge model
model = Ridge(solver=&quot;sag&quot;, fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_2, y_train_2)
print('[{}] Finished to train desc ridge (2)'.format(time.time() - start_time))
desc_ridge_preds2 = model.predict(X_train_1)
desc_ridge_preds2f = model.predict(X_description_test)
print('[{}] Finished to predict desc ridge (2)'.format(time.time() - start_time))

# Combine Ridge predictions
desc_ridge_preds_oof = np.concatenate((desc_ridge_preds2, desc_ridge_preds1), axis=0)
desc_ridge_preds_test = (desc_ridge_preds1f + desc_ridge_preds2f) / 2.0
print('RMSLE OOF: {}'.format(rmse(desc_ridge_preds_oof, y_train)))
if not SUBMIT_MODE:
   print('RMSLE TEST: {}'.format(rmse(desc_ridge_preds_test, y_test)))

# Ridge regression modeling for name text (same process as above)
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_name_train, y_train,
                                                             test_size=0.5,
                                                             shuffle=False)
print('[{}] Finished splitting'.format(time.time() - start_time))

model = Ridge(solver=&quot;sag&quot;, fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_1, y_train_1)
print('[{}] Finished to train name ridge (1)'.format(time.time() - start_time))
name_ridge_preds1 = model.predict(X_train_2)
name_ridge_preds1f = model.predict(X_name_test)
print('[{}] Finished to predict name ridge (1)'.format(time.time() - start_time))

model = Ridge(solver=&quot;sag&quot;, fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_2, y_train_2)
print('[{}] Finished to train name ridge (2)'.format(time.time() - start_time))
name_ridge_preds2 = model.predict(X_train_1)
name_ridge_preds2f = model.predict(X_name_test)
print('[{}] Finished to predict name ridge (2)'.format(time.time() - start_time))

name_ridge_preds_oof = np.concatenate((name_ridge_preds2, name_ridge_preds1), axis=0)
name_ridge_preds_test = (name_ridge_preds1f + name_ridge_preds2f) / 2.0
print('RMSLE OOF: {}'.format(rmse(name_ridge_preds_oof, y_train)))
if not SUBMIT_MODE:
   print('RMSLE TEST: {}'.format(rmse(name_ridge_preds_test, y_test)))

# Memory cleanup
del X_train_1
del X_train_2
del y_train_1
del y_train_2
del name_ridge_preds1
del name_ridge_preds1f
del name_ridge_preds2
del name_ridge_preds2f
del desc_ridge_preds1
del desc_ridge_preds1f
del desc_ridge_preds2
del desc_ridge_preds2f
gc.collect()
print('[{}] Finished garbage collection'.format(time.time() - start_time))

# Process categorical variables
# Label encode brand names
lb = LabelBinarizer(sparse_output=True)
X_brand_train = lb.fit_transform(df_train['brand_name'])
X_brand_test = lb.transform(df_test['brand_name'])
print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))

# Label encode categories
X_cat_train = lb.fit_transform(df_train['gencat_name'])
X_cat_test = lb.transform(df_test['gencat_name'])
X_cat1_train = lb.fit_transform(df_train['subcat1_name'])
X_cat1_test = lb.transform(df_test['subcat1_name'])
X_cat2_train = lb.fit_transform(df_train['subcat2_name'])
X_cat2_test = lb.transform(df_test['subcat2_name'])
print('[{}] Finished label binarize categories'.format(time.time() - start_time))

# Create dummy variables for numeric features
X_dummies_train = csr_matrix(
   pd.get_dummies(df_train[list(cols - (basic_cols - {'item_condition_id', 'shipping'}))],
                  sparse=True).values)
print('[{}] Create dummies completed - train'.format(time.time() - start_time))

X_dummies_test = csr_matrix(
   pd.get_dummies(df_test[list(cols - (basic_cols - {'item_condition_id', 'shipping'}))],
                  sparse=True).values)
print('[{}] Create dummies completed - test'.format(time.time() - start_time))

# Combine all feature matrices
sparse_merge_train = hstack((X_dummies_train, X_description_train, X_brand_train, X_cat_train,
                            X_cat1_train, X_cat2_train, X_name_train)).tocsr()
del X_description_train, lb, X_name_train, X_dummies_train
gc.collect()
print('[{}] Create sparse merge train completed'.format(time.time() - start_time))

sparse_merge_test = hstack((X_dummies_test, X_description_test, X_brand_test, X_cat_test,
                            X_cat1_test, X_cat2_test, X_name_test)).tocsr()
del X_description_test, X_name_test, X_dummies_test
gc.collect()
print('[{}] Create sparse merge test completed'.format(time.time() - start_time))

# Set number of iterations for FM_FTRL model training
if SUBMIT_MODE:
   iters = 3
else:
   iters = 1
   rounds = 3

# Define FM_FTRL model
model = FM_FTRL(alpha=0.035, beta=0.001, L1=0.00001, L2=0.15, D=sparse_merge_train.shape[1],
               alpha_fm=0.05, L2_fm=0.0, init_fm=0.01,
               D_fm=100, e_noise=0, iters=iters, inv_link=&quot;identity&quot;, threads=4)

# Train and predict with FM_FTRL model
if SUBMIT_MODE:
   model.fit(sparse_merge_train, y_train)
   print('[{}] Train FM completed'.format(time.time() - start_time))
   predsFM = model.predict(sparse_merge_test)
   print('[{}] Predict FM completed'.format(time.time() - start_time))
else:
   # In validation mode, repeat multiple times to check performance
   for i in range(rounds):
       model.fit(sparse_merge_train, y_train)
       predsFM = model.predict(sparse_merge_test)
       print('[{}] Iteration {}/{} -- RMSLE: {}'.format(time.time() - start_time, i + 1, rounds, rmse(predsFM, y_test)))

del model
gc.collect()
if not SUBMIT_MODE:
   print(&quot;FM_FTRL dev RMSLE:&quot;, rmse(predsFM, y_test))

# Feature selection (SelectKBest)
fselect = SelectKBest(f_regression, k=48000)
train_features = fselect.fit_transform(sparse_merge_train, y_train)
test_features = fselect.transform(sparse_merge_test)
print('[{}] Select best completed'.format(time.time() - start_time))

# Memory cleanup
del sparse_merge_train
del sparse_merge_test
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# TF-IDF vectorization (name)
tv = TfidfVectorizer(max_features=250000,
                    ngram_range=(1, 3),
                    stop_words=None)
X_name_train = tv.fit_transform(df_train['name'])
print('[{}] Finished TFIDF vectorize `name` (1/2)'.format(time.time() - start_time))
X_name_test = tv.transform(df_test['name'])
print('[{}] Finished TFIDF vectorize `name` (2/2)'.format(time.time() - start_time))

# TF-IDF vectorization (item description)
tv = TfidfVectorizer(max_features=500000,
                    ngram_range=(1, 3),
                    stop_words=None)
X_description_train = tv.fit_transform(df_train['item_description'])
print('[{}] Finished TFIDF vectorize `item_description` (1/2)'.format(time.time() - start_time))
X_description_test = tv.transform(df_test['item_description'])
print('[{}] Finished TFIDF vectorize `item_description` (2/2)'.format(time.time() - start_time))

# Prepare dataset for LightGBM model
d_train = lgb.Dataset(train_features, label=y_train)
del train_features; gc.collect()
if SUBMIT_MODE:
   watchlist = [d_train]
else:
   d_valid = lgb.Dataset(test_features, label=y_test)
   watchlist = [d_train, d_valid]

# Set LightGBM parameters
params = {
   'learning_rate': 0.15,
   'application': 'regression',
   'max_depth': 13,
   'num_leaves': 400,
   'verbosity': -1,  # Don't print training progress
   'metric': 'RMSE',
   'data_random_seed': 1,
   'bagging_fraction': 0.8,  # Data sampling ratio for bagging
   'feature_fraction': 0.6,  # Feature ratio to use in each tree
   'nthread': 4,  # Number of CPU threads to use
   'lambda_l1': 10,  # L1 regularization
   'lambda_l2': 10   # L2 regularization
}
print('[{}] Finished compiling LGB'.format(time.time() - start_time))

# Train LightGBM model
modelL = lgb.train(params,
                 train_set=d_train,
                 num_boost_round=1350,  # Number of boosting iterations
                 valid_sets=watchlist,
                 verbose_eval=50)  # Print evaluation results every 50 iterations

# LightGBM prediction
predsL = modelL.predict(test_features)
predsL[predsL &amp;lt; 0] = 0  # Adjust negative predictions to 0

if not SUBMIT_MODE:
   print(&quot;LGB RMSLE:&quot;, rmse(predsL, y_test))

# Memory cleanup
del d_train
del modelL
if not SUBMIT_MODE:
   del d_valid
gc.collect()

# Combine FM_FTRL and LightGBM predictions (weighted average)
preds_final = predsFM * 0.33 + predsL * 0.67
if not SUBMIT_MODE:
   print('Final RMSE: ', rmse(preds_final, y_test))

# Save final prediction results
if SUBMIT_MODE:
   preds_final = np.expm1(preds_final)  # Reverse log transformation
   submission['price'] = preds_final
   submission.to_csv('lgb_and_fm_separate_train_test.csv', index=False)
   print('[{}] Writing submission done'.format(time.time() - start_time))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3.&amp;nbsp;Combining Ridge predictions&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733400759369&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;desc_ridge_preds_oof = np.concatenate((desc_ridge_preds2, desc_ridge_preds1), axis=0)
desc_ridge_preds_test = (desc_ridge_preds1f + desc_ridge_preds2f) / 2.0&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;OOF Predictions:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Each data point is predicted by a model that didn't use it for training&lt;/li&gt;
&lt;li&gt;Can obtain predictions for the entire training data without overfitting&lt;/li&gt;
&lt;li&gt;These predictions can be used as features for the next level models&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Test Predictions Average:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;More stable predictions by averaging predictions from two models&lt;/li&gt;
&lt;li&gt;Offsets errors from individual models&lt;/li&gt;
&lt;li&gt;Applies the basic principle of ensemble learning&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;However, this part in this kernel is only used for printing sample prediction result with simple ridge model in the middle of the process. NOT FOR FINAL PREDICTION!!!&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-style=&quot;style5&quot; data-ke-type=&quot;horizontalRule&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Stay&amp;nbsp;focused&amp;nbsp;on&amp;nbsp;your&amp;nbsp;goals&amp;nbsp;and&amp;nbsp;don't&amp;nbsp;let&amp;nbsp;distractions&amp;nbsp;derail&amp;nbsp;you&amp;nbsp;from&amp;nbsp;your&amp;nbsp;path.&lt;br /&gt;- Max Holloway -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>캐글</category>
      <category>Kaggle</category>
      <category>mercari price suggestion challenge</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/77</guid>
      <comments>https://dongsunseng.tistory.com/entry/Kaggle-Study-13-Mercari-Price-Suggestion-Challenge#entry77comment</comments>
      <pubDate>Fri, 6 Dec 2024 01:07:52 +0900</pubDate>
    </item>
    <item>
      <title>[Kaggle Study] #15 2017 Kaggle Machine Learning &amp;amp; Data Science Survey</title>
      <link>https://dongsunseng.tistory.com/entry/Kaggle-Study-15-2017-Kaggle-Machine-Learning-Data-Science-Survey</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Fourteenth(Last) course following Youhan Lee's curriculum.&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;Not competition&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;.&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/ash316/novice-to-grandmaster&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;First Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Novice to Grandmaster&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The&amp;nbsp;biggest&amp;nbsp;problem&amp;nbsp;that&amp;nbsp;we&amp;nbsp;might&amp;nbsp;face&amp;nbsp;is&amp;nbsp;fake&amp;nbsp;and&amp;nbsp;bogus&amp;nbsp;responses.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;As it is a survey, not everyone will answer with proper credentials, and thus I assume that there will be a lot many outlier.&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/mhajabri/what-do-kagglers-say-about-data-science&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Second Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; What&amp;nbsp;do&amp;nbsp;Kagglers&amp;nbsp;say&amp;nbsp;about&amp;nbsp;Data&amp;nbsp;Science&amp;nbsp;?&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;EDA Kernel with trying some prediction with modeling techniques.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Dimensionality reduction and 2D-plotting&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The&amp;nbsp;most&amp;nbsp;known&amp;nbsp;/&amp;nbsp;used&amp;nbsp;dimensionality&amp;nbsp;reduction&amp;nbsp;technique&amp;nbsp;has&amp;nbsp;to&amp;nbsp;be&amp;nbsp;PCA.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;The&amp;nbsp;problem&amp;nbsp;with&amp;nbsp;PCA&amp;nbsp;is&amp;nbsp;that&amp;nbsp;it&amp;nbsp;works&amp;nbsp;best&amp;nbsp;for&amp;nbsp;numerical&amp;nbsp;/&amp;nbsp;continuous&amp;nbsp;variables&amp;nbsp;which&amp;nbsp;is&amp;nbsp;not&amp;nbsp;the&amp;nbsp;case&amp;nbsp;here.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;A similar technique, Multi Correspondence Analysis (MCA), is used to achieve dimensionality reduction for categorical data.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Simply put, It's a technique that use chi-2 independence tests to create a distance between row points that will be further contained in a matrix. &lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;Each of the eigenvalues of this matrix has an inertia (similar to expressed variance for PCA) and the process to obtain the 2D visualization is the same.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;clean&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;### NOT WORKING ON KAGGLE SERVERS (no module prince)####
#import prince
#np.random.seed(42)
#mca = prince.MCA(data_viz, n_components=2,use_benzecri_rates=True)
#mca.plot_rows(show_points=True, show_labels=False, color_by='CompensationAmount', ellipse_fill=True)&lt;/code&gt;&lt;/pre&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/hakkisimsek/plotly-tutorial-1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Third Kernel: PLOTLY&amp;nbsp;TUTORIAL&amp;nbsp;-&amp;nbsp;1&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Literally plotting plots analyzing response data using PLOTLY.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-style=&quot;style5&quot; data-ke-type=&quot;horizontalRule&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;The&amp;nbsp;first&amp;nbsp;step&amp;nbsp;is&amp;nbsp;to&amp;nbsp;establish&amp;nbsp;that&amp;nbsp;something&amp;nbsp;is&amp;nbsp;possible;&amp;nbsp;then&amp;nbsp;probability&amp;nbsp;will&amp;nbsp;occur.&lt;br /&gt;- Elon Musk -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>캐글</category>
      <category>2017 kaggle machine learning &amp;amp; data science survey</category>
      <category>Kaggle</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/76</guid>
      <comments>https://dongsunseng.tistory.com/entry/Kaggle-Study-15-2017-Kaggle-Machine-Learning-Data-Science-Survey#entry76comment</comments>
      <pubDate>Thu, 5 Dec 2024 00:57:58 +0900</pubDate>
    </item>
    <item>
      <title>[Kaggle Study] #14 Toxic Comment Classification Challenge</title>
      <link>https://dongsunseng.tistory.com/entry/Kaggle-Study-14-Toxic-Comment-Classification-Challenge</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Thirteenth competition following Youhan Lee's curriculum.&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;Natural Language Processing competition&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;.&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/sbongo/for-beginners-tackling-toxic-using-keras&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;First Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; [For&amp;nbsp;Beginners]&amp;nbsp;Tackling&amp;nbsp;Toxic&amp;nbsp;Using&amp;nbsp;Keras&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Kernel using keras LSTM.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Checking null values&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;train.isnull().any(),test.isnull().any()&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Tokenization&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Keras has turned our words into index representation for us:&lt;/span&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;color: #3c4043; text-align: left;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;[[688,
  75,
  1,
  126,
  130,
  177,
  29,
  672,
  4511,
  12052,
  1116,
  ...
  ]]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. W&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;e have to feed a stream of data that has a consistent length(fixed number of features) -&amp;gt; Padding&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;We could make the shorter sentences as long as the others by filling the shortfall by zeros.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;But on the other hand, we also have to trim the longer ones to the same length(maxlen) as the short ones. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In this case, we have set the max length to be 200.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;How&amp;nbsp;do&amp;nbsp;you&amp;nbsp;know&amp;nbsp;what&amp;nbsp;is&amp;nbsp;the&amp;nbsp;best&amp;nbsp;&quot;maxlen&quot;&amp;nbsp;to&amp;nbsp;set?&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;If you put it too short, you might lose some useful feature that could cost you some accuracy points down the path.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;If&amp;nbsp;you&amp;nbsp;put&amp;nbsp;it&amp;nbsp;too&amp;nbsp;long,&amp;nbsp;your&amp;nbsp;LSTM&amp;nbsp;cell&amp;nbsp;will&amp;nbsp;have&amp;nbsp;to&amp;nbsp;be&amp;nbsp;larger&amp;nbsp;to&amp;nbsp;store&amp;nbsp;the&amp;nbsp;possible&amp;nbsp;values&amp;nbsp;or&amp;nbsp;states.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;One of the ways to go about it is to see the distribution of the number of words in sentences.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;totalNumWords = [len(one_comment) for one_comment in list_tokenized_train]&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;plt.hist(totalNumWords,bins = np.arange(0,410,10))#[0,50,100,150,200,250,300,350,400])#,450,500,550,600,650,700,750,800,850,900])
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-04 오후 2.39.33.png&quot; data-origin-width=&quot;812&quot; data-origin-height=&quot;528&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bPpd6R/btsK7fcXyZH/2vU9KV7mMHEhrG7HZfMVWK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bPpd6R/btsK7fcXyZH/2vU9KV7mMHEhrG7HZfMVWK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bPpd6R/btsK7fcXyZH/2vU9KV7mMHEhrG7HZfMVWK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbPpd6R%2FbtsK7fcXyZH%2F2vU9KV7mMHEhrG7HZfMVWK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;468&quot; height=&quot;304&quot; data-filename=&quot;스크린샷 2024-12-04 오후 2.39.33.png&quot; data-origin-width=&quot;812&quot; data-origin-height=&quot;528&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;As we can see, most of the sentence length is about 30+. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;We could set the &quot;maxlen&quot; to about 50, but I'm being paranoid so I have set to 200. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Then again, it sounds like something you could experiment and see what is the magic number.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. LSTM Modeling Details&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-04 오후 2.40.35.png&quot; data-origin-width=&quot;1566&quot; data-origin-height=&quot;622&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/QizUK/btsK6gRcRIQ/V2KUyHUDqyv4vEvhqAvBW0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/QizUK/btsK6gRcRIQ/V2KUyHUDqyv4vEvhqAvBW0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/QizUK/btsK6gRcRIQ/V2KUyHUDqyv4vEvhqAvBW0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FQizUK%2FbtsK6gRcRIQ%2FV2KUyHUDqyv4vEvhqAvBW0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1566&quot; height=&quot;622&quot; data-filename=&quot;스크린샷 2024-12-04 오후 2.40.35.png&quot; data-origin-width=&quot;1566&quot; data-origin-height=&quot;622&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Before we could pass the output to a normal layer, we need to reshape the 3D tensor into a 2D one. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;We reshape carefully to avoid throwing away data that is important to us, and ideally we want the resulting data to be a good representative of the original data.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Therefore, we use a Global Max Pooling layer which is traditionally used in CNN problems to reduce the dimensionality of image data. In simple terms, we go through each patch of data, and we take the maximum values of each patch.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;These collection of maximum values will be a new set of down-sized data we can use.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;5.&lt;/b&gt; &lt;b&gt;Additional tips and tricks&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;1) &lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;If you have hit some roadblocks, especially when it starts returning dimension related errors, a good idea is to run &quot;model.summary()&quot; because it lists out all your layer outputs, which is pretty useful for diagnosis.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;model.summary()&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;2) While adding more layers, and doing more fancy transformations, it's a good idea to check if the outputs are performing as you have expected. You can reveal the output of a particular layer by:&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;from keras import backend as K

# with a Sequential model
get_3rd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[2].output])
layer_output = get_3rd_layer_output([X_t[:1]])[0]
layer_output.shape
# print layer_output to see the actual data

# result: (1, 200, 60)&lt;/code&gt;&lt;/pre&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/jagangupta/stop-the-s-toxic-comments-eda&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Second Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Stop&amp;nbsp;the&amp;nbsp;S@#$&amp;nbsp;-&amp;nbsp;Toxic&amp;nbsp;Comments&amp;nbsp;EDA&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;EDA Kernel.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Multi-tagging&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;There are ~95k comments in the training dataset and there are ~21 k tags and ~86k clean comments&lt;/li&gt;
&lt;li&gt;This is only possible when multiple tags are associated with each comment (eg) a comment can be classified as both toxic and obscene.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;routeros&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;x=rowsums.value_counts()

#plot
plt.figure(figsize=(8,4))
ax = sns.barplot(x.index, x.values, alpha=0.8,color=color[2])
plt.title(&quot;Multiple tags per comment&quot;)
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('# of tags ', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-04 오후 4.12.50.png&quot; data-origin-width=&quot;1088&quot; data-origin-height=&quot;574&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dlorF2/btsK6L4p7BS/Invq3kJE7XBRYSfjYzIXi1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dlorF2/btsK6L4p7BS/Invq3kJE7XBRYSfjYzIXi1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dlorF2/btsK6L4p7BS/Invq3kJE7XBRYSfjYzIXi1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdlorF2%2FbtsK6L4p7BS%2FInvq3kJE7XBRYSfjYzIXi1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;553&quot; height=&quot;292&quot; data-filename=&quot;스크린샷 2024-12-04 오후 4.12.50.png&quot; data-origin-width=&quot;1088&quot; data-origin-height=&quot;574&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Feature Engineering&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1) Direct features: &lt;/b&gt;Features which are a directly due to words/content.We would be exploring the following techniques&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Word&amp;nbsp;frequency&amp;nbsp;features&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Count&amp;nbsp;features&lt;/li&gt;
&lt;li&gt;Bigrams&lt;/li&gt;
&lt;li&gt;Trigrams&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Vector&amp;nbsp;distance&amp;nbsp;mapping&amp;nbsp;of&amp;nbsp;words&amp;nbsp;(Eg:&amp;nbsp;Word2Vec)&lt;/li&gt;
&lt;li&gt;Sentiment&amp;nbsp;scores&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2) Indirect features: &lt;/b&gt;Some&amp;nbsp;more&amp;nbsp;experimental&amp;nbsp;features.&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;count of sentences&lt;/li&gt;
&lt;li&gt;count&amp;nbsp;of&amp;nbsp;words&lt;/li&gt;
&lt;li&gt;count&amp;nbsp;of&amp;nbsp;unique&amp;nbsp;words&lt;/li&gt;
&lt;li&gt;count&amp;nbsp;of&amp;nbsp;letters&lt;/li&gt;
&lt;li&gt;count&amp;nbsp;of&amp;nbsp;punctuations&lt;/li&gt;
&lt;li&gt;count&amp;nbsp;of&amp;nbsp;uppercase&amp;nbsp;words/letters&lt;/li&gt;
&lt;li&gt;count&amp;nbsp;of&amp;nbsp;stop&amp;nbsp;words&lt;/li&gt;
&lt;li&gt;Avg&amp;nbsp;length&amp;nbsp;of&amp;nbsp;each&amp;nbsp;word&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3) Leaky features: &lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;From the example, we know that the comments contain identifier information (eg: IP, username,etc.). We can create features out of them but, it will certainly lead to overfitting to this specific Wikipedia use-case.&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;toxic IP scores&lt;/li&gt;
&lt;li&gt;toxic users&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Note: Creating the indirect and leaky features first. There are two reasons for this:&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Count features(Direct features) are useful only if they are created from a clean corpus&lt;/li&gt;
&lt;li&gt;Also the indirect features help compensate for the loss of information when cleaning the dataset&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. Indirect features&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733297525158&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;## Indirect features

#Sentense count in each comment:
    #  '\n' can be used to count the number of sentences in each comment
df['count_sent']=df[&quot;comment_text&quot;].apply(lambda x: len(re.findall(&quot;\n&quot;,str(x)))+1)
#Word count in each comment:
df['count_word']=df[&quot;comment_text&quot;].apply(lambda x: len(str(x).split()))
#Unique word count
df['count_unique_word']=df[&quot;comment_text&quot;].apply(lambda x: len(set(str(x).split())))
#Letter count
df['count_letters']=df[&quot;comment_text&quot;].apply(lambda x: len(str(x)))
#punctuation count
df[&quot;count_punctuations&quot;] =df[&quot;comment_text&quot;].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
#upper case words count
df[&quot;count_words_upper&quot;] = df[&quot;comment_text&quot;].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
#title case words count
df[&quot;count_words_title&quot;] = df[&quot;comment_text&quot;].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
#Number of stopwords
df[&quot;count_stopwords&quot;] = df[&quot;comment_text&quot;].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
#Average length of the words
df[&quot;mean_word_len&quot;] = df[&quot;comment_text&quot;].apply(lambda x: np.mean([len(w) for w in str(x).split()]))&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733297579533&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#derived features
#Word count percent in each comment:
df['word_unique_percent']=df['count_unique_word']*100/df['count_word']
#derived features
#Punct percent in each comment:
df['punct_percent']=df['count_punctuations']*100/df['count_word']&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. Leaky&amp;nbsp;features&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Caution: Even though including these features might help us perform better in this particular scenario, it will not make sence to add them in the final model/general purpose model.&lt;/li&gt;
&lt;li&gt;Here we are creating our own custom count vectorizer to create count variables that match our regex condition.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733298058946&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#Leaky features
df['ip']=df[&quot;comment_text&quot;].apply(lambda x: re.findall(&quot;\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}&quot;,str(x)))
#count of ip addresses
df['count_ip']=df[&quot;ip&quot;].apply(lambda x: len(x))

#links
df['link']=df[&quot;comment_text&quot;].apply(lambda x: re.findall(&quot;http://.*com&quot;,str(x)))
#count of links
df['count_links']=df[&quot;link&quot;].apply(lambda x: len(x))

#article ids
df['article_id']=df[&quot;comment_text&quot;].apply(lambda x: re.findall(&quot;\d:\d\d\s{0,5}$&quot;,str(x)))
df['article_id_flag']=df.article_id.apply(lambda x: len(x))

#username
##              regex for     Match anything with [[User: ---------- ]]
# regexp = re.compile(&quot;\[\[User:(.*)\|&quot;)
df['username']=df[&quot;comment_text&quot;].apply(lambda x: re.findall(&quot;\[\[User(.*)\|&quot;,str(x)))
#count of username mentions
df['count_usernames']=df[&quot;username&quot;].apply(lambda x: len(x))
#check if features are created
#df.username[df.count_usernames&amp;gt;0]

# Leaky Ip
cv = CountVectorizer()
count_feats_ip = cv.fit_transform(df[&quot;ip&quot;].apply(lambda x : str(x)))


# Leaky usernames

cv = CountVectorizer()
count_feats_user = cv.fit_transform(df[&quot;username&quot;].apply(lambda x : str(x)))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;5. Direct Features&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1) Count based features(for unigrams):&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Lets&amp;nbsp;create&amp;nbsp;some&amp;nbsp;features&amp;nbsp;based&amp;nbsp;on&amp;nbsp;frequency&amp;nbsp;distribution&amp;nbsp;of&amp;nbsp;the&amp;nbsp;words.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Initially&amp;nbsp;lets&amp;nbsp;consider&amp;nbsp;taking&amp;nbsp;words&amp;nbsp;one&amp;nbsp;at&amp;nbsp;a&amp;nbsp;time&amp;nbsp;(ie)&amp;nbsp;Unigrams&lt;/li&gt;
&lt;li&gt;Python's SKlearn provides 3 ways of creating count features.&lt;/li&gt;
&lt;li&gt;All three of them first create a vocabulary(dictionary) of words and then create a sparse matrix of word counts for the words in the sentence that are present in the dictionary.&lt;/li&gt;
&lt;li&gt;A brief description of them:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;CountVectorizer&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Creates&amp;nbsp;a&amp;nbsp;matrix&amp;nbsp;with&amp;nbsp;frequency&amp;nbsp;counts&amp;nbsp;of&amp;nbsp;each&amp;nbsp;word&amp;nbsp;in&amp;nbsp;the&amp;nbsp;text&amp;nbsp;corpus&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;TF-IDF&amp;nbsp;Vectorizer&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;TF&amp;nbsp;-&amp;nbsp;Term&amp;nbsp;Frequency&amp;nbsp;--&amp;nbsp;Count&amp;nbsp;of&amp;nbsp;the&amp;nbsp;words(Terms)&amp;nbsp;in&amp;nbsp;the&amp;nbsp;text&amp;nbsp;corpus&amp;nbsp;(same&amp;nbsp;of&amp;nbsp;Count&amp;nbsp;Vect)&lt;/li&gt;
&lt;li&gt;IDF&amp;nbsp;-&amp;nbsp;Inverse&amp;nbsp;Document&amp;nbsp;Frequency&amp;nbsp;--&amp;nbsp;Penalizes&amp;nbsp;words&amp;nbsp;that&amp;nbsp;are&amp;nbsp;too&amp;nbsp;frequent.&amp;nbsp;We&amp;nbsp;can&amp;nbsp;think&amp;nbsp;of&amp;nbsp;this&amp;nbsp;as&amp;nbsp;regularization&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;HashingVectorizer&lt;/b&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Creates a hashmap(word to number mapping based on hashing technique) instead of a dictionary for vocabulary&amp;nbsp;&lt;/li&gt;
&lt;li&gt;This&amp;nbsp;enables&amp;nbsp;it&amp;nbsp;to&amp;nbsp;be&amp;nbsp;more&amp;nbsp;scalable&amp;nbsp;and&amp;nbsp;faster&amp;nbsp;for&amp;nbsp;larger&amp;nbsp;text&amp;nbsp;coprus&lt;/li&gt;
&lt;li&gt;Can&amp;nbsp;be&amp;nbsp;parallelized&amp;nbsp;across&amp;nbsp;multiple&amp;nbsp;threads&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;Using TF-IDF here.&lt;/li&gt;
&lt;li&gt;Note: Using the concatenated dataframe &quot;merge&quot; which contains both text from train and test dataset to ensure that the vocabulary that we create does not missout on the words that are unique to testset.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733298524352&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;### Unigrams -- TF-IDF 
# using settings recommended here for TF-IDF -- https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle

#some detailed description of the parameters
# min_df=10 --- ignore terms that appear lesser than 10 times 
# max_features=None  --- Create as many words as present in the text corpus
    # changing max_features to 10k for memmory issues
# analyzer='word'  --- Create features from words (alternatively char can also be used)
# ngram_range=(1,1)  --- Use only one word at a time (unigrams)
# strip_accents='unicode' -- removes accents
# use_idf=1,smooth_idf=1 --- enable IDF
# sublinear_tf=1   --- Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf)


#temp settings to min=200 to facilitate top features section to run in kernals
#change back to min=10 to get better results
start_unigrams=time.time()
tfv = TfidfVectorizer(min_df=200,  max_features=10000, 
            strip_accents='unicode', analyzer='word',ngram_range=(1,1),
            use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')
tfv.fit(clean_corpus)
features = np.array(tfv.get_feature_names())

train_unigrams =  tfv.transform(clean_corpus.iloc[:train.shape[0]])
test_unigrams = tfv.transform(clean_corpus.iloc[train.shape[0]:])&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733298544228&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=25):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)

def top_mean_feats(Xtr, features, grp_ids, min_tfidf=0.1, top_n=25):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    
    D = Xtr[grp_ids].toarray()

    D[D &amp;lt; min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)
    
# modified for multilabel milticlass
def top_feats_by_class(Xtr, features, min_tfidf=0.1, top_n=20):
    ''' Return a list of dfs, where each df holds top_n features and their mean tfidf value
        calculated across documents with the same class label. '''
    dfs = []
    cols=train_tags.columns
    for col in cols:
        ids = train_tags.index[train_tags[col]==1]
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733298564404&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#get top n for unigrams
tfidf_top_n_per_lass=top_feats_by_class(train_unigrams,features)

end_unigrams=time.time()

print(&quot;total time in unigrams&quot;,end_unigrams-start_unigrams)
print(&quot;total time till unigrams&quot;,end_unigrams-start_time)

# result: total time in unigrams 85.26099634170532
#         total time till unigrams 366.4286904335022&lt;/code&gt;&lt;/pre&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/tunguz/logistic-regression-with-words-and-char-n-grams&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Third Kernel: Logistic&amp;nbsp;regression&amp;nbsp;with&amp;nbsp;words&amp;nbsp;and&amp;nbsp;char&amp;nbsp;n-grams&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Literally kernel using logistic regression for modeling with both words features and char features.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Summary&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This&amp;nbsp;code&amp;nbsp;implements&amp;nbsp;a&amp;nbsp;machine&amp;nbsp;learning&amp;nbsp;model&amp;nbsp;for&amp;nbsp;classifying&amp;nbsp;toxic&amp;nbsp;comments.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Here&amp;nbsp;are&amp;nbsp;its&amp;nbsp;key&amp;nbsp;features&amp;nbsp;and&amp;nbsp;operational&amp;nbsp;methods:&lt;br /&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Data Processing Approach:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Analyzes comment text at two levels (word and character)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Word-level&amp;nbsp;analysis&amp;nbsp;captures&amp;nbsp;individual&amp;nbsp;word&amp;nbsp;meanings&lt;/li&gt;
&lt;li&gt;Character-level analysis can capture typos and special expressions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Feature&amp;nbsp;Extraction&amp;nbsp;Method:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization&lt;/li&gt;
&lt;li&gt;Word&amp;nbsp;features:&amp;nbsp;Extracts&amp;nbsp;up&amp;nbsp;to&amp;nbsp;10,000&amp;nbsp;unigram&amp;nbsp;features&lt;/li&gt;
&lt;li&gt;Character&amp;nbsp;features:&amp;nbsp;Extracts&amp;nbsp;up&amp;nbsp;to&amp;nbsp;50,000&amp;nbsp;features&amp;nbsp;from&amp;nbsp;2-6&amp;nbsp;character&amp;nbsp;sequences&lt;/li&gt;
&lt;li&gt;Combines both features to create rich text representation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Modeling&amp;nbsp;Approach:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Creates separate binary classification models for each of 6 toxic categories&lt;/li&gt;
&lt;li&gt;Uses&amp;nbsp;logistic&amp;nbsp;regression&amp;nbsp;to&amp;nbsp;predict&amp;nbsp;probabilities&amp;nbsp;for&amp;nbsp;each&amp;nbsp;category&lt;/li&gt;
&lt;li&gt;Evaluates&amp;nbsp;model&amp;nbsp;performance&amp;nbsp;using&amp;nbsp;3-fold&amp;nbsp;cross-validation&lt;/li&gt;
&lt;li&gt;Measures performance using ROC-AUC score&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Optimization&amp;nbsp;Considerations:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Sets sublinear_tf=True to reduce impact of extreme frequency values&lt;/li&gt;
&lt;li&gt;Sets&amp;nbsp;stop_words='english'&amp;nbsp;to&amp;nbsp;remove&amp;nbsp;stop&amp;nbsp;words&lt;/li&gt;
&lt;li&gt;Prevents&amp;nbsp;overfitting&amp;nbsp;through&amp;nbsp;L2&amp;nbsp;regularization&amp;nbsp;(C=0.1)&lt;/li&gt;
&lt;li&gt;Optimizes&amp;nbsp;large-scale&amp;nbsp;data&amp;nbsp;processing&amp;nbsp;using&amp;nbsp;SAG&amp;nbsp;(Stochastic&amp;nbsp;Average&amp;nbsp;Gradient)&amp;nbsp;optimizer&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&amp;nbsp;This implementation demonstrates a practical approach to text classification, particularly effective for analyzing toxicity in comments from multiple angles.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;2. Code Analysis&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733292030368&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Import required libraries
import numpy as np
import pandas as pd
# scikit-learn libraries for text processing and modeling
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
# Define toxic comment categories for classification
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
# Load data and handle missing values with empty spaces
train = pd.read_csv('../input/train.csv').fillna(' ')
test = pd.read_csv('../input/test.csv').fillna(' ')
# Extract comment text from training and test data
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])
# Word-level TF-IDF vectorization settings
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,      # Apply log scale to TF values
    strip_accents='unicode', # Remove accents
    analyzer='word',         # Word-level analysis
    token_pattern=r'\w{1,}', # Recognize one or more word characters as tokens
    stop_words='english',    # Remove English stop words
    ngram_range=(1, 1),     # Use single words only (unigram)
    max_features=10000)      # Use maximum of 10000 features
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)
# Character-level TF-IDF vectorization settings
char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,      # Apply log scale to TF values
    strip_accents='unicode', # Remove accents
    analyzer='char',         # Character-level analysis
    stop_words='english',    # Remove English stop words
    ngram_range=(2, 6),     # Use 2-6 character sequences
    max_features=50000)      # Use maximum of 50000 features
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)
# Horizontally combine word and character features
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
# Train models and make predictions for each toxic category
scores = []
submission = pd.DataFrame.from_dict({'id': test['id']})
for class_name in class_names:
    # Extract target data for current category
    train_target = train[class_name]
    # Initialize logistic regression model (L2 regularization, SAG optimizer)
    classifier = LogisticRegression(C=0.1, solver='sag')
    # Calculate ROC-AUC score using 3-fold cross-validation
    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, cv=3, scoring='roc_auc'))
    scores.append(cv_score)
    print('CV score for class {} is {}'.format(class_name, cv_score))
    # Train model on full training data and predict test data
    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]
# Calculate average score across all categories
print('Total CV score is {}'.format(np.mean(scores)))
# Save predictions to CSV file
submission.to_csv('submission.csv', index=False)&lt;/code&gt;&lt;/pre&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/rhodiumbeng/classifying-multi-label-comments-0-9741-lb&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Fourth Kernel: Classifying&amp;nbsp;multi-label&amp;nbsp;comments&amp;nbsp;(0.9741&amp;nbsp;lb)&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Also using logistic regression.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Unlabelled data&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;As the mean values are very small (some way below 0.05), there would be many not labelled as positive in the six categories. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;From this I guess that there would be many comments which are not labelled in any of the six categories.&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733293932259&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;unlabelled_in_all = train_df[(train_df['toxic']!=1) &amp;amp; (train_df['severe_toxic']!=1) &amp;amp; (train_df['obscene']!=1) &amp;amp; 
                            (train_df['threat']!=1) &amp;amp; (train_df['insult']!=1) &amp;amp; (train_df['identity_hate']!=1)]
print('Percentage of unlabelled comments is ', len(unlabelled_in_all)/len(train_df)*100)

# result: Percentage of unlabelled comments is  89.83211235124176&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733293987345&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Let's look at the character length for the rows in the training data and record these
train_df['char_length'] = train_df['comment_text'].apply(lambda x: len(str(x)))&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733293994963&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# look at the histogram plot for text length
sns.set()
train_df['char_length'].hist()
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;edited_스크린샷 2024-12-04 오후 3.33.24.png&quot; data-origin-width=&quot;826&quot; data-origin-height=&quot;536&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ciDSUW/btsK5NuSXRU/092jCopClp5rtJ8cPGNVTK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ciDSUW/btsK5NuSXRU/092jCopClp5rtJ8cPGNVTK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ciDSUW/btsK5NuSXRU/092jCopClp5rtJ8cPGNVTK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FciDSUW%2FbtsK5NuSXRU%2F092jCopClp5rtJ8cPGNVTK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;408&quot; height=&quot;307&quot; data-filename=&quot;edited_스크린샷 2024-12-04 오후 3.33.24.png&quot; data-origin-width=&quot;826&quot; data-origin-height=&quot;536&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Most of the text length are within 500 characters, with some up to 5,000 characters long.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Manually cleaning comment text&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733294393414&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def clean_text(text):
    text = text.lower()
    text = re.sub(r&quot;what's&quot;, &quot;what is &quot;, text)
    text = re.sub(r&quot;\'s&quot;, &quot; &quot;, text)
    text = re.sub(r&quot;\'ve&quot;, &quot; have &quot;, text)
    text = re.sub(r&quot;can't&quot;, &quot;cannot &quot;, text)
    text = re.sub(r&quot;n't&quot;, &quot; not &quot;, text)
    text = re.sub(r&quot;i'm&quot;, &quot;i am &quot;, text)
    text = re.sub(r&quot;\'re&quot;, &quot; are &quot;, text)
    text = re.sub(r&quot;\'d&quot;, &quot; would &quot;, text)
    text = re.sub(r&quot;\'ll&quot;, &quot; will &quot;, text)
    text = re.sub(r&quot;\'scuse&quot;, &quot; excuse &quot;, text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text
    
# clean the comment_text in train_df [Thanks to Pulkit Jha for the useful pointer.]
train_df['comment_text'] = train_df['comment_text'].map(lambda com : clean_text(com))

# clean the comment_text in test_df [Thanks, Pulkit Jha.]
test_df['comment_text'] = test_df['comment_text'].map(lambda com : clean_text(com))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. Problem Transformation&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;One way to approach a multi-label classification problem is to transform the problem into separate single-class classifier problems. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;This is known as 'problem transformation'. There are three methods:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Binary Relevance.&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;This is probably the simplest which treats each label as a separate single classification problems. The key assumption here though, is that there are no correlation among the various labels.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Classifier Chains.&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;In this method, the first classifier is trained on the input X. Then the subsequent classifiers are trained on the input X and all previous classifiers' predictions in the chain. This method attempts to draw the signals from the correlation among preceding target variables.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Label Powerset.&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;This method transforms the problem into a multi-class problem where the multi-class labels are essentially all the unique label combinations. In our case here, where there are six labels, Label Powerset would in effect turn this into a 2^6 or 64-class problem. {Thanks Joshua for pointing out.}&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1) Binary Relevance&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;# import and instantiate the Logistic Regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logreg = LogisticRegression(C=12.0)

# create submission file
submission_binary = pd.read_csv('../input/sample_submission.csv')

for label in cols_target:
    print('... Processing {}'.format(label))
    y = train_df[label]
    # train the model using X_dtm &amp;amp; y
    logreg.fit(X_dtm, y)
    # compute the training accuracy
    y_pred_X = logreg.predict(X_dtm)
    print('Training accuracy is {}'.format(accuracy_score(y, y_pred_X)))
    # compute the predicted probabilities for X_test_dtm
    test_y_prob = logreg.predict_proba(test_X_dtm)[:,1]
    submission_binary[label] = test_y_prob&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2) Classifier Chains&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# create submission file
submission_chains = pd.read_csv('../input/sample_submission.csv')

# create a function to add features
def add_feature(X, feature_to_add):
    '''
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    '''
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')&lt;/code&gt;&lt;/pre&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div style=&quot;background-color: #f1f3f4;&quot;&gt;
&lt;div&gt;
&lt;pre class=&quot;dockerfile&quot; style=&quot;color: #3c4043;&quot;&gt;&lt;code&gt;for label in cols_target:
    print('... Processing {}'.format(label))
    y = train_df[label]
    # train the model using X_dtm &amp;amp; y
    logreg.fit(X_dtm,y)
    # compute the training accuracy
    y_pred_X = logreg.predict(X_dtm)
    print('Training Accuracy is {}'.format(accuracy_score(y,y_pred_X)))
    # make predictions from test_X
    test_y = logreg.predict(test_X_dtm)
    test_y_prob = logreg.predict_proba(test_X_dtm)[:,1]
    submission_chains[label] = test_y_prob
    # chain current label to X_dtm
    X_dtm = add_feature(X_dtm, y)
    print('Shape of X_dtm is now {}'.format(X_dtm.shape))
    # chain current label predictions to test_X_dtm
    test_X_dtm = add_feature(test_X_dtm, test_y)
    print('Shape of test_X_dtm is now {}'.format(test_X_dtm.shape))
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;The only way to fail is to never try. Take risks, learn from your mistakes, and keep pushing forward.&amp;nbsp;&lt;br /&gt;- Max Holloway -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>캐글</category>
      <category>Kaggle</category>
      <category>toxic comment classification challenge</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/75</guid>
      <comments>https://dongsunseng.tistory.com/entry/Kaggle-Study-14-Toxic-Comment-Classification-Challenge#entry75comment</comments>
      <pubDate>Wed, 4 Dec 2024 16:53:14 +0900</pubDate>
    </item>
    <item>
      <title>[Kaggle Study] #12 Spooky Author Identification</title>
      <link>https://dongsunseng.tistory.com/entry/Kaggle-Study-12-Spooky-Author-Identification</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Eleventh competition following Youhan Lee's curriculum. &lt;b&gt;&lt;span&gt;&lt;span&gt;Natural language processing &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span&gt;&lt;span&gt;competition&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1733162128140&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Spooky Author Identification&quot; data-og-description=&quot;Share code and discuss insights to identify horror authors from their writings&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/c/spooky-author-identification&quot; data-og-url=&quot;https://kaggle.com/spooky-author-identification&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bKHFMy/hyXGKx5pj4/WfmEt56n05uWH65i4f6cNK/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400,https://scrap.kakaocdn.net/dn/GGIOP/hyXGI1j4BL/961TzD8PYIgpL4RWKAYPx1/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/c/spooky-author-identification&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/c/spooky-author-identification&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bKHFMy/hyXGKx5pj4/WfmEt56n05uWH65i4f6cNK/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400,https://scrap.kakaocdn.net/dn/GGIOP/hyXGI1j4BL/961TzD8PYIgpL4RWKAYPx1/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Spooky Author Identification&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Share code and discuss insights to identify horror authors from their writings&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/arthurtok/spooky-nlp-and-topic-modelling-tutorial&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;First Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Spooky NLP and Topic Modelling tutorial&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Topic modeling:&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&amp;nbsp;the process in which we try uncover abstract themes or &quot;topics&quot; based on the underlying documents and words in a corpus of text&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Two standard topic modeling techniques:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Latent Dirichlet Allocation (LDA)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Non-negative Matrix Factorization (NMF)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Top 50 (Uncleaned) Word Frequency in Training set&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733203299996&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;all_words = train['text'].str.split(expand=True).unstack().value_counts()
data = [go.Bar(
            x = all_words.index.values[2:50],
            y = all_words.values[2:50],
            marker= dict(colorscale='Jet',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 (Uncleaned) Word frequencies in the training dataset'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 2.21.51.png&quot; data-origin-width=&quot;1346&quot; data-origin-height=&quot;776&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bcHkEf/btsK5DkU3IR/ikKRbZun2cFINfcOgMJivk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bcHkEf/btsK5DkU3IR/ikKRbZun2cFINfcOgMJivk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bcHkEf/btsK5DkU3IR/ikKRbZun2cFINfcOgMJivk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbcHkEf%2FbtsK5DkU3IR%2FikKRbZun2cFINfcOgMJivk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;703&quot; height=&quot;405&quot; data-filename=&quot;스크린샷 2024-12-03 오후 2.21.51.png&quot; data-origin-width=&quot;1346&quot; data-origin-height=&quot;776&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;These words are all so commonly occuring words which you could find just anywhere else. Not just in spooky stories and novels by our three authors but also in newspapers, kid book, religious texts - really almost every other english text. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Therefore we must find some way to preprocess our dataset first to strip out all these commonly occurring words which do not bring much to the table.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;2. WordClouds to visualise each author's work&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;One&amp;nbsp;very&amp;nbsp;handy&amp;nbsp;visualization&amp;nbsp;tool&amp;nbsp;for&amp;nbsp;a&amp;nbsp;data&amp;nbsp;scientist&amp;nbsp;when&amp;nbsp;it&amp;nbsp;comes&amp;nbsp;to&amp;nbsp;any&amp;nbsp;sort&amp;nbsp;of&amp;nbsp;natural&amp;nbsp;language&amp;nbsp;processing&amp;nbsp;is&amp;nbsp;plotting&amp;nbsp;&quot;&lt;span style=&quot;background-color: #f89009;&quot;&gt;&lt;b&gt;Word&amp;nbsp;Cloud&lt;/b&gt;&lt;/span&gt;&quot;.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;A&amp;nbsp;word&amp;nbsp;cloud&amp;nbsp;(as&amp;nbsp;the&amp;nbsp;name&amp;nbsp;suggests)&amp;nbsp;is&amp;nbsp;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;an&amp;nbsp;image&amp;nbsp;that&amp;nbsp;is&amp;nbsp;made&amp;nbsp;up&amp;nbsp;of&amp;nbsp;a&amp;nbsp;mixture&amp;nbsp;of&amp;nbsp;distinct&amp;nbsp;words&amp;nbsp;which&amp;nbsp;may&amp;nbsp;make&amp;nbsp;up&amp;nbsp;a&amp;nbsp;text&amp;nbsp;or&amp;nbsp;book&amp;nbsp;and&amp;nbsp;where&amp;nbsp;the&amp;nbsp;size&amp;nbsp;of&amp;nbsp;each&amp;nbsp;word&amp;nbsp;is&amp;nbsp;proportional&amp;nbsp;to&amp;nbsp;its&amp;nbsp;word&amp;nbsp;frequency&amp;nbsp;in&amp;nbsp;that&amp;nbsp;text&amp;nbsp;(number&amp;nbsp;of&amp;nbsp;times&amp;nbsp;the&amp;nbsp;word&amp;nbsp;appears)&lt;/b&gt;&lt;/span&gt;.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Here&amp;nbsp;instead&amp;nbsp;of&amp;nbsp;dealing&amp;nbsp;with&amp;nbsp;an&amp;nbsp;actual&amp;nbsp;book&amp;nbsp;or&amp;nbsp;text,&amp;nbsp;our&amp;nbsp;words&amp;nbsp;can&amp;nbsp;simply&amp;nbsp;be&amp;nbsp;taken&amp;nbsp;from&amp;nbsp;the&amp;nbsp;column&amp;nbsp;&quot;text&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1) Store the text of each author in a Python list&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733203682337&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;eap = train[train.author==&quot;EAP&quot;][&quot;text&quot;].values
hpl = train[train.author==&quot;HPL&quot;][&quot;text&quot;].values
mws = train[train.author==&quot;MWS&quot;][&quot;text&quot;].values&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;2) Encoding image and imported&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733203766076&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from wordcloud import WordCloud, STOPWORDS&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Generating a normal wordcloud is rather boring so I would like to introduce to you a technique of importing pictures (something relevant) and using the outline of that picture as a mask for our wordclouds.&lt;/li&gt;
&lt;li&gt;Therefore the pictures that I have chosen are the ones I feel most representative for their authors:&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The Raven for Edgar Allen Poe&lt;/li&gt;
&lt;li&gt;Octopus Cthulu-thingy for HP Lovecraft&lt;/li&gt;
&lt;li&gt;Frankenstein for Mary Shelly&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The way I am loading in the pictures on Kaggle is a sort of a feature hack although readers familiar to my work know this trick.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;I first derive the Base64 encoding of whatever images I want to use and then use that particular encoding and re-convert the picture back on the notebook.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;3) Decoding image using &lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;codecs module&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733204082690&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import codecs
# Generate the Mask for EAP
f1 = open(&quot;eap.png&quot;, &quot;wb&quot;)
f1.write(codecs.decode(eap_64,'base64'))
f1.close()
img1 = imread(&quot;eap.png&quot;)
# img = img.resize((980,1080))
hcmask = img1

f2 = open(&quot;mws.png&quot;, &quot;wb&quot;)
f2.write(codecs.decode(mws_64,'base64'))
f2.close()
img2 = imread(&quot;mws.png&quot;)
hcmask2 = img2

f3 = open(&quot;hpl.png&quot;, &quot;wb&quot;)
f3.write(codecs.decode(hpl_64,'base64'))
f3.close()
img3 = imread(&quot;hpl.png&quot;)
hcmask3 = img3;&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;4) Finally wordcloud&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;routeros&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# The wordcloud of Cthulhu/squidy thing for HP Lovecraft
plt.figure(figsize=(16,13))
wc = WordCloud(background_color=&quot;black&quot;, max_words=10000, 
               mask=hcmask3, stopwords=STOPWORDS, max_font_size= 40)
wc.generate(&quot; &quot;.join(hpl))
plt.title(&quot;HP Lovecraft (Cthulhu-Squidy)&quot;, fontsize=20)
# plt.imshow(wc.recolor( colormap= 'Pastel1_r' , random_state=17), alpha=0.98)
plt.imshow(wc.recolor( colormap= 'Pastel2' , random_state=17), alpha=0.98)
plt.axis('off')&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 2.35.39.png&quot; data-origin-width=&quot;1346&quot; data-origin-height=&quot;1212&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/y2yhX/btsK5MhHIXL/rV6ahYUMuxGJel171LyR9K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/y2yhX/btsK5MhHIXL/rV6ahYUMuxGJel171LyR9K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/y2yhX/btsK5MhHIXL/rV6ahYUMuxGJel171LyR9K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fy2yhX%2FbtsK5MhHIXL%2FrV6ahYUMuxGJel171LyR9K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;632&quot; height=&quot;569&quot; data-filename=&quot;스크린샷 2024-12-03 오후 2.35.39.png&quot; data-origin-width=&quot;1346&quot; data-origin-height=&quot;1212&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;3. Text preprocessing&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Tokenization&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;- Segregation of the text into its individual constitutent words.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Stopwords&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;- Throw away any words that occur too frequently as its frequency of occurrence will not be useful in helping detecting relevant texts. (as an aside also consider throwing away words that occur very infrequently).&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Stemming&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;- combine variants of words into a single parent word that still conveys the same meaning&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Vectorization&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;- Converting text into vector format. One of the simplest is the famous bag-of-words approach, where you create a matrix (for each document or text in the corpus). In the simplest form, this matrix stores word frequencies (word counts) and is often referred to as vectorization of the raw text.&lt;/li&gt;
&lt;/ol&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;4. Tokenization using &lt;b&gt;NLTK module&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The concept of tokenization is the act of taking a sequence of characters (think of Python strings) in a given document and dicing it up into its individual constituent pieces, which are the eponymous &quot;tokens&quot; of this method. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;One could loosely think of them as singular words in a sentence. One could naively implement the &quot;split( )&quot; method on a string which separates it into a python list based on the identifier in the argument. It is actually not that trivial to.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Here we split the first sentence of the text in the training data just on a space as follows:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733204778555&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Storing the first text element as a string
first_text = train.text.values[0]
print(first_text)
print(&quot;=&quot;*90)
print(first_text.split(&quot; &quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 2.58.07.png&quot; data-origin-width=&quot;1544&quot; data-origin-height=&quot;492&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cjtNKl/btsK6Agypv7/6FZ44CuAUrjk7PlCouPKs0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cjtNKl/btsK6Agypv7/6FZ44CuAUrjk7PlCouPKs0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cjtNKl/btsK6Agypv7/6FZ44CuAUrjk7PlCouPKs0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcjtNKl%2FbtsK6Agypv7%2F6FZ44CuAUrjk7PlCouPKs0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1544&quot; height=&quot;492&quot; data-filename=&quot;스크린샷 2024-12-03 오후 2.58.07.png&quot; data-origin-width=&quot;1544&quot; data-origin-height=&quot;492&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;However as you can see from this first attempt at tokenization, the segregation(분리) of the sentence into its individual elements (or terms) is not entirely accurate. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;As an example, look at the second element of the list which contains the term &quot;process,&quot;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;The punctuation mark (comma) has also been included and is being treated along with the word &quot;process&quot; as a term in itself. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Ideally we would like the comma and the word to be in two different and separate elements of the list. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Trying to do this with pure python list operations will be quite complex so this is where the NLTK library comes into play. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;There is a convenient method &quot;word_tokenize( )&quot; (TreebankWord tokenizer) which strips out singular words as well as punctuations into separate elements automatically as follows:&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733205500096&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;first_text_list = nltk.word_tokenize(first_text)
print(first_text_list)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.06.10.png&quot; data-origin-width=&quot;1544&quot; data-origin-height=&quot;264&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lhGR3/btsK4K6jo9l/9tWrGSJYtjDbIrQveAsdU0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lhGR3/btsK4K6jo9l/9tWrGSJYtjDbIrQveAsdU0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lhGR3/btsK4K6jo9l/9tWrGSJYtjDbIrQveAsdU0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlhGR3%2FbtsK4K6jo9l%2F9tWrGSJYtjDbIrQveAsdU0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1544&quot; height=&quot;264&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.06.10.png&quot; data-origin-width=&quot;1544&quot; data-origin-height=&quot;264&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;5. Stopword Removal&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;As alluded to above stopwords are generally &lt;b&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;words that appear so commonly and at such a high frequency in the corpus that they don't actually contribute much to the learning or predictive process as a learning model would fail to distinguish it from other texts&lt;/span&gt;&lt;/b&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Stopwords include terms such as &quot;to&quot; or &quot;the&quot; and therefore, it would be to our benefit to remove them during the pre-processing phase. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Conveniently, NLTK comes with a predefined list of 153 english stopwords.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733206106587&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;stopwords = nltk.corpus.stopwords.words('english')
len(stopwords)

# result: 179&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.09.02.png&quot; data-origin-width=&quot;1772&quot; data-origin-height=&quot;1004&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Tweyv/btsK4gLaOQL/13NqolVjS7VbQlFRaoREV1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Tweyv/btsK4gLaOQL/13NqolVjS7VbQlFRaoREV1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Tweyv/btsK4gLaOQL/13NqolVjS7VbQlFRaoREV1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FTweyv%2FbtsK4gLaOQL%2F13NqolVjS7VbQlFRaoREV1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1772&quot; height=&quot;1004&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.09.02.png&quot; data-origin-width=&quot;1772&quot; data-origin-height=&quot;1004&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;Filtering out stopwords from &lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;our tokenized list of words:&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733206190125&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;first_text_list_cleaned = [word for word in first_text_list if word.lower() not in stopwords]
print(first_text_list_cleaned)
print(&quot;=&quot;*90)
print(&quot;Length of original list: {0} words\n&quot;
      &quot;Length of list after stopwords removal: {1} words&quot;
      .format(len(first_text_list), len(first_text_list_cleaned)))&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;6. Stemming&amp;nbsp;and&amp;nbsp;Lemmatization&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The work at this stage attempts to &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;reduce as many different variations of similar words into a single term ( different branches all reduced to single word&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;stem)&lt;/b&gt;&lt;/span&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Therefore if we have &quot;running&quot;, &quot;runs&quot; and &quot;run&quot;, you would really want these three distinct words to collapse into just the word &quot;run&quot;. (However of course you lose granularity of the past, present or future tense).&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;We can turn to NLTK again which provides various stemmers which include variants such as the Porter stemming algorithm, the lancaster stemmer and the Snowball stemmer.&lt;/li&gt;
&lt;li&gt;In the following example, I will create a porter stemmer instance as follows:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733206242905&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;stemmer = nltk.stem.PorterStemmer()&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733206394762&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;print(&quot;The stemmed form of running is: {}&quot;.format(stemmer.stem(&quot;running&quot;)))
print(&quot;The stemmed form of runs is: {}&quot;.format(stemmer.stem(&quot;runs&quot;)))
print(&quot;The stemmed form of run is: {}&quot;.format(stemmer.stem(&quot;run&quot;)))

# The stemmed form of running is: run
# The stemmed form of runs is: run
# The stemmed form of run is: run&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;As we can see, the stemmer has successfully reduced the given words above into a base form and this will be most in helping us reduce the size of our dataset of words when we come to learning and classification tasks.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;However there is one flaw with stemming and that is the fact that the process involves quite a crude heuristic in chopping off the ends of words in the hope of reducing a particular word into a human recognizable base form. &lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Therefore this process does not take into account vocabulary or word forms when collapsing words as this example will illustrate:&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733206612937&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;print(&quot;The stemmed form of leaves is: {}&quot;.format(stemmer.stem(&quot;leaves&quot;)))

# result: The stemmed form of leaves is: leav&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p id=&quot;Lemmatization-to-the-rescue&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Lemmatization&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Therefore we turn to another that we could use in lieu of stemming.&lt;/li&gt;
&lt;li&gt;This method is called lemmatization which aims to achieve the same effect as the former method.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;However unlike a stemmer, lemmatizing the dataset aims to reduce words based on an actual dictionary or vocabulary (the Lemma) and therefore will not chop off words into stemmed forms that do not carry any lexical meaning.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Here we can utilize NLTK once again to initialize a lemmatizer (WordNet variant) and inspect how it collapses words as follows:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733206666867&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
print(&quot;The lemmatized form of leaves is: {}&quot;.format(lemm.lemmatize(&quot;leaves&quot;)))

# result: The lemmatized form of leaves is: leaf&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;7. Vectorizing&amp;nbsp;Raw&amp;nbsp;Text&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;In the vast collection of NLP literature, there are many different purposes for analyzing raw text, where in some cases you would like to compare the similarity of one body of text to another (Clustering techniques/Distance measurements), text classification (the purpose of this competition) as well as uncovering the topics that comprise a body of text (the aim of this notebook). &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;With the purpose of uncovering topics at the back of our minds we must now think of how to feed the raw text into a machine learning model. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Having already discussed tokenization, stopword removals and stemming (or maybe lemmatizing) we have now arrived at a reasonably cleaner text dataset then we started out with. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;However at this juncture, our raw text though human readable is still unfortunately not yet machine readable. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;A machine can read in bits and numbers and therefore we will first need to convert our text into numbers for which we utilise a very common approach known as the Bag-of-Words&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #3c4043; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;The Bag of Words approach&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This approach uses the counts of words as a starting block and records the occurrence of each word (from the entire text) in a vector specific to that particular word.&lt;/li&gt;
&lt;li&gt;For example given these two sentences &quot;I love to eat Burgers&quot;, &quot;I love to eat Fries&quot;, we first tokenize to obtain our vocabulary of 6 words from which we can get the word counts for - [I, love, to, eat, Burgers, Fries].&lt;/li&gt;
&lt;li&gt;Vectorizing the text via the Bag of Words approach, we get six distinct vectors one for each word.&lt;/li&gt;
&lt;li&gt;So you ask since we now have rows consisting of numbers (instead of text) what forms the columns (or features)?&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Well each word now becomes an individual feature/column in this new transformed dataset.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;To illustrate this point, I shall utilize the Scikit-learn library to implement a vectorizer that generates a vector of word counts (term frequencies) - via the CountVectorizer method as follows.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733207190490&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Defining our sentence
sentence = [&quot;I love to eat Burgers&quot;, 
            &quot;I love to eat Fries&quot;]
vectorizer = CountVectorizer(min_df=0)
sentence_transform = vectorizer.fit_transform(sentence)&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Fitting&amp;nbsp;the&amp;nbsp;vectorizer&amp;nbsp;to&amp;nbsp;the&amp;nbsp;dataset&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Here we initialize and create a simple term frequency object via the CountVectorizer function simply called &quot;vectorizer&quot;.&lt;/li&gt;
&lt;li&gt;The parameters that I have provided explicitly (the rest are left as default) are the bare minimum.&lt;/li&gt;
&lt;li&gt;Here &quot;min_df&quot; in the parameter refers to the minimum document frequency and the vectorizer will simply drop all words that occur less than that value set (either integer or in fraction form).&lt;/li&gt;
&lt;li&gt;Finally we apply the fit_transform method is actually comprised of two steps.&lt;/li&gt;
&lt;li&gt;The first step is the fit method where the vectorizer is mapped to the dataset that you provide.&lt;/li&gt;
&lt;li&gt;Once this is done, the actual vectorizing operation is performed via the transform method where the raw text is turned into its vector form as shown below:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733208027834&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;print(&quot;The features are:\n {}&quot;.format(vectorizer.get_feature_names()))
print(&quot;\nThe vectorized array looks like:\n {}&quot;.format(sentence_transform.toarray()))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.40.26.png&quot; data-origin-width=&quot;784&quot; data-origin-height=&quot;334&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Q4fYe/btsK58SrDZM/rypEC13NAg4PK9qYxRjYxK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Q4fYe/btsK58SrDZM/rypEC13NAg4PK9qYxRjYxK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Q4fYe/btsK58SrDZM/rypEC13NAg4PK9qYxRjYxK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FQ4fYe%2FbtsK58SrDZM%2FrypEC13NAg4PK9qYxRjYxK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;495&quot; height=&quot;211&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.40.26.png&quot; data-origin-width=&quot;784&quot; data-origin-height=&quot;334&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.41.28.png&quot; data-origin-width=&quot;1252&quot; data-origin-height=&quot;334&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qlDL3/btsK5ZBdsKF/Ud1EX2JUVDfkKarGFaB3kk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qlDL3/btsK5ZBdsKF/Ud1EX2JUVDfkKarGFaB3kk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qlDL3/btsK5ZBdsKF/Ud1EX2JUVDfkKarGFaB3kk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqlDL3%2FbtsK5ZBdsKF%2FUd1EX2JUVDfkKarGFaB3kk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1252&quot; height=&quot;334&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.41.28.png&quot; data-origin-width=&quot;1252&quot; data-origin-height=&quot;334&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Sparse matrix vector ouptuts&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;From the output of the vectorized text, we can see that the features consist of the words in the corpus of text that we fed into the vectorizer (here the corpus being the two sentences we defined earlier). &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Simply call the get_feature_names attribute from the vectorizer to inspect it.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;With regards to the transformed text, one would be tempted to inspect the values by simplying calling it.&lt;/li&gt;
&lt;li&gt;However when you try to call it you really just get a message which states &quot;sparse matrix of type class 'numpy.int64' with 8 stored elements in Compressed Sparse Row format&quot;.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;Therefore this means that the vectorizer returns the transformed raw text as a matrix where most of its values are zero or almost negligible, hence the term sparse. &lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Thinking about this, it does make sense that our returned matrices contain quite a high degree of sparsity due to the fact that most words in a language appear relatively infrequently in any given text.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;8. Topic modeling&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;Latent Dirichlet Allocation&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;- Probabilistic, generative model which uncovers the topics latent to a dataset by assigning weights to words in a corpus, where each topic will assign different probability weights to each word.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Non-negative Matrix Factorization&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;- Approximation method that takes an input matrix and approximates the factorization of this matrix into two other matrices, with the caveat that the values in the matrix be non-negative.&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;When you vectorize the raw text with CountVectorizer, the dual stages of tokenizing and stopwords filtering are automatically included as a high-level component.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Here unlike the NLTK tokenizer that you were introduced to in the Section 2a earlier, Sklearn's tokenizer discards all single character terms like ('a', 'w' etc) and also lower cases all terms by default. &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Filtering out stopwords in Sklearn is as convenient as passing the value 'english' into the argument &quot;stop_words&quot; where a built-in English stopword list is automatically used.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;b&gt;Unfortunately, there is no built-in lemmatizer in the vectorizer so we are left with a couple of options.&lt;/b&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;b&gt;Either implementing it separately everytime before feeding the data for vectorizing or somehow extend the sklearn implementation to include this functionality.&lt;/b&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Luckily for us, we have the latter option where we can extend the CountVectorizer class by overwriting the &quot;build_analyzer&quot; method as follows:&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733208947056&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;lemm = WordNetLemmatizer()
class LemmaCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(LemmaCountVectorizer, self).build_analyzer()
        return lambda doc: (lemm.lemmatize(w) for w in analyzer(doc))&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733208978744&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Storing the entire training text in a list
text = list(train.text.values)
# Calling our overwritten Count vectorizer
tf_vectorizer = LemmaCountVectorizer(max_df=0.95, 
                                     min_df=2,
                                     stop_words='english',
                                     decode_error='ignore')
tf = tf_vectorizer.fit_transform(text)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.56.58.png&quot; data-origin-width=&quot;1370&quot; data-origin-height=&quot;970&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/eSkQ1f/btsK6v7sUfQ/KKMQpfqddZGQtiAfQDtkp0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/eSkQ1f/btsK6v7sUfQ/KKMQpfqddZGQtiAfQDtkp0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/eSkQ1f/btsK6v7sUfQ/KKMQpfqddZGQtiAfQDtkp0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FeSkQ1f%2FbtsK6v7sUfQ%2FKKMQpfqddZGQtiAfQDtkp0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;628&quot; height=&quot;445&quot; data-filename=&quot;스크린샷 2024-12-03 오후 3.56.58.png&quot; data-origin-width=&quot;1370&quot; data-origin-height=&quot;970&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Latent Dirichlet Allocation&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;There are a couple of different implements of this LDA algorithm but in this notebook, I will be using Sklearn's implementation. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Another very well-known LDA implementation is Radim Rehurek's&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;a style=&quot;background-color: #ffffff; color: #008abc; text-align: left;&quot; href=&quot;https://radimrehurek.com/gensim/&quot;&gt;gensim&lt;/a&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;, so check it out as well.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;In LDA, the modelling process revolves around three things: the text corpus, its collection of documents, D and the words W in the documents. &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;Therefore the algorithm attempts to uncover K topics from this corpus via the following way (illustrated by the diagram).&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 4.00.38.png&quot; data-origin-width=&quot;1500&quot; data-origin-height=&quot;418&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/LpCHL/btsK6w6nEnv/Q3VO7AoXkJfrtpdi3xSR31/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/LpCHL/btsK6w6nEnv/Q3VO7AoXkJfrtpdi3xSR31/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/LpCHL/btsK6w6nEnv/Q3VO7AoXkJfrtpdi3xSR31/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FLpCHL%2FbtsK6w6nEnv%2FQ3VO7AoXkJfrtpdi3xSR31%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1500&quot; height=&quot;418&quot; data-filename=&quot;스크린샷 2024-12-03 오후 4.00.38.png&quot; data-origin-width=&quot;1500&quot; data-origin-height=&quot;418&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Model each topic, $\kappa$ via a Dirichlet prior distribution given by $\beta_{k}$:&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Model each document d by another Dirichlet distribution parameterized by $\alpha$:&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;Subsequently for document d, we generate a topic via a multinomial distribution which we then backtrack and use to generate the correspondings words related to that topic via another multinomial distribution:&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;The LDA algorithm first models documents via a mixture model of topics. &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;From these topics, words are then assigned weights based on the probability distribution of these topics. &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;It is this probabilistic assignment over words that allow a user of LDA to say how likely a particular word falls into a topic. &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Subsequently from the collection of words assigned to a particular topic, are we thus able to gain an insight as to what that topic may actually represent from a lexical point of view.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;From a standard LDA model, there are really a few key parameters that we have to keep in mind and consider programmatically tuning before we invoke the model:&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;n_components: The number of topics that you specify to the model&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;$\alpha$ parameter: This is the dirichlet parameter that can be linked to the document topic prior&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;$\beta$ parameter: This is the dirichlet parameter linked to the topic word prior&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;To invoke the algorithm, we simply create an LDA instance through the Sklearn's LatentDirichletAllocation function.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;The&amp;nbsp;various&amp;nbsp;parameters&amp;nbsp;would&amp;nbsp;ideally&amp;nbsp;have&amp;nbsp;been&amp;nbsp;obtained&amp;nbsp;through&amp;nbsp;some&amp;nbsp;sort&amp;nbsp;of&amp;nbsp;validation&amp;nbsp;scheme.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In this instance, the optimal value of n_components (or topic number) was found by conducting a KMeans + Latent Semantic Analysis(LSA) Scheme (as shown in this paper here) whereby the number of Kmeans clusters and number of LSA dimensions were iterated through and the best silhouette mean score.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733210145441&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;lda = LatentDirichletAllocation(n_components=11, max_iter=5,
                                learning_method = 'online',
                                learning_offset = 50.,
                                random_state = 0)
lda.fit(tf)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오후 4.15.59.png&quot; data-origin-width=&quot;712&quot; data-origin-height=&quot;192&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/4HBR4/btsK4ZoIyEs/PK2zkYYokU5tU5wkd5CRi1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/4HBR4/btsK4ZoIyEs/PK2zkYYokU5tU5wkd5CRi1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/4HBR4/btsK4ZoIyEs/PK2zkYYokU5tU5wkd5CRi1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F4HBR4%2FbtsK4ZoIyEs%2FPK2zkYYokU5tU5wkd5CRi1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;712&quot; height=&quot;192&quot; data-filename=&quot;스크린샷 2024-12-03 오후 4.15.59.png&quot; data-origin-width=&quot;712&quot; data-origin-height=&quot;192&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Second Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Approaching&amp;nbsp;(Almost)&amp;nbsp;Any&amp;nbsp;NLP&amp;nbsp;Problem&amp;nbsp;on&amp;nbsp;Kaggle&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Trying out various modeling techniques:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;tfidf&lt;/li&gt;
&lt;li&gt;count features&lt;/li&gt;
&lt;li&gt;logistic regression&lt;/li&gt;
&lt;li&gt;naive bayes&lt;/li&gt;
&lt;li&gt;svm&lt;/li&gt;
&lt;li&gt;xgboost&lt;/li&gt;
&lt;li&gt;grid search&lt;/li&gt;
&lt;li&gt;word vectors&lt;/li&gt;
&lt;li&gt;LSTM&lt;/li&gt;
&lt;li&gt;GRU&lt;/li&gt;
&lt;li&gt;Ensembling&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1. Metric&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;def multiclass_logloss(actual, predicted, eps=1e-15):
    &quot;&quot;&quot;Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    &quot;&quot;&quot;
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;For this particular problem, Kaggle has specified multi-class log-loss as evaluation metric.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;2. TF-IDF + Logistic Regression&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733232226731&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Always start with these features. They work (almost) everytime!
tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(xtrain) + list(xvalid))
xtrain_tfv =  tfv.transform(xtrain) 
xvalid_tfv = tfv.transform(xvalid)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733232244369&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Fitting a simple Logistic Regression on TFIDF
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.626&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;3. Word Count as feature + Logistic Regression&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;awk&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
ctv.fit(list(xtrain) + list(xvalid))
xtrain_ctv =  ctv.transform(xtrain) 
xvalid_ctv = ctv.transform(xvalid)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733232360344&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Fitting a simple Logistic Regression on Counts
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.528&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Instead of using TF-IDF, we can also use word counts as features. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;This can be done easily using CountVectorizer from scikit-learn.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;4. Naive Bayes + TF-IDF&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;# Fitting a simple Naive Bayes on TFIDF
clf = MultinomialNB()
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.578&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;5. Naive Bayes + Word Count&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;# Fitting a simple Naive Bayes on Counts
clf = MultinomialNB()
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.485&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;6. SVM + TF-IDF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Since SVMs take a lot of time, we will reduce the number of features from the TF-IDF using Singular Value Decomposition before applying SVM.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Also, note that before applying SVMs, we&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;must&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;standardize the data.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1733232796750&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Apply SVD, I chose 120 components. 120-200 components are good enough for SVM model.
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv)
xvalid_svd = svd.transform(xvalid_tfv)

# Scale the data obtained from SVD. Renaming variable to reuse without scaling.
scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733232832331&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Fitting a simple SVM
clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.741&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;7. XGBoost + TF-IDF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733234382134&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Fitting a simple xgboost on tf-idf
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_tfv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_tfv.tocsc())

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.782&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733234401423&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Fitting a simple xgboost on tf-idf
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_ctv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_ctv.tocsc())

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.772&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733234420523&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Fitting a simple xgboost on tf-idf svd features
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.768&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1733234435108&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Fitting a simple xgboost on tf-idf svd features
clf = xgb.XGBClassifier(nthread=10)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print (&quot;logloss: %0.3f &quot; % multiclass_logloss(yvalid, predictions))

# result: logloss: 0.816&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;8. Word Embedding - Using GloVe&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;livecodeserver&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# this function creates a normalized vector for the whole sentence
def sent2vec(s):
    words = str(s).lower().decode('utf-8')
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;# create sentence vectors using the above function for training and validation set
xtrain_glove = [sent2vec(x) for x in tqdm(xtrain)]
xvalid_glove = [sent2vec(x) for x in tqdm(xvalid)]

xtrain_glove = np.array(xtrain_glove)
xvalid_glove = np.array(xvalid_glove)&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;9. Using Neural Network&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733235229085&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# scale the data before any neural net:
scl = preprocessing.StandardScaler()
xtrain_glove_scl = scl.fit_transform(xtrain_glove)
xvalid_glove_scl = scl.transform(xvalid_glove)

# we need to binarize the labels for the neural net
ytrain_enc = np_utils.to_categorical(ytrain)
yvalid_enc = np_utils.to_categorical(yvalid)

# create a simple 3 layer sequential neural net
model = Sequential()

model.add(Dense(300, input_dim=300, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(300, activation='relu'))
model.add(Dropout(0.3))
model.add(BatchNormalization())

model.add(Dense(3))
model.add(Activation('softmax'))

# compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

model.fit(xtrain_glove_scl, y=ytrain_enc, batch_size=64, 
          epochs=5, verbose=1, 
          validation_data=(xvalid_glove_scl, yvalid_enc))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;10. LSTM&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;W&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;ith LSTMs we need to tokenize the text data&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733235658598&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 70

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

# zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;livecodeserver&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;routeros&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;routeros&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, verbose=1, validation_data=(xvalid_pad, yvalid_enc))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;11. Version with early stopping&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;routeros&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;12. &lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Bi-directional LSTM&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div style=&quot;background-color: #f1f3f4;&quot;&gt;
&lt;div&gt;
&lt;pre class=&quot;routeros&quot; style=&quot;color: #3c4043;&quot;&gt;&lt;code&gt;# A simple bidirectional LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div style=&quot;background-color: #ffffff; color: #000000; text-align: left;&quot;&gt;&lt;b&gt;13. GRU&lt;/b&gt;&lt;/div&gt;
&lt;div style=&quot;background-color: #ffffff; color: #000000; text-align: left;&quot;&gt;
&lt;pre class=&quot;routeros&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# GRU with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;14. Ensemble&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;ruby&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# this is the main ensembling class. how to use it is in the next cell!
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold
import pandas as pd
import os
import sys
import logging

logging.basicConfig(
    level=logging.DEBUG,
    format=&quot;[%(asctime)s] %(levelname)s %(message)s&quot;,
    datefmt=&quot;%H:%M:%S&quot;, stream=sys.stdout)
logger = logging.getLogger(__name__)


class Ensembler(object):
    def __init__(self, model_dict, num_folds=3, task_type='classification', optimize=roc_auc_score,
                 lower_is_better=False, save_path=None):
        &quot;&quot;&quot;
        Ensembler init function
        :param model_dict: model dictionary, see README for its format
        :param num_folds: the number of folds for ensembling
        :param task_type: classification or regression
        :param optimize: the function to optimize for, e.g. AUC, logloss, etc. Must have two arguments y_test and y_pred
        :param lower_is_better: is lower value of optimization function better or higher
        :param save_path: path to which model pickles will be dumped to along with generated predictions, or None
        &quot;&quot;&quot;

        self.model_dict = model_dict
        self.levels = len(self.model_dict)
        self.num_folds = num_folds
        self.task_type = task_type
        self.optimize = optimize
        self.lower_is_better = lower_is_better
        self.save_path = save_path

        self.training_data = None
        self.test_data = None
        self.y = None
        self.lbl_enc = None
        self.y_enc = None
        self.train_prediction_dict = None
        self.test_prediction_dict = None
        self.num_classes = None

    def fit(self, training_data, y, lentrain):
        &quot;&quot;&quot;
        :param training_data: training data in tabular format
        :param y: binary, multi-class or regression
        :return: chain of models to be used in prediction
        &quot;&quot;&quot;

        self.training_data = training_data
        self.y = y

        if self.task_type == 'classification':
            self.num_classes = len(np.unique(self.y))
            logger.info(&quot;Found %d classes&quot;, self.num_classes)
            self.lbl_enc = LabelEncoder()
            self.y_enc = self.lbl_enc.fit_transform(self.y)
            kf = StratifiedKFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, self.num_classes)
        else:
            self.num_classes = -1
            self.y_enc = self.y
            kf = KFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, 1)

        self.train_prediction_dict = {}
        for level in range(self.levels):
            self.train_prediction_dict[level] = np.zeros((train_prediction_shape[0],
                                                          train_prediction_shape[1] * len(self.model_dict[level])))

        for level in range(self.levels):

            if level == 0:
                temp_train = self.training_data
            else:
                temp_train = self.train_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):
                validation_scores = []
                foldnum = 1
                for train_index, valid_index in kf.split(self.train_prediction_dict[0], self.y_enc):
                    logger.info(&quot;Training Level %d Fold # %d. Model # %d&quot;, level, foldnum, model_num)

                    if level != 0:
                        l_training_data = temp_train[train_index]
                        l_validation_data = temp_train[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])
                    else:
                        l0_training_data = temp_train[0][model_num]
                        if type(l0_training_data) == list:
                            l_training_data = [x[train_index] for x in l0_training_data]
                            l_validation_data = [x[valid_index] for x in l0_training_data]
                        else:
                            l_training_data = l0_training_data[train_index]
                            l_validation_data = l0_training_data[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])

                    logger.info(&quot;Predicting Level %d. Fold # %d. Model # %d&quot;, level, foldnum, model_num)

                    if self.task_type == 'classification':
                        temp_train_predictions = model.predict_proba(l_validation_data)
                        self.train_prediction_dict[level][valid_index,
                        (model_num * self.num_classes):(model_num * self.num_classes) +
                                                       self.num_classes] = temp_train_predictions

                    else:
                        temp_train_predictions = model.predict(l_validation_data)
                        self.train_prediction_dict[level][valid_index, model_num] = temp_train_predictions
                    validation_score = self.optimize(self.y_enc[valid_index], temp_train_predictions)
                    validation_scores.append(validation_score)
                    logger.info(&quot;Level %d. Fold # %d. Model # %d. Validation Score = %f&quot;, level, foldnum, model_num,
                                validation_score)
                    foldnum += 1
                avg_score = np.mean(validation_scores)
                std_score = np.std(validation_scores)
                logger.info(&quot;Level %d. Model # %d. Mean Score = %f. Std Dev = %f&quot;, level, model_num,
                            avg_score, std_score)

            logger.info(&quot;Saving predictions for level # %d&quot;, level)
            train_predictions_df = pd.DataFrame(self.train_prediction_dict[level])
            train_predictions_df.to_csv(os.path.join(self.save_path, &quot;train_predictions_level_&quot; + str(level) + &quot;.csv&quot;),
                                        index=False, header=None)

        return self.train_prediction_dict

    def predict(self, test_data, lentest):
        self.test_data = test_data
        if self.task_type == 'classification':
            test_prediction_shape = (lentest, self.num_classes)
        else:
            test_prediction_shape = (lentest, 1)

        self.test_prediction_dict = {}
        for level in range(self.levels):
            self.test_prediction_dict[level] = np.zeros((test_prediction_shape[0],
                                                         test_prediction_shape[1] * len(self.model_dict[level])))
        self.test_data = test_data
        for level in range(self.levels):
            if level == 0:
                temp_train = self.training_data
                temp_test = self.test_data
            else:
                temp_train = self.train_prediction_dict[level - 1]
                temp_test = self.test_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):

                logger.info(&quot;Training Fulldata Level %d. Model # %d&quot;, level, model_num)
                if level == 0:
                    model.fit(temp_train[0][model_num], self.y_enc)
                else:
                    model.fit(temp_train, self.y_enc)

                logger.info(&quot;Predicting Test Level %d. Model # %d&quot;, level, model_num)

                if self.task_type == 'classification':
                    if level == 0:
                        temp_test_predictions = model.predict_proba(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict_proba(temp_test)
                    self.test_prediction_dict[level][:, (model_num * self.num_classes): (model_num * self.num_classes) +
                                                                                        self.num_classes] = temp_test_predictions

                else:
                    if level == 0:
                        temp_test_predictions = model.predict(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict(temp_test)
                    self.test_prediction_dict[level][:, model_num] = temp_test_predictions

            test_predictions_df = pd.DataFrame(self.test_prediction_dict[level])
            test_predictions_df.to_csv(os.path.join(self.save_path, &quot;test_predictions_level_&quot; + str(level) + &quot;.csv&quot;),
                                       index=False, header=None)

        return self.test_prediction_dict&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;routeros&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot;&gt;&lt;code&gt;# specify the data to be used for every level of ensembling:
train_data_dict = {0: [xtrain_tfv, xtrain_ctv, xtrain_tfv, xtrain_ctv], 1: [xtrain_glove]}
test_data_dict = {0: [xvalid_tfv, xvalid_ctv, xvalid_tfv, xvalid_ctv], 1: [xvalid_glove]}

model_dict = {0: [LogisticRegression(), LogisticRegression(), MultinomialNB(alpha=0.1), MultinomialNB()],

              1: [xgb.XGBClassifier(silent=True, n_estimators=120, max_depth=7)]}

ens = Ensembler(model_dict=model_dict, num_folds=3, task_type='classification',
                optimize=multiclass_logloss, lower_is_better=True, save_path='')

ens.fit(train_data_dict, ytrain, lentrain=xtrain_glove.shape[0])
preds = ens.predict(test_data_dict, lentest=xvalid_glove.shape[0])&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/sudalairajkumar/simple-feature-engg-notebook-spooky-author&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Third Kernel: Simple&amp;nbsp;Feature&amp;nbsp;Engg&amp;nbsp;Notebook&amp;nbsp;-&amp;nbsp;Spooky&amp;nbsp;Author&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Create different features that will help us in identifying the spooky authors.&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Meta features - features that are extracted from the text like number of words, number of stop words, number of punctuations etc&lt;/li&gt;
&lt;li&gt;Text based features - features directly based on the text / words like frequency, svd, word2vec etc.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; &lt;b&gt;Meta Features&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Number of words in the text&lt;/li&gt;
&lt;li&gt;Number of unique words in the text&lt;/li&gt;
&lt;li&gt;Number of characters in the text&lt;/li&gt;
&lt;li&gt;Number of stopwords&lt;/li&gt;
&lt;li&gt;Number of punctuations&lt;/li&gt;
&lt;li&gt;Number of upper case words&lt;/li&gt;
&lt;li&gt;Number of title case words&lt;/li&gt;
&lt;li&gt;Average length of the words&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;## Number of words in the text ##
train_df[&quot;num_words&quot;] = train_df[&quot;text&quot;].apply(lambda x: len(str(x).split()))
test_df[&quot;num_words&quot;] = test_df[&quot;text&quot;].apply(lambda x: len(str(x).split()))

## Number of unique words in the text ##
train_df[&quot;num_unique_words&quot;] = train_df[&quot;text&quot;].apply(lambda x: len(set(str(x).split())))
test_df[&quot;num_unique_words&quot;] = test_df[&quot;text&quot;].apply(lambda x: len(set(str(x).split())))

## Number of characters in the text ##
train_df[&quot;num_chars&quot;] = train_df[&quot;text&quot;].apply(lambda x: len(str(x)))
test_df[&quot;num_chars&quot;] = test_df[&quot;text&quot;].apply(lambda x: len(str(x)))

## Number of stopwords in the text ##
train_df[&quot;num_stopwords&quot;] = train_df[&quot;text&quot;].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
test_df[&quot;num_stopwords&quot;] = test_df[&quot;text&quot;].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

## Number of punctuations in the text ##
train_df[&quot;num_punctuations&quot;] =train_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test_df[&quot;num_punctuations&quot;] =test_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Number of title case words in the text ##
train_df[&quot;num_words_upper&quot;] = train_df[&quot;text&quot;].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test_df[&quot;num_words_upper&quot;] = test_df[&quot;text&quot;].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

## Number of title case words in the text ##
train_df[&quot;num_words_title&quot;] = train_df[&quot;text&quot;].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test_df[&quot;num_words_title&quot;] = test_df[&quot;text&quot;].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

## Average length of the words in the text ##
train_df[&quot;mean_word_len&quot;] = train_df[&quot;text&quot;].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df[&quot;mean_word_len&quot;] = test_df[&quot;text&quot;].apply(lambda x: np.mean([len(w) for w in str(x).split()]))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Text-based features&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1) &lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;tf-idf values of the words present in the text&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;### Fit transform the tfidf vectorizer ###
tfidf_vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
full_tfidf = tfidf_vec.fit_transform(train_df['text'].values.tolist() + test_df['text'].values.tolist())
train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())
test_tfidf = tfidf_vec.transform(test_df['text'].values.tolist())&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;The tfidf output is a sparse matrix and so if we have to use it with other dense features, we have couple of choices.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;We can choose to get the top 'n' features (depending on the system config) from the tfidf vectorizer, convert it into dense format and concat with other features.&lt;/li&gt;
&lt;li&gt;Build a model using just the sparse features and then use the predictions as one of the features along with other dense features.&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Based on the dataset, one might perform better than the other. Here we can use the second approach since there are some very good scoring kernels using all the features of tfidf.&lt;/li&gt;
&lt;li&gt;Also&amp;nbsp;it&amp;nbsp;seems&amp;nbsp;that,&amp;nbsp;Naive&amp;nbsp;Bayes&amp;nbsp;is&amp;nbsp;performing&amp;nbsp;better&amp;nbsp;in&amp;nbsp;this&amp;nbsp;dataset.&amp;nbsp;So&amp;nbsp;we&amp;nbsp;could&amp;nbsp;build&amp;nbsp;a&amp;nbsp;naive&amp;nbsp;bayes&amp;nbsp;model&amp;nbsp;using&amp;nbsp;tfidf&amp;nbsp;features&amp;nbsp;as&amp;nbsp;it&amp;nbsp;is&amp;nbsp;faster&amp;nbsp;to&amp;nbsp;train.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;def runMNB(train_X, train_y, test_X, test_y, test_X2):
    model = naive_bayes.MultinomialNB()
    model.fit(train_X, train_y)
    pred_test_y = model.predict_proba(test_X)
    pred_test_y2 = model.predict_proba(test_X2)
    return pred_test_y, pred_test_y2, model&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;python&quot; style=&quot;background-color: #f1f3f4; color: #3c4043; text-align: start;&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;cv_scores = []
pred_full_test = 0
pred_train = np.zeros([train_df.shape[0], 3])
kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2017)
for dev_index, val_index in kf.split(train_X):
    dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    pred_val_y, pred_test_y, model = runMNB(dev_X, dev_y, val_X, val_y, test_tfidf)
    pred_full_test = pred_full_test + pred_test_y
    pred_train[val_index,:] = pred_val_y
    cv_scores.append(metrics.log_loss(val_y, pred_val_y))
print(&quot;Mean cv score : &quot;, np.mean(cv_scores))
pred_full_test = pred_full_test / 5.&lt;/code&gt;&lt;/pre&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-style=&quot;style5&quot; data-ke-type=&quot;horizontalRule&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;Success is not guaranteed, but it is worth fighting for. &lt;br /&gt;- Max Holloway -&lt;/blockquote&gt;</description>
      <category>캐글</category>
      <category>Kaggle</category>
      <category>spooky author identification</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/74</guid>
      <comments>https://dongsunseng.tistory.com/entry/Kaggle-Study-12-Spooky-Author-Identification#entry74comment</comments>
      <pubDate>Wed, 4 Dec 2024 00:49:37 +0900</pubDate>
    </item>
    <item>
      <title>[Kaggle Study] #11 Credit Card Fraud Detection</title>
      <link>https://dongsunseng.tistory.com/entry/Kaggle-Study-11-Credit-Card-Fraud-Detection</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Tenth competition following Youhan Lee's curriculum.&lt;b&gt;&lt;span&gt;&lt;span&gt; Anomaly detection &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span&gt;&lt;span&gt;competition using&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt; tabular data&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;.&lt;/p&gt;
&lt;figure id=&quot;og_1732954882496&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Credit Card Fraud Detection&quot; data-og-description=&quot;Anonymized credit card transactions labeled as fraudulent or genuine&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud&quot; data-og-url=&quot;https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bL1eeZ/hyXGDrCGv3/yfiJnJKMSmqiCPgTz7dMn1/img.jpg?width=600&amp;amp;height=600&amp;amp;face=0_0_600_600&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bL1eeZ/hyXGDrCGv3/yfiJnJKMSmqiCPgTz7dMn1/img.jpg?width=600&amp;amp;height=600&amp;amp;face=0_0_600_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Credit Card Fraud Detection&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Anonymized credit card transactions labeled as fraudulent or genuine&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/joparga3/in-depth-skewed-data-classif-93-recall-acc-now&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;First Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; In depth skewed data classif. (93% recall acc now)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Testing different methods on skewed data.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;i&gt;The idea is to compare if preprocessing techniques work better when there is an overwhelming majority class that can disrupt the efficiency of our predictive model.&lt;/i&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1. Methodologies for dealing with unbalanced data&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-30 오후 5.32.02.png&quot; data-origin-width=&quot;1178&quot; data-origin-height=&quot;826&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bDKJJ3/btsK2mjK9O9/HoKkI7R2Bx4B7L0NkoHenk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bDKJJ3/btsK2mjK9O9/HoKkI7R2Bx4B7L0NkoHenk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bDKJJ3/btsK2mjK9O9/HoKkI7R2Bx4B7L0NkoHenk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbDKJJ3%2FbtsK2mjK9O9%2FHoKkI7R2Bx4B7L0NkoHenk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;532&quot; height=&quot;373&quot; data-filename=&quot;스크린샷 2024-11-30 오후 5.32.02.png&quot; data-origin-width=&quot;1178&quot; data-origin-height=&quot;826&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #202214; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;There are several ways to approach this classification problem taking into consideration this unbalance.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Collect more data? Nice strategy but not applicable in this case&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Changing the performance metric:&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Use the confusion matrix to calculate Precision, Recall&lt;/li&gt;
&lt;li&gt;F1score (weighted average of precision recall)&lt;/li&gt;
&lt;li&gt;Use Kappa - which is a classification accuracy normalized by the imbalance of the classes in the data&lt;/li&gt;
&lt;li&gt;ROC curves - calculates sensitivity/specificity ratio.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Resampling the dataset&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Essentially this is a method that will process the data to have an approximate 50-50 ratio.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;One way to achieve this is by OVER-sampling, which is adding copies of the under-represented class (better when you have little data)&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Another is UNDER-sampling, which deletes instances from the over-represented class (better when he have lot's of data)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Nice approach that can be applied to other anomaly detection problems as well&lt;/b&gt;&lt;/p&gt;
&lt;p id=&quot;Approach&quot; style=&quot;color: #202214; text-align: start;&quot; data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Approach&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #3c4043; text-align: start;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;We are not going to perform feature engineering in first instance. The dataset has been downgraded in order to contain 30 features (28 &lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;anonymized&lt;/b&gt;&lt;/span&gt; + time + amount).&lt;br /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This&amp;nbsp;means&amp;nbsp;that&amp;nbsp;the&amp;nbsp;28&amp;nbsp;features&amp;nbsp;in&amp;nbsp;the&amp;nbsp;dataset&amp;nbsp;have&amp;nbsp;been&amp;nbsp;anonymized&amp;nbsp;so&amp;nbsp;that&amp;nbsp;their&amp;nbsp;actual&amp;nbsp;names&amp;nbsp;and&amp;nbsp;meanings&amp;nbsp;cannot&amp;nbsp;be&amp;nbsp;known.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;For&amp;nbsp;example,&amp;nbsp;if&amp;nbsp;the&amp;nbsp;original&amp;nbsp;feature&amp;nbsp;names&amp;nbsp;were&amp;nbsp;&quot;age,&quot;&amp;nbsp;&quot;gender,&quot;&amp;nbsp;&quot;occupation,&quot;&amp;nbsp;etc.,&amp;nbsp;they&amp;nbsp;have&amp;nbsp;been&amp;nbsp;changed&amp;nbsp;to&amp;nbsp;neutral&amp;nbsp;names&amp;nbsp;like&amp;nbsp;&quot;V1,&quot;&amp;nbsp;&quot;V2,&quot;&amp;nbsp;&quot;V3,&quot;&amp;nbsp;and&amp;nbsp;so&amp;nbsp;on.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;We will then compare what happens when using resampling and when not using it. We will test this approach using a simple logistic regression classifier.&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #c0d1e7;&quot;&gt;&lt;b&gt;When the result is happy with the resampling dataset, we will then apply the same hyperparameter to the whole dataset.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;We will evaluate the models by using some of the performance metrics mentioned above.&lt;/li&gt;
&lt;li&gt;We will repeat the best resampling/not resampling method, by tuning the parameters in the logistic regression classifier.&lt;/li&gt;
&lt;li&gt;We will finally perform classifications model using other classification algorithms.&lt;b&gt;(actually not in this kernel)&lt;/b&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. Resampling process&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;As&amp;nbsp;we&amp;nbsp;mentioned&amp;nbsp;earlier,&amp;nbsp;there&amp;nbsp;are&amp;nbsp;several&amp;nbsp;ways&amp;nbsp;to&amp;nbsp;resample&amp;nbsp;skewed&amp;nbsp;data.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Apart&amp;nbsp;from&amp;nbsp;under&amp;nbsp;and&amp;nbsp;over&amp;nbsp;sampling,&amp;nbsp;there&amp;nbsp;is&amp;nbsp;a&amp;nbsp;very&amp;nbsp;popular&amp;nbsp;approach&amp;nbsp;called&amp;nbsp;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;SMOTE (Synthetic Minority Over-Sampling Technique), which is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;In this notebook, we will use traditional UNDER-sampling.&lt;/li&gt;
&lt;li&gt;The way we will under sample the dataset will be by creating a 50/50 ratio.&lt;/li&gt;
&lt;li&gt;This will be done by &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;randomly selecting&lt;/b&gt;&lt;/span&gt; &quot;x&quot; amount of sample from the majority class, being &quot;x&quot; the total number of records with the minority class.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1732956144431&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']&lt;/code&gt;&lt;/pre&gt;
&lt;pre id=&quot;code_1732956200288&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Number of data points in the minority class
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index

# Out of the indices we picked, randomly select &quot;x&quot; number (number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

# Showing ratio
print(&quot;Percentage of normal transactions: &quot;, len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print(&quot;Percentage of fraud transactions: &quot;, len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print(&quot;Total number of transactions in resampled data: &quot;, len(under_sample_data))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Result:&lt;/p&gt;
&lt;pre class=&quot;yaml&quot; style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;code&gt;Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resampled data:  984&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. Recall Metric&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;We are very interested in the &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;recall&lt;/b&gt;&lt;/span&gt; score, because that is &lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;the metric that will help us try to capture the most fraudulent transactions&lt;/b&gt;&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;If you think how Accuracy, Precision and Recall work for a confusion matrix, recall would be the most interesting:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Accuracy&amp;nbsp;=&amp;nbsp;(TP+TN)/total&lt;/li&gt;
&lt;li&gt;Precision&amp;nbsp;=&amp;nbsp;TP/(TP+FP)&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;Recall&amp;nbsp;=&amp;nbsp;TP/(TP+FN)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;As we know, due to the imbalacing of the data, many observations could be predicted as False Negatives, being, that we predict a normal transaction, but it is in fact a fraudulent one. Recall captures this.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Obviously, trying to increase recall, tends to come with a decrease of precision.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;However, in our case, if we predict that a transaction is fraudulent and turns out not to be, is not a massive problem compared to the opposite.&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Misclassifying a fraudulent transaction as legitimate (False Negative) is a bigger problem&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Than misclassifying a legitimate transaction as fraudulent (False Positive)&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;We could even apply a cost function when having FN and FP with different weights for each type of error, but let's leave that aside for now.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;5. Result checking process&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.29.48.png&quot; data-origin-width=&quot;507&quot; data-origin-height=&quot;413&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bJ6NgI/btsK25VWJ0k/6OuhR69fOvqYkCIOxk5eoK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bJ6NgI/btsK25VWJ0k/6OuhR69fOvqYkCIOxk5eoK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bJ6NgI/btsK25VWJ0k/6OuhR69fOvqYkCIOxk5eoK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbJ6NgI%2FbtsK25VWJ0k%2F6OuhR69fOvqYkCIOxk5eoK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;507&quot; height=&quot;413&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.29.48.png&quot; data-origin-width=&quot;507&quot; data-origin-height=&quot;413&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;The model is offering an 93.2% recall accuracy on the generalised unseen data (test set).&lt;/li&gt;
&lt;li&gt;Not a bad percentage to be the first try.&lt;/li&gt;
&lt;li&gt;However, recall this is a 93.2% recall accuracy measure on the undersampled test set.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Being&amp;nbsp;happy&amp;nbsp;with&amp;nbsp;this&amp;nbsp;result,&amp;nbsp;let's&amp;nbsp;apply&amp;nbsp;the&amp;nbsp;model&amp;nbsp;we&amp;nbsp;fitted&amp;nbsp;and&amp;nbsp;test&amp;nbsp;it&amp;nbsp;on&amp;nbsp;the&amp;nbsp;whole&amp;nbsp;data.&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.30.42.png&quot; data-origin-width=&quot;507&quot; data-origin-height=&quot;413&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lrgXb/btsK2BntVgb/nFaAVaob20EUrTe9nLyfpK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lrgXb/btsK2BntVgb/nFaAVaob20EUrTe9nLyfpK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lrgXb/btsK2BntVgb/nFaAVaob20EUrTe9nLyfpK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlrgXb%2FbtsK2BntVgb%2FnFaAVaob20EUrTe9nLyfpK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;507&quot; height=&quot;413&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.30.42.png&quot; data-origin-width=&quot;507&quot; data-origin-height=&quot;413&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Still a very decent recall accuracy when applying it to a much larger and skewed dataset.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;So, we now move on to checking various metrics to evaluate the performance.&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;6. Plotting&amp;nbsp;ROC&amp;nbsp;curve&amp;nbsp;and&amp;nbsp;Precision-Recall&amp;nbsp;curve&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;Found precision-recall curve much more convenient in this case as our problems relies on the &quot;positive&quot; class being more interesting than the negative class, but as we have calculated the recall precision, I am not going to plot the precision recall curves yet.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;AUC and ROC curve are also interesting to check if the model is also predicting as a whole correctly and not making many errors&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1732973545595&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# ROC CURVE
lr = LogisticRegression(C = best_c, penalty = 'l1')
y_pred_undersample_score = lr.fit(X_train_undersample,y_train_undersample.values.ravel()).decision_function(X_test_undersample.values)

fpr, tpr, thresholds = roc_curve(y_test_undersample.values.ravel(),y_pred_undersample_score)
roc_auc = auc(fpr,tpr)

# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.32.37.png&quot; data-origin-width=&quot;555&quot; data-origin-height=&quot;413&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/baJZ1I/btsK2BAWwWS/EMrTCYuCJPftaMjC13JEQ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/baJZ1I/btsK2BAWwWS/EMrTCYuCJPftaMjC13JEQ0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/baJZ1I/btsK2BAWwWS/EMrTCYuCJPftaMjC13JEQ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbaJZ1I%2FbtsK2BAWwWS%2FEMrTCYuCJPftaMjC13JEQ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;555&quot; height=&quot;413&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.32.37.png&quot; data-origin-width=&quot;555&quot; data-origin-height=&quot;413&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;An additional comment that would be interesting to do is to initialize multiple undersampled datasets and repeat the process in loop.&lt;/li&gt;
&lt;li&gt;Remember that, to create an undersample data, we randomly got records from the majority class.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;Even though this is a valid technique, is doesn't represent the real population, so it would be interesting to repeat the process with different undersample configurations and check if the previous chosen parameters are still the most effective.&lt;/b&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;In the end, the idea is to use a wider random representation of the whole dataset and rely on the averaged best parameters.&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;7. Now testing on skewed data after resampled data&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Having&amp;nbsp;tested&amp;nbsp;our&amp;nbsp;previous&amp;nbsp;approach,&amp;nbsp;I&amp;nbsp;find&amp;nbsp;really&amp;nbsp;interesting&amp;nbsp;to&amp;nbsp;test&amp;nbsp;the&amp;nbsp;same&amp;nbsp;process&amp;nbsp;on&amp;nbsp;the&amp;nbsp;skewed&amp;nbsp;data.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Our intuition is that skewness will introduce issues difficult to capture, and therefore, provide a less effective algorithm.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;To be fair, taking into account the fact that the train and test datasets are substantially bigger than the undersampled ones, I believe a K-fold cross validation is necessary. &lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;I guess that by splitting the data with 60% in training set, 20% cross validation and 20% test should be enough... but let's take the same approach as before (no harm on this, it's just that K-fold is computationally more expensive)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.56.03.png&quot; data-origin-width=&quot;555&quot; data-origin-height=&quot;489&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bURnlu/btsK16IqNUM/ytiCrB6WHOMJH0CaxkGCxK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bURnlu/btsK16IqNUM/ytiCrB6WHOMJH0CaxkGCxK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bURnlu/btsK16IqNUM/ytiCrB6WHOMJH0CaxkGCxK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbURnlu%2FbtsK16IqNUM%2FytiCrB6WHOMJH0CaxkGCxK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;555&quot; height=&quot;489&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.56.03.png&quot; data-origin-width=&quot;555&quot; data-origin-height=&quot;489&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Therefore by undersampling the data, our algorithm does a much better job at detecting fraud.&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;8. Threshold Tuning&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;I wanted also to show how can we tweak our final classification by changing the thresold.&lt;/li&gt;
&lt;li&gt;Initially,&amp;nbsp;you&amp;nbsp;build&amp;nbsp;the&amp;nbsp;classification&amp;nbsp;model&amp;nbsp;and&amp;nbsp;then&amp;nbsp;you&amp;nbsp;predict&amp;nbsp;unseen&amp;nbsp;data&amp;nbsp;using&amp;nbsp;it.&lt;/li&gt;
&lt;li&gt;We&amp;nbsp;previously&amp;nbsp;used&amp;nbsp;the&amp;nbsp;&quot;predict()&quot;&amp;nbsp;method&amp;nbsp;to&amp;nbsp;decided&amp;nbsp;whether&amp;nbsp;a&amp;nbsp;record&amp;nbsp;should&amp;nbsp;belong&amp;nbsp;to&amp;nbsp;&quot;1&quot;&amp;nbsp;or&amp;nbsp;&quot;0&quot;.&lt;/li&gt;
&lt;li&gt;There&amp;nbsp;is&amp;nbsp;another&amp;nbsp;method&amp;nbsp;&quot;predict_proba()&quot;.&lt;/li&gt;
&lt;li&gt;This&amp;nbsp;method&amp;nbsp;returns&amp;nbsp;the&amp;nbsp;probabilities&amp;nbsp;for&amp;nbsp;each&amp;nbsp;class.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;The&amp;nbsp;idea&amp;nbsp;is&amp;nbsp;that&amp;nbsp;by&amp;nbsp;changing&amp;nbsp;the&amp;nbsp;threshold&amp;nbsp;to&amp;nbsp;assign&amp;nbsp;a&amp;nbsp;record&amp;nbsp;to&amp;nbsp;class&amp;nbsp;1,&amp;nbsp;we&amp;nbsp;can&amp;nbsp;control&amp;nbsp;precision&amp;nbsp;and&amp;nbsp;recall.&lt;/li&gt;
&lt;li&gt;Let's check this using the undersampled data (best C_param = 0.01)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1732975085090&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] &amp;gt; i
    
    plt.subplot(3,3,j)
    j += 1
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print(&quot;Recall metric in the testing dataset: &quot;, cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold &amp;gt;= %s'%i)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.58.23.png&quot; data-origin-width=&quot;825&quot; data-origin-height=&quot;782&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bQ2nvb/btsK3v7MIdb/W21QBJNZhg1aqgXwYcjXJ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bQ2nvb/btsK3v7MIdb/W21QBJNZhg1aqgXwYcjXJ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bQ2nvb/btsK3v7MIdb/W21QBJNZhg1aqgXwYcjXJ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbQ2nvb%2FbtsK3v7MIdb%2FW21QBJNZhg1aqgXwYcjXJ1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;825&quot; height=&quot;782&quot; data-filename=&quot;스크린샷 2024-11-30 오후 10.58.23.png&quot; data-origin-width=&quot;825&quot; data-origin-height=&quot;782&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;The pattern is very clear: the more you lower the required probability to put a certain in the class &quot;1&quot; category, more records will be put in that bucket.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;This implies an increase in recall (we want all the &quot;1&quot;s), but at the same time, a decrease in precision (we misclassify many of the other class).&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Therefore,&amp;nbsp;even&amp;nbsp;though&amp;nbsp;recall&amp;nbsp;is&amp;nbsp;our&amp;nbsp;goal&amp;nbsp;metric&amp;nbsp;(do&amp;nbsp;not&amp;nbsp;miss&amp;nbsp;a&amp;nbsp;fraud&amp;nbsp;transaction),&amp;nbsp;we&amp;nbsp;also&amp;nbsp;want&amp;nbsp;to&amp;nbsp;keep&amp;nbsp;the&amp;nbsp;model&amp;nbsp;being&amp;nbsp;accurate&amp;nbsp;as&amp;nbsp;a&amp;nbsp;whole.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;There&amp;nbsp;is&amp;nbsp;an&amp;nbsp;option&amp;nbsp;I&amp;nbsp;think&amp;nbsp;could&amp;nbsp;be&amp;nbsp;quite&amp;nbsp;interesting&amp;nbsp;to&amp;nbsp;tackle&amp;nbsp;this.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;We could assign cost to misclassifications, but being interested in classifying &quot;1s&quot; correctly, the cost for misclassifying &quot;1s&quot; should be bigger than &quot;0&quot; misclassifications.&lt;/b&gt; &lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Incorrectly classifying an actual fraudulent transaction (1) as legitimate (0) (False Negative)&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;This case should have a higher cost&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Incorrectly classifying a legitimate transaction (0) as fraudulent (1) (False Positive)&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;This case should have a relatively lower cost&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;After that, the algorithm would select the threshold which minimises the total cost.&lt;/li&gt;
&lt;li&gt;A drawback I see is that we have to manually select the weight of each cost... therefore, I will leave this know as a thought.&lt;/li&gt;
&lt;li&gt;Going back to the threshold changing, there is an option which is the Precision-Recall curve.&lt;/li&gt;
&lt;li&gt;By visually seeing the performance of the model depending on the threshold we choose, we can investigate a sweet spot where recall is high enough whilst keeping a high precision value.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;9. Investigate Precision-Recall curve and area under this curve&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1732975778348&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from itertools import cycle

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
colors = cycle(['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal', 'red', 'yellow', 'green', 'blue','black'])

plt.figure(figsize=(5,5))

j = 1
for i,color in zip(thresholds,colors):
    y_test_predictions_prob = y_pred_undersample_proba[:,1] &amp;gt; i
    
    precision, recall, thresholds = precision_recall_curve(y_test_undersample,y_test_predictions_prob)
    
    # Plot Precision-Recall curve
    plt.plot(recall, precision, color=color,
                 label='Threshold: %s'%i)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('Precision-Recall example')
    plt.legend(loc=&quot;lower left&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-30 오후 11.09.52.png&quot; data-origin-width=&quot;474&quot; data-origin-height=&quot;468&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bicjic/btsK2d1ANMP/rGJ3ZEhYkeOtDtJ94f8pC1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bicjic/btsK2d1ANMP/rGJ3ZEhYkeOtDtJ94f8pC1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bicjic/btsK2d1ANMP/rGJ3ZEhYkeOtDtJ94f8pC1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbicjic%2FbtsK2d1ANMP%2FrGJ3ZEhYkeOtDtJ94f8pC1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;474&quot; height=&quot;468&quot; data-filename=&quot;스크린샷 2024-11-30 오후 11.09.52.png&quot; data-origin-width=&quot;474&quot; data-origin-height=&quot;468&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/pavansanagapati/anomaly-detection-credit-card-fraud-analysis&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Second Kernel&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Disappeared.... :(&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/matheusfacure/semi-supervised-anomaly-detection-survey&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Third Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Semi-Supervised&amp;nbsp;Anomaly&amp;nbsp;Detection&amp;nbsp;Survey&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Explore some anomaly detection techniques.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1. Three types of anomalies&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-02 오후 10.11.12.png&quot; data-origin-width=&quot;380&quot; data-origin-height=&quot;322&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/q9yOH/btsK24c2wHJ/1zlBRwl3B1ccWTyogzfWF1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/q9yOH/btsK24c2wHJ/1zlBRwl3B1ccWTyogzfWF1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/q9yOH/btsK24c2wHJ/1zlBRwl3B1ccWTyogzfWF1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fq9yOH%2FbtsK24c2wHJ%2F1zlBRwl3B1ccWTyogzfWF1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;380&quot; height=&quot;322&quot; data-filename=&quot;스크린샷 2024-12-02 오후 10.11.12.png&quot; data-origin-width=&quot;380&quot; data-origin-height=&quot;322&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1) Point Anomaly&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&quot;an individual data instance can be considered as anomalous with respect to the rest of data&quot;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In the image above, instance ( o_1 ) and ( o_2 ) and all instances in ( O_3 ) are point anomalies since they lie outside the normal regions. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;As another example, consider credit card transaction data, with information only about amount spent. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Then, a high transaction compared to the rest for a particular individual is an anomaly.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;2) Contextual Anomaly&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;In this case, the data must have features regarding some contextual attribute&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;(e.g. time, space)&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;and some features regarding behavioral attributes. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;The anomaly is then determined within a given context. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;As an example, consider again credit card transactions, but now we have both information about the amount spend and day of the year. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Now, a high amount transaction might be considered normal if it occurred in the week before Christmas, but the same amount transaction in July might be suspicious. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;We could also have information about the location the client is when performing transactions, and then expect high amounts if we detect he/she is somewhere far from home, as in a vacation.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;3) Collective Anomaly&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In this case, some related data instances are anomalous with respect to the entire data set, but each individual instances may not be considered anomalous. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;As an example, consider the stock of a retailer. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;We expect to see its volume fluctuating in time, with low values followed by high values. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;However, a low stock for a long period of time is a anomaly. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Note that the low volume per se is not an anomaly, but it persistence is.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Summary&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Note that the last two types assume some relation among data instances, that is, they are not independent identically distributed&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;(&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;i.i.d&lt;/span&gt;)&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In the present work, we have credit card transaction information and time is one of the features, so we could treat this problem as contextual anomaly detection. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;However, we only have two days of data, making it almost impossible to determine a useful temporal context.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Hence, we will only consider point anomalies techniques to avoid the burden in the extra work of defining a context. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Nonetheless, we will keep time as a feature, so in some sense the contextual information will be considered, although no directly modeled.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;3. Challenges&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;One straightforward approach to anomaly detection would be to simply define a region where the normal data lies and classify anything out of that region as an anomaly.&lt;/li&gt;
&lt;li&gt;This is most easily said than done and there are some major challenges that often arise in anomaly detection problem:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Modeling a normal region that captures all normal behavior is extremely difficult and the boundary between normal an abnormal is often blurred.&lt;/li&gt;
&lt;li&gt;Anomalies might be the result of malicious actions. Then, the malicious adversaries are always trying to adapt to make anomalous observations seem normal.&lt;/li&gt;
&lt;li&gt;The normal behavior can change, and then a current notion of normal might not be valid in the future.&lt;/li&gt;
&lt;li&gt;As we've seen, the notion of an anomaly varies for different application domains, and there is no algorithm that can handle all of them equally well.&lt;/li&gt;
&lt;li&gt;Labeled data for training/validation of models used by anomaly detection techniques is usually a major issue, being either extremely scarce or non existent.&lt;/li&gt;
&lt;li&gt;If the data contains a lot of noise, it is difficult to distinguish noisy instances from anomalies.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. Metric&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;An ideal system with high precision and high recall will return many results, with all results labeled correctly.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Since we are in a scenario of credit card fraud detection, failing to detect a fraud has a higher cost than assigning as fraudulent a normal transaction. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Hence, we are more concerned with a high recall metric, as this shows that our system can consistently detect frauds, even if this means getting a few false positives. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Nonetheless, we don't want to have a lot of false positives, since there is also a cost in verifying to much transactions assigned as frauds.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;So we can summarize our model's performance in a single metric, we will use the&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;(( F_2 ))&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #3c4043; font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;score, which places more importance in recall than precision. Formally, it is defined as:&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-12-03 오전 2.09.17.png&quot; data-origin-width=&quot;374&quot; data-origin-height=&quot;112&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b91Xse/btsK3vVE2Aq/griSjFCKiQPS0SjAHVRAo1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b91Xse/btsK3vVE2Aq/griSjFCKiQPS0SjAHVRAo1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b91Xse/btsK3vVE2Aq/griSjFCKiQPS0SjAHVRAo1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb91Xse%2FbtsK3vVE2Aq%2FgriSjFCKiQPS0SjAHVRAo1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;374&quot; height=&quot;112&quot; data-filename=&quot;스크린샷 2024-12-03 오전 2.09.17.png&quot; data-origin-width=&quot;374&quot; data-origin-height=&quot;112&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;5. Statistical&amp;nbsp;Anomaly&amp;nbsp;Detection&amp;nbsp;Techniques&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffc9af;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;We assume that&amp;nbsp;&lt;/span&gt;Normal data instances occur in high probability regions of a stochastic model, while anomalies occur in the low probability regions of the stochastic model.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;In the statistical model techniques we fit a statistical model and perform statistical inference to decide if an unseen observation comes from the model distribution or not. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;One advantage of this methods is that we can associate a confidence interval to each prediction, which can help when deciding on a course of action to deal with the anomalies. &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;Another advantages is that if the model is robust to anomalies, it can be used in an unsupervised fashion, without needing any labeled data.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043;&quot;&gt;&lt;span style=&quot;background-color: #ffffff;&quot;&gt;1) Gaussian Model Based&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733160054932&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from scipy.stats import multivariate_normal

mu = train.drop('Class', axis=1).mean(axis=0).values
sigma = train.drop('Class', axis=1).cov().values
model = multivariate_normal(cov=sigma, mean=mu, allow_singular=True)

print(np.median(model.logpdf(valid[valid['Class'] == 0].drop('Class', axis=1).values))) 
print(np.median(model.logpdf(valid[valid['Class'] == 1].drop('Class', axis=1).values)))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043;&quot;&gt;&lt;span style=&quot;background-color: #ffffff;&quot;&gt;2) Histogram&amp;nbsp;Based&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733161061244&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;class hist_model(object):
    
    def __init__(self, bins=50):
        self.bins = bins
        
    def fit(self, X):
        
        bin_hight, bin_edge = [], []
        
        for var in X.T:
            # get bins hight and interval
            bh, bedge = np.histogram(var, bins=self.bins)
            bin_hight.append(bh)
            bin_edge.append(bedge)
        
        self.bin_hight = np.array(bin_hight)
        self.bin_edge = np.array(bin_edge)
   

    def predict(self, X):
        
        scores = []
        for obs in X:
            obs_score = []
            for i, var in enumerate(obs):
                # find wich bin obs is in
                bin_num = (var &amp;gt; self.bin_edge[i]).argmin()-1
                obs_score.append(self.bin_hight[i, bin_num]) # find bin hitght
            
            scores.append(np.mean(obs_score))
        
        return np.array(scores)
                

        
model = hist_model()
model.fit(train.drop('Class', axis=1).values)
print(np.median(model.predict(valid[valid['Class'] == 0].drop('Class', axis=1).values))) 
print(np.median(model.predict(valid[valid['Class'] == 1].drop('Class', axis=1).values)))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;6. Cluster based Technique&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733161204825&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, n_init=4, random_state=42)
gmm.fit(train.drop('Class', axis=1).values)
print(gmm.score(valid[valid['Class'] == 0].drop('Class', axis=1).values))
print(gmm.score(valid[valid['Class'] == 1].drop('Class', axis=1).values))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;7. SVM based Technique&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733161678843&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from sklearn.svm import OneClassSVM
np.random.seed(42)

model = OneClassSVM(gamma=0.000562, nu=.95, kernel='rbf')
model.fit(train.drop('Class', axis=1).values)
print(model.decision_function(valid[valid['Class'] == 0].drop('Class', axis=1).values).mean())
print(model.decision_function(valid[valid['Class'] == 1].drop('Class', axis=1).values).mean())&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;8. Tree based Technique&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1733161712663&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from sklearn.ensemble import IsolationForest
np.random.seed(42)

model = IsolationForest(random_state=42, n_jobs=4, max_samples=train.shape[0], bootstrap=True, n_estimators=50)
model.fit(train.drop('Class', axis=1).values)
print(model.decision_function(valid[valid['Class'] == 0].drop('Class', axis=1).values).mean())
print(model.decision_function(valid[valid['Class'] == 1].drop('Class', axis=1).values).mean())&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;9. Neural based Technique: AutoEncoder&lt;/b&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Embrace&amp;nbsp;challenges&amp;nbsp;as&amp;nbsp;opportunities&amp;nbsp;for&amp;nbsp;growth&amp;nbsp;and&amp;nbsp;transformation.&lt;br /&gt;- Max Holloway -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>캐글</category>
      <category>Credit Card Fraud Detection</category>
      <category>Kaggle</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/73</guid>
      <comments>https://dongsunseng.tistory.com/entry/Kaggle-Study-11-Credit-Card-Fraud-Detection#entry73comment</comments>
      <pubDate>Tue, 3 Dec 2024 02:52:01 +0900</pubDate>
    </item>
    <item>
      <title>[Kaggle Study] #10 Zillow Prize: Zillow&amp;rsquo;s Home Value Prediction (Zestimate)</title>
      <link>https://dongsunseng.tistory.com/entry/Kaggle-Study-10-Zillow-Prize-Zillow%E2%80%99s-Home-Value-Prediction-Zestimate</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Nineth competition following Youhan Lee's curriculum.&lt;b&gt;&lt;span&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;Regression&lt;/span&gt;&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;competition using&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;tabular data&lt;/b&gt;.&lt;/p&gt;
&lt;figure id=&quot;og_1732862934956&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Zillow Prize: Zillow&amp;rsquo;s Home Value Prediction (Zestimate)&quot; data-og-description=&quot;Can you improve the algorithm that changed the world of real estate?&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/c/zillow-prize-1&quot; data-og-url=&quot;https://kaggle.com/zillow-prize-1&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/b1MvQM/hyXGFpi8Pt/Aknh5RrzQq1Gkw59IELffk/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400,https://scrap.kakaocdn.net/dn/Y1SVr/hyXDgxXxrn/Cj5cvKK1sSxoWb7dXvHvC1/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/c/zillow-prize-1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/c/zillow-prize-1&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/b1MvQM/hyXGFpi8Pt/Aknh5RrzQq1Gkw59IELffk/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400,https://scrap.kakaocdn.net/dn/Y1SVr/hyXDgxXxrn/Cj5cvKK1sSxoWb7dXvHvC1/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Zillow Prize: Zillow&amp;rsquo;s Home Value Prediction (Zestimate)&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Can you improve the algorithm that changed the world of real estate?&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/sudalairajkumar/simple-exploration-notebook-zillow-prize&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;First Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Simple&amp;nbsp;Exploration&amp;nbsp;Notebook&amp;nbsp;-&amp;nbsp;Zillow&amp;nbsp;Prize&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;EDA kernel focused on &lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;univariate correlation analysis&lt;/span&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1. Removing outliers&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1732865408983&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;ulimit = np.percentile(train_df.logerror.values, 99)
llimit = np.percentile(train_df.logerror.values, 1)
train_df['logerror'].ix[train_df['logerror']&amp;gt;ulimit] = ulimit
train_df['logerror'].ix[train_df['logerror']&amp;lt;llimit] = llimit&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/anokas/simple-xgboost-starter-0-0655&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Second Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Simple&amp;nbsp;XGBoost&amp;nbsp;Starter&amp;nbsp;(~0.0655)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Literally simple baseline kernel using xgboost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; XGBoost Dmatrix&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1732872088044&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;DMatrix&amp;nbsp;is&amp;nbsp;a&amp;nbsp;special&amp;nbsp;data&amp;nbsp;structure&amp;nbsp;used&amp;nbsp;in&amp;nbsp;XGBoost.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;It's an object that converts regular numpy arrays or pandas DataFrames into a format that XGBoost can process efficiently.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;The main reasons for using DMatrix are:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Memory efficiency: Stores data in an optimized format to save memory&lt;/li&gt;
&lt;li&gt;Training&amp;nbsp;speed:&amp;nbsp;Prepares&amp;nbsp;data&amp;nbsp;in&amp;nbsp;an&amp;nbsp;optimized&amp;nbsp;format&amp;nbsp;so&amp;nbsp;XGBoost&amp;nbsp;can&amp;nbsp;train&amp;nbsp;quickly&lt;/li&gt;
&lt;li&gt;Sparse&amp;nbsp;matrix&amp;nbsp;support:&amp;nbsp;Can&amp;nbsp;efficiently&amp;nbsp;handle&amp;nbsp;data&amp;nbsp;when&amp;nbsp;it's&amp;nbsp;sparse&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/viveksrinivasan/zillow-eda-on-missing-values-multicollinearity&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Third Kernel:&lt;span&gt; Zillow&amp;nbsp;EDA&amp;nbsp;On&amp;nbsp;Missing&amp;nbsp;Values&amp;nbsp;&amp;amp;&amp;nbsp;Multicollinearity&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;EDA focused on missing values and multicollinearity.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;asciidoc&quot; style=&quot;background-color: #f1f3f4; color: #3c4043;&quot;&gt;&lt;code&gt;- Missing Value Analysis
- Correlation Analysis
- Top Contributing Features (Through XGBoost)
- Correlation Analysis 
- Multicollinearity Analysis
- Univariate Analysis 
- Bivariate Analysis&lt;/code&gt;&lt;/pre&gt;
&lt;div style=&quot;background-color: #000000;&quot;&gt;
&lt;div&gt;
&lt;div style=&quot;background-color: #ffffff; color: #3c4043;&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;background-color: #000000;&quot;&gt;
&lt;div&gt;
&lt;div id=&quot;sharing-control-portal-2&quot; style=&quot;background-color: #ffffff; color: #3c4043; text-align: start;&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Multicollinearity Analysis&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1732873740143&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;# Import function for calculating VIF (Variance Inflation Factor)
from statsmodels.stats.outliers_influence import variance_inflation_factor  
# Hide warning messages
import warnings
warnings.filterwarnings(&quot;ignore&quot;)
# Define function for calculating VIF
def calculate_vif_(X):
    variables = list(X.columns)
    # Calculate VIF scores for each variable and return as dictionary
    vif = {variable:variance_inflation_factor(exog=X.values, exog_idx=ix) 
           for ix,variable in enumerate(list(X.columns))}
    return vif
# Select numerical columns only
numericalCol = []
for f in merged.columns:
    # Select columns that are not object type and exclude specific columns (parcelid, transactiondate, logerror)
    if merged[f].dtype!='object' and f not in [&quot;parcelid&quot;, &quot;transactiondate&quot;, &quot;logerror&quot;]:
        numericalCol.append(f)
# Create dataframe with missing values filled with -999
mergedFilterd = merged[numericalCol].fillna(-999)
# Calculate VIF scores
vifDict = calculate_vif_(mergedFilterd)
# Convert VIF results to dataframe
vifDf = pd.DataFrame()
vifDf['variables'] = vifDict.keys()
vifDf['vifScore'] = vifDict.values()
# Sort by VIF score in descending order
vifDf.sort_values(by=['vifScore'],ascending=False,inplace=True)
# Variables with VIF score &amp;le; 5 (no multicollinearity)
validVariables = vifDf[vifDf[&quot;vifScore&quot;]&amp;lt;=5]
# Variables with VIF score &amp;gt; 5 (with multicollinearity)
variablesWithMC  = vifDf[vifDf[&quot;vifScore&quot;]&amp;gt;5]
# Create subplots for visualization
fig,(ax1,ax2) = plt.subplots(ncols=2)
fig.set_size_inches(20,8)
# Visualize VIF scores for variables without multicollinearity
sn.barplot(data=validVariables,x=&quot;vifScore&quot;,y=&quot;variables&quot;,ax=ax1,orient=&quot;h&quot;,color=&quot;#34495e&quot;)
# Visualize VIF scores for top 5 variables with multicollinearity
sn.barplot(data=variablesWithMC.head(5),x=&quot;vifScore&quot;,y=&quot;variables&quot;,ax=ax2,orient=&quot;h&quot;,color=&quot;#34495e&quot;)
# Set graph titles and labels
ax1.set(xlabel='VIF Scores', ylabel='Features',title=&quot;Valid Variables Without Multicollinearity&quot;)
ax2.set(xlabel='VIF Scores', ylabel='Features',title=&quot;Variables Which Exhibit Multicollinearity&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-29 오후 6.49.15.png&quot; data-origin-width=&quot;1600&quot; data-origin-height=&quot;674&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/by2Wzq/btsK3hOU4pm/BD05oCQOMXBb1RQTZKgook/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/by2Wzq/btsK3hOU4pm/BD05oCQOMXBb1RQTZKgook/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/by2Wzq/btsK3hOU4pm/BD05oCQOMXBb1RQTZKgook/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fby2Wzq%2FbtsK3hOU4pm%2FBD05oCQOMXBb1RQTZKgook%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1600&quot; height=&quot;674&quot; data-filename=&quot;스크린샷 2024-11-29 오후 6.49.15.png&quot; data-origin-width=&quot;1600&quot; data-origin-height=&quot;674&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Overall explanation: This code demonstrates the process of analyzing multicollinearity between features in the dataset. Multicollinearity refers to strong correlations between independent variables, which can degrade model performance.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Main steps:&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Selects only numerical variables for analysis&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Calculates VIF (Variance Inflation Factor) scores, which measure multicollinearity&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Generally, VIF scores above 5 or 10 indicate multicollinearity; this code uses 5 as the threshold&lt;/li&gt;
&lt;li&gt;Visualizes results in two graphs:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Left graph: Variables without multicollinearity (VIF &amp;le; 5)&lt;/li&gt;
&lt;li&gt;Right graph: Top 5 variables with multicollinearity (VIF &amp;gt; 5)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This analysis helps identify which variables have strong correlations with each other, which is valuable information for preprocessing steps like feature selection or dimensionality reduction.&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/aharless/xgboost-lightgbm-and-ols-and-nn&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Fourth Kernel:&lt;span&gt; XGBoost,&amp;nbsp;LightGBM,&amp;nbsp;and&amp;nbsp;OLS&amp;nbsp;and&amp;nbsp;NN&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Kernel ensembling various prediction methods: XGBoost, LightGBM, OLS(Linear Regression), and NN.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Summary&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;LightGBM Model&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Data preprocessing&lt;/li&gt;
&lt;li&gt;Set LightGBM parameters&lt;/li&gt;
&lt;li&gt;Train model and make predictions&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;XGBoost Model&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Data reprocessing (different from LightGBM)&lt;/li&gt;
&lt;li&gt;Remove outliers&lt;/li&gt;
&lt;li&gt;Train two different XGBoost models&lt;/li&gt;
&lt;li&gt;Combine predictions from both models&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Neural Network Model&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Data preprocessing (standardization, handling missing values)&lt;/li&gt;
&lt;li&gt;Network structure:
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;4 hidden layers (400 &amp;rarr; 160 &amp;rarr; 64 &amp;rarr; 26 units)&lt;/li&gt;
&lt;li&gt;PReLU activation function&lt;/li&gt;
&lt;li&gt;Use Dropout and BatchNormalization&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Train model and make predictions&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;OLS Model: OLS(Ordinary Least Squares) is the basic form of linear regression&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Feature engineering&lt;/li&gt;
&lt;li&gt;Train LinearRegression&lt;/li&gt;
&lt;li&gt;Make predictions for multiple dates&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Final Prediction Combination&lt;/b&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;Combine predictions from each model using weights&lt;/li&gt;
&lt;li&gt;Apply &lt;b&gt;FUDGE_FACTOR&lt;/b&gt; for final adjustment&lt;/li&gt;
&lt;li&gt;Save results to CSV file&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. FUDGE_FACTOR&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1732873178245&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;pred = FUDGE_FACTOR * (OLS_WEIGHT*reg.predict(get_features(test)) + (1-OLS_WEIGHT)*pred0)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;This coefficient is applied after combining predictions from all models and &lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;increases the final prediction value by 12% (1.12 times)&lt;/b&gt;&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;It is used for the following purposes:
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;Prediction Bias Correction&lt;/b&gt;: To correct when models tend to systematically underpredict&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Calibration&lt;/b&gt;: To adjust predictions based on validation set or previous submission results&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Systematic Error Correction&lt;/b&gt;: To correct systematic errors due to data characteristics or model limitations&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;This is an empirically determined value, and the optimal value was likely found through validation dataset or leaderboard performance.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-style=&quot;style5&quot; data-ke-type=&quot;horizontalRule&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #666666;&quot;&gt;Success&amp;nbsp;is&amp;nbsp;not&amp;nbsp;about&amp;nbsp;luck,&amp;nbsp;but&amp;nbsp;about&amp;nbsp;hard&amp;nbsp;work,&amp;nbsp;dedication,&amp;nbsp;and&amp;nbsp;sacrifice.&lt;br /&gt;- Max Holloway -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>캐글</category>
      <category>Kaggle</category>
      <category>zillow prize: zillow&amp;rsquo;s home value prediction (zestimate)</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/72</guid>
      <comments>https://dongsunseng.tistory.com/entry/Kaggle-Study-10-Zillow-Prize-Zillow%E2%80%99s-Home-Value-Prediction-Zestimate#entry72comment</comments>
      <pubDate>Fri, 29 Nov 2024 18:59:27 +0900</pubDate>
    </item>
    <item>
      <title>[Kaggle Study] #9 New York City Taxi Trip Duration</title>
      <link>https://dongsunseng.tistory.com/entry/Kaggle-Study-9-New-York-City-Taxi-Trip-Duration</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;Eighth competition following Youhan Lee's curriculum.&lt;b&gt;&lt;span&gt; Regression&lt;/span&gt;&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;competition using &lt;b&gt;tabular data&lt;/b&gt;.&lt;/p&gt;
&lt;figure id=&quot;og_1732862913556&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;New York City Taxi Trip Duration&quot; data-og-description=&quot;Share code and data to improve ride time predictions&quot; data-og-host=&quot;www.kaggle.com&quot; data-og-source-url=&quot;https://www.kaggle.com/c/nyc-taxi-trip-duration&quot; data-og-url=&quot;https://kaggle.com/nyc-taxi-trip-duration&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/IrYJs/hyXGCTEwZO/OuP84AHYgKUm3lCCYEUm80/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400,https://scrap.kakaocdn.net/dn/x5eEa/hyXDdnJU3k/KX46OV6Vsw5iaWetkWjr50/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/c/nyc-taxi-trip-duration&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://www.kaggle.com/c/nyc-taxi-trip-duration&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/IrYJs/hyXGCTEwZO/OuP84AHYgKUm3lCCYEUm80/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400,https://scrap.kakaocdn.net/dn/x5eEa/hyXDdnJU3k/KX46OV6Vsw5iaWetkWjr50/img.jpg?width=1900&amp;amp;height=400&amp;amp;face=0_0_1900_400');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;New York City Taxi Trip Duration&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Share code and data to improve ride time predictions&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;www.kaggle.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/drgilermo/dynamics-of-new-york-city-animation/notebook&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;First Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt; Dynamics&amp;nbsp;of&amp;nbsp;New&amp;nbsp;York&amp;nbsp;city&amp;nbsp;-&amp;nbsp;Animation&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Use K-means clustering to cluster New York into different groups based on location, and analyze the traffic into and out of every cluster as a function of the time along the day&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Clustering Code Example: &lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;cluster New York City based on the pick-up and drop-off points of each taxi ride&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1732809853753&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;kmeans = KMeans(n_clusters=15, random_state=2, n_init = 10).fit(loc_df)
loc_df['label'] = kmeans.labels_

loc_df = loc_df.sample(200000)
plt.figure(figsize = (10,10))
for label in loc_df.label.unique():
    plt.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.3, markersize = 0.3)

plt.title('Clusters of New York')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-29 오전 1.04.27.png&quot; data-origin-width=&quot;625&quot; data-origin-height=&quot;592&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zUv7D/btsKZP7LEIV/KpPXAagYEKJmGmGRdZf2l0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zUv7D/btsKZP7LEIV/KpPXAagYEKJmGmGRdZf2l0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zUv7D/btsKZP7LEIV/KpPXAagYEKJmGmGRdZf2l0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzUv7D%2FbtsKZP7LEIV%2FKpPXAagYEKJmGmGRdZf2l0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;625&quot; height=&quot;592&quot; data-filename=&quot;스크린샷 2024-11-29 오전 1.04.27.png&quot; data-origin-width=&quot;625&quot; data-origin-height=&quot;592&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Plotting cluster center&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1732810052832&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;fig,ax = plt.subplots(figsize = (10,10))
for label in loc_df.label.unique():
    ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.4, markersize = 0.1, color = 'gray')
    ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')
    ax.annotate(label, (kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1]), color = 'b', fontsize = 20)
ax.set_title('Cluster Centers')
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2024-11-29 오전 1.07.45.png&quot; data-origin-width=&quot;625&quot; data-origin-height=&quot;592&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bdSATu/btsKZKyuPpy/DrKgseTx8YyHMTkx35wVd0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bdSATu/btsKZKyuPpy/DrKgseTx8YyHMTkx35wVd0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bdSATu/btsKZKyuPpy/DrKgseTx8YyHMTkx35wVd0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbdSATu%2FbtsKZKyuPpy%2FDrKgseTx8YyHMTkx35wVd0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;625&quot; height=&quot;592&quot; data-filename=&quot;스크린샷 2024-11-29 오전 1.07.45.png&quot; data-origin-width=&quot;625&quot; data-origin-height=&quot;592&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. Plotting taxi rides from one cluster to another&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;Absolute traffic:&lt;/p&gt;
&lt;pre id=&quot;code_1732810154845&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;fig, ax = plt.subplots(1, 1, figsize = (10,10))

def animate(hour):
    ax.clear()
    ax.set_title('Absolute Traffic - Hour ' + str(int(hour)) + ':00')    
    plt.figure(figsize = (10,10));
    for label in loc_df.label.unique():
        ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray');
        ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r');


    for label in clusters.label:
        for dest_label in clusters.label:
            num_of_rides = len(df[(df.pickup_cluster == label) &amp;amp; (df.dropoff_cluster == dest_label) &amp;amp; (df.pickup_hour == hour)])
            dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]
            dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]
            pct = np.true_divide(num_of_rides,len(df))
            arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = 15*pct)
            ax.add_patch(arr)
            arr.set_facecolor('g')


ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
plt.close()
ani.save('animation.gif', writer='imagemagick', fps=2)
filename = 'animation.gif'
video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''&amp;lt;img src=&quot;data:image/gif;base64,{0}&quot; type=&quot;gif&quot; /&amp;gt;'''.format(encoded.decode('ascii')))&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;Relative traffic:&lt;/p&gt;
&lt;pre id=&quot;code_1732810172706&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;fig, ax = plt.subplots(1, 1, figsize = (10,10))

def animate(hour):
    ax.clear()
    ax.set_title('Relative Traffic - Hour ' + str(int(hour)) + ':00')    
    plt.figure(figsize = (10,10))
    for label in loc_df.label.unique():
        ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray')
        ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')


    for label in clusters.label:
        for dest_label in clusters.label:
            num_of_rides = len(df[(df.pickup_cluster == label) &amp;amp; (df.dropoff_cluster == dest_label) &amp;amp; (df.pickup_hour == hour)])
            dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]
            dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]
            pct = np.true_divide(num_of_rides,len(df[df.pickup_hour == hour]))
            arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = pct)
            ax.add_patch(arr)
            arr.set_facecolor('g')


ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
plt.close()
ani.save('animation.gif', writer='imagemagick', fps=2)
filename = 'animation.gif'
video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''&amp;lt;img src=&quot;data:image/gif;base64,{0}&quot; type=&quot;gif&quot; /&amp;gt;'''.format(encoded.decode('ascii')))&lt;/code&gt;&lt;/pre&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/aiswaryaramachandran/eda-baseline-model-0-40-rmse&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Second Kernel:&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; EDA&amp;nbsp;+&amp;nbsp;Baseline&amp;nbsp;Model(0.40&amp;nbsp;RMSE)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Literally EDA + making baseline model with decent LB.&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;Insight / Summary:&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1.&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt; Calculating &lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;Haversine Distance using latitude, longitude&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1732848536928&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def calculateDistance(row):
    R=6373.0 # approximate radius of earth in km
    pickup_lat=radians(row['pickup_latitude'])
    pickup_lon=radians(row['pickup_longitude'])
    dropoff_lat=radians(row['dropoff_latitude'])
    dropoff_lon=radians(row['dropoff_longitude'])
    dlon = dropoff_lon - pickup_lon
    dlat = dropoff_lat - pickup_lat
    a = sin(dlat / 2)**2 + cos(pickup_lat) * cos(dropoff_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    distance = R * c
    return distance&lt;/code&gt;&lt;/pre&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span style=&quot;background-color: #ffffff; color: #3c4043; text-align: left;&quot;&gt;2. Bearing&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: -apple-system, BlinkMacSystemFont, 'Helvetica Neue', 'Apple SD Gothic Neo', Arial, sans-serif; letter-spacing: 0px;&quot;&gt;Bearing (also called azimuth) is the angle between the direction of travel and true north, measured clockwise from north. In other words, it tells you which direction you're heading:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;0&amp;deg; (or 360&amp;deg;) = North&lt;/li&gt;
&lt;li&gt;90&amp;deg; = East&lt;/li&gt;
&lt;li&gt;180&amp;deg; = South&lt;/li&gt;
&lt;li&gt;270&amp;deg; = West&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color: #9feec3;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #3c4043; text-align: left;&quot;&gt;The formula is: &amp;theta; = atan2( sin &amp;Delta;&amp;lambda; &amp;sdot; cos &amp;phi;2 , cos &amp;phi;1 &amp;sdot; sin &amp;phi;2 &amp;minus; sin &amp;phi;1 &amp;sdot; cos &amp;phi;2 &amp;sdot; cos &amp;Delta;&amp;lambda; ) &amp;lambda; is the longitude&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1732852060219&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def calculateBearing(lat1,lng1,lat2,lng2):
    R = 6371 
    lng_delta_rad = np.radians(lng2 - lng1)
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    y = np.sin(lng_delta_rad) * np.cos(lat2)
    x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
    return np.degrees(np.arctan2(y, x))&lt;/code&gt;&lt;/pre&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;a href=&quot;https://www.kaggle.com/code/danijelk/beat-the-benchmark&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Third Kernel: Beat&amp;nbsp;the&amp;nbsp;benchmark!&lt;/b&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Similar kernel but XGBoost used.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;span style=&quot;color: #666666;&quot;&gt;Believe&amp;nbsp;in&amp;nbsp;your&amp;nbsp;abilities,&amp;nbsp;even&amp;nbsp;when&amp;nbsp;others&amp;nbsp;doubt&amp;nbsp;you.&amp;nbsp;Your&amp;nbsp;belief&amp;nbsp;will&amp;nbsp;carry&amp;nbsp;you&amp;nbsp;through.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #666666;&quot;&gt;- Max Holloway -&lt;/span&gt;&lt;/blockquote&gt;</description>
      <category>캐글</category>
      <category>Kaggle</category>
      <category>new york city taxi trip duration</category>
      <author>dongsunseng</author>
      <guid isPermaLink="true">https://dongsunseng.tistory.com/71</guid>
      <comments>https://dongsunseng.tistory.com/entry/Kaggle-Study-9-New-York-City-Taxi-Trip-Duration#entry71comment</comments>
      <pubDate>Fri, 29 Nov 2024 15:43:25 +0900</pubDate>
    </item>
  </channel>
</rss>