게놈
Oct 16, 2023
Nature 616권, 543~552페이지(2023)이 기사 인용
25,000회 액세스
4 인용
89 알트메트릭
측정항목 세부정보
종양 내 이질성(ITH)은 폐암의 진화를 촉진하여 면역 회피와 치료에 대한 저항성을 유발합니다1. 여기서는 쌍을 이루는 전체 엑솜과 RNA 시퀀싱 데이터를 사용하여 TRACERx 연구에 전향적으로 모집된 최초 421명의 환자 중 347명의 비소세포폐암 종양 354개에서 종양 내 전사체 다양성을 조사합니다2,3. 96개의 종양 인접 정상 조직 샘플과 함께 원발성 및 전이성 질환을 모두 나타내는 947개 종양 영역을 분석한 결과 전사체가 표현형 변이의 주요 원인임을 암시합니다. 유전자 발현 수준과 ITH는 종양 진화 과정에서 양성 및 음성 선택 패턴과 관련이 있습니다. 우리는 후생유전학적 기능 장애와 관련된 빈번한 카피 수 독립적 대립유전자 특이적 발현을 관찰합니다. 대립유전자 특이적 발현은 또한 암 유전자 파괴에 수렴하는 게놈-전사체 평행 진화를 초래할 수 있습니다. 우리는 RNA 단일 염기 치환의 시그니처를 추출하고 그 병인을 RNA 편집 효소인 ADAR 및 APOBEC3A의 활성과 연결함으로써 종양에서 다른 방법으로는 검출되지 않은 진행 중인 APOBEC 활성을 드러냅니다. 원발성-전이성 종양 쌍의 전사체를 특성화하면서 우리는 게놈 및 전사체 변수를 활용하여 전이 파종 가능성을 돌연변이의 진화적 맥락과 원발성 종양 영역 내 증식 증가와 연결하는 여러 기계 학습 접근법을 결합합니다. 이러한 결과는 ITH, 폐암 진화 및 전이에 영향을 미치는 게놈과 전사체 간의 상호 작용을 강조합니다.
종양의 진화를 이해하려면 암세포 간 변이의 원인을 이해하는 것이 필수적입니다. 최근 연구에서는 이러한 변이의 대부분이 전사체이며 게놈 변이와 관련되거나 독립적인 다양한 메커니즘에서 발생한다는 점을 강조했습니다4. 비소세포폐암(NSCLC)의 마우스 모델에서 전사체 가소성은 ITH5를 뒷받침하는 것으로 나타났습니다. 게놈 변이는 종양의 진화 과정에서 획득된 과거 신체 사건의 유물을 반영하는 반면, 전사체 변이는 샘플링 당시 종양의 표현형 상태에 대한 정확한 근사치를 제공할 수 있습니다1. 현재까지 인간의 종양 진화에 대한 대부분의 연구는 게놈 변화가 암에 미치는 영향에 초점을 맞춰 왔습니다. 대량 종양 RNA 시퀀싱(RNA-seq) 데이터를 활용하는 전사체 연구는 단일 시점에서 수행된 단일 생검에서 유전자 발현의 진폭에 초점을 맞추는 경향이 있습니다. 이 접근법은 암 진화에 중요한 영향을 미칠 수 있는 대립유전자 특이적 발현(ASE) 및 RNA 편집을 포함하여 제대로 이해되지 않은 전사체 과정을 포착하지 못할 수 있습니다1,4.
여기서 우리는 TRACERx 연구2에 모집된 환자의 다중 영역 시퀀싱 데이터를 활용하여 다양한 전사체 특징의 영향과 NSCLC 진화의 게놈 및 표현형 다양성과의 상호 작용을 다양한 공간 및 시간 규모에서 더 잘 이해합니다.
우리는 전향적 연구 TRACERx(TRACERx 421 코호트)에 모집된 347명의 환자로부터 일치하는 RNA-seq 및 전체 엑솜 시퀀싱 데이터를 분석했습니다. 코호트의 샘플은 354개의 NSCLC 종양(6명의 환자가 진단 시 다중 원발성을 보유함)의 947개 종양 영역과 96개의 종양 인접 정상 폐 조직 영역으로 구성되었습니다(보충 정보의 통합 보고 시험 표준(CONSORT) 다이어그램 참조)6 ,7. 이들 환자 중 344명은 원발 종양 영역이 886개였고, 21명은 원발 종양의 수술적 절제 시 샘플링된 전이성 림프절(LN) 영역이 29개였으며, 24명의 환자는 재발 또는 진행 시 샘플링된 전이성 종양 영역이 30개였습니다. 이 코호트에 포함된 64명의 환자로부터 총 168개의 원발 종양 영역과 4개의 LN 영역이 이전에 TRACERx 100 코호트에서 설명되었습니다8. 여기에서 분석된(동반 논문6에 보고된) 한 쌍의 원발-전이 영역 코호트는 수술 시 절제된 LN 영역 및 폐내 전이(이후 원발 LN/위성 병변으로 지칭함)와 재발 또는 진행 시 LN 및 전이 영역을 포함하는 61개의 전이 영역으로 구성됩니다.
1) was most readily observed within truncating mutations in genes in the highest expression tertile. Notably, within non-cancer genes, signals of negative selection (dN/dS ± 95% confidence intervals of <1) were identified within truncating mutations in genes within the highest expression tertile only (242 truncating mutations, relative to 3,932 observed truncating mutations, were estimated to have been lost through negative selection in these genes). Similar patterns were observed when dividing the data by different expression quantiles (Extended Data Fig. 1i)./p>8 reads (Methods). It was possible to evaluate ASE in a total of 16,378 different genes across all samples within the cohort at an average of 3,809 (s.d. ± 885) and 4,064 (s.d. ± 485) genes per tumour and normal tissue sample, respectively./p>G substitutions, in keeping with ADAR-linked RNA editing, which deaminates adenosine to inosine, a nucleotide that is then read as guanosine by the translation machinery26 and sequencing platforms. Of these substitutions, 65% were present in the REDIportal database27 of known A>G editing events in human tissues. C>T substitutions28 represented 11.8% of the total substitutions detected. Of all the RNA substitutions detected, 67% were tumour specific (not present within a TRACERx panel of samples of normal tissue), and of these, 29.4% were shared between two or more tumours./p>G transitions, whereas RNA-SBS2 consisted mainly of C>T transitions. RNA-SBS3 consisted mainly of A>G and T>C transitions, RNA-SBS4 of G>A transitions and RNA-SBS5 of G>T transversions. RNA-SBS1 and RNA-SBS3 were identified in most tumours (RNA-SBS1 in 98% and RNA-SBS3 in 85%). RNA-SBS1 exhibited the lowest ITH and was detected within all regions of 87.4% of multiregion tumours./p>G sites from REDIportal was highly similar to RNA-SBS1 (cosine similarity = 0.97), consistent with the A>G activity of ADAR underpinning RNA-SBS1./p>T transitions at TpC sites (67%), a motif consistent with the RNA editing activity of APOBEC3A (ref. 30). In keeping with this, an unbiased analysis showed that RNA-SBS2 correlated more strongly with APOBEC3A expression than with any other gene in the transcriptome (Pearson's r = 0.73, FDR = 4.7 × 10−108; Fig. 3d). A multiple linear regression considering all APOBEC enzymes revealed that the expression of APOBEC3A was the strongest independent predictor of RNA-SBS2 activity, although APOBEC3F was also significant (P = 2.6 × 10−57 and P = 0.01 for APOBEC3A and APOBEC3F, respectively, linear mixed-effects model). Investigating the link between RNA-SBS2 and C>T enrichment at APOBEC3A-specific motifs30,31 further confirmed that RNA-SBS2 was strongly influenced by APOBEC3A expression (Extended Data Fig. 3c,d). Associations between gene expression or genomic features and the activity of the three remaining RNA-SBS signatures did not produce any obvious explanations for their aetiology./p>40% of all genes with zero counts (estimated using the QoRTS output Genes_WithZeroCounts) were excluded. Additionally, samples with <20% of reads mapping to a genomic area covered by exactly one gene in a coding sequence genomic region (estimated using the QoRTS output ReadPairs_UniqueGene_CDS) were excluded. Next, RNA coverage was calculated for single nucleotide variants (SNVs) detected in matched whole-exome sequencing data per tumour region using SAMtools (v.1.9)61 mpileup. Mutation expression was used to further quality check the mapping of RNA reads. The expression of SNVs exclusive to a given tumour region was used to detect potential instances of within-patient mislabelling of RNA–DNA matched tumour regions as well as to exclude normal adjacent lung tissue regions that expressed mutations present in paired tumour regions. A similar approach was applied to germline SNPs to further assess potential sample swaps based on patterns of CN variation from matched DNA per tumour region. Tumour regions in which fewer than 10 mutations, or fewer than 25% of the total mutation count, had evidence of expression, and/or less than 10% of SNPs had evidence of biallelic expression, were excluded. Finally, tumour regions clustering with tumour-adjacent normal tissue regions (see the section ‘UMAP clustering’) and tumour regions with a low purity were also excluded from further analyses. To ensure the reproducibility and portability of the above pipeline, all steps described were implemented through the Nextflow (v.20.07.1)62 pipeline manager./p>0) were evaluated for an enrichment in driver mutations more commonly associated with LUADs./p> 0.5 as not significantly ASE. In the case of CN-dependent ASE, genes were required to show no significant ASE, irrespective of CN, to be categorized as not significantly ASE. Genes with no phasing information were not tested for ASE./p> 0.2). For each of these, we computed the number of CpGs that were significantly hypomethylated and hypermethylated in tumour samples compared to the normal samples, taking only loci that had coverage in all samples (minnormal = 10, mintumour = 3). We then calculated the fraction of differentially methylated positions that were hypomethylated. Using a linear mixed effects model, with tumour identity as random effect, we then compared this metric to the percentage of genes showing evidence of CN-independent ASE per sample (separately for LUAD and LUSC)./p>T events at known RNA-editing APOBEC motifs. APOBEC enzymes typically edit C>T variants at the fourth position of 4-nucleotide-long RNA hairpin loops. In particular, APOBEC3A favours the CAT[C>T] motif30,31./p>T variant site, a Fisher's test was performed to test whether C>T changes within 20 upstream or downstream nucleotides occurred more than expected by chance at specific motifs (CAT[C>T]) in either strand./p>0.2 CCF were considered as seeding for this analysis. In total, 516 primary tumour regions from 206 tumours for which seeding status could be established and for which all metrics tested could be measured (307 non-seeding regions, 209 seeding) were analysed. The following features were also considered for the classifier:/p> 0.75, n = 11). We one-hot-encoded categorical features using get_dummies from Pandas (v1.3.3)106 and then split the data into training and test datasets (75/25 split). After encoding, we had a total of 60 features. We scaled the continuous features using MinMaxScaler from sklearn.preprocessing (v.0.0)107 and used SMOTENC from imblearn.over_sampling (v.0.8.0)105 to improve the balance of the dataset in terms of numbers of seeding and non-seeding regions. Finally, we used the sklearn (v.0.0)105 framework to perform additional variable selection before training using a LinearSVC model (penalty = "l1"), keeping those features with importance ≥0.015. This threshold removed 15 out of 60 features. Following this initial pre-processing, we generated different subsets of the dataset depending on the source of the input features, thus downstream processes within the pipeline operated on three datasets: (1) genomic only features, (2) transcriptomic only features, and (3) all features./p>