Reference genome and transcriptome informed by the sex chromosome complement of the sample increases ability to detect sex differences in gene expression from RNA-Seq data

Background Human X and Y chromosomes share an evolutionary origin and, as a consequence, sequence similarity. We investigated whether sequence homology between the X and Y chromosomes affects alignment of RNA-Seq reads and estimates of differential expression. We tested the effects of using reference genomes and reference transcriptomes informed by the sex chromosome complement of the sample’s genome on measurements of RNA-Seq abundance and sex differences in expression. Results The default genome includes the entire human reference genome (GRCh38), including the entire sequence of the X and Y chromosomes. We created two sex chromosome complement informed reference genomes. One sex chromosome complement informed reference genome was used for samples that lacked a Y chromosome; for this reference genome version, we hard-masked the entire Y chromosome. For the other sex chromosome complement informed reference genome, to be used for samples with a Y chromosome, we hard-masked only the pseudoautosomal regions of the Y chromosome, because these regions are duplicated identically in the reference genome on the X chromosome. We analyzed transcript abundance in the whole blood, brain cortex, breast, liver, and thyroid tissues from 20 genetic female (46, XX) and 20 genetic male (46, XY) samples. Each sample was aligned twice; once to the default reference genome and then independently aligned to a reference genome informed by the sex chromosome complement of the sample, repeated using two different read aligners, HISAT and STAR. We then quantified sex differences in gene expression using featureCounts to get the raw count estimates followed by Limma/Voom for normalization and differential expression. We additionally created sex chromosome complement informed transcriptome references for use in pseudo-alignment using Salmon. Transcript abundance was quantified twice for each sample; once to the default target transcripts and then independently to target transcripts informed by the sex chromosome complement of the sample. Conclusions We show that regardless of the choice of read aligner, using an alignment protocol informed by the sex chromosome complement of the sample results in higher expression estimates on the pseudoautosomal regions of the X chromosome in both genetic male and genetic female samples, as well as an increased number of unique genes being called as differentially expressed between the sexes. We additionally show that using a pseudo-alignment approach informed on the sex chromosome complement of the sample eliminates Y-linked expression in female XX samples. Author summary The human X and Y chromosomes share an evolutionary origin and sequence homology, including regions of 100% identity; this sequence homology can result in reads misaligning between the sex chromosomes, X and Y. We hypothesized that misalignment of reads on the sex chromosomes would confound estimates of transcript abundance if the sex chromosome complement of the sample is not accounted for during the alignment step. For example, because of shared sequence similarity, X-linked reads could misalign to the Y chromosome. This is expected to result in reduced expression for regions between X and Y that share high levels of homology. For this reason, we tested the effect of using a default reference genome versus a reference genome informed by the sex chromosome complement of the sample on estimates of transcript abundance in human RNA-Seq samples from whole blood, brain cortex, breast, liver, and thyroid tissues of 20 genetic female (46, XX) and 20 genetic male (46, XY) samples. We found that using a reference genome with the sex chromosome complement of the sample resulted in higher measurements of X-linked gene transcription for both male and female samples and more differentially expressed genes on the X and Y chromosomes. We additionally investigated the use of a sex chromosome complement informed transcriptome reference index for alignment free quantification protocols. We observed no Y-linked expression in female XX samples only when the transcript quantification was performed using a transcriptome reference index informed on the sex chromosome complement of the sample. We recommend that future studies requiring aligning RNA-Seq reads to a reference genome or pseudo-alignment with a transcriptome reference should consider the sex chromosome complement of their samples prior to running default pipelines.

3 on the X chromosome and a total of 56,571 genes across the entire genome for GRCh38 version 284 of the human reference genome (Aken et al., 2017). Only primary alignments were counted and 285 specified using the --primary option in featureCounts. 286 287

RNA-seq quantification for transcriptome index 288
Transcript quantification for trimmed paired RNA-seq brain cortex samples were estimated twice, 289 once to a default decoy-aware reference transcriptome index and once to a sex chromosome 290 complement informed decoy-aware reference transcriptome index using Salmon with the -291 validateMappings flag. Salmon's -validateMappings adopts a scheme for finding protentional 292 mapping loci of a read using a chain algorithm introduced in minimap2 (Li, 2018). Transcript 293 quantification for female (46, XX) samples was estimated using a Y-masked reference 294 transcriptome index and male (46, XY) transcript quantification was estimated using a Y PAR 295 masked reference transcriptome index when the Y PAR sequence information was available for 296 the transcriptome build. This was repeated for both the Ensembl and the gencode cDNA 297 transcriptome builds, keeping all parameters the same, only changing the reference transcriptome 298 index used, as described above. 299 300 DGEList object 301 Differential expression analysis was performed using the limma/voom pipeline (Law et al., 2014) 302 which has been shown to be a robust differential expression software package (Costa-Silva et al., 303 2017; Seyednasrollah et al., 2015) for both reference-based and pseudo-alignment quantification. 304 Quantified read counts from each sample for the reference-based quantification were generated 305 from featureCounts were combined into a count matrix, each row representing a unique gene ID 306 15 and each column representing the gene counts for each unique sample. This was repeated for each 307 tissue type and read into R using the DGEList function in the R limma package (Love et al., 2014). 308 A sample-level information file related to the genetic sex of the sample, male or female, and the 309 reference genome used for alignment, default or sex chromosome complement informed, was 310 created and corresponds to the columns of the count matrix described above. 311 Pseudo-aligned transcript read counts from each brain cortex sample quantified using 312 Salmon were combined into a count matrix using tximport (Soneson et al., 2015) with each row 313 representing a unique transcript ID and each column representing the transcript counts for each 314 unique sample. To create length scaled transcripts per million (TPM) values to pass into limma, 315 tximport function lengthScaledTPM was employed (Soneson et al., 2015). The reference assembly 316 annotation file was read into R using tximport function makeTxDbFromGFF. Following this, a 317 key of the transcript ID corresponding to the gene ID was created was created using the keys 318 function (Soneson et al., 2015). Gene level TPM values were then generated using the tx2gene 319 function. This was repeated for the Ensembl and the gencode default and sex chromosome 320 complement informed transcriptome quantification estimates. 321 322 Multidimensional Scaling 323 Multidimensional Scaling (MDS) was performed using the DGEList-object containing gene 324 expression count information for each sample. MDS plots were generated using the plotMDS 325 function in in the R limma package (Law et al., 2014). The distance between each pair of samples 326 is shown as the log2 fold change between the samples. The analysis was done for each tissue 327 separately using all shared common variable genes for dimensions (dim) 1 & 2 and dim 2 & 3. 328 Samples that did not cluster with reported sex or clustered in unexpected ways in either dim1, 2 or 329 16 3 were removed from all downstream analysis (Additional file 5). MDS plots for each tissue 330 containing the samples that were used for quality control are located in Additional file 6. Briefly, 331 one male XY whole blood did not cluster with any of the other samples and was removed. One 332 female XX breast sample clustered with the opposite sex and was thus removed. In brain cortex, 333 three male XY brain cortex samples didn't cluster neatly with the other male XY samples in dim 334 1 & 2 were thus removed. Another male brain cortex sample, although clustered with other male 335 samples, had the lowest number of sequencing remaining after trimming for quality, 23.9M, and 336 thus was also removed. To keep the number of samples in each sex roughly equal, four female XX 337 brain cortex samples were randomly selected for removal. For liver and thyroid tissue, no samples 338 appeared to cluster in any unexpected ways and thus no liver or thyroid tissue samples were 339 removed. For all aligners the first component of variation in the MDS plot is explained by the sex 340 of the sample (Figure 3). 341 342 Differential expression 343 Using edgeR (Robinson et al., 2010), raw counts were normalized to adjust for compositional 344 differences between the RNA-Seq libraries using the voom normalize quantile function, which 345 normalizes the reads by the method of trimmed mean of values (TMM) (Law et al., 2014). Counts 346 were then transformed to log2(CPM+0.25/L), where CPM is counts per million, L is library size, 347 and 0.25 is a prior count to avoid taking the log of zero (Robinson et al., 2010). For this dataset, 348 the average library size is about 79.76 million, therefore L is 79.76. Thus, the minimum 349 log2(CPM+0.25/L) value for each sample, representing zero transcripts, is log2(0+0.25/15) = -8.32. 350 A mean minimum of 1 CPM, or the equivalent of 0 in log2(CPM+2/L), in at least one sex 351 per tissue comparison was required for the gene to be kept for downstream analysis. A CPM value 352 of 1 was used in our analysis to separate expressed genes from unexpressed genes, meaning that 353 in a library size of ~79.76 million reads, there are at least an average of 79 counts in at least one 354 sex. After filtering for a minimum CPM, 53,804 out of the 56,571 quantified genes were retained 355 for the whole blood samples, 53,822 for brain cortex, 54,184 for breast, 53,830 for liver, and 356 53,848 for thyroid. A linear model was fitted to the DGEList-object, which contains the filtered 357 and normalized gene counts for each sample, using the limma lmfit function which will fit a 358 separate model to the expression values for each gene (Law et al., 2014). 359 For differential expression analysis a design matrix containing the genetic sex of the sample 360 (male or female) and which reference genome the sample was aligned to (default or sex 361 chromosome complement informed) was created for each tissue type for contrasts of pairwise 362 comparisons between the sexes. Pairwise contrasts were generated using limma makecontrasts 363 function (Law et al., 2014). We identified genes that exhibited significant expression differences 364 defined using an Benjamini-Hochberg adjusted p-value cutoff that is less than 0.01 (1%) to account 365 for multiple testing in pairwise comparisons between conditions using limma/voom decideTests 366 vebayesfit (Law et al., 2014). A conservative adjusted p-value cutoff of less than 0.01 was chosen 367 to be highly confident in the genes that were called as differentially expressed when comparing 368 between reference genomes used for alignment.

GO analysis 372
We examined differences and similarities in gene enrichment terms between the differentially 373 expressed genes obtained from the differential expression analyses of the samples aligned to the 374 default and sex chromosome complement informed reference genomes, to investigate if the 375 18 biological interpretation would change depending on the reference genome the samples were 376 aligned to. We investigated gene ontology enrichment for lists of genes that were identified as 377 showing overexpression in one sex versus the other sex for whole blood, brain cortex, breast, liver, 378 and thyroid samples (adjusted p-value < 0.01). We used the GOrilla webtool, which utilizes a 379 hypergeometric distribution to identify enriched GO terms (Eden et al., 2009). A modified Fisher 380 exact p-value cutoff < 0.001 was used to select significantly enriched terms (Eden et al., 2009). 381 382

RNA-Seq reads aligned to autosomes do not vary much between reference genomes 384
We compared total mapped reads when reads were aligned to a default reference genome and to a 385 reference genome informed on the sex chromosome complement. Reads mapped across the whole 386 genome, including the sex chromosomes, decreased when samples were aligned to a reference 387 genome informed on the sex chromosome complement, paired t-test p-value < 0.05 (Additional 388 files 7 -9). This was true regardless of the read aligner used, HISAT or STAR, or of the sex of the 389 sample, XY or XX. To test the effects of realignment on an autosome, we selected chromosome 390 8, because of its similar size to chromosome X. Overall, there is a slight mean increase in reads 391 mapping to chromosome 8 when samples are aligned to a sex chromosome complement informed 392 reference genome compared to aligning to a default reference genome (Additional file 9). For 393 female XX samples, the mean increase in reads mapping for chromosome 8 was 42.2 reads for 394 whole blood, 50.25 for brain cortex, 109.9 for breast, 68.5 for liver, and 98.2 for thyroid 395 (Additional file 9), which was significant using a paired t-test, p-value < 0.05 in all tissues 396 (Additional file 9). Male XY samples also showed a mean increase in reads mapping for 397 chromosome 8. The mean increase in reads mapping to chromosome 8 for male whole blood 398 19 samples was 0.84, 2.38 for brain cortex, 5.58 for breast, 3.2 for liver, and 5 for thyroid (Additional 399 file 9). There was a significant increase, p-value < 0.05 paired t-test, for reads mapping to 400 chromosome 8 for male brain cortex, breast, liver, and thyroid samples. There was no significant 401 increase in reads mapping for male whole blood for chromosome 8 (Additional file 9). 402 403 Reads aligned to the X chromosome increase in both XX and XY samples when using a sex 404 chromosome complement informed reference genome 405 We found that when reads were aligned to a reference genome informed by the sex chromosome 406 complement for both male XY and female XX tissue samples, reads on the X chromosome 407 increased by ~0.12% when aligned using HISAT. For all tissues and both sexes we observe an 408 average increase of 1,991 reads on chromosome X. We observe an increase in reads mapping to 409 the X chromosome for all tissues and for each sex, which was significant using a paired t-test, p- PCDH11X for female samples when aligned using HISAT was 0.4, 0.28, 0.33, 0.16, and 0.16 for 446 whole blood, brain cortex, breast, liver, and thyroid, respectively. Other X and Y homologous 447 genes sometimes increased in expression depending on the tissue and sometimes there was no 448 change in expression (Additional file 13). Next to PCDH11X, the most increase in expression in 449 an X and Y homologous genes was VCX3B, NLGN4X, and VCX3A. NLGN4X in whole blood 450 showed a 0.14 log2 fold increase when aligned using HISAT. VCX3B showed a 0.2 log2 fold 451 increase in brain, NLGN4X showed a 0.04 log2 fold increase in breast, VCX3A showed a 0.07 log2 452 fold increase in liver, and VCX3B showed a 0.04 log2 fold increase in thyroid, when aligned using 453 HISAT (Additional file 13). 454 455 A sex chromosome complement informed reference genome increases the ability to detect sex 456 differences in gene expression 457 We next investigated how this would affect gene differential expression between the sexes. 458 Generally, we find that more genes are differentially expressed on the sex chromosomes between 459 the sexes when the sex chromosome complements are taken into account. The number of 460 differentially expressed genes on the autosomes remained the same or increased. At a conservative 461 Benjamini-Hochberg adjusted p-value of < 0.01 and aligning with HISAT, we find 4 new genes 462 (3 Y-linked and 1 X-linked) that are only called as differentially expressed between the sexes in 463 the brain cortex when aligned to reference genomes informed on the sex chromosome complement 464 ( Figure 5; Additional file 14). We observed similar trends in changes for differential expression 465 between male XY and female XX for whole blood, breast, liver, and thyroid samples using either 466 HISAT or STAR as the aligner (Additional file 14). For example, in whole blood, 3 additional 467 genes are called as being differentially expressed between the sexes using HISAT, while 1 468 additional gene is called differentially expressed when aligned using STAR. Additionally, when 469 taking sex chromosome complement into account, the number of genes called as differentially 470 expressed between the sexes for the breast samples increased by 13 genes (8 autosomal, 3 X-linked 471 and 2 Y-linked) using HISAT and by 8 genes using STAR (6 autosomal and 2 X-linked) 472 The sex of each sample used in this analysis was provided in the GTEx manifest. We investigated 506 the expression of genes that could be used to infer the sex of the sample. We studied X and Y 507 Measurements of X chromosome expression increase for both male XY and female XX 566 whole blood, brain cortex, breast, liver, and thyroid samples when aligned to a sex chromosome 567 complement informed reference genome versus aligning to a default reference genome (Figure 4). 568 While we see increases in measured expression for PAR1 and PAR2 genes in both males and 569 females, we only observe a difference in measured XTR expression in females. This is because 570 while the PARs are 100% identical between the X and Y and so one copy (here we mask the Y-571 linked copy) should be masked, the XTR is not hard-masked in the YPARs-masked reference 572 genome. The XTR is not identical between the X and Y; it shares 98.78% homology between X 573 and Y but no longer recombines between X and Y (Veerappa et al., 2013) ( Figure 1A) and because 574 of this divergence, is therefore not hard-masked when aligning male XY samples. Tukiainen  The choice of read aligner has long been known to give slightly differing results of 594 differential expression due to the differences in the alignment algorithms (Conesa et al., 2016;595 Costa-Silva et al., 2017). Differences between HISAT and STAR could be contributed to 596 differences in default parameters for handling multi-aligning reads (Kim et al., 2015). We show 597 that regardless of choice of read aligner, HISAT or STAR, we observe similar results. Sample size 598 has also long been known to alter differential expression analysis (Ching et  Here we show that aligning RNA-Seq reads to a sex chromosome complement informed reference 652 genome will change the results of the analysis compared to aligning reads to a default reference 653 genome. We previously observed that a sex chromosome complement informed alignment is 654 important for DNA as well (Webster et al., 2019). A sex chromosome complement informed 655 approach is needed for a sensitive and specific analysis of gene expression on the sex chromosomes 656 (Khramtsova et al., 2018). A sex chromosome complement informed reference alignment resulted 657 in increased expression of the PARs of the X chromosome for both male XY and female XX 658 samples. We further found different genes called as differentially expressed between the sexes and 659 identified sex differences in gene pathways that were missed when samples were aligned to a 660 default reference genome. 661

Perspectives and Significance 662
The accurate alignment and pseudo-alignment of the short RNA-Seq reads to the reference genome 663 or reference transcriptome is essential for drawing reliable conclusions from differential 664 expression data analysis on the sex chromosomes. We strongly urge studies using RNA-Seq to 665 carefully consider the genetic sex of the sample when quantifying reads, and provide a framework 666 for doing so in the future (https://github.com/SexChrLab/XY_RNAseq). 667