A computational pipeline to identify gene regulatory networks for multiple cell types from single cell RNA sequencing data to understand sex-specific differences in gene expression
This study set out to investigate heterogeneity in the transcriptional regulation of sex-specific differences in tissues from the Tabula Muris dataset. A computational pipeline was developed to process the single cell RNA sequencing (scRNA-seq) data and identify gene regulatory networks that were tissue type and sex-specific (Fig. 1). Our study was built from ten cell types from the brain and heart samples in Tabula Muris [14] and restrictions to these ten cell types was motivated by the availability of adequate sample size of the data. Imputation of the data did not introduce spurious sex dimorphism as the library sizes after imputation increased proportionately for both female and male cells (Supplementary Figure 2). On the other hand, the overall data quality significantly improved as the imputed data better reflected characteristics of the negative binomial distribution (Supplementary Figure 3). While validating our imputed and normalized data, we found that only 0.12% of the brain cells and 0.57% of the heart cells (Supplementary Figure 1 and Supplementary Table 1) showed inconsistent grouping with the original cell type that they were annotated to. These cells were removed from subsequent analyses. The dominant source of variability in gene expression was driven by cell type differences and, overall, the cells did not show clear separation with respect to sex (Supplementary Figure 1). No evidence of mouse sample-driven gene expression pattern was observed (Supplementary Figure 1).
Identifying genes with changes in differential distribution versus differential expression between male and female mice to characterize sex dimorphism
We investigated sex-specific differential distribution of gene expression separately in each cell type using scDD [13]. For each gene, scDD first models the log-transformed nonzero expression values using a conjugate Dirichlet process mixture of normal distributions, which partitions the expression values and identifies the modality of a gene’s expression distribution. scDD categorizes differential distributions occurring between two groups as one of the following categories DE, DM, DP, DB, and NC (the “Methods” section and Supplementary Figure 4). In addition, logistic regression models can be used to assess whether genes showed a differential proportion of zeros between female and male groups after adjusting for the proportion of actively expressed genes in each cell.
In this study, significant DE genes classified by scDD were viewed as a separate category, since these genes represent differential regulation of gene expression in a standard sense. In contrast, the genes classified into the DM, DP, DB, DZ, and NC categories were the main focus of this study because of their altered non-standard distributions in the scRNA-seq profiles between males and females. We designated the union of genes classified as DM, DP, DB, DZ, and NC as DD genes and provide summary statistics for all genes in Supplementary Table 2. No significant sex-dimorphic expression of housekeeping genes, such as eukaryotic translation initiation factors and proteasome subunits (Supplementary Table 2), was observed.
Differentially expressed genes exhibit high cell type specificity
Significant DE genes between male and female mice were detected in all four brain cell types and all six heart cell types. These genes had unimodal distributions with different mean expression levels between sexes and were associated with biological processes that relate to tissue-specific regulation (Fig. 2a, b; Supplementary Table 3).
It is interesting to note that all cell types exhibited sex-specific differential expression in ribosome subunits. This result may serve as the foundation for a wide variety of differences in intracellular biochemical reactions by influencing the overall regulation of translation. Thirty-two genes coding for large ribosomal subunits and 31 genes coding for small ribosomal subunits had sex-biased expression in at least one cell type in this study, among which Rps25 was more highly expressed in female astrocytes, brain endothelial cells, microglia, oligodendrocytes, cardiac fibroblasts, cardiac smooth muscle cells, and cardiac leukocytes. Similarly, Rpl6, was more highly expressed in female brain endothelial cells, microglia, endocardial cells, cardiac leukocytes.
While most genes that were differentially expressed between the sexes were not shared between cell types some notable exceptions were observed (Fig. 2c, d). For example, the mitochondrial leucyl-tRNA synthetase, Lars2, was differentially expressed in all four brain cell types and four other heart cell types where males had significantly higher Lars2 expression, which could contribute to sex-dimorphic mitochondrial functions in multiple cell types. Long non-coding RNA Malat1 was more highly expressed in male endocardial cells, fibroblasts, heart leukocytes, and heart smooth muscle cells, suggesting widespread sex-dimorphic post-transcriptional regulation [25].
In oligodendrocytes, genes related to the myelin sheath and paranodal junction assembly exhibited sex-dimorphic gene expression levels, and various genes involved in the formation of these two pivotal structures of oligodendrocytes were differentially expressed. For example, neurofascin gene (Nfasc) was significantly upregulated in males compared to females [26]. Myelin sheath was also affected in microglial cells, which control remyelination in the central nervous system (CNS) around the axons [27]. In heart endothelial cells, DE genes were widely involved in structural formation of the cardiopulmonary system. Three genes, Ctnnb1, Notch1, and Nrp1 that regulate coronary artery morphogenesis were significantly upregulated in males. The, genes Actb, Atp5b, Gnai2, Gnas, Kcna5, Kdr, and Myadm, that are associated with the membrane raft were also upregulated in male heart endothelial cells [28, 29]. Given that the membrane raft of endothelial cells is essential for signal transduction through key signaling molecules and ion channels [30, 31], this may be a new source of additional regulation for sex dimorphism that requires further investigation.
In astrocytes, genes related to the tetraspanin-enriched microdomain, Tspan9 and Pdpn (cancer-related gene directly interacting with tetraspanin CD9 [32]), had significantly higher expression levels in male cells; hyaluronic acid binding-related genes also showed significant sex-biased expression, including Bcan which was more expressed in males and Ncan which was more expressed in females. Differential expression of these genes suggests potential sex dimorphism in astrocyte activation and that could translate to differences in response to injury.
In heart-infiltrating leukocytes, in addition to DE genes annotated to immune-related terms such as MHC class I protein complex and T cell receptor binding, the GO term, Schaffer-collateral-CA1 synapse, was also shown to be influenced by differential gene expression, suggesting an interplay between heart, neurons in the CNS, and sex differentiation. Examples of these DE genes with increased expression were Capzb and Srgn in females, as well as Actb and Ptpra in males. However, though leukocytes infiltration is associated with neural development in CNS [33], the potential association between cardiac leukocytes and cardiac nervous system has not been fully elucidated. In cardiac smooth muscle cells, myosin II complex and actomyosin structure organization were influenced by sex-biased gene expression, which may directly contribute to sex dimorphism in cardiac function. The heavy-chain coding gene Myh9 was upregulated in males while the light-chain coding gene Myl9 was upregulated in females. Trpm7, a cation channel coding gene has been reported to be responsible for magnesium homeostasis in vascular smooth muscle [34], was more highly expressed in males. In addition, the cardiac troponin I coding gene Tnni3 was more highly expressed in females, and this may reflect differential sensitivity to intracellular calcium flow during muscular force production [35].
It is possible that the unequal sample size in sex groups may influence the number of DE genes detected, and hence the thresholds used for statistical significance in our computational pipeline were set to be more stringent. Additionally, we noticed that when comparing the direction of the gene expression differences between male and female cells, one sex did not consistently dominate the other, across all cell types (Supplementary Figure 5). For example, brain endothelial cells had more DE genes that were upregulated in females (Supplementary Figure 5B), whereas in heart endothelial cells and fibroblasts (Fig. 2b and Supplementary Figure 5F), more DE genes were upregulated in males.
Scalable variability in distribution indicates widespread sex-dimorphic regulation of gene expression.
The set of DD genes identified in each cell type varied tremendously and there were very few DD genes that were common across different types of cells (Supplementary Figure 6 and Supplementary Table 4). Many conserved DD genes were related to carcinogenesis in brain cells which may contribute to the sex-specific predisposition to various types of brain tumor, including those based on oncogenes Nras [36] and Alkbh5 [37] and genes associated with tumor cell proliferation and/or metastasis in the brain such as Spry2 [38] and Cryab [39, 40]. Additionally, Tsix, antagonist and repressor of Xist, was also a conserved DD gene and identified across four brain cell types.
Results of GSVA indicate that DD genes play an important role in sex dimorphism that complement the pathways enriched for DE genes, though the underlying regulatory mechanisms are less clear and interpretable. Based on word cloud analysis of four cell types in which the number of significantly differentially regulated gene sets exceeded 100, we found that many gene sets were associated with inter-cellular heterogeneity and hence could explain the variability observed in gene expression, such as differentiation, development, and morphogenesis (Supplementary Figure 7A-D). There were many cell type-specific functions that were also represented.
Many of the DD genes in fibroblasts were related to transport, such as the transmembrane transport of fibroblast growth factor-related hormones (e.g., catecholamine, dopamine), cholesterol and various ions (e.g., calcium, chloride, potassium, sodium). This result is potentially relevant to sex dimorphism with respect to the regulation of fibroblast growth, fibroblast cholesterol biosynthesis, and regulation and membrane potential induction or maintenance. In contrast, in both brain and heart endothelial cells, many differentially represented gene sets were involved in signaling pathways and metabolic pathways. For example, heart endothelial cells had sex-dimorphic representation of collagen metabolism, which could be related to the stimulating effect of endothelium on collagen synthesis of cardiac fibroblasts [41].
GSVA also revealed findings about transcriptional regulation underlying sex specificity and inter-cellular heterogeneity from a new aspect. For example, using the most significantly sex-dimorphic gene sets in fibroblast, we observed that cells in male mice generally showed stronger activities in pathways related to keratinization, epidermis development (Supplementary Figure 7E), etc. Differentially distributed expression of cytokeratin (Krt17) [42] and stratifin Sfn [43] were responsible for the differential representation of these two pathways, both of which had significantly different proportions of zeros and have been shown to play an essential role in the differentiation of keratinocytes and fibroblast-keratinocyte interaction.
It is noteworthy that a considerable proportion of cells collected from female mice had these pathways even more strongly represented than in male cells, though they did not necessarily resemble their male counterparts in the expression pattern of every pathway. We posited that for certain biological processes (e.g., keratinization) that are critical for fibroblasts in both females and males, some female cells might upregulate the expression of specific genes so that the sex-dimorphic effect can be mitigated. In female cells, activities of many pathways seemed to be stronger than in male cells. The pathway for “hormone activity” is one example, where the differential proportion of zeros might be responsible for sex-dimorphic fibroblast growth (Supplementary Figure 7E). For example, peptide YY (coded by Pyy) is a gene that had differential proportion of zeros; it belongs to the gene set “hormone activity” and this gene is known to stimulate proliferation and collagen production of cardiac fibroblasts [44].
Deconvolution of differential distribution in gene expression to further explain cell-cell heterogeneity
We posited that DD genes could be used to identify new sub-cell types within cell populations. Previous cluster analysis and identification of cell types did not decompose inter-cell expression variance into variance caused by sex specificity and variance caused by cell type specificity. Intuitively, major cell types have distinct gene expression signatures that make them easily recognizable by canonical algorithms because the variance is mainly due to cell type specificity. However, when sex specificity is the predominant cause of the variance, identifying subpopulations of cell types is more challenging. We presume that DD genes would be able to account for the sex-specific variance without introducing new covariates. Then, further classification can focus on sub-cell type specificity and can be a promising approach to uncovering biologically meaningful cell groups.
We sub-classified fibroblast cells using DD genes as classifiers (the “Methods” section) and identified five cell clusters (Fig. 3a). Two clusters (0 and 1) were predominantly made up of female cells (97.0% of cluster 0 and 97.2% of cluster 1) whereas two clusters (2 and 3) were predominantly consisting of male cells (96.2% of cluster 2 and 97.8% of cluster 3; Fig. 3b). We found clusters 0 and 2 may have similar biological identities because the highly expressed unique marker genes of cluster 0 were moderately expressed in cluster 2 cells but not in other clusters, and vice versa (Fig. 3c). Likewise, clusters 1 and 3 also exhibited high similarity as determined by similar expression profiles of marker genes (Fig. 3c). Notably, one mixture cluster (cluster 4) of both female and male cells of comparable proportion exhibited different expression profiles compared to the other clusters. No evidence for differential expression of housekeeping genes and cell division or mitotic cell cycle-related genes was observed between female and male cells in clusters 0 and 2, clusters 1 and 3 and within cluster 4 (Supplementary Figure 8).
We then identified 238 marker genes that were differentially expressed between clusters 0 and 1, as well as 291 marker genes that were differentially expressed between clusters 2 and 3. Among the two lists of marker genes, 134 overlapped. These common marker genes were strongly associated with cellular components of extracellular space and basement membrane, and plasma membrane binding of crucial molecules, such as heparin, integrin, and calcium ion (Supplementary Table 5), which are essential for proliferation and cell-matrix interaction of fibroblast cells. Notably, marker genes that were most differentially expressed, as determined by FDRs, overlapped to a large extent (Fig. 3d), which suggests that the similar transcriptomic reshaping is associated with the diversification of cells in both females and males. In particular, Clec3b and Pi16, two genes known to be significantly upregulated in Col14a1 matrix fibroblasts uniquely [45], had both higher level of expression and larger proportion of actively expressing cells in cluster 0 and 2, suggesting similarity of these two clusters to Col14a1 matrix fibroblasts. In contrast, Id3 [46], Tmem176b [47], and Igfbp3 [48], three genes widely involved in cell proliferation and differentiation as well as cellular aging in fibroblasts, were significantly upregulated in cluster 1 and 3, implying that cells in these two clusters might be involved in a dynamic process of reproduction and differentiation.
Interestingly, though the marker genes for the distinct mixed-sex cluster 4 had exclusive and unique expression profiles, they were also associated with extracellular matrix/cell surface-related components and functions as mentioned above (Supplementary Table 6). This emphasizes the important role of signaling transduction, adhesion and cell-cell interaction in establishing cell heterogeneity of fibroblast. Cells in this cluster, with actively transcribed Kcna6 and Kcna1 (Supplementary Figure 9), coding for potassium voltage-gated channels Kv1.1 and Kv1.6, respectively, are highly likely to be responsible for the generation and regulation of voltage-dependent potassium currents in heart fibroblasts [49, 50]. Meanwhile, these cells expressed various proteins that are commonly found in neural/glial cells, including the artemin receptor component GFRA3, the neural cell adhesion molecule L1-like protein CHL1, and the solute carrier SLC35F1, suggesting a close relationship with the nervous system.
Gene regulatory networks identify hidden differential regulatory mechanisms and cell heterogeneity for sex-based differences
To identify potential drivers of sex-dimorphic transcriptional regulation, we augmented a Jack-knife-based PANDA framework to generate 100 female-specific networks and 100 male-specific networks in each of the cell types analyzed in this study (the “Methods” section). For each of the four ensembles of brain-specific cell type networks, interactions between 318 TFs and 9673 genes were quantified as edge weights which were subjected to parametric statistical tests. In the six ensembles of heart-specific cell type networks, interactions between 317 TFs and 9722 genes were measured and examined in the same manner. We subsequently summarized differential TF-regulation from four aspects: (1) genes on sex-specific edges, (2) TFs on sex-specific edges, (3) TFs that show differential levels of activity overall, and (4) genes that are differentially targeted by the overall TF activity.
While the number of network statistics, 100, used in each pair-wise Welch’s t test for differential edges was consistent, we observed dramatic changes in the number of significantly differentially represented edges across 10 cell types (Fig. 4a, b; and Supplementary Figure 10). Naturally, the number of genes and TFs involved in these differential edges also varied substantially. In most cases, genes on sex-specific edges did not show any overlap between females and males, namely most genes appeared exclusively in female-specific or male-specific edges, except for cardiac muscle cells where 487 genes were under both female-specific and male-specific regulatory control (Supplementary Table 7). The TFs responsible for sex-specific edges, on the other hand, showed strong overlap which seems plausible as a large proportion of TFs could be in control of both female-specific and male-specific edges simultaneously (Supplementary Table 7). Functional annotations of these genes that were differentially targeted by specific TFs provided further information underlying the regulation of tissue and cell type specificity (Supplementary Table 8 and Supplementary Table 9). Particularly, for cardiac muscle cells and endocardial cells where most genes were not differentially expressed, the genes on sex-specific edges were involved in many processes associated with specialized cellular functions. For example, in male cardiac muscle cells, genes related to positive regulation of tumor necrosis factor secretion were more strongly targeted by TFs than in female cardiac muscle cells (Fig. 4a). Sex-specific TF-gene binding pairs involved in this process included SP1-Fzd5, SP1-Cyp2j6, SP1-C1qtnf4, KLF4-Tlr2, NR5A2-Cd84, and ZFP281-Arid5a. In female endocardial cells, aortic valve morphogenesis is one of the processes that had stronger TF-targeting (Fig. 4b). Examples of sex-specific TF-gene binding pairs were SMAD3-Efna1, SP1-Emilin1, ZFP281-Emilin1, ASCL2-Rb1, SRF-Rb1, SP1-Slit3, SP1-Smad6, and ZFP281-Smad6.
While genes on differential edges exhibit high cell type specificity, TFs in control of the differential regulation through motif recognition and binding are highly conserved across cell types. From the illustration of sex-specific edges present in cardiac leukocytes (Fig. 4c), we observed several core nodes which are TFs with dense sex-specific edges, such as SP1, SP4, KLF4, ASCL2, and ZFP281. We further identified TFs that contribute the most to the differential edges by gathering information of all 10 types of cells under investigation (Fig. 5). These core TF families are ASCL2, EGR family (EGR1 and EGR2), GABPA, SP/KLF family (SP1, SP4, KLF4 and KLF7), RXRα, and members of the Zinc Finger family (ZFP281 and ZFP410). Despite the obvious variation in the number of differential edges and TF on differential edges, these TFs almost always initiate differential edges most frequently in the 10 cell types.
In addition, we identified TFs with differential overall activity which were quantified by summing the weights of edges extending from the same TF and illustrated these differentially targeting TFs using network ensembles of astrocytes and cardiac muscle cells (Fig. 6 and Supplementary Table 10). All TFs of previously identified TF families that accounted for the most differential edges also had differential overall activity in either of the two cell types shown, reinforcing their special role in relevant sex-dimorphic regulatory mechanisms. TFs belonging to the same TF family tended to have similar patterns in the intensity of activated regulation, as evidenced by the closeness of the unsupervised clusters. TFs are known to be universally responsible for sex determination and differentiation, and some TFs exhibited differential overall activity, including AR [51, 52], RAR [53], IRF [54], STAT [55, 56], and GATA [57]. Some TFs that have been known to be responsible for development in particular tissues were shown to have sex-dimorphic intensity of activity in the corresponding type of cell, such as MYC in cardiac muscle cells, which is able to regulate glucose metabolism and mitochondrial biogenesis [58, 59]. Meanwhile, TFs may also have opposite sex-specific activity in different cell types. For example, RARA had stronger overall activity in female astrocytes while also in male cardiac muscle cells.
We also identified genes that were overall differentially targeted by summing weights of edges pointing towards the same gene. Due to the stringency of our threshold, only in astrocytes did we identify a considerable number of significantly differentially targeted genes (Supplementary Table 11). These genes were significantly enriched for KEGG pathways that stood out in analyses above such as ribosome, axon, and oxidative phosphorylation, as well as terms pertaining to neuronal activities in GO term enrichment analysis (Supplementary Table 12), but also thermogenesis and processes associated with Huntington’s disease which were the top two most enriched pathways (Fig. 7b, c). A broad range of key receptors and enzymes related to thermogenesis were under differential regulation, including exogenous hypothalamic AMPK, cell membrane enzymes adenylate cyclase AC and monoglyceride lipase MGL, and mitochondrial enzyme acetyl-CoA synthetase ACS and the electron transport chain complexes, as well as the nuclear enzyme JMJD1A, which is a H3 lysine 9 demethylase promoting expression of thermogenic genes during acute cold stress [60]. Dynactin, which is core to dynein/dynactin-mediated vesicle transport, was more strongly targeted in females, while Clathrin which is responsible for Clathrin-mediated endocytosis was more strongly targeted in males (Fig. 7c), together suggesting differential regulation of transport. Moreover, in both cardiac muscle cells and heart smooth muscle cells, the chemokine receptor Ccr5 which has been shown to be a key factor to atherogenesis in vascular smooth muscle cells [61], was more strongly targeted in females than in males (Supplementary Table 11). In both cardiac muscle cells and endocardial cells, I7Rn6 which is responsible for maintaining normal functioning of clara cells in lung development [62], was more strongly targeted in females than males as well. The cell adhesion molecule, NrCAM, though mainly expressed in brain and endocrine tissues, had the coding gene differentially targeted in cardiac leukocytes. These differentially targeted genes might imply connection in developmental process between tissues (e.g., blood vessels and cardiac cells).
Instinctively, we would expect genes on sex-specific edges and/or genes that are overall differentially targeted to have a significant overlap with DD genes since ideally TFs should change the expression pattern of their targeted genes. In this study, genes on sex-specific edges are more likely to be simultaneously differentially distributed for brain endothelial cells, cardiac muscle cells, endocardial cells, fibroblast, and leukocytes (Supplementary Table 13). However, being differentially distributed is independent of being overall differentially targeted in astrocytes (Supplementary Table 14).