Abstract
Recent studies have begun to uncover the genetic architecture of educational attainment. We build on this work using genome-wide data from siblings in the National Longitudinal Study of Adolescent to Adult Health (Add Health). We measure the genetic predisposition of siblings to educational attainment using polygenic scores. We then test how polygenic scores are related to social environments and educational outcomes. In Add Health, genetic predisposition to educational attainment is patterned across the social environment. Participants with higher polygenic scores were more likely to grow up in socially advantaged families. Even so, the previously published genetic associations appear to be causal. Among pairs of siblings, the sibling with the higher polygenic score typically went on to complete more years of schooling as compared to their lower-scored co-sibling. We found subtle differences between sibling fixed-effect estimates of the genetic effect versus those based on unrelated individuals.
- polygenic score
- educational attainment
- Add Health
- genetic risk score
Genetics and the social sciences have endured a long and troubled partnership. At the beginning of the 20th century, eugenicists—including the father of modern quantitative genetics, R. A. Fisher—used their science to promote politics of racism, classism, and xenophobia (Tabery, 2008). By the end of 20th century, things were not much better. Publication in 1994 of The Bell Curve was followed by contentious debate over the existence of and biological basis for a racial gradient in intelligence (Devlin, 1997; Neisser et al., 1996). The 21st century is off to a better start in the form of international collaboration among academic social scientists and geneticists, best embodied by the Social Science Genetic Association Consortium (SSGAC). The first large-scale endeavor of this group was to apply state-of-the-art methods typically used to hunt for genetic causes of common diseases to investigate the genetics of educational attainment (Rietveld et al., 2013). The members pooled data on more than 100,000 individuals from 42 different studies. To the surprise of many in the scientific community, they actually found something. Not only were they able to identify genetic variants that exhibited robust and replicable associations with educational attainment, they were able to construct a genome-wide “polygenic score” for educational attainment that predicted, albeit very weakly, how far an individual was likely to progress in his or her educational career (i.e., total years of schooling and/or whether he or she completed college).
This breakthrough finding raises an important question for social scientists who study educational attainment: What does a measure of genetic proclivity toward higher levels of educational attainment actually capture? Can one say with confidence that the genetics of educational attainment uncovered in Rietveld et al. (2013) operate independently of the social circumstances into which a child is born? And, if so, what are the mechanisms? That is, what are the personal attributes (e.g., endophenotypes) that develop from a “high education” genotype that in turn enable their holders to go farther in their educational careers?
To help address these questions, we conducted a sibling fixed-effects analysis among respondents in the National Longitudinal Study of Adolescent to Adult Health (Add Health) sibling pairs study. Differences in siblings’ genotypes arise from a random process similar to a lottery (variation in recombination and segregation of alleles during the meiosis that produces gametes). Our analysis tested whether the “winners” of within-family genetic lotteries completed more years of schooling as compared to their siblings. The use of an independent sample of sibling pairs for this type of inquiry provides three important contributions to the existing work in this area. First, we find strong evidence that recent discoveries made in genetic studies of educational attainment are nonspurious (i.e., not the result of environmental confounding) and represent more than the genetic signature of a privileged social group or groups. Second, features of children’s environments that promote educational attainment are correlated with their genetic endowments; such correlations may bias between-family estimates of genetic effects. Third, estimates of genetic influence on educational attainment from comparisons of siblings may differ in important respects from estimates based on individuals who do not share the same household. We also examined the potential bias that could arise if socioeconomic correlates of a person’s genetic inheritance are ignored, a question critical to any future translation of genetic discoveries into education research. Finally, we examined a putative mechanism or pathway by which this genotype–education relationship may hold: verbal intelligence as measured by a receptive vocabulary test.
The remainder of this introduction is split into four sections. We begin by introducing genome-wide data analysis and its application to the study of educational attainment. We then discuss polygenic scoring as an approach to translating results from genome-wide analysis into a tool for social science. In particular, we highlight vulnerabilities in polygenic scoring methods and ways of addressing them. Finally, we discuss population and social stratification that may confound inference and how the sibling difference may be used to bypass these confounding dynamics.
Genome-Wide Data Analysis and Its Application to the Study of Educational Attainment
Completions of the Human Genome Project and the International HapMap Project have given scientists the necessary tools to directly investigate human DNA and its relation to various traits and diseases. The current approach favored by geneticists for identifying DNA sequence variation associated with complex human traits is the genome-wide association study (GWAS). A GWAS is an inductive data-mining approach in which an outcome of interest (known as a phenotype) is analyzed for association with each of a large number of genetic variants selected to survey variation throughout the entire genome, most commonly, single-nucleotide polymorphisms (SNPs).1 To date, thousands of genome-wide analyses have been conducted on hundreds of traits and diseases, and many discoveries have been made (Welter et al., 2014). Most GWAS research falls within the biomedical domain, but the SSGAC was formed to apply the methods of a GWAS to the study of social phenomena. Their first large-scale project was a genome-wide association study of educational attainment (Rietveld et al., 2013). That GWAS, which analyzed data from more than 100,000 individuals, identified several SNPs that were associated with educational attainment even after strict adjustments for multiple hypothesis testing. Subsequent analysis has replicated these discoveries (Rietveld et al., 2014). The individual genetic variants discovered exhibited only very small effects on educational attainment, consistent with findings from GWASs of other complex traits ranging from body mass index to schizophrenia. But the results of the GWAS are not limited to the handful of SNPs identified. It is possible to combine information from all of the SNPs analyzed in the GWAS to calculate a “polygenic score” that summarizes genome-wide genetic predisposition to educational attainment.
Polygenic Scores as a Tool to Integrate GWAS Results Into Social Science Research
Polygenic scores (also known as genetic risk scores) summarize an individual’s cumulative genetic predisposition to a particular disease or trait. Scores aggregate information across a panel of SNPs according to associations identified in independent GWASs. Each SNP is scored by counting the number of disease-/trait-associated alleles and then multiplying that sum by a weight. The same weight may be used for all SNPs or some other value may be used, such as the coefficient estimated for the association between the SNP and the disease/trait in a GWAS. Then, the weighted allele counts are summed across the SNP panel. Polygenic scores can include all SNPs measured in a GWAS or some subset, typically defined by a p value threshold for the GWAS results (for a detailed discussion of polygenic scoring methods, see Dudbridge, 2013; Purcell et al., 2009; Wray, Goddard, & Visscher, 2007). As the number of SNPs included in a polygenic score increases, the score’s distribution rapidly approaches normality (Plomin, Haworth, & Davis, 2009). The capacity to integrate information from across the genome into a single index and the statistical properties of that index (i.e., continuous and normally distributed) have made polygenic scores an appealing tool for the integration of genetics in both biomedical and social sciences. For example, previous work has used polygenic scores to study the development of obesity, smoking, and asthma (Belsky et al., 2012; Belsky, Moffitt, Baker, et al., 2013; Belsky, Sears, et al., 2013; Domingue et al., 2014). The majority of polygenic scores can predict only a small percentage of the variance in traits of interest. However, it is thought that as GWAS samples increase in size along with the density of SNPs genotyped, so too will the predictive power of polygenic scores based on GWAS results (Conley, in press). In the case of human height, a trait measured with high precision, a GWAS of nearly one quarter million individuals recently generated a polygenic score predicting nearly 30% of population variance (Wood et al., 2014). Even with the small level of predictive power they do offer, polygenic scores provide a tool for beginning to pose and answer questions about the complex relationships that exist between genetics, environments, and the traits and behaviors of interest to the social sciences (Belsky & Israel 2014; Conley, Domingue, Cesarini, Dawes, & Boardman, 2015).
Population Stratification and Ethnic Confounding of Genome-Wide Analysis
Substantively, GWASs test for covariance between allele frequencies and a trait of interest. When an association is detected, the inference is that the SNP (or, more likely, some other DNA sequence variant that is highly spatially correlated with the SNP) causes a biological effect that in turn causes variation in the trait of interest. But there are other sources of covariance between allele frequencies and traits that can confound associations detected in a GWAS. A particularly pervasive source of confounding in a GWAS is “population stratification.” Population stratification is the nonrandom patterning of allele frequencies across global populations (Cardon & Palmer, 2003). These patterns may arise for any number of reasons, including major events, such as the departure of a select group from the African subcontinent, and minor events of social construction, such as the erection of national boundaries that restrict contact between groups. The main consequence of population stratification for our purposes is that these alleles will be associated with any trait that varies systematically between these populations even though the genetic variation may have nothing to do with the underlying reasons (which may be environmental) why the trait varies between the two groups. To guard against confounding due to population stratification, GWASs typically use samples in which the respondents all report the same self-identified racial background (Cardon & Palmer, 2003).
The challenge of population stratification raises two important considerations for the integration of genome-wide data into social science research. First, it highlights the potential racial specificity of GWAS findings because the particular SNPs identified in a GWAS may be differently associated with the true causal loci due to differences in “linkage disequilibria” (e.g., Reich et al., 2001). This implies that a particular SNP measured in a GWAS may be highly correlated with an unmeasured causal variant in one population but not in another. An important first step for social scientists wishing to incorporate GWAS-derived genetic measurements into their own research designs is to evaluate cross-race replication of associations (Belsky, Moffitt, & Caspi, 2013; Belsky, Moffitt, Sugden, et al., 2013; Domingue et al., 2014). This is an especially important point because the SSGAC GWAS of educational attainment was conducted only in a European-descent sample.
The second consideration raised by the challenge of population stratification is that residual confounding may be present even within samples designed to be racially homogenous. Subtle, genome-wide allele frequency differences exist within even relatively narrowly defined European-descent populations (Nelis et al., 2009). Thus, at a minimum, statistical controls for population stratification are needed. The usual approach in the context of a GWAS is to estimate principal components from genome-wide SNP data and then use these as control variables in regression analysis (Price et al., 2006). Such principal components are only estimates, though. Therefore, an ideal control for population stratification is to conduct analyses that compare individuals who share the same ancestry, that is, family-based genetic analysis (Laird & Lange, 2006).
Social Stratification and Environmental Confounding of Genome-Wide Analysis
To the extent that GWASs are able to uncover molecular roots of behavioral phenomena, there are important challenges to address in establishing the magnitudes of the effects of genetic influences. A primary challenge is that polygenic influences will be correlated among family members; any genetic predisposition to social attainment will be shared between parents and children. Thus a child’s genetic and social inheritances will be correlated (e.g., Boardman, Domingue, & Fletcher, 2012). Attempts to quantify genetic effects must therefore account for social differences between children. One method is to measure and control for features of children’s environments, such as characteristics of their families and neighborhoods. But in parallel to the limitations with using principal components to control population stratification, such methods depend on the quality and completeness of the measurements of children’s environments. An alternative is to conduct within-family analysis via sibling fixed effects. Full siblings in a family share—to a large degree—parents, housing, neighborhoods, schools, and so on. And as discussed above, their genetic differences are essentially randomly assigned. Siblings thus provide ideal controls for establishing magnitudes of genetic effects on social attainments.
Here, we test the effects of a polygenic score related to educational attainment derived from a GWAS in a nationally representative sample of siblings. We then evaluate correlations between genetic and social determinants of educational attainment. We next estimate genetic effects after controlling for select measured features of children’s social environments. Finally, we submit genetic effect estimates to the acid test of a sibling comparison. We evaluate whether genetic effects on educational attainment operate in a similar manner within families and across children in the population. We also test whether genetic effects are accounted for by a common measure of academic aptitude, verbal intelligence.
Materials
Sample
Add Health is a nationally representative cohort drawn from a probability sample of 80 U.S. high schools and 52 U.S. middle schools, representative of U.S. schools in 1994–1995 with respect to region, urban setting, school size, school type, and race or ethnic background (n = 20,745, ages 12–20 years at Wave 1 in 1994–1995). The Waves 3 (2001–2002) and 4 (2008–2009) data collections included n = 15,197 individuals (then ages 18–26 years, mean age 22.3 years) and n = 15,701 individuals (then ages 24–32 years, mean age 28.9 years), respectively. The Add Health study includes an oversample of siblings (Harris, Halpern, Haberstick, & Smolen, 2013). The sibling pairs sample was genotyped (via Oragene saliva collection) with the Illumina Human Omni Quad chip at Wave 4 of the study (see McQueen et al., 2014, for details). We use this genome-wide data to construct polygenic scores for study participants.
Patterns of linkage disequilibrium (LD) vary considerably across socially defined racial and ethnic groups, and this is particularly evident when comparing the correlated genotype structures of Europeans to those of African ancestry (Price, Zaitlen, Reich, & Patterson, 2010). Specifically, there is more genetic variation among those of African ancestry (Li et al., 2008; Rosenberg et al., 2002) that reduces LD (e.g., the correlation between neighboring SNPs) and thus creates problems for comparing the effects of SNPs across groups, a problem compounded when creating genome-wide polygenic scores. We therefore analyzed genetic associations separately for European and African Americans.
The 917 European Americans (EAs) in our analytic sample are in 386 sibling pairs and 12 sibling trios, with an additional 109 singletons. The 677 African Americans (AAs) are in 100 sibling pairs and four trios, with an additional 465 singletons. Table 1 shows characteristics of the EA and AA sibling pairs study participants who provided genetic data and constitute our analytic sample. The table also shows characteristics of the full Add Health EA and AA samples for comparison. The EAs in our analytic sample are largely comparable to the full population of EA respondents in the Add Health study. The AAs in our sample are less educated, have less educated parents, and score lower on the verbal intelligence measure as compared to all AA Add Health participants. The bulk of our analysis is focused on the EA sample because the original Rietveld et al. (2013) GWAS was conducted on European-descent individuals. Replication of polygenic scores discovered in EA samples among AA samples may be compromised because LD differences in the groups lead to less precision among AA samples. Accordingly, large-scale GWASs of educational attainment in African Americans will be needed to better quantify genetic influences on attainment in this population. Nevertheless, in the interest of testing the extent to which findings made in European-descent individuals replicate in a different population, we conduct several analyses of the AA sample. Due to the small number of AA sibling pairs in the data, sibling analyses are conducted only in EAs.
Key Descriptive Statistics Comparing the Full Add Health Cohort, the European American (EA) and African American (AA) Subsamples of That Cohort, and the Genotyped EA and AA Siblings (Sibs) That Are the Focus of This Analysis
Measures
Educational attainment
We measured educational attainment as the highest degree completed by the time of interview at Wave 4 when respondents were asked, “What is the highest level of education that you have achieved to date?” Response options and their numeric values (in parentheses) were eighth grade or less (8), some high school (10), high school graduate (12), some vocational/technical training (13), completed vocational/technical training (14), some college (14), completed college (16), some graduate school (17), completed a master’s degree (18), some graduate training beyond a master’s degree (19), completed a doctoral degree (20), some post-baccalaureate professional education (18), and completed post-baccalaureate professional education (19). EA respondents in our genetic sample completed 14.2 years of schooling on average (SD = 2.2) by Wave 4. Of the sibling pairs, 64% varied in their educational attainment (mean difference = 1.7 years). AA respondents in our genetic sample completed 13.5 years of schooling on average (SD = 2.2).
Parental education
At the first wave of data collection, parents of respondents (over 90% were females) responded to a question asking, “How far did you go in school?” Potential responses and their numeric codes (in parentheses) included eighth grade or less (8), more than eighth grade but did not graduate from high school (10), went to vocational school in place of high school (10), high school graduate (12), GED (12), vocational school after high school (13), attended college (14), graduated college (16), and training beyond college (18). EA parents of participants in our genetic sample reported completing 13.5 years of schooling on average (SD = 2.1). AA parents completed 12.6 years of schooling on average (SD = 2.2). Participants with more educated parents went on to complete more years of schooling (r = .42 in the EA sample; r = .32 in the AA sample; see Table 2).
Correlations Between Educational Attainment, Polygenic Score, and Other Key Variables in EA and AA Samples
Neighborhood disadvantage
The Add Health Study used respondents’ residential addresses at the time of Wave I data collection to link individuals with data describing the U.S. Census block group where they lived. We used contextual variables from this data set to measure the socioeconomic and sociodemographic characteristics of the neighborhoods in which Add Health respondents were living at the time of the baseline interview in adolescence (see online supplement). By design, measured neighborhood disadvantage was associated with educational attainment (r = −.35 for EA respondents), although this association was weaker for AA respondents (r = −.14).
Verbal intelligence
Verbal intelligence was measured at Wave 1 (when Add Health participants were 12–20 years old) via a modified version of the Peabody Picture Vocabulary Test (Dunn & Dunn, 1981, 1997), a test of receptive vocabulary (M = 103.9, SD = 11.1, for EA; M = 91.6, SD = 13.8, for AA). Respondents who scored higher on the vocabulary test went on to complete more years of schooling (r = .36 in both EA and AA samples).
Educational attainment polygenic score
After quality controls (see online supplement), the genetic database included 1,886 individuals with valid data on 940,862 SNPs. Polygenic scores for educational attainment were calculated for each sibling pairs participant using the results of their meta-analysis of the GWAS of educational attainment (Rietveld et al., 2013). Briefly, SNPs in the Add Health sibling pairs genetic database were matched to SNPs with reported results in the GWAS. For each of these SNPs, a loading was calculated as the number of educational attainment–associated alleles multiplied by the effect size estimated in the original GWAS. Loadings were then summed across the SNP set to calculate the polygenic score. Additional details on the construction of this variable, as well as a sensitivity analysis, are included in the online supplement. We standardize the polygenic score to have M = 0, SD = 1, separately within the EA and AA samples. Scores were normally distributed (Figure S1). The mean sibling difference in polygenic scores in the EA sample was 0.8.
Analysis
Our analysis used three models to test associations between Add Health participants’ polygenic scores and their educational attainments. The youngest participants were age 24 at the time of the most recent data collection, and some may not have completed their education (Figure S1 contains a comparison of birth year and educational attainment). All models were adjusted for year of birth to account for any differences in educational attainment due to age at the time of follow-up. Models 1 and 2 are also adjusted for the first 10 principal components estimated from the genome-wide SNP data to account for any population stratification in our analytic sample (McQueen et al., 2014).
The first model estimated the association between polygenic score and educational attainment in the pooled sample of sibling pairs. Model 1 takes the form
The estimate of the genetic effect is denoted βU, where the subscript emphasizes the fact that the estimate comes from an approach in which the respondents are treated as unrelated individuals. The sibling structure of the data was accounted for by clustering standard errors within families (Zeileis, 2004), but this does not affect point estimates. Model 1 approximates the approach being used by many social scientists seeking to integrate genetic information into analyses of educational attainment (e.g., de Zeeuw et al., 2014; Ward et al., 2014).
A limitation to Model 1 is that βU may be biased away from zero due to confounders that covary with the genetic score across families (environmental stratification, as discussed in the introduction). For example, children share half of their DNA with each parent. Thus, a child’s polygenic score will be positively correlated their parents’ scores. If the polygenic score is causally related to educational attainment, then children with high scores will tend to have better-educated parents as compared to children with low scores. As a consequence, they are likely to grow up in quite different environments. βU may therefore capture not just a genetic effect but also the effects of environmental advantages that are associated with the child’s genotype (i.e., parents with more education and the economic and social resources that come with it). The geocoded Add Health contextual data allow us to test this hypothesis by fitting a second model that statistically controls for differences in adolescents’ environments that may be correlated with their polygenic scores. Model 2 takes the form
where ν and ω adjust for differences between adolescents’ parental and neighborhood characteristics. We also consider models where ν and ω are independently constrained to be zero (Models 2A and 2B, respectively).
A limitation of Model 2 is that it cannot account for unmeasured features of families and neighborhoods that are correlated with children’s genotypes. Therefore, we fit a third model that utilized the family structure of the data to generate a sibling fixed-effect estimate that fully controls for parental genotype and attainments and also for any neighborhood or environmental characteristics that may vary across families. Model 3 takes the form
where Ik(i) is 1 if individual i is in family k and 0 otherwise (and one family, k = 1, is excluded as the reference). This sibling comparison model leverages the genetic lotteries that occur within families. Estimates of βW represent the educational advantage enjoyed by the sibling who “wins” a hypothetical family’s genetic lottery. Because the estimate is based on comparing siblings, any parental, neighborhood, or school factors that are shared by siblings in a family are controlled by the design of the model.
Results
Did Adolescents With Higher Polygenic Scores Complete More Years of Schooling?
Adolescents with higher polygenic scores went on to complete more years of schooling as of the most recent follow-up, when they were in their 20s and 30s. The genetic effect in our U.S. sample of EA respondents was small in magnitude (r = .18; see Table 2), consistent with published estimates from samples in the United Kingdom and the Netherlands (de Zeeuw et al., 2014; Ward et al., 2014). In years of educational attainment, this correlation is equivalent to a predicted increase of 0.41 years for an increase of one standard deviation in the polygenic score. In our European-descent sample, we detected little evidence that population stratification confounds genetic effects as estimated effect sizes for the polygenic score were similar when models were fitted without adjustment for population structure: Our base Model 1 estimated that each standard deviation increase in an adolescent’s polygenic score forecast his or her completion of over one third of 1 year of additional schooling (
= 0.37, SE = 0.08, p < .001; see Table 3). In comparison, having a mother who graduated college was associated with an additional 1.7 years of schooling.
Model Estimates of Polygenic Score on Educational Attainment
We repeated this analysis in the AA sibling pairs. The genetic effect was smaller in AAs but remained statistically significant (r = .11, p < .01). In real terms, after controlling for population structure, Model 1 suggests that each standard deviation increase in polygenic score forecast their completion of about one fifth of 1 year of additional schooling (
= 0.20, SE = 0.09, p = .02).
Were Adolescents’ Social Environments Related to Their Genetic Inheritance?
We next tested the potential for environmental confounding of genetic associations. In the EA sample, we did not detect a (significant) relationship between participants’ polygenic scores and their mothers’ educational attainments. In contrast, in the AA sample, participants with higher polygenic scores tended to have better-educated mothers (r = .12, p < .01). This pattern of findings was reversed when we analyzed genetic associations with neighborhood disadvantage. EA participants with higher polygenic scores tended to live in more socially advantaged neighborhoods (r = −.13, p < .001), whereas AA participants’ polygenic scores were not related to the social circumstances of their neighborhoods. These findings show that genetic predisposition to educational attainment was socially stratified in both Whites and Blacks, although they suggest differences in the nature of that social stratification.
We next tested whether genetic associations with educational attainment could be accounted for by measured social environmental differences. We repeated our genetic analysis of educational attainment, this time adding statistical adjustments to account for maternal education and neighborhood disadvantage. For the EA respondents, adding controls for parental education and neighborhood disadvantage one at a time attenuated genetic effect estimates by roughly 20% (for a model controlling neighborhood disadvantage, Model 2A,
= 0.30, SE = 0.07, p < .001; for a model controlling maternal education, Model 2B,
= 0.29, SE = 0.08, p < .001). When both maternal education and neighborhood disadvantage were included in the model together, the genetic effect was reduced by roughly 30% (
= 0.26, SE = 0.07, p < .001). We repeated this analysis in the AA sample. Because neighborhood disadvantage showed no distinguishable association with the polygenic score, we focus on Model 2B, which adjusts the effect of the polygenic score for parental education. After we included controls for maternal education in Model 2B, the estimated coefficient for the polygenic score was not statistically significant (
= 0.14, SE = 0.09, p = .12).
Differences between adolescents’ polygenic scores also reflect genetic differences between their families. Correlations of polygenic scores between parents and children have been estimated as high as r = .60 (Conley et al., 2015). In our sample, the correlations between EA siblings’ polygenic scores is r = .53. Families with higher polygenic scores could achieve higher degrees and acquire the resources to move into more advantaged neighborhoods on the strength of their genetic endowments. As a result, interpretation of the attenuation of genetic effects from Model 1 to Model 2 is not straightforward. We therefore moved to the sibling comparison model, in which adolescents’ social environments are equal by design and genetic differences between individuals are randomly assigned by the “lottery” of meiosis.
Within a Family, Did the Sibling With the Higher Polygenic Score Achieve Higher Educational Attainment?
We expected that our Model 3 sibling fixed-effect estimate would be similar to our Model 2 estimates. Surprisingly, the sibling-difference genetic effect was of nearly the same magnitude as the base model estimate (
= 0.35, SE = 0.11, p < .01). This result suggests two things. First, genetic associations with educational attainment are nonspurious, that is, not confounded by social environmental differences that correlate with adolescents’ polygenic scores. Second, sibling-based analyses may be subtly different from analysis of unrelated samples. We discuss the substance and implications of these differences below.
Do Genetic Effects Operate via Influence on Verbal Intelligence?
Published analyses suggest that genetic influence on educational attainment may be mediated by higher intellectual functioning; that is, children with higher polygenic scores complete more schooling because they are cognitively more able (e.g., Rietveld et al., 2013). We found evidence to support this hypothesis in our models analyzing unrelated adolescents. Our analysis here focused on the subset of 877 EA respondents with data on the modified Peabody Picture Vocabulary Test of verbal intelligence in Add Health Wave 1. Adolescents with higher polygenic scores did better on the verbal intelligence test (r = .14, p < .001; see Table 2). In turn, adolescents with higher verbal scores went on to complete more schooling (r = .36, p < .001). When we repeated the Model 1 analysis of the association between an adolescent’s polygenic score and his or her educational attainment, this time adding the verbal intelligence score as a covariate, the genetic effect was attenuated (
= 0.25, p < .001, compared to
= 0.36, p < .001). This result suggests that about one third of the genetic association with educational attainment is attributable to genetic influence on the development of verbal intelligence. However, the statistical test for the difference in coefficients fails to reach conventional significance levels.
We next subjected the mediation hypothesis to the rigorous test of the sibling comparison model. There was a relatively weak association between the difference in sibling polygenic scores and the difference in sibling verbal intelligence (r = .07, p = .18). However, the difference in sibling verbal intelligence was correlated with differences in attainment (r = .22, p < .001). When we repeated our analysis of the within-sibling association between polygenic score and educational attainment, this time adding Peabody score as a covariate, the coefficient was only modestly (and insignificantly) attenuated (
= 0.31, SE = 0.12, p < .01, compared to
= 0.35, SE = 0.12, p < .01). This result suggests that very little of the genetic effect on sibling differences in educational attainment is attributable to sibling differences in verbal intelligence.
We discuss several plausible explanations for these divergent results based on between- and within-family analyses. First, it could be that intelligence-score differences between siblings contain relatively less information than score differences between unrelated individuals. This could occur if there were less true score variance within sibships. If this were true and variance due to random measurement error remained constant across the two types of comparisons, then there may be a reduced reliability of the sibling difference score—that is, the ratio of signal to noise would be lower for the sibling analysis. It could also be the case that sibling analysis captures nonrandom measurement error, that is, mean-regressive error, which may occur if siblings deemphasized their verbal differences (consciously or unconsciously) when tested. This would not change the reliability of the family average (thus the point estimate for the between-family analysis would be unaffected); however, it would lead to attenuation bias in the within-family analysis. A final potential explanation is that the mechanisms linking genes to educational attainment could be different for unrelated individuals compared to siblings. Twin studies suggest that traits other than intelligence (e.g., personality) may mediate genetic influences on educational attainment (Krapohl et al., 2014), and these traits may play a larger role in producing differences between siblings. We return to this divergence in results in the Discussion.
Sensitivity Analyses
The strength of the sibling analyses is that factors that do not vary across siblings are eliminated as potential confounders. One clear difference between siblings, which previous studies have related to attainment, is their birth order (Black, Devereux, & Salvanes, 2006; Conley & Glauber, 2006; Kantarevic & Mechoulan, 2006; cf. Hauser & Sewell, 1985). If birth order were also related to a person’s polygenic score, it would represent a plausible confounder. We therefore tested this association. A sibling’s birth order was not related to his or her polygenic score (r = .01). When we include a dummy variable for birth order in Model 3, we estimate
at Wave 4 to be 0.30 (p = .01), unchanged from the original estimate of 0.29.
Previous research suggests the possibility that genetic influences on a child’s educational attainment may be modified by features of the child’s environment, such as his or her family’s socioeconomic status (SES; Turkehimer, Haley, Waldron, d’Onofrio, & Gottesman, 2003). A previous test of this hypothesis in older cohorts using a similar polygenic score found no evidence that genetic effects varied by family SES (Conley et al., 2015). As an exploratory analysis, we evaluated the hypothesis in our data by testing for an interaction between the polygenic score and maternal educational attainment in a modified version of Model 2B. The main effect of the polygenic score was similar to what was reported in Table 3 (
= 0.30, p < .001). We estimated an interaction between parental education and the polygenic score of −0.06 (SE = 0.03, p = .04). The coefficient being negative suggests that a child’s polygenic score is less predictive of his or her own educational attainment when his or her mother holds a higher degree. Notably, this finding is opposite the prediction that would be made based on the original Turkheimer et al. (2003) observation, in which genetic factors explained more variance in higher-SES children. We view this as a preliminary result, which will need to be verified in the full Add Health cohort once it has been genotyped. A comparable model estimated in the AA sample yielded a main effect nearly identical to Model 2B (
= 0.14, p = .12) and an interaction of 0.01 (SE = 0.04, p = .79).
Given the limited sample size, statistical power is a concern. On the basis of published associations between the polygenic score and educational attainment, we expected an effect size of at least r = .1. We have better than 80% power to detect such an effect in the EA sample. Power for the sibling comparison analyses is somewhat lower (about 60%; additional details available in the online supplement). Therefore, our results should be interpreted as contributing to the evidence base on the nature of genetic associations with educational attainment but needing replication in additional samples.
Discussion
We investigated a recently published genetic algorithm to predict educational attainment using genome-wide genetic data from the Add Health sibling pair files (McQueen et al., 2014). We found that a polygenic score produced with this algorithm was predictive of educational outcomes in our sample of U.S. adolescents born during the 1970s and 1980s and followed up through the first decade of the 21st century. Add Health respondents with higher polygenic scores completed more years of schooling as compared to peers with lower scores. Each standard deviation difference in polygenic score predicted roughly one third of 1 year’s difference in completed schooling by the end of follow-up (e.g., a moderate effect size). This estimate may be a lower bound of how much variation in educational attainment can be predicted with a polygenic score. Twin studies estimate that approximately 40% of the variation in educational attainment is attributable to genetic factors (e.g., Branigan, McCallum, & Freese, 2013). The SSGAC estimates that the variance in educational attainment explained by the associated polygenic score will grow as GWAS sample size increases; Rietveld et al. (2013) estimate that 15% of the variance in attainment might be predicted with a polygenic score derived from a GWAS on 1 million respondents.
Our sibling comparison analysis extends prior work (Conley et al., 2015; Rietveld et al., 2014) to a contemporary, nationally representative U.S. sample. We further show, for the first time, clear evidence for sociogeographic patterning of polygenic scores in the contemporary United States. It is not entirely surprising that the genetic similarities of parents and children are reflected in their respective educational attainments (Conley et al., 2015; Krapohl & Plomin, 2015). But our data also show that patterning of polygenic scores extends to the neighborhoods in which children live. Neighborhoods can be important facilitators of or impediments to children’s social attainments (e.g., Chetty & Hendren, 2015; Chetty, Hendren, & Katz, 2015). Authors of future research should investigate neighborhoods and other macrosocial factors as potential pathways through which familial genetic endowments influence children’s outcomes. Ultimately, the substance of genetic differences between neighborhoods implied by our analysis remains uncertain. Our observations here represent only a first illustration of how novel genome science methods can begin to integrate biological science with research on social attainment and mobility.
A further contribution of our study is to identify an important difference in estimates of genetic effects obtained from between-family analysis and within-family analysis. In our between-family analysis, genetic effects were substantially attenuated when we included controls for family and neighborhood social advantage. This result suggests that for educational attainment, social advantages are correlated with genetic advantages. This complicates the causal models social scientists use when they study socioeconomic gradients in education, particularly in light of evidence that childhood social advantage and educational attainment share genetic roots (Krapohl & Plomin, 2015). In any event, the within-family analysis does not have this problem due to the shared sibling environment. In the within-family analysis that also controlled for socioeconomic differences between individuals, genetic effects were nearly identical to unadjusted estimates from between-family analysis. We also see discrepancies in the mediation analyses: Verbal intelligence appears to mediate about one third of the genetic association with educational attainment in analyses of unrelated individuals but is a weaker mediator of genetic effects identified in the within-family analysis. So why do the two approaches yield such different results?
The explanation we favor is that families constitute heavily controlled laboratories for testing genetic effects. Out in the “wild” of between-family analyses, variance in educational attainments is mostly accounted for by structural features of the social environments children grow up in—their parents’ education, the kinds of neighborhoods in which they live, and the schools they attend. These powerful social forces are silenced within families. This is generally regarded as a strength of within-family analysis. But in our case, it may require a subtle reinterpretation of results. Because so much is similar for siblings, small differences in their genetic makeup have the opportunity to stand out. We know that medical treatments sometimes show large effects in carefully controlled trials but prove less effective when implemented in field settings where there is more variation in treatment context (Rothwell, 2005). In the same way, a genetic difference measured by polygenic score could have larger consequences for a pair of siblings, who share most other determinants of educational outcomes, than for a pair of unrelated individuals. Some have referred to this pattern as a “social distinction” process in which particular social environments, specifically those in which background social noise in minimized, enable us to distinguish the signals from small genetic associations (Boardman, Daw, & Freese, 2013). It may also be the case that family environments function to magnify differences between siblings. Parents respond to observed differences in their children by making different investments in them (Conley, 2004), potentially magnifying a genetic difference of modest consequence. Siblings, seeking to differentiate themselves from one another, may form identities that track them toward more or less educationally enriching activities and associations, again, with the consequence of magnifying a genetic difference of initially modest consequence.
We acknowledge limitations. First, our data are right censored. Some Add Health participants may not have completed their educational careers by the time of the most recent Wave 4 interview. Continued follow-up of the cohort is needed. Second, our data are left censored. Add Health began when participants were well along their adolescent educational careers. We were therefore unable to observe preschooling characteristics but also unable to observe all possible educational transitions (e.g., we have left and right censoring). Third, cognitive assessment in Add Health at baseline was limited to the modified Peabody Picture Vocabulary Test. It is possible that the genetic influence measured in the polygenic score affects other facets of general cognitive ability not measured in this test of verbal intelligence. Finally, the Add Health study used school-based cluster sampling, providing a highly attractive setting for investigating the role of schools in modifying/contextualizing genetic influence on educational outcomes (e.g., through use of school-level fixed effects). The sibling pairs sample is not large enough to take advantage of this design, and therefore schools are omitted from our analysis. We do analyze characteristics of children’s families and neighborhoods. Analysis of schools will be a priority when the genetic data on the full Add Health sample become available.
Conclusion
Twin studies have been the traditional approach for understanding the connection between genes and outcomes, such as education, but they do not tell us about the biological underpinnings of this connection. Although we must emphasize that this age of integrative genetic research is only just entering its second decade, study of molecular genetic data has begun to offer evidence providing information about why certain types of genetic variation lead to variation in mental ability. At this point, we attempt to answer a key question: What is the relevance of such genetics research to education research? At the present time, the predictive power of the polygenic score is clearly too weak to have “clinical” value, and we are skeptical that even increased predictive power would make the score useful as the basis for intervention. But we do think this line of inquiry offers opportunities for study of (a) how the genetic predisposition toward attainment comes to fruition and (b) how environments, often in the role of policies, combine with biology to influence outcomes. We discuss these two opportunities in turn.
There are numerous reasons that certain individuals experience educational success. Some individuals have more raw ability in the various cognitive domains required to continue in education. Some individuals have psychological characteristics that contribute, while others have social skills that lead to increased educational attainment. Genes are linked to all of these personal attributes. Here, we have tested one natural pathway (verbal intelligence) through which the genetic predisposition toward educational attainment may act, but we are limited in our ability to test other pathways. The full Add Health sample is currently being genotyped. When this process is complete, we hope to test additional pathways. Alongside the study of these mediating pathways, incorporating genetics into education research also provides an additional point of leverage for studying the translational pathways through which increased educational attainment may translate into more distal life course outcomes, such as improved health and labor force participation.
One important pathway through which a genotype may translate into increased attainment involves the possibility that one’s genotype evokes a particular environment (i.e., evocative gene–environment correlation). This perspective suggests that genotype is associated with observable traits that may, for example, affect a counselor’s decision about class scheduling, a teacher’s perception of student ability or effort, or even the likelihood that a particular student will befriend certain people (Boardman et al., 2012). All of these factors may then have influences on the years of educational attainment. If this is the case, it does not change the fact that genotype is related to educational outcomes, but it suggests that the cause has more to do with the environment in which one resides than the production of specific proteins that directly enhance one’s ability to succeed in school.
A second area of relevance to educational research of genetic inquiry is an increased understanding of how environments shape outcomes. As an example of how this might work, consider smoking. There is evidence to suggest that genes became a more important determinant of smoking behavior after the 1964 publication of the Surgeon General’s warning (Boardman et al., 2011; Boardman, Blalock, & Pampel, 2010). If those who still smoke have a different biological relationship with tobacco (as indicated by genetics) than smokers from previous generations, then this suggests that modern cessation efforts might need a new focus as compared to previous efforts aimed at those with a weaker genetic inclination toward smoking.
One might similarly consider the composition of those who enter college in 2015 compared to the composition of those who entered college in the mid-1960s. It is increasingly normative for nearly all students in the United States to consider college attendance, with 68% of high school graduates attending college in 2013 compared to only 45% in 1965 (U.S. Bureau of Labor Statistics, 2015). As such, it is possible that the relative contribution of genetics to educational attainment may have changed. This increased access to education may increase or decrease the relative contribution of genes to educational differences in the population. For example, 100 years ago, a remarkably select group of adults was able to attend and matriculate from college. Thus, social factors related to family resources and institutional connections placed great limits on who was able to obtain higher education. As such, small genetic associations may not have differentiated between individuals in this context. As social controls were removed, it is possible that the selection into college was not random but initiated primarily among those with higher polygenic scores, which would enable genetic variation in the population to contribute to phenotypic variation (e.g., education).
Of course, there are almost assuredly scenarios that would decrease the relevance of the polygenic score. The introduction of compulsory schooling, universal preschool, and the GI Bill are all interventions that have possibly changed the association between the polygenic score and attainment. Whether the genetic association with attainment is increasing or decreasing, the larger point is that a consideration of genetics can help us understand the role of environment, including policy interventions. In particular, a consideration of genetics may allow for understanding of response heterogeneity and, more broadly, could help us to understand why policies may (or may not) be generating the desired policy objective. Although such research is just beginning, Fletcher (2012) provides a useful example in which he demonstrates that the smoking behavior of certain individuals may be less sensitive to changes in the tax rate as a function of genotype.
In closing, this article adds to the ample evidence to suggest that children’s educational attainments are influenced by their genes (e.g., Branigan et al., 2013). However, it is becoming increasingly clear that just as biology plays a role in shaping social outcomes, such as education, the social environments in which humans are placed play a role in shaping their biology. For example, recent research suggests that chronic poverty plays a role in shaping brain structure (Noble et al., 2015). Children’s educational environments are among the most important social exposures that modern humans experience. Thus, we believe that just as genetics can offer new tools to education researchers, education researchers have important expertise to bring to genetic studies. Specifically, there is a need to identify which aspects of the educational environment matter, when in development they matter most, and whether there are specific children who may be more or less sensitive to these environments.
Acknowledgments
This research uses data from Add Health, a program project directed by Kathleen Mullan Harris and designed by J. Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris at the University of North Carolina at Chapel Hill and funded by Grant P01-HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with cooperative funding from 23 other federal agencies and foundations. Opinions reflect those of the authors and do not necessarily reflect those of the granting agencies. We also received support from NIH/NICHD R01 HD060726.
Notes
↵1. Single-nucleotide polymorphisms (SNPs) are single-letter changes in the human DNA sequence that are present in >1% of the population. An individual’s genotype for an SNP includes two alleles, one inherited from each parent. Most SNPs involve the substitution of one letter of the A-C-T-G alphabet of human DNA for another. So an SNP might be described as A/G if some individuals in the population carried a G where most others carried an A. An individual could carry one A and one G or two As or two Gs. In some cases, a change in allele results in a functional change in the genome. For example, in the case of the SNP rs6265 in the BDNF gene, the substitution of an A allele for the more common G allele results in an amino acid substitution from valine to methionine, in turn resulting in altered production of the BDNF peptide (Egan et al., 2003). However, most SNPs do not have a known biological function, and the biological significance of associations detected in a genome-wide association study is usually uncertain.
- © The Author(s) 2015
This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 3.0 License (http://www.creativecommons.org/licenses/by-nc/3.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
Authors
BEN DOMINGUE is an assistant professor at the Stanford Graduate School of Education. He is interested in test scores and their uses as well as the integration of genetic data into social science research.
DAN BELSKY is Assistant Professor of Medicine in the Division of Geriatrics at Duke University in the School of Medicine. Dan studies life course development at the intersection of genetics, the social and behavioral sciences, and public health.
DALTON CONLEY is University Professor at New York University. Conley’s research focuses on the the biological and social determinants of economic opportunity within and across generations.
KATHLEEN MULLAN HARRIS is the James E. Haar Distinguished Professor of Sociology and Faculty Fellow at the Carolina Population Center at the University of North Carolina at Chapel Hill. Her research focuses on social inequality and health with particular interests in family demography, the transition to adulthood, health disparities and family formation.
JASON D. BOARDMAN is a Professor in the Department of Sociology and Director of the Health & Society Program in Institute of Behavioral Science at the University of Colorado at Boulder. His research focuses on the social determinants of health with an emphasis on the gene-environment interactions related to health behaviors.










