A year of storytelling
Another new year is already here. Here I am writing to you exactly one year since I sent you my last Substack newsletter. For some reason, I couldn’t get out of the comforts of writing Twitter posts. It might have something to do with my lack of discipline and excessive procrastination when it comes to tasks that take days instead of an hour or two. I am hoping to write to you more in the new year. Let’s see how that resolution turns out. Anyway, I’ve been writing consistently on Twitter though, as many of you may have noticed (if you were still sticking around). I am happy that I was able to do that despite my busy day job.
So, with the arrival of the new year, I am completing yet another year of storytelling. As always, I spend some time at the end of the year to browse through all my past Twitter posts and appreciate how much I have learned and shared over the year. And I enjoy compiling the most interesting stories into a long year-end post.
The goal of this post is not to provide a comprehensive overview of all the human genetics advancements that happened in 2023. I am neither qualified nor have the bandwidth to do so. My selections of human genetics papers are from the pool of papers that I have read and shared on Twitter.
As usual, I have organized the stories into various topics and whenever possible I have provided hyperlinks to either my Twitter posts or directly to the papers. I hope you’ll enjoy this post.
Grab a cup of tea or your favorite beverage, find a comfortable place, and settle down. It’s gonna take a while to finish reading this post. Here we go!
Human genetics and drug development
As a scientist working in drug target discovery, the topic closest to my heart is the role of human genetics in drug development. Under this theme, I’d like to highlight three genes—BCL11A, APOL1 and APOE4. We have known these genes and their associations with human diseases and traits for a long time. While BCL11A’s association with fetal hemoglobin levels (a major determinant of disease severity in sickle cell patients) and APOL1’s association with kidney disease are some of the major early GWAS discoveries (we often call them “low-hanging fruits”), our knowledge of APOE’s association with Alzheimer’s disease predates GWAS era. Multiple decades have passed since their discoveries, yet their stories are still evolving. At least one of these fruits has now ripened into a life-saving medicine, marking 2023 as a landmark year in the history of human genetics.
BCL11A
The most exciting news of 2023 in the field of human genetics is undoubtedly the FDA approval of Casgevy, a first-ever CRISPR-based medicine developed for the treatment of two major blood disorders—sickle cell disease and beta-thalassemia—that affect hundreds of thousands of individuals of African, African American and South-Asian ancestries. The treatment involves taking the blood stem cells out of the patient’s body, making a double-stranded cut at an enhancer site of BCL11A using CRISPR, and putting the CRISPR-edited stem cells back into the patient. Disrupting the enhancer will reduce the expression of BCL11A, a transcription factor involved in the fetal-to-adult hemoglobin switching, and this will reprogram the hemoglobin machinery of the new red blood cells (that the CRISPR-edited stem cells will differentiate into) to produce more fetal hemoglobin instead of the disease-causing, defective adult hemoglobin.
The Casgevy approval marks the success of two technologies—CRISPR and GWAS. Of course, the Nobel-winning CRISPR technology plays a major role here by offering scientists a magical tool to manipulate the human genome precisely but being a human geneticist, I am excited more for the human genetics aspect of this medicine.
The link between BCL11A and fetal hemoglobin was discovered in 2007 through GWAS of fetal hemoglobin levels, reported by at least two independent research groups. BCL11A marks the first successful translation of a GWAS discovery into an FDA-approved medicine and will serve as the best proof of one of the foundational concepts of the GWAS field: to understand the molecular basis of a human disease or a trait, one needs to study its genetic basis in a hypothesis-free manner.
APOL1
The story of APOL1 kidney disease in African populations is my all-time favorite. Refer to this Twitter thread to catch up on the full story of the discovery of APOL1 as a major genetic risk factor of chronic kidney disease in Africans and African Americans. I am fascinated more by the APOL1 story than the sickle cell disease (both of which are by-products of the evolutionary arms race between humans and pathogens) for the fact that APOL1 discovery was born out of GWAS. Since its discovery, there has been so much interest from the scientific community and drug industries that APOL1 kidney disease has evolved into a scientific field of its own. Those decades of scientific effort might soon bear the first fruit in the form of a life-saving medicine within the next few years. At least 10 companies are developing drugs that are in various stages of clinical development.
One exciting advancement in the APOL1 story this year is the discovery of a disease modifier protective variant reported by two independent research groups. Although it was always believed that APOL1 risk variants contribute to disease development through a gain of function mechanism and the therapeutic approach would be to inhibit APOL1, there was no human genetic finding to support that idea, that is, demonstration of a protective association between APOL1 loss of function variants and kidney disease. Because APOL1 loss of function variants are extremely rare. With this background, the newly identified modifier missense variant (that turned out to be functionally a loss of function variant) showing protective association with kidney disease in the background of a specific APOL1 risk haplotype is a major advancement in the APOL1 field as it strongly favors APOL1 inhibition as the therapeutic mechanism to treat (or prevent) chronic kidney disease in Africans and African Americans.
APOE
If there is one disease area where the drug developers need to up their game (or perhaps down their routine game, take a break, step out of the pot they’ve been riding horses around in circles, and think differently), it is Alzheimer’s disease. So many billion dollars have been spent in the making of drugs for Alzheimer’s, and yet, we don’t have even one decent medicine for this devastating disease. The discovery of APOE as the major (and almost, a monogenic) risk factor for Alzheimer’s dates back to the early 1990s, a breakthrough that arose at the culmination of molecular, epidemiological, and genetic studies. As I was drafting this section, I read a beautiful news article from 1993 by John Travis (who is currently the managing editor at Science) that accompanied the landmark Science paper and enjoyed every sentence of it. I highly recommend you read that piece and travel back in time to the moment the field celebrated a breakthrough discovery of a new genetic risk factor, APOE4, for Alzheimer’s.
But unfortunately, even after 30 years since its discovery, I think, our understanding of the molecular mechanism of APOE4 risk effect on Alzheimer’s is still blurry and this has hindered drug development. As always, human genetics will be the savior, as it’s often said, the most useful animal model to study human diseases (especially, neurodegenerative diseases) is none other than the humans themselves. And that is why I am excited about a new development that happened this year in the field of Alzheimer’s genetics.
A preprint published this year describes a handful of extraordinarily rare humans who happened to have been born with an APOE4 risk variant and an APOE loss of function variant, both in the same haplotype. As a consequence, in these individuals, the bad APOE4 missense variant has gotten butchered off by the cooccurring protein-truncating loss of function variant. So, the natural question here would be what happened to these individuals when they became old? The authors found that these individuals did not develop Alzheimer’s or had biomarkers suggestive of Alzheimer’s until late age, and the ones who died of other causes did not show any evidence of Alzheimer’s histopathology in the postmortem brain examinations. There are good reasons to speculate that loss of function variants protected these individuals from the harmful effect of the E4 allele, which the authors do, but, given the fact that the E4 heterozygous state is not that highly penetrant, we’ll have to wait for a future study with sufficient sample size to prove these speculations.
The idea of targeting the APOE gene to treat Alzheimer’s has been floating around for years, and I believe many companies are already knee-deep in such drug programs. This new case report seems to support that therapeutic mechanism, which is promising!
Rare disease and drug development
Rare disease informing drug development for common diseases is one of my newly found interests, and I think this topic is often overlooked. In the drug discovery workflow, when we identify a gene as a potential drug target, one of the questions we ask is what happens at the extremes of this gene’s disruption, that is, what happens when this gene is over-activated or fully absent. Answers to such questions often come from literature on Mendelian diseases, and that is why, one of the places we immediately visit as soon as we get our hands on a new target is the Online Mendelian Inheritance in Man (OMIM) database.
Rare diseases can inform many aspects of drug development, for example, the therapeutic efficacy of targeting a gene. A beautiful example lies within a topic that we just discussed: BCL11A and sickle cell disease. Just one year following the GWAS discovery of the link between BCL11A and fetal hemoglobin levels, Vijay Shankaran, Stu Orkin, and colleagues (who have been working on the hemoglobin switching long before the GWAS era) from the Boston Children's Hospital, published a landmark paper in Science, experimentally confirming that BCL11A is indeed the causal gene at the GWAS locus and it is an erythrocyte specific transcription factor responsible for fetal to adult hemoglobin switching after birth. Efforts to translate this finding into sickle cell disease treatment were swiftly made. The therapeutic design would be to switch off the adult hemoglobin production and switch on the fetal hemoglobin production by inhibiting BCL11A expression. But one key question remained to be answered. Eliminating BCL11A from the blood cells is problematic as that seems to cause some unwanted side effects like B cell toxicity. So, one should dial down the BCL11A’s expression within a safety margin. But to what level one should dial down the BCL11A expression to achieve a clinically meaningful rise in the HbF in the blood (which is around 20% according to natural history studies of sickle cell patients)? The answer came from patients with a rare Mendelian neurodevelopmental disease caused by BCL11A haploinsufficiency. It turned out the HbF levels in the blood of BCL11A haploinsufficient children were around 30%, well above the clinically desired threshold of 20%. This critical piece of data informed the drug developers that one needs to dial down BCL11A only by 50% to achieve therapeutic efficacy1.
Another important insight that rare disease brings to drug development is about adverse effects of targeting a gene. The discovery of the genetic cause of an extremely rare Mendelian muscle disease this year falls right under this topic. I wrote a Twitter thread on this, highlighting work from two independent research groups (one from Israel and the other from the USA) who discovered that recessive deleterious mutations in HMGCR (the genetic target of the world’s most popular medicine statin) cause a severe Mendelian type of limb-girdle disease. The clinical profile of these rare disease patients gave a human validation of a long-debated adverse effect of statin—muscle pain and weakness. Not only that, an n=1 clinical trial performed by Israeli scientists hinted at a possible treatment for statin-induced myopathy: mevalonate supplementation. The Twitter thread inspired a reporter Sarah Zhang to cover this story in the Atlantic magazine.
Decoding GWAS loci
If you’ve been following the social media GWAS discussions, you might have come across the debate on the value of studying rare vs common variants for drug target discovery. Hearing about the successful translation of the BCL11A GWAS locus might trick a naive reader into thinking that it is easy to spot the causal gene once the GWAS locus is discovered. That is not true! Loci such as BCL11A are outliers in the sea of GWAS loci waiting to be decoded. If you look back at the history of many famous GWAS loci, you’ll be surprised to find how often scientists get the causal gene wrong the first time. A good example is the APOL1 locus we discussed before. When the GWAS locus near APOL1 was first mapped in the admixed African American population (using a technique called admixture mapping), the scientists believed that the causal gene was MYH9. They were so confident they even boldly put the gene name in the title of the two GWAS papers (by two independent research groups) that stand as milestones in the APOL1 story. Only later the scientists realized that they got the wrong gene the first time and the true causal gene at the locus was APOL1.
The difficulty of identifying the causal gene at the GWAS locus is a major bottleneck in using common variants for drug target discovery. To illustrate this, there can be no other example more perfect than FTO, the first GWAS locus identified for obesity. The FTO locus might be the most studied in the GWAS history with at least a dozen influential papers published in top journals like Nature, Science, and NEJM. Despite the decades of hammering, the FTO locus hasn’t spilled all of its secrets yet, which you’ll readily appreciate from a paper published in Nature Metabolism this year. My Twitter post summarizing this paper, to my pleasant surprise, generated a lot of excitement and interest in the human genetics field. Refer to the post to catch up on the background story and how the ball of causal gene was bouncing back and forth between two genes (FTO and IRX3/5) over the years. With the recent publication, the ball now lies in the FTO court. Let’s wait and watch if the ball stays or bounces back to the IRX3/5 court.
Genetic drift
Finngen
The idea of studying founder populations to discover disease-causing genes has existed for ages. In fact, the idea is a direct derivate of the concept of genetic linkage studies done using family pedigrees. Members of the same family have similar genomes and any disease seen in multiple family members tends to be caused by the same genetic mutation. As the genomes of the affected and unaffected family members will be highly similar except for the regions harboring the disease-causing mutation, it is relatively easier to map the gene as opposed to doing the same in unrelated individuals. The same principle, to some extent, applies to isolated founder populations, which are in a sense large family pedigrees where all living members are descendants of a small number of founders. This is one of the premises under which deCODE genetics was founded by the Icelandic visionary, Kari Stefansson. Refer to deCODE’s very first press release in 1997 announcing the discovery of a genetic locus for familial essential tremor.
The most important value of studying isolated populations is the opportunity to identify a large number of carriers of certain rare gene-disrupting variants (due to genetic drift) that are rare or often absent elsewhere in the world. Many such discoveries were made by deCODE, for example, the discovery of the first protective variant for Alzheimer’s that offered support to the amyloid-beta hypothesis and accelerated the development of BACE inhibitors for the treatment of Alzheimer’s.
Finland, similar to Iceland, is a founder population with a history of multiple severe bottleneck events. Although many small-scale genetic studies have been published based on the Finnish population, which have painted a pretty good picture of the higher prevalence of certain rare variants and recessive diseases, Finland lacked a large-scale genetic database for a long time. Only recently, a few years ago, Finland entered the big league of massive biobanks with the launch of Finngen, an ambitious research project to collect genomes and health data of 500,000 Finnish individuals through national health registries. Two flagship papers describing the genetic analyses of the first 220,000 Finns were published this year in Nature. One of the papers focuses on the recessive genetic associations while the other on the genetic associations unique to Finland driven by drifted variants. Both have tons of interesting findings and represent big advancements in the human genetics field.
Peurto-Ricans
Speaking of founder populations, many factors can cause bottleneck events, for example, famine, flood, migration, etc., drastically reducing a population’s size to a small number which then expands into a bigger population. One major factor that pushed certain populations close to extinction causing a severe bottleneck effect is colonization. The indigenous populations of the Caribbean islands fall under that category and are a great resource for human genetic discoveries, yet they are largely under-explored. Except for some of the highly prevalent conditions, for example, Hermansky-Pudlak syndrome causing oculocutaneous albinism in Puerto Ricans, much of the genetic treasures of these isolated populations are yet to be uncovered.
Thanks to massive direct-to-consumer genetics databases like 23andme, genetic discoveries that can only be made by traveling to remote parts of the world can be now made by scanning through the millions of diverse participants of DTC databases enriched with individuals of many understudied populations. In that line, a genetic study of Puerto Ricans by the 23andme researchers deserves special mention. By studying ~45,000 individuals of Puerto Rican ancestry enrolled in the 23andme research database, the authors identified a novel risk gene ITGA6 associated with adult-onset cataract. Heterozygous carriers of loss of function variants in ITGA6 are at a 12-fold higher risk of developing cataract and do so on average 13.7 years earlier compared to the non-carriers.
Endogamy and consanguinity
South Asian stroke genetics
The world’s largest genetic database of South Asians to date comes from Pakistan. Pakistani Genomic Resource (PGR), founded by Danish Saleheen, a physician-scientist, is a large South-Asian cohort comprising >150,000 Pakistanis enriched for consanguineous families. Many breakthrough works have come out of this unique genomic resource including a landmark paper on human knockouts published in 2015 in Nature. In fact, only after this paper, the phrase “human knockout” became popular among the genomics community.
Regeneron Genetics Center has been a long-term collaborator of PGR and this year, a major discovery my colleagues made using this resource was preprinted. Studying the genetics of stroke in Pakistanis using an exome-wide association study of ~30,000 individuals, the authors discovered a novel genetic risk factor for stroke in Pakistanis (probably, also in other South Asians). A missense variant in NOTCH3 (a known Mendelian gene for stroke), which is seen in almost 1% of South Asians but rare elsewhere in the world, increases the stroke risk by more than 2 to 3 folds (the largest effect size of a common risk variant for stroke discovered so far today). The large effect size and the high prevalence of this variant make it one of the most important genetic risk factors of stroke among South Asians, explaining more than 5% of hemorrhagic strokes (and 1% of all strokes) in the population. This discovery has a major therapeutic implication, similar to the APOL1 discovery in African Americans. Such an important risk factor was never found in the past GWASs of stroke that were based predominantly on Europeans, even in a sample size surpassing 1 million individuals. Can there be any more perfect example than this to highlight the importance of increasing the representation of non-European populations in human genetic studies?
South Asian whole genomes
A 2017 Nature Genetics paper by a team of scientists led by David Reich and Kumarasamy Thangaraj (pioneers of ancient DNA research) represents one of the most important papers on South Asian genetics. The first line of the abstract is a perfect summary of the genetic makeup of South Asian populations:
“The more than 1.5 billion people who live in South Asia are correctly viewed not as a single large population but as many small endogamous groups.”
The data presented by Nakatsuka et al. was the first systematic and comprehensive genome-wide analysis of the South Asian populations (over 2800 individuals representing 260 distinct groups), which opened the window offering first views into the extreme founder events in certain South Asian endogamous communities, which put the well-known founder populations like Ashkenazi Jews and Finns to shame. Although the authors theoretically predicted based on the common variants that the South Asian population is likely a gold mine of partial and complete human knockouts, they didn’t have rare variants data to empirically demonstrate their predictions.
A landmark paper published this year by multiple academic and industry research groups from within and outside South Asia reports a comprehensive analysis of 4800 whole genome sequences of South Asians recruited across India, Pakistan, and Bangladesh (the largest to date), providing estimates of the impressive genome-wide enrichment of heterozygous and homozygous loss of function variants in the South Asian populations. This paper represents a big step forward in our understanding of South Asian genetics.
Consanguinity and risk of common diseases
Historically the word “recessive” in human genetics literature is almost always discussed in the context of rare Mendelian diseases. There has been interest from the GWAS community in studying the recessive genetic architecture of common, complex diseases such as type 2 diabetes, schizophrenia, etc. Occasionally, researchers have performed recessive GWASs for complex diseases but mostly such efforts did not yield that many signals, which is not surprising because most of those studies were based on outbred European populations. Thanks to emerging large-scale genetic databases of founder populations like Finngen, scientists are starting to discover recessive genetic associations for complex traits.
If you think about it, the most ideal population to study recessive genetics would be highly endogamous and consanguineous populations like South Asians. But most studies that looked into the health effects of consanguineous marriages focussed only on rare Mendelian diseases. For that reason, a beautiful paper published this year by Hilary Martin, Daniel Malawsky, and colleagues on the effect of consanguinity on the risk of common diseases in British Pakistanis represents a major advancement in human genetics. Using an ingenious study design that is free of environmental confounders, the authors show that the load of recessive mutations in the genome of individuals born to closely related parents are strongly associated with increased risk of many common diseases such as type 2 diabetes, post-traumatic disorder, etc. Especially for type 2 diabetes, which is much more common in South Asians, the authors estimate that as much as 10% of the disease risk can be explained by recessive mutation load, beautifully demonstrating that South Asians might be the perfect population to study the recessive genetic architecture of common, complex diseases.
Low-hanging fruits
The publication of the haplotype map of the human genome in 2005 by the International HapMap consortium officially marks the beginning of the GWAS era. Researchers picked their favorite diseases and traits—type 2 diabetes, coronary artery disease, Crohn’s disease, etc—, joined teams, gathered samples, and performed the first GWASs. Many of those GWAS yielded the initial genetic associations driven by variants that are extremely common and had a moderate to large effect size and so can be discovered in a smaller sample size (a few hundred to a few thousand). Some popular examples include TCF7L2 locus associated with type 2 diabetes, FTO associated with obesity, CDKN2A/B locus associated with coronary artery disease, NOD2 associated with Crohn’s disease, CFH locus associated with age-related macular degeneration, PNPLA3 associated with fatty liver disease, CHRNA5 associated with smoking and list goes on. These early GWAS discoveries are classically described as “low-hanging fruits” (as I mentioned earlier in the post). Most of the GWASs performed early in the GWAS era (2005 to 2010) focussed on a set of diseases and traits that the scientists considered important and were familiar with as they have been studying them during the early days of their career using family-based linkage analysis or epidemiological analysis.
Looking back at the early GWAS literature and learning how the scientists competed with one another to get their hands on those low-hanging fruits will make one wonder how many such fruits remain untouched in the GWAS grove just because their associated traits haven’t been GWASed yet. Every year one or two such studies come out where researchers perform GWAS of a phenotype for the first time and uncover a strong genetic signal. In that category, we have this year a beautiful revelation of an impressively strong GWAS locus near ADRA2A (encodes an adrenergic receptor) associated with Reynaud’s phenomenon, an interesting clinical condition characterized by sympathetic overactivation on exposure to cold or stress resulting in peripheral vasospasm, mainly in the fingers. The ADRA2A locus beautifully recapitulates the well-established role of the sympathetic nervous system, placing adrenergic receptors in the center of Reynaud’s pathophysiology. The most striking about this new GWAS discovery, reported by two independent research teams, is that the locus was genome-wide significant in each of the four independent samples that were studied. It’s fascinating to see such a strong locus is getting discovered only now because no one bothered to study this phenotype early in the GWAS era. Such studies will continue to surface every year and when they do I’ll make sure to highlight them.
Rare coding variants
One of the most satisfying aspects of working in human genetics is discovering extremely rare mutations with mind-blowing phenotypic consequences. Before the era of big genetic data, such discoveries happened mostly in clinics where doctors encountered families with features indicative of a genetic origin, for example, morbid obesity, and sequencing a particular genomic region (informed by genetic linkage analysis) or a suspect gene (informed by prior knowledge of the gene’s link with the phenotype in lab animals). However, the emergence of big human genetic databases such as the UK Biobank completely changed the landscape of rare variant discoveries.
The completion of whole exome sequencing of half a million UK Biobank participants last year led to the discovery of many rare coding variant associations with huge effect sizes. A few notable examples: GPR75 associated with obesity, GIGYF1 & type 2 diabetes, MC3R & age at menarche, CHEK2 & age at menopause, CHRNB2 & heavy smoking and ANKRD12 & cognitive function.
Although many research teams have now analyzed the UK Biobank exome data many times for their favorite phenotypes, the rare coding variants discoveries haven’t reached saturation yet, even for some of the highly studied phenotypes like body mass index (BMI). Analyzing those UK Biobank exomes to capture those rare variants with extraordinarily large effect sizes sometimes feels like hunting for magical beasts in an enchanted forest. What is visible for some eyes isn’t for others.
This year a new obesity gene BSN (encodes a synaptic protein called basoon) has surfaced, joining the league of Mendelian genes causing morbid obesity. The discovery was made by at least three independent research groups. Carriers of heterozygous loss of function variants in BSN appear to be morbidly obese, even more obese than carriers of the well-known Mendelian obesity gene MC4R. Though the mechanism through which BSN mutations lead to obesity is yet to be understood, the fact it is also a neurodegenerative gene with reports of association with neurodegenerative disease including one from this year in a genetic study of early onset Parkinson’s in the Indian population, makes BSN much more interesting. I’ll be closely following further developments of this gene, and hopefully, more exciting functional studies will follow in the coming years.
Rare non-coding variants
Yes, human geneticists love rare coding variants with large phenotypic effects. But do you know what they love even more? Rare noncoding variants with large phenotypic effects. The release of the whole genome sequences of half a million UK Biobank participants this year has raised the hopes of many people about new noncoding genetic discoveries. However, I am not sure if their dreams will come true. Of course, there will be a few interesting discoveries, some of which deCODE has already reported in their initial iteration of the analysis of the first 150,000 UK Biobank whole genomes, for example, a 5’ UTR variant in TAC3 (encodes tachykinin 3) that delays menarche by almost a year and a promoter variant in GHRH (encoded growth hormone releasing hormone) that reduces height by ~3 cm. But it is not clear how many more such discoveries will be made in the future. A preprint reporting a first-pass analysis of the full UK Biobank whole genomes has been already out this year and there weren’t that many non-coding revelations.
The major problem with studying non-coding variants is the lack of confident prediction of consequences at a single base pair level as we have for coding variants, for example, frameshift, stop gain, stop lost, missense, etc. But soon the scenario will change as there is a whole separate field on the study of noncoding predictions based on machine learning combined with in vitro/in vivo experiments. Until then, our knowledge of the phenotype consequences of non-coding variants will be based on isolated work from clinical geneticists studying extreme Mendelian phenotypes. Under that category, we have seen two mind-blowing stories in the last year's roundup. In one of which, a research team from the University of Exeter discovered how rare intronic variants awakened the sleeping HK1 in the pancreatic beta cells leading to congenital hyperinsulinism, and in the other story, a research team from Germany discovered a rare structural inversion in the promoter region of ASIP (agouti-related protein) switched the gene’s tissue-restricted expression to express throughout the body, including hypothalamic neurons causing extreme obesity.
FOXA2
Under the noncoding category, a preprint from the same Exeter research team that published the HK1 story last year caught my attention this year. The authors found rare noncoding regulatory deletions as the cause of congenital hyperinsulinism in three patients, demonstrating that certain non-coding deletions can be as catastrophic as whole gene deletions.
While on the topic of non-coding variants, I’d also like to highlight a couple more papers where the authors beautifully demonstrate using extensive functional experiments the crucial developmental role of noncoding variants.
GATA2
The first paper by a research team from Boston Children's Hospital decodes a genomic locus 3q21.2–22 linked to congenital facial paresis more than 25 years ago in a family linkage study. The authors identify a heterogenous set of non-coding variants (SNVs and duplications) clustering within specific regulatory regions of transcription factor GATA2 as the causal variants and beautifully demonstrate how these mutations scramble the trajectories of neuronal development resulting in facial paresis, and impressively, the authors reproduced the phenotype in a mice model.
ZNF808
The second paper by research teams from Exeter, Cambridge, and Finland doesn’t strictly fall under the non-coding category in terms of its discovery where the authors identify loss of function variants in a primate-specific transcription factor ZNF808 as the cause of pancreatic agenesis. However, the paper fits well in the noncoding category in terms of the functional consequences of loss of ZNF808, which results in aberrant activation of specific classes of transposable elements that turned out to be major regulators of the pancreatic and liver development trajectories.
Repeat genome
When talking about noncoding genomes, people generally talk about introns, promoters, enhancers, etc. But there is one big elephant in the room that often gets ignored—the repeat genome, the darkest part of the human genome. Almost half of the human genome is made of repetitive sequences (a mind-blowing fact that not that many appreciate), and most of it (~90%) are transposons, which are classically referred to as graveyards of dead DNA sequences as they were once believed to serve no function and can be dissected out of the human genome like an appendix without any consequences (hence, the name “junk DNA”), which we know today is not true.
The papers that I’d like to highlight here fall under the class that occupies only 1-2% of the genome but likely plays a much bigger role in human disease: short sequence repeats characterized by tandem repetition of a short DNA sequence of length that can be anywhere between 2 base pairs to 100 base pairs.
Minisatellite
These are short sequence repeats of length between 10 to 100 base pairs and are classically referred to as variable number tandem repeats (VNTR). This class of variation is one of the frequently studied candidates in the candidate gene association era that preceded the GWAS era. They are difficult to sequence using traditional sequencing methods, hence we know little about their role in human diseases even today. Thanks to emerging long-read sequencing technology, we are starting to learn more about VNTRs. In this category, I’d like to highlight a great work by Po-Ru Loh (I am a big fan of this genius scientist like many others in the human genetics field) and his team from Harvard Medical School, where the authors created a haplotype reference panel of VNTR using publicly available long-read sequencing data, using which they genotyped VNTRs in the full UK Biobank. Among the many interesting discoveries, the authors show VNTRs as the causal variant in two well-known GWAS loci discovered early in the GWAS era: TMCO1 locus associated with glaucoma and EIFH3 locus associated with colon cancer. Many of the thousands of unsolved GWAS loci sitting in the GWAS catalog are likely driven by VNTR, which will come to light with the rise of long-read sequencing databases.
Microsatellite
Microsatellites are short sequence repeats of length between 2 and 10 base pairs. This specific class played a major role in the early days of human genetics, the era of genetic linkage analysis. Microsatellites are highly polymorphic and create unique patterns when digested with restriction enzymes, which makes them good markers for linkage analysis to triangulate disease locus in affected families. Although we know microsatellite regions of the genome are some of the highly mutable parts of the human genome, there have been no previous empirical estimates of their mutation rate. For that reason, a paper from deCODE on the de novo mutation rate of microsatellites published this year deserves special mention. Using high coverage whole genome sequencing data of 6084 parent-offspring Icelandic trios, the researchers estimate that on average around 64 de novo microsatellite mutations arise per person per generation. Considering microsatellites occupy only <3% of the genome, that number is impressively large compared to mutation rates of any other known classes of variations in the human genome.
Looking into the future
That’s a wrap of my 2023 storytelling. Of course, there are a lot more stories to tell. Only time and space (and your attention) are the limits. Before I conclude, some thoughts on things I am looking forward to in the future. I am excited about three things in particular.
Firstly, as a drug discovery scientist, I am excited about all the new genetic discoveries that will unfold and inspire a new generation of medicines called genetic medicines. To be honest, I am more excited about stories of successful translation of old genetic discoveries than new discoveries as you may have sensed from my excitement around BCL11A and APOL1 stories. This is where new technologies like CRISPR and innovations in drug deliveries to challenging destinations in the human body like the brain will play a major role. I’ll be closely watching for such stories and will share them as they unfold.
Speaking of technologies, one that I am particularly excited about is long-read sequencing. This year I shared a Twitter post on how researchers from Utah solved a 25-year-old genetic puzzle by identifying the causal mutation responsible for a subtype of spinocerebellar ataxia using long-read sequencing. Genetic databases based on long-read sequencing are growing rapidly. This year at the ASHG I heard about a new resource of 1000 African long-read whole genomes being generated as part of the All of Us biobank. Such databases will drive more successful stories like the one from Utah in the coming years.
Thirdly, I am looking forward to innovations in phenotyping. Lack of good phenotypes is the bottleneck of drug development in fields like psychiatric and neurodegenerative genetics. Technologies like high throughput differentiation of induced pluripotent stem cells, single-cell sequencing, and automated high-throughput phenotyping from cellular microscopic images will enable scientists to study the consequences of disease-associated mutations in more sophisticated ways. Early this year I shared about an amazing work from Soumya Raychaudri and colleagues from Harvard Medical School where the authors derived cellular-level traits of iPSCs using advanced microscopy and correlated them with common and rare variants and discovered fascinating associations. One of the talks at this year’s ASHG on Huntington’s disease from Bob Handsaker from Steve McCarroll’s group at the Harvard Medical School stunned many audiences. Using single-cell sequencing technology, the authors made great advancements in understanding the pathogenesis of HTT repeat expansions in the striatal neurons of Huntington’s patients. I am excited about more such advancements in the upcoming years.
If you enjoyed this post, you might also like the podcast version (part 1 & part 2) of this content available from The Genetics Podcast hosted by Patrick Short, the CEO and founder of Sano Genetics. In the nearly 2.45-hour-long year-end episode, I discuss the papers that we covered in this post in a little more detail.
As you can guess, it takes a lot of time and effort to curate such a resource. If you feel that you’ve learned something new reading this, share this with your friends and provide feedback below, which will motivate me to write more in this new year.
Wish you all a very happy and successful new year 2024! Thanks for reading :)
—Veera
Wow, that's a long and pretty comprehensive summary. Thanks for highlighting the BCL11A story, I had no idea that we are already at the level of having FDA-approved CRISPR treatments. Exciting times!
PS. "It might have something to do with my lack of discipline and excessive procrastination..." I think you know deep down it was not the primary reason for not writing more :)
Nice! Love to see posts from you on substack, feels less fleeting than Twitter