- Wong, Karen HY;
- Ma, Walfred;
- Wei, Chun-Yu;
- Yeh, Erh-Chan;
- Lin, Wan-Jia;
- Wang, Elin HF;
- Su, Jen-Ping;
- Hsieh, Feng-Jen;
- Kao, Hsiao-Jung;
- Chen, Hsiao-Huei;
- Chow, Stephen K;
- Young, Eleanor;
- Chu, Catherine;
- Poon, Annie;
- Yang, Chi-Fan;
- Lin, Dar-Shong;
- Hu, Yu-Feng;
- Wu, Jer-Yuarn;
- Lee, Ni-Chung;
- Hwu, Wuh-Liang;
- Boffelli, Dario;
- Martin, David;
- Xiao, Ming;
- Kwok, Pui-Yan
The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.