Presented by Dr Erin Hahn & Dr Clare Holleley

CSIRO National Research Collections Australia

Recorded Wednesday 28 June 2023


Before I get started, I would like to acknowledge that Clare and I are joining you from Ngunawal country today and we would like to pay respects to Elders past, present and emerging. We're really excited about this seminar because it's not often that we get to present to a museum crowd and I often have to bang on about how it's really important to look at museum specimens in order to understand how species have changed over time, but that's all old hat for you guys so I'm going to dig right into the really cool science part of it.

So as Mark mentioned, we are joining you from the Australian National Wildlife Collection at the CSIRO. The wildlife collection is part of seven collections spread across the CSIRO, so in case you're unfamiliar with that, we have specimens housed in our collection that are largely terrestrial vertebrates, but we also have fish, insect, herbaria, tropical herbarium, the Australian tree seed centre and the algae collection, which are our live collections. So we're quite a broad collection if you're unfamiliar.

In Canberra, that's where we house our terrestrial vertebrates, so largely focusing on mammals, birds, we have some amphibians preserved in a whole bunch of different ways. And it's hardly going to be news to any of you that we're having a bit of a genomics revolution in the museum space. For the last 30 years or so, we've been able to get DNA out of museum specimens and use them for all sorts of cool projects studying conservation, ecology and evolution. But in the last few years, we've really been opening up the vaults to look at genomic analysis and using our fresh tissue collections to reconstruct whole genomes and more and more getting into some of our differently preserved specimens to do whole genome-wide analyses. However, there's whole wings of our collections that have so far been completely inaccessible to genetic analysis, never mind genomic analysis, and that's those specimens in our spirit vaults that have been preserved with or come in contact with formalin. And it's a real shame that these sorts of specimens are inaccessible to genomic analysis because anyone studying reptiles, amphibians or fish more or less can't get historical genetic information. And it's also very unfortunate that it's these specimens that are inaccessible to genetic analysis because these are largely the only specimens in our collections that are preserved whole. So they have all of their organs intact.

And so going beyond genetic information, if we were able to access the additional layer that sits on top of the genome, so the epigenome, which are all of the marks upon the genome that determine how genes are expressed, that determine the different shapes and sizes and things that a specimen comes in, so describing its phenotype. These marks are sitting on top of the DNA in the form of DNA methylation, and modifications, chemical modifications to histones, the individual nucleosomes that the DNA is wrapped in, because you can infer information on what the genes are doing based on how they're positioned across the DNA. You can also just pull out RNA and proteomic information from individual tissues if you are able to access them. However, that sort of work has largely been just using fresh tissues, and we haven't been able to get at that tissue-specific information from historical specimens, which is very unfortunate because that gene expression information is what tells us how species respond to all sorts of threats that they've faced over the period of time that our Earth has been rapidly changing. If we were able to get at this information, we'd be able to get species response data in relation to changes like climate change, introduction of novel pests and pathogens, and being able to track interactions between our native species and invasive species. So as you can imagine, if we were able to access it, there would be a wealth of information sitting at our fingertips.

And so this is the project that we've decided to tackle over the last seven years, is how can we revisit the vaults and open them up to genomic analysis and potentially even epigenetic analysis? So we're looking to fix formalin. Putting epigenetic information aside for a moment, we first had to determine whether or not you could get DNA information out of formalin-preserved specimens. And a number of years of work have resulted in a series of processes that we put specimens through in order to first vet them to see if they're going to be suitable for sequencing, and then do some tissue preparation, an appropriate DNA extraction, followed by heavy, heavy whole genome sequencing, and then some fancy bioinformatics. What we found is that the most important component of this process in determining whether or not you're going to get usable genomic information from formalin-preserved specimens is that initial component of specimen vetting. If you can find suitable specimens, they are actually very, a wealth of information is stored with them. It's just a matter of using appropriate techniques to get it out. And the most critical component of specimen vetting is what we call the ICK test. If you open up a specimen and look at its internal organs, if you can differentiate the organs from one another, it's most likely going to be quite suitable for sequencing.

We published this method in Molecular Ecology Resources last year, and that details the specific method and approach that we use both for specimen vetting, sequencing, and bioinformatics. And then using this method, we've applied it to another, a bunch of other specimens in the collection, and we've been able to recover 11X whole genomic cover from a specimen who was collected in 1905. So a specimen that's over 100 years old, yielding quite high quality whole genome information, as well as teeny, teeny, tiny tuna larvae preserved in formalin, which we've been able to recover 25X genomic coverage for. So you can see that this is a viable way to open up our vaults for genomic analysis. Within that paper, we've also included a decision-making tree. And so this is really useful for curators when they receive requests for destructive sampling of formalin-preserved specimens to decide how or whether or not they should allow a sampling of the specimen. As you can see here, the first thing you do is decide, is there, are the viscera present? Are they decomposed? Don't bother sampling it. And then we can step through another few decision-making points about whether or not it's going to be appropriate to sample it, as well as appropriate for what the researcher's asking for. And so we'll point you in the direction of this, if you get any requests for destructive sampling that we hope you find this really useful.

So we've now essentially fixed formalin in terms of being able to access these specimens for genomic analyses. But now we want to flip it on its head and think about formalin in a completely different way. But first, to revisit the way that researchers and curators typically think of formalin-preserved specimens in museums. And that is that the mechanism of action that formalin undertakes in order to preserve our specimens so beautifully is that it cross-links the DNA to the associated proteins in the cell. And then when you go to extract it, that's why it's all degraded, is because getting it out is like chiselling a thread of silk out of a big piece of concrete. It's going to be damaged in some way or another. So we think of formalin as destructive. However, there's whole bodies of research in the medical community and in people who study model species who use formalin in order to infer information about the structure of the genome. There's a whole suite of assays that actually use formalin in their processes to lock everything in place so that you can take that snapshot of the genome and get information about what genes and the genome were doing. So with this in mind, that we have a whole bunch of formalin-preserved specimens and there are assays that use formalin in order to infer gene expression, we decided to optimize them for use in museum specimens. So I'm going to focus on one of those assays in particular, which is Micrococcal nuclease sequencing. Essentially, it takes a little enzyme that goes in and chops the DNA right next to those nucleosomes, which is what is the packaging material in the genome.

But rather than try to optimize this in our really precious museum samples, we designed an experimental specimen set that we could play with without having to worry about damaging anything that was precious. This specimen set included a whole bunch of inbred lab mice as well as some wild-caught Australian mice. We chose mouse because it has a really well annotated genome and a wealth of information about gene expression, so we'd have a lot to compare it to. We accessioned these specimens just like we would anything else that comes into the collection, took a fresh tissue sample, preserved it in formalin, and then let it sit on the shelf next to all the rest of our collection. And when we got around to using them, the specimens were between two and five years old, which we've shown is a truly bona fide formalin-preserved specimen. We then put the fresh tissue through RNA-seq analysis to do gene expression, a traditional Micrococcal nuclease assay to replicate that which is done in fresh tissue, and then a modified approach of the Micrococcal nuclease treatment with our formalin-preserved tissues.

This is an overview of the Micrococcal nuclease method. Don't worry about the details, it's going to be included in our upcoming publication. Sufficed to say we do a whole bunch of things in order to soften the tissue to make it malleable to access by the Micrococcal nuclease. We then put it through a modified fragment isolation protocol, which you find is really useful for getting the most you can out of these really fragmented DNA isolates. And then we do, again, really, really heavy sequencing followed by some fancy bioinformatics, which makes the most out of the fragmented DNA that we get, and an enrichment analysis using a program called DANPOS.

These are the sorts of data that come out of having treated archival tissues with Micrococcal nuclease, so I'll just draw your attention to a couple elements here. So, the top is just doing whole genome sequencing on fresh tissues versus archival tissues from the same sample. And what you can see here is that the coverage across this gene in the genome is very similar between the fresh and archival tissues. So again, confirming that we're able to get whole genome coverage out of archival specimens. So the next point here is the coverage that we see having treated the specimen with Micrococcal nuclease. So we're just going in and chopping alongside nucleosomes to enrich for those fractions that are differently packaged in the chromatin. And in this line, you can see that there's induction of differences between the samples after treatment with Micrococcal nuclease so they're different from the whole genome sequencing, and they're also different from each other. We then used that program, DANPOS, to compare the Micrococcal nuclease treated sample to its input control, so what it would look like if it were just whole genome sequencing. And that's where we can see what we define as the Micrococcal nuclease signal. And upstream of a gene, like this one, which is very highly expressed in liver, in a fresh tissue sample, we see a positive signal, or a peak gain over just having sequenced the entire genome. However, in the archival sample, we see the mirror opposite of that. Rather than enrichment of a single, we see depletion, and so we see peak loss. And that's what we see across the entire genome.

These are plots which you can just think of as if you're getting a signal of gene expression from the tissue, having treated it with Micrococcal nuclease, you see this kind of tornado effect. And in fresh tissue with peak gains, you see a tornado effect, but you don't see in the archival sample. It's only when you flip it on its head and look for peak losses when you see that same kind of tornado signal in archival samples. So the signal isn't lost because we've treated it with formalin very, very heavily as we do in museums. It's just inverted, and you have to think about it differently in order to infer that same information from the tissue samples.

So having worked this out in our mocked-up mouse specimens, we then wanted to put it to work in real collection samples. For this project component, we teamed up with some researchers up at the UQs, Celine Frere and Nicola Jackson. They study the genomics of adaptation to urbanization in their chosen model, the very stately eastern water dragon. Celine and Nicola have collected a really vast data set from modern samples describing how this species has responded to the gradient of urbanization across Brisbane. And they've collected data describing behaviour, morphology, the prevalence of a fungal disease, which is having negative effects on the population, as well as genetic differentiation associated with urbanization. So that describes the gradient across Brisbane as it is now. But Brisbane being an expanding city, if they were able to access historical data, then they would also be able to track the response of the species through time to the growing urbanization in the area.

But to do that, they'd have to go to museum specimens. And of course, all the museum specimens that they would be looking at would have been at least exposed to formalin. So we sent Nicola into the Queensland Museum with our specimen vetting system in hand, and she selected five specimens for a pilot study and sent me tissue for specimens from 1905 through to 2001. I then processed them with our archival Micrococcal nucleus assay, as well as archival whole genome DNA extraction for genome recovery. So these are those familiar tornado plots that we were generating from the mouse data and with our five samples of the Eastern Water Dragon, we can see those albeit somewhat weaker but readily visible tornado signals that we see in all five specimens. Even in our 1905 specimen, you're seeing that similar effect. So that's looking at expression signals across an entire 100-year period. Recapitulating what we saw in just extracting genomic DNA, we see that age doesn't matter in terms of whether or not these sorts of specimens are going to be accessible for studying chromatin. I also calculated the total genomic cover that we were able to get with all of these five specimens, and it ranged between 7X and about 14X cover. And even from our specimen that is now 118 years old, we were able to get genomic cover that would be sufficient for calling genomic SNPs and then potentially inferring some genetic differentiation. So from the same assay, we're getting both historical gene expression and chromatin information as well as whole genome sequencing data.

So that's all well and good, but can we infer some kind of biological relevance from processing these samples in this way? We do only have five samples at this point with the Eastern Water Dragon. However, even just from these five samples, we start to see some interesting patterns associated with the sex and the habitat from which the specimen was collected. So here I've taken the Micrococcal nuclease signal from across the entire genome, and I'm looking at the differences between individuals in the PCA plot. In this first plot, we can see that there's this cluster of three males all collected within city habitat that are clustering together even though they've been collected over a period of 100 years. Next, we see this one sole female that's up and off in its own little world, and then another sole male that was collected from natural habitat. So we can already start to see that very, very strong differentiation due to sex and a great differentiation due to habitat. If I focus then on PCS2 and 3, so eliminating what we think is a very, very strong effect of sex, we see then that the samples separate quite strictly by where they were collected. So even though we only had five samples that we were analysing, we're starting to see these signatures of biological information that we can infer from Micrococcal nuclease sequencing, which then gives us an idea that if we were able to sample a larger population, we can infer really, really cool stuff, giving us information about all sorts of environmental change. So I'm going to hand it over to Clare now, who's going to tell us a little bit more about where we're going.

Thanks very much, Erin. That was an absolutely excellent introduction to this new capability that we're developing at CSIRO, so thank you very much. So what I'm going to talk about briefly now is, I guess, where to next and what the future for this kind of potential new analysis for museums is. And what I'd like to introduce is a new genomic consortium that I'm trying to spin up at the moment, and it's called the Temporal Epigenomics Atlas. And this consortium is going to aim to build basically the first historical data set that's going to capture the actual dynamics of gene expression over the last century. It's going to be characterising historical epigenetic change, and it's going to allow us to investigate the mechanisms and response to disease, how organisms develop and change over time, and also, importantly, in the current context of accelerating climate change, it's going to allow us to look at what the plastic responses to environmental change are and whether they're accelerating, whether or not they're, you know, whether organisms can cope or whether they're going to succumb basically to these environmental challenges.

So what we need to do to achieve this goal is to start to develop a foundational data set. We need to take what we've learned in our initial project funded through the Future Science Platform at CSIRO and scale up. So we've initially decided that what we're going to do is focus on the top 10 represented species within collections. We're going to be focusing primarily on terrestrial vertebrates initially because that's, well, really because it's my area of expertise and it's at my fingertips. And you've got to start somewhere. We're going to be doing a cross representation across vertebrates. So we're going to start with three mammals, three reptiles and three amphibians. We have not yet decided on which species these are going to be. That's going to be a process to work through with our stakeholders and the people interested in this technique. But what we've also recognised is that having this temporal information about vertebrates, specimens are indeed a vertebrate also and we'd like to work with some biomedical counterparts to develop a temporal cohort of human samples to compare these specimens with.

So what does this actually mean and how much work is involved? It sounds good, but when you actually start doing the math, what we realise is this adds up very, very quickly. So if we're wanting to look at a 100 year time series for 10 species, remembering, of course, that epigenetic information is going to be tissue specific. And if you are asking a specific question, biological question, you have to match that question to the tissue, things start to add up quite a bit. So depending on the genome size of the organism, so we've got two kind of examples for a reptile with a kind of regular reptile size genome at 1.5 GB frogs can be quite problematic being very, very large. If you actually start to think about how much sequencing this then involves, it's somewhere between 120 and 320 billion reads to get a temporal understanding of genome expression over time. And so that's why this is going to be a very large consortium-based investment in both time and time and money. It's just not going to be something that we can do on our own.

But what we're hoping to do is basically use this opportunity to start the new consortium and we're going to be timing it so that the consortium launches in the next sort of 6 to 12 months to coincide with the opening of our new state of the art collections building, which is currently under construction at Black Mountain. I reckon we're probably about maybe halfway there at the moment. So we're really taking this opportunity to start reaching out to the community to find partners who have an interest in this type of question and to find out what's going to be the most impactful way to develop that foundational data set.

And I guess that's where you guys come in. So if the Temporal Epigenomics Atlas is aiming to build what is essentially an epigenetic time machine, what we want to do is we want to build the best time machine we can and so we're going to need people with domain expertise. We're going to need people that have detailed, taxonomic experts about the biology of the specific organisms that we're going to be sequencing. The involvement and collaboration and integration with museums is going to be absolutely essential for success. And we're going to prioritise those top 10 species based on the questions that people can ask and the likely impact that answering those questions is going to have. We are of course going to be looking for co-investment and any in-kind contributions that collaborators might be able to bring. So really what we're hoping today is that this talk is going to bring something to you that you never thought possible, that you could go back in time and understand how organisms are responding to the environment, what's going on at that gene expression level. And so if you come up with interesting questions, we want to hear from you. So please, please do reach out.

Going to finish up the talk today also with a little bit of a plug. I have a textbook that's going to go to press on Friday and it's currently accepting pre-orders. And if you're interested in how epigenomics generally can be applied in ecological, evolutionary and wildlife management type questions, we have a whole chapter dedicated to ecological epigenomics in that book, for example. So it might be a good place to start if you're new to this kind of area. And with that, I'll put up the acknowledgement slide and say that really this work has been an absolutely integral partnership with Erin. We have been a bit of a dynamic duo and driven this project really beyond what I even thought possible at the beginning. And everyone thought I was crazy for thinking this was possible in the first place. And it's, I hope, going to take museums to a completely new capability and really show to the world why we're important and why we're current and what we can contribute to science and the understanding of the natural world. So thanks very much, everybody.


Gene expression plasticity enables organisms to rapidly respond to changes in their environment. The degree to which such plasticity contributes to species’ resilience or vulnerability to change is currently unknown and this knowledge gap is hindering uptake of gene expression data in ecosystem monitoring. Historical gene expression data spanning the last century would accelerate study of how gene expression shifts might be involved in response to environmental factors. Museum specimens represent phenotypic snapshots spanning the last 150 years – a period of rapid environmental change. Unfortunately most specimens, especially those older than 20-30 years, were not preserved with RNA retention in mind. Therefore, we must seek alternative measures to infer historical gene expression to generate long-term time series data.



Drs Hahn and Holleley will share new capabilities they have developed from an unlikely source – formalin-preserved museum specimens. While formalin-fixation is generally viewed as a hindrance to analysing historical genomes, they have developed two complementary methods which harness historical fixation to characterise preserved epigenomes through measuring chromatin accessibility. These methods provide new capabilities to track the trajectory of temporal gene expression trends and increase the utility of formalin-preserved specimens that were previously considered sub-optimal genomic resources.

Pre-print of the study, Century-old chromatin architecture preserved with formaldehyde, is now available here: https://www.biorxiv.org/content/10.1101/2023.07.26.550239v1