27/10/2021 - AI3SD Autumn Seminar III: Data Science 4 Chemistry : AI 4 Scientific Discovery

This event was the third of the AI3SD Autumn Seminar Series that was run from October 2021 to December 2021. This seminar was hosted online via a zoom webinar and the theme for this seminar was Data Science 4 Chemistry, and consisted of two talks on the subject. Below is the videos of the talks and speaker biographies. The full playlist of this seminar can be found here.

Statistics Are a Girl’s best Friend: Expanding the mechanistic Study Toolbox with Data Science – Dr Anat Milo

Anat Milo received her BSc/BA in Chemistry and Humanities from the Hebrew University of Jerusalem in 2001, her MSc from UPMC Paris in 2004 with Berhold Hasenknopf, and her PhD from the Weizmann Institute of Science in 2011 with Ronny Neumann. Her postdoctoral studies at the University of Utah with Matthew Sigman focused on developing physical organic descriptors and data analysis approaches for chemical reactions. At the end of 2015 she returned to Israel to join the Department of Chemistry at Ben-Gurion University of the Negev, where her research group develops experimental, statistical, and computational strategies for identifying molecular design principles in catalysis with a particular focus on stabilizing and intercepting reactive intermediates by second sphere interactions.

Q & A

So how little data do you think we can get away with? But the little data side against them?

It depends on a few things. So, rule of thumb I like to go with 15 data points because one of the rules of thumb is five data points per parameter or else. But again, that’s not great either because it’s a lot of parameters per data set. But if you have a simple question where your mechanism is fairly simple and it’s based on the components of the reaction, then you don’t need many data points to look at it. I think that one of the things that is at the core of this is that we design our datasets in advance to answer some questions. So, if we take aldehydes, we’d put something electron donating and withdrawing at the two position, then the same at the three position, and then at the four position, and then we put something big at the two, three, four position, and so we’re kind of probing things. If the mechanism is fairly complex, we’d need more data points to get a reliable correlation that validates, so it really depends on the system. You can go very small if you’re not describing things that aren’t very complex then the less data you have, the less parameters you’re allowed to use statistically.

Looking or your lab it looks to me like you’ve got some flow control systems and stuff there to actually make sure that the experiments are probably more reproducible then doing them all individually by hand. What the interplay there between actually doing experiments and repeating them, and that sort of variability that clearly always crept in when I was trying to do experiments in the lab on that?

I think one of the key points is that when you’re looking at data that you want to analyse statistically, you want to know your error. So, it’s really important to be able to reproduce the experiment and then to look at the error because it could be that you’re producing a model that the error is small compared to the experimental error or vice versa. We always want to know that your experimental error is smaller than which you can get with the with the model. Every reaction that we run is at least in duplicate, if not more. I have here a picture, this is our bigger lab room. We have a smaller one which is temperature and humidity controlled. I had some talks with the people who are in charge of the infrastructure at the university, so we are now in new labs and we designed it. But before it was pretty difficult, so I used to have in my last slide our temperature and humidity controller because that is critical. Again, that’s why high throughput or automation are really important, because if you can, you know take out that variable, our human contribution becomes a more interesting when we’re not just trying to get things to work the same way

When you’re only doing a relatively small number of reactions it’s at least conceivable to do them, even that it’s a huge amount of effort. But being able to do things high throughput on small scale, so what with some of these reactions I’m never quite sure how smaller scale could one realistically do them and then be representative of doing them on a moderate scale. Or maybe that doesn’t need to be, but you know the less you have to do, the more things you can make I guess is the way I look at it. Is that realistic or does one really need to still handle this on a normal lab scale?

It really depends that I know that people have done things on a nano-mole scales, you can do this manually. You can take a 96 well plate or an even bigger one and load it yourself and you can get results. I think that normally they at least correspond with what you’d get on a larger scale. They’re telling you something that when you move to the largest scale, you probably have to tweak a little bit to optimise it, but it will definitely tell you something about the components and how they work together. But in in our lab I think that the smallest scales we do are in GC vials, we haven’t got smaller than that, but that’s pretty small.

It’s really nice to see a group where the experiments and the modelling are being done by the same group of people, so that the models really have feel for the experiments and the experiment (maybe it’s the same people doing them). It’s really important to have a feel for both sides in my view so that you know what to expect and what to what you need to do, and I think it makes a huge difference to the reliability of these things. Thank you.

I truly believe that, and I keep telling the students that they have to go into the lab and do the models as well. The best thing is knowing all the aspects of what you’re doing.

Data management: at the root of high-throughput experimentation – Dr Nessa Carson

Nessa Carson was born in Warrington, England. She received her MChem degree from Oxford University, before completing postgraduate studies in catalysis and organic methodology at the University of Illinois at Urbana-Champaign. She started in industry as a synthetic chemist for AMRI, then moved within the company to run the high-throughput automation facility on behalf of Eli Lilly in Windlesham, working across both the discovery and process chemistry arenas. She then worked in process development using automation at Pfizer. Nessa started at Syngenta in 2020, working in automation, reaction optimization, and data management. She maintains a website of useful chemistry resources, https://supersciencegrl.co.uk.

Q & A

Q1: Are all your synthesis reactions done at room temperature and atmospheric pressure? If so, does this limit what you can produce?

They are absolutely not, and certainly not at room temperature, I tend to run my reactions in aluminium blocks rather than the plastic plates that you’ll see biologists using, so that really helps. Cooling is actually surprisingly difficult sometimes, but you certainly can cool aluminium blocks as well as heat. As for pressure, you have to have specialist kits to run high throughput at high pressure; it’s doable but I think there are still challenges because most of the time it’s literally a box – a box of gas that goes around your plates and it does have that extra concern. If you’re running a truly high throughput like 96-well you do have that extra concern of potential like solvent overspill between different reactions. Both of these things are very, very doable.

Q2: What do you use to destroy the chemicals after you are done with an experiment?

When it comes to quenching reactions, honestly, I don’t really do workups like aqueous workups unless I have to. Most of the time I will add a solvent to every vial in a plate, sample from that, and inject directly into the LC-MS just because it’s so much faster. I will only not do that if there’s a very good reason not to. The advantage of very small scale is that hazards are often mitigated, so you pretty much can quench things in the same way as you would in a lab, but quickly with a multi-channel pipette. If you need to add bleach or water or whatever it is, you have that advantage of being able to quench very quickly simply because of the small scale.

Q3: In each pie chart on Slide 12, what do the four items (colours) in the pie chart indicate and what does the size of the circle indicate?

Each colour refers to a different component in the reaction mixture. So, green is good essentially: here, green is the desired product. In general, I like to stay consistent so our colleagues who look at this would immediately understand that a row full of green is good because I always make green good. These charts are for are relative amounts in the LC-MS analysis, in this case, the UV peak area at a certain wavelength. The size of the circle is pretty much a reaction profile, so if there are a lot of impurities not accounted for, the circle is small. So, a small pie chart is a messy reaction; a large pie chart is a clean reaction, basically.

Q4: What is a “self-optimising reaction”?

Having asked that question, I now realise maybe I should call it a “self-optimising experiment”, maybe “self-optimising reaction” is wrong. Self-optimising experiments – I will make an effort to say that from now on. I suppose I mean something that automation can optimize essentially by itself or partly by itself. So, you would probably feed it a parameter space to start with. In fact, in all cases at the moment in the literature, even if they don’t explicitly state this, like Bayesian optimisation: they will always provide a relatively small parameter space to start with. But something that automation can essentially just run, like it might start off exploring chemical space broadly, and then the software would generate a statistical model with an objective to maximize what appears as green in these pie charts and keep sampling, working towards that, probably with some kind of machine learning.

Q5: Is all of your HTE data stored in a database and is that data FAIR?

I would like it to be FAIR. I push very hard for things to be as FAIR as possible when it comes to data. I go on about it a lot and I think people are probably sick of me going on about it. It’s incredibly important that that we have that FAIR storage model, at least within the company. And it’s definitely getting a lot better; I always try to ensure that at least my own data are FAIR. As for storing in a database, then this is evolving for me at least. It’s definitely getting better and better over time. Many people define our understanding of what kind of data standards we want and how they will work, not just for chemists but everybody from IT to biology to other people who might have need for this later on: formulation, etc. So, yes, it is stored somewhat sensibly right now, but I think we should always be making improvements on this.

JF: I really like the way you’ve integrated the data management, the actual running, the experiments, the people, the issues around it for a real lab working here. Lots of questions I’d like to ask, but let’s just focus on that. Your high throughput essentially is parallel because you’ve got the wellplates and then you repeat it. Obviously, there’s a serial element to this as well, especially when you’re running, say, different conditions. Have you looked at any of the designs, like the sequential updating of the Design of Experiments, so that as you’ve got some data coming in from some parts of your matrix, you decide which things to do next?

That that would be nice but basically the answer is not right now for me. You see so many impressive things in the literature, particularly around Bayesian optimisation and that kind of thing. I think there will be a lot more space in industry for this kind of computer-guided self-optimisation in the future, although I also like generating a large amount of data very quickly, so I believe 96-well plates are here to stay too.

Q6: When you put a seed in a small well with dirt, after you are done, how do you destroy the chemicals?

Good question – I honestly don’t know! It’s of course important to not let chemicals that are not fully tested for environmental safety to be released so I would guess treated plants are dealt with similarly to toxic chemicals.