So, this lecture is going to be about challenges of scientific reproducibility, which are the motivation for the development of the Galaxy platform. In the last decade, biology has rapidly become a highly data intensive science, dependent on complex, computational, and statistical methods. And so, a major question for infrastructure providers is, how can we make these methods available to resear, as accessible as possible for researchers, while remaining, ensuring that scientific results remain reproducible? In genomic research in particular, this question of reproducibility is something of a crisis. There have been many studies that have looked at this question, but one that's kind of summarizes this very well is the reproducibility project for cancer biology. And this is a project that is attempting to independently replicate 50 high-impact cancer studies that were published in the ra, in the years 2010 to 2012. And you can find out more about the project at that URL. However, what I want to draw your attention to is another study, which looked at these same papers, but asked a question, how identifiable are the research resources used in each of these papers? And so here we have a graph, they're, that's available from the Fixture URL there, where they've looked at different types of research resources, and asked, could they be identified? Meaning, could you actually find out the exact resource that was used in the paper? And if you look over on the right, you'll see software. Software is the main thing that we're going to be talking about throughout this course, and the fraction that's identifiable here is actually, is, is very low. Less than 30% of the papers and of, of the tools, 32 out of 127 tools, in fact. And even worse, if you look at the papers, only 6 out of 41 of those papers could all of the software be identified. Now, identifiable means more than just knowing the name of the software. You have to know the particular version of the software. So, why is this important? Because methods details really matter in computational analysis, and so this is just looking at, this is a graph looking at different ver, versions of a tool called BWA, which is a tool for aligning short, short sequencing reeds back to a reference genome. And then we've looked at a particular site that's a variable, and looked at frequency fluctuations the, the, the frequency of different variants at that site. And so along the x-axis, what you have are different versions of the tool we did it with. And then the lines are different parameter settings. And so what you can see is that both the version and the parameters that you've used for this tool actually can substantially affect the outcome that you get. And so it's absolutely crucial for for scientific research that these details, particularly versions of software and parameters used, are recorded. So what is reproducibility generally? Provenance alone is not reproducibility. Provenance provides documentation of what was done, but it doesn't mean that you can actually go back and recover the particular tools that we use and reproduce the analysis. Conversely, reproducibility does not necessarily mean reusability. It doesn't mean that the analysis generalizes to a to analysis of different data. And certainly, reproducibility is not correct. It's just because you can reproduce an analysis doesn't mean that it's actually the correct analysis. But what it does mean is that the analysis described in sufficient detail, that it can be precisely reproduced by another person and presumably, in another environment. And this is really a minimum standard. We'd like to go beyond this. But the, but being able to just reproduce the analysis means that papers and, and research that had been done, you can actually go back and inspect them, understand how they were done and, and get a deep understanding of the methods that we used. Unfortunately, most published analysis are not reproducible. I showed you the one example, however there have been many studies of this and it continues to be the case that it's difficult to reproduce analysis. This is because either the software is actually not available, or the versions of the software that were used, the parameters, or even the data that was used in the study is not available. So, we put together some recommendations, for how to make analysis more reproducible, and this is an abridged version. If you're interested you can read this paper and we have a more extended set of recommendations. But the crucial thing is that you first accept that computation is now an integral component of biomedical research. You, for, for, any kind of research that involves significant amount of data, you have to think of computation as being extremely important. And you have to put in the effort to accurately document these details. You should always provide access to raw primary data. Maybe, there are cases where data needs to be protected. There are database, data stores available that will allow you to make that protected data available. You need to record the versions of all data sets that you use, or archive all the data that's used in your analysis. You must store the exact versions of all software used and ideally, archive the software. Depending on if the secondary archive for the software is problematic because then you can't guarantee that it will be available down the road. And then finally, record all the perimeters, even if you are just using default values, because these defaults can change over time and it, it makes sense just to make sure that you've recorded as much as possible. So finally, is this an achievable problem? There are in fact a spectrum of solutions. It requires effort on your part. And so through the different courses you, there are different approaches you can take to reproducibility. If you're working at the command line, there are best practices you can use including things like version control and reproducible builds, that allow you to document analysis. There are analysis environments where if you perform your analysis within that environment, reproducibility is automatically handled for you. There are workflow systems that allow you to capture analysis. There are notebook style systems, like iPython notebook, where the analysis is recorded automatically as you do it. Similarly, literate programming systems popular things like Sweave and knitR, that allow you to actually it's similar to notebooks document analysis in line with data. There are systems level systems approach to this. Particularly popular these days are ways of capturing your complete analysis environment, through virtual machines. Things called containers that actually allow you to capture all of the software, all of the versions associated with analysis that you've done. The Galaxy is one solution to this problem, and that's the one that we're going to talk about here. It's an analysis environment in which you perform analysis and the versions and the provenance, and all of the details of your analysis are automatically captured for you.