My goal now is to show you a couple of very simple ways to optimize your code to make it run considerably faster for singly threaded code. Code that just uses default scheduling and that sort of thing, just focused on making code run faster. The reason for this is on occasion, the worst case execution time is just too long and it's actually the c_i. That's the problem for meeting deadlines or feasibility or safety margin, so sometimes we just need to make code run faster. Let's take a look at that. The first thing I have is a prime number, [inaudible] It's a very crude algorithm called Eratosthenes sieve. There are better algorithms but the whole point is just how fast can we make this run no matter how good the algorithm is. If we look at our make file, and very often when you're developing and you haven't yet fully debugged your code, it's best to compile with dash 0 zero, no optimization, minus g, debug flag, wall and so forth, and so we've built this code that way. I'm just going to run this singly threaded version which is called erastsimp. I'm going to look for between zero and 100 million, I believe. It'll keep that CPU busy having time stamp for when the start test started. Go look at each tab. Because it's singly threaded and its best effort, we're just going to see this at 100 percent on whatever CPU core the OS gives the program to. It's done already. It just ran on one core and its found basically was 5,761,455 primes. [LAUGHTER] You can go and check that. In about 11 seconds there, I'm using clock monotonic to get that. We can take a look at the code too real quick, if you're curious. It tests just around 8.9 million numbers per second. But the point here is, use optimization once you are sure your code is working correctly, turn on. You can go through the levels O1, O2, O3. I'm going to jump straight to O3. I'm not going to turn on any vector instructions yet because I don't have any reason to believe it's going to help. We can always test that hypothesis if we're not sure. I got to do a make clean and we've got to do a make. I'm going to run it again. Again, if I go over to each tab, I should see it there on a core. If it runs long enough, it'll eventually could get migrated by the OS. It's on core 1 now, but it's already done. It ran reasonably faster. It ran in 8.5 seconds to 11.27, almost 12 million numbers per second. Let's see if vector instructions bias anything at all. I'm going to say no because this is a very iterative algorithm that just uses simple multiplication and some a bit wise operations as I recall. But we can try it. There you can see the neon instructions were added in there. We'll run it again and see if it's going to improve it at all. If it doesn't, I'm going to show you an algorithm where it doesn't prove it. That's hard to argue that it really improved because 11.67 versus 11.77, that's probably in the noise, 8.56 versus 8.49. Remember these are best effort threads, so there could be interference from other background threads. We could make this scale FIFO and give it a higher priority if we really wanted to get it to be more like a real-time thread. But here I was just trying to show the optimization. Now I'm going to go to a different example. This is a single threaded PSF, point spread function convolution. You've seen it before if you've been in my Course 1 and Course 2. I guess improved it. I've made it load files a lot faster because they're trying to 12 megapixel file and the code I had originally was really slow, so I made the IO much faster, so that we can actually process a really large frame, then we still have a threaded version of it as well. I'm just going to show you the make file really quick. Normally when I'm developing something for the first time, I'll do dash 0 zero minus g because I got to debug. That's going to be a lot easier to debug. I'm going to go ahead and run this on two files, the input file and the sharpened version that it's going to create. We'll see how I want to test that read by the way before I optimized it. Took IO to 20 seconds and now it's just a little bit longer than a second. It's getting a frame done about every ten seconds with no optimizations. That's a 12 megapixel frames. That's a large frame. Take a good chunk of memory and takes quite a bit to plow through that. I could certainly work on my programming to try to improve my programming too. But I might as well see when I couldn't get out of the compiler before I go on to work too much at the c code level or something like that. Because maybe I can get enough out of the compiler where I don't have to go in and tweak with the code. I'm going to go O3 next. You might go through the levels. Sometimes your code can break, so you might have to settle for O2 or something like that. I'm going to do make clean and do make. We're at O3 now. Let's see how much speed we are going to get. Look at that the IO is left faster, like three times as fast, maybe four times as fast and the frames are faster too. 3.3 seconds as opposed to like 10. We got done in 9.19 seconds for three frames. That's still three seconds per frame. But these are 12 megapixel frames. These are not small frames. Let's go for the vector instructions. Now this might be a more subtle difference. Remember 9.2 and about 3.3 per frame. Now we see the neon instructions going in there and let's run it again. Now, already there, even the file IO was faster, so vector instructions somehow benefited that. It looks appreciably faster. That's 7.4 seconds frames getting down about 2.6. Vector instructions really do pay off here. That's fairly apparent. We could repeat it 10s of times or a 100s of times to make sure. But it's pretty clear that the vector instructions do matter. A few little tricks. I'm going to go back to the simpler code. Just when you're doing this work, you can do this time and you can do time erastsimp. If you do that, you can get time from a time utility that's available in Linux and basically all Unix systems. You can see my time said that what it took 8.5 seconds. They measured 8.765 seconds for real-time. That's supposed to be wall-clock time, like my time. But they say that the user time, the time spent in User space was 8.03 seconds and spent 0.8 seconds in kernel space. That's one thing that their time is interesting. When I say there is a Linux or Unix POSIX time, it shows you kernel time and user time. We can try that over here too, just for grains and see how that compares on our optimized ones. We don't have to wait around too long. Let's see. We'll see how that compares. My time is just the user space time, so 7.45, they get some 0.61. There's always seems to be a little longer. Then the curl times 0.3. Most of it spent in users space and that gives you an introduction to basic optimization. I also have a number of Linux how to use and make file tips and so forth. That I will post in resources for all of you to help as well with just working on simple things like code optimization, what compiler flags do I want and so forth. That should get you going and thank you very much.