2. Replicate the host computation for the small example on the Nvidia device using CUDA code. Time it. It is likely that your device computation will be much slower on the device, especially if data movement from host to device and back is taken into account.
3. Extend the example to any input array size while comparing that host and device produce the same result. Keep comparing execution time.
4. Apply the optimisations found on the research paper work indicated above and replicate their experiments. Organize your experiments in tables. Produce graphs similar to Figure shown below.
5. Try to apply your own ideas for further optimisation. Include them (and briefly explain) even if they fail to give you an advantage.
Write a report on: Kernel Density Estimation on the GPU Using the CUDA Framework. The report should include (Methodology and implementation) on the following:
1. Sequential Algorithms (univariate)
2. Naive CUDA algorithms (univariate)
3. CUDA optimisations algorithms (univariate)
4. Discuss and analyze your results and conclude quantitatively (speedup, bandwidth, etc).
5. Please include all source code (c++ and CUDA source code with comments) So that it can be easily replicated. Use scripts in order to show graphs/tables included in the report can be generated with little effort.