In this project, you will develop a complete CUDA program to compute the Histogram of the input array. You will implement the Histogram on the device GPU. After the device Histogram is invoked, your program will also compute the Histogram sequentially on the CPU, and compare that solution with the device-computed solution. If it matches, then it will print out "Test PASSED" to the screen before exiting. Assume the Histogram will have 256 bins, i.e., bin 0, bin 1, …, and bin 255. Input value i will be mapped to the bin i.
Use the following pseudo code for array initialization.
int *A;
A=malloc(sizeof(int)*N); //N is the size
int init =1325;
For (i=0;i<N;i++){
init=3125*init%65537;
A[i]=init %256;
}
Task 1 - Basic CUDA Program using global memory
Develop a CUDA program with GPU threads collectively performing the histogram calculation. Use an atomic instruction to enforce one thread at a time accessing to individual locations in the global histogram array.
Task 2 – CUDA program that takes advantage of shared memory
In Task 1, you will find that you GPU program speedup compared to the CPU version is very limited due to the atomic access to the global histogram array. Modify the code in Task 1 to try to improve the speedup by using GPU shared memory and registers.
Record your runtime with respect to different input array sizes as shown in the following table for task 1 and task 2, and compute the speed up using the GPU computation time, and the CPU computation time. I did not specify the thread block size, you might can explore different thread block size to find the best thread block size for each input array size. The thread block size of 256 is the most obvious choice.
Optional: You can also include the memory transfer time between CPU and GPU in the GPU computation time (In that case, it might be fair to also include the time for matrix initialization in the CPU computation time), and re-compute the speedup.
Time 131072 (128*1024) 1048576 (1024*1024)
CPU computation time
GPU computation time
GPU memory transfer time
Note that the compiling command for the CUDA program using atomic instructions should add the -arch compiler option. The following compiling command can be used to compile the source CUDA program with file name histogram.cu.
nvcc [login to view URL] –o histogram -arch=sm_30
I have read your description and I am so interested in your project.
I am confident in your project and I can finish it clearly on time.
I am well experienced and skillful CUDA/OpenMP/MPI programmer.
I have +5 years of experience in software developing.
I have finished a lot of project like this.
I ensure the best quality of your project and to keep your deadline.
Please contact me kindly and let us discuss in more detail.
Working with me, you will have a good experience and good friend and save more time and money.
Best regards!
Hello, I am a CUDA expert with experience in algorithm design. I have developed a lot of algorithms using CUDA and I would like to implement histogramm algorithm using CUDA. Please contact me to discuss the details and the timeline.
Hi, There. I have plenty of experience in C++, CUDA. I have also done a similar project. Please have a chat about the project. I shall be glad to work on this project.
No problem!
I have read your description carefully and very interested in your project.
I am working on Desktop App with C/C++,C#,Python & Java for 7years.
I think i can do it perfectly.
If you hire me, you will get cool results.
i can work full-time in your time zone.
Best Regards
Hi,
I have about seven years of experience in C and CUDA. I have developed similar algorithms in CUDA. I have two GPU cards, Telsa and Pascal. I will be able to complete your project as per your requirements and well within time.
Thanks,
Ajay