I want to know how these systems run, and how can they run so fast in different cases ( like CPU vs GPU).
Like what even is happening under the hood, and ideally move towards a place where I can do faster computation.
Start with →
N*N matrix multiplication ⇒
on CPU vs GPU
like in CPU it takes O(n^3) time, what time will it take on GPU?
Each thread runs for one entry in the output, so total n^2 threads, and time for each is O(n).