CUDA, memory, low latency and faster computation

I want to know how these systems run, and how can they run so fast in different cases ( like CPU vs GPU). Like what even is happening under the hood, and ideally move towards a place where I can do faster computation. Start with → N*N matrix multiplication ⇒ on CPU vs GPU like in CPU it takes O(n^3) time, what time will it take on GPU? Each thread runs for one entry in the output, so total n^2 threads, and time for each is O(n).