Improve multiblock efficiency
If right now we a 3\times 3 block-diagonalization, but with one block completely decoupled, the amount of matrix products will be bigger than in 2-block case. This is because we use \mathcal{W} = -\mathcal{U}'^\dagger\mathcal{U}'/2 in that case and do not use that \mathcal{W} and \mathcal{V} commute.
If we're solving the dense 3\times 3 block-diagonalization, computing \mathcal{W}^2 separately from \mathcal{V}^2 is slower: \mathcal{W}^2 requires 6 Cauchy products and \mathcal{V}^2 requires 3 extra, compared to just 6 necessary for \mathcal{U}'^\dagger\mathcal{U}'.
Right now whether the offdiagonal terms are computed is controlled by the two_block_optimized
variable. It is likely that there is a better criterion for that, which is more general than checking whether the matrices have 2 blocks.
This should be done after #128 (closed) (don't optimize before knowing that it helps).