Here's a good article by Ann Steffora Mutschler that describes some of the issues of implementing hardware cache coherency using cache coherent interconnects. Neil Parris from ARM is quoted, as am I.
Here's the link to the full article: http://semiengineering.com/coherency-cache-and-configurability/
Ann captured my main point regarding the emerging need for heterogeneous cache coherency well:
Arteris’ Shuler concluded that while there are only a few companies that can justify going down to 10nm and 7nm — at least right away — they are still going to need to sell a product that is as low power, and as high performing as the devices at 10nm and 7nm. “How are you going to do it? You’re going to have to be more efficient in the processing of your hardware; you’re going to have to have more efficient hardware. That means you’re not going to be able to run all that software just on CPU cores. You’re going to have to do a lot more offloading and that’s where the heterogeneous stuff is really going to take off.”