There are two main factors involved. First, the Xbox design was done very quickly.Once the go-ahead was given they knew wanted to be on retail shelves ASAP or the competition would be much to deeply entrenched to make it worth entering this generation of machines. It was a given that the console would be based around DirectX development tools, ergo the Xbox name, so that gave them a huge time savings in going with an Intel CPU and a PC-ish structure optimized for video throughput and reduced cost. Not as optimized as it could have been given more time but delivering substantially better performance than a comparable PC running full Windows and supporting all the things a PC must do a s well as play games.
Nvidia was the natural choice for a chipset partner since they had long focused their designs on DirectX and had the then most advanced products in the consumer sector. These chips were rapidly catching up with CPUs in the heat department as their transistor counts grew by leaps and bounds as Nvidia made major additions to video chip functionality beyond just pushing polygons. The XGPU ended up being a somewhat more advanced version of the GeForce 3, which PC developers were just learning to exploit. It is sometime referred to as GeForce 3.5. Nvidia also had plans underway for getting into the motherboard chipset market and the highly advanced audio subsystem was also a perfect match for giving the Xbox a feature beyond that of any previous console.
So everything was coming together but time was still very short compared to the multi-year stretch an existing console firm would typically have involved in their next generation. There wasn't nearly enough time for heavy duty thermal design testing and there was also the issue of how much cost could be allowed for higher end cooling methods that would allow for a more compact design. The heat pipe used in Sega's tiny but powerful Dreamcast had added considerably to Sega's difficulties in making a profit on that system. The Xbox was going to have to allow a bit of bulk to allow enough internal airflow for a lower cost cooling solution.
What about later? Why didn't they do a redesign when they had all the time in the world, like Sony did? The first Japanese PS2 units had chip made on a .25 micron process that had only been intended for engineering and devlopment samples. The delay in getting .18 micron production up and running cost Sony tens of millions in low yields and expensive chips but bore the pain and the PS2 chipset was done in .18 in time for US launch.
When you move a chip design to a smaller feature size manufacturing process, a die-shrink, if you aren't seeking greater performance the most immediate benefit is reduced cost since you get more chips per wafer. The primary operational benefit is those chips generate a lot less heat and can fit in a tighter space with lesser cooling needs.
To create the PStwo Sony took the main chips, the Emotion Engine and Graphi Synthesizer, and remade them on first a .13 process, then a .09 or 90 nanometer process. They also combined them into a single package which further reduced their size and simplified the board layout. On top of this, eliminating the support for an internal hard drive on the new PStwo design got rid of another big heat source.
So why couldn't Microsoft do this? The hard drive couldn't be eliminated but a die shrink would allow for a much smaller Xbox. This brings up the second issue, which goes hand in hand with the first, the rushed design period. Part of this meant using what Nvidia had to offer rather than something designed from scratch. Microsoft doesn't own the bulk of the innards in the Xbox video and audio chips. They cannot create smaller versions of those chips without Nvidia's participation and that is unlikely to happen for a price Microsoft can accept. The Xbox didn't move in the kind of numbers Microsoft had been hoping and they consequently put pressure on Nvidia to reduce their chip prices to allow the Xbox retail price to be lowered for a sales boost.
Continued...