Unless you're using GlobalDeviceArrays, the device mesh passed to pjit
must be composed of contiguous submeshes for each process (i.e. each
process's local devices must all be next to each other in the full
mesh and form a rectangular submesh). This change teaches
`create_device_mesh` how to output meshes that satisfy this
constraint in some common cases.
This isn't the default behavior because the resulting meshes are a
little awkward and magical, and eventually we'd like using
GlobalDeviceArrays to be the common use case.