Hacking and Other Thoughts

Wed, 04 Jul 2007


LDOM Developments, July 4, 2007

Fabio and I have been busy playing whack-a-mole with the lingering buglets and inconveniences in the current LDOM Linux work.

I also took some time to hike around Mt. Rainer. Last Sunday the hike was from the Wind River campground up to the interglacier area. This is the primary north side climbing route. Interestingly this past week, Bill Painter, a well known old local climber summited Rainier again. He's 84 and continues to hold the record of being the oldest person to summit the mountain. This guy is incredibly fit for his age, he bikes 100+ miles every week and is constantly training on a mountain which is closer to his home elsewhere in Washington.

Enough Rainier, how do LDOMs work? Let's start at the lowest level.

Each runinng node on a Niagara system is provided with a machine description. A simple hypervisor call copies this mdesc into a buffer. It's a densely encoded graph of all the resources the node has access to. It's very similar to the compressed device tree that powerpc systems use under Linux.

So when you first setup a LDOMS system, you first boot into the "factory-default" machine description which basically gives all of the hardware resources to the control node. Then from the control node you take things away, and setup the virtualization server services, so that guests can be created. You update the machine description for the control node, giving it a new name, then reboot into it. You're ready to create guests. The copies of these machine descriptions sit on the System Controller of the machine.

Along with the machine description facility are Logical Domain Channels. Each virtualized service between guests, control nodes, service nodes, and the system controller communicate over point to point links. Each link is configured within the hypervisor, there is a transmit queue and a receive queue for each end of the channel. Each entry in the queue holds a 64-byte fixed-size LDC packet. You can size your queues however you like with some minor restrictions.

The LDC link layer defines a handshake, reliable and unreliable as well as raw modes of operation. The handshake is used to negoatiate a LDC protocol version that both sides can understand. The handshake also is used to get the sequence numbers initialized so real work can be done on the link. The raw mode elides the handshake entirely, has no packet headers, and just sends raw 64-byte packets over the link.

The hypervisor also provides memory sharing facilities for the LDC channels. There is a page table where exported pages are defined, and exported memory is expressed to the remote consumer using "cookies" which essentially define which export page table entry holds the translation, the offset into that page, the page size of the translation, and the size of the area being described. Essentially these cookies are DMA descriptors.

To access imported memory you can ask the hypervisor to copy in and out of cookie defined memory into/from local node memory. You can also map these things into your address space or program the imported cookies into an IOMMU but that breeds enormous levels of complex issues to revolve.

Part of that complexity is that you have to be able to handle faults. Each and every exported piece of memory can be revoked at any time by the exporter. If this happens, an access will fault. For CPU mapped imported memory the accesses to such mappings would need to be annotated much like Linux currently annotates user space accesses in the kernel using exception tables. This is so that the fault handler will know what to do with these kernel faults.

Even more complicated is the IOMMU use of imported memory. If revoked the IOMMU is going to fault and this will make the PCI controller send a master abort to the requesting PCI device. The response to master aborts amongst devices varies, but universally the device is likely to need to be completely reset when this occurs.

So the safest thing to do, and what every existing use of LDC channels does, it use the hypervisor copy operation to access imported memory. In this case you only need to handle error return values from the LDC hypervisor call, rather than complicated faults all over the place, when revoked memory is accessed.

On top of the LDC protocol sits the VIO layer, which has it's own handshake mechanism. It handles versioning and sequence number initialization just like the LDC handshake does, but it also handles the transfer of device specific attributes such as exported disk size, network device MTU, etc.

The VIO handshake also handles the registry of descriptor rings. These rings are how VIO devices setup I/O operations. The ring entries are composed of a generic VIO tag (containing a entry state value, and an ACK field which says if the receiver should ACK the ring entry after it is processed or defer the ACK until it's current run over the ring is complete). After the tag is the device type specific area where virtual disk devices describe the block I/O and virtual network devices can describe the size of the packet etc. Finally, there is an array of cookie entries to describe the I/O buffer.

For the network device there is a single TX descriptor ring created at each end, these are populated locally with transmit packets for receive at the other end. They are imported into the peer using the hypervisor export mechanism described above.

Descriptor ring entries at the importer side are accessed with the aforementioned LDC copy mechanism.

I/O is triggered using DRING_DATA packets over the LDC channel, which tell the receiver which entries in the descriptor ring to process. Writes into the local peers descriptor entry just use local cpu loads and stores, ordering is important.

The DRING_DATA packets give a start and end descriptor index for the peer to process. The end index can be specified as "-1" which means to just keep processing until you see a descriptor which is not in READY state.

Thus is it important for the sending peer to update the state field as the last possible operation, with a memory barrier, such that the receiver does not accidently see a half-initialized descriptor in READY state.

More on LDOMs in a future blog entry. We've made some progress with the network hang bug and other issues, so things are looking up.