qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] Set addresses for memory devices [CXL]


From: Jonathan Cameron
Subject: Re: [RFC] Set addresses for memory devices [CXL]
Date: Thu, 28 Jan 2021 10:51:14 +0000

On Wed, 27 Jan 2021 21:20:21 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, Jan 27, 2021 at 7:52 PM Ben Widawsky <ben@bwidawsk.net> wrote:
> >
> > Hi list, Igor.
> >
> > I wanted to get some ideas on how to better handle this. Per the recent
> > discussion [1], it's become clear that there needs to be more thought put 
> > into
> > how to manage the address space for CXL memory devices. If you see the
> > discussion on interleave [2] there's a decent diagram for the problem 
> > statement.
> >
> > A CXL topology looks just like a PCIe topology. A CXL memory device is a 
> > memory
> > expander. It's a byte addressable address range with a combination of 
> > persistent
> > and volatile memory. In a CXL capable system, you can effectively think of 
> > these
> > things as more configurable NVDIMMs. The memory devices have an interface 
> > that
> > allows the OS to program the base physical address range it claims called 
> > an HDM
> > (Host Defined Memory) decoder. A larger address range is claimed by a host
> > bridge (or a combination of host bridges in the interleaved case) which is
> > platform specific.
> >
> > Originally, my plan was to create a single memory backend for a "window" and
> > subregion the devices in there. So for example, if you had two devices 
> > under a
> > hostbridge, each of 256M size, the window would be some fixed GPA of 512M+ 
> > size
> > memory backend, and those memory devices would be a subregion of the
> > hostbridge's window. I thought this was working in my patch series, but as 
> > it
> > turns out, this doesn't actually work as I intended. `info mtree` looks 
> > good,
> > but `info memory-devices` doesn't.
> >  
> 
> A couple clarifying questions...
> 
> > So let me list the requirements and hopefully get some feedback on the best 
> > way
> > to handle it.
> > 1. A PCIe like device has a persistent memory region (I don't care about
> > volatile at the moment).  
> 
> What do you mean by "PCIe" like? If it is PCI enumerable by the guest
> it has no business being treated as proper memory because the OS
> rightly assumes that PCIe address space is not I/O coherent to other
> initiators.
> 
> > 2. The physical address base for the memory region is programmable.
> > 3. Memory accesses will support interleaving across multiple host bridges.  
> 
> So, per 1. it would look like a PCIe address space inside QEMU but
> advertised as an I/O coherent platform resource in the guest?

Personally I find it easier to think of these devices as containing:

1) A PCI based configuration interface (in config + bar space).

2) Memory accessed via an entirely separate memory bus -
   the PA translations for which (system address map etc) happens to
   be controllable via the PCI path.

The memory traffic goes over the PCI wires, but doesn't otherwise obey
any of the rules of PCI, so separate decode etc allowing for interleaving.
From an emulation point of view it might as well be an entirely different bus.
(with a similar tree).

The host allocates certain windows of PA space for which it routes
PA reads / writes to particular physical ports - beyond that all the
PA routing to particular memory devices can be programmed at runtime.

Interleave makes this more 'interesting' :)

The host can set certain PA regions to interleave across multiple CXL root 
ports.
So if base PA = 128G, interleave of 512Bytes, 2 way.

Read to 128G + 0bytes   -> port 0
Read to 128G + 512bytes -> port 1
Read to 128G + 1024bytes-> port 0 etc

OS can then put two devices into such a PA region and let them know about
the interleave (via that PCI based config interface)
If there are switches below those ports, further interleave can occur
as well. It's very flexible.

Of course, others may prefer a different mental model!

Jonathan






reply via email to

[Prev in Thread] Current Thread [Next in Thread]