qemu-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Aaron's rationale for NAII requested FPGA capabilities.


From: alarson
Subject: Aaron's rationale for NAII requested FPGA capabilities.
Date: Fri, 23 Dec 2022 10:11:42 -0700

At the December 19th coordination meeting, Archit asked me to provide
a brief (sorry failed on that) rationale for the capabilities I
requested of the FPGA provided by NAI for sync.  To be clear, this is
not intended to be a specification or even a suggestion for the
behavior requested by Korry.  This is based on what I wished we had on
numerous other systems in the past that needed to implement sync.

Sync is rarely easy, and often has systemic effects and dependencies,
so this is not to say how sync should be done, but rather issues to
consider and how the NAI capabilities can be used to facilitate
reasonably simple solutions to some common problems.

The first issue for sync is any solution that involves event latency
(e.g., responding to an interrupt) is going to be complex and fragile
due to inherent variability in response times.  The NAI FPGA
capabilities help substantially.  Consequently I won't describe the
latency associated problems the FPGA avoids, suffice it to say it
helps a lot.

The NAI FPGA provides a sequence of pulses, where a pulse contains a
"high" and "low" portion.  The pulse duration is the sum of those
times.  Since the FPGA is timestamping the rising and falling edges
separately it is possible for a receiver to know the duration of each
part of the pulse from each sender.  That allows the sender to provide
not only the timing of the pulse start, but also to encode some
information based on the relative size of the high/low times.  For
example a 10ms pulse could be composed of a 1ms high + 9 ms low, or
2ms high + 8 ms low, etc.

The FPGA auto-reloads the next pulse based on current (at the time of
the pulse start) high/low register values.  There is no need for the
PAL to do anything unless an adjustment is required.  Since the FPGA
samples the registers at the pulse start, the PAL can adjust the
"next" pulse durations at any time during the pulse without needing to
worry about race conditions (see the WAT vs pulse offset discussion
below).

Sync is inherently modal, and I'm going to assume for the purpose of
this discussion that there is one chosen "master" and that all other
channels slave to the master.  This is most definitely an area where
KISS is important.  When a channel starts (e.g., from power on, or
from a health monitor initiated reset), the started channel doesn't
know the state of the other channels, e.g., is there a master already?
By encoding the local channel's mode in the signal it transmits to the
other channels, the local channel can help all the channels do a
selection of the master.  When a channel starts, lets assume it starts
in state "I don't think I'm the master".  The channel can immediately
start transmitting a pulse containing its state (e.g., 1ms high + 9ms
low), then the other channels can make group decisions relatively
easily.  If there is already a master, the master ignores the state of
others and the newly started channel syncs to it.  If there is no
master then (arbitrarily and e.g.,) the group assumes the lowest
numbered active channel will become master and all will sync to that.
Once a channel believes it is (should be) the master it encodes that
state in its pulse signal (e.g., 2ms high + 8ms low).  Note there are
race conditions that need to be resolved so its not quite that easy,
but nothing a little more waiting can't resolve.

A channel that is starting can wait until it has identified a master
and then (if it is not the master) set its pulse time to approximate
the master's and then startup normally.  Note that in this scenario
the master might not be consistently selected.  If that is important
more state-full logic is necessary.

To simplify the discussion I'm going to assume Deos' "sense of time"
i.e., the WAT start, is driven by a clock and not by the FPGA's pulse
train, that there is no backup clock source, and that all channels
have consistent WAT durations.

Note that in this configuration there is another synchronization
required.  Namely between the timer driving the WAT and the FPGA's
generated pulse.  I was hoping that a common (e.g., the GIC) timer
could be used by the FPGA but that was not possible.  This is where
the ability to synchronously read the FPGA's timestamp is likely to be
needed.  I do not know what requirements there are for the
synchronization of the channel WAT starts, so I'll leave this topic
for now and focus on FPGA pulse synchronization.

During runtime if the pulse start is selected to be offset from the
start of the OS's WAT start by a time larger than the anticipated
worst case channel skew (e.g., 1ms) then when the PAL is activated by
the WAT timer the timestamps for the most recent pulse from every
channel will be available.  The offset between the local channel's and
the master's pulse start can be computed easily by simple math.  If
there is an offset the duration of the pulse (and the WAT clock) can
be adjusted (shortened or lengthened) to get the current channel's
pulse start to migrate toward matching the master.  Normal ramping and
stability considerations need to be made when selecting the
adjustments.  The fine grained FPGA clock frequency helps.

Nominally the PAL would provide an interface to adjust the WAT timer
duration and an application or PAL extension would be responsible for
computing the adjustment, calling the PAL interface with the
adjustment and updating the FPGA pulse duration.  There is substantial
variability in this from platform to platform so the exact allocation
of responsibilities will need to be worked out.

Note that since Deos is a real time OS, the WAT duration can't be
adjusted arbitrarily, otherwise Deos couldn't provide time
partitioning.  In Deos there is an overhead that can be specified
which permits the WAT duration to be shortened or lengthened a small
amount.  Any shortening has to be within that amount.  Sometimes
lengthening can be less constrained.  The selection of the amount
affects total system processor availability and is a platform
integrator responsibility.

The above completely ignores detecting, diagnosing, and responding to
various faults that can occur.  Most notably what to do about a
misbehaving master.  In situations where the multi-channel capability
is providing increased availability (rather than integrity) such
conditions can complicate the logic considerably.  I am by no means an
expert in such systems but I believe the signalling capability
embodied by the selectable high/low pulse width becomes even more
important in those situations.  It is also why I suggested having
three inputs so that if a system safety analysis indicated there was a
need for a wrap around monitor, the FPGA would already support it.
The solution proposed at the Dec 19th interchange meeting seems like a
better solution since it simplifies the interconnects.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]