Theory of Operation for ADS Ptolemy Simulation

ADS Ptolemy provides signal processing simulation for ADS's specialized design environments. Each of these design environments capture a model of computation, called a domain, that has been optimized to simulate a subset of the communication signal path. ADS domains that are part of ADS Ptolemy, or can cosimulate with ADS Ptolemy are:

Domain Simulation Technology Controller Application Area
Synchronous Dataflow (SDF) Numeric dataflow Data Flow Synchronous multirate signal processing simulation
Timed Synchronous Dataflow (TSDF) Timed dataflow Data Flow Baseband and RF functional simulation (e.g., antenna and propagation models, timed sources)
Circuit Envelope Time- and frequency-domain analog Envelope Complex RF simulation
Transient Time-domain analog Transient Baseband analog simulation

In ADS Ptolemy, a complex system is specified as a hierarchical composition (nested tree structure) of simpler circuits. Each subnetwork is modeled by a domain. A subnetwork can internally use a different domain than that of its parent. In mixing domains, the key is to ensure that at the interface, the child subnetwork obeys the semantics of the parent domain.

Thus, the key concept in ADS Ptolemy is to mix models of computation, implementation languages, and design styles, rather than trying to develop one, all-encompassing technique. The rationale is that specialized design techniques are more useful to the system-level designer, and more amenable to a high-quality, high-level synthesis of hardware and software. Synchronous dataflow (SDF) and timed synchronous dataflow (TSDF) are described in the sections that follow.

For general documentation on the Circuit Envelope and Transient simulators see Circuit Envelope Simulation and Transient and Convolution Simulation. For information on cosimulation with ADS Ptolemy and these circuit simulators, refer to Cosimulation with Analog-RF Systems.

Synchronous Dataflow

Synchronous dataflow is a special case of the dataflow model of computation, which was developed by Dennis [1]. The specialization of the model of computation is to those dataflow graphs where the flow of control is completely predictable at compile time. It is a good match for synchronous signal processing systems, those with sample rates that are rational multiples of one another.

The SDF domain is suitable for fixed and adaptive digital filtering, in the time or frequency domains. It naturally supports multirate applications, and its rich component library includes polyphase FIR filters.

The ADS examples directories contain application examples that rely on SDF semantics. To view these examples, chose File > Example Project; select the DSP/dsp_demos_prj directory for one group of SDF examples.

SDF is a data-driven, statically scheduled domain in ADS Ptolemy. It is a direct implementation of the techniques given by Lee [2] [3]. Data-driven means that the availability of data at the inputs of a component enables it; components without any inputs are always enabled. Statically scheduled means that the firing order of the components is periodic and determined once during the start-up phase. It is a simulation domain, but the model of computation is the same as that used for bit-true simulation of synthesizable hardware. A number of different schedulers have been developed for this model of computation.

Basic Dataflow Terminology

The SDF model is equivalent to the computation graph model of Karp and Miller [4]. In the terminology of dataflow literature, components are called actors ; an invocation of a component is called a firing. The signal carried along the arc connecting the blocks are made of individual packets of data called tokens. In a digital signal processing system, a sequence of tokens might represent a sequence of samples of a speech signal or a sequence of frames in a video sequence.

Note
Some ADS Ptolemy terminology differs from UCB Ptolemy terminology. For example, a component in ADS is called a star in UCB Ptolemy and an arc is a wire. Refer to the Terminology for more information.

When an actor fires, it consumes some number of tokens from its input arcs, and produces some number of output tokens. In synchronous dataflow, these numbers remain constant throughout the execution of the system. It is for this reason that this model of computation is suitable for synchronous signal processing systems, but not for asynchronous systems. The fact that the firing pattern is determined statically is both a strength and a weakness of this domain. It means that long runs can be very efficient, a fact that is heavily exploited in ADS Ptolemy. But it also means that data-dependent flow of control is not allowed; this would require dynamically changing firing patterns.

Balancing Production and Consumption of Tokens

Each port of each SDF component has an attribute that specifies the number of tokens consumed (for inputs) or the number of tokens produced (for outputs). When you connect an output to an input with an arc, the number of tokens produced on the arc by the source component may not be the same as the number of tokens consumed from that arc by the destination component. To maintain a balanced system, the scheduler must fire the source and destination components with different frequency.

Consider a simple connection between three components, as shown in the following figure.


Simple Connection of SDF Components Illustrates Balance Equations Constructing a Schedule

The symbols adjacent to the ports, such as NA1, represent the number of tokens consumed or produced by that port when the component fires. For many signal processing components, these numbers are simply one, indicating that only a single token is consumed or produced when the component fires. But there are three basic circumstances in which these numbers differ from one:

To be able to handle these circumstances, the scheduler first associates a simple balance equation with each connection in the graph. For the graph in the previous figure, the balance equations are

This is a set of three simultaneous equations in three unknowns. The unknowns rA, rB, and rC are the repetitions of each actor that are required to maintain balance on each arc. The first task of the scheduler is to find the smallest non-zero integer solution for these repetitions. It is proven in Lee [1] that such a solution exists and is unique for every SDF graph that is consistent , as defined next.

How Schedulers Work

SDF is a restricted version of dataflow in which the number of data produced or consumed by a component per invocation is known at compile time. As the name implies, SDF can be used to model synchronous signal processing algorithms. In these algorithms, all of the sampling rates are rationally related to one another.

An SDF scheduler takes a simulation design and determines the sequence or order of invocation of each component. The simulator will simulate the design according to the schedule generated by the scheduler. ADS Ptolemy provides user-selectable schedulers for any given simulation:

The trade-offs between them are typically the time needed to generate the schedule, the memory usage of storing the schedule data structure, and the memory usage of buffers for data collection.

Classical Scheduler

This is the classic scheduler from UC Berkeley Ptolemy, which tries to minimize the buffer sizes. It delays the firing of any block as long as possible. A component firing is deferred until none of the components that feeds data to it can be fired. It usually takes longer to generate the schedule than other schedulers, and has a large schedule size, but it uses less memory buffers across components. This scheduler is good for uni-rate designs.

Cluster Loop Scheduler

This scheduler generates single-appearance schedules that take less time than the classical scheduler. The advantage of a single-appearance schedule is that it significantly reduces schedule size. Each component appears once in the schedule and possibly with a loop factor. However, the buffer memory usage will be increased by the loop factor. Most of the increased buffer memory usage can be reduced by clustering components and limiting the buffer increases to only between clusters. This scheduler is good for multirate designs such as wireless 2.5/3G.

Hierarchical Scheduler

This scheduler generates single-appearance schedules based on cluster loop scheduler. A hierarchical structure is used for storing different clusters. Disconnected graphs are controlled separately to avoid unnecessary memory usage of multirate designs. Simulation time is also reduced when any graph finishes earlier than the other.

Multithreaded Scheduler

This scheduler is based on the cluster loop scheduler that uses multithreading techniques to achieve speedup on multiprocessor machines. It extracts parallelism from synchronous dataflow graphs and distributes the load among multiple threads. The actual speedup depends on the complexity of designs and the number of processors used. However, speedup does not scale linearly to the number of processors. For more information refer to Multithreaded Synchronous Dataflow.

Multirate Scheduler

This scheduler strategically integrates several techniques for graph decomposition and SDF scheduling. Integrating techniques in this way provides effective, joint minimization of time and memory requirements for simulating large-scale and heavily multi-rate SDF graphs. In practical communication and signal processing systems, single-rate subsystems (subsystems in which all components and interconnections operate at the same average rate) arise commonly, even within designs that are heavily multirate at the global level. The multirate scheduler decomposes a complex multirate graph into a reduced multirate version along with several single-rate subgraphs. These single-rate subgraphs are scheduled in a simple and efficient manner, while the multi-ate graphs are scheduled using sophisticated graph decomposition and scheduling techniques towards reducing run-time and memory usage. The multirate scheduler has shown significant speed-up with heavily multirate graphs, but is seen to perform at-par with the cluster loop scheduler for most other graphs.

Iterations in SDF

At each SDF iteration, each component is fired the minimum number of times to satisfy the balance equations.

For example, suppose that component B in the figure above, Simple Connection of SDF Components Illustrates Balance Equations Constructing a Schedule, is an FFT_Cx component with its parameters set so that it will consume 128 samples and produce 128 samples. And, suppose that component A produces exactly one sample on each output, and component C consumes one sample from each input. In summary,

The balance equations become

The smallest integer solution is

Hence, each iteration of the system includes one firing of the FFT_Cx component and 128 firings each of components A and B.

It is not always possible to solve the balance equations. Suppose that in the figure above, Simple Connection of SDF Components Illustrates Balance Equations Constructing a Schedule, we have

In this case, the balance equations have no non-zero solution. The problem with this system is that there is no sequence of firings that can be repeated indefinitely with bounded memory. If we fire A,B,C in sequence, a single token will be left over on the arc between B and C. If we repeat this sequence, two tokens will be left over. Such a system is said to be inconsistent, and is flagged as an error. The SDF scheduler will refuse to run it.

Deadlocks

While scheduling a system, it is possible that all firings specified in the repetition vector will not be completed because none of the remaining components are enabled. Such a system is said to be a deadlock state. The following figure illustrates a system in such a state.


Deadlocked SDF System

The repetitions vector for this system is:

Thus this system is consistent; however, neither A nor B are enabled because each is waiting for one token from the other. The way to resolve this deadlock is to introduce an initial token on one of the two arcs in the figure. This initial token is known as a delay.

The Delay component (symbolized by a component with a diamond that is connected to an arc) has a single parameter for the number of samples of delay to be introduced. In the SDF domain, a delay with parameter equal to one is simply an initial token on an arc. This initial token may enable a component, assuming that the destination component for the delay arc requires one token in order to fire. To avoid deadlock, all feedback loops must have delays. The SDF scheduler will flag an error if it finds a loop with no delays. For most token types, the initial value of a delay will be zero.

By default, a delay has a zero value. To specify a specific value for the initial token, use the InitDelay token.

There are a number of specialized components in ADS Ptolemy that add a delay on an arc, such as DelayRF, VcDelayRF, ShiftRegPPSyn, ShiftRegPSSyn, ShiftRegSPSyn, and CounterSyn. Delay_M is used for matrices.

Deadlock Resolution

In ADS Ptolemy, an optional deadlock resolution algorithm can determine where Delay components need to be inserted. If preferred, the delay components can be automatically spliced in.

For the DF (data flow) controller, an Options item DeadlockManager has 3 options:

Timed Synchronous Dataflow

Timed synchronous dataflow is an extension of SDF. TSDF adds a Timed data type (described in Using Data Types). For each token of the Timed type, both a time step and a carrier frequency must be resolved.

ADS examples directories contain numerous application examples that rely on TSDF semantics. To view these examples, chose File > Example Project, a dialog box appears. Select the DSP/ModemTimed_prj directory for one group of TSDF examples.

Note that the ADS Ptolemy (Pt) simulation time domain signals are different from those used for Circuit Envelope (CE) and Transient (T) simulation.

Time Step Resolution

In TSDF, each Timed arc has an associated time step. This time step specifies the time between each sample. Thus the sampling frequency for the envelope of a Timed arc is 1/time step.

The sampling frequency is propagated over the entire graph, including both Timed and numeric arcs. To calculate a time step, the SDF input and output numbers of tokens consumed/produced are used.

For any given SDF or TSDF component, the sampling frequency of the component is defined as the sampling frequency on any input (or output) divided by the consumption (or production) SDF parameter on that port. After a sampling frequency is derived for a given component, it is propagated to every port by multiplying the component's rate with the SDF parameter of the port. A sample rate inconsistency error message is returned if inconsistent sample rates are derived.

Carrier Frequency Resolution

Each Timed arc in a timed dataflow system has an associated carrier frequency (F c). These F c values are used when a conversion occurs between Timed and other data types, as well as by the Timed components.

The F c has either a numerical value, which is greater than or equal to zero (F c ≥ 0.0), or is undefined (F c = UNDEFINED). All Timed ports have an associated F c ≥ 0.0. Non-timed ports have an UNDEFINED F c.

During simulation, all F c values associated with all Timed ports are resolved by the simulator. The resolution algorithm begins by propagating the F c specified by the user in the Timed sources parameter Fcarrier until all ports have their associated F c. At times, the user may have specified incompatible carrier frequencies, and ADS Ptolemy will return an error message.

In the feedforward designs, the algorithm will converge quickly to a unique solution. In the designs with feedback, the algorithm takes additional steps to resolve the carrier frequency at all pins.

For feedback paths, a default F c is assigned by the simulator. This default F c is then propagated until the F c converges on the feedback path. This F c is occasionally non-unique. To specify a unique value, use the SetFc Timed component.

Input/Output Resistance

Resistors can be used with timed components. Resistors provide a means to support analog/RF component signal processing. They provide definition of analog/RF input and output resistance, additive resistive Gaussian thermal (Johnson) noise, and power-level definition for time-domain signals.

Though resistors are circuit components, they are used in the data flow graph by defining their inputs from the outputs of connected TSDF components and their outputs at connected TSDF component inputs.

The following figure shows two TSDF components, T1 and T2, with an interconnected series resistor, R1, at the output of T1 and a shunt resistor, R2, at the input of T2. Such interconnected resistors are collected and replaced with the appropriate signal transformation model that includes time-domain signal scaling and additive thermal noise.


TSDF Components with Interconnected Output (R1) and Input (R2) Resistors

Resistors contribute additive thermal noise (kTB) to signals when the specified resistance temperature (RTemp) is greater than absolute zero (-273.16°C) where:

k = Boltzmann's constant
T = temperature in Kelvin
B = simulation frequency bandwidth
  1/2 / TStep if signal is a timed baseband signal
  1 / TStep if signal is a timed complex envelope signal

Multithreaded Synchronous Dataflow

The multithreaded scheduler, targeted for multiprocessor workstations, can significantly reduce the run time of multi-rate SDF simulations. The basic idea behind the scheduler is to realize the potential parallelism that exists in an SDF graph and transform the efficient uniprocessor hierarchical cluster looped schedule into a multithreaded parallel schedule for the simulation environment.

Parallelism with Clustering

The main idea behind cluster loop scheduling is to limit the explosion of nodes in an SDF schedule while minimizing the buffer memory requirement. The following figure shows a chain-structured multi-rate SDF graph. Using the classical scheduling to minimize buffer usage, a component will fire when their input requirement is fulfilled. The schedule will use the minimum amount of buffer 4+1+4=9 but the schedule will consist of 41 nodes, ABCABCDABC..., before it repeats again.


Using looped scheduling, the schedule becomes (9A)(12B)(12C)(8D) which greatly reduces the nodes into four where each has a loop factor. Unfortunately, the buffer memory requirement has gone up to 36+12+24=72. With the help of clustering, the schedule can be transformed into (3[(3A)(4B)])(4[(3C)(2D)]), grouping A and B into cluster α, C and D into cluster β. The resultant memory requirement is 12+12+6=30, not as good as the minimal but less than half of looped scheduling.


Now if we apply α and β using pipelining on dual processors machine, two threads can run independently, one thread fires 21 components and the other 20 components before they synchronize at the end of the time slot. Let us assume that each component firing consumes 2 cycles, and thread synchronization takes 10 cycles. A single processor will finish the schedule in 82 cycles. In the 2 processor scenario, it will take max(42,40)+10=52 cycles to finish the schedule. Furthermore, one can do schedule unfolding to increase the workload before the synchronization of each thread to generate higher throughput. The drawback with schedule unfolding is the proportional increase in memory buffer requirement.

Note
Data dependency issues and exact algorithms to increase parallelism are beyond the scope of this document and will not be discussed.

Memory Overhead

The classic trade-off between memory and speed happens in Multithreaded SDF scheduling. When loop unfolding is in effect (applicable to small designs only), the buffer size increase is proportional to the number unfolding. This is in contrast to pipelining of clusters, where buffer size is doubled. And, only the buffers that bridge across clusters are doubled; buffers connecting components within a single cluster remain the same size.

Sample Results

A subset of WLAN and 3GPP designs results are listed here. All designs are shipped as example designs. Results are gathered by running the designs on a dual-processor PC (specifics of PC) with the cluster looped and multithreaded scheduler:

WLAN 3GPP Simulation Time (seconds) Speedup
Cluster Looped Multi- Threaded
WLAN_80211a_RxNonAdjCh_12Mbps.dsn   5090 3512 1.45
  UE_Rx_In_Band_Blocking.dsn 2701 1406 1.92
WLAN_80211a_RxAdjCh_9Mbps.dsn   2634 2058 1.28
WLAN_80211a_RxNonAdjCh_48Mbps.dsn   2230 1006 2.22
WLAN_80211a_RxSensitivity_6Mbps.dsn   1762 1949 0.90
WLAN_80211a_ChannelCoding.dsn   1567 1229 1.28
WLAN_80211a_RxAdjCh_18Mbps.dsn   1485 755 1.97
WLAN_80211a_GI.dsn   1261 1138 1.11
  UE_Rx_ACS.dsn 1245 500 2.49
WLAN_80211a_RxAdjCh_36Mbps.dsn   976 517 1.89
  UE_Rx_MaxLevel.dsn 823 329 2.50
  UE_Rx_MaxLevel.dsn 815 333 2.45
WLAN_80211a_RxSensitivity_24Mbps.dsn   626 429 1.46
WLAN_80211a_TxEVM_Turbo.dsn   617 360 1.71

Simulation controls associated with the above designs are modified such that the simulation time is normalized to fall between 10 and 90 minutes. The average speedup for a dual-processor machine is approximately 1.76x.

The next list of designs indicate a set which takes a much shorter time (3 - 10 minutes). The idea was to observe the speedup among designs that take less time to simulate. As seen here, shorter simulation time gives a slightly smaller speedup due to the overhead of the multithreaded scheduler. Again, speedup depends on the parallelism of a design. The average speedup here is 1.68x.

WLAN 3GPP Simulation Time (seconds) Speedup
Cluster Looped Multi- Threaded
WLAN_80211a_TxEVM.dsn   578 437 1.324
  UpLk_1DPCCH_5DPDCH.dsn 519 298 1.741
  UpLk_1DPCCH_6DPDCH.dsn 510 294 1.734
  DnLk_DPCH.dsn 495 243 2.047
  UpLk_1DPCCH_3DPDCH.dsn 489 289 1.692
  DnLk_TestModel.dsn 486 250 1.944
  UpLk_1DPCCH.dsn 484 244 1.987
  DnLk_PCCPCH_SCH.dsn 482 241 2
  UpLk_1DPCCH_4DPDCH.dsn 481 287 1.688
  UpLk_1DPCCH_1DPDCH.dsn 473 273 1.731
WLAN_80211a_RxSensitivity_54Mbps.dsn   402 284 1.423
  UpLk_1DPCCH_2DPDCH.dsn 349 214 1.631
WLAN_80211a_TxSpectrum.dsn   301 197 1.539
  3GPPFDD_UE_Tx_12_2.dsn 274 246 1.111
  UE_Rx_RefLevel_PhyCHBER.dsn 249 149 1.671

Reentrancy

Due to reentrancy issues, some simulations may not demonstrate a desired speedup improvement. For example, Analog/RF cosimulation limitations allow only one processor to perform execution in Analog/RF portion. Multithreading will only be on for the Ptolemy portion.

Thread-Safe Programming

In order to satisfy reentrancy, some components have serial regions to ensure proper multithreading simulations. Custom models can usually be made thread-safe by not relying on static data and the communication between models with global data.

It is not always possible to avoid using this type data structure when designing a model. When it is used, you will be responsible for any data synchronization that is necessary.

The Netscape Portable Runtime Library, http://www.mozilla.org/projects/nspr/, is a free public library for multithread programming that is shipped with ADS. The following is an example on how to guard against a static variable.

Defstar {
    Name { myStar }
    Domain { SDF }
hinclude { "prlock.h" }
protected {
static int sharedData;
static PRLock* sampleLock;
}
code {
    PRLock* SDFmyStar::sampleLock = PR_NewLock();
}
go {
    PR_Lock(sampleLock);
        sharedData++;
    PR_Unlock(sampleLock);
}
}

First, prlock.h is included so we can use the Lock functions provided by NSPR. Assume that we have a variable, which is shared by multiple instances of SDFmyStar, called sharedData. To make the model thread safe, we declare another static variable sampleLock for guarding the access of sharedData. In the go method, we create a critical section to guarantee that only one thread at a time will go into the critical section, accessing the variable sharedData.

Note
The current version of Tcl/Tk library shipped with ADS is not thread-safe. Therefore, designs with Tcl/Tk components are not allowed to be used with the Multithreaded Scheduler.

References

  1. J. B. Dennis, First Version Data Flow Procedure Language, Technical Memo MAC TM61, May 1975, MIT Laboratory for Computer Science.
  2. E.A. Lee and D. G. Messerschmitt, "Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing," IEEE Trans. on Computers, vol. 36, no. 1, pp. 24-35, January 1987.
  3. E.A. Lee and D. G. Messerschmitt, "Synchronous Data Flow," Proc. of the IEEE, vol. 75, no. 9, pp. 1235-1245, September 1987.
  4. R.M. Karp and R. E. Miller, "Properties of a Model for Parallel Computations: Determinancy, Termination, Queueing," SIAM Journal, vol. 14, pp. 1390-1411, November 1966.
 

Privacy Statement  | Terms of Use  | Legal | Contact Us  | © Agilent 2000-2008 

Contents
Additional Resources