Summary of Current Research

My current research is being done by my students in the Reconfigurable Computing Lab. Specific areas we are currently investigating include:

Run-time-configurable, on-chip, profiling hardware for multicore processors to improve program analysis and scheduling;
FPGA CAD and architecture;
Architecture and O/S support for (Dynamic) Reconfigurable Computing;
Networks-on-Chip (NoCs) on FPGAs;
Facilitating SoC and multi-FPGA designs using the "Systems Integrating Modules with Predefined Physical Links" (SIMPPL) framework;
Real Time and Embedded system design using FPGAs (particularly application acceleration for imaging and biomedical applications); and
On-chip design infrastructure for reducing design time and improving runtime performance.

For a more detailed description of current and past research projects, goto RCL Research Projects.

Summary of PhD Research

My doctoral research focused on trying to reduce embedded computing system design time. I created an on-chip design infrastructure, which includes tools that obtain profiling data and facilitate on-chip verification. They quickly provide designers with accurate information about system performance and operatien to help reduce the portion of design time spent simulating system functionality. This is particularly appropriate for embedded computing systems designed on SRAM-based FPGAs, which can be reprogrammed at no cost to the designer.More recently, I have explored how computing system architecture can be used to facilitate design reuse, thus reducing design time. The Systems Integrating Modules with Predefined Physical Links (SIMPPL) model represents computing systems as a network of Computing Elements (CEs) interconnected with asynchronous FIFOs, reminiscent of Kahn and Data Process Networks. However, the strength of the model is the CE abstraction, which allows designers to decouple Intellectual Property functionality from system-level communication and control using a programmable controller.

Previous Research

Previously I developed a benchmark description format for a suite of benchmarks that can be used by researchers to measure the performance of their reconfigurable processing architectures. The desire is that these benchmarks be usable by anyone, independent of the programming language accepted by the processor.

A New Benchmark Suite: A New Paradigm for Testing

Traditionally, processor performance is measured by compiling C source code for a set of accepted benchmarks to the new architecture. The executable is then run on the architecture and different measurements are taken to determine the overall performance of the architecture. Presently, no suite of testbenches exist that are consistently used by designers of reconfigurable computing architectures because there is no common language for these processors.

However, to be able to judge the performance of one architecture with respect to another, some basis of commonality is needed. To accommodate the lack of source code standards for reconfigurable microprocessors, we must move to a higher level of abstraction. The first step is to describe an application in terms of an algorithmic representation, instead of providing the necessary source code for a testbench.

The Design Issues link summarizes the issues that we hope our new reconfigurable benchmark suite addresses. The other link is to the actual benchmark suite itself:

Design Issues
The Reconfigurable Architecture TEsting Suite

Design Issues

A Standard for Algorithm Description
There is no standard input language for reconfigurable processors. This is because designers have chosen different computational models for their processors. By abstracting each testbench to an algorithm specification, as opposed to a source code description, the computational model dependency is removed.

Data Sets
Since there is no standard architecture for this style of processor, there may not be a specific data set size that ensures each processor can store the data internally or requires memory loads. This can greatly affect the execution time of the program on different architectures.

Data Intensive versus Control Intensive Applications
The majority of reconfigurable computing focuses on trying to improve the performance of data intensive applications. For instance, many different architectures include an FIR in their suite of tests along with other DSP applications. Therefore, the testbench suite should be dominantly data intensive applications. However, there should also be some control intensive applications- if only to illustrate that the architecture is capable of executing these types of programs even if they achieve poor performance on this architecture.

Previous Next Back to Top

RATES- A Reconfigurable Architecture TEsting Suite

We propose the Reconfigurable Architecture TEsting Suite (RATES) as the new standard benchmarking suite for measuring the performance of these processors. We present the Guidelines for introducing a testbench into the suite, the general specification format for a benchmark, and the kernels and applications that are presently included in the suite. The RATES white paper outlines describes the work completed thus far.

          Guidelines for Testbench
          Benchmark Specification Format
          Kernels
          Applications
          RATES White Paper

Previous Back to Top

RATES- Guidelines for Benchmark Specification

The following guidelines define the necessary characterisitics for specifying benchmarks for RATES. It should be noted that this initial set of rules is an attempt to structure the benchmark format and it may be necessary to append, amend or expel specific rules as the reconfiurable processing field matures. Finally, we hope that some forum for discussion will be convened to esure that these guidelines fairly represent the needs of the reconfigurable community.

To ensure that the following discussion uses consistent terminology, the pertinent definitions are now given. Each benchmark contains application input and output data sets as well as a Standard Algorithm describing what functionality is to be implemented. This is the Standard Algorithm for the benchmark that may further be broken down into the individual tasks that must be performed. Each task is then described in terms of a Task Algorithm. This hierarchy is illustrated pictorially in Figure 1.

Figure 1: A pictorial hierarchy illustrating the suite's taxonomy.

Rule 1: A sequential algorithm must be given as the Standard Algorithm for each benchmark, and C source code provided as an implementation example. Source code for the reconfigurable processor must be generated based on the Standard Algorithm and not the example source code. The designer's source code implements the same Standard Algorithm demonstrated in the example source code, however, it can be parallelized and pipelined so as to obtain the best possible performance on their architecture. The example code is provided as a reference point so that the actual functionality of the benchmark can be verified. A detailed description of the Standard Algorithm specification format is given here.

Rule 2: The specification of the Standard Algorithm must be at a functional level where the inputs and outputs to a task are detailed but the internal implementation is a black box. The objective is to describe to the user the basic components, or tasks, that should be used to implement the algorithm as illustrated in Figure 1. The tasks are described at a higher level to allow for flexibility of implementation, but a strict definition of the control flow between tasks is required to ensure that designers are executing the same Standard Algorithm.

This is sufficient information for a consistent algorithmic description. If the designer feels that the Standard Algorithm is too restrictive or does not fully utilize the processor's capabilities, the algorithm can be customized and the improvement in performance measured with respect to the standard.

Rule 3: The Standard Algorithm must always be implemented. Any modifications to this will be considered a separate Custom Algorithm. To make a reasonable comparison of performance between architectures, a common baseline is necessary. Not only must the problem be the same, but also the methodology used for its solution. When multiple different algorithms exist for solving the same problem, only one is chosen as the Standard Algorithm. Implementations of other algorithms are considered to be Custom Algorithms.

Rule 4: The input and output operands of a task must be specified in terms of type and bit width. To further restrict the interface between tasks, all input and output variable fields must be strictly specified, but the designer may choose the storage structures used to implement these variables. This ensures that everyone is working at the same minimal level of computational precision and that the application output results are consistent among processors, while not restricting the actual implementation. It should also be noted that internal temporary variables remain unspecified so as to allow even greater flexibility when mapping each task of execution to the processor.

Rule 5: Application Input and Output data sets must be supplied for reproducibility. All architectures should measure performance based on the same set of input data. This is also needed to provide a common basis for comparison. The output data sets allows for checking the correctness of the implementation.

Obviously, there is no expectation that the numerical values will be exactly the same for computationally intense applications, but the precision of the result is a measure of the correctness of the outputs. Therefore, the designer should stipulate the precision of the outputs along with the execution time. For example, if the output results from processor A are obtained in time t with 6 bits of precision whereas the results from processor B take time 2t to be generated but have 14 bits of precision, processor B could be considered a better architecture depending on the requirements.

Rule 6: The performance of the architecture as the size of the problem and/or the size of the reconfigurable fabric scales should be discussed. The intent of this rule is to provide some understanding of the behaviour of the architecture when the problem does not ``fit'' entirely on the fabric. While it is not a required measure, the results would significantly enhance the understanding of a processor's behaviour in these circumstances. All results must still show the performance of the Standard Algorithm.

Rule 7: All architectures must be evaluated with the same metrics. The process for obtaining a metric must also be clearly defined. Current metrics include:

   1. Execution Time. Execution time and the output data-bit precision are required values for each benchmark. The definition of this metric is dependent on the type of benchmark. For streaming data applications, the execution time may be measured as the throughput time for the data being processed. For applications with a definitive start and end point, the execution time is defined to be the time elapsed from when the program begins its load onto the processor until the final results are obtained.
   2. Configuration Time. Another essential metric for a reconfigurable processor is the configuration time, which is to be reported as two components. The first is the time required for initialization of the processor fabric and the second is the overhead incurred when processor execution stalls so that the fabric can be reconfigured. For reconfigurable architectures, configuration time is an important consideration as the overhead can be quite significant if not done in parallel with program execution.
   3. Power Dissipation. Power is a significant concern in reconfigurable architectures as there may be significant dynamic power dissipated when the processor is reconfigured. However, measuring power dissipation is a difficult exercise, so this metric is being discussed assuming that some researchers will focus on power issues in their architectures. It is not assumed that every designer will attempt to quantify the power dissipation of their architecture. However, for the measure to be meaningful when reported, any power dissipated during the initial configuring of the processor should also be included in the measurement.
   4. Area. The silicon area required to implement the processor may be another potential metric for use in some research. A primary consideration is that the size of a reconfigurable fabric will greatly affect the ability to parallelize and pipeline application execution. This metric should be reported as an estimate of the die area required to implement the architecture in a technology specified by the designer.

Rule 8: A well written implementation of the Standard Algorithm for execution on general purpose processors must be provided. It must be used for all performance comparisons between the reconfigurable architectures being studied and general purpose processors. Use of the Standard Algorithm again provides a common basis for comparison between the implementation on the reconfigurable processor and the general purpose processor. Requiring all performance measurements on general purpose processors to use the same code for the Standard Algorithm also provides a means for calibration between projects as it is unlikely that different projects will measure performance using the exact same general purpose processor configurations.

Previous Next Up

RATES- Generic Format for Benchmark Specification

The following illustrates the generic format for Standard Algorithm Specification:

Previous Next Up

RATES- Kernels

To date, the only chosen kernel is the Discrete Cosine Transform The DCT II algorithm was chosen because it is the most commonly used of the four algorithms, being found in MPEG and JPEG encoders.

Please note that this is only be a preliminary version of the benchmark, containing an example C code program demonstrating the algorithm. An optimized version of the DCT is not yet available as an example C program.

2D-DCT

Previous Next Up

RATES- Applications

To date, the only application chosen for the suite is John Conway's "Game of Life". It is a control intensive application that requires little coding time. Furthermore, the size of the of the grid is easily scalable and the designer can choose to implement different computational models.

I have a complete benchmark specification for this application that I am going to make available for downloading by January 26th. I apologize for the delay.

No other applications have been chosen yet as we feel it is important to get input from the reconfigurable community before defining these tests. An important design consideration is to remember that no source code can be provided for the applications so they should be modular and repetetive to reduce design time.

Conway's Game of Life

Previous Up