2011 VII Designer Forum (DF)

Transcripción

1 2011 VII Designer Forum (DF) Preface Table of Contents Executive Committee Forum Committee Sponsors Editors Jorge M. Finochietto Gustavo D. Sutter Orlando Micolini Pablo Recabarren

2 ii

3 Proceedings of the 2011 VII Designer Forum Córdoba, Argentina April 13 15, 2011 Organized by Digital Communications Research Lab School of Exact, Physical and Natural Sciences National University of Córdoba iii

4 iv

5 Proceedings of the 2011 VII Designer Forum Editors Jorge M. Finochietto Gustavo Sutter Orlando Micolini Pablo Recabarren ISBN: v

6 Preface These Proceedings contain the technical papers presented at the VII 2011 Designer Forum organized within the 2011 VII Southern Conference on Programmable Logic (SPL), held in Córdoba, Argentina, from April 13th to 15th, The SPL Conference is the South Hemisphere s largest and most comprehensive conference focused on reconfigurable technology (i.e., FPGA) and its applications. The history of SPL started in The Joint Latin American FPGA Laboratories Project (SURLAB) was financed by Banco Santander Central Hispano of Spain. Its aim was to create a network of Latin American laboratories to spread FPGA as a key technology for industry, updating university curricula to include related subjects. The original partners were the Universidad Autónoma de Madrid, the Instituto Tecnologico de Monterrey, the University of Lima in Peru, and the Argentinean Universities of Mar del Plata, Salta, Tandil, and CAECE. Starting in March 2005, the first SPL Conference was attended by more than 60 people from Argentina, Brazil, Costa Rica, and Peru. This 5-day workshop in the unique atmosphere of the one-hundred year old CAECE University building, introduced students, professors and engineers to the FPGA state of the art. In 2006, more than 80 engineers attended the 2nd SPL, and more than 50 papers from Argentina, Brazil, Costa Rica, and Peru, Spain, United Kingdom, Uruguay, and USA were selected. In 2007, the 3rd SPL Conference was sponsored by IEEE for the first time, receiving more than 90 papers from 24 countries: Argentina, Australia, Bangladesh, Belgium, Brazil, Colombia, Costa Rica, Czech Republic, France, Germany, Greece, Hong Kong, India, Italy, Mexico, Netherlands, Paraguay, Peru, Portugal, Singapore, Spain, Taiwan, UK, and USA In 2008, the 4th SPL Conference moved from Mar del Plata to San Carlos de Bariloche, situated on the Andes foothills. A total of 29 full-papers, 23 short papers and 20 Designer Forum papers were selected, from around one hundred submission, including authors from the following countries: Argentina, Australia, Brazil, China, Canada, Colombia, France, Germany, Hong Kong, Mexico, Peru, Portugal, Romania, Spain, United Kingdom, and USA. In 2009, the 5th SPL Conference, sponsored again by IEEE, moved out of Argentina to Sao Carlos, Brazil. 90 papers were submitted from many countries, 26 were accepted as full papers, 12 as short papers, and 8 as Designer Forum papers. In 2010, the 6th SPL Conference, sponsored by IEEE, moved to the Northeastern Coast of Brazil to the well known Porto de Galinhas Beach, near the city of Recife. This central location in a relaxed atmosphere, combined with the fast-paced economic growth in this part of Brazil, was a great site to discuss advanced technology. SPL2010 received submissions from Argentina, Brazil, Canada, China, France, Iran, Italy, Mexico, Netherlands, Pakistan, Peru, Poland, Portugal, Spain, United Kingdom, and United States. A total of 53 papers were selected: 22 full papers, 13 short papers, and 18 Designer Forum papers. In 2011, the 7th SPL Conference, sponsored as traditionally by IEEE, has moved to the Córdoba, the second-largest city in Argentina, and it will be hosted at the National University of Córdoba, one of the oldest universities in America. Paper submission from the following countries were received: Argentina, Belgium, Brazil, Colombia, Finland, France, Germany, Greece, India, Mexico, Portugal, Spain, Sweden, United Kingdom, United States of America and Uruguay. From 99 submissions, a total of 50 regular papers were selected: 24 for oral presentation and 21 for poster one. A total of 25 papers were selected to be included in the Proceedings of Designer Forum, which

7 demonstrates the increasing relevance of this forum within the SPL conference. The goal of the Designer Forum is to give exposure to ongoing researches, academic experiences, and industrial designs in order to get feedback from experienced researchers and industrial partners. The Designer forum was born with the Southern Conference on Programmable Logic (SPL) in 2005 and it became an important part of it. It promotes the participation of novel researchers and advanced students of the conference region. Due to the regional scope of the Designer Forum, its papers can be written also in Spanish and Portuguese languages. This year 2 one-week intensive courses were held to encourage hardware digital design skills on advance students and professionals; thus, maintaining the spirit to spread FPGA technology knowledge in the southern hemisphere. Besides, 4 tutorials have been organized for conference attendees which are lectured by both industry and academic experts. This year over 150 participants are expected from more than 40 universities, technological institutions and companies all around the world. The topics in this year program include: Embedded Processors and IP Cores, System-on-Chip, Computer Arithmetic, Image Processing and Vision, FPGA Architectures for Specific Applications, Fault Tolerance, Test & Verification. SPL has beautiful track record and it becoming an important forum for discussion on FPGA technology and its applications. We would like to express our gratitude to the many people who have contributed to the high quality of the technical program. Special thanks to those who chaired or were members of the various committees. Particularly the Program Committee who s careful review has helped to maintain the high quality of SPL. Finally, we would like to thank our sponsors: Altera, ClariPhy Argentina, Fundación Tarpuy, National Agency for the Scientific and Technologic Promotion (Agencia), National Scientific and Technical Research Council (CONICET), and Synopsys. A special thanks to the School of Exact, Physical and Natural Sciences (National University of Córdoba) and Universidad Autónoma de Madrid for their support. The Editors Córdoba, Argentina, April 2011 vii

8 viii

9 Table of Contents Executive Committee xi Forum Committee xiii Poster Session 1 IP core MAC Ethernet Rodrigo Melo, Salvador Tropea Autonomous Intelligent Wireless Network accessible via IP María Isabel Schiavon, Daniel Crepaldo Multi-Level Synthesis on the Example of a Particle Filter Jan Langer, Daniel Frob, Enrico Billich, Marko Robler, Ulrich Heinkel Layered testbench for assertion based verification Jose Mosquera, Sol Pedre, Patricia Borensztejn Development and Implementation of an Adaptive Narrowband Active Noise Controller Fernando González, Roberto Rossi, German Rodrigo Molina, Gustavo Parlanti Bio-inspired hardware system based in animals of cold and hot blood Pablo Salvadeo, Rafael Castro López, Ángel Veca, Elvo Morales Análise Comparativa e Qualitativa de Ferramentas de Desenvolvimento de FPGA Gabriel da Silva, Maximiliam Luppe Generación automática de VHDL a partir de una Red de Petri. Análisis comparativo de los resultados de síntesis. Roberto Martinez, Javier Belmonte, Rosa Corti, Estela D Agostino, Enrique Giandoménico Using a WII remote and a FPGA to drive a mechanical arm to aid physicaly challenged people Emerson Pedrino, Valentin Roda, Bruno Martins Systolic Matrix-Vector Multiplier for a High-Throughput N-Continuous OFDM Transmitter Enrique Lizarraga, Victor Sauchelli Synthesis of the Hartley Transform with a Hadamard-based matrix architecture Edval JP Santos, Gilson Alves Implementación de MODBUS en FPGA mediante VHDL - Capa de Enlace - Luis Guanuco, Jonatan Panozzo Zenere, Sergio Olmedo, Agustin Rubio Poster Session 2 Music sequencer on a FPGA board Matías López-Rosenfeld, Francisco Laborda, Patricia Borensztejn Flexible Platform for Real-time Video and Image Processing Paulo Da Cunha Possa, Zied El Hadhri, Laurent Jojczyk, Carlos Valderrama SoPC platform for real-time DVB-T modulator debugging Armando Astarloa, Jesus Lázaro, Unai Bidarte, Aitzol Zuloaga, Mikel Idirin High reliability capture core for data acquisition in System on Programmable Chips Jesus Lázaro, Armando Astarloa, Aitzol Zuloaga, Jaime Jimenez, Unai Bidarte, Jose Martín Desarrollo de una plataforma genérica para sistemas de visión basada en arquitectura CoreConnect Luis Pantaleone, Lucas Leiva, Martín Vazquez Prototipado rápido de un IP para aplicar la transformada Wavelet en imágenes Hugo Melo, Alejandro Perez, Guillermo Gutierrez, Rodolfo Cavallero Cortex-M0 implementation on a Xilinx FPGA Pedro Martos, Fabricio Baglivo ix

10 Digitally Configurable Platform for Power Quality Analysis Bruno Falduto, Ricardo Cayssials, Edgardo Ferro Solar Tracker for Compact Linear Fresnel reflector using PicoBlaze Maiver Villena, Daniel Hoyos, Carlos Cadena, Victor Serrano, Telmo Moya, Marcelo Gea Toolbox NURBS and Visualization System Via FPGA Luiz Marcelo Silva, Maria Paiva Una Metodología para el Desarrollo de Sistemas en Chip de Alta Performance Marcos Oviedo, Pablo Ferreyra High Throughput 4x4 and 8x8 SATD Similarity Criteria Architectures for Video Coding Applications Luciano Agostini, Julio Saracol Domigues, Dieison Soares Silveira, Leomar Soares da Rosa, Vinicius Possani Adquisición de Vídeo Bajo Estándar ITU-R BT Mediante Lógica Programable Juan Carlos Contreras, Guillermo Gutierrez, Emilio Kowalski, Rodolfo Cavallero x

11 Executive Committee General Chairs Jorge M. Finochietto Universidad Nacional de Córdoba CONICET, Argentina Gustavo Sutter Universidad Autónoma de Madrid, Spain Forum Chairs Orlando Micolini Universidad Nacional de Córdoba, Argentina Pablo Recabarren Universidad Nacional de Córdoba CONICET, Argentina Tutorial Chair Graciela Corral-Briones Universidad Nacional de Córdoba, Argentina Local Chair Carmen Rodirguez Universidad Nacional de Córdoba, Argentina Financial Chair Ramiro Calderón Fundación Tarpuy, Argentina Executive Secretary María José Agazzi Universidad Nacional de Córdoba, Argentina Publicity Chair Eduardo Boemo Universidad Autónoma de Madrid, Spain Edval Santos Universidade Federal de Pernambuco, Brazil Valentin Obac Roda Universidade de Sao Paulo, Brazil Elias Todorovich Universidad Nacional del Centro, Argentina Luciano Agostini Universidade Federal de Pelotas, Brazil xi

12 xii

13 Forum Committee Carlos Valderrama, Université de Mons Polytech Mons, Belgium Luciano Agostini, Universidade Federal de Pelotas, Brazil Ali Akoglu, University of Arizona, USA Fadi Aloul, American University of Sharjah, UAE Cristiano Araujo, UFPE, Brazil Edna Barros, Centro de Informatica - UFPE, Brazil Gabriel Caffarena, Universidad San Pablo-CEU, Spain João Cardoso, University of Porto, Portugal Hugo Carrer, Universidad Nacional de Cordoba, Argentina Jorge Castiñeira, Universidad Nacional de Mar del Plata, Argentina Ricardo Cayssials, Universidad Nacional del Sur, Argentina Scott Chin, University of British Columbia, Canada Juan Cousseau, Universidad Nacional del Sur, Argentina Angel de Castro, Universidad Autonoma de Madrid, Spain Helio de Oliveira, Federal University of Pernambuco, Brazil Debatosh Debnath, Oakland University, USA Jean-Pierre Deschamps, Universidad Rovira i Virgili, Spain Yongfeng Gu, The Mathworks, USA Eduardo Romero, Universidad Tecnológica Nacional, Argentina Guillermo Guichal, Universidad Tecnologica Nacional, Argentina Reiner Hartenstein, TU Kaiserslautern, Germany Juan P. Olivier, Universidad de la República, Uruguay Valentin Obac Roda, Universidade de Sao Paolo, Brazil Victor Grimblatt, Synopsis, Chile Damián Morero, Universidad Nacional de Cordoba, Argentina Carol Marsh, Selex Galileo, UK Michelle Petracca, Columbia University, USA Wolfgang Klingauf, Xilinx, USA Gustavo Parlanti, Motorola, Argentina René Cumplido, INAOE, Mexico Esam El-Araby, The Catholic University of America, USA Altamiro Susin, UFRGS, Brazil Gabriela Peretti, Universidad Tecnológica Nacional, Argentina Martín del Barco, ClariPhy, Argentina J. Ignacio Alvarez-Hamelin, ITBA-UBA, Argentina Neil Bergmann, University of Queensland, Australia Philip Leong, The University of Sydney, Australia Sergio Lopez-Buedo, Universidad Autonoma de Madrid, Spain Norian Marranghello, Sao Paulo State University - Unesp, Brazil Seda Memik, Northwestern University, USA Ruben Milocco, Universidad Nacional del Comahue, Argentina Rolf Molz, UNISC - Universidade de Santa Cruz do Sul, Brazil Carlos Muravchik, Universidad Nacional de La Plata, Argentina Horacio Neto, INESC-ID, Portugal

14 Felix Palumbo, CONICET - CNEA, Argentina Michele Petracca, Columbia University, USA Sébastien Pillement, IRISA, France Salvatore Pontarelli, University of Rome Tor Vergata, Italy Jose Saito, Universidade Federal de São Carlos, Brazil Kentaro Sano, Tohoku University, Japan Marco Domenico Santambrogio, MIT, USA Edval JP Santos, Universidade Federal de Pernambuco, Brazil Pete Sedcole, Viotech Communications, France Cristian Sisterna, Universidad Nacional de San Juan, Argentina Julio Pérez Acle, Universidad de la República, Uruguay Jose Soares Augusto, Universidade de Lisboa, Portugal Guillermo Jaquenod, JaqTek, Argentina Dominique Lavenier, IRISA, France Alfonso Chacon Rodriguez, Instituto Tecnologico, Costa Rica Maria Jose Moure, Universidad de Vigo, Spain Diego Crivelli, ClariPhy, Argentina Pablo Ferreyra, Universidad Nacional de Cordoba, Argentina Raoul Velazco, TIMA, France Samir Belkacemi, General Electric, USA Paulo Flores, INESC-ID, Portugal Yana Krasteva, Universidad Politecnica de Valencia, Spain Victoria Rodellar, Universidad Politecnica de Madrid, Spain María Liz Crespo, ICSTP, Italy xiv

15 IP CORE MAC ETHERNET Ing. Rodrigo A. Melo, Ing. Salvador E. Tropea Instituto Nacional de Tecnología Industrial Centro de Electrónica e Informática Laboratorio de Desarrollo Electrónico con Software Libre {rmelo,salvador}@inti.gob.ar ABSTRACT La tecnología Ethernet provee comunicación entre PCs y dispositivos que funcionen en forma autónoma, en ámbitos locales o a través de Internet. En este trabajo presentamos un core que implementa la capa MAC Ethernet, de uso sencillo, con diversas configuraciones, que ocupa pocos recursos de una FPGA. El diseño fue simulado con herramientas de Software Libre y verificado en hardware utilizando una FPGA Virtex INTRODUCCIÓN Nuestro equipo de trabajo desarrolla sistemas embebidos que enlamayoríadeloscasosprecisanestarcomunicados conunapc. Si bien hemos desarrollado cores que cubran esta necesidad, como el core USB [1], en la actualidad, esta conexión deja de ser suficiente para incontables aplicaciones que precisan de un funcionamiento autónomo, que vaya más allá de un ámbito local. La tecnología Ethernet, presente en sus diversas variantes en la mayoría de los dispositivos dotados de conexión a una LAN (Local Area Network), sumado al uso de Internet, provee la solución más conveniente a este problema. Se realizó una búsqueda de cores Ethernet disponibles, de uso libre y descriptos en VHDL, ya que estas condiciones forman parte de la línea de trabajo de nuestro laboratorio. Los resultados fueron pocos, siendo el más destacable el core GReth[2], perteneciente a la GRLib [3]. Sin embargo, el área ocupada de la FPGA, el complejomododeusoylaúnicaopcióndeutilizaciónmedianteunbus AMBA[4], excedían las características deseadas. En este trabajo presentamos un core MAC(Media Access Controller) Ethernet que surgió de lo aprendido en base al estudio del coregreth.escompacto,defácilutilizaciónycapazdeserusado en FPGAs de cualquier fabricante Introducción 2. CORE GRETH La GRLib es una biblioteca de IP cores, distribuida mediante un sistema de doble licenciamiento: comercial y GPL [5]. GReth proveeunainterfazentreunbusambayunaredethernet(10/100 Mb/s, full- and half-duplex). Implementa el estándar , sin soporte de la capa opcional de control Arquitectura Eldiagrama en bloques de GRethse encuentra en lafig.1. Fig. 1. Diagrama en bloques de GReth. Los buses AMBA utilizados son el APB (Advanced Peripheral Bus) para el manejo de registros de configuración y control, y el AHB (Advanced High-performance Bus) para flujo de datos, dado a través de canales DMA (Direct Memory Access) para transmisión y recepción. Se conecta a un PHY externo mediante las interfaces MII (Media Independent Interface) o RMII(Reduced MII) para el intercambio de datos y MDIO (Management Data Input/Output) para acceder a la configuración y estado. La interfaz EDCL (Ethernet Debug Communication Link) provee acceso de lectura/escritura al bus AHB mediante Ethernet. El core posee tres dominios de reloj: los de transmisión y recepción, provistos por el PHY externo, y el del resto de componentes y buses AMBA Descripción de hardware La GRLib está descripta utilizando el llamado Método de los dos procesos[6]: usando dos procesos por entidad, uno conteniendo toda la lógica combinacional y el otro toda la secuencial, el algoritmo completo puede ser codificado en el proceso combinacional, mientras que el proceso secuencial sólo contiene asignación de registros. Dicho método abstrae la descripción de hardware asimilándola al desarrollo de un software Mododeuso Elcore escontrolado mediante APBcon registros de 32 bits: Registros 0 y 1: control/estado. 1

16 Registros 2 y 3: dirección MAC. Registro 4: control/estado de interfaz MDIO. Registros 5 y 6: dirección de memoria de la tablas de descriptores de transmisión y recepción. Los descriptores son datos de 32 bits transmitidos mediante AHB. Tanto en transmisión como en recepción se tienen dos descriptores contiguos: Descriptor 0: se conforma de bits de control y estado. Utiliza 11 bits para especificar la cantidad de bytes a transferir. Descriptor 1: consiste en un puntero de 30 bits a la zona de memoria donde se almacenan/extraen los datos Transmisión A través del AHB se colocan los datos a partir de la dirección apuntada por el descriptor 1. Los datos deben poseer las direcciones MAC destino y origen, y el campo tipo/tamaño. El CRC (Cyclic redundancy check) de 4 bytes es añadido automáticamente. A continuación, se especifica la dirección del descriptor 0 en el registro 5. GReth comienza la transmisión cuando se le indica enel registro0. Cuando la transmisión finaliza, GReth escribe información de estado en el registro 1 y el descriptor 0. Finalmente apunta al siguiente par de descriptores y queda listo para la próxima operación Recepción Se especifica la dirección del descriptor 0 en el registro 6. GReth lee los descriptores cuando se le indica en el registro 0 y aguarda un paquete entrante. Dicho paquete será aceptado cuando la dirección MAC destino sea la indicada en los registros 2 y 3 o ladebroadcast, ocuandoelcore tengahabilitadoelmodopromiscuo. En cualquier otro caso será descartado. Cuando finaliza, se escribe información de estado en el registro 1 y el descriptor 0, y los datos recibidos son accesibles a partir de la dirección apuntada por el descriptor MDIO Esta interfaz permite acceder de 1 a 32 PHY, que contengan de 1 a 32 registros de 16 bits. Su control y estado es accesible mediante el registro 4. La escritura se inicia especificando el dato, número de PHY y registro, y colocando a 1 el bit de escritura, mientras que la lectura precisa el número de PHY y registro, e inicia colocando a 1 elbit de lectura. 3. TESTEO DEL GRETH Con el objetivo de poder detectar cualquier error introducido al simplificar el core se diseñó un testbench para el GReth. Esto nos permitió tener un mejor conocimiento de su funcionamiento, en particular teniendo en cuenta la utilización del Método de los dos procesos en GReth. El testeo consistió en instanciar el GReth, junto a una descripción denominada FakePHY, que simula ser un PHY y desde las interfaces AMBA realizar escrituras y lecturas MDIO, transmisiones y recepciones mediante MII, y verificar que lo enviado y lo recibido coincidiera, o abortar en caso contrario. Fig. 2. Esquema de instanciaciones de GReth(izq.) y MAC(der.). Para el manejo de AMBA se desarrolló una biblioteca denominada AMBA Handler, con propósitos de simulación. En la misma se implementaron ocho procedimientos que representan las combinaciones de escritura o lectura, a un maestro o esclavo, APB o AHB. 4. EL CORE DESARROLLADO: MAC Ethernet 4.1. Introducción En la Fig. 2, se pueden ver dos esquemas resumidos de la instanciación de componentes del core del cual se partió(izquierda) y del core que se obtuvo (derecha). El nivel superior del GReth, instancia las FIFO de transmisión y recepción, el componente ethc0, e implementa el manejo del core mediante descriptores, la comunicación MDIO y parte de AMBA, la interfaz EDCL y la sincronización entre distintos dominios de reloj. El componente ethc0, instancia a los componentes que resuelven la transmisión y recepción a través de MII/RMII y a un componente que resuelve la otra parte de la comunicación AMBA. El core MAC desarrollado, presenta un nivel superior netamente estructural, que solamente instancia a los llamados canales de transmisión y recepción, y opcionalmente la interfaz MDIO. Los canales nombrados, instancian en su interior memorias RAM dual port, los componentes que resuelven la transmisión y recepción a través de MII y componentes para la sincronización entre distintos dominios de reloj Implementación El core desarrollado fue escrito en lenguaje VHDL 93 estándar. Para su desarrollo se utilizaron las herramientas y lineamientos recomendadas por el proyecto FPGALibre[7]. Con respecto a GReth se eliminaron ciertas características, se remplazaron descripciones y se modificaron en parte o totalmente otras. Se eliminaron las siguientes características: Utilización de buses AMBA. Manejo mediante descriptores. Interfaz EDCL. Soporte de RMII. Las FIFO genéricas utilizadas en GReth fueron remplazadas por unas propias del laboratorio, implementadas con memoria RAM dual port. Además, las mismas pasaron a ser instanciadas 2

17 los datos escritos a la FIFO son los obtenidos de MII y los leídos de la FIFO quedan disponibles para ser usados. Para evitar la pérdida de paquetes, debido a que la aplicación no haya terminado de retirar los datos recibidos, se implementó un esquema de múltiples FIFOs.ElnúmerodeFIFOsesconfigurable ysumanejodepende exclusivamente del core Mododeuso Fig.3. Diagrama enbloques del core MAC. dentro de los nuevos canales de transmisión y recepción, los cuales implementan la comunicación del MAC con una aplicación superior, de manera mucho más sencilla. La funcionalidad MDIO se extrajo de la compleja descripción donde se encontraba para pasar a ser un componente independiente. Los componentes que resolvían la transmisión y recepción a travésdemii,sonjuntoalmdio,losúnicos que mantienenparte de la descripción original y la utilización del Método de los dos procesos. Sufrieron cambios como: eliminación de soporte para RMII; eliminación o simplificación de estados de sus FSM(Finite State Machine); eliminación o cambios de señales de control y estado; eliminación de componente que filtraba posibles glitches en la señal de reset; etc. La sincronización entre distintos dominios de reloj, antes se daba entre las FIFO y los componentes de transmisión y recepción, y ahora se da entre los puertos de escritura y lectura de las RAM dual port. Además, antes eran una funcionalidad esparcida por diversas zonas de la descripción, mientras que ahora utiliza un nuevo componente desarrollado para tal fin Arquitectura LaFig.3muestraundiagramaenbloquescore,dondesepuede apreciar los tres dominios de reloj con los cuáles trabaja el sistema. LatransmisiónconsisteenunaFSMqueenfuncióndeseñales de entrada, escribe datos a una FIFO implementada con una RAM dual port. Al terminar de transferir datos a la FIFO, se genera la señal wr_end, que luego de ser sincronizada, es identificada por la FSMque leelosdatos de lafifoylostransmiteatravésde MII. Una vez leídos todos los datos, mediante la señal rd_end, la FSM de escritura vuelve a su estado inicial. La recepción es similar a la transmisión, con la diferencia que El core presenta diversas configuraciones en base a generics, de las cuales se pueden destacar: TXFIFOSIZE y RXFIFOSIZE: utilizadas para especificar la capacidad de almacenamiento en bytes de las FIFO. RX_CHANNELS: cantidad de canales de recepción a utilizar.cada canal implica eluso de una FIFO. ENABLE_MDIO: para indicar si se hace uso o no del módulo MDIO. Además, posee líneas de control para: Habilitar o deshabilitar los canales de transmisión y recepción. Habilitar o deshabilitar señales de interrupción. Indicar half o full duplex. Especificar la dirección MAC. Activar el modo promiscuo Transmisión se indica el inicio y fin con señales independientes para tal fin. Los datos se confirman mediante una señal de escritura. Posee indicación de ocupado y provee información de errores de overrun de la memoria o alcance de límite de reintentos de transmisión en el bus Recepción se informan datos disponibles colocando una señal en estado alto, la cual se mantiene hasta la lectura de todos los datos. Estas lecturas se confirman mediante la señal de lectura o se abortan en caso de decidirse descartar el paquete. Los errores que señaliza son: Overrun de la memoria de datos; paquete recibido más corto/largo que el mínimo/máximo soportado por Ethernet; alineamiento o CRC erróneo; cantidad de datos recibidos no concuerda con los especificados en el campo length del paquete recibido MDIO presenta características similares al GReth, pero una nueva interfaz. Posee señales para especificar el número de PHY, de registro y datos de entrada y salida por separado. Con señales individuales se indica si la operación es una escritura o una lectura. Finalmente, cuenta con una señal de ocupado y otra de falla en la comunicación. 5. VALIDACIÓN DEL CORE DESARROLLADO 5.1. Simulación Para la simulación se utilizó GHDL[8]

18 Table 1. Resultados de la síntesis core GReth Configuración LUTs FFs Slices BRAMs Sin MDIO Con MDIO Configuración core MAC LUTs FFs Slices BRAMs 1 RX sin MDIO RX sin MDIO RX con MDIO Se realizó un testbench, donde nuevamente se instancia al core FakePHY, esta vez junto a nuestro MAC, pero a diferencia del testeo del GReth, este es más riguroso, incluyendo características tales como: Implementa procesos separados para transmisión y recepción, en lugar de utilizar uno sólo de forma secuencial. Verifica el funcionamiento de la indicación de errores. Los tres relojes que utiliza, no son múltiplos exactos entre ellos, lo que permite una mejor simulación de la sincronización entre señales. Por otro lado, se desarrolló un core denominado Replies, el cual contesta peticiones ARP(Address Resolution Protocol) e ICMP (Internet Control Message Protocol). Cabe aclarar que los mecanismos que utiliza para tal fin no reflejan los especificado para estos dos protocolos, sino artilugios para realizar pruebas. Este core se utilizó en un testbench junto a tramas Ethernet reales adquiridas con el software wireshark[9], para recrear la ejecución del comando ping y poder visualizar las formas de onda y los paquetes de datos intercambiados Validación en hardware Se llevó a cabo utilizando una FPGA Virtex 4 de Xilinx y el software ISE WebPack L.57. El host utilizado fue una computadora personal corriendo el sistema operativo Debian [10] GNU [11] /Linux. Como aplicación se utilizó el core Replies, el cual es sintetizable. Una vez que el core superó el testbench sin reportar ningún error, se hicieron múltiples pruebas utilizando el comando ping, quefuerondesdehorashastamásdeunasemanadeejecución,presentando en todos los casos cero paquetes perdidos. Nuevamente, se utilizó el software wireshark, en este caso para verificar la correcta conformación de los paquetes recibidos. El PHY externo utilizado, fue el DP83847 de National Semiconductor. Las pruebas se realizaron usando una comunicación full-duplex de 100 Mb/s. 6. RESULTADOS En el Cuadro 1 pueden observarse los resultados de la síntesis de los cores GRethyMAC,para una Virtex4. En el caso del GReth, se sintetizaron las configuraciones más comunes con y sin el uso de la interfaz MDIO, en ambos casos con la interfaz EDCL deshabilitada. Para el MAC se sintetizaron las mismas opciones, siendo dos canales de recepción el caso más común de uso, yademás elcaso de utilizarun sólocanal de recepción, lo cual puede ser suficiente en numerosas aplicaciones que no requieran un flujo de datos continuo. 7. CONCLUSIONES De la comparación de los resultados de la síntesis, puede apreciarse que se obtuvo una implementación más compacta de la que se partió. Para configuraciones de uso equivalentes, nuestro core utiliza menos del 50% de área de la FPGA que el GReth. Debe considerarse también que el core GReth precisa la disponibilidad de memoria accesible mediante AMBA, además de todo el soporte para el manejo de descriptores, mientras que nuestro core cuenta con todo lo necesario para ser directamente utilizado. Encuantoalmododeuso,elcoredesarrolladoesmássimpley no depende de un cierto bus, aunque puede ser fácilmente adaptado al que sea necesario, ya sea AMBA,WISHBONE [12] u otro. La simplificación del modo de uso y el cambio de arquitectura, son las principales razones de la menor ocupación de recursos de la FPGA. La utilización de lenguaje VHDL 93 estándar, permite que el core sea sintetizable en una FPGA de cualquier fabricante. La utilización de las herramientas propuestas por el proyecto FPGALibre demostró ser adecuada para un proyecto de estas características. Tareas futuras sobre este trabajo, podrían implicar tanto capas de menor nivel, como la implementación de algún PHY Ethernet, como aplicaciones de un nivel superior, que provea manejo del protocolo IP(Internet Protocol). 8. REFERENCES [1] S. E. Tropea and R. A. Melo, USB framework - IP core and related software, in XV Workshop Iberchip, vol. 1, Buenos Aires, 2009, pp [2] GRLIB IP Core User s Manual, ed. Gaisler Research, 2008, pp [3] J. Gaisler, An open-source VHDL IP library with plug&play configuration, in IFIP Congress Topical Sessions, R. Jacquart, Ed. Kluwer, 2004, pp [4] ARM. (2010, Jun.) AMBA - Advanced Microcontroller Bus Architecture. [Online]. Available: system-ip/amba/amba-open-specifications.php [5] Free Software Foundation, Inc., GNU General Public License, [6] J. Gaisler, A structured VHDL design method, com/doc/vhdl2proc.pdf, Jun [7] S. E. Tropea, D. J. Brengi, and J. P. D. Borgna, FPGAlibre: Herramientas de software libre para diseño con FPGAs, in FPGA Based Systems. Mar del Plata: Surlabs Project, II SPL, 2006, pp [8] T. Gingold. (2010, Jun.) A complete VHDL simulator. [Online]. Available: [9] G. Combs and contributors. (2010, Jun.) Network protocol analyzer. [Online]. Available: [10] I. Murdock et al. (2010, Jun.) Debian GNU/Linux operating system. [Online]. Available: [11] R. M. Stallman et al. (2010, Jun.) The GNU project. [Online]. Available: [12] Silicore and OpenCores.Org. (2010, Jun.) WISHBONE Systemon-Chip (SoC) interconnection architecture for portable IP cores. [Online]. Available: 4

19 IP CORE MAC ETHERNET Ing. Rodrigo A. Melo, Ing. Salvador E. Tropea Instituto Nacional de Tecnología Industrial Centro de Electrónica e Informática Laboratorio de Desarrollo Electrónico con Software Libre {rmelo,salvador}@inti.gob.ar ABSTRACT La tecnología Ethernet provee comunicación entre PCs y dispositivos que funcionen en forma autónoma, en ámbitos locales o a través de Internet. En este trabajo presentamos un core que implementa la capa MAC Ethernet, de uso sencillo, con diversas configuraciones, que ocupa pocos recursos de una FPGA. El diseño fue simulado con herramientas de Software Libre y verificado en hardware utilizando una FPGA Virtex INTRODUCCIÓN Nuestro equipo de trabajo desarrolla sistemas embebidos que enlamayoríadeloscasosprecisanestarcomunicados conunapc. Si bien hemos desarrollado cores que cubran esta necesidad, como el core USB [1], en la actualidad, esta conexión deja de ser suficiente para incontables aplicaciones que precisan de un funcionamiento autónomo, que vaya más allá de un ámbito local. La tecnología Ethernet, presente en sus diversas variantes en la mayoría de los dispositivos dotados de conexión a una LAN (Local Area Network), sumado al uso de Internet, provee la solución más conveniente a este problema. Se realizó una búsqueda de cores Ethernet disponibles, de uso libre y descriptos en VHDL, ya que estas condiciones forman parte de la línea de trabajo de nuestro laboratorio. Los resultados fueron pocos, siendo el más destacable el core GReth[2], perteneciente a la GRLib [3]. Sin embargo, el área ocupada de la FPGA, el complejomododeusoylaúnicaopcióndeutilizaciónmedianteunbus AMBA[4], excedían las características deseadas. En este trabajo presentamos un core MAC(Media Access Controller) Ethernet que surgió de lo aprendido en base al estudio del coregreth.escompacto,defácilutilizaciónycapazdeserusado en FPGAs de cualquier fabricante Introducción 2. CORE GRETH La GRLib es una biblioteca de IP cores, distribuida mediante un sistema de doble licenciamiento: comercial y GPL [5]. GReth proveeunainterfazentreunbusambayunaredethernet(10/100 Mb/s, full- and half-duplex). Implementa el estándar , sin soporte de la capa opcional de control Arquitectura Eldiagrama en bloques de GRethse encuentra en lafig.1. Fig. 1. Diagrama en bloques de GReth. Los buses AMBA utilizados son el APB (Advanced Peripheral Bus) para el manejo de registros de configuración y control, y el AHB (Advanced High-performance Bus) para flujo de datos, dado a través de canales DMA (Direct Memory Access) para transmisión y recepción. Se conecta a un PHY externo mediante las interfaces MII (Media Independent Interface) o RMII(Reduced MII) para el intercambio de datos y MDIO (Management Data Input/Output) para acceder a la configuración y estado. La interfaz EDCL (Ethernet Debug Communication Link) provee acceso de lectura/escritura al bus AHB mediante Ethernet. El core posee tres dominios de reloj: los de transmisión y recepción, provistos por el PHY externo, y el del resto de componentes y buses AMBA Descripción de hardware La GRLib está descripta utilizando el llamado Método de los dos procesos[6]: usando dos procesos por entidad, uno conteniendo toda la lógica combinacional y el otro toda la secuencial, el algoritmo completo puede ser codificado en el proceso combinacional, mientras que el proceso secuencial sólo contiene asignación de registros. Dicho método abstrae la descripción de hardware asimilándola al desarrollo de un software Mododeuso Elcore escontrolado mediante APBcon registros de 32 bits: Registros 0 y 1: control/estado. 5

20 Registros 2 y 3: dirección MAC. Registro 4: control/estado de interfaz MDIO. Registros 5 y 6: dirección de memoria de la tablas de descriptores de transmisión y recepción. Los descriptores son datos de 32 bits transmitidos mediante AHB. Tanto en transmisión como en recepción se tienen dos descriptores contiguos: Descriptor 0: se conforma de bits de control y estado. Utiliza 11 bits para especificar la cantidad de bytes a transferir. Descriptor 1: consiste en un puntero de 30 bits a la zona de memoria donde se almacenan/extraen los datos Transmisión A través del AHB se colocan los datos a partir de la dirección apuntada por el descriptor 1. Los datos deben poseer las direcciones MAC destino y origen, y el campo tipo/tamaño. El CRC (Cyclic redundancy check) de 4 bytes es añadido automáticamente. A continuación, se especifica la dirección del descriptor 0 en el registro 5. GReth comienza la transmisión cuando se le indica enel registro0. Cuando la transmisión finaliza, GReth escribe información de estado en el registro 1 y el descriptor 0. Finalmente apunta al siguiente par de descriptores y queda listo para la próxima operación Recepción Se especifica la dirección del descriptor 0 en el registro 6. GReth lee los descriptores cuando se le indica en el registro 0 y aguarda un paquete entrante. Dicho paquete será aceptado cuando la dirección MAC destino sea la indicada en los registros 2 y 3 o ladebroadcast, ocuandoelcore tengahabilitadoelmodopromiscuo. En cualquier otro caso será descartado. Cuando finaliza, se escribe información de estado en el registro 1 y el descriptor 0, y los datos recibidos son accesibles a partir de la dirección apuntada por el descriptor MDIO Esta interfaz permite acceder de 1 a 32 PHY, que contengan de 1 a 32 registros de 16 bits. Su control y estado es accesible mediante el registro 4. La escritura se inicia especificando el dato, número de PHY y registro, y colocando a 1 el bit de escritura, mientras que la lectura precisa el número de PHY y registro, e inicia colocando a 1 elbit de lectura. 3. TESTEO DEL GRETH Con el objetivo de poder detectar cualquier error introducido al simplificar el core se diseñó un testbench para el GReth. Esto nos permitió tener un mejor conocimiento de su funcionamiento, en particular teniendo en cuenta la utilización del Método de los dos procesos en GReth. El testeo consistió en instanciar el GReth, junto a una descripción denominada FakePHY, que simula ser un PHY y desde las interfaces AMBA realizar escrituras y lecturas MDIO, transmisiones y recepciones mediante MII, y verificar que lo enviado y lo recibido coincidiera, o abortar en caso contrario. Fig. 2. Esquema de instanciaciones de GReth(izq.) y MAC(der.). Para el manejo de AMBA se desarrolló una biblioteca denominada AMBA Handler, con propósitos de simulación. En la misma se implementaron ocho procedimientos que representan las combinaciones de escritura o lectura, a un maestro o esclavo, APB o AHB. 4. EL CORE DESARROLLADO: MAC Ethernet 4.1. Introducción En la Fig. 2, se pueden ver dos esquemas resumidos de la instanciación de componentes del core del cual se partió(izquierda) y del core que se obtuvo (derecha). El nivel superior del GReth, instancia las FIFO de transmisión y recepción, el componente ethc0, e implementa el manejo del core mediante descriptores, la comunicación MDIO y parte de AMBA, la interfaz EDCL y la sincronización entre distintos dominios de reloj. El componente ethc0, instancia a los componentes que resuelven la transmisión y recepción a través de MII/RMII y a un componente que resuelve la otra parte de la comunicación AMBA. El core MAC desarrollado, presenta un nivel superior netamente estructural, que solamente instancia a los llamados canales de transmisión y recepción, y opcionalmente la interfaz MDIO. Los canales nombrados, instancian en su interior memorias RAM dual port, los componentes que resuelven la transmisión y recepción a través de MII y componentes para la sincronización entre distintos dominios de reloj Implementación El core desarrollado fue escrito en lenguaje VHDL 93 estándar. Para su desarrollo se utilizaron las herramientas y lineamientos recomendadas por el proyecto FPGALibre[7]. Con respecto a GReth se eliminaron ciertas características, se remplazaron descripciones y se modificaron en parte o totalmente otras. Se eliminaron las siguientes características: Utilización de buses AMBA. Manejo mediante descriptores. Interfaz EDCL. Soporte de RMII. Las FIFO genéricas utilizadas en GReth fueron remplazadas por unas propias del laboratorio, implementadas con memoria RAM dual port. Además, las mismas pasaron a ser instanciadas 6

21 los datos escritos a la FIFO son los obtenidos de MII y los leídos de la FIFO quedan disponibles para ser usados. Para evitar la pérdida de paquetes, debido a que la aplicación no haya terminado de retirar los datos recibidos, se implementó un esquema de múltiples FIFOs.ElnúmerodeFIFOsesconfigurable ysumanejodepende exclusivamente del core Mododeuso Fig.3. Diagrama enbloques del core MAC. dentro de los nuevos canales de transmisión y recepción, los cuales implementan la comunicación del MAC con una aplicación superior, de manera mucho más sencilla. La funcionalidad MDIO se extrajo de la compleja descripción donde se encontraba para pasar a ser un componente independiente. Los componentes que resolvían la transmisión y recepción a travésdemii,sonjuntoalmdio,losúnicos que mantienenparte de la descripción original y la utilización del Método de los dos procesos. Sufrieron cambios como: eliminación de soporte para RMII; eliminación o simplificación de estados de sus FSM(Finite State Machine); eliminación o cambios de señales de control y estado; eliminación de componente que filtraba posibles glitches en la señal de reset; etc. La sincronización entre distintos dominios de reloj, antes se daba entre las FIFO y los componentes de transmisión y recepción, y ahora se da entre los puertos de escritura y lectura de las RAM dual port. Además, antes eran una funcionalidad esparcida por diversas zonas de la descripción, mientras que ahora utiliza un nuevo componente desarrollado para tal fin Arquitectura LaFig.3muestraundiagramaenbloquescore,dondesepuede apreciar los tres dominios de reloj con los cuáles trabaja el sistema. LatransmisiónconsisteenunaFSMqueenfuncióndeseñales de entrada, escribe datos a una FIFO implementada con una RAM dual port. Al terminar de transferir datos a la FIFO, se genera la señal wr_end, que luego de ser sincronizada, es identificada por la FSMque leelosdatos de lafifoylostransmiteatravésde MII. Una vez leídos todos los datos, mediante la señal rd_end, la FSM de escritura vuelve a su estado inicial. La recepción es similar a la transmisión, con la diferencia que El core presenta diversas configuraciones en base a generics, de las cuales se pueden destacar: TXFIFOSIZE y RXFIFOSIZE: utilizadas para especificar la capacidad de almacenamiento en bytes de las FIFO. RX_CHANNELS: cantidad de canales de recepción a utilizar.cada canal implica eluso de una FIFO. ENABLE_MDIO: para indicar si se hace uso o no del módulo MDIO. Además, posee líneas de control para: Habilitar o deshabilitar los canales de transmisión y recepción. Habilitar o deshabilitar señales de interrupción. Indicar half o full duplex. Especificar la dirección MAC. Activar el modo promiscuo Transmisión se indica el inicio y fin con señales independientes para tal fin. Los datos se confirman mediante una señal de escritura. Posee indicación de ocupado y provee información de errores de overrun de la memoria o alcance de límite de reintentos de transmisión en el bus Recepción se informan datos disponibles colocando una señal en estado alto, la cual se mantiene hasta la lectura de todos los datos. Estas lecturas se confirman mediante la señal de lectura o se abortan en caso de decidirse descartar el paquete. Los errores que señaliza son: Overrun de la memoria de datos; paquete recibido más corto/largo que el mínimo/máximo soportado por Ethernet; alineamiento o CRC erróneo; cantidad de datos recibidos no concuerda con los especificados en el campo length del paquete recibido MDIO presenta características similares al GReth, pero una nueva interfaz. Posee señales para especificar el número de PHY, de registro y datos de entrada y salida por separado. Con señales individuales se indica si la operación es una escritura o una lectura. Finalmente, cuenta con una señal de ocupado y otra de falla en la comunicación. 5. VALIDACIÓN DEL CORE DESARROLLADO 5.1. Simulación Para la simulación se utilizó GHDL[8]

22 Table 1. Resultados de la síntesis core GReth Configuración LUTs FFs Slices BRAMs Sin MDIO Con MDIO Configuración core MAC LUTs FFs Slices BRAMs 1 RX sin MDIO RX sin MDIO RX con MDIO Se realizó un testbench, donde nuevamente se instancia al core FakePHY, esta vez junto a nuestro MAC, pero a diferencia del testeo del GReth, este es más riguroso, incluyendo características tales como: Implementa procesos separados para transmisión y recepción, en lugar de utilizar uno sólo de forma secuencial. Verifica el funcionamiento de la indicación de errores. Los tres relojes que utiliza, no son múltiplos exactos entre ellos, lo que permite una mejor simulación de la sincronización entre señales. Por otro lado, se desarrolló un core denominado Replies, el cual contesta peticiones ARP(Address Resolution Protocol) e ICMP (Internet Control Message Protocol). Cabe aclarar que los mecanismos que utiliza para tal fin no reflejan los especificado para estos dos protocolos, sino artilugios para realizar pruebas. Este core se utilizó en un testbench junto a tramas Ethernet reales adquiridas con el software wireshark[9], para recrear la ejecución del comando ping y poder visualizar las formas de onda y los paquetes de datos intercambiados Validación en hardware Se llevó a cabo utilizando una FPGA Virtex 4 de Xilinx y el software ISE WebPack L.57. El host utilizado fue una computadora personal corriendo el sistema operativo Debian [10] GNU [11] /Linux. Como aplicación se utilizó el core Replies, el cual es sintetizable. Una vez que el core superó el testbench sin reportar ningún error, se hicieron múltiples pruebas utilizando el comando ping, quefuerondesdehorashastamásdeunasemanadeejecución,presentando en todos los casos cero paquetes perdidos. Nuevamente, se utilizó el software wireshark, en este caso para verificar la correcta conformación de los paquetes recibidos. El PHY externo utilizado, fue el DP83847 de National Semiconductor. Las pruebas se realizaron usando una comunicación full-duplex de 100 Mb/s. 6. RESULTADOS En el Cuadro 1 pueden observarse los resultados de la síntesis de los cores GRethyMAC,para una Virtex4. En el caso del GReth, se sintetizaron las configuraciones más comunes con y sin el uso de la interfaz MDIO, en ambos casos con la interfaz EDCL deshabilitada. Para el MAC se sintetizaron las mismas opciones, siendo dos canales de recepción el caso más común de uso, yademás elcaso de utilizarun sólocanal de recepción, lo cual puede ser suficiente en numerosas aplicaciones que no requieran un flujo de datos continuo. 7. CONCLUSIONES De la comparación de los resultados de la síntesis, puede apreciarse que se obtuvo una implementación más compacta de la que se partió. Para configuraciones de uso equivalentes, nuestro core utiliza menos del 50% de área de la FPGA que el GReth. Debe considerarse también que el core GReth precisa la disponibilidad de memoria accesible mediante AMBA, además de todo el soporte para el manejo de descriptores, mientras que nuestro core cuenta con todo lo necesario para ser directamente utilizado. Encuantoalmododeuso,elcoredesarrolladoesmássimpley no depende de un cierto bus, aunque puede ser fácilmente adaptado al que sea necesario, ya sea AMBA,WISHBONE [12] u otro. La simplificación del modo de uso y el cambio de arquitectura, son las principales razones de la menor ocupación de recursos de la FPGA. La utilización de lenguaje VHDL 93 estándar, permite que el core sea sintetizable en una FPGA de cualquier fabricante. La utilización de las herramientas propuestas por el proyecto FPGALibre demostró ser adecuada para un proyecto de estas características. Tareas futuras sobre este trabajo, podrían implicar tanto capas de menor nivel, como la implementación de algún PHY Ethernet, como aplicaciones de un nivel superior, que provea manejo del protocolo IP(Internet Protocol). 8. REFERENCES [1] S. E. Tropea and R. A. Melo, USB framework - IP core and related software, in XV Workshop Iberchip, vol. 1, Buenos Aires, 2009, pp [2] GRLIB IP Core User s Manual, ed. Gaisler Research, 2008, pp [3] J. Gaisler, An open-source VHDL IP library with plug&play configuration, in IFIP Congress Topical Sessions, R. Jacquart, Ed. Kluwer, 2004, pp [4] ARM. (2010, Jun.) AMBA - Advanced Microcontroller Bus Architecture. [Online]. Available: system-ip/amba/amba-open-specifications.php [5] Free Software Foundation, Inc., GNU General Public License, [6] J. Gaisler, A structured VHDL design method, com/doc/vhdl2proc.pdf, Jun [7] S. E. Tropea, D. J. Brengi, and J. P. D. Borgna, FPGAlibre: Herramientas de software libre para diseño con FPGAs, in FPGA Based Systems. Mar del Plata: Surlabs Project, II SPL, 2006, pp [8] T. Gingold. (2010, Jun.) A complete VHDL simulator. [Online]. Available: [9] G. Combs and contributors. (2010, Jun.) Network protocol analyzer. [Online]. Available: [10] I. Murdock et al. (2010, Jun.) Debian GNU/Linux operating system. [Online]. Available: [11] R. M. Stallman et al. (2010, Jun.) The GNU project. [Online]. Available: [12] Silicore and OpenCores.Org. (2010, Jun.) WISHBONE Systemon-Chip (SoC) interconnection architecture for portable IP cores. [Online]. Available: 8

23 AUTONOMOUS WIRELESS INTELLIGENT NETWORK ACCESSIBLE VIA IP María Isabel Schiavon, Daniel Alberto Crepaldo Laboratorio de Microelectrónica Universidad Nacional de Rosario, Argentina ABSTRACT An autonomous wireless intelligent network is presented. The intended function is to sense meteorological data on field. A minimum and dedicated set of Internet Protocol rules was selected for communications, so that the net can be accessed remotely from an Ethernet wireless local area network. Internal intelligence of the network is centered in dynamic topology reconfiguration according to the physical location of the nodes. Border Gateway Protocol (BGP) was adapted to allow dynamic reconfiguration. D A O N C E J H K L M F P Q 1 IP address 14 MAC address 1.INTRODUCTION An autonomous wireless intelligent network (AWIN) is presented. It is defined as a wireless Ethernet local area network. All the communications, internal and external, are made via Internet Protocol (IP). Stations remote access via wireless Ethernet is enabled for reset process or data gathering. The protocol for wireless Ethernet networks is defined in IEEE standard rules [1] [2]. The rules are technology and internal structure independent. The minimum and necessary subset of this standard rules was selected to implement the node communication module. The network has an IP address; all nodes shared this IP address and have their own physical address (MAC). Internal network intelligence is centered in architecture dynamic reconfiguration according to the physical location of the nodes. Border Gateway Protocol (BGP) was adapted to allow dynamic reconfiguration. BGP was developed to allow an effective all to all interconnection between autonomous systems via IP [3]. As BGP capabilities exceed autonomous network needs, the capabilities needed for specific application were selected. To make dynamic reconfiguration in a simple way, adding o removing nodes and changing the communication path without affect network performance presented an interesting compromise to solve. The commitment was high performance, low cost and minimum power consumption. Figure 1 shows a fourteen nodes net before the communication architecture has been was built. Figure 1. Network fourteen nodes with unbuilt architecture Network is identified for one IP address, so all the nodes share it and have the same structure and capabilities but each of them is identified with a different MAC address. The network builds autonomously its communication architecture. As each wireless network node can communicate only with those nodes that are within the range of transmitter, the communication inside the net must be neighbor node to neighbor node or mouth to mouth. Once the communication path is defined, as it is shown in figure 2, the net is ready and the programmed process starts. Nodes deployment is not fixed and it may change over time. Nodes are battery powered, so the transmitter range will be affected by the state of battery charge. This or another cause of failure as environmental or electronic risk or involuntary destruction can put some nodes out of service. If one or some nodes stop working, the network must be auto reconfigured to maintain the network communication alive as it is shown in figure 3. Periodically, an architecture check is done, and when it is necessary a communication path reconfiguration is made. When an external access is required, the requirement can be received by many nodes, the first node that answers assumes the role of hub node. Hub node is responsible for wireless communication with the external Ethernet network and all others must report to it using intermediate nodes as repeaters. 9

24 A N O Figure 2. Fourteen nodes network communication path A N O D D C E C E J J H H Figure 3. Fourteen nodes network communication path with C node out of service 2.NODE DESCRIPTION Typical net node block diagram is shown in figure 4. It is possible to difference two subsystems, one for communication and the other to manage sensor activity and configuration. Communication subsystem has three blocks. First block is a wireless ETHERNET compatible transmitter/receiver. The second (PROTOCOL CODE/DECO) is a dedicated communication module that is responsible for interpreting the message according to the IP protocol, for storing in memory the fields it needs to keep and for transmitting data to the sensor subsystem in a reception process, or for shaping the frame according to the Ethernet protocol retrieving from memory the fields needed to build the outgoing message. The last is a memory block (COMMUNICATION MEMORY). Sensor subsystem is composed by three blocks: one to manage all subsystem activities (SENSOR SUBSYSTEM CONTROL), a memory block to store data and configuration parameters (SENSOR MEMORY) and the sensor itself (SENSOR). The transmitter/receiver to be used in this application will be a wireless ETHERNET IEEE compatible transmitter/receiver and its description runs out of the scope of this paper. K K L M L M F F P Q P Q PROTOCOL CODE/DECO SENSOR SUBSYSTEM CONTROL TO/FROM ETHERNET NETWORK TRANSMITTER /RECEIVER COMUNICATION MEMORY SENSOR MEMORY SENSOR COMMUNICATION SUBSYSTEM SENSOR SUBSYSTEM Figure 4. Network node block diagram Dedicated communication module block (PROTOCOL CODE/DECO) was designed on the basis of earlier works [4] [5]. System internal working frequency was defined at 100MHz and part of Ethernet manager works at 50MHz. It is a bidirectional block to manage data transmission and reception. As receptor, it recognizes, decodes and processes the incoming frame according to ETHERNET rules. In data transmission, the reverse process is managed. It selects between a transmission or reception process. In transmission process, the output frame is shaped assembling sensor subsystem incoming data with destination/origin MAC and IP addresses and control bits. Before starting transmission channel occupancy is detected, when channel is free transmission is enabled. In reception process, when a valid data frame is detected, reception is starting. Incoming frame is processed according to protocol and destination IP address network matching is verified, in other way the frame is discarded. If origin MAC address matches with one of the network nodes MAC addresses an internal net message is identified, in other way an external communication is detected. In both, decoding process is accomplished and redundancies are checked through a feedback shift register that was proposed in XILINX application notes [6]. Origin and destination MAC and IP addresses are extracted and stored in COMMUNICATION MEMORY to be used in message answer construction, and data is submited to the sensor subsystem with an special bit code to identify the external or internal communication. COMMUNICATION MEMORY was implemented in a two read/write ports memory. 10

25 Sensor Subsystem has three blocks: the SENSOR SUBSYSTEM CONTROL (SSC), a memory block to store sensor data and address and configuration parameters (SENSOR MEMORY) and the sensor it self. SSC has the responsibility of management all sensor subsystem activities. 3.NETWORK OPERATION. Network operations are differenced in five categories. Three of tem are defined for external communication (shown in figure 1) and they are identified as Network Set Up, Network Programming and Data Gathering. The fourth category corresponds to an internal communication process of the net and it is defined as Network Configuration, and the last, which is identified as Data Recollecting, is defined for storing data collected by the sensor in the sensor memory. Network Set Up (NSU) is the starting process. Assuming the network has a predefined quantity of nodes, each of them identified with a different MAC addresses, and each node has stored the addresses of all the others, an external NSU message is required to start net operation. When NSU message is received, the node that receives and first answers the requirement, assumes the role of hub node, and Network Configuration process (NCP) is started (figure 5). A dedicated protocol based on BGP was developed for NCP. Devices that can communicate directly are defined as neighbors, and the first step is to detect neighboring. Hub node sends a START message to all the others, the nodes that answered message are assumed as neighbors and their MAC address are stored as a neighbor address. After a prefixed time without receive answer messages, hub node assumes its table of neighbour node is completed, and sends an OPEN message to each one of its neighboring nodes, and waits for a KEEPALIVE message that only includes the BGP header. Each one of the nodes carries out the same procedure to identify its neighbors. REMOTE STATION HUB NODE D A N O C E J H K L M F P Q 1 IP address 14 MAC address Once received the KEEPALIVE message, the hub node emits an UPDATE message to notifying its neighbors MAC addresses. Neighbor nodes receives message and emits an UPDATE message to announce their own neighbor addresses and the route to reach hub node. Every node that receives the message repeat the operation announcing its MAC neighbor addresses and the route to reach the hub node, and information goes spreading for the network. When all nodes have been reached and the path communication information has been stored in all of them, the net architecture is completely configured and sensors start DR process. KEEPALIVE messages will be periodically exchanged to ensure that the relationship continues established. If some node goes out of service, a communication break is reported and routes including this node are reconfigured with UPDATE messages generation. Network Programming (NP) and Data Gathering (DG) process start with the corresponding external messages. When a NP or a DG external message is received, all the node are enable to receive it, the one that first answers the requirement, assumes the role of hub node to receive and retransmit information. NP is the process to programme sensors parameters. The information goes spreading for the network and all the sensors are reprogrammed when it is stored in the sensor memory of each node. DG is the process that allows the transfer of data stored in the sensors outside the network. When hub node sends a data request message, data travel node to node to reach hub node and they are transmitted to the external network. Data Recollecting (DR) is an internal node process which periodicity is programmed during NP process. 4.CONCLUSIONS Nodes structure and operation of an autonomous wireless intelligent network reachable remotely via INTERNET were presented. Specific application is sensing meteorological data in field. The structure of nodes is the same for all of them. All nodes have the same capabilities, share the same IP address and have different MAC addresses. The minimum and necessary rules subset of IEEE standard rules was selected to implement node communication module. Internal network intelligence is centered in dynamic topology reconfiguration according to the physical location of the nodes. Border Gateway Protocol (BGP) was adapted to allow dynamic reconfiguration. Figure 5. NSU message reception 11

26 Two prototypes nodes were implemented over SPARTAN III available in Digilent S3 SKB development XILINX field programmable logic devices boards [7]. The design was validated with successfully communication tests made in Laboratory. For tests, connection between nodes was implemented as a wired connection using a 10BASE-T connection synchronized at 10Mb/seg. Now the work is RF transmitter analysis and selection to implement wireless communication. 5.REFERENCES [1] IEEE, IEEE STD , Revision of IEEE STD , June [2] Waisbrot, J. Request For Comments: 791, [3] Rekhter Y., Li T., Hares S. Request for Comments 4271: A Border Gateway Protocol 4 (BGP-4) [4] Schiavon M. I., Crepaldo D., Martín R. L., Varela C. Dedicated system configurable via Internet embedded communication manager module, V Southern Conference on Programmable Logic, San Carlos, Brasil (2009) pp [5] Schiavon M. I., Crepaldo D., Martín R. L. Wireless Internet configurable network module, VI Southern Conference on Programmable Logic, Puerto Galhinas, Brasil (2010) pp [6] Borrelli C. IEEE cycle redundancy check, XILINX, App. Note XAPP209. March, [7] Digilent S3 SKB development boards, SPARTAN 3 FPGA, and ISE platform, 12

27 Multi-Level Synthesis on the Example of a Particle Filter Jan Langer, Daniel Froß, Enrico Billich, Marko Rößler, Ulrich Heinkel Chemnitz University of Technology Chemnitz, Germany {laja,daf,ebi,marr,heinkel}@hrz.tu-chemnitz.de Abstract In this paper we compare two high level synthesis approaches on the example of a particle filter design. First, a C synthesis is used to transform C code into RT level VHDL. The second method employs the tool vhisyn to compile a set of operation properties written in ITL into RTL code. A particle filter component has been implemented using both methods and the resulting designs were synthesized and run on a FPGA board. The corresponding synthesis results have been compared to a hand coded design. This work focuses on the comparison of two high level design methods starting from different levels of abstraction and hand coded VHDL. As a result, the resource utilization and timing of the high level designs are not prohibitively high. Especially, it is interesting to classify operation properties as an efficient prototyping and design method in certain application areas. In general, high-level design methods are applied when a more abstract, concise and maintainable system description is required and only a short design time is allowed. Operation properties represent a compromise between abstract C based methods and classical RT design. I. INTRODUCTION High level synthesis (HLS) raises the level of designing a system from the traditional register transfer (RT) level up to higher levels of abstraction. This step helps to improve both design productivity and achieved verification quality. In this paper, two very different approaches to HLS and a hand coded design on RT level are evaluated by means of a case study in performance and efficiency. A particle filter algorithm is used as an application example. The particle filter is an estimation technique for Bayesian models that is primarily well suited for localization purposes. Furthermore, the particle filter is a good example to illustrate certain aspects of the different design approaches of this work. The first HLS approach is the generation of RT Hardware based on a system description written in an augmented C language that will be translated into synthesizable VHDL. The resulting hardware implementation exploits coarse-grained parallelism on process level and low level parallelism on instruction level. This research work was supported in part by the German Federal Ministry of Education and Research (BMBF) in the project HERKULES under the contract number 01 M 3082 and the project InnoProfile under contract number 03 IP 505. A fundamentally different approach is to utilize the InTerval Language (ITL), that has been originally used as a formal verification technique. A system description is created as a set of Operation Properties that split the system s behavior into operations of fixed length, which are connected by a property graph. Using ITL has been proposed as an intermediate HLS methodology that compensates specific drawbacks of the previous approach. This paper is structured as follows. First, an overview of previous work in the field of HLS is given. The second section describes the specification of the particle filter design. In section IV and V, we provide some details about the high level design methodologies we have used. The paper concludes with a presentation of the design results and the respective performance of the two implementations compared to a hand coded VHDL design. II. PREVIOUS WORK High-level synthesis rises the design level with the objective to improve verification and system design productivity. Related work dates back 30 years, starting from algorithmic level [1] and moving up to system level. ANSI C/C++ and derivatives of them like SystemC, Single Assignment-C (SA-C) [2] and Handle-C [3] provide functionality similar to languages like Verilog and VHDL and aim at a unified hardware-software representation. Commercial and academic C to VHDL compilers like CatapultC, C-to-Silicon [4], Cyber [5] and others generate intermediate RT level code, which can be processed by logic synthesis tools afterwards [6]. C2H [7], Streams-C [8] and CoDeveloper [9] combine HLS and hardware software codesign. Tools for compiling other languages like Java [10] or Matlab to hardware appeared recently. In general, it is a well understood process to generate executable and even synthesizable models from single temporal properties or sets of properties. Those models can be either used as monitors in system simulation and emulation or they form abstractions for early prototypes in system verification. Synthesizing temporal properties has mostly focused on Linear Time Logic (LTL) as implemented in PSL or SVA [11] [14]. However, all those methods can only handle a subset of the operators of the property language or they can only process problems of very small complexity. Another problem 13

28 is ambiguity. In most cases, a property or a set of properties is satisfied by more than one exact behaviour. Thus, the synthesis method can either create a general solution that contains all consistent behaviour or an arbitrarily chosen specific solution. In contrast to PSL or SVA, the synthesis of models from complete sets of ITL properties can profit from additional constraints, that are not present in pure LTL properties. For one, the property graph connecting the operations imposes structural information that is used during synthesis. Furthermore, the special syntax of ITL (in many aspects more restricted than general LTL) and the assertions obtained during the check for completeness simplify the synthesis process and allow a much higher complexity to be handled. Thus, in [15] a tool vhisyn has been proposed to translate ITL descriptions to VHDL. This work uses the tool to generate the operation property based design to be compared to the other two design approaches. Similar to this paper, [16] also uses ITL properties to generate executable models, called Cando objects. The algorithm does not employ the property graph structure, and on one hand is more general than our approach, but on the other hand less able to handle complex property descriptions. Case studies of HLS tools are available (e.g. in [6], [17] [19]), but limit the comparison exclusively to either programming language based HLS approaches or to RT level designs. To the best of our knowledge, there is presently no comprehensive case study available that comparatively qualifies the results of synthesizing a complete design of a complex algorithm at these levels of abstraction. III. PARTICLE FILTER This section presents a particle filter for localization estimation as a possible specification for a hardware implementation. The filter estimates an object s three-dimensional position by incorporating distance measurements to reference points of known position. The localization problem is similar to that of the global positioning system (GPS). The particle filter has been chosen as an example for this comparative work, because it can be described as a short, wellunderstood piece of C code, that will be used as a starting point for C based synthesis. Furthermore, the particle filter s behavior can be split into meaningful operational properties making it a feasible target of property based synthesis. However, despite these characteristics, a specific hardware implementation of this design on register-transfer-level requires a lot of work. Considering these facts, the particle filter appears as an ideal candidate for a study to compare the design approach using operational properties with both a higher level method based on C and a lower level manual implementation. A particle filter is a nonparametric implementation of the Bayes filter algorithm, where the posterior distribution is approximated by a set of random state samples (particles). The likelihood of the true system s state is proportional to the density a region of the state space is populated by particles. See [20] for a comprehensive introduction to particle filters. For reasons of approximation accuracy the number of particles has to be large - depending on the problem to be estimated. As a consequence, a software implementation on an embedded microprocessor platform is infeasable due to low update rates. This has made a hardware implementation necessary. In our case, the state to be estimated is the unknown position (x, y, z) of the object. Thus, every particle represents one possible position hypothesis p [m] = (x [m], y [m], z [m] ), (1) where m is the running index in the particle set. A filter update at time t consists of the following steps: 1) Prediction. A hypothetical position p [m] t for each particle is predicted at the actual timestep t based on its former position p [m] t 1. Therefore, every new particle has to be sampled from a proposal distribution that is based on a given state transition or motion model. In our case, the mobile node is assumed to move without any favored direction. Hence, this distribution is modeled symmetrically around p [m] t 1 as a three dimensional normal distribution with identical variances σp t 2 for x, y and z. Due to the fact that positional uncertainty increases with time, the variance values are scaled with the time difference t between the actual time and the time of the last filter update. 2) Weight Calculation. The next step consists of calculating a weight w [m] t by incorporating a distance measurement d t between the object and an for each particle p [m] t anchor position p a. This weight is the probability of the distance measurement under the particle p [m] t. In our case the weight is given by w [m] t = k k + d, k > 0 (2) d = d t p [m] t p a (3) where d is the difference between expected distance (euclidean distance between particle and anchor position) and measured distance d t. The scaling constant k characterizes the quality of distance information. If predicted and measured distance match exactly the weight maximizes to one. With increasing difference the weight decreases asymptotically to zero according to the value of k. 3) Resampling. The final particle set is generated through a resampling procedure of the hypothetical set from step 1). The probability of drawing each particle from the set is given by its weight. The resulting particle set possesses duplicates of particles with large weights while particles of lower weight have been replaced. Thus, the resulting particle set focuses on regions with high posterior probability. In our implementation, a socalled low variance sampler from [20] is deployed. In 14

29 r w t w t [1] w t [2] r + w t weights Resampling Fig. 2. Fig r + 2w t Low variance resampling procedure at least M Weight FIFO Position FIFO particles Statistics particles Weight Calculation Prediction mean / covariance Block diagram of the particle filter design. measurement Power PC a first step, a single random number r in the interval [0; w t ) is chosen where w t is the arithmetic mean of all particle weights. In the following steps the algorithm selects particles by repeatedly adding w t to r and by choosing the particle that corresponds to the resulting value. Figure 1 illustrates this resampling method. 4) Density Extraction. Finally, based on the discrete particle set maintained by the filter, a continuous density is estimated. We compute the mean and the covariances over all particles assuming them to be normally distributed. The probability density at any position can then be calculated by a normal distribution using the obtained mean vector and covariance matrix. To compare both high level design approaches to a hand coded VHDL design, the particle filter has been implemented using all three methods. All designs are structured similarly as shown in Fig. 2. The three blocks: prediction, weight calculation and resampling correspond to the update rules 1) to 3) above. The resampling block will not start operating until the cumulative sum of all particle weights is available. Therefore, the weights and positions of one complete set (M = 8192) of particles need to be stored in a FIFO, that is located at the input of the resampling block. As soon as the resampled particles drop out of the resampler, they are processed by the prediction and weight calculation and again pushed into the FIFOs. The statistics block corresponds to update step 4) with calculating mean and covariance parameters over all particles. IV. C-BASED SYNTHESIS To synthesize hardware from derivatives of the sequential software programming language C, several problems have to be considered. The programming model of pure C does not define certain aspects of the concurrency model, data types, timing specifications, memories, communication patterns and other constraints [21]. However, these aspects are crucial to synthesize the corresponding hardware structures. Handling of these issues differs between the available C synthesis tools and there appears to be no clear winning solution. Nevertheless, all tools share a more or less semi-automated way to handle the various levels of parallelism to generate hardware with reasonable performance. For the work in this paper, the tool CoDeveloper by Impulse Accelerated Technologies has been used. It is the commercial successor of the Streams-C compiler. In general, the principles described in this paper also apply to other synthesis tools based on C that do not depend on explicit annotation of concurrency on a fine grained level. On the lowest level, blocks of C code, bounded by control statements (e.g. case, if, for,... ), are automatically processed to exploit parallelism. Data dependencies between instructions are analyzed to extract implicit concurrency. Simple operations (e.g. addition of fixed point values) are directly mapped to the corresponding HDL statement, whereas more complex instructions are mapped to specific components from a library. The following allocation step decides how many operators will be instantiated and how memory access and data operations are scheduled into fixed time slices according to their estimated execution time. The automatic transformation of loops and control structures generally results in state machines. Loops are either unrolled and each step is executed concurrently to minimize computational delay or the steps are pipelined for area efficiency. Unrolling and pipelining span a rather large design space bound by the required speed (frequency) and size (area) of the chip. A constraint driven synthesis process explores solutions to meet the restrictions defined by the designer. The original resampling algorithm of the particle filter is shown in the left part of Fig. 3. The resulting scheduling is annotated in the right part. The initialization phase takes two cycles due to a memory read. Loop conditions consume one cycle and the loop bodies two cycles each due to data dependencies and memory accesses. C-Code Cycle Block U = rand() % step; 0 Block1 i = 0; 0 j = 0; 0 c = M*weight[i]; 0-1 for (j=0; j<=m; j++) 2 Loop1 while (U>=c) 3 Loop2 i++; 4 c += M*weight[i]; 4-5 Fig. 3. state2[j] = state1[i]; 6-7 Block2 U += step; 6 C code of the low variance resampling algorithm. Pure C language is not especially well-suited to specify hardware. Therefore, a designer is forced to guide the synthesis 15

30 C-Code Cycle Block U = rand() % step; 0 Block1 i = 0; 0 j = 0; 0 c = M*weight[i]; 0-1 while (j < M) 1 Loop1 #pragma CO PIPELINE if (U>=c) 1 c += M*weight[i++]; 2-3 Block2 else { U += step; 4 state2[++j] = state1[i]; 4-5 Block3 } Fig. 4. Optimized C code of the resampling algorithm process in order to achieve the best possible performance. Guidelines by tool vendors and the research community include combining loops, combine or split memories, mark loops for pipeling or unrolling. In general, it is necessary to review the synthesis results in order to optimize critical code sections. The resulting C code might be less efficient to be run in software but more suitable for hardware synthesis. Fig. 4 shows an optimized version of the resampling algorithm of the particle filter. Rewriting the algorithm and advising loop piplining reduced the latency in each path to two cycles. All C-synthesis tools require a manual definition of parallelism on the coarse grained level. This is often achieved by processes or threads. In particular, the fundamental unit of concurrently executed computation in CoDeveloper is called process. Streams, signals, registers and shared memories are provided to synchronize processes and to extract the global data path. The implementation of the particle filter in Fig. 2 uses processes for weight calculation, prediction and resampling. Global arrays are used to buffer the particles between the processes, whereas all remaining communication utilizes streams. V. OPERATION PROPERTIES The commercial tool 360MV TM by OneSpin Solutions [22] introduces a Gap Free Verification methodology based on operation properties. It provides a special property syntax known as InTerval Language (ITL). A set of additional rules helps to write a complete set of properties, that explicitly covers the design intent for every valid sequence of input values. The tool employs a powerful engine to prove the completeness of the property set as well as the correctness of each individual property with respect to the design. A property set is complete, if the conjunction of the properties alone is able to map every valid sequence of input data to exactly one corresponding sequence of output data [23]. The completeness of a property set can be proven without the need of an actual design. To illustrate the property-based design, we want to show one property of the resampling component. The resampler s behavior is first split into distinct operations based on the specification. It turns out, that exactly five operations are Fig. 5. start read property read is assume: at t : U >= c; property at read t : isi < M; assume: prove : at t : weight >= limit; idle Jan Langer 3 write reset Property graph of the resampler. at t+2 : i = prev(i,2)+1; at t+2 : c = prev(c + weight,2); at t+1 at : rd_cnt t+2 : U < = M; prev(u,2); during[t+1,t+2] : wr_en = 0 ; prove: during[t+1,t+2] : state2 = 0; Professur Schaltkreis- read und Systementwurf <M +1 at t+2 at : rd_cnt t+1 : rd_en = prev(rd_cnt,2)+1; = 1 ; at t+2 at : limit t+2 : = rd_en prev(limit,2) = 0 ; +... prev(in_weight,2); at t+2 end : property; weight = prev(weight,2); state2 rd_en 0 during[t+1,t+2] : wr_en = 0 ; during[t+1,t+2] Fig. 6. ITL : out_state code and timing = 0; diagram of the read property. at t+1 : rd_en = 1 ; at t+2 : rd_en = 0 ; end property; reset has occurred. Furthermore, Jan Langer it defines Professur the Schaltkreisvalues of all 4 und Systementwurf i weight c U wr_en >= + = t t+1 t+2 needed, as shown in Fig. 5. The reset property sets the component s state variables to defined values after a system output signals in this phase. The idle property is activated in the time between subsequent update cycles of the filter and sets the output values to zero. In case a new update cycle is started, the start property applies and prepares the internal variables for the following resampling process. The two properties read and write alternate according to the received particle weights. As soon as all particles have been read and as many particles have been written, the idle property is activated again. The resampling component s read operation picks the state and weight of the next particle from the FIFO and does not write a new particle to the output. This operation is shown in Fig. 6. The corresponding timing diagram tries to visualize the behavior. The expressions in the assume part of the operation form the antecedent of the property and indicate the activation conditions. In this case, the read property is executed as long as variable c is greater or equal to variable U and the particle read count i is smaller than the total number of particles M. The prove part forms the consequent of the operation and sets output and internal signals to their new values. In contrast to high level synthesis approaches based on algorithmic descriptions like the C language, the properties contain no loops. The user has to encode loop-like behavior implicitly in the sequence of the operations allowed by the property graph. Furthermore, the properties are designed such that they can partly overlap and therefore exploit a pipelining behavior in the resulting design. The length of the read operation is two cycles. So, during read s third cycle at t + 2, the following property can be activated and the two properties overlap for one cycle. In general, the use of operations is more beneficial for properties 16

31 16000 TABLE I DESIGN DESCRIPTION AND SYNTHESIS RESULTS real position estimated position anchors covariance ellipsis Hand coded vhisyn CoDeveloper lines of code 2138 (vhdl) 1243 (vhi) 447 (C) estimated design effort 1-2 weeks 3 days 2 days slices 3855 (28%) 6011 (43%) 4603 (33%) slice FF 5924 (21%) 5120 (18%) 5286 (19%) 4 input LUT 3552 (12%) 8930 (32%) 6387 (23%) BRAM 70 (51%) 69 (50%) 82 (60%) MULT18x18 18 (13%) 23 (16%) 29 (21%) max. freq. (in MHz) avg. cycles per particle Fig. 7. A random walk of an object in the playground and the corresponding estimated position of the particle filter. of lengths greater than one, since there is no need for the designer to define a FSM for the substates of the operation. The process of designing the properties is supported by the 360MV verification tool. Its completeness check guarantees that for each specific input sequence there is exactly one unique sequence of properties defining a unique and deterministic output sequence. When the completeness check is employed during the design process, less errors will remain undetected, and the design quality improves. In [15], it is argued that operation properties are an alternative, and in some application areas more convenient, design approach. That means, operation properties are not suitable for all kinds of designs. However, in cases that are dominated by sequences of behavior that are activated under certain conditions, operation properties are a very natural design method. For an experienced designer of the future, it is most useful to know a couple of different design paradigms from which to choose the most suitable for the design task at hand. We propose operation properties as one of those paradigms. The tool vhisyn derives a cycle-accurate register transfer model from a given specification based on a set of operation properties. It processes a set of ITL properties and outputs a VHDL model. The model is synthesizable and exactly implements the propertyies behavior as long as they satisfy the completeness check of the 360MV tool. VI. RESULTS Fig. 7 is a plot of a random movement of an object, whose position is estimated by the particle filter. Although the design estimates 3D positions, the plot displays only two dimensions. It can be seen, that the estimated position follows the real position very closely, except near the border of the playground. The ellipsis that is plotted at equidistant intervals of the estimation path represents the covariance matrix of the particle set. It indicates the certainty of the estimation. A sparse distribution of particles results in a large ellipsis, whereas a dense particle cloud results in a small ellipsis, i.e. the filter more strongly believes in its estimation. One interesting aspect of comparing the three approaches is the effort to describe the design. As shown in Table I, the property-based ITL design took about 3 days to write the properties. Most of this time has been spent on implementing arithmetic operations, which could be reused as library functions in later designs, such as a square root algorithm. Furthermore, the ITL code contains not only the functional design description, but also the necessary code to conduct a complete formal verification. In contrast to that, the hand coded VHDL design took between one and two weeks of pure coding on RT-level. Since most of the implementation level decisions have already been explored and specified during the property-based design, the VHDL design could follow a pretty straight-forward path. This also illustrates the use of operation property based design as a prototyping method that allows to quickly explore design decisions, while even so resulting in an actual design that is ready for simulation or logic synthesis. The ImpulseC design methodology using CoDeveloper can be divided into two phases. First, the ImpulseC code has been implemented based on the pseudo-code reference implementation given in [20]. This step has been completed within just a few hours. However, the synthesis result was some orders of magnitude slower than the results obtained with the other methods. Consequently, it has been necessary to optimize critical code sections and to partition the design functionality into smaller blocks to improve the coarse grain parallelism. Additionally, in Table I the results of the synthesis of all three designs are shown. It can be seen that the design size has the same order of magnitude. As expected, the hand coded design is very small, whereas the vhisyn approach uses nearly twice the resources. Furthermore, the timing of the design is considerably different. There are several reasons for this: 1) First, in contrast to the hand coded design the two generated designs do not use the highly optimized and efficiently pipelined components generated by the 17

32 Xilinx Coregen Tool. This applies for example to the various arithmetic operators with large bit widths such as division, square root and multipliers. 2) The design generated by CoDeveloper is of moderate size and reasonably fast but it needs about 66 cycles to process one particle. In particular, CoDeveloper fails to implement a pipelined division and employs a sequential component that needs 64 cycles for one operation. The runtime of the synthesis tools itself has been negligible. The vhisyn tool runs for about 16 seconds to generate the particle filter design. The time scales linearly with the amount of hardware generated. It has been intended as a prototyping platform and offers a lot of room for speed improvements. In general, when developing vhisyn, it has been a major point not to include algorithms that do not scale well with big, industrial strength designs blocks. By far the largest runtime is consumed by the tools, that process the generated VHDL code and generate a bistream file for the FPGA. VII. CONCLUSION In this paper, we used two high level design approaches to implement a particle filter design. We compared the two generated designs to a hand coded VHDL design of the same functionality. As expected the hand coded design leads in terms of resource utilization and frequency requirements. However, when considering the improved ease of use and much lower code maintenance costs of both the property and the C code approach, the higher resource requirements and lower maximum frequency seem to be acceptable. Furthermore, it can be seen that the C based methodology is more abstract than the property based method, which results in a very low implementation effort but reduced control over the cycle accurate behaviour. One of the most important aspects of the property based design effort has been the constant use of formal verification, that provides the designer with information about the design quality. Such measures are the determination of all output signals at each time step, the absence of deadlocks in the control flow automaton and the unambiguous design behavior for every possible sequence of valid input data. The paper classifies operation properties as an intermediate level of description for hardware blocks, that offers a valuable design approach for certain applications. In a future development environment or even single hardware description language, algorithmic descriptions, operations and traditional RT level design will coexist and the developer chooses the most appropriate design method for each individual block. In certain cases, even a mixture of different methods might be applied. REFERENCES [1] M. C. McFarland, A. C. Parker, and R. Camposano, Tutorial on high-level synthesis, in Design Automation Conference (DAC). Los Alamitos, CA, USA: IEEE Computer Society Press, 1988, pp [2] W. A. Najjar, W. Böhm, B. A. Draper, J. Hammes, R. Rinker, J. R. Beveridge, M. Chawathe, and C. Ross, High-level language abstraction for reconfigurable computing, Computer, vol. 36, no. 8, pp , Aug [3] Celoxica Limited, Handle-C Language Reference Manual, [Online]. Available: [4] Cadence Design Systems Inc., Cadence C-to-Silicon Compiler Delivers On The Promise Of High-level Synthesis, [5] K. Wakabayashi, C-based synthesis experiences with a behavior synthesizer, Cyber, in Design, Automation, and Test in Europe (DATE). Munich: IEEE Comput. Soc, 1999, pp [6] O. Hammami, Z. Wang, V. Fresse, and D. Houzet, A Case Study: Quantitative Evaluation of C-Based High-Level Synthesis Systems, EURASIP Journal on Embedded Systems, vol. 2008, [7] Altera Corporation, Nios II C2H Compiler User Guide, [Online]. Available: [8] M. B. Gokhale, J. M. Stone, J. Arnold, and M. Kalinowski, Stream- Oriented FPGA Computing in the Streams-C High Level Language, in IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM). Washington, DC, USA: IEEE Computer Society Press, 2000, p. 49. [9] M. Rößler, H. Wang, N. Engin, W. Drescher, and U. Heinkel, Rapid Prototyping of a DVB-SH Turbo Decoder Using High-Level-Synthesis, in Forum on Specification & Design Languages (FDL), Sophia Antipolis, France, Sep [10] S. S. Huang, A. Hormati, D. F. Bacon, and R. Rabbah, Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary, in Object-Oriented Programming (ECOOP). Springer, 2008, pp [11] Y. Abarbanel, I. Beer, L. Gluhovsky, S. Keidar, and Y. Wolfsthal, FoCs - Automatic Generation of Simulation Checkers from Formal Specifications, in Computer Aided Verification. Berlin / Heidelberg: Springer, 2000, pp [12] M. Boule and Z. Zilic, Efficient Automata-Based Assertion-Checker Synthesis of SEREs for Hardware Emulation, in Asia South Pacific Design Automation Conference (ASP-DAC). IEEE, 2007, pp [13] R. Bloem, S. Galler, B. Jobstmann, N. Piterman, A. Pnueli, and M. Weiglhofer, Specify, Compile, Run: Hardware from PSL, Electronic Notes in Theoretical Computer Science, vol. 190, no. 4, pp. 3 16, [14] K. Morin-Allory and D. Borrione, Proven correct monitors from PSL specifications, in Design, Automation, and Test in Europe (DATE), 2006, pp [15] J. Langer and U. Heinkel, High Level Synthesis Using Operational Properties, in Forum on Specification & Design Languages (FDL), Sep. 2009, pp [16] M. Schickel, Applications of Property-Based Synthesis in Formal Verification, Ph.D. thesis, Technische Universität Darmstadt, [17] E. El-Araby, M. Taher, M. Abouellail, T. El-Ghazawi, and G. B. Newby, Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology and Empirical Study, in Southern Conference on Programmable Logic (SPL). Mar del Plata: IEEE, Feb. 2007, pp [18] S. Ahuja, S. T. Gurumani, C. Spackman, and S. K. Shukla, Hardware Coprocessor Synthesis from an ANSI C Specification, IEEE Design & Test of Computers, vol. 26, no. 4, pp , Jul [19] L. Piga and S. Rigo, Comparing RTL and high-level synthesis methodologies in the design of a theora video decoder IP core, in Southern Conference on Programmable Logic (SPL). Sao Carlos: IEEE, Apr. 2009, pp [20] S. Thrun, W. Burgard, and D. Fox, The Particle Filter, in Probabilistic Robotics. MIT Press, 2005, ch. 4.3, pp [21] S. A. Edwards, The Challenges of Synthesizing Hardware from C-Like Languages, IEEE Design & Test of Computers, vol. 23, no. 5, pp , [22] (2010) OneSpin Solutions. [Online]. Available: onespin-solutions.com [23] J. Bormann, Vollständige funktionale Verifikation, Ph.D. thesis, Universität Kaiserslautern,

33 LAYERED TESTBENCH FOR ASSERTION BASED VERIFICATION José Mosquera, Sol Pedre y Patricia Borensztejn Departamento de Computación Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires jose.mosquera@oracle.com, {spedre, patricia}@dc.uba.ar ABSTRACT In this paper we present the use of an assertion based verification technique in combination with our own layered testbench environment to dynamically verify a design implemented on an FPGA. We use PSL (Property Specification Language) to build a set of assertions to ensure the consistency between requirements and implementation. The layered simulation environment provides a higher level of abstraction making the verification process easier and more robust than a monolithic testbench approach. 1. INTRODUCTION In this paper we expose a functional verification through simulation following the layered testbench methodology [1]. Monolithic testbenches are written to verify an specific functionality of a Device Under Verification (DUV), where the testbench designer have to take care of all aspects of the simulation at the same time, like low level details such as signals values and transitions. If other functionality is needed to be verified, a new testbench has to be written. In contrast, with layered testbenches it is possible to split the test environment into smaller pieces or components, allowing the designer to concentrate on specific part of the environment. Each component has its own aim, like stimulus generation, stimulus injection, response capture and automated response check. In an automated verification environment there is a need of a tool to measure the correctness of the behavior of the DUV. Assertion Based Verification (ABV) [2] technique permits the construction of such metric. The ABV decomposes the intent of the design, into properties that have not to be violated during the simulation. We have chosen PSL [3] in its Verilog flavor to write a set of assertions and coverage points that should be satisfied during the simulation. Despite the availability of verification frameworks like OVM/UVM [4] we have decided to make our own implementation due to two main reasons: budged constraints and our own goal of capturing the benefits of a layered testbench environment with ABV. This article is organized in the following way, section 2 describe the DUV, section 3 list the chosen testcases, section 4 explain the design of the layered testbench, section 5 introduce the assertions based verification and the coverage points used as a metric of verification progress. Last section presents the conclusions and future works. 2. DEVICE UNDER VERIFICATION (DUV) We have chosen as a DUV the Verilog module AC97 controller described in a previous work [5], which implements a subset of the AC97 protocol [6] to gather microphone samples and loads them into an asynchronous FIFO for a later transmission over an Ethernet interface. The audio samples come serially into the DUV through the signal AC97_SDATA_IN in a form of 256 bit frames. The start of each frame is signaled by AC97_SYNC. The AC97 controller takes the 20 bit of time slot 3; converts bits into bytes and send them out trough DATA_OUT signal. In case of FIFO full event, the AC97 controller flush the FIFO asserting the RESTART signal during 6 AC97 bit clocks. 3. FUNCTIONAL VERIFICATION OF DUV The subset of AC97 protocol implemented by the DUV, include the reception of AC97 frames considering time slot 0 (control information) and time slot 3 (microphone samples). All other information of the frame is discarded. On the other side, the DUV fills a FIFO with microphone samples, so DUV should monitor the state of that FIFO before write on it. The functional verification test cases selected are the following: Valid Frame, Valid Time Slot, FIFO full and the combination of them. These cases are the golden reference in which we have defined what is success or fail. During simulation, assertions execution is monitored inside the DUV and on its external interfaces. Whenever an assertion is violated, it is reported. Internal observability for bug detection, isolation and notification are then improved. Each assertion specifies an illegal behavior of a circuit structure inside the design. 19

34 tested functionality during the simulation. Section 5 describes assertions and coverage points with more details Command Layer The command Layer has two components, Driver and Monitor. Fig. 1. Desing Under Verification: AC97 controller. We have also introduced properties to verify that the design does everything it is supposed to do. The collection of these additional verification properties represents the functional coverage model of the DUV. The properties covered during the simulation provide a metric of verification progress. 4. LAYERED TESTBENCH Based on the methodology of layered testbench, and having previously defined the functional verification cases (section 3) of the DUV (section 2), we have built a test environment dividing it into smaller components. The design of the layered testbench was done following the bottom-up practice, i.e. from DUV to Test cases, splitting the simulation environment as it is represented in Fig. 2. Hence we have defined 4 layers: signal, command, functional and scenario. Signal layer contains the DUV described in section 2. Scenario layer contains the test cases described in section 3. Functional and Command layers provide the interface between highest and lowest layers. Functional layer translates each test case provided by Scenario layer, into a transaction to be checked. A transaction in this layer is an AC97 frame. Command layer takes the transaction and drives the proper signals to the DUV (Signal layer) Signal Layer Signal Layer, contains the DUV, i.e. the AC97 controller module and the signals that connect it to the simulated AC97 audio codec and Asynchronous FIFO modules. The RTL code has embedded the PSL assertions and coverage points. Both, PSL assertions and coverage points, are depicted as small circles in the DUV module of Fig.2, reporting to the environment the successful or failure of the Driver The driver translates in proper stimulus the different commands received from the Agent, and notify back the execution of each command based on AC97_SYNC signal. The driver was divided internally into two sub-drivers, the ac97_driver and the fifo_driver. The last sets/resets the FIFO full signal and the former injects serially the 256 bit audio frame into the DUV Monitor The monitor takes the 8 bit DATA_OUT signal each time the LOAD signal is asserted, and reports back to the Checker the obtained 20 bit of the sample data Functional Layer Agent The Agent translates the transaction received from Scenario layer in the right commands to the driver. The possible transactions are the combination of Valid Frame, Valid Time Slot, FIFO full and a set of generated microphone samples. The end of the transaction is parameterized at the Agent in order to send a desired number of microphone samples throughout the DUV Scoreboard Based on the same command sent to the Driver, the Scoreboard generates the golden reference of the expected DUV s response. In our case, the scoreboard notifies the checker when monitor response should be checked against the output generated by the sub-agent mic sample_gen Checker Based on current testcase scenario, the Checker compares the monitored response of the DUV with the expected result generated by the Scoreboard, e.g. in the scenario of Valid Frame and Valid Time Slot, the 20 bit microphone sample generated by the Agent is compared against the 20 bit sampled by the Monitor Scenario Layer Generator We have implemented the generator as a state machine, which in sequence sends the proper transactions to the agent to simulate each testcase scenario defined in section 3. 20

35 Fig. 2. Internal modules of layered testbench 5. ASSERTIONS AND COVERAGE PSL specification define four layers, Boolean layer which has HDL boolean expressions; Temporal layer which is the core of PSL, providing temporal relationships between boolean expressions; Verification layer which directs the use of properties to coverage or assertions; and Model layer which has statements to model the environment. We have written PSL embedded in Verilog comments as a method to introduce assertions in the simulation environment. We have introduced assertions at AC97 interface based on the AC97 protocol specification; hence we have added properties to verify the duration of a frame (Example 1), the duration of the AC97_SYNC signal in case of Valid Frame and Valid Time Slot. Example 1: // psl property frame_len = always {rose(ac97_sync)} -> {1'b1[*256];rose(ac97_sync)}; // psl assert frame_len; At the FIFO interface we have added properties to verify that no new data is loaded if FIFO is full (Example 2), and to ensure the right duration of the restart signal. Example 2: // psl property fifo_full_load = always {full==1'b1} =>{!rose(load)}; // psl assert fifo_full_load; We have introduced coverage points to verify that the design does everything it is supposed to do. Based on the DUV s specification and the list of directed-testcases, we have created a set of properties that reports the functionality tested. The AC97 controller module is based on the FSMD methodology, i.e. consist of a data path controlled by a FSM. So, we have added coverage to each of the possible states (Example 3 show some covered states) to ensure that the simulation cover all its possible states. Also, we have added assertions to verify the control signal are valid at the right moment. Example 3: // psl sequence fsm_idle = {state_reg==idle_state}; // psl cover fsm_idle; // psl sequence fsm_sync = {state_reg==sync_state}; // psl cover fsm_sync; // psl sequence fsm_valid_frame = {state_reg==validframe_state}; // psl cover fsm_valid_frame; 21

36 6. CONCLUSION This paper is on the direction of adopting innovative tools and methodologies applied to testing and verification. We have found that the time spent to implement the layered testbench environment is on the same order of each testcase on the monolithic approach. Hence, the automation of directed-testcases is reflected as a productivity increase on the verification process. Assertions and coverage properties propose a higher level of abstraction because are closer to the specification than traditional testbenches. These introduce not only the benefit of productivity increase, but also improve the robustness of verification. Having implemented our own testbench framework totally in Verilog, as next step, we are going further on the adoption of innovative tools and methodologies, such as System Verilog and constrained-random testcases, with the intention of future adoption of OVM/UVM framework. 7. REFERENCES [1] Chris Spear, SystemVerilog for Verification: A Guide to Learning the Testbench Language Features Second Edition, Springer, [2] Harry D.Foster, Adam C.Krolnik, David J.Lacey. Assertion-Based Design, 2nd edition, Springer, (ISBN: ). [3] Property Specification Language (PSL), Accellera, [4] Open Verification Methodology, [5] Designer Forum Proceedings Audio sobre Ethernet: Implementación utilizando FPGA. José Mosquera, Andrés Stoliar, Sol Pedre, Maximiliano Sacco y Patricia Borensztejn. Proceedings of SPL Southern Programmable Logic Conference ISBN: Rima Editora. pag [6] Audio Codec 97, Revision 2.3 Revision 1.0, Intel. April,

37 DEVELOPMENT AND IMPLEMENTATION OF AN ADAPTIVE NARROWBAND ACTIVE NOISE CONTROLLER Fernando A. González, Roberto R. Rossi Digital Signal Processing Laboratory Universidad Nacional de Córdoba Córdoba, Argentina Germán R. Molina, Gustavo F. Parlanti Digital Signal Processing Laboratory Universidad Nacional de Córdoba Córdoba, Argentina ABSTRACT This paper presents the development and implementation of an adaptive feedback Active Noise Control (ANC) system based on a commercial Digital Signal Processor (DSP). The system aims to cancel the low frequency narrowband noise remaining inside a headset shell. This low frequency noise is particularly difficult to cancel by passive acoustic means. Using active techniques as the one presented here appropriate levels of noise attenuation are achieved in an efficient way suitable for commercial use. 1. INTRODUCTION In industrial environments, cabins near engines (as in a car or an airplane) or in any noisy environments in general, the headsets used for passive noise cancellation are useful only for frequencies over 500 Hz. As a complementary solution to passive headsets, Active Noise Control (ANC) [1], [2] systems aim to cancel the remaining noise low frequency components. ANC main objective is to produce inside the headset, a signal of equal amplitude but opposite phase to the remaining noise. This signal in often called anti-noise The noise inside the headset may vary over time because the external noise has changed, or because the headset has moved, changing its acoustic transfer function. The ANC system must be able to adapt to these changes, modifying the produced anti-noise signal accordingly, then learning from its errors [3]. In environments with presence of engines, turbines, air-conditioning, sirens etc the noise to be cancelled will be mainly narrowband. This means that the noise spectrum will be concentrated among well defined frequencies, and the noise signal will be periodic in time. In this case the so called feedback ANC implementations can be used without causality constrains. The ANC adaptive filters are often made of a Finite Impulse Response (FIR) filter with varying coefficients. With the availability of fast and high processing power Digital Signal Processors (DSP), both coefficients update and ANC input signal processing can be done in real time. On headsets feedback ANC, the FIR cannot govern the anti-noise acoustic signal directly, but needs to act through physical elements like a Digital to Analog Converter (DAC) and a loudspeaker inside the headset. Similarly, the acoustic error will have to be captured by microphones to be converted to the electrical domain and then converted to digital by an Analog to Digital Converter (ADC) to become the error digital signal. Beside these elements, amplifiers and attenuators are often needed. All these additional elements are usually represented by a unique transfer function called secondary path S(z) in series with the adaptive FIR, W(z). In order to compensate the effect of S(z) the input to the adaptive algorithm has to be affected (filtered) in the same way as the controller s output. The modified algorithm is then called Filtered x Least Mean Squares (FxLMS) algorithm [4]. As S(z) is unknown and may vary on time, it is often identified by a second adaptive system and a copy of its transfer function, S^(z), is introduced to filter the input to the adaptive algorithm and to produce a replica of the noise to cancel from the controller s output. S^(z) can be obtained prior to W(z) adaption process ( off-line estimation) or at the same time with it ( on-line estimation). This paper presents the implementation details of an ANC system built with a commercial DSP, which produces active cancellation of the remaining narrowband low frequency noise inside a commercial headset, using off-line secondary path estimation. Fig. 1 shows the general architecture of the system. Main equations governing the adaption process [5], including a normalized version of FxLMS are summarized as follows: µ(n)= α/lp x (n). (1) w k (n+1) = w k (n) + µ(n)e(n)x (n-k). (2) where n is the iteration number, µ(n) is the variable step or adaption speed, L the amount for coefficients in the FIR filter, α is a constant, P x (n) is the power of x (n), w k (n) are the FIR coefficients and k = 0, 1, 2, L-1 is the coefficient number. 23

38 Fig. 1. Adaptive feedback ANC system with FXLMS. Fig. 2. Block diagram of the experimental model. 2. SYSTEM IMPLEMENTATION The implementation of an ANC applied to a headset was done using a high performance DSP StarCore MSC7116 from Freescale Semiconductor Inc. The StarCore MSC7116 is a low cost, 16 bits word-length, fixed point DSP with four Arithmetic and Logic Units (ALU). It can produce 1000 MMACS at 266MHz. Due to its high processing power, complex calculations as those required by the adaptive filters of Fig. 1 for both audio channels can be achieved within a sampling period ( single-sample real time processing ). The DSP program runs over SmartDSP, the specific DSP s Real Time Operating System (RTOS) designed for the StarCore family. The SmartDSP Application Program Interface (API) [6], made up from functions developed in the C language, allows an easy configuration and utilization of the DSP peripherals. The API has a driver for every peripheral type, allowing the application program to communicate with it. The Time Division Multiplexing (TDM) peripheral driver was used for input and output of both DSP audio channels. Data come in and out at the sampling rate, which was selected to be 8kHz. This sampling rate was considered enough for this application, which aims to produce ANC at frequencies below 500Hz. The application program was fully developed in the C language, using the so called intrinsic functions to optimize the adaptive filters routines. These functions are also written in C and belong to the compiler s domain [7] rather than to the RTOS s API. They are designed to made fractional operations and take advantage of the DSP parallel processing capabilities. The intrinsic functions are directly inserted within the C language code, allowing the programmer to closely match the efficiency of the DSP assembler language. The data precision was defined to be 16 bits word length for most of the data, using the fixed point format. Most data multiplications were then 16 by 16 bits, which is optimized on the DSP architecture. The only exception was in the filter adaption process, whose precision was improved by using 32 bits for the result of µ(n) times e(n), and then performing a 32 by 16 bits multiplication for W(z) coefficient s update in (2). This prevented the adaptation process to stop by lack of precision, resulting on a performance improvement. The block diagram of one ANC audio channel is shown on Fig DSP evaluation board The kit MSC711xEVMT [8], is an evaluation board for applications using the DSP StarCore MSC711x. It was used to schedule and evaluate the program to the DSP from a PC. The board has also integrated the stereo 16 bit CODEC AK4554 from AKM Semiconductor Inc. It was used to handle the electro-acoustic transducers input and output analog signals. Besides the ADC and DAC for both channels, the CODEC is used to make the required antialiasing and reconstruction low-pass filters. The TDM peripheral inside the DSP communicates to the CODEC to produce the input and output of both data channels The acoustic system The headset used was the circumaural stereo SHP1900 from Philips. On circumaural headsets, the user ears are covered by the ear-cup, leaving small acoustic cavities near each ear. The Electret type omnidirectional microphone, ECM-30 was used as error sensor microphone. They have sensitivity, bandwidth, signal to noise ratio and physical size appropriate for ANC applications. The error microphone position within the headset determines the quiet zone. The selected position follows the ideal placement suggested by the authors of [9] and [10]. This place is the nearest possible to the user s ear canal, and produces the flattest possible frequency response of the secondary path. 24

39 2.3. The amplifier s board The amplifier s board lodges two preamplifiers for the microphones and two power amplifiers for the loudspeakers. The preamplifiers are transistor s based and were designed to accommodate the low level microphone signals to the A/D input level requirements. The power amplifiers are based on LM386 chips and deliver the power required by the loudspeakers. 3. RESULTS The resulting data were exported from the DSP to the MATLAB program, where different metrics can be easily analyzed and plotted. Fig. 3. S^(z) impulse response Secondary path S(z) identification To identify S(z) off-line, S^(z) was made of an adaptive FIR structure with length L s = 128 coefficients. This length was enough to reproduce most of S^(z) impulse response. S^(z) coefficients were updated using the original LMS algorithm. White noise with zero mean and 0.03 variance was generated inside the DSP and used as learning signal. An adaption LMS step µ = 0.01 was used. Fig. 3 shows the resulting impulse response after S^(z) convergence. It shows a delay of aproximatelly 40 coefficients, or 5 ms given a sampling rate of 8 KHz. The CODEC s datasheet reports a fixed delay of 36 coeficientes, which explains almost all S^(z) delay. If needed, a smaller temporary delay can be achieved simply by raising the sampling rate. The S^(z) learning process first 2 seconds (or samples) is shown in Fig. 4. There, the white noise learning signal filtered by S(z) is shown in blue, the white noise filtered by S^(z) is shown in green and the identification error is shown in red The controller s performance Two different testing noises where generated in MATLAB to evaluate the performance of the controller W(z). Then, the testing noises where reproduced by a portable PC loudspeaker. A user wearing the headset was seated in front of the PC loudspeaker to get uniform sound on both ears. The S^(z) previously obtained off-line was used in the tests. The normalized version of the FxLMS algorithm, was used to update W(z) coefficients. An adaptive FIR structure with length L w = 160 coefficients was used to made W(z). In (1), an α = was used. In order to calculate in an efficient way an approximation of P x, we used an exponential window of the form [5]: P^x (n) = (1-β) P^x (n-1) + βx 2 (n) (3) Fig. 4. Temporary evolution of S(z) identification. In (3), only one value P^x (n-1) is required between iterations. The factor β determines how fast changes in x (n) will be reflected in P^x (n). A β = was used as a compromise between following closely x (n) changes and P^x (n) stability. As near to zero x (n) values would produce a very large µ in (3), a minimum limit of P^x MIN(n)=0.01 was forced by software in (3). The first testing noise generated with MATLAB was a single tone of 200Hz. A big amplitude was selected in order to produce distortion in the output of the portable PC loudspeakers, generating then harmonic components. Fig. 5 shows in blue the generated noise spectrum, which is represented by signal d(n) in Fig. 1, and in green the attenuated noise, signal e(n) in Fig. 1. This result was obtained after 3 seconds (24000 samples) from starting the W(z) learning process. In Fig. 5 we can see the 200Hz tone with its harmonics and a single 50Hz tone, also present in the circuit. The attenuation achieved for 50 Hz was 33.5 db, for 200 Hz was 51 db, for 400 Hz was 44 db and for 600 Hz was 35.5 db. The second testing noise was a typical engine sound [11], also generated by MATLAB, and reproduced without distortion by a portable PC loudspeaker. 25

40 For the different signals tested, the user reports comfortable levels of remaining noise inside the headset shell cavity, being these significantly lower than those without the ANC. The future directions will focus on improving the feedback ANC performance, and on broadband noise ANC within a headset. Different learning algorithms will also be investigated, analyzed and implemented on real conditions with commercially available components. Fig. 5. Noise power spectrum of a distorted 200 Hz tone (blue line) and error signal (green line). 5. ACKNOWLEDGMENTS The authors wants to acknowledge the Cordoba National University Cience and Technology Office (Secretaría de Ciencia y Tecnología de la Universidad Nacional de Córdoba), and the Freescale Semiconductor, Inc. company. 6. REFERENCES Fig. 6. [1] S. M. Kuo and D. R. Morgan, Active noise control: a tutorial review, Proceedings of the IEEE, vol. 87, no. 6, pp , Jun [2] S. J. Elliot and P.A. Nelson, Active Noise Control, IEEE Signal Processing Magazine, vol. 10, no. 4, pp , Oct [3] B. Widrow et al, Adaptive Noise Cancelling, IEEE Proceedings, vol. 63, no. 12, pp , Dec [4] B. Widrow, D. Shur and S. Shaffer, On adaptive inverse control in Proc. 15th Asilomar Conf., 1981, pp Engine noise power spectrum (blue line) and error signal (green line). [5] S. M. Kuo and D. R. Morgan, Active Noise Control Systems- Algorithms and DSP Implementations, New York: Wiley, This noise is a combination of 12 components multiples of 60Hz, with different amplitudes. Most of the power is concentrated around the 240, 300, 360, 420 y 480 Hz components. Data was also collected after 3 seconds (24000 samples) from the start of the controller W(z) learning process. The Fig. 6 shows the noise power spectrum and the error signal. The noise attenuation for the main engine noise components was: 31.3 db for 240 Hz, 46.7 db for 300 Hz, 35.2 db for 360 Hz, 38.7 db for 420 Hz and 25.6 db for 480 Hz 4. CONCLUSION AND FUTURE DIRECTIONS From the analysis and the results presented in this paper, the following conclusions can be summarized: The designed and implemented ANC system attenuates periodic (narrowband) noises in the frequencies of interest in an acceptable level. The precision and computational power of the used DSP, are enough to process simultaneously both independent audio channels in real time. [6] SmartDSP OS Reference Manual, Rev. 1.42, Metrowerks, Austin, TX, Sept [7] CodeWarrior Development Studio for StarCore DSP Architectures: C Compiler User Guide, Freescale, Austin, TX, Aug [8] MSC711XEVM User s Guide, Rev. 0, Freescale, Austin, TX, Apr [9] W. S. Gan and S. M. Kuo, Adaptive Feedback Active Noise Control Headset: Implementation, Evaluation and Its Extensions, IEEE Transaction on Consumer Electronics, vol. 51, no. 3, pp , Aug [10] S. M. Kuo and W. S. Gan, Active Noise Control System for Headphone Applications, IEEE Transaction on Control Systems Technology, vol. 14, no. 2, pp , Mar [11] MathWorks. Filter design toolbox. Active Noise Control Using a Filtered-X LMS FIR Adaptive Filter. [Online]. Available: l?file=/products/demos/shipping/filterdesign/adaptfxlmsdem o.html 26

41 BIO-INSPIRED HARDWARE SYSTEM BASED IN ANIMALS OF COLD AND HOT BLOOD Pablo A. Salvadeo Laboratorio de Computación Reconfigurable FRM UTN Rodriguez 273, CP M5502AJE, Mendoza, Arg. Rafael Castro López Instituto de Microelectrónica de Sevilla CNM CSIC Avda. Américo Vespucio S/N, CP 41092, Sevilla, Esp. Ángel C. Veca Instituto de Automática FI UNSJ Av. San Martin Oeste 1112, CP J5400ARL, San Juan, Arg. ABSTRACT In this document will be discussed a way to create a bioinspired hardware system sensitive to the temperature, using a hardware description language and digital reconfigurable devices. In addition, the system will be selfcontained in the employed device. A FPGA (Field Programmable Gate Array) will be used for the implementation, together with VHDL (VHSIC Hardware Description Language) for the description. Moreover, two systems whose biological inspiration is based in the animals near of the spectrum extremes of its kingdom, the cold and the hot, will be described. 1. INTRODUCTION Animals are biological self-contained systems in its own body. Through this, an animal is able to detect changes in the environment and have them into account to modify its behavior. A bio-inspired electronic system should be, likewise, self-contained: sensors should be intrinsic to the mentioned system and this should not use external or extrinsic elements to it. The objective of this work is to obtain a system capable of to perceive any physical magnitude and to modify its behavior as a function of this, in a similar way like happens in Nature. In this study, the temperature is chosen as the magnitude to follow and the problem consists to make this job from within the hardware device. To solve this problem, a digital circuit whose dynamics depends on the Elvo H. Morales INDEA FRM UTN Rodriguez 273, CP M5502AJE, Mendoza, Arg. hugom@frm.utn.edu.ar temperature will be used. The techniques of the digital design attempt to abstract to the designer of the analogical behavior of the devices, it which would be useful when interacting with physical magnitudes. This is the followed very approach when the hardware description language is used. Fig. 1 shows the proposed system, which is composed by two main parts: sensor circuit and block sensitive to it, named henceforth sub-system (SS). Both parts will be designed by using the same techniques which are applicable within a range of temperature where the abstractions they have been based on are still valid. Nevertheless, some digital circuits exist, whose nature makes them depending upon physical magnitudes although the mentioned rules are applied for its realization. Then, it is one of these cases which will be chosen to execute the system input sensor function. Sensor(T E ) Sub-System FPGA Fig. 1. The self-contained system. Such a transducer should have the temperature as the input variable, and as output should be a digital signal whose parameters proportional to the input value. If the output signal is periodical, it will be characterized by two 27

42 parameters: phase and frequency. As the signal is a digital one, it is considered that the amplitude only can take two values: 0 and 1, and therefore this amplitude does not offer any sensible information. In this case, the frequency of the signal will be used, so that it will vary with the elected physical magnitude. A circuit that satisfies these specifications is the so called Ring Oscillator (RO). 2. RING OSCILATOR The RO, in its simplest form, is a combinational digital circuit integrated by a NOT gate with a feedback loop closed between the input and the output. After power-up it begins to oscillate, delivering a signal whose frequency is dependent on the delay time of gate, and this delay varies as a function of the temperature. If this circuit is described and implemented over a reconfigurable hardware device, the obtained behavior, according to our experience, will not be the expected one. Therefore, another description of the RO, with a dual input NAND gate is used. One of them is a used as the enable line and the other is connected with the output. In addition this version presents the advantage of being able to be controlled by the sub-system via enable. To describe this circuit by using VHDL is necessary to define an entity with an input port and an output port. This port must be of the buffer type, because this will be read and written, being the architecture as simple as: output <= nenable nand output;. If the FPGA architecture [1] is took as a design reference, where each island is a LBA (Logic Array Block) formed by a set of ALMs (Adaptive Logic Modules), the implementation of a RO occupy only one of these basic blocks. The feedback is built by employing one LCs (Local Connections) of the LBA, as shown in Fig. 2. Fig. 2. Implementation of the RO. In blue, the LCs used for the feedback. In green, the ALM used by the logic. All outputs used to implement the RO are combinational ones, so the frequency of oscillation will be superior that the highest specified by the maker of the FPGA. Such specification states the maximum frequency allowed for a clock signal. For its calculation, the time of transit through the sequential output is taken into account, including the delay introduced by the flip-flop of this block. Hence, the output is directly used if the sub-system is combinational. However, if it is sequential, the frequency should be decreased to the device datasheet recommended value. Obviously, one or more counters could be used to divide the signal, and thus to obtain the desired value. Furthermore, if the transducer is built with an odd number of linked gates, a lower frequency will be obtained due to the rising of the delay inserted by these additional gates. Finally, such as shown in [2], the relationship between frequency and temperature is linear, getting close to a straight line with negative slope within the range of work specified by the manufacturer. In [3] it is observed that when the number of gates of the ring is increased, the exchange rate of the output diminishes along with the decrement of frequency. 3. BEHAVIOR Following the exposed proposal, a bio-inspired hardware system sensible to the temperature and self-contained in a device, its body, is made. In this section, the observable behavior that appears with the changes of temperature in the animal kingdom, and the way to emulate such behavior with the proposed system, will be described. In Nature, animals tend to maintain constant the temperature of its own bodies, and in the same way, a reconfigurable device can operate in a closed interval depending upon the techniques of its fabrication. Then, it seems reasonable to explore the mechanisms used by animals in relation to its thermic adaptation [4]. In the animal kingdom, diverse behaviors in relationship with the temperature are found to be scattered over a wide spectrum, whose extremes are denominated cold and hot blood. The first one alludes to the lack of internal mechanisms to stabilize the corporal temperature. The second one refers to the capability of maintaining it constant by using those mechanisms. Thus, animals which find themselves further close to one than of another extreme they will have different behaviors in front of the changes of the environment. In the nearnesses of the cold blood side, a significant part of the time is inverted in searching different places of its habitat where to remain some particular hours of the day, and thus to hold its regulated temperature. Instead in the neighborhood of the other extreme, the time only is used in such activities of occasional way. Further, the first ones do not use its metabolism to get cold or get hot, while the second ones effectively do it. Hence, for animals with identical corporal weight, but in opposed extremes of the spectrum, them cold-blooded they need minus energy than them warm-blooded, due to minor energy consumption in 28

43 the first ones. Close to the cold extreme, the metabolism is composed by various reactions that activate themselves into different temperatures thresholds. On the other hand, nearby of the hot side, only is necessary one reaction or a few of them to conform it. Thereby, the hot-blooded animals stabilize its temperature to optimize its simple metabolism. The coldblooded ones possess a complex metabolism composed by several reactions that are optimal to different temperatures. Thus, the metabolic complexity is exchanged by consumption of energy, conferring this interchange, advantages and disadvantages to different animals in specific situations. With the presented observations, two plausible systems will be considered: one of them near of the cold terminal and the other one close to the hot extreme Cold Blood System If the system hasn't an internal mechanism which stabilizes its body at the optimum work temperature, the device will be composed by several sub-systems. Each one of these will exhibit the maximum performance over a limited portion of the operational range. This can be implemented in two different ways, depending on the area portion occupied by the sub-system. If each one occupies a little portion of the total, then they all can coexist, see Fig. 3.1a. In this case, depending of the transducer's frequency, each one will be activated while the rest remains inactive. However, if the SS occupies a big part of the chip, the device can be reconfigured in terms of period's changes, see Fig. 3.1b. In order to do that, the configurations of the subsystems for each temperature should be stored and used in concordance with the current temperature. Sensor(T E ) Sub-System(T 0 T 1 ) Sub-System(T 1 T 2 ) (a) (b) FPGA Sub-System(T 2 T 3 ) Sensor(T E = T 01 ) Sub-System(T 0 T 1 ) FPGA Configurations(T 01, T 12, T 23 ) Fig The cold blood system: (a) small and (b) large SS Hot Blood System If on the contrary, the system has mechanisms to get hot or to get cold, it will be constituted by three parts: the transducer, the SS, and a circuit of varying the temperature of the device. The diagram of this is shown in the Fig.3.2. The behavior is as follows: changes in the frequency of the transducer's output signal are due to temperature changes. Such changes influence the thermal control circuit modifying its set-point in order to cause a contrary effect to that initiated by the environment. In this way, the work conditions of the sub-system are kept constants, ensuring the maximum performance. Sensor(T E ) Sub-System Thermic Controller(T E ) Fig The hot blood system. The thermal circuit must be able to change the system temperature, offsetting the changes of the environment. In order to do this, the energy consumption will permit to vary the system temperature, in the same way as it is done by the hot-blooded animals. As an option, the controller should be compound by an oscillator whose increase in frequency must be proportional to the decrease of the temperature. In this schema, the increment in the quantity of commutations for time unit increases the rate of transformation of electric energy in heat, enlarging the consumption. If the heat generation is bigger than its dissipation, the temperature will increase. In this case as in Nature, is simpler to hold the device's temperature higher than that of the environment. When the environmental temperature rises, the circuit will generate minus heat by consuming less energy. But if the temperature falls, the controller will hold the performance of the sub-system at the expense of a bigger consumption. Finally, two improvements are proposed in order to obtain a more uniform heating. The first one is to distribute the signal of the thermal controller by the canals around the islands used by the SS. The second one is to place slave switches in some islands that follow the thermal rhythm leaded by the controller. 4. CONCLUSION FPGA A technique that does not make use of exogenous elements to the device in order to obtain a bio-inspired hardware system sensitive to the temperature was shown. For this system, a Ring Oscillator, a combinational circuit that demonstrated attributes to be used as sensor of temperature 29

44 in other applications, was used. A way to lead the biological inspiration toward the emulation of the behavior of the animals nearby of the spectrum extremes of its kingdom, the cold and the hot, was also shown. For this reason, is believed that in this journey they have shown concrete options that can be useful when implementing bio-inspired hardware systems. 5. ACKNOWLEDGMENTS This work was made during 2010 with the partial support of BINID UTN. Thanks: to Ángel C. Veca for to invite me to participate of research and development, and to Eduardo Zavalla, INAUT FI UNSJ, for to collaborate in the grammatical revision of this paper. 6. REFERENCES [1] Altera Corp., Stratix II Device Handbook, vol. 1, sec. 2, pp , May [2] S. Lopez-Buedo, J. Garrido, and E. I. Boemo, Dynamically Inserting, Operating, and Eliminating Thermal Sensors of FPGA-Based Systems, IEEE Trans. Components and Packaging Technologies, vol. 25, no. 4, pp , Dec [3] S. K. Yoo, D. Karakoyunlu, B. Birand, and B. Sunar, Improving the Robustness of Ring Oscillator TRNGs, ACM Trans. Reconfigurable Technology and Systems, vol. 3, no. 2, art. 9, pp. 1 30, May [4] M. S. Blumberg, Body Heat: Temperature and Life on Earth, Cambridge, MA: Harvard University Press, pp. 1 69,

45 ANÁLISE COMPARATIVA E QUALITATIVA DE FERRAMENTAS DE DESENVOLVIMENTO DE FPGA S Gabriel Santos da Silva / Maximiliam Luppe Departamento de Engenharia Elétrica / Escola de Engenharia de São Carlos Universidade de São Paulo Av. Trabalhador São Carlense, 400 São Carlos SP Brasil gabriel.santos.silva@usp.br; maxluppe@sc.usp.br ABSTRACT Este trabalho fornece um estudo das ferramentas de desenvolvimento dos principais fabricantes de FPGA s no mercado atualmente, a fim de realizar uma análise comparativa e qualitativa entre as mesmas. Utilizou-se como base para este estudo um projeto de iniciação científica implementado em FPGA que abordou ferramentas de síntese, simulação e geração de IP-cores. 1. INTRODUÇÃO Nos últimos anos, o crescimento dos dispositivos reconfiguráveis e de suas respectivas ferramentas de desenvolvimento - tanto em diversidade, quanto em densidade - tem favorecido a implementação de sistemas complexos e completos em lógica integrada e programável (SoC System on Chip). Altera [1], Lattice [2] e Xilinx [3] são exemplos de empresas que elaboram soluções na área de sistemas reconfiguráveis digitais, cada uma delas possuindo suas respectivas ferramentas de desenvolvimento: Quartus II, Diamond e ISE, respectivamente. As vantagens em se trabalhar com FPGA s [4] estão na possibilidade de desenvolver soft-cores [5], podendo ser reutilizados (um mesmo soft-core pode ser utilizado em diversos projetos, sem custo adicional nem gasto com tempo de projeto) e portáteis (pode ser adequado a diversas plataformas de desenvolvimento de dispositivos reconfiguráveis). Por isso é extremamente importante o estudo de linguagens de descrições de hardware e destas plataformas existentes no mercado. A escolha da linguagem de descrição de hardware 0 Verilog para a implementação do projeto de iniciação científica se dá pela maior facilidade de aprendizagem em relação ao VHDL, visto que esta opção se assemelha muito a linguagem C, amplamente conhecida, enquanto que a escolha das plataformas para o mesmo é realizada pelas empresas que se destacam atualmente no ramo. 2. ANALISADOR LÓGICO E MEMÓRIA FIFO O projeto de iniciação científica citado compreendeu a elaboração de um analisador lógico [7] para análise onchip de Sistemas Digitais implementado em FPGA. Este consiste na implementação de um dispositivo para análise de sinais digitais on-chip que seja open-source, visando possuir um número irrestrito de canais de entrada, permitindo-o trabalhar com circuitos mais complexos, e não ser condicionados ao uso das FPGA s dos seus próprios fabricantes. A Fig. 1 ilustra o diagrama de blocos de um analisador lógico, representando suas principais funções. Fig. 1. Diagrama de Blocos do Analisador Lógico O bloco Base de Tempo define se os processos de aquisição e armazenamento de dados serão feito com sinal de clock advindo do dispositivo analisado ou externo. O bloco Estágio de Disparo inicia o processo de captura dos dados, possuindo duas opções: disparo (trigger) interno, no qual são comparados os dados adquiridos com uma palavra de informação (dado de entrada) previamente determinada, e disparo externo, no qual o procedimento se dá após o reconhecimento de um pulso advindo de uma entrada externa específica. O bloco Memória representa uma memória FIFO, First In, First Out, responsável pelo armazenamento dos dados adquiridos. O último bloco, Interface, responsável pela forma que os dados são apresentados ao usuário, não foi abordado por este projeto. 3. FERRAMENTAS DE DESENVOLVIMENTO A fim de desenvolver o analisador lógico deve-se, além de implementar os soft-cores necessários, sintetizá-los e 31

46 simulá-los, para garantir o funcionamento correto dos mesmos. Para tanto, utiliza-se um IDE (Integrated Development Environment), ferramenta de desenvolvimento que contém aplicativos responsáveis pelos processos desejados: design, síntese, place-and-route e verficação; como ilustra a Fig. 2. família contém uma matriz bi-dimensional de LAB s (Logic Array Blocks), cada um contendo 16 elementos lógicos (Logical Element - LE), pequenas unidades lógicas responsáveis pela implementação das funções lógicas do usuário, possuindo LUT (Look-Up Table) de quatro entradas, um registrador programável, etc. Estão presentes também nessa arquitetura, blocos de memória denominados M4K, capazes de implementar vários tipos de memória (single-port RAM, ROM, FIFO); e blocos multiplicadores otimizados para processamento digital de sinais (DSP) ISE Design Suite 11.1 Fig. 2. IDE's e seus processos Como este projeto visa a generalização dos módulos referentes ao analisador lógico, os mesmos devem ser depurados e simulados em diferentes IDE s Quartus II, ISE Design Suite e Lattice Diamond, observando se as respostas obtidas são as mesmas em todos os casos. Desta forma, podem-se obter as informações necessárias para uma análise a cerca destas ferramentas Quartus II 9.0 O IDE Quartus II 9.0 pertence à empresa Altera que lançou, em 1984, o primeiro dispositivo lógico reprogramável complexo (CPLD) e que ocupa o segundo lugar no mercado de dispositivos lógicos reconfiguráveis. Este IDE possui uma ferramenta de síntese integrada, não necessitando de outra ferramenta para este processo, mesmo possuindo esta opção. Esta ferramenta suporta ambas as linguagens HDL mais utilizadas VHD e Verilog - como a AHDL (uma linguagem de descrição de hardware própria da Altera). O Quartus II 9.0 também possui um simulador integrado (permite o uso de arquivos Vector Waveform), mas também pode trabalhar com outras ferramentas como o ModelSim. A partir de sua última versão, 10.0, o simulador integrado foi removido. A Altera fornece, por meio de sua ferramenta Mega Wizard Plug-In Manager, IP-cores - ou mega funções -.parametrizáveis que são otimizados para arquiteturas de seus próprios dispositivos. Essas funções oferecem síntese lógica mais eficiente, podendo reduzir tempo de design gasto com codificação. Esta ferramenta permite ao usuário configurar várias opções de parâmetros destas funções. Adotou-se para este projeto de iniciação científica um dispositivo da família Cyclone II. A arquitetura desta O IDE ISE Design Suite 11.1 pertence à empresa Xilinx, maior fabricante de dispositivos lógicos reprogramáveis, que lidera este mercado desde a década de 90, sendo a inventora da FPGA. Este software é capaz de sintetizar um projeto HDL utilizando as ferramentas Synthesize-XST (Xilinx Synthesis Technology), Synplify/Synplify Pro, ou Precision, sendo a primeira opção, a opção default utilizada para o processo de síntese, por ser da própria Xilinx. Diferente do Quartus II, o ISE não possui um simulador integrado, sendo gerado pelo IDE, um arquivo a ser utilizado em outro aplicativo, como por exemplo, o ISim, simulador da própria Xilinx, instalado automaticamente no software. Para realizar a simulação do projeto, este IDE utiliza um arquivo de testbench, arquivo responsável pela geração de sinais e valores iniciais de alguns vetores de entrada. Em posse deste tipo de arquivo teste, juntamente com uma ferramenta de simulação, pode-se analisar, por meio das formas de onda de saída, se o dispositivo em teste (DUT Device Under Test) está funcionando da forma desejada. A empresa Xilinx possui também uma ferramenta geradora de IP-cores, a IP (CORE Generator & Architecture Wizard). Esta ferramenta é um design gráfico interativo que permite a criação de módulos de alto nível, tais como elementos de memória, funções matemáticas e de comunicação e núcleos de interface I/O. Estes módulos podem ser personalizados e otimizados por meio de prémódulos, a fim de aproveitar as inerentes características técnicas das arquiteturas das FPGA s da Xilinx. A arquitetura Spartan-3, geração do dispositivo adotado para o projeto, consiste de cinco elementos programáveis fundamentais: CLB s (Configurable Logic Blocks), formados por slices, possuidores de LUT s, podendo operar para implementação lógica e armazenamento de dados; blocos de entrada e saída que controlam o fluxo de dados entre os pinos I/O e a lógica interna do dispositivo; blocos de RAM que armazenam dados na forma de blocos de 18-Kbit; blocos multiplicadores; e os DCM s (Digital Clock Manager). Esta geração possui uma rica rede de 32

47 traços que interconecta esses elementos funcionais e transmite sinais entre os mesmos. Cada um desses elementos possui uma chave matricial associada que permite múltiplas conexões no roteamento Lattice Diamond 1.0 Lattice Diamond 1.0 pertence a empresa Lattice Semicondutor, pioneira do sistema de programação ISP e uma das três maiores fabricantes de CI s reconfiguráveis de todo o mercado internacional. Esta IDE inclui a Synopsys Synplify Pro como ferramenta de síntese integrada, que, diferentemente dos demais IDE s, é um aplicativo de outra empresa: Synopsys [8]. Apresenta como vantagem, o suporte a síntese de designs mistos entre Verilog e VHDL. Para o processo de simulação, este IDE utiliza uma ferramenta externa que necessita de projeto próprio, Active-HDL Lattice WebEdition 8.2, aplicativo que, da mesma maneira que o aplicativo de síntese, pertence a outra empresa, a empresa Aldec [9]. O mesmo também se destaca por suas características de simulação de códigos mistos de VHDL e Verilog, além de verificação avançada e muitos recursos de depuração. Conforme os outros IDE s, este possui sua ferramenta geradora de IP-core, a IPexpress. Este aplicativo reúne vários módulos funcionais que ajudam na geração de códigos em VHDL ou Verilog, podendo ser reutilizados conforme a necessidade do usuário, agilizando e obtendo os melhores resultados do projeto. Os módulos provem funções I/O, aritméticas, de memória, etc. Cada dispositivo da família LatticeXP2, família representante da empresa Lattice no projeto, possui uma matriz de blocos lógicos cercada por PIC s (Programmable I/O Cells). Entre as fileiras de blocos lógicos se encontram linhas de EBR s (Embedded Block RAM), blocos de memórias de 18 Kbits (RAM, ROM ou FIFO), e uma fileira de DSP (Digital Signal Processing). Existem dois tipos de blocos lógicos, o PFU (Programmable Functional Unit), responsável por funções lógicas, aritméticas, RAM e ROM, e o PFF (Programmable Functional Unit without RAM), responsável pelas funções lógicas, aritméticas e ROM; ambos possuindo quatro slices interligados (LUT s de quatro entradas e dois registradores, ou apenas LUT s). 4. RESULTADOS O processo de síntese, mesmo entre as diferentes IDE s, é responsável por checar a sintaxe do código, compilá-lo (traduzir e otimizar o mesmo, tornando-o um conjunto de componentes que possam ser reconhecidos) e mapeá-lo (converte os componentes da fase de compilação para componentes primitivos da tecnologia a ser trabalhada). Ao realizar a síntese do soft-core implementado durante o projeto de iniciação cientifica, nota-se que, optando pela memória implementada, o módulo do analisador lógico acaba sendo limitado a parâmetros de entrada, largura de dados e número de palavras pequenos. Fato que não ocorre ao utilizar a memória obtida por IPcore, devido ao processo de síntese adotar o uso de blocos de memória ao invés de elementos lógicos. Para realizar uma análise comparativa dos três processos de síntese, verificou-se os reports fornecidos pelos mesmos, ao utilizar tanto o soft-core implementado quanto o IP-core gerado. Esta análise apresenta um alto nível de dificuldade devido às diferentes arquiteturas adotadas por cada dispositivo. Em posse dos reports devidamente analisados constrói a tabela 1, onde são apresentados dados comparativos a cerca dos dois tipos de memórias sintetizadas pelas três ferramentas abordadas por este projeto. Para melhor entendimento desta tabela, são apresentadas algumas considerações a cerca dos reports fornecidos e dos itens apresentados. O report fornecido pelo IDE da empresa Altera, Analysis & Synthesis Summary Reports, possui, dentre as suas diversas informações, o número total de elementos lógicos, incluindo o total de funções combinacionais e de registradores lógicos dedicados, o número total de registradores, e o número total de bits de memória utilizados e disponíveis. Na Tabela 1, encontram-se os dados referentes aos elementos lógicos, aos registradores e aos bits de memória. O report fornecido pelo IDE ISE, Synthesis Report, possui uma forma diferente de abordagem, na qual analisa o uso de células no processo de síntese, dividindo-as entre BELS, elementos lógicos básicos como inversores, LUT s e mux s, flip-flops/latches e buffers. Adota-se, ao verificar todo o documento, que os flip-flops/latches são considerados registradores, enquanto que os LUT s são considerados os elementos lógicos. Para analisar a utilização de blocos de memória, é importante analisar também os reports gerados para o processo de Map. Na Tabela 1 encontram-se os dados referentes às LUT s, aos registradores e aos blocos de memória. Os reports desta ferramenta definem como bloco de memória o conjunto de 18 Kbits de memórias. O IDE Diamond fornece o documento Resource Usage Report, que também indica por meio de LUT s e bits de registradores os itens a serem comparados. Da mesma forma que acontece com o ISE, utilizaram-se os reports gerados no processo de Map para a análise dos blocos de memória. Na Tabela 1 encontram-se os dados referentes somente as LUT s que podem ser utilizadas como RAM, aos bits de registradores e aos blocos de memória. Os reports desta ferramenta definem como bloco de memória o conjunto de 18 Kbits de memórias. Devido principalmente a possuir estes dois tipos de blocos lógicos, contendo ou não RAM, a ferramenta de síntese consegue ótimos resultados, otimizando o uso de registradores e outros elementos. 33

48 Tabela 1. Quadro Comparativo (parâmetros: 8 bits de largura de dados e 1024 palavras de dados ao todo) Memória Implementada Memória Gerada (IP_Core) Software Família Dispositivo Freqüência Máxima de Operação (MHz) Quartus II Cyclone II EP2C20F484C7 ISE Spartan3A(N) XC3S50A-5TQ144 Diamond LatticeXP2 LFXP2-5E-6TN144C 127,03 128, ,2 Elementos Lógicos 23078/18752 (123%) 15,157/1,584 (1076%) 960/810 (119%) Registradores 8251/18752 (44%) 8,264/1,408 (586%) 163/4752 (3%) Elementos Lógicos 79/18752 (<1%) 58/1,408 (4%) 6/810 (1%) Registradores 53/18752 (<1%) 67/1408 (4%) 33/4752 (1%) Blocos de Memória 8192/ (3%) 1/3 (33%) 1/9 (11%) 5. AGRADECIMENTOS Os autores Gabriel Santos da Silva e Maximiliam Luppe agradecem o apoio concedido pela FAPESP 2009/ REFERÊNCIAS [1] Altera, [2] Lattice Semiconductor, [3] Xilinx, [4] FPGA, [5] Soft-Core e IP-Core, roperty_core [6] HDL s Verilog e VHDL, ge Palnitkar, Samir, Verilog HDL, A guide to Digital Design and Synthesis, Sunsoft Press, Strozek, Lukasz, Verilog Tutorial, Edited for CS141, October 8, Tala, Deepak Kumar, Verilog Tutorial, October 25, 2003 [7] Analisador Lógico, Introdução ao Analisador Lógico. Logic analyzer. [8] Synopsys, [9] Aldec, 34

49 GENERACIÓN AUTOMÁTICA DE VHDL A PARTIR DE UNA RED DE PETRI. ANÁLISIS COMPARATIVO DE LOS RESULTADOS DE SÍNTESIS Roberto Martínez, Javier Belmonte, Rosa Corti, Estela D Agostino, Enrique Giandoménico Facultad de Ciencias Exactas, Ingeniería y Agrimensura Universidad Nacional de Rosario (FCEIA/UNR) Avenida Pellegrini 250, (2000) Rosario, Argentina romamar, belmonte, rcorti, estelad, giandome@fceia.unr.edu.ar RESUMEN Las Redes de Petri (RdeP) proveen un formalismo para la modelización de sistemas donde el paralelismo y la colaboración en la utilización de recursos son parámetros que los caracteriza. En este trabajo se presenta un módulo software que automatiza el método de traducción directa de una RdeP a código VHDL sintetizable. Además se realiza un análisis comparativo de recursos utilizados y velocidad de trabajo, referido a la síntesis de las soluciones alcanzadas mediante la metodología propuesta respecto de las logradas utilizando máquinas de estado finito. Dicho estudio indica que los modelos basados en RdeP requieren, en general, más recursos que los abordados con máquinas de estado finito. Sin embargo, la metodología propuesta reduce en gran medida el tiempo de desarrollo y previene errores de codificación, resultando muy conveniente si los requerimientos de recursos utilizados no son críticos. 1. INTRODUCCIÓN En los sistemas industriales es común la existencia de varios procesos de evolución paralela que muchas veces necesitan sincronizarse entre ellos, comunicarse y/o compartir algún recurso. Las Redes de Petri (RdeP) proveen un formalismo para la modelización de sistemas donde el paralelismo y la colaboración en la utilización de recursos son parámetros que los caracteriza. Además, este formalismo agrega a sus ventajas la facilidad de representar sistemas fuertemente no especificados [1]. El modelado con alto nivel de abstracción y la utilización de técnicas de descripción formal de Diseño a Nivel de Sistemas (SLD, System Level Design) [2], permite el empleo de métodos de prototipado rápido, a partir de la creación de librerías y reutilización de componentes hardware/software [3]. Los HDLs (Hardware Description Languages), permiten el diseño de circuitos digitales con un alto nivel de abstracción. Estos lenguajes, orientados inicialmente a la descripción de hardware y la simulación, se utilizan actualmente en la síntesis automática de circuitos sobre dispositivos de lógica reconfigurable. Los ambientes EDA (Electronic Design Automation), que integran en el mismo marco de trabajo las herramientas de descripción, síntesis, simulación e implementación de sistemas digitales, incluyen herramientas de síntesis que reconocen estructuras lógicas; entre las cuales podemos mencionar las asociadas con las máquinas de estado finito (MEF). Los formatos propuestos para la codificación de las máquinas optimizan su síntesis, ya sea en área ocupada o velocidad. En este trabajo se presenta un módulo software, MakeVHDL, que automatiza el método, descripto en [4], de traducción directa de una RdeP a código VHDL sintetizable. Además se realiza un análisis comparativo de los resultados, en cuanto a recursos y velocidad, cuando se sintetiza un código generado por MakeVHDL, respecto de la codificación obtenida a partir del formato de MEF de la herramienta de síntesis de Xilinx XST (Xilinx Synthesis Tool). El resto de la publicación se organiza de la siguiente manera, en la sección 2 se mencionan los trabajos relacionados con el aquí presentado; la sección 3 describe el modulo software MakeVHDL desarrollado y la sección 4 el análisis comparativo de los resultados de síntesis. Finalmente en la sección 5 se incluyen las conclusiones. 2. TRABAJOS RELACIONADOS Se han propuesto varios enfoques para traducir un modelo representado con una red de Petri, a una codificación en un lenguaje de descripción de hardware. En [5] se reporta una herramienta de software (código cerrado) de traducción de un modelo en lenguaje PNML (Petri Net Markup Language, un estándar internacional que define una sintaxis de transferencia para diferentes versiones de redes de Petri) a código C y VHDL. La estrategia de implementación se basa en analizar cada nodo y establecer la correspondencia lugar-registro y transición-lógica combinacional. Otros autores en [6] comunican el desarrollo de una aplicación, también en código cerrado denominada HILECOP, usada en el dominio médico para generar código VHDL a partir de una red de Petri interpretada. Un aporte interesante de 35

50 Fig. 1. Pantalla de PIPE con el módulo MakeVHDL. esta última comunicación, es la posibilidad de aplicar un control de actividad de los componentes VHDL, para ahorro de energía consumida por el dispositivo, haciendo uso del principio de propagación de actividad. En [7], los autores descomponen el modelo en bloques estructurales básicos de una RdeP, compuestos de un lugar y una transición y luego cada uno es implementado en un bloque lógico configurable (CLB) de una FPGA. Los autores de [8] informan el desarrollo de Animator4FPGA, herramienta de código cerrado, que permite la descripción de controladores por medio de RdeP para luego generar el VHDL correspondiente. En [9], se propone que las RdeP puedan ser usadas como lenguaje de especificación en el codiseño hardware/software de los sistemas embebidos, poniendo condiciones, entre ellas, la de que a partir de esta especificación se pueda generar el código para distintas plataformas que pueda ser usado para simulación, verificación e implementación. El trabajo descrito en [10] informa de un estudio comparativo de recursos utilizados en la síntesis de MEF, para distintos estilos de descripción de la máquina y métodos de codificación de sus estados. Proponen una metodología basada en el análisis de los reportes de síntesis, contabilizando slices, flip-flops y LUTs. También se analiza la frecuencia teórica máxima de reloj que se estima en los reportes. El trabajo aquí presentado se basa en una descripción matricial del modelo de RdeP, y a diferencia de los descriptos, está basado en una herramienta open source de libre disponibilidad. 3. GENERACIÓN AUTOMÁTICA DE VHDL La realización de un modelo mediante Rde P de un sistema, cualquiera sea la índole de éste, consiste en la realización de grafos o diagramas de diferentes estilos, conforme al tipo de RdeP utilizado para la modelización. También es posible utilizar directamente una RdeP como forma de especificar al sistema. En cualquier caso, subyacente a dicho diagrama, rigen las reglas matemáticas que dan soporte a la descripción y permiten además la realización de simulaciones tendientes a verificar su comportamiento. Una de las muchas herramientas gráficas existentes para la construcción y simulación de RdeP es la denominada PIPE (Platform Independent Petri Net Editor) [11], de tipo open-source y desarrollada en Java. PIPE está estructurado de manera que es posible el agregado de prestaciones específicas por medio de módulos que se pueden incorporar a su interfaz. Para la generación de VHDL, implementamos un módulo (MakeVHDL) que traduce en forma directa la RdeP representada en PIPE a código VHDL conforme el método descrito en [4]. El mismo realiza la traducción desde una perspectiva global del sistema, a partir de la representación matricial de la RdeP asociada, lo que permite acotar la complejidad de la descripción VHDL resultante. La implementación de la arquitectura de la red consta de tres bloques que se comunican mediante señales. El primero determina cuales son las transiciones que están en condiciones de disparo, el segundo define el nuevo marcado y el tercero asigna las salidas. La Fig. 1 muestra una pantalla de PIPE con el agregado del módulo MakeVHDL. Dicho módulo incluye facilidades para la identificación de las entradas y salidas del sistema. Además, permite agregar condiciones lógicas a las transiciones y definir salidas condicionadas. La metodología de traducción propuesta en [4] se amplió agregando los elementos mencionados. Se logró por tanto la generación completa del código VHDL a partir de la RdeP, incluyendo la creación de entidades, arquitecturas, señales, puertos de entrada y salida y demás elementos necesarios para obtener una descripción VHDL sintetizable. El código VHDL generado por MakeVHDL puede ser guardado como un archivo o copiado y pegado en el ambiente de diseño elegido. En nuestro caso, para verificar el código obtenido, y realizar el análisis de los resultados de síntesis, se trabajó con ISE 8.2i de Xilinx. 36

51 Mhz (a) (b) Fig. 2. (a) Procesos concurrentes. (b) Recursos compartidos Procesos Petri MEF Fig. 4. Frecuencia máxima para procesos paralelos Recursos Procesos Petri FF MEF FF Petri Slice MEF Slice Fig. 3. Recursos reportados para procesos paralelos. 4. ANÁLISIS COMPARATIVO DE LOS RESULTADOS DE SÍNTESIS Se analizaron dos casos de estudio donde la modelización con RdeP es más ventajosa que con MEF. Como contrapartida, la herramienta de síntesis XST, optimiza la implementación de los diseños si se los describe utilizando el formato aconsejado para las MEF. Nuestro objetivo, al comparar los resultados de la síntesis a partir del código VHDL obtenido por medio de ambos modelos, fue mensurar la incidencia del uso de RdeP sobre la frecuencia de trabajo y el uso de recursos. El análisis se basó en los reportes de síntesis, ya que constituyen un indicador clave de la forma en que la herramienta interpreta el código. El código VHDL se obtuvo utilizando MakeVHDL al trabajar con RdeP, mientras que para MEF, se codificó respetando el formato de dos procesos propuesto por XST Procesos concurrentes Una RdeP resulta ventajosa para modelizar un sistema de evolución en paralelo compuesto de varios procesos que cooperan para la realización de un objetivo común. Recursos En la Fig. 2 (a) se muestra el diagrama de Petri de dos procesos paralelos que reinician su funcionamiento cuando ambos han finalizado su ejecución. El modelo basado en MEF del mismo problema se modularizó, utilizando una máquina para cada proceso. El problema se resolvió para dos, tres, seis, ocho y diez procesos. La Fig. 3 indica la cantidad de flip-flops (FF) y slices utilizados en la síntesis para ambos modelos de representación, con la opción de optimización de área. La Fig. 4 muestra los valores correspondientes de la frecuencia de trabajo máxima. La diferencia entre los valores para máxima frecuencia en el peor de los casos llega aproximadamente al 25%, que no resulta significativo en los sistemas industriales. Respecto al uso de recursos, la síntesis del modelo Petri utiliza tres veces más FF que la MEF, mientras que el número de slices utilizados es similar para ambos Recursos compartidos Procesos Petri FF MEF FF Petri Slice MEF Slice Fig. 5. Recursos reportados para recursos compartidos. Los sistemas en los cuales varios procesos comparten uno o más recursos, pueden representarse utilizando una RdeP 37

52 Mhz Procesos Petri MEF Fig. 6. Frecuencia máxima para recursos compartidos. como muestra la Fig. 2 (b). En la misma se muestran dos procesos A y B que comparten el recurso R. Este tipo de sistema se modelizó para dos, tres y seis procesos que comparten un único recurso. La Fig. 5 permite comparar el número de FF y slices inferidos por XST en el proceso de síntesis para los dos modelos de representación utilizados. La Fig. 6 por su parte, se refiere a los valores de frecuencia máxima. Al incorporar el uso de recursos compartidos, MEF utiliza un 50 % menos de slices. En cuanto al uso de FF la comparación entre ambos modelos pone en evidencia que la MEF agrega un FF más que Petri por cada proceso incorporado. 5. CONCLUSIONES El análisis realizado, muestra que el uso de las RdeP para modelar los sistemas propuestos, tiene un costo en la síntesis respecto de recursos utilizados y velocidad de trabajo que, en general, es mayor que la modelización con MEF. Sin embargo, la metodología que en este trabajo se propone, esquematizada en la Fig. 7, realiza una traducción automática del modelo gráfico de Petri a código sintetizable, en todos los casos, y elimina toda posibilidad de error en la codificación. Por otro lado, ante una modificación en el sistema físico, su descripción con RdeP resulta notablemente más simple que con MEF. El módulo software desarrollado está basado en una herramienta open source y por lo tanto es de libre disponibilidad. Por último, se puede concluir que si los requerimientos de diseño no son críticos, en lo que se refiere al uso de recursos de pastilla, el método propuesto resulta muy conveniente. 6. REFERENCIAS [1] M. Uzam and A.H. Jones, Design of a Discrete Event Control System for a Manufacturing System Using Token Passing Ladder Logic, Proc. of the CESA'96 IMACS Multiconference, Symposium on Discrete Events and Manufacturing Systems, July 1996, pp Fig. 7. Generación automática de código VHDL. [2] I. Viskic, D. Rainer, "A Flexible, Syntax Independent Representation (SIR) for System Level Design Models," 9th EUROMICRO Conference on Digital System Design (DSD'06), 2006, pp [3] K. Keutzer, S. Malik, R. Newton, J. Rabaey and A. Sangiovanni-Vincentelli, System level design: Orthogonalization of concerns and platform-based design, IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 19 (12), Dec [4] R. Martínez., J. Belmonte, R. Corti, E. D Agostino, E. Giandoménico, Descripción en VHDL de un sistema digital a partir de su modelización por medio de una red de Petri, in Proc. V Southern Conference on Programmable Logic, Apr 2009, pp [5] L. Gomes, A. Costa, J.P. Barros, P. Lima, From Petri net models to VHDL implementation of digital controllers, The 33rd Annual Conference of the IEEE Industrial Electronics Society, (IECON), pp 94-99, Taiwan, Nov [6] D. Andreu, G. Souquet, Thierry Gil, "Petri Net Based Rapid Prototyping of Digital Complex System," isvlsi, pp , 2008 IEEE Computer Society Annual Symposium on VLSI, [7] E. Soto, M. Pereira, Implementing a Petri net specification in a FPGA using VHDL, Int. Workshop on Discret-Event System Design, Przytok, Poland, June 27-29, [8] F. Moutinho, L. Gomes, From Models to Controllers Integrating Graphical Animation in FPGA through Automatic Code Generation", Industrial Electronics, ISIE IEEE International Symposium on, pp [9] L. Gomes, J.P. Barros, A. Costa, R. Pais, F. Moutinho, Towards Usage of Formal methods within Embedded Systems Co-design, Proc. of the 2005 IEEE Conference on Emerging Technologies and Factory Automation, Vol 2, pp [10] Nader I Rafla, Brett LaVoy Davis, A Study of Finite State Machine Coding Styles for Implementation in FPGAs, Circuits and Systems, 2006 IEEE International Midwest Symposium on, pp [11] P. Bonet, C.M. Llado, R. Puijaner and W.J. Knottenbelt, Platform Independent Petri net Editor 2, (consultado 10/10/10). 38

53 USING A WII REMOTE AND A FPGA TO DRIVE A MECHANICAL ARM TO AID PHYSICALLY CHALLENGED PEOPLE Bruno Seiva Martins, Emerson Carlos Pedrino Computing Department Federal University of São Carlos Rod. Washington Luiz, km 235, São Carlos, SP, Brazil CEP: , Caixa Postal: bruno_martins@comp.ufscar.br, emerson@dc.ufscar.br Valentin Obac Roda Department of Electrical Engineering Federal University of Rio Grande do Norte Caixa Postal Campus Universitário Lagoa Nova CEP Natal/RN - Brazil valentinobac@gmail.com ABSTRACT The Nintendo Wii Remote videogame controller brought an innovative way of playing videogames, using simple hand movements along with simple button pressings as input commands. This type of controller may be used as input device for several applications, such as robotic control. Here it is presented a way to interface this controller with the Altera s DE2 development board, which contains several devices that can be controlled by the Cyclone II FPGA chip present in the board. The DE2 board is configured with the Nios II softcore processor, and the uclinux operating system is installed. The communication between the board and the controller is then made through the Bluetooth protocol. Thus, we propose a system that can take and interpret input from the Wiimote controller to control a mechanical arm. This system is designed to be operated by physically challenged people, in order to facilitate their lives. 1. INTRODUCTION The main goal of the proposed system is to take input from a controller and interpret it to drive a mechanical arm accordingly. The controller used is the Nintendo Wii Remote (or Wiimote, for short), which is be connected to an interpreter system. This interpreter system is built on an Altera s DE2 development board that has several peripherals ready to use and is very good to make prototypes. After interpreted, the gesture made with the Wiimote is then reproduced in the mechanical arm. The mechanical arm is a Lynxmotion model: AL5A. An overview of this system is given in Fig 1. Fig. 1.. Overview of the proposed system. In Fig. 1 the Wiimote (1) communicates with the DE2 board (3) through the Bluetooth protocol (2). This is the proposed system presented in this article. Moreover, the board (3) translates the commands and transmits them to the AL5A mechanical arm (5) through Serial communication protocol (4). The Wiimote controller has twelve buttons, an infrared camera and three accelerometers (one for each Cartesian axis). The controller is also provided with one low quality sound speaker, and is powered by two AA batteries. The accelerometers are in a free fall frame of reference [1]. Fig 2 illustrates the orientation of the axes and of the angles measured between those axes. Fig. 2. Orientation of the three angles and axes. The set of data produced by the controller is sent through the Bluetooth protocol to a previously paired device, which is a system built using the DE2 board. 39

54 The DE2 board is configured with the Nios II softcore, general purpose processor, along with the needed peripherals, such as memory chips and communication ports controllers. Given the overall complexity of the system, it was more efficient to build it in layers, configuring the board with an operating system (uclinux) and on top of it running the programs to deal with the communication and data treatment. The mechanical arm used is a simple model manufactured by Lynxmotion, which has four degrees of freedom to its movement and Serial communication to take commands. We did a broad search through many paper databases and there are plenty of works on using the Wiimote as a motion capture device for a variety of purposes. However, we found that none of these works uses a FPGA system to gather and interpret the data. Thus, we believe that the research done to accomplish this kind of system, integrating a FPGA system with a Bluetooth device, is completely new. 2. PROPOSED SYSTEM The proposed system developed in this paper was built using the DE2 board. Fig. 3 depicts the overview of the board with the peripherals attached and hardware and software configured. Note that USB HUB, Bluetooth USB adapter and USB Flash drive boxes represent physical devices, but C Software, uclinux OS and Nios II Processor boxes represents logical layers, as they are either stored in the SDRAM memory chip, in case of the OS and the C programs, or configured in the FPGA chip, which would be the processor. attached to the internal bus. This process was done within the Quartus II software provided by Altera. The focus of the proposed system was to establish communication with the Wiimote controller, therefore only a few devices were required. In [2] is described the process of choosing and connecting modules using the SOPC Builder tool inside Quartus II software. The main devices used by the system were: the FPGA chip, to host the Nios II processor, the SDRAM memory chip, which is loaded with the uclinux operating system, the USB controller, to attach the Bluetooth adapter and the flash drive and the serial UART controller to drive the mechanical arm. The Nios II processor version used was the fast (/f core) version, which provides the best performance but costs more in FPGA usage [3]. After setting up the system hardware, the uclinux distribution was. A Linux system was chosen because it is open source, highly configurable and actively maintained by its developers. Also, a Linux system allows one to read the full source code and also to make suitable modifications. The version of uclinux-dist used in the system is from July 30 th, 2009, and is hosted at Nios II Community s ftp site [4]. This distribution was made by the community of Nios II users and targets Altera boards (including the DE2 board). The whole set of tools and source code is called uclinux-dist. The uclinux compilation parameters are set using the make menuconfig command, which opens a screen containing all the program, library and options available to compile with or change. There are several options shown using this tool. All of the settings are divided in two categories: Kernel Settings and Application/Library Settings. Once the configuration is finished, a simple make command will start compiling the source code into an image file. Fig 4 shows a typical configuration menu screen. Fig. 3. Overview of the proposed system. A USB HUB was attached to the USB port, to allow use of multiple USB devices. One of these devices is the Bluetooth USB Dongle, responsible for giving Bluetooth capabilities to the board. The other is a USB flash drive, which was used to store the many versions of the test program. The SDRAM memory was loaded with the uclinux operating system, so the system could run C coded programs benefiting from an operating system environment. The Cyclone II FPGA chip was configured with the Nios II processor, and all other board devices needed were Fig. 4. Typical uclinux-dist configuration screen. Here is possible to enable Bluetooth support by the kernel. 40

55 With the board attached to the computer, the image is uploaded through the Nios II Embedded Design Suit software, but only after the.sof file is configured. The process is illustrated in Fig. 5. use, and discovered one called Wiiuse [6], which, like BlueZ, is also open source. This library is well documented and well written, allowing painless modifications to be made. With the Wiiuse, we were able to run the sample program included with it, and make modifications to perform some tests. 3. RESULTS Several results were produced by the proposed system. First, we were able to read the data of the Wiimote on the screen, in real time, showing, for instance, the readings of the accelerometers (Fig. 7) and the buttons pressed (Fig. 8). Fig. 5. Upload flow. Both of the files are uploaded using Nios II EDS, first the.sof and then the zimage. Using the same program (Nios II EDS), it s possible to see the output from the board and give input to it, after it is properly configured. In Fig. 6 it is shown the initial screen once the OS is booted. Fig 7. Real-time data from the Wiimote, showing the current accelerometer data. Fig. 6. Screen after boot. After testing the default configuration, several settings were modified, in order to support the devices of the board and the software to be run on the operating system. We enabled kernel support for specific USB devices (USB flash drives and USB Bluetooth adapter) Bluetooth protocol and FAT filesystems. Other enabled by default settings were disabled, like network communications via TCP/IP protocol, given that they would have no use for us. Several modules were also disabled, such as ifconfig and dhcp configuration tools, among others, to save some space. We had to add some configuration tools to deal with the Bluetooth communication and Bluetooth code compilation, namely the BlueZ library tools [5], and that came to be one of our great difficulties. The BlueZ library has three main versions, and only the oldest one would run on our system. Some source code analysis and rewriting were made in order to get the tools compiled and running, successfully. Once the Bluetooth was up and running, we searched for Wiimote libraries that we could 41 Fig 8. Real-time data from the Wiimote, showing the current buttons being pressed. Then, we tested the integration between the software and board hardware, by lighting up some LED s according to the rotation of the controller. We wrote a program that retrieved the accelerometer data of the X axis and displayed it using eight red LED s. The result is shown in Figs 9 through 11.

56 4. CONCLUSIONS Fig. 9. With the controller turned 90º to left, the leftmost LED turns on (8 th LED). The DE2 evaluation board is very versatile, as it allows virtually any kind of system to be implemented. One of its strengths is the variety of devices, already wired and ready to be used, which gives great prototyping power to the developer using the board. This power was used on our system, which interprets commands from the Nintendo Wiimote videogame controller and is able to drive a mechanical arm attached to the board, according to the commands received. The integration accomplished by our work can be used as a base to the development of a number of systems that can benefit from the Wiimote input. The use of embedded systems to do this job may be more efficient, given the real time concern of such systems compared to general purpose Personal Computers. Our work s ultimate goal is to enable this system to be used smoothly by physically challenged people in order to make their lives easier, so that they can control a mechanical arm to do everyday tasks, such as move relatively heavy objects or reach objects stored too high. 5. ACKNOWLEDGEMENTS Fig. 10. With the controller just past the center position (when it is facing up), the corresponding LED turns on (3 rd LED). We wish to thank FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo) for its financial and institutional support to this research, registered under the process number 2010/ Emerson C. Pedrino is grateful to FAPESP by the process: 2009/ , too. 6. REFERENCES Fig. 11. As expected, when the controller is turned 90º to the right, the rightmost LED turns on (1 st LED). The proposed system is actually under development. The mechanical arm that will be driven by the system is under study, and the software to control it is being developed. Our goal is to allow physically challenged people to control a robotic arm with ease, in order to make the arm perform simple tasks, like pushing a heavy object around or reaching normally out of reach objects. Although this idea is not new [7], the use of a FPGA to gather the data produced by the Wiimote and to drive the mechanical arm is original. The use of embedded devices to do such task instead of personal computers represents a new branch of research, allowing real time responsiveness and portability for the system. [1] Wiimote Wiibrew < Retrieved on October 14 th, [2] J. O. Hamblen, T. S. Hall, M. D. Furman, Tutorial IV: Nios II Processor Hardware Design In Rapid Prototyping of Digital Systems SOPC Edition Springer (2008). [3] Altera s Embedded Processors, Accessed Oct 14, [Online]. Available: index.html [4] Nios II Community FTP, Accessed Oct 14, [Online]. Available: [5] BlueZ, Accessed Oct 15, [Online]. Available: [6] wiiuse The Wiimote C Library, Accessed Oct 15, [Online]. Available: [7] C. Smith, H. I. Christensen, Wiimote Robot Control Using Human Motion Models The 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, USA (2009). 42

57 SYSTOLIC MATRIX-VECTOR MULTIPLIER FOR A HIGH-THROUGHPUT N-CONTINUOUS OFDM TRANSMITTER Enrique Mariano Lizárraga CONICET - F.T.yCs.Ap. Universidad Nacional de Catamarca 4700 Catamarca, Argentina emlizarraga@conicet.gov.ar Victor Hugo Sauchelli F.C.E.F.yN. Universidad Nacional de Cordoba 5000 Cordoba, Argentina vsauch@com.uncor.edu ABSTRACT Digital systems frequently present high-speed operation requirements and many times its processing is based on arithmetical operation. In this work, we consider a strong resource demanding application such as matrix-vector multiplication, in the context of N-Continuous OFDM signal generation. We propose a systolic architecture which can perform operations in a parallel way and reduces the processing time according to a design parameter. Our results expose the benefits derived from a critical path treatment, and show an attractive simplicity owing to a circular data shifting in a systolic approach. 1. INTRODUCTION High-speed operation in current VLSI designs is a necessary feature; the bottleneck in the capacity of supporting certain applications is frequently given by the timing performance. Considered applications may be found in devices ranging from smartphones to complex computers. Then, we focus on an efficient architecture for digital signal processing. This area has allowed applications such as high data rate wireless communications [1], image recognition [2], and biomedical processing [3]. In these designs, matrix operations are frequently required but they imply a large number of elemental (scalar) multiplications. Even simple scalar multipliers are resource demanding [4]. Therefore, area, power and speed costs are increased in a strong way for the global system if the design is not optimized. In this work, we consider a matrix-vector multiplication where the matrix elements are constant. Then, they can be previously stored in a memory. This case may be found in several applications such as generalized DFT, coordinate rotations, coding, etc. We base the design on a multipliers bank that allows a parallel processing system. In this way, the operation time is reduced by a design parameter. We present particular consequences of the parallelism concept and a detailed description of the architecture. The proposed architecture is tested for a wireless communication transmitter showing good performance and allowing the required bandwidth. The rest of the paper is organized as follows. In Section 2 we present the fundamentals of the matrix-vector multiplication algorithm and alternatives are discussed, in Section 3 the architecture is explained and delay considerations are given; the Section 4 present results obtained from simulation and synthesis for FPGA. Conclusions are given in Section ALGORITHM BASICS 2.1. Matrix-Vector Multiplication Fundamentals Let M be a N N matrix, and v a N 1 column vector. Then, the matrix-vector multiplication result is allocated in the r column vector with dimension N 1 as defined in r i = N 1 j=0 M ij.v j i = 0,..., N 1, (1) where M ij and v j are the elements of M and v, respectively. From this expression we can derive the requirement of solving N 2 individual multiplications to complete the operation. Although in many cases they may be complex multiplications, results obtained in this work remain valid. Focusing on the result calculation, in one hand, a simple approach is to use combinatorial logic, but the area requirement is high and a slow clock is necessary. In other hand, an alternative is to solve every multiplication sequentially. However, a large number of clock cycles will be required to complete the processing. In addition, many architectures for real or complex number multipliers present latency, which may reduce the global performance. Pipelining may be included to mitigate the delay drawback. Even though, the global processing time is still governed by the expression N 2. However, the benefit reached with this approach is based on the use of only one multiplier. Then, the 43

58 processing time is given by T seq = ρn 2 where ρ represents the processing time in the elemental multiplier and may be fixed at one if pipelining is applied, f represents the system clock rate. Since N-Continuous technique has been proposed for out-of-band power reduction in OFDM systems [5], and it is based on a correction vector obtained by means of a matrixvector multiplication. We analyze the sequential operation in the correction calculation. In this case, N is defined by the subcarriers number of the OFDM system. Based on the 3GPP E-UTRA/LTE specification for wireless communications N = 300 is chosen [6]. Then, a typical clock rate for wireless communication architectures implemented in FPGA is considered, f = 50 MHz. According to a typical complex multiplier scheme, which operates in four cycles, a straightforward pipelining is considered, ρ = 1. So, the global processing time is T seq = 1.8 ms. If Guard Interval (GI) is applied with fraction GI = 22/300 [6], the transmitter can achieve a final bandwidth of B = f (2) (1 + GI)N T seq = 179 KHz (3) at most. Unfortunately, this bandwidth is less than the one specified in [6]. Also, a practical OFDM transmitter includes other operations that can further reduce the presented speed performance Parallel Operation Performance As in the application considered, the timing requirement may not be achieved by using the unique multiplier scheme discussed above. An alternative is to use K multipliers and synchronize them for simultaneous operation. Based on this approach, (1) can turn into an L-elements addition L 1 r i = r i,k i = 0,..., N 1 (4) k=0 where L = N/K, and r i,k represents the k-th partial addition r i,k = K 1 j=0 M i,j+kl.v j+kl i = 0,..., N 1. (5) The matrix element selection in column-sense is obtained from j + kl, where k = 0,..., L 1. Then, the model in (4) indicates L steps and K elemental multipliers for the proposed architecture. The processing time for the complete calculation is reduced depending on the K parameter. So, it follows the expression N 2 /K as T par = ρn 2 /(Kf). (6) For the N-Continuous OFDM transmitter described for analysis we can express the bandwidth as B = (1 + GI)N T par. (7) By selecting 32 parallel multipliers, K = 32, the obtained bandwidth is 5.72 MHz, according to the resulting processing time of µs for the matrix-vector multiplier. These values achieve the required bandwidth in [6]. Also, the processing period for the matrix-vector multiplier does not cover the whole OFDM symbol duration; then, a fraction of the symbol transmission period may be used for another specific OFDM processing. 3. ARCHITECTURE DESIGN A processing unit fed by the v vector is considered. We suppose that the elements of the M matrix have been previously stored in an internal memory. Then, the objective is to present the result of the calculation as fast as possible. According to (5), we can build a parallel multipliers bank composed by K elemental multipliers. Since each one has two inputs, a and b, we can join all the multiplier inputs and form two buses, A and B, which are fed by v and M, respectively. This scheme is depicted in Fig. 1. According to the parallel concept, the bus A is fed by the k-th fraction of the v vector in each cycle. In this way, the complete load of v requires L cycles. As stated in (5), this fraction of the v vector is multiplied by the k-th fraction of the i-th row of M. Fig. 1. Functional Diagram Note that the multipliers bank output represents the elements to be summed in (4), then r i,k is computed. Since the values of r i,k are sequentially generated, it is necessary to store each one of the set k = 0,..., L 1. The value of r i may be computed by means of a new addition before the L partial additions are obtained. A special consideration is based on the need of feeding the two K-length vectors represented by {M i,j+kl } and {v j+kl } for j = 0, 1,..., K 1 in a simultaneous way. This requirement allows the calculation of every element of r in only L clock cycles. However, this configuration implies special memory units to accomplish the described behavior for v in the input bus. Also, it is observed that after 44

59 L cycles, each fraction of the vector v, i.e. {v j+kl } for j = 0, 1,..., K 1, is required again for calculation. Then, we can feed it back by means of a simple circular buffer connected to A. It is a consequence of the systolic approach in the proposed system and allows an important simplification in the design. If we select R bits for resolution, then the buses A and B must be sized as R K bits. In turn, the bus B needs to be connected to a memory where the N 2 /K words of R K bits are allocated for completely represent M. This analysis remains valid even if fixed-point or floatingpoint number representation is used. For complex number representation, the storage units and the add stages may process real and imaginary parts independently Data Propagation Optimization Although the former section presents system level requirements, this section discuss the datapath in the design. If we consider the r i,k calculation, the implementation may be based on tree adders, as shown in Fig. 2. We can use K 1 two-input adders and finally obtain r i,k by performing a cascaded connection. Unfortunately, it was showed that this scheme may affect the global timing performance in a strong way because of the critical path extension. It is a consequence of the extensive combinatorial logic inferred by the adders, which defines a long propagation path. According to [4], we establish the critical path in a VLSI circuit by means of the latches interconnection. Then, as K increases, more combinatorial adders are inserted into a two latches path and the performance becomes poor. Although the technique in the previous section was to increase K to improve the speed performance, it is possible to obtain an opposite effect because of the critical path extension. An appropriate parameter selection criterion may be stated. In one hand, K may be chosen as large as possible, limited by the area resources. In other hand, this selection affects the speed performance in a negative way if the critical path extension becomes too high. This behavior was presented in the wireless communication application considered, for K = 32. Nevertheless, other (N, K) settings may be located in a beneficial point of the space parameters. Based on [4], we include flip-flops which interrupt the propagation paths and shorten them, it produces a pipelined architecture for our design. Although delay cycles are introduced into the system as a result of this technique, the global performance is improved since K is sufficiently large and the operation of the transmitter is periodic Final Settings The complete the design is depicted in Fig. 3, where a pipelined architecture is used for Adder 1. This way, the speed bottleneck imposed by high values in K, as desired Fig. 2. Cascaded addition for Adder 1 for high-parallelism operation, is improved and the required operation frequency is allowed. In this scheme, K 2 delay units are inserted into the connection between the multipliers bank output and the elemental adders of the tree. Delay value is fixed at one for the output in the position K 3 and it is increased in one as the position decreases up to the position 0 in Fig 3. This tree in Adder 1 represents the simplest approach and achieves a good performance in the mentioned application; however, it may be improved further by defining a symmetric tree adder [4]. In our design, r i,k is available K 2 cycles after the multipliers bank produces its output. In turn, once the first r i,k element is calculated, it is necessary to store it up to the complete set for k = 0,..., L 1 is available. As K increase, L becomes lower. Then, if the value of L is small, we can synchronize the r i,k contributions to r i by means of a new set of delay units without affecting the area performance. In other cases, a memory based subsystem may replace it, and address logic needs to be appended. In the case of delay units, their values are represented as r i,0 r i,1 r i,l 2 r i,l 1 X r i,0 r i,l 3 r i,l X X r i,0 r i,1 X X X r i,0 where each column represent different clock cycles t 0, t 1,..., t L 1 from left to right, so it is a classical S/P unit. After this operation, we use a new tree adder fed by the entire set r i,k in a parallel way. In this case, the propagation path extension does not affect significantly the performance because of the small value of L. Nevertheless, a more sophisticated critical path treatment is still possible. Based on whether a specific system achieve timing constraints or not, a pipelined tree adder similar to Adder 1 may replace Adder 2. According to the error computation for the analyzed OFDM transmitter where fixed-point number repre- (8) 45

60 Fig. 3. Complete matrix-vector multiplier architecture sentation is chosen, 2R bits are used for real and imaginary parts independently in the adders input. Then, truncation is not applied in the multipliers output. Based on numerical simulation, the adder outputs are defined as 2R-bit. A truncation unit is placed in the last stage, and the output bus represent the results in R bits for real and imaginary part, independently. 4. SIMULATION RESULTS The proposed architecture has been tested on an Altera R EP2C70F672C6 device where a VHDL specification was developed. Debugging was performed by means of a fixedpoint simulator built on Matlab R. It was complemented by a special unit for connecting the test board with a PC through an Ethernet port. The final performance is summarized in Table 1. Table 1. Synthesis Results Resource Utilization % LEs LABs Registers Memory Bits Hardware Multipliers OFDM signal generation, and performance achieved is sufficient for implementing an N-Continuous OFDM transmitter by following the LTE standard. 6. REFERENCES [1] T. Onizawa, A. Ohta, and Y. Asai, Experiments on fpgaimplemented eigenbeam mimo-ofdm with transmit antenna selection, Vehicular Technology, IEEE Transactions on, vol. 58, no. 3, pp , march [2] P.-Y. Chen, C.-Y. Lien, and C.-P. Lu, Vlsi implementation of an edge-oriented image scaling processor, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 17, no. 9, pp , sept [3] L. Androuchko and I. Nakajima, Developing countries and e- health services, in Enterprise Networking and Computing in Healthcare Industry, HEALTHCOM Proceedings. 6th International Workshop on, , pp [4] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and Implementation. Wiley, [5] J. van de Beek and F. Berggren, N-continuous OFDM, Communications Letters, IEEE, vol. 13, no. 1, pp. 1 3, [6] Physical Channels and Modulation (Release 8), 3GPP Std. TSG RAN TS , v8.4.0., These values were obtained for standard synthesis effort in Quartus R II software, and the maximum operation frequency is found to be 50.1 MHz. 5. CONCLUSION Based on the requirement of high-speed processing for arithmetic calculation units, we focused on the matrix-vector multiplication issue. Several implementation considerations are analyzed for the case of parallel elemental multipliers. Although these multipliers may be real or complex, and may accept fixed-point or floating-point number representations, the presented architecture remains valid. The proposed scheme allows reducing the processing time from N 2 to N 2 /K clock cycles. The design is tested for N-Continuous 46

61 SYNTHESIS OF THE HARTLEY TRANSFORM WITH A HADAMARD-BASED MATRIX ARCHITECTURE Gilson J. Alves, Member, IEEE, and Edval J. P. Santos, Senior Member, IEEE Laboratory for Devices and Nanostructures, Electronics and Systems Department, Universidade Federal de Pernambuco. Rua Academico Helio Ramos, s/n, Varzea, , Recife, PE, Brasil. ABSTRACT Hadamard matrices are used to synthesize the Hartley Transform. This approach allows for the implementation of the Hartley transform in a scalable format. For comparison, the transform has also been implemented via the Matrix definition. Tests were carried out with a vector simulating the input signal. The output of both implementations are compared. The FPGA device used is a Xilinx R Spartan 3E, XC3S500e. 1. INTRODUCTION Integral transforms, specially the Fourier Transform, play an important role in several engineering fields, with emphasis special to Digital Signal Processing - DSP in Optics, Voice and Image Recognition, and Telecomunication [1]. Application examples are Image Compression [2], Content Based Image Retrieval [3], ADSL modens [4] and Communication Systems with multiple access - CDMA [5]. However, they have the disadvantage of requiring large hardware area to be implemented. The Hartley Transform is an integral transform closely related to the Fourier Transform, with the advantage that its results for a real entry signal does not have complex numbers, what implies easy arithmetic operation [6, 7, 8], that can be implemented in smallers areas. There are several algorithms which cab be used to implement integral transforms. The Least Mean Squared (LMS) has been widely used in DSP [9]. However, the LMS algorithm suffers from high computational complexity. Many techniques have been proposed to reduce the computational complexity. Merched et al. [10] have introduced the implementation of the Hartley transform using LMS - HBNLMS. The HBNLMS involves extending the data matrices to circulant symmetric matrices and then using the Hartley transform to diagonalize them, with the purpose of finding the commonality between adaptive filters using multidelay concepts and those using filterbanks. Efficient implementations The authors thanks professor H. M. de Oliveira for sugesting the implementation of this algorithm. for the Discrete Hartley Transform of N-length (N-DHT) have been proposed, such as PFA - Prime Factor Algorithm, where N is decomposable into prime factors, presentend in [11, 12] and WFTA, the Winograd Fourier Transform Algorithm, presented in [13]. Marchesi [14] describes the N- DHT using CORDIC processors and Systolic shuffle units, when N is a power of 2, but this implementation uses large area and has a slow computation. More recently, H. M. de Oliveira, and Renato S. Cintra have proposed the use of Hadamard matrix based architecture to implement the Hartley transform in a more scalable format. This is the approach this paper has selected to synthesize. The implementation was analyzed with Simulink MatLab R tool, and synthesized with Xilinx R ISE The paper is divided into five sections: the first is this introdution; the second presents brief overview of the Hartley transform; the third presents the Hadamard-based matrix architecture; the fourth describes the methodology for implementation, with specification, HDL description, simulations and synthesis. An other implementation of the 16-DHT is presented, via Hartley Matrix Definition. Tests are carried out with both implementations, the Hadamard-based and the Matrix Definition, and the results are compared; and the last one, the conclusion. 2. A BRIEF HARTLEY TRANSFORM OVERVIEW The Hartley Transform is an integral transformation that maps a real-valued temporal or spacial function into a real-valued frequency function via the kernel, cas(ωx) = cos(ωx) +sin(ωx). This symmetrical formulation of the traditional Fourier transform, attributed to Ralph Vinton Lyon Hartley in 1942, leads to a parallelism that exists between the function of the original variable and that of its transform [7]. This transform remained in quiescent state for over 40 years, and was rediscovered by Bracewell [7]. 47

62 h n = 1 N N 1 k=0 H k cas 2πkn, n = 0, 1,..., N-1 (6) N 2.1. Definition Fig. 1. The Self-inverse transform The Hartley transform of a function f(x) can be expressed as the pair H(ω) = 1 + f(x).cas(ωx)dx (1) 2π f(x) = 1 + H(ω).cas(ωx)dω (2) 2π where the angular frequency variable ω is related to the frequency variable f by ω =2πf, and the integral kernel is the casfunction, defined as cas(i) =cos(i)+sin(i) Hartley and Fourier Transforms The Hartley transform is closely related to the Fourier transform, and this relationship can be expressed as follows: The Fourier Transform: F (ω) = 1 + f(x).e jωx dx (3) 2π From equation (3), is easily shown that the Hartley transform presented in equation (1) can be rewritten as H(ω) =Re{F (ω)} Im{F (ω)} (4) The Hartley transform - HT has structural advantages over the Fourier transform - FT, owing to the fact that it is a real transform that also has the property of Self-inverse, as can be seen in Fig.1. So, if the resources are scarce, the HT can be a way of solve the problem, I mean, only one algorithm for computing the direct and the inverse Hartley Transform is enough The Discrete Hartley Transform The discrete version of the Hartley transform for discrete signals of N-length is the Discrete Hartley Transform for N- length (N-DHT), and is defined by [6] as the pair H k = N 1 n=0 h n cas 2πkn, k = 0, 1,..., N-1 (5) N and cas(i) =cos(i)+sin(i). The existence of fast algorithm for computing the discrete transforms (FTA) is one of the main reasons for their applications [15]. The Fast Hartley transforms are close to the N-DHT applications [16], and so the N-DHT have been an efficient tool. The transform presented in equation (5) can be expressed with a matrix linear operator as H = H N.h (7) Where H N is the Hartley matrix of N-length, whose elements h kn are given by cas( 2πkn N ). For this job, N=16. In this case, the equation (7) becomes H 0 H 1 H 2. H 15 = THE HADAMARD-BASED MATRIX ARCHITECTURE h 0 h 1 h 2. h 15 (8) If One use the equation (8) for direct implementation of the DHT, the problem of area cited in the introduction remains. Thus, it was used in the second approach just for comparison with the first one, the Hadamard based matrix architecture. For better results, the fast transform algorithms(fta s) are commonly used. The reason is that the FTA s meet minimal multiplicative complexity. Between others, the multiplicative complexity is one of the main criterions used for the analysis of an algorithm efficiency. In a mathematical vision, there are basically three methods for reach better algorithms transforms: The Index Rebuild, the Matrix Operations or the use of the Convolution Theorem [17]. This article uses the second technique. The DHT and the DFT of a real discrete signal S i (i = 0, 1,..., N 1), where s = f is respect to Fourier, and s = h is respect to Hartley, can be writed as: DFT: f i F k and DHT: h i H k. A relationship between DHT and DFT is expressed by: F k = 1 2 [(H k + H N k ) j(h k H N k )] (9) 48

63 The Turbo Hartley Transforms-THT for short block lenght presented by De Oliveira, Cintra and Campello [18] was used for this approach. In this method, the technique of decomposition in Layer Matrix is used [17]. In the equation (7), remarking that ( 2πk(n + N 2 cas ) N ) ( ) 2πkn = cas N + πk =( 1) n cas ( 2πkn N ) (10) Applying this to the Hartley Matrix H N when N=16, according to the method showed by De Oliveira et al. [18] and Cintra [17], the 16-DHT can be computed via four intermediate matrix layers. Each layer is derived from the matrix representation of the previous layer, and all the arithmetic operations already executed in the previous layer can be reused in the next one. Thus, all the operations are computed just once, and reused everytime they are needed. So, if the N-DHT scheme is knowed, it can be used for computing the (N+1)-DHT, what means that the (N+1)-DHT scheme encapsulates the N-DHT scheme, as can be seen in Fig. 4. The result is that the effort for computing a DHT is reduced with a much concise block of operations. 4. DESIGN METHODOLOGY The design was developed following the steps: Specification, HDL description, Behavioral Simulation and Hardware Implementation. The next subsections are referring to the first approach, the Hadamard-based matrix architecture Specification The project aims to implement the Discrete Hartley Transform of lenght 16, 16-DHT, in a cheap FPGA module, in accordance with Fig. 4 and Fig. 3. The Fig.2 resume the design conception. The Discrete Signal-In is a vector of sixteen 14-bitssamples. In a serial mode, the samples are stored in a entrymemory. After that, the memory vector is multiplied for elements of the Hartley Matrix of transformation, in a 4-layers operation according to the scheme of figure 4 and the explanation in the previous section (3). The 16-DHT response is the Discrete Signal-out, a vector of 16-length, where each component is represented in a 12-bit word HDL Description The Matlab R Simulink software can be used to simulate a Hartley transform execution system [19]. In this design, the Matlab R was used to implement the 16-lenght Hartley Matrix with a Hadamard-based matrix architecture, that was converted in simulink blocks, as shown in Fig. 4. The computation of the Hartley transform is executed according to Fig. 2. Design Conception the description in section 3, resumed in Fig. 4, adapted from [17]. The HDL code was generated in VHDL, to realise the 16-DHT conception Behavioral Simulation The simulation was leaded to check the system response. The simulation environment used was the ModelSim-XE R with the Xilinx R ISE tool. Tests were carried out with a vector simulating the input signal, in two situations: The first, where the input signal simulates a rectified sine wave, and the response is showed in Fig. 6 and Fig. 7. The Signal-In vector (entrada th ) is an integer approach of a Sin wave with amplitude 10 and positive rectification, and the response, the Signal-Out (saida th ) is presented as integer numbers due to the tool limitations; in the second situation, the input signal is a gate function, and the response is showed in Fig. 5. The time to compute the 16-DHT of a signal is 3 μs. This makes it feasible for a range of applications, like audio and image processing [20] Hardware Implementation As the result of simulation occuried as expected, the synthesis was executed in a Xilinx R Spartan-3E, XC3S500E- 4fg320, via the Xilinx R ISE 11.1, with a previous RTL- Register Transfer Level generation. Due to characteristics of the test platform used, the auxiliary-clock frequency was adjusted in 6.25Mhz, but it can be set up to 50 MHz, depending on the FPGA platform. With comparison purposes, an other synthesis of the 16- DHT was carried out, via its matrix definition algorithm, ac- 49

64 Fig DHT block conception Table DHT: Device utilization summary Logic Available Matrix definition Hadamard-based Used Utilization Used Utilization SLICES % % SLICE FF % 174 1% 4 INPUT LUT % % BONDED IOB % 17 7% MULT18X18SIO % 13 65% GCLK % 2 8% cording to the equation (7). In this situation, although the response is the same, the hardware consumption is a strong disadvantage, as can be seen in Table 1, where the hardware characteristics are summarized. The synthesis response of the 16-DHT of a rectified sine wave computed in the MatLab R (Fig.7) is the discrete signal presented in Fig CONCLUSION Hadamard-based matrix implementation is useful for a wide range of applications. This paper presented a methodology for fast implementation of the discrete Hartley transform with a Hadamard-based matrix architecture. The design was implemented in a cheap FPGA module, the Spartan- 3E XC3S500E-4fg320. With a low auxiliary-clock of 6.25 MHz, the requisite time for computing the 16-DHT is 3 μs. The area consumption with the Hadamard-based architecture is better, using, for exemple, about 85% less slices than the Matrix definition architecture, as can be seen in Table Fig DHT algorithm conception 1. The results points to the feasibility for applications like audio and image processing for teleconference, distancelearning and medical investigations. Improvements that includes analogic interface for response visualisation are currently in execution. 6. REFERENCES [1] K. J. Olejniczak and G. T. Heydt, Special section on the hartley transform, Proceedings of the IEEE, vol. 82, pp , Mar [2] P. Meher, T. Srikanthan, J. Gupta, and H. K. Agarwal, Near lossless image compression using lossless hartley like transform, ICICS - PCM, vol. 19, Dec [3] P. Rajavel, Directional hartley transform and content based image retrieval, Elsevier - SigPro, vol. 90, pp , Nov [4] J. I. Guo, An efficient design for one-dimendional discrete hartley transform using parallel additions, IEEE Transactions on Signal Processing, vol. 48, [5] H. Bogucka, Effective implementation of the ofdm/cdma base station transmitter using joint fht and ifft, Proc. IEEE 50

65 DHT of a gate function (synthetized) 100 X= 0 Y= DHT of a Rectified Sine Wave DHT 20 0 Amplitude X= 4 Y= 8 X= 6 Y= X= 8 Y= X= 12 Y= t Fig DHT of a gate function - Synthetized X= 14 Y= t Fig. 7. MatLab R 16-DHT of a rectified sine wave Fig DHT of the rectified sine wave Workshop Signal Process. Adv.Wireless Commun., pp. p , [6] R. V. L. Hartley, A more symmetrical fourier analysis applied to transmission problems, Proc. IRE, vol. 30, pp , Mar [7] R. N. Bracewell, Discrete hartley transform, J. Opt. Soc. Amer., vol. 73, pp , Dec [8], The fast hartley transform, Proc. IEEE, vol. 72, pp , Aug [9] R. Vasanthan, K. Prabhu, and P. Sommen, An analysis of real-fourier domain-based adaptative algorithms implemented with the hartley transform using cosine-sine symmetries, IEEE Transactions on Signal Processing,, vol. 53, no. 2, Feb [10] R. Merched and A. H. Sayed, An embedding approach to frequency domain and subband adaptive filtering, IEEE Transactions on Signal Processing,, vol. 48, no. 9, pp , Sep [11] C. Chakrabarti and J. JaJa, Systolic architectures for the computation of the discrete hartley and the discrete cosine transforms based on prime factor decomposition, IEEE Transactions on Computers, vol. 39, no. 11, pp , Nov Fig. 8. Synthesis of the 16-DHT of a rectified sine wave [12] D. Yang, Prime factor fast hartley transform, Elect. Letters, vol. 26, no. 2, pp , Jan [13] S. Winograd, On computing the discrete fourier transform, Math. Comp., vol. 32, pp , [14] M. Marchesi, G. Orlandi, and F. Piazza, A systolic circuit for fast hartley transform, Proceedings of the (ISCAS 88), vol. 39, no. 11, pp , [15] R. Blahut, Fast Algorithms for Digital Signal Processing. Addison-Wesley, [16] G. Bi and Y. Chen, Fast dht algorithms for length n=q*2m, IEEE Trans. on Signal Processing, vol. 47, no. 3, pp , Mar [17] R. Cintra, Transformada rapida de hartley: Novas fatoraçoes e um algoritmo aritmetico, Dissertaçao de Mestrado - UFPE -CTG, [18] H. de Oliveira, R. Cintra, and R. Campello, Multilayer hadamard decomposition of discrete hartley transforms, 51

66 SBrT XVIII Simposio Brasileiro de Telecomunicaçoes, Set [19] R. C. de Oliveira, H. M. de Oliveira, R. Campello, and E. Santos, A flexible implementation of a matrix laurent seriesbased 16-point fast fourier and hartley transforms, IEEE Proceedings of VI Southern Programmable Logic Conference, pp , Mar [20] S. A. Parthasarathy Ranganathan and N. P. Jouppiy, Performance of image and video processing with general-purpose processors and media isa extensions, Proceedings of the IEEE, Aug

67 IMPLEMENTACIÓN DE MODBUS EN FPGA MEDIANTE VHDL - CAPA DE ENLACE - Guanuco Luis, Panozzo Zenere Jonatan, Olmedo Sergio, Rubio Agustin* Centro Universitario de Desarrollo en Automación y Robótica CUDAR FRC / UTN Córdoba, Argentina lguanuco@electronica.frc.utn.edu.ar, 49190@electronica.frc.utn.edu.ar, solmedo@scdt.frc.utn.edu.ar, 49286@electronica.frc.utn.edu.ar ABSTRACT The Hardware description through VHDL (VHSI Hardware Description Language) programming, allows a wide flexibility in the digital circuits design. This article presents a description about the comunication between Programmable Logic Devices (PLDs), according to the MODBUS protocol. This widely accepted comunication standard defines protocols for the Aplication, Datalink, and Physical layers. This document explains the development of said standard in the Data-link layer, and offers a summary about the way this layer interacts with the other two. In order to do this, the descriptions of the main blocks, synthesis, simulation and, finally, the implementation on FPGA (Field-Programmable Gate Array) devices are included. 1. INTRODUCCIÓN Para el desarrollo de cualquier protocolo de comunicación se deben considerar niveles de abstracción para el tratado de la información como así también diferentes formas de implementación tanto hardware como software. Para definir éstas pautas de diseño se considera el modelo OSI. El modelo OSI (Open System Interconnection) es un marco de referencia para la definición de arquitecturas de interconexión de sistemas de comunicaciones desarrollado por la Organización Internacional para la Estandarización [1]. Este permite al desarrollador seguir una determinada estructura para el manejo de la información en dicha red, Fig. 1. Cada uno de los niveles de este modelo se regirá de acuerdo a las especificaciones del protocolo. Este modelo logra imponer un nivel de abstracción en el cual la comunicación es entre capas del mismo nivel de dos o más dispositivos. Sin embargo, la comunicación existe solo entre capas adyacentes de un mismo dispositivo, conectándose a otro únicamente a través de las capas físicas Capa de Enlace MODBUS [2] [3] define un protocolo en esta capa para la comunicación serie entre un único dispositivo Maestro y uno (conexión punto a punto) a 247 Esclavos (conexión multipunto). Una comunicación siempre la inicia un Maestro, por lo que un Esclavo solo transmite información luego de una petición; de lo cual se deduce que no es posible la comunicación directa entre Esclavos. Cada uno de estos dispositivos, tiene una dirección específica que los distingue. El dispositivo Maestro puede transmitir datos en dos modos diferentes: Unicast o Broadcast. El primero, está dado por una petición del Maestro a un Esclavo especifico y siempre la respuesta de este. El segundo es una transmisión del Maestro hacia todos los Esclavos al mismo tiempo, no habiendo respuesta alguna de ninguno de ellos. MODBUS permite la codificación de la información en la red en dos formas diferentes, RTU y ASCII [3]. RTU (Remote Terminal Unit), codificación síncrona, los datos se presentan en bits consecutivos formando tramas de datos, cuyo inicio y fin son indicados por intervalos de tiempo. ASCII ( American Standard Code for Information Interchange), caracterizado por ser asíncrono, la información se encuentra codificada en caracteres ASCII. La trama de datos comienza y termina con caracteres definidos. Los tiempos de transmisión y recepción de una trama en cada uno de éstos modos de codificación difieren en gran medida. En modo ASCII los datos deben ser convertidos en su correspondiente caracter además de ser ponderados en formato hexadecimal. Por el contrario, en el modo RTU la información se encuentra en forma de bits consecutivos, permitiendo que para un mismo tiempo, haya un mayor flujo de información por la red que en el modo ASCII. Fig. 1. Modelo OSI con sus diferentes niveles. 53

68 1.2. Codificación ASCII Se elige para el presente desarrollo el modo ASCII, debido a la mejor legibilidad de la información. En este modo se puede apreciar la trama circulante por el bus, conectando a él un dispositivo con las capacidades de interpretar caracteres ASCII. Esta es una característica fundamental si se quiere realizar un análisis en cualquier punto de una red donde se encuentra aplicado MODBUS. La codificación en modo ASCII cuenta con una trama limitada por un caracter de comienzo : y dos de fin CR (Charriage Return) LF (Line Feed). El mensaje se encuentra dentro de éstos caracteres distribuido como se observa en la Fig. 2. Los cuatro campos que forman el mensaje son: Dirección; del dispositivo esclavo que está actuando en la comunicación. Código de Función; códigos preestablecidos por MODBUS que establecen las operaciones que debe llevar a cabo el esclavo. Datos; es la información. CRC/LRC; campo que sirve para la detección, (no corrección) de errores. Fig. 2. Trama de MODBUS en modo ASCII [3]. 2. DISEÑO La implementación de un protocolo MODBUS en FPGAs requiere un diseño en algún lenguaje de descripción de hardware, basado en gran medida, en el desarrollo de máquinas de estados finitas. La generación de una trama comienza con el envío de un caracter que define el principio de la misma. En forma consecutiva se transmiten los campos de dirección, función, datos, chequeo de error LRC y para terminar los caracteres de fin de trama. De forma semejante se plantea para la recepción de la trama. En forma general se definen los estados de codificación/decodificación de la trama en la Fig. 3. Como bloques específicos de mayor relevancia en el diseño, se considera los de Recepción y Transmisión, los que se definirán como maquinas de estados, a nivel de componentes, dentro de la descripción principal en VHDL. Las máquinas de estados se clasifican en dos tipos: Moore y Mealy [4]. Ambas se diferencian por la dependencia o no, de las salidas con respecto al estado de las entradas. En virtud de los requerimientos necesarios para la capa de enlace del MODBUS, se opta por la implementación de máquinas de estados tipo Mealy, Fig. 4. Fig. 3. Diagrama de estados en transmisión y recepción en modo ASCII [3] Bloque de RAM La información que contiene la trama del MODBUS debe ser almacenada en registros para su tratado en los diferentes niveles. En este sentido, este bloque funciona como puente entre las capas de Enlace y Aplicación. La primera, guarda en la RAM los datos recibidos en el mensaje, preparando el servicio de la capa de aplicación. Esta toma los datos desde la RAM, los procesa y escribe en ella la información a transmitir. Es posible utilizar un bloque de RAM ya embebido en el dispositivo lógico (FPGA) o un bloque de RAM descriptivo. Los bloques de RAM embebidos en FPGAs, llamados también bloques de RAM primitivos, se encuentran físicamente en el chip [6]; compuestos de entradas/salidas, bus de direccionamiento y señales de control. La limitación en su utilización es que no se cuenta con un modelo descriptivo de los mismos, lo que restringe el diseño, por no poder reducirse el uso de recursos físicos, sumado a la dependencia del hardware a utilizar. En el bloque de RAM descriptivo se puede llevar a cabo análisis de tiempo y reducir la cantidad de bloques lógicos en función de la necesidad de la implementación. Por esto, y a los fines investigativos, se adopta en el presente trabajo este tipo de bloque de memoria. Sin embargo, el diseño global ocupa mayores recursos dado que las RAM primitivas están igualmente incorporadas y disponibles en el chip. Fig. 4. Diagrama de máquina de estados Mealy [5]. 54

69 2.2. Transmisor y Receptor El Transmisor funcionalmente debe generar la trama a ser enviada, esto, tanto en el Maestro como en los Esclavos. Se diseña una máquina de estados, pendiente del proceso de escritura del bloque de RAM, llevada a cabo por la capa de aplicación. La máquina de estados realiza las lecturas sucesivas desde el bloque de RAM hasta enviar uno a uno los caracteres, respetando los marcadores de comienzo y fin de trama. En la recepción, al igual que en la transmisión, se utiliza nuevamente una máquina de estados, que deberá cumplir con las especificaciones del modo de codificación. En este caso se cuenta con la información en forma serial recibida por la capa Física. Los datos son almacenados en el bloque de RAM, momento en el que el bloque de recepción posee el control absoluto de escritura en la memoria. Por lo expuesto, resulta necesaria la presencia de un control de accesibilidad del bloque de RAM, dado que varios componentes precisan de la escritura y/o lectura de dicho bloque UART MODBUS define para las capas 1 y 2 del modelo OSI, el Protocolo MODBUS de Línea Serial [3]. Esto implica la utilización de una UART (Universal Asynchronous Receiver Transmitter) para poder transmitir y recibir los datos en forma serie. La UART constituye entonces la conexión de la capa de Enlace con la capa Física. Esta última puede ser cualquier estándar de comunicación serial como el RS232 o el RS485 adoptado en el presente desarrollo. Este bloque se realiza al igual que los demás de manera descriptiva en VHDL, y en forma general presenta el dato recibido en forma serial, como salida en paralelo. De forma análoga, recibe el dato a transmitir en paralelo y envía los bits de información en forma serie atendiendo las configuraciones de velocidad elegidas, y las condiciones preestablecidas por el protocolo MODBUS sobre la conformación de la palabra a enviar: bits de comienzo, datos, paridad y parada [3]. Distribución de Clock: DDL (Delay-Locked Loop). Boundary Scan. Con las pautas de diseño ya presentadas, como así también la identificación de los distintos bloques que componen nuestra descripción, se presenta el resultado de la síntesis, Tabla 1. Tabla 1. Resumen de utilización de recursos Dispositivo FPGA: 2S200EPQ208-6Q Recurso Utilizado Disponible Porcentaje Slices % Flip Flops % LUTs Lógica ,1% RAM ,1% Entradas/Salidas 37 IOBs conectados % GCLKs % De la Tabla 1 se aprecia los escasos recursos utilizados, ya que se cuenta con un dispositivo con gran número de CLBs (Configurable Logic Blocks). Igualmente es de suma importancia la simulación, verificación y posterior simplificación de la descripción, para lograr un mejor rendimiento de los recursos en vista de su implementación en diferentes dispositivos lógicos. Un análisis más detallado ha de resaltar la importancia de la no implementación de elementos primitivos en el actual proyecto. Al respecto, en el caso del bloque de RAM descriptivo, es necesario un determinado y reducido número de elementos que permiten su instanciación hasta en dispositivos lógicos más pequeños, por ejemplo, CPLDs. En caso de necesitar un bloque de RAM de mayor tamaño, ha de considerarse el empleo de bloques de memoria RAM primitivas, obviamente, realizándose un previo estudio del dispositivo a utilizar. Así como la consideración anterior, se debe tener en cuenta todos los recursos necesarios para el proyecto y los disponibles en el hardware a utilizar. 3. SÍNTESIS E IMPLEMENTACIÓN La implementación se realiza en una FPGA Xilinx Spartan 2E XC2S200E [6]. El sintetizador es el XST (Xilinx Synthesis Technology) [7], herramienta que forma parte del paquete ISE Xilinx WebPack [8] disponible en el centro de investigación donde se lleva a cabo el desarrollo. La FPGA cuenta con una gran cantidad de recursos físicos, Fig. 5. Los principales se detallan a continuación: Bloques de entradas y salidas. Bloque lógico configurable. Bloques de RAM. Fig. 5. Diagrama en bloque de la familia FPGA Spartan- IIE [6]. 55

70 La utilización de un único reloj para el sincronismo de los CLBs resulta ser más flexible en el diseño que disponer de varios clocks externos conectados a la FGPA. Sin embargo, debe tenerse presente que esto se logra con el correspondiente consumo de recursos físicos, ya que un divisor de clock, implementado con bloques lógicos, se sintetiza como un contador lógico. RTL (Register Transfer Level) permite la representación gráfica del diseño descrito en VHDL, visualizándose los componentes finales, Fig. 6. Fig. 8. Simulación de una trama de transmisión y recepción en capa de Enlace del MODBUS. 5. CONCLUSION Fig. 6. Representación RTL final. 4. SIMULACIÓN La simulación resulta fundamental en el proceso de síntesis e implementación. En función de las especificaciones de la capa de Enlace del MODBUS, se presenta los casos posibles de comunicación tanto para la transmisión como la recepción. La estructura del proceso de simulación es acorde al esquema de la Fig. 7. De esta manera se crea un lazo que permite llegar al correcto funcionamiento del sistema digital. Fig. 7. Proceso de validación de un diseño electrónico digital mediante VHDL [9]. La simulación no solo ofrece información útil para corregir problema en la síntesis, sino que además permite validar la trama, como se observa en la Fig. 8. En base a las especificaciones del protocolo MODBUS, se ha logrado un desarrollo totalmente descriptivo en VHDL, esto permite la flexibilidad en el diseño de sistemas digitales como así también se logra portabilidad en la implementación sobre PLDs. En el proceso de investigación a cerca de la implementación de MODBUS en sistemas embebidos, presenta una preferencia en la utilización de microcontroladores.el avance tecnológico, su evolución en nuevas arquitecturas y las herramientas de software han incrementado ésta tendencia. Aún así la abstracción del lenguaje VHDL ha permitido satisfacer las especificaciones del MODBUS embebido en una FPGA. 6. REFERENCES [1] MODBUS-IDA.ORG, Modelo OSI [2] MODBUS-IDA.ORG, MODBUS application protocol specification, V1.1b [3] MODBUS-IDA.ORG, MODBUS over serial line specification and implementation guide, V [4] K. Kuusilinna, V. Lahtinen, T. Hämäläinen, J.Saarinen, Finite state machine encoding for VHDL synthesis, IEEE Proc.-Comput. Digit. Tech, Vol. 148, No. 1, Enero [5] A. Iborra y J. Suardiaz, Diseño de Sistemas Electrónicos- DB4, Diseño Basado en Máquinas de Estado Finitas, Uni. 8. Mayo [6] Xilinx Inc., Spartan-IIE 1.8V FPGA Family: Functional Description, v2.1, Product Specification. Julio [7] Xilinx, ISE 10.1 Quick Start Tutorial. Agosto [8] Xilinx Inc. ISE WebPACK Design Software. Agosto [9] J. Jiménez, E. Fernández, J. Martin, U. Bidarte,A. Zuloaga. Simulation environment to verify industrial communication circuits. University of the Basque Country, Department of Electronics and Telecommunications,

71 SECUENCIADOR MUSICAL EN UNA PLACA FPGA MUSIC SEQUENCER ON A FPGA BOARD Matías López-Rosenfeld, Patricia Borensztejn Departamento de Computación Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires {mlopez, Francisco Laborda Instituto de Ciencias Universidad Nacional de General Sarmiento flaborda@ungs.edu.ar ABSTRACT En este artículo presentamos una aplicación implementada en un FPGA de Xilinx. La misma permite reproducir temas musicales polifónicos sintetizando señales de audio digital. La aplicación fue diseñada para trabajar en un Spartan 3E Starter Kit de Digilent. Está construido sobre las ideas del protocolo MIDI como entrada y su salida principal es un bit PWM (modulado por ancho de pulso), el cual es amplificado utilizando otro módulo de Digilent (PMOD-AMP1). In this paper we present an application implemented on a Xilinx FPGA. This application allows it to play polyphonic music songs by synthesizing digital audio signals. It was designed to work on a Spartan 3E Starter Kit of Digilent. It is constructed upon the ideas of the MIDI protocol as input and a PWM (pulse width modulation) bit is it s main output, which is amplified using another Digilent module (PMOD- AMP1). 1. INTRODUCCIÓN A fines de los 60, se popularizó el uso de los sintetizadores digitales en la música popular. En 1983, el protocolo MIDI (Music Instrument Digital Interface) [1] estandarizó la comunicación entre diferentes marcas de sintetizadores y permitió la programación de los mismos. Esto revolucionó la forma de hacer música ya que cualquiera podía programar aunque no fuera un buen ejecutante. Cabe aclarar que MIDI no transmite señales de audio, sino datos de eventos y mensajes controladores que se pueden interpretar de acuerdo con la programación del dispositivo que los recibe. Es decir, MIDI es una especie de partitura que contiene las instrucciones en valores numéricos (0-127) sobre cuándo generar cada nota de sonido y las características que debe tener. El aparato al que se envíe dicha partitura la transformará en música completamente audible. En este trabajo creamos una aplicación que funciona como punto de partida para el desarrollo de un secuenciador polifónico basándonos en los principios del protocolo MI- DI. Este artículo está organizado de la siguiente manera: en la sección 2 se describe el proyecto realizado, en la sección 3 se explica el diseño del sistema, en la sección 4 se describe cada módulo individualmente, en la sección 5 se explica acerca de la verificación del proyecto y en la sección 6 se muestra la información de síntesis. Por último, en la sección 7 se presentan las conclusiones. 2. DESCRIPCIÓN DEL PROYECTO El mismo consiste en la implementación en Verilog de un secuenciador musical polifónico sobre una FPGA (field - programmable gate array). Permite, dada una secuencia de eventos de tipo NoteOn y NoteOff (al igual que en el protocolo MIDI), reproducir una pieza musical con sonidos sintetizados en la misma placa. Las ondas de audio sintetizadas son dientes de sierra de valores discretos. La polifonía se resolvió utilizando multiplexación temporal de las diferentes voces. Para la implementación del proyecto se utilizó el Starter Kit de la empresa Digilent Inc., donado por Xilinx a través del Xilinx University Program [2]. El kit contiene un FPGA Spartan-3E XC3S500E [3] de Xilinx. Además, se utilizó un PMOD-AMP1 [4] de Digilent Inc. como complemento para transformar la señal digital modulada, en una señal analógica reproducible por un parlante. Una implementación similar y completa de un sintetizador monofónico en un FPGA se puede encontrar en el proyecto [5] Entrada Así como un músico lee una partitura, nuestro proyecto también lo hace. Tiene una entrada en la cual puede leer cuándo tiene que empezar a hacer sonar una nota y cuándo la tiene que dejar de hacer sonar. 57

72 Tabla de tonos 22 NCO Voz PC Tabla partitura 1 Metrónomo Tabla de tonos Tabla de tonos Tabla de tonos NCO NCO NCO Voz2 22 Voz3 22 Voz4 22 Mezclador 1 PWM MHz Fig. 1. Data Path. En nuestra implementación no nos centramos en la forma en que se ingresan los datos de la ejecución. Almacenamos en una tabla los datos necesarios para reproducir una pieza musical, haciendo ésta las veces de una entrada real. Así, dejamos el camino abierto para poder ingresar datos de otras formas (tiempo real vía puerto serie desde una pc o un controlador MIDI, por medio de un teclado ps/2, etc.). Nos referimos a esta tabla que hace las veces de entrada como Tabla Partitura Salida de audio Las salidas de audio son dos, y contienen el audio digital resultado de la ejecución de la partitura que se toma en la entrada. Una de estas salidas es una señal diente de sierra digital discreta en un bus de 22 bits que varía en ciclos de frecuencia asociada al tono que se desea reproducir. En nuestro proyecto esta salida no se utiliza pero queda disponible para cualquier otra conversión digital-analógica que se quiera realizar. La otra salida es la versión modulada por ancho de pulso (PWM) de la salida mencionada anteriormente. La misma es de 1 bit que alterna entre 0 y 1 en ciclos con frecuencia asociada al tono que se desea reproducir. Esta salida sí es utilizada y es recibida por el PMOD-AMP1 y transformada en audio capaz de ser reproducido por cualquier parlante. 3. DISEÑO El Data Path (ver Fig. 1) del proyecto se compone de los siguientes módulos: El Metrónomo, que es un divisor de la frecuencia del reloj de la placa. El Registro PC, que indexa la Tabla Partitura. La Tabla Partitura, que contiene los datos de ejecución de la pieza musical. La Tabla de Tonos, que traduce cada nota musical a ser reproducida al valor que necesita el Oscilador para generar la señal correspondiente a ella. El Oscilador (NCO), que genera una señal diente de sierra con frecuencia controlada numéricamente. El Mezclador, que combina las señales de los diferentes osciladores en una sola Módulo Metrónomo 4. IMPLEMENTACIÓN Este módulo es simplemente un divisor de la frecuencia del reloj de la placa. Su salida pasa de 0 a 1 indicando que ha transcurrido una unidad de tiempo para la interpretación musical. En futuras implementaciones, será posible modificar el tempo de la pieza musical durante su ejecución con solo cambiar el valor por el cual se divide la frecuencia en este módulo Registro PC Este registro funciona como un contador de la cantidad de pulsos emitidos por el Metrónomo. Este valor se utiliza para indexar la Tabla Partitura. 58

73 timestamp nro. nota NoteOn/NoteOff nro. voz 8 7 0x1BBE4 v Fig. 2. Esquema de una fila de la Tabla de Tonos. 0x0 t 4.3. Tabla Partitura Esta tabla simula una entrada propiamente dicha. Como se comentó en 2.1, en futuras versiones podría ser reemplazada por una entrada en otro formato. Cada una de sus filas (ver Fig. 2) representa un evento de ejecución musical para un determinado momento al que llamaremos timestamp. Este proviene del valor almacenado en 4.2. Un evento está compuesto por: la nota involucrada; una voz (por la cual se generará el sonido); y un valor binario que indica si representa el comienzo o el fin del sonido de esa nota en esa voz (NoteOn/NoteOff ). En esta primer etapa la aplicación sólo es capaz de hacer sonar hasta 4 notas en simultáneo. Las mismas empezarán a sonar y dejarán de hacerlo en el timestamp indicado en la partitura Tabla de tonos Esta tabla guarda los valores precalculados que sirven de tope a los contadores de los osciladores para lograr las frecuencias deseadas. Dado que son contadores discretos, las frecuencias generadas pueden tener un error, pero el mismo es despreciable para el oído humano. El funcionamiento es sencillo de explicar: dado que el clock interno es de 50 MHz, lo que tenemos que preguntarnos es cuántos tics deberíamos contar para retrasar esta frecuencia a la de la nota deseada, entonces contamos desde 0 hasta ese número una y otra vez para así lograr una señal de la frecuencia de dicha nota. La cantidad de tics del clock de la FPGA: 50MHz = 1seg 1Hz = 1 50M seg 1ticF P GA = 1 50M seg La cantidad de tics que tiene el La medio: (1a) (1b) (1c) (1d) 440Hz = 1seg (2a) 1Hz = 1 seg 440 (2b) 1ticLA = 1 seg 440 (2c) Fig. 3. Contador de tics para obtener un La medio. 0x20DDF2 0x x1F220D 0x0 v Fig. 4. Contador de tics centrado en 0X para obtener un La medio. Tics del clock de la FPGA necesarios para obtener un La: 1 seg = 1ticF P GA 50M (3a) 1 50M seg = x = ticsf P GA (3b) t De esta forma almacenamos los valores precalculados para las 128 notas del espectro musical que contempla el protocolo MIDI Módulo Oscilador (NCO) Este módulo es el encargado de generar la señal que representa cierta frecuencia. La salida es una señal diente de sierra que oscila a cierta velocidad, la cual es determinada por la entrada. Para generar la señal de diente de sierra lo que hace el módulo es contar tics del clock interno del FPGA. La entrada de este módulo entonces será el número correspondiente a la nota deseada según 4.4 y la salida es el valor del contador (ver Fig. 3). Cuando decimos que una voz se activa en una nota, nos referimos a que uno de los cuatro Osciladores comienza a generar a la salida una señal de diente de sierra que oscila digitalmente en la frecuencia asociada a esa nota. (Por ejemplo, para el La medio, a 440 Hz.) Cuando decimos que una voz se desactiva, el Oscilador asociado a esa voz tiene en forma constante el valor cero a la salida. Este módulo también centra la señal. Por centrarla nos referimos a que la mitad de nuestra representación de 22 bits sea siempre alcanzada en la mitad del rango a recorrer. Con lo cual, en lugar de contar desde 0 a n contamos desde p hasta q con p < q y p es la negación bit a bit de q que cumplen que q p = n (ver Fig. 4). 59

74 Table 1. Tabla de síntesis. Componente Utilizados Porcentaje Slices: 150/ % Slice Flip Flops: 107/ % 4 input LUTs: 264/ % IOs: - 21 % Bonded IOBs: 17/232 7 % 4.6. Módulo Mezclador Este módulo multiplexa en el tiempo sus cuatro entradas, provenientes de los osciladores, a 100 KHz. Tiene dos salidas, una de 1 bit y una de 22 bits. Optamos por utilizar la salida de un bit y enviar por ella una señal PWM ya que contábamos con el PMOD-AMP1 para realizar nuestro trabajo. Para entender qué hace este módulo, veamos qué datos tenemos a la entrada y qué deberíamos hacer para obtener lo que queremos a la salida. El oscilador en 4.5 entrega una señal centrada, característica que vamos a aprovechar para construir nuestro bit de salida. Vamos a enviar un 1 cuando la señal de entrada esté por encima del valor intermedio y un 0 cuando sea inferior. Pero este dato viene dado en el bit más significativo de la señal de entrada, con lo cual, sólo tenemos que redirigir este bit a la salida y luego el PMOD- AMP1 podrá construir la señal con él y reproducir el audio esperado. Para mezclar las 4 voces en una única señal, lo que hacemos es ir alternando entre cada una de ellas. Técnicamente, en un determinado momento suena a lo sumo una voz. Al pasar tan rápidamente de una voz a otra, el oído humano no logra distinguir esos cambios y así se genera la sensación de estar oyendo un acorde. La segunda salida de este módulo es una señal completa de 22 bits, que si bien no estamos utilizando actualmente en nuestra implementación, podría ser dirigida hacia un DAC (conversor analógico-digital) que permita enviarla a un parlante para ser oída. Cabe aclarar que esta señal es mezclada de la misma manera que la salida de un bit pero para 22 bits. 5. VERIFICACIÓN Dada la naturaleza subjetiva de la música y lo díficil de diseñar testbenches para salidas de osciladores, se optó por verificar el software de manera experimental directamente en la placa. Si bien los sonidos obtenidos son tonos puros (o combinaciones de ellos), y por lo tanto, poco agradables al oído, las piezas musicales testeadas fueron satisfactoriamente reconocidas por variados oyentes. 6. SÍNTESIS En la tabla 1 se puede ver el resultado de la síntesis de nuestro proyecto, sin ninguna partitura cargada, para la placa Spartan-3E (XC3S500E) sintetizando nuestro proyecto con XST. Presentamos la sintesis sin la partitura, porque la idea es que la misma deje de estar dentro de la placa para pasar a ser una entrada de otro tipo como explicamos en CONCLUSIÓN Se implementó un secuenciador musical capaz de reproducir piezas musicales con, a lo sumo, cuatro voces sonando en simultáneo. Un punto destacable es que fue implementado desde cero sin utilizar código preexistente, en un afán de comprender los mecanismos básicos que intervienen en la reproducción y generación de sonidos y en este caso música. Algo valioso del proyecto es que sienta una base y un punto de partida para el desarrollo de varios proyectos diferentes. Este proyecto podría converger por diferentes caminos en un sintetizador programable o capaz de ser ejecutado en vivo, en un secuenciador MIDI, en un secuenciador basado en muestras de audio, o hasta en un sintetizador de efectos o una caja de ritmos, entre otras cosas. 8. REALIZACIÓN DEL PROYECTO Este trabajo fue realizado en el contexto de la materia Diseño de Sistemas con FPGA, en el Departamento de Computación de la Facultad de Ciencias Exactas de la Universidad de Buenos Aires a cargo de la Dra. Patricia Borensztejn durante el primer cuatrimestre de REFERENCES [1] MIDI Manufacturers Association Incorporated, The complete midi 1.0 detailed specification, midi.org/techspecs/midispec.php. [2] Xilinx, Spartan-3e fpga starter kit board user guide, boards and kits/ug230.pdf. [3], Xilinx university program, com/university/index.htm. [4] Digilent Inc., Pmodamp1 TM speaker/headphone amplifier reference manual, Data/Products/PMOD-AMP1/PmodAMP1 rm RevB. pdf. [5] S. Gravenhorst, GateMan I, net/pmwiki/pmwiki.php?n=fpgasynth.gatemani. 60

75 FLEXIBLE PLATFORM FOR REAL-TIME VIDEO AND IMAGE PROCESSING Paulo da Cunha Possa, Zied El Hadhri, Laurent Jojczyk and Carlos Valderrama Department of Electronics and Microelectronics, University of Mons Boulevard Dolez, 31, 7000 Mons, Belgium {paulo.possa, zied.elhadhri, laurent.jojczyk, ABSTRACT This work provides a platform for real-time image and video processing enabling exploration and evaluation of different processing techniques. The goal of our approach is to provide a flexible environment for the prototyping of the different processing techniques on Field Programmable Gate Arrays (FPGAs), easily customizable to specific target applications and suitable for educational purpose. In this paper we give an overview of different requirements and techniques of video processing featuring FPGAs. Three real-time video processing algorithms were combined to show the advantages and characteristics of our approach. Within this system, the modules running in parallel can be easily selected at run-time according to the application needs. Index Terms Video signal processing, field programmable gate array, tracking, object detection, embedded system. 1. INTRODUCTION The performance requirements of image and video processing applications have led to increase the computing power of implementation platforms, especially when realtime constraints need to be met [1]. Traditional implementation of image and video processing designs are based on Digital Signal Processors (DSPs) or Application Specific Integrated Circuits (ASICs). However, FPGAs have shown very high performance in many applications in this field [2][3]. FPGAs hold a clear advantage compared to conventional DSPs to perform digital signal processing which is their scalability (the capacity to replicate functions as required) and its inherent parallelism. Also, current FPGAs devices provide several attractive features for implementing DSP algorithms, e.g. high performance input and output pins, large memory blocks, embedded multipliers and microprocessors [4]. Many works concerning video and image processing on FPGAs can be found with the most diverse applications. Within those design environments, we are interested on how the different processing algorithms can be combined in a flexible way to satisfy specific application requirements. One example of this is the work [5] about a multipurpose reconfigurable platform, where more than one process can be applied simultaneously on the incoming video signal. On that approach, a user-specific functional module implements most of the required functionalities. This functional block can be extended depending on different application scenarios. Nowadays, students should evolve from software development to architecture design in order to satisfy such requirements. They must master algorithmic selection and processing power requirements working in a design framework built around the latest video/image standards. By adopting a video processing platform based on FPGAs, we can provide real-time exploration in a flexible and evolving environment, populated by a growing set of technology bricks. In this context, the objective of this work is to provide a flexible environment for the prototyping of the different processing techniques on FPGAs, easily customizable to specific target applications and suitable for educational purpose. In the following section, we detail the characteristics of our video processing platform. We also describe the processing algorithms implemented in our system in order to evaluate it in terms of performance and effectiveness. Section 3 presents the results obtained in the evaluation. Finally, in section 4, we conclude with an analysis of results and provide future directions of our work. 2. THE VIDEO PROCESSING PLATFORM The board chosen to host our system is the Altera DE2 Development and Education Board (Fig. 1). The motivation of this choice was based on the educational purpose of the DE2 board, with accessible components for debugging (e.g. toggle switches, debounced pushbutton switches, and LEDs) and a complete set of peripherals, including a 24-bit audio codec, an USB host/slave controller, 10/100 Ethernet controller, 8MB SDRAM, a TV decoder, and a VGA 10-bit DAC. The DE2 board hosted FPGA is the low cost Altera s FPGA device EP2C35 from the family Cyclone II. The EP2C35 contains 33,216 Logic Elements, 105 M4K RAM blocks, 35 embedded multipliers, and 4 PLLs. The proposed video processing platform takes advantage of the already available TV Decoder (ADV7181B) and Video DAC (ADV7123) to create a low cost environment for video processing. Fig. 2 shows a 61

76 simplified diagram of the video framework architecture. generates two signals corresponding to the pixel coordinates. TV Decoder ADV7181B Video DAC ADV7123 Input ITU-R 656 Decoder Deinterlacer SDRAM Interface SDRAM 8 MB SRAM 512 kb Cyclone II EP2C35 Video Input Module YUV 4:2:2 to 4:4:4 YUV to RGB Fig. 3. Diagram of the Video Input module. Output Fig. 1. Altera DE2 Development and Education Board. Video Camera VGA Monitor Customized video processing modules can be easily placed between these two modules (Video Input and Video Output). A basic scalable architecture was utilized to create a complete video application. Fig. 4 shows a diagram of the video processing module created to evaluate our platform. TV Decoder ADV7181 Video Input Video Processing Video Output Cyclone II 2C35 Video DAC ADV7123 Input Video Processing Module Mirroring Background Subtraction Tracking Output Fig. 2. Simplified diagram of the video platform architecture. The Video Input and Output modules (Fig. 2) are based on the DE2 TV box demonstration supplied with the DE2 board by Altera. Customized video processing operators can be placed between these two modules. In the Video Input module (Fig. 3), the ITU-R 656 Decoder block extracts YUV 4:2:2 video signals from the ITU-R 656 data stream. As we are using an interlaced video signal in the input, a Deinterlacer block is employed in order to convert the input video stream into a progressive format. The Deinterlacer has an interface to the SDRAM on-board where the two fields (F0 and F1) of the interlaced frame are stored. After that, the chroma components of the video stream are up-sampled by the YUV 4:2:2 to 4:4:4 block. Finally, the video stream is converted from YUV colour format to RGB, which is the input format of the next module (Video Processing). The Video Output module is basically an interface between the Video Processing module and the Video DAC. This block is responsible for generating all synchronism signals needed by the Video DAC and the VGA output, e.g. vertical and horizontal synchronization signals. It also Fig. 4. Diagram of the Video Processing Module. The main components of the Video Processing Module are the processing algorithms blocks, the multiplexers, and Control Bus. The multiplexers can bypass or not the processing algorithms blocks. The selection between their inputs can be done at run time through the Control Bus. The Control Bus can also share control data among blocks and a microcontroller. In this test we did not add a microcontroller. Instead, we used available on-board switches to control the multiplexers. Later, we describe the three processing algorithms implemented in the Video Processing Module Mirroring Control Bus The Mirroring algorithm is based on three line buffers using the concept of a Last In First Out (LIFO) structure. Each line buffer stores one of the three input colours. They are based on a dual port RAM block embedded in the 62

77 FPGA device. The dual port RAM allows storing data in one address and read data from another address at the same time. Using the LIFO structure, the Mirroring block creates a mirror effect in the output video Background Subtraction The Background Subtraction module extracts in real-time the background of a frame, highlighting new objects on the frame. Background subtraction is a commonly used class of techniques for segmenting out objects of interest in a scene for applications such as surveillance [6]. In our approach, we store a specific part of a frame into an external SRAM. After that, we compare each pixel, from the next frames, with the buffered pixels. As result of the comparison algorithm (1), each pixel is classified as a background pixel or a foreground pixel. In the output, the background pixels will appear black and the foreground pixels will appear as in the input, i.e. in the output we will see only what is new in the frame. IF ( P > (P buffer + τ) OR P < (P buffer - τ) ) P class <= foreground; (1) ELSE P class <= background; END In (1), P is the current pixel, P buffer is the correspondent buffered pixel, and τ is threshold level. In order save memory resources, we are buffering only part of the frame ( pixels), which results in a partial background subtraction. Also, for the same reason, we are working with only one colour channel carrying the greyscale component Tracking The Tracking algorithm analyses the address of pixels classified as foreground by the Background Subtraction block in each frame. The address analysis objective is to locate the limits of the foreground object. With this information, this block draws a green square enclosing the object. Also, the centre position of square and the area occupied by the object are available into the Control Bus. For the evaluation we used an extra module to show the position and area information on the on-board sevensegment displays. 3. RESULTS The first aspect to go into is the resources utilization by our design. All internal modules, including decoders, controllers, and the customized processing blocks, utilize 7% of all logic elements available in the EP2C35 Cyclone II. This shows that even a relative small FPGA device can support a large number of real-time video applications in a VGA standard resolution ( fps). In terms of internal memory resources, our design reaches 15% of utilization. Next, Table 1 summarizes the FPGA resource usage by our system and Fig. 5 shows the Cyclone II EP2C35 floorplan after the fitting process with the main blocks location. Table 1. FPGA resource usage by the Video Processing Platform. Modules Logic Embedded Memory Bits Elements Multipliers PLLs Video In Processing Module Video Out Total 2307/ / /35 1/4 Percentage 7% 15% 26% 25% Processing Module Video Output Fig. 5. Cyclone II 2C35 floorplan. Video Input Regarding to the system performance, our system can operate in a maximum clock frequency of MHz. This result was obtained by the Altera s Time Analyser tool. As our architecture can process 1 pixel / clock cycle, it can achieve a maximum performance of 143 fps (2). Frame fmax = Pixel fmax / Frame resolution Frame fmax = MHz / ( ) (2) Frame fmax = 143 fps The maximum data rate achieved is Mbit/s per colour channel (3) or 1.32 Gbit/s for RGB (4). Data Rate max = Pixel resolution Pixel fmax Data Rate max = 10 bits MHz (3) Data Rate max = Mbit/s Data Rate max = 10 bits MHz (4) Data Rate max = 1.32 Gbit/s 63

78 (a) Mirroring Background Subtraction Tracking Mirroring Background Subtraction Tracking Mirroring Background Subtraction Tracking (b) (c) Fig. 6. Video Processing Module results: (a) bypassing the background subtraction/tracking block; (b) bypassing de mirroring block without foreground objects; (c) bypassing de mirroring block with a foreground object. Related with resource usage and system performance, the Altera s PowerPlay tool estimated a power dissipation of mw by the FPGA device in our system. The experimental results demonstrate the effectiveness of our platform. Fig. 6 illustrates the platform video output with different multiplexer settings in the Video Processing Module. Fig. 6a shows the mirroring block result without background subtraction/tracking. In Fig. 6b and 6c, only the mirroring block is bypassed and we can see the result of the background subtraction/tracking blocks. In Fig. 6b, the output shows an empty space in the centre of the image. This space is where the background subtraction/tracking block is active. In the next image (Fig. 6c), an object is added to the environment. We can see the new object without the background information and also a square enclosing it. At the same time, the on-board sevensegment is showing the centre position of the square and the object area in the image. 4. CONCLUSIONS In this paper, we present a platform for real-time image and video processing applications. The objective of this framework is to allow engineering students to design, explore and evaluate different image and video processing modules. 64 As we mentioned before, we used a low cost Altera s FPGA device EP2C35 from the family Cyclone II. This device is embedded in an also low cost development board, the DE2. The DE2 has the advantage of containing a video input and output based on a TV input decoder (ADV7181B) and a Video DAC output (ADV7123). Also, the DE2 was especially developed targeting educational purpose, which is our focus. We implemented three basic processing algorithms in our system in order to validate the entire system and test its flexibility and performance. The results showed that even a relative small FPGA device can support a large number of real-time video applications in a VGA standard resolution. In the future work, we intend to implement extra memory in the DE2 board through daughter boards connected in its expansion connector. This will allow us implementing multiple frame buffers required for more complex algorithms. Also, we want to utilize a digital video source (for example the Terasic D5M digital camera) instead the analog that we used. This will simplify the Video Input Module and save FPGA resources. Moreover, we will migrate our platform to a more powerful development board aiming applications on Full HD resolution.

79 ACKNOWLEDGEMENT This work is supported by the French Community of Belgium under the Research Action ARC-OLIMP (Optimization for Live Interactive Multimedia Processing ). Also, we would like to thank Altera University Program for providing the development boards. REFERENCES [1] M. Akil, Special issue on reconfigurable architecture for real-time image processing, Journal of Real-Time Image Processing, volume 3(3), pages , [2] J.A. Kalomiros, J. Lygouras, Design and evaluation of a hardware/software FPGA-based system for fast image processing, Microprocessors & Microsystems, volume 32(2), pages , [3] S. Asano, T. Maruyama, Y. Yamaguchi, Performance comparison of FPGA, GPU and CPU in image processing, International Conference on Field Programmable Logic and Applications, pages , [4] N. Lawal, B. Thornberg, M. O'Nils, Power-aware automatic constraint generation for FPGA based real-time video processing systems, Norchip, [5] J. Li, H. He, H. Man, S. Desai, A general-purpose FPGA-based reconfigurable platform for video and image processing, International Symposium on Neural Networks, pages , [6] A.M. McIvor, Background subtraction techniques, In Proc. of Image and Vision Computing,

80 66

81 SOPC PLATFORM FOR REAL-TIME DVB-T MODULATOR DEBUGGING Armando Astarloa, Jesús Lázaro, Unai Bidarte, Aitzol Zuloaga Department of Electronics and Telecommunications, University of the Basque Country Alameda Urquijo s/n Bilbao - Spain armando.astarloa@ehu.es Mikel Idirin System-on-Chip engineering S.L. Zitek Bilbao - ETSI Bilbao mikel.idirin@soc-e.com ABSTRACT The debugging of DVB-T FPGA based systems is not a trivial task. The large bandwidth requirements in combination with the massive storage needed for further analysis of the video frames, requieres an add-hoc solution. This article presents a SoPC architecture specifically designed to capture frames of a Digital Television modulator IP core in real time. All the required processing (video, communications, TCP-IP encapsulation, etc.) is managed by the FPGA, and the frames can be captured between any stage of the pipeline hardware processing of the DVB-T modulator IP core. As a result, a powerful tool for Digital Television hardware debugging is obtained. 1. INTRODUCTION Recent years have witnessed the development of technology in several digital areas. Similarly, this evolution has lead into the need to replace existing technology in field of broadcasting, which has been mostly analog until recently. This evolution not only concerns TV and radio end user but also RF links between intermediate equipments. An example is the communication between a camera and production center within the context of the broadcast of a sport event. Trying to solve the shortcomings of previous analog systems a digital broadcasting service for TV and radio has emerged. In order to organize this evolution, a European standard for digital television [1] has been set. The basis of the new digital technology is digital compression of the image. The development of digital sound has been early treat but real-time moving image has many This work has been partially supported by the research program DIPE- BEAZ 2009 (DIPE09/02) This work has been partially supported by the Government of the Basque Country within the research program NETS (project IN- 2010/ ) more problems, which have held back the video system digital. The transmission of digitized images without compression at the speed required by television requires too much bandwidth, something intolerable given the congested spectrum. It was therefore necessary to compress digital sending no more than what is necessary to reconstruct the image at the receiver. This compression technique was developed by MPEG (Moving Picture Experts Group). Regarding this, MPEG2 image compression system is used as a reference for the European Digital TV standard [2]. The flexibility and computing power required for Digital Television hardware processing are faced optimally using reconfigurable logic. In fact, the state of the art regarding processing hardware modules for DTT (Cores of Modulators / Demodulators DVB-T) shows how many companies offer specialized IP cores for integration into FPGA. Additionally, the latest platform FPGAs have enabled the integration of whole digital systems in a single device [3]: hardware cores, microprocessors, on-chip buses, etc. G. Martin in the chapter The History of the SoC Revolution (2003) [4] emphasized how the core-based design with commercial reconfigurable FPGA platforms was a strong reality in the System-on-Chip (SoC) [5] design, and it would continue in the future. This announcement has been met and nowadays, the SoCs are widely extended, specially the SoCs implemented in reconfigurable logic: the SoPCs. Regarding methods and tools for high performance systems debug, most work has been done in the last years. FPGAs have become popular as a valuable resource for the debug and verification of those high-performance embedded complex systems. With current FPGA technology, it became possible to control and manage several different real-time and high bandwidth interfaces simultaneously. In this way, in [6] they use a FPGA to allow a general purpose full observability cosimulation platform. As another example, in [7], a JTAG compatible logic analyzer core is presented, which 67

82 Table 1. Input and output data bus width of the DVB-T IP internal modules. Module name Input bus data width Output bus data width transport stream if 8 9 randomized 9 9 reed salomon 9 9 external interleaver 9 8 viterbi puncture 8 2 internal interleaver 2 33 pilot and tps ifft ig dac core (DDR) makes easier the real-time debug of FPGA. [8] and [9] show two other examples of different FPGA based architectures used to facilitate the debug of high requirements systems; a CCD image processing system and a wireless network node, respectively. The remainder of this article is organized into four sections. Section 2 presents the system architecture of the SoPC debug platform and summarizes the architecture of the DVB-T modulator. Section 3 summarizes the implementation results of one configuration of the IP core and the SoPC debug platform. Section 4, concludes this paper and presents the future work in this field. 2. SYSTEM ARCHITECTURE The computation scheme of a DVB-T modulator fits into a pipelined bus architecture. Figure 1 shows the architecture of the DVB-T modulator IP core under verification. Video frames are processed sequentially starting at the transport stream interface core (transport stream interface) and ending at the core that prepares the I+Q output for the DAC (DAC core). Table 1 summarizes the modules that composes the DVB-T modulator IP. The name of each identifies the computation that performs. It is worth mentioning that, as the input and output data bus width of each module differs, the speed of the data transfer in each point-to-point link is different as well. While the system runs, the dataflow inside the modulator core cannot be stopped because video Transport Stream frames inputs into the modulator at the transport stream if module at a given sampling rate. Thus, the proposed debug solution must be able to extract and transmit to the host the data that it communicated between any core. Taking into account the huge data volumes that are involved, the debug task in the development of FPGA based systems DVB-T modulator requires the storage of large volumes of real-time video frames. In order to deal with this, it is necessary to design SoPCs architectures and appropriate technologies. This paper presents a solution based on a system that can extract SoPC real-time information to a host via 1Gbps Ethernet TCP-IP connection. The useful information throughput will be above 200 Mbps (payload). To meet this challenge, the key technological elements in the system are: High-end Virtex-5 FPGA (XC5VFX70TFF-1136). A hard core Power-PC processor 440, integrated into the FPGA silicon. A hard core High performance Gigabit Ethernet controller integrated into the FPGA silicon. A hard core DDR2 controller embedded inside the Power-PC processor. The debug set-up establishes a point-to-point link between the capture system and a PC computer. The data throughput that the application demands is very high and therefore, the transmission through the channel Gigabit Ethernet must be optimized. Apart from the FPGA side (hard Gigabit Ethernet controller), a high-performance PC must be selected in order to avoid creating bottlenecks in the host side. So it must be taken into account the quality of the network card incorporated in it, CPU, RAM memory and hard disk recording speed. From the standpoint of architectural design, to fulfill these demands, the following elements are considered: Optimized FIFO with an dedicated on-chip bus: Data capture is done through a FIFO. This memory is written with the information that want to be analyzed. A simple FSM is charge of copying the data from the point-to-point link to the memory. The Power-PC processor reads the FIFO through a PLB bridge to FSL. The FSL is a bus optimized for direct FIFO connections. DMA transfers between memory and Gigabit Ethernet controller: Power-PC processor, once it has acquired data from FIFO, prepares TCP/IP packets in the dynamic RAM memory included in the system. The transference of these packets to the Gigabit Ethernet controller is done using DMA. This solution optimizes PLB on-chip bus transfers and substantially improves the performance of the TCP-IP transmission (see Section 3). Use of hard checksum functions provided by TMAC: TMAC Gigabit Ethernet controller provides functions for checksum generation and verification. This implementation takes advantage of those features. This releases the lwip stack, in charge of TCP-IP packet 68

83 Mpeg2 (Input) MPEG2 IF (S) 1 TRANSPORT_STREAM_IF FIFO IF (M) FIFO IF (S) 2 RANDOMIZED FIFO IF (M) FIFO IF (S) 3 REED_SALOMON FIFO IF (M) WB IF (S) WB IF (S) WB IF (S) WB IF (S) WB IF (S) WB IF (S) FIFO IF (S) 4 EXTERNAL_INTERLEAVER FIFO IF (M) FIFO IF (S) 5 VITERBI_PUNCTURE FIFO IF (M) FIFO IF (S) 6 INTERNAL_INTERLEAVER FIFO IF (M) WB IF (S) WB IF (S) WB IF (S) FIFO IF (S) 7 PILOT_AND_TPS FIFO IF (M) FIFO IF (S) 8 IFFT_IG FIFO IF (M) FIFO IF (S) 10 DAC_CORE DAC IF (M) Output (DACs) Debug WB IF (M) UART IF 11 CTRL FPGA Fig. 1. DVB-T modulator IP core block diagram. composition, from the software execution of these tasks and allows a substantial improvement in final performance of the communication. TCP-IP and lwip parameters optimization: There can be substantial performance improvements in communication achieved by modifying some parameters of the TCP-IP stack in combination with some size optimizations of the transmission and reception FI- FOs. The most significant parameters are the following: Maximum Segment Size (TCP MSS): bytes. TCP Transmission Buffer (TCP SND BUF): bytes. TCP Window(TCP WND): bytes. TMAC transmission and reception FIFO: bytes. Figure 2 shows the block diagram of the proposed SoPC for real-time debug. It has been implemented on a Virtex-5 FPGA. In addition to critical modules mentioned above, the following additional cores are presented in the system: 16 Kbytes of internal RAM memory built using dedicated block RAM modules. SRAM controller for external memory. Interrupt controller. UART for debug purposes. Auxiliary modules for clock, reset and JTAG management. 3. IMPLEMENTATION RESULTS In order to obtain the debug system as fast as possible, both the IP core and the SoPC have been implemented on a ML507 Xilinx Virtex-5 evaluation board. This populates a XC5V- FX70T-FFG1136 device and it has all the means need for the real-time operation: DDR2 external memory, SRAM memory and Gigabit Ethernet physical Link. Figure 3 shows the block diagram of the whole system. Inside the FPGA the IP and the SoPC have been implemented. In this set-up, the SoPC is capturing the data between the output of the FFT and the input of the DAC module. FSM Ctrl. is the Finite State Machine that controls the data transfer between the DVB-T modulator IP core and the FIFO stored in the CAPTURE FSL MASTER OUT IP. Table 2 summarizes the implementation results. The first column describes the FPGA resource type. Column 2 and 3 respectively, summarize the FPGA occupation for the SoPC alone and in combination with the IP core under test. In this case, the DVB-T modulator. It is worth noting that a huge FPGA like the one used for this implementation, allows easy 69

84 SRAM CAPTURE_FSL_ DATA TO BE CAPTURED MASTER_OUT FIFO MFSL SFSL PLB2FSL SPLB Bridge INTERRUPT Controller SRAM Interface GP IOs Leds, buttons DDR2 DDR2 Controller PPC4 MPLB PLB V46 on-chip bus PPC440 RAM Internal RAM memory UART LLDM LLIN TMAC Gigabit Ethernet SPLB Debug SoPC Ethernet PHY Debug (host) Fig. 2. Block diagram of the SoPC real-time debug Server listening on TCP port 2000 TCP window size: 8.00 KByte (default) [1856] local port 2000 connected with port 4097 [ ID] Interval Transfer Bandwidth [1856] sec 30.8 MBytes 129 Mbits/sec [1856] sec 30.8 MBytes 129 Mbits/sec [1856] sec 36.5 MBytes 153 Mbits/sec [1856] sec 35.0 MBytes 147 Mbits/sec [1856] sec 37.1 MBytes 156 Mbits/sec [1856] sec 37.3 MBytes 156 Mbits/sec [1856] sec 33.6 MBytes 141 Mbits/sec [1856] sec 31.1 MBytes 130 Mbits/sec [1856] sec 31.1 MBytes 130 Mbits/sec [1856] sec 35.7 MBytes 150 Mbits/sec [1856] sec 34.2 MBytes 144 Mbits/sec [1856] sec 35.5 MBytes 149 Mbits/sec [1856] sec 32.0 MBytes 134 Mbits/sec [1856] sec 36.6 MBytes 154 Mbits/sec [1856] sec 36.5 MBytes 153 Mbits/sec [1856] sec 29.9 MBytes 126 Mbits/sec [1856] sec 33.1 MBytes 139 Mbits/sec [1856] sec 36.4 MBytes 153 Mbits/sec [1856] sec 37.2 MBytes 156 Mbits/sec [1856] sec 34.3 MBytes 144 Mbits/sec Fig. 4. Communication performance between fast prototyping board (ML507) and PC host. Data provided by Iperf tool. Table 2. Implementation results of the SoPC designed for real-time debug of a DVB-T transmisor IP core (data for a Virtex-5 XC5VFX70T-FFG1136 FPGA). FPGA resource type SoPC system IP core under analysis and SoPC system 4 input LUTs (10%) (12%) Slice Flip-Flops (11%) (15%) Virtex-5 Slices (26%) (33%) 36K BlockRAM 17 (11%) 23 (15%) Hard Power-PC processor 1 (100%) 1 (100%) TMAC Gigabit Ethernet 1 (50%) 1 (50%) fast-prototyping for complex debug systems. Only 33% of the general purpose resources of the FPGA are used and all timing constraints are easily met. Figure 4 shows a screenshot of the real-time communication between the ML507 evaluation board used to implement the platform presented with a PC through a point to point Gigabit Ethernet communication link. In the PC runs a Iperf server, which evaluates the actual data flow in transfer. The program used to capture the TCP-IP packets is Wireshark. It is in charge of saving the reconstructed frames in the PC hard disk for further analysis. Thoses frames are captured and stored in real-time; however, they are analyzed off-line, when they are compared with the ones generated by the DVB-T modulator reference model (implemented in C language). As it can be noticed, for the chosen commu- 70

85 Ethernet (PC host) DDR2 TRANSPORT_STREAM_IF WB IF (S) RANDOMIZED WB IF (S) REED_SALOMON WB IF (S) ETHERNET PHY SRAM FPGA FIFO IF (M) FIFO IF (M) FIFO IF (S) FIFO IF (M) FIFO IF (S) FIFO IF (M) WB IF (S) EXTERNAL_INTERLEAVER WB IF (S) VITERBI_PUNCTURE WB IF (S) INTERNAL_INTERLEAVER FIFO IF (S) FIFO IF (M) FIFO IF (S) FIFO IF (M) FIFO IF (S) FIFO IF (M) WB IF (S) PILOT_AND_TPS WB IF (S) IFFT_IG WB IF (S) 10 DAC_CORE 8 FIFO IF (S) MPEG2 IF (S) FIFO IF (S) FIFO IF (M) FIFO IF (S) DAC IF (M) WB IF (M) 11 CTRL DVB-T modulator IP core Debug (PC host) BOARD UART IF MPEG2 Transport Stream DEBUG (PC host) FSM Ctrl. CAPTURE_FSL_ MASTER_OUT SFSL PLB2FSL SPLB FIFO MFSL Bridge DDR2 Controller PPC4 MPLB PPC440 LLDM TMAC LLIN SPLB Gigabit Ethernet INTERRUPT Controller RAM Internal RAM memory SRAM Interface UART GP IOs Leds, buttons PLB V46 on-chip bus Debug SoPC Fig. 3. Block diagram of the SoPC in combination with DVB-T transmisor IP core for real-time debug. 71

86 nication parameters the obtained throughput is around 140 Mbps. This transfer bandwidth is enough to acquire data for debugging purposes at any interface in DVB-T modulator flow chain. International Conference on Wireless Communications, Networking and Mobile Computing (WiCom09), 2009, pp CONCLUSIONS The main contribution of this work is the report of a fast prototyping SoPC debug system for real-time video applications. Sometimes the debug of these systems needs to save a huge amount of real-time data for further analysis. Normally, it is not easy to find a commercial instrumentation that fits with these needs. As it has been proven in the presented case, it is possible thanks to the benefits of the portability of the RTL code. This allows the migration of an IP core targeted for a low cost FPGA to a high-end one in combination with a SoPC that enables high bandwidth communication with a remote host. Future work in this field includes a better automatization regarding all the processes involved in the experimental setup and a partial reconfigurable support that allows dynamic interchange of the IP core under test. 5. REFERENCES [1] E. B. Union, Digital Video Broadcasting (DVB). ETS Framing structure, channel coding and modulation for digital terrestial television, [2], Digital Video Broadcasting (DVB). ETR290. Framing structure, channel coding and modulation for digital terrestial television, [3] Xilinx Corp., Xilinx Platform Studio and EDK, Xilinx Documentation, [4] G. Martin and H. C. (Eds.), Winning the SoC Revolution: Experiences in Real Design. Massachusetts, USA: Kluwer Academic Publishers, [5] R. A. Bergamaschi, S. Bhattacharya, R. Wagner, C. Fellenz, and M. Muhlada, Automating the Design of SOCs Using Cores, IEEE Design & Test of Computers, vol. 18, no. 5, pp , [6] Cheng, X., Ruan, A.W., Liao, Y.B., Li, P., and Huang, H.C., A run-time rtl debugging methodology for fpga-based cosimulation, in 2010 International Conference Communications, Circuits and Systems (ICCCAS), 2010, pp [7] Z. K. Baker and J. S. Monson, In-situ fpga debug driven by on-board microcontroller, pp , [8] F. Zhang, Q.-Z. Wu, and G.-Q. Ren, A real-time capture and transport system for high-resolution measure image, vol. 1. Los Alamitos, CA, USA: IEEE Computer Society, 2010, pp [9] Q. Wang, L. Wang, and J. He, A New Simulation Scheme for Testing and Debugging Wireless Sensor Networks, in 5th 72

87 HIGH RELIABILITY CAPTURE CORE FOR DATA ACQUISITION IN SYSTEM ON PROGRAMMABLE CHIPS Jesús Lázaro, Armando Astarloa, Aitzol Zuloaga, Jaime Jimenez, Unai Bidarte, José Luis Martín Department of Electronics and Telecommunications, University of the Basque Country Alameda Urquijo s/n Bilbao - Spain jesus.lazaro@ehu.es ABSTRACT The present paper presents both an standalone capture core and a SoPC system. The paper also presents a simulation framework and a practical implementation of a high reliability filter implementation. The implementation uses FIR filters although it can be extended to IIR filters or any other kind of mathematical circuit. The circuits makes use triple redundancy and voter circuits to obtain a correct filter output in presence of a failure either in the conversion circuit or in the FPGA. The system presented in this paper is not a substitution of a traditional triple redundancy circuit but an addition. The SoPC includes the standalone core and adds memory and communications cores to process the data and transfer it. 1. INTRODUCTION Data acquisition is a key component in modern control systems. A data acquisition system is in charge of taking an analog signal and passing it to a digital processing system. In this process, the analog signal must be filtered, digitalized and digitally filtered before transferring the data to the processing unit. Traditionally the processing unit has been built around a DSP (digital signal processor). In recent years there has been a change from the DSP to the FPGA (field programable gate array), the main reasons being: FPGAs are more affordable every day FPGAs have increased their signal processing capabilities Thee parallel processing capabilities of FPGAs surpass the sequential processing power of DSPs This work has been partially supported by the Government of the Basque Country within the research program SAIOTEK (project SAI09/17). Data acquisition is used in many critical systems [1, 2, 3]. These systems require of a high degree of reliability, both because human lives and great economic loss can be at risk. One of the traditional ways of dealing with reliability is redundancy. Tripling the system assures greater reliability and protects against a failure in one of the systems. FPGAs are not strange to failure. As any electronic circuit they suffer failures due to temperature, age, humidity, shock,... One difference between conventional circuits and FPGAs is their reconfiguration capability. SRAM (static random access memory) based FPGAs can suffer an upset both in the operational circuit and in their configuration memory [4, 5, 6]. This leads to both temporal (an upset in the circuit) and permanent (a failure in the configuration will remain there until it is configured again). This article presents a redundancy scheme for filters inside the FPGA. The article is focused on FIR (finite impulse response) filters because they suit better than IIR (infinite impulse response) the internal structure of FPGAs. Specifically, the fixed point implementation of FIR filters requires much less hardware than the floating point implementation of IIR. IIR filters are not normally implemented using fixed point to avoid stability problems [7, 8]. The redundancy scheme is prepared to detect errors not only in the FIR circuit but in the conversion circuitry as well. The design presented in the paper is the first stage of a more complex system as in charge of combining the outputs of different conversion circuits into a single value. After the proposed system, a more conventional triple redundancy [9] signal processing circuit is likely to appear before passing the data to a processing unit. In fact, the whole vote and mean circuit should be tripled in a high reliability scheme. The article is divided as follows. First the filter structure is presented, including the simulation framework. Secondly the simulation results and the hardware resource utilization appear to end with some final conclusions and future work. 73

88 2. OVERALL STRUCTURE The Capture core is in charge of receiving data from the ADC decide the correct value and ready it into a PLB compatible core. The SoPC is composed of several cores (one being the capture core) and a microprocessor that will use the captured data. external sensor FPGA PowerPC redundant ADC Capture Core plb Fig. 1. Overall circuit and interconnection The system is built using XSG (Xilinx System Generator) [10] and XPS [11]. XSG is a Simulink toolbox [12] and it is capable of linking FPGA description to Matlab sink and sources. It is also capable of creating a custom core compatible with PLB bus standard [13] so that it can be used inside XPS. The overall project can be divided into two main parts: Capture core Complete SoPC 3. CAPTURE CORE STRUCTURE 3.1. Simulation framework The overall simulation framework (see figure 2) is composed of the following elements: Input signal. It is composed of two different sinusoidal signals, one of them should be filtered. Error generation block. To the input signal a white noise signal can be added, simulating an error. Hardware FIR filter. Three equal filter designed using the FDATool (filter generation wizard). Voter and mean. This block is in charge of detecting any non working filter+adc and of giving the correct output signal. Reference filter. FDATool filter to be used as baseline. Spectrum analyzer. This block is used to compare the different outputs: ideal, output and single hardware filter output Structure The block in charge of combining the filter outputs in order to give the correct answer is built around the following blocks: Voter. This block is in charge of deciding which filter, if any, is giving a corrupted output. Fault counter. This block counts how many error are found in each of the filters for a given time. Disabling circuit. Knowing the amount of error of each filter, this circuit disables the one with more errors (if all have the same number of errors, C circuit is disabled) Gate circuit. According to the disabling signal, it will output the filter value or 0 towards the mean calculator. Mean calculator. This circuit adds all three gate outputs and divides them by two, a simple shift Voter The voter is in charge of deciding which filter, if any, is giving a corrupted output. In table 1 the basic output of a majority voter can be seen [14]. This circuit is capable of finding which one of the three inputs is giving a value different to the other two. If all three inputs are the same the circuit outputs 0. A B C Error Table 1. Truth table of the voter circuit. For every input combination the circuit outputs where the error is. 0 output means that no error was found. In our circuit, the voter test the most significant bit of the output of the filter. This has been done because the system is tuned to use the full dynamic range of input values. If such 74

89 Fig. 2. Overall system, depicting inputs, filters and voting circuitry. Fig. 3. Voter and mean calculator. A majority voter decides which filter output is probably failing. Several counters counts how many failures happen in a given time. A third block disables a core if an anomalous condition is found. The fourth block makes the output of the failing filter 0 while the last block calculates the mean of the outputs of the filters. 75

90 a thing is not possible, bits with lower binary weight should be use. Contrary to conventional voting circuit. The current output of the voter is not used, but the average of errors is used. This way the voting circuit needs not to be perfectly tuned, since there is margin for spurious outputs. hardware cost, the circuit has been designed to add the three outputs from the filters and divide them by two Error counter In order to deal with spurious outputs from the voter circuit, an error counter is added. This blocks counts the number of errors from each filter in a given time. This time reference is given using the terminal count from a counter. The size of the counter is a parameter that can be changed to suit the applications. The smaller the counter, the quicker will it react to changes of the inputs but it will also react to short term spurious errors. The bigger the counter, the slower will react to errors in the inputs but will filter short term spurious signals from the voter. It must be noted that any change of the active filters will produce a small transient in the output, so bigger counters can be desirable. In our test case, a 8 bit counter has been used Disabling circuitry The truth table of this simple circuit can be found in 2. This circuit decides which filter output not to use in the mean calculation. To to so, three comparator compare the different input values and by means of a truth table, which filter output should not be used is decided. A > B A > C B > C disable C B X B C X A A Table 2. Truth table of the disabling circuitry. Depending on the number of errors, A, B or C filter is disabled. X marks a non possible situation. After deciding which output not to use, this output is converted to zero so it does not interfere in the addition Mean calculator The mean calculator is composed of two adders and a divide by two circuit. Since division by an arbitrary number is an expensive operation and, division by power of two has zero Fig. 4. Mean calculator. Dividing by two is very hardware efficient. In case of no failures in any filter, one filter is disabled so the mean is always calculated the same way. The operation performed can be seen in equation 1. It must be taken into account that one of the outputs will always be zero. Making a simpler circuit than a divide by three circuit. F = A d A + B d B + C d C Simulation results default A + B 2 Figure 5 depicts the simulation results. Channel 1 is the output of the ideal filter using the FDATool. Cannel 2 is the output of the real system while one of the filters is not working properly. Channel 3 is the output of the faulty filter. The fault is injected using a white noise generator simulating a faulty ADC. As it can be seen, the erroneous filter does not interfere with the correct functioning of the overall system Hardware results In table 4 we can see the resources needed for the system. The target FPGA has been a Xilinx Spartan 3A-DSP [15]. This FPGA is well suited for signal processing since it has plenty of DSP48A resources as well as internal RAM and logic. The overall system uses a mere 3% of the FPGA when implementing a 50 order lowpass equiripple FIR filter. The input data rate is been set to 1MHz and 16 bit. The system runs 64 times faster, at 64 MHz. This increases speed allows to perform 64 operations in each DSP before a new data arrives. This means that each filter only requires a single DSP (in fact, a 63 order filter could be implemented with minor extra hardware requirements). Since the filter is tripled, 3 DSP48 [16] are required as well as e little bit of extra logic for the comparators, adders and counters. This means that, (1) 76

91 CH 1 CH 2 CH 3 interconection interface with the PLB bus has to be defined. In our case, this connections is done through a FIFO style shared memory. This memory is written by the core and read through the PLB. -40 Magnitude-squared, db Frame: 86 Frequency (khz) Fig. 5. Spectrum result of the filters. Channel 1 depicts the ideal floating point filter. Channel 2 the output of the real system. Channel 3 depicts the output of the non working filter Hardware structure The main elements inside the FPGA are: Capture core PowerPC hard processor TEMAC: Hard Ethernet MAC Memory cores: DDR2 interface core, Flash interface core The capture core has been explained in previous sections. So we will focus on the rest of the cores. Table 3. Resource summary report for a Spartan 3A-DSP. Timing constrains set to 64MHz, allowing 1MHz input sampling time. Quantity % of FPGA DSP48As 3 3% Slice 648 3% in a worst case scenario, the hardware overhead is limited to 2 DSP48A blocks and less than 2% of the FPGA. This implementation is useful for a standalone version of the core. The resources used in the final FPGA, are slightly bigger since the interconnection logic has to be added. It may seem that it requires less Slices, but Virtex5 slices are twice that of a Spartan SOPC STRUCTURE The capture core seen in the previous section can be used as standalone, but, using the export Pcore feature, it can be used inside a SoPC. In this kind of system, all the elements of the circuit are integrated inside an FPGA. The capturing core has to be slightly modified, specifically, the Table 4. Resource summary report for a Virtex5. Look up tables in Virtex5 devices are bigger than in Spartan 3 and the number of flip-flops is also bigger. Quantity % of FPGA DSP48E 3 2% Slice 431 3% 4.2. PowerPC hard processor The IBM PowerPC c 440 core is a hard 32-bit RISC CPU blocks designed into the fabric of select Virtex series FPGAs to implement high performance embedded applications. The combination of hard cores with integrated co-processing capability enables a wide range of performance optimization options. The PowerPC 440 processor supported by Virtex-5 FXT FPGAs with a sophisticated CPU/APU controller and highbandwidth crossbar switch. The crossbar switch enables high-throughput 128-bit interfaces and point-to-point connectivity. Integrated DMA channels, dedicated memory interface, and Processor Local Bus (PLB) interfaces minimize logic utilization, reduce system latency and optimize performance. Simultaneous I/O and memory access maximizes data transfer rates TEMAC: Hard Ethernet MAC TEMAC is an acronym for Tri-Mode Ethernet Media Access Controller and is a reference to the three speed (10, 100, and 1000 Mb/S) capable Ethernet MAC function available in this core. This core is based on the Xilinx hard silicon Ethernet MAC in the Virtex-5 FXt. This core provides some very advanced capabilities: DMA transfers between memory and Gigabit Ethernet controller: Hard checksum functions. This releases any IP stack, in charge of packet composition, from the software execution of these tasks and allows a substantial improvement in final performance of the communication. 77

92 Table 5. Resource summary report for a Virtex5 70fxt. Timing constrains set for 100MHz bus speed to allow high speed communications. Quantity % of FPGA 4.4. Memory cores PPC % TEMAC 1 50% Slice % BRAM 15 10% DSP48As 3 2% The system has two memory interfaces, one for DDR2 an another for Flash. The combination of these memories allows the use of complex software scheme such as operating systems, IP stacks,... allowing the system to transfer any data using standard protocols Hardware results In table 5 a summary of the required resources is presented. The system is built around the high performance Virtex5 70fxt. The PowerPC is running at 400 MHz to provide maximum performance. The presented system has only a single capture core, but there is plenty of room both to have more capturing core and to have a more complex SoC. 5. CONCLUSIONS AND FUTURE WORK The present paper presents both a simulation framework and a practical implementation of a high reliability filter implementation. The implementation uses FIR filters although it can be extended to IIR filters or any other kind of mathematical circuit. In systems where FPGA failure is of concern, the vote and mean circuitry should also be tripled as well as any following signal processing circuitry. The system can be upgraded to detect an error both in the input (analog to digital converter) and in the output (result of the filtering). This way action can be taken to try to solve the problem. If the error is in the input, not much can be done but, if the error is inside the FPGA some action can be taken. This can range from resetting the offending circuit to full FPGA reconfiguration with partial reconfiguration as the middle point. 6. REFERENCES [1] B. McHarg, Control, data acquisition, and remote participation for fusion research, Fusion Engineering and Design, vol. 71, no. 1-4, pp. 1 3, 2004, 4th IAEA Technical Meeting on Control, Data Acquistion, and Remote Participation for Fusion Research. [Online]. Available: sciencedirect.com/science/article/b6v3c-4cgnsf2-1/2/ bfd1aabcaa30ed b4742affb1a [2] K. Nurdan, H. Besch, B. Freisleben, T. Conka-Nurdan, N. Pavel, and A. Walenta, Development of a Compton Camera Data Acquisition System Using FPGAs, in Proceedings of the 2003 International Signal Processing Conference, [3] H. I. Schlaberg, D. Li, Y. Wu, and M. Wang, FPGA Based Data Acquisition and Processing for Gamma Ray Tomography, AIP Conference Proceedings, vol. 914, no. 1, pp , [Online]. Available: http: //link.aip.org/link/?apc/914/831/1 [4] P. Adell and G. Allen, Assessing and mitigating radiation effects in Xilinx FPGAs, JPL, Tech. Rep., [Online]. Available: [5] R. Baumann, Soft errors in advanced semiconductor devices-part I: the three radiation sources, Device and Materials Reliability, IEEE Transactions on, vol. 1, no. 1, pp , mar [6] R. Baumann and E. Smith, Neutron-induced boron fission as a major source of soft errors in deep submicron SRAM devices, 2000, pp [7] M. Bellanger, Digital Processing of Signals: Theory and Practice. John Wiley & Sons Ltd., [8] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed. Prentice Hall, [9] Xilinx, TMRTool Product Brief, publications/prod mktg/xtmrtool ssht.pdf. [10], Xilinx System Generator for DSP, com/tools/sysgen.htm. [11], Xilinx Platform Studio, xps.htm. [12] T. MathWorks, Simulink - Simulation and Model-Based Design, [13] Xilinx, Processor Local Bus (PLB) v4.6, com/support/documentation/ip documentation/ds531.pdf. [Online]. Available: documentation/ip documentation/ds531.pdf [14] R. Perez, Methods for Spacecraft Avionics Protection Against Space Radiation in the Form of Single-Event Transients, Electromagnetic Compatibility, IEEE Transactions on, vol. 50, no. 3, pp , aug [15] Xilinx, Spartan-3A DSP FPGA Family: Complete Data Sheet, sheets/ds610.pdf, [16], XtremeDSP DSP48A for Spartan-3A DSP FPGAs User Guide, user guides/ug431.pdf,

93 DESARROLLO DE UNA PLATAFORMA GENÉRICA PARA SISTEMAS DE VISIÓN BASADA EN LA ARQUITECTURA CORECONNECT Pantaleone Luis M., Leiva Lucas E., Vazquez Martín INCA/INTIA Universidad Nacional del Centro de la pcia. de Bs. As. Paraje Arrollo Seco, Tandil, pcia. de Bs. As, Argentina ABSTRACT En este trabajo se presenta la implementación y el análisis de una plataforma genérica de adquisición, procesamiento, visualización y transmisión de imágenes. La mencionada plataforma esta basada en SoC (System on a Chip) implementados en FPGAs de Xilinx, utilizando la arquitectura CoreConnect. La ventaja de esta plataforma es la facilidad de agregar diversos cores a la plataforma, como así también algoritmos de procesamientos de imágenes en el microprocesador. Se ha desarrollado un core encargado de controlar y adquirir imágenes desde una cámara, escribiendo las mismas en una memoria externa. Como así también el software para el procesamiento y visualización de las mismas. Dicho software se ejecuta sobre el microprocesador embebido del sistema. 1. INTRODUCCIÓN Cada día es más común observar sistemas dedicados a procesamiento de vídeo con fines específicos. Dentro estos podemos destacar usos relacionados a medición, inspección, reconocimiento, orientación o en sistemas específicos para aplicaciones industriales[1]. Para implementar sistemas de visión podemos usar distintas tecnologías, desde simples computadoras, hasta los más complejos sistemas ASIC, transitando por FPGAs, microcontroladores, DSP, etc. Se denomina SoC a un circuito integrado que incluyen un procesador, bus, y otros elementos en un chip. Los desarrollos SoC pueden ser implementados en distintas tecnologías, las cuales pueden ser ASIC, FPGA, microcontroladores, DSP, etc. Al momento de desarrollar un SoC sobre FPGAs existen varías arquitecturas de las cuales se destaca la CoreConnect[2], la cual está orientada a la interconexión de cores mediante buses. Se denomina core a los dispositivos o periféricos que se conectan al bus. Estos pueden ser de dos tipos, softcore y hardcore. Los hardcore son recursos físicos de la FPGA, ejemplo de éstos son los microprocesadores embebidos, block rams, multiplicadores, etc. Mientras que BUS Core Cámara core IP VGA PLB PPC Memoria FPGA Fig. 1. Plataforma genérica del sistema visión un softcore es un recurso sintetizable. Ejemplo de esto son microprocesadores tale como Microblaze, OpenSparc, Nios II, etc.; filtros de procesamiento; etc. Los cores pueden estar conectados al bus como maestros o como esclavos. Para comunicarse con el bus deben implementar el protocolo de comunicación correspondiente. Un core es maestro cuando este es el que inicia una solicitud hacia otro core, mientras que un core es esclavo, cuando solamente recibe solicitudes. Un core puede ser maestro y esclavo a la vez, siempre y cuando tenga la suficiente lógica para cumplir ambas funciones. 2. DISEÑO DE LA PLATAFORMA El objetivo es diseñar un sistema (fig 1) el cual almacene en la memoria RAM las imágenes recibidas desde una cámara para su posterior procesamiento, visualización y transmisión. El sistema debe ser escalable fácilmente y portable. Para ello se ha decidido utilizar la arquitectura Core- Connect provista por Xilinx, la misma fue desarrollada por IBM para utilizarse en conjunto con el microprocesador Po- 79

94 Fig. 2. Arquitectura del sistema werpc. Esta arquitectura se basa en uso de Cores y buses. La versión de la arquitectura es la 4.6[3], la cual viene integrada en el EDK 10.1[4]. El sistema posee un core encargado de controlar y recibir los datos desde la cámara y escribirlos en la memoria. Se trata de un sensor CMOS de 5MP, fabricado por la empresa Micron, cuyo nombre de serie es MT9P001[5]. Dicho sensor va montado sobre una placa de desarrollo (headboard), la cual posee una lente Navitar capaz de controlar la apertura como la distancia focal. La resolución del mismo es de 2592 (horizonal) x 1944 píxeles (vertical). Cada píxel tiene una profundidad de 12 bits. El sensor trabaja con el patrón de Bayer[6]. El procesamiento de las imágenes se lleva a cabo en el microprocesador embebido PowerPC (o MicroBlaze), mediante la ejecución de un programa codificado en C. También se encarga de transmitir las imágenes tanto hacia un monitor como hacia la interface serial. Las imágenes procesadas se almacenan en la memoria externa, en tres áreas distintas. Un área de memoría es la del core de video, una segunda área es donde el core de la cámara escribe las imágenes, y una tercera donde se almacena la imágen procesada para luego ser visualizada o transmitida. 3. ARQUITECTURA DEL SOC En la figura 2 se observa la arquitectura con los componentes que intervienen en el sistema. Los cores IP s empleados son principalmente controladores de block ram (bram block y xps bram if cntrl), controladores de memoria RAM (mpmc), controlador UART (xps uartlite), driver de video (xps tft), y el controlador del PowerPC (ppc405). El bus que se utiliza para la comunicación es el PLB en su versión 4.6. Sobre un determinado bus se conectan los periféricos encargados de la UART, memoria externa, bloques de ram, interruptores externos. El core IP del controlador de video es del tipo master y slave. Se conectan a dos buses distintos, el master se conecta a un bus propio para comunicarse con la memoria y el slave al bus donde estan conectado el resto de los periféricos. El motivo de que se conecte con el controlador de memoria mediante un bus dedicado se debe a que el periférico necesita un alto ancho de banda. De esta forma evita compartir el bus con otros periféricos, teniendo siempre acceso a él. El core IP encargado de controlar el PowerPC posee cuatro puertos, dos dedicados a los datos y otras dos dedicadas a las instrucciones del mismo, denominadas D0 y D1 para los datos, e I0 e I1 para las instrucciones. Los puertos I0 y D0 se conectan al mismo bus donde están conectados el resto de los cores de los periféricos para poder interactuar con ellos. Mientras que D1 e I1 están conectados a la memoria mediante un bus dedicado para obtener un acceso rápido a ella sin necesidad de competir por el acceso al bus con otros cores Core de la Cámara Como se mencionó anteriormente el core (fig 3) se desarrolló bajo la arquitectura de capas para el desarrollo de cores[7]. Esta arquitectura en su capa de más bajo nivel (llamada IPIF) se comunica con el bus PLB y provee así una interface simplificada denominada IPIC hacia la capa superior denominada User Logic. En el User Logic es donde se coloca la lógica del core. El core se desarrolló en VHDL. El mismo es portable hacia otros sistemas, siempre y cuando utilicen la versión 4.6 del bus PLB. Principalmente consta de dos módulos, uno encargado de recibir los datos y configurar la cámara (driver); y otro el encargado de enviar los datos a través del bus hacia la memoria para su posterior procesamiento. El driver se encarga de la configuración de la cámara y de la obtención del valor de intensidad de los píxeles con una resolución de 8 bits. La configuración de la cámara se realiza mediante el protocolo I2C. El User Logic implementa la lógica del controlador. Esta entidad se comunica con el driver de la cámara y envía a los datos hacia la memoria a través del bus. Este compononente se encarga de comunicar los datos desde el driver hacia el IPIF mediante el seteo de señales y direcciones para la transferencia de los datos. Debido a que el driver trabaja con 8 bits de resolución por píxel y que el ancho de bus es de 32 bits, se almacena el dato en un buffer y se envian paquetes de 32 bits (4 píxeles). La dirección de inicio del área de escritura es configurable vía parámetros del core. 80

95 Table 1. Tiempos empleados Algoritmos Sin Cache Con Cache Aceleración Interp. simple 251ms 225ms 11.6 % Interp. bilineal 276ms 270ms 2.2 % Interp. gradiente 647ms 602ms 7.5 % 4. RESULTADOS EXPERIMENTALES Fig. 3. Arquitectura del Core de la cámara La cámara trabaja a una frecuencia de 24Mhz y el core a una frecuencia de 100Mhz, el buffer ademas de almacenar los datos, los sincroniza para que puedan trabajar a ambas frecuencias. Se escriben datos en el buffer cuando lo indica el driver. La escritura del buffer se realiza a 24Mhz, mientras que la lectura del mismo es a 100Mhz. En promedio el protocolo de envío de datos tarda aproximadamente 11 ciclos del reloj del sistema. De esta manera el sistema tiene la capacidad de enviar los datos a medida que arriban. Para la implementación del buffer mencionado se utilizó el componente srl fifo 16 obtenido de Open Cores. A este componente se lo dotó de lógica adicional para que pueda cumplir las funciones mencionadas. El buffer para su implementación no utiliza block ram si no registros en LUT. De esta manera se puede reservar la blockram para otros usos, permitiendo una mejor portabilidad al no consumir recursos que varían significativamente de familia a familia de FPGAs. Para la implementación del sistema se utilizo el kit de desarrollo Xilinx University Program (XUP), el cual posee una Virtex 2 Pro (XC2VP30[8]). Dicho kit se lo conoce como XUPV2P[9] y lo fabrica Digilent Inc. La FPGA cuenta con dos microprocesadores embebidos PowerPc en su versión 405[10], pudiendo utilizarse como alternativa a estos el procesador softcore Microblaze. Esta versión del PowerPC posee algunas diferencias respecto a la versión 440[11] utilizada en algunas familias de Virtex 5. Para la implementación de la plataforma se utilizarón un total de 6609 luts (24 %). El core de la cámara necesitó de 1108 luts (4 %). La frecuencia máxima a la cual puede ejecutar la plataforma resultó de 106 MHz, mientras que el core de la cámara a 163 MHz. El sistema se ejecuta a 100MHz (el microprocesador PowerPC inclusive). Se implementaron diferentes algoritmos de decodificación de Pattern Bayer, los cuales convierten las imágenes de entrada en imágenes de color verdadero. Se probaron los algoritmos de decodificación de interpolación simple, bilineal e interpolación basada en gradiente[12][6]. Las métricas se realizaron sobre estos algoritmos de decodificación realizados por el microprocesador. Los tiempos empleados en los diversos algoritmos incluyen los tiempos de copia de la memoria de la imágen hacia una memoria temporal y el tiempo de copia hacia la memoria del core de la salida de video VGA. En una segunda instancia se introdujo una memoria block ram al bus DSOCM[13] conectado al microprocesador, para que actúen como caches de datos. En la tabla 1 se observan los tiempos de ejecución de los diferentes algoritmos implementados con y sin utilización de cache. La diferencia de tiempos entre los tres algoritmos se debe principalmente a la diferencia en la cantidad de accesos a memoria. Por cada lectura de un determinado píxel de la imágen se está haciendo un acceso a memoria. 5. CONCLUSIONES Se desarrolló una plataforma genérica para sistemas de visión basados en FPGAs, utilizando la arquitectura Core- Connect. La ventaja de esta plataforma es la facilidad de agregar algoritmos de procesamientos de imágenes, ya sean 81

96 implementados en el microprocesador o como un core conectado al bus. Otra ventajas es la facilidad de agregar diversos cores a la plataforma. Es posible trabajar con más de una cámara en la plataforma, agregando un core de la cámara por cada cámara que se conecte y configurándolos para que escriban en distintas áreas de memoría.. Se desarrolló un Core encargado de controlar y recibir los datos de la cámara. El mismo resulta ser portable, siempre y cuando, se trabaje con la versión 4.6 del bus PLB. La dirección de inicio de escritura es parametrizable. Se realizaron métricas sobre los algoritmos ejecutados sobre el microprocesador. Se observó que la diferencia en tiempos entre los diversos algoritmos se debió a la diferentes cantidad de accesos a memoria que realizaban por cada píxel. Se logró una reducción en los tiempos al incorporar una cache al microprocesador. 6. TRABAJOS FUTUROS Como trabajo a futuro se propone la implementación de la plataforma sobre una FPGA Virtex 5 FX. Esta FPGA cuenta con un microprocesador embebido PowerPC 440, el cual se comunica directamente con el controlador de memoria. Como complemento a la transmisión serial se propone la implementación de rutinas en el programa que ejecuta el microprocesador que permitan enviar las imágenes procesadas a través de Ethernet, utilizando algún protocolo de redes tal como TCP/IP o PPPoE (Point to Point Protocol over Ethernet). Complementariamente el desarrollo de un core que permita el envío de imágenes a través de una interface serial de alta velocidad. Para aumentar la capacidad de procesamiento utilizar varios microprocesadores para la ejecución de diversos procesamientos. Algunas alternativas serían ejecutar distintos algoritmos sobre los distintos microprocesadores. Otra alternativa sería que los microprocesadores trabajen en forma cooperativa (ejecutando el mismo algoritmo sobre diferentes áreas de la imágen), quedando uno de ellos como maestro y el otro como esclavo. En el caso de la Virtex 2 Pro, al contar con dos microprocesadores embebidos PowerPC se utilizarían estos, mientras que la Virtex 5 al disponer de uno solo se lo combinaría con un microprocesador softcore (pudiendo ser el MicroBlaze). [5] Micron, 1/2.5-Inch 5-Megapixel CMOS Digital Image Sensor, [6] S. Imaging, RGB Bayer Color and MicroLenses, [7] Xilinx, PLB IPIF (v1.00f), [8], Virtex-II Pro and Virtex-II Pro X Platform FP- GAs: Complete Data Sheet, [9], Xilinx University Program Virtex-II Pro Development System - Hardware Reference Manual, [10], PowerPC 405 Processor Block Reference Guide, [11], Embedded Processor Block in Virtex-5 FPGAs, [12] R. Ramanath, W. E. Snyder, G. L. Bilbro, and W. A. S. III, Demosaiking methods for bayer color arrays, Journal of Electronic Imaging, vol. 11, pp , [13] Xilinx, Data Side OCM Bus v1.0, REFERENCES [1] EMVA, An introduction to machine vision, [2] IBM, CoreConnect Bus Architecture. [3] Xilinx, Processor Local Bus (PLB) v4.6 (v1.04a), [4], EDK Concepts, Tools, and Techniques,

97 PROTOTIPADO RÁPIDO DE UN IP PARA APLICAR LA TRANSFORMADA WAVELET EN IMÁGENES MELO Hugo Maximiliano PEREZ Alejandro GUTIÉRREZ Francisco CAVALLERO Rodolfo Centro Universitario de Desarrollo en Automación y Robótica (CUDAR) Universidad Tecnológica Nacional Facultad Regional Córdoba M.M. Lopez esq. Cruz Roja Argentina Ciudad Universitaria - Córdoba RESUMEN Se presenta una metodología para la implementación de prototipos funcionales para sistemas embebidos basados en tecnología FPGA. Se describen las herramientas de alto nivel utilizadas y se desarrolla, un banco de filtros para transformada Wavelet en 2-D que luego es incorporado al sistema embebido. 1. INTRODUCCIÓN Uno de los proyectos desarrollados en el CUDAR de la Universidad Tecnológica Nacional Facultad Regional Córdoba es la Compresión de Video con Wavelet en Lógica Programable. Para el desarrollo del proyecto se ha utilizado la FPGA VirtexII Pro de la empresa XILINX, la cual cuenta con dos procesadores embebidos en silicio tipo PowerPC. El sistema hace uso de uno de estos procesadores e incorpora distintos periféricos en forma de IP que configuran la lógica de la FPGA. Como plataforma de desarrollo se ha utilizado el XPS de la empresa XILINX. La compresión de video involucra operaciones matemáticas y algoritmos, varios de los cuales han sido implementados como funciones de Hardware. Durante el proceso de desarrollo se investigaron y probaron distintos métodos para prototipado rápido de funciones para poder evaluar y verificar algoritmos, evitando el tiempo necesario para el desarrollo por los métodos convencionales de descripción. El método descripto en el presente trabajo hace uso de uno de los programas más difundidos en el área matemática: MatLab perteneciente a la empresa MathWorks y su módulo asociado SimuLink, que es una plataforma versátil de diseño y simulación de sistemas dinámicos, lo que permitió que integrantes del proyecto con poca experiencia en la metodología de desarrollo en lógica programable, pudiesen probar ideas y realizar simulaciones que con poco esfuerzo pueden ser llevadas al hardware. XILINX ha desarrollado el System Generator el cual incluye un conjunto de bloques para utilizar con SimuLink. El método de prototipado rápido emplea el concepto de cajas negras y de abstracción de hardware para facilitar el manejo conceptual de los módulos. Una vez implementado y simulado el prototipo es necesario bajar el sistema a la plataforma de hardware. La empresa XILINX propone como herramienta de alto nivel para desarrollo de sistemas embebidos el EDK (Embedded Development Kit). Esta plataforma de software se compone del XPS (Xilinx Plataform Studio) para el desarrollo del hardware y el SDK (Software Development Kit) para el desarrollo de software. Estas herramientas están pensadas para acortar el tiempo de desarrollo. El XPS permite diseñar gráficamente la arquitectura del sistema. Se manejan bloques funcionales que representan un dispositivo de hardware distribuido en forma de IP (Intellectual Property). Los mismos hacen uso de las celdas de las FPGAs para generar el dispositivo físico. Los códigos de MatLab auxiliares serán reemplazados por código de programa implementado en los PowerPC disponibles en la plataforma. 2. WAVELETS. Las Wavelets son familias de funciones que se encuentran en el espacio y se emplean para el análisis, examinan a la señal de interés para obtener determinadas características de espacio, tamaño y dirección. La familia está definida por: x b h a h = a, b R a 0 a, b a La familia Wavelet se genera desde una función madre h(x), que es modificada con las variables a y b para obtener traslaciones y escalado temporal. De esta manera se logra la 83

98 mejor concentración en información de tiempo y frecuencia [1]. Las transformadas Wavelet se clasifican en Transformadas Wavelet Discretas (DWT) y Transformadas Wavelet Continuas (CWT). 2.1.Transformada Wavelet Discreta en 2-D El análisis por transformada Discreta de Wavelet (DWT) puede ser implementada con bancos de filtros, pasa bajos y pasa altos seguidos de etapas de down sampling. Para la síntesis también se utilizan los bancos de filtros y up sampling de la señal. La Fig. 1 es un esquema del proceso de análisis. El decimado (Down Sampling) y undecimado (Up Sampling) indican decremento o incremento, respectivamente, de números de muestras, lo cual se logra eliminando una muestra o intercalando un cero entre ellas [2]. La característica de energía Wavelet {Eni} n=1...d, i= H, V, D refleja la distribución de energía a lo largo del eje de frecuencia sobre una escala y en una orientación determinada. La energía de las imágenes se concentra en las frecuencias bajas. Una imagen tiene un espectro que se reduce con el incremento de las frecuencias. Estas propiedades quedan reflejadas en la Transformada Wavelet Discreta de la imagen [3]. En compresión y en algunas otras aplicaciones de la transformada se hace necesario aplicar una técnica multinivel. Esta se obtiene aplicando sucesivamente las transformadas a la parte de aproximación de la etapa anterior. En la Fig. 3 se observa una representación clásica del resultado de la transformada Wavelet multinivel, en donde las dimensiones de la matriz son las mismas que la imagen original. La nomenclatura se interpreta de la siguiente manera: La primer letra indica el sentido del detalle o aproximación: V=Vertical, D=Diagonal, H=Horizontal, A=Aproximación; el número representa el nivel de transformada al cual corresponde. Figura 1. Descomposición Simple Una imagen es una matriz de datos en donde cada elemento representa un pixel, en caso de ser imagen color la misma puede representarse por sus componentes RGB o YcrCb. Para aplicar la transformada Wavelet en dos dimensiones utilizando el método de filtros separables, es necesario recorrer la matriz de dos maneras, primero por filas y luego por columnas como puede verse en la Fig. 2. Figura 3. 3 Niveles de Wavelet en 2-D 3. IMPLEMENTACIÓN Para la implementación de la transformada Wavelet en 2-D se optó por aplicar el método de filtros separables. A modo de prueba de concepto se optó por implementar modularmente las distintas etapas de la transformada. Cada una de las etapas se presenta como un módulo independiente y la unión entre ambos se realiza con un código de MatLab que posteriormente será reemplazado en la FPGA por rutinas de manipulación de datos ejecutadas en los procesadores embebidos. A continuación se describe el método empleado y los resultados obtenidos. 3.1.Características Figura 2. Wavelet 2-D La energía normalizada de una sub-imagen formada por N coeficientes de Wavelet se define como: 1 2 Eni = [ Dni ( b N j, bk )] j, k Los filtros FIR a implementar son los que permiten realizar la transformada Wavelet por el método de bancos de filtros. Estos coeficientes se obtienen desde la ventana de comando de MatLab con la siguiente expresión: [LO_D,HI_D,LO_R,HI_R]= WFILTERS('db3'); 84

99 'db3' le indica a la función de MatLab que la Wavelet madre es una Daubechies 3. Los filtros resultantes para esta Wavelet son de orden 5 con un total de 6 coeficientes. A continuación se realiza el siguiente esquemático Fig. 4 en entorno SimuLink, utilizando bloques propios de SimuLink y System Generator. Figura 4. Primera Descomposición Simple Se toma como entrada la componente de Crominancia roja (Cr) de una imagen de 100 x 100 píxeles. Luego de la decimación de la primera descomposición simple, los filtros arrojan N+2 coeficientes con valor numérico, donde N es el número que se espera luego de una decimación. Estos dos valores podrían ser interpretados como extras, producto de: Datosdeentrada Ordendel filtro N + 2 = Propuesta de implementación Truncaciónparteentera El presente trabajo propone implementar rápidamente en hardware un prototipo funcional del sistema que permita hacer una evaluación conceptual y funcional del diseño, sin atacar aún el problema de optimización del mismo. Para realizar los distintos barridos necesarios de la matriz de datos, se utiliza programación directa en MatLab, y se aplica la señal a SimuLink directamente en forma de vector, lo cual hace transparente el recorrido de la matriz para este último. MatLab esta diseñado para trabajar con matrices, por lo que las operaciones con este tipo de arreglo de datos son extremadamente simples de realizar, la mayoría de ellas se reducen a operadores, como la que devuelve un vector, a partir de recorrer la matriz por columnas. Para obtener el barrido horizontal y vertical se utiliza la misma función pero aplicada a la matriz original o a su transpuesta. Para poder hacer uso de este método es necesario que la matriz sea cuadrada y además debe respetar la apariencia de la imagen original. Como método de prueba se trabajó sobre la siguiente proposición: a la salida de la primera descomposición se obtienen 5002 datos, lo cual no es compatible con una matriz rectangular. A los efectos de lograr una matriz rectangular se ensayaron las siguientes soluciones: a) Se agregaron 98 ceros para hacer compatible el número de datos con una matriz rectangular necesaria para la siguiente etapa, lo que dio como resultado una alteración de la imagen reconstruida ya que los ceros quedaban embebidos en el análisis. b) Se optó por truncar el número de datos, considerando válidos 5000 datos. Durante los ensayos se determinó que no se puede descartar cualquier dato, ya que esto repercute en los resultados de la posterior reconstrucción. Para cada descomposición simple, se optó por tomar como válidos los N primeros datos, eliminando M datos de la descomposición, el valor de M se obtiene truncando la parte entera de la siguiente relación: Orden del filtro M = 2 Truncación parte entera La cantidad N de datos útiles se calcula con la siguiente fórmula: Datos de Entrada N = 2 De esta forma se obtienen los datos para formar la matriz rectangular necesaria para los siguientes pasos. Este método se utilizó tanto en la etapa de descomposición como en la de reconstrucción. Se debe tener en cuenta que para una reconstrucción correcta, utilizando el presente método, se debe aplicar el recorte de datos de manera invertida. Es decir si en la etapa de descomposición se utilizaron los primeros N datos, en las etapas de reconstrucción se deben utilizar los últimos N datos. N ' = ( Datos deentrada) 2 Dejando de tener en cuenta una cantidad M de coeficientes de salida. 3.3.Distorsión en los bordes de una imagen La teoría de banco de filtros utilizada para la implementación de la transformada Wavelet, está planteada y funciona adecuadamente para señales infinitas, pero se producen distorsiones en los límites de las señales finitas, como es en el caso de una imagen [4]. Se han propuesto varios métodos para solucionar este problema. Todos ellos proponen extender la señal de alguna manera. La bibliografía consultada propone entre otros el método de convolución circular y la reflexión simétrica, que se obtienen mediante la reflexión y la repetición simétrica de las muestras en la frontera. MatLab también plantea la posibilidad de relleno con ceros. Estas ampliaciones no son arbitrarias y dependen exclusivamente del orden del filtro. Pese a la extensión, la salida continúa generando distorsión en los bordes, sin embargo es bastante fácil ver la reflexión simétrica también a la salida de los filtros. Eliminando dicha distorsión simétrica, se obtiene la salida recuperada perfecta, que puede ser verificada con MatLab a través de: [Ap De]= dwt (entrada, db3 ); 85

100 Donde entrada es un vector de valores finitos, db3 corresponde al tipo de onda utilizado para el cálculo de coeficientes y las salidas son Ap y De corresponden a la Aproximación y Detalle respectivamente. 3.4.Resultado de la implementación Al comparar la imagen reconstruida con la imagen original utilizando el método implementado, se hallaron errores en el margen superior izquierdo de la imagen, de manera más específica en una submatriz de n x n, donde n es la cantidad de coeficientes del filtro implementado para Daubechies 3. Como solución a este problema se agregaron marcos a la imagen original. En esta experiencia para prototipado rápido se utilizó un marco cuyo valor numérico era el uno, esto permitió apreciar el comienzo de la imagen al finalizar el marco, y pese a que no se empleó ninguna extensión de frontera se logró una reconstrucción perfecta. Esto se debe a que los errores de los procesos de truncación de información para la formación de matrices auxiliares de recorrido, descriptos anteriormente, se sitúan dentro del perímetro del marco, Fig. 5, que posteriormente es eliminado. Figura 5. Marcos agregados a la imagen original Implementación del módulo El sistema generado en VHDL por SimuLink tendrá por puertos dos registros de entrada salida de 8 bits, una entrada para el reloj del sistema y una señal de chip enable la cual en esta aplicación sera conectada a un 1 lógico. Para embeber el mismo se ejecuta el asistente Create or import Peripheral. Los pasos para implementar un sistema embebido utilizando el entorno XPS son: Generar gráficamente la plataforma base, Micro, bus, controladores de memorias, periféricos de IO genéricos etc. Generar un nuevo IP para poder incorporar la entidad principal de los bloques Wavelet. Esto se debe contemplar dentro de las funciones que se seleccionan durante el asistente para generación de la interfaz funcional para propiedad intelectual (IPIF). La existencia de los dos registros accesibles por software que serán los mismos que necesita el módulo generado por System Generator. Instanciar las fuentes de los archivos generados por System Generator en los archivos user:logic.hdl y top_entidad.hdl.. Una vez verificada la incorporación, mediante la síntesis de las fuentes, se agrega el IP desde el repositorio en el entorno EDK, se conecta al bus correspondiente y se generan las posiciones de memoria del sistema. Figura 5 EDK. Incorporado el filtro al sistema se creará una rutina en SDK la cual escribirá en el registro de entrada del filtro datos enviados por el puerto serie de una PC conectada al sistema. Los datos resultantes en el registro de salida se envían a la PC para contrastar con los resultados de las simulaciones realizadas en SimuLink. 4. CONCLUSIÓN Y TRABAJOS FUTUROS Las pruebas realizadas con la metodología utilizada demostró ser efectiva para la implementación de módulos IP, la interacción entre las herramientas de alto nivel demostró ser robusta y confiable. La posibilidad de utilizar la potencia de MatLab en desarrollo de algoritmos y verificaciones abren la puerta a la implementación de hipótesis de manera rápida para validación o refutación, acortando los tiempos de investigación y desarrollo. La evolución de las herramientas de desarrollo de sistemas embebidos en plataformas FPGA muestran un crecimiento importante y sostenido de este campo en este tipo de tecnologías. 5. REFERENCIAS [1] Jalali, Payman. Wavelets and aplication. Energy Tecnology Department. Lappeenranta University of Tecnology. [2] Burrus, Sidney; Gopinath, Ramesh; Guo Haitao. Introduction to Wavelets and Wavelets Transforms. Electrical and computer engenieering departament. Rice University. Houston, Texas. [3] Borja José García Menéndez; Eva Mancilla Ambrona; Ruth Montes Fraile. Optimizacion de la transformada Wavelet Discreta Universidad Complutense de Madrid Facultad de informática [4] Strang,Gilbert; Nguyen, Truong. Wavelets and Filter Banks. 86

101 Cortex-M0 Implementation on a Xilinx FPGA Pedro Ignacio Martos y Fabricio Baglivo Laboratorio de Sistemas Embebidos Facultad de Ingeniería - UBA Buenos Aires, Argentina pmartos@fi.uba.ar / baglivofabricio@gmail.com ABSTRACT This paper presents an implementation of the recently available ARM Cortex TM -M0 soft core processor on a small Xilinx FPGA. These kinds of processors are oriented to mixed-signal and microcontroller devices with low cost and low power consumption. At present, Xilinx does not offer a Cortex-M0 soft core solution. By contrast, Actel and Altera offer Cortex TM -M1 cores in FPGA solutions. The aim of this work is to connect the ARM Cortex-M0 soft core with a synthesized code memory using the Advanced Microcontroller Bus Architecture (AMBA ) AHB-Lite interface in a Xilinx FPGA to evaluate how much fabric resources are needed to implement this soft core processor. For this purpose, we have designed an small system that receives the data request from the processor, sends it to the synthesized memory, and returns the data obtained to the processor. In this implementation, the proper work of the system is monitored with Chipscope- Pro, a Xilinx in-system debugging tool. 1. INTRODUCTION Within the 32-bit ARM processor family, shown in Figure 1, Cortex-processors support high performance applications with embedded operating systems and real time applications. Cortex TM -R processors are oriented toward low-cost controllers, with deterministic, fixed-latency and interrupt handling. These processors are also intended for high-performance, real-time applications. Cortex-M is the lowest member in Cortex family. This family is an option for simple, low-cost, low-power and low-performance designs. The Cortex-M0 is, nowadays, an alternative to 8- bit microcontrollers, with the advantage of high processing capacity. It is built on a high-performance processor core, with a 3-stage pipelined Von Neumann architecture. This processor implements the ARMv6-M architecture, which supports the ARMv6-M Thumb instruction set, including Figure 1 ARM Cortex Family Thumb-2 technology. This provides the exceptional performance expected of a modern 32-bit architecture, with a higher code density than other 8-bit and 16-bit microcontrollers, being capable of achieving performance figures around 0.9 DMIPS/MHz Cortex-M0 DesignStart The Cortex-M0 DesignStart (M0DS) processor, shown in Figure 2, is a fixed configuration of the Cortex-M0 (M0) processor. It is delivered by ARM as a pre-configured and obfuscated netlist, but it is synthesizable, Verilog version of the full Cortex-M0 processor. The main differences between them are: a) M0 has a full AMBA interface that provides MASTER and SLAVE support, while the M0DS provides an AMBA Lite interface with only MASTER support; b) M0 can be implemented with either a high speed multiplier (1 clock cycle) or a slow-speed multiplier (32 clock cycles), while the M0DS only allows the slowspeed multiplier; c)m0 can handler 32 interrupts while M0DS can handler 16 interrupts; and d) M0 includes an optional wake-up interrupt controller, architectural clock gating and hardware debugging. M0DS does not include these options. The M0DS processor is distributed as a single zipped tar file bundle, containing the release notes, synthesized 87

102 Verilog, and test-bench code. The test-bench instantiates the Cortex-M0 DS module and connects it in a minimal way to a memory model and clock and reset generators. It also provides a means of outputting information from the processor to the Verilog simulator s console output. The aim of this work is: synthesize the Cortex-M0_DS in a real FPGA, connect it to a memory with a small program inside using the AMBA-Lite interface, evaluate how much FPGA fabric resources are needed to do it, and see its applicability in small footprint systems AHB AMBA-Lite overview This protocol defines the data and address buses, and all the control signal for high performance synthesizable designs requiring high bandwidth. First, the address bus, HADDR[31:0], is a MASTER output to a SLAVE device. Data transfers are performed using two buses: a read one, which is a SLAVE output to a MASTER input, called HRDATA[31:0], and a write one, which is a MASTER output to a SLAVE, called HWDATA[31:0]. In this work we use only HRDATA. Finally the protocol specifies control signals. HBURST[2:0] indicates if the transfer is a single transfer or forms part of a burst. HMASTLOCK indicates if the current transfer is part of a locked sequence. HPROT[3:0] provides additional information about a bus access and is primarily intended for use by any module that wants to implement some level of protection. HSIZE[2:0] indicates the size of the transfer, e.g.,byte, half word, or word. HTRANS[1:0] indicates the transfer type of the current transfer (IDLE, BUSY, NON SEQUENTIAL, SEQUENTIAL). The HWRITE signals indicates the transfer direction, when HIGH this signal indicates a write transfer, and when LOW, a read transfer. from the Digilent web page. This is very important for our purpose because it is possible to program the board and see the state of it easily. The Xilinx S3E500-4 is the FPGA included on the board. It has 500K gates, 10,500 logic cells, 20 hardware multipliers, 360Kbits of dedicated RAM, 73 Kbits of distributed RAM, 4 clock handlers, and a maximum clock frequency of 300MHz Software tools We used the Xilinx Integrated Software Environment (ISE ) as our design suite software. The ISE project navigator allowed us to manage the project and synthesize. Core Generator is the tool we used to generate the ROM memory and the reset generator. ISIM was used for simulation purpose. Also we used Chipscope Pro to perform online debugging. This program let us see the state of the system. For C code compiling, we used the ARM Microcontroller Development Kit by Keil TM. The ARM deliverables package contains a logical folder with synthesizable code and test-bench code. The test-bench project has a Verilog implementation of the processor, Cortexm0ds_tb.v, prepared for simulation. It also includes a HelloWorld.c program with a make file containing the compilation parameters. As result of this compilation a.bin file is obtained. This file is a memory 2. THE IMPLEMENTATION 2.1. The board For this paper we used a Digilent Spartan 3E Starter FPGA board. This digital system design platform is built around a Xilinx Spartan 3E FPGA. It has 16 megabytes of fast SDRAM and 16 megabytes of flash ROM. It also has a 50MHz oscillator, plus a socket for a second oscillator. This board contains a USB2 port that provides board power, programming, and data transfers. Some peripherals are also included on the board like an LCD display, a set of LEDs, switches, etc. In particular, we use LEDs as event indicators. One of the major advantages of this board is that is possible to use Xilinx software tools (Impact, Chipscope Pro, xmd, etc.), downloading software (Adept) Figure 2 Cortex-M0 implementation 88

103 Figure 4 FPGA use with the Project Implementation. Figure 3 Cortex M0 Design test-bench Schematic image that is loaded by the Verilog test-bench at the beginning of the simulation System Implementation The first step of the implementation process was the replication of the Hello World project. We realized that the makefile did not work correctly with Keil or IAR, so we decided to begin a new project with Keil. We made the compilation process and we compared the.bin file obtained with the one ARM provide in the test-bench package. After that, we began the simulation with ISIM. In it we saw that the data bus did not contain valid values. The.bin file was not correctly loaded by the Verilog code because it was assumed that the Verilog instruction $fread() made double words accesses, where it actually made byte accesses. After this modification the test-bench was working as expected. The next step was the implementation of the testbench code into a synthesized VHDL code (see Figure 3). An external 50MHz oscillator was used as the external clock. We used a synthesized DCM to generate the 10Mhz system clock. The reset generator was implemented using the Xilinx architecture wizard. A two clock-cycles reset pulse is needed to initialize the processor. The pre initialized 1Kx32bit RAM was created using the Core Generator tool. The processor does 32 bits data access, so we had to shift the RAM address bus in 2 positions, so processor addr[0] is connected to RAM addr[2]. Some LEDs were connected to internal signals, namely, LOCKED and SLEEPING. The address and data buses of the AMBA-Lite interface and HWRITE were connected directly to the memory. HREADY was fixed to 0 and HRESP was fixed to 1. Others signals of the interface were connected to internal signals for debugging purpose. A bus signal detector was developed to compare HWRITE bus information with two patterns, one to turn the LED on and the other one to turn it off. The user constraint file (.ucf) was defined using the Plan Ahead tool A Toggle LED project was created. Its aim was to turn on and off one of the board s LEDs with the Cortex processor. The.bin file was generated using the Keil tools. The Core Generator tool allowed us to fill the memory with an image. The file format required was.coe. A script was generated to transform the.bin file in a.coe. Once the project was synthesized, IMPACT was used to program the board. 3. RESULTS As mentioned before, the Chipscope Pro package was used to see the transitions on the AMBA AHB-Lite interface signals and for debugging. With this tool, we could verify that the interface worked as expected. All memory accesses were correctly synchronized and we saw that the LED on the board blinks. So we conclude that the implementation is correct and functional. In Figure 4 we show some results using the FPGA with the project implementation. Figure 5 shows the simulation of the memory accesses in Keil software. Figure 6 shows the Chipscope Pro capture of the AMBA bus. Timing reports (post place & route) gave a maximum clock speed of 40MHz. This value could be improved by implementing time and placement constraints. Figure 5 Simulation in Keil 89

104 [3] ARM Ltd, AT510-DC r0p0-00-rel0 ARM Cortex M0 DesignStart Release Note August [4] ARM Ltd, ARM DDI 0432C Cortex M0 Revision r0p0 Technical Reference Manual, November [5] ARM Ltd, ARM DUI 0497A Coertex M0 Devices Generic User Guide, October [6] Xilinx, DS312 Spartan-3E FPGA Family: Datasheet, August [7] Digilent, Digilent Spartan 3E Starter Kit Reference Manual, June Figure 6 Capture of the AMBA Lite bus 4. CONCLUSIONS The most remarkable conclusion is that it is possible to implement the M0DS in a low range FPGA. With this result, Xilinx, Actel and Altera (the three most important FPGA manufacturers) can support this core making it a considerable alternative when portability between these three FPGA types is a requirement for a design. As an improvement, it would be useful to have a complete test bench that allows us to generate the bin file from the source code. That is not possible right now, and it would accelerate the development time. In future work, other peripherals will be connected to the AMBA bus in order to increase the processors capacity. Also, going further, a Linux operating system can be investigated on this processor, obtaining a Linux implementation in a small footprint design. I. REFERENCES [1] ARM Ltd, ARM DDI 0419C ARMv6-M Architecture Reference Manual, September [2] ARM Ltd,, ARM IHI 0033A AMBA 3 AHB-Lite Protocol V.1 Specification, June II. ACKNOWLEDGES To William Hohl, Joe Bungo, Fiona Cole, the people at the AR University Program and the people at the Xilinx University Program (XUP) for their support and cooperation. III. TRADEMARKS AND COPYRIGHTS The information about ARM processor families was mainly extracted from ARM Ltd web site ( as published on October, ARM, CORTEX, CORTEX-M, AMBA, AMBA-LITE, and other designated brands included herein are trademarks of ARM Limited. Xilinx, Spartan, ISE, and other designated brands included herein are trademarks of Xilinx Inc. Digilent, Spartan3E Starter Kit, Adept, and other designated brands included herein are trademarks of Digilent Inc. All other trademarks are the property of their respective owners. Figures 1 to 3 are copyright ARM Ltd. Reproduced with permission. References [1] to [5] are copyright ARM Ltd. Reference [6] is copyright Xilinx Ltd. Reference [7] is copyright Digilent Inc. 90

105 DIGITALY CONFIGURABLE PLATFORM FOR POWER QUALITY ANALYSIS B. Falduto, E.Ferro, R. Cayssials 1 Universidad Nacional del Sur CONICET 1 Department of Electrical Engineering Bahía Blanca Argentina ricardo.cayssials@uns.edu.ar ABSTRACT 1 Nowadays, the power quality requires sophisticated approaches to get an efficient utilization of both: the electrical energy and the electrical installations and equipment. Most of the power quality devices are base on analogical circuits to synchronise the processing stage with both the voltage and the current of the power line. Because the analogue components utilised, these approaches require being precisely tuned and calibrated. On the other hand, power quality analysis is cover by several standards. These standards define the parameters and measures that have to be satisfied in order to guarantee an adequate power quality. According new parameters or perturbations appear, new standard or modifications can be proposed to adequate them to new necessities. The digital architecture proposed is aim to integrate the synchronization stage and processing unit in a peripherical core compatible with a processor bus. This core can be easily modified to include future power quality standards. 1. INTRODUCTION In recent years, power quality (PQ) has become a significant issue for both power suppliers and customers. There have been important changes in relation to power quality. First of all, the characteristics of load have become so complex that the voltage and current of the power line connected with these loads are easily distorted. Lately, for example, non-linear loads with power electronic interface that generate large harmonic current have been greatly increased in power systems ([1]) 11 This work was supported by the Technological Modernization Program under Grant BID1728/OC-AR-PICT2005 Number and the project Digital processing platfrom for Active Power Line Filters granted by Fundación Hermanos Agustín y Enrique Rocca. Power quality disturbances can range form impulses with rise times in microseconds range, to long-duration variations in voltage magnitudes. To solve various quality problems, PQ should be measured accurately according to the exact definition of each power quality category, and then evaluated and diagnosed in versatile ways. Power quality monitoring systems (PQMS), that characterizes power quality events and variations, has experienced rapid progress using high tech detection functions. PQMS has many variations in their structure, but the recent trend is the permanent type that is installed at one site permanently to asses the power quality for 24 hours a day without a break. Moreover, in many cases, to manage local power systems that have more than one PQMS efficiently, the analyzed data of each PQMS are sent to supervision controllers using network connections ([22]). On the other hand, electric market defines a set of different economical conditions that have to be taken into account in order to get a cost-effective utilization of the electrical energy. As power quality becomes an economical, ecological and electrical efficiency concern, several standards have been proposed ([5, 6, 7, 8]). Embedded Systems are used to implement most of the power quality equipment and instrumentation. These systems have to be design to meet the power quality standards, as well as been compatible with another instruments. Compatibility issues may become a concern since most of power quality devices should work synchronised to meet the requirements define by the standards. The measuring of harmonics is already defined in [6] and requires processing the information of voltages and currents. In this paper, we propose a digital architecture to process the power quality parameters according the IEC The only analogue components required are the attenuators and the A/D converters. All the synchronization and processing are perform digitally and consequently there is no need of calibration. This paper is organized as follows: Section 2 describes the main concepts in PQ. Section 3 explains the necessity of a variable sampling frequency. In section 4, it is stated 91

106 the FFT analysis of harmonics and interharmonics waves. In Section 5, we describe the platform proposed. In Section 6, the architecture is analysed. Conclusions are drawn in Section POWER QUALITY CONCEPTS The power quality is determined by a set of different measurements performed over the voltage and current waveforms. The main purpose of these measurements is to determine both: (1) how efficiently the electrical energy is utilized and (2) how good is the energy provision. Former electrical applications consisted in linear and balanced loads. Under this case of loads, power quality analyses were confined to determine the phase angle between the voltage and current waveform. The cosine of this phase angle, denoted cos(φ), gives a relationship between the electrical energy effectively utilized (active energy) and the electrical energy supplied (apparent energy). Nowadays, the characteristics of the loads are different from the former ones. Most of the electrical loads use semiconductor devices that produce a non-linear behaviour, and consequently it introduces perturbations into the power line that worsen the PQ of the system. The perturbations defined in [3], and supported by most of the modern PQMS, are: Swell: is an increase in the A/C voltage, with a duration which may range from a half cycle to a few seconds. Sag: idem swell with reduction of the voltage. Flicker: is a momentary interruption of the electrical energy. Undervoltage: is a reduction of the voltage during more than 1 minute. Overvoltage: is an increase of the voltage during more than 1 minute. Interruption: is a reduction of the voltage below 10%. Harmonics: are voltage and currents components with a frequency different from the frequency of the power line. Frequency derivation: is the difference between the frequency measured and the theoretical of the power line. Similarly, and according to IEC , Power Quality Analyzer should analyze and evaluate these quantities: power frequency, magnitude of the supply voltage, flicker, harmonics and interharmonics, supply voltage unbalance, rapid voltage changes and voltage dips, swells and interruptions. The standard suggests the monitoring and analysis of the current as well as it specifies the measurement uncertainty to be better than 0.1% for the complete instrument including sensors (e.g. current clamps). An adequate measurement is the basis for any other power quality device. Modern PQ monitoring systems range from traditional watt-hour meters or digital protection relays in which the PQ analysis algorithms are inserted, to complex devices that deal with PQ parameters and events. Most of these devices have to be configured in order to implement the adequate strategies to guarantee an acceptable power quality. PQ strategies determine the different actions to take for the different PQ events. The possible actions could be the modification of the regimen of electrical loads, the connection of compensators, the switch off of secondary components, etc. 3. VARIABLE SIGNAL SAMPLING An important parameter for power quality is the harmonics content of the power line supply and load. The specification of the measurement and analysis are well defined in [5] and [8]. In [8], it is specified that the measurement interval shall be 10 or 12 cycle time for 50Hz or 60Hz respectively. The standard is defining the time period which need to be measured and how the measured values will be aggregated. The interval in time is not fixed but varies in time as the fundamental frequency of the power changes. This kind of measurement requires synchronization with the power line in order to adapt the sampling interval accordingly. The easiest way to achieve adaptive sampling frequency is using PLL (Phase Lock Loop). A PLL is an electronic feedback system that generates a signal, the phase of which is locked to the phase of an input or reference signal. This is accomplished in a common negative feedback configuration by comparing the output of a voltage controlled oscillator to the input reference signal using a phase detector. Analog PLLs are generally built of a phase detector, low pass filter and voltage-controlled oscillator (VCO) placed in the forward path of a negative feedback closedloop configuration. Figure 1 shows a block diagram of a basic PLL structure. reference input phase detector (PD) VCO output controlled oscillator (VCO) low pass filter (LPF) Figure 1: Block diagram of a basic PLL structure 92

107 Analogue PLL circuits should be calibrated in order to achieve adequate response times when the input frequency changes. Moreover, including additional analogue components in a system requires a more careful design to avoid interferences between the analogue and digital circuits. On the other hand, software PLL circuits require strict temporal requirements to be met. The processing time required by this king of algorithms is very demanding and hard to be met by most of the embedded processor without an exclusive or prioritised utilisation. When these temporal constraints are not met, then harmonics components are introduced in the fequential analysis and consequently measurement errors. In this paper,, we utilised the digital PLL circuit for power line applications proposed in [9]. This PLL is implemented as a digital circuit that produces a fast response time when the power frequency changes. This PLL circuit does not require processing time from the system processor. The PLL synchronises with the power frequency and generates the adequate sampling frequency to meet the harmonic analysis specified in [5] and [8]. The goal of an adequate synchronization for the harmonics analysis is to reduce the spectral leakage effect. Besides, the PLL utilised computes the sine and cosine of the power line frequency that it is used to easily detect voltage and current perturbations as well as to determine reactive, active and apparent powers and energies. 4. FFT for Harmonics and Inteharmonics Analysis Power frequency is called the fundamental frequency. A sinusoidal wave with a frequency k times (k is an integer) is called harmonic wave or harmonic for short. Other sinusoidal waves whose frequency cannot be expressed as an integer multiple of the fundamental, it is called an interharmonic wave or interharmonic for short. In [6], it is specified the principle of harmonics and interharmonics measurements: a 200 ms windows (10 periods of 50Hz or 12 periods of 60Hz signal) is used in DFT calculation resulting with 5Hz increment in frequency spectrum. For power quality measurements, usually the analysis of harmonics is reduced to the 50 th harmonic (i.e. to 2500Hz for 50Hz signal). FFT (Fast Fourier Transform) transforms a time sampled signal into its frequential spectrum. When FFT is implemented for discrete time applications, then the suitable algorithm is the DFT (Discrete Fourier Transform). With these specifications, it is determined that the sampling frequency is at least 100 samples per period to get the 50 th harmonic frequency as the highest that can be measured. Besides, the 10/12 cycles windows determines at least 1200 sample per window. Because of the radix-2 factor of the DFT transform, a length of was 2048 chosen, with a sampling frequency varying from 9kHz to 13kHZ. 5. PLATFORM PROPOSED Power quality measurement and analysis requires strict temporal constraints to be met. On the other hand, it is also required processing and storage a large amount of information. We can define two kinds of functions that a power quality device has to carry out: (1) power quality synchronization, measurement, monitoring and analysis and (2) processing, storage and communication of the data and information of the system. When the first kind of functions is implemented as software, then perturbations are introduced when the processor s time is shared among all the functions of the system. These perturbations are produced because the realtime features of a Real-Time Operating System are not well match with the temporal constraints that a power quality analysis requires. We proposed a platform based on a FPGA device that implements two main units: (1) the Power Quality unit (PQU) and (2) the SoPC unit with Linux. Both units are communicated through a communication bus that maps the PQU in the memory map of the SoPC unit. The PQU contains the acquisition, synchronisation and DFT stages. Voltage and current signal are attenuated, isolated, filtered, converted from analogue to digital and transformed. Figure 2 shows a scheme of the PQU. Isolation Attenuation I 1, I 2, I 3 DFT LP Filter ADC LM12L458 Isolation Attenuation Bus Interface Dual Port Memory V 1 V 1, V 2, V 3 Figure 2: PQU Architecture PLL clk Sampling Generator Analogue Stage Whilst the power quality measurement has need of a dedicated hardware to meet the strict timing constraints, the rest of the functions required to either communicate, storage or process a great deal of information may be 93

108 implemented on a processor. For this reason, a softprocessor FPGA-based system was found as a suitable alternative to implement it. The System on Programmable Chip gives a great flexibility of the system as well as a friendly environment to build the embedded system architecture. We use the NIOS II soft-processor from Altera, implemented on a Cyclone III FPGA device. The system executes a µclinux operating system to give support to the software applications. Figure 3 shows the architecture of the FPGA-Linux board. µclinux was chosen because of its wide support for communication an storage. The embedded system offers native server communications through Ethernet and Serial interface. Modbus RTU protocol Serial Interface secondary communication link Pertubations to Waveform conver sion application NIOS processor Data Exchange uclinux DRIV ES Modbus TCP protocol Ethernet Interface to PC with Matlab/Simulink Events processing application PQU Voltages and Currents Figure 3. Architecture of the FPGA-Linux board The interface between the NIOS processor and the PQ unit is through the Avalon interface. A µclinux device driver was programmed in order to an easy access to the PQ unit from software applications. The FPGA-Linux Board was implemented on a DE2 Altera board with a Cyclone II 2C35 FPGA. The board includes Ethernet and serial ports used to communicate with a supervisor PC. The drivers and protocols for these communications links are easily implemented as Linux applications. 6. RESULT ANALYSIS Power Quality standards do not prescribe protocols or experiences that have to be carried out to meet different Class requirements. Instead, they define the measurement and parameters for power quality analysis and monitoring. This turns difficult to assure that a certain instrument, device or platform meets the power quality specification of the standards. Several simulations have been carried out considering different scenarios of perturbations, finding processing errors within the boundaries of the standard. However, we cannot assure that this performance is achieved for all the possible perturbations considered for power quality analysis. We need to define a protocol of testing to compare the architecture with a Class A certified analyser. However, we can assure that the digital architecture of the PQ unit proposed, allows us to configure the synchronisation, transform and analysis parameters to optimise the performance of the unit for different power line perturbations. This feature helps to improve the flexibility of the architecture. The utilization of a FPGA device running µclinux reduces the design complexity. Linux reduces the complexity on implementing the data processing and data communication functions since they are programmed as applications that use the already tested drivers. The processing and communication speed reached is adequate to measure and analyse the power quality parameters with harmonics up to the 60th order. 7. CONCLUSIONS Modern loads utilize semiconductor devices whose nonlinear behaviour introduces perturbations in the power line. Such perturbations could reduce the efficiency on the energy utilization as well as cause damage to the equipment connected to the power line. Several standards has been published to define the different parameters to take into account to assure a good quality of service. Power quality requires the processing of the voltage and current of the power line. Analogue and software approaches have been proposed for this purpose. Whilst the analogue ones requires a precisely tuning and calibration for each device, the second ones require a great deal of processing time of the system processor. We proposed a power quality platform implemented on a FPGA. The power quality measurement, synchronization and analysis are performed by the Power Quality unit. This unit may be changed and modified in order to incorporate new power quality specification. On the other hand, the processing, storage and communication is implemented on a NIOS II softprocessor executing a Linux version for software support. In this way, the platform is highly flexible from both, the power quality unit and the SoPC unit. Changes produced in one unit does not affect on the other, making the design and adaptation easy. 8. REFERENCES [1] B. H. Chowdhury, "Power Quality," IEEE Potentials, vol. 20, pp. 5-11, [2] I.-Y. C.-J. Won, J.-M. Kim, S.-J. Ahn, S.-I. Moon, J.-C. Seo, and J.-W. Choe, "Development of Power Quality 94

109 Diagnosis System for Power Quality Improvement," presented at Power Engineering Society General Meeting, [3] "IEEE Std , "IEEE Recommended Practice for Powering and Grounding Sensitive Electronic Equipement", (IEEE Emeral Book)"," ISBN , [4] D.-J. Won, I.-Y. Chung, J.-M. Kin, S.-J. Ahn, S.-I. Moon, J.-C. Seo, and J.-W. Choe, "Power Quality Monitoring System with a New Distributed Monitoring Structure," KIEE International Transactions on PE, vol. 4A, pp , [5] EN 50160: Voltage characteristics of Electricity supplied by Public Distribution Systems. [6] IEC Amend.1 to Ed.2: Electromagnetic compatibility (EMC): Testing and measurement techniques - General guide on harmonics and interharmonics measurements and instrumentation, for power supply systems and equipment connected thereto [7] IEC Electromagnetic compatibility (EMC): Testing and measurement techniques Flickermeter Functional design specifications [8] IEC Electromagnetic compatibility (EMC): Testing and measurement techniques Power quality measurement methods. [9] Ricardo Cayssials, Omar Alimenti, Edgardo Ferro, A Digital PLL Circuit for AC Power Lines with Instantaneous Sine and Cosine Computation, IV IEEE Southern Conference on Programmable Logic, San Carlos de Bariloche, ISBN , de Marzo de 2008, Argentina. 95

110 96

111 SOLAR TRACKER FOR COMPACT LINEAR FRESNEL REFLECTOR USING PICOBLAZE Daniel Hoyos, Maiver Villena, Carlos Cadena INENCO Universidad Nacional de Salta Av. Bolivia Salta (Argentina) { hoyosd, maiver, cadena Victor Serrano, Telmo Moya, Marcelo Gea Departamento de Física Universidad Nacional de Salta Av. Bolivia 5150 (Salta) { serranovh, tmoya, geam ABSTRACT This paper describes a distributed control system for a Compact Linear Fresnel Reflector using a combination of chronological and light-sensing tracking techniques. The system uses LabVIEW at controller stage, ZigBee for wireless communications and Spartan 3 FPGA s at input /output stages. 1. INTRODUCTION One of the most interesting options for electric energy generation using renewable energy its to warm water using Fresnel solar concentrators, which are mirror arrays that send sunbeams to an absorber elevated over them, the reflectors are low curvature parabolic cylinders. They are installed at floor level tracking the apparent solar path rotating over horizontal axes. The reflectors concentrate the direct solar radiation on an absorber fixed some meters over floor. The absorber is a linear tower with a cavity in its inferior face. This kind of concentrator must orientate the mirror to reflect sun rays over a concentrator located at a height of 10 meters. The mirrors have to turn following sun s path to hold the reflected rays over the absorber. Analyzing the mirror located under the absorber it must be 45 degrees East at sunrise and 45 degrees West at sunset, so it have to scan 90 degrees during daytime. In order to stablish mirror s speed we must compute sunrise and sunset times for each day of the year. Sunrise time and daytime duration vary too with seasons, so to track the sun the mirror must starts its movement always at 45 degrees but at different clock times every day and its speed will be different too depending on the day of the year. For others mirrors located at other positions its movement must start at the same time but with different angles. Concentrator is located at a height of 10m and its wide is 0.4 meters implying that sunbeams must concentrate over Instituto de Investigación en Energías no Convencionales Departamento de Física - Universidad Nacional de Salta. Fig. 1. General Scheme degrees. This precision of two degrees at the concentrator implies one degree of precision for mirrors movement. Mirrors must be rotated, following sun s path to keep reflected rays on the absorber, mirrors should be placed below 45 degrees east at sunrise and finish below 45 degrees west at sunset. This means that runs 90 degrees all day. 2. DESCRIPTION OF MOTION This device consists of a set of mirrors and an absorber located 10 meters over it, both of them North-South oriented. The mirrors should concentrate sunlight to the absorber. It is considered that the absorber is exactly over the mirrors. The rays of the rising sun are parallel to the surface of the earth, the mirrors should be at 45 degrees, in the solar noon should be vertical and in the evening should be an angle of 45 degrees, because at sunset the sunbeams are again parallel to the horizontal. The mirrors must track the sun from sunrise to sunset, these values depend on the position of the sun for each day of the year. [1] Solar declination is given by (1) 97

112 Fig. 2. LabVIEW communications subroutine. Table 1. Protocol codes. Code Operation Start Start daily controller routine Time update Set controller time Time check Check controller s time Position Check Check controller s position Position change Order position change Save Save Relocate Relocate Echo order Echo request Status order Status request Time blw steps Set time between steps Id Request Request controller identification ( ) n d = sin Solar time it does not coincide with local clock time. To convert standard time to solar time takes two corrections: First, there are a constant for the difference between the observer s longitude and the longitude of the country. The second correction is from equation of time, which takes into account the perturbations of the rotation of the earth, is show in (2) (1) Solar time Standard time = 4(L st L loc ) + E (2) Where E is given by (3) and B is shown in (4) E = 229.2( cos B sin B cos 2B sin 2B) B = (n 1) where n is number of days, L st Standard longitude of the country and L loc is longitude of place in question. [2] To protect mirrors at night they must be placed looking down, so this device must go over 135 degrees (with steps). In order to go back to start position at sunrise it must return steps. The speed of this movement is limited by motor s maximum possible speed and system s inertia. It was experimentally determined that 100Hz pulses gives a free fault motor working, so the time needed to put the system in repose mode is one minute. After, at sunrise, it relocates the system in two minutes. (3) (4) 3. SYSTEM CONTROL The system has a central control, a communications network and one controller for each mirror. Central control makes more complex calculations like sunrise time, sunset time and day duration and sends them to the controller set. It also verifies controller operation, updates system time and orders system protection mode on bad weather. This central control was implemented with LabVIEW running on an embedded PC (PXI8155) sending data through serial port [3]. Communications subroutine is show in Fig. 2 A simple three bytes protocol was defined for control orders, containing instructions at first byte and data at the others. Instruction byte is composed by 0xf at high nibble and the proper instruction code at the low nibble. The instruction set is show in Table 1 The controller located at each mirror drives its movement in function of the orders received from the central control and position sensors data. The controllers are independent among themselves. Protocol (ZigBee) modules are used for RF communications, working at the 2, 4GHz band [4]. A module configured as Coordinator is connected to PC serial port and End Device configured one for each controller Mirror Control Controller stage uses a Xilinx s FPGA with PicoBlaze, an embedded soft processor that performs overall control. The tasks required by the controller are implemented on FPGA including: motor control, real time clock, analog to digital converters for sensors and UART to drive ZigBee module. As those devices are connected to PicoBlaze s input/output ports it can access them using configuration registers. Motor control is implemented as a state machine that compares position register data with internal current position and sets movement sense and steps number. This con- 98

113 Fig. 3. Controller System. trol has two control registers: CE to enable/disable and Reset to clear, one input register: Position to set desired position and one output register with current position. RTC module presents three input registers in order to set system s time and a control register with CE enable/disable, Reset clear RTC, Act Update time. Three output registers shows current time. The used UART and ADC are the ones proposed by the manufacturer and have implemented all the proposed registers. Connection of those registers with PicoBlaze processor is made by two multiplexers to its input and output ports [5] Engine Control Engine control is implemented as an state machine that compares position register and actual position register. If the result is zero it does not perform any action. If the result is greater than zero it increases the current position register and increases the position register which is connected to the output decoder that generates control signals for stepper motor. When the difference is less than zero the movement must be in the opposite direction and decreases the current position register Sensors To verify proper system s motion two LDR sensors are placed looking at the edge of the mirrors. When sun-rays do not impinge on the absorber, one of the sensors is illuminated and indicates that the system is out of focus and in which direction is the blur. The sensor circuit consists of a resistor in series with the LDR and this signal is connected to ADC input PicoBlaze Software Fig. 4. Engine Control. PicoBlaze processor receives orders from the central control using the implemented UART and enters a menu that determines the actions to follow [7]. The processor uses the time between engine steps to increase position s register of the stepper motor, waits a second and sees if the position sensors are illuminated. If this occurs sensors indicates in which direction it should move the engine, so it moves the stepper motor, waits for 0.1 seconds and repeats this routine up to ten times. This program follows sun s path, in case the day it is cleared with the sensors or when the day is cloudy roughly assuming that the sun moves at constant speed ZigBee Considerations On a first approach broadcast ZigBee transmissions where used in order to simplify frames composition [8]. The results were satisfactory for early test, but for a complete system broadcast delays are unacceptable and unicast ZigBee transmissions are needed. That implies compose the frames with individual End Device s ZigBee addresses at Coordinator level. As central control is implemented on LabVIEW the corresponding libraries where designed to allow unicast managing and obtain the desired performance. 99

114 ture of the absorber if necessary, for example by blurring a mirror. 6. REFERENCES Fig. 5. Mirror with controller. 4. RESULTS The tested system consists of: PC with a PXI8255 embedded data acquisition board, USB-RS232 module, two X-Bee modules and Spartan 3a FPGA. All of them connected to a power interface and stepper motor. A real scale mirror was builded, as shown in Fig. 5, and different motors and gearboxes where tested to optimize the system. The described system went through various stages. At the first stage the embedded PC algorithms were developed and tested, then tried on stepper motor control, PicoBlaze control on the driver and finally I 2 C and ZigBee networks. Also tested different control strategies. A sensor based control strategy was tested, the system was experiencing problems on clouding situation and was slow to refocus. Using only sun s motion equations the system is out of focus at noon (maximum radiation) and remained focused for the day. Using a combined control strategy, as described in this paper, the system don t experience problems at noon nor at a cloudy day. [1] Y. J. Huang, B. C. Wu, C. Y. Chen, C. H. Chang, and T. C. Kuo, Solar Tracking Fuzzy Control System Design using FPGA IAENG Proceedings of the World Congress on Engineering 2009, WCE London vol I, Jul [2] J. A. Duffie, W. A. Beckman, Solar Engineering Of Thermal Processes Second Edition John Wiley & Sons, Inc [3] National Instruments, LabVIEW Data Acquisition Basic Manual NI.com 2000 [4] C. Evans-Pughe, ZigBee wireless standard. IEEE Review. Vol 49, Iss 3, pp , Mar [5] Xilinx, PicoBlaze 8-bit Embedded Microcontroller User Guide for Spartan-3, Spartan-6, Virtex-5, and Virtex-6 FP- GAs xilinx.com Jan [6] J. Logue, Virtex Analog to Digital Converter. Xilinx XAPP 155, Sep [7] D. Antonio-Torres, D. Villanueva-Perez, E. Sanchez-Canepa, PicoBlaze-Based Embedded System for Monitoring Applications, CONIELECOMP 2009, pp , Feb [8] Shahin Farahani, ZigBee Wireless Networks and Transceivers Elsevier pp , 47-78, 2008 [9] J.A. Beltran, J.L.S. Gonzalez Rubio, C.D. Garcia-Beltran, Solar Tracker (seguidor solar de lazo abierto con look up table de datos precalculados) Design, Manufacturing and Performance Test of a Solar Tracker Made by a Embedded Control, CERMA, pp , Sep CONCLUSION The use of FPGA allows quickly control system reconfiguration using a minimum of discrete components for optimization, PicoBlaze simplifies the control program. System developed works acceptably and further development is to share information between mirrors to reduce the tempera- 100

115 TOOLBOX NURBS AND VISUALIZATION SYSTEM VIA FPGA Luiz Marcelo Chiesse da Silva* Electrical Engineering Department Federal University of Technology Paraná Cornélio Procópio Paraná Brazil Maria Stela Veludo de Paiva Electrical Engineering Department USP University of São Paulo São Carlos São Paulo Brazil ABSTRACT NURBS curves and surfaces are widely used in CAD/CAM, reverse engineering and rapid prototyping systems, to represent adequately, in a compact way, almost any shape. For this sake, NURBS is included in standards, being implemented in CAD/CAM systems and graphical processing units (GPUs). Data acquisition and manufacturing systems make use of NURBS implemented in embedded systems, and works using FPGA are incipient, as a alternative to other technologies based in microcontrollers or dedicated integrated circuits. This work proposes the implementation of a NURBS interpolation and visualization Soc - System on a chip, using FPGA aiming implement a embedded system for manufacturing and CAM systems for tool positioning and toolpath simulation. 1. INTRODUCTION Curve and surface interpolation is a fundamental task in graphic systems, like CAD/CAM, for example, where the resolution between the design and the manufacturing systems should be adjusted [1]. When a set of points are given or received from a graphical unit, and is essential to fit this set with a curve or straight lines coincident with the given points, is done the interpolation (if the points are not coincident, the fitting is made by aproximation) [2]. There are several methods for the interpolation of a set of points, ranging from simple and efficient triangulations [3] to modified methods using RBF [4] and others [5]. Among these methods, NURBS - Non Uniform Rational B-Splines are adopted in graphic standards like IGES [6], STEP [7] and OpenGL [8] (PHIGS) for curve and surface representation between graphical systems. The main advantages of the rational b-splines (affine transformations, for example) make them the most suitable choice for standardization, despite the lack of compression in the representation of conic sections [9], is widely used too in generic mathematical applications. The use of piecewise *sponsored by Teacher Qualification Institutional Program Coordination for the Improvement of Higher Level Personnel (Capes). polynomials require a minimal number of procedures, namely orderly parameterization, linear system solution and the curve/surface fitting. Manufacturing systems, like CNC, and 3D data acquisition systems makes use of NURBS to provide greater efficience, being implemented in embedded systems [10] based in microcontrollers, DSPs or a application specific integrated circuit. The FPGA technology provide a all-in-one chip solution for the data pre-processing and control in embedded systems, an architecture specified by the system designer, and reconfigurable logic, capable to perform a custom processor [11] for a specific task. This work proposes a Soc in a FPGA system, with a NURBS local curve an surface interpolation cores, based in the fast Cox-de Boor implementation [12], and a basic graphic pipeline [13], for visualization purposes. Optionnally, is included two cores for the generation of straight lines and circles [14]. To connect and synchronize the cores, is used a Wishbone based bus [15], following its conventions and being an open source logic bus. The use of FPGASs in the area of video and image processing is consolidated [16], despite the fact that new technologies in the area of application, ad hoc integrated circuits like the Cell processor [17-18], there is a gap in the graphics processing that leads to aplications making use of mixed technologies, like GPGPU and FPGAs. 2. NURBS A NURBS curve of degree p is a piecewise polynomial curve defined as: C(u) = n i= 0 w i P i N i,p (u) where u is a parameter value, P i form the so called control polygon points, weighted by w i, and N i,p (u), i = 0,...,n, are the B-spline basis functions defined over a knot vector, where: (1) U = u,..., um }, ui ui+, i = 0,..., m 1 (2a) {

116 N N 1 if u u < u ( u) = i i + 1 i, p 0 otherwise (2b) u u u 1 u i i+ p+ u) = Ni, p 1( u) + Ni + 1, p ( ) u u u u (2c) i, p( 1 u i+ p i i+ p+ 1 i+ 1 We assume throughout this paper that the knot vector has the following form: D 0 T 0 q 0 D 1 T 1 q 1 D 2 T 2 D n-1 q n-1 D n T n U = {a, a,..., a,u p+ 1,..., um-p- 1, b, b,..., b} (3) p+ 1 p+ 1 where, in most practical applications, a = 0 and b = 1. A NURBS surface of degree (p, q) is defined similarly as: S(u,v) = n m i= 0 j= 0 w i,j P i,j N i,p (u) N j,q (v) where u and v are the parameter values in the longitudinal and isoparametric directions of surface construction, P i,j, i = 0,..., n; j = 0,..., m, form the so-called control net defined by a set of points weighted by w i,j and the basis functions N i,p (u), i = 0,..., n, and N j,q (v), j = 0,..., m, are defined as above (the construction of N j,q (v) is similar), over the knot vectors: { u0,...,u },u i uι 1, i = 0 1 { v,..., v }, v v j = 0 1 (4) U = r +,...,r (5a) V =, 0 s j j+ 1,...,s (5b) 2.1. NURBS INTEPOLATION The NURBS interpolation could be divided in local and global interpolation methods. The first constructs a curve by rational segments (rational polynomials), in the case of surfaces by rational patches, such that the endpoints of each segment are the given data points. Neighboor segments are joined with some continuity level between the junctions, with the curve construction proceeding segment wise. The global interpolation makes the curve as a whole, using all the given points in a matrix calculation, and the control points are obtained by the inversion in the matrix form of the NURBS equation (1). For both methods, is necessary the points to be interpolated (data points), the number of control points (segments), the knot vector and the parameter values. In figure 1, D i are the interpolated data points, and the knots are calculated based in the chord length between each data point given by the module of vectors q i. Another method for the interpolation is to satisfy the conditions given for the curve tangent vectors T i in each data point D i. For a given set of data points, the best method to set up the T n-1 Fig. 1. Data points, junctions (knots), distance and tangent vectors in a NURBS curve. knots is calculating initial parameters values given by the chord length method: t k = t = 0 0 (6a) k 1 Di Di 1 L i= 1 (6b) t = 1 (6c) n where t is the parameter value, L the total chord length between the data points, D i D i-1 the chord length between two adjacent data points, k the paramater index and n the number of data points. For the knot vector is recommended the technique of averaging: u... = u = 0 ; u =... = u 1 (7a) 0 = p m-p m = u j p = t p j + p i = i= j ; j 1,..., n-p (7b) where t is the parameter in the equation (6b). The values u 0 at u p and u m-p at u m reflects the knot multiplicities required for the spline beginning and end conditions. The property of local control allows the local interpolation, given by the suport region of the basis functions, that restricts the influence of the basis functions only in a limited number of piecewise polynomials. So, many polynomials segments can be computed concurrently to generate the final curve, given the desired contnuity at the polinomials junctions (knots), feature exploited by the de Cox-de Boor algorithm given in equation (9), where t is the parameter, u are the knots, P the control points for each layer (data points in the first layer), i the control point index in the layer, k the order of the polynomial segments and j the layer number. j ( ) t ui C 1 i t =. P ui+ k j ui t u j 1 i j 1 i 1 +. P i ui+ k j ui (8) 102

117 Layer 0 Layer 1 Layer 2 Vertex Reader P 0 0 Viewport Transformation P 1 0 P 2 0 P 0 1 P 1 1 P 0 2 = C(u) Primitive Drawer Video Memory W riter Fig. 3. Basic 3D visualization pipeline. Fig. 2. Cox-de Boor algorithm control point layers. The control point in a b-spline curve is the convex combination of another two control points in the previous layer, as illustrated in figure 2, and its influence in the curvature is given by the term in brackets in equation (8). If a control point and the respective knot are repeated, the curvature tends towards its position, until the curve pass through the point, resulting in the interpolation. 3. GRAPHIC PIPELINE A basic visualization pipeline is used, and by the sequential nature, this process is divided in serial stages, whose number could vary between implementations, but follows the general arrangement given by fig 3. In computer graphics systems, the last three stages are managed by a API Application Programming Interface. The vertex reader reads 3D data obtained from a "cloud of points", in a form of a triple indicating the euclidean coordinates of the points. This data is stored in a RAM memory and, according to the primitive drawer cores, the required points are bufferized in the FPGA. In the viewport transformation stage, the computation of each transformation matrix could be parallelized, but between them are serialized. The primitive drawer include the NURBS, straight line and circle generation cores. The video memory is made in a built-in RAM, for the purpose of different display resolutions suport and faster data transfer, executing the scanning to a D/A video converter independently of the remaining systems. The vertex reader is the data input of the pipeline, obtaining the data in the form of coordinates in euclidean space, from another built-in memory of the FPGA. The data could be inserted in the synthetization process, mapped from a graphical user interface to a memory initialization file. The primitive drawer is responsible for the effective data processing to originate the graphics, remaining to the viewport transformation maps the data to fit the video memory. The video memory writer send the data in 10 bits size for the video resolution of XGA (1024 columns by 768 lines), generating the timing synchronization signals for the video DAC. This pipeline could be used for the 2D or 3D view modifing the viewport transformation stage. Figures 4 and 5 shows examples of NURBS curve and surface obtained from the video memory system. 4. FPGA IMPLEMENTATION Given the set of data points to be interpolated, the initial parameters for the knots are calculated by equations (7a- 07c) using the city block distance, since the chord length is relative between each adjacent point, and the knot vector is calculated by the equation (8a-8b). For each parameter t is given a core that recursively calculates the point C in the curve by the Cox-de Boor algorithm, with the recurrence for the interpolation of the data points. Figure 4 shows a simplified diagram with the cores for the chord length and the averaging techniques for the knot vector generation, excluding the enabling gates from the bus signals, that increments in two clock cicles each core processing. In figure 5 there is an example for the generation of one point by the Cox-de Boor algorithm for a degree 3 NURBS, resulting in 3 cores layers. The cores between dashed lines are parallelized, showing that the degree determines the number of layers, and the number of clock cicles to generate the point. 5. CUDA GPU SYSTEM The Graphics Processing Units was created with the multicore processing capability, but the first ones was made for the computer graphics processing applications only. Actually, it has an architecture made for the multidata processing, open to the programmers and suitable for high performance computing, so that the manufacturers allows actually GPGPUs with more than 500 cores devoted to general purpose applications. The use of GPU here is based in the CUDA implementation of NURBS, that consists in a multitthread processing of the Cox-de Boor and knot addiction algorithm for the generation for each parameter in the curve/surface, despite the devoted circuits in the board. The performance is compared with the FPGA implementation, up to 16 cores, being made in C language. The CUDA implementation follows these items: 103

118 Fig. 4. Chord length and averaging cores for the knot vector generation. t 0 u 0 u 4 t 0 u 0 u 4 u 0 d 0, d 1, d 2 (1- (1-0 ).d ).d d d 2 0 P 0 1 P 1 1 (1-1 ).P P 1 1 P 0 2 = C(0) layer 0 layer 1 layer 2 Fig. 5. Core for the generation of one point in the degree 3 NURBS curve of p=3 layers. - dividing the task in blocks, each of them consisting of a thread, defining 8 to 32 threads for each block; - passing the serial processing to the computer processor; - 16 bits data size (compatible with the FPGA system). The GPU used has 16 multicores with 4 processors each (enabling at most 64 processors). The NURBS is implemented multithreading each layer of the Cox-de Boor algorithm, just like the FPGA implementation, regarding the hardware limitation. 6. RESULTS The NURBS local interpolation algorithm and the visualization system for FPGAs, is compared with a single GPU, comparing the number of clock cicles, separated The GPU processing still results in a more efficient manner to deal with the data interpolation, being a dedicated circuit, designed to optimally perform graphic functions. The cores are synthesized to perform similar functionallity like the GPU, with 16 simultaneous threads and a independent clock counter synchronized with the beginning and end of the process. 7. CONCLUSIONS The use of FPGAs in computer graphic is still incipient, being confirmed despite the fact that circuits dedicated to this aim leading to the following items: 1. while the GPU is a highly specialized processor that can get great performance (for a specific subset of the problems), actually most of them are not suitable for embedded applications in respect to FPGAs due to the power dissipation, sometimes requiring more cooling than computer processors; 104

119 Table 1. Total clock cycles for 32 interpolation points and 100 parameters (p is the NURBS degree). CPU - GPU - FPGA Vectorized code CUDA NURBS interpolation 47p(p+1) 16p 8p Visualization 12p+2 4p 10p+20 pipeline Knot addiction 10p 16p (16 knots) Table 2. Cores and respective number of logic elements. Core LEs Vertex reader 57 Viewport transformation 110 NURBS 148 Primitive interpolation drawer Straight line 24 Circle 48 Vídeo memory writer 62 Wishbone 54 Fig. 6. NURBS curve (100 points, degree 3) produced by the visualization system, control points (square dots) and control polygon - (a) integer 16 bits, (b) single precision. Fig. 7. NURBS Cube surface with 160x110 points, made by the system from 16x11 control points. 2. the GPU is limited by the built-in hardware and firmware, despite of the multiprocessing power; 3. the processors based in the traditional computer architecture, are restricted by the demand of a higher clock frequency, giving arising to multicore processors. PLAs technologies didn t reach the higher frequency, being developed with higher densities too. Actually, General Purpose GPUs (GPGPUs) provide more flexibility to the system designer, still locked to the hardware architecture, being some operations, like fixed point operations, efficiently done in FPGAs. Future works are devised to match reconfigurable systems with GPGPUS. 8. REFERENCES [1] M. C. Tsai, C. W. Cheng, M. Y. Cheng, A real-time NURBS surface interpolator for precision three-axis CNC machining, International Journal of Machine Tools & Manufacture, vol. 43, no. 12, pp , May [2] L. Piegl, W. Tiller, The NURBS Book, 2nd ed., Springer, New York, [3] S. Mann, M. Lounsbery, C. Loop, D. Meyers, J. Painter, T. DeRose, K. Sloan, "A Survey of Parametric Scattered Data Fitting Using Triangular Interpolants", Curve and surface Design, chapter 8, H. Hagen, SIAM, [4] J. C. Carr, R. K. Beatson, J. B. Cherrie, T. J. Mitchell, W. R. Fright, B. C. McCallum, T. R. Evans, "Reconstruction and Representation of 3D Objects with Radial Basis Functions", Proc. 28th annual Conference on Computer Graphics and Interactive Ttechniques, pp , [5] S. K. Lodha, R. Franke, "Scattered Data Techniques for Surfaces", Proc. Conference on Scientific Visualization, pp. 181, [6] IGES/PDES Organization, Initial Graphics Exchange Specification - IGES 5.3, ANSI 1996, U. S. Product Data Association, Sep., [7] ISO 10303, Industrial automation systems and integration - Product data representation and exchange, multipart standard, International Organization for Standardization ISO. [8] OpenGL Architecture Review Board, D. Shreiner, M. Woo, J. Neider, T. Davis, The OpenGL Programming Guide: The Official Guide to Learning OpenGL - Version 2.1, 6th ed., Addison-Wesley Professional, Boston, Massachusetts, [9] L. Piegl, On NURBS: A Survey, IEEE Computer Graphics & Applications, pp , Jan [10] Orenstein, P. (2002). High-Speed CAM of 3-D Sculpted Surfaces, Time Compression Magazine, Mar [11] H. Styles, W. Luk, Customising graphics applications: techniques and programming interface IEEE Symposium on Field-Programmable Custom Computing Machines, pp , Apr [12] H. T. Yau, M. T. Lin e M. S. Tsai, Real-Time NURBS interpolation using FPGA for high speed motion control, Computer-Aided Design, no. 38, pp , [13] N. Knutsson, "An FPGA-based 3D Graphics System", Master s thesis in Electronics Systems, Linkoping Institute of Technology,

120 [14] J. E. Bresenham, "Algorithm for Computer Control of a Digital Plotter", IBM Systems Journal, vol. 4(1), pp , [15] OpenCores Organization, WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores, Revision: B.3, [16] B. Cope, P. Y. K. Cheung, W. Luk, S. Witt, "Have GPUs made FPGAs redundant in the field of Video Processing?", Proc. IEEE International Conference on Field- Programmable Technology, pp , Dec [17] M. L. Stokes, A Brief Look at FPGAs, GPUs and Cell Processors, ITEA Journal, pp , Jun./Jul [18] L. W. Howes, P. Price, O. Mencer, O. Beckmann, PGAs, GPUs and the PS2 - A Single Programming Methodology, 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pp , Apr

121 UNA METODOLOGÍA PARA EL DESARROLLO DE SISTEMAS EN CHIP DE ALTA PERFORMANCE Marcos J. Oviedo Facultad de Ingeniería Instituto Universitario Aeronáutico Córdoba, AR Pablo A. Ferreyra Facultad de Ciencias Exactas, Físicas y Naturales Universidad Nacional de Córdoba Posgrado en Sistemas Embebidos Instituto Universitario Aeronáutico Córdoba, AR ferreyra@famaf.unc.edu.ar ABSTRACT El procesamiento de datos a alta performance se ha convertido en un desafío para la tecnología de los sistemas embebidos actuales. En el presente trabajo se describe una metodología para diseñar de sistemas embebidos de alta performance a través de la utilización de aceleradores de hardware implementados en lógica programable. Como prueba de concepto se implementa el algoritmo de encriptación simétrica TripleDES en un sistema embebido de alta performance. 1. INTRODUCCION Avances en la industria de los semiconductores han permitido que utilizando componentes de lógica programable sea posible implementar sistemas digitales complejos. Un sistema en chip programable (SoC) es un sistema digital que en una sola pastilla de silicio, implementa un sistema embebido, dispositivos de aplicación específica y software de aplicación y control. Uno de los conceptos más poderosos atrás del diseño SoC es que la funcionalidad del sistema puede ser especificada y asignada no sólo al software que corre sobre el procesador, sino también a los componentes de hardware que lo constituyen. Permitiendo así, que para efectos de aceleración de procesamiento, sea posible implementar unidades de procesamiento de alta performance encargadas de realizar ciertos tipos de operaciones computacionales de forma óptima y por lo tanto a mayor velocidad que el procesador del sistema embebido, ayudando así a incrementar la performance del sistema. Debido a la creciente demanda de tecnológica de la sociedad moderna, los sistemas embebidos que componen los equipos electrónicos actuales deben ser capaces de evolucionar constantemente con el fin de soportar la creciente demanda de capacidad de procesamiento a los que están sometidos. A modo de resolver esto existen alternativas a los sistemas embebidos tradicionales, como el desarrollo de sistemas embebidos en un SoC de alta performance (HPSoC). En un HPSoC la arquitectura de hardware esta optimizada para que la plataforma de cómputo que lo compone trabaje cooperativamente con aceleradores de hardware. En el curso de desarrollo de un HPSoC, las funcionalidades que componen al mismo son definidas primero en software, para que luego parte de las mismas sea trasladada a hardware. La implementación en hardware a través de lógica programable es una alternativa válida que se puede utilizar para lograr sustanciales incrementos en performance, teniendo en cuenta el alto nivel de paralelismo que se puede conseguir con la utilización de una plataforma de lógica programable. En el presente trabajo se mostrará en forma teórica una metodología de desarrollo de un HPSoC para luego finalizar con una comparación de performance entre los resultados obtenidos de dos alternativas de implementación del algoritmo de encriptación simétrico TripleDES en un HPSoC. La implementación se realizo sobre una plataforma de lógica programable que contiene una FPGA Xilinx Virtex 4. Este paper está organizado de la siguiente forma: La sección 2 presenta información teórica sobre las limitaciones de los sistemas basados en procesador que motivan el presente trabajo. La sección 3 presenta la metodología de trabajo propuesta para desarrollar este tipo de sistemas digitales. La sección 4 presenta un enfoque en distintos niveles sobre las técnicas y mecanismos disponibles para incrementar la performance de un HPSoC. La sección 5 presenta dos implementaciones de HPSoC que proveen aceleradores criptográficos desarrollados siguiendo la metodología propuesta. La sección 6 muestra y compara los resultados obtenidos. Por último, en la sección 7 se muestran los resultados obtenidos y se presentan las conclusiones del presente trabajo. 107

122 2. LIMITACIONES DE LOS SISTEMAS BASADOS EN PROCESADOR Los procesadores fueron concebidos para realizar computación de propósito general. Esta decisión de diseño produjo que los procesadores no sean eficientes a la hora de realizar tareas de cómputo específicas y por lo tanto, que no puedan satisfacer la performance de procesamiento que demandan algunos sistemas embebidos actuales. Persiguiendo la ley de Moore, a lo largo de los años se ha buscado alternativas para mejorar la performance de las plataformas de cómputo basadas en procesador. Sin embargo estas alternativas no han sido eficientes ni aplicables en muchos escenarios en donde la performance era un requerimiento. Como se enuncia en [1], esto se debe principalmente a que existen limitaciones físicas inherentes a los procesadores que en muchos casos y dada la tecnología actual, impiden que estas alternativas se apliquen arbitrariamente: - El hecho de aumentar la cantidad de transistores y la frecuencia a la que estos trabajan, introduce serios problemas de disipación de calor (barrera de potencia). - La frecuencia no puede ser incrementada arbitrariamente, no solo por la barrera de potencia, si no también debido a una inherente limitación física en los tiempos de conmutación de los transistores utilizados en el diseño del microprocesador (barrera de frecuencia). - En un sistema de cómputo actual, el ancho de banda del microprocesador es generalmente 70 veces superior al de la memoria externa, convirtiendo el acceso a la misma en un cuello de botella. El uso de complejas jerarquías de memorias locales al microprocesador (caches) disminuye considerablemente el tiempo de acceso a los datos, pero debido a la imposibilidad tecnológica de incrementar el tamaño del cache arbitrariamente, el acceso a memoria sigue siendo un problema real (barrera de memoria). - Finalmente los procesadores en si tienen una limitación fundamental: Un diseño basado en ejecución serial, que hace extremadamente difícil extraer niveles de paralelismo de un flujo de ejecución de instrucciones. Como se menciona en [2] existen complejos diseños y técnicas en las arquitecturas de los procesadores actuales que intentan extraer el paralelismo en las instrucciones y mitigar esta limitación. La utilización de lógica programable y la realización de un HPSoC es una valida alternativa para lograr sustanciales incrementos en performance en sistemas donde la performance es el principal requerimiento. En el presente trabajo se demostrara la implementación de un HPSoC que permitirá crear una plataforma de computo especifica y orientada a la aplicación, a modo de optimizar el camino de ejecución de datos (a través de extraer e implementar paralelismo), optimizar el uso de la memoria (aumento la localidad y el acceso), disminuir la disipación de potencia (hardware especifico requiere menor número de transistores) y disminuir la frecuencia de trabajo (posible debido a que en cada ciclo de reloj se realizan múltiples operaciones). 3. METODOLOGIA DE DESARROLLO DE UN HPSOC En la metodología propuesta, el diseño de un HPSoC consistirá en dos áreas separadas pero que requieren interacción entre ellas. Una de esas áreas es la creación del soporte necesario para implementar un sistema embebido en la FPGA basado en microprocesador y la otra es la optimización de la aplicación en vistas de una posterior implementación basada en un codiseño hardware-software. En este codiseño el componente de hardware se implementara como un componente acelerador que se comunicara con el componente de software a través del diseño embebido. En primera instancia, el desarrollo de sistemas embebidos se realiza utilizando herramientas EDA que permiten interconectar, a través de una jerarquía de buses de interconexión, un microprocesador (que puede ser un softcore, o un hardcore como se menciona en [3]), con un conjunto de dispositivos que vuelven al sistema embebido una plataforma de computo funcional. El desarrollo de sistemas embebidos no es estandarizado y varía dependiendo del fabricante de FPGAs que se utilice. En el presente trabajo se utilizó FPGAs de la firma Xilinx, por lo cual se trabajo con el ecosistema de desarrollo de Xilinx para implementar el sistema embebido. Esto consistió en utilizar las herramientas EDK, ISE y las librerías de componentes de hardware XilinxProcessorIPLib. En segunda instancia, se deberá trabajar sobre la aplicación que se busca optimizar. Para esto se debe realizar un prototipo por software de la aplicación o algoritmo a implementar en el HPSoC. Este prototipo será luego caracterizado y evaluado mediante herramientas como profilers y analizadores de código, a modo de poder detectar cuales son los segmentos o áreas de la misma en donde más procesamiento se realiza (secciones críticas en términos de performance). Con esta información y a través de un enfoque top-down, se procederá a estudiar el algoritmo que define la aplicación, a modo de refactorizar la misma y que las secciones críticas puedan ser optimizadas y aisladas para ser implementadas en hardware. La implementación en hardware de las secciones críticas de la aplicación permiten que las operaciones computacionales puedan ser representadas en lenguajes de 108

123 descripción de hardware y que a través de una estrategia de optimización por niveles, se puedan implementar componentes aceleradores de hardware, es decir hardware de procesamiento especifico que permita realizar computo altamente performante y eficiente. Para el desarrollo del componente acelerador se puede utilizar un lenguaje de descripción de hardware como VHDL o utilizar una herramienta ESL como ImpulseC [4]. El componente acelerador además, deberá integrarse dentro del diseño embebido del HPSoC, por lo que un canal de comunicación de alta velocidad entre hardware y software deberá también ser desarrollado. Por otro lado, cabe mencionar que el componente de software del codiseño HW/SW puede correr directamente sobre el procesador o bajo el control y soporte de un sistema operativo (como una aplicación más de espacio de usuario). Dado los beneficios que provee un sistema operativo, en nuestra metodología se brinda soporte de un sistema operativo para el componente de software. Una vez que estas dos instancias del HPSoC estén completas, el diseño de hardware tiene que ser trasladado y mapeado en el fabric de una FPGA, y las imágenes binarias del software correspondiente tienen que ser almacenadas en las memorias correspondientes para su posterior evaluación. 4. OPTIMIZACION DE PERFORMANCE EN UN DISEÑO HPSOC Existen diversos factores que pueden ser modificados y técnicas que pueden ser aplicadas en la arquitectura de un HPSoC a modo de incrementar la performance general del mismo. Estos factores pueden ser agrupados en tres áreas a las que llamaremos niveles de optimización Optimizaciones a nivel de Sistema Describimos como sistema a la plataforma física en donde se implementa la aplicación. Las optimizaciones a nivel de sistema están ligadas a la forma en que se pueden implementar las aplicaciones en esta plataforma, y las modificaciones que pueden ser realizadas en la misma para que estas se ejecuten más rápido y para que el throughput sea más elevado. La optimización trivial es modificar el diseño de hardware que compone la plataforma de cómputo para que los diversos componentes de esta funcionen a la máxima velocidad admisible. Además, es óptimo establecer canales de comunicaciones de alta velocidad entre los componentes de uso frecuente por el procesador, por ejemplo los bancos de memorias o la comunicación con el fabric de la FPGA. El uso de caches de memorias (preferentemente memoria RAM en bloque) puede aumentar la localidad de datos y así mejorar la performance. Así mismo, las herramientas de síntesis que sintetizan el diseño de hardware permiten configurar restricciones de tiempo y aumentar el nivel esfuerzo, a modo de mejorar el rendimiento disminuyendo el tiempo de propagación de datos a través del hardware. Por otra parte, a nivel de sistema se puede paralelizar el procesamiento de datos a nivel de componente acelerador. Siempre que el algoritmo a procesar lo permita, es decir que el algoritmo de procesamiento trabaje con un conjunto de datos independiente unos de otros, y que además exista disponibilidad de recursos en la FPGA utilizada, se puede implementar más de un componente acelerador y procesar así varios conjuntos de datos en paralelo Optimizaciones a nivel de aplicación Describimos como aplicación al algoritmo computacional que cumple un cierto número de requerimientos con el fin de implementar por software o hardware la funcionalidad principal del HPSoC. El objetivo de las optimizaciones a nivel de aplicación es estudiar el algoritmo que define las operaciones críticas en performance de la aplicación, a modo detectar el paralelismo inherente en el mismo y optimizar el procesamiento. Cabe aclarar que las optimizaciones sobre el algoritmo se harán sobre los detalles de alto nivel del mismo, y no sobre los detalles de bajo nivel que definen la implementación del mismo. Entonces, si posibles paralelizaciones son detectadas, y siempre cumpliendo con los requerimientos funcionales iniciales, se buscara implementar las modificaciones necesarias en el código del algoritmo, de modo de que este deje de lado su flujo de ejecución serial y adopte un modelo de funcionamiento en paralelo. Además de detectar niveles de paralelización y optimizaciones en el flujo de ejecución del algoritmo, otra interesante técnica que se puede utilizar para optimizar la performance a nivel de aplicación es el uso de precomputo de datos. Esto consiste básicamente en acotar el rango de acción del algoritmo, tomando asumpciones sobre el espacio de trabajo del algoritmo, a modo de precomputar y simplificar sus operaciones y de ese modo acelerar la ejecución del mismo Optimizaciones a nivel de micro arquitectura Describimos como micro arquitectura a los componentes de lógica programable que implementan los detalles de bajo nivel del algoritmo que define la aplicación que se ejecutara sobre el HPSoC. Algunas técnicas para mejorar la performance de la micro arquitectura del componente acelerador son las siguientes: 1) Replicar los arrays o bancos de memoria que contienen los datos: Una de las ventajas más importantes que nos ofrece la programación en hardware es la 109

124 posibilidad de acceder a múltiples bancos de memoria en un solo ciclo de reloj. A diferencia de una implementación de software, en la que un CPU esta conectado a uno o mas dispositivos de memoria física siempre a través de un solo bus, una implementación en hardware permite la flexibilidad de generar una topología de conexionado arbitraria, en la que un conjunto de operaciones al ser ejecutadas puedan acceder a datos distribuidos en varios bancos de memoria en una sola operación de reloj. Es por esto que un factor importante a tener en cuenta, es que para lograr resultados óptimos debemos replicar nuestro set de datos en diferentes bancos de memoria. Con esto lograremos tener bancos de memoria separados, cada uno con su puerto de lectura/escritura, lo que permitirá acceder a los mismos en forma paralela para su posterior operación/procesamiento. 2) Operaciones sobre bucles: En un algoritmo, los bucles son una de las construcciones que contienen un alto grado de paralelismo inherente, y por lo tanto, son una de las construcciones que se apunta a optimizar. Los bucles generalmente realizan operaciones repetitivas sobre un set de datos. Si cada de las operaciones del bucle no depende de datos calculados en interacciones anteriores, es decir si en cada iteración se puede operar sobre set de datos independientes, el grado de paralelismo que se puede obtener es elevado. Existen dos técnicas para optimizar las operaciones sobre bucles, estas son el desenrollado del bucle y la generación de líneas de ensamblado, o mas conocido por su término en ingles, pipelines. El desenrrollado de bucles consiste en expandir el conjunto de iteraciones consideradas por el bucle y reacomodar el algoritmo para que estas puedan ser realizadas en paralelo y en una sola iteración del bucle. El desarrollo de pipelines consiste en dividir el trabajo a procesar en subtareas, a modo de que a medida que van entrando los datos a procesar, cada subtarea pueda ir procesando en forma concurrente un diferente set de datos. Entonces, si cada iteración del bucle requiere ejecutar N subtareas, en una implementación sin pipeline, el bucle realizara una cantidad (N * cantidad_elementos_de_datos) de iteraciones para completar su trabajo. En cambio en una implementación con pipeline, la totalidad de datos serán procesados en una cantidad (N + 1) de iteraciones. La teoría de pipelines y desenrollado de bucles ha sido extensamente desarrollada en [5] y [6]. 5. V. PRUEBA DE CONCEPTO IMPLEMENTACIÓN DE UN HPSOC CRIPTOGRAFICO A modo de evaluar las mejoras obtenidas a través de la implementación de un HPSoC acelerado por hardware, se desarrollo un SoC prototipo sin aceleración (implementación solo por software), y dos versiones de un HPSoC con aceleración por hardware. En estos dos últimos casos, el componente acelerador fue desarrollado en VHDL y con la herramienta de síntesis de alto nivel ImpulseC respectivamente. La aplicación criptográfica consistió en obtener un set de datos de memoria y cifrarlos a través del algoritmo de cifrado simétrico TripleDES. El algoritmo de cifrado simétrico TripleDES se utilizó en modo ECB, siguiendo los lineamientos mencionados en [7] y [8]. Siguiendo la metodología propuesta en el presente trabajo, después de diseñar e implementar el sistema embebido en el SoC, se desarrollo en software un prototipo no optimizado del algoritmo a utilizar. Este prototipo sirvió para estudiar el algoritmo y caracterizarlo. Con los datos obtenidos y evaluando las técnicas de optimización enumeradas en [9], se procedió a desarrollar los componentes aceleradores y aplicar los niveles de optimización anteriormente descritos. Cabe aquí citar que para la implementación de los prototipos se utilizó el kit de desarrollo FX12 Minimodule, provisto por la firma Avnet y que cuenta con una FPGA Virtex4 FX12 y diversos componentes externos descritos en la página del fabricante. El diagrama en bloque del HPSoC desarrollado puede verse en la figura Desarrollo del sistema embebido del SoC Durante el desarrollo del diseño embebido se dio soporte a todos los dispositivos físicos de hardware del kit de desarrollo, utilizando los IP Cores de Hardware necesarios para el funcionamiento del sistema embebido. Para implementar el diseño embebido se utilizó la herramienta EDK de Xilinx descrita en [10]. El procesador elegido para el diseño embebido fue un recurso de hardware que posee la FPGA elegida, es decir el hardcore de un PowerPC 405 (PPC). Mediante esta herramienta se desarrollo un sistema embebido que permitió comunicar el procesador PPC con los dispositivos externos del kit de desarrollo, tales como la Memoria RAM, la memoria FLASH, el puerto UART, la PHY de Gigabit Ethernet así como también implementar componentes necesarios para volver al sistema embebido y su procesador una plataforma de computo funcional. Además la herramienta permitió, desarrollar un canal de comunicación de alta velocidad entre el componente acelerador implementado en la lógica programable y el microprocesador del sistema embebido. Se genero el soporte necesario para que el PPC pueda comunicarse a través de los buses PLB, OPB, FCB, OCM y DCR a los distintos dispositivo. Estos buses pertenecen a la familia de buses CrossConnect y están descriptos en [10]. Cabe aclarar que este procesador solo soporta conexión directa a los buses PLB, OCM, DCR y FCB, por lo que los dispositivos atrás del bus OPB se alcanzaran a través de un bridge PLB2OPB. Una vez definida la arquitectura de buses, sus tamaños y frecuencias de trabajo, así como también los elementos 110

125 Arbitro Arbitro RAM de bloque Power PC 405 Bus FCB Componente Acelerador de HW I cache D cache PLB Bus Bridge OPB Controlador PHY Controlador DDR Controlador UART Controlador GPIO Controlador HWICAP Fig. 2. Flujo de consultas para booteo del sistema operativo. Fig. 1. Diagrama en bloques de la arquitectura HPSoC desarrollada adicionales necesarios para su correcto funcionamiento, se escogió a los dispositivos del kit de desarrollo a los cuales se les dará soporte y como estarán conectados estos a la jerarquía de buses, a modo de que estos sean visibles al procesador PPC. La configuración de los mismos, es particular de cada caso y dependiente del diseño adoptado, aunque por lo general esta configuración especifica incluye aspectos como el tipo de DMA que utilizara, velocidad de funcionamiento, tipo de clock al que estará conectado, pines y redes al que estará conectado, cantidad, tipo de interrupciones que generara y áreas de memoria que estarán reservadas en el mapa de memoria del sistema para los registros del mismo. Cada IPCore de la librería de hardware XilinxProcessorIPLib posee un datasheet que detalla sus parámetros de configuración posibles. El canal de comunicaciones de alta velocidad utilizado se implemento por medio del controlador APU, una funcionalidad del procesador PPC descripta en [11]. El controlador APU provee una interfase de comunicación flexible y de alta velocidad entre el fabric de la FPGA y el procesador PPC. Esta interfase de comunicación conecta directamente el pipeline de instrucciones del PPC a uno o más componentes aceleradores de hardware Desarrollo del soporte del software de control del HPSoC El componente de software de la aplicación del HPSoC se ejecutara con soporte de un sistema operativo. Se eligió Linux como sistema operativo de soporte. Para esto se preparo el sistema operativo Linux a modo que controle los distintos componentes de hardware del sistema embebido. Se genero además un root filesystem con las aplicaciones necesarias para volver funcional al sistema. Se desarrollo además un mecanismo para que el kernel pueda ser cargado en memoria de ejecución mediante el uso de XMD, un debugger provisto por EDK y que a través de JTAG puede bajar y ejecutar binarios ELF compilados para el procesador PPC. Esto permitió que el kernel se pueda ejecutar sobre el sistema embebido, inicializarse, detectar el hardware sobre el que se ejecuta, configurar las interfaces de red, autoconfigurar su dirección de red a través de DHCP y bootear un root filesystem remoto a través de NFS. En la figura 2 se puede observar la iteración de protocolos durante el booteo del kernel. La versión de Linux utilizada es la Se realizaron modificaciones masivas sobre el código del kernel, utilizando código provisto por el fabricante y modificaciones ad-hoc. Además, para poder soportar el canal de comunicaciones APU, hubo que habilitar el bit 6 del registro MSR, Machine-State Register del procesador PPC. Este registro define el estado de funcionamiento del procesador, y debe ser configurado en tiempo de inicialización del sistema operativo. En este modo de funcionamiento, el procesador puede utilizar instrucciones vectoriales para transmitir datos a través del canal de comunicación de alta performance entre el hardware y el software Desarrollo de la aplicación del HPSoC Como se mencionó, se desarrollaron tres versiones de la aplicación. Una desarrollada enteramente en software que sirvió como punto de estudio del algoritmo de encriptación. A partir de este prototipo, se desarrollaron dos versiones de componentes aceleradores. Una versión utilizando la herramienta ImpulseC y otra versión desarrollada en VHDL. A ambas versiones se les aplicaron las mismas optimizaciones descritas a continuación: 1) Optimizaciones a nivel de sistema: Las optimizaciones a nivel de sistema realizadas sobre la plataforma serán listadas a continuación. a) Se incremento la velocidad del clock del procesador embebido a 200 Mhz. b) Se incrementaron las velocidades de los buses PLB y OPB. c) Se incremento la velocidad del componente acelerador. 111