CATEGORIZADORES NEURONALES EN VLSI

Memoria presentada por
TERESA SERRANO GÓTARREDONA
para optar al grado de Doctor en Ciencias Físicas

Sevilla, Julio de 1996
CATEGORIZADORES NEURONALES EN VLSI

Memoria presentada por
TERESA SERRANO GOTARREDONA
para optar al grado de Doctor en Ciencias Físicas

Sevilla, Julio de 1996
CATEGORIZADORES NEURONALES EN VLSI

Memoria presentada por
TERESA SERRANO GOTARREDONA
para optar grado de Doctor en Ciencias Físicas

Sevilla, Julio de 1996

EL DIRECTOR:

Bernabé Linares Barranco

Departamento de Electrónica y Electromagnetismo
Universidad de Sevilla
A Bernabé, a mis padres Joaquín y Caridad, y a mis hermanos por su paciencia y cariño.
Contenidos

Nomenclatura 3
Categorizadores Neuronales en VLSI 6
   1. Sistemas Categorizadores 7
   2. La Arquitectura ART 1 9
   3. Implementación de Circuito del Algoritmo ART 1 13
      A. Realización del Circuito “Winner-Take-All” 15
      B. Realización de los Elementos de Corriente 18
      C. Caracterización del “Mismatching” de una Tecnología CMOS 19
      D. Nuevo Prototipo ART 1 23
   4. Realizaciones Multichip 23
      A. Expansión Modular del Sistema 23
      B. Realización de un Sistema ARTMAP 24
   5. Conclusiones 26
   6. Referencias 28

Appendix 1: A VLSI-Friendly ART 1 Algorithm 30
   1.1. The Adaptive Resonance Theory 30
   1.2. The ART 1 Mathematical Model 33
      A. STM Equations 33
      B. The LTM equations 36
      C. The Reset Subsystem 36
   1.3. The Fast Learning ART 1 Algorithm 37
   1.4. The Modified ART 1 Algorithm 38
   1.5. Computational Equivalence of the Original and the Modified Models 41
      A. Direct Access to Subset and Superset Patterns 42
      B. Direct Access by Perfectly Learned Patterns (Theorem 1 of original ART 1) 43
      C. Stable Choices in STM (Theorem 2 of original ART 1) 44
      D. Initial Filter Values determine Search Order (Theorem 3 of original ART 1) 45
      E. Learning on a Single Trial (Theorem 4 of original ART 1) 45
      F. Stable Category Learning (Theorem 5 of original ART 1) 46
      G. Direct Access after Learning Self-Stabilizes (Theorem 6 of original ART 1) 47
      H. Search Order(Theorem 7 of original ART 1) 49
      I. Biasing the Network towards Uncommitted Nodes 52
      J. Remarks 52
   1.6. Functional Differences between Original and Modified Model 52
   1.7. Extending the ART 1m Model to Type-2 and Type-1 Descriptions 59
      A. A Type-2 ART 1m Implementation 59
      B. A Type-1 ART 1m Implementation 60
   1.8. Alternative ART 1 Modifications 61
   1.9. References 62

Appendix 2: A Real-Time Clustering Microchip Neural Engine 65
   2.1. Hardware Oriented Attractive Properties of the ART 1 Algorithm 65
   2.2. Circuit Description 68
      A. Synaptic Circuit and Controlled Current Sources 70
      B. Winner-Take-All (WTA) Circuit 72
      C. Current Comparators 72
      D. Current Mirrors 72
      E. Synaptic Current Sources 73
F. Weights Read Out  
G. Modular System Expansibility  
2.3. Experimental Results  
A. System Precision Characterizations  
B. Throughput Time Measurements  
C. System Level Performance  
D. Yield and Fault Tolerance  
2.4. Further Enhancements  
2.5. References  

Appendix 3: A High-Precision Current-Mode WTA-MAX Circuit with Multi-Chip Capability  
3.1. Introduction  
3.2. Operation Principle  
3.3. Circuit Implementation  
3.4. System Stability Coarse Analysis  
3.5. System Stability Fine Analysis  
3.6. Experimental Results  
A. Operation Precision  
B. Operation Speed  
3.7. References  

Appendix 4: Systematic CMOS Transistor Mismatch Characterization  
4.1. Introduction  
4.2. Pelgrom’s Model of Transistor Mismatch  
4.3. Mismatch Characterization Chip  
4.4. Transistor Measurement  
4.5. Statistical Data Processing  
4.6. (H)Spice Simulations  
4.7. References  

Appendix 5: Multichip Realizations with ART 1 Modules  
5.1. A Compact ART 1 Design  
5.2. Experimental Results of this ART 1 Prototype  
A. Single ART 1 Chip Operation  
B. Multichip ART 1 Operation  
5.3. ARTMAP Architectures  
A. The ARTMAP Algorithm: Supervised Learning  
B. ARTMAP Circuit Implementation  
C. Experimental Results  
5.4. References
Nomenclatura

ART Teoría de la resonancia adaptativa
ART 1 Primer algoritmo de la teoría de la resonancia adaptativa
ARTMAP Algoritmo de resonancia adaptativa con aprendizaje supervisado

\( F_1 \) Capa de las entradas de una arquitectura ART
\( F_2 \) Capa de las categorías de una arquitectura ART
\( N \) Número de nudos de la capa \( F_1 \)
\( M \) Número de nudos de la capa \( F_2 \)
\( v_i \) Nudo de la capa \( F_1 \)
\( u_j \) Nudo de la capa \( F_2 \)
\( Z^{bu} \) Matriz de pesos “bottom-up”, que interconectan la capa \( F_1 \) con la capa \( F_2 \)
\( z_{ij}^{bu} \) Vector de pesos que une los nudos de la capa \( F_1 \) con del nudo \( u_j \) de \( F_2 \)
\( z_{ii}^{bu} \) Peso que interconecta el nudo \( v_i \) de \( F_1 \) con el nudo \( u_j \) de \( F_2 \)
\( Z^{td} \) Matriz de pesos “top-down”, que interconectan la capa \( F_2 \) con la capa \( F_1 \)
\( z_{ij}^{td} \) Vector de pesos que une el nudo \( u_j \) de \( F_2 \) con los nudos de la capa \( F_1 \)
\( z_{ii}^{td} \) Peso que interconecta el nudo \( u_j \) de \( F_2 \) con el nudo \( v_i \) de \( F_1 \)
\( \rho \) Parámetro de vigilancia
\( I \) Patrón de entrada
\( I_i \) Componente del patrón de entrada
\( X \) Patrón de activación de la capa \( F_1 \)
\( x_i \) Estado del nudo \( v_i \)
\( Y \) Patrón de activación de la capa \( F_2 \)
\( y_j \) Estado del nudo \( u_j \)
\( S \) Patrón de señales postsinápticas de la capa \( F_1 \)
\( h(x_i) \) Señal postsináptica del nudo \( v_i \) de \( F_1 \)
\( T \) Patrón de entrada a la capa \( F_2 \)
\( T_j \) Señal de entrada al nudo \( u_j \)
\( U \) Patrón de señales postsinápticas de la capa \( F_2 \)
\( f(y_j) \) Señal postsináptica del nudo \( u_j \) de \( F_2 \)
\( V \) Patrón “top-down” de entrada a la capa \( F_1 \)
\( V_i \) Componente de \( V \) que entra en el nudo \( u_j \)
\( L \) Parámetro del algoritmo ART 1 original
ART 1m Algoritmo ART 1 modificado

\( L_A, L_B, L_M \) Parámetros del algoritmo ART 1 modificado
\( \alpha \) Parámetro del algoritmo ART 1 modificado tal que \( \alpha = \frac{L_A}{L_B} \)
\( \epsilon \) Parámetro del algoritmo ART 1 modificado relacionado con \( \alpha \) mediante \( \epsilon = \frac{1}{1-\alpha} \)
\( O_j \) Valor inicial de las señal \( T_j \) que determina el orden de búsqueda de la categoría \( u_j \)
\( J_{ij}^{+} \) Suma de todas las señales excitatorias que inciden en el nudo \( v_i \)
\( J_i \) Suma de todas las entradas inhibitorias que actúan en el nudo \( v_i \)
\( J_j^{+} \) Suma de todas las señales excitatorias que inciden en el nudo \( u_j \)
$J_j$ Suma de todas las entradas inhibidorias al nudo $u_j$
$s_{ij}$ Sinapsis en la columna $i$ y en la fila $j$ en el circuito que implementa el algoritmo ART 1 modificado
$N_j$ Nudo donde se realiza la suma de corrientes $L_A \sum I_{iz_{ij}} - L_B \sum z_{ij}$ de la fila $j$
$N_{j'}$ Nudo donde se calcula la suma $L_A \sum I_{iz_{ij}}$ correspondiente a la fila $j$
$N_{j''}$ Nudo donde se hace la suma de corrientes $L_A \sum I_i$
$T_{Aj}$ Valor del término $\sum I_{iz_{ij}}$ de la fila $j$
$T_{Bj}$ Valor del término $\sum z_{ij}$ de la fila $j$
$I_p$ Intensidad de salida global del circuito Winner-Take-All en modo de corriente
$I_{oij}$ Intensidad de salida de la celda $j$ del WTA en modo de corriente
$\alpha_j$ Parámetro que representa la máxima contribución de la celda $j$ a la salida global en un WTA en modo de corriente
$C_C$ Capacidad equivalente a la entrada del comparador de corriente de una celda del WTA
$G_C$ Conductancia equivalente a la entrada del comparador de corriente de una celda del WTA
$A$ Ganancia en tensión del comparador de corriente de una celda del WTA
$v_{x_{ij}}$ Tensión existente a la entrada del comparador de corriente de una celda del WTA
$v_{M}$ Tensión umbral del comparador de corriente
$C_A$ Capacidad de compensación introducida en cada celda del WTA
$C_p$ Capacidad equivalente en la “gate” del espejo PMOS del WTA en modo de corriente
$C_g$ Capacidad parasita “gate-drain” en el transistor que actúa de llave en cada celda del WTA
$g_{mn}$ Transconductancia del transistor que actúa como llave en una celda del WTA
$g_{mp}$ Transconductancia en el transistor de entrada del espejo PMOS del WTA
$g_n$ Conductancia de salida de los transistores de salida del espejo NMOS de cada celda del WTA
$P$ Parámetro eléctrico de un transistor
$\Delta P$ Desviación del parámetro $P$ entre dos transistores
$P(x, y)$ Función densidad del parámetro eléctrico $P$
$G(x, y)$ Función de geometría de una configuración de transistores
$D_w$ Diámetro del “wafer”
$\Omega$ Área del “wafer”
$A_p$ Parámetro estadístico que modela la contribución del error aleatorio en la desviación del parámetro $P$
$S_p$ Parámetro estadístico que modela la contribución del error sistemático en la desviación del parámetro eléctrico $P$
$r(P_1, P_2)$ Correlación entre los parámetros $P_1$ y $P_2$
$\sigma (P_1, P_2)$ Desviación estándar de los valores de $r(P_1, P_2)$ medidos en los distintos chips
$m_{max}$ Número de veces que una celda está repetida en la dirección $x$ en el chip para la caracterización del “mismatching”
$n_{max}$ Número de veces que una celda está repetida en la dirección $y$ en el chip para la caracterización del “mismatching”
$I_o^s (x, y)$ Superficie definida por las corrientes simuladas de salida de un espejo de salida multiple
$I^o(x, y)$ Superficie definida por el plano de mejor ajuste de las corrientes de salida de un espejo de salida múltiple

$I^m(x, y)$ Superficie definida por las corrientes medidas de salida de un espejo de salida múltiple

$I_{maxplane}$ Máximo del plano de mejor ajuste de las corrientes de salida de un espejo de salida múltiple

$I_{minplane}$ Mínimo del plano de mejor ajuste de las corrientes de salida de un espejo de salida múltiple

Inter-ART Módulo de interconexión de los módulos ART 1 en una arquitectura ARTMAP

$ART~1^a$ Módulo ART 1 cuyo parámetro de vigilancia está controlado por el módulo inter-AART en una arquitectura ARTMAP

$ART~1^b$ Módulo ART 1 de una arquitectura ARTMAP cuyo parámetro de vigilancia es fijado externamente

$a$ Vector de entrada al módulo $ART~1^a$

$b$ Vector de entrada al módulo $ART~1^b$

$F^{ab}$ Capa del módulo inter-AART

$u_k^{ab}$ Nudo $k$ de la capa $F^{ab}$

$y_k^{ab}$ Actividad del nudo $u_k^{ab}$

$w_{jk}$ Peso que interconecta el nudo $u_k^{ab}$ de la capa $F^{ab}$ con el nudo $u_j^a$ de la capa $F_2^a$
Categorizadores Neuronales en VLSI

Durante los pasados años ha habido un desarrollo muy fuerte en el campo de las denominadas “Redes Neuronales”. Este crecimiento se debe a que los sistemas basados en sus principios son capaces de realizar tareas tales como reconocimiento de voz o de imágenes, clasificación de patrones, memorias asociativas, etc [1]. Tareas fácilmente realizables por seres humanos de forma natural, pero tremendamente complicadas para ser realizadas por ordenadores convencionales. Las redes neuronales ya están siendo utilizadas para realizar tareas que tradicionalmente hubiera sido impensable que fueran realizadas por máquinas [2].

Existen en la literatura diversas propuestas de redes neuronales, con muy diversos mecanismos de aprendizaje y entrenamiento [3]. Todos ellos se han realizado con programas en software que simulan las ecuaciones que definen el comportamiento de las redes neuronales mediante un ordenador convencional. Ésta técnica, aunque suficiente para validar los modelos teóricos de las redes neuronales, resulta en la práctica excesivamente lenta debido a la gran cantidad de cálculos a realizar, consecuencia del tremendo paralelismo de las redes neuronales. Para aplicaciones prácticas es imprescindible disponer de hardware especializado capaz de responder en márgenes de tiempo adecuados.

En el campo de las realizaciones hardware de redes neuronales es posible distinguir dos tipos de implementaciones:

- Las realizaciones hardware de sistemas de “propósito general”. Estos son sistemas diseñados para que admitan gran flexibilidad en su topología y en las operaciones de redes neuronales que son capaces de realizar. Estos sistemas son de gran utilidad para los diseñadores de algoritmos de redes neuronales ya que permiten la realización de las operaciones del algoritmo a una velocidad muy superior a la de los ordenadores convencionales.

- Diseño de sistemas para aplicaciones específicas. Para ello es necesario seleccionar previamente el algoritmo de redes neuronales que mejor se adapte al propósito de la aplicación y permita una realización hardware más eficiente.

El presente trabajo de tesis doctoral se ha encaminado a la implementación de un categorizador de patrones binarios en tiempo real. Para nuestra aplicación, hemos elegido el algoritmo ART 1 de categorización de patrones binarios, debido a sus atractivas propiedades orientadas a la realización hardware y a sus propiedades computacionales. Para obtener una realización de circuito más eficiente hemos modificado ligeramente el algoritmo matemático ART 1. Nuestra modificación permite una realización hardware utilizando bloques de circuito más simples, pero conservando todas las propiedades computacionales del algoritmo ART 1 original.

La memoria del trabajo está estructurada de la siguiente manera. Consta de un resumen en castellano de todo el trabajo el cual va referenciando a cinco Apéndices, escritos en inglés, que desarrollan con más detalle los temas que se han cubierto.

En la Sección 1 del presente resumen se explica la operación de un sistema categorizador y las principales características de los algoritmos ART. En el Apéndice 1, cuyo contenido está resumido en la Sección 2 de este
resumen, se describen las modificaciones del algoritmo matemático ART 1 que permiten una realización hardware más eficiente. La Sección 3 introduce el circuito que realiza la versión modificada del algoritmo ART 1. Los detalles de la implementación y los resultados experimentales están incluidos en el Apéndice 2. El Apéndice 3 explica la realización de un circuito “Winner-Take-All” en modo de corriente para implementar el proceso de selección de categorías del algoritmo ART 1. En el Apéndice 4, se explica un procedimiento para la caracterización del mismatching de un proceso tecnológico CMOS y se incluyen los resultados de la caracterización de dos tecnologías: la de ES2-1.0μm y la de CNM-2.5μm. Basándonos en los resultados de la caracterización del Apéndice 4 hemos diseñado un nuevo prototipo del chip ART 1 que mantienen la precisión de la operación con un área más reducida. Los resultados experimentales de la operación de este nuevo prototipo están detallados en el Apéndice 5. También en el Apéndice 5, que es una ampliación de la Sección 4 de este capítulo, se incluyen resultados experimentales de sistemas multichip realizados con este nuevo prototipo. Por último, en la Sección 5 se resumen las aportaciones de esta tesis.

1. Sistemas Categorizadores

Un sistema categorizador es aquel que recibe una serie de patrones de entrada y realiza una agrupación de dichos patrones de entrada atendiendo a algún criterio de semejanza.

La operación de un sistema categorizador está ilustrada en la Fig. 1. El sistema categorizador recibe como estímulos externos una secuencia de patrones de entrada \( I, J, K \ldots \) y los va agrupando por semejanza en unas categorías 1, 2, ... Las categorías no existen a priori, sino que se van formando y sus características se van definiendo conforme se reciben los patrones externos o estímulos. Esto contrasta con los llamados “sistemas clasificadores” en los que las diferentes categorías existen a priori (su número y características están predefinidos), y su operación se reduce a decidir qué categoría es más apropiada para clasificar a cada patrón externo que vaya llegando. El categorizador de la Fig. 1 considera semejantes los patrones \( I \) y \( J \), clasificándolos en la categoría 1, mientras que el patrón \( K \) lo considera diferente y lo clasifica en otra categoría.

Existen sistemas categorizadores que deben ser entrenados “off-line” para construir las categorías a partir de un conjunto de patrones de entrada [4]-[9]. En estos sistemas, el número y características de las categorías son calculadas mediante algún algoritmo previamente a la operación del sistema. Una vez que se ha realizado el entrenamiento, el sistema ya está preparado para operar con el conjunto de patrones para el que ha sido entrenado. A partir de este momento, actúa como un “sistema clasificador”. Sin embargo, si se desea

![Fig. 1: Esquema de la Operación de un Sistema Categorizador](image-url)
incorporar algún patrón de entrada nuevo, el sistema debe ser re-entrenado (barrido el conocimiento ya existente) con el nuevo conjunto de patrones de entrada formado por los patrones ya existentes más los nuevos patrones a incorporar.

Por el contrario, se dice que el sistema categorizador opera en “tiempo real” cuando el sistema es capaz de ir formando y adaptando sus categorías internas conforme se le van presentando los patrones de entrada. Cada vez que se presenta un patrón de entrada, el sistema busca la categoría que mayor semejanza guarda con dicho patrón de entrada. A continuación, el sistema actualiza la definición de la categoría seleccionada para incorporar las características relevantes del nuevo patrón. En el caso de que ninguna de las categorías ya existentes sea suficientemente parecida al patrón de entrada presentado, se forma una nueva categoría para clasificar este patrón de entrada.

Las arquitecturas basadas en la Teoría de la Resonancia Adaptativa (ART) constituyen sistemas categorizadores en tiempo real [10]-[15]. Estas arquitecturas son:

- ART 1 [10]: es un categorizador en tiempo real para patrones formados por “pixels” binarios.
- ART 3 [13]: clasifica en tiempo real secuencias de patrones de entrada analógicos y asíncronos (en las demás arquitecturas debe mantenerse una sincronización al presentar los patrones de entrada).

Las arquitecturas ART presentan una serie de propiedades interesantes de las que carecen otros algoritmos categorizadores. Entre estas propiedades destacan:

- **Vigilancia variable**: existe un parámetro ajustable $\rho$, conocido como parámetro de vigilancia, que ajusta el grado de semejanza que debe existir entre los patrones de entrada para ser clasificados por el sistema en una misma categoría. Un parámetro de vigilancia $\rho$ próximo a ‘0’ hace que el criterio de vigilancia sea poco estricto. El sistema clasificará patrones de entrada bastante diferentes como pertenecientes a una misma categoría. Por el contrario, una vigilancia $\rho$ próxima a ‘1’ hace que se formen categorías más finas, y el número de categorías formadas será mayor.
- **Auto-ajuste del orden de búsqueda**: el orden de búsqueda no está predeterminado sino que se va modificando a medida que el conocimiento del sistema va aumentando.
- **Acceso directo a patrones de entrada familiares**: el sistema siempre accede directamente a la categoría correspondiente a un patrón de entrada que tienen perfectamente aprendido, independientemente del número y complejidad de las categorías que existan en el sistema.
- **Auto-estabilización**: en respuesta a una secuencia arbitraria de patrones de entrada, el aprendizaje del sistema siempre se estabiliza en un número finito de presentaciones de los patrones de entrada.
2. La Arquitectura ART 1

La arquitectura más básica basada en la Teoría de la Resonancia Adaptativa es la arquitectura ART 1, que está representada en la Fig. 2. El sistema ART 1 está compuesto por dos subsistemas: el subsistema de atención y el subsistema de orientación. El subsistema de atención está compuesto por dos capas: la capa $F_1$ o capa de las entradas, y la capa $F_2$ o capa de las categorías. La capa $F_1$ está formada por $N$ celdas $(v_1, v_2, ..., v_N)$ cuyas actividades vienen descritas por el vector $(x_1, x_2, ..., x_N)$. La capa de las categorías $F_2$ está formada por $M$ celdas $(u_1, u_2, ..., u_M)$, y su vector de activación es $(y_1, y_2, ..., y_M)$. Cada nudo $v_i$ de la capa $F_1$ está interconectado con cada nudo $u_j$ de la capa $F_2$ mediante un peso $z_{ij}^{bu}$. De la misma manera, cada nudo $u_j$ de $F_2$ está conectado con cada nudo $v_i$ de la capa $F_1$ mediante un peso $z_{ji}^{td}$. Los nudos de la capa $F_2$ están a su vez interconectados entre ellos. Cada nudo de la capa $F_2$ presenta una realimentación positiva consigo mismo e influye negativamente en los restantes nudos de la capa $F_2$.

La evolución de un sistema ART 1 viene descrita por dos conjuntos de ecuaciones diferenciales: las ecuaciones STM o de “memoria de corto término”, y las ecuaciones LTM o de “memoria de largo término”. Las ecuaciones STM describen la evolución de las actividades en los nudos de $F_1$ y $F_2$ en función de las interacciones existentes entre ellos. Las ecuaciones LTM describen la evolución de los pesos de interconexión. Las ecuaciones STM evolucionan según una constante de tiempo mucho menor que las ecuaciones LTM. Por ello, es posible suponer que las ecuaciones STM alcanzan su estado estacionario instantáneamente y considerar sólo la dinámica correspondiente a las ecuaciones LTM. Carpenter y Grossberg introdujeron además el modo de “aprendizaje rápido” [10]. En este modo, se supone que las ecuaciones LTM alcanzan también su estado estacionario en cada presentación de un patrón de entrada.

Según el tipo de aproximación que se haga, es posible distinguir tres niveles de implementación del algoritmo ART 1:

![Diagrama de una Arquitectura ART 1](image)

*Fig. 2: Diagrama de una Arquitectura ART 1*
Tipo-1  Implementación del Modelo Completo: Se implementan directamente las ecuaciones diferenciales STM y LTM.

Tipo-2  Implementación del Estado Estacionario STM: Se supone que las ecuaciones STM alcanzan el estado estacionario mientras el patrón de entrada \( I \) permanece constante. Se sustituyen, por tanto, las ecuaciones diferenciales STM por las ecuaciones algebraicas correspondientes al estado estacionario, y sólo se implementan directamente las ecuaciones diferenciales LTM.

Tipo-3  Implementación de Aprendizaje Rápido: Tanto las ecuaciones diferenciales STM como las LTM son sustituidas por las ecuaciones algebraicas del estado estacionario. En este caso, hay que implementar artificialmente la secuencia de estados STM y LTM.

La descripción completa del algoritmo ART 1 Tipo-1 está desarrollada en el Apéndice 1. Aquí describiremos únicamente las ecuaciones de estado estacionario del algoritmo ART 1, esto es, la implementación Tipo-3. El diagrama de flujo correspondiente a esta implementación es el que se muestra en la Fig. 3(a).

Todos los pesos \( z_{ji}^{td} \) son inicializados al valor ‘1’, mientras que los pesos \( z_{ij}^{bu} \) toman inicialmente el valor \( \frac{L}{L - 1 + N} \).

Tal como se observa en la Fig. 2, cada nudo \( v_i \) de la capa \( F_1 \) recibe una componente del vector de entrada \( I_i \). La entrada \( T_j \) que recibe cada nudo \( u_j \) de la capa \( F_2 \) es el resultado de multiplicar el vector de entrada \( I \) por todos los pesos \( z_{ij}^{bu} \) que conectan los nudos \( v_i \) de \( F_1 \) con el nudo \( u_j \) de \( F_2 \). Esto es,

\[
T_j = \sum_{i=1}^{N} z_{ij}^{bu} I_i
\]

La capa \( F_2 \) actúa como un circuito “Winner-Take-All”: la realimentación positiva que existe en cada nudo y las inhibiciones laterales entre los distintos nudos hacen que la activación presente en cada nudo tienda a reforzarse y a inhibir la activación en los restantes nudos. En el estado estacionario resulta que solamente el nudo \( u_j \) que recibe la entrada \( T_j \) máxima permanece activo,

\[
y_j = \begin{cases} 
1 & \text{if } T_j = \max \{ T_k \} \\
0 & \text{en otro caso}
\end{cases}
\]

Llamemos \( u_j \) a la categoría de la capa \( F_2 \) que ha recibido la entrada máxima \( T_j \). Una vez que el nudo \( u_j \) de \( F_2 \) se ha activado, se activará el patrón de pesos \( z_{ji}^{td} \), modificando la activación \( X \) de los nudos de la capa \( F_1 \). La activación \( x_i \) de cada nudo \( v_i \) vendrá dada por,

\[
x_i = I_i \sum_{j=1}^{M} z_{ji}^{td} y_j = I_i z_{ji}^{td},
\]

o en notación vectorial
\[ X = I \cap z_{jd}^{id}, \]  

(4)

donde \( z_{jd}^{id} = (z_{j1}^{id}, z_{j2}^{id}, ..., z_{jN}^{id}) \), \( X = (x_1, x_2, ..., x_N) \) y \( I = (I_1, I_2, ..., I_N) \).

El subsistema de orientación compara el patrón de activación \( X \) de los nudos de la capa \( F_j \) con el patrón de entrada original \( I \), de acuerdo con un criterio de vigilancia. El criterio de vigilancia está controlado por el parámetro de vigilancia \( \rho \). Tras la comparación pueden ocurrir dos casos:

a) Si \( \rho |I| > |I \cap z_{jd}^{id}| \), la categoría \( u_j \) no es válida para el nivel de vigilancia impuesto por el parámetro de vigilancia \( \rho \). En este caso, \( u_j \) será desactivada haciendo \( T_j = 0 \), de forma que otra categoría \( u_j \) se activará por la acción del circuito "Winner-Take-All".¹

b) Si \( \rho |I| \leq |I \cap z_{jd}^{id}| \), la categoría \( u_j \) es aceptada y sus pesos se adaptan para incorporar el nuevo conocimiento.

La ley de actualización de los pesos está descrita por las ecuaciones algebraicas siguientes:

\[ z_{ij}^{bu} (new) = \frac{Lx_i}{L-1 + |X|} = \frac{L I_j z_{jd}^{id} (old)}{L-1 + |I \cap z_{jd}^{id} (old)|} \]  

(5)

\[ z_{ji}^{id} (new) = x_i = I_i z_{ji}^{id} (old) \]  

(6)

o en notación vectorial,

\[ z_{j}^{bu} (new) = \frac{L I \cap z_{jd}^{id} (old)}{L-1 + |I \cap z_{jd}^{id} (old)|} \]  

(7)

\[ z_{j}^{id} (new) = I \cap z_{jd}^{id} (old). \]  

(8)

La Fig. 3(a) muestra el algoritmo completo de la operación de la arquitectura ART 1 Tipo-3.

Obsérvese en las ecuaciones (5) y (6) que los pesos \( z_{ji}^{id} \) tomarán siempre valores binarios '1' o '0', mientras que los pesos \( z_{ij}^{bu} \) toman valores analógicos. El valor mínimo es 0 (para \( x_i = 0 \)) y el máximo es 1 (para \( x_i = 1 \) y \( |X| = 1 \)). A la hora de implementar en circuito el algoritmo mostrado en la Fig. 3(a) surgen dos dificultades. La primera dificultad que aparece es la necesidad de implementar dos conjuntos de pesos: los pesos "bottom-up" \( z_{ij}^{bu} \) que pueden tomar cualquier valor comprendido en el intervalo \([0, 1]\), y los pesos "top-down" \( z_{ji}^{id} \) que son pesos binarios ya que solamente toman los valores '0' o '1'. Sin embargo, en las ecuaciones (7) y (8) se observa que el conjunto de pesos \( \{z_{ij}^{bu}\} \) no es más que una versión normalizada del conjunto de pesos \( \{z_{ji}^{id}\} \). Por tanto, se puede implementar un sólo conjunto de pesos binarios \( \{z_{ij}\} \) y realizar la normalización durante el cómputo de los términos \( T_j \). En este caso, para calcular los términos \( T_j \) habría que modificar la ecuación (1) de forma que se tenga en cuenta la normalización.

---

¹ Si \( a \) es un vector de componentes \((a_1, a_2, ..., a_q)\), la notación \( |a| \) se refiere a su norma L₁: \( |a| = \sum_{i=1}^{q} |a_i| \)
\[ T_j = \frac{L T_{A_j}}{L - 1 + T_{B_j}} = \frac{L \sum_{i=1}^{N} z_{ij} I_i}{L - 1 + \sum_{i=1}^{N} z_{ij}} = \frac{L |z_j \cap I|}{L - 1 + |z_j|}. \] (9)

El resultado de introducir esta pequeña modificación es el algoritmo representado en la Fig. 3(b). La operación a nivel de sistema del algoritmo de la Fig. 3(a) es idéntica a la de la Fig. 3(b). Sin embargo, el segundo no requiere realizar físicamente pesos analógicos.

Desafortunadamente, ahora surge la dificultad de implementar la operación de división necesaria en el cálculo de los términos \( T_j \). Esta es una operación muy costosa de realizar por circuitos tanto para una implementación analógica como digital. Esto nos ha llevado a sustituir la operación de división en el cálculo de los términos \( T_j \) por una operación de sustracción. La operación de sustracción se realiza fácilmente con circuitos sin más que aplicar la ley de Kirchoff de las intensidades en un nudo del circuito. De esta forma, el cálculo de los elementos \( T_j \) se reduce a calcular,

Fig. 3: Implementación Tipo-3 del Algoritmo de la Arquitectura ART 1: (a) ART 1 original (b) ART 1 con una sola matriz de pesos binarios (c) y ART 1 modificado para la implementación VLSI
Fig. 4: Diagrama de Bloques del Circuito que Implementa el Algoritmo ART 1 Modificado

\[ T_j = L_A T_{A_j} - L_B T_{B_j} + L_M \]  

(10)

Donde \( L_A \) y \( L_B \) son parámetros siempre positivos que realizan el papel de los parámetros \( L \) y \( L - 1 \) del algoritmo original. \( L_M \) es un parámetro también positivo que se introduce para asegurar que todos los elementos \( T_j \) son siempre positivos.

El algoritmo que resulta tras esta nueva modificación está representado en la Fig. 3(c). En el Apéndice 1, se compara la operación a nivel de sistema del algoritmo original y del algoritmo resultante tras esta modificación. Así mismo se demuestra que el algoritmo modificado conserva todas las propiedades del algoritmo original. Sin embargo, el algoritmo modificado presenta unas propiedades mucho más interesantes desde el punto de vista de la implementación con circuitos.

3. Implementación de Circuito del Algoritmo ART 1

La Fig. 4 muestra un diagrama de bloques de un circuito que implementa el algoritmo de la Fig. 3(c). El circuito consiste en un array de \( 18 \times 100 \) sinapsis \( S_{11}, ..., S_{18,100} \), un vector de 100 celdas de entrada \( C_1, ..., C_{100} \), dos arrays de \( 1 \times 18 \) espejos de corriente de ganancia unidad \( CMA_1, ..., CMA_{18}, CMB_1, ..., CMB_{18} \), un array de \( 1 \times 18 \) comparadores de corriente \( CC_1, ..., CC_{18} \), un circuito WTA de 18 entradas, dos espejos de corriente de 18 salidas de ganancia unidad \( CMM \) y \( CMC \), y un espejo de corriente de ganancia variable \( \rho \).

Los diagramas de cada sinapsis \( S_{ij} \) y de las celdas de entrada \( C_j \) están detallados en la Fig. 5(a) y en la Fig. 5(b), respectivamente. Cada sinapsis \( S_{ij} \) contiene una celda de memoria que almacena el valor del peso \( z_{ij} \), dos fuentes de corriente \( L_A \), una fuente de corriente \( L_B \), más circuitería para controlar las corrientes de salida y para el aprendizaje de los pesos. Las señales "RESET" y "LEARN" son señales de control compartidas por todas las sinapsis del circuito. Cada sinapsis genera dos corrientes de salida:

- Una corriente de valor \( L_A I_i z_{ij} - L_B z_{ij} \) que entra en el espejo \( CMA_j \).
- Una corriente \( L_A I_i z_{ij} \) que entra en el espejo \( CMB_j \).
El espejo $CMA_j$ recibe todos los elementos de corriente $L_A I z_{ij} - L_B z_{ij}$ de las sinapsis de la fila $j$ además de una corriente $L_M$ que es replicada por el espejo $CMM$. De esta forma la corriente de entrada que distribuye $CMA_j$ al WTA valen

$$T_j = L_A \sum_i I_i z_{ij} - L_B \sum_i z_{ij} + L_M = L_A[I \cap z_j] - L_B[z_j] + L_M$$

El espejo $CMB_j$ recibe todos los elementos de corriente $L_A I z_{ij}$ de las sinapsis de la fila $j$,

$$L_A \sum_i I_i z_{ij} = L_A[I \cap z_j].$$

Cada celda de entrada $C_i$ genera una corriente $L_A I_i$. Todas estas corrientes generadas por las celdas $C_i$ se suman en la entrada del espejo de ganancia variable $\rho$. De esta forma, la corriente distribuida por el espejo $CMC$ a la entrada de los comparadores de corriente es

$$\rho L_A \sum_i I_i = \rho L_A[I].$$

Cada comparador $CC_j$ recibe una corriente de entrada total $L_A[I \cap z_j] - \rho L_A[I]$ y la compara frente a '0'. Si esta corriente es positiva, las señales de control $c_j$ del WTA están altas. Cuando la señal $c_j$ está alta la corriente $T_j$ entra a formar parte de la competición del WTA. Por el contrario, si la corriente de entrada del comparador $CC_j$ es negativa, la señal $c_j$ está baja lo que impide a la corriente $T_j$ participar en la competición del WTA.

El circuito de la Fig. 4 sigue la siguiente secuencia de operación:

1.- La señal de "RESET" es activada. Todos los pesos $z_{ij}$ son inicializados al valor ‘1’.
2.- Se presenta un patrón de entrada $I$.
3.- Las 18 filas de sinapsis generan las corrientes de entrada a las 18 celdas del WTA:

$$T_j = L_A[I \cap z_j] - L_B[z_j] + L_M$$

y las 18 corrientes de entrada a los comparadores $L_A[I \cap z_j]$. 
4.- La filas de celdas de entrada $C_i$ generan la corriente $L_A|I|$ que es multiplicada por $p$ y distribuida por el espejo $CMC$ a los 18 comparadores.

5.- Cada comparador compara la corriente $L_A|I| \cap z_j - \rho L_A|I|$ frente a '0':
   - Si dicha corriente es negativa, la corriente $T_j$ es desviada y no entra en el WTA para la competición.
   - Si la corriente de entrada al comparador es positiva, la corriente $T_j$ entra en la competición del WTA.

6.- El WTA selecciona un ganador para el que se hace $y_j = 0$.

7.- Una vez que se ha elegido un único ganador, se activa la señal "LEARN" (LEARN=0). Los pesos $z_{ij}$ correspondientes a la categoría seleccionada se modifican de forma que se hace '0' para aquellas sinapsis en las que $I_i = 0$.

En el Apéndice 2 se explican los detalles de un prototipo fabricado en la tecnología de ES2-1.5µm y se incluyen resultados experimentales.

A. Realización del Circuito "Winner-Take-All"

La operación de selección de categorías se realiza con el circuito mostrado en la Fig. 6. El circuito está basado en el WTA reportado por Lazzaro [16]. En el circuito hay un array de transistores MB por los que fluyen las corrientes de entrada $T_j$, y un array de transistores MA que se reparten la corriente de polarización $I_{BIAS}$. Todos los transistores MB comparten sus puertas en el nudo $V_{COMMON}$ y tienen las fuentes a tierra. La precisión de la operación de este circuito depende del apareamiento entre todos los transistores MA y todos los transistores MB. Por tanto, esta precisión se verá seriamente degradada cuando intentemos distribuir el
circuito entre distintos chips. Una solución para este problema es usar un WTA cuya operación se base únicamente en el reflejo y comparación de corrientes. En el Apéndice 3 se describe una topología de un circuito WTA que opera totalmente en modo de corriente. La operación de este circuito se puede distribuir entre distintos chips sin pérdida significativa de precisión si se reúnen adecuadamente las corrientes de un chip a otro.

La Fig. 7 muestra el diagrama del circuito WTA en modo de corriente. El circuito consta de \( M \) celdas interconectadas entre sí mediante un espejo PMOS de \( M \) salidas. Cada celda del circuito consta de un espejo NMOS de dos salidas, un transistor NMOS y un comparador de corriente. Cada celda recibe dos corrientes de entrada: \( T_j \) e \( I_o \). \( T_j \) es la corriente de entrada aplicada externamente que es replicada dos veces mediante el espejo NMOS. \( I_o \) es una corriente suministrada por una de las salidas del espejo PMOS. Cada celda da una corriente de salida \( I_{oj} \). Todas las corrientes de salida \( I_{oj} \) se suman en el nudo de entrada del espejo PMOS y dicha suma \( I_o = \sum I_{oj} \) es distribuida por el espejo PMOS a la entrada de todas las celdas. El comparador de corriente de cada celda compara la diferencia de las entradas \( I_o - T_j \) frente a '0'. Si esa diferencia es positiva la salida del comparador estará baja y la corriente de salida es \( I_{oj} = 0 \). Por el contrario, si la diferencia \( I_o - T_j \) es negativa la salida del comparador estará alta y el transistor NMOS conducirá la corriente de salida \( I_{oj} = T_j \).

La salida de la celda \( j \) del WTA la podemos, por tanto, expresar como

\[
I_{oj} = \begin{cases} 
T_j & \text{si } T_j \geq I_o \\
0 & \text{si } T_j < I_o 
\end{cases} \quad (14)
\]

o de forma más compacta,

\[
I_{oj} = T_j H(T_j - I_o) \quad (15)
\]

donde la función \( H(x) \) es la función escalón que se define como
Fig. 8: Representación Gráfica de la Solución de la Ecuación (18)

\[ H(x) = \begin{cases} 
1 & \text{si } x \geq 0 \\
0 & \text{si } x < 0 
\end{cases} \]

(16)

Por otro lado, podemos expresar la corriente \( I_o \) como

\[ I_o = \sum_{j=1}^{M} I_{oj} \]

(17)

De las ecuaciones (15) y (17) se deduce que

\[ I_o = \sum_{j=1}^{M} T_j H(T_j - I_o) \]

(18)

La Fig. 8 muestra la representación gráfica de las funciones \( f_1(I_o) = \sum_{j=1}^{M} T_j H(T_j - I_o) \) y \( f_2(I_o) = I_o \). El punto de intersección de ambas funciones será la solución de la ecuación del circuito (18). Se observa que existe un único punto de equilibrio y que el valor de \( I_o \) en el punto de equilibrio es tal que

\[ I_o \big|_{eq} = \max \{ T_j \} \]

(19)

En el punto de equilibrio la única celda que da una corriente de salida no nula \( I_{oj} = T_j \) es la celda ganadora cuya corriente de entrada \( T_j \) es máxima. Por lo tanto, el circuito realiza la operación de un WTA.

La precisión de la operación del circuito está determinada básicamente por la precisión en la reflexión de las corrientes, ya que el offset del comparador de corriente es despreciable. Por tanto, si las corrientes se reflejan con precisión de un chip a otro, la operación se puede distribuir entre varios chips sin sufrir una pérdida de precisión significativa.
La Fig. 9 muestra un diagrama de la estrategia para ensamblar varios WTA. El ensamblaje entre los chips es posible sin más que añadir una reflexión adicional en un espejo de corriente. Se observa que el nudo de suma de las corrientes $I_{oj}$ es compartido por todos los chips. El espejo NMOS adicional distribuye la corriente global $I_o$ entre las entradas de los espejos PMOS de los distintos chips.

El WTA en modo de corriente ha sido fabricado y testado para dos tecnologías diferentes: la tecnología CMOS de poly simple y doble metal de ES2-1.0µm y la tecnología CMOS con doble metal y doble poly de MIETEC-2.4µm. En el Apéndice 3 se discuten los detalles de la implementación del circuito y se incluyen los resultados experimentales de los chips y de un sistema formado por la interconexión de dos chips, cuando los chips son de la misma tecnología y cuando cada uno de ellos es de una tecnología diferente. Se demuestra que la operación de los circuitos es correcta y que la precisión no se degrada significativamente al distribuir el sistema entre diferentes chips, ni siquiera si éstos son de tecnologías diferentes.

**B. Realización de los Elementos de Corriente**

Para la realización de las fuentes de corriente $L_A$ y $L_B$ de las sinapsis $S_{ij}$ del circuito de la Fig. 4, y de las celdas de entrada $C_i$ es necesario replicar la corriente $L_A$ un total de $2 \times 1800 + 100 = 3700$ veces, y la corriente $L_B$ debe ser replicada 1800 veces. Además, estas fuentes de corriente están distribuidas en un área de aproximadamente $1cm^2$. Para mantener una precisión del orden del 1% hemos replicado las corrientes utilizando una estructura de árbol de espejos de corriente tal como se muestra en la Fig. 10 para las corrientes $L_B$. Cada etapa es un espejo de corriente de salida múltiple. Los transistores de cada etapa del espejo tienen una geometría de $W = L = 10\mu m$ y están realizados utilizando técnicas de layout centroide común para mejorar el apareamiento entre los transistores.
Fig. 10: Cascada de Espejos de Corriente para Reducir el Error de Desapareamiento

La desviación estándar medida en las corrientes de salida $L_A$ de varias sinapsis es menor que 1% para corrientes de operación mayores que 5μA.

Para mantener un nivel de precisión adecuado en el diseño de los espejos de corriente sin incurrir en un derroche de área por miedo al "mismatching" es necesario caracterizar el mismatching del proceso tecnológico que se esté empleando. Con objeto de simplificar el diseño de las fuentes de corriente $L_A$ y $L_B$ de las sinapsis hemos caracterizado estadísticamente las desviaciones en los parámetros de los transistores del proceso CMOS de ES2-1.0μm. El método de caracterización empleado es válido para la caracterización estadística del mismatching de cualquier proceso tecnológico CMOS.

C. Caracterización del "Mismatching" de una Tecnología CMOS

Según el modelo de Pelgrom [17], la desviación estándar de la diferencia del parámetro eléctrico $P$ entre dos transistores viene dada por una ecuación de la forma

$$\sigma^2(\Delta P) = \frac{A_p^2}{WL} + S_pD^2$$  \hspace{1cm} (20)

donde $WL$ es el área de los transistores, $D$ la distancia entre ambos transistores, y $A_p$ y $S_p$ son parámetros característicos del proceso tecnológico.

En la ecuación (20) se distinguen dos componentes que influyen en la desviación de un parámetro: una componente aleatoria o de ruido dependiente del área de los transistores caracterizada por el parámetro $A_p$, y otra componente de gradiente dependiente de la distancia entre los transistores caracterizada por el parámetro $S_p$.

El parámetro $A_p$ es muy estable de chip a chip, de "run" a "run", de “wafer” a “wafer” e incluso entre distintas “foundries” que utilizan procesos tecnológicos parecidos. Puede, por tanto, ser caracterizada con muy pocas muestras. Sin embargo, el parámetro $S_p$ necesita muchas muestras para su caracterización. Además, la componente de error dependiente de la distancia puede ser eliminada con técnicas de layout [18].
por lo que su caracterización no tiene tanto interés para el diseñador. Nosotros hemos caracterizado solamente los parámetros $A_p$.

Para la caracterización de un proceso CMOS empleamos un chip de propósito específico realizado en dicho proceso. El circuito está formado por una matriz de celdas. Cada celda contiene pares de transistores NMOS y PMOS de diferentes tamaños. El chip contiene además circuitería de decodificación para seleccionar en cada momento un solo par de transistores. La Fig. 11 muestra un esquema simplificado del chip y del equipo de medida. En el chip, todos los transistores NMOS comparten su drenador en el pin DN. Todos los transistores PMOS comparten sus drenadores en el pin DP. Todos los transistores NMOS y PMOS comparten sus fuentes en el pin S. En cada momento, solamente los transistores NMOS y PMOS del par seleccionado tienen sus puertas conectadas al pin G. Los restantes transistores del chip tienen sus puertas cortocircuitadas con sus terminales de fuente. Si el pin DP se deja desconectado y se accede a los terminales DN, G, y S se puede caracterizar el transistor NMOS del par seleccionado. De la misma manera, dejando desconectado el pin DN y accediendo a los terminales DP, G y S es posible caracterizar el transistor PMOS del par seleccionado.

El par de transistores activo se selecciona a través de un bus digital controlado por un ordenador exterior, el cual controla a su vez un equipo trazador de curvas en DC (HP4145) que mide las curvas para la caracterización de los transistores del par seleccionado.

Para la caracterización de cada transistor se miden dos curvas correspondientes a zona óhmica:

- **Curva 1:** \[ V_{DS} = 0.1V \quad V_{SB} = 0V \]
  \[ V_{GS} = 1.5V-5.0V \]

![Fig. 11: Montaje Experimental para la Caracterización Automática del “Mismatching” entre Transistores MOS](image-url)
\[ I_{DS} = \beta \frac{\left( V_{GS} - V_{T0} - \frac{1}{2} 0.1V \right) 0.1V}{1 + \theta \left( V_{GS} - V_{T0} \right)} \]

- Curva 2: 
  \( V_{DS} = 0.1V \)
  \( V_{GS} = 3.0V \)
  \( V_{SB} = 0.0V - 2.0V \)

\[ I_{DS} = \beta \frac{\left( 3.0V - V_{T}(V_{SB}) - \frac{1}{2} 0.1V \right) 0.1V}{1 + \theta \left( 3.0V - V_{T}(V_{SB}) \right)} \]

Estas curvas se ajustan utilizando técnicas de ajuste no lineal a las curvas de nivel 1 del modelo de HSPICE de un transistor en zona óhmica

\[ I_{DS} = \beta \frac{\left( V_{GS} - V_{T}(V_{SB}) - \frac{1}{2} V_{DS} \right) V_{DS}}{1 + \theta \left( V_{GS} - V_{T}(V_{SB}) \right)} \quad V_{DS} \leq V_{GS} - V_{T} \quad (21) \]

\[ V_{T}(V_{SB}) = V_{T0} + (\eta - 1) V_{SB} + \gamma \left( \sqrt{\phi + V_{SB} - \phi} \right) \quad (22) \]

En el proceso de ajuste los parámetros \( \eta \) y \( \theta \) se consideran constantes por lo que se ajustan sólo para el primer transistor. Los parámetros \( \beta \), \( V_{T0} \) y \( \gamma \) se extraen para todos los transistores del chip.

Para cada parámetro \( P \), para cada geometría de los transistores \( S \), para cada tipo de transistor \( K \) (NMOS o PMOS) y para cada chip, se calcula la desviación estándar de las diferencias entre los parámetros de los transistores situados consecutivamente a lo largo de la dirección \( x \) y a lo largo de la dirección \( y \)

\[ \Delta x P_{SK}(n, m) = P_{SK}(n + 1) \Delta x, m \Delta y) - P_{SK}(n \Delta x, m \Delta y) \]

\[ \Delta y P_{SK}(n, m) = P_{SK}(n \Delta x, (m + 1) \Delta y) - P_{SK}(n \Delta x, m \Delta y) \]

A continuación se calcula la desviación estándar relativa de las diferencias del parámetro,

\[ \sigma_{P,SK} = \frac{\sigma^2 (\Delta x P_{SK}) + \sigma^2 (\Delta y P_{SK})}{2} \quad (25) \]

Ajustando, en cada chip, los valores de las desviación obtenidas para las distintas geometrías a una curva del tipo

\[ \sigma_{P,SK} = \frac{A_p}{\sqrt{W_{eff} L_{eff}}} \quad \left\{ \begin{array}{l} W_{eff} = W - WD \\ L_{eff} = L - LD \end{array} \right\} \quad (26) \]
se obtiene el valor del parámetro $A_p$ en cada chip. Promediando para todos los chips se obtiene el valor de $A_p$ de la tecnología. En el Apéndice 4 se muestran los resultados de caracterización para la tecnología de ES2-1.0μm y para la tecnología del CNM-2.5μm.

Usando esta técnica de caracterización de los transistores hemos obtenido los parámetros $\beta(x,y)$, $V_{TO}(x,y)$ y $\gamma(x,y)$ de una matriz de $6 \times 6$ celdas fabricada en la tecnología de ES2-1.0μm, que ocupa un área total de $5.5cm^2$. Para cada posición del transistor, podemos calcular la desviación de cada parámetro con respecto al valor medio

\[
\frac{\Delta \beta(x,y)}{\bar{\beta}} = \frac{\beta(x,y) - \bar{\beta}}{\bar{\beta}} \tag{27}
\]

\[
\Delta V_{TO}(x,y) = V_{TO}(x,y) - \bar{V}_{TO} \tag{28}
\]

\[
\Delta \gamma(x,y) = \gamma(x,y) - \bar{\gamma}. \tag{29}
\]

Usando estas desviaciones podemos construir un fichero de entrada a HSPICE donde los transistores cuyos parámetros hemos extraído son los transistores de salida de un espejo de corriente de salida múltiple. De esta forma, obtenemos la superficie de las corrientes de salida simuladas $I_o^s(x,y)$. Para cada una de estas superficies calculamos el plano $I_o^p(x,y) = Ax + By + C$ que mejor se ajusta a la superficie simulada $I_o^s(x,y)$.

La componente aleatoria de las corrientes de salida simuladas $I_o^s(x,y)$ la calculamos como la desviación estándar de la diferencia $\Delta I_o^s(x,y) = I_o^s(x,y) - I_o^p(x,y)$. Esto es,

\[
\sigma(\Delta I_o) = \sqrt{\frac{\sum_{n=1}^{n_{\text{max}}} \sum_{m=1}^{m_{\text{max}}} \left( I_o^s(n\Delta x, m\Delta y) - I_o^p(n\Delta x, m\Delta y) - \bar{I}_o \right)^2}{n_{\text{max}} \times m_{\text{max}}}} \tag{30}
\]

donde $n_{\text{max}}$ y $m_{\text{max}}$ son el número de veces que cada transistor está repetido en las direcciones $x$ e $y$, y $\Delta x$ y $\Delta y$ son las distancias entre dos transistores de celdas consecutivas en las direcciones $x$ e $y$.

La máxima desviación sistemática en las corrientes de salida la calculamos como

\[
\Delta I_o^p = I_{\text{maxplane}} - I_{\text{minplane}} \tag{31}
\]

donde $I_{\text{maxplane}}$ e $I_{\text{minplane}}$ son el máximo y el mínimo del plano de interpolación.

Teniendo en cuenta que el 98% de los valores aleatorios se mantienen en un intervalo de $\pm 3\sigma(\Delta I_o)$, la relación entre la componente de error aleatorio y la componente de error sistemático vendrá dada por

\[
p_{\text{ran/sys}} = \frac{6 \times \sigma(\Delta I_o)}{\Delta I_o^p} \tag{32}
\]

Realizando las simulaciones para los transistores de geometría $W = L = 10\mu m$, medidos en el array de $6 \times 6$ celdas, para niveles de corriente de 10μA obtuvimos que la componente de error sistemático era del mismo orden (y generalmente menor) que la componentene aleatoria. El Apéndice 5 contiene un resumen de los resultados obtenidos en esas simulaciones.
D. Nuevo Prototipo ART 1

Basándonos en estos resultados, diseñamos otro prototipo del ART 1 que contiene una matriz de $50 \times 10$ sinapsis. La matriz de sinapsis ocupa $2.1 \text{mm}^2$. La diferencia fundamental con respecto al primer prototipo es que las fuentes de corriente $L_A$ y $L_B$ de las sinapsis no están realizadas mediante un árbol de espejos de corriente. Las fuentes de corriente $L_A$ y $L_B$ de las diferentes sinapsis se realizan directamente mediante un espejo de salida múltiple. Con esta modificación se consigue reducir 15 veces el área del primer prototipo. Los resultados experimentales de las corrientes de salida medidas en todas las sinapsis están contenidos en el Apéndice 5. Para un nivel de corriente de $10\mu\text{A}$, la desviación total de las corrientes de salida (debida a las componentes aleatoria y sistemática) es siempre inferior al 1%. El Apéndice 5 contiene además resultados del comportamiento a nivel de sistema del nuevo prototipo.

4. Realizaciones Multichip

En el Apéndice 5 se muestran resultados experimentales de dos tipos de realizaciones multichip utilizando el prototipo ART 1 de $50 \times 10$ sinapsis:

- Una realización en la que se han interconectado dos chips horizontalmente para aumentar el número de entradas.
- Una implementación de un sistema ARTMAP de aprendizaje supervisado utilizando dos módulos ART 1 interconectados mediante un módulo inter-ART.

A. Expansión Modular del Sistema

La expansión modular del sistema es posible tanto horizontal como verticalmente. La conexión horizontal de $N$ chips permite aumentar el número de pixels de entrada a $N \times 50$. Conectando verticalmente $M$ chips
aumentamos el número de categorías del sistema a $M \times 10$. La Fig. 12 muestra el diagrama para la interconexión de varios chips.

Para interconectar verticalmente varios chips es necesario interconectar los circuitos WTA de los distintos chips para la elección de un único ganador.

La interconexión horizontal requiere compartir los nudos $N_j$ y $N_j'$ donde se realiza la suma de las corrientes de las sinapsis de la fila $j$. También el nudo $N''$ debe ser compartido entre los distintos chips para realizar la suma de las corrientes de salida de las celdas $C_j$. En la interconexión horizontal de varios chips solamente uno de los WTA permanece activo. En los restantes chips, se desconectan los nudos de suma de las corrientes $N_j, N_j'$ y $N''$ de las entradas a los espejos $CMA_1, ..., CMA_{10}, CMB_1, ..., CMB_{10}$ y $CMC$. Las salidas $y_j$ del WTA activo deben ser compartidas por todas las sinapsis de la fila $j$ de los distintos chips para implementar la regla de aprendizaje.

En el Apéndice 5 se detallan resultados experimentales de la interconexión horizontal de dos chips.

**B. Realización de un Sistema ARTMAP**

Una arquitectura ARTMAP es un sistema que puede ser entrenado para aprender de forma supervisada la correspondencia entre patrones de entrada binarios.

La Fig. 13 muestra un diagrama de bloques de una arquitectura ARTMAP. El sistema está compuesto por dos módulos ART 1 interconectados por medio de un módulo inter-ART. La letra $a$ denota a los elementos del sistema ART $1^a$ y el subíndice $b$ denota a los elementos del sistema ART $1^b$.

Los $M_b$ nudos del módulo inter-ART están interconectados uno a uno con los $M_b$ nudos de la capa $F_2^b$ mediante un vector de pesos bidireccionales que tienen un valor constante igual a '1'. Los $M_a$ nudos de la

![Diagrama de la arquitectura ARTMAP](image_url)

**Fig. 13: Arquitectura ARTMAP**
capa \( F_2 \) están totalmente interconectados con los \( M_b \) nudos del módulo inter-ART mediante una matriz de pesos binarios que son inicializados a ‘1’.

El sistema puede operar en dos modos:

- **Modo de entrenamiento**: Durante el cual se le presentan al sistema secuencias de pares de patrones de entrada \( a \) y \( b \) cuya correspondencia ha de aprender.

  El módulo \( ART \) \( 1^a \) inicia un proceso de búsqueda durante el cual clasifica al patrón \( a \) en una categoría \( u_j^a \) que satisface el criterio de vigilancia impuesto por \( \rho_a \). De la misma manera, el módulo \( ART \) \( 1^b \) clasifica al patrón \( b \) en una categoría \( u_j^b \) que satisface el criterio de vigilancia impuesto por \( \rho_b \).

  Si el peso \( w_{jk} \) de interconexión entre la categoría \( u_j^a \) y el nudo \( u_k^b \) del módulo inter-ART tiene un valor ‘1’ el sistema aprende la correspondencia entre las categorías activadas en \( F_2^a \) y \( F_2^b \). El peso \( w_{jk} \) corresponde a las dos categorías activadas \( u_j^a \) y \( u_k^b \) se mantiene a ‘1’. Todos los pesos \( w_{jk} \) que establecen la correspondencia entre la categoría activada \( u_j^a \) en \( F_2^a \) y las categorías no activadas en \( F_2^b \) almacenarán un valor ‘0’.

  Si, por el contrario, el peso \( w_{jk} \) que corresponde a la categoría \( u_j^a \) activada en \( F_2^a \) y la categoría \( u_k^b \) activada en \( F_2^b \) vale ‘0’, significa que se ha producido un error de predicción. La categoría \( u_j^a \) había aprendido previamente a predecir una categoría de \( F_2^b \) distinta de \( u_k^b \). En este caso, el parámetro de vigilancia del sistema \( ART \) \( 1^a \) es aumentado hasta el mínimo valor necesario para resetear la categoría \( u_j^a \). El proceso continúa hasta que se activa una categoría de \( F_2^a \) que predice la categoría \( u_k^b \) activada en \( F_2^b \) o bien que no ha aprendido previamente a predecir ninguna categoría de \( F_2^b \). En cualquiera de los dos casos se encuentra una categoría \( u_j^a \) tal que \( w_{jk} = 1 \).

  La secuencia de operación del algoritmo ARTMAP durante el entrenamiento está representada en la Fig. 14. En ella la operación del los módulos ART 1 se supone que está descrita por el algoritmo ART 1 modificado.

- **Modo de predicción**: Durante el modo de predicción se le presenta al sistema un único patrón de entrada \( a \).

  El módulo \( ART \) \( 1^a \) inicia una búsqueda tras la que clasifica el patrón de entrada en la categoría \( u_j^a \) tal que la entrada \( T_j^a \) es máxima y satisface el criterio de vigilancia impuesto por \( \rho_a \).

  La categoría \( u_k^b \) tal que el peso correspondiente \( w_{jk} \) sea igual a ‘1’, será la categoría de \( F_2^b \) predicha por el sistema.

  La Fig. 15 muestra el diagrama de flujo del modo de predicción del algoritmo ARTMAP.

  Para la realización en modo de circuito del algoritmo ARTMAP se han interconectado dos chips ART 1 mediante un módulo inter-AART, tal como se muestra en la Fig. 16. La Fig. 17 muestra el esquemático del módulo inter-AART. Consiste en una matriz de \( M_a \times M_b \) celdas \( c_{jk} \). Cada celda \( c_{jk} \) recibe dos señales de entrada \( y_j^a \) y \( y_k^b \), y almacena el valor del peso correspondiente \( w_{jk} \). Las señales de “RESET” y “LEARN” son comunes a todas las celdas del circuito. El chip incluye también circuitería adicional para leer el peso \( w_{jk} \) correspondiente a las categorías \( u_j^a \) y \( u_k^b \) cuyas señales de salida \( y_j^a \) e \( y_k^b \) están activadas.

  La activación de la señal de “RESET” o de inicialización del sistema hace que todos los pesos \( w_{jk} \) almacenén el valor ‘1’.
Fig. 14: Diagrama de Flujo de la Operación ARTMAP durante la Fase de Entrenamiento

La activación de la señal de “LEARN” hace que todos los pesos $w_{jk}$ correspondientes a la categoría activada $u_j$ se hagan ‘0’ excepto aquel para el cual la salida $y_{jk}$ de la categoría $u_k$ vale ‘1’.

En el Apéndice 5 están contenidos los resultados experimentales del comportamiento a nivel de sistema del circuito ARTMAP. Allí también se incluye una explicación más detallada de su funcionamiento.

5. Conclusiones

Se ha diseñado y fabricado un circuito capaz de clasificar en tiempo real de forma no supervisada patrones de entrada binarios. El circuito realiza una versión modificada del algoritmo ART 1. Esta nueva versión del
Fig. 15: Diagrama de Flujo de la Operación de Predicción del Algoritmo ARTMAP

Fig. 16: Diagrama de Interconexión del Sistema ARTMAP

Fig. 17: Esquemático del Módulo Inter-ART
algoritmo ART 1 es mucho más eficiente para la implementación en “hardware”, sin embargo, conserva todas las propiedades computacionales del algoritmo original.

Se han fabricado dos prototipos del sistema categorizador ambos en la tecnología de ES2-1.0μm. El primer prototipo clasifica patrones de 100 entradas binarias en hasta 18 categorías diferentes, ocupando un área de 1cm². El segundo prototipo clasifica patrones de 50 entradas binarias en 10 categorías y ocupa un área 15 veces menor que el anterior prototipo.

Ambos prototipos son totalmente modulares, esto es, pueden ser ensambalados en matrices de N×M chips para formar sistemas mayores, con mayor número de entradas y/o mayor número de categorías. Se ha testado el comportamiento de un sistema formado por la interconexión de dos chips del segundo prototipo. Los chips estaban interconectados para aumentar el número de categorías.

Se ha construido un sistema, basado en el algoritmo ARTMAP, capaz de aprender de forma supervisada correspondencias entre pares de patrones de entrada binarios. Este sistema está formado por la interconexión de dos módulos ART 1 mediante un pequeño módulo inter-ART. El comportamiento de este sistema ha sido ampliamente testado.

Para la realización de la operación del “Winner-Take-All” característica del proceso de selección de categorías del algoritmo ART 1, se ha diseñado un circuito WTA que opera en modo de corriente. Reflejando adecuadamente las corrientes entre los distintos chips, la operación del sistema se puede distribuir entre distintos chips sin pérdida de precisión.

Se ha desarrollado un método para la caracterización estadística de las desviaciones de los parámetros eléctricos de cualquier proceso CMOS. El método requiere la realización de un chip de propósito específico en la tecnología a caracterizar. Se ha realizado el diseño del chip y las medidas para la caracterización de dos tecnologías CMOS: la tecnología de ES2-1.0μm y la del CNM-2.5μm.

6. Referencias


Appendix 1: A VLSI-Friendly ART 1 Algorithm

1.1. The Adaptive Resonance Theory

Since 1987 Carpenter and Grossberg have published a series of papers of Adaptive Resonance Theory (ART) architectures [1.2], [1.3], [1.4], [1.5], [1.6] and [1.7]. These architectures are:

- ART 1 [1.2] is an architecture, which is capable of generating in an unsupervised way, stable recognition codes in response to a series of arbitrarily ordered and arbitrarily many and complex binary input patterns.
- The ART 2 [1.3] and Fuzzy-ART [1.5] architectures, which do the same but for analog input patterns.
- The ART 3 [1.4] architecture, which copes with sequences of asynchronous analog input patterns in real time.
- The ARTMAP [1.6] and Fuzzy-ARTMAP [1.7] architectures, which can be trained to learn in a supervised way correspondences between pairs of binary and analog input patterns, respectively.

The most basic module, which is the ART 1 architecture, is depicted in Fig. 1.1. It is composed of two subsystems: the “orienting subsystem” and the “attentional subsystem”. The attentional subsystem consists of two layers: the \( F_1 \) or input layer and the \( F_2 \) or category layer. The \( F_1 \) layer has \( N \) cells \( (v_1,v_2,\ldots,v_N) \) and \( F_2 \) has \( M \) cells \( (u_1,u_2,\ldots,u_M) \). Each \( F_1 \) cell \( v_i \) connects to each \( F_2 \) cell \( u_j \) through bottom-up connections \( z_{ij}^{bu} \). Each \( F_2 \) cell \( u_j \) connects to each \( F_1 \) cell through top-down connections \( z_{ji}^{td} \). Cells in the \( F_2 \) layer have interconnections

---

This Appendix is an amplified version of paper [1.1].
among themselves, while cells in the $F_1$ layer are not interconnected among themselves. The activation (or state) of an $F_1$ cell $v_i$ is called $x_i$. The activation of layer $F_1$ is represented by the vector $\mathbf{X} = (x_1, x_2, ... , x_N)$. The activation of an $F_2$ cell $u_j$ is called $y_j$. The activation of layer $F_2$ is represented by the vector $\mathbf{Y} = (y_1, y_2, ... , y_M)$. The output of an $F_1$ cell $v_i$ is a non-linear function of its activation $h(x_i)$. The output of layer $F_1$ is represented by the vector $\mathbf{S} = (h(x_1), h(x_2), ... , h(x_N))$. The output of an $F_2$ cell $u_j$ is also a non-linear function of its activation $f(y_j)$, and the output of $F_2$ layer is represented by the vector $\mathbf{U} = (f(y_1), f(y_2), ... , f(y_M))$.

The ART 1 module performs the following sequence of operations:

- An input pattern $\mathbf{I} = (I_1, I_2, ... , I_N)$ is presented to the $F_1$ layer. Each pixel $I_i$ is either 0 or 1. A pattern of activation $\mathbf{X}$ is produced across the $F_1$ nodes.
- The activation pattern $\mathbf{X}$ causes the so-called “postsynaptic pattern of gated signals” $\mathbf{S}$ to appear at the output of the $F_1$ layer.
- These postsynaptic signals are multiplied by the weight matrix $\mathbf{Z}^{bu}$ generating a pattern of input signals $\mathbf{T}$ at the input of the $F_2$ layer:

$$T_j = \sum_{i=1}^{N} z_{ij}^{bu} S_i = \sum_{i=1}^{N} z_{ij}^{bu} h(x_i). \tag{1.1}$$

- The $F_2$ layer nodes interact among themselves with lateral inhibition causing a pattern of activation $\mathbf{Y}$ which is the result of contrast-enhancing the $F_2$ input pattern $\mathbf{T}$.
- This pattern $\mathbf{Y}$ is gated generating the $F_2$ output pattern $\mathbf{U}$ which is then multiplied by the weight matrix $\mathbf{Z}^{id}$ to produce another input source to the $F_1$ layer $\mathbf{V}$ named “top-down template” or “learned expectations”,

$$V_i = \sum_{j=1}^{M} z_{ji}^{id} U_j = \sum_{j=1}^{M} z_{ji}^{id} f(y_j). \tag{1.2}$$

- This “top-down template” $\mathbf{V}$ affects the input layer $F_1$, causing a modification of the activation pattern across $F_1$. Let’s call $\mathbf{X}^*$ the new activation pattern. The amount of modification which experiments the original activation pattern $\mathbf{X}$ depends on the matching between the input pattern $\mathbf{I}$ and the top-down expectations $\mathbf{V}$.
- A poor matching between $\mathbf{I}$ and $\mathbf{V}$ can cause the orienting subsystem to send a reset signal to the $F_2$ layer. The matching criterion used by the orienting subsystem to decide whether or not to send a reset signal is controlled by the vigilance parameter $\rho$. Usually, $\rho$ ranges from 0 to 1. For values of $\rho$ close to 1, a good matching is demanded, while for $\rho$ values close to 0, poor matchings are acceptable. This reset signal permanently inhibits all the activated nodes across $F_2$, so that if an $F_2$ node $u_j$ is activated the reset signal removes all its activity $y_j$ as long as the same input pattern $\mathbf{I}$ is present at the system input.
When the activated pattern $Y$ is removed, the top-down template is also removed. So, the original $F_1$ activation pattern $X$ and the input pattern $T$ is restablished. Due to the permanent $F_2$ inhibition, another activation pattern $Y^*$ appears across the $F_2$ layer, and another top-down template $V^*$ is read out at the $F_1$ input.

- If a poor matching between $I$ and $V^*$ is again observed, a new reset process and a new selection process take place. This search process ends when a top-down template $V$ is found which matches the input pattern $I$ to the degree of accuracy required by the vigilance parameter $\rho$.

The modifications of the activation patterns or “short-term-memories (STM)” $X$ and $Y$ take place much faster than the updating of the weight matrixes or “long-term-memories (LTM)” $z^{bu}$ and $z^{td}$. Consequently, it can be stated that the learning process takes place when the system has established its STM state. This is often referred to as the STM being in a resonant state.

The ART I architecture has a collection of interesting computational properties, rarely present in other neural network systems:

- **Self-Scaling:** Features of an input pattern that are considered as irrelevant noise when embedded in a complex input pattern can be considered as critical when present in a simpler input pattern.

- **Vigilance or Variable Coarseness:** The matching criterion between the input pattern and the top-down template learned by a chosen category is adjustable, and it is determined by the vigilance parameter $\rho$. If the vigilance parameter is high, more attention will be paid to distinguish very similar input patterns, and classify them into different categories. However, if the vigilance parameter is low, there must be a significant difference between two input patterns to separate them into different categories.

- **Subset and Superset Direct Access:** The system is able to classify a new input pattern as belonging to either a subset or a superset category, depending on global similarity criteria. No restrictions on input orthogonality or linear predictability are needed.

- **Direct Access to Familiar Input Patterns:** No matter how many and how complex the learned patterns may be, the system always accesses directly the recognition code of a familiar previously learned input pattern.

- **Self-Stabilization:** In response to an arbitrary list of binary input patterns, all interconnection weights subject to learning approach limits after a finite number of learning trials. Learning is guaranteed to stabilize, and it does so for a small number of training pattern presentations.

- **Biasing the Network to form New Categories:** There is a parameter that can bias the tendency of the system to code unfamiliar patterns into new categories, independent of the vigilance parameter.

- **Self-Adjustment of the Search Order:** The search order is not predetermined. The whole system adjusts the search order as the self-organizing process evolves.
• **On-Line Learning:** The ART1 algorithm learns as it performs, as opposed to other algorithms, where first the algorithm must be trained and second, it can be used in an application. The ART1 algorithm can incorporate new knowledge as it is being used. This property makes ART1 an excellent candidate for real-time clustering.

• **Capturing Rare Events:** ART1 is able to identify and build clusters of events that appear with a very low frequency. Even if an event corresponding to a clearly distinct cluster appears only once, ART1 is able to detect it while building and preserving the corresponding cluster or category.

### 1.2. The ART 1 Mathematical Model

The ART 1 algorithm is a self-organizing system capable of learning stable recognition codes in response to a set of arbitrary many and complex binary input patterns.

The model is fully described by two sets of STM and LTM differential equations and the operation of the vigilance subsystem. The STM equations describe the evolution of the activities, $x_i$ and $y_j$, in the nodes of layers $F_1$ and $F_2$. The LTM equations describe the evolution of the weights, $z_{ij}^{bu}$ and $z_{ji}^{id}$, or learning rules.

As mentioned in the previous section, the STM equations evolve with a much faster time constant than that of the LTM equations. This allows us to distinguish three different levels of ART 1 implementations:

**Type-1**  
**Full Model Implementation:** Both sets of STM and LTM time-domain differential equations are directly implemented. This implementation is the most expensive (both in hardware and in software) requiring a large amount of computational power.

**Type-2**  
**STM Steady-State Implementation:** If an input pattern $\mathbf{I}$ is held at the $F_1$ input layer, until the STM equations settle to their steady state, the resulting steady state depends only on the interconnection weights, $z_{ij}^{bu}$ and $z_{ji}^{id}$, and the input pattern $\mathbf{I}$. Consequently, it is not necessary to solve the STM differential equations. The steady state can be computed by solving the corresponding algebraic equations. In this case, a proper sequencing of STM events has to be introduced artificially and only the LTM differential equations are implemented.

**Type-3**  
**Fast Learning Implementation:** If an input pattern $\mathbf{I}$ is held at the $F_1$ input layer until both STM and LTM differential equations settle to their steady state, the resulting STM and LTM steady states can be computed directly without solving any differential equation. In this case, a proper sequencing of STM and LTM events has to be done.

The full Type-1 mathematical models described next.

### A. STM Equations

The STM equations which describe the evolution of activity $x_i$ in an $F_1$ node $v_i$ have the form,
where $A_1$, $B_1$ and $C_1$ are positive constants which assure that the $x_i$ activity remains always limited to the interval $\left[ -\frac{B_1}{C_1}, \frac{1}{A_1} \right]$. $J_i^+$ is the total excitatory input to $F_1$ node $v_i$, and $J_i^-$ is the total inhibitory input to $F_1$ node $v_i$.

Fig. 1.2 shows a detailed diagram of the interactions among the processing units in the system. The total excitatory input to an $F_1$ node $v_i$ is the sum of the component of the input pattern $I_i$ and the component $V_i$ of the top-down template $V$,

$$J_i^+ = I_i + V_i = I_i + \sum_{j=1}^{M} f(y_j) z_{ji}^{td}.$$  \hspace{1cm} (1.4)

The total inhibitory input to this $F_1$ node is the “attentional signal”

$$J_i^- = \sum_{j=1}^{M} f(y_j)$$  \hspace{1cm} (1.5)

where $f(y_j)$ is the $F_2$ j-th node $u_j$ output signal generated by its activity $y_j$. This attentional signal is intended to allow the system to distinguish between an input pattern $I$ and a top-down template $V$ generated by the

**Fig. 1.2**: Diagram of the interactions between the nodes in an ART 1 architecture
presence of a certain amount of activation across the layer $F_2$. The attentional signal $\sum_{j=1}^{M} f(y_j)$ is greater than zero whenever an activation pattern $Y$ appears across layer $F_2$. So, if a top-down template $V$ is generated before applying any input pattern $I$ the attentional signal becomes active. This inhibition signal prevents the $F_1$ layer from becoming active without the presence of an input pattern $I$.

The same STM equation is valid to describe the evolution of the activity $y_j$ in an $F_2$ node $u_j$,

$$\varepsilon \frac{dy_j}{dt} = -y_j + (1 - A_2 y_j) J_j^+ - (B_2 + C_2 y_j) J_j$$  \hspace{1cm} (1.6)

where the amount of excitatory input equals

$$J_j^+ = T_j + f(y_j) = \sum_{i=1}^{N} z_{ij}^{bu} h(x_i) + f(y_j)$$  \hspace{1cm} (1.7)

where $h(x_i)$ is the output signal generated by activity $x_i$ of $F_1$ node $v_i$.

The inhibitory input is the sum of the lateral inhibitory actions from the remaining $F_2$ nodes, that is,

$$J_j = \sum_{j \neq k} f(y_k)$$  \hspace{1cm} (1.8)

Each $F_2$ node generates an excitatory input to itself an inhibitory inputs to the remaining $F_2$ nodes. This lateral inhibition contrast enhances pattern $T$ present at the $F_2$ layer input. If $F_2$ parameters are properly chosen, we can force $F_2$ to maintain activation only in the node $u_j$ which receives the largest input $T_j$. We call this case "the forced choose situation" or "Winner-Take-All" action. In particular, if $\varepsilon$ is small, after a reduced time,

$$f(y_j) = \begin{cases} 1 & \text{if } T_j = \max \{T_k\} \\ 0 & \text{otherwise} \end{cases}$$  \hspace{1cm} (1.9)

If we denote $u_j$ the node receiving the largest $T_j$ input, the top-down expectation vector has the components

$$V_t = z_{fj}^{td}$$  \hspace{1cm} (1.10)

hence, equalling the learned top-down template of category $u_j$.

In this case, the total excitatory input to an $F_1$ node is,

$$J_j^+ = I_j + z_{fj}^{td}$$  \hspace{1cm} (1.11)

and the inhibitory input $J_j^-$ equals always ‘1’ after a $u_j$ node has been activated and its top-down template has been read out.

The $F_1$ parameters are chosen so that only the nodes receiving more excitatory than inhibitory inputs become active, so that,
\[ x_i = \begin{cases} 
1 & \text{if both } I_i \text{ and } z_{ji}^{td} \text{ equal 1} \\
0 & \text{otherwise}
\end{cases}, \quad (1.12) \]

which can be expressed,

\[ x_i = I_i z_{ji}^{td}, \quad (1.13) \]

or in vector notation,

\[ X = I \cap z_j^{td}. \quad (1.14) \]

### B. The LTM equations

The LTM equations define the learning rules describing the evolution of the top-down and bottom-up weight templates. The equations have the form,

\[ \frac{d z_{ij}^{bu}}{dt} = f(y_j) \left[ -\left( (L-1) + \sum_k h(x_k) \right) z_{ij}^{bu} + L h(x_j) \right] \quad (1.15) \]

\[ \frac{d z_{ji}^{td}}{dt} = f(y_j) \left[ -z_{ji}^{td} + h(x_i) \right] \quad (1.16) \]

From these equations, it can be seen that only the weight vectors \( z_j^{bu} \) and \( z_j^{td} \) corresponding to the activated \( F_2 \) node \( u_j \) evolve towards their new stationary values, which, in both cases, are a certain scaled version of the activation pattern across \( F_1 \),

\[ z_{ij}^{bu} \rightarrow \frac{L h(x_j)}{L - 1 + \sum_k h(x_k)} = \frac{L I_j z_{ji}^{td}}{L - 1 + |I \cap z_j^{td}|} \quad (1.17) \]

\[ z_{ji}^{td} \rightarrow h(x_i) = I_i z_{ji}^{td} \quad (1.18) \]

Before any prior learning has occurred in the system, the weights are initialized to the positive values such that

\[ 0 < z_{ij}^{bu}(0) \leq \frac{L}{L - 1 + N} \quad (1.19) \]

\[ z_{ji}^{td}(0) = 1 \quad (1.20) \]

### C. The Reset Subsystem

The reset subsystem operates on the nodes of the category layer. Whenever an unacceptable mismatch between the binary input pattern \( I \) and the activation pattern \( X \) across \( F_1 \) occurs, a reset signal is generated. This signal permanently inhibits the activated \( F_2 \) node \( u_j \). The reset signal is activated each time the following condition is verified,\(^1\)
The adjustable vigilance parameter $\rho$ controls the degree of matching required between both templates.

### 1.3. The Fast Learning ART 1 Algorithm

In the fast learning algorithm or Type-3 implementation of the ART 1 model, the assumption is made that the weights reach their stationary values during each presentation of an input pattern. The flow diagram of the Type-3 ART 1 algorithm is depicted Fig. 1.3a.

The sequence of operation is as follows:

- Before the presentation of any input pattern, the weights are initialized to the values

\[
|X| < \rho |I| \quad (1.21)
\]

**Fig. 1.3: Type-3 implementation algorithm of the ART 1 architecture: (a) original ART 1 (b) ART 1 with a single binary valued weight template (c) and VLSI-friendly ART 1m**

1. If $\mathbf{a}$ is a vector of components $(a_1, a_2, ..., a_q)$, the notation $|\mathbf{a}|$ means its $L_1$ norm: $|\mathbf{a}| = \sum_{i=1}^{q} |a_i|$
\[
    z_{ij}^{bu}(0) = \frac{L}{L - 1 + N}
\]
(1.22)

\[
    z_{ji}^{td}(0) = 1
\]
(1.23)

- After an input pattern \( \mathbf{I} \) is read, the \( T_j \) terms are computed

\[
    T_j = \sum_{i=1}^{N} z_{ij}^{bu} I_i
\]
(1.24)

- The \( F_2 \) category \( u_j \) with the largest \( T_j \) input is selected and becomes active \( (y_j = 1) \), while the other \( F_2 \) nodes remain inactive \( (y_{j \neq j} = 0) \).

- The vigilance criterion is checked for the winning category. If \( |I \cap z_j^{td}| < \rho |I| \), the \( u_j \) category is reset and another winning category is chosen. Otherwise, the category \( u_j \) is accepted and its weights are updated to their new values given by equations (1.17) and (1.18).

### 1.4. The Modified ART 1 Algorithm

From a hardware implementation point of view, one of the first issues that comes into consideration is that there are two templates of weights to be built. The set of bottom-up weights \( z_{ij}^{bu} \), each of which must store a real value belonging to the interval \([0,1]\), and the set of top-down weights \( z_{ji}^{td} \), each of which stores either the value ‘0’ or ‘1’. The physical implementation of the bottom-up template memory presents the first hardware difficulty because the weights need either an analog or a digital memory with sufficient bits per weight so that the digital discretization does not affect the system performance. However, it can be seen from eqs.(1.17) and (1.18) that the bottom-up set \( \{z_{ij}^{bu}\} \) and the top-down set \( \{z_{ji}^{td}\} \) contain the same information: each of these sets can be fully computed by knowing the other set. The bottom-up set \( \{z_{ij}^{bu}\} \) is a normalized version of the top-down set \( \{z_{ji}^{td}\} \). Therefore, from a hardware implementation point of view, it would be desirable to implement physically only a binary valued set (one bit per weight) and introduce the normalization of the bottom-up weights during the computation of \( \{T_j\} \). This way, the two sets \( \{z_{ij}^{bu}\} \) and \( \{z_{ji}^{td}\} \) can be substituted by a single binary valued set \( \{z_{ij}\} \), and eq. (1.24) modified to take into account the normalization effect of the original bottom-up weights,²

\[
    T_j = \frac{L | I \cap z_j |}{L - 1 + |z_j|} = \frac{L \sum_{i=1}^{N} z_{ij} I_i}{L - 1 + \sum_{i=1}^{N} z_{ij}}
\]
(1.25)

Considering this minor "implementation" modification, the algorithm of Fig. 1.3(a) would be transformed into that depicted in Fig. 1.3(b). The system level performance of the algorithms described by Fig. 1.3(a) and (b) is

---

² This type of modification is employed in the Fuzzy-ART model [1.5], which operates with analog patterns, instead of binary ones. Making Fuzzy-ART to work with binary patterns results in ART 1 behavior, but using only one set of weights, similar to the system described in this Appendix.
Fig. 1.4: Illustration of simplification process of the division operation: (a) original division operation, (b) piece-wise linear approximation, (c) linear approximation

identical. There is no difference in the behavior between the two diagrams, and the one in Fig. 1.3(b) offers more attractive features from a hardware (as well as software) implementation point of view.

However, in Fig. 1.3(b), an extra division operation, \( T_j = (L|I \cap z_j|) / (L - 1 + |z_j|) \), needs to be performed for each node in the \( F_2 \) layer. This is an expensive hardware operation and would probably constitute a performance bottleneck in the overall system for both analog and digital circuit implementations. If possible, it would be very desirable to avoid this division operation. That can be done by substituting this division operation by another, less expensive one, and, although this results in a system with a slightly different behavior, we will show that it preserves all the computational properties of the original ART 1 algorithm.

Fig. 1.4(a) shows the curves that represent the division operation of eq. (1.25). A first simplification could be to substitute these curves by a piece-wise linear approximation as shown in Fig. 1.4(b). Such an approximation still presents some hardware difficulties and could also limit the performance of the overall system. A more drastic simplification would be to substitute the original operation by the operation represented by the set of curves of Fig. 1.4(c). Mathematically, the division operation has been substituted by a subtraction operation

\[
T_j = L_A|I \cap z_j| - L_B|z_j| + L_M,
\]

where \( L_A \) and \( L_B \) are positive parameters that play the role of the original \( L \) (and \( L-1 \)) parameter. As we will see in the next Section, the condition \( L_A > L_B \) must be imposed for proper system operation. \( L_M > 0 \) is a constant parameter needed\(^4\) to ensure that \( T_j \geq 0 \) for all possible values of \(|I \cap z_j|\) and \(|z_j|\).

Replacing a division operation with a subtraction one is a very important hardware simplification with significant performance improvement potential. Fig. 1.3(c) shows the final Type-3 modified ART 1 algorithm.

---

3. Similar \( T_j \) functions (also called distances or choice functions) have been proposed by other authors for Fuzzy-ART. Since ART 1 can be considered a particular case of Fuzzy-ART when the input patterns are binary, Fuzzy-ART choice functions can also be used for ART 1. In section 1.8 we show how these other choice functions also yield to ART 1 architectures that preserve as well all the original computational properties. However, the choice function presented here is computationally less expensive and is easier to implement in hardware.

4. In reality, parameter \( L_M \) has been introduced for hardware reasons [1.13] and [1.14]. In a software implementation parameter \( L_M \) can be ignored.
which we will call from now on the ART $I_m$ algorithm. In the next sections, we will try to show that the price paid for this drastic simplification, although it yields a system with slightly different input-output behavior, is insignificant since all the computational properties of the original ART 1 architecture are preserved.

It is worth mentioning here that substituting a division operation by a subtraction one means a significant performance boost from a hardware implementation point of view. Implementing physically division operators in hardware constraints significantly the whole system design and imposes limitations on the overall system performance.

In the case of digital hardware, a division circuit can be built using either sequential techniques or large size higher speed special purpose circuits [1.10]. Sequential techniques use simpler hardware but are slower, while a dedicated circuit is very large compared to the former and requires much more power consumption. As an example, and for a sequential type division circuit, in order to realize the following division

$$T_j = \frac{L|I \cap z_j|}{L-1 + |z_j|},$$

(1.27)

$q$ addition/subtractions operation would be needed, where $q$ is the number of bits needed for the result of the division. If, for example, there are $N = 1000$ nodes in the $F_j$ layer, numerator and denominator in eq. (1.27) should be represented by 10-bit words. If, for a given input $I$, we want to differentiate between two terms $T_{j_1}$ and $T_{j_2}$ whose respective templates $z_{j_1}$ and $z_{j_2}$ differ in one bit, the $F_2$ layer (WTA) would need to resolve

$$|\Delta T_{j_2,j_1}|_{min} = \frac{L|I \cap z_{j_1}|}{L-1 + |z_{j_1}|} - \frac{L|I \cap z_{j_2}|}{L-1 + |z_{j_2}|}_{min}. \tag{1.28}$$

The worst case occurs when $|z_{j_1}| = |I \cap z_{j_1}| = N$, $|z_{j_2}| = |I \cap z_{j_2}| = N - 1$. In this case

$$|\Delta T_{j_2,j_1}|_{min} = \frac{L (L-1)}{(L-1+N) (L-2+N)}. \tag{1.29}$$

A reasonable minimum value for $L$ is 1.01. Therefore, if $N = 1000$ then $|\Delta T_{j_2,j_1}|_{min} = 10^{-8}$. On the other hand, it is easy to see that $|\Delta T_{j_2,j_1}|_{max}$ is close to but less than one. Consequently, for each $T_j$ a dynamic range of

$$\frac{T_{j,\text{max}}}{T_{j,\text{min}}} = 10^8 \tag{1.30}$$

is needed. Such dynamic range requires a $q=27$ bit representation. Thus, for each division operation we need to realize 27 10-bit addition/subtractions. Furthermore, the WTA in the $F_2$ layer would need to choose the maximum among $M$ 27-bit words. On the other hand, if the ART $I_m$ algorithm is used, instead of the $M \times 24$ 11-bit addition/subtractions, we need only to realize $M$ 11-bit subtractions, and the WTA has to choose the maximum among $M$ 11-bit words.

In the case of analog hardware, there are ways to implement the division operation with compact dedicated circuits [1.9], [1.12], [1.11], [1.15], but they usually suffer from low signal-to-noise ratios, limited signal range, noticeable distortion, or require bipolar devices which are available for more expensive VLSI technologies. In any case, the performance of the overall ART system would be limited by the lower
performance of the division operators. If the division operators are eliminated the performance of the system would be limited by other operators which, for the same VLSI technology, render considerable better performance figures. Furthermore, in the case of analog current mode signal processing [1.14], the addition and subtraction of currents does not need any physical components. Consequently, by eliminating the need of signal division, the circuitry is dramatically simplified and its performance drastically improved.

1.5. Computational Equivalence of the Original and the Modified Models

Throughout the original ART 1 paper [1.2], Carpenter and Grossberg provide rigorous demonstrations of the computational properties of the ART 1 architecture. Some of these properties are concerned with Type-1 and Type-2 operations of the architecture, but most refer to the Type-3 model operation. From a functional point of view, i.e., when looking at the ART 1 system as a black box regardless of the details of its internal operations, the system level computational properties of ART 1 are fully contained in its Fast-Learning or Type-3 model. The theorems and demonstrations given by Carpenter and Grossberg [1.2] relating to Type-1 and Type-2 models of the system only ensure proper Type-3 behavior. The purpose of this Section is to demonstrate that the modified Type-3 model developed during the previous Section preserves all the Type-3 computational properties of the original ART 1 architecture. The only functional difference between ART 1 and ART 1_m is the way the terms \( T_j \) are computed before competing in the Winner-Take-All block. Therefore, the original properties and demonstrations that are not affected by the terms \( T_j \) will be automatically preserved. Such properties are, for example, the Self-Scaling property and the Variable Coarseness property tuned by the Vigilance Parameter. But there are other properties which are directly affected by the way the terms \( T_j \) are computed. In the remainder of this Section we will show that these properties remain in the ART 1_m architecture.

Let us define a few concepts before demonstrating that the original computational properties are preserved.

a) **Direct Access:** an input pattern \( \mathbf{I} \) is said to have Direct Access to a learned category \( u_j \) if this category is the first one selected by the Winner-Take-All \( F_2 \) layer and is accepted by the vigilance subsystem, so that no reset occurs.

b) **Subset Template:** an input pattern \( \mathbf{I} \) is said to be a Subset Template of a learned category \( z_j \equiv (z_{1j}, z_{2j}, \ldots z_{nj}) \) if \( \mathbf{I} \subseteq z_j \). Formally,

\[
\begin{align*}
    z_{ij} = 0 & \Rightarrow I_i = 0 & \forall i = 1, \ldots, N, \\
    I_i = 1 & \Rightarrow z_{ij} = 1 & \forall i = 1, \ldots, N,
\end{align*}
\]

there are some values of \( i \) such that \( I_i = 0 \) and \( z_{ij} = 1 \). (1.31)

c) **Superset Template:** an input pattern \( \mathbf{I} \) is said to be a Superset Template of a learned category \( u_j \) if \( z_j \subseteq \mathbf{I} \).
d) **Mixed Template:** $z_j$ and $I$ are said to be mixed templates if neither $I \subset z_j$ nor $z_j \subset I$ are satisfied, and $I \neq z_j$.

e) **Uncommitted node:** an $F_2$ node $u_j$ is said to be uncommitted if all its weights $z_{ij}$ ($i = 1, \ldots, N$) preserve their initial value ($z_{ij} = 1$), i.e., node $u_j$ has not yet been selected to represent any learned category.

A. Direct Access to Subset and Superset Patterns

Suppose that a learning process has produced a set of categories in the $F_2$ layer. Each category $u_j$ is characterized by the set of weights that connect node $u_j$ in the $F_2$ layer to all nodes in the $F_1$ layer, i.e., $z_j = (z_{1j_1}, z_{2j_2}, \ldots, z_{Nj_N})$. Suppose that two of these categories, $u_{j_1}$ and $u_{j_2}$, are such that $z_{j_1} \subset z_{j_2}$ ($z_{j_1}$ is a subset template of $z_{j_2}$). Now consider two input patterns $I^{(1)}$ and $I^{(2)}$ such that,

$$
I^{(1)} = z_{j_1} = (z_{1j_1}, z_{2j_1}, \ldots, z_{Nj_1}) \quad , \\
I^{(2)} = z_{j_2} = (z_{1j_2}, z_{2j_2}, \ldots, z_{Nj_2}) 
$$

The **Direct Access to Subset and Superset** property assures that input $I^{(1)}$ will have Direct Access to category $u_{j_1}$ and that input $I^{(2)}$ will have Direct Access to category $u_{j_2}$.

A.1: **Original ART 1:**

Let us compute the values of $T_{j_1}$ and $T_{j_2}$ when the input patterns $I^{(1)}$ and $I^{(2)}$ are presented at the input of the system. For pattern $I^{(1)}$ we will have,

$$
T_{j_1} = \frac{L \sum_{i=1}^{N} I_{i}^{(1)} z_{ij_1}}{L - 1 + |z_{j_1}|} = \frac{L|I^{(1)}|}{L - 1 + |I^{(1)}|} 
$$

$$
T_{j_2} = \frac{L \sum_{i=1}^{N} I_{i}^{(1)} z_{ij_2}}{L - 1 + |z_{j_2}|} = \frac{L|I^{(1)}|}{L - 1 + |I^{(1)}|} 
$$

Since $|I^{(1)}| < |I^{(2)}|$, it is obvious that $T_{j_1} > T_{j_2}$ (remember that $L > 1$) and therefore category $u_{j_1}$ will become the active one. On the other hand, if input pattern $I^{(2)}$ is presented at the input,

$$
T_{j_1} = \frac{L \sum_{i=1}^{N} I_{i}^{(2)} z_{ij_1}}{L - 1 + |z_{j_1}|} = \frac{L|I^{(2)}|}{L - 1 + |I^{(2)}|} 
$$

$$
T_{j_2} = \frac{L \sum_{i=1}^{N} I_{i}^{(2)} z_{ij_2}}{L - 1 + |z_{j_2}|} = \frac{L|I^{(2)}|}{L - 1 + |I^{(2)}|} 
$$
Since the function $Lx/(L-1+x)$ is an increasing function of $x$, it results that now $T_{j_i} > T_{j_i}$ and category $u_{j_i}$ will be chosen as the winner.

A.2 : Modified ART I:

If pattern $I^{(1)}$ is given as the input pattern we will have

$$T_{j_1} = L_A \sum_{i=1}^{N} I^{(1)}_i z_{ij_1} - L_B |z_{ij_1}| + L_M = L_A |I^{(1)}| - L_B |I^{(1)}| + L_M$$

$$T_{j_2} = L_A \sum_{i=1}^{N} I^{(1)}_i z_{ij_2} - L_B |z_{ij_2}| + L_M = L_A |I^{(1)}| - L_B |I^{(2)}| + L_M$$

Since $|I^{(1)}| < |I^{(2)}|$, it follows that (remember that $L_B > 0$) $T_{j_1} > T_{j_2}$. In the case pattern $I^{(2)}$ is presented at the input of the network it would be,

$$T_{j_1} = L_A \sum_{i=1}^{N} I^{(2)}_i z_{ij_1} - L_B |z_{ij_1}| + L_M = L_A |I^{(1)}| - L_B |I^{(1)}| + L_M$$

$$T_{j_2} = L_A \sum_{i=1}^{N} I^{(2)}_i z_{ij_2} - L_B |z_{ij_2}| + L_M = L_A |I^{(2)}| - L_B |I^{(2)}| + L_M$$

In order to guarantee that $T_{j_2} > T_{j_1}$ the condition

$$L_A > L_B$$

has to be assured.

B. Direct Access by Perfectly Learned Patterns (Theorem 1 of original ART I):

This theorem, adapted to a Type-3 implementation, states the following

An input pattern $I$ has direct access to a node $u_j$ which has perfectly learned the input pattern $I$.

B.1 : Original ART I:

In order to prove that $I$ has direct access to $u_j$, we need to demonstrate that the following properties hold:
(i) $u_j$ is the first node to be chosen, (ii) $u_j$ is accepted by the vigilance subsystem and (iii) $u_j$ remains active as learning takes place.

To prove property (i) we have to show that, at the start of each trial $T_j > T_j \forall j \neq J$. Since $I = z_j$,

$$T_j = \frac{L |I|}{L - 1 + |I|}$$

and
\[ T_j = \frac{L|I \cap z_j|}{L-1 + |z_j|}. \]  

(1.39)

Since \( \frac{Lw}{L-1 + w} \) is an increasing function of \( w \) (because \( L > 1 \)) and \( |I| > |I \cap z_j| \), we can state,

\[ T_j = \frac{L|I|}{L-1 + |I|} > \frac{L|I \cap z_j|}{L-1 + |I \cap z_j|} > \frac{L|I \cap z_j|}{L-1 + |z_j|} = T_j. \]  

(1.40)

So, property (i) is always fulfilled.

Property (ii) is directly verified since \( |I \cap z_j| = |I| \geq \rho |I| \ \forall \rho \in [0, 1] \).

Property (iii) is always verified because after node \( u_j \) is selected as the winning category, its weight template \( z_j \) will remain unchanged (because \( z_j|_{\text{new}} = I \cap z_j|_{\text{old}} = I = z_j|_{\text{old}} \)), and consequently the inputs to the \( F_2 \) layer \( T_j \) will remain unchanged.

B.2: Modified ART 1:

In order to demonstrate that \( I \) has direct access to \( u_j \), we have only to prove that property (i) is verified for the modified algorithm, as the proof of properties (ii) and (iii) is identical to the case of the original algorithm.

To prove property (i), we have to demonstrate that

\[ T_j = L_A|I| - L_B|I| + L_M > L_A|I \cap z_j| - L_B|z_j| + L_M = T_j. \]  

(1.41)

Since \( L_Aw - L_Bw + L_M \) is an increasing function of \( w \) (\( L_A > L_B \)), and \( |z_j| > |I \cap z_j| \),

\[ T_j = L_A|I| - L_B|I| + L_M > L_A|I \cap z_j| - L_B|I \cap z_j| + L_M > L_A|I \cap z_j| - L_B|z_j| + L_M = T_j \]  

(1.42)

C. Stable Choices in STM (Theorem 2 of original ART 1):

Whenever an input pattern \( I \) is presented for the first time to the ART 1 system, a set of \( \{T_j\} \) values is formed that compete in the Winner-Take-All \( F_2 \) layer. The winner may be reset by the vigilance subsystem, and a new winner appears that may also be reset, and so on until a final winner is accepted. During this search process, the \( T_j \) values that led to earlier winners are set to zero. Let us call \( O_j \) the values of \( T_j \) at the beginning of the search process, i.e., before any of them is set to zero by the vigilance subsystem. Theorem 2 of the original ART 1 architecture states:

Suppose that an \( F_2 \) node \( u_j \) is chosen for STM storage instead of another node \( u_j \) because \( O_j > O_j \). Then read-out of the top-down template preserves the inequality \( T_j > T_j \) and thus confirms the choice of \( u_j \) by the bottom-up filter.

This theorem has only sense for a Type-1 implementation, because there, as a node in the \( F_2 \) layer activates, the initial values of \( T_j \) (immediately after presenting an input pattern \( I \)) may be altered through the top-down
"feed-back" connections. In a Type-3 description (see Fig. 1.3) the initial terms \( T_j \) remain unchanged, independently of what happens in the \( F_2 \) layer. Therefore, this theorem is implicitly satisfied.

**D. Initial Filter Values determine Search Order (Theorem 3 of original ART 1):**

Theorem 3 of the original ART 1 architecture states that (page 92 of [1.2]):

The Order Function \( (O_{j_1} > O_{j_2} > O_{j_3} > ...) \) determines the order of search no matter how many times \( F_2 \) is reset during a trial.

The proof is the same for the original ART 1 and the modified ART 1 (both Type-3) implementation\(^5\). If \( T_{j_1} \) is reset by the vigilance subsystem, the values of \( T_{j_2}, T_{j_3}, \ldots \) will not change. Therefore, the new order sequence is \( O_{j_1} > O_{j_2} > ... \) and the original second largest value \( O_{j_2} \) will be selected as the winner. If \( T_{j_2} \) is now set to zero, \( O_{j_3} \) is the next winner, and so on.

This Theorem, although trivial in a Type-3 implementation, has more importance in a Type-1 description where the process of selecting and shutting down a winner has the consequence of altering all \( T_j \) values.

**E. Learning on a Single Trial (Theorem 4 of original ART 1):**

This theorem (page 93 of [1.2]) states the following, assuming a Type-3 implementation is being considered\(^6\):

Suppose that an \( F_2 \) winning node \( u_j \) is accepted by the vigilance subsystem. Then the LTM traces \( z_j \) change in such a way that \( T_j \) increases and all other \( T_j \) remain constant, thereby confirming the choice of \( u_j \). In addition, the set \( I \cap z_j \) remains constant during learning, so that learning does not trigger reset of \( u_j \) by the vigilance subsystem.

**E.1: Original ART 1:**

According to eq. (1.25), if \( u_j \) is the winning category accepted by the vigilance subsystem, we have that

\[
T_j = \frac{LT_{A_j}}{L - 1 + T_{B_j}} = \frac{L|I \cap z_j|}{L - 1 + |z_j|}.
\]  \hspace{1cm} (1.43)

This is the \( T_j \) value before learning takes place. After updating the weights (see Fig. 1.3(a)),

\[
z_j(new) = I \cap z_j(old)
\]  \hspace{1cm} (1.44)

and the new \( T_j \) value is given by,

---

5. However, note that the resulting ordering \( \{j_1, j_2, j_3, \ldots \} \) can be different for the original and for the modified architecture.

6. In the original ART 1 paper [1.2] a more sophisticated demonstration for this theorem is provided. The reason is that there the demonstration is performed for a Type-1 description of ART 1.
\[ T_j(\text{new}) = \frac{L|I \cap z_j(\text{new})|}{L - 1 + |z_j(\text{new})|} = \frac{L|I \cap I \cap z_j(\text{old})|}{L - 1 + |I \cap z_j(\text{old})|} \leq \frac{L|I \cap z_j(\text{old})|}{L - 1 + |z_j(\text{old})|} = T_j(\text{old}) \]  

(1.45)

Note that by eq. (1.44),
\[ I \cap z_j(\text{new}) = I \cap I \cap z_j(\text{old}) = I \cap z_j(\text{old}) \]  

(1.46)

Since the only weights that are updated are those connected to the winning (and accepted) \( u_j \) node, all other \( T_j_{|j \neq j} \) values remain unchanged. Therefore, it can be concluded, by eq. (1.45), that learning confirms the choice of \( u_j \) and that, by eq. (1.46), the set \( I \cap z_j \) remains constant.

E.2 : Modified ART 1:

In this case, if \( u_j \) is the winning category accepted by the vigilance subsystem, by eq. (1.26) we have that
\[ T_j = L_A T_{A_j} - L_B T_{B_j} + L_M = L_A |I \cap z_j| - L_B |z_j| + L_M \]  

(1.47)

The update rule is the same as before (see Fig. 1.3(b)), therefore
\[ z_j(\text{new}) = I \cap z_j(\text{old}) \]  

(1.48)

and the new \( T_j \) value is given now by,
\[ T_j(\text{new}) = L_A |I \cap z_j(\text{old})| - L_B |I \cap z_j(\text{old})| + L_M \geq L_A |I \cap z_j(\text{old})| - L_B |z_j(\text{old})| + L_M = T_j(\text{old}) \]  

(1.49)

Like before, learning confirms the choice of \( u_j \), and by eq. (1.18) the set \( I \cap z_j \) remains constant as well.

F. Stable Category Learning (Theorem 5 of original ART 1):

Suppose an arbitrary list (finite or infinite) of binary input patterns is presented to an ART 1 system. Each template set \( Z_j = (z_{1j}, z_{2j}, \ldots, z_{Nj}) \) is updated every time category \( u_j \) is selected by the Winner-Take-All \( F_2 \) layer and accepted by the vigilance subsystem. Some of these times template \( z_j \) might be changed, and some others it might stay unchanged. Let us call the times \( z_j \) suffers a change \( t_{r_i}^{(i)} < t_{r_2}^{(i)} < \ldots < t_{r_N}^{(i)} \). Since vector (or template) \( z_j \) has \( N \) components (initially set to '1'), and by eqs. (1.44) and (1.18), each component can only change from '1' to '0' but not from '0' to '1', it follows that template \( z_j \) can, at the most, suffer \( N-1 \) changes\(^7\),
\[ r_j \leq N - 1 \]  

(1.50)

Since template \( z_j \) will remain unchanged after time \( t_{r_N}^{(i)} \), it is concluded that the complete LTM memory will suffer no change after time
\[ t_{\text{learn}} = \max_j \{ t_{r_N}^{(i)} \} \]  

(1.51)

---

\(^7\) Here we are assuming that the empty template \( (z_{ij} = 0 \, \text{forall} \, i) \) is not a valid one. The only way to store this template is by using the empty input pattern \( (I_i = 0 \, \text{forall} \, i) \), which we assume has no significance, and hence will never be presented.
If there is a finite number of nodes in the $F_2$ layer $t_{\text{learn}}$ has a finite value, and thus learning completes after a finite number of time steps.

All this is true for both, the original and the modified ART 1 architecture, and therefore the following theorem (page 95 of [1.2]) is valid for the two algorithms:

In response to an arbitrary list of binary input patterns, all LTM traces $z_j(t)$ approach limits after a finite number of learning trials. Each template set $z_j$ remains constant except for at most $N-1$ times $t_1^{(1)} < t_2^{(2)} < ... < t_i^{(j)}$ at which it progressively loses elements, leading to the

\[
\text{Subset Recoding Property: } z_j(t_1^{(1)}) \Rightarrow z_j(t_2^{(2)}) \Rightarrow ... \Rightarrow z_j(t_i^{(j)}). \tag{1.52}
\]

The LTM traces $z_j(t)$ such that $i \notin z_j(t_i^{(j)})$ decrease to zero. The LTM traces $z_j(t)$ such that $i \in z_j(t_i^{(j)})$ remain always at ‘1’. The LTM traces such that $i \in z_j(t_i^{(j)})$ but $i \notin z_j(t_{i+1}^{(j)})$ stay at ‘1’ for times $i \leq t_i^{(1)}$ but will change to and stay at ‘0’ for times $i \geq t_i^{(j)}$.

G. Direct Access after Learning Self-Stabilizes (Theorem 6 of original ART 1):

Assuming $F_2$ has a finite number of nodes, the present theorem (page 98 of [1.2]) states the following:

After recognition learning has self-stabilized in response to an arbitrary list of binary input patterns, each input pattern $I$ either has direct access to the node $u_j$ which possesses the largest subset template with respect to $I$, or $I$ cannot be coded by any node of $F_2$. In the latter case, $F_2$ contains no uncommitted nodes.

Since learning has already stabilized $I$ can be coded only by a node $u_j$ whose template $z_j$ is a subset template with respect to $I$. Otherwise, after $u_j$ becomes active, the set $z_j$ would contract to $z_j \cap I$, thereby contradicting the hypothesis that learning has already stabilized. Thus if $I$ activates any node other than one with a subset template, that node must be reset by the vigilance subsystem. For the remainder of the proof, let $u_j$ be the first $F_2$ node activated by $I$. We need to show that if $z_j$ is a subset template, then it is the subset template with the largest $O_J$; and if it is not a subset template, then all subset templates activated on that trial will be reset by the vigilance subsystem. To prove these two steps we need to differentiate between the original ART 1 and the modified one.

G.1 : Original ART 1:

If $u_j$ and $u_j$ are nodes with subset templates with respect to $I$, then

\[
O_j = \frac{L[z_j]}{L - 1 + |z_j|} < O_J = \frac{L[z_J]}{L - 1 + |z_J|} \tag{1.53}
\]

Since $L[z_j]/(L - 1 + |z_j|)$ is an increasing function of $|z_j|$,

\[
|z_j| < |z_J| \tag{1.54}
\]
and,

$$R_j = \left| \frac{I \cap z_j}{|I|} \right| = \left| \frac{z_j}{|I|} \right| < R_J = \left| \frac{I \cap z_j}{|I|} \right| = \frac{|z_j|}{|I|} \quad (1.55)$$

Once activated, a node $u_k$ will be reset if $R_k < \rho$. Therefore, if $u_j$ is reset ($R_j < \rho$), then all other nodes with subset templates will be reset as well ($R_j < \rho$).

Now suppose that $u_j$, the first activated node, does not have a subset template with respect to $I$ ($|I \cap z_j| < |z_j|$), but that another node $u_j$ with a subset template is activated in the course of search. We need to show that $|I \cap z_j| = |z_j| < \rho|I|$, so that $u_j$ is reset. We know that,

$$O_j = \frac{L|z_j|}{L - 1 + |z_j|} = O_J = \frac{L|I \cap z_j|}{L - 1 + |z_j|} = \frac{L|z_j|}{L - 1 + |z_j|} \quad (1.56)$$

which implies that $|z_j| < |z_j|$. Since $u_j$ cannot be chosen, it has to be reset by the vigilance subsystem, which means that $|I \cap z_j| < \rho|I|$. Therefore,

$$\frac{|z_j|}{L - 1 + |z_j|} < \frac{|I \cap z_j|}{L - 1 + |z_j|} < \frac{\rho|I|}{L - 1 + |z_j|}$$

which implies that,

$$|I \cap z_j| = |z_j| < \rho|I| \quad (1.58)$$

G.2: Modified ART 1:

If $u_j$ and $u_j$ are nodes with subset templates with respect to $I$, then

$$O_j = L_A|z_j| - L_B|z_j| + L_M < O_J = L_A|z_j| - L_B|z_j| + L_M \quad (1.59)$$

Since $(L_A - L_B)|z_j|$ is an increasing function of $|z_j|$, 

$$|z_j| < |z_j| \quad (1.60)$$

and,

$$R_j = \left| \frac{I \cap z_j}{|I|} \right| = \left| \frac{z_j}{|I|} \right| < R_J = \left| \frac{I \cap z_j}{|I|} \right| = \frac{|z_j|}{|I|} \quad (1.61)$$

Therefore, if $u_j$ is reset ($R_j < \rho$), then all other nodes with subset templates will be reset as well ($R_j < \rho$).

Now suppose that $u_j$, the first activated node, does not have a subset template with respect to $I$ ($|I \cap z_j| < |z_j|$), but that another node $u_j$ with a subset template is activated in the course of search. We need to show that $|I \cap z_j| = |z_j| < \rho|I|$, so that $u_j$ is reset. We know that,

$$O_j = (L_A - L_B)|z_j| + L_M < O_J = L_A|I \cap z_j| - L_B|z_j| + L_M < (L_A - L_B)|z_j| + L_M \quad (1.62)$$

which implies that $|z_j| < |z_j|$. Since $u_j$ cannot be chosen, it has to be reset by the vigilance subsystem, which means that $|I \cap z_j| < \rho|I|$. Therefore,
\[ L_A |z_I| - L_B |z_I| < L_A |I \cap z_I| - L_B |z_I| < L_A \rho |I| - L_B |z_I| < L_A \rho |I| - L_B |z_I| \]  

which implies that,

\[ |I \cap z_I| = |z_I| < \rho |I| \]  

**H. Search Order(Theorem 7 of original ART 1):**

The original Theorem 7 (page 100 of [1.2]) states the following:

Suppose that input pattern satisfies

\[ L - 1 \leq \frac{1}{|I|} \]  

and

\[ |I| \leq N - 1 \]  

Then \( F_2 \) nodes are searched in the following order, if they are searched at all.

Subset templates with respect to \( I \) are searched first, in order of decreasing size. If the largest subset template is reset, then all subset templates are reset. If all subset templates have been reset and if no other learned templates exist, then the first uncommitted node to be activated will code \( I \). If all subset templates are searched and if there exist learned superset templates but no mixed templates, then the node with the smallest superset template will be activated next and will code \( I \). If all subset templates are searched and if both superset templates \( z_j \) and mixed templates \( z_j \) exist, then \( u_j \) will be searched before \( u_j \) if and only if

\[ |z_j| < |z_I| \quad \text{and} \quad \frac{|I|}{|z_j|} < \frac{|I \cap z_I|}{|z_I|} \]  

If all subset templates are searched and if there exist mixed templates but no superset templates, then a node \( u_j \) with a mixed template will be searched before an uncommitted node \( u_j \) if and only if

\[ \frac{L |I \cap z_I|}{L - 1 + |z_I|} > T_j(I, t=0). \]  

Where \( T_j(I, t=0) = \frac{(L \sum \nu_{i,j}(0))}{(L - 1 + \sum z_{i,j}(0))} \). The conditions expressed in eqs. (1.65)-(1.68) have to be changed in order to adapt this theorem to the modified ART 1 architecture. The original proof will not be reproduced here, because it differs drastically from the one we will provide for the modified theorem. The modified theorem is identical to the original one, except for eqs. (1.65)-(1.68). It states the following:

Suppose that input pattern satisfies

\[ \frac{L_A}{L_B} < \frac{N}{N-1} \]  

and

\[ |I| \leq N - 1 \]  

Then \( F_2 \) nodes are searched in the following order, if they are searched at all.
Subset templates with respect to \( I \) are searched first, in order of decreasing size. If the largest subset template is reset, then all subset templates are reset. If all subset templates have been reset and if no other learned templates exist, then the first uncommitted node to be activated will code \( I \). If all subset templates are searched and if there exist learned superset templates but no mixed templates, then the node with the smallest superset template will be activated next and will code \( I \). If all subset templates are searched and if both superset templates \( z_j^\ell \) and mixed templates \( z_j \) exist, then \( u_j \) will be searched before \( u_j \) if and only if

\[
|z_j| < |z_j^\ell| \quad \text{and} \quad \frac{|I| - |I \cap z_j^\ell|}{|z_j^\ell| - |z_j|} < \frac{L_B}{L_A}.
\]  

(1.71)

If all subset templates are searched and if there exist mixed templates but no superset templates, then a node \( u_j \) with a mixed template will be searched before an uncommitted node \( u_j \) if and only if

\[
L_A|I \cap z_j| - L_B|z_j^\ell| + L_M > T_J(\mathbf{I}, t=0).
\]

(1.72)

Where \( T_J(\mathbf{I}, t=0) = L_A \sum_i z_{i,j}(0) - L_B \sum_i z_{i,j}(0) + L_M \). The proof has several parts:

a) First we show that a node \( u_I \) with a subset template \(( I \cap z_j = z_j )\) is searched before any node \( u_I \) with a non subset template. In this case,

\[
O_j = L_A|I \cap z_j| - L_B|z_j^\ell| + L_M = |I \cap z_j|\left(L_A - L_B\frac{|z_j|}{|I \cap z_j|}\right) + L_M
\]

(1.73)

Now, note that

\[
\frac{|z_j|}{|I \cap z_j|} > \frac{N}{N-1}
\]

(1.74)

because \(^8\)

\[
\frac{|z_j|}{|I \cap z_j|_{\min}} = \frac{|z_j|}{|z_j| - 1}_{\min} = \frac{N - 1}{N - 2} > \frac{N}{N - 1}
\]

(1.75)

From eqs. (1.26), (1.69) and (1.74), it follows that

\[
O_j < |I \cap z_j|L_B\left(L_A - \frac{N}{N-1}\right) + L_M < L_M
\]

(1.76)

On the other hand,

\[
O_j = (L_A - L_B)|z_j^\ell| + L_M > L_M
\]

(1.77)

Therefore,

\[
O_j > O_j
\]

(1.78)

b) Subset templates are searched in order of decreasing size:

Suppose two subset templates of \( I, z_j \) and \( z_j \) such that \(|z_j^\ell| > |z_j|\). Then

\(^8\) We are assuming that \( u_I \) is not an uncommitted node \((|z_j| < N)\).
\[ O_j = (L_A - L_B) |z_j| + L_M > (L_A - L_B) |z_j| + L_M = O_j \quad (1.79) \]

Therefore node \( u_j \) will be searched before node \( u_j \). By eq. (1.61), if the largest subset template is reset, then all other subset templates are reset as well.

c) Subset templates \( u_j \) are searched before an uncommitted node \( u_j \):

\[ O_j = L_A |I| - L_B N + L_M \leq L_A (N - 1) - L_B N + L_M = L_B \left( \frac{L_A}{L_B} (N - 1) - N \right) + L_M < \]
\[ L_B \left( \frac{N}{N - 1} (N - 1) - N \right) + L_M = L_M < (L_A - L_B) |z_j| + L_M = O_j \quad (1.80) \]

Therefore now, if all subset templates are searched and if no other learned template exists, then an uncommitted node will be activated and code the input pattern.

d) If all subset templates have been searched and there exist learned superset templates but no mixed templates, the node with the smallest superset template \( u_j \) will be activated (and not an uncommitted node \( u_j \)) and code \( I \):

\[ O_j = L_A |I| - L_B |z_j| + L_M > L_A |I| - L_B N + L_M = O_j \quad (1.81) \]

If there are more than one superset templates, the one with the smallest \( |z_j| \) will be activated. Since \( |I \cap z_j| > 1 \) there is no reset, and \( I \) will be coded.

e) If all subset templates have been searched and there exist a superset template \( u_j \) and a mixed template \( u_j \), then \( O_j > O_j \) if and only if eq. (1.71) holds:

\[ O_j - O_j = L_A (|I \cap z_j| - |I|) + L_B (|z_j| - |z_j|) \quad (1.82) \]

\[ \text{e.1) if eq. (1.71) holds:} \]

\[ O_j - O_j = L_A \left( \frac{L_B}{L_A} \frac{|I| - |I \cap z_j|}{|z_j| - |z_j|} \right) (|z_j| - |z_j|) > 0 \quad (1.83) \]

\[ \text{e.2) if } O_j > O_j : \]

Assume first that \( |z_j| - |z_j| < 0 \). Then, by eq. (1.83), it has to be

\[ \frac{L_B}{L_A} < \frac{|I| - |I \cap z_j|}{|z_j| - |z_j|} \quad (1.84) \]

Since \( L_A > L_B > 0 \) it had to be \( |I| - |I \cap z_j| < 0 \), which is false. Therefore, it must be \( |z_j| - |z_j| > 0 \) and

\[ \frac{L_B}{L_A} > \frac{|I| - |I \cap z_j|}{|z_j| - |z_j|} \quad (1.85) \]

f) If all subset templates are searched and if there exist mixed templates but no superset templates, then a node \( u_j \) with a mixed template \( O_j = L_A |I \cap z_j| - L_B |z_j| + L_M \) will be searched before an uncommitted node \( u_j \) (\( O_j = L_A |I| - L_B N + L_M \)) if and only if eq. (1.72) holds:
\[ O_j - O_j = L_A (|I \cap z_j| - |I|) - L_B (|z_j| - N) > 0 \Leftrightarrow \]
\[ \Leftrightarrow L_A |I \cap z_j| - L_B |z_j| + L_M > L_A |I| - L_B N + L_M = T_j (I, t=0) \]  
(1.86)

This completes the proof of the modified Theorem 7 for the modified ART 1 architecture.

I. Biasing the Network towards Uncommitted Nodes:

In the original ART 1 architecture, choosing \( L \) large increases the tendency of the network to choose uncommitted nodes in response to unfamiliar input patterns \( I \). In the modified ART 1 architecture, the same effect is observed when choosing \( L_A/L_B \) large. This can be understood through the following reasoning.

When an input pattern \( I \) is presented, an uncommitted node is chosen before a coded node \( u_j \) if

\[ L_A |I \cap z_j| - L_B |z_j| < L_A |I| - L_B N \]  
(1.87)

This inequality is equivalent to

\[ \frac{L_A}{L_B} > \frac{N - |z_j|}{|I| - |I \cap z_j|} \]  
(1.88)

As the ratio \( L_A/L_B \) increases it is more likely that eq. (1.88) is satisfied, and hence that uncommitted nodes are chosen before coded nodes, regardless of the vigilance parameter value \( \rho \).

J. Remarks:

Even though in this Section we have showed that the computational properties of the original ART 1 system are preserved in the modified ART 1 system, the response of both systems to an arbitrary list of training patterns will not be exactly the same. The main underlying reason for this difference in behavior is that the initial ordering

\[ O_{j_1} > O_{j_2} > O_{j_3} > ... \]  
(1.89)

is not always exactly the same for both architectures. In the next Section we will try to study the differences between the two ART 1 systems. As we will see, for most cases the behavior is identical, although in a few cases a different behavior results.

1.6. Functional Differences between Original and Modified Model

As stated previously, the difference in behavior between the ART 1 and ART 1m models is caused by the different orderings of the terms of eq. (1.89). Assuming that both models, at a certain time, have identical weight templates \( \{ z_j \} \), and the same input pattern \( I \) is given, eq. (1.89) has the following two formulations:
Original ART 1: \[
\frac{|I \cap z_j|}{L - 1 + |z_j|} > \frac{|I \cap z_{j_2}|}{L - 1 + |z_{j_2}|} > \frac{|I \cap z_{j_3}|}{L - 1 + |z_{j_3}|} > \ldots
\]

Modified ART 1: \[
\frac{L_A}{L_B} |I \cap z_i| - |z_i| > \frac{L_A}{L_B} |I \cap z_{i_j}| - |z_{i_j}| > \ldots
\]

where \(j_k\) might be different than \(l_k\). The ordering resulting for the original ART 1 description is modulated by parameter \(L > 1\). For example, if \(L\) is very large compared to all \(|z_j|\) terms, then the ordering depends exclusively on the values of \(|I \cap z_j|\),

\[
|I \cap z_j| > |I \cap z_{j_2}| > |I \cap z_{j_3}| > \ldots
\]  

If \(L\) is very close to 1, then the ordering depends on the ratios,

\[
\frac{|I \cap z_j|}{|z_j|} > \frac{|I \cap z_{j_2}|}{|z_{j_2}|} > \frac{|I \cap z_{j_3}|}{|z_{j_3}|} > \ldots
\]  

Likewise, for the ART 1_m description, the ordering is modulated by a single parameter \(\alpha = L_A/L_B > 1\). If \(\alpha\) is extremely large, the situation in eq. (1.91) results. However, for \(\alpha\) very close to 1, the ordering depends on the differences,

\[
|I \cap z_i| - |z_i| > |I \cap z_{i_j}| - |z_{i_j}| > |I \cap z_{i_2}| - |z_{i_2}| > \ldots
\]  

Obviously, the behavior of the two ART 1 descriptions will be identical for large values of \(L\) and \(\alpha\). However, moderate values of \(L\) and \(\alpha\) are desired in practical ART 1 applications. On the other hand, it can be expected that the behavior will also tend to be similar for very high values of \(\rho\): if \(\rho\) is very close to 1, each training pattern will form an independent category. However, different training patterns will cluster into a shared category for smaller values of \(\rho\). Therefore, a very similar behavior between ART 1 and ART 1_m will be expected for high values of \(\rho\), while more differences in behavior might be apparent for smaller values of \(\rho\).

In order to compare the two algorithms' behavior, we have performed exhaustive simulations using randomly generated training patterns sets. As an illustration of a typical case where the two algorithms produce different learned templates, Fig. 1.5 shows the evolution of the memory templates, for both the ART 1 and the ART 1_m algorithms, using a randomly generated training set of 10 patterns with 25 pixels each. Weight templates for original ART 1 are named \(z_j\), while for ART 1_m they are named \(z'_{j}\). The vigilance parameter was set to \(\rho = 0.4\) for the original ART 1 \(L = 5\), and for the ART 1_m \(\alpha = 2\). In Fig. 1.5, boxed category templates are those that met the vigilance criterion and had the maximum \(T_j\) value. If the box is drawn with a continuous line, the corresponding \(z_j\) template suffered modifications due to learning. If the box is drawn with dashed line, learning did not alter the corresponding \(z_j\) template. Both algorithms stabilized their weights in 2 training trials. Looking at the learned templates we can see that input patterns 4 and 5 clustered in the same category for both algorithms (\(z_4\) for original ART 1 and \(z'_{3}\) for ART 1_m). This also occurred for patterns 6 and 8 (\(z_3\) and \(z'_{2}\)) and for patterns 3, 9 and 10 (\(z_5\) and \(z'_{5}\)). However, patterns 1, 2,

---

9. For all simulations in this Appendix, randomly generated training patterns sets were obtained with a 50% probability for a pixel to be either '1' or '0'.

<table>
<thead>
<tr>
<th>Inputs</th>
<th>Original ART 1</th>
<th>ART 1_m</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$z_1$</td>
<td>$z_2$</td>
</tr>
<tr>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td></td>
<td></td>
</tr>
<tr>
<td>9</td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Fig. 1.5: Comparative Learning Example ($\rho=0.4$, $L=5$, $\alpha=2$)
and 7 did not cluster in the same way in the two cases. In the original ART 1 algorithm patterns 1 and 7 clustered into category \( z_1 \), while pattern 2 remained independent in category \( z_2 \). In the ART 1m algorithm patterns 1 and 2 clustered together into category \( z'_1 \), while pattern 7 remained independent in category \( z'_4 \).

To measure a distance between the two templates \( z_j \) and \( z'_j \), let us use the Hamming distance between two binary patterns \( a = (a_1, a_2, \ldots, a_N) \) and \( b = (b_1, b_2, \ldots, b_N) \),

\[
d(a, b) = \sum_{i=1}^{N} f_d(a_i, b_i),
\]

(1.94)

where

\[
f_d(a_i, b_i) = \begin{cases} 
0 & \text{if } a_i = b_i, \\
1 & \text{if } a_i \neq b_i.
\end{cases}
\]

(1.95)

We can use this metric to define the distance between two sets of patterns \( \{z_j\}_{j=1}^{Q} \) and \( \{z'_j\}_{j=1}^{Q} \) as that which minimizes

\[
\sum_{i=1}^{Q} d(z_i, z'_i).
\]

(1.96)

For this purpose, the optimal ordering of indexes \( (l_1, l_2, \ldots, l_Q) \) must be found. In the case of Fig. 1.5 (where \( Q = 5 \)), the distance \( D \) between the two learned patterns sets is given by,

\[
D = d(z_1, z'_4) + d(z_2, z'_1) + d(z_3, z'_2) + d(z_4, z'_3) + d(z_5, z'_5) = 7.
\]

(1.97)

In general, we can define the distance between two patterns sets \( A = \{a_j\}_{j=1}^{Q} \) and \( B = \{b_j\}_{j=1}^{Q} \) as,

\[
D(A, B) = \min_{\{l_1, l_2, \ldots, l_Q\}} \left[ \sum_{i=1}^{Q} d(a_i, b_{l_i}) \right].
\]

(1.98)

In the case of Fig. 1.5, both algorithms produced the same number of learned categories. This does not always occur. For the case where a different number of categories results, we measured the distance between the two learned sets by adding as many uncommitted \( F_2 \) nodes to the set with less categories as necessary to equal the number of categories. An uncommitted category has all its pixels set to ‘1’. Thus, having a different number of committed nodes drastically increases the resulting distance, and is consequently a strong penalty.

We have repeated the simulation of Fig. 1.5 many times for different sets of randomly generated training patterns and sweeping the values of \( p, L, \) and \( \alpha \). For each combination of \( p, L, \) and \( \alpha \) values, we repeated the simulation 100 times for different training patterns sets, and computed the average number of learned categories, learning trials, and distance between learned categories, as well as their corresponding standard deviations. Fig. 1.6 and Fig. 1.7 present the results of these simulations. Fig. 1.6(a) shows how the average number of learned categories changes with \( L \) (from 1.01 to 40) for different values of \( p \), for the original ART
Fig. 1.6: Simulated Results Comparing Behavior between ART 1 and ART $1_m$
Appendix 1: A VLSI-Friendly ART1 Algorithm. Page: 57

Fig. 1.7: Learned Categories Average Distances
Fig. 1.8: Optimal parameters fit between ART 1 and ART 1m

1. As \( \rho \) decreases, parameter \( L \) has more control on the average number of learned categories. Fig. 1.6(b) shows the standard deviation for the number of learned categories of Fig. 1.6(a). As the number of learned categories approaches the number of training patterns (10 in this case), standard deviation decreases. This happens for large values of \( L \) (independently of \( \rho \)) and for large values of \( \rho \) (independently of \( L \)). Fig. 1.6(c) and Fig. 1.6(d) show the same as Fig. 1.6(a) and Fig. 1.6(b) respectively, for the ART 1m algorithm. As we can see, parameter \( \alpha \) (swept from 1.01 to 5.0) of ART 1m has more tuning power than parameter \( L \) of the original ART 1. On the other hand, ART 1m presents a slightly higher standard deviation than the original ART 1. Nevertheless, the qualitative behavior of both algorithms is similar. Fig. 1.6(e) and Fig. 1.6(f) show the average number of learning trials and their corresponding deviations, needed by the original ART 1 algorithm to stabilize its learned weights. Fig. 1.6(g) and Fig. 1.6(h) show the same for the ART 1m algorithm. As we can see, the ART 1m algorithm needs a slightly higher average number of learning trials to stabilize. Also, the standard deviation observed for the ART 1m algorithm is slightly higher. Finally, Fig. 1.7 shows the resulting average distances (as defined by eq. (1.98)) between learned categories of the ART 1 and the ART 1m algorithms. For \( \rho \) changing from 0.0 to 0.7 in steps of 0.1, each sub-figure in Fig. 1.7 depicts the resulting average distance for different values of \( L \) while sweeping \( \alpha \) between 1.01 and 5.0.

It seems natural to expect that, for a given value of \( \rho \) and a given value of the original ART 1 parameter \( L \), there is an optimal value for the ART 1m parameter \( \alpha \) that will minimize the difference in behavior between the two algorithms. To find this relation between \( L \) and \( \alpha \) for each \( \rho \), we computed (for a given \( \rho \) and \( L \)) the value of \( \alpha \) that minimizes the average distance between the learned patterns sets generated by the two algorithms. The results of these computations are shown in Fig. 1.8 10. Fig. 1.8(a) shows a family of curves (one for each value of \( \rho \)), that shows the optimal value of \( \alpha \) as a function of \( L \). Fig. 1.8(b) shows the resulting minimum average distance between learned sets for the same family of curves. As shown in Fig. 1.8(a), the optimum fit between parameters \( \alpha \) and \( L \) is very slightly dependent on the value of \( \rho \).

As can be concluded from Fig. 1.6, Fig. 1.7, Fig. 1.8, and the discussion in this Section, the behavior of the two algorithms is qualitatively the same although some slight quantitative differences can be observed. ART 1m parameter \( \alpha \) has a wider tuning range than original ART 1 parameter \( L \). On the other hand, ART 1m needs a slightly higher number of learning trials than the original ART 1. Also, there is an optimal adjustment

---

10. Note that high values of \( \rho \) and \( L \) were omitted in this analysis, since in these cases the behavior of the two algorithms tends to be similar, regardless of the fit between parameters \( L \) and \( \alpha \).
between parameters \( \alpha \) and \( L \) that minimizes the difference in behavior between the two algorithms, and this adjustment appears approximately independent of \( \rho \).

### 1.7. Extending the ART \( 1_m \) Model to Type-2 and Type-1 Descriptions

The great advantage of the ART \( 1_m \) algorithm is its ability to produce a very simple Type-3 hardware implementation, requiring only a binary valued memory template and only addition, subtraction and comparison operations, as well as a Winner-Take-All competition. Although Type-2 and Type-1 descriptions can be found that lead to the Type-3 behavior of the ART \( 1_m \) algorithm described in this paper, these descriptions do not possess the hardware-attractive features of the Type-3 implementation. Nevertheless, brief Type-2 and a Type-1 descriptions for this ART \( 1_m \) algorithm are presented in this Section.

#### A. A Type-2 ART \( 1_m \) Implementation

The change in weights must be smooth in a Type-2 description. Every time an input pattern \( I \) is presented and an \( F_2 \) category node is selected for LTM storage, only a partial change in LTM traces is allowed. In this case, it is obvious that we can no longer use a binary valued weight template.

As seen in Section 1.4, Fig. 1.3(c) shows the flow diagram of a Type-3 implementation of the ART \( 1_m \) algorithm. Extending this diagram to a Type-2 description is straightforward. The only box that needs to be changed is that corresponding to the update of weights. Instead of using the algebraic formula \( z_{j(new)} = I \cap z_{j(old)} \) we have to use a time domain differential equation that would lead to the same steady state. The following set of differential equations fulfills this requirement,

\[
\dot{z}_{ij} = Kf(y_j) \left[ -z_{ij} + h(x_i) \right], \tag{1.99}
\]

where \( K \) is a positive constant, \( h(\cdot) \) a sigmoidal function, and \( x_i \) is the STM activity of node \( v_i \) in the \( F_1 \) layer, given by,

\[
x_i = I_i \sum_j y_j z_{ij} = I_i z_{ij}. \tag{1.100}
\]

Note that eq. (1.100) has the same form than eq. (1.16) which describes the evolution of the top-down weights of the Type-2 implementation of the original algorithm. Similarly eq. (1.100) is the result of substituting in eq. (1.13) the top-down weights of the original algorithm by the set of weights \( \{ z_{ij} \} \) of the modified algorithm.

If \( T_\omega \) is the time required for the LTM eqs. (1.99) to settle to their steady state, the update of weights (i.e., the simulation of eqs. (1.99)) would be allowed only for a time interval \( \tau \ll T_\omega \) for each input pattern \( I \) presentation. As \( \tau \) approaches \( T_\omega \), application of eqs. (1.99) or the update weights equation of Fig. 1.3(c) would become equivalent. Fig. 1.9 shows the flow diagram corresponding to this Type-2 implementation of the ART \( 1_m \) algorithm.
B. A Type-1 ART 1m Implementation

For a Type-1 implementation, an appropriate set of STM equations must be found that leads to the flow diagram of Fig. 1.9 when the STM time constants are very small compared to the LTM ones. The following time domain STM differential equations would serve our purpose,

\[
F_1: \quad \varepsilon x_i = -x_i + (1 - A_1 x_i) J^+_i - (B_1 + C_1 x_i) J_i \\
F_2: \quad \varepsilon y_j = -y_j + (1 - A_2 y_j) J^+_j - (B_2 + C_2 y_j) J_j
\]  
(1.101)

where,

\[
J_i^+ = I_i + D_1 \sum_j f(y_j) z_{ij} ,
\]
\[
J_i = \sum_j f(y_j) ,
\]
\[
J_j^+ = g(y_j) + T_j ,
\]
\[
J_j = \sum_{k \neq j} g(y_k) .
\]  
(1.102)

Parameters \( \varepsilon, A_1, B_1, C_1, A_2, B_2, C_2 \), and \( D_1 \) are positive and constant. Functions \( f(\cdot) \) and \( g(\cdot) \) are sigmoidal. Functions \( g(\cdot) \) will be responsible for the resulting Winner-Take-All action of the \( F_2 \) layer. These STM equations are identical to those of the original ART 1 algorithm [1.2], except that we use one weight

| Initialize weights: | \( z_{ij} = 1 \) |

| Read input pattern: | \( I = (I_1, I_2, \ldots, I_M) \) |

| \( T_j = L_A[I \cap z_j] - L_B[z_j] + L_M \) |

| Winner-Take-All: | \( y_j = 1 \) if \( T_j = \max_j \{T_j\} \) \( y_j = 0 \) if \( j \neq j \) |

| YES | \( p[I] > |I \cap z_j| \) |

| Update weights: | Apply LTM differential equations during a time interval \( \tau \) |

| NO | \( T_j = 0 \) |

Fig. 1.9: ART 1m algorithm Type-2 implementation
template instead of two. However, the main difference lies in the way the terms $T_j$ are computed. In this case $T_j$ will be given by the following equation,

$$T_j = D_2 \left[ L_A \sum_i h(x_i) z_{ij} - L_B \sum_i z_{ij} + L_M \right].$$  \hspace{1cm} (1.103)

where $D_2$ is constant and positive. Using eqs. (1.101)-(1.103) together with an STM Reset System will assure that if the STM time constants are very small compared to the LTM ones, the Type-2 description of Fig. 1.9 results. The Reset System can be identical to that used in the original ART 1 system: each active input ($I_i = 1$) sends an excitatory signal of size $P$ to an orienting subsystem $A$. Each $F_i$ node $x_i$ which exceeds zero generates an inhibitory signal of size $Q$ and sends it to $A$. The orienting subsystem $A$ generates a nonspecific reset wave to $F_2$ whenever

$$\frac{|X|}{|I|} < \rho = \frac{P}{Q},$$

$$\hspace{1cm} (1.104)$$

where $I$ is the input pattern and $|X|$ is the number of $F_i$ nodes such that $x_i > 0$. The nonspecific reset wave shuts off active $F_2$ nodes until the input pattern $I$ shuts off.

### 1.8. Alternative ART 1 Modifications

Other alternatives to the computation of the terms $T_j$ of eq. (1.25) have been proposed [1.8] for a Fuzzy-ART architecture. Since ART 1 reduces to a particular case of Fuzzy-ART when the input pattern $I$ is binary valued, any valid way of computing $T_j$ in Fuzzy-ART should, in principle, be valid for ART 1 as well. The different $T_j$ functions (also called ‘distances’ or ‘choice functions’) proposed in [1.8] when particularized for ART 1 result in the following formulations:

Function 1: \[ |I \cap z_j| - |z_j| + \varepsilon \left( |z_j| - |I \cup z_j| \right) \]

Function 2: \[ |I \cap z_j| - |z_j| + \varepsilon \left( |z_j| - |I| \right) \]

$$\hspace{1cm} (1.105)$$

Note that these functions are also based on the subtraction operation, as in ART 1, but are computationally more expensive since either $|I \cup z_j|$ or $|I|$ has to be computed as well. The choice function that we have used in this paper would be equivalent to the following,

$$T_j = |I \cap z_j| - |z_j| + \varepsilon |z_j| = |I \cap z_j| - (1 - \varepsilon) |z_j|,$$

$$\hspace{1cm} (1.106)$$

and parameter $\alpha = L_A/L_B > 1$ would have been equivalent to

$$\alpha = \frac{1}{1 - \varepsilon}.$$  \hspace{1cm} (1.107)

If all the original ART 1 properties are to be preserved, we know now that $\alpha$ has to be greater than one. This implies,

$$\alpha > 1 \iff 1 > \varepsilon > 0.$$  \hspace{1cm} (1.108)
With respect to the choice functions in eq. (1.105), Function 2 is mathematically equivalent to eq. (1.106), because the only difference between the two is the term $-\epsilon|I|$. Since the input is common to all of the category nodes and does not change during a single presentation, this term effectively acts as a uniform negative bias on all of the category nodes, regardless of the pattern coded in their templates. Eq. (1.106), therefore, is more efficient because the input size computation is unnecessary.

Function 1 of eq. (1.105) is another valid choice function, but is also computationally more expensive than eq. (1.106). It can be shown that the original ART 1 computational properties are preserved when this function is used (provided $\epsilon > 0$). To see this, substitute the equations of Section 1.5 whose numbers appear in the first column of Table 1 by the equations in the second column, and note that

$$
\begin{align*}
|I \cup z_j| & \geq |z_j|, |I| \\
|I \cap z_j| & \leq |z_j|, |I| \\
|I \cup z_j| &= |I + |z_j| - |I \cap z_j|
\end{align*}
$$

(1.109)

are always satisfied (if we know that $I \neq z_j$ then the ‘$\geq$’ and ‘$\leq$’ signs in eq. (1.109) can be substituted by ‘$>$’ and ‘$<$’, respectively). Table 1.1 only provides the demonstrations for properties A, B, E, G and I of Section 1.5 Properties C, D and F are automatically satisfied since they do not depend on the explicit formulation of $T_j$. With respect to property H (Search Order) it can be shown that all of them are fulfilled if eqs. (1.69), (1.71), and (1.72) are changed to

$$
1 - \frac{1}{1 - \epsilon} \leq \frac{N}{N - 1},
$$

(1.110)

$$
|z_j| < |z_j| \quad \text{and} \quad \frac{|I \cup z_j| - |z_j|}{|z_j| - |z_j|} < \epsilon, \quad \text{and}
$$

(1.111)

$$
|I \cap z_j| - \epsilon|I \cup z_j| - (1 - \epsilon)|z_j| > T_j (I, t=0),
$$

(1.112)

respectively.

1.9. References


<table>
<thead>
<tr>
<th>original equation</th>
<th>new equation</th>
</tr>
</thead>
<tbody>
<tr>
<td>(1.35)</td>
<td>$T_{j_{1}} =</td>
</tr>
<tr>
<td></td>
<td>$T_{j_{2}} =</td>
</tr>
<tr>
<td>(1.36)</td>
<td>$T_{j_{1}} =</td>
</tr>
<tr>
<td></td>
<td>$T_{j_{2}} =</td>
</tr>
<tr>
<td>(1.41)(1.42)</td>
<td>$T_{j} =</td>
</tr>
<tr>
<td></td>
<td>$T_{j} =</td>
</tr>
<tr>
<td>(1.49)</td>
<td>$T_{j}(new) =</td>
</tr>
<tr>
<td>(1.59)</td>
<td>$O_{j} = \epsilon</td>
</tr>
<tr>
<td>(1.62)</td>
<td>$O_{j} = \epsilon</td>
</tr>
<tr>
<td>(1.63)</td>
<td>$O_{j} =</td>
</tr>
<tr>
<td>(1.87)</td>
<td>$</td>
</tr>
<tr>
<td>(1.88)</td>
<td>$\epsilon &gt; \frac{</td>
</tr>
</tbody>
</table>

Table 1.1

**Neural Networks (WCNN’94), vol. I, pp. 713-722.**


Appendix 2: A Real-Time Clustering Microchip Neural Engine

2.1. Hardware Oriented Attractive Properties of the ART 1 Algorithm

Two types of neural hardware engineers can be distinguished. The first designs "general purpose" hardware accelerators or systems that speed up neural algorithms running on conventional computers [2.5]-[2.13]. This kind of hardware allows considerable flexibility in the topology and operations of the neural systems. In this way algorithm researchers have a powerful tool to further develop neural algorithms and industry engineers have some attractive chips that significantly speed up their neural commercial products. The second type of hardware engineers are those who design a real-time system for a specific application. They must select the best-suited algorithm and map it into hardware. This achieves a close-to-optimum efficient hardware for a limited range of applications. The work described in this appendix falls into this second category of hardware engineering. The specific application is real-time clustering of binary input patterns.

A clustering device is a device able to build categories from a collection of patterns. A real-time clustering device has to be able to do this at the speed of arrival of the patterns. There are some clustering algorithms [2.14]-[2.19] that need to be trained off-line to build the categories. For a real-time clustering device, however, it would be desirable to use an algorithm that can be trained on-line: if a new pattern arrives the algorithm updates its internal knowledge (instead of erasing all the accumulated knowledge and retrain with the old and new collection of patterns).

For the second type of neural hardware engineers, the issue of efficiently implementing in hardware a real size neural network is not a trivial task. Many neural network algorithms are available in the literature which have been developed, studied, and optimized for applications through computer and/or software based systems. Consequently, when designing a hardware realization, engineers face many problems like excessive interconnectivity, high resolution of weights, high precision of operations, complicated operator requirements (e.g., integrals and derivatives), high number of neurons required for a real-world application, etc. Many times some of these requirements can be relaxed, the topology modified, or the operations simplified, with no significant deterioration of global operation of the neural system but with a considerable boost in the hardware performance. Modifying neural algorithms to make them more VLSI-friendly and produce more efficient hardware should be a common practice among neural hardware engineers of the second type [2.20]-[2.23]. After selecting an appropriate neural algorithm the next step consists of studying how far the algorithm can be simplified without performance degradation. The simplifications have to be hardware-oriented, so that the final combination of "theoretical algorithm" + "hardware circuit technique" results in a high performance real time system. The success of the hardware system depends on the selection of the algorithm, the selection of a powerful circuit design technique, and how the algorithm is modified to efficiently "marry" the circuit technique resulting in an optimum performance final system.

In the case of our application, real-time binary patterns clustering, we chose the ART 1 algorithm mainly due to the attractive hardware-oriented properties (which will be highlighted below), as well as the theoretical computational properties (see Appendix 1)[2.3]. We also chose to slightly modify the

---

This Appendix is an amplified version of the paper [2.1].

mathematical ART 1 algorithm to obtain more efficient hardware. This modification (described in the previous Appendix) allows the use of simpler operations while preserving all the computational properties of the original ART 1 architecture [2.3], [2.4]. As an extra bonus, the hardware circuit introduces a significant speed improvement as it automatically parallels the sequential ART search process [2.2] inherent in the mathematical neural algorithm.

In performance comparison of hardware implementations, a common figure of merit is the number of interconnections per second. More refined figures have been proposed that include resolution and precision [2.24]. However, these figures would be reasonably fair criteria for the first type of hardware engineering mentioned above, the general-purpose one. In order to compare hardware systems of the second type, the specific-application neural hardware, some global figure must be used that evaluates the overall system performance. Usually this figure will be application dependent. In our case, since we are concerned with a real-time clustering application of binary input patterns, an appropriate figure of merit might be

$$ppc/s = \frac{\text{number of patterns processed}}{\text{seconds}} \times \text{pixels} \times \text{categories}$$

(2.1)

where,

- **number of patterns processed/second** is the speed at which patterns are classified and learned (including the number of learning trials required). This speed generally depends on the patterns themselves, and on the knowledge already stored in the system. Therefore, this speed can be given as an average or as the slowest case measured.
- **pixels** is the maximum number of pixels of the input patterns.
- **categories** is the maximum number of categories the system is able to form.

As we will see later in the Section on experimental results, the chip described here is able to cluster up to 18 different categories of binary patterns with 100 pixels, while classifying and learning each pattern in less than 1.8μs. Since ART 1 learns on-line, 1 iteration of input patterns presentations provides the system with sufficient knowledge to perform properly\(^1\). This results in a ppc/s of

$$ppc/s = \frac{n \text{ patterns}}{1 \text{ iteration} \times n \text{ patterns} \times 1.8\mu s} \times 100 \text{ pixels} \times 18 \text{ categories} = 1.0 \times 10^9 \text{ ppc/s}$$

(2.2)

If we would like to obtain the same performance using Backpropagation based hardware, and assuming the network would learn with 10,000 iterations of patterns presentations, this means that a speed of 180ps would be needed for each pattern classification and corresponding weights update. Assuming this task could be performed with a Backpropagation network with 100 input neurons, 5 hidden-layer neurons, and 5 output neurons\(^2\) (which means a total of $100 \times 5 + 5 \times 5 = 525$ interconnections), and that the speed of feedforward classification is the same as for feedback learning, hardware able to perform

---

1. The input patterns set can be iterated several times to stabilize the internal weights, but this is not necessary for the system to start working.
2. Optimistically, a backpropagation net with 5 output nodes might be able to code up to $2^5$ categories.
\[
\frac{2 \times 525 \text{ connections}}{180\text{ps}} = 5.83 \times 10^{12} \text{ connections/s plus connection-updates/s} \tag{2.3}
\]

would be needed. For the chip described here, since it is based on the powerful ART 1 algorithm, the above performance can be achieved with a hardware of only \(4.4 \times 10^9 \text{ connections/s plus connection-updates/s}\), as discussed in the Subsection B of Section 2.3.

Note that the Backpropagation algorithm is not appropriate for clustering applications, and comparing it against ART 1 is slightly unfair. There are other algorithms available in the literature that have been developed specially for clustering applications [2.14]-[2.19]. However, they usually do not provide all the computational properties mentioned in Appendix 1, specially the "On-Line Learning" property which is crucial for real-time clustering, or they present serious difficulties when mapped into hardware.

Another hardware attractive feature that an ART 1 based implementation offers with respect to others, is that the interconnection weights do not have to be analog, as shown in the previous Appendix. Most of the neural algorithms reported in the literature require a real-valued set of weights defined within a certain interval. These weights can be discretized in a number of digital steps, but the granularity required for proper operation of the system is usually very fine (around 16-bits for the Back-Propagation algorithm [2.25]). Even worse, in some cases the granularity requirements become more severe as the size of the system increases. For example, in a BAM system [2.26] of \(N \times M\) neurons, storage capacity has been heuristically estimated to be around \(n_p = (N \times M)^{1/4}\) [2.27], where \(n_p\) is the average maximum number of patterns that can be stored. The resolution required by the interconnection weights in this case is at least \(n_p + 1\). In the chip described in this Appendix, since it is based on the ART 1 algorithm and requires only binary-valued weights, the resolution of the weights is not affected by the size nor the storage capacity of the system. This, and the non necessity of analog weights is one of the most hardware attractive features of the ART 1 algorithm.

Another consideration to take into account during the design of a hardware system is how it scales up with size and performance. We have already mentioned that some neural systems need to increase their weight resolution as they scale up. Another feature is how their size and interconnectivity scale up with pattern size or storage capacity. For an ART 1 based system, the number of neurons \(N\) in the bottom layer is the number of pixels of the patterns, the number of neurons \(M\) in the top layer is the maximum number of categories, and \(N \times M\) is the number of synapses. This system scales up linearly with storage capacity \((M)\) and input pixels \((N)\). For a BAM system, for example, the size scales quadratically with the storage capacity and the number of pixels.

Section 2.4 will present other scaling considerations, more directly related to the hardware technique selected. In the case of an analog hardware, random and systematic errors due to fabrication process variations will appear. A neural network can usually cope very well with random errors, even if the size of the system increases. However, systematic errors may accumulate as the system increases and may render the complete network useless as it scales up. The chosen circuit technique must be either insensitive to the accumulation of systematic errors, or allow for some kind of calibration technique to overcome them.

Regarding hardware implementations of the ART 1 architecture, several attempts have been reported in the literature. Ho et al. suggested a Type-1 implementation\(^3\) [2.28]. Tsay and Newcomb proposed a CMOS circuit
Appendix 2: A Real-Time Clustering Microchip Neural Engine. Page: 68

technique that would realize a partial Type-2 implementation [2.29]; Wunsch et al. [2.30] have built optical-based Type-3 implementations; this work presents a CMOS VLSI Type-3 circuit.

The next Section describes the circuit implementation of the modified ART 1 (see Appendix 1) algorithm using analog current-mode circuit design techniques.

### 2.2. Circuit Description

Fig. 2.1 shows the VLSI-friendly Type-3 ART 1 algorithm as described in the previous Appendix, which has been mapped into hardware.

The operations in Fig. 2.1 that need to be implemented are the following:

- Generation of the terms $T_j$ or “choice functions”. Since $z_{ij}$ and $I_i$ are binary valued (0 or 1), “binary multiplication” and addition/subtraction operations are required.
- Winner-Take-All (WTA) operation to select the maximum $T_j$ term.
- Comparison of the term $p|I|$ with $|I \cap z_j|$.
- Deselection of the term $T_j$ if $p|I| > |I \cap z_j|$.
- Update of weights.

The first three operations require a certain amount of precision, while the last two operations are not precise. We intended to obtain a precision between 1 and 2% (equivalent to 6-bits) for our circuit, while handling input patterns of up to 100 binary pixels. Fig. 2.2 shows a possible hardware block diagram that would physically

![Block diagram](image)

Fig. 2.1: Type-3 implementation algorithm of the modified VLSI-friendly ART 1 architecture

3. For an explanation of the terminology Type-1, Type-2, and Type-3 implementation refer to Appendix 1, Section 1.2.
implement the algorithm of Fig. 2.1. The circuit consists of an 18×100 array of synapses $S_{11}, S_{12}, \ldots, S_{18,100}$, a 1×100 array of controlled current sources $C_1, C_2, \ldots, C_{100}$, two 1×18 arrays of unity-gain current mirrors $CMA_1, \ldots, CMA_{18}, CMB_1, \ldots CMB_{18}$, a 1×18 array of current comparators $CC_1, \ldots, CC_{18}$, an 18-input WTA circuit, two 18-output unity-gain current mirrors $CMM$ and $CMC$, and an adjustable-gain ($0 < p \leq 1$) current mirror. Registers $R1, \ldots, R18$ and the NOR gate are optional, and their function is explained later.

Each synapse receives two input signals $y_j$ and $I_i$, has two global control signals RESET and LEARN, stores the value of $z_{ij}$, and generates two output currents:

- the first goes to the input of current mirror $CMA_j$ and is $L_A z_{ij} I_i - L_B z_{ij}$.
- the second goes to the input of current mirror $CMB_j$ and is $L_A z_{ij} I_i$.

All synapses in the same row $j$ ($S_{j1}, S_{j2}, \ldots, S_{j100}$) share the two nodes ($N_j$ and $N'_j$) into which the currents they generate are injected. Therefore, the input of current mirror $CMA_j$ receives the current

$$T_j = L_A \sum_{i=1}^{100} z_{ij} I_i - L_B \sum_{i=1}^{100} z_{ij} + L_M = L_A |I \cap z| - L_B |z| + L_M$$ (2.4)

while the input of current mirror $CMB_j$ receives the current

$$L_A \sum_{i=1}^{100} z_{ij} I_i = L_A |I \cap z|$$ (2.5)

Current $L_M$, which is replicated 18 times by current mirror $CMM$ has an arbitrary value as long as it assures that the terms $T_j$ are positive.

Each element of the array of controlled current sources $C_i$ has one input signal $I_i$ and generates the current $L_A I_i$. All elements $C_i$ share their output node, so that the total current they generate is $L_A |I|$. This
current reaches the input of the adjustable gain \( p \) current mirror, and is later replicated 18 times by current mirror CMC.

Each of the 18 current comparators \( CC_j \) receives the current \( L_A[\mathbf{I} \cap \mathbf{z}_j] - L_B\rho[\mathbf{I}] \) and compares it against zero. If this current is positive, the output of the current comparator falls, but if the current is negative the output rises. Each current comparator \( CC_j \) output controls input \( c_j \) of the WTA. If \( c_j \) is high the current sunk by the WTA input \( i_j \) (which is \( T_j \)) will not compete for the winning node. On the contrary, if \( c_j \) is low, input current \( T_j \) will enter the WTA competition. The outputs of the WTA \( \bar{y}_j \) are all high, except for that which receives the largest \( \bar{c}_j T_j \); such output, denominated \( \bar{y}_j \), will fall.

Now we can describe the operation of the circuit in Fig. 2.2. All synaptic memory values \( z_{ij} \) are initially set to ‘1’ by the RESET signal. Once the input vector \( \mathbf{I} \) is activated, the 18 rows of synapses generate the currents \( L_A[\mathbf{I} \cap \mathbf{z}_j] - L_B\rho[\mathbf{I}] \) and \( L_A[\mathbf{I} \cap \mathbf{z}_j] \), and the row of controlled current sources \( C_1, \ldots, C_{100} \) generates the current \( L_A[\mathbf{I}] \). Each current comparator \( CC_j \) will prevent current \( T_j = L_A[\mathbf{I} \cap \mathbf{z}_j] - L_B\rho[\mathbf{I}] + L_M \) from competing in the WTA if \( \rho[\mathbf{I}] > |\mathbf{I} \cap \mathbf{z}_j| \). Therefore, the effective WTA inputs are \( \{ \bar{c}_j T_j \} \), from which the WTA chooses the maximum, making the corresponding output \( \bar{y}_j \) fall. Once \( \bar{y}_j \) falls, and assuming the synaptic control signal \( \text{LEARN} \) is low, all \( z_{ij} \) values will change from ‘1’ to ‘0’.

Note that initially (when all \( z_{ij} = 1 \)),
\[
\bar{c}_j T_j = L_A[\mathbf{I}] - L_B N + L_M \quad (N=100) \quad \forall j
\]
This means that the winner will be chosen among 18 equal competing inputs, basing the election on mismatches due to random process parameter variations of the transistors. Even after some categories are learned, there will be a number of uncommitted rows (\( z_{1j} = \ldots = z_{100j} = 1 \)) that generate the same competing current of eq. (2.6). The operation of a WTA circuit in which there are more than 1 equal and winning inputs becomes more difficult and in the best case, renders slower operation. To avoid these problems 18 D-registers, \( R1, \ldots, R18 \), might be added. Initially these registers are set to ‘1’ so that the WTA inputs \( s_2, \ldots, s_{18} \) are high. Inputs \( s_1, \ldots, s_{18} \) have the same effect as inputs \( c_1, \ldots, c_{18} \): if \( s_j \) is high \( T_j \) does not compete for the winner, but if \( s_j \) is low \( T_j \) enters the WTA competition. Therefore, initially only \( \bar{c}_1 T_1 \) competes for the winner. As soon as \( \bar{y}_1 \) rises once, the input of register \( R1 \) (which is ‘0’) is transmitted to its output making \( s_2 = 0 \). Now both \( \bar{c}_1 T_1 \) and \( \bar{c}_2 T_2 \) will compete for the winner. As soon as \( \bar{c}_2 T_2 \) wins once, the input of register \( R2 \) is transmitted to its output making \( s_3 = 0 \). Now \( \bar{c}_1 T_1, \bar{c}_2 T_2, \) and \( \bar{c}_3 T_3 \) will compete, and so on. If all available \( F_2 \) nodes (\( y_1, \ldots, y_{18} \)) have won once, the “FULL” signal rises, advising that all \( F_2 \) nodes are storing a category. The WTA control signal “ER” enables operation of the registers.

A. Synaptic Circuit and Controlled Current Sources:

The details of a synapse \( S_{ij} \) are shown in Fig. 2.3(a). It consists of three current sources (two of value \( L_A \) and one of value \( L_B \)), a two-inverter loop (acting as a Flip-Flop), and nine MOS transistors working as switches. As can be seen in Fig. 2.3(a) each synapse generates the currents \( L_A z_{ij} \bar{I}_i - L_B z_{ij} \bar{I}_j \) and \( L_A z_{ij} \bar{I}_i \). The
RESET control signal sets $z_{ij}$ to '1'. Learning is performed by making $z_{ij}$ change from '1' to '0' whenever $\text{LEARN} = 0$, $\bar{y}_j = 0$, and $I_i = 0$.

Fig. 2.3(b) shows the details of each controlled current switch $C_i$. If $I_i = 0$ no current is generated, while if $I_i = 1$, the current $I_A$ is provided.

Fig. 2.3: (a) Details of Synapse Circuit $S_y$, (b) Details of Controlled Current Source Circuit $C_i$

Fig. 2.4: Circuit Schematic of Winner-Take-All (WTA) Circuit
Fig. 2.5: (a) Circuit Schematic of Current Comparator. (b) Circuit Schematic of Active-Input Regulated-Cascode Current Mirror. (c) Circuit Schematic for Adjustable Gain \( p \) Current Mirror

B. Winner-Take-All (WTA) Circuit:

Fig. 2.4 shows the details of the WTA circuit. It is based on Lazzaro's WTA [2.31], which consists of the array of transistors \( MA \) and \( MB \), and the current source \( I_{BIAS} \). Transistor \( MC \) has been added to introduce a cascode effect and increase the gain of each cell. Transistors \( MX, MY, \) and \( MZ \) transform the output current into a voltage, which is then inverted to generate \( \bar{y}_j \). Transistor \( MT \) disables the cell if \( c_j \) is high, so that the input current \( T_j \) will not compete for the winner. Transistors \( MS \) and \( ME \) have the same effect as transistor \( MT \): if signals \( ER \) and \( s_j \) are high, \( T_j \) will not compete.

C. Current Comparators:

The circuit used for the current comparators is shown in Fig. 2.5(a). Such a comparator forces an input voltage approximately equal to the inverters trip voltage, has extremely high resolution (less than 1pA), and can be extremely fast (in the order of 10-20ns for input around 10\( \mu \)A) [2.32].

D. Current Mirrors:

Current Mirrors \( CMA1, ..., CMA18, CMB1, ..., CMB18, CMM, CMC \), and the \( p \)-gain mirror have been laid out using common centroid layout techniques to minimize matching errors and keep the 6-bit precision of the overall system. For current mirrors \( CMA1, ..., CMA18 \) and \( CMB1, ..., CMB18 \) a special topology has been used, shown in Fig. 2.5(b) [2.33]. This topology forces a constant voltage \( V_D \) at its input node, thus producing a virtual ground in the output nodes of all synapses, which reduces channel length modulation distortion improving matching between the currents generated by all synapses. In addition, the topology of Fig. 2.5(b) presents a very wide current range with small matching errors [2.33].

The adjustable gain \( p \) current mirror also uses this topology, as shown in Fig. 2.5(c). Transistor \( M0 \) has a geometry factor \( (W/L) \) 10 times larger than transistors \( M1, ..., M10 \). Transistors \( MR1, ..., MR10 \) act as switches (controlled by signals \( r_1, ..., r_{10} \)), so that the gain of the current mirror can be adjusted between \( p = 0.0 \) to \( p = 1.0 \) in steps of 0.1, while maintaining \( r_0 = 0 \). By making \( r_0 \) higher than 0 Volts, \( p \) can be fine tuned.
E. Synaptic Current Sources:

The current sources $L_A$ and $L_B$ inside each synapse $S_{ij}$ and controlled current sources $C_i$ have to match within approximately 1% to keep the system 6-bit precision. There is a total of $100 \times 18 \times 2 + 100 = 3700$ $L_A$ current sources and $100 \times 18 = 1800$ $L_B$ current sources spread over a die area of 1cm$^2$ which have to match within 1%. For such distances, number of current sources, and reasonable current values, a spread of 10% in the currents would be an optimistic estimate. However, a single current mirror, with a reduced number of outputs (like 10), a reasonable transistor size (like 40$\mu$m $\times$ 40$\mu$m), a moderate current (around 10$\mu$A), and using common centroid layout techniques can be expected to have a mismatch error standard deviation $\sigma_q$ of less than 1% [2.34]. By cascading several of these current mirrors in a tree-like fashion as is shown in Fig. 2.6 (for current sources $L_B$), a high number of current sources (copied from a single common reference) can be generated with a mismatch equal to

$$\sigma_{Total} = \sigma_1 + \sigma_2 + \ldots + \sigma_q$$  \hspace{1cm} (2.7)

Each current mirror stage introduces an error $\sigma_k$. This error can be reduced by increasing the transistor areas of the current mirrors. Since the last stage $q$ has a higher number of current mirrors, it is important to keep their area low. For previous stages the transistors can be made larger to contribute with a smaller $\sigma_k$, because they are less in number and will not contribute significantly to the total transistor area. For current sources $L_A$, a circuit similar to that shown in Fig. 2.6 is used. Current $L_B$ in Fig. 2.6 (and similarly current $L_A$) is injected externally into the chip so that parameter $\alpha = L_A/L_B$ can be controlled.

F. Weights Read Out:

The switches $sw_1$ to $sw_{100}$ of Fig. 2.6 were added to enable reading out the internally learned synaptic weights $z_{ij}$, and test the progress of the learning algorithm. These switches are all ON during normal operation of the system. However, for weights read-out, all except one will be OFF. The switch that is ON is selected by a decoder inside the chip, so that only column $i$ of the synaptic array of Fig. 2.2 injects the current
\[ z_{ij}L_B \] to nodes \( N_j \). All nodes \( N_j \) can be isolated from current mirrors \( CMA_j \), and connected to output pads to sense the currents \( z_{ij}L_B \), thus measuring the values of \( z_{ij} \).

\[ G. \text{ Modular System Expansibility:} \]

The circuit of Fig. 2.2 can be expanded both horizontally, increasing the number of input patterns from 100 to \( 100 \times N \), and vertically increasing the number of possible categories from 18 to \( 18 \times M \). Fig. 2.7 shows schematically the interconnectivity between chips in the case of a \( 2 \times 2 \) array.

Vertical expansion of the system is possible by making several chips share the input vector terminals \( I_1, \ldots, I_{100} \), and node \( V_{\text{COMMON}} \) of the WTA (see Fig. 2.4). Thus, the only requirement is that \( V_{\text{COMMON}} \) be externally accessible. Horizontal expansion is directly possible by making all chips in the same row share their \( N_j \), \( N'_j \), and \( N''_j \) nodes, and isolating all except one of them, from the current mirrors \( CMA1, \ldots, CMA18, CMB1, \ldots, CMB18 \), and the adjustable gain \( \rho \)-mirror. Also, all synapse inputs \( \tilde{y}_j \) must be shared.

Both vertical and horizontal expansion degrades the system performance. Vertical expansion causes degradation because the WTA becomes distributed among several chips. For the WTA of Fig. 2.4, all MA and MB transistors must match well, which is very unlikely if they are in different chips. A solution for this problem is to use a WTA topology based on current processing and replication, insensitive to inter-chip transistor mismatches [2.35], [2.36].

Horizontal expansion degrades the performance because current levels have to be changed:

- Either currents \( L_A \) and \( L_B \) are maintained the same, which makes the current mirrors \( CMA_j, CMB_j, CMM_1:p, CMC \), the current comparators \( CC_j \), and the WTA to handle higher currents. This may cause malfunctioning due to eventual saturation in some of the blocks.
Appendix 2: A Real-Time Clustering Microchip Neural Engine. Page: 75

- Or currents $L_A$ and $L_B$ are scaled down so that the current mirrors $CMA_j$, $CMB_j$, $CMM$, 1:p, $CMC$, the current comparators $CC_j$, and the WTA handle the same current level. However, this produces an increase in mismatch between the current sources $L_A$ and $L_B$.

2.3. Experimental Results

A prototype chip that contains the previous circuit description of a real-time clustering engine has been fabricated in a standard double-poly double-metal 1.6μm CMOS digital process (Eurochip ES2). The die area is 1cm$^2$ and it has been mounted in a 120-pin PGA package. This chip implements an ART 1 system with 100 nodes in the $F_1$ layer and 18 nodes in the $F_2$ layer. Most of the pins are intended for test and characterization purposes. All the subcircuits in the chip can be isolated from the rest and conveniently characterized. The $F_1$ input vector $I$, which has 100 components, has to be loaded serially through one of the pins into a shift register. The time delay measurements reported here do not include the time for loading the shift register.

The experimental measurements provided in this Section have been divided into four parts. The first describes DC characterization results of the elements that contribute critically to the overall system precision. These elements are the WTA circuit and the synapatic current sources. The second describes time delay measurements that contribute to the global throughput time of the system. The third presents system level experimental behaviors obtained with digital test equipment (HP82000). Finally, the fourth focuses on yield and fault tolerance characterizations.

A. System Precision Characterizations:

The ART 1 chip was intended to achieve an equivalent 6-bit (~1.5% error) precision. The part of the system that is responsible for the overall precision is formed by the components that perform analog computations. These components are (see Fig. 2.2) all current sources $L_A$ and $L_B$, all current mirrors $CMA_j$, $CMB_j$, $CMM$, $CMC$, and the p-mirror, the current comparators $CC_j$, and the WTA circuit. The most critical of these components (in precision) is the WTA circuit. Current sources and current mirrors can be made to have mismatch errors below 0.2% [2.34], [2.37]-[2.39], at the expense of increasing transistors area and current, decreasing distances between matched devices, and using common centroid layout techniques [2.40]. This is feasible for current mirrors $CMA_j$, $CMB_j$, $CMM$, $CMC$, and the p-mirror, which appear in small numbers. However, the area and current level is limited for the synapatic current sources $L_A$ and $L_B$, since there are many of them. Therefore, WTA and current sources $L_A$ and $L_B$ are the elements that limit the precision of the overall system, and their characterization results will be described next.

<table>
<thead>
<tr>
<th>$T_j$</th>
<th>10μA</th>
<th>100μA</th>
<th>1mA</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\sigma (T_j)$</td>
<td>1.73%</td>
<td>0.86%</td>
<td>0.99%</td>
</tr>
</tbody>
</table>

Table 2.1. Precision of the WTA
Appendix 2: A Real-Time Clustering Microchip Neural Engine. Page: 76

Fig. 2.8: Measured Mismatch error (in %) between 18 arbitrary $L_A$ current sources

A.1: WTA Precision Measurements:

$L_A$ and $L_B$ will have current values of 10μA or less. The maximum current a WTA input branch can receive is (see eq. (2.4)),

$$T_{ij}^{max} = L_M + \left[ \sum_{i=1}^{100} z_{ij} (L_A l_i - L_B) \right]_{max} = L_M + 100 (L_A - L_B)$$

(2.8)

which corresponds to the case where all $z_{ij}$ and $l_i$ values are equal to ‘1’ (remember that $L_A > L_B > 0$). In our circuit the WTA was designed to handle input currents of up to 1.5mA for each input branch. In order to measure the precision of the WTA, all input currents except two were set to zero. Of these two inputs one was set to 100μA and the other was swept between 98μA and 102μA. This will cause their corresponding output voltages $\bar{y}_j$ to indicate an interchange of winners. The transitions do not occur exactly at 100μA. Moreover, the transitions change with the input branches. The standard deviation of these transitions was measured as $\sigma=0.86μA$ (or 0.86%). Table 2.1 shows the standard deviation (in %) measured when the constant current is set to 10μA, 100μA, and 1mA.

A.2: Synaptic Current Sources Precision Measurements:

The second critical precision error source of the system is the mismatch between synaptic current sources. In our chip each of the 3700 $L_A$ current sources and each of the 1800 $L_B$ current sources could be isolated and independently characterized. Fig. 2.8 shows the measured mismatch error (in %) for 18 arbitrary $L_A$ current sources when sweeping $L_A$ between 0.1μA and 10μA. As can be seen in Fig. 2.8, for currents higher than 5μA the standard deviation of the mismatch error is close to 1%. The same result is obtained for the $L_B$ current sources.
Table 2.2. Delay times of the WTA

<table>
<thead>
<tr>
<th>$T_1^a$</th>
<th>$T_1^b$</th>
<th>$T_2$</th>
<th>$T_3, \ldots T_{18}$</th>
<th>$t_{d1}$</th>
<th>$t_{d2}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0µA</td>
<td>200µA</td>
<td>100µA</td>
<td>0</td>
<td>550ns</td>
<td>570ns</td>
</tr>
<tr>
<td>0mA</td>
<td>1mA</td>
<td>500µA</td>
<td>0</td>
<td>210ns</td>
<td>460ns</td>
</tr>
<tr>
<td>100µA</td>
<td>150µA</td>
<td>125µA</td>
<td>100µA</td>
<td>660ns</td>
<td>470ns</td>
</tr>
<tr>
<td>400µA</td>
<td>600µA</td>
<td>500µA</td>
<td>400µA</td>
<td>440ns</td>
<td>400ns</td>
</tr>
<tr>
<td>500µA</td>
<td>1.50mA</td>
<td>1.00mA</td>
<td>500µA</td>
<td>230ns</td>
<td>320ns</td>
</tr>
<tr>
<td>90µsA</td>
<td>110µA</td>
<td>100µA</td>
<td>0</td>
<td>1.12µs</td>
<td>1.11µs</td>
</tr>
<tr>
<td>490µA</td>
<td>510µA</td>
<td>500µA</td>
<td>0</td>
<td>1.19µs</td>
<td>1.06µs</td>
</tr>
<tr>
<td>990µA</td>
<td>1.01mA</td>
<td>1.00mA</td>
<td>0</td>
<td>380ns</td>
<td>920ns</td>
</tr>
</tbody>
</table>

B. Throughput Time Measurements:

For a real-time clustering device the throughput time is defined as the time needed for each input pattern to be processed. During this time the input pattern has to be classified into one of the pre-existing categories or assigned to a new one, and the pre-existing knowledge of the system has to be updated to incorporate the new information the input pattern carries. From a circuit point of view, this translates into the measurement of two delay times:

- The time needed by the WTA to select the maximum among all \( \{ \tilde{c}_j T_j \} \).
- The time needed by the synaptic cells to change \( z_{ij} \) from its old value to \( y_j / z_{ij} \)

B.1: WTA Delay Measurements:

The delay introduced by the WTA depends on the current level present in the competing input branches. This current level will depend on the values chosen for \( L_A, L_B, \) and \( L_M \), as well as on the input pattern \( I \) and all internal weights \( z_j \). To keep the presentation simple, delay times will be given as a function of \( T_j \) values directly. Table 2.2 shows the measured delay times when \( T_1 \) changes from \( T_1^a \) to \( T_1^b \), and \( T_2 \) to \( T_{18} \) have the values given in the table. \( t_{d1} \) is the time needed by category \( y_1 \) to win when \( T_1 \) switches from \( T_1^a \) to \( T_1^b \), and \( t_{d2} \) is the time spent by category \( y_2 \) in winning when \( T_1 \) decreases from \( T_1^b \) to \( T_1^a \). As can be seen, this delay is always below 1.2µs.

For the cases when the vigilance criterion is not directly satisfied and hence comparators \( CC_j \) cut some of the \( T_j \) currents, an additional delay is observed. This extra delay has been measured to be less than 400ns for the worst cases. Therefore, the time needed until the WTA selects the maximum among all \( \{ \tilde{c}_j T_j \} \) is less than \( 1.2\mu s + 0.4\mu s = 1.6\mu s \).
B.2: Learning Time:

After a delay of 1.6μs (so that the WTA can settle), the learn signal \( \text{LEARN} \) (see Fig. 2.2) is enabled during a time \( t_{\text{LEARN}} \). To measure the minimum \( t_{\text{LEARN}} \) time required, this time was set to a specific value during a training/learning trial, and it was checked that the weights had been updated properly. By progressively decreasing \( t_{\text{LEARN}} \) until some of the weights did not update correctly, it was found that the minimum \( t_{\text{LEARN}} \) time for proper operation was 190ns. By setting \( t_{\text{LEARN}} \) to 200ns and allowing the WTA a delay of 1.6μs, the total throughput time of the ART 1 chip is established as 1.8μs.

B.3: Comparison with Digital Neural Processors:

A digital chip with a feedforward speed of \( a \) connections per second, a learning speed of \( b \) connection updates per second, and a WTA section with a delay of \( c \) seconds must satisfy the following equation to achieve a throughput time of 1.8μs when emulating the ART 1 algorithm of Fig. 2.1(c):

\[
\frac{3700}{a} + \frac{100}{b} + c = 1.8\mu s
\]  \hspace{1cm} (2.9)

Note that there are 100 synapse weights \( z_{ij} \) to update for each pattern presentation, and 3700 feed-forward connections: 1800 connections to generate all \( T_j = L_A[|I \cap z|] - L_B[z] + L_M \), 1800 connections to generate \( L_A[|I \cap z|] \), and 100 connections to generate \( L_A[|I|] \).

Assuming \( c = 100ns \), and \( a = b \), eq. (2.9) results in a processing speed of \( a = b = 2.2 \times 10^9 \) connections/s and connection-updates/s. A digital neural processor would require such figures of merit to equal the processing time of the analog ART 1 chip presented in this work. Therefore, this “approximate reasoning” makes us conclude that our chip has an equivalent computing power of \( a + b = 4.4 \times 10^9 \) connections/s plus connection-updates/s.

C. System Level Performance:

Although the internal processing of the chip is analog in nature, its input \( (I_i) \) and output \( (\bar{y}_j) \) are binary valued. Therefore, the system level behavior of the chip can be tested using conventional digital test equipment. In our case we used the HP82000 IC Evaluation System.

An arbitrary set of 100-bit input patterns \( \{I^k\} \) was chosen, shown in Fig. 2.9. A typical clustering sequence is shown in Fig. 2.10, for \( \rho = 0.7 \) and \( \alpha = L_A/L_B = 1.05 \). The first column indicates the input pattern \( I^k \) that is fed to the \( F_1 \) layer. The other 18 squares (10×10 pixels) in each row represent each of the internal \( z_j \) vectors after learning is finished. The vertical bars to the right of some \( z_j \) squares indicate that these categories won the WTA competition while satisfying the vigilance criterion. Therefore, such categories correspond to \( z_j \), and these are the only ones that are updated for that input pattern \( I^k \) presentation. The figure shows only two iterations of input patterns presentation, because no change in weights were observed after these. The last row of weights \( z_j \) indicates the resulting categorization of the input patterns. The numbers
Fig. 2.10: Clustering Sequence for $\rho=0.7$ and $\alpha=L_A/L_B=1.05$
Fig. 2.11: Categorization of the input patterns for $L_A=3.2\mu A$, $L_B=3.0\mu A$, $L_M=400\mu A$, and different values of $\rho$

below each category indicate the input patterns that have been clustered into this category. In the following figures we will show only this last row of learned patterns together with the pattern numbers that have been clustered into each category.

Fig. 2.11 shows the categorizations that result when tuning the vigilance parameter $\rho$ to different values while the currents were set to $L_A = 3.2\mu A$, $L_B = 3.0\mu A$, and $L_M = 400\mu A$ ($\alpha = L_A/L_B = 1.07$). Note that below some categories there is no number. This is a known ART 1 behavior: during the clustering process some categories might be created that will not represent any of the training patterns. In Fig. 2.12 the vigilance parameter is maintained constant at $\rho = 0$, while $\alpha$ changes from 1.07 to 50. For a more detailed explanation on how and why the clustering behavior depends on $\rho$ and $\alpha$ see references [2.3] and [2.4], or other ART 1 theoretical papers [2.2], [2.41].

D. Yield and Fault Tolerance:

A total of 30 chips (numbered 1 through 30 in Table 2.3 and Fig. 2.13) were fabricated. For each chip every subcircuit was independently tested and its proper operation verified; 14 different faults were identified. Table 2.3 indicates the faults detected for each of the 30 chips. The faults have been denoted from $F1$ to $F14$, and are separated into two groups:

- **Catastrophic Faults (digital sense)** are those clearly originated by a short or open circuit failure. These faults are $F1$, ..., $F8$. This kind of faults would produce a failure in a digital circuit.
- **Non-Catastrophic Faults (digital sense)** are those that produce a large deviation from the nominal behavior, too large to be explained by random process parameter variations. These faults are $F9$, ..., $F14$. This kind of faults would probably not produce a catastrophic failure in a digital circuit, but be responsible for significant delay times degradations.
<table>
<thead>
<tr>
<th>chip #</th>
<th>F1</th>
<th>F2</th>
<th>F3</th>
<th>F4</th>
<th>F5</th>
<th>F6</th>
<th>F7</th>
<th>F8</th>
<th>F9</th>
<th>F10</th>
<th>F11</th>
<th>F12</th>
<th>F13</th>
<th>F14</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
</tr>
<tr>
<td>8</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>9</td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>12</td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>13</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>14</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>15</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>16</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>17</td>
<td>X</td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>18</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>19</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>20</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>21</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>22</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>23</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>24</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>25</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>26</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>27</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>28</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>29</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>30</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 2.3. Fault Characterizations of the 30 ART 1 Chip Samples. Dark Shades: Sample with Catastrophic Fault; Light Shade: Sample with no Catastrophic Fault but with non Catastrophic Fault; no Shade: Sample with no Fault.

Table 2.4 describes the subcircuits where the faults of Table 2.3 were found. Note that the most frequent faults are F2/F9 and F3/F10, which are failures in some current sources $L_A$ or $L_B$, and these current sources occupy a significant percentage of the total die area. Fault F1 is a fault in the shift register that loads the input vector $I^e$. Fault F2 is a fault in the WTA circuit. Therefore, chips with an F1 or F2 fault could not be tested for system level operation. Faults F3 and F9 are faults detected in the same subcircuits of the chip, with F3 being catastrophic and F9 non-catastrophic. The same is valid for F4 and F10, F5 and F11, and so on until F8 and F14.

Note that only 2 of the 30 chips (6.7%) are completely fault-free. According to the simplified expression for the yield performance as a function of die area $\Omega$ and process defects density $\rho_D$ [2.42],

$$\text{yield} (\%) = 100e^{-\rho_D \Omega} \quad (2.10)$$
Fig. 2.12: Categorization of the input patterns for $\rho=0$ and different values of $\alpha$

<table>
<thead>
<tr>
<th>$\alpha$</th>
<th>$Z_1$</th>
<th>$Z_2$</th>
<th>$Z_3$</th>
<th>$Z_4$</th>
<th>$Z_5$</th>
<th>$Z_6$</th>
<th>$Z_7$</th>
<th>$Z_8$</th>
<th>$Z_9$</th>
<th>$Z_{10}$</th>
<th>$Z_{11}$</th>
<th>$Z_{12}$</th>
<th>$Z_{13}$</th>
<th>$Z_{14}$</th>
<th>$Z_{15}$</th>
<th>$Z_{16}$</th>
<th>$Z_{17}$</th>
<th>$Z_{18}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.07</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1.5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>20</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>50</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 2.4. Description of Faults

<table>
<thead>
<tr>
<th>Faults</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F1</td>
<td>non-operative shift register for loading $\mathbf{1}^n$</td>
</tr>
<tr>
<td>F2</td>
<td>non-operative WTA circuit</td>
</tr>
<tr>
<td>F3/F9</td>
<td>fault in a current source $L_A$</td>
</tr>
<tr>
<td>F4/F10</td>
<td>fault in a current source $L_B$</td>
</tr>
<tr>
<td>F5/F11</td>
<td>fault in vigilance parameter $\rho$ current mirror</td>
</tr>
<tr>
<td>F6/F12</td>
<td>fault in current mirror CMM</td>
</tr>
<tr>
<td>F7/F13</td>
<td>fault in current mirrors $CMA_j$ or $CMB_j$</td>
</tr>
<tr>
<td>F8/F14</td>
<td>fault in current mirror CMC</td>
</tr>
</tbody>
</table>

This requires a process defect density of $\rho_D = 3.2 \text{cm}^{-1}$. On the other hand, ignoring the non-catastrophic faults yields 9 out of 30 chips (30%). According to eq. (2.10) such a yield would be predicted if the process defect density is $\rho_D' = 1.4 \text{cm}^{-1}$.

Even though the yield is quite low, many of the faulty samples were still operative. This is due to the fault tolerant nature of the neural algorithms in general [2.43]-[2.46], and the ART 1 algorithm in particular. Looking at Table 2.3 we can see that there are 16 chips that have an operative shift register and WTA circuit. We performed system level operation tests on these chips to verify if they would be able to form clusters of the input data, and verified that 12 of these 16 chips were able to do so. Moreover, 6 (among which were the two completely fault-free chips) behaved exactly identically. The resulting clustering behavior of these 12 chips is depicted in Fig. 2.13 for $\rho = 0.5$ and $\alpha = 1.07$.

---

4. The effective die area is $\Omega = (0.92 \text{cm})^2$ to account for a 400$\mu$m width pad ring.
Appendix 2: A Real-Time Clustering Microchip Neural Engine. Page: 83

<table>
<thead>
<tr>
<th>chip #</th>
<th>$x_1$</th>
<th>$x_2$</th>
<th>$x_3$</th>
<th>$x_4$</th>
<th>$x_5$</th>
<th>$x_6$</th>
<th>$x_7$</th>
<th>$x_8$</th>
<th>$x_9$</th>
<th>$x_{10}$</th>
<th>$x_{11}$</th>
<th>$x_{12}$</th>
<th>$x_{13}$</th>
<th>$x_{14}$</th>
<th>$x_{15}$</th>
<th>$x_{16}$</th>
<th>$x_{17}$</th>
<th>$x_{18}$</th>
</tr>
</thead>
</table>

Fig. 2.13: Categorization of the input patterns performed by operative samples

### 2.4. Further Enhancements

The chip described here is the first prototype designed by the authors for real-time clustering. As such, the design focused on testability and full characterization possibilities, instead of maximizing speed and yield, for example.

To make this chip an industry ready prototype, several trivial modifications should be introduced:

- First, substitute the serially loaded shift register that holds the input pattern $I^k$, with some kind of parallel loading mechanism (using either electrical or optical data acquisition techniques).
- Use some simple yield enhancement technique. Looking at Table 2.3 and Table 2.4 we can see that most of the failures are due to faults in the synaptic current sources. A simple yield enhancement technique would be to add a number of spare columns of synapses, some of which would substitute faulty columns of synapses.
- Add a handshaking mechanism that would allow the chip to communicate with the outside circuitry. Thus, when the WTA produces a fast response (which, by the way, is most of the times), the outside circuitry need not wait for the worst case WTA delay.

Other, less trivial, enhancements that should be addressed relate to the high area and current consumption of the synaptic current sources $L_A$ and $L_B$. One possibility would be to use UV-activated floating-gate-calibrated [2.47]-[2.50] current sources, instead of the tree-like structure of Fig. 2.6. In principle, it should be possible to use one single calibrated MOS transistor per synaptic current source. This transistor, which can be close to minimum size, does not have to drive a large current either. Calibration errors of 0.2% have been reported for currents of 200nA [2.50]. Using a scheme like this significantly reduces the current and silicon area consumption per synapse, allowing a much higher number of synapses per chip and thus boosting the performance of the chip significantly.

Other considerations relate to the question of how this chip would scale up with size. What would be the practical limitations? Usually a strong limitation when scaling up analog neural hardware is how systematic offsets accumulate. A common circuit technique for analog neural VLSI is the use of transconductors [2.23],
Connecting many of them in parallel results in addition of their systematic offset components. If the size of the system is sufficiently large, this total offset can drive the system out of working range. For our circuit the accumulation of systematic offsets of the synaptic current sources is not a problem. Note that the total currents $T_j$ (which certainly include a common systematic offset) will compete in a WTA circuit, and the maximum among all $\{T_j\}$ is the same regardless of the presence or not of a common offset component.

A real scaling limitation for the circuit technique used in our chip is the following. The smallest current per synaptic current source is limited by the precision we want to achieve (even when using UV-activated floating-gate calibration techniques). Therefore, the maximum number of synapses that can be put into the same chip will be limited by the maximum power dissipation allowed by the package for a given precision. This implies a trade-off between precision and size.

Another problem that might arise when the number of nodes in the $F_2$ layer (maximum number of categories) becomes significantly large, is that the WTA circuit might not be able to detect the maximum among a large number of close-to-maximum inputs. At that point, one might reconsider if it is necessary to have an $F_2$ layer that provides one (and only one) winner, instead of an $F_2$ layer that provides a “bubble” of winners [2.52], [2.53].

A different way of system growth is to assemble different ART 1 subsystems to perform supervised clustering tasks [2.54], or to combine ART cells hierarchically for higher level knowledge processing [2.55], [2.56].

2.5. References


---

5. In this case a global offset calibration technique can be used to overcome this problem.
Appendix 2: A Real-Time Clustering Microchip Neural Engine. Page: 85


Appendix 3: A High-Precision Current-Mode WTA-MAX Circuit with Multi-Chip Capability

3.1. Introduction

Winner-Take-All (or Looser-Take-All) and MAX (or MIN) circuits are often fundamental building blocks in neural and/or fuzzy hardware systems [3.3]-[3.5]. Given a set of \( M \) external inputs \( (T_1, T_2, \ldots, T_j, \ldots, T_M) \), their operation consists in determining which input \( j \) presents the largest (or smallest) value, or what is this maximum (or minimum) value, respectively. If a Winner-Take-All (WTA) or MAX circuit is available, a Looser-Take-All (LTA) or MIN circuit is obtained by simply inverting the input \( (-T_1, -T_2, \ldots, -T_j, \ldots, -T_M)^T \). Hence, this Appendix will only concentrate on WTA and MAX circuits.

In literature, the physical implementation of these systems has been tackled through two main approaches:

a) Systems of \( O(M^2) \) complexity: their connectivity increases quadratically with the number of inputs [3.6]-[3.10].

b) Systems of \( O(M) \) complexity: their connectivity increases linearly with the number of inputs [3.11], [3.12].

In a system of \( O(M^2) \) complexity, as shown in Fig. 3.1(a), there is one cell per input; each cell has an inhibitory connection (black triangle) to the rest of the cells and an excitatory connection (white triangle) to itself. Therefore, the system has \( M^2 \) connections. Each cell \( j \) receives an external input \( T_j \). The cell that receives the maximum input will turn all other cells OFF and will remain ON. If the system is a Winner-Take-All (WTA) circuit, each cell has a binary output that indicates whether the cell is ON or OFF. In a MAX circuit the winning cell will copy its input to a common output.

![WTA Circuit Diagrams](image)

Fig. 3.1: WTA topologies. (a) WTA of \( O(M^2) \) complexity, (b) transformation to \( O(M) \) complexity, (c) typical topology of \( O(M) \) WTA hardware implementation.

1. Optionally, a common offset term may be added.

This Appendix is a merged and amplified version of papers [3.1] and [3.2].
Under some circumstances\textsuperscript{2} it is possible to convert the $O(M^2)$ topology of Fig. 3.1(a) into an $O(M)$ one, as shown in Fig. 3.1(b). In these cases, a global inhibition term is computed. Each cell contributes to this global inhibition, and each cell receives the same global inhibition. Note that now, each cell contributes to inhibit itself. Consequently, the excitatory connection that each cell has to itself must be increased to compensate for this fact.

Typical $O(M)$ WTA circuits reported in literature [3.11], [3.12] correspond to the topology shown in Fig. 3.1(c). In such circuits there are also $M$ cells, each receiving an external input $T_j$. Each cell connects to a common node, through which a global property (for example, a current) is shared between all cells. The amount of that global property taken by each cell depends (nonlinearly) on how much its input $T_j$ deviates from an “average” of all inputs. Usually this “average” is not an exact linear average, but is somehow nonlinearly dependent on all inputs. The cell with the maximum input $T_j$ takes most (or all) of the common global property leaving the rest with little or nothing. Due to the way this global property is shared and how the “average” is computed, the operation of these circuits relies on the matching of transistor threshold voltages of an array of transistors [3.11], or other transistor parameters (like in [3.12] where the operation also relies on the matching of parameter $\lambda$ of the transistor array). The number of transistors in the array equals, at least, the number of inputs $M$ of the system. If the WTA or MAX circuit has such a large number of inputs so that it must be distributed among different chips, the matching of threshold voltages (or other transistor parameters) will degrade significantly, and the overall system will lose precision in its operation.

This Appendix describes an $O(M)$ complexity circuit technique (which can be represented by the topology in Fig. 3.1(b)) for implementing either WTA and/or MAX circuits, based on current-mode principles. The resulting circuit does not rely on the matching of an $M$-size transistor array. The precision of the overall system relies on precise current replication, which can be achieved locally without matching $M$ transistors. Sometimes, when assembling large neural and/or fuzzy systems, a WTA/MAX circuit must be distributed among several chips [3.13]. The circuit described here can be distributed among several chips with no influence on its precision, as shown in the Section on experimental results.

In Section 3.2 a mathematical model that performs WTA/MAX operation is described. This operation principle will be used in Section 3.3 to develop a current-mode processing circuit. Sections 3.4 and 3.5 deal with stability considerations of the circuit presented in Section 3.3. Finally, Section 3.6 provides experimental measurement results obtained from prototypes fabricated in two different CMOS technologies, and from multi-chip systems formed by chips of the same or different technologies.

### 3.2. Operation Principle

The operation principle given in this Section can be used for simultaneous implementation of a WTA and MAX circuit. The system has $M$ cells. Each cell $j$ produces an output

$$I_{oj} = \alpha_j H(T_j - I_o), \quad j = 1, \ldots M$$

(3.1)

where

---

\textsuperscript{2} If the inhibition that goes from cell $i$ to cell $j$ does not depend on $j$. 

---
Fig. 3.2: Graphic Representation of the Solution of eq. (3.4)

\[ I_o = \sum_{j=1}^{M} \alpha_j (T_j - I_o) \]

(3.2)

\[ H(\cdot) \text{ is the step function defined as} \]

\[ H(x) = \begin{cases} 
1 & , \quad x \geq 0 \\
0 & , \quad x < 0 
\end{cases} \]

(3.3)

and \( T_j \) is the external input to the \( j \)-th cell. Substituting eqs. (3.1) into eq. (3.2) yields

\[ I_o = \sum_{j=1}^{M} \alpha_j H(T_j - I_o) \]

(3.4)

Fig. 3.2 shows a graphic representation of the functions \( f_1(I_o) = \sum \alpha_j H(T_j - I_o) \) and \( f_2(I_o) = I_o \). The intersection of \( f_1(I_o) \) and \( f_2(I_o) \) provides the solution to eq. (3.4). Note that if \( \alpha_j > 0 \quad \forall j \), eq. (3.4) has a unique equilibrium point \( S \), as deduced from Fig. 3.2. Furthermore, if

\[ \alpha_j \geq T_j \quad , \quad \forall j \]

(3.5)

the value of \( I_o \) at the equilibrium point \( S \) is

\[ I_{o|S} = \max \{ T_j \} \]

(3.6)

and the cell that drives a nonzero output \( I_{oj} \neq 0 \) is the winner. Consequently, a circuit that implements eq. (3.4) can be used to realize both a WTA or a MAX circuit.

In the case of an LTA or a MIN circuit, the same mathematical model of Fig. 3.2 applies if each input equals
Fig. 3.3: WTA unit cell: (a) circuit diagram, (b) transfer curve

Fig. 3.4: Diagram of the WTA circuit

\[ I_L - T_j, \]  \hspace{1cm} (3.7)

where \( I_L \) is an upper bound for all input

\[ 0 \leq T_j \leq I_L \hspace{0.5cm}, \hspace{0.5cm} \forall j. \] \hspace{1cm} (3.8)

### 3.3. Circuit Implementation

This Section shows how to realize a circuit that implements eq. (3.4) using currents to represent the mathematical variables \( T_j \) and \( I_o \). The circuit for each cell \( j \) is shown in Fig. 3.3. It consists of a 2-output current mirror, a MOS transistor, and a digital inverter. Each cell \( j \) receives two input currents, \( T_j \) and \( I_o \), and delivers one output current \( I_{oj} \). The inverter acts as a current comparator. If \( I_o > T_j \), the inverter output \( y_j \) is low, the MOS transistor is OFF, and \( I_{oj} \) is zero. If \( I_o < T_j \), the inverter output \( y_j \) is high, the MOS transistor is ON, and \( I_{oj} = T_j \). Consequently, the circuit of Fig. 3.3 implements a cell with \( \alpha_j = T_j \).

Fig. 3.4 shows the complete WTA or MAX circuit. It consists of \( M \) cells shown in Fig. 3.3 and an additional \( M \)-output current mirror. Note that the responsibility of the \( M \)-output current mirror is to deliver the sum of currents \( I_o = \Sigma_j I_{oj} \) to each of the \( M \) cells. Replication and transportation of current \( I_o \) must be very precise. If the number of cells \( M \) is too large, or if the circuit has to be distributed among several chips, high
precision in $I_o$ replication cannot be guaranteed by a single current mirror with $M$ outputs. In this case, replication of current $I_o$ must rely on several mirrors with a smaller number of outputs but with guaranteed precise replication. Fig. 3.5 shows an arrangement to distribute the circuit of Fig. 3.4 among several chips. The fact that current $I_o$ can be replicated many times without relying on the matching of a large array of transistors is the advantage of this WTA and MAX (or LTA and MIN) circuit technique over other implementations.

The precision of the overall WTA current mode circuit is determined by the kind of current mirrors used. Since the current comparator has virtually no offset, the current error at the input of each current comparator is determined by mirror mismatches. The error at the positive current $I_o$ available at the input of each current comparator results from one p-mirror reflection, preceded by an n-mirror reflection of the winning cell,

$$\sigma^2 (I_o) = \sigma^2_N + \sigma^2_P$$  \hspace{1cm} (3.9)

while the error of the negative current $T_j$ at the current comparator inputs results from a single n-mirror reflection,

$$\sigma^2 (T_j) = \sigma^2_N.$$  \hspace{1cm} (3.10)

The total current error at the input of each current comparator is therefore given by,

$$\sigma^2_{Total} = \sigma^2 (I_o) + \sigma^2 (T_j) = 2\sigma^2_N + \sigma^2_P.$$  \hspace{1cm} (3.11)

However, the error introduced by a current mirror is not only the random mismatch contribution, as considered in eqs. (3.9)-(3.11), but also its systematic error contribution, which results from different drain-to-source
Fig. 3.6: Enhanced current mirror topologies. (a) active, (b) cascode, (c) regulated cascode output, and (d) active regulated cascode

voltages at the reflecting transistors, poor impedance coupling, and inherent nonlinear MOS transistor operation. In our implementation we have chosen an “active-input regulated-cascode” current mirror (see Fig. 3.6(d)) [3.2]. As shown in Fig. 3.6(d), this topology maintains the same drain-to-source voltage at the input and output reflecting transistors, thus avoiding this source of systematic error component.

The active current mirror idea (see Fig. 3.6(a)) [3.14] was introduced as a need to maintain a constant voltage at the input of a current mirror in order to avoid current subtraction errors in previous stages. This technique allows the $V_{GS}$ voltage of the current mirror input transistor $M1$ to be independent of its $V_{DS}$ voltage, which will be kept constant, and therefore lowering the mirror input impedance and minimizing loading effects on previous stages. The current mirror will be operative as long as transistors $M1$ and $M2$ are kept in saturation. The higher the reference voltage $V_D$ is, more current is allowed through the mirror with $M1$ operating in saturation. The drain-to-source voltage of the output transistor depends on the load of the mirror and will be dependent on the mirror current. If the load impedance is not sufficiently low $M2$ will suffer of large drain-to-source voltage variations, which through the channel length modulation effect, will cause a systematic mismatch error between the input and output currents of the mirror. Such a circumstance can be avoided by using the cascode current mirror of Fig. 3.6(b). However, this mirror has a smaller output voltage swing and requires a high input voltage drop for high currents (which are needed for maximum accuracy) [3.16]. The regulated-cascode output stage (see Fig. 3.6(c)) [3.17] would allow to maintain the $V_{DS}$ of
transistor $M2$ constant and to increase significantly the output impedance of the mirror, while not sacrificing voltage range at the input of the current mirror. This current mirror will be operative as long as transistor $M2$ remains in saturation, which depends on $V_D$ and the input current, as well as on the load impedance at the output of the mirror.

By combining the active current mirror input of Fig. 3.6(a) with the regulated-cascode output in Fig. 3.6(c), the active-input regulated-cascode current mirror of Fig. 3.6(d) results. This current mirror has a very low input impedance, a very high output impedance and is operative if transistors $M1$ and $M2$ are either in saturation or ohmic region (because their $V_{DS}$ voltages are always equal). Therefore, the gate voltage of transistors $M1$ and $M2$ can change from rail to rail. However, this current mirror will fail to operate with high accuracy if the output voltage approaches $V_D$. But, on the other hand, $V_D$ can be made smaller than for Fig. 1(a) because $M1$ and $M2$ can operate now in ohmic regime.

When cascading current mirrors, the regulated-cascode output is not needed and the configuration of Fig. 3.6(a) can be used. The virtual ground effect at the drain of transistor $M2$ is produced by the active input of the next current mirror. However, in this case $V_D$ has to be set equal for all PMOS and NMOS active current mirrors. Also, care has to be taken by choosing the value of $V_D$ in order to respect the output voltage range of the circuit at the input of the first mirror, and to respect the input voltage range of the circuit at the output of the last current mirror.

As it has been previously explained, the accuracy limitation of current mirrors has two main types of sources, systematic errors and random errors. Systematic errors are caused by different $V_{DS}$ voltages at transistors $M1$ and $M2$ due to poor input/output impedance coupling between subsequent stages and high-order nonlinear effects. Random errors are fundamentally caused by differences in the electrical parameters between transistors $M1$ and $M2$, due to random process parameter variations. While random mismatch error contributions are practically independent on the circuit topology, systematic errors change considerably from one topology to another. We will consider that the total precision of a current mirror is given by

$$\Delta I_{Total} = \Delta I_{sys} + \sigma_I$$

(3.12)

where $\Delta I_{sys}$ is the systematic error contribution (evaluated through a single nominal Hspice simulation) and $\sigma_I$ is the standard deviation of the output current (evaluated through 30 Hspice Monte Carlo simulations). The statistical significance of $\sigma_I$ is that 68% of the samples have an output current error within the range $(\Delta I_{sys} - \sigma_I, \Delta I_{sys} + \sigma_I)$. For random mismatch errors Hspice simulations it is considered that the only sources of random mismatch are the differences in threshold voltage ($V_T$) and current factor ($\beta = \frac{C_{ox}W}{L}$), and that their standard deviation is given by [3.16].

3. Although we are showing a differential input voltage amplifier, in the original paper [3.17] a single input amplifier was used. In this case the voltage $V_D$ would be set through process and circuit parameters.
Fig. 3.7: Resolution (in bits) as a function of working current for different current mirror topologies

\[
\sigma^2 (V_T) = \frac{A_{V_T}^2}{W_L} + S_{V_T}^2 D^2 \\
\frac{\sigma^2 (\beta)}{\beta^2} = \frac{A_{\beta}^2}{W_L} + S_{\beta}^2 D^2
\]

(3.13)

where \( W \) and \( L \) are the sizes of the transistors, \( D (~W) \) their separation, and \( A_{V_T}=15m\mu m, S_{V_T}=2\mu m, A_\beta=2.3\%\mu m, \) \( S_\beta=2\times10^{-6}\mu m \) (parameters given in [3.16] for a 1.6\mu m N-well process with 25nm gate oxide and direct wafer writing).

In the following simulations we use \( W=100\mu m \) and \( L=20\mu m \) for transistors \( M1 \) and \( M2 \) of all current mirror topologies, with \( V_D=1.5V \) (power supply is 5V). Fig. 3.7 represents the total accuracy of eq. (3.12) as a function of operating current for different current mirror topologies. The first two topologies are for the active-input regulated-cascode current mirror with amplifier gain values \( A=1000 \) and \( A=100 \). For \( A=1000 \) the resolution is above 8-bits for current values between 40\mu A and 750\mu A, while for \( A=100 \) it is only between 40\mu A and 330\mu A. In both cases the decrease in precision above ~300\mu A is because transistors \( M1 \) and \( M2 \) enter their ohmic region of operation. A simple current mirror has a resolution below 8-bits for the complete current range. This is mainly due to poor impedance coupling, which can be avoided with regulated cascode outputs, achieving 8-bits resolution between 45\mu A and 150\mu A. The cascode mirror suffers from large voltage drops at its input, thus offering the 8-bits resolution only between 40\mu A and 70\mu A. The active-input mirror (Fig. 3.6(a)) has low resolution because its output voltage was set to 2.5V, hence rendering an important systematic error contribution. Also shown in Fig. 3.7 is the random mismatch error contribution (\( \sigma_f \) in eq. (3.12)), which is approximately the same for all the topologies. Note that all topologies suffer from loss of
precision at high currents. This is produced because the increasing voltage drops at the different nodes approximate the limit of available voltage range.

The use of the active input requires stability compensation [3.14]. Compensation of an active-input current mirror for an operating current of several decades and achieving speed figures better than for standard current mirror topologies is not a simple task. For our application, where we needed to achieve 8-bits accuracy for a one-decade operating current and with speeds faster than 100ns (for most part of the range), we required the use of an automatic transistor sizing optimization tool [3.15] for the design of the voltage amplifiers. For our application we required voltage amplifiers with a gain lower than 1000, therefore a simple topology could be used ([3.14] or [3.18]). The regulated cascode output amplifier does not require compensation.

Table 3.1 shows the transistor level simulated transient times of our design when using a regular OTA [3.18] as the active device. These transients correspond to the time the mirror takes to settle to 1% of the final current value, when the input is driven by an ideal current step signal changing between the levels given in the first row of the table. Table 3.1 also shows the delays for other topologies that have the same transistor sizes. Note that speed is improved with respect to mirrors that do not use active devices. The reason is that the active device introduces more design variables, thus allowing a more optimum final result.

<table>
<thead>
<tr>
<th></th>
<th>10\mu A to 15\mu A</th>
<th>50\mu A to 75\mu A</th>
<th>100\mu A to 150\mu A</th>
<th>250\mu A to 375\mu A</th>
<th>400\mu A to 600\mu A</th>
<th>530\mu A to 800\mu A</th>
</tr>
</thead>
<tbody>
<tr>
<td>act. reg. casc.</td>
<td>65ns</td>
<td>70ns</td>
<td>45ns</td>
<td>45ns</td>
<td>70ns</td>
<td>460ns</td>
</tr>
<tr>
<td>simple</td>
<td>160ns</td>
<td>80ns</td>
<td>65ns</td>
<td>40ns</td>
<td>80ns</td>
<td>35ns</td>
</tr>
<tr>
<td>cascode</td>
<td>180ns</td>
<td>85ns</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>active input</td>
<td>65ns</td>
<td>70ns</td>
<td>45ns</td>
<td>35ns</td>
<td>45ns</td>
<td>-</td>
</tr>
<tr>
<td>reg. cas. out.</td>
<td>165ns</td>
<td>80ns</td>
<td>65ns</td>
<td>50ns</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>

Table 3.1. Simulated Transient Times

As we will see in Section 3.6, we built a CMOS prototype of the circuit in Fig. 3.4 (and Fig. 3.5) using simple current mirrors for the NMOS mirrors, and active-input mirrors for the PMOS and the assembling current mirrors. As it will be discussed in Section 3.6, this is sufficient to guarantee multichip WTA operation. For precise MAX operation, however, more elaborated mirrors would be needed for the NMOS mirrors.

3.4. System Stability Coarse Analysis

Let us assume that the dynamics of each cell (see Fig. 3.3) can be modelled by the following first-order nonlinear differential equation

\[
C_e \dot{v}_{x_j} (t) + G_e (v_{x_j} (t) - v_M) + T_j = I_o (t)
\]  

(3.14)
where \( C_c \) is the total capacitance available at node \( v_{xj} \), \( G_c \) is the total conductance at this node, and \( v_M \) is the inverter trip voltage. Let us also assume that the output current of a cell is given by

\[
I_{oj}(t) = T_j U( v_M - v_{xj}(t) )
\]  
(3.15)

where \( U(\cdot) \) is a continuous and differentiable approximation to the step function of eq. (3.3). For example, we can define \( U(\cdot) \) as the following sigmoidal function,

\[
U(x) = \frac{1}{1 + e^{-x/\epsilon}}
\]  
(3.16)

where \( \epsilon \) is positive and non-zero but close to zero. Now consider eq. (3.14) for two nodes, \( j \) and \( w \). Let \( w \) be the node that eventually should become the winner. If we subtract eq. (3.14) for the two nodes \( j \) and \( w \), then

\[
C_c [\dot{v}_{xj}(t) - \dot{v}_{xw}(t)] + G_c [v_{xj}(t) - v_{xw}(t)] = T_w - T_j
\]  
(3.17)

Eq. (3.17) has the following solution

\[
v_{xj}(t) - v_{xw}(t) = \frac{T_j - T_w}{G_c} + \left[ v_{xj}(0) - v_{xw}(0) - \frac{T_j - T_w}{G_c} \right] e^{-t/\tau_c}, \quad \tau_c = \frac{C_c}{G_c}.
\]  
(3.18)

After a few time constants \( \tau_c \), the difference between the two node voltages will remain constant and equal to their difference at the equilibrium point. Therefore, if we can obtain the expression for \( v_{xw}(t) \), applying eq. (3.18) would obtain \( v_{xj}(t) \) for the rest of the nodes.

Consider now eq. (3.14) for node \( w \), and substitute eqs. (3.2) and (3.15) into it,

\[
C_c \dot{v}_{xw}(t) + G_c (v_{xw}(t) - v_M) + T_w = \sum_j T_j U( v_M - v_{xj}(t) ) .
\]  
(3.19)

Since \( \dot{v}_{xj}(t) \) is given by eq. (3.18), after a few time constants \( \tau_c \), eq. (3.19) becomes

\[
C_c \dot{v}_{xw}(t) = G_c (v_{xw}(t) - v_M) - T_w + \sum_j T_j U( v_{xj}(t) + \frac{T_w - T_j}{G_c} ).
\]  
(3.20)

This first order differential equation has stable equilibrium points if

\[
\frac{d\dot{v}_{xw}}{dv_{xw}} \bigg|_{equilibrium \ point} < 0 .
\]  
(3.21)

By deriving eq. (3.20) with respect to \( v_{xw} \) results

\[
C_c \frac{d\dot{v}_{xw}}{dv_{xw}} = -G_c - \sum_j T_j U'(\cdot).
\]  
(3.22)

Since \( G_c, I_j, \) and \( U'(\cdot) \) are always positive, eq. (3.22) is always negative for all possible values of \( v_{xw} \) (including the equilibrium point). Consequently, eq. (3.20) represents the dynamics of a stable system.
Fig. 3.8: Small Signal Modelling of Delay of the N-output Current Mirror

The discussion in this Section assumes that the M-output current mirror presents no delay. This is not very realistic. If we assume the M-output current mirror of Fig. 3.4 has the first-order dynamics defined by the small signal equivalent circuit depicted in Fig. 3.8.

Current \( I_o(t) \) represents each of the M outputs of this current mirror, and \( \sum_j I_{oj} \) its input. The dynamics of this current mirror in time-domain are given by

\[
I_o(t) + \tau_p \dot{I}_o(t) = \sum_j I_{oj}(t), \quad \tau_p = \frac{C_p}{g_{mp}} \tag{3.23}
\]

Eqs. (3.14)-(3.16) are still valid, but eqs. (3.17)-(3.19) have a higher order dynamics. By substituting eq. (3.14) and its derivatives into eq. (3.23) results

\[
\tau_p C_c \ddot{v}_{sx}(t) + \left( C_c + \tau_p G_c \right) \dot{v}_{sx}(t) + G_c v_{sx}(t) = (G_c v_{M} - T_w) + \sum_{j=1}^{M} T_j U(v_M - v_{sj}(t)). \tag{3.24}
\]

Subtracting them for two nodes \( j \) and \( w \) yields,

\[
\tau_p C_c [\dot{v}_{sj}(t) - \dot{v}_{sx}(t)] + (C_c + \tau_p G_c) [v_{sj}(t) - v_{sx}(t)] + G_c [v_{sj}(t) - v_{sx}(t)] = T_w - T_j. \tag{3.25}
\]

The solution to this differential equation is

\[
v_{sj}(t) - v_{sx}(t) = \frac{T_w - T_j}{G_c} + K_1 e^{-\frac{t}{\tau_c}} + K_2 e^{-\frac{t}{\tau_c}} \quad , \quad \tau_c = \frac{C_c}{G_c} \tag{3.26}
\]

where \( K_1 \) and \( K_2 \) are determined by initial conditions. Consequently, after a few time constants \( \tau_c \) and \( \tau_p \), eq. (3.24) would be given by

\[
\tau_p C_c \ddot{v}_{sx}(t) + \left( C_c + \tau_p G_c \right) \dot{v}_{sx}(t) + G_c v_{sx}(t) = (G_c v_{M} - T_w) + \sum_{j=1}^{M} T_j U(v_M - v_{sx}(t) + \frac{T_w - T_j}{G_c}) \tag{3.27}
\]

If \( v_{sx}(t) \) is well above or below \( v_{M} + (T_w - T_j)/G_c \), then the corresponding \( j \)-th cell function \( T_j U(\cdot) \) will equal either 0 or \( T_j \), respectively. In these cases the term \( T_j U(\cdot) \) contributes to the constant term (time independent) of eq. (3.27). On the other hand, if \( v_{sx}(t) \) is close to \( v_{M} + (T_w - T_j)/G_c \), the term \( T_j U(\cdot) \) is close-to-linearly dependent on \( v_{sx}(t) \). In this case, its first-order Taylor series expansion is

\[
T_j U(\cdot) = \frac{1}{2} T_j U'(0) \left( v_M + \frac{T_w - T_j}{G_c} - v_{sx}(t) \right). \tag{3.28}
\]

This term contributes to the constant term and to the \( v_{sx}(t) \) term of eq. (3.27). Summing over all cells obtains
\[ \sum_{j=1}^{M} T_j U(\cdot) = -Sv_{xw}(t) + K, \]

where \( K \) and \( S \) are constants and \( S > 0 \) (because \( I_j, U'(0) > 0 \)). Therefore, the poles of eq. (3.27) are the roots of

\[ \tau_p C_c s^2 + (C_c + \tau_p G_p) s + (G_c + S) = 0 \]

which always have a negative real part. Consequently, eq. (3.27) converges always to its unique equilibrium point.

### 3.5. System Stability Fine Analysis

Performing electrical simulations of the circuit in Section 3.3, verifies that the analysis in Section 3.4 is a good approximation as long as the equilibrium point does not lie in the transition region of any of the \( M \) sigmoidal functions \( U(\cdot) \). This can only be guaranteed if \( \alpha_j = T_j \) and the two largest inputs \( T_j \) and \( T_w \) are sufficiently different. If \( \alpha_j > T_j \) or (with \( \alpha_j = T_j \)) if two or more inputs \( T_j \) are maximum and very similar, the equilibrium point of the system (see Fig. 3.2) will be in the transition region of some sigmoid \( U(\cdot) \). In these cases, transistor parasitic elements that have been neglected in the analysis of Section 3.4 may render unstable behavior. Consequently, some kind of compensation is necessary.

Under unstable conditions the system exhibits the following characteristics (observed through electrical simulations with Hspice):

- Only the cells \( j \) whose sigmoid functions \( U(\cdot) \) must be in their transition region at the equilibrium point are unstable. The rest of the cells behave as if the system had reached its equilibrium point.
- The unstable cells present oscillations (presence of complex conjugate poles).
- In the case of \( \alpha_j = T_j \) and with two or more equal maximum inputs, the steady-state oscillating waveforms at these cells become the same, regardless of their initial conditions.

This last observation suggests that a stability analysis could be performed by simply considering one cell in the system, which represents the parallel connection of all unstable cells, as shown in Fig. 3.9(a). On the other hand, since the unstable cells have the equilibrium point in the transition region of their sigmoid \( U(\cdot) \), we can linearize these sigmoids for the stability analysis. Therefore, let us consider the small signal equivalent circuit shown in Fig. 3.9(b), where the circuitry comprised by dashen lines represents the parallel of all cells with equal and maximum input. The rest of the circuitry models the \( M \)-output current mirror (or set of current mirrors) responsible for distributing the global current \( I_o \) among the \( M \) cells. The minimum set of dynamic elements needed for the system to present unstable oscillating behavior are parasitic capacitors \( C_c, C_p, \) and \( C_g \) (observed through electrical simulation).

The frequency-domain KCL equations of the linear circuit of Fig. 3.9(b) are
Fig. 3.9: (a) Parallel connection of unstable cells, (b) uncompensated small signal equivalent circuit, (c) compensated small signal equivalent circuit.

\[
I_o = V_{xj} (G_c + sC_c) \\
g_{mn} (y_j - V_{cj}) = g_n V_{cj} - sC_g (y_j - V_{cj}) \\
M_m g_{mn} (y_j - V_{cj}) = \left(1 + s \frac{C_p}{g_{mp}} \right) I_o \\
y_j = -AV_{xj}
\]

Routine analysis yields the following third-order polynomial

\[
as^3 + bs^2 + cs + d = 0
\]

\[
a = \frac{C_p C_e C_g}{g_{mp}}
\]

\[
b = C_p G_c \frac{G_c}{g_{mp}} + C_p G_c \frac{g_{mn}}{g_{mp}} + C_g C_c
\]

\[
c = C_g G_c + C_p G_c \frac{g_{mn}}{g_{mp}} + C_c g_{mn}
\]

\[
d = g_{mn} G_c + AM_m g_n g_{mn}
\]

Since all coefficients \(a, b, c,\) and \(d\) are positive, the roots of this polynomial have negative real parts if \(bc - ad > 0\). Considering that parasitic capacitances \(C_p, C_e,\) and \(C_g\) are approximately of the same order of magnitude and that \(g_{mp} > g_{mn} > g_n = G_c\), the stability condition simplifies to

\[
AM_m < \frac{C_e (g_p + g_{mn})}{g_n \left( \frac{g_p}{C_p} + \frac{g_{mn}}{C_g} \right)}
\]

(3.33)

where \(M_m\) is the number of cells with equal and maximum input. This condition is not easy to satisfy since \(A\) must be large for proper operation, \(M_m\) may become large, and it is not trivial to make the right hand side of eq. (3.33) very large.

Stability compensation can be achieved by introducing capacitor \(C_A\), as shown in Fig. 3.9(c).

The frequency domain KCL equations of the linear circuit of Fig. 3.9(c) are
\[ I_o = V_{sjl} (G_c + sC_c) + sC_A (V_{sjl} - V_{cj}) \]
\[ g_{mn}(y_j - V_{cj}) = g_n V_{cj} - sC_g (y_j - V_{cj}) - sC_A (V_{sjl} - V_{cj}) \]
\[ M_m g_{mn}(y_j - V_{cj}) = \left(1 + s \frac{C_p}{g_{mp}} \right) I_o \]
\[ y_j = -A V_{sjl} \]  

Routine analysis yields the following third-order polynomial

\[ a s^3 + b s^2 + c s + d = 0 \]
\[ a = \tau_p [C_c C_g + C_c C_A + (A + 1) C_g C_A] \]
\[ b = \tau_p C_c (g_{mn} + g_n) + \tau_p C_g g_n + \tau_p C_c g_{mn} + \tau_p C_g C_c + \tau_p C_c C_A + C_A C_A + \tau_p (A + 1) g_{mn} C_A + (A + 1) C_A C_g \]
\[ c = (\tau_p C_c + C_c) (g_{mn} + g_n) + C_A (g_{mn} + g_n + A M_m g_{mn} + A g_{mn} + M_m g_{mn}) \]
\[ d = A M_m g_{mn} g_n \]

Assuming \( A \gg 1 \), \( g_{mp} > g_{mn} \gg g_n \), \( G_c \) eqs. (3.35) can be simplified to

\[ a = A \frac{C_g C_A C_p}{g_{mp}} \]
\[ b = AC_A \left( \frac{g_{mn}}{g_{mp}} C_p + C_g \right) \]
\[ c = C_A (A M_m + A + M_m) g_{mn} \]
\[ d = A M_m g_{mn} g_n \]

Since all coefficients \( a, b, c, \) and \( d \) are positive, the roots of this polynomial have negative real parts if \( bc - ad > 0 \), which yields the following stability condition,

\[ \left(1 + \frac{1}{M_m} \right) C_A > \frac{g_n}{g_{mp}/C_p + g_{mn}/C_g} \]  

(3.37)

The worst case occurs for very large values of \( M_m \), for which eq. (3.37) reduces to

\[ C_A > \frac{g_n}{g_{mp}/C_p + g_{mn}/C_g} \]  

(3.38)

Note that now the stability condition does not depend on gain \( A \), and is easier to fulfill. However, now capacitor \( C_A \) degrades the settling speed of the system. Capacitor \( C_A \) acts as a Miller capacitance. Since the DC-gain from node \( v_{sjl} \) to node \( v_{cj} \) is approximately \( -A \) (i.e. the negative of the slope of \( U(\cdot) \)), there will be an effective Miller capacitance of value \( (A + 1) C_A \) in parallel with the original \( C_c \) capacitor. If the sigmoid is not in its transition region \( A = 0 \), but if the sigmoid is in its transition region \( A \) can be very large. Therefore, for compensated cells eq. (3.20) must be changed to

\[ [C_c + C_A + U^r (v_M - v_{sw}) C_A] \dot{v}_{sw} = G_c (v_M - v_{sw}) - T_w + \sum_j T_j U \left( v_M - v_{sw} - \frac{T_w - T_j}{G_c} \right) \]  

(3.39)
If the winning cell is in its transition region \( U'(v_M - v_{xw}) \neq 0 \) and a large capacitance \( C_c + (A + 1) C_A \) is present at node \( v_{xw} \). Otherwise, \( U'(v_M - v_{xw}) = 0 \) and the effective capacitance is only \( C_c + C_A \).

3.6. Experimental Results

A WTA-MAX system with \( M=10 \) competing cells has been designed and fabricated in two different technologies. The first prototype has been integrated in a double-metal single-poly 1.0\( \mu \)m CMOS technology (ES2), and the other in a double-metal double-poly 2.5\( \mu \)m CMOS process (MTEC). Both technologies were available through the European silicon foundry service, EUROCHIP.

If the circuit is going to be used as a MAX circuit, all current mirrors must provide good replication precision. They need to have small systematic errors and small random deviations [3.16], so that the resulting value of current \( I_o \) resembles the maximum among all inputs as much as possible. However, if the circuit is going to be used as a WTA circuit, requirements are not that severe. If inside one single chip, a WTA performs the same even if the current mirrors have appreciable systematic errors. Since systematic errors are common with respect to all inputs, the system can still determine which input is maximum. On the other hand, random mismatch errors in the current mirrors must be kept small because these errors change randomly from one input to another. Reducing random errors implies using larger transistor sizes. Reducing systematic errors implies using more elaborate current mirror topologies that either reduce their output conductance (using cascode [3.16], regulated cascode [3.17], or gain-boosting [3.20] techniques), decrease their input impedance [3.21], or both [3.2].

For our application it was not critical that the final value of \( I_o \) be an exact replica of the maximum of the input. Therefore, we used a simple 3-transistor current mirror (without any output conductance or input impedance decreasing technique) for the 2-output NMOS current mirror of each cell. However, we used active input current mirrors [3.21] for the \( M \)-output PMOS current mirror and for the extra NMOS assembling current mirror (see Fig. 3.5). These current mirrors assure fixed voltages at their input nodes. This was necessary because if the system is distributed among several chips, the presence of the assembling current mirror would break the symmetry between some of the inputs, making systematic errors affect these inputs differently.

The following presents proper system operation of a WTA circuit in one single chip, in two chips of the same technology, and in two chips each of a different technology. As will be shown, the DC-behavior of the system is not degraded when the operation is distributed among several chips. In the remainder of this Section we will detail experimental measurements related to the precision of a WTA and its speed response.

A. Operation Precision

The DC transfer curves of the system have been measured for different input current levels and for different system configurations. Fig. 3.10 shows thirty transfer curves when the competing cells are inside the same chip. Each curve is obtained by randomly selecting a pair of input cells, \( i \) and \( j \), applying a constant input current \( T_i = T_p \) to the first, and sweeping the input current of the second \( T_j \) from \( 0.9 \times T_p \) to \( 1.1 \times T_p \). The
Fig. 3.10: Transfer curves of the WTA implemented in a ES2 1.0μm chip for an input current level of 100μA

Fig. 3.11: Transfer curves when two ES2 1.0μm chips are assembled and for an input current level of 10μA

The figure represents the two inverter output voltages, $y_i$ and $y_j$, versus the current $T_J$. For each pair of cells, $i$ and $j$, we measure the value of $T_J$ at the point where $y_i = y_j$. Let us call this value $T_M$. Thirty curves were measured for each value of $T_P$, resulting in thirty values of $T_M$. The difference between the mean of these thirty $T_M$ values and $T_P$ is a measure of the systematic error of $T_M$. Let us call it $\varepsilon(T_P)$. The variance of the thirty $T_M$ values represents the random error of $T_M$. Let us call it $\sigma(T_P)$. In the case of Fig. 3.10, corresponding to a WTA inside one single chip fabricated in the ES2 1.0μm CMOS technology with $T_P = 100\mu A$, we measured a random deviation of $\sigma(T_P) = 1.04\%$ and a systematic error of $\varepsilon(T_P) = 0.03\%$.

Fig. 3.11 again shows thirty DC transfer curves, where $T_P = 10\mu A$ and the system is built by assembling two chips of the same technology using the set-up illustrated in Fig. 3.5. To obtain these curves, cell $i$ was always chosen among the cells in the first chip, and cell $j$ was always selected from the second chip. A random
deviation of $\sigma(T_p) = 2.30\%$ was measured, while the systematic deviation $\varepsilon(T_p) = 0.05\%$. Fig. 3.11 shows thirty DC transfer curves for the case $T_p = 500\mu A$, when two chips of different technologies are used. In this case $\sigma(T_p) = 2.18\%$ and $\varepsilon(T_p) = 0.06\%$. Note that the voltage ranges of $y_1$ and $y_2$ differ for the two chips.

Table 3.3 contains the measured total error (defined as $\sigma(T_p) + \varepsilon(T_p)$) for three decades of change in $T_p$. The table shows results for the cases of WTAs inside one chip, assembled using two chips of the same technology, and assembled with two chips of different technologies. Note that the precision degradation is very small when the system is distributed among two chips, regardless of whether the chips are of the same technology or not. This is the main advantage of this WTA-MAX circuit with respect to others reported in literature. This is shown in Table 3.3 which depicts the simulation results of another WTA [3.12]. The input signals for this WTA are voltages that range from 1.5V to 4.5V. As can be seen, there is a significant precision degradation when the WTA is distributed among two chips of different technologies caused by a large increase in the systematic error component [3.22].

**B. Operation Speed**

Delay measurements were performed as follows. Only two input signals were made non-zero. Let us call them $T_1$ and $T_2$. Current $T_1$ was made constant and equal to $T_{IN}$, while current $T_2$ changed in a pulsed between values $T_{IN} - 0.5\Delta T_{IN}$ and $T_{IN} + 0.5\Delta T_{IN}$, as shown in Fig. 3.13(a). The pulse starts at time $t_{o1}$ and ends at time $t_{o2}$. Waveforms $y_1$ and $y_2$ have the shape depicted in Fig. 3.13(b). Four different delay times were measured. For the system response caused by a rising edge in $T_2$, time $t_{d1}$ is the delay between time $t_{o1}$ and the instant at which voltage $y_2$ crosses the 50% value of its range. Delay $t_{d2}$ is the same for output voltage $y_1$. For the system response caused by a falling edge in $T_2$, time $t_{d3}$ is the delay between time $t_{o2}$...
### Table 3.2. Current Mode WTA Precision Measurements

<table>
<thead>
<tr>
<th>Technology</th>
<th>number of used chips</th>
<th>$T_p$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>10μA</td>
</tr>
<tr>
<td>ES2_1.0μm</td>
<td>1</td>
<td>2.00%</td>
</tr>
<tr>
<td>ES2_1.0μm</td>
<td>2</td>
<td>2.35%</td>
</tr>
<tr>
<td>MITEC_2.4μm</td>
<td>1</td>
<td>1.94%</td>
</tr>
<tr>
<td>MITEC_2.4μm</td>
<td>2</td>
<td>2.15%</td>
</tr>
<tr>
<td>MITEC_2.4μm, ES2_1.0μm</td>
<td>2</td>
<td>2.24%</td>
</tr>
</tbody>
</table>

### Table 3.3. WTA Precision Computations (obtained through Hspice simulations) for the Circuit reported in [3.12]

<table>
<thead>
<tr>
<th>Technology</th>
<th>number of used chips</th>
<th>$v_p$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>1.5V</td>
</tr>
<tr>
<td>ES2_1.0μm</td>
<td>1</td>
<td>0.39%</td>
</tr>
<tr>
<td>MITEC_2.4μm, ES2_1.0μm</td>
<td>2</td>
<td>2.62%</td>
</tr>
</tbody>
</table>

---

**Fig. 3.13:** (a) Input Signals, (b) Output Waveforms
Table 3.4. Measured delay times for one chip WTAs

<table>
<thead>
<tr>
<th>$T_{IN}$</th>
<th>$\Delta T_{IN}$</th>
<th>ES2_1.0µm</th>
<th>MIETEC_2.4µm</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>$t_{d1}$</td>
<td>$t_{d2}$</td>
</tr>
<tr>
<td>10µA</td>
<td>2µA</td>
<td>6.132µs</td>
<td>4.824µs</td>
</tr>
<tr>
<td>10µA</td>
<td>10µA</td>
<td>1.574µs</td>
<td>1.369µs</td>
</tr>
<tr>
<td>50µA</td>
<td>10µA</td>
<td>1.289µs</td>
<td>1.023µs</td>
</tr>
<tr>
<td>50µA</td>
<td>50µA</td>
<td>364ns</td>
<td>319ns</td>
</tr>
<tr>
<td>100µA</td>
<td>20µA</td>
<td>943ns</td>
<td>801ns</td>
</tr>
<tr>
<td>100µA</td>
<td>100µA</td>
<td>191ns</td>
<td>167ns</td>
</tr>
<tr>
<td>500µA</td>
<td>100µA</td>
<td>161ns</td>
<td>147ns</td>
</tr>
<tr>
<td>500µA</td>
<td>200µA</td>
<td>59ns</td>
<td>68ns</td>
</tr>
</tbody>
</table>

Table 3.5. Measured delay times for a two-chips WTA

<table>
<thead>
<tr>
<th>$T_{IN}$</th>
<th>$\Delta T_{IN}$</th>
<th>ES2_1.0µm</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>$t_{d1}$</td>
</tr>
<tr>
<td>10µA</td>
<td>2µA</td>
<td>16.8µs</td>
</tr>
<tr>
<td>10µA</td>
<td>10µA</td>
<td>3.2µs</td>
</tr>
<tr>
<td>100µA</td>
<td>20µA</td>
<td>470ns</td>
</tr>
<tr>
<td>100µA</td>
<td>100µA</td>
<td>235ns</td>
</tr>
<tr>
<td>500µA</td>
<td>100µA</td>
<td>154ns</td>
</tr>
<tr>
<td>500µA</td>
<td>200µA</td>
<td>150ns</td>
</tr>
</tbody>
</table>

and the instant at which voltage $y_1$ crosses the 50% value of its range. Delay $t_{d4}$ is the same for output voltage $y_2$.

Measurements were performed for $T_{IN}$ values of 10µA, 50µA, 100µA, and 500µA, and for $\Delta T_{IN}$ equal to 0.2$T_{IN}$ and $T_{IN}$. Table 3.4 shows the measured delay times for those cases where the system is inside one single chip. Table 3.4 shows the delay times measured when a WTA is assembled using 2 chips of the ES2 1.0µm process.

3.7. References


Appendix 4: Systematic CMOS Transistor Mismatch Characterization

4.1. Introduction

Mismatching is a limiting factor in circuit design, specially as the transistor geometries are being reduced. Therefore, mismatch characterization is becoming an important task for circuit designers to maintain a proper circuit performance without wasting an excess of circuit area.

The electrical parameters of transistors fabricated in the same die suffer from two kinds of mismatches: a systematic gradient-based deviation and a random deviation with respect to their nominal value. These deviations depend on their position in the wafer, their sizes, and the separation between them.

Suppose we have a die with a two-dimensional array of identical MOS transistors and we measure a certain electrical parameter (for example, the threshold voltage $V_T$) for each transistor. Fig. 4.1 represents a typical measurement result: horizontal axes $x$ and $y$ represent transistor position in the die, and vertical axis $z$ shows the electrical parameter value. In a typical case we would see a surface that fits the measured values, plus noisy deviations for each point in the surface. The surface represents the systematic gradient-based deviations with respect to the nominal value (or mean value) for all transistors in the same wafer, while the noisy deviations at each point reflect the random deviations.

In this Appendix we will first give the theoretical model of Pelgrom for transistor mismatch properties [4.1], then we will present a chip intended for systematic measurements of transistor mismatch, and finally provide measurement and characterization results.

![Diagram](image.png)

Fig. 4.1: Typical Transistor Property Measurement Result along a Die
4.2. Pelgrom’s Model of Transistor Mismatch

If we assume that the electrical property $P$ of a transistor is the result of averaging a certain continuous density function $P(x, y)$ over the transistor area [4.1], the value of parameter $P$ of the transistor of size $W \times L$, with its center point located at position $(x_1, y_1)$, is

$$P_1(x_1, y_1) = \frac{1}{WL} \int_{area(x_1, y_1)} P(x', y') \, dx' \, dy'$$  \hspace{1cm} (4.1)

The density function $P(x', y')$ contains the variation due to wafer gradients as well as the variation due to the noisy deviations around the interpolated surface.

Under these assumptions, the mismatch in property $P$ between a pair of transistors, sized $W \times L$, located at positions $(x_1, y_1)$ and $(x_2, y_2)$, respectively (as shown in Fig. 4.2), is given by

$$\Delta P(x_{12}, y_{12}) = P_1(x_1, y_1) - P_2(x_2, y_2) =$$

$$= \frac{1}{WL} \int_{area(x_1, y_1)} P(x', y') \, dx' \, dy' - \frac{1}{WL} \int_{area(x_2, y_2)} P(x', y') \, dx' \, dy' =$$

$$= \frac{1}{WL} \int_{\mathbb{R}^2} G(x_{12} - x', y_{12} - y') \, P(x', y') \, dx' \, dy'$$  \hspace{1cm} (4.2)

where, we have denoted as $x_{12}$ and $y_{12}$ the coordinates $x$ and $y$ of the middle point between both transistor centers, that is,

![Fig. 4.2: Position and Coordinates of two Transistors](image-url)
\[ x_{12} = \frac{x_1 + x_2}{2} \]
\[ y_{12} = \frac{y_1 + y_2}{2} \]  

and \( G(x, y) \) is a function defined as

\[
G(x, y) = \begin{cases} 
-1 & \text{if } \left(-\frac{D_x - L}{2} \leq x \leq \frac{D_x + L}{2}\right) \text{ and } \left(-\frac{D_y - W}{2} \leq y \leq \frac{D_y + W}{2}\right) \\
1 & \text{if } \left(-\frac{D_x - L}{2} \leq x \leq \frac{D_x + L}{2}\right) \text{ and } \left(-\frac{D_y - W}{2} \leq y \leq \frac{D_y + W}{2}\right) \\
0 & \text{otherwise}
\end{cases}
\]  

(4.4)

\( G(x, y) \) has the particularity of being a geometry function, that is, it depends only on the geometries and location of the transistors but not on the parameter density function \( P(x, y) \), so it can be computed for each transistor disposition independently of the electrical property.

Taking the Fourier Transform in eq. (4.2) yields,

\[
\Delta P(\omega_x, \omega_y) = \frac{1}{WL} \mathcal{G}(\omega_x, \omega_y) \mathcal{P}(\omega_x, \omega_y)
\]

(4.5)

where \( \Delta P(\omega_x, \omega_y) \) is the Fourier Transform of \( \Delta P(x_{12}, y_{12}) \), \( \mathcal{G}(\omega_x, \omega_y) \) is the one of \( G(x, y) \), and \( \mathcal{P}(\omega_x, \omega_y) \) is the one of \( P(x, y) \).

For the layout of Fig. 4.2 it can be shown that

\[
\mathcal{G}(\omega_x, \omega_y) = \frac{\sin\left(\frac{\omega_x L}{2}\right) \sin\left(\frac{\omega_y W}{2}\right)}{\frac{\omega_x}{2} \frac{\omega_y}{2}} \left(-2j\right) \sin\left(\frac{\omega_x D_x + \omega_y D_y}{2}\right)
\]

(4.6)

If \( D_y = 0 \), it follows that

\[
\left|\frac{1}{WL} \mathcal{G}(\omega_x, \omega_y)\right| = \frac{\sin\left(\frac{\omega_x L}{2}\right) \sin\left(\frac{\omega_y W}{2}\right)}{\frac{\omega_x L}{2} \frac{\omega_y W}{2}} \{2 \sin\left(\frac{\omega_x D_x}{2}\right)\}
\]

(4.7)

For a pair of transistors in a common centroid configuration, as shown in Fig. 4.3, it would be

\[
\left|\frac{1}{2WL} \mathcal{G}(\omega_x, \omega_y)\right| = \frac{\sin\left(\frac{\omega_x L}{2}\right) \sin\left(\frac{\omega_y W}{2}\right)}{\frac{\omega_x L}{2} \frac{\omega_y W}{2}} \{2 \sin\left(\frac{\omega_x D_x}{2}\right) \sin\left(\frac{\omega_y D_y}{2}\right)\}
\]

(4.8)

Fig. 4.4 shows a wafer in which typical contour lines of constant property \( P \) have been drawn. In the wafer, at coordinate \((x_{12}, y_{12})\) a pair of transistors is drawn. Assuming that when averaging \( \Delta P(x_{12}, y_{12}) \) all over the wafer we have
Fig. 4.3: Layout Configuration for a Transistor Pair using Common Centroid

Fig. 4.4: Wafer Gradients


\[ W(\omega_x, \omega_y) \]

\[ P_1 \]

\[ D_w = \text{Wafer Diameter} \]

**Fig. 4.5: Approximate Shape of Frequency Domain Function \( W() \)**

\[
\Delta P \bigg|_{\text{wafer}} = 0
\]

we can write that

\[
\sigma^2 (\Delta P) = \frac{1}{\Omega} \int \int_{\Omega} \Delta P^2 (x_{12}, y_{12}) \, dx_{12} \, dy_{12}
\]

where \( \Omega \) is the area of the wafer. Applying Poisson’s Theorem to eq. (4.10) results in

\[
\sigma^2 (\Delta P) = \frac{1}{4\pi^2 \Omega} \int_{-\infty}^{\infty} \, d\omega_x \int_{-\infty}^{\infty} \, d\omega_y \left| \frac{1}{W L} \mathcal{P}(\omega_x, \omega_y) P(\omega_x, \omega_y) \right|^2
\]

Let us now make the following assumption for function \( \mathcal{P}(\omega_x, \omega_y) \):

\[
\mathcal{P}(\omega_x, \omega_y) = P_o + W(\omega_x, \omega_y)
\]

where \( P_o \) is a constant (frequency independent) representative of white noise and \( W(\omega_x, \omega_y) \) is a wafer map component responsible for long distance gradients along the wafer. The spatial frequency content of function \( W(\omega_x, \omega_y) \) is for frequencies of the order of \( D_w^{-1} \), where \( D_w \) is the wafer diameter. Therefore, function \( W(\omega_x, \omega_y) \) can be assumed to have a shape of the type depicted in Fig. 4.5, and consequently we can assume that

\[
W(\omega_x, \omega_y) = \begin{cases} 
- P_1 & \text{if } \left( \frac{-1}{D_w} \leq \omega_x \leq \frac{1}{D_w} \right) \cdot \left( \frac{-1}{D_w} \leq \omega_y \leq \frac{1}{D_w} \right) \\
0 & \text{otherwise}
\end{cases}
\]

(4.13)
Therefore, eq. (4.11) can be written as

\[
\sigma^2 (\Delta P) = \frac{1}{4\pi^2 \Omega^2 W^2 L^2} \int d\omega_x \int d\omega_y |g_2|^2 |P_o + \mathcal{W}|^2 = \\
= \frac{1}{4\pi^2 \Omega^2 W^2 L^2} \{|P_o|^2 Y_1 + Y_2\}
\]  

(4.14)

Assuming a transistor pair as in Fig. 4.2, it would be

\[
Y_1 = \int d\omega_x d\omega_y |g_2|^2 = 8\pi^2 WL
\]

(4.15)

and

\[
Y_2 = \int d\omega_x \int d\omega_y |g_2|^2 \left[ P_o^* W + P_o W^* + |\mathcal{W}|^2 \right]
\]

(4.16)

Since \( D_w \gg D_x, W, L \) then \( g(\omega_x, \omega_y) = \omega_x D_x L W \). Thus,

\[
\frac{1}{D_w} \frac{1}{D_w} \int d\omega_x \int d\omega_y |g_2|^2 D_x L^2 W^2 \left[ P_o^* W + P_o W^* + |\mathcal{W}|^2 \right] = D_x^2 L^2 W^2 k_o^{'},
\]

\[
\frac{1}{D_w} \frac{1}{D_w} \int d\omega_x \int d\omega_y \omega_x^2 \left[ P_o^* W + P_o W^* + |\mathcal{W}|^2 \right] = \frac{1}{D_w} \frac{1}{D_w} \int d\omega_x \int d\omega_y \omega_x^2 \left[ P_o^* W + P_o W^* + |\mathcal{W}|^2 \right]
\]

(4.17)

where \( P_o^* W + P_o W^* + |\mathcal{W}|^2 \approx P_o^* P + P_o P^* + |P_1|^2 \). and,

\[
Y_2 = \frac{4k_o D_x^2 L^2 W^2}{3D_w^4}
\]

(4.18)

This results in

\[
\sigma^2 (\Delta P) = \frac{|P_o|^2}{\Omega WL} + \frac{k_o D_x^2}{3\pi^2 \Omega D_w^4} = \frac{A_p^2}{WL} + S_p D_x^2
\]

(4.19)

\[
A_p^2 = \frac{|P_o|^2}{\Omega}, \quad S_p = \frac{k_o}{3\pi^2 \Omega D_w^4}
\]

For a common centroid configuration it can be shown that the result is

\[
\sigma^2 (\Delta P) = \frac{|P_o|^2}{\Omega WL} + \frac{k_o D_x^2 D_y^2}{36\pi^2 \Omega D_w^4} = \frac{A_p^2}{2WL} + S_p D_x^2 D_y^2
\]

(4.20)
As deduced from Pelgrom’s Model, the standard deviation (\(\sigma\)) which characterizes the mismatch of parameter \(P\) between two transistors of area \(W \times L\) separated a distance \(D\) is given by

\[
\sigma^2 (\Delta P) = \frac{A_p}{WL} + S_p D^2
\]  \(\text{Eq. (4.21)}\)

where \(A_p\) and \(S_p\) are process dependent parameters that need to be characterized.

In eq. (4.21) we can distinguish two components of the mismatch of electrical parameters between two transistors: an area dependent component (caused by the random deviations of Fig. 4.1) and a distance dependent component (caused by the “randomness” of the location of the transistor pair in the systematic surface of Fig. 4.1).

Parameter \(A_p\) stays quite stable from die to die, wafer to wafer, run to run, and even (in first approximation) from foundry to foundry. It can be characterized with a few dies. On the other hand, parameter \(S_p\) is more characteristic of each foundry, and for a good characterization many dies per wafer and run are needed. However, parameter \(S_p\) is not a critical parameter for circuit designers, since the surface component of Fig. 4.1 can be drastically reduced using layout techniques, such as common centroid (Fig. 7.2-5 of [4.2]) as shown in eq. (4.20), for those transistors whose matching is critical.

In this Appendix we provide characterization results for the parameters \(A_p\). These parameters which cannot be compensated with layout techniques, and therefore need to be well characterized so that circuit designers can take their effects into account during the circuit design process.

### 4.3. Mismatch Characterization Chip

For the mismatch characterization of each CMOS process we propose to fabricate a special purpose chip for each technology. In this Appendix we will show the characterization results of parameter \(A_p\) of eq. (4.21) for two CMOS processes: the ES2\(^1\) 1.0\(\mu\)m CMOS process and the CNM\(^2\) 2.5\(\mu\)m CMOS process. For each

\[\text{Fig. 4.6: Experimental montage for the Automatic Characterization of the “Mismatching” between MOS Transistors}\]
process a chip was fabricated that contains a matrix of cells. Each cell contains a number of NMOS and PMOS transistors of different sizes. Additional decoding-selection circuitry is added to each cell and to the chip, so that only one transistor at a time is selected and connected to the outside pins for characterization. This way, we can bypass the procedure of accessing each transistor with special probes, reducing significantly the measurement time and cost.

Fig. 4.6 shows a simplified diagram of the chip and the external control and measurement equipment. In the chip, transistors are grouped by pairs: one NMOS and one PMOS. All NMOS transistors in the chip share their Drains in a common node connected to the pin DN. All PMOS transistors share their Drains at pin DP. All NMOS and PMOS transistors share their Sources at pin DS. All the transistors have their Gates short-circuited to their Sources, except for one transistor pair: the one that receives a high "select" signal. This pair has their Gates connected to the external pin G. If pin DP is left unconnected and the current between pins S and DN is measured, then the selected NMOS transistor is being measured. If pin DN is left unconnected and one measures the current between pins S and DP, then the selected PMOS transistor is being measured. Fig. 4.7 shows the schematic of the circuitry in a chip containing a matrix of $8 \times 8$ cells\(^3\) with 8 pairs of NMOS and PMOS transistors of different sizes inside each cell. For this chip we can select separately each of the 512 transistor pairs using a decoding circuitry with 8 control bits. One group of bits (from $b_0$ to $b_1$ in Fig. 4.7) selects the active matrix row through the row decoder. Another group of bits (from $b_2$ to $b_4$ in Fig. 4.7) selects the active matrix column through the column decoder. Finally a decoder which selects the size of the active transistor pair inside the selected cell is added to each cell and is controlled by another group of bits (from $b_5$ to $b_7$ in Fig. 4.7).

The active pair of transistors is selected by the digital decoding circuitry controlled through an external bus by a host computer. The host computer also controls a DC curve tracer (HP4145) which through chip pins DN, DP, S and G measures the curves of the active NMOS and PMOS transistors. By this way, it is possible to characterize a large number of transistors per chip without automatic probe positioning machines.

### 4.4. Transistor Measurement

The most critical electrical parameters responsible for current mismatches between transistors are:

- Beta: $\beta = \frac{W}{L} \mu C_{ox}$
- Threshold Voltage: $V_{T0}$
- Gamma: $\gamma$

In a Level 1 (H)Spice model of the MOS transistor, its Drain to Source current is given by

---

1. ES2: European Silicon Structures, available through the EUROPRACTICE services.
2. CNM: National Microelectronics Center Silicon Foundry at Barcelona, Spain.
3. For CNM-2.5\(\mu\)m technology the chip contains a $7 \times 8$ cell array, and for the ES2-1.0\(\mu\)m technology the chip contains an $8 \times 8$ array of cells.
Fig. 4.7: Schematic of the internal decoding circuitry in the chip

\[
I_{DS} = \beta \left( \frac{V_{GS} - V_T(V_{SB})}{1 + \theta (V_{GS} - V_T(V_{SB}))} \right)^2 \quad \text{for } V_{DS} \geq V_{GS} - V_T
\]
\[
I_{DS} = \beta \left( \frac{V_{GS} - V_T(V_{SB}) - \frac{1}{2} V_{DS}}{1 + \theta (V_{GS} - V_T(V_{SB}))} \right) V_{DS} \quad \text{for } V_{DS} \leq V_{GS} - V_T
\]  \hspace{1cm} (4.22)

where \( \theta \) is responsible for the mobility degradation effect. The threshold voltage, which depends on the Source to Bulk voltage, is given by,

\[
V_T(V_{SB}) = V_{T0} + (\eta - 1) V_{SB} + \gamma \left( \sqrt{\phi + V_{SB} - \sqrt{\phi}} \right)
\]  \hspace{1cm} (4.23)

where \( \phi = 0.6 \), \( \eta \) is an extra (fitting) parameter, and \( V_{T0} = V_T(V_{SB}=0) \).

For each transistor two curves are measured:

- **Curve 1:** \( V_{DS} = 0.1V \), \( V_{SB} = 0V \)
  \[
  V_{GS} = 1.5V-5.0V
  \]
  \[
  I_{DS} = \beta \left( \frac{V_{GS} - V_{T0} - \frac{1}{2} \times 0.1V}{1 + \theta (V_{GS} - V_{T0})} \right) 0.1V
  \]

- **Curve 2:** \( V_{DS} = 0.1V \)
  \[
  V_{GS} = 3.0V
  \]
\[ V_{SB} = 0.0V - 2.0V \]
\[ I_{DS} = \beta \frac{\left(3.0V - V_T(V_{SB}) - \frac{1}{2}0.1V\right)0.1V}{1 + \theta \left(3.0V - V_T(V_{SB})\right)} \]

For Curve 1 parameters \( \beta, V_{T0}, \) and \( \theta \) are fitted using Nonlinear Curve Fitting [4.3]. Random deviations in parameters \( \theta \) and \( \eta \) are assumed to have negligible contribution to current mismatches, so parameter \( \theta \) is only fitted for the first transistor of each size and that value is taken for the other transistors of the same size in the same chip. This way, the current mismatch is assumed to be due to mismatches in parameters \( \beta \) and \( V_{T0} \) only. For Curve 2, parameters \( \beta, V_{T0}, \) and \( \theta \) are taken from the fitted values from Curve 1, and the measured curve

\[ V_\gamma = V_T(V_{SB}) - V_{T0} = V_{GS} - V_{T0} - \frac{I_{DS} + \frac{1}{2}\beta V_{DS}^2}{\beta V_{DS} - \theta I_{DS}} \quad (V_{DS} = 0.1V, \ V_{GS} = 3.0V) \]  

is fitted to the curve

\[ V_\gamma = (\eta - 1) V_{SB} + \gamma \left(\sqrt{\phi + V_{SB}^2} - \sqrt{\phi}\right). \]  

Parameter \( \eta \) is fitted only for the first transistor of each size and assumed to be the same for the other transistors of the same size. This assumption leaves a random nature only for parameter \( \gamma \).

In this way, for all the transistors of the same size, parameters \( \beta, V_{T0}, \) and \( \gamma \) are extracted. Using this extraction procedure, we can represent the parameter \( P_{W,L} \) measured for a transistor of size \( W \times L \) as a function of the position of the transistor in the chip, thus generating the surface \( P_{W,L}(x, y) \).

Repeating this procedure for the different sizes and types of transistors we obtain the family of surfaces \( \beta_{W,L}(x, y), V_{T0-W,L}(x, y) \) and \( \gamma_{W,L}(x, y) \) for the NMOS and PMOS transistors. For technology CNM-2.5 \( \mu m \) 56 identical transistors were available in each chip, and 4 different transistor geometries were available in each cell, for both NMOS and PMOS transistors. For technology ES2-1.0\( \mu m \) there were 196 transistors of each of the 8 different geometries and for NMOS and PMOS types. To illustrate the surfaces we obtained for each parameter and for each set of equal sized transistors in each chip, Fig. 4.8 shows the surfaces \( \beta_{W,L}(x, y) \) for the NMOS transistors of the 4 different sizes for a chip of the CNM2.5\( \mu m \) process, and Fig. 4.9 depicts the surfaces \( V_{T0-W,L}(x, y) \) for the PMOS transistors of the 8 different sizes for a chip of the ES2 1.0\( \mu m \) process. Each surface represents the measured transistor parameter as a function of the position of the transistor in the chip. Below each surface a contour diagram has been included.

4.5. Statistical Data Processing

Let us name each of the surfaces obtained after applying the extraction procedure in the following way:

\[ P_{SK} = P_{SK}(n\Delta x, m\Delta y) \]
Fig. 4.8: Surfaces $\beta(x,y)$ for NMOS transistor with size (W=L=40$\mu$m), (W=L=10$\mu$m), (W=L=5$\mu$m), (W=2$\mu$m, L=3$\mu$m) for a chip of the CNM 2.5$\mu$m process

where $P$ is the extracted parameter which can be either $\beta$, $V_{T0}$, and $\gamma$, $S$ denotes the transistor size, $K$ the type of transistor ($K=N$ for NMOS, $K=P$ for PMOS), and $n$ and $m$ specify the position of the transistor in the array. $\Delta x$ and $\Delta y$ are the separations between two adjacent cells of the array in the $x$ and $y$ directions.

According to this notation, the mean value of an electrical parameter $P_{SK}$ in each chip would be computed as follows:

$$P_{SK} = \sum_{n=0}^{n_{\text{max}}} \sum_{m=0}^{m_{\text{max}}} P(n\Delta x, m\Delta y)$$

$$= \sum_{n=0}^{n_{\text{max}}} \sum_{m=0}^{m_{\text{max}}} P_{SK}(n\Delta x, m\Delta y)$$

where $n_{\text{max}}$ and $m_{\text{max}}$ are the number of times each cell is repeated in the $x$ and $y$ directions.

After computing the mean of each parameter, the following differences are computed at each point,

$$\Delta_x P_{SK}(n, m) = P_{SK}((n+1)\Delta x, m\Delta y) - P_{SK}(n\Delta x, m\Delta y) \quad \begin{cases} n = 0, \ldots, n_{\text{max}} - 1 \\ m = 0, \ldots, m_{\text{max}} \end{cases}$$

$$\Delta_y P_{SK}(n, m) = P_{SK}(n\Delta x, (m+1)\Delta y) - P_{SK}(n\Delta x, m\Delta y) \quad \begin{cases} n = 0, \ldots, n_{\text{max}} \\ m = 0, \ldots, m_{\text{max}} - 1 \end{cases}$$
Fig. 4.9: Surfaces $V_{TH}(x,y)$ for PMOS transistor sized (W=L=40μm), (W=L=20μm), (W=L=10μm), (W=L=4μm), (W=2.5μm, L=2μm), (W=2.5μm, L=1μm), (W=1.25μm, L=2μm), (W=1.25μm, L=1μm) for a chip of the ES2 1.0μm process
Afterwards, the standard deviation of the difference between parameters along the X-direction is computed:

\[
\sigma_{P_{X, SK}}^2 = \sigma^2 (\Delta x P_{SK}) = \frac{\sum_{n=0}^{n_{max}} \sum_{m=0}^{m_{max}} (\Delta x P_{SK}(n, m) - \bar{\Delta x P_{SK}})^2}{(n_{max} - 1)(m_{max} - 1)}
\]  

(4.30)

In the same way, the standard deviation along the Y-direction is computed:

\[
\sigma_{P_{Y, SK}}^2 = \sigma^2 (\Delta y P_{SK}) = \frac{\sum_{n=0}^{n_{max}} \sum_{m=0}^{m_{max}} (\Delta y P_{SK}(n, m) - \bar{\Delta y P_{SK}})^2}{n_{max}(m_{max} - 1)}
\]  

(4.31)

Next, the average between these two quantities is computed, and its square-root, in %, is given as the relative standard deviation:

\[
\sigma_{P_{, SK}} = \sqrt{\frac{\sigma_{P_{X, SK}}^2 + \sigma_{P_{Y, SK}}^2}{2}} \times 100
\]  

(4.32)

Both, \(\sigma_{P_{, SK}}\) and \(\bar{P}_{SK}\) are computed for each die and for each transistor size.

Assuming these data fit to curves of the type

\[
\sigma_{P_{, SK}} = \frac{A_p}{\sqrt{W_{eff}L_{eff}}} \quad \begin{cases} W_{eff} = W - WD \\ L_{eff} = L - LD \end{cases}
\]  

(4.33)

the value of \(A_p\) was computed for each parameter \(P\) and for each die. Fig. 4.10 depicts the dependence with \(1/\sqrt{WL}\) of the measured values of \(\sigma (\Delta \beta)\), \(\sigma (\Delta V_{TO})\) and \(\sigma (\Delta \gamma)\) of the NMOS transistors of one chip of the CNM 2.5\(\mu m\) technology.

Finally, the values of \(\sigma_{P_{, SK}}\) are averaged over all dies. Table 4.4 shows the computed averaged standard deviations for the chips of the CNM 2.5\(\mu m\) technology\(^4\) and Table 4.4 shows the averaged standard deviations for the chips of the ES2-1.0\(\mu m\) technology. These averaged values are also assumed to fit to a curve of the type given by eq. (4.33). The fitted slopes of these curves are the statistical parameters \(A_p\) which characterize the technological process. Table 4.3 contains the final values of parameters \(A_\beta\), \(A_{V_{TO}}\) and \(A_\gamma\) obtained fitting the curves over the averaged deviations for the NMOS and PMOS transistors of the CNM-2.5\(\mu m\) and the ES2-1.0\(\mu m\) CMOS processes.

In order to evaluate the confidence of these measurements, for one of the dies one transistor of each size was measured repeatedly (56 times) and the standard deviations were computed as if a 56 elements array was measured. Table 4.4 gives the values computed for the CNM 2.5\(\mu m\) process. These deviations represent the error within which the values of Table 4.4 have been measured and computed.

\(^4\) The results were measured for two types of substrates: (a) substrate of type p, and (b) substrate of type p with epitaxial p+ layer (p/p+).
Fig. 4.10: Dependence of functions $\sigma(\Delta \beta)$, $\sigma(\Delta V_{T0})$ and $\sigma(\Delta \gamma)$ (in %) versus $1/\sqrt{WL}$ for NMOS transistors of the CNM 2.5\mu m technology

<table>
<thead>
<tr>
<th></th>
<th>$W = 40\mu m$</th>
<th>$W = 20\mu m$</th>
<th>$W = 10\mu m$</th>
<th>$W = 5\mu m$</th>
<th>$W = 2.5\mu m$</th>
<th>$W = 2.5\mu m$</th>
<th>$W = 1.25\mu m$</th>
<th>$W = 1.25\mu m$</th>
</tr>
</thead>
<tbody>
<tr>
<td>NMOS</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>$\sigma(\beta)$ (%)</td>
<td>0.080</td>
<td>0.131</td>
<td>0.310</td>
<td>0.371</td>
<td>0.951</td>
<td>0.910</td>
<td>0.930</td>
<td>1.282</td>
</tr>
<tr>
<td>$\sigma(V_{T0})$ (mV)</td>
<td>0.652</td>
<td>1.339</td>
<td>1.362</td>
<td>2.623</td>
<td>5.218</td>
<td>9.940</td>
<td>6.210</td>
<td>13.122</td>
</tr>
<tr>
<td>$\sigma(\gamma)$ (mV(^{-1}))</td>
<td>0.396</td>
<td>0.763</td>
<td>0.487</td>
<td>1.415</td>
<td>3.421</td>
<td>5.698</td>
<td>3.037</td>
<td>9.652</td>
</tr>
<tr>
<td>PMOS</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>$\sigma(\beta)$ (%)</td>
<td>0.093</td>
<td>0.120</td>
<td>0.297</td>
<td>0.427</td>
<td>0.711</td>
<td>1.676</td>
<td>0.681</td>
<td>1.574</td>
</tr>
<tr>
<td>$\sigma(V_{T0})$ (mV)</td>
<td>1.201</td>
<td>1.535</td>
<td>1.492</td>
<td>2.787</td>
<td>5.188</td>
<td>11.868</td>
<td>7.132</td>
<td>14.147</td>
</tr>
<tr>
<td>$\sigma(\gamma)$ (mV(^{-1}))</td>
<td>0.522</td>
<td>0.646</td>
<td>0.697</td>
<td>1.612</td>
<td>4.148</td>
<td>7.011</td>
<td>4.051</td>
<td>7.945</td>
</tr>
</tbody>
</table>

Table 4.2: Standard Deviations Averaged over all Dies for the ES2-1.0\mu m process
Table 4.1: Standard Deviations Averaged over all Dies for the CNM 2.5μm process

<table>
<thead>
<tr>
<th>NMOS</th>
<th>$W = 40\mu m$</th>
<th>$W = 10\mu m$</th>
<th>$W = 5\mu m$</th>
<th>$W = 2\mu m$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$L = 40\mu m$</td>
<td>$L = 10\mu m$</td>
<td>$L = 5\mu m$</td>
<td>$L = 3\mu m$</td>
</tr>
<tr>
<td>$p$</td>
<td>$\sigma(\beta)$</td>
<td>0.2200704</td>
<td>0.2612075</td>
<td>0.3846001</td>
</tr>
<tr>
<td></td>
<td>$\sigma(V_{TH})$</td>
<td>0.1236119</td>
<td>0.2850701</td>
<td>0.6125308</td>
</tr>
<tr>
<td></td>
<td>$\sigma(\gamma)$</td>
<td>0.0657665</td>
<td>0.1154645</td>
<td>0.2588394</td>
</tr>
<tr>
<td>$p/p+$</td>
<td>$\sigma(\beta)$</td>
<td>0.3447584</td>
<td>0.3985946</td>
<td>0.5431468</td>
</tr>
<tr>
<td></td>
<td>$\sigma(V_{TH})$</td>
<td>0.2419148</td>
<td>0.4164946</td>
<td>0.9343108</td>
</tr>
<tr>
<td></td>
<td>$\sigma(\gamma)$</td>
<td>0.1229629</td>
<td>0.2150784</td>
<td>0.4434690</td>
</tr>
</tbody>
</table>

| PMOS | $p$ | $\sigma(\beta)$ | 0.2194395 | 0.2921146 | 0.4232929 | 1.061984 |
|      | $\sigma(V_{TH})$ | 0.1935054 | 0.3779017 | 0.7190964 | 1.892969 |
|      | $\sigma(\gamma)$ | 0.0707893 | 0.1269131 | 0.2446530 | 0.5855953 |
| $p/p+$ | $\sigma(\beta)$ | 0.2477367 | 0.3357617 | 0.5007437 | 0.8304728 |
|      | $\sigma(V_{TH})$ | 0.3599902 | 0.6305608 | 0.9422545 | 1.590794 |
|      | $\sigma(\gamma)$ | 0.1210937 | 0.3411530 | 0.6457047 | 0.8907923 |

Table 4.3. Extracted $A_P$ parameters expressed in %μm

Another interesting statistical data, which may be of great help when designing a circuit, are the correlations that may appear between the different electrical parameters $\beta_n$, $V_{TH}$, $\gamma_n$, $\beta_p$, $V_{TH}$, and $\gamma_p$, for each transistor size and for each technology. As previously said, we have 56 identical transistors in each CNM-2.5μm chip, and 64 in the ES2-1.0μm chips. Consequently, for each transistor size we can compute the statistical correlations between these 6 extracted parameters. The correlations between parameters $P_1$ and $P_2$ for transistors of size $W/L$ of a given chip are

$$r_{W/L}(P_1, P_2) = \frac{\sum_{n=1}^{n_{\text{max}}} \left( P_1(n) - \overline{P_1} \right) \left( P_2(n) - \overline{P_2} \right)}{\sigma(P_1) \sigma(P_2)}.$$

These values are computed for each chip and then averaged over all available chips. Also the standard deviation $s_{W/L}(P_1, P_2)$ of the values that the correlation $r_{W/L}(P_1, P_2)$ takes over the different chips is computed. If a small
value of $\sigma_{WL}(P_1, P_2)$ results, the corresponding correlation has a stable value over all measured chips. Otherwise, the correlations suffer large (random) oscillations from chip to chip. Table 4.5, Table 4.6 and Table 4.7 show the correlations for the CNM-2.5µm process and Table 4.8, Table 4.9 and Table 4.10 show the correlations for the ES2-1.0µm process. The tables show the “mean ± sigma” of the correlations computed over all available chips. For CNM-2.5µm there were 11 operative chips, and for ES2-1.0µm there were 8.
### Table 4.8: ES2 1.0μm correlations (mean±sigma)

<table>
<thead>
<tr>
<th>ES2</th>
<th>$\beta_n - V_{TO_n}$</th>
<th>$\beta_n - \gamma_n$</th>
<th>$V_{TO_n} - \gamma_n$</th>
<th>$\beta_p - V_{TO_p}$</th>
<th>$\beta_p - \gamma_p$</th>
<th>$V_{TO_p} - \gamma_p$</th>
</tr>
</thead>
<tbody>
<tr>
<td>40/40</td>
<td>0.34±0.10</td>
<td>0.12±0.26</td>
<td>0.81±0.12</td>
<td>0.06±0.30</td>
<td>-0.22±0.18</td>
<td>-0.29±0.29</td>
</tr>
<tr>
<td>20/10</td>
<td>0.05±0.20</td>
<td>-0.08±0.22</td>
<td>0.62±0.10</td>
<td>-0.37±0.16</td>
<td>+0.14±0.19</td>
<td>-0.28±0.12</td>
</tr>
<tr>
<td>10/10</td>
<td>-0.19±0.24</td>
<td>-0.15±0.20</td>
<td>0.54±0.18</td>
<td>-0.18±0.16</td>
<td>-0.03±0.11</td>
<td>-0.19±0.30</td>
</tr>
<tr>
<td>5/4</td>
<td>-0.27±0.11</td>
<td>-0.25±0.10</td>
<td>0.43±0.16</td>
<td>-0.30±0.12</td>
<td>-0.04±0.11</td>
<td>-0.27±0.16</td>
</tr>
<tr>
<td>2.5/2</td>
<td>-0.33±0.10</td>
<td>-0.46±0.09</td>
<td>0.41±0.11</td>
<td>-0.40±0.15</td>
<td>-0.19±0.12</td>
<td>-0.19±0.13</td>
</tr>
<tr>
<td>1.25/2</td>
<td>-0.26±0.90</td>
<td>-0.39±0.11</td>
<td>0.44±0.13</td>
<td>-0.29±0.07</td>
<td>-0.30±0.09</td>
<td>-0.07±0.11</td>
</tr>
<tr>
<td>2.5/1</td>
<td>-0.61±0.11</td>
<td>-0.75±0.04</td>
<td>0.66±0.14</td>
<td>-0.59±0.10</td>
<td>-0.68±0.08</td>
<td>0.30±0.13</td>
</tr>
<tr>
<td>1.25/1</td>
<td>-0.41±0.08</td>
<td>-0.57±0.08</td>
<td>0.55±0.09</td>
<td>-0.57±0.05</td>
<td>-0.64±0.05</td>
<td>0.27±0.11</td>
</tr>
</tbody>
</table>

### Table 4.9: ES2 1.0μm correlations (mean±sigma)

<table>
<thead>
<tr>
<th>ES2</th>
<th>$\beta_n - V_{TO_n}$</th>
<th>$\beta_p - \beta_p$</th>
<th>$V_{TO_n} - \beta_p$</th>
<th>$\gamma_n - \gamma_p$</th>
<th>$\gamma_n - V_{TO_n}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>40/40</td>
<td>0.09±0.36</td>
<td>-0.24±0.37</td>
<td>0.42±0.16</td>
<td>-0.19±0.027</td>
<td>0.31±0.27</td>
</tr>
<tr>
<td>20/10</td>
<td>-0.01±0.26</td>
<td>-0.11±0.19</td>
<td>0.13±0.30</td>
<td>-0.01±0.22</td>
<td>0.16±0.26</td>
</tr>
<tr>
<td>10/10</td>
<td>-0.03±0.17</td>
<td>-0.12±0.11</td>
<td>-0.04±0.18</td>
<td>-0.19±0.22</td>
<td>-0.01±0.16</td>
</tr>
<tr>
<td>5/4</td>
<td>0.01±0.12</td>
<td>-0.08±0.14</td>
<td>0.02±0.15</td>
<td>-0.05±0.13</td>
<td>0.11±0.13</td>
</tr>
<tr>
<td>2.5/2</td>
<td>-0.05±0.09</td>
<td>-0.08±0.12</td>
<td>-0.06±0.10</td>
<td>-0.06±0.13</td>
<td>0.03±0.10</td>
</tr>
<tr>
<td>1.25/2</td>
<td>-0.05±0.10</td>
<td>-0.11±0.10</td>
<td>-0.06±0.08</td>
<td>0.06±0.11</td>
<td>-0.06±0.08</td>
</tr>
<tr>
<td>2.5/1</td>
<td>-0.25±0.10</td>
<td>-0.32±0.11</td>
<td>-0.32±0.17</td>
<td>-0.32±0.13</td>
<td>-0.40±0.10</td>
</tr>
<tr>
<td>1.25/1</td>
<td>-0.01±0.14</td>
<td>-0.09±0.13</td>
<td>-0.09±0.13</td>
<td>0.13±0.16</td>
<td>-0.10±0.14</td>
</tr>
</tbody>
</table>

Table 4.10: ES2 1.0μm correlations (mean±sigma)

### 4.6. (H)Spice Simulations

These results can now be useful for circuit design. By including them in the simulation files a circuit designer can estimate the deviation of the different circuit parameters. This allows proper sizing of transistors and maintain design parameters within acceptable limits without wasting excessive area. Fig. 4.11 shows an input file for simulator (H)Spice where these results are used through Monte Carlo simulations.
As a verification exercise, the simulator was used to obtain the same curves that were measured experimentally. Then, the electrical parameters of the transistors were extracted using the same procedure of non-linear curve fitting as in the case of the transistor curves measured experimentally. For the extracted parameter the same computations as explained in Section 4.5 were done to obtain the statistical parameters $A_p$. Fig. 4.12 shows the mismatch characteristics predicted now by the simulator. As can be seen, they agree very well with the measured data results.

4.7. References


Appendix 4: Systematic CMOS Transistor Mismatch Characterization. Page: 125

.param kp_n_global=58e-6
.param kp_p_global=16.7e-6
.param vto_n_global=1.00
.param vto_p_global=1.20
.param gamma_n_global=1.26
.param gamma_p_global=0.70

.subckt modn Drain Gate Source Bulk widthn lengthn
  m_modn Drain Gate Source Bulk widthn lengthn
  + ad=-4u' as=-4u' pds=2w+8u' ps=2w+8u'
  .MODEL modn NMOS
  + LEVEL = 2 VTO = vto_n KP = kp_n GAMMA = gamma_n
  + PHI = 0.72 LAMBDA = 0.009 MOB = 7 THETA = 0.154
  + NUB = 2E16 NJ = 1E-6 JS = 0.73E-3 RSH = 25
  + TOX = 3.68E-8 LD = 0.75U WD = 0U PB = 0.83
  + CJ = 5.63E-4 CSJM = 2.7E-9 MJ = 0.43 NJSN = 0.3
  + AP = 1 KP = 2.3E-27
  .param Ab=2.741175e-8
  .param Ap=3.806265e-8
  .param Aq=1.612826e-8
  .param sigma_kp_n='kp_n_global*Ab*sqrt(1/w1)'
  .param sigma_vto_n='vto_n_global*sqrt(1/w1)'
  .param sigma_gamma_n='gamma_n_global*Ab*sqrt(1/w1)'
  .param kp_n=gauss(kp_n_global,sigma_kp_n)
  .param vto_n=gauss(vto_n_global,sigma_vto_n)
  .param gamma_n=gauss(gamma_n_global,sigma_gamma_n)
.ends

.subckt modp Drain Gate Source Bulk widthp lengthp
  m_modp Drain Gate Source Bulk widthp lengthp
  + ad=-4u' as=-4u' pds=2w+8u' ps=2w+8u'
  .MODEL modp PMOS
  + LEVEL = 2 VTO = vto_p KP = kp_p GAMMA = gamma_p
  + PHI = 0.72 LAMBDA = 0.011 MOB = 7 THETA = 0.154
  + NUB = 2E16 NJ = 1E-6 JS = 4.5E-3 RSH = 115
  + TOX = 3.68E-8 LD = 0.75U WD = 0U PB = 0.56
  + CJ = 3.47E-4 CSJM = 1.73E-9 MJ = 0.36 NJSN = 0.39
  + AP = 1 KP = 2.3E-27
  .param Ab=2.994805e-8
  .param Ap=5.161805e-8
  .param Aq=1.630307e-8
  .param sigma_kp_p='kp_p_global*Ab*sqrt(1/w1)'
  .param sigma_vto_p='vto_p_global*sqrt(1/w1)'
  .param sigma_gamma_p='gamma_p_global*Ab*sqrt(1/w1)'
  .param kp_p=gauss(kp_p_global,sigma_kp_p)
  .param vto_p=gauss(vto_p_global,sigma_vto_p)
  .param gamma_p=gauss(gamma_p_global,sigma_gamma_p)
.ends

*xx11 vxx1 vgl 0 0 mmod w=40u l=40u
*xx11 0 vdd1 -0.1
*xx12 vdd2 vgl 0 0 mmod w=10u l=10u
*xx12 0 vdd2 -0.1
*xx13 vdd3 vgl 0 0 mmod w=5u l=5u
*xx13 0 vdd3 -0.1
*xx14 vdd4 vgl 0 0 mmod w=2u l=2u
*xx14 0 vdd4 -0.1
*vgl vgl 0
xx11 vdd1 vgl vgl 0 0 mmod w=40u l=40u
xx12 vdd2 vdd1 -0.1
xx12 vdd2 vgl vgl 0 0 mmod w=10u l=10u
xx12 vdd2 vdd2 -0.1
xx12 vdd3 vgl vgl 0 0 mmod w=5u l=5u
xx12 vdd3 vdd3 -0.1
xx14 vdd4 vgl vgl 0 0 mmod w=2u l=2u
xx14 vdd4 vdd4 -0.1
vgl vdd2 3.0
vdd2 vdd2 0
.options numopt=6
*.dc vgl 1.5 5.0 0.035 sweep monte=56
*.dc vgl 2 0.2 5.0 sweep monte=56
*.print dc i(vdd1) i(vdd2) i(vgl) i(vdd4)
*.print dc i(vdd1) i(vdd2) i(vgl) i(vdd4)
.end
Appendix 5: Multichip Realizations with ART 1 Modules

5.1. A Compact ART 1 Design

Saving chip area is a crucial need to improve integrated circuits yield. As yield decreases exponentially with chip area, attempting to improve yield performance has driven us to the design of a more compact and reduced prototype of the ART 1 chip.

A new prototype which implements the ART \( L_m \) algorithm has been designed. Fig. 5.1 depicts the block diagram of the chip. As it can be seen, this new prototype diagram is equal to the one reported in Appendix 2, except for its reduced dimensions. It has an input layer of 50 input pixels, instead of 100, and it clusters the input patterns into up to 10, instead of 18, categories. However, the area reduction achieved with this prototype is greater than \( 100/50 \times 18/10 = 3.6 \) times the area of the first prototype. The area of the actual circuitry is \( 6.40mm^2 \) which represents a reduction of 15 times the area of the previous prototype.

This area reduction was possible thanks to the elimination of the current mirror trees which were used in the first prototype to replicate currents \( L_A \) and \( L_B \) all over the chip. These current mirror trees were used to eliminate the systematic error component in the mirror output currents \([5.2]\). The systematic error is due to gradients that appear in the \( \beta(x, y) \), \( V_{TO}(x, y) \), and \( \gamma(x, y) \) surfaces (see Appendix 4). These measurements of the \( \beta \), \( V_{TO} \) and \( \gamma \) parameters of a \( 6 \times 6 \) transistor matrix, which occupies an area of \( 2.5mm \times 2.2mm = 5.5mm^2 \), showed that for this order of chip dimensions the systematic error component in the transistor currents was of the same order than the random error component. The matrix of synapses in this prototype occupies an area of \( 2.6mm \times 0.8mm = 2.1mm^2 \). Consequently, for our chip dimensions a direct replication of currents \( L_A \) and \( L_B \) using simple current mirrors with the output transistors distributed over the synapse matrix is possible without a severe output currents precision degradation.

In Appendix 4, a special purpose chip to fully characterize the parameters of a matrix of transistors was developed. For the ES2-1.0\( \mu \)m technology, a chip with a \( 6 \times 6 \) cell matrix was designed, containing NMOS

---

**Fig. 5.1: Block diagram of new ART 1 chip prototype**
and PMOS transistors of different sizes. Characteristic curves of these transistors were measured and their electrical parameters extracted. As a result, we know the parameters $\beta$, $V_{TO}$ and $\gamma$ of several arrays of transistors that have been fabricated in the ES2-1.0μm technology. If these transistors were the output transistors of a multiple-output current mirror we could predict the output currents as a function of position $I_o(x, y)$, and we could separate the contributions of the gradient and the random components.

Let us use the extracted parameters $\beta(x, y)$, $V_{TO}(x, y)$ and $\gamma(x, y)$ of a typical chip characterized in Appendix 4. For each transistor position we can compute how much each extracted parameter deviates from the mean,

$$\frac{\Delta\beta(x, y)}{\bar{\beta}} = \frac{\beta(x, y) - \bar{\beta}}{\bar{\beta}}$$

(5.1)

$$\Delta V_{TO}(x, y) = V_{TO}(x, y) - \bar{V}_{TO}$$

(5.2)

$$\Delta \gamma(x, y) = \gamma(x, y) - \bar{\gamma}$$

(5.3)

Using these deviations we can build a spice file with a multiple-output current mirror and obtain the surface of simulated output currents $I_o^d(x, y)$. This is shown in Fig. 5.2(a) for a NMOS multiple-output current mirror whose input current is $I_{in} = 10\mu A$ and with transistor sizes $W = L = 10\mu m$. Fig. 5.2(b) depicts the same for a PMOS multiple-output current mirror.

For each of the surfaces formed by the simulated output currents $I_o^d(x, y)$, we calculate the parameters of the plane $I_o^p(x, y) = Ax + By + C$ that best fits the points of the simulated surface $I_o^d(x, y)$. Parameters $A$ and $B$ represent the gradient components of the surface.

After obtaining the optimum plane $I_o^p(x, y)$ that fits the surface $I_o^d(x, y)$, we estimate the random error component present in $I_o^d(x, y)$ by computing the standard deviation of the difference $\Delta I_o = I_o^d(x, y) - I_o^p(x, y)$. That is,
\[ \sigma(\Delta I_o) = \sqrt{\frac{n_{\max} \cdot m_{\max}}{n_{\max} \cdot m_{\max}} \sum_{n=1}^{n_{\max}} \sum_{m=1}^{m_{\max}} \left( I_o^{(n\Delta x, m\Delta y)} - I_o^{(n\Delta x, m\Delta y)} - \bar{I}_o \right)^2} \]  

(5.4)

where \( n_{\max} \) and \( m_{\max} \) are the number of times each transistor is repeated in the x and y directions and, \( \Delta x \) and \( \Delta y \) are the distances between two adjacent transistors in the x and y directions, respectively.

The total standard deviation (random+systematic) can also be computed as,

\[ \sigma_T(I_o) = \sqrt{\frac{n_{\max} \cdot m_{\max}}{n_{\max} \cdot m_{\max}} \sum_{n=1}^{n_{\max}} \sum_{m=1}^{m_{\max}} \left( I_o^{(n\Delta x, m\Delta y)} - \bar{I}_o \right)^2} \]  

(5.5)

Simulations were performed using the extracted parameters of the 6 \times 6 matrixes of the NMOS transistors of size \( W = L = 10 \mu m \) located in 8 different chips. The input current level was set to 10\( \mu A \).

Let us call the maximum value of the interpolated plane \( I_{\text{maxplane}} \), that is,

\[ I_{\text{maxplane}} = \max \{ I_o^{(x, y)} \} \]  

(5.6)

and \( I_{\text{minplane}} \) the minimum value

\[ I_{\text{minplane}} = \min \{ I_o^{(x, y)} \} \]  

(5.7)

Let us compute the maximum systematic deviation in the output currents as

\[ \Delta I_o^p = I_{\text{maxplane}} - I_{\text{minplane}} \]  

(5.8)

Since 98% of the random values remain within the interval \( \pm 3\sigma(\Delta I_o) \), let us measure the relationship between the random error component and the systematic error component by the following ratio

\[ r_{\text{ran/sys}} = \frac{6 \times \sigma(\Delta I_o)}{\Delta I_o^p} \]  

(5.9)

Table 5.1 contains the values of the random deviation component \( \sigma(\Delta I_o) \), the gradient components A and B, the systematic deviation \( \Delta I_o^p \), the relation \( r_{\text{ran/sys}} \), and the total standard deviation \( \sigma_T(I_o) \) obtained for the 8 different chips. These results are for chips of size similar to 2.5\( mm \times 2.2 \)\( mm = 5.5 \)\( mm^2 \). As it can be observed in Table 5.1, for this chip sizes the systematic error component is of the same order (and usually less) than the random deviation component.

The random error component is quite stable from chip to chip and its mean value averaged over the 8 chips is \( \bar{\sigma}(\Delta I_o) = 0.059 \mu A \), which means a relative random error in the output currents of approximately 0.59%. The total error component averaged over all the chips is \( \bar{\sigma}_T(I_o) = 0.076 \mu A \), which is a total relative error of 0.76%.
Similar simulations were performed for $6 \times 6$ matrixes of PMOS transistors of size $W = L = 10\mu m$ for the 8 different chips. The input current was again set to $10\mu A$. Table 5.1 contains the values of $\sigma(\Delta I_o)$, $A$, $B$, $\Delta I'_o$, $r_{ran/sys}$, and $\sigma_T(I_o)$ obtained for each chip.

The random error component is again quite stable from chip to chip, with a mean value averaged over the 8 chips of $\overline{\sigma}(\Delta I_o) = 0.046\mu A$, which is equivalent to a relative random current error of 0.46%. The systematic error component has again large variations from chip to chip but remains in the same order of magnitude than the random error component. The total standard deviation averaged over all the chips is $\overline{\sigma_T(I_o)} = 0.057\mu A$, which is a total relative error of 0.57%.

Based on these results we designed another ART 1 prototype that contains a matrix of $50 \times 10$ synapses. The synapse matrix occupies an area of $2.6mm \times 0.8mm = 2.1mm^2$. Each synapse contains two $L_A$ current sources and one $L_B$ current source. Fig. 5.3 contains a detailed schematic of each synapse circuit. In this prototype, each $L_A$ current source is an output transistor of a simple NMOS current mirror of size $W = L = 10\mu m$. Similarly, each $L_B$ current source is an output transistor of a simple PMOS current mirror of size $W = L = 10\mu m$.

<table>
<thead>
<tr>
<th>chip</th>
<th>$\sigma(\Delta I_o)$ ($\mu A$)</th>
<th>$A$ ($\mu A$)</th>
<th>$B$ ($\mu A$)</th>
<th>$\Delta I'_o$ ($\mu A$)</th>
<th>$r_{ran/sys}$</th>
<th>$\sigma_T(I_o)$ ($\mu A$)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.057368</td>
<td>0.000990</td>
<td>-0.020644</td>
<td>0.129800</td>
<td>2.652</td>
<td>0.067357</td>
</tr>
<tr>
<td>2</td>
<td>0.061811</td>
<td>-0.032629</td>
<td>0.000362</td>
<td>0.197943</td>
<td>1.874</td>
<td>0.083224</td>
</tr>
<tr>
<td>3</td>
<td>0.047458</td>
<td>0.029508</td>
<td>0.022027</td>
<td>0.309206</td>
<td>0.921</td>
<td>0.078784</td>
</tr>
<tr>
<td>4</td>
<td>0.051876</td>
<td>-0.011395</td>
<td>-0.003614</td>
<td>0.090057</td>
<td>3.456</td>
<td>0.055749</td>
</tr>
<tr>
<td>5</td>
<td>0.053839</td>
<td>-0.009419</td>
<td>-0.018067</td>
<td>0.164915</td>
<td>1.959</td>
<td>0.064105</td>
</tr>
<tr>
<td>6</td>
<td>0.058233</td>
<td>0.035253</td>
<td>0.014964</td>
<td>0.301303</td>
<td>1.160</td>
<td>0.087573</td>
</tr>
<tr>
<td>7</td>
<td>0.065038</td>
<td>-0.003324</td>
<td>-0.029267</td>
<td>0.195543</td>
<td>1.996</td>
<td>0.082222</td>
</tr>
<tr>
<td>8</td>
<td>0.072544</td>
<td>-0.031138</td>
<td>0.004652</td>
<td>0.214743</td>
<td>2.027</td>
<td>0.090298</td>
</tr>
</tbody>
</table>

Table 5.1: Output current error in a $6 \times 6$ NMOS current mirror

<table>
<thead>
<tr>
<th>chip</th>
<th>$\sigma(\Delta I_o)$ ($\mu A$)</th>
<th>$A$ ($\mu A$)</th>
<th>$B$ ($\mu A$)</th>
<th>$\Delta I'_o$ ($\mu A$)</th>
<th>$r_{ran/sys}$</th>
<th>$\sigma_T(I_o)$ ($\mu A$)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.058207</td>
<td>-0.017879</td>
<td>-0.007673</td>
<td>0.153314</td>
<td>2.278</td>
<td>0.067023</td>
</tr>
<tr>
<td>2</td>
<td>0.047461</td>
<td>-0.010970</td>
<td>-0.001420</td>
<td>0.074343</td>
<td>3.830</td>
<td>0.051083</td>
</tr>
<tr>
<td>3</td>
<td>0.048464</td>
<td>-0.006271</td>
<td>0.007500</td>
<td>0.082628</td>
<td>3.519</td>
<td>0.051260</td>
</tr>
<tr>
<td>4</td>
<td>0.039949</td>
<td>0.026079</td>
<td>0.010238</td>
<td>0.217897</td>
<td>1.100</td>
<td>0.063231</td>
</tr>
<tr>
<td>5</td>
<td>0.046463</td>
<td>-0.008931</td>
<td>0.001028</td>
<td>0.059749</td>
<td>4.666</td>
<td>0.048933</td>
</tr>
<tr>
<td>6</td>
<td>0.045003</td>
<td>0.032343</td>
<td>-0.004070</td>
<td>0.218475</td>
<td>1.236</td>
<td>0.071586</td>
</tr>
<tr>
<td>7</td>
<td>0.044067</td>
<td>0.013397</td>
<td>-0.000502</td>
<td>0.083394</td>
<td>3.171</td>
<td>0.049660</td>
</tr>
<tr>
<td>8</td>
<td>0.041021</td>
<td>-0.015243</td>
<td>-0.006057</td>
<td>0.127800</td>
<td>1.926</td>
<td>0.049673</td>
</tr>
</tbody>
</table>

Table 5.2: Output current error in $6 \times 6$ PMOS current mirrors
There are two analog switches controlled by signal $s_i$ that disconnect the output currents of the synapse from nodes $N_j$ and $N'_j$. These signals $s_i$, which are common to all the synapses in the same column, are generated by a column selection decoder. For each combination of the decoder input signals, only one column of synapses is injecting its output currents into nodes $N_j$ and $N'_j$. During the normal circuit operation, an enable input signal to the decoder is activated that allows all the signals $s_i$ to be high at the same time. Thus, all the synapses inject their output currents into nodes $N_j$ and $N'_j$.

Using a computer which controlled the column selection decoder and a DC curve tracer also controlled by the computer, we were able to measure separately the output current flowing through each of the current sources $L_{A1}^{ij}$, $L_{A2}^{ij}$ and $L_B^{ij}$ in each synapse. Before doing these measurements, all the weights $z_{ij}$ were reset to their high state and all the input vector components $I_j$ were loaded with ‘1’ through the shift register. Currents $L_{A1}^{ij}$ and $L_{A2}^{ij}$ of each synapse were measured setting the input current $L_B$ to 0. Then, using the selection circuitry we alternatively connected a different column to nodes $N_j$ and $N'_j$ measuring the current flowing into these nodes. Afterwards, current $L_A$ was set to 0 and a current was injected through the input of the $L_B$ current mirror. The output current $L_B^{ij}$ of each synapse flowing into nodes $N_j$ was measured.

Fig. 5.4(a) shows the measured output currents of the synapse current sources $L_{A1}^{ij}$ in one chip. The input current $L_A$ was set to 10μA. To obtain the surface shown in Fig. 5.4, we have represented the $L_{A1}^{ij}$ current as a function of the coordinates $(x, y)$ of the synapse where the source is located. Fig. 5.4(b) depicts the output currents $L_{A2}^{ij}$ of the same chip for the same input current level $L_A = 10μA$. Fig. 5.4(c) shows the surface obtained when we represent the synapse output currents $L_B^{ij}$ of a chip when the input current $L_B$ is set to 10μA.

To compute the random error component and the systematic deviation component of these surfaces we follow the same procedure used with the simulated surfaces. For each of the $L_{A1}$, $L_{A2}$ and $L_B$ surfaces, we calculate the $A$, $B$ and $C$ parameters of the plane $L_K^{m}(x, y) = Ax + By + C$ that best fits the measured surface $L_K^{m}(x, y)$. Afterwards, we compute the standard deviation of the difference $\Delta L_K(x, y) = L_K^{m}(x, y) - L_K(x, y)$, that is

\[ L_B^{ij} \]

\[ L_{A1}^{ij} \]

\[ L_{A2}^{ij} \]

\[ \bar{y}_j \]

\[ \bar{z}_{ij} \]

\[ s_i \]

\[ N_j \]

\[ N'_j \]

\[ LEARN \]

\[ RESET \]

**Fig. 5.3: Diagram of a synapse in the ART 1 chip**
Fig. 5.4: Representation of the synapse output current as a function of the synapse position in the chip, (a) $L_{A1}$ output currents, (b) $L_{A2}$ output currents and, (c) $L_B$ output currents
\[ \sigma (\Delta L_K) = \sqrt{\sum_{n=1}^{n_{\text{max}}} \sum_{m=1}^{m_{\text{max}}} \left( L_K^n (n \Delta x, m \Delta y) - L_K^p (n \Delta x, m \Delta y) - \Delta L_K \right)^2} \]

which is a measurement of the random error component in the output current \( L_K \).

We also find the maximum value of the output current obtained for a chip \( L_{K_{\text{maxplane}}} \), that is,

\[ L_{K_{\text{maxplane}}} = \max \{ L_K^p (x, y) \} \]  \hspace{1cm} (5.11)

the minimum output current \( L_{K_{\text{minplane}}} \) for a chip

\[ L_{K_{\text{minplane}}} = \min \{ L_K^p (x, y) \} \]  \hspace{1cm} (5.12)

the maximum systematic deviation in the output currents

\[ \Delta L_K^p = L_{K_{\text{maxplane}}} - L_{K_{\text{minplane}}} \]  \hspace{1cm} (5.13)

and the relation between the random error component and the systematic error component

\[ r_{K}^{\text{ran/sys}} = \frac{6 \times \sigma (\Delta L_K)}{\Delta L_K^p} \]  \hspace{1cm} (5.14)

Table 5.3 contains, for each chip, the values computed for the random error component \( \sigma (\Delta L_{A1}) \), the gradient components \( A \) and \( B \), the systematic deviation component \( \Delta L_{A1}^p \), the relation between the random and systematic deviation \( r_{A1}^{\text{ran/sys}} \), and the total standard deviation in the output currents \( \sigma_T (L_{A1}) \).

The mean value of the random error component averaged over all the chips \( \bar{\sigma} (\Delta L_{A1}) \) is 0.065\( \mu A \), which is equivalent to a relative error of 0.65\%. The mean value of the total standard deviation \( \bar{\sigma}_T (L_{A1}) \) is

<table>
<thead>
<tr>
<th>chip</th>
<th>( \sigma (\Delta L_{A1}) ) (( \mu A ))</th>
<th>( A ) (nA)</th>
<th>( B ) (nA)</th>
<th>( \Delta L_{A1}^p ) (( \mu A ))</th>
<th>( r_{A1}^{\text{ran/sys}} )</th>
<th>( \sigma_T (L_{A1}) ) (( \mu A ))</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.062479</td>
<td>0.711</td>
<td>10.361</td>
<td>0.139149</td>
<td>2.694</td>
<td>0.070785</td>
</tr>
<tr>
<td>2</td>
<td>0.050822</td>
<td>0.233</td>
<td>5.588</td>
<td>0.067536</td>
<td>5.311</td>
<td>0.062175</td>
</tr>
<tr>
<td>3</td>
<td>0.067802</td>
<td>2.660</td>
<td>-5.821</td>
<td>0.191207</td>
<td>2.128</td>
<td>0.080585</td>
</tr>
<tr>
<td>4</td>
<td>0.059039</td>
<td>0.544</td>
<td>0.089</td>
<td>0.028099</td>
<td>12.607</td>
<td>0.061145</td>
</tr>
<tr>
<td>5</td>
<td>0.063774</td>
<td>-1.179</td>
<td>6.046</td>
<td>0.126529</td>
<td>3.204</td>
<td>0.069474</td>
</tr>
<tr>
<td>6</td>
<td>0.064865</td>
<td>-1.981</td>
<td>2.947</td>
<td>0.128540</td>
<td>3.028</td>
<td>0.071346</td>
</tr>
<tr>
<td>7</td>
<td>0.065738</td>
<td>-0.125</td>
<td>-3.512</td>
<td>0.041365</td>
<td>9.535</td>
<td>0.068456</td>
</tr>
<tr>
<td>8</td>
<td>0.064164</td>
<td>-0.773</td>
<td>5.358</td>
<td>0.092227</td>
<td>4.174</td>
<td>0.066945</td>
</tr>
<tr>
<td>9</td>
<td>0.079046</td>
<td>2.258</td>
<td>10.701</td>
<td>0.219908</td>
<td>2.157</td>
<td>0.090626</td>
</tr>
<tr>
<td>10</td>
<td>0.074479</td>
<td>-0.529</td>
<td>1.666</td>
<td>0.043103</td>
<td>10.368</td>
<td>0.075308</td>
</tr>
</tbody>
</table>

Table 5.3: Measured output current error in the \( L_{A1} \) NMOS simple current mirror
0.072\mu A, which is a relative error in the output currents of 0.72%. Consequently, for these dimensions the gradient component does not cause a serious degradation in the current deviations.

Table 5.3 contains the same information as Table 5.3 but for the $L_{A2}$ output currents. The random standard deviation averaged over all the chips is 0.073\mu A, that is, a 0.73% of relative error. The mean value of the total standard deviation is 0.078\mu A, a 0.78% relative error.

Table 5.3 gives the measured deviation in the synapse current sources $L_B$. The mean of the random relative error is 0.62% and the mean of the total relative error is 0.70%.

### 5.2. Experimental Results of this ART 1 Prototype

As mentioned in the previous section, we have designed and fabricated a new prototype with an $F_1$ layer of $N = 50$ input pixels and a category layer with $M = 10$ categories. The chip has been designed and

<table>
<thead>
<tr>
<th>chip</th>
<th>$\sigma(\Delta L_{A2})(\mu A)$</th>
<th>$A$ (nA)</th>
<th>$B$ (nA)</th>
<th>$\Delta L_{A2}^p$ (\mu A)</th>
<th>$r_{L_{A2}}^{\text{ran/sys}}$</th>
<th>$\sigma_T(L_{A2})$ (\mu A)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.069025</td>
<td>0.796</td>
<td>9.600</td>
<td>0.135811</td>
<td>3.057</td>
<td>0.075334</td>
</tr>
<tr>
<td>2</td>
<td>0.063631</td>
<td>-0.198</td>
<td>5.028</td>
<td>0.059737</td>
<td>6.391</td>
<td>0.064838</td>
</tr>
<tr>
<td>3</td>
<td>0.067669</td>
<td>2.419</td>
<td>-7.264</td>
<td>0.193584</td>
<td>2.097</td>
<td>0.080359</td>
</tr>
<tr>
<td>4</td>
<td>0.067981</td>
<td>0.207</td>
<td>-0.475</td>
<td>0.015092</td>
<td>27.027</td>
<td>0.070033</td>
</tr>
<tr>
<td>5</td>
<td>0.064327</td>
<td>-1.589</td>
<td>4.707</td>
<td>0.126529</td>
<td>3.050</td>
<td>0.070788</td>
</tr>
<tr>
<td>6</td>
<td>0.071239</td>
<td>-2.190</td>
<td>1.679</td>
<td>0.126294</td>
<td>3.384</td>
<td>0.078401</td>
</tr>
<tr>
<td>7</td>
<td>0.064670</td>
<td>-0.469</td>
<td>-1.609</td>
<td>0.039520</td>
<td>9.818</td>
<td>0.066243</td>
</tr>
<tr>
<td>8</td>
<td>0.063205</td>
<td>-1.179</td>
<td>6.548</td>
<td>0.124456</td>
<td>3.047</td>
<td>0.069372</td>
</tr>
<tr>
<td>9</td>
<td>0.107896</td>
<td>1.157</td>
<td>7.259</td>
<td>0.130448</td>
<td>4.962</td>
<td>0.112106</td>
</tr>
<tr>
<td>10</td>
<td>0.088077</td>
<td>-0.802</td>
<td>-1.394</td>
<td>0.054058</td>
<td>9.776</td>
<td>0.090044</td>
</tr>
</tbody>
</table>

**Table 5.4: Measured output current error in the $L_{A2}$ NMOS simple current mirror**

<table>
<thead>
<tr>
<th>chip</th>
<th>$\sigma(\Delta L_B)(\mu A)$</th>
<th>$A$ (nA)</th>
<th>$B$ (nA)</th>
<th>$\Delta L_B^p$ (\mu A)</th>
<th>$r_{L_B}^{\text{ran/sys}}$</th>
<th>$\sigma_T(L_B)$ (\mu A)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.062361</td>
<td>0.914</td>
<td>-1.588</td>
<td>0.061606</td>
<td>6.076</td>
<td>0.063976</td>
</tr>
<tr>
<td>2</td>
<td>0.059196</td>
<td>-0.126</td>
<td>-1.524</td>
<td>0.021530</td>
<td>16.497</td>
<td>0.060367</td>
</tr>
<tr>
<td>3</td>
<td>0.056222</td>
<td>3.203</td>
<td>-17.212</td>
<td>0.332290</td>
<td>1.015</td>
<td>0.089202</td>
</tr>
<tr>
<td>4</td>
<td>0.062793</td>
<td>-1.043</td>
<td>-3.764</td>
<td>0.089795</td>
<td>4.196</td>
<td>0.064319</td>
</tr>
<tr>
<td>5</td>
<td>0.064517</td>
<td>-2.690</td>
<td>-4.824</td>
<td>0.182743</td>
<td>2.118</td>
<td>0.075611</td>
</tr>
<tr>
<td>6</td>
<td>0.063882</td>
<td>-2.642</td>
<td>-1.728</td>
<td>0.149403</td>
<td>2.565</td>
<td>0.073131</td>
</tr>
<tr>
<td>7</td>
<td>0.059516</td>
<td>-1.182</td>
<td>-9.926</td>
<td>0.158375</td>
<td>2.255</td>
<td>0.066926</td>
</tr>
<tr>
<td>8</td>
<td>0.062204</td>
<td>-2.424</td>
<td>-2.666</td>
<td>0.147884</td>
<td>2.524</td>
<td>0.070645</td>
</tr>
<tr>
<td>9</td>
<td>0.062816</td>
<td>-0.059</td>
<td>-3.444</td>
<td>0.037389</td>
<td>10.080</td>
<td>0.063074</td>
</tr>
<tr>
<td>10</td>
<td>0.056737</td>
<td>-1.315</td>
<td>-15.073</td>
<td>0.216458</td>
<td>1.573</td>
<td>0.072737</td>
</tr>
</tbody>
</table>

**Table 5.5: Measured output current error in the $L_B$ PMOS current mirror**
fabricated in the double-metal single-poly CMOS technology of ES2-1.0μm, and has been mounted on a PGA-84 pins package. The circuit area is 6.40mm², but for our test prototype the total area is 18.82mm² which is due to the pads. An area reduction of around 15 times has been achieved in comparison with the prototype described in Appendix 2.

This area reduction resulted in a great yield improvement. In this case, we obtained a 100% yield compared to the poor 6.7% yield of the first prototype. In Appendix 2, using the results of the yield of our first prototype, and considering a dependence between the yield performance, the die area Ω, and the process defect density ρ₀, given by the expression,

\[
\text{yield} \, (\%) = 100e^{-ρ₀Ω}
\]  

(5.15)

we estimated a process defects density ρ₀ of 3.2cm⁻¹. If we use this value of ρ₀ to estimate the yield of a chip of area Ω = 0.064cm² a yield performance of 98% results. Therefore, the yield improvement achieved now is totally justified by the area reduction.

The system level operation of this prototype has been tested not only for a single chip, but also when several chips were assembled horizontally to increase the size of the input patterns. In the next subsections, we show the test results for a single chip, and for a system formed by two horizontally assembled chips.

A. Single ART 1 Chip Operation

The operation of a single chip has been tested using the digital test equipment HP82000. This equipment automatically applied a sequence of binary input patterns, and also read the winning category and the stored weights after the classification and learning of each input pattern has taken place.

To test the system we have trained the system with a set of ten 7 × 7 = 49 input patterns. Each pattern represents each of the ten digits from ‘0’ to ‘9’. The last input pixel was always set to zero and it is not shown in the figures. The classification of the set of input patterns was repeated for different values of the vigilance parameter ρ and several values of parameter α = L_A/L_B.

Fig. 5.5 shows the training sequence for a vigilance ρ = 0.3 and α = 1.1. The first column represents the input pattern applied to the system. The remaining ten columns correspond to the weights stored in each category when the input pattern has been classified and learned. The underlined category is the winning category after the Winner-Take-All competition. In this case, learning self-stabilizes after two input pattern presentations. That is, no modification of the winning category or the stored weights take place in subsequent presentations of the input patterns. As shown if Fig. 5.5, the system has classified the ten input patterns into four categories.

Fig. 5.6 shows the complete training sequence obtained for the same value of α = 1.1 but for a higher vigilance parameter, ρ = 0.5. Now, the system needs only one iteration of the input pattern set until learning self-stabilizes. Due to the higher vigilance parameter, the system forms more categories to classify the same input patterns. The system classifies the ten input patterns into six different categories. A similar effect occurs, as explained in Appendix 1, when the vigilance parameter remains constant but we increase the current ratio
Fig. 5.5: Classification performed by a chip for $\rho = 0.3$ and $\alpha = 1.1$
\( \alpha \). Fig. 5.7 shows the complete training sequence performed by the chip when the vigilance parameter is \( \rho = 0.3 \) but the current ratio has been increased to \( \alpha = 3.11 \). After three iterations the system forms six categories to classify the ten input patterns, as occurred when \( \rho = 0.5 \) and \( \alpha = 1.1 \).

**B. Multichip ART 1 Operation**

A system composed of two horizontally connected chips was arranged. This system can cope with input patterns of \( 2 \times N \) binary pixels. Fig. 5.8 shows a diagram of two horizontally interconnected chips. To expand the system in an horizontal way, nodes \( N_j, N_j' \), and \( N'' \) of the different chips have to be interconnected and isolated, all except one of them, from the \( CMA'S \) and \( CMB'S \) current mirrors, and the adjustable gain

---

**Fig. 5.6:** Classification performed by a single chip with a vigilance parameter \( \rho = 0.5 \), and \( \alpha = 1.1 \)

**Fig. 5.8:** Interconnection of two chips for horizontal system expansion
Fig. 5.7: Training sequence of a chip for $\rho = 0.3$ and $\alpha = 3.11$
\( \rho \)-mirror. The outputs \( y_j \) of the active WTA have also to be shared among all the chips to control the weights updating in all synapses.

The system level performance of the two assembled chips has been tested. In this case, the input patterns had \( 10 \times 10 = 100 \) binary pixels. Fig. 5.9 depicts a training sequence performed on the system. The system classifies the 10 input patterns into 8 categories after a single presentation of the set of input patterns. The sequence of Fig. 5.9 was obtained with the vigilance parameter set to \( \rho = 0.5 \), and the current levels in the synapses \( L_A = 10 \mu A \) and \( L_B = 5 \mu A \), that is, \( \alpha = 2 \).

Fig. 5.9: Training sequence of a system formed by two horizontally assembled chips.
The vigilance parameter is \( \rho = 0.5 \) and \( \alpha = 2 \)
5.3. ARTMAP Architectures

An ARTMAP architecture is a system which can be trained in a supervised way to learn the correspondence between pairs of binary input patterns [5.3].

A. The ARTMAP Algorithm: Supervised Learning

Fig. 5.10 shows a block diagram of the ARTMAP architecture. It consists of two ART 1 modules (ART $1^a$ and ART $1^b$) and an inter-ART module. We denote as $a = (a_1, a_2, ..., a_{N_a})$ the $N_a$-dimensional input vector to module ART $1^a$, and $b = (b_1, b_2, ..., b_{N_b})$ the $N_b$-dimensional input vector to ART $1^b$. Whenever a pair of input vectors, $a$ and $b$, is presented, self-organization processes evolve in the ART $1^a$ and ART $1^b$ modules. Then, the inter-ART module learns to recognize the correspondence between the categories activated in ART $1^a$ and ART $1^b$. The “match-tracking” signal is intended to control the self-organization process when an error prediction takes place. If the activated category in ART $1^a$ has previously learned to predict an ART $1^b$ category which is different than the activated one, the inter-ART module sends a match-tracking signal to the ART $1^a$ module. This match-tracking signal influences the ART $1^a$ vigilance parameter $\rho_a$. It increases this vigilance parameter by the minimum amount necessary to force the ART $1^a$ system to reset the currently activated category. This process of ART $1^a$ searches and match-tracking reset continues until the ART $1^a$ system chooses an $F_2^a$ category which predicts the activated category in $F_2^b$ or has not previously learned any other $F_2^b$ prediction.

Fig. 5.11 depicts a more detailed diagram of the ARTMAP architecture. In this figure, letter $a$ denotes the elements of the ART $1^a$ module. Letter $b$ denotes the elements associated with ART $1^b$. As we can see in Fig. 5.11 there is a one to one correspondence between the nodes of the ART $1^b$ category layer and the nodes in the “mapfield” module $F^{ab}$. Each $F_2^b$ node is connected to an $F^{ab}$ node bidirectionally through a vector of

![Fig. 5.10: ARTMAP Architecture](image-url)
non-adaptive weights which always equal '1'. However, the nodes of the ART 1\textsuperscript{a} category layer are interconnected to the mapfield ones by a matrix of $M_a \times M_b$ adaptive binary weights $\{w_{jk}\}$. We have denoted as $y^{ab}$ the activation pattern across the nodes of $F^{ab}$. As shown in Fig. 5.11, each node $u_k^{ab}$ in the map-field layer receives inputs from three sources of activation: the control signal $G$, the $F^b_2$ output signals $y_k^b$, and the gated $F^a_1$ output signals $\sum_j w_{jk} y_j^a$. The activation of a $u_k^{ab}$ node, obeys the 2/3 rule: signal $y_k^{ab}$ is activated if at least 2 of the 3 node inputs are high. Mathematically,

$$y_k^{ab} = \begin{cases} 1 & \text{if } y_k^b + G + \sum_{j=1}^{M_a} y_j^a w_{jk} \geq 2 \\ 0 & \text{otherwise} \end{cases}$$

(5.16)

Fig. 5.11: Diagram of the interactions in the ARTMAP architecture
where $G$ is a control signal which is always active except when the $F^a_2$ and $F^b_2$ layers are simultaneously active. That is,

$$G = \begin{cases} 
0 & \text{if } F^a_2 \text{ and } F^b_2 \text{ are active} \\
1 & \text{otherwise} 
\end{cases} \quad (5.17)$$

The function of signal $G$ is to allow the system to distinguish between two operating modes: the training mode and the prediction mode.

In the training mode two input vectors, $a$ and $b$, are applied to the system which must learn the correspondence between the activated $F^a_2$ and $F^b_2$ categories. In the prediction mode, only one input pattern $a$ is presented to the ART $1^a$ module and the system must predict the corresponding $F^b_2$ category.

The adaptive weights $w_{jk}$ are binary valued ones. They are initially set to ‘1’ and during the system training follow the learning rule

$$w_{jk}^{new} = \begin{cases} 
y^a_b & \text{if } y^a_j = 1 \\
w^a_{jk}^{old} & \text{if } y^a_j = 0
\end{cases} \quad (5.18)$$

According to this law, the weight vector $w_j$ corresponding to the active $F^a_2$ node $v^a_j$, evolves towards the activation pattern across $F^{ab}$, $y^a_k$. The weight vectors connecting the remaining $F^a_2$ nodes, $w_j$ with $j \neq J$ remain constant.

As mentioned above, the system can operate in two different modes:

- **Training mode**: Two input patterns, $a$ and $b$, are applied to layers $F^a_2$ and $F^b_2$, respectively. The ART $1^a$ and ART $1^b$ modules classify the input patterns in their corresponding categories, $u^a_j$ and $u^b_k$. The presence of simultaneous activation in layers $F^a_2$ and $F^b_2$ will deactivate the control signal $G = 0$. The activation vector across layer $F^{ab}$ will become

$$y^a_k = \begin{cases} 
1 & \text{if } k = K \text{ and } w_{JK} = 1 \\
0 & \text{otherwise} 
\end{cases} \quad (5.19)$$

which can be expressed as,

$$y^a_k = y^b_k w_{jk} \quad (5.20)$$

or in vector notation,

$$y^{ab} = y^b \odot w_j. \quad (5.21)$$

This means that all nodes $u^a_k$ with $k \neq K$ will be deactivated, node $u^a_k$ will be active only if node $u^b_j$ has previously learned to predict category $u^a_k$, or if node $u^b_j$ has not previously learned to predict any other ART $1^b$ category.
If node $u_j^a$ has previously learned to predict an $F^a_k$ category different from the activated one $u_k^b$, no node in $F^{ab}$ will be active. In this case, the match-tracking signal $R$ will become active. Signal $R$ increases the ART $1^a$ vigilance parameter $\rho_a$ by the minimum amount necessary to deactivate the node $u_j^a$ which has caused the prediction error. Therefore, the new vigilance parameter $\overline{\rho}_a$ will satisfy the following condition,

$$\overline{\rho}_a > \frac{|a \cap z_j^a|}{|a|}.$$  \hspace{1cm} (5.22)

The search process in ART $1^a$ continues, increasing each time the vigilance parameter $\rho_a$, until a category $u_j^a$ is found which has previously learned to predict category $u_k^b$, or which has not predicted any ART $1^b$ category yet.

After the search process is completed, the weights are updated. Weights $z^a_{ij}$ and $z^b_{ki}$ follow the ART 1 algorithm learning rules, while weights $w_{jk}$ follow the learning rule given by (5.18).

- **Prediction mode:** In this operation mode only an input pattern $a$ is presented to the ART $1^a$ module and the system must predict the corresponding ART $1^b$ category.

Pattern $a$ will activate an $u_j^a$ category in $F^a_2$, while $F^b_2$ remains inactive. In this case, the control signal $G$ will be active. The activation pattern across the inter-ART module $F^{ab}$ will become,

$$y_{ab}^b = \begin{cases} 1 \text{ if } w_{jk} = 1 \\ 0 \text{ otherwise} \end{cases}.$$  \hspace{1cm} (5.23)

Category $u_j^a$ activates all the nodes in $F^{ab}$ if $u_j^a$ has not previously learned any prediction. If a correspondence between $u_j^a$ and a node $u_k^b$ has been previously learned, only the corresponding node $u_k^b$ will be active because $w_{JK}$ is the only weight that will have a ‘1’ value.

Fig. 5.12 depicts the flow diagram of the ARTMAP operation in the training mode case and Fig. 5.13 shows the ARTMAP flow diagram in the prediction mode operation. In both diagrams, the operation of the ART $1^a$ and ART $1^b$ modules is considered to be that of the ART $1_m$ algorithm reported in Appendix 1.

**B. ARTMAP Circuit Implementation**

A circuit that implements the ARTMAP algorithm has been realized. To construct the ARTMAP system, we have interconnected two chips which implement the ART $1_m$ algorithm through an inter-ART module. Fig. 5.14 shows the interconnection diagram of the ARTMAP system and Fig. 5.15 shows the detailed schematic of the inter-ART circuit.

The inter-ART circuit has been designed and fabricated using the same technology as the ART 1 chip: the ES2-1.0μm double-metal single-poly CMOS technology. It consists of a matrix of $M_a \times M_b$ (in our case, it is a $10 \times 10$ matrix) cells that occupies an area of $0.459\text{mm} \times 0.478\text{mm} = 0.219\text{mm}^2$ and it is mounted on a DIL-28 pins package.
Each cell \( c_{jk} \) contains a memory element which stores the value of the weight \( w_{jk} \) plus some circuitry to reset and update the weights. All the cells in the same column \( j \) share the input \( y_j^a \), which is the output from node \( u_j^a \) in \( F_j^2 \). In the same way, the outputs \( y_k^b \) of layer \( F_k^2 \) are shared as inputs by all the inter-ART cells in the same row \( k \). The reset signal “RESET”, \( t \) and the “LEARN” signal are shared by all the cells \( c_{jk} \) in the inter-ART chip. Signals “RESET” and “LEARN” are also common to the three chips in the system. There is also some circuitry in each cell to read out the value of the weight \( w_{jk} \), corresponding to the activated cells \( u_j^a \) and \( u_k^b \).

The system performs the operation sequence depicted in Fig. 5.12 and Fig. 5.13:

**Fig. 5.12: Flow diagram of the ARTMAP operation during the training phase**
Appendix 5: Multichip Realizations with ART 1 Modules. Page: 145

Fig. 5.13: Flow diagram of the prediction ARTMAP operation

Fig. 5.14: Interconnection Diagram of the ARTMAP System

Fig. 5.15: Schematic of the Inter-ART Module
• The weights are initialized. Before applying the training sequence, a pulse is applied to the reset signal that makes all the weights \( w_{jk}^a \), \( z_{ij}^a \) and \( z_{ik}^b \) be set to their high values.

• During the training mode:

1.- Two input patterns \( a \) and \( b \) are presented to the ART 1 modules, which make a selection of categories \( u_j^a \) and \( u_K^b \). These categories must satisfy the vigilance criterion imposed by vigilance parameters \( \rho_a \) and \( \rho_b \) set in the ART 1 modules.

2.- After the activation of categories \( u_j^a \) and \( u_K^b \), the value of \( w_{JK} \) is read out. Two alternatives may occur:

2.1.- If \( w_{JK} = 0 \), a mismatch has occurred between the prediction of category \( u_j^a \) and category \( u_K^b \).

The ART 1a vigilance parameter \( \rho_a \) is increased to \( \rho_a' = \rho_a + \Delta \rho_a \) (where \( \Delta \rho_a \) is the minimum \( \rho \) variation step in the ART 1 systems) and a new search process takes place in ART 1a. This process is repeated until an ART 1a category is found that makes \( w_{JK} = 1 \).

2.2.- If \( w_{JK} = 1 \), a pulse is applied to the "LEARN" signal and the weights \( z_{ij}^a \), \( z_{ik}^b \) and \( w_j \) are updated. The weights \( w_j \) corresponding to the active \( u_j^a \) category are updated to their new values given by,

\[
 w_{jk}^{new} = \begin{cases} 
 w_{jk}^{old} & \text{if } y_k^b = 1 \\
 0 & \text{if } y_k^b = 0
\end{cases} \quad (5.24)
\]

that is,

\[
 w_{jk}^{new} = w_{jk}^{old} \cap y_k^b \quad (5.25)
\]

or in vector notation,

\[
 w_{j}^{new} = w_{j}^{old} \cap y^b \quad (5.26)
\]

Since only one \( y_k^b \) is equal to ‘1’,

\[
 y_k^b = \begin{cases} 
 1 & \text{if } k = K \\
 0 & \text{otherwise}
\end{cases} \quad (5.27)
\]

it holds that,

\[
 w_{jk}^{new} = \begin{cases} 
 1 & \text{if } k = K \\
 0 & \text{otherwise}
\end{cases} \quad (5.28)
\]

3.- The ART 1a vigilance parameter \( \rho_a' \) is restored to its original value \( \rho_a \) when the input patterns \( a \) and \( b \) are substituted by a new pair of input patterns \( a' \) and \( b' \). Then, a new selection process begins in ART 1a and ART 1b.

• During the prediction mode:
1.- An input pattern \( a \) is applied to the ART \(^1\) module and a category \( u_j^a \) which satisfies the ART \(^1\) vigilance criterion is chosen.

2.- The output of the inter-ART module is given by,

\[
y_k^{ab} = \sum_j w_{jk} y_j^a = w_{JK} \tag{5.29}
\]

Since for each \( J \) there is only one \( K \) such that \( w_{JK} = 1 \), let us call it \( w_{JK} \). Therefore, category \( u_K^b \) is considered to be the prediction made by the ARTMAP system.

### C. Experimental Results

The system level operation of the ARTMAP system has been also tested using the HP82000 digital test equipment. Fig. 5.16 and Fig. 5.17 show a test sequence performed on the system.

Fig. 5.16 shows a system training sequence. The first column, named “pattern_a”, represents the input patterns applied to the ART \(^1\) chip. Each input pattern is a different representation of one of the five vowels. The column named “pattern_b” represents the input patterns applied to the ART \(^1\) system. The ART \(^1\) input patterns are the first five digits. The columns named “weights_a” and “weights_b” represent the stored weights in the ART \(^1\) and ART \(^1\) categories after the classification and learning of each input pattern pair. The underlined categories are the ones that remain active after the search process has finished, and these are the only ones that are updated with learning. Below each ART \(^1\) category we indicate the minimum value of the vigilance parameter \( \rho_a \) needed in the search process to chose this category as the winning one, thus avoiding a prediction error (the vigilance parameter \( \rho_a \) was increased in steps of \( \Delta \rho_a = 1/32 \)). The last column shows the stored weight in the inter-ART module which represent the learned correspondence between the ART \(^1\) and ART \(^1\) categories. During the training sequence, the system learns to identify each set of different representations of a vowel with a different digit. The vigilance parameter \( \rho_a \) is set to ‘0’ and the current ratio parameter \( \alpha = 2 \) \( (L_A^a = 10\mu A \ and \ L_B^a = 5\mu A) \). For the ART \(^1\) system, we set the same current levels, \( L_A^b = 10\mu A \) and \( L_B^b = 5\mu A \), and a vigilance parameter \( \rho_b = 0.75 \). For this vigilance parameter, the ART \(^1\) chip forms a different category for each input pattern (it perfectly learns the set of input patterns), so we can identify each category with the corresponding input pattern.

During the prediction sequence depicted in Fig. 5.17, only an input pattern \( a \) is applied to the system. The first column shows the input patterns applied to the system. These input patterns are the result of applying a random noise to the training patterns. The underlined ART \(^1\) category is the one chosen by this chip after the search process and the underlined ART \(^1\) category is the one the ART \(^1\) category has learned to predict. During this sequence the current levels are \( L_A^a = 10\mu A \), \( L_B^a = 5\mu A \). The vigilance parameter is \( \rho_a = 0 \). Only one prediction error occurs despite the amount of noise present in the input patterns.

### 5.4. References


Fig. 5.16: Complete training sequence of the ARTMAP system for a vigilance parameters $\rho_a = 0$
and $\rho_b = 0.75$

**Fig. 5.17:** Recognition sequence performed on a previously training ARTMAP system. The applied input patterns are a noisy version of the a vectors of the training set. The vigilance parameter $\rho_a$ is set to '0'.

---

This page contains a figure illustrating the recognition sequence performed on a previously trained ARTMAP system. The input patterns are noisy versions of the a vectors from the training set, and the vigilance parameter $\rho_a$ is set to '0'.
TERESA SERRANO GUTARRÉS
CATEGORIZADORES NEURONALES EN VLSI

APTO CUM LAUDE

20 Diciembre 96

[Signatures]