# Implementing P Systems Parallelism by Means of GPUs

Jose M. Cecilia<sup>2</sup>, José M. García<sup>2</sup>, Ginés D. Guerrero<sup>2</sup>, Miguel A. Martínez–del–Amor<sup>1</sup>, Ignacio Pérez–Hurtado<sup>1</sup>, and Mario J. Pérez–Jiménez<sup>1</sup>

 <sup>1</sup> Research Group on Natural Computing Department of Computer Science and Artificial Intelligence University of Sevilla Avda. Reina Mercedes s/n, 41012 Sevilla, Spain {mdelamor,perezh,marper}@us.es
<sup>2</sup> Grupo de Arquitectura y Computación Paralela Dpto. Ingeniería y Tecnología de Computadores Universidad de Murcia Campus de Espinardo, 30100 Murcia, Spain {chema,jmgarcia,gines.guerrero}@ditec.um.es

Abstract. Software development for Membrane Computing is growing up yielding new applications. Nowadays, the efficiency of P systems simulators have become a critical point when working with instances of large size. The newest generation of GPUs (Graphics Processing Units) provide a massively parallel framework to compute general purpose computations. We present GPUs as an alternative to obtain better performance in the simulation of P systems and we illustrate it by giving a solution to the N-Queens problem as an example.

# 1 Introduction

Membrane Computing is an emerging branch within Natural Computing that was introduced by Gh. Păun [24]. The main idea is to consider biochemical processes taking place inside living cells from a computational point of view, in a way that gives us a new nondeterministic model of computation by using cellular machines.

Up to now, it has not been possible to have implementations neither *in vivo* nor *in vitro* of P systems, so the computation and analysis of these devices are performed by simulators. Therefore, P systems simulators are tools that help the researchers to extract results from models. Since P systems was presented, many software applications have been produced [11]. These simulators have to be as much efficient as possible when handling large problem sizes. Thus, the massively parallel nature of P systems computations points out to look for a massively parallel technology where the simulator can run efficiently.

Parallel computation on clusters is the traditional environment to speedup parallel applications. Particularly, many simulators of P systems have been designed for clusters of computers [4]. However, this computation is relatively expensive and it is available for organizations that have enough resources to buy and maintain those clusters. Nowadays, there are other cheaper solutions in the computer market that also provides parallel environments. Among these solutions, the newest generation of graphics processor units (GPUs) are massively parallel processors which allow to develop a wide range of parallel applications. We also recall that other parallel computing platforms for P systems simulators are being investigated, such as special hardware circuits [6] and FPGAs [20].

GPUs can support several thousand of concurrent threads providing a massively parallel environment where parallel applications can obtain huge performance [14][17][29]. Current Nvidia's GPUs, for example, contain up to 240 scalar processing elements per chip [16], they are programmed using C and CUDA [32][21], and they have low cost compared with a cluster of computers.

In this paper, we use CUDA as parallel programming environment for P systems simulator in order to speedup the simulation. The input of the simulator is a P system which is defined by using the P-Lingua [5] programming language, and the output is a detailed list of information of every configuration of the computation. The simulation is divided in two main stages: *selection stage* and *execution stage*. At this stage of development, the simulator simulates recognizer P systems with active membranes, the *selection stage* is executed on the GPU and the *execution stage* is executed on the CPU.

The rest of the paper is structured as follows. In Section 2 several definitions and concepts are given for a correct understanding of the paper. Section 3 introduces the Compute Unified Device Architecture (CUDA) and some concepts of programming on GPUs are specified. In Section 4 we explain the design of the simulator. In Section 5 we implement a solution to the N-Queens problem using the simulator and P-Lingua. Finally, in Section 6 we show some results and compare them with the sequential version of the simulator. The paper ends with some conclusions and ideas for future work in Section 7.

## 2 Preliminaries

Polynomial time solutions to **NP**-complete problems in Membrane Computing are achieved by trading time for space. This is inspired by the capability of cells to produce an exponential number of new membranes in polynomial time. There are many ways a living cell can produce new membranes: *mitosis* (cell division), *autopoiesis* (membrane creation), *gemmation*, etc. Following these inspirations a number of different models of P systems has arisen, and many of them proved to be computational completeness (they are equivalent in power to Turing machines).

In this paper we focus on the model of P systems with active membranes. It is one of the most studied models in Membrane Computing and one of the first models presented by Gh. Păun [25]. P systems with active membranes is formed by a membrane structure, where a label and a polarization is associated to each membrane. In this model, every elementary membrane is able to divide itself by reproducing its content into a new membrane. Here we provide a short recall of its features (see [25] for details). The model of P system with active membranes is a construct of the form  $\Pi = (O, H, \mu, \omega_1, \ldots, \omega_m, R)$ , where  $m \ge 1$  is the initial degree of the system; Ois the alphabet of *objects*, H is a finite set of *labels* for membranes;  $\mu$  is a membrane structure (a rooted tree), consisting of m membranes injectively labelled with elements of  $H, \omega_1, \ldots, \omega_m$  are strings over O, describing the *multisets of objects* placed in the m regions of  $\mu$ ; and R is a finite set of *rules*, where each rule is of one of the following forms:

- (a)  $[a \to v]_h^{\alpha}$  where  $h \in H$ ,  $\alpha \in \{+, -, 0\}$  (electrical charges),  $a \in O$  and v is a string over O describing a multiset of objects associated with membranes and depending on the label and the charge of the membranes (*evolution rules*).
- (b)  $a[]_{h}^{\alpha} \to [b]_{h}^{\beta}$  where  $h \in H$ ,  $\alpha, \beta \in \{+, -, 0\}$ ,  $a, b \in O$  (send-in communication rules). An object is introduced in the membrane, possibly modified, and the initial charge  $\alpha$  is changed to  $\beta$ .
- (c)  $[a]_{h}^{\alpha} \to []_{h}^{\beta} b$  where  $h \in H$ ,  $\alpha, \beta \in \{+, -, 0\}$ ,  $a, b \in O$  (send-out communication rules). An object is sent out of the membrane, possibly modified, and the initial charge  $\alpha$  is changed to  $\beta$ .
- (d)  $[a]_{h}^{\alpha} \to b$  where  $h \in H$ ,  $\alpha \in \{+, -, 0\}$ ,  $a, b \in O$  (dissolution rules). A membrane with a specific charge is dissolved in reaction with a (possibly modified) object.
- (e)  $[a]_{h}^{\alpha} \to [b]_{h}^{\beta} [c]_{h}^{\gamma}$  where  $h \in H, \alpha, \beta, \gamma \in \{+, -, 0\}, a, b, c \in O$  (division rules). A membrane is divided into two membranes. The objects inside the membrane are replicated, except for a, that may be modified in each membrane.

Rules are applied according to the following principles:

- All the elements which are not involved in any of the operations to be applied remain unchanged.
- Rules associated with label h are used for all membranes with this label, no matter whether the membrane is an initial one or whether it was generated by division during the computation.
- Rules from (a) to (e) are used as usual in the framework of membrane computing, i.e., in a maximal parallel way. In one step, each object in a membrane can only be used by at most one rule (non-deterministically chosen), but any object which can evolve by a rule must do it (with the restrictions indicated below).
- Rules (b) to (e) cannot be applied simultaneously in a membrane in one computation step.
- An object a in a membrane labelled with h and with charge  $\alpha$  can trigger a division, yielding two membranes with label h, one of them having charge  $\beta$  and the other one having charge  $\gamma$ . Note that all the contents present before the division, except for object a, can be the subject of rules in parallel with the division. In this case we consider that in a single step two processes take place: "first" the contents are affected by the rules applied to them, and "after that" the results are replicated into the two new membranes.

 If a membrane is dissolved, its content (multiset and interior membranes) becomes part of the immediately external one. The skin is never dissolved neither divided.

Note that P systems can be seen as devices with two levels of parallelism: among membranes (every membrane works independently, with the exception of when there are communication across them) and among objects inside a membrane (the rules are applied to the existing multiset of objects in a maximal parallel way).

Recognizer P systems were introduced in [26], and constitute the natural framework to study the solvability of decision problems. The data representing an instance of the problem has to be provided to the P system to compute the appropriate answer. This is done by codifying each instance as a multiset placed in an *input membrane*. The output of the computation, *yes* or *no*, is sent to the environment in every halting configuration.

Furthermore, the act of simulating something generally entails representing certain key characteristics or behaviours of some physical, or abstract, system. However, an emulation tool duplicates the functions of one system by using a different system, so that the second system behaves like (and appears to be) the first system. With the current technology, we can not emulate the functionality of a cellular machine by using a conventional computer to solve **NP**-complete problems in polynomial time, but we can simulate these cellular machines, not necessarily in polynomial time, in order to aid researchers. However, depending on the underlying technology where the simulator is executed, the simulations can take too much time.

The technology used for this work is called CUDA (Compute Unified Device Architecture). CUDA is a co-designed hardware and software solution to make easier developing general-purpose applications on the Graphics Processor Unit (GPU) [34]. GPUs, that are one of the main components of traditional computers, originally were specialized for math-intensive, highly parallel computation which is the nature of graphics applications. These characteristics of the GPU were very attractive to accelerate scientific applications which have massively parallel computations. However, the problem was the way to program general purpose applications on the GPU. This way involved to deal with GPUs designed for video games, so they have had to tune their applications using programming idioms tied to computer graphics, programing environment tightly constrained, etc [17] [14]. The CUDA extensions developed by Nvidia provides an easier environment to program general-purpose applications onto the GPU, because it is based on ANSI C, supported by several keywords and constructs. ANSI C is the standard published by the American National Standards Institute (ANSI) for the C programming language, which is one of the most used.

P systems devices are massively parallel, what fits, in a similar way, into massively parallel nature of the GPUs with thousands of threads running in parallel. These threads are units of execution which execute the same code concurrently on different pieces of data.

# 3 Graphics Processing Unit

Driven by the video games market, programmable GPUs (Graphics Processing Units) have evolved into a highly parallel, multithreaded, manycore processor. They were designed to accelerate graphics applications, which transform threedimensional data (coordinates of triangle vertices) into pixels that are displayed on a screen, using for this task programming interfaces such as OpenGL and DirectX. The massively parallel nature of graphics applications and its arithmetic intensity leads the researches to explore more general non-graphics applications onto the GPU, creating a new programming field called GPGPU (General-Purpose on GPUs).

GPUs have become an inexpensive and readily available single-chip massively parallel system. However, GPGPU programmers had to deal with the limitations and difficulties of constrained graphics primitives to compute their non-graphics computations. The emergence of Compute Unified Device Architecture (CUDA) [34] programming model, proposed by Nvidia Corporation in 2007, has helped to develop highly-parallel applications onto the GPU easier than it was before. CUDA allows GPGPU programmers to develop their applications in a more familiar environment by using C/C++ programming language, with some extensions to manipulate special aspects of the GPU. Moreover, Nvidia consolidated this trend launching a line of GPUs optimized for general purpose computations called TESLA [16].

In this work we use a Tesla C1060 graphics processor unit (GPU) from Nvidia as hardware target for its study. This section introduces the Tesla C1060 computing architecture. In addition, it analyses the threading model of Tesla architectures, and also the most important issues in the CUDA programming environment.

#### 3.1 Tesla C1060 Base Microarchitecture

The Tesla C1060 [16] is based on a scalable processor array which has 240 streaming-processor (SP) cores organised as 30 streaming multiprocessor (SM). The applications start at the host side (the CPU) which communicates with the device side (the GPU) through a PCI-Express x16 bus (see the top of figure 1).

The SM is the processing unit, and it is unified graphics and computing multiprocessor. Every SM contains eight SPs arithmetic cores, one double precision unit, 16-Kbyte read/write shared memory, a set of 16384 registers, and access to the off-chip memory (global/local memory). The access to shared memory is very cheap, however, the access to the off-chip memory has low performance because it is out of the chip, as it is shown on figure 1. In addition, table 1 shows all memories available on the GPU and also the cost to access them.



**Fig. 1.** Tesla C1060 GPU with 240 SPs: Streamming Processors, organised in 30 SMs: Streamming Multiprocessors

| Memory        | Location | Size                                 | Latency            | Access |
|---------------|----------|--------------------------------------|--------------------|--------|
| Registers     | On-Chip  | 16384<br>32-bits Registers per SM $$ | $\simeq 0$ cycles  | R/W    |
| Shared Memory | On-Chip  | 16  KB per SM                        | $\simeq registers$ | R/W    |
| Constant      | On-Chip  | 64  KB                               | $\simeq registers$ | R      |
| Texture       | On-Chip  | Up to Global                         | > 100 cycles       | R      |
| Local         | Off-Chip | 4  GB                                | 400-600 cycles     | R/W    |
| Global        | Off-Chip | 4  GB                                | 400-600 cycles     | R/W    |

Table 1. Memory System on the Tesla C1060

### 3.2 Parallel Computing with CUDA

The GPU is seen as a cooprocessor that executes data-parallel *kernel* functions. The user creates a program encompassing CPU code (Host code) and GPU code (Kernel code). They are separated and compiled by *nvcc* (Nvidia's compiler for CUDA code) as shown in figure 2.

Firstly, the host code is responsible for transfering data from the main memory (RAM or host memory) to the GPU memory (device memory), using CUDA instructions, such as *cudamemcpy*. Moreover, the host code has to state the number of threads executing the kernel function and the organization of them. Threads execute the kernel code, and they are organized into a three-level hierarchy as it is shown in figure 3. At the highest level, each kernel creates a single grid that consists of many thread blocks. Each thread block can contain up to 512



Fig. 2. Nvcc compilation process

threads, which can share data through Shared Memory and can perform barrier synchronization by invoking the *-syncthreads* primitive [31]. Besides, thread blocks can not perform synchronization. The synchronization across blocks can only be obtained by terminating the kernel.

Furthermore, the host code calls the kernel function like a C function by passing parameters if it is needed, and also by specifying the number of threads per block and the number of blocks making up the grid. Each block within the grid has their own identifier [22]. This identifier can be one, two or three



Fig. 3. Thread organization in CUDA programming model

dimensions depending on how the programmer has declared the grid, accessed via .x, .y, and .z index fields. Each thread within the block have their own identifier which can be one, two or three dimensions as well. Combining thread and block identifiers, the threads can access to different data address, and also select the work that they have to do.

The kernel code is specified through the key word <u>\_global\_</u> and the syntax is: <u>\_\_global\_\_</u> kernelName <<< dimGrid, dimBlock >>> (...parameter list...) where dimGrid and dimBlock are three-elements vectors that specify the dimensions of the grid in blocks and the dimensions of the blocks in threads, respectively [21].

#### 3.3 Threading Model

A SM is a hardware device specifically designed with multithreaded capabilities. Each SM manages and executes up to 1024 threads in hardware with zero scheduling overhead. Each thread has its own thread execution state and can execute an independent code path. The SMs execute threads in a Single-Instruction Multiple-Thread (SIMT) fashion [16]. Basically, in the SIMT model all the threads execute the same instruction on different piece of data. The SMs create, manage, schedule and execute threads in groups of 32 threads. This set of 32 threads is called *Warp*. Each SM can handle up to 32 Warps (1024 threads in total, see table 2). Individual threads of the same Warp must be of the same type and start together at the same program address, but they are free to branch and execute independently.

| <b>Configuration Parameters</b> | Limitation |
|---------------------------------|------------|
| Threads/SM                      | 1024       |
| Thread Blocks/SM                | 8          |
| 32-bit Registers/SM             | 16384      |
| Shared Memory/SM                | 16KB       |
| Threads/Block                   | 512        |
| Threads/Warp                    | 32         |
| Warps/SM                        | 32         |

Table 2. Major Hardware and Software Limitations programing on CUDA

The execution flow begins with a set of Warps ready to be selected. The instruction unit selects one of them, which is ready for issue and executing instructions. The SM maps all the threads in an active Warp per SP core, and each thread executes independently with its own instructions and register state. Some threads of the active Warp can be inactive due to branching or predication, and it is also another critical point in the optimisation process. The maximum performance is achieved when all the threads in an active Warp takes the same path (the same execution flow). If the threads of a Warp diverge, the Warp serially executes each branch path taken, disabling threads that are not on that path, and when all the paths complete, the threads reconverge to the original execution path.

## 4 Design of the Simulator for Recognizer P Systems

In this section we briefly describe the simulator of recognizer P systems with active membranes, elementary division and polarization. Firstly, we explain the previous work that we have done in order to prepare the development of the parallel simulator on the GPU. Then, we introduce the algorithm design in the CUDA programming language, and finally, we finish with our simulator's design.

#### 4.1 Design of the Baseline Simulator

As previously mentioned, CUDA programming model is based on C/C++ language. Therefore, the first recommended step when developing applications in CUDA is to start from a baseline algorithm written in C++, where some parts can be susceptible to be parallelized on the GPU.

In this work, we have based on the simulator for P systems with active membranes developed in PLinguaCore [5]. This sequential (or single-threaded) simulator is programmed in JAVA, so the first step was to translate the code to C++.

The simulator is executed into two main stages: *selection stage* and *execution stage*. The *selection stage* consists of the search for the rules to be executed in each membrane. Once the rules have been selected, the *execution stage* consists of the execution of these rules.

The input data for the selection stage consists of the description of the membranes with their multisets (strings over the working alphabet O, labels associated with the membrane in H, etc.), and the set of rules R to be selected. The output data of this stage is the set of selected rules. Only the execution stage changes the information of the configuration. It is the reason because execution stage needs synchronization when accessing to the membrane structure and the multisets. At this point of implementation, we have parallelized the selection stage on the GPU, and the execution stage is still executed on the CPU because of the synchronization problem.

We also have developed an adapted sequential simulator for the CPU (called *fast sequential* simulator), which has the same constraints as the CUDA simulator explained in the next subsections to make a fair comparison among them. This simulator achieves much better performance than the original sequential simulator.

#### 4.2 Algorithm Design in CUDA

Whenever we design algorithms in the CUDA programming model, our main effort is dividing the required work into processing pieces, which have to be processed by TB thread blocks of T threads each. Using a thread block size of T=256, it is empirically determined to obtain the overall best performance on the Tesla C1060 [28]. Each thread block access to one different set of input data, and assigns a single or small constant number of input elements to each thread.

Each thread block can be considered independent to the other, and it is at this level at which internal communication (among threads) is cheap using explicit barriers to synchronize, and external communication (among blocks) becomes expensive, since global synchronization only can be achieved by the barrier implicit between successive kernel calls. The need of global synchronization in our designs requires successive kernel calls even to the same kernel.

### 4.3 Design of the Parallel Simulator

In our design, we identify each membrane as a thread block where each thread represents at least an element of the alphabet O. Each thread block runs in parallel looking for the set of rules that has to select for its membrane, and each individual thread is responsible for selecting the rules associated with the object that it represents (each thread selects the rules that need to be executed by using the represented object).

As result of the *execution stage*, the membranes can vary including news elements, dissolving membranes, dividing membranes, etc. Therefore, we have to modify the input data for the *selection stage* with the newest structure of membranes, and then call the selection again. It is an iterative process until a halting configuration is reached.

Finally, our simulator presents some limitations, constrained by some peculiarities in the CUDA programming model. The main limitations are showed in table 3, and the following stand out among them: it can handle only two levels of membrane hierarchy for simplicity in synchronization (the skin and the rest of elementary membranes), which is enough for solving lots of **NP**-complete problems; and the number of objects in the alphabet must be divisible by a number smaller than 512 (the maximum thread block size), in order to distribute the objects among the threads equally.

| Parameter                               | Limitation                             |
|-----------------------------------------|----------------------------------------|
| Levels of membrane hierarchy            | 2                                      |
| Maximum alphabet size                   | 65535                                  |
| Maximum label set size                  | 65535                                  |
| Maximum multiplicity of an object in an | 65535                                  |
| elementary membrane                     |                                        |
| Alphabet size                           | Divisible by a number smaller than 512 |

Table 3. Main limitations in the parallel simulator

# 5 A Case Study: Implementing a Solution to the N-Queens Problem

In this section, we briefly present a solution to the **N-Queens** problem by means of P systems, given by Miguel A. Gutiérrez–Naranjo et al in [10]. This family of P systems is our case study for the performance analysis of our simulator.

#### 5.1 A Family of P Systems for Solving the N-Queens Problem

The **N-Queens** problem can be expressed as a formula in conjunctive normal form, in such way that one truth assignment of the formula is considered as a solution of the puzzle. A family of recognizer P systems for the SAT problem [27] can state whether exists a solution to the formula or not sending *yes* or *no* to the environment.

However, the *yes* ot *no* answer from the recognizer P system is not enough because it is also important to know the solutions. Besides, the system needs to give us the way to encode the state of the **N-Queens** problem.

The P system designed for solving the **N-Queens** problem is a modification of the P system for the SAT problem. It is an uniform family of deterministic recognizer P system which solves SAT as a decision problem (i.e., the P system sends *yes* or *no* to the environment in the last computation step), but also stores the truth assignments that makes true the formula encoded in the elementary membranes of the halting configuration.

#### 5.2 Implementation

P-Lingua 1.0 [5] is a programming language useful for defining P system models with active membranes. We use P-Lingua to encode a solution to the **N-Queens** problem, and also to generate a file that our simulator can use as input. Figure 4 shows the P-Lingua process to generate the input for our simulator.



Fig. 4. Generation of the simulator's input

P-Lingua 2.0 [7] is able to translates a P system written in P-Lingua language into a binary file. A binary file is a file whose information is encoded in Bytes and bits (not understandable by humans like plain text), which is suitable for trying to compress the data. This binary file contains all the information of the P system (Alphabet, Labels, Rules, ...) which is the input of our simulator.

In our tests, we use the P system for solving the 3-Queens and 4-Queens problems. The former creates 512 membranes and up to 1883 different objects. The latter creates 65536 membranes and up to 8120 different objects, and now the simulator can handle it because we have decreased the memory requirement

by the simulator in [18]. On one hand, the P system for 5-Queens needs to generate 33554432 membranes and 25574 objects, what leads in a memory space limitation (requires up to 1.5TB). On the other hand, we point out that 2-Queens is a system with only 4 membranes, what are not enough for exploiting the parallelism in P systems.

## 6 Performance Analysis

We now examine the experimental performance of our simulator. Our performance test are based on the solutions to 3-Queens and 4-Queens problems previously explained in 5.2. We report the *selection stage* time which is executed on the GPU, and compare it with the *selection stage* for the fast sequential code. We do not include the cost of transferring input (and output) data from (and to) host CPU memory across the PCI-Express bus to the GPU's on board memory, which negatively affects to the overall simulation time. Selection is one building block of a larger-scale computation. Our aim is to get a full implementation of the simulator on the GPU. In such case, the transfers across PCI-Express bus will be close to zero.

We have used the Nvidia GPU Tesla C1060 which has 240 execution cores and 4GB of device memory, plugged in a computer server with a Intel Core2 Quad CPU and 8GB of RAM, using the 32bits ubuntu server as Operating System.

The selection stage on the GPU takes about 171 msec for the 3-Queens. So it is 2.7 times faster than the selection stage on the CPU which takes 465 msec. For the 4-Queens problem our simulator is 2 times faster than the fast sequential version, taking 315291 and 629849 msec in selection respectively.

Our experimental results demonstrate the results we expect to see: a massively parallel problem such as selection of the rules in a P system with active membranes achieves faster running times on a massively parallel architecture such as GPU.

# 7 Conclusions and Future Work

In this paper, we have presented a simulator for P systems using CUDA. P system computations have a double parallel nature. The first level of parallelism is presented by the objects inside the membranes, and the second one is presented between membranes. Hence, we have simulated these P systems in a platform which provides those levels of parallelism. This platform is the GPU, with parallelism between thread blocks and threads. Besides, we have used a programming language called P-Lingua to encode P systems as input for our simulator. This tool helped us to encode the P system for solving the N-Queens problem in order to test our simulator.

Using the power and parallelism that provides the GPU to simulate P systems with active membranes is a new concept in the development of applications for membrane computing. Even the GPU is not a cellular machine, its features help the researches to accelerate their simulations allowing the consolidation of the cellular machines as alternative to traditional machines.

The first version of the simulator is presented for recognizer P systems with active membranes, elementary division and polarization, specifically, we have developed the *selection stage* of the simulator on the GPU. In forthcoming versions, we will include the execution version on the GPU. This issue allows a completely parallel execution on the GPU, avoiding CPU-GPU transfers in every step, which degrades system performance.

Moreover, we are working to obtain fully simulation of P systems with active membranes, deleting the limitations showed in table 3. Besides, we will include new funcionality in the simulator like not elementary division. Our aim is to develop a framework of P systems simulators running on the GPU, so we will study the simulation of other P systems models by using this parallel architecture.

It is also important to point out that this simulator is limited by the resources available on the GPU as well as the CPU (RAM, Device Memory, CPU, GPU). They limit the size of the instances of **NP**-complete problems whose solutions can be successfully simulated. Although developing general purpose programs on the GPU is easier than several years ago with tools such as CUDA, to extract the maximum performance of the GPU is still hard, so we need to make a deep analysis to obtain the maximum performance available for our simulator. For instance, in the following versions of the simulator we will reduce the memory requirements in order to simulate bigger instances of **NP**-complete problems and avoid idle threads, by deleting objects with zero multiplicity. For this task we can use spare matrix in our simulator's design.

The massively parallel environment that provides the GPUs is good enough for the simulator, however, we need to go beyond. The newest cluster of GPUs provides a higher massively parallel environment, so we will attempt to scale to those systems to obtain better performance in our simulated codes.

Finally, we will study new simulation algorithms based on algebra, and the adaptation of the design of P systems to the constraints of the GPU to make faster simulations. Furthermore, it would be interesting to avoid the brute force algorithms in P system computations, and start to design heuristics in the design of membrane solutions (i.e., avoiding membrane division as possible).

## Acknowledgement

The first three authors acknowledge the support of the project from the Fundación Séneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under grant 00001/CS/2007, and also by the Spanish MEC and European Commission FEDER under grant CSD2006-00046. The last three authors acknowledge the support of the project TIN2006–13425 of the Ministerio de Educación y Ciencia of Spain, cofinanced by FEDER funds, and the support of the "Proyecto de Excelencia con Investigador de Reconocida Valía" of the Junta de Andalucía under grant P08-TIC04200.

# References

- Alhazov, A., Pérez–Jiménez, M.J.: Uniform solution of QSAT using polarizationless active membranes. In: Durand-Lose, J., Margenstern, M. (eds.) MCU 2007. LNCS, vol. 4664, pp. 122–133. Springer, Heidelberg (2007)
- Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: stream computing on graphics hardware. In: SIGGRAPH 2004, pp. 777–786. ACM Press, New York (2004)
- Ciobanu, G., Pérez–Jiménez, M.J., Păun, G. (eds.): Applications of membrane computing. Springer, Heidelberg (2006)
- Ciobanu, G., Wenyuan, G.: P systems running on a cluster of computers. LNCS, vol. 2993, pp. 123–139. Springer, Heidelberg (2004)
- Díaz-Pernil, D., Pérez-Hurtado, I., Pérez-Jiménez, M.J., Riscos-Núñez, A.: A P-Lingua programming environment for Membrane Computing. In: Corne, D.W., Frisco, P., Paun, G., Rozenberg, G., Salomaa, A. (eds.) WMC 2008. LNCS, vol. 5391, pp. 187–203. Springer, Heidelberg (2009)
- Fernández, L., Martínez, V.J., Arroyo, F., Mingo, L.F.: A hardware circuit for selecting active rules in transition P systems. In: Proceedings of the Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, p. 415 (2005)
- García–Quismondo, M., Gutiérrez–Escudero, R., Martínez–del–Amor, M.A., Orejuela, E., Pérez–Hurtado, I.: P–Lingua 2.0. A software framework for cell-like P systems. Intern. J. Computers, Communications and Control IV(3), 234–243 (2009)
- Garland, M., Grand, S.L., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28(4), 13–27 (2008)
- Govindaraju, N.K., Manocha, D.: Cache–efficient numerical algorithms using graphics hardware. Parallel Computing 33(10-11), 663–684 (2007)
- Gutiérrez-Naranjo, M.A., Martínez-del-Amor, M.A., Pérez-Hurtado, I., Pérez-Jiménez, M.J.: Solving the N-queens puzzle with P systems. In: Proc. 7th Brainstorming Week on Membrane Computing, vol. I, pp. 199–210 (2009)
- Gutiérrez-Naranjo, M.A., Pérez-Jiménez, M.J., Riscos-Núñez, A.: Available membrane computing software. In: Applications of Membrane Computing, ch. 15, pp. 411–436. Springer, Heidelberg (2006)
- Gutiérrez-Naranjo, M.A., Pérez-Jiménez, M.J., Riscos-Núñez, A.: Towards a programming language in cellular computing. Electronic Notes in Theoretical Computer Science 123, 93–110 (2005)
- Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (Scan) with CUDA. GPU Gems 3 (2007)
- Hartley, T.D., Catalyurek, U., Ruiz, A., Igual, F., Mayo, R., Ujaldon, M.: Biomedical image analysis on a cooperative cluster of GPUs and multicores. In: ICS 2008: Proce. 22nd annual international conference on Supercomputing, pp. 15–25. ACM, New York (2008)
- Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pp. 63–74. ACM, New York (1991)
- Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: Nvidia Tesla. A unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)

- Mark, W.R., Glanville, R.S., Akeley, K., Kilgard, M.J.: Cg a system for programming graphics hardware in a C–like language. In: SIGGRAPH 2003, pp. 896–907. ACM, New York (2003)
- Martínez-del-Amor, M.A., Pérez-Hurtado, I., Pérez-Jiménez, M.J., Cecilia, J.M., Guerrero, G.D., García, J.M.: Simulation of Recognizer P Systems by using Manycore GPUs. In: Proc. 7th Brainstorming Week on Membrane Computing, vol. II, pp. 45–58 (2009)
- Michalakes, J., Vachharajani, M.: GPU acceleration of numerical weather prediction. In: IPDPS, pp. 1–7 (2008)
- Nguyen, V., Kearney, D., Gioiosa, G.: An algorithm for non-deterministic object distribution in P systems and its implementation in hardware. In: Corne, D.W., Frisco, P., Paun, G., Rozenberg, G., Salomaa, A. (eds.) WMC 2008. LNCS, vol. 5391, pp. 325–354. Springer, Heidelberg (2009)
- Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)
- Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: Gpu computing. Proceedings of the IEEE 96(5), 879–899 (2008)
- Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1), 80–113 (2007)
- Păun, G.: Computing with membranes. Journal of Computer and System Sciences 61(1), 108–143 (2000); Turku Center for Computer Science-TUCS Report No 208
- 25. Păun, G.: Membrane Computing, An introduction. Springer, Berlín (2002)
- Pérez–Jiménez, M.J., Romero–Jiménez, A., Sancho–Caparrini, F.: Complexity classes in models of cellular computing with membranes. Natural Computing 2(3), 265–285 (2003)
- Pérez–Jiménez, M.J., Romero–Jiménez, A., Sancho–Caparrini, F.: A polynomial complexity class in P systems using membrane division. Journal of Automata, Languages and Combinatorics 11(4), 423–434 (2006)
- Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore GPUs. To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium (2009)
- Ruiz, A., Ujaldon, M., Andrades, J.A., Becerra, J., Huang, K., Pan, T., Saltz, J.H.: The GPU on biomedical image processing for color and phenotype analysis. In: BIBE, pp. 1124–1128 (2007)
- 30. Ryoo, S., Rodrigues, C., Baghsorkhi, S., Stone, S., Kirk, D., Mei Hwu, W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 73–82 (2008)
- Ryoo, S., Rodrigues, C.I., Stone, S.S., Stratton, J.A., Ueng, S.-Z., Baghsorkhi, S.S., Hwu, W.W.: Program optimization carving for GPU computing. J. Parallel Distrib. Comput. 68(10), 1389–1401 (2008)
- 32. Nvidia CUDA Programming Guide 2.0. (2008), http://developer.download.nvidia.com/compute/cuda/2\_0/docs/ NVIDIA\_CUDA\_Programming\_Guide\_2.0.pdf
- 33. GPGPU organization. World Wide Web electronic publication, http://www.gpgpu.org
- Nvidia CUDA. World Wide Web electronic publication, http://www.nvidia.com/cuda