14 GPU versions of QW-Simulator

Despite very fast progress in electromagnetic simulation software and hardware, the speed of simulations remains a crucial problem. We are in a situation of a never-ending demand. Faster simulations encourage engineers and researchers to try more complicated simulation scenarios and then put more pressure on speed requirements. The history of computer hardware of the last decade shows that there are limited possibilities of speeding up computer clocks. Instead, computer systems are evolving towards more complicated architectures and, in particular, to multi-processor and multi-core solutions.

Massively parallel computers were originally developed for specially demanding scientific tasks. They were very expensive and thus not widely accessible to engineers. Subsequently, market forces helped find a more economical solution for massive parallel architectures. The demand for very fast processing required in video games led to the development of graphic cards (GPUs) of massively parallel architecture. That multi-billion-dollar market drove the progress so fast that engineers started to envy their kids. They found that GPU processors could, in many cases, serve them better than even the most modern CPUs. Such a situation led to spontaneous efforts by many companies to develop their own simulation codes destined for specific graphic cards. On the other hand, there was also a constant progress in multi-processor CPUs. Multiprocessor CPUs entered the personal computer market and fast progress is also expected in this field.

It is hard to predict whether CPU or GPU solutions will win the race. This stimulates a quest for setting up a standard language for cross platform programming, suitable for massively parallel processing units in general, including both massively parallel CPUs and massively parallel GPUs. The language has been called Open Computing Language (OpenCL, http://www.khronos.org/opencl/). OpenCL is designed to highly improve speed and responsiveness of computer applications. OpenCL is not associated with specific computer hardware or specific market solutions. From this point of view, an OpenCL code is prepared for present as well as future computer hardware solutions.

QW-Simulator GPU is a version of QW-Simulator designated for massive parallel computing hardware. It incorporates parts of QW-Simulator code re-written by QWED in OpenCL.

QW-Simulator MultiGPU is a version of QW-Simulator designated for multiple massive parallel computing hardware. It allows running single simulation on multiple GPU devices.

In 2018 the most efficient OpenCL platform are GPU accelerators with processing power of about 15 TFlops and memory bandwidth of about 900 GB/s. The number of stream processors in one GPU unit exceeds 5000.

From the application point of view, a GPU can be treated as a computational coprocessor. QW-GPUSim run on contemporary GPUs provides two major advantages. The first aspect is its massive parallelisation, in a sense that different FDTD cells are calculated by different threads and in different processing units, at the same time. The other aspect is a very fast access to local memory resulting in a very large memory bandwidth.

From the user's point of view, operation of a software package programmed in OpenCL is practically the same as of a classically programmed code. However, the user should be aware that not all the functions are ported to OpenGL. It would not be desirable for at least two reasons:

Not all the software operations would gain in speed when re-programmed for massive parallel processing.
Graphic cards have limited internal memory and saturating it with data needed for auxiliary operations would be counterproductive to overall performance of the software.

QW-Simulator GPU and QW-Simulator MultiGPU are currently optimised for application on modern PC graphic cards of a massively parallel architecture. For the above reasons, only the parts of the code judged crucial to the speed of processing have been ported to OpenCL. Other parts are still executed on CPU. Such a scheme of operation may sometimes require a transfer of large amounts of data between the CPU and graphic card memories, which requires some additional time. Thus the gain in speed with respect to a regular CPU version depends on the actually simulated scenario and the required postprocessings.

QWED has made an extensive effort to provide the code providing the best possible speedups for most practical scenarios. We cannot exclude that in some specific scenarios the speedup may be smaller than for the typical ones exemplified in Section GL 3. We encourage QW-3D users to report such cases to QWED. We will take those observations into account when preparing consecutive QW-Simulator GPU and QW-Simulator MultiGPU versions.