Software transactional memory for gpu architectures


















Their validation technique requires that locks are sorted whenever transactional reading takes place to avoid the possibility of livelock. The static priority rule used by PR-STM on the other hand avoids the need to sort locks our threads are effectively pre-sorted by their priorities instead. In particular, Fung et al.

Nasre et al. More recently, Holey et al. ESTM updates shared memory during transaction execution while updating an undo log to remove those updates upon an abort. Like our approach, ISTM can represent invisible reads to reduce conflicts during a transaction. In this section we present results from a series of benchmarks to demonstrate the performance of our system. All shared data and the global lock table are allocated in global memory, while all local meta-data is stored in local memory.

The global lock table data accessed the L2 cache, while local memory accessed both the L1 and L2 caches. We used the Windows 7 Operating System. The experiments use a benchmark called bank which accompanies TinySTM. A configurable array of bank accounts represents the shared data from which transactions withdraw and deposit funds. We allocated 10 MB of memory to create roughly 2.

We required many accounts to accommodate the presence of many more threads in the GPU. We found that this number of accounts allowed us to observe the effects of both low and high contention as we varied scenario parameters. We also added several adaptations to the base scenario, most notably the ability to vary the amount of shared data accessed within a transaction i. This allowed us to vary the likelihood of contention caused by longer transactions. We also implemented changes to the hashing function used in all three STM systems so that we could control the amount of shared data covered by a single lock to experiment with the degree of false-sharing.

Finally, we included results where the number of threads are increased to observe the contention caused by high numbers of threads featured in GPU applications.

Figure 2 shows the degree of transaction throughput when the number of accounts accessed per transaction is increased. These values were used as they provided the best performance in each system. In Figs. As expected, when the transaction size increases the throughput of all three STMs drops because inter-transaction conflicts are now more likely. The sharpest drop in performance is witnessed in GPU-STM as the higher thread numbers exacerbate the degree of conflicts.

This is because fewer locks are acquired and so fewer conflicts occur. The hash function determines the number of accounts covered by a single lock; the lower the hash value the less chance that threads will try to access the same lock when reading or writing to different shared data. Once again the Y axes show the throughput in transactions per second and the X-axes show the hash function value as the number of accounts covered by a single lock.

As the hash value increases the performance of TinySTM deteriorates due to the increased likelihood of false conflicts. In Fig. Transaction throughput rises until threads are used where inter-thread conflicts begin to occur at a substantial rate. Below threads, the possibility of conflict is negligible because the high number of accounts used reduces the probability that threads will access the same account. As the number of threads increases, however, so too increases the rate of conflict and therefore the throughput decreases markedly.

The benefit of the work produced by extra threads is cancelled out by the overhead caused by inter-transactional contention. All other factors being equal, improvements in terms of read only transactions have little effect on the GPU. We believe there exists much scope for expanding our approach. In the short-term we would like to enhance our Contention Management Policy to accommodate dynamic priorities and application semantics this has been shown to provide substantial performance improvements [ 17 , 18 ].

In the long-term we would like to experiment with combining the GPU and the CPU within a heterogeneous transaction manager. The results suggest that the GPU is particularly effective at processing large numbers of short transactions, while the presence of read-only transactions provides only a small improvement to GPU performance.

Further testing will allow us to formulate transaction allocation strategies, assigning work to either the CPU or the GPU based on the effectiveness of each processing element to execute that work. Skip to main content Skip to sections. This service is more advanced with JavaScript available. Advertisement Hide. European Conference on Parallel Processing.

Conference paper First Online: 25 July Download conference paper PDF. At the time of writing, implementing an efficient TM technique for the GPU remains an area with much potential for development.

The work in this paper aims to contribute to that development by providing the following: An STM algorithm for the GPU based on a simple, yet effective, static priority rule.

This allows a thread with priority n to steal a lock which is currently owned by any thread with a priority less than n see Fig.

As every thread has a unique priority, this addresses the possibility of deadlock because any thread can always determine its next action when encountering locked data. Open image in new window. Livelock and contention management in GPU transaction execution. PR-STM consists of two types of metadata: a global metadata which is shared among all threads and a local metadata which is private to a single thread: Global Lock Table.

Each entry in the global lock table is an unsigned integer composed of version 11 bits , owner 19 bits , locked 1 bit and pre-locked 1 bit ; Local Read Set is a set of read entries each composed of a memory location, version and value read by the current thread; Local Write Set is a set of write entries recording the memory location and value written by the current thread; Local Lock Set is a set of lock indices and lock versions written by the current thread.

CUDA provides a thread fence function to ensure memory values modified before the fence can be seen by all other threads. The thread fence ensures that modifications to shared data are visible to all threads before any locks are released. The thread then updates the version bit in the global lock table for each lock in its lock set. The version bit is either incremented line 30 or reset line 31 if the version value has reached the maximum value.

We use locks for both protecting shared data and implementing our priority rule policy. The various bits of each lock represent the following: The first 11 bits of a lock represent the current version of that lock.

In the following graphs we present results where: i all threads perform update transactions i. This was included to observe the impact of invisible reads on the scenario.

Each test lasted for 5 s and was executed 10 times with the average results presented. Average throughput with increasing transaction size. This helps to differentiate the performance when the transaction size increases beyond 16 accounts, where the values are too close to read in absolute terms.

One possible reason for this is that our algorithm does not have to sort the local lock array at every read or write step like GPU-STM while the higher number of threads enjoyed by PR-STM remains a benefit to performance rather than a hindrance.

Average throughput with increasing lock coverage and increased threads. Cederman, D. Dice, D. In: Dolev, S. DISC LNCS, vol. Felber, P. ACM Google Scholar. Fung, W. Harris, T. Herlihy, M. ACM Trans. Lock-based synchronization for GPU architectures.

Computing Frontiers. View 3 excerpts, references background and methods. Hardware transactional memory for GPU architectures. Highly Influential. View 6 excerpts, references background. View 3 excerpts, references background. Efficient GPU hardware transactional memory through early conflict resolution. Towards scalar synchronization in SIMT architectures. View 4 excerpts, references background.

Inter-block GPU communication via fast barrier synchronization. View 2 excerpts, references methods and background. Energy efficient GPU transactional memory via space-time optimizations. Related Papers.



0コメント

  • 1000 / 1000