Coarse grain task parallel processing with cache optimization on shared memory multiprocessor

Kazuhisa Ishizaka, Motoki Obata, Hironori Kasahara

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

In multiprocessor systems, the gap between peak and effective performance has getting larger. To cope with this performance gap, it is important to use multigrain parallelism in addition to ordinary loop level parallelism. Also, effective use of memory hierarchy is important for the performance improvement of multiprocessor systems because the speed gap between processors and memories is getting larger. This paper describes coarse grain task parallel processing that uses parallelism among macro-tasks like loops and subroutines considering cache optimization using data localization scheme. The proposed scheme is implemented on OSCAR automatic multigrain parallelizing compiler. OSCAR compiler generates OpenMP FORTRAN program realizing the proposed scheme from a sequential FORTRAN77 program. Its performance is evaluated on IBM RS6000 SP 604e High Node 8 processors SMP machine using SPEC95fp tomcatv, swim, mgrid. In the evaluation, the proposed coarse grain task parallel processing scheme with cache optimization gives us up to 1.3 times speedup on IPE, 4.7 times speedup on 4PE and 8.8 times speedup on 8PE compared with a sequential processing time.

Original languageEnglish
Pages (from-to)352-365
Number of pages14
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2624
Publication statusPublished - 2003
Externally publishedYes

Fingerprint

Shared-memory multiprocessors
Parallel Processing
Cache
Parallelism
Data storage equipment
Speedup
Optimization
Multiprocessor Systems
Processing
D.3.2 [Programming Languages]: Language Classifications - Fortran
Subroutines
Parallelizing Compilers
Memory Hierarchy
Macros
OpenMP
Compiler
Evaluation
Vertex of a graph

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

@article{e0e067a067594370b3988842beba1ba6,
title = "Coarse grain task parallel processing with cache optimization on shared memory multiprocessor",
abstract = "In multiprocessor systems, the gap between peak and effective performance has getting larger. To cope with this performance gap, it is important to use multigrain parallelism in addition to ordinary loop level parallelism. Also, effective use of memory hierarchy is important for the performance improvement of multiprocessor systems because the speed gap between processors and memories is getting larger. This paper describes coarse grain task parallel processing that uses parallelism among macro-tasks like loops and subroutines considering cache optimization using data localization scheme. The proposed scheme is implemented on OSCAR automatic multigrain parallelizing compiler. OSCAR compiler generates OpenMP FORTRAN program realizing the proposed scheme from a sequential FORTRAN77 program. Its performance is evaluated on IBM RS6000 SP 604e High Node 8 processors SMP machine using SPEC95fp tomcatv, swim, mgrid. In the evaluation, the proposed coarse grain task parallel processing scheme with cache optimization gives us up to 1.3 times speedup on IPE, 4.7 times speedup on 4PE and 8.8 times speedup on 8PE compared with a sequential processing time.",
author = "Kazuhisa Ishizaka and Motoki Obata and Hironori Kasahara",
year = "2003",
language = "English",
volume = "2624",
pages = "352--365",
journal = "Lecture Notes in Computer Science",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Coarse grain task parallel processing with cache optimization on shared memory multiprocessor

AU - Ishizaka, Kazuhisa

AU - Obata, Motoki

AU - Kasahara, Hironori

PY - 2003

Y1 - 2003

N2 - In multiprocessor systems, the gap between peak and effective performance has getting larger. To cope with this performance gap, it is important to use multigrain parallelism in addition to ordinary loop level parallelism. Also, effective use of memory hierarchy is important for the performance improvement of multiprocessor systems because the speed gap between processors and memories is getting larger. This paper describes coarse grain task parallel processing that uses parallelism among macro-tasks like loops and subroutines considering cache optimization using data localization scheme. The proposed scheme is implemented on OSCAR automatic multigrain parallelizing compiler. OSCAR compiler generates OpenMP FORTRAN program realizing the proposed scheme from a sequential FORTRAN77 program. Its performance is evaluated on IBM RS6000 SP 604e High Node 8 processors SMP machine using SPEC95fp tomcatv, swim, mgrid. In the evaluation, the proposed coarse grain task parallel processing scheme with cache optimization gives us up to 1.3 times speedup on IPE, 4.7 times speedup on 4PE and 8.8 times speedup on 8PE compared with a sequential processing time.

AB - In multiprocessor systems, the gap between peak and effective performance has getting larger. To cope with this performance gap, it is important to use multigrain parallelism in addition to ordinary loop level parallelism. Also, effective use of memory hierarchy is important for the performance improvement of multiprocessor systems because the speed gap between processors and memories is getting larger. This paper describes coarse grain task parallel processing that uses parallelism among macro-tasks like loops and subroutines considering cache optimization using data localization scheme. The proposed scheme is implemented on OSCAR automatic multigrain parallelizing compiler. OSCAR compiler generates OpenMP FORTRAN program realizing the proposed scheme from a sequential FORTRAN77 program. Its performance is evaluated on IBM RS6000 SP 604e High Node 8 processors SMP machine using SPEC95fp tomcatv, swim, mgrid. In the evaluation, the proposed coarse grain task parallel processing scheme with cache optimization gives us up to 1.3 times speedup on IPE, 4.7 times speedup on 4PE and 8.8 times speedup on 8PE compared with a sequential processing time.

UR - http://www.scopus.com/inward/record.url?scp=35248858309&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35248858309&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:35248858309

VL - 2624

SP - 352

EP - 365

JO - Lecture Notes in Computer Science

JF - Lecture Notes in Computer Science

SN - 0302-9743

ER -