Cache optimization for coarse grain task parallel processing using inter-array padding

Kazuhisa Ishizaka, Motoki Obata, Hironori Kasahara

    Research output: Contribution to journalArticle

    4 Citations (Scopus)

    Abstract

    The wide use of multiprocessor system has been making automatic parallelizing compilers more important. To improve the performance of multiprocessor system more by compiler, multigrain parallelization is important. In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism. In addition, locality optimization to use cache effectively is also important for the performance improvement. This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes the decomposed loops consecutively on the same processor. In the performance evaluation on Sun Ultra 80(4pe), OSCAR compiler on which the proposed scheme is implemented gave us 2.5 times speedup against the maximum performance of Sun Forte compiler automatic loop parallelization at the average of SPEC CFP95 tomcatv, swim hydro2d and turb3d programs. Also, OSCAR compiler showed 2.1 times speedup on IBM RS/6000 44p-270(4pe) against XLF compiler.

    Original languageEnglish
    Pages (from-to)64-76
    Number of pages13
    JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume2958
    Publication statusPublished - 2004

    Fingerprint

    Solar System
    Parallel Processing
    Compiler
    Sun
    Cache
    Optimization
    Parallelization
    Subroutines
    Parallelism
    Processing
    Macros
    Multiprocessor Systems
    Speedup
    Parallelizing Compilers
    Locality
    Performance Evaluation
    Sharing
    Minimise
    Decompose

    ASJC Scopus subject areas

    • Computer Science(all)
    • Biochemistry, Genetics and Molecular Biology(all)
    • Theoretical Computer Science

    Cite this

    @article{7c39c807270d4fb284f90f4546946cb0,
    title = "Cache optimization for coarse grain task parallel processing using inter-array padding",
    abstract = "The wide use of multiprocessor system has been making automatic parallelizing compilers more important. To improve the performance of multiprocessor system more by compiler, multigrain parallelization is important. In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism. In addition, locality optimization to use cache effectively is also important for the performance improvement. This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes the decomposed loops consecutively on the same processor. In the performance evaluation on Sun Ultra 80(4pe), OSCAR compiler on which the proposed scheme is implemented gave us 2.5 times speedup against the maximum performance of Sun Forte compiler automatic loop parallelization at the average of SPEC CFP95 tomcatv, swim hydro2d and turb3d programs. Also, OSCAR compiler showed 2.1 times speedup on IBM RS/6000 44p-270(4pe) against XLF compiler.",
    author = "Kazuhisa Ishizaka and Motoki Obata and Hironori Kasahara",
    year = "2004",
    language = "English",
    volume = "2958",
    pages = "64--76",
    journal = "Lecture Notes in Computer Science",
    issn = "0302-9743",
    publisher = "Springer Verlag",

    }

    TY - JOUR

    T1 - Cache optimization for coarse grain task parallel processing using inter-array padding

    AU - Ishizaka, Kazuhisa

    AU - Obata, Motoki

    AU - Kasahara, Hironori

    PY - 2004

    Y1 - 2004

    N2 - The wide use of multiprocessor system has been making automatic parallelizing compilers more important. To improve the performance of multiprocessor system more by compiler, multigrain parallelization is important. In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism. In addition, locality optimization to use cache effectively is also important for the performance improvement. This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes the decomposed loops consecutively on the same processor. In the performance evaluation on Sun Ultra 80(4pe), OSCAR compiler on which the proposed scheme is implemented gave us 2.5 times speedup against the maximum performance of Sun Forte compiler automatic loop parallelization at the average of SPEC CFP95 tomcatv, swim hydro2d and turb3d programs. Also, OSCAR compiler showed 2.1 times speedup on IBM RS/6000 44p-270(4pe) against XLF compiler.

    AB - The wide use of multiprocessor system has been making automatic parallelizing compilers more important. To improve the performance of multiprocessor system more by compiler, multigrain parallelization is important. In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism. In addition, locality optimization to use cache effectively is also important for the performance improvement. This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes the decomposed loops consecutively on the same processor. In the performance evaluation on Sun Ultra 80(4pe), OSCAR compiler on which the proposed scheme is implemented gave us 2.5 times speedup against the maximum performance of Sun Forte compiler automatic loop parallelization at the average of SPEC CFP95 tomcatv, swim hydro2d and turb3d programs. Also, OSCAR compiler showed 2.1 times speedup on IBM RS/6000 44p-270(4pe) against XLF compiler.

    UR - http://www.scopus.com/inward/record.url?scp=35048829771&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=35048829771&partnerID=8YFLogxK

    M3 - Article

    AN - SCOPUS:35048829771

    VL - 2958

    SP - 64

    EP - 76

    JO - Lecture Notes in Computer Science

    JF - Lecture Notes in Computer Science

    SN - 0302-9743

    ER -