Human Action Recognition Based on Integrating Body Pose, Part Shape, and Motion

Hany El-Ghaish, Mohamed E. Hussein, Amin Shoukry, Rikio Onai

    Research output: Contribution to journalArticle

    Abstract

    Human action recognition is a challenging problem, especially in the presence of multiple actors in the scene and/or viewpoint variations. In this paper, three modalities, namely, 3-D skeletons, body part images, and motion history image (MHI), are integrated into a hybrid deep learning architecture for human action recognition. The three modalities capture the main aspects of an action: body pose, part shape, and body motion. Although the 3-D skeleton modality captures the actor's pose, it lacks information about the shape of the body parts as well as the shape of manipulated objects. This is the reason for including both the body-part images and the MHI as additional modalities. The deployed architecture combines convolution neural networks (CNNs), long short-term memory (LSTM), and a fine-tuned pre-trained architecture into a hybrid one. It is called MCLP: multi-modal CNN + LSTM + VGG16 pre-trained on ImageNet. The MCLP consists of three sub-models: CL1D (for CNN1D + LSTM), CL2D (for CNN2D + LSTM), and CMHI (CNN2D for MHI), which simultaneously extract the spatial and temporal patterns in the three modalities. The decisions of these three sub-models are fused by a late multiply fusion module, which proved to yield better accuracy than averaging or maximizing fusion methods. The proposed combined model and its sub-models have been evaluated both individually and collectively on four public data sets: UTkinect Action3D, SBU Interaction, Florence3-D Action, and NTU RGB+D. Our recognition rates outperform the state-of-the-art rates on all the evaluated data sets.

    Original languageEnglish
    Article number8453782
    Pages (from-to)49040-49055
    Number of pages16
    JournalIEEE Access
    Volume6
    DOIs
    Publication statusPublished - 2018 Aug 31

    Fingerprint

    Convolution
    Fusion reactions
    Neural networks
    Long short-term memory
    Deep learning

    Keywords

    • CNN-LSTM
    • convolution neural networks (CNN)
    • Human action recognition
    • long short-term memory (LSTM)
    • motion history images (MHI)
    • spatial and temporal features

    ASJC Scopus subject areas

    • Computer Science(all)
    • Materials Science(all)
    • Engineering(all)

    Cite this

    Human Action Recognition Based on Integrating Body Pose, Part Shape, and Motion. / El-Ghaish, Hany; Hussein, Mohamed E.; Shoukry, Amin; Onai, Rikio.

    In: IEEE Access, Vol. 6, 8453782, 31.08.2018, p. 49040-49055.

    Research output: Contribution to journalArticle

    El-Ghaish, Hany ; Hussein, Mohamed E. ; Shoukry, Amin ; Onai, Rikio. / Human Action Recognition Based on Integrating Body Pose, Part Shape, and Motion. In: IEEE Access. 2018 ; Vol. 6. pp. 49040-49055.
    @article{75cce49cbb2a4dd18a27d8b71e44c69d,
    title = "Human Action Recognition Based on Integrating Body Pose, Part Shape, and Motion",
    abstract = "Human action recognition is a challenging problem, especially in the presence of multiple actors in the scene and/or viewpoint variations. In this paper, three modalities, namely, 3-D skeletons, body part images, and motion history image (MHI), are integrated into a hybrid deep learning architecture for human action recognition. The three modalities capture the main aspects of an action: body pose, part shape, and body motion. Although the 3-D skeleton modality captures the actor's pose, it lacks information about the shape of the body parts as well as the shape of manipulated objects. This is the reason for including both the body-part images and the MHI as additional modalities. The deployed architecture combines convolution neural networks (CNNs), long short-term memory (LSTM), and a fine-tuned pre-trained architecture into a hybrid one. It is called MCLP: multi-modal CNN + LSTM + VGG16 pre-trained on ImageNet. The MCLP consists of three sub-models: CL1D (for CNN1D + LSTM), CL2D (for CNN2D + LSTM), and CMHI (CNN2D for MHI), which simultaneously extract the spatial and temporal patterns in the three modalities. The decisions of these three sub-models are fused by a late multiply fusion module, which proved to yield better accuracy than averaging or maximizing fusion methods. The proposed combined model and its sub-models have been evaluated both individually and collectively on four public data sets: UTkinect Action3D, SBU Interaction, Florence3-D Action, and NTU RGB+D. Our recognition rates outperform the state-of-the-art rates on all the evaluated data sets.",
    keywords = "CNN-LSTM, convolution neural networks (CNN), Human action recognition, long short-term memory (LSTM), motion history images (MHI), spatial and temporal features",
    author = "Hany El-Ghaish and Hussein, {Mohamed E.} and Amin Shoukry and Rikio Onai",
    year = "2018",
    month = "8",
    day = "31",
    doi = "10.1109/ACCESS.2018.2868319",
    language = "English",
    volume = "6",
    pages = "49040--49055",
    journal = "IEEE Access",
    issn = "2169-3536",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",

    }

    TY - JOUR

    T1 - Human Action Recognition Based on Integrating Body Pose, Part Shape, and Motion

    AU - El-Ghaish, Hany

    AU - Hussein, Mohamed E.

    AU - Shoukry, Amin

    AU - Onai, Rikio

    PY - 2018/8/31

    Y1 - 2018/8/31

    N2 - Human action recognition is a challenging problem, especially in the presence of multiple actors in the scene and/or viewpoint variations. In this paper, three modalities, namely, 3-D skeletons, body part images, and motion history image (MHI), are integrated into a hybrid deep learning architecture for human action recognition. The three modalities capture the main aspects of an action: body pose, part shape, and body motion. Although the 3-D skeleton modality captures the actor's pose, it lacks information about the shape of the body parts as well as the shape of manipulated objects. This is the reason for including both the body-part images and the MHI as additional modalities. The deployed architecture combines convolution neural networks (CNNs), long short-term memory (LSTM), and a fine-tuned pre-trained architecture into a hybrid one. It is called MCLP: multi-modal CNN + LSTM + VGG16 pre-trained on ImageNet. The MCLP consists of three sub-models: CL1D (for CNN1D + LSTM), CL2D (for CNN2D + LSTM), and CMHI (CNN2D for MHI), which simultaneously extract the spatial and temporal patterns in the three modalities. The decisions of these three sub-models are fused by a late multiply fusion module, which proved to yield better accuracy than averaging or maximizing fusion methods. The proposed combined model and its sub-models have been evaluated both individually and collectively on four public data sets: UTkinect Action3D, SBU Interaction, Florence3-D Action, and NTU RGB+D. Our recognition rates outperform the state-of-the-art rates on all the evaluated data sets.

    AB - Human action recognition is a challenging problem, especially in the presence of multiple actors in the scene and/or viewpoint variations. In this paper, three modalities, namely, 3-D skeletons, body part images, and motion history image (MHI), are integrated into a hybrid deep learning architecture for human action recognition. The three modalities capture the main aspects of an action: body pose, part shape, and body motion. Although the 3-D skeleton modality captures the actor's pose, it lacks information about the shape of the body parts as well as the shape of manipulated objects. This is the reason for including both the body-part images and the MHI as additional modalities. The deployed architecture combines convolution neural networks (CNNs), long short-term memory (LSTM), and a fine-tuned pre-trained architecture into a hybrid one. It is called MCLP: multi-modal CNN + LSTM + VGG16 pre-trained on ImageNet. The MCLP consists of three sub-models: CL1D (for CNN1D + LSTM), CL2D (for CNN2D + LSTM), and CMHI (CNN2D for MHI), which simultaneously extract the spatial and temporal patterns in the three modalities. The decisions of these three sub-models are fused by a late multiply fusion module, which proved to yield better accuracy than averaging or maximizing fusion methods. The proposed combined model and its sub-models have been evaluated both individually and collectively on four public data sets: UTkinect Action3D, SBU Interaction, Florence3-D Action, and NTU RGB+D. Our recognition rates outperform the state-of-the-art rates on all the evaluated data sets.

    KW - CNN-LSTM

    KW - convolution neural networks (CNN)

    KW - Human action recognition

    KW - long short-term memory (LSTM)

    KW - motion history images (MHI)

    KW - spatial and temporal features

    UR - http://www.scopus.com/inward/record.url?scp=85052819809&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85052819809&partnerID=8YFLogxK

    U2 - 10.1109/ACCESS.2018.2868319

    DO - 10.1109/ACCESS.2018.2868319

    M3 - Article

    VL - 6

    SP - 49040

    EP - 49055

    JO - IEEE Access

    JF - IEEE Access

    SN - 2169-3536

    M1 - 8453782

    ER -