Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Document Type : Research Article

Authors

1 Masters of Computer Engineering, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

2 Professor of Artificial Intelligence, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

Abstract

Recent advancements in Weakly Supervised Semantic Segmentation (WSSS) have highlighted the use of image-level class labels as a form of supervision. Many methods use pseudo-labels from class activation maps (CAMs) to address the limited spatial information in class labels. However, CAMs generated from Convolutional Neural Networks (CNNs) are often led to focus on prominent features, making it difficult to distinguish foreground objects from their backgrounds. While recent studies show that features from Vision Transformers (ViTs) are more effective in capturing the scene layout than CNNs, the use of hierarchical ViTs has not been widely studied in WSSS. This work introduces "SWTformer" and explores the effect of Swin Transformer’s local-to-global view on improving the accuracy of initial seed CAMs. SWTformer-V1 produces CAMs solely based on patch tokens as its input features. SWTformer-V2 enhances this process by integrating a multi-scale feature fusion mechanism and employing a background-aware mechanism that refines the accuracy of localization maps, resulting in better differentiation between objects. Experiments on the Pascal VOC 2012 dataset demonstrate that compared to state-of-the-art models, SWTformer-V1 achieves 0.98% mAP higher in localization accuracy and generates initial localization maps that are 0.82% mIoU higher in accuracy while relying solely on the classification network. SWTformer-V2 enhances the accuracy of the seed CAMs by 5.32% mIoU. Code available at: ttps://github.com/RozhanAhmadi/SWTformer

Keywords

Main Subjects