[1] K.H. J. Dai, and J. Sun, Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, pp. 1635–1643.
[2] J.D. D. Lin, J. Jia, K. He, and J. Sun, Scribblesup: Scribble-supervised convolutional networks for semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 3159–3167.
[3] L.a.Z. Wu, Zhun and Fang, Leyuan and He, Xingxin and Liu, Qiang and Ma, Jiayi and Chen, Hao, Sparsely annotated semantic segmentation with adaptive gaussian mixtures, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15454--15464.
[4] O.R. A. Bearman, V. Ferrari, and L. Fei-Fei, What’s the point: Semantic segmentation with point supervision, in: Proceedings of the European conference on computer vision, 2016, pp. 549–565.
[5] A.K. B. Zhou, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
[6] Y.L. Z. Liu, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B.Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
[7] S.A.a.W. Zuidema, Quantifying attention flow in transformers, in: arXiv preprint arXiv:2005.00928, 2020.
[8] L.B. A. Dosovitskiy, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2021.
[9] M.C. H. Touvron, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, 2021, pp. 10347–10357.
[10] Z.a.H. A. Peng, Wei and Gu, Shanzhi and Xie, Lingxi and Wang, Yaowei and Jiao, Jianbin and Ye, Qixiang, Conformer: Local features coupling global representations for visual recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 367-376.
[11] Y.C. L. Yuan, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, Tokens-to-token vit: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 558–567.
[12] E.X. W. Wang, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.
[13] J.A.a.S. Kwak, Learning pixel-level semantic affinity with imagelevel supervision for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4981–4990.
[14] S.C. J. Ahn, and S. Kwak, Weakly supervised learning of instance segmentation with inter-pixel relations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2209–2218.
[15] J.F. Y. Wei, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan, Object region mining with adversarial erasing: A simple classification to semantic segmentation approach, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 1568–1576.
[16] E.K. J. Lee, S. Lee, J. Lee, and S. Yoon, Ficklenet: Weakly and semisupervised semantic image segmentation using stochastic inference, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5267–5276.
[17] T. Chen, X. Jiang, G. Pei, Z. Sun, Y. Wang, Y. Yao, Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation, in: Proceedings of the European conference on computer vision, Springer, 2025, pp. 441-458.
[18] M.Z. T. Zhou, F. Zhao, and J. Li, Regional semantic contrast and aggregation for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4299– 4309.
[19] L.a.Z. Wu, Zhun and Ma, Jiayi and Wei, Yunchao and Chen, Hao and Fang, Leyuan and Li, Shutao, Modeling the Label Distributions for Weakly-Supervised Semantic Segmentation, in: arXiv preprint arXiv:2403.13225, 2024.
[20] J. Fan, Z. Zhang, T. Tan, C. Song, J. Xiao, Cian: Cross-image affinity net for weakly supervised semantic segmentation, in: Proceedings of the AAAI conference on artificial intelligence, 2020, pp. 10762-10769.
[21] G. Sun, W. Wang, J. Dai, L. Van Gool, Mining cross-image semantics for weakly supervised semantic segmentation, in: Proceedings of the European conference on computer vision, Springer, 2020, pp. 347-365.
[22] Z.F. Y. Du, Q. Liu, and Y. Wang, Weakly supervised semantic segmentation by pixel-to-prototype contrast, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4320–4329.
[23] L.Y. Q. Chen, J.-H. Lai, and X. Xie, Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4288–4298.
[24] F. Tang, Z. Xu, Z. Qu, W. Feng, X. Jiang, Z. Ge, Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3324-3334.
[25] J.H. T. Wu, G. Gao, X. Wei, X. Wei, X. Luo, and C. H. Liu, Embedded discriminative attention mechanism for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16765–16774.
[26] J.Z. Y. Wang, M. Kan, S. Shan, and X. Chen, Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12275–12284.
[27] Q.W. Y.-T. Chang, W.-C. Hung, R. Piramuthu, Y.-H. Tsai, and M.H. Yang, Weakly-supervised semantic segmentation via sub-category exploration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8991–9000.
[28] M.C. Y. Lin, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15305–15314.
[29] S.a.Z. Deng, Wei and Xie, Jinheng and Shen, Linlin, Qa-clims: Question-answer cross language image matching for weakly supervised semantic segmentation, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5572--5583.
[30] B.a.H. Murugesan, Rukhshanda and Bhattacharya, Rajarshi and Ben Ayed, Ismail and Dolz, Jose, Prompting classes: exploring the power of prompt class learning in weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 291--302.
[31] B.a.Y. Zhang, Siyue and Wei, Yunchao and Zhao, Yao and Xiao, Jimin, Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3796--3806.
[32] L.a.O. Xu, Wanli and Bennamoun, Mohammed and Boussaid, Farid and Xu, Dan, Learning multi-modal class-specific tokens for weakly supervised dense object localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19596--19605.
[33] L. Zhu, X. Wang, J. Feng, T. Cheng, Y. Li, B. Jiang, D. Zhang, J. Han, WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation, International Journal of Computer Vision, (2024) 1-21.
[34] X.H. J. Xie, K. Ye, and L. Shen, Clims: Cross language image matching for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 483–4492.
[35] J.W.K. A. Radford, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al, Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning(PMLR), 2021, pp. 8748–8763.
[36] E.M. A. Kirillov, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al, Segment anything, in: arXiv preprint arXiv:2304.02643, 2023, pp. 1–30.
[37] Z.L. W. Sun, Y. Zhang, Y. Zhong, and N. Barnes, An alternative to wsss? an empirical study of the segment anything model (sam) on weakly-supervised semantic segmentation problems, in: arXiv preprint arXiv:2305.01586, 2023.
[38] P.-T.J.a.Y. Yang, Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation, in: arXiv preprint arXiv:2305.01275, 2023.
[39] Z.M. T. Chen, R. Li, and W.-l. Chao, Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation, in: arXiv preprint arXiv:2305.05803, 2023.
[40] H.a.Y. Kweon, Kuk-Jin, From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19499--19509.
[41] Z.a.F. Yang, Kexue and Duan, Minghong and Qu, Linhao and Wang, Shuo and Song, Zhijian, Separate and conquer: Decoupling co-occurrence via decomposition and representation for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3606--3615.
[42] Y.Z. L. Ru, B. Yu, and B. Du, Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16846–16855.
[43] W.O. L. Xu, M. Bennamoun, F. Boussaid, and D. Xu, Multiclass token transformer for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4310–4319.
[44] D.Z. S. Rossetti, M. Sanzari, M. Schaerf, and F. Pirri, Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation, in: Proceedings of the European conference on computer vision, 2022, pp. 446–463.
[45] H.Z. L. Ru, Y. Zhan, and B. Du, Token contrast for weakly-supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3093–3102.
[46] Z.M. R. Li, Z. Zhang, J. Jang, and S. Sanner, Transcam: Transformer attentionbased cam refinement for weakly supervised semantic segmentation, in: Elsevier Journal of Visual Communication and Image Representation, 2023, pp. 103800.
[47] S.-H.a.K. Yoon, Hoyong and Kim, Hyeonseong and Yoon, Kuk-Jin, Class Tokens Infusion for Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3595--3605.
[48] Y.a.Y. Wu, Xichen and Yang, Kequan and Li, Jide and Li, Xiaoqiang, DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3534--3543.
[49] H.T. Mathilde Caron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, Emerging properties in self-supervised vision transformers, in: Proceedings of the European conference on computer vision, 2021.
[50] P.D. T.-Y. Lin, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
[51] T.C.a.L. Mo, Swin-fusion: swin-transformer with feature fusion for human action recognition, in: Springer Neural Processing Letters, 2023, pp. 11109–11130.
[52] L.V.G. M. Everingham, C. K. Williams, J. Winn, and A. Zisserman, Andrew, The pascal visual object classes (voc) challenge, in: Springer International journal of computer vision, 2010, pp. 303–338.
[53] P.A. B. Hariharan, L. Bourdev, S. Maji, and J. Malik, Semantic contours from inverse detectors, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2011, pp. 991–998.
[54] W.D. J. Deng, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Imagenet: A largescale hierarchical image database, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[55] E.K. J. Lee, and S. Yoon, Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4071–4080.