Publications

You can also find my articles on my Google Scholar profile.

Journal Articles

Using segmentation with multi-scale selective kernel for visual object tracking

Published in IEEE Signal Processing Letters, 2022

Generic visual object tracking is challenging due to various difficulties, e.g. scale variations and deformations. To solve those problems, we propose a novel multi-scale selective kernel module for tracking, which contains small-scale and large-scale branches to model the target at different scales and attention mechanism to capture the more effective appearance information of the target. In our module, we cascade multiple small-scale convolutional blocks as an equivalent large-scale branch to extract large-scale features of the target effectively. Besides, we present a hybrid strategy for feature selection to extract significant information from features of different scales. Based on the current excellent segmentation tracking framework, we propose a novel tracking network that leverages our module at multiple places in the up-sample phase to construct a more accurate and robust appearance model. Extensive experimental results show that our tracker outperforms other state-of-the-art trackers on multiple challenging benchmarks including VOT2018, TrackingNet, DAVIS-2017, and YouTube-VOS-2018 while achieves real-time tracking.

Recommended citation: Bao F, Cao Y, Zhang S, et al. Using segmentation with multi-scale selective kernel for visual object tracking[J]. IEEE Signal Processing Letters, 2022, 29: 553-557.
Download Paper

Conference Papers

Global-Guided Weighted Enhancement for Salient Object Detection

Published in Artificial Neural Networks and Machine Learning – ICANN 2024, 2024

Salient Object Detection (SOD) benefits from the guidance of global context to further enhance performance. However, most works focus on treating the top-layer features through simple compression and nonlinear processing as the global context, which inevitably lacks the integrity of the object. Moreover, directly integrating multi-level features with global context is ineffective for solving semantic dilution. Although the global context is considered to enhance the relationship among salient regions to reduce feature redundancy, equating high-level features with global context often results in suboptimal performance. To address these issues, we redefine the role of global context within the network and propose a new method called Global-Guided Weighted Enhancement Network (GWENet). We first design a Deep Semantic Feature Extractor (DSFE) to enlarge the receptive field of network, laying the foundation for global context extraction. Secondly, we construct a Global Perception Module (GPM) for global context modeling through pixellevel correspondence, which employs a global sliding weighted technique to provide the network with rich semantics and acts on each layer to enhance SOD performance by Global Guidance Flows (GGFs). Lastly, to effectively merge multi-level features with the global context, we introduce a Comprehensive Feature Enhancement Module (CFEM) that integrates all features within the module through 3D convolution, producing more robust feature maps. Extensive experiments on five challenging benchmark datasets demonstrate that GWENet achieves state-of-the-art results.

Recommended citation: Yu J, Liu Y, Wei H, et al. Global-Guided Weighted Enhancement for Salient Object Detection[C]//International Conference on Artificial Neural Networks. Cham: Springer Nature Switzerland, 2024: 137-152.
Download Paper

Learning Modality-Complementary and Eliminating-Redundancy Representations with Multi-Task Learning for Multimodal Sentiment Analysis

Published in 2024 International Joint Conference on Neural Networks (IJCNN), 2024

A crucial issue in multimodal language processing is representation learning. Previous works joint training the multimodal and unimodal tasks to learn the consistency and difference of modality representations. However, due to the lack of cross-modal interaction, the extraction of complementary features between modalities is not sufficient. Moreover, during multimodal fusion, the generated multimodal embeddings may be redundant, and unimodal representations also contain noise information, which negatively influence the final sentiment prediction. To this end, we construct a Modality-Complementary and EliminatingRedundancy multi-task learning model (MCER), and additionally add a cross-modal task to learn complementary features between two modal pairs through gated transformer. Then use two label generation modules to learn modality-specific and modalitycomplementary representations. Additionally, we introduce the multimodal information bottleneck (MIB) in both multimodal and unimodal tasks to filter out noise information in unimodal representations as well as learn powerful and sufficient multimodal embeddings that is free of redundancy. Last, we conduct extensive experiments on two popular sentiment analysis benchmarks, MOSI and MOSEI. Experimental results demonstrate that our model significantly outperforms the current strong baselines.

Recommended citation: Zhao X, Miao X, Xu X, et al. Learning Modality-Complementary and Eliminating-Redundancy Representations with Multi-Task Learning for Multimodal Sentiment Analysis[C]//2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024: 1-8.
Download Paper

Towards Highly Effective Moving Tiny Ball Tracking via Vision Transformer

Published in 20th International Conference, ICIC 2024, 2024

Recent tiny ball tracking methods based on deep neural networks have significantly progressed. However, since moving balls in the video are always blurred, most existing methods cannot achieve accurate tracking due to limited receptive fields and sampling depth. Furthermore, as high-resolution competition videos become increasingly common, existing methods perform poorly on highresolution images. To this end, we provide a strong baseline for tracking tiny balls called TrackFormer. Firstly, we use Vision Transformer to build the whole network architecture and enhance the tiny ball localization through its powerful spatial mining ability. Secondly, we develop a Global Context Sampling Module (GCSM) to capture more powerful global features, thereby increasing the accuracy of tiny ball identification. Finally, we design a Context Enhancement Module (CEM) to enhance tiny ball semantics to achieve robust tracking performance. To promote research and development of tiny ball tracking, we establish a Large-scale Tiny Ball Tracking dataset called LaTBT. Specifically, LaTBT is founded on three types of tiny balls (badminton, tennis, and squash), offering more than 300 video sequences and over 223K annotations from 19 types of professional matches to address various tracking challenges in diverse and complex backgrounds. To our knowledge, LaTBT is the first large-scale dataset for tiny ball tracking. Experiments demonstrate that our baseline achieves state-of-the-art performance on our proposed benchmark dataset. The dataset and the algorithm code are available at https://github.com/Gi-gigi/TrackFormer.

Recommended citation: Yu J, Liu Y, Wei H, et al. Towards Highly Effective Moving Tiny Ball Tracking via Vision Transformer[C]//International Conference on Intelligent Computing. Singapore: Springer Nature Singapore, 2024: 368-379.
Download Paper

Mutual Information as Intrinsic Reward of Reinforcement Learning Agents for On-demand Ride Pooling

Published in The 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS2024), 2023

The emergence of on-demand ride pooling services allows each vehicle to serve multiple passengers at a time, thus increasing drivers’ income and enabling passengers to travel at lower prices than taxi/car on-demand services (only one passenger can be assigned to a car at a time like UberX and Lyft). Although on-demand ride pooling services can bring so many benefits, ride pooling services need a well-defined matching strategy to maximize the benefits for all parties (passengers, drivers, aggregation companies and environment), in which the regional dispatching of vehicles has a significant impact on the matching and revenue. Existing algorithms often only consider revenue maximization, which makes it difficult for requests with unusual distribution to get a ride. How to increase revenue while ensuring a reasonable assignment of requests brings a challenge to ride pooling service companies (aggregation companies). In this paper, we propose a framework for vehicle dispatching for ride pooling tasks, which splits the city into discrete dispatching regions and uses the reinforcement learning (RL) algorithm to dispatch vehicles in these regions. We also consider the mutual information (MI) between vehicle and order distribution as the intrinsic reward of the RL algorithm to improve the correlation between their distributions, thus ensuring the possibility of getting a ride for unusually distributed requests. In experimental results on a real-world taxi dataset, we demonstrate that our framework can significantly increase revenue up to an averag

Recommended citation: Zhang X, Sun J, Gong C, et al. Mutual Information as Intrinsic Reward of Reinforcement Learning Agents for On-demand Ride Pooling[J]. arXiv preprint arXiv:2312.15195, 2023.
Download Paper

Yifei Cao