Quick Paper Post | Jianbo Ma

This post will collect papers that appears to be interesting, but without careful reading.

2023-12-12

Joint Audio and Speech Understanding
github_link

@inproceedings{gong_ltuas,
  title={Joint Audio and Speech Understanding},
  author={Gong, Yuan and Liu, Alexander H and Luo, Hongyin, and Karlinsky, Leonid and Glass, James},
  year={2023},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
}

Details

Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass
Date: 25 Sep 2023
#Citations:

Efficiently Modeling Long Sequences with Structured State Spaces

@article{gu2021efficiently,
  title={Efficiently modeling long sequences with structured state spaces},
  author={Gu, Albert and Goel, Karan and R{\'e}, Christopher},
  journal={arXiv preprint arXiv:2111.00396},
  year={2021}
}

2024-01-22

Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting
github_link

Category: transformer variation, multi-scale, time series

@inproceedings{32897a63141342c1a067d56df134df8a,
title = "Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting",
abstract = "Transformer-based models have achieved significant success in time series forecasting. Existing methods mainly model time series from limited or fixed scales, making it challenging to capture different characteristics spanning various scales. In this paper, we propose multi-scale transformers with adaptive pathways (Pathformer). The proposed Transformer integrates both temporal resolution and temporal distance for multi-scale modeling. Multi-scale division divides the time series into different temporal resolutions using patches of various sizes. Based on the division of each scale, dual attention is performed over these patches to capture global correlations and local details as temporal dependencies. We further enrich the multi-scale transformer with adaptive pathways, which adaptively adjust the multi-scale modeling process based on the varying temporal dynamics in the input time series, improving the prediction accuracy and generalization of Pathformer. Extensive experiments on nine real-world datasets demonstrate that Pathformer not only achieves state-of-the-art performance by surpassing all current models but also exhibits stronger generalization abilities under various transfer scenarios.",
author = "Peng Chen and Yingying Zhang and Yunyao Cheng and Yang Shu and Yihang Wang and Qingsong Wen and Bin Yang and Chenjuan Guo",
year = "2024",
month = jan,
day = "16",
language = "English",
booktitle = "International Conference on Learning Representations",
}

Details

Authors:
Date: ICLR 2024
#Citations:

Vision Mamba Efficient Visual Representation Learning with Bidirectional State Space Model
github_link

Category: mamba for vision, representation learning, efficient model

@article{zhu2024vision,
  title={Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model},
  abstract = "Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have shown great potential for long sequence modeling. Building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance of visual representation learning on self-attention is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8× faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to become the next-generation backbone for vision foundation models. ",
  author={Zhu, Lianghui and Liao, Bencheng and Zhang, Qian and Wang, Xinlong and Liu, Wenyu and Wang, Xinggang},
  journal={arXiv preprint arXiv:2401.09417},
  year={2024}
}

Details

Authors:
Date: Jan 2024
#Citations:

VMamba Visual State Space Model
github_link

Category: mamba for vision, representation learning, efficient model

@article{liu2024vmamba,
  title={VMamba: Visual State Space Model},
  author={Liu, Yue and Tian, Yunjie and Zhao, Yuzhong and Yu, Hongtian and Xie, Lingxi and Wang, Yaowei and Ye, Qixiang and Liu, Yunfan},
  journal={arXiv preprint arXiv:2401.10166},
  year={2024}
}

Details

Authors:
Date: Jan 2024
#Citations:

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
github_link

Category: vision language model

@article{li2023blip,
  title={Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models},
  author={Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven},
  journal={arXiv preprint arXiv:2301.12597},
  year={2023}
}

Details

Authors:
Date: Jun 2023
#Citations:

2024-02-04

OLMo Accelerating the Science of Language Models
github_link

Category: Large language model, open-source

Technical report

Details

Authors:
Date: Feb 2024
#Citations: