Skip to content
View kcsayem's full-sized avatar
  • Ulsan National Institute of Science & Technology
  • Ulsan, South Korea

Highlights

  • Pro

Block or report kcsayem

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
kcsayem/README.md

MD Khalequzzaman Chowdhury Sayem

Researcher at UNIST Vision & Learning Lab
3D Vision · Vision-Language Models · Geometry-Grounded Multimodal Reasoning

Homepage Google Scholar LinkedIn Email

CVPR 2026 AAAI 2025 3D Vision Vision-Language Models Hand-Object Interaction

About Me

I am a researcher at the Vision & Learning Lab, UNIST, South Korea, working under the supervision of Prof. Seungryul Baek and Prof. Binod Bhattarai.

My work focuses on multimodal learning, vision-language models, and geometry-grounded reasoning in visually complex environments, especially for articulated hands and hand-object interaction.

I am interested in building multimodal systems that reason more reliably about 3D structure, spatial relationships, and fine-grained interactions, with longer-term goals in grounded world models and embodied multimodal intelligence.

Research Snapshot

Current Directions

  • Reliable multimodal reasoning with explicit geometric supervision
  • Fine-grained understanding of hands and hand-object interactions
  • Scalable benchmarks for spatial reasoning in VLMs
  • Interpretable and grounded multimodal foundation models

Research Areas

  • 3D Vision
  • Vision-Language Models
  • Multimodal Learning
  • Hand Pose and Hand-Object Interaction
  • Geometry-Grounded Reasoning
  • Embodied AI and World Models

Featured Publications

HandVQA

CVPR 2026

1.6M+ VQA Pairs

Spatial Reasoning

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

  • Large-scale benchmark grounded in 3D hand geometry
  • Covers joint angles, distances, and relative spatial relations
  • Shows explicit 3D supervision improves reliability and cross-task generalization

Project Page

QORT-Former

AAAI 2025

53.5 FPS

Real-Time 3D Pose

QORT-Former: Query-Optimized Real-Time Transformer for Understanding Two Hands Manipulating Objects

  • Real-time Transformer for two-hand and object 3D pose estimation
  • Balances efficiency and accuracy for practical deployment
  • Outperforms prior methods on H2O and FPHA while running in real time

Project PagePaperCode

Selected Repositories

Repository Description
HandVQA Fine-grained spatial reasoning about hands in vision-language models
QORT-Former Real-time Transformer for understanding two hands manipulating objects
4d-editing 4D Instruct-GS2GS for extending semantic editing to dynamic 3D scenes
Parallel-bandit Parallelized contextual bandit algorithms for news recommendation

Connect

I am always open to research discussions, collaborations, and ideas around 3D vision, multimodal learning, and vision-language reasoning.

Pinned Loading

  1. handvqa handvqa Public

    [CVPR 2026] HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

    Python 5 1

  2. QORT-Former QORT-Former Public

    [AAAI 2025] Official implementation of QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

    JavaScript 12

  3. Parallel-bandit Parallel-bandit Public

    Analyzes the performance trade-offs of parallelized contextual bandit algorithms (LinUCB, Thompson Sampling) in news recommendation, using the Yahoo! R6A dataset.

    Jupyter Notebook 1

  4. eldor-fozilov/4d-editing eldor-fozilov/4d-editing Public

    4D Instruct-GS2GS: Extending Semantic Editing to Dynamic 3D Scenes

    Jupyter Notebook