Junxuan Bai's Homepage 白隽瑄的主页

Journal Papers

LiteNeRFAvatar: A lightweight NeRF with local feature learning for dynamic human avatar
Junjun Pan, Xiaoyu Li, Junxuan Bai, Ju Dai
Pattern Recognition, Volume 170, 2026
[Info] [Demo] [Code]

Creating high-quality dynamic human avatars within acceptable costs remains challenging in computer vision and computer graphics. The neural radiance field (NeRF) has become a fundamental means of generating human avatars due to its success in novel view synthesis. However, the storage-intensive and time-consuming per-scene training due to the transformation and evaluation of massive sampling points constrains its practical applications. In this paper, we introduce a novel lightweight NeRF model, LiteNeRFAvatar, to overcome these limits. To avoid the high-cost backward transformation of the sampling points, LiteNeRFAvatar decomposes the appearance features of clothed humans into multiple local feature spaces and transforms them forward according to human movements. Each local feature space affects a limited local area and is represented by an explicit feature volume created by the tensor decomposition techniques to support fast access. The sampling points retrieve the features based on the relative positions to the local feature spaces. The densities and the colors are then regressed from the aggregated features using a tiny decoder. We also adopt an empty space skipping strategy to further reduce the number of sampling points. Experimental results demonstrate that our LiteNeRFAvatar achieves a satisfactory balance between synthesis quality, training time, rendering speed and parameter size compared to the existing NeRF-based methods.

Motion In-Betweening via Recursive Keyframe Prediction
Rui Zeng, Ju Dai, Junxuan Bai, Junjun Pan
Computer Animation and Virtual Worlds, 36(3), e70035, 2025
[Info]

Motion in-betweening is a flexible and efficient technique for generating 3-dimensional animations. In this paper, we propose a keyframe-driven method that effectively addresses the pose ambiguity issue and achieves robust in-betweening performance. We introduce a keyframe-driven synthesis framework. At each recursion, the key poses at both ends keep predicting the new one at the midpoint. The recursive breakdown reduces motion ambiguities by simplifying the in-betweening sequence as the integration of short clips. The hybrid positional encoding scales the hidden states to adapt to long- and short-term dependencies. Additionally, we employ a temporal refinement network to capture the local motion relationships, thereby enhancing the consistency of the predicted pose sequence. Through comprehensive evaluations that include both quantitative and qualitative comparisons, the proposed model demonstrates its competitiveness in prediction accuracy and in-betweening flexibility.

Motion Editing for Quadruped Characters via Latent Frequency Embedding
Rui Zeng, Junjun Pan, Ju Dai, Yang Gao, Junxuan Bai, Hong Qin
IEEE Transactions on Visualization and Computer Graphics, 2024
[Info]

The accurate and diversified generation of motion sequences for virtual characters poses both an enticing and challenging task within the domain of 3D animation and game content production. To achieve a natural and realistic full-body motion, the movements of virtual characters must adhere to a set of constraints, promoting reliable and seamless pose-changing. This study presents a two-stage model specifically designed to learn Inverse Kinematics (IK) constraints from the representative quadruped character poses. In the first stage, we employ frequency analysis to decompose motion poses into the base-level and style-level components. The base-level content encapsulates the global correlations in the dataset, while the style-level variation centers on distinguishing the local attributes in similar data elements. In order to construct data correlations among poses, we embed the decomposed pose feature into a latent space in the second stage. The kernel matrix of the embedding, which is refined from the original joint angles to the decomposed representation and the IK constraints, creates a more compact distribution of the pose similarity and also guarantees a plausible sampling result with certain IK constraints. Moreover, new motions from the edited IK constraints can also be generated by proposing a searching strategy to adapt to our latent embedding. Experimental results reveal that our method is competitive with the state-of-the-art synthetic approaches in terms of accuracy, highlighting our considerable potential for high efficiency in the animation production.

Free Editing of Shape and Texture with Deformable Net for 3D Caricature Generation
Yuanyuan Lin, Ju Dai, Junjun Pan, Feng Zhou, Junxuan Bai
The Visual Computer, 2024
[Info]

2D caricature editing has shown superior performance. However, 3D exaggerated caricature face (ECF) modeling with flexible shape and texture editing capabilities is far from achieving satisfactory high-quality results. This paper aims to model shape and texture variations of 3D caricatures in a learnable parameter space. To achieve this goal, we propose a novel framework for highly controllable editing of 3D caricatures. Our model mainly consists of the texture and shape hyper-networks, texture and shape Sirens, and a projection module. Specifically, two hyper-networks take the texture and shape latent codes as inputs to learn the compact parameter spaces of the two Siren modules. The texture and shape Sirens are leveraged to model the deformation variations of textural styles and geometric shapes. We further incorporate precise control of the camera parameters in the projection module to enhance the quality of generated ECF results. Our method allows flexible editing online and swapping textural features between 3D caricatures. For this purpose, we contribute a 3D caricature face dataset with textures for training and testing. Experiments and user evaluations demonstrate that our method is capable of generating diverse high-fidelity caricatures and achieves better editing capabilities than state-of-the-art methods.

DGFormer: Dynamic graph transformer for 3D human pose estimation
Zhangmeng Chen, Ju Dai, Junxuan Bai, Junjun Pan
Pattern Recognition, 152, 2024
[Info] [Code]

Despite the significant progress for monocular 3D human pose estimation, it still faces challenges due to self-occlusions and depth ambiguities. To tackle those issues, we propose a novel Dynamic Graph Transformer (DGFormer) to exploit local and global relationships between skeleton joints for pose estimation. Specifically, the proposed DGFormer mainly consists of three core modules: Transformer Encoder (TE), immobile Graph Convolutional Network (GCN), and dynamic GCN. TE module leverages the self-attention mechanism to learn the complex global relationships among skeleton joints. The immobile GCN is responsible for capturing the local physical connections between human joints, while the dynamic GCN concentrates on learning the sparse dynamic K-nearest neighbor interactions according to different action poses. By building the adequately global long-range, local physical, and sparse dynamic dependencies of human joints, experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate that our method can predict 3D pose with lower errors outperforming the recent state-of-the-art image-based performance. Furthermore, experiments on in-the-wild videos demonstrate the impressive generalization abilities of our method. Code will be available at: https://github.com/czmmmm/DGFormer.

Foot-constrained spatial-temporal transformer for keyframe-based complex motion synthesis
Hao Li, Ju Dai, Rui Zeng, Junxuan Bai, Zhangmeng Chen, Junjun Pan
Computer Animation and Virtual Worlds, e2217, 2023.
[Info]

Keyframe-based motion synthesis holds significant effects in games and movies. Existing methods for complex motion synthesis often require secondary post-processing to eliminate foot sliding to yield satisfied motions. In this paper, we analyze the cause of the sliding issue attributed to the mismatch between root trajectory and motion postures. To address the problem, we propose a novel end-to-end Spatial-Temporal transformer network conditioned on foot contact information for high-quality keyframe-based motion synthesis. Specifically, our model mainly compromises a spatial-temporal transformer encoder and two decoders to learn motion sequence features and predict motion postures and foot contact states. A novel constrained embedding, which consists of keyframes and foot contact constraints, is incorporated into the model to facilitate network learning from diversified control knowledge. To generate matched root trajectory with motion postures, we design a differentiable root trajectory reconstruction algorithm to construct root trajectory based on the decoder outputs. Qualitative and quantitative experiments on the public LaFAN1, Dance, and Martial Arts datasets demonstrate the superiority of our method in generating high-quality complex motions compared with state-of-the-arts.

KD-Former: Kinematic and dynamic coupled transformer network for 3D human motion prediction
Ju Dai, Hao Li, Rui Zeng, Junxuan Bai, Feng Zhou, Junjun Pan
Pattern Recognition, 143, 2023.
[Info] [Code]

Recent studies have made remarkable progress on 3D human motion prediction by describing motion with kinematic knowledge. However, kinematics only considers the 3D positions or rotations of human skeletons, failing to reveal the physical characteristics of human motion. Motion dynamics reflects the forces between joints, explicitly encoding the skeleton topology, whereas rarely exploited in motion prediction. In this paper, we propose the Kinematic and Dynamic coupled transFormer (KD-Former), which incorporates dynamics with kinematics, to learn powerful features for high-fidelity motion prediction. Specifically, We first formulate a reduced-order dynamic model of human body to calculate the forces of all joints. Then we construct a non-autoregressive encoder-decoder framework based on the transformer structure. The encoder involves a kinematic encoder and a dynamic encoder, which are respectively responsible for extracting the kinematic and dynamic features for given history sequences via a spatial transformer and a temporal transformer. Future query sequences are decoded in parallel in the decoder by leveraging the encoded kinematic and dynamic information of history sequences. Experiments on Human3.6M and CMU MoCap benchmarks verify the effectiveness and superiority of our method. Code will be available at: https://github.com/wslh852/KD-Former.git.

Diverse Dance Synthesis via Keyframes with Transformer Controllers
Junjun Pan, Siyuan Wang, Junxuan Bai, Ju Dai
Computer Graphics Forum, 40(7): 71-83, 2021.
[PDF] [Slides] [Presentation] [Code]

Existing keyframe-based motion synthesis mainly focuses on the generation of cyclic actions or short-term motion, such as walking, running, and transitions between close postures. However, these methods will significantly degrade the naturalness and diversity of the synthesized motion when dealing with complex and impromptu movements, e.g., dance performance and martial arts. In addition, current research lacks fine-grained control over the generated motion, which is essential for intelligent human-computer interaction and animation creation. In this paper, we propose a novel keyframe-based motion generation network based on multiple constraints, which can achieve diverse dance synthesis via learned knowledge. Specifically, the algorithm is mainly formulated based on the recurrent neural network (RNN) and the Transformer architecture. The backbone of our network is a hierarchical RNN module composed of two long short-term memory (LSTM) units, in which the first LSTM is utilized to embed the posture information of the historical frames into a latent space, and the second one is employed to predict the human posture for the next frame. Moreover, our framework contains two Transformer-based controllers, which are used to model the constraints of the root trajectory and the velocity factor respectively, so as to better utilize the temporal context of the frames and achieve fine-grained motion control. We verify the proposed approach on a dance dataset containing a wide range of contemporary dance. The results of three quantitative analyses validate the superiority of our algorithm. The video and qualitative experimental results demonstrate that the complex motion sequences generated by our algorithm can achieve diverse and smooth motion transitions between keyframes, even for long-term synthesis.

EmoDescriptor: A hybrid feature for emotional classification in dance movements
Junxuan Bai, Rong Dai, Ju Dai, and Junjun Pan
Computer Animation and Virtual Worlds, 32(6), 2021.
[PDF] [Demo]

Similar to language and music, dance performances provide an effective way to express human emotions. With the abundance of the motion capture data, content‐based motion retrieval and classification have been fiercely investigated. Although researchers attempt to interpret body language in terms of human emotions, the progress is limited by the scarce 3D motion database annotated with emotion labels. This article proposes a hybrid feature for emotional classification in dance performances. The hybrid feature is composed of an explicit feature and a deep feature. The explicit feature is calculated based on the Laban movement analysis, which considers the body, effort, shape, and space properties. The deep feature is obtained from latent representation through a 1D convolutional autoencoder. Eventually, we present an elaborate feature fusion network to attain the hybrid feature that is almost linearly separable. The abundant experiments demonstrate that our hybrid feature is superior to the separate features for the emotional classification in dance performances.

Interactive animation generation of virtual characters using single RGB-D camera
Ning Kang, Junxuan Bai (co-first author), Junjun Pan, and Hong Qin
The Visual Computer, 35(6-8): 849-860, 2019.
[PDF] [Presentation]

The rapid creation of 3D character animation by commodity devices plays an important role in enriching visual content in virtual reality. This paper concentrates on addressing the challenges of current motion imitation for human body. We develop an interactive framework for stable motion capturing and animation generation based on single Kinect device. In particular, we focus our research efforts on two cases: (1) The participant is facing the camera; or (2) the participant is turning around or is side facing the camera. Using existing methods, camera could obtain a proﬁle view of the body, but it frequently leads to less satisfactory result or even failure due to occlusion. In order to reduce certain artifacts appeared at the side view, we design a mechanism to reﬁne the movement of the human body by integrating an adaptive ﬁlter. After specifying the corresponding joints between the participant and the virtual character, the captured motion could be retargeted in a quaternion-based manner. To further improve the animation quality, inverse kinematics are brought into our framework to constrain the target’s positions. A large variety of motions and characters have been tested to validate the performance of our framework. Through experiments, it shows that our method could be applied to real-time applications, such as physical therapy and ﬁtness training.

Novel metaballs-driven approach with dynamic constraints for character articulation
Junxuan Bai, Junjun Pan, Yuhan Yang, and Hong Qin
SCIENCE CHINA Information Science, 61(9): 094101:1-094101:3, 2018.
[PDF] [Slides] [Demo]

Skinning techniques are essential for character articulation in 3D computer animation. Currently, skeleton-based methods are widely used in the animation industry for its simplicity and efficiency, especially in linear blend skinning (LBS) and dual quaternion skinning (DQS). However, owing to the lack of the inside volumetric representation, they suffer from joint collapse, candy-wrapper, and bulging problems.

Essential techniques for laparoscopic surgery simulation
Kun Qian, Junxuan Bai, Xiaosong Yang, Junjun Pan, Jian-Jun Zhang
Computer Animation and Virtual Worlds, 28(2), 2017.

Laparoscopic surgery is a complex minimum invasive operation that requires long learning curve for the new trainees to have adequate experience to become a qualified surgeon. With the development of virtual reality technology, virtual reality‐based surgery simulation is playing an increasingly important role in the surgery training. The simulation of laparoscopic surgery is challenging because it involves large non‐linear soft tissue deformation, frequent surgical tool interaction and complex anatomical environment. Current researches mostly focus on very specific topics (such as deformation and collision detection) rather than a consistent and efficient framework. The direct use of the existing methods cannot achieve high visual/haptic quality and a satisfactory refreshing rate at the same time, especially for complex surgery simulation. In this paper, we proposed a set of tailored key technologies for laparoscopic surgery simulation, ranging from the simulation of soft tissues with different properties, to the interactions between surgical tools and soft tissues to the rendering of complex anatomical environment. Compared with the current methods, our tailored algorithms aimed at improving the performance from accuracy, stability and efficiency perspectives. We also abstract and design a set of intuitive parameters that can provide developers with high flexibility to develop their own simulators.

Real-time haptic manipulation and cutting of hybrid soft tissue models by extended position-based dynamics
Junjun Pan, Junxuan Bai, Xin Zhao, Aimin Hao, and Hong Qin
Computer Animation and Virtual Worlds, 26(3-4): 321-335, 2015.
[PDF]

This paper systematically describes an interactive dissection approach for hybrid soft tissue models governed by extended position‐based dynamics. Our framework makes use of a hybrid geometric model comprising both surface and volumetric meshes. The fine surface triangular mesh with high‐precision geometric structure and texture at the detailed level is employed to represent the exterior structure of soft tissue models. Meanwhile, the interior structure of soft tissues is constructed by coarser tetrahedral mesh, which is also employed as physical model participating in dynamic simulation. The less details of interior structure can effectively reduce the computational cost during simulation. For physical deformation, we design and implement an extended position‐based dynamics approach that supports topology modification and material heterogeneities of soft tissue. Besides stretching and volume conservation constraints, it enforces the energy preserving constraints, which take the different spring stiffness of material into account and improve the visual performance of soft tissue deformation. Furthermore, we develop mechanical modeling of dissection behavior and analyze the system stability. The experimental results have shown that our approach affords real‐time and robust cutting without sacrificing realistic visual performance. Our novel dissection technique has already been integrated into a virtual reality‐based laparoscopic surgery simulator.

Conference Papers

Attribute-Decomposable Motion Compression Network for 3D MoCap Data
Zengming Chen, Junxuan Bai (co-first author), Ju Dai
Data Compression Conference (DCC) 2022, Full paper, 2022
[Slides]

Motion Capture (MoCap) data is one type of fundamental asset for the digital entertainment. The progressively increasing 3D applications make MoCap data compression unprecedentedly important. In this paper, we propose an end-to-end attribute-decomposable motion compression network using the AutoEncoder architecture. Specifically, the algorithm consists of an LSTM-based encoder-decoder for compression and decompression. The encoder module decomposes human motion into multiple uncorrelated semantic attributes, including action content, arm space, and motion mirror. The decoder module is responsible for reconstructing vivid motion based on the decomposed high-level characteristics. Our method is computationally efficient with powerful compression ability, outperforming the state-of-the-art methods in terms of compression rate and compression error. Furthermore, our model can generate new motion data given a combination of different motion attributes while existing methods have no such capability.

3D-CariNet: End-to-end 3D caricature generation from natural face images with differentiable renderer
Meijia Huang, Ju Dai, Junjun Pan, Junxuan Bai, Hong Qin
Pacific Graphics 2021, Short paper, 2021
[PDF]

Caricatures are an artistic representation of human faces to express satire and humor. Caricature generation of human faces is a hotspot in CG research. Previous work mainly focuses on 2D caricatures generation from face photos or 3D caricature reconstruction from caricature images. In this paper, we propose a novel end-to-end method to directly generate personalized 3D caricatures from a single natural face image. It can create not only exaggerated geometric shapes, but also heterogeneous texture styles. Firstly, we construct a synthetic dataset containing matched data pairs composed of face photos, caricature images, and 3D caricatures. Then, we design a graph convolutional autoencoder to build a non-linear colored mesh model to learn the shape and texture of 3D caricatures. To make the network end-to-end trainable, we incorporate a differentiable renderer to render 3D caricatures into caricature images inversely. Experiments demonstrate that our method can achieve 3D caricature generation with various texture styles from face images while maintaining personality characteristics.

Human motion synthesis and control via contextual manifold embedding
Rui Zeng, Ju Dai, Junxuan Bai, Junjun Pan, Hong Qin
Pacific Graphics 2021, Short paper, 2021
[PDF]

Modeling motion dynamics for precise and rapid control by deterministic data-driven models is challenging due to the natural randomness of human motion. To address it, we propose a novel framework for continuous motion control by probabilistic latent variable models. The control is implemented by recurrently querying between historical and target motion states rather than exact motion data. Our model takes a conditional encoder-decoder form in two stages. Firstly, we utilize Gaussian Process Latent Variable Model (GPLVM) to project motion poses to a compact latent manifold. Motion states could be clearly recognized by analyzing on the manifold, such as walking phase and forwarding velocity. Secondly, taking manifold as prior, a Recurrent Neural Network (RNN) encoder makes temporal latent prediction from the previous and control states. An attention module then morphs the prediction by measuring latent similarities to control states and predicted states, thus dynamically preserving contextual consistency. In the end, the GP decoder reconstructs motion states back to motion frames. Experiments on walking datasets show that our model is able to maintain motion states autoregressively while performing rapid and smooth transitions for the control.

Flower factory: a component-based approach for rapid flower modeling
Siyuan Wang, Junjun Pan, Junxuan Bai, Jinglei Wang
ISMAR 2020: 12-23, Conference paper, 2020
[PDF] [Presentation]

The rapid 3D objects modeling provides an effective way to enrich digital content, which is one of the essential tasks in VR/AR research. Flowers are frequently utilized in real-time applications, such as video games and VR/AR scenes. Technically, a realistic flower generation using the existing 3D modeling software is complicated and time-consuming for designers. Moreover, it is difficult to create imaginary and surreal flowers, which might be more interesting and attractive for the artists and game players. In this paper, we propose a component-based framework for rapid flower modeling, called Flower Factory. The flowers are assembled by different components, e.g., petals, stamens, receptacles and leaves. The shape of these components are created using simple primitives such as points and splines. After the shape of models are determined, the textures are synthesized automatically based on a predefine mask, according to a number of rules from real flowers. The whole modeling process can be controlled by several parameters, which describe the physical attributes of the flowers. Our technique is capable of producing a variety of flowers rapidly. Even novices without any modeling skills are able to control and model the 3D flowers. Furthermore, the developed system will be integrated in a lightweight application of smartphone due to its low computational cost.

Real-time animation and motion retargeting of virtual characters based on single RGB-D camera
Ning Kang, Junxuan Bai, Junjun Pan, Hong Qin
VR 2019: 1006-1007, Poster, 2019

The rapid generation and flexible reuse of characters animation by commodity devices are of significant importance to rich digital content production in virtual reality. This paper aims to handle the challenges of current motion imitation for human body in several indoor scenes (e.g., fitness training). We develop a real-time system based on single Kinect device, which is able to capture stable human motions and retarget to virtual characters. A large variety of motions and characters are tested to validate the efficiency and effectiveness of our system.

Virtual reality based laparoscopic surgery simulation
Kun Qian, Junxuan Bai, Xiaosong Yang, Junjun Pan, Jian-Jun Zhang
VRST 2015: 321-335, Conference paper, 2015

With the development of computer graphic and haptic devices, training surgeons with virtual reality technology has proven to be very effective in surgery simulation. Many successful simulators have been deployed for training medical students. However, due to the various unsolved technical issues, the laparoscopic surgery simulation has not been widely used. Such issues include modeling of complex anatomy structure, large soft tissue deformation, frequent surgical tools interactions, and the rendering of complex material under the illumination of headlight. A successful laparoscopic surgery simulator should integrate all these required components in a balanced and efficient manner to achieve both visual/haptic quality and a satisfactory refreshing rate. In this paper, we propose an efficient framework integrating a set of specially tailored and designed techniques, ranging from deformation simulation, collision detection, soft tissue dissection and rendering. We optimize all the components based on the actual requirement of laparoscopic surgery in order to achieve an improved overall performance of fidelity and responding speed.

Dissection of hybrid soft tissue models using position-based dynamics
Junjun Pan, Junxuan Bai, Xin Zhao, Aimin Hao, and Hong Qin
VRST 2014: 219-220, Poster, 2014

This paper describes an interactive dissection approach for hybrid soft tissue models governed by position-based dynamics. Our framework makes use of a hybrid geometric model comprising both surface and volumetric meshes. The fine surface triangular mesh is used to represent the exterior structure of soft tissue models. Meanwhile, the interior structure of soft tissues is constructed by coarser tetrahedral meshes, which are also employed as physical models participating in dynamic simulation. The less details of interior structure can effectively reduce the computational cost of deformation and geometric subdivision during dissection. For physical deformation, we design and implement a position-based dynamics approach that supports topology modification and enforces the volume-preserving constraint. Experimental results have shown that, this hybrid dissection method affords real-time and robust cutting simulation without sacrificing realistic visual performance.

Engineering

Displaying platform for crude oil logging software
Zhen Wang, Junxuan Bai, Jingxue Li
China Oilfield Services Limited (COSL), 2012-2013

The goal of this project is to visualize logging data on a pad computer.

Dr. Junxuan Bai

白隽瑄

Home

Research & Publications

Experience & Education

Award

Code

Useful Links

Journal Papers

Conference Papers

Engineering