Projects

Development and Deployment of Audit Large Language Model

Dec 2024 ~ Apr 2025
AITA and Henan Provincial Audit Department, Henan, China

The primary objective of this project is to design, fine-tune and deploy a secure, internally-facing Audit Large Language Model (ALLM) to empower auditors at Henan Provincial Audit Department. Built atop Qwen foundation model, our ALLM need deliver intelligent support for audit Q&A, regulatory compliance and table-data analysis, improving audit efficiency and accuracy. In this project, I worked with Lutong Zhang and Haonan Zhang to finish the following tasks:

  • Data Collection: Aggregating amounts of audit documentation, historical audit reports, government financial regulations (national and provincial level), accounting standards (e.g., Chinese GAAP), policy directives, compliance manuals, and best practice guidelines.
  • Data Processing: Implementing data processing pipelines for heterogeneous audit sources, including: text extraction that applies PaddleOCR with regex-based digit correction for scanned documents, pdfplumber for layout-aware PDF, python-docx for DOCX files; spaCy + FinBERT for entity normalization of regulations/financial terms; DeepSeek-V2 classification for document categorization and UIE model for sensitivity tagging; LayoutLMv3 for context-aware semantic segmentation to preserve audit logic.
  • Knowledge Vectorization: Utilizing BGE-M3 to transform textual knowledge into high-dimensional vectors indexed into Milvus with metadata filters, enabling efficient semantic search and retrieval.
  • Synthetic QA Generation: Using Qwen-72B-Chat to produce question-answer pairs reflecting common audit scenarios, enriching our fine-tuning dataset.
  • Expert Annotation: Collaborating with senior auditors to annotate real audit cases, identifying key entities, risks, compliance issues, and generating relevant queries/responses.
  • Instruction Tuning: Formulating diverse instructions covering typical audit tasks (e.g., "Identify potential fraud indicators in this transaction log," "Explain the relevant regulation for expense X," "Summarize the key findings from report Y").

Investigation of Key Regulatory Factors and Molecular Mechanisms of Peanut Stomatal Phenotypes

Apr 2024 ~ Nov 2024
AITA and State Key Laboratory of Crop Stress Adaptation and Improvement, Henan, China

This project aims to collect and quantify stomatal phenotypes from live field-grown peanut plants using an intelligent recognition system. By integrating phenotypic, genotypic and environmental data, we can perform Genome-Wide Association Study (GWAS) and Genotype-by-Environment Interaction Analysis (G×E) to identify key and high-stable regulatory genes. The ultimate goal is to discover superior genetic resources for peanut breeding under variable environments and enhance its productivity and adaptability. In this project, I worked with Quanling Zhao and Prof. Xiaohui Yang from AITA, as well as Dr. Chenyang Du and Prof. Chen Miao from State Key Laboratory of Crop Stress Adaptation and Improvement, to finish the following works:

  • Development of RGxEStat. RGxEStat is a user-friendly R GUI package designed for statistical analysis of genotype-by-environment interactions. It integrates: (1) significance analysis based on the mixed effect model to determine whether genes or GxE interactions significantly affect phenotypic traits; (2) single-gene and multi-gene stability analysis based on singular value effect decomposition, which further studies the interactive relationships between genes and environments, as well as the relative superiority or inferiority of genotypes across environments. This tool empowers breeders to identify environment-specific alleles associated with stomatal function, providing actionable insights for climate-resilient peanut breeding.
  • StomaD², an end-to-end stomatal phenotyping system based on diffusion-based restoration detection network. StomaD² supports real-time imaging and quantification of stomatal phenotypes (e.g., density, conductance) in field conditions. It is also compatible with both monocot and dicot crops, and accepts both destructive and non-destructive microscopy images. Experimental results demonstrate that StomaD² achieves expert-level accuracy and high generalization across species, highlighting its potential for large-scale phenotyping, plant physiology research and precision agriculture.
  • We propose a triphasic Hellinger Distance-based Intersection over Union (HDIoU) for oriented bounding box (OBB) regression, and integrate it into the training of YOLOv8-OBB on non-destructive leaf epidermal stomatal images. HDIoU utilizes a dynamic indicator to determine the current optimization focus during training, allowing the loss function to continuously transition across three objective phases within a single unified learning process. This triphasic design facilitates more accurate and efficient localization of target contours, yielding precise phenotypic information for stomatal studies. Furthermore, HDIoU is inherently scale-invariant, making it particularly effective for detecting small objects such as stomata in high-resolution biological imagery.

Intelligent Modeling of Organic Chemical Synthesis Based on Topological Data Analysis and Machine Learning

Nov 2023 ~ Mar 2024
Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms, Henan, China

This project is the beginning of my research journey. I was fortunate to collaborate with Yanhui Guo and Prof. Xiaohui Yang in modeling organic chemical synthesis reactions using statistical and machine learning tools. In the two papers I contributed to, we utilized topological data analysis, ensemble learning (tree-based models), convolutional neural networks and multi-scale attention to build an intelligent system for yield analysis and prediction of organic synthesis. Our work aims to support researchers with comprehensive, multi-perspective decision-making information.