Development and Deployment of Audit Large Language Model
Dec 2024 ~ Apr 2025
AITA and Henan Provincial Audit Department, Henan, China
The primary objective of this project is to design, fine-tune and deploy a secure, internally-facing Audit Large Language Model (ALLM) to empower auditors at Henan Provincial Audit Department. Built atop Qwen foundation model, our ALLM need deliver intelligent support for audit Q&A, regulatory compliance and table-data analysis, improving audit efficiency and accuracy. In this project, I worked with Lutong Zhang and Haonan Zhang to finish the following tasks:
- Data Collection: Aggregating amounts of audit documentation, historical audit reports, government financial regulations (national and provincial level), accounting standards (e.g., Chinese GAAP), policy directives, compliance manuals, and best practice guidelines.
- Data Processing: Implementing data processing pipelines for heterogeneous audit sources, including: text extraction that applies PaddleOCR with regex-based digit correction for scanned documents, pdfplumber for layout-aware PDF, python-docx for DOCX files; spaCy + FinBERT for entity normalization of regulations/financial terms; DeepSeek-V2 classification for document categorization and UIE model for sensitivity tagging; LayoutLMv3 for context-aware semantic segmentation to preserve audit logic.
- Knowledge Vectorization: Utilizing BGE-M3 to transform textual knowledge into high-dimensional vectors indexed into Milvus with metadata filters, enabling efficient semantic search and retrieval.
- Synthetic QA Generation: Using Qwen-72B-Chat to produce question-answer pairs reflecting common audit scenarios, enriching our fine-tuning dataset.
- Expert Annotation: Collaborating with senior auditors to annotate real audit cases, identifying key entities, risks, compliance issues, and generating relevant queries/responses.
- Instruction Tuning: Formulating diverse instructions covering typical audit tasks (e.g., "Identify potential fraud indicators in this transaction log," "Explain the relevant regulation for expense X," "Summarize the key findings from report Y").