soft-axiom-data - Ryan Chung

Data collection and curation pipeline for fine-tuning a language model that powers Soft Axiom. The goal is to move beyond generic API-backed inference and train a domain-adapted model that understands the specific retrieval and knowledge-management tasks Soft Axiom handles, improving answer quality and reducing dependence on third-party providers.

Motivation

Soft Axiom currently routes queries through external LLM providers. That works for prototyping, but it limits control over output quality, latency, cost, and privacy. Fine-tuning a model on task-specific data lets the platform own its inference stack end to end—from the data that shapes the model to the serving infrastructure that runs it. The first step is building the dataset.

Data Collection

The pipeline gathers training examples from multiple sources: synthetic question-answer pairs generated from ingested documents, real user interactions with the RAG pipeline (anonymized and filtered), and curated public datasets relevant to document comprehension and knowledge retrieval. Each example is structured as an instruction-input-output triple to match the fine-tuning format.

Data Quality and Processing

Raw data goes through deduplication, length filtering, and quality scoring before entering the training set. Embedding-based similarity checks remove near-duplicates, and heuristic filters flag low-quality or off-topic examples for manual review. The pipeline tracks provenance so every training example can be traced back to its source.

Fine-tuning Pipeline

The fine-tuning workflow uses Hugging Face Transformers with PEFT/LoRA for parameter-efficient adaptation. This keeps GPU memory requirements manageable and allows rapid iteration on different base models and hyperparameter configurations without retraining from scratch.

What's Next

Scale synthetic data generation from the existing Soft Axiom document corpus
Establish evaluation benchmarks for retrieval-grounded question answering
Run first LoRA fine-tuning experiments and compare against baseline providers
Integrate the fine-tuned model into the Soft Axiom serving pipeline