Back to Work
AI Systems

MiniEmbed Research

Type
Case Study
Date
Nov 2024
Stack
PyTorchRustONNXTransformers

Overview

MiniEmbed represents our flagship research into high-performance, compact embedding models designed specifically for constrained environments. Traditional LLM-based embeddings are often too resource-intensive for real-time edge applications.

Our approach focuses on distillation and quantization-aware training to maintain 95%+ performance while reducing model size by 80%.

Technical Implementation

We utilized a hybrid architecture combining transformer blocks with specialized attention mechanisms optimized for low-power ARM architectures.

def optimize_attention(query, key, value):
    # Specialized low-latency attention mechanism
    scaled_dot_product = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    return torch.matmul(F.softmax(scaled_dot_product, dim=-1), value)

Results

  1. Latency: 12ms average inference time on mobile chipsets.
  2. Size: 45MB total footprint (quantized to INT8).
  3. Accuracy: Retained MTEB benchmark parity with significantly larger models.
"Aquilonis AI's MiniEmbed has fundamentally changed how we deploy semantic search on our industrial sensor arrays." — Lead Engineer, TechGlobal