Overview
MiniEmbed represents our flagship research into high-performance, compact embedding models designed specifically for constrained environments. Traditional LLM-based embeddings are often too resource-intensive for real-time edge applications.
Our approach focuses on distillation and quantization-aware training to maintain 95%+ performance while reducing model size by 80%.
Technical Implementation
We utilized a hybrid architecture combining transformer blocks with specialized attention mechanisms optimized for low-power ARM architectures.
def optimize_attention(query, key, value):
# Specialized low-latency attention mechanism
scaled_dot_product = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
return torch.matmul(F.softmax(scaled_dot_product, dim=-1), value)Results
- Latency: 12ms average inference time on mobile chipsets.
- Size: 45MB total footprint (quantized to INT8).
- Accuracy: Retained MTEB benchmark parity with significantly larger models.
"Aquilonis AI's MiniEmbed has fundamentally changed how we deploy semantic search on our industrial sensor arrays." — Lead Engineer, TechGlobal