Your LLM, anywhere
We provide professional services to port and optimize Gen AI models for mobile, IoT, and server environments.


Qwen
TinyLlama
Llama
Gemma
LLaVA
Intern-VL
Qwen
TinyLlama
Llama
Gemma
LLaVA
Intern-VL
Qwen
TinyLlama
Llama
Gemma
LLaVA
Intern-VL
Taking Foundation Models to the Edge
You can now deploy any LLM or VLM, even those that were previously too large or complex.
Mobile
Smart home
Robotics
Automotive
Server
Industrial edge box
Mobile
Smart home
Robotics
Automotive
Server
Industrial edge box
Mobile
Smart home
Robotics
Automotive
Server
Industrial edge box
Optimized LLMs, Engineered with Hardware Insight
Specialized techniques meet real hardware knowledge to push Gen AI to its limits.
Bridging
AI and Hardware
Bridging
AI and Hardware
Bridging
AI and Hardware
We understand both model internals and hardware constraints — with extensive experience in deploying models across real-world devices.
We understand both model internals and hardware constraints — with extensive experience in deploying models across real-world devices.
We understand both model internals and hardware constraints — with extensive experience in deploying models across real-world devices.

Hardware-Aware Quantization
Hardware-Aware Quantization
Hardware-Aware Quantization
We apply device-specific quantization strategies to maximize performance while preserving accuracy.
We apply device-specific quantization strategies to maximize performance while preserving accuracy.
We apply device-specific quantization strategies to maximize performance while preserving accuracy.

Research-Driven
Optimization
Research-Driven
Optimization
Research-Driven
Optimization
We implement the latest methods from cutting-edge papers — including mixed precision quantization, operator fusion, and graph pruning.
We implement the latest methods from cutting-edge papers — including mixed precision quantization, operator fusion, and graph pruning.
We implement the latest methods from cutting-edge papers — including mixed precision quantization, operator fusion, and graph pruning.

Real Results.
Proven Performance.
The NetsPresso team enhanced LLM performance on real devices by combining quantization with hardware-aware techniques, leading to measurable gains in memory efficiency, speed, and output quality.
The NetsPresso team enhanced LLM performance on real devices by combining quantization with hardware-aware techniques, leading to measurable gains in memory efficiency, speed, and output quality.
The NetsPresso team enhanced LLM performance on real devices by combining quantization with hardware-aware techniques, leading to measurable gains in memory efficiency, speed, and output quality.
Optimized quantization to meet both efficiency and accuracy
Original model
Nota-optimized model
Memory Efficiency
Memory
(GB)
Reduced by
6.9%
Inference Speed
Prefill speed
(token/s)
1.27x
faster
Decoding speed
(token/s)
1.15x
faster
Text Generation Quality
Instruction
Understanding
Improved by
6.2%
Text
Summarization
Improved by
11.7%
· Llama 3-8B benchmark before and after optimization (under the same conditions)
· Measurements taken on Qualcomm DragonWing IQ-9075.
Llama, Qualcomm, and DragonWing IQ-9075 are used for illustrative purposes only and do not imply any affiliation with, endorsement by, or sponsorship from the respective third parties.
Performance enhancement through advanced decoding techniques
SS Decoding Off
SS Decoding On
Inference Speed
Decoding speed
(token/s)
10.9
1.78x
Faster
19.4
TTFT
(ms)
791
Almost
same
804
Applying self-speculative decoding to the llava1.5-7B.
The measurements were taken on the Qualcomm Snapdragon8 Gen3.
Quantization
Decoding
Optimized quantization to meet both efficiency and accuracy
Original model
Nota-optimized model
Memory Efficiency
Memory
(GB)
Reduced by
6.9%
Inference Speed
Prefill speed
(token/s)
1.27x
faster
Decoding speed
(token/s)
1.15x
faster
Text Generation Quality
Instruction
Understanding
Improved by
6.2%
Text
Summarization
Improved by
11.7%
· Llama 3-8B benchmark before and after optimization (under the same conditions)
· Measurements taken on Qualcomm DragonWing IQ-9075.
Llama, Qualcomm, and DragonWing IQ-9075 are used for illustrative purposes only and do not imply any affiliation with, endorsement by, or sponsorship from the respective third parties.
Performance enhancement through advanced decoding techniques
SS Decoding Off
SS Decoding On
Inference Speed
Decoding speed
(token/s)
10.9
1.78x
Faster
19.4
TTFT
(ms)
791
Almost
same
804
Applying self-speculative decoding to the llava1.5-7B.
The measurements were taken on the Qualcomm Snapdragon8 Gen3.
Quantization
Decoding
Quantization
Decoding
Optimized quantization to meet both efficiency and accuracy
Original model
Nota-optimized model
Inference Speed
Text Generation Quality
Memory Efficiency
Memory (GB)
Prefill speed (token/s)
1.27x
faster
Decoding speed (token/s)
1.15x
faster
Instruction Understanding
Improved by
6.2%
Text Summarization
Improved by
11.7%
· Llama 3-8B benchmark before and after optimization (under the same conditions)
· Measurements taken on Qualcomm DragonWing IQ-9075.
Llama, Qualcomm, and DragonWing IQ-9075 are used for illustrative purposes only and do not imply any affiliation with, endorsement by, or sponsorship from the respective third parties.
Reduced by
6.9%
Performance enhancement through advanced decoding techniques
SS Decoding On
Inference Speed
Decoding speed (token/s)
10.9
1.78x
Faster
19.4
TTFT (ms)
791
Almost
same
804
SS Decoding Off
Deep Optimization Backed by Research
Explore Our Research
LLMs that run anywhere and
know everything with RAG
LLMs that run anywhere and
know everything with RAG
Whether it’s standalone LLM or RAG-enhanced, we support both.
Whether it’s standalone LLM or RAG-enhanced, we support both.
Only LLM
Standalone language model without external data access
Standalone language model without external data access
Standalone language model without external data access


LLM
(Generate)
Question
Answer
Answers are based only on pre-trained knowledge
Answers are based only on pre-trained knowledge
Fast inference with lower system complexity
Fast inference with lower system complexity


Answers are based only on pre-trained knowledge
Fast inference with lower system complexity




With RAG
With RAG
LLM connected to external knowledge sources
LLM connected to external knowledge sources
Retrieve
(Docs)


LLM
(Generate)
Answer
Question

Provides up-to-date and accurate information
Provides up-to-date and accurate information
Reduces hallucination by grounding in real data
Reduces hallucination by grounding in real data


Provides up-to-date and accurate information
Reduces hallucination by grounding in real data




RAG on Edge: Live Demo
Watch how our RAG-enabled LLM runs on a Qualcomm QCS6490 device — answering questions based on documents it hasn’t seen before.
Too Big, Too Slow, Too Costly?
Not Anymore.
The model is too large to deploy.
It’s slow and generates strange outputs.
High-end GPUs are too expensive.
We don’t know when development will finish.
Considering a product with generative AI.
It’s slow, and sometimes it generates really strange outputs.
It’s all good — the NetsPresso team will optimize it for smoother performance and a better user experience.
Ask me anything
The model is too large to deploy.
It’s slow and generates strange outputs.
High-end GPUs are too expensive.
We don’t know when development will finish.
Considering a product with generative AI.
It’s slow, and sometimes it generates really strange outputs.
It’s all good — the NetsPresso team will optimize it for smoother performance and a better user experience.
Ask me anything
The model is too large to deploy.
It’s slow and generates strange outputs.
High-end GPUs are too expensive.
We don’t know when development will finish.
Considering a product with generative AI.
It’s slow, and sometimes it generates really strange outputs.
It’s all good — the NetsPresso team will optimize it for smoother performance and a better user experience.
Ask me anything
Have a device in mind?
We’ll make your generative AI fit perfectly.
Talk to Our Experts