Your LLM, anywhere

We provide professional services to port and optimize Gen AI models for mobile, IoT, and server environments.

Get a Free Consultation

Qwen
TinyLlama
Llama
Gemma
LLaVA
Intern-VL

Qwen
TinyLlama
Llama
Gemma
LLaVA
Intern-VL

Qwen
TinyLlama
Llama
Gemma
LLaVA
Intern-VL

Taking Foundation Models to the Edge

You can now deploy any LLM or VLM, even those that were previously too large or complex.

Mobile
Smart home
Robotics
Automotive
Server
Industrial edge box

Mobile
Smart home
Robotics
Automotive
Server
Industrial edge box

Mobile
Smart home
Robotics
Automotive
Server
Industrial edge box

Optimized LLMs, Engineered with Hardware Insight

Specialized techniques meet real hardware knowledge to push Gen AI to its limits.

Bridging

AI and Hardware

Bridging

AI and Hardware

Bridging

AI and Hardware

We understand both model internals and hardware constraints — with extensive experience in deploying models across real-world devices.

Hardware-Aware Quantization

We apply device-specific quantization strategies to maximize performance while preserving accuracy.

Research-Driven

Optimization

Research-Driven

Optimization

Research-Driven

Optimization

We implement the latest methods from cutting-edge papers — including mixed precision quantization, operator fusion, and graph pruning.

Real Results.
Proven Performance.

The NetsPresso team enhanced LLM performance on real devices by combining quantization with hardware-aware techniques, leading to measurable gains in memory efficiency, speed, and output quality.

Optimized quantization to meet both efficiency and accuracy

Original model

Nota-optimized model

Memory Efficiency

Memory

(GB)

Reduced by

6.9%

Inference Speed

Prefill speed

(token/s)

1.27x

faster

Decoding speed

(token/s)

1.15x

faster

Text Generation Quality

Instruction

Understanding

Improved by

6.2%

Text

Summarization

Improved by

11.7%

· Llama 3-8B benchmark before and after optimization (under the same conditions)

· Measurements taken on Qualcomm DragonWing IQ-9075.

Llama, Qualcomm, and DragonWing IQ-9075 are used for illustrative purposes only and do not imply any affiliation with, endorsement by, or sponsorship from the respective third parties.

Performance enhancement through advanced decoding techniques

SS Decoding Off

SS Decoding On

Inference Speed

Decoding speed

(token/s)

10.9

1.78x

Faster

19.4

TTFT

(ms)

791

Almost

same

804

Applying self-speculative decoding to the llava1.5-7B.

The measurements were taken on the Qualcomm Snapdragon8 Gen3.

Quantization

Decoding

Optimized quantization to meet both efficiency and accuracy

Original model

Nota-optimized model

Memory Efficiency

Memory

(GB)

Reduced by

6.9%

Inference Speed

Prefill speed

(token/s)

1.27x

faster

Decoding speed

(token/s)

1.15x

faster

Text Generation Quality

Instruction

Understanding

Improved by

6.2%

Text

Summarization

Improved by

11.7%

· Llama 3-8B benchmark before and after optimization (under the same conditions)

· Measurements taken on Qualcomm DragonWing IQ-9075.

Llama, Qualcomm, and DragonWing IQ-9075 are used for illustrative purposes only and do not imply any affiliation with, endorsement by, or sponsorship from the respective third parties.

Performance enhancement through advanced decoding techniques

SS Decoding Off

SS Decoding On

Inference Speed

Decoding speed

(token/s)

10.9

1.78x

Faster

19.4

TTFT

(ms)

791

Almost

same

804

Applying self-speculative decoding to the llava1.5-7B.

The measurements were taken on the Qualcomm Snapdragon8 Gen3.

Quantization

Decoding

Quantization

Decoding

Optimized quantization to meet both efficiency and accuracy

Original model

Nota-optimized model

Inference Speed

Text Generation Quality

Memory Efficiency

Memory (GB)

Prefill speed (token/s)

1.27x

faster

Decoding speed (token/s)

1.15x

faster

Instruction Understanding

Improved by

6.2%

Text Summarization

Improved by

11.7%

· Llama 3-8B benchmark before and after optimization (under the same conditions)

· Measurements taken on Qualcomm DragonWing IQ-9075.

Llama, Qualcomm, and DragonWing IQ-9075 are used for illustrative purposes only and do not imply any affiliation with, endorsement by, or sponsorship from the respective third parties.

Reduced by

6.9%

Performance enhancement through advanced decoding techniques

SS Decoding On

Inference Speed

Decoding speed (token/s)

10.9

1.78x

Faster

19.4

TTFT (ms)

791