Why does Apple use "small models"?

新智元 · Jul 2 19:49

Source: Xin Zhi Yuan

At WWDC 2024, Apple redefined AI - Apple Intelligence.

This is a personal intelligence system deeply integrated into iOS 18, iPadOS 18, and macOS Sequoia.

Different from other technology giants, Apple's AI does not adhere to the principle of 'bigger is better'.

On the contrary, Apple's attitude is more pragmatic, prioritizing user experience and emphasizing customization of AI models.

Seamlessly integrating generative AI into the operating system - in a sense, this is a very 'Apple' approach.

Apple Intelligence consists of multiple powerful generative models that are specifically designed for users' daily tasks and can adapt to their current activities in real time.

The basic models built into Apple Intelligence have been fine-tuned for user experience, including text composition and optimization, summarization, priority determination for notifications, creating interesting images for conversation, and simplifying interactions across apps.

Apple tends to use small models on the device side to handle these tasks. Of course, users can also choose to use third-party services such as ChatGPT, but then Apple has no responsibility for the data.

Apple emphasized two of the models: a device-side language model with about 3 billion parameters, and a larger server-based language model (which can run on Apple servers through private cloud computing).

Keep Small

Apple's basic models are trained on the AXLearn framework.

AXLearn is an open-source project released by Apple in 2023, built on JAX and XLA, allowing Apple to train models with high efficiency and scalability on various training hardware and cloud platforms, including TPU, cloud and local GPU.

Apple uses data parallelism, tensor parallelism, sequence parallelism, and fully sharded data parallelism (FSDP) to scale training along multiple dimensions such as data, model, and sequence length.

Apple uses the web crawler AppleBot to collect publicly available data. If web publishers do not want their content to be used for training by Apple Intelligence, Apple also provides various granularity control schemes.

Apple states that it never uses users' private personal data or user interactions when training basic models, and applies filters to remove personal identity information (such as social security and credit card numbers) publicly available on the Internet.

In addition to filtering, Apple also uses data extraction, duplicate data removal, and model-based classifiers to identify high-quality documents.

Post-processing

In the training pipeline, Apple uses a hybrid data strategy that combines manual annotation and synthetic data, and implements thorough data management and filtering programs.

Apple has developed two novel algorithms in the post-processing stage:

1. Rejection sampling fine-tuning algorithm;
2. Based on Reinforcement Learning with Human Feedback (RLHF) algorithm, using mirror descent strategy optimization and leave-one-out advantage estimator.

Both of these algorithms can significantly improve the quality of model's instruction following.

In addition to ensuring that the generated model has powerful functionality, Apple also uses a series of innovative technologies to optimize it on devices and private clouds to improve speed and efficiency.

Both device-side models and server-side models use grouped query attention to optimize their inference performance.

Apple uses a shared input and output vocabulary to reduce memory requirements and inference costs, ensuring that the mapping of shared embedding tensors is not duplicated.

The device-side model uses a vocabulary size of 49K, while the server-side model uses a vocabulary size of 100K.

For device-side inference, Apple uses low-bit palletization to meet the necessary memory, power, and performance requirements.

In order to maintain model quality, Apple has developed a new framework that uses a LoRA adapter and a configuration strategy that mixes 2-bit and 4-bit (with an average of 3.5 bits per weight) to achieve the same accuracy as uncompressed models.

In addition, Apple uses the interactive model latency and power analysis tool Talaria to better guide bit rate selection for each operation.

By using activation quantization and embedding quantization, efficient key-value cache (KV cache) updates can be achieved on Apple's neural engines.

Through these optimizations, iPhone 15 Pro can achieve a latency of about 0.6 milliseconds and a generation rate of 30 tokens per second.

Adapter

Apple's basic model has been fine-tuned for users' daily activities and can dynamically focus on specific tasks.

The approach is to insert small neural networks as modules (adapters) into various layers of the pre-trained models to achieve fine-tuning for specific tasks.

In addition, Apple adjusts the fully connected layers in the attention matrix, attention projection matrix, and feedforward network to adapt to the decoding layer of the Transformer architecture.

By fine-tuning only the adapter layer, the original parameters of the basic pre-trained model remain unchanged, retaining general knowledge of the model while supporting specific tasks.

Apple Intelligence includes a wide range of adapters, which is an effective way to expand the functionality of the basic model.

Apple uses 16-bit representations of adapter parameter values, and for device models with 3 billion parameters, the parameters of 16-bit adapters typically require 10 megabytes.

Adapter models can be dynamically loaded, temporarily cached in memory, and swapped, ensuring the responsiveness of the operating system.

Because user experience is the highest priority, Apple focuses on manual evaluation when benchmarking models.

Summary

Apple's training data is based on synthetic summaries generated from larger server models and filtered through a rejection sampling strategy to retain only high-quality summaries.

To evaluate product-specific summaries, a set of 750 responses were carefully sampled for each use case.

The evaluation dataset covers a variety of inputs that Apple's product features may encounter in production, including hierarchical combinations of single documents and stacked documents of different content types and lengths.

In addition, evaluating summary functionality also takes into account some inherent risks, such as occasional model omissions of important details.

Based on the scorer's ratings in five dimensions, the summary is classified as good, medium, or poor.

Experimental results show that the model with an adapter can generate better summaries compared with similar models.

Moreover, in over 99% of targeted adversarial examples, the summary adapter does not amplify sensitive content.

Basic functions

For the general functions of device-side and server-side models, Apple uses a comprehensive set of real-world prompts to evaluate the functionality of the general model.

These prompts vary at different difficulty levels, covering major categories such as brainstorming, classification, closed-end Q&A, encoding, extraction, mathematical reasoning, open-end Q&A, rewriting, security, summarization, and writing.

Compare Apple's models with open-source models (Phi-3, Gemma, Mistral, DBRX) and commercially available models of similar sizes (GPT-3.5-Turbo, GPT-4-Turbo).

The experiment shows that Apple's models are more favored by human scorers than most competitors.

Apple's 3B device-side model performs better than large models such as Phi-3-mini, Mistral-7B, and Gemma-7B; and Apple's server-side model is also superior to DBRX-Instruct, Mixtral-8x22B, and GPT-3.5-Turbo, while being more efficient.

Security

Apple uses a set of different adversarial prompts to test the model's performance on harmful content, sensitive topics, and factuality.

Measuring the violation rate of each model is also evaluated by human assessment:

The above figure shows the PK with competitors on safety prompts. Human scorers found Apple's responses to be safer and more helpful.

Directive compliance

To further evaluate the model, Apple also uses the instruction-following evaluation (IFEval) benchmark test to compare the capabilities of similar models.

The results show that Apple's device-side and server-side models both follow detailed instructions better than similar-sized open-source and commercial models.

Finally, based on internal summarization and the composition benchmark evaluation model of writing ability, including various writing instructions, these results do not involve adapters used for specific functions.

Editor / jayden

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

苹果为什么要用“小模型”？

Why does Apple use "small models"?

Keep Small

后处理

适配器

摘要

基础功能

安全性

指令遵循

Keep Small

Post-processing

Adapter

Summary

Basic functions

Security

Directive compliance

苹果为什么要用“小模型”？

Why does Apple use "small models"?

Keep Small

后处理

适配器

摘要

基础功能

安全性

指令遵循

Keep Small

Post-processing

Adapter

Summary

Basic functions

Security

Directive compliance

Risk Disclaimer

Statement