”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > MLP-混合器(理论)

MLP-混合器(理论)

发布于2024-11-08
浏览:928

TL;DR - This is the first article I am writing to report on my journey studying the MPL-Mixer architecture. It will cover the basics up to an intermediate level. The goal is not to reach an advanced level.

Reference: https://arxiv.org/abs/2105.01601

Introduction

From the original paper it's stated that

We propose the MLP-Mixer architecture (or “Mixer” for short), a competitive but conceptually and technically simple alternative, that does not use convolutions or self-attention. Instead, Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.

At first, this explanation wasn’t very intuitive to me. With that in mind, I decided to investigate other resources that could provide an easier explanation. In any case, let’s keep in mind that the proposed architecture is shown in the image below.

MLP-Mixer (Theory)


MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

Unlike CNNs, which focus on local image regions through convolutions, and transformers, which use attention to capture relationships between image patches, the MLP-Mixer uses only two types of operations:

  • Patch Mixing (Spatial Mixing)

  • Channel Mixing

Image Patches

In general, from the figure below, we can see that this architecture is a "fancy" classification model, where the output of the fully connected layer is a class. The input for this architecture is an image, which is divided into patches.

In case you're not 100% sure what a patch is, when an image is divided into 'patches,' it means the image is split into smaller, equally sized square (or rectangular) sections, known as patches. Each patch represents a small, localized region of the original image. Instead of processing the entire image as a whole, these smaller patches are analyzed independently or in groups, often for tasks like object detection, classification, or feature extraction. Each patch is then provided to a "Per-patch Fully-connected" layer and subsequently a " N×N \times (Mixer Layer)."

MLP-Mixer (Theory)
MLP-Mixer

How the MLP-Mixer Processes Patches

For a clear example, let's take a 224x224 pixel image which can be divided into 16x16 pixel patches, when diving the image we'll have

22416=14 \frac{224}{16} = 14 16224=14

So, the 224x224 pixel image will be divided into a grid of 14 patches along the width and 14 patched along the height. This results in a 14x14 grid of patches. The total number of patches is then calculated by 14×14=19614 \times 14 = 196 14×14=196 patches. After diving the image into patches, each patch will contain 16×16=25616 \times 16 = 256 16×16=256 pixels. These pixel values can be flattened into a 1D vector of length 256.

It's crucial to note that if the image has multiple channels (like RGB with 3 channels), each patch will actually have 3×16×16=7683 \times 16 \times 16 = 768 16×16=768 values because each pixel in a RGB image has three color channels.

Thus, for an RGB image:

  • Each patch can be represented as a vector of 768 values

  • We then have 196 patches, each represented by a 768-dimensional vector

Each patch (a 768-dimensional vector in our example) is projected into a higher-dimensional space using a linear layer (MLP). This essentially gives each patch an embedding that is used as the input to the MLP-Mixer. More specifically, after dividing the image into patches, each patch is processed by the Per-patch Fully-connected layer (depicted in the second row).

Per-patch Fully Connected Layer

Each patch is essentially treated as a vector of values, as we have explained before, a 768-dimensional vector in our example. These values are passed through a fully connected linear layer, which transforms them into another vector. This is done for each patch independently. The role of this fully connected layer is to map each patch to a higher-dimensional embedding space, similar to how token embeddings are used in transformers.

N x Mixer Layers

This part shows a set of stacked mixer layers that alternate between two different types of MLP-based operations:

1. Patch Mixing: In this layer, the model mixes information between patches. This means it looks at the relationships between patches in the image (i.e., across different spatial locations). It's achieved through an MLP that treats each patch as a separate entity and computes the interactions between them.

2. Channel Mixing: After patch mixing, the channel mixing layer processes the internal information of each patch independently. It looks at the relationships between different pixel values (or channels) within each patch by applying another MLP.

These mixer layers alternate between patch mixing and channel mixing. They are applied N times, where N is a hyperparameter to configure the number of times the layers are repeated. The goal of these layers is to mix both spatial (patch-wise) and channel-wise information across the entire image.

Global Average Pooling

After passing through several mixer layers, a global average pooling layer is applied. This layer helps reduce the dimensionality of the output by averaging the activations across all patches. The global average pooling layer computes the average value across all patches, essentially summarizing the information from all patches into a single vector. This reduces the overall dimensionality and prepares the data for the final classification step. Additionally, it helps aggregate the learned features from the entire image in a more compact way.

Fully Connected Layer for Classification

This layer is responsible for taking the output of the global average pooling and mapping it to a classification label. The fully connected layer takes the averaged features from the global pooling layer and uses them to make a prediction about the class of the image. The number of output units in this layer corresponds to the number of classes in the classification task.

Quick Recap Until Now

  1. Input Image: the image is divided into small patches

  2. Per-patch Fully Connected Layer: Each patch is processed independently by a fully connected layer to create patch embeddings

  3. Mixer Layers: The patches are then passed through a series of mixer layers, where information is mixed spatially (between patches) and channel-wise (within patches)

  4. Global Average Pooling: The features from all patches are averaged to summarize the information

  5. Fully Connected Layer: Finally, the averaged features are used to predict the class label of the image

Overview of the Mixer Layer Architecture

As stated before, the Mixer Layer architecture consists of two alternating operations:

  1. Patch Mixing

  2. Channel Mixing

Each mixing state is handled by a MLP, and there are also skip-connections and layer normalization steps included. Let's go step by step through how an image is processed using as a reference the image shown below:

MLP-Mixer (Theory)
Mixer Layer

Note, that the image itself is not provided as an input to this diagram. Actually, the image has already been divided into patches as explained before and each patch is processed by a per-patch fully connected layer (linear embedding). The output of that state is what enters the Mixer Layer. Here is how it goes:

  • Before reaching this layer, the image has already been divided into patches. These patches are flattened and embedded as vectors (after the per-patch fully connected layer)

  • The input to this diagram consists of tokens (one for each patch) and channels (the feature dimensions for each patch)

  • So the input is structured as a 2D tensor with: Patches (one patch per token in the sequence) and Channels (features within each patch)

  • In simple terms, as shown in the image we can think of the input as a matrix where: Each row represents a patch, and each column represents a channel (feature dimension)

Processing Inside the Mixer Layer

Now, let's break down the key operations in this Mixer Layer, which processes the patches that come from the previous stage:

1. Layer Normalization

  • The first step is the normalization of the input data to improve training stability. This happens before any mixing occurs.

2. Patch Mixing (First MLP Block)

  • After normalization, the first MLP block is applied. In this operations, the patches are mixed together. The idea here is to capture the relationships between different patches.

  • This operation transposes the input so that it focuses on the patches dimension

  • Then, a MLP is applied along this patches dimension. This is done to allow the model to exchange information between patches and learn how patches relate spatially.

  • Once this mixing is done, the input is transposed back to its original format, where channels are in the focus again.

3. Skip Connection

  • The architecture uses a skip-connection (bypass residual layers, allowing any layer to flow directly to any subsequent layer) that adds the original input (before patch mixing) back to the output of the patch mixing block. This helps avoid degradation of information during training.

4. Layer Normalization

  • Another layer normalization step is applied to the output of the token mixing operation before the next operation (channel mixing) is applied

5. Channel Mixing (Second MLP Block)

  • The second MLP block performs channel mixing. Here, each patch is processed independently by a separate MLP. The goal is to model relationships between the different channels (features) within each patch.

  • The input is processed along the channels dimension (without mixing information between different patches). The MLP learns to capture dependencies between the various features within each patch

6. Skip Connection (for Channel Mixing)

  • Similar to patch mixing, there's a skip-connection in the channel mixing block as well. This allows the model to retain the original input after the channel mixing operation and helps in stabilizing the learning process.

With the step by step explanation above we can summarize it with some key points:

  • Patch Mixing - this part processes the relationships between different patches (spatial mixing), allowing the model to understand global spatial patterns across the image

  • Channel Mixing - this part processes the relationships within each patch (channel-wise mixing), learning to capture dependencies between the features (such as pixel intensities or features maps) within each patch

  • Skip Connections - The skip connections help the network retain the original input and prevent vanishing gradient problems, especially in deep networks

Inside the MLP Block

The MLP block used in the MLP-Mixer architecture can be seen in the image below:

MLP-Mixer (Theory)
MLP Architecture

1. Fully Connected Layer

  • The input (whether it's patch embeddings or channels, depending on the context) first passes through a fully connected layer. This layer performs a linear transformation of the input, meaning it multiplies the input by a weight matrix and adds a bias term.

2. GELU (Gaussian Error Linear Unit) Activation

  • After the first fully connected layer, a GELU activation function is applied. GELU is a non-linear activation function that is smoother than the ReLU. It allows for a more fine-grained activation, as it approximates the behavior of a normal distribution. The formula for GELU is given by
GELU(x)=xΦ(x) GELU(x) = x \cdot \Phi(x) GELU(x)=x⋅Φ(x)

where Φ(x)\Phi(x) Φ(x) is the cumulative distribution function of a standard Gaussian distribution. The choice of GELU in MLP-Mixer is intended to improve the model’s ability to handle non-linearities in the data, which helps the model learn more complex patterns.

How it works in context:

  • In the Token Mixing MLP, this block is applied across the patch (token) dimension, mixing information across different patches

  • In the Channel Mixing MLP, the same structure is applied across the channels dimension, mixing the information within each patch independently.

Intuition - The MLP block acts as the core computational unit within the MLP-Mixer. By using two fully connected layers with an activation function in between, it learns to capture both linear and non-linear relationships within the data, depending on where it is applied (either for mixing patches or channels).

Conclusion

The MLP-Mixer offers a unique approach to image classification by leveraging simple multilayer perceptrons (MLPs) for both spatial and channel-wise interactions. By dividing an image into patches and alternating between patch mixing and channel mixing layers, the MLP-Mixer efficiently captures global spatial dependencies and local pixel relationships without relying on traditional convolutional operations. This architecture provides a streamlined yet powerful method for extracting meaningful patterns from images, demonstrating that, even with minimal reliance on complex operations, impressive performance can be achieved in deep learning tasks. It is important to note, however, that this architecture can be applied to different fields, such as time series predictions, even though it was initially proposed for vision tasks.


If you've made it this far, I want to express my gratitude. I hope this has been helpful to someone other than just me!

✧⁺⸜(^-^)⸝⁺✧

版本声明 本文转载于:https://dev.to/igor1740/mlp-mixer-theory-2dje?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • CSS可以根据任何属性值来定位HTML元素吗?
    CSS可以根据任何属性值来定位HTML元素吗?
    靶向html元素,在CSS 中使用任何属性值,在CSS中,可以基于特定属性(如下所示)基于特定属性的基于特定属性的emants目标元素: 字体家庭:康斯拉斯(Consolas); } 但是,出现一个常见的问题:元素可以根据任何属性值而定位吗?本文探讨了此主题。的目标元素有任何任何属性值,属...
    编程 发布于2025-06-30
  • 为什么使用固定定位时,为什么具有100%网格板柱的网格超越身体?
    为什么使用固定定位时,为什么具有100%网格板柱的网格超越身体?
    网格超过身体,用100%grid-template-columns 为什么在grid-template-colms中具有100%的显示器,当位置设置为设置的位置时,grid-template-colly修复了?问题: 考虑以下CSS和html: class =“ snippet-code”> g...
    编程 发布于2025-06-30
  • 反射动态实现Go接口用于RPC方法探索
    反射动态实现Go接口用于RPC方法探索
    在GO 使用反射来实现定义RPC式方法的界面。例如,考虑一个接口,例如:键入myService接口{ 登录(用户名,密码字符串)(sessionId int,错误错误) helloworld(sessionid int)(hi String,错误错误) } 替代方案而不是依靠反射...
    编程 发布于2025-06-30
  • Async Void vs. Async Task在ASP.NET中:为什么Async Void方法有时会抛出异常?
    Async Void vs. Async Task在ASP.NET中:为什么Async Void方法有时会抛出异常?
    在ASP.NET async void void async void void void void void的设计无需返回asynchroncon而无需返回任务对象。他们在执行过程中增加未偿还操作的计数,并在完成后减少。在某些情况下,这种行为可能是有益的,例如未期望或明确预期操作结果的火灾和...
    编程 发布于2025-06-30
  • 在UTF8 MySQL表中正确将Latin1字符转换为UTF8的方法
    在UTF8 MySQL表中正确将Latin1字符转换为UTF8的方法
    在UTF8表中将latin1字符转换为utf8 ,您遇到了一个问题,其中含义的字符(例如,“jáuòiñe”)在utf8 table tabled tablesset中被extect(例如,“致电。为了解决此问题,您正在尝试使用“ mb_convert_encoding”和“ iconv”转换受...
    编程 发布于2025-06-30
  • 找到最大计数时,如何解决mySQL中的“组函数\”错误的“无效使用”?
    找到最大计数时,如何解决mySQL中的“组函数\”错误的“无效使用”?
    如何在mySQL中使用mySql 检索最大计数,您可能会遇到一个问题,您可能会在尝试使用以下命令:理解错误正确找到由名称列分组的值的最大计数,请使用以下修改后的查询: 计数(*)为c 来自EMP1 按名称组 c desc订购 限制1 查询说明 select语句提取名称列和每个名称...
    编程 发布于2025-06-30
  • 如何处理PHP文件系统功能中的UTF-8文件名?
    如何处理PHP文件系统功能中的UTF-8文件名?
    在PHP的Filesystem functions中处理UTF-8 FileNames 在使用PHP的MKDIR函数中含有UTF-8字符的文件很多flusf-8字符时,您可能会在Windows Explorer中遇到comploreer grounder grounder grounder gro...
    编程 发布于2025-06-30
  • C++中如何将独占指针作为函数或构造函数参数传递?
    C++中如何将独占指针作为函数或构造函数参数传递?
    在构造函数和函数中将唯一的指数管理为参数 unique pointers( unique_ptr [2启示。通过值: base(std :: simelor_ptr n) :next(std :: move(n)){} 此方法将唯一指针的所有权转移到函数/对象。指针的内容被移至功能中,在操作...
    编程 发布于2025-06-30
  • 如何高效地在一个事务中插入数据到多个MySQL表?
    如何高效地在一个事务中插入数据到多个MySQL表?
    mySQL插入到多个表中,该数据可能会产生意外的结果。虽然似乎有多个查询可以解决问题,但将从用户表的自动信息ID与配置文件表的手动用户ID相关联提出了挑战。使用Transactions和last_insert_id() 插入用户(用户名,密码)值('test','test...
    编程 发布于2025-06-30
  • 如何将PANDAS DataFrame列转换为DateTime格式并按日期过滤?
    如何将PANDAS DataFrame列转换为DateTime格式并按日期过滤?
    将pandas dataframe列转换为dateTime格式示例:使用column(mycol)包含以下格式的以下dataframe,以自定义格式:})指定的格式参数匹配给定的字符串格式。转换后,MyCol列现在将包含DateTime对象。 date date filtering > = p...
    编程 发布于2025-06-30
  • 如何使用Python有效地以相反顺序读取大型文件?
    如何使用Python有效地以相反顺序读取大型文件?
    在python 反向行读取器生成器 == ord('\ n'): 缓冲区=缓冲区[:-1] 剩余_size- = buf_size lines = buffer.split('\ n'....
    编程 发布于2025-06-30
  • 用户本地时间格式及时区偏移显示指南
    用户本地时间格式及时区偏移显示指南
    在用户的语言环境格式中显示日期/时间,并使用时间偏移在向最终用户展示日期和时间时,以其localzone and格式显示它们至关重要。这确保了不同地理位置的清晰度和无缝用户体验。以下是使用JavaScript实现此目的的方法。方法:推荐方法是处理客户端的Javascript中的日期/时间格式化和时...
    编程 发布于2025-06-30
  • 为什么PYTZ最初显示出意外的时区偏移?
    为什么PYTZ最初显示出意外的时区偏移?
    与pytz 最初从pytz获得特定的偏移。例如,亚洲/hong_kong最初显示一个七个小时37分钟的偏移: 差异源利用本地化将时区分配给日期,使用了适当的时区名称和偏移量。但是,直接使用DateTime构造器分配时区不允许进行正确的调整。 example pytz.timezone(...
    编程 发布于2025-06-30
  • 如何在鼠标单击时编程选择DIV中的所有文本?
    如何在鼠标单击时编程选择DIV中的所有文本?
    在鼠标上选择div文本单击带有文本内容,用户如何使用单个鼠标单击单击div中的整个文本?这允许用户轻松拖放所选的文本或直接复制它。 在单个鼠标上单击的div元素中选择文本,您可以使用以下Javascript函数: function selecttext(canduterid){ if(do...
    编程 发布于2025-06-30

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3