Computer vision (電腦視覺)

主要用來自己做記錄，看到那測試到那就寫到那

中文語音識別、中文語音去噪、中文文本分類、中文文本糾錯、中文機器閱讀理解、中文實體識別、中文文本相似度、中文聲紋識別、中文語者分離
 Diffusion Model 完全解析：從原理、應用到實作 (AI 圖像生成) | ASR/TTS 開發避坑指南：語音辨識與合成的常見挑戰與對策
那些自然語言處理 (Natural Language Processing, NLP)踩的坑| 那些語音處理 (Speech Processing) 踩的坑
那些大型語言模型要踩的坑 | 檢索增強生成(RAG)不是萬靈丹之優化挑戰技巧 | 白話文手把手帶你科普 GenAI
Call for Partner or POC (Proof of Concept), Contact: TonTon ( at ) TWMAN.ORG
https://deep-learning-101.github.io/ | DEMO | https://huggingface.co/DeepLearning101

Optical Character Recognition (光學字元辨識)

用PaddleOCR的PPOCRLabel微調醫療診斷書和收據

Document Layout Analysis (文件結構分析)

Shen, Zejiang and Zhang, Ruochen and Dell, Melissa and Lee, Benjamin Charles Germain and Carlson, Jacob and Li, Weining, "A unified toolkit for Deep Learning Based Document Image Analysis", arXiv preprint, arXiv:2103.15348

https://layout-parser.github.io

以簡化文件結構分析 (文件圖像分析)為目標，針對各種不同格式的文件架構，只需幾行代碼便可，總計搜集了9種模型及5種數據集 !

Layout Analysis – in 4 Lines of Code
Transform document image analysis pipelines with the full power of Deep Learning.

Abstract

文檔圖像分析 (DIA) 的最新進展主要是由神經網絡的應用推動的。理想情況下，研究成果可以很容易地部署到生產環境中並進行擴展以供進一步研究。然而，諸如鬆散組織的代碼庫和復雜的模型配置等各種因素使廣大受眾輕鬆重用重要創新變得複雜。儘管在自然語言處理和計算機視覺等學科中一直在努力提高可重用性和簡化深度學習 (DL) 模型開發，但它們都沒有針對 DIA 領域的挑戰進行優化。這代表了現有工具包中的一個重大差距，因為 DIA 是社會科學和人文學科廣泛學科學術研究的核心。本文介紹了 layoutparser，這是一個開源庫，用於簡化 DL 在 DIA 研究和應用中的使用。核心 layoutparser 庫帶有一組簡單直觀的界面，用於應用和自定義 DL 模型以進行佈局檢測、字符識別和許多其他文檔處理任務。為了提高可擴展性，layoutparser 還包含一個社區平台，用於共享預訓練模型和完整文檔數字化管道。我們證明了 layoutparser 對實際用例中的輕量級和大規模數字化管道都有幫助。(譯自Google翻譯)

Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li and Furu Wei, "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models", arXiv preprint, arXiv:2109.10282, 2021.

PaperWithCode | HuggingFace (可先在colab試試效果) | github

Abstract

首先說明了關於文字識別/Document Layout Analysis (文件結構分析)，最常見的是透過CNN來做圖像理解以及透過RNN來處理字符級別(char-level跟word-level在英文和中文是不同的)；這篇論文主要提出了TrOCR是一個端到端基於預訓練的image Transformer和text Transformer(Transformer中文應該也只能叫變型金剛?)的文字識別，來平衡圖像理解跟字元級別的文字生成。另外就是可以使用大規模合成數據進行預訓練，並使用人工標記的數據集進行微調，剩的就是說目前是SOTA？還有，更關鍵的是這篇主要針對文檔圖像的文本識別任務，至於文本檢測則是另一件事了；也就是說你必需先知道那裡有文字 XD

Introduction

做了不少介紹，包括了圖像Transformer等等，但重點就是其中這幾句；其中搭配上方的架構圖(TrOCR architecture. Taken from the original paper.)應該是可以蠻清楚的理解論文中這段說明：首先將輸入文本圖像的大小調整為 384×384，然後將圖像分割成一系列 16×16，用作圖像的輸入。

編碼器使用預訓練的 ViT 模型做初始化 (參考這幾篇)

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021a. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR.

Hangbo Bao, Li Dong, and Furu Wei. 2021. Beit: Bert pre-training of image transformers.

解碼器使用預訓練的 BERT 模型做初始化 (參考這幾篇)

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation.

TrOCR 使用預訓練圖像 Transformer 和文本 Transformer ，利用大規模未標記數據進行圖像理解和語言建模，無需外部語言模型。

TrOCR 不需要任何復雜的捲積網絡作為backbone，模型易於實作和維護。

TrOCR 可以在沒有任何打印和手頭文本的複雜性任務的情況下完成OCR 基準數據任務。

TrOCR

Encoder

the encoder decomposes the input image into a batch of N = HW/P2 foursquare patches with a fixed size of (P,P), while the width W and the height H of the resized image are guaranteed to be divisible by the patch size P. After that, the patches are flattened into vectors and linearly projected to D-dimension vectors, which are the patch embeddings and D is the hidden size of the Transformer through all of its layers.

這一段是基於encoder無法直接處理圖像，所以將其長(H)跟寬(W)轉成都可以被P整除的長度大小的方型，然後再拉長(研展)成D維的向量。嗯？為什麼？這裡就需要認真再去瞭解一下Transformer (Transformer 架构逐层功能介绍和详细解释)

Decoder

Model Initialization

Encoder Initialization

Decoder Initialization

Task Pipeline

Pre-training

Fine-tuning

這裡簡單介紹說是針對印刷 (printed) 和手寫 (hand written) 等文字識別做Fine-tuning。

Data Augmentation

最後數據增強的部份就是一貫的對圖像做旋轉等隨機旋轉（-10 到 10 度）、高斯模糊、圖像膨脹、圖像侵蝕、縮小、下劃線或保持原始圖像來增強輸入圖像等等。

Experiments