ห้องสมุดส่วนตัว - gordon123/learn2ComfyUI GitHub Wiki

ห้องสมุดส่วนตัวเล็ก ๆ เกี่ยวกับ งานวิจัยด้าน AI ที่น่าสนใจ

https://paperswithcode.com/ https://aiforthai.in.th/service_bn.php https://course.fast.ai/

🔗References for Text to image generative AI

Image generative Areana! https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Image-Leaderboard

หัวข้อ	คำอธิบายย่อๆ	Link
High-Resolution Image Synthesis with Latent Diffusion Models	ต้นกำเนิด Stable diffusion	https://arxiv.org/abs/2112.10752
Reproducible scaling laws for contrastive language-image learning	งานวิจัยเกี่ยวกับ CLIP ของ OpenAI	https://arxiv.org/abs/2212.07143 ลิ้งเพิ่มเติม https://github.com/mlfoundations/open_clip
Adding Conditional Control to Text-to-Image Diffusion Models	การทำงาน Control net	Controlnet Paper

🔗References for Sound/Voice/Musics generative AI

แหล่ง data set สำหรับเสียง Sound/Voice clone areana ! https://huggingface.co/spaces/TTS-AGI/TTS-Arena https://airesearch.in.th/releases/speech-emotion-dataset/ https://sites.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf https://inference.readthedocs.io/en/latest/models/model_abilities/audio.html#audio https://keithito.com/LJ-Speech-Dataset/ https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-central

หัวข้อ	คำอธิบายย่อๆ	Link
PyThaiNLP: Thai Natural Language Processing in Python	TBA	https://arxiv.org/pdf/2312.04649
PyThaiNLP open source	TBA	https://pythainlp.org/thai-tutorials/index.html
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion	TBA	https://arxiv.org/abs/2306.07691
MegaTTS 3: Zero-Shot Speech Synthesis	***** Quality great and light weight	https://arxiv.org/abs/2306.07691, https://github.com/bytedance/MegaTTS3
AudioX: Diffusion Transformer for Anything-to-Audio Generations	เพิ่ม sound effect ให้ video	Audio-X Project page

🔗Reference for Video generative AI

The areana!! https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Image-Leaderboard

หัวข้อ	คำอธิบายย่อๆ	Link
TBA	TBA	TBA

🔗 Summary: Research Papers on Image-to-Text Models

Topic	Paper Title	Authors	Year	Link	Summary (Thai)
Vision Transformer (ViT)	An Image is Worth 16x16 Words	Dosovitskiy et al.	2020	arXiv	เสนอแนวทางการใช้ Transformer สำหรับการจำแนกรูปภาพโดยไม่ใช้ Convolutional Layers
Convolutional Neural Networks (CNNs)	Gradient-Based Learning Applied to Document Recognition	LeCun et al.	1998	Paper	อธิบายพื้นฐานของ CNN และการใช้ในงานจดจำเอกสาร
Recurrent Neural Networks (RNNs)	Long Short-Term Memory	Hochreiter & Schmidhuber	1997	Paper	นำเสนอ LSTM ซึ่งช่วยให้ RNN สามารถจัดการกับข้อมูลลำดับยาวได้ดีขึ้น
CLIP	Learning Transferable Visual Models From Natural Language Supervision	Radford et al.	2021	arXiv	พัฒนาโมเดลที่สามารถจับคู่ภาพและข้อความได้อย่างมีประสิทธิภาพโดยใช้ Contrastive Learning
BLIP	Bootstrapped Language-Image Pretraining	Li et al.	2022	arXiv	โมเดลที่ใช้การเรียนรู้ร่วมกันระหว่างภาพและข้อความเพื่อการแปลภาษาและสร้างคำอธิบายภาพ
GPT-4 Vision (GPT-4V)	GPT-4 Technical Report	OpenAI	2023	Paper	รายงานทางเทคนิคของ GPT-4 และความสามารถด้านการเข้าใจภาพ
SimVLM	Simple Visual Language Model Pretraining	Wang et al.	2021	arXiv	โมเดลที่รวมการเรียนรู้ภาษากับการมองเห็น โดยใช้การฝึกแบบอ่อน (weakly supervised)
LLaVA	Large Language and Vision Assistant	Liu et al.	2023	arXiv	โมเดลที่รวม LLM กับความสามารถในการประมวลผลภาพเพื่อช่วยตอบคำถามเกี่ยวกับภาพ
Flamingo	A Visual Language Model for Few-Shot Learning	Alayrac et al.	2022	arXiv	โมเดลที่สามารถเรียนรู้จากข้อมูลตัวอย่างน้อยและสามารถทำงานร่วมกับทั้งข้อความและภาพ
Kosmos-2	Grounding Multimodal Large Language Models	Huang et al.	2023	arXiv	โมเดลที่สามารถสร้างข้อความโดยมีพื้นฐานจากภาพและวิดีโอ
GIT	Generative Image-to-Text Transformer	Wang et al.	2022	arXiv	โมเดลที่สามารถสร้างคำบรรยายภาพได้อย่างแม่นยำโดยใช้ Transformer
Show and Tell	A Neural Image Caption Generator	Vinyals et al.	2015	arXiv	โมเดลแรก ๆ ที่ใช้ CNN + LSTM ในการสร้างคำอธิบายภาพอัตโนมัติ
OCR (Optical Character Recognition)	What is Wrong with Scene Text Recognition Model Comparisons?	Baek et al.	2019	arXiv	วิเคราะห์ปัญหาและแนวทางการเปรียบเทียบโมเดล OCR สำหรับข้อความในภาพ
Word2Vec	Efficient Estimation of Word Representations in Vector Space	Mikolov et al.	2013	arXiv	นำเสนอ Word2Vec สำหรับการแปลงคำเป็นเวกเตอร์ที่มีความสัมพันธ์เชิงความหมาย
GloVe	Global Vectors for Word Representation	Pennington et al.	2014	Stanford NLP	เทคนิคที่ใช้คำนวณเวกเตอร์ของคำโดยอ้างอิงจากสถิติการปรากฏร่วมกันของคำ
BERT	Pre-training of Deep Bidirectional Transformers	Devlin et al.	2019	arXiv	โมเดล Transformer ที่เรียนรู้บริบทของคำจากทั้งสองทิศทางเพื่อใช้ใน NLP
Universal Sentence Encoder (USE)	Universal Sentence Encoder	Cer et al.	2018	arXiv	โมเดลที่ช่วยให้การแปลงประโยคเป็นเวกเตอร์สามารถนำไปใช้กับงานต่าง ๆ ได้อย่างมีประสิทธิภาพ
OpenAI Embeddings	OpenAI Embeddings: Text Representations for Semantic Search	OpenAI	2022	Docs	โมเดลที่ให้ embedding สำหรับการค้นหาและวิเคราะห์ความหมายของข้อความ
Florence	Florence: A New Foundation Model for Computer Vision	arXiv:2111.11432	2021		เน้นเรื่อง compute vision
CLIP	Multimodal Foundation Models	Now Publishers	2024		จับคู่ภาพกับข้อความและใช้สำหรับค้นหาข้อมูลจากภาพ
BLIP	Benchmark Evaluations of Large Vision-Language Models	arXiv 2501.02189	2025		สร้างคำอธิบายภาพ (Image Captioning) และตอบคำถามเกี่ยวกับภาพ
GPT-4 Vision	Unified Approaches for Vision-Language	ProQuest	2024		วิเคราะห์ภาพและให้คำตอบเชิงตรรกะเกี่ยวกับเนื้อหาภาพ
SimVLM	Simple Visual Language Model	arXiv 2111.09883	2021		ใช้การฝึกฝนแบบอ่อน (weakly supervised) เพื่อให้คำอธิบายภาพที่มีประสิทธิภาพ
LLaVA	Multimodal Large Language Models	ACL Anthology	2024		โมเดลที่รวมภาษาและภาพ ช่วยตอบคำถามเกี่ยวกับภาพโดยใช้ LLM
Flamingo	Few-Shot Learning for Vision-Language	DeepMind	2022		ใช้การเรียนรู้แบบ few-shot สำหรับงานที่ต้องการให้เข้าใจภาพและข้อความร่วมกัน
Kosmos-2	Grounded Multimodal Generation	Microsoft	2023		สร้างข้อความจากภาพหรือวิดีโอ ใช้สำหรับการสร้างเนื้อหาที่มีพื้นฐานจากข้อมูลภาพ
GIT	Generative Image-to-Text Transformer	arXiv 2205.14100	2022		โมเดลสร้างข้อความจากภาพที่มีความแม่นยำสูง
Show and Tell	A Neural Image Caption Generator	arXiv 1411.4555	2015		หนึ่งในโมเดลแรกที่ใช้ CNN+LSTM เพื่อสร้างคำอธิบายภาพอัตโนมัติ
OCR (Tesseract, TrOCR)	OCR for Vision-Language Tasks	arXiv 2401.02276	2024		ตรวจจับและดึงข้อความจากภาพ (OCR) เช่น เอกสาร สัญลักษณ์ และป้ายถนน