๐Ÿ“š Weekly AI Paper Digest

๊ธฐ๊ฐ„: 2026-03-23 ~ 2026-03-28 ์„ ์ •: ์ด๋ฒˆ ์ฃผ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›์€ ๋…ผ๋ฌธ Top 5


๐Ÿ† ์ด๋ฒˆ ์ฃผ Top 5

์ˆœ์œ„๋…ผ๋ฌธโฌ†๏ธDeep Dive
๐Ÿฅ‡MinerU-Diffusion: Rethinking Document OCโ€ฆ125DD-051
๐ŸฅˆOmni-WorldBench: Towards a Comprehensiveโ€ฆ122DD-052
๐Ÿฅ‰Speed by Simplicity: A Single-Stream Arcโ€ฆ115DD-053
4.PixelSmile: Toward Fine-Grained Facial Eโ€ฆ105DD-054
5.Astrolabe: Steering Forward-Process Reinโ€ฆ105DD-055

๐Ÿ” ์ด๋ฒˆ ์ฃผ ํŠธ๋ Œ๋“œ

ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ

  • ํšจ์œจ์  ์ƒ์„ฑ ์•„ํ‚คํ…์ฒ˜ (Efficient Generative Architectures): ์ž๊ธฐํšŒ๊ท€ ๋ฐฉ์‹์˜ ์ˆœ์ฐจ์  ์ง€์—ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋””ํ“จ์ „ ๋””์ฝ”๋”ฉ์ด๋‚˜ ๋‹จ์ผ ์ŠคํŠธ๋ฆผ ๊ตฌ์กฐ ๋“ฑ์„ ๋„์ž…ํ•˜์—ฌ ์ถ”๋ก  ์†๋„์™€ ํ’ˆ์งˆ์„ ๋™์‹œ์— ๊ฐœ์„ ํ•˜๋ ค๋Š” ์‹œ๋„.
  • ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํ†ตํ•ฉ ์ƒ์„ฑ (Multimodal Generation): ์˜ค๋””์˜ค์™€ ๋น„๋””์˜ค์˜ ๋™์‹œ ์ƒ์„ฑ, ๋ฌธ์„œ์˜ ๋ ˆ์ด์•„์›ƒ๊ณผ ํ…์ŠคํŠธ ๋ณต์› ๋“ฑ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ†ตํ•ฉ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ์ˆ ์˜ ๊ณ ๋„ํ™”.
  • 4D ์›”๋“œ ๋ชจ๋ธ๋ง (4D World Modeling): ๋‹จ์ˆœํ•œ 3D ์žฌ๊ตฌ์„ฑ์ด๋‚˜ ์ •์  ์˜์ƒ ์ƒ์„ฑ์„ ๋„˜์–ด, ์‹œ๊ฐ„์˜ ํ๋ฆ„๊ณผ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์„ ํฌํ•จํ•˜๋Š” 4์ฐจ์› ์„ธ๊ณ„ ๋ชจ๋ธ์˜ ์ค‘์š”์„ฑ ๋ถ€๊ฐ.
  • ์ •๊ตํ•œ ์ œ์–ด ๋ฐ ์ •๋ ฌ (Fine-grained Control & Alignment): ๋ฏธ์„ธํ•œ ํ‘œ์ • ํŽธ์ง‘์ด๋‚˜ ์ธ๊ฐ„์˜ ์‹œ๊ฐ์  ์„ ํ˜ธ๋„์— ๋งž๋Š” ์˜์ƒ ์ƒ์„ฑ ๋“ฑ, ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฌผ์— ๋Œ€ํ•œ ์„ธ๋ฐ€ํ•œ ์ œ์–ด ๋Šฅ๋ ฅ๊ณผ ์ •๋ ฌ ๊ธฐ์ˆ  ๊ฐ•ํ™”.

๊ณตํ†ต ์ฃผ์ œ

์ด๋ฒˆ ์ฃผ ๋…ผ๋ฌธ๋“ค์€ ๊ธฐ์กด ์ƒ์„ฑ ๋ชจ๋ธ๋“ค์ด ๊ฐ€์ง„ ๊ตฌ์กฐ์  ๋ณต์žก์„ฑ๊ณผ ํšจ์œจ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ์ฃผ๋ ฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์ž๊ธฐํšŒ๊ท€(autoregressive) ๋ฐฉ์‹์ด๋‚˜ ๋ณต์žกํ•œ ๊ต์ฐจ ์ฃผ์˜(cross-attention) ๊ตฌ์กฐ๋ฅผ ํƒˆํ”ผํ•˜์—ฌ, ๋‹จ์ˆœํ™”๋œ ๊ตฌ์กฐ๋‚˜ ๋””ํ“จ์ „ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๋””์ฝ”๋”ฉ ๋ฐฉ์‹์„ ํ†ตํ•ด ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๋†’์ด๊ณ  ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์ด ๊ณตํ†ต์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‹จ์ˆœํ•œ ์ƒ์„ฑ์„ ๋„˜์–ด ์‹œ๊ฐ„์  ์š”์†Œ(4D)์™€ ์ƒํ˜ธ์ž‘์šฉ, ์ธ๊ฐ„์˜ ์„ ํ˜ธ๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ์ •๊ตํ•œ ์ œ์–ด๊ฐ€ AI ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๋ชฉํ‘œ๋กœ ์ž๋ฆฌ ์žก๊ณ  ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ฃผ๋ชฉํ•  ์ 

OCR ๋ถ„์•ผ์—์„œ ํ…์ŠคํŠธ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ์กด ๋ฐฉ์‹์„ ๋ฒ„๋ฆฌ๊ณ , ๋ฌธ์„œ๋ฅผ ์—ญ ๋ Œ๋”๋ง(Inverse Rendering) ๊ด€์ ์—์„œ ์ ‘๊ทผํ•˜์—ฌ ๋””ํ“จ์ „ ๋ชจ๋ธ๋กœ ๋””์ฝ”๋”ฉํ•˜๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค๋Š” ์ ์ด ํฅ๋ฏธ๋กญ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์˜ค๋””์˜ค์™€ ๋น„๋””์˜ค, ํ…์ŠคํŠธ๋ฅผ ํ•˜๋‚˜์˜ ํ† ํฐ ์‹œํ€€์Šค๋กœ ํ†ตํ•ฉํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹จ์ผ ์ŠคํŠธ๋ฆผ(Single-stream) ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๋ณต์žกํ•œ ๋ชจ๋ธ ๊ตฌ์กฐ ์—†์ด๋„ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ์˜ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹ค๋ฌด ์‹œ์‚ฌ์ 

๊ฐœ๋ฐœ์ž์™€ ์—ฐ๊ตฌ์ž๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ๋ชจ๋ธ์„ ์„ค๊ณ„ ์‹œ, ๋ณต์žกํ•œ ๋ชจ๋“ˆ ๊ฒฐํ•ฉ๋ณด๋‹ค๋Š” ํ†ตํ•ฉ๋œ ํ† ํฐ ์ฒ˜๋ฆฌ๋‚˜ ๋น„์ž๊ธฐํšŒ๊ท€์  ๋””์ฝ”๋”ฉ ๋ฐฉ์‹์„ ํ†ตํ•ด ์ถ”๋ก  ๋น„์šฉ์„ ์ค„์ด๊ณ  ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋Š”์ง€ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์›”๋“œ ๋ชจ๋ธ์ด๋‚˜ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ ๊ฐœ๋ฐœ ์‹œ์—๋Š” ๋‹จ์ˆœ ํ™”์งˆ ๋น„๊ต๋ฅผ ๋„˜์–ด **๋™์—ญํ•™์  ์ƒํ˜ธ์ž‘์šฉ์ด๋‚˜ ์ธ๊ฐ„ ์„ ํ˜ธ๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ์ตœ์‹  ๋ฒค์น˜๋งˆํฌ(Omni-WorldBench ๋“ฑ)**๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ์‹ค์šฉ์„ฑ์„ ๊ฒ€์ฆํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“‘ ๋…ผ๋ฌธ๋ณ„ ์š”์•ฝ

๐Ÿฅ‡ 1. MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

arXiv: 2603.22458 | โฌ†๏ธ 125 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: ocr diffusion-model inverse-rendering document-ai parallel-decoding vlm computer-vision

๊ธฐ์กด์˜ ๋А๋ฆฌ๊ณ  ์˜ค๋ฅ˜์— ์ทจ์•ฝํ–ˆ๋˜ ์ˆœ์ฐจ์  ๋ฐฉ์‹์˜ OCR(Optical Character Recognition)์„ ๋””ํ“จ์ „(Diffusion) ๊ธฐ๋ฐ˜์˜ ๋ณ‘๋ ฌ ๋””์ฝ”๋”ฉ์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ, ๋ฌธ์„œ์˜ ๊ณต๊ฐ„์  ๊ตฌ์กฐ๋ฅผ ํ›จ์”ฌ ๋” ํšจ์œจ์ ์ด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๋ณต์›ํ•˜๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿฅˆ 2. Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

arXiv: 2603.22212 | โฌ†๏ธ 122 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: world-models benchmark video-generation embodied-ai evaluation-metrics causal-reasoning computer-vision

๊ธฐ์กด ํ‰๊ฐ€ ๋ฐฉ์‹๋“ค์ด ๋†“์น˜๊ณ  ์žˆ๋˜ ์›”๋“œ ๋ชจ๋ธ์˜ ๊ฐ€์žฅ ํ•ต์‹ฌ ๋Šฅ๋ ฅ์ธ โ€˜์ƒํ˜ธ์ž‘์šฉ์— ๋”ฐ๋ฅธ ์ธ๊ณผ์  ๋ฐ˜์‘โ€™์„ ์ฒ˜์Œ์œผ๋กœ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ํฌ๊ด„์ ์ธ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿฅ‰ 3. Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

arXiv: 2603.21986 | โฌ†๏ธ 115 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: audio-video-generation single-stream transformer multimodal generative-model human-centric open-source efficiency

๋ณต์žกํ•œ ๋ฉ€ํ‹ฐ ์ŠคํŠธ๋ฆผ ๊ตฌ์กฐ ์—†์ด ๋‹จ์ผ ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜๋กœ ๊ณ ํ’ˆ์งˆ์˜ ์˜ค๋””์˜ค์™€ ๋น„๋””์˜ค๋ฅผ ๋™๊ธฐํ™”ํ•˜์—ฌ ์ƒ์„ฑํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์—ฌ ์—ฐ๊ตฌ ํ™•์žฅ์„ฑ๊ณผ ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ๋ชจ๋‘ ํ™•๋ณดํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


4. 4. PixelSmile: Toward Fine-Grained Facial Expression Editing

arXiv: 2603.25728 | โฌ†๏ธ 105 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: ai-paper ml

ํ‘œ์ • ๊ฐ„์˜ ์˜๋ฏธ์  ์ค‘๋ณต ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ์—ฐ์†์ ์ธ ๊ฐ•๋„ ์กฐ์ ˆ์ด ๊ฐ€๋Šฅํ•˜๊ณ  ์ •๊ตํ•˜๊ฒŒ ๋ถ„๋ฆฌ๋œ ์–ผ๊ตด ํ‘œ์ • ํŽธ์ง‘์„ ์‹คํ˜„ํ•œ ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ์™€ ๋ฐ์ดํ„ฐ์…‹์„ ์ œ์‹œํ–ˆ๋‹ค๋Š” ์ ์—์„œ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


5. 5. Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

arXiv: 2603.17051 | โฌ†๏ธ 105 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: video-generation autoregressive-model rlhf diffusion-distillation real-time-inference online-rl computer-vision

์˜คํ”„๋ผ์ธ ์ฆ๋ฅ˜(Distillation) ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ฑ๊ณผ ์˜จ๋ผ์ธ ๊ฐ•ํ™” ํ•™์Šต(Online RL)์˜ ์ธ๊ฐ„ ์„ ํ˜ธ๋„ ์ตœ์ ํ™”๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ, ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ณ ํ’ˆ์งˆ์˜ ๊ธด ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ–ˆ๊ธฐ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿ“… ์ƒ์„ฑ์ผ: 2026-03-29 | ๐Ÿค– GLM-4.7 Weekly Digest