How to extract text in natural reading order (up2down, left2right) - pymupdf/PyMuPDF GitHub Wiki
Easiest way
First of all, use SortedCollection.
from operator import itemgetter
from itertools import groupby
import fitz
doc = fitz.open( 'mydocument.pdf' )
for page in doc:
text_words = page.get_text_words()
# The words should be ordered by y1 and x0
sorted_words = SortedCollection( key = itemgetter( 3, 0 ) )
for word in text_words:
sorted_words.insert( word )
# At this point you already have an ordered list. If you need to
# group the content by lines, use groupby with y1 as a key
lines = groupby( sorted_words, key = itemgetter( 3 ) )
# Enjoy!