Tips - rootiest/obsidian-ai-image-ocr GitHub Wiki
💡 Tips & Notes
Token Limits
- Large batches of images may exceed the token limits of your selected AI model.
- If you encounter errors, try processing fewer images or wait for automatic batch splitting in a future release.
Gemini Models
Gemini models offer significantly higher token limits than most others.
For example, Gemini Flash 2.5 supports input sizes up to 1,048,576
tokens. Additionally, image token costs are much lower in Gemini compared
to other models — you can typically attach around 1,000 images to a
single request (assuming an average resolution of 1536×864 pixels).
Gemini enforces a hard limit of 3,000 images per request, regardless of the token total.
Image token costs in Gemini are calculated using tiles, where each tile is a 768×768 square. Larger images are priced based on the number of tiles needed to cover them.
- Each tile costs exactly 258 tokens, even if only partially filled.
The table below shows approximate image counts per request for various average sizes:
Tiles/Image | Max Images | Image Dimensions |
---|---|---|
1 | 3000 | 768 × 432 |
2 | 2031 | 1365 × 768 |
3 | 1354 | 1536 × 864 |
4 | 1016 | 1536 × 864 (2×2 tiles) |
6 | 677 | 2304 × 1296 |
9 | 451 | 2304 × 1296 (3×3 tiles) |
16 | 253 | 3072 × 1728 |
OpenAI Models
OpenAI does not publicly disclose their image token pricing.
Estimates are based on experimental data and are subject to change.
The GPT-4o model supports up to 128,000 tokens of input.
Here are rough estimates of token costs for various image sizes:
Image Size | Estimated Token Cost |
---|---|
512 × 512 | ~750 tokens |
640 × 640 | ~900–1100 tokens |
1024 × 1024 | ~1500–2000 tokens |
These values are approximations. Actual token usage may vary.
Automation
Available starting in version 1.0.0
The plugin will include new automation features to help manage large image batches more intelligently:
- Estimate token usage before sending a request
- Automatically detect token limits for the selected model
- Split large batches into smaller requests that stay within the model's limits
- Use one request per image for non-Gemini models, where batching is typically not viable
[!NOTE] Token estimation is only available for select providers:
- Gemini models support accurate token counts via a built-in API function
→ Support for Gemini token estimation is planned for v1.0.0- OpenAI offers a local estimation library, but the API does not expose token counts
→ OpenAI support may be added in a future release- Other providers (including local models and OpenAI-compatible APIs) may not support token estimation
Prompt Customization
You can enter a custom prompt that guides the model to extract specific information.
- There are no restrictions on the prompt text, but keep in mind the following tips:
- Instruct the model to extract text from the image.
- Instruct the model to format the output (markdown, plain text, etc.)
- Instruct the model to respond only with the extracted text (no commentary, explanations, etc.)
- Leave it blank to use the plugin's default behavior.
Batched Images Prompt
When processing batched images, the plugin attempts to use a single API request to handle as many images as possible at once in order to reduce API costs.
To correctly extract and separate the text for each image from the combined response, the language model must be instructed to wrap each image’s output in predefined markers.
To ensure consistency, this instruction is hard-coded and automatically appended to both the default and any custom prompt. The appended text is:
For each image, wrap the response using the following format:
--- BEGIN IMAGE: ---
<insert OCR text>
--- END IMAGE ---
Repeat this for each image.
This text cannot be customized. It will always be appended to any custom prompt you provide.
If the language model fails to include these markers—either due to an oversight or because your custom prompt overrides the instruction—the batch will be treated as a single image, and the output will be processed accordingly.
[!NOTE] If your image contains the literal strings
--- BEGIN IMAGE: ---
or--- END IMAGE ---
, they may be misinterpreted as delimiters by the parser. This is unlikely with most handwritten or printed text, but it’s important to be aware that we cannot distinguish between model-inserted delimiters and identical extracted text.
File and Folder Naming
Please be mindful of OS/filesystem restrictions on filenames and folder names.
Some filesystems like Linux disks have little to no restriction on naming so the plugin will not enforce any restrictions.
If your system does not allow a filename that was generated by the plugin, an error will occur and the extraction will fail to complete.
Additionally, if you use Obsidian Sync or other sync services, you may run into issues with filenames that are too long or contain characters that are not allowed on some systems.
For reference, some commonly disallowed characters include:
\
- Backslash/
- Forward Slash:
- Colon*
- Asterisk?
- Question Mark"
- Double Quote<
- Less Than>
- Greater Than|
- Vertical Bar / Pipe