Making Docling UI into Hugging Face Space - cereal-d3v/Docling GitHub Wiki

How to make a Hugging Face Space

1. Set up your Hugging Face Space (if you haven't already):

  • Go to [Hugging Face Spaces](https://huggingface.co/spaces) and click "Create new Space."
  • Give it a name (e.g., Docling-Ui), choose an SDK (Gradio, Streamlit, Docker, etc., based on what docling-serve is designed for), and set its visibility.

2. Add your Hugging Face Space as a remote to your docking-serve GitHub repository:

  • Navigate to your local clone of the docking-serve GitHub repository.

    cd path/to/your/docling-serve-repo
    
  • Add the Hugging Face Space as a new remote. You'll use the URL of your Hugging Face Space, which follows the format https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME.git.

    git remote add space https://huggingface.co/spaces/CerealDev/Docling-Ui.git
    

    (Replace CerealDev/Docling-Ui with your actual Space URL.)

3. Push your docling-serve repository to the Hugging Face Space:

  • Initial Push (Force Push Recommended for First Sync): For the very first push to an empty or intended-to-be-overwritten Hugging Face Space, it's common to force push to ensure everything syncs correctly and overwrites any placeholder content.

    git push --force space main
    

    (Or master if your docking-serve repo's main branch is master).

    You will be prompted for your Hugging Face username and a Hugging Face access token.

4. Set up GitHub Actions for automatic synchronization (Highly Recommended):

This is the best way to keep your Hugging Face Space in sync with your GitHub repository. Every time you push to your GitHub repo, the GitHub Action will automatically push those changes to your Hugging Face Space.

  • Create a GitHub Actions workflow file: In your docking-serve GitHub repository, create a directory .github/workflows/ if it doesn't exist. Inside it, create a new file (e.g., sync_to_hf_space.yml):

    name: Sync to Hugging Face Hub
    
    on:
      push:
        branches: [main] # or master, depending on your main branch name
      workflow_dispatch: # Allows manual triggering from GitHub Actions tab
    
    jobs:
      sync-to-hub:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v3
            with:
              fetch-depth: 0
              lfs: true # Essential if you have large files tracked with Git LFS
    
          - name: Push to Hugging Face Space
            env:
              HF_TOKEN: ${{ secrets.HF_TOKEN }} # This will be a GitHub Secret
            run: |
              git config --global user.email "[email protected]"
              git config --global user.name "GitHub Action"
              git push https://CerealDev:[email protected]/spaces/CerealDev/Docling-Ui.git main # Replace with your actual username and Space name
    
  • Add your Hugging Face Token as a GitHub Secret:

    1. Go to your docling-serve GitHub repository.
    2. Click on "Settings" (usually on the right sidebar).
    3. Go to "Secrets and variables" > "Actions" > "Repository secrets."
    4. Click "New repository secret."
    5. Name it HF_TOKEN (must match the name in the workflow file).
    6. Paste your Hugging Face access token (the one you generated in step 3) into the "Secret value" field.
    7. Click "Add secret."

Now, whenever you push changes to the main branch of your docking-serve GitHub repository, the GitHub Action will automatically push those changes to your Hugging Face Space, and it will handle Git LFS files correctly.

This setup provides a robust and automated way to manage your Hugging Face Space content from your dedicated GitHub repository.

It looks like you're still hitting a similar issue, but this time the rejection is explicitly for "binary files" (specifically .png and .pdf) rather than just "files larger than 10 MiB." Hugging Face, like many Git hosting services, uses pre-receive hooks to enforce certain best practices for repository size and content, particularly for large binary files. The recommended solution remains the same: Git Large File Storage (Git LFS).

Even if docking-serve is a separate GitHub repo, if it contains these binary files and you haven't configured Git LFS for them, you'll encounter this error when pushing to Hugging Face. Hugging Face's infrastructure is designed to work well with Git LFS for assets like images, PDFs, models, and datasets.

Here's a detailed guide to fix this using Git LFS:

Understanding Git LFS

Git LFS replaces large files in your Git repository with small pointer files. When you clone or pull, Git LFS downloads the actual large files from a separate LFS server. Hugging Face's infrastructure supports this seamlessly.

Steps to Implement Git LFS and Push Successfully

You'll need to do this from the local clone of the repository that contains these offending files (likely your docling-serve repository, or whichever local repo you're currently trying to push from).

  1. Install Git LFS (if you haven't already): If you don't have Git LFS installed on your system, you need to do that first.

    • macOS (using Homebrew):
      brew install git-lfs
      
    • Windows (using Chocolatey):
      choco install git-lfs
      
    • Linux (using apt):
      curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
      sudo apt-get install git-lfs
      
    • Other (or manual): Visit https://git-lfs.github.com/ for download links and instructions.

    After installation, initialize Git LFS for your user:

    git lfs install
    
  2. Navigate to your Repository: Make sure you are in the root directory of the docling-serve (or relevant) Git repository that contains the img/ and tests/ folders.

    cd path/to/your/docling-serve-repo
    
  3. Track the Binary Files with Git LFS: You need to tell Git LFS which file types to track. The git lfs track command creates or updates a .gitattributes file in your repository, which Git uses to know which files are handled by LFS.

    Based on your error message, you need to track .png and .pdf files.

    git lfs track "*.png"
    git lfs track "*.pdf"
    

    This command adds entries like *.png filter=lfs diff=lfs merge=lfs -text to your .gitattributes file.

  4. Add .gitattributes to Git (Commit it!): The .gitattributes file is crucial because it tells Git (and Hugging Face) how to handle these files. You must commit this file.

    git add .gitattributes
    
  5. Re-add the Offending Files (Crucial Step for Existing Files): If the large files were already committed to Git before you configured LFS, their history still contains the large binary data. You need to re-add them so Git LFS can process them.

    git add img/ui-output.png
    git add tests/2206.01062v1.pdf
    git add tests/2408.09869v5.pdf
    # Add any other new or existing large files that were rejected
    

    Self-correction/Pro Tip: If you have many such files, or if they are scattered, you can re-add them by staging them again: git add . (if you want to stage all changes) or use a specific path to re-stage all files within a directory after LFS tracking is set up.

  6. Commit Your Changes: Commit the .gitattributes file and the re-added binary files.

    git commit -m "Configure Git LFS for binary files and re-track existing ones"
    
  7. Push to Hugging Face Space: Now, when you push, Git LFS will intercept the large files and upload them to Hugging Face's LFS storage, while Git commits only the small pointer files.

    git push origin main
    

    (Or git push space main if you named your Hugging Face remote space as in the previous suggestion).

Why This Happens and Why LFS is the Solution:

  • Git's Design: Git is optimized for tracking changes in text-based code. Large binary files, when committed directly, bloat the repository history, making cloning, pushing, and pulling slow and inefficient.
  • Hugging Face's Restrictions: Hugging Face Spaces have limits on direct binary file size in the main Git repository to maintain performance and manage storage efficiently. They strongly encourage LFS for such assets.
  • The Error "binary files": This is a specific check on the type of file, not just its size, indicating that Hugging Face wants these non-text files handled by LFS. The previous "larger than 10 MiB" was a more general size check.

By following these steps, you should successfully push your docling-serve repository, including the binary assets, to your Hugging Face Space.