Carga dos artigos da fonte pid provider - scieloorg/core GitHub Wiki

Procedimento para Carga de Artigos a partir do PID Provider

Pré-requisitos

  • Registros PidProviderXML já criados com proc_status = TODO
  • XMLs armazenados e acessíveis via current_version.file.path
  • Usuario válido para auditoria

Execução da Carga

1. Disparar a task principal:

task_load_articles.apply_async(kwargs={
    'user_id': user.id,
    'username': user.username
})

2. A task automaticamente:

  • Busca todos os PidProviderXML com status TODO
  • Para cada registro, chama xmlsps.load_article()
  • Valida o artigo criado
  • Atualiza status para DONE se válido
  • Dispara automaticamente task_mark_articles_as_deleted_without_pp_xml

3. Processamento complementar (opcional):

# Gerar formatos adicionais (PDF, HTML)
task_convert_xml_to_other_formats_for_articles.apply_async(kwargs={
    'user_id': user.id
})

# Completar metadados faltantes
task_articles_complete_data.apply_async(kwargs={
    'user_id': user.id
})

Monitoramento

  • Acompanhe logs para identificar falhas via UnexpectedEvent
  • Verifique artigos com valid=False para reprocessamento
  • Confirme mudança de status: TODO → DONE nos registros processados

Resultado

Artigos estruturados criados na tabela Article, prontos para publicação, com limpeza automática de registros órfãos.

graph TD
    %% Starting Point - PidProviderXML Records
    START[🚀 Starting Point PidProviderXML created proc_status = TODO]
    
    %% Main Article Loading Task
    LOAD_TASK[📊 task_load_articles • Query: proc_status=TODO • Iterate through records • Process each XML]
    
    %% Core XML Processing
    XML_LOAD[⚙️ xmlsps.load_article • Parse XML content • Extract metadata • Create Article object • Validate data]
    
    %% Decision Point
    VALIDATION{🔍 Article.valid?}
    
    %% Success Path
    ARTICLE_CREATED[✅ Article Created • Structured data • All fields populated • valid = True]
    
    UPDATE_STATUS[📝 Update PidProviderXML proc_status = DONE]
    
    %% Failure Path
    ARTICLE_INVALID[❌ Article Invalid • Parsing errors • Missing data • valid = False]
    
    ERROR_LOG[📋 Log Exception UnexpectedEvent.create]
    
    %% Automatic Cascade - Cleanup Task
    CLEANUP_TASK[🗑️ task_mark_articles_as_deleted • Find articles with pp_xml=None • Mark as DATA_STATUS_DELETED • Clean orphaned records]
    
    %% Parallel Processing Tasks
    FORMAT_COORDINATOR[📄 task_convert_xml_to_other_formats • Query: sps_pkg_name not null • Dispatch individual tasks]
    
    FORMAT_INDIVIDUAL[🔧 convert_xml_to_other_formats • Check existing ArticleFormat • Generate if missing/force_update • Create multiple formats]
    
    DATA_COORDINATOR[📝 task_articles_complete_data • Iterate all Articles • Dispatch completion tasks]
    
    DATA_INDIVIDUAL[🔍 article_complete_data • Check missing sps_pkg_name • Generate from pid_v3 • Update record]
    
    %% Generated Outputs
    ARTICLE_FORMAT[📚 ArticleFormat • PDF version • HTML version • Other formats]
    
    COMPLETED_ARTICLE[📋 Enhanced Article • Complete metadata • All formats available • Ready for publication]
    
    %% Database States
    PID_TODO[(🗄️ PidProviderXML proc_status = TODO)]
    PID_DONE[(🗄️ PidProviderXML proc_status = DONE)]
    ARTICLE_DB[(🗄️ Article Structured data)]
    
    %% Flow Connections
    START --> PID_TODO
    PID_TODO --> LOAD_TASK
    LOAD_TASK --> XML_LOAD
    XML_LOAD --> VALIDATION
    
    %% Success Flow
    VALIDATION -->|Yes| ARTICLE_CREATED
    ARTICLE_CREATED --> ARTICLE_DB
    ARTICLE_CREATED --> UPDATE_STATUS
    UPDATE_STATUS --> PID_DONE
    
    %% Failure Flow
    VALIDATION -->|No| ARTICLE_INVALID
    ARTICLE_INVALID --> ERROR_LOG
    ARTICLE_INVALID --> ARTICLE_DB
    
    %% Automatic Cascade
    LOAD_TASK --> CLEANUP_TASK
    
    %% Parallel Processing Branches
    ARTICLE_DB --> FORMAT_COORDINATOR
    FORMAT_COORDINATOR --> FORMAT_INDIVIDUAL
    FORMAT_INDIVIDUAL --> ARTICLE_FORMAT
    
    ARTICLE_DB --> DATA_COORDINATOR
    DATA_COORDINATOR --> DATA_INDIVIDUAL
    DATA_INDIVIDUAL --> COMPLETED_ARTICLE
    
    %% Final Integration
    ARTICLE_FORMAT --> COMPLETED_ARTICLE
    
    %% Error Recovery Loops
    ERROR_LOG -.-> |Retry Logic| LOAD_TASK
    PID_TODO -.-> |items_to_load_article_with_valid_false| LOAD_TASK
    
    %% Status Conditions
    PID_TODO -.-> |proc_status=TODO| LOAD_TASK
    ARTICLE_DB -.-> |sps_pkg_name≠null| FORMAT_COORDINATOR
    ARTICLE_DB -.-> |pid_v3 exists & sps_pkg_name=null| DATA_COORDINATOR
    ARTICLE_DB -.-> |pp_xml=None| CLEANUP_TASK
    
    %% Styling
    classDef startPoint fill:#e3f2fd,stroke:#0277bd,stroke-width:3px
    classDef mainTask fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
    classDef processing fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef decision fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef success fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef failure fill:#ffebee,stroke:#d32f2f,stroke-width:2px
    classDef database fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef parallel fill:#e0f2f1,stroke:#00695c,stroke-width:2px
    classDef output fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
    
    class START startPoint
    class LOAD_TASK,CLEANUP_TASK mainTask
    class XML_LOAD processing
    class VALIDATION decision
    class ARTICLE_CREATED,UPDATE_STATUS success
    class ARTICLE_INVALID,ERROR_LOG failure
    class PID_TODO,PID_DONE,ARTICLE_DB database
    class FORMAT_COORDINATOR,FORMAT_INDIVIDUAL,DATA_COORDINATOR,DATA_INDIVIDUAL parallel
    class ARTICLE_FORMAT,COMPLETED_ARTICLE output