Carga dos artigos da fonte pid provider - scieloorg/core GitHub Wiki
Procedimento para Carga de Artigos a partir do PID Provider
Pré-requisitos
- Registros
PidProviderXML
já criados comproc_status = TODO
- XMLs armazenados e acessíveis via
current_version.file.path
- Usuario válido para auditoria
Execução da Carga
1. Disparar a task principal:
task_load_articles.apply_async(kwargs={
'user_id': user.id,
'username': user.username
})
2. A task automaticamente:
- Busca todos os
PidProviderXML
com statusTODO
- Para cada registro, chama
xmlsps.load_article()
- Valida o artigo criado
- Atualiza status para
DONE
se válido - Dispara automaticamente
task_mark_articles_as_deleted_without_pp_xml
3. Processamento complementar (opcional):
# Gerar formatos adicionais (PDF, HTML)
task_convert_xml_to_other_formats_for_articles.apply_async(kwargs={
'user_id': user.id
})
# Completar metadados faltantes
task_articles_complete_data.apply_async(kwargs={
'user_id': user.id
})
Monitoramento
- Acompanhe logs para identificar falhas via
UnexpectedEvent
- Verifique artigos com
valid=False
para reprocessamento - Confirme mudança de status:
TODO → DONE
nos registros processados
Resultado
Artigos estruturados criados na tabela Article
, prontos para publicação, com limpeza automática de registros órfãos.
graph TD
%% Starting Point - PidProviderXML Records
START[🚀 Starting Point PidProviderXML created proc_status = TODO]
%% Main Article Loading Task
LOAD_TASK[📊 task_load_articles • Query: proc_status=TODO • Iterate through records • Process each XML]
%% Core XML Processing
XML_LOAD[⚙️ xmlsps.load_article • Parse XML content • Extract metadata • Create Article object • Validate data]
%% Decision Point
VALIDATION{🔍 Article.valid?}
%% Success Path
ARTICLE_CREATED[✅ Article Created • Structured data • All fields populated • valid = True]
UPDATE_STATUS[📝 Update PidProviderXML proc_status = DONE]
%% Failure Path
ARTICLE_INVALID[❌ Article Invalid • Parsing errors • Missing data • valid = False]
ERROR_LOG[📋 Log Exception UnexpectedEvent.create]
%% Automatic Cascade - Cleanup Task
CLEANUP_TASK[🗑️ task_mark_articles_as_deleted • Find articles with pp_xml=None • Mark as DATA_STATUS_DELETED • Clean orphaned records]
%% Parallel Processing Tasks
FORMAT_COORDINATOR[📄 task_convert_xml_to_other_formats • Query: sps_pkg_name not null • Dispatch individual tasks]
FORMAT_INDIVIDUAL[🔧 convert_xml_to_other_formats • Check existing ArticleFormat • Generate if missing/force_update • Create multiple formats]
DATA_COORDINATOR[📝 task_articles_complete_data • Iterate all Articles • Dispatch completion tasks]
DATA_INDIVIDUAL[🔍 article_complete_data • Check missing sps_pkg_name • Generate from pid_v3 • Update record]
%% Generated Outputs
ARTICLE_FORMAT[📚 ArticleFormat • PDF version • HTML version • Other formats]
COMPLETED_ARTICLE[📋 Enhanced Article • Complete metadata • All formats available • Ready for publication]
%% Database States
PID_TODO[(🗄️ PidProviderXML proc_status = TODO)]
PID_DONE[(🗄️ PidProviderXML proc_status = DONE)]
ARTICLE_DB[(🗄️ Article Structured data)]
%% Flow Connections
START --> PID_TODO
PID_TODO --> LOAD_TASK
LOAD_TASK --> XML_LOAD
XML_LOAD --> VALIDATION
%% Success Flow
VALIDATION -->|Yes| ARTICLE_CREATED
ARTICLE_CREATED --> ARTICLE_DB
ARTICLE_CREATED --> UPDATE_STATUS
UPDATE_STATUS --> PID_DONE
%% Failure Flow
VALIDATION -->|No| ARTICLE_INVALID
ARTICLE_INVALID --> ERROR_LOG
ARTICLE_INVALID --> ARTICLE_DB
%% Automatic Cascade
LOAD_TASK --> CLEANUP_TASK
%% Parallel Processing Branches
ARTICLE_DB --> FORMAT_COORDINATOR
FORMAT_COORDINATOR --> FORMAT_INDIVIDUAL
FORMAT_INDIVIDUAL --> ARTICLE_FORMAT
ARTICLE_DB --> DATA_COORDINATOR
DATA_COORDINATOR --> DATA_INDIVIDUAL
DATA_INDIVIDUAL --> COMPLETED_ARTICLE
%% Final Integration
ARTICLE_FORMAT --> COMPLETED_ARTICLE
%% Error Recovery Loops
ERROR_LOG -.-> |Retry Logic| LOAD_TASK
PID_TODO -.-> |items_to_load_article_with_valid_false| LOAD_TASK
%% Status Conditions
PID_TODO -.-> |proc_status=TODO| LOAD_TASK
ARTICLE_DB -.-> |sps_pkg_name≠null| FORMAT_COORDINATOR
ARTICLE_DB -.-> |pid_v3 exists & sps_pkg_name=null| DATA_COORDINATOR
ARTICLE_DB -.-> |pp_xml=None| CLEANUP_TASK
%% Styling
classDef startPoint fill:#e3f2fd,stroke:#0277bd,stroke-width:3px
classDef mainTask fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
classDef processing fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef decision fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef success fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef failure fill:#ffebee,stroke:#d32f2f,stroke-width:2px
classDef database fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef parallel fill:#e0f2f1,stroke:#00695c,stroke-width:2px
classDef output fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
class START startPoint
class LOAD_TASK,CLEANUP_TASK mainTask
class XML_LOAD processing
class VALIDATION decision
class ARTICLE_CREATED,UPDATE_STATUS success
class ARTICLE_INVALID,ERROR_LOG failure
class PID_TODO,PID_DONE,ARTICLE_DB database
class FORMAT_COORDINATOR,FORMAT_INDIVIDUAL,DATA_COORDINATOR,DATA_INDIVIDUAL parallel
class ARTICLE_FORMAT,COMPLETED_ARTICLE output