Report1 - datasets-br/public-person GitHub Wiki

Hi, this is a first draft, I will publish it at git in the next week.

TSE-candidatos data analysis

There are 3.5 million (3520846) of CSV rows to be checked over uniqueness of person, it is expected that records are repeated for each election year.

  • all (100%) have name (full name) valid property.
  • 59% (2081277 rows) have birthDate valid property.
  • 43% (1523871 rows) have vatID (Brazilian CPF) valid property.

This is the profile of the CSV files:

  • There are 310 source files consulta_cand*.txt (state and year), and the its fields are not uniform.

  • Supposing uniformity, the relevant fields are at positions 3,6,11,14 and 27 (from 1), and secondary positions 15,28 and 31. Supposing corresponds respectivally to fields (of the LEIAME.PDF documentation) ANO_ELEICAO,SIGLA_UF,NOME_CANDIDATO,CPF_CANDIDATO,DATA_NASCIMENTO, NOME_URNA_CANDIDATO,NUM_TITULO_ELEITORAL_CANDIDATO,SEXO.

  • Supposing "newer is better", so, when repeat we can select only the last version.

Filtering the CSV lines with valid name, valid birth and valid CPF-checksum, removing duplicates, remains 1538241 records:

  • 89% (1375305 rows) have distinct names;
  • ... There are some little of names with problems, as "0DÁRICO", "0SEAS", and "ABRRAÃO".
  • 89% (1375973 rows) have distinct CPFs (remained 11% of 135485 rows are repeated CPF for distinct name-birth);
  • ~1% (14270 rows) have empty CPF;
  • ~100 rows have two or more distinct CPFs to the same name and same birth date.
  • ... Grouping by CPF: there are 21 cases of "CPF 99999999999" (invalid but a lie approved by TSE), some with more than 5 repetitions (eg. CPFs 02990529829 and 32844085253), 859 cases with more tham 3 repetitions, 11419 cases with more tham 2 repetitions.

Commom examples of same person with two names:

  • ortographic error: Marli and Marly, both with birth date 1965-12-27 and CPF 45258007472. DELIZOM and DELIZON (CPF 21963541987), JOSE and JOSÉ (CPF 71404449949), etc.

  • Real change of name (eg. after marriage): "MARIA DO CARMO PRATES COSTA" and "MARIA DO CARMO PRATES MARQUES", both 1965-03-09 and CPF 37921800572.


Queries:


SELECT count(DISTINCT (info->'vatid_cpfs')->>0) as n_unique_cpfs
FROM   pubperson.person
WHERE  length((info->>'vatid_cpfs')::text)=15
;