Report1 - datasets-br/public-person GitHub Wiki
Hi, this is a first draft, I will publish it at git in the next week.
TSE-candidatos data analysis
There are 3.5 million (3520846) of CSV rows to be checked over uniqueness of person, it is expected that records are repeated for each election year.
- all (100%) have name (full name) valid property.
- 59% (2081277 rows) have birthDate valid property.
- 43% (1523871 rows) have vatID (Brazilian CPF) valid property.
This is the profile of the CSV files:
-
There are 310 source files
consulta_cand*.txt
(state and year), and the its fields are not uniform. -
Supposing uniformity, the relevant fields are at positions 3,6,11,14 and 27 (from 1), and secondary positions 15,28 and 31. Supposing corresponds respectivally to fields (of the LEIAME.PDF documentation)
ANO_ELEICAO
,SIGLA_UF
,NOME_CANDIDATO
,CPF_CANDIDATO
,DATA_NASCIMENTO
,NOME_URNA_CANDIDATO
,NUM_TITULO_ELEITORAL_CANDIDATO
,SEXO
. -
Supposing "newer is better", so, when repeat we can select only the last version.
Filtering the CSV lines with valid name, valid birth and valid CPF-checksum, removing duplicates, remains 1538241 records:
- 89% (1375305 rows) have distinct names;
- ... There are some little of names with problems, as "0DÁRICO", "0SEAS", and "ABRRAÃO".
- 89% (1375973 rows) have distinct CPFs (remained 11% of 135485 rows are repeated CPF for distinct name-birth);
- ~1% (14270 rows) have empty CPF;
- ~100 rows have two or more distinct CPFs to the same name and same birth date.
- ... Grouping by CPF: there are 21 cases of "CPF 99999999999" (invalid but a lie approved by TSE), some with more than 5 repetitions (eg. CPFs 02990529829 and 32844085253), 859 cases with more tham 3 repetitions, 11419 cases with more tham 2 repetitions.
Commom examples of same person with two names:
-
ortographic error: Marli and Marly, both with birth date 1965-12-27 and CPF 45258007472. DELIZOM and DELIZON (CPF 21963541987), JOSE and JOSÉ (CPF 71404449949), etc.
-
Real change of name (eg. after marriage): "MARIA DO CARMO PRATES COSTA" and "MARIA DO CARMO PRATES MARQUES", both 1965-03-09 and CPF 37921800572.
Queries:
SELECT count(DISTINCT (info->'vatid_cpfs')->>0) as n_unique_cpfs
FROM pubperson.person
WHERE length((info->>'vatid_cpfs')::text)=15
;