Record Linkage Module
GUI Refactoring Roadmap
- First Tabbed Pane
- Load data files
- Define data types by field (eg, string, numeric, others?)
- Name each field
- Allow re-arrangement of field order
- Second Tabbed Pane
- Implement exact-match comparison
- Define blocking runs and blocking variables
- Generate XML configuration
- Implement string comparators
- Implement freq-based weight scaling
- Modify name of configuration to "Blocking Run" label.
- Third Tabbed Pane
- Generate process output and progress, eg:
- Sorting data files
- Form pairs
- Analyze files
- EM output
- random sampling
- freq counting
- etc
- Score Pairs
- Output results
- Option to sort by score (??)
- Option to select XML output, and define output file name
Additional To-Do Items (in no particular order)
- Modify scored record output to XML format (rather than pipe-delimited). An XML layout can then be formatted for multiple uses, eg, human review, etc. Here are some candidate XML layouts:
<pair> <pair>
<metaData> <metaData>
<score/> <score/>
<sensitivity/> <sensitivity/>
<specificity/> <specificity/>
<comparator/> <comparator/>
<blockScheme/> <blockScheme/>
<matchVector> <matchVector/>
</metaData> </metaData>
<RecordA> <fields>
<field label="last"> <field label="last">
SMITH <valueA>
</field> SMITH
<field label="gender"> </valueA>
M <valueB>
</field> SMYTHE
</RecordA> </colB>
<RecordB> </field>
<field label="last"> <field label="sex">
SMYTHE <valueA>
</field> M
<field label="gender"> </valueA>
M <valueB>
</field> M
</RecordB> </valueB>
</pair> </field>
</fields>
</pair>
- Random sampling for u-value estimation -- an analytic function (an option on Tab 2):
- Run EM for m-vals (u-vals are also generated)
- Sample for u-vals, update config file with new stats
- Option to configure # of samples (eg, % of largest file, or absolute number)
- Implement potential pair post-processing heuristics to accept/reject potential matches (in Tabbed-Pane 4?):
- Define Rules (rules evaluate to True or False):
Rule Concept1 Criteria Concept2
---- -------- -------- --------
A. LN.A == "SMITH" <-- Absolute Value
B. LN.A != LN.B <-- Value Referenced by _field_
C. YB.A >= "2006" <-- Absolute Value
- Apply Rules (and combinations of rules) to each pair and assign an action (eg, reject or accept):
ID RuleCombo Action
--- --------- --------
1. A or B Reject
2. C Accept
- Pre-process records (not pairs) to accommodate known, invalid field data. Pre-processing should occur before forming pairs or analyzing data:
- Define Rules (which evaluate to true or false)
Rule Concept1 Criteria Concept2
---- -------- -------- --------
A. FN == "INFANT"
B. FN == "BABY"
C. SEX != "M"
D. SEX != "F"
E. YB <= 1900
F. DB == 1
- Implement rules and actions. The two most obvious actions seem to either 1) reject the record, or 2) modify the field value
ID RuleCombo Action
--- --------- --------
1. A or B FN = ""
2. C or D SEX = ""
3. E and F Reject record
Projects
- B-F Linkage
- NBS -> INPC (freq scaling)
- INPC -> INPC
- Modeling Errors