Record Linkage Module

GUI Refactoring Roadmap

  • First Tabbed Pane
    • Load data files
    • Define data types by field (eg, string, numeric, others?)
    • Name each field
    • Allow re-arrangement of field order
  • Second Tabbed Pane
    • Implement exact-match comparison
    • Define blocking runs and blocking variables
    • Generate XML configuration
    • Implement string comparators
      • JWC
      • LEV
      • LCS
      • RMS
    • Implement freq-based weight scaling
    • Modify name of configuration to "Blocking Run" label.
  • Third Tabbed Pane
    • Generate process output and progress, eg:
      • Sorting data files
      • Form pairs
      • Analyze files
        • EM output
        • random sampling
        • freq counting
        • etc
      • Score Pairs
      • Output results
        • Option to sort by score (??)
        • Option to select XML output, and define output file name

Additional To-Do Items (in no particular order)

  • Modify scored record output to XML format (rather than pipe-delimited). An XML layout can then be formatted for multiple uses, eg, human review, etc. Here are some candidate XML layouts:

<pair>                           <pair>          
  <metaData>                       <metaData>
    <score/>                         <score/> 
    <sensitivity/>                   <sensitivity/>
    <specificity/>                   <specificity/>
    <comparator/>                    <comparator/>
    <blockScheme/>                   <blockScheme/>
    <matchVector>                    <matchVector/>
  </metaData>                      </metaData>
  <RecordA>                        <fields>
    <field label="last">             <field label="last">
      SMITH                            <valueA>
    </field>                             SMITH
    <field label="gender">             </valueA>
      M                                <valueB>
    </field>                             SMYTHE
  </RecordA>                           </colB>
  <RecordB>                          </field>
    <field label="last">             <field label="sex">
      SMYTHE                           <valueA>
    </field>                             M
    <field label="gender">             </valueA>
      M                                <valueB>
    </field>                             M
  </RecordB>                           </valueB>
</pair>                              </field>
                                   </fields>
                                 </pair>
  • Random sampling for u-value estimation -- an analytic function (an option on Tab 2):
    • Run EM for m-vals (u-vals are also generated)
    • Sample for u-vals, update config file with new stats
    • Option to configure # of samples (eg, % of largest file, or absolute number)
  • Implement potential pair post-processing heuristics to accept/reject potential matches (in Tabbed-Pane 4?):
    • Define Rules (rules evaluate to True or False):
      Rule  Concept1  Criteria  Concept2
      ----  --------  --------  --------
      A.    LN.A      ==         "SMITH"  <-- Absolute Value
      B.    LN.A      !=         LN.B     <-- Value Referenced by _field_
      C.    YB.A      >=         "2006"   <-- Absolute Value 
      
    • Apply Rules (and combinations of rules) to each pair and assign an action (eg, reject or accept):
      ID   RuleCombo  Action
      ---  ---------  --------
      1.   A or B     Reject   
      2.   C          Accept   
      
  • Pre-process records (not pairs) to accommodate known, invalid field data. Pre-processing should occur before forming pairs or analyzing data:
    • Define Rules (which evaluate to true or false)
      Rule  Concept1  Criteria  Concept2
      ----  --------  --------  --------
      A.    FN        ==        "INFANT"
      B.    FN        ==        "BABY"
      C.    SEX       !=        "M"
      D.    SEX       !=        "F"
      E.    YB        <=        1900
      F.    DB        ==        1
      
    • Implement rules and actions. The two most obvious actions seem to either 1) reject the record, or 2) modify the field value
      ID   RuleCombo    Action
      ---  ---------    --------
      1.   A or B       FN = ""   
      2.   C or D       SEX = ""
      3.   E and F      Reject record
      

Projects

  • B-F Linkage
  • NBS -> INPC (freq scaling)
  • INPC -> INPC
  • Modeling Errors