CiviCRM Mass Dedupe Workflow Improvements

20 April 2015

We at Veda Consulting been doing quite a lot of improvements in civicrm dedupe management and workflow. Thanks to Leukaemia & Lymphoma Research. And just before CiviCon Denver, seems like a good timing to share it with community when we have most of it working. Most of the code has been tested and implemented in v4.5, and available on github. Here goes the list:

1. Improve dedupe response using custom queries written via hooks

Finding duplicates for a large database, even with a simple rule (first + last + email) with inclusive threshold, could be dead slow to respond. Reason for this is because CiviCRM dedupe still has to run 3 queries, out of which queries based on just "first" and "last" names are potentially going to find a huge number of duplicates, even before finding an intersection of "first AND last AND email".

However if we run a single custom query with inner join of "first-name AND last-name AND email", is very more likely to give quick results. Same rule applies to all other rules.
Using civicrm_dedupeQuery hook (also submitted to core), we came out with some custom rules that are quick to render, unlike previous default rules that would make us wait endlessly. Some of these custom rules are:

Custom: First Name AND Last Name
Custom: First Name AND Last Name AND Email
Custom: First Name AND Last Name AND Postal Code
Custom: Initial AND Last Name AND Email
and so on..

Code sample:

function dedupe_civicrm_dupeQuery($form, $type, &$data) {
  if ($type == 'supportedFields' && !empty($data)) {
    $data['Individual']['civicrm_contact']['custom_first_last_email']  = "Custom: First Name AND Last Name AND Email";
  }
  if ($table == 'civicrm_contact' && $col == 'custom_first_last_email') {
      $data[$key] = "
    SELECT t1.id id1, t2.id id2, $wt weight 
      FROM civicrm_contact t1
INNER JOIN civicrm_email em1 ON t1.id = em1.contact_id 
      JOIN ( SELECT cc.id, cc.first_name, cc.last_name, cc.contact_type, em2.email 
               FROM civicrm_contact cc
         INNER JOIN civicrm_email em2 ON cc.id = em2.contact_id ) t2 ON t1.first_name = t2.first_name AND 
                                                                        t1.last_name = t2.last_name AND 
                                                                        em1.email = t2.email
     WHERE t1.contact_type = 'Individual' AND 
           t2.contact_type = 'Individual' AND 
           t1.id < t2.id AND 
           t1.first_name IS NOT NULL AND 
           t1.last_name  IS NOT NULL AND 
           em1.email IS NOT NULL AND em1.email <> ''";
  }

Full code: https://github.com/veda-consulting/uk.co.vedaconsulting.module.dedupe/bl...

2. Display more information on the screen listing duplicates

CiviCRM dedupe listing screen at the moment only list contact names. For a large duplicate list its quite important to have most of the data listed on the same screen.
Civi uses civicrm_prevnext_cache tables for storing duplicate pairs. We used SQL joins with same table to retrieve all other extra information.
Code: https://github.com/civicrm/civicrm-core/blob/7db993ab2b635d48f6778af8e07...

2A. Columns like email, street address, post code - in addition to existing contact names

Initially contact and email columns are only rendered. Show / hide columns above the grid could reveal other columns.

2B. hover information - a pop up displaying more information about contacts listed in duplicate list

At this point for pop-up we using the same profile and similar ajax calls like used for advanced search. However profile could be changed to another if needed.

3. Allow searching and locating specific duplicate records.

Currently in CiviCRM as you can only do full batch merges, there are no options to drill down specific pairs.

Filter section that we added, allows to filter duplicate pairs using columns present in the datatable. We using contact, email, street address and postcode filters, same as the columns in the datatable. Every filter fires an ajax call and updates the datatable with new result set immediately.
Relevant Code: https://github.com/civicrm/civicrm-core/blob/7db993ab2b635d48f6778af8e07...

4. User be able to selectively specify the duplicate pair to run batch merge on

In CiviCRM you could "batch merge all duplicates", but no option to run batch merge on a subset of result set.
We now render a checkbox against every duplicate pair which allows user to specify the selection. And added "Batch Merge Selected Duplicates" button action to allow merges on selected duplicate pairs.

Here we re-use the is_selected column of civicrm_prevnext_cache table to record the requirement. Each selection fires an ajax call to update is_selected column of the table. In header section there 'll be a checkbox to toggle the section for entire visible pairs (not shown in the image yet). The batch merge now also uses this column to determine whether to do merges for all records or the records with selections.

5. Batch merges to use civicrm queue mechanism to show a progress bar on UI

Batch merge in civi is a time taking process and is therefore mostly relied on crons to execute. While managing dedupes in UI, there is no way to know the status of large merges other than staring at white screen to respond.

We have added a progress bar to the merges. All merges now happen through a new journey that uses CiviCRM's queue system to show the progress on merges.

6. Any conflicts during merges to be stored and reported in UI as a new column

CiviCRM currently for safety reasons doesn't merge records with conflicts. For e.g if both main-contact and duplicate-contact has birth-date information present, they aren't going to be merged as its not clear which data to retain. There could be many such fields, but the problem is user doesn't get to know the cause of conflicts.

We made these conflict reasons, store in the database (civicrm_prevnext_cache table) and rendered in the datatable as a new column. User is presented with this column and duplicate pairs after first round of batch merge. The idea is to give information on records which could not be merged and why. User then can make any changes based on the conflict reason and retry the merges.

7. Allow user to run "aggressive merges" on records in conflicts

As mentioned - CiviCRM currently doesn't merge records with conflicts, there could be situation where user is fine to go ahead with merges even if that means overwriting the column in conflict. Note there are hooks where we can set the biasing of columns i.e which column to retain in case of conflicts.

We have provided the facility to run aggressive merges from UI as second step i.e once a safe merge is run and there are duplicate pairs that couldn't be merged due to conflicts, user can decide to run aggressive merge on them.

8. Code on github

A. Extension for creating custom rules and deciding the biasing in case of conflicts: https://github.com/veda-consulting/uk.co.vedaconsulting.module.dedupe
B. Code for all other improvements in dedupe system: https://github.com/veda-consulting/civicrm-core/compare/4.5...veda-consu...

Keep in touch, we value relations!

Connect with us