Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to compare differently formatted addresses (php / mysql)
    primarykey
    data
    text
    <p>I am working on an established php mysql web application with some fairly big data.</p> <p>As part of the data management routines new data on addresses are added to the database (import) from two different sources.</p> <p>All of the addresses in this data are from the UK.</p> <p>The application has used <code>addressRecord.address1 = address1 AND addressRecord.postcode = postcode</code> as a collision detection before inserting a new address, or associating a record with an existing address.</p> <p>The trouble is the collision test is not conclusive. As the two different sources of data have provided somewhat different address formats.</p> <p><strong>source 1</strong></p> <pre><code> address1 = 'FLAT N, RICHMOND HILL GATE, 1' address2 = 'RICHMOND HILL DRIVE' address3 = 'BOURNEMOUTH' postcode = 'BH2 6LT' </code></pre> <p><strong>source 2</strong></p> <pre><code> address1 = 'Flat N' address2 = 'Richmond Hill Gate' address3 = '1 Richmond Hill Drive' postcode = 'BH2 6LT' </code></pre> <p>Because this is an established application this duplication already exists in the address table, which I must deal with, but also new data is continually imported and must be related to an address record.</p> <p>So I am looking for a conclusive (almost conclusive might do) way to compare addresses with slightly different formatting, that is also performant (10's millions of rows of data).</p> <p>I have so far considered calculating an identification field, which can be augmented to the existing data, and calculated for the new imports, that may for example be the 3 address fields concatenated, with all punctuation removed, or perhaps just the numbers. or ... Any ideas gratefully received.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COFWIW, this is a very common issue. There are companies that offer mailing list de-duping as a service, it is a significant undertaking. How to tell that "123 Richmond Way" is the same address as "123 Richmond" and "123 SE Richmond Wy.", and that "123 SE Richmond Wy. Ste 123" is the same and the preferred, being more complete? You're going to have to dip into a lot of buckets to solve this. I've used Google maps to geolocate the addresses, the US postal service offers a change-of-address verification, and there are other half-measures, but any big database full of people struggles with this.
      singulars
    2. COAnother note: I've also used a strategy where I parse the addresses and format them consistently in one pass (normalize the data), THEN do a matching string comparison as a second pass. I used this function to parse the addresses -- it works for US addresses with some success: https://gist.github.com/GRMule/7775384 -- so I looped every record, normalized the address, wrote the record, etc. I have not tried it on UK mailing addresses, but you can see what is involved there!
      singulars
    3. CO@Chris Thanks for the useful comments and especially the code sample which has given me some ideas. The normalisation idea is exactly what I am looking for. Better formatted UK address data that I have seen splits bulilding name or number, secondary element such as flat , and street name into 3 exact fields . I wonder if this was automated but I suspect manual input or verification. Thanks.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload