Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Interesting question. You're looking to parse a database schema and data to determine which tables are relevant or should be related to each other, without any strict definition of the relationship. In effect, you're trying to infer a relationship.</p> <p>I see two ways that you can infer such a relationship. First let me say that your approach might vary depending on the databases you're working with. A number of questions spring to mind (I don't want answers, but they are worth reflecting on)</p> <ul> <li>are these in-house enterprise systems that follow some consistent naming convention or pattern?</li> <li>or are they 'in-the-wild' databases that you come across anywhere, at any time? </li> <li>what sort of assumptions are you prepared to make? </li> <li>would you prefer to get more false positives or false negatives in your result?</li> </ul> <p>Note that this type of inference will almost certainly give false results, and is built on a lot of assumptions.</p> <p>So I offer two approachs that I'd use in concert.</p> <p><strong>Inferring a relationship through structure / naming (symbolic analysis)</strong></p> <p>Common database design is to name a PK column after the table name (e.g. <code>CustomerId</code> on table <code>Customer</code>), or alternatively name the PK column simply <code>Id</code>.</p> <p>A table with a FK relationship to another often names its related column the same as the related table. In the <code>Order</code> table I'd expect a <code>CustomerId</code> column which refers to the <code>CustomerId</code> / <code>Id</code> column in the <code>Customer</code> table.</p> <p>This type of analysis would include </p> <ul> <li>inspecting columns across tables for similar phrases / words</li> <li>looking for columns names that are similar to the names of other tables</li> <li>checking for column names that contain the name of other column (e.g. <code>FirstCustomerId</code> &amp; <code>SecondCustomerId</code> both refer to the <code>CustomerId</code> column in the <code>Customer</code> table)</li> </ul> <p><strong>Inferring a relationship through data (statistical analysis)</strong></p> <p>Looking at data, as you suggest you have done in your comments, will allow you to determine 'possible' references. If the <code>CustomerId</code> column in the <code>Order</code> table contains values which don't exist in the <code>Id</code> column of the <code>Customer</code> table then it's reasonable to question that this is a valid relationship (although you never know!)</p> <p>A simple form of data analysis is using dates and times. Rows that were created with close proximity to one another are more likely to be related to one another. If, for every <code>Order</code> row that was created, there also exist between 2 and 5 <code>Item</code> rows created within a few seconds, then a relationship between the two is likely.</p> <p>A more detailed analysis might look at the range and distribution of used values.</p> <p>For example, if your <code>Order</code> table has a <code>St_Id</code> column - you might infer using symbolic analysis that the column is likely to relate to either a <code>State</code> table or a <code>Status</code> table. The <code>St_Id</code> column has 6 discrete values, and 90% of the records are covered by 2 values. The <code>State</code> table has 200 rows, and the <code>Status</code> table has 9 rows. You could quite reasonably infer that the <code>St_Id</code> column relates to the <code>Status</code> table - it gives a more greater coverage of the rows of the table (2/3 of the rows are 'used', whereas only 3% of the rows in the <code>State</code> table would be used).</p> <p>If you perform data analysis on existing databases to gather 'real life data', I'd expect some patterns that could be used as guides to structure inference. When a table with a large number of records has a column with a small number of values repeated many times (not necessarily in order), it's more likely to this column relates to a table with a correspondingly small number of rows.</p> <p><strong>In summary</strong> </p> <p>Best of luck. It's an interested problem, I've just thrown some ideas out there but this is very much a trial &amp; error, data gathering and performance tuning situation.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload