Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The only solution I found was in in the MYSQL Cookbook <a href="http://www.kitebird.com/mysql-cookbook/" rel="nofollow">http://www.kitebird.com/mysql-cookbook/</a></p> <p>Code samples at: <a href="http://www.kitebird.com/mysql-cookbook/downloads-2ed.php" rel="nofollow">http://www.kitebird.com/mysql-cookbook/downloads-2ed.php</a></p> <p>This does the first piece...analyse TSV data and generate a "CREATE TABLE" appropriate for it. The second piece, uploading the TSV into the table structure, is simple and common.</p> <p>guess_table.pl</p> <pre><code>#!/usr/bin/perl # guess_table.pl - characterize the contents of a data file and use the # information to guess a CREATE TABLE statement for the file # Usage: guess_table.pl table_name data_file # To do: # - Use value range information for something. It's collected but not yet # used. For example, suggest better INT types. # - Get rid of nonnegative attribute; it can be assessed now from the range. # Load a data file and read column names and data values. # Guess the declaration for each of the columns based on what the data # values look like, and then generate an SQL CREATE TABLE statement for the # table. Because the column declarations are just guesses, you'll likely # want to edit the output, for example, to change a data type or # length. You may also want to add indexes. Nevertheless, using this # script can be easier than writing the CREATE TABLE statement by hand. # Some assumptions: # - Lines are tab-delimited, linefeed-terminated # - Dates consist of 3 numeric parts, separated by - or /, in y/m/d order # Here are some ways that guess_table.pl could be improved. Each of # them would make it smarter, albeit at the cost of increased processing # requirements. Some of the suggestions are likely impractical for really # huge files. # - For numeric columns, use min/max values to better guess the type. # - Keep track of the number of unique values in a column. If there # aren't many, the column might be a good candidate for being an ENUM. # Testing should not be case sensitive, because ENUM columns are not # case sensitive. # - Make the date guessing code smarter. Have it recognize non-ISO format # and attempt to make suggestions that a column needs to be reformatted. # (This actually needs to see entire column, because that would help # it distinguish U.S. from British formats WRT order of month and day.) # This would need to track min/max for each of the three date parts. # - If all values in a column are unique, suggest that it should be a PRIMARY # KEY or a UNIQUE index. # - For DATETIME columns, allow some times to be missing without flagging # column as neither DATE nor TIME. # Paul DuBois # paul@kitebird.com # 2002-01-31 # 2002-01-31 # - Created. # 2002-02-19 # - Add code to track ranges for numeric columns and for the three date # subparts of columns that look like they contain dates. # 2002-02-20 # - Added --lower and --upper options to force column labels to lowercase # or uppercase. # 2002-03-01 # - For character columns longer than 255 characters, choose TEXT type based # on maximum length. # 2002-04-04 # - Add --quote-names option to quote table and column names `like this`. # The resulting statement requires MySQL 3.23.6 or higher. # 2002-07-16 # - Fix "uninitialized value" warnings resulting from missing columns in # data lines. # - Don't attempt to assess date characteristics for columns that are always # empty. # 2005-12-28 # - Make --quote-names the default, add --skip-quote-names option so that # identifier quoting can be turned off. # - Default data type now is VARCHAR, not CHAR. # 2006-06-10 # - Emit UNSIGNED for double/decimal columns if they're unsigned. use strict; use warnings; use Getopt::Long; $Getopt::Long::ignorecase = 0; # options are case sensitive $Getopt::Long::bundling = 1; # allow short options to be bundled # ---------------------------------------------------------------------- # Create information structures to use for characterizing each column in # in the data file. We need to know whether any nonnumeric values are # found, whether numeric values are always integers, and the maximum length # of column values. # Argument is the array of column labels. # Creates an array of hash references and returns a reference to that array. sub init_col_info { my @labels = @_; my @col_info; for my $i (0 .. @labels - 1) { my $info = { }; $info-&gt;{label} = $labels[$i]; $info-&gt;{max_length} = 0; # these can be tested directly, so they're set false until found # to be true $info-&gt;{hasempty} = 0; # has empty values $info-&gt;{hasnonempty} = 0; # has nonempty values # these can be assessed only by seeing all the values in the # column, so they're set true until discovered by counterexample # to be false $info-&gt;{numeric} = 1; # used to detect general numeric types $info-&gt;{integer} = 1; # used to detect INT $info-&gt;{nonnegative} = 1; # used to detect UNSIGNED $info-&gt;{temporal} = 1; # used to detect general temporal types $info-&gt;{date} = 1; # used to detect DATE $info-&gt;{datetime} = 1; # used to detect DATETIME $info-&gt;{time} = 1; # used to detect TIME # track min/max value for numeric columns $info-&gt;{min_val} = undef; $info-&gt;{max_val} = undef; # track min/max for each of three date parts $info-&gt;{date_range} = [ undef, undef, undef]; push (@col_info, $info); } return (\@col_info); } sub print_create_table { my ($tbl_name, $col_info_list, $quote) = @_; my $ncols = @{$col_info_list}; my $s; my $extra = ""; $quote = ($quote ? "`" : ""); # quote names? for my $i (0 .. $ncols - 1) { my $info = $col_info_list-&gt;[$i]; $s .= ",\n" if $i &gt; 0; $s .= $extra if $extra ne ""; $extra = ""; $s .= " $quote$info-&gt;{label}$quote "; if (!$info-&gt;{hasnonempty}) # column is always empty, make wild guess { $s .= "CHAR(10) /* NOTE: column is always empty */"; next; } # if the column has nonempty values but one of # these hasn't been ruled out, that's a problem if ($info-&gt;{numeric} &amp;&amp; $info-&gt;{temporal}) { die "Logic error: $info-&gt;{label} was characterized as both" . " numeric and temporal\n"; } if ($info-&gt;{numeric}) { if ($info-&gt;{integer}) { $s .= "INT"; ## TO DO: use range to make guess about type # Print "might be YEAR" if in range...(0, 1901-2155) } else { $s .= "DOUBLE"; } $s .= " UNSIGNED" if $info-&gt;{nonnegative}; } elsif ($info-&gt;{temporal}) { # if a date column looks more like a U.S. or British # date, add some comments to that effect if (exists ($info-&gt;{date_type})) { my $ref = $info-&gt;{date_type}; $extra .= " # $info-&gt;{label} might be a U.S. date\n" if $ref-&gt;{us}; $extra .= " # $info-&gt;{label} might be a British date\n" if $ref-&gt;{br}; } if ($info-&gt;{date}) { $s .= "DATE"; } elsif ($info-&gt;{datetime}) { $s .= "DATETIME"; } elsif ($info-&gt;{time}) { $s .= "TIME"; } else { die "Logic error: $info-&gt;{label} is flagged as temporal, but" . " not as any of the temporal types\n"; } } else { if ($info-&gt;{max_length} &lt; 256) { $s .= "VARCHAR($info-&gt;{max_length})"; } elsif ($info-&gt;{max_length} &lt; 65536) { $s .= "TEXT"; } elsif ($info-&gt;{max_length} &lt; 16777216) { $s .= "MEDIUMTEXT"; } else { $s .= "LONGTEXT"; } } # if a column doesn't have empty values, guess that it cannot be NULL $s .= " " . ($info-&gt;{hasempty} ? "NULL" : "NOT NULL"); } $s = "CREATE TABLE $quote$tbl_name$quote\n(\n$s\n);\n"; print $s; } sub print_report { my $col_info_list = shift; my $ncols = @{$col_info_list}; my $s; for my $i (0 .. $ncols - 1) { my $info = $col_info_list-&gt;[$i]; printf "Column %d: %s\n", $i+1, $info-&gt;{label}; if (!$info-&gt;{hasnonempty}) # column is always empty { print " column is always empty\n"; next; } # if the column has nonempty values but one of # these hasn't been ruled out, that's a problem if ($info-&gt;{numeric} &amp;&amp; $info-&gt;{temporal}) { die "Logic error: $info-&gt;{label} was characterized as both" . " numeric and temporal\n"; } print " column has empty values: " . ($info-&gt;{hasempty} ? "yes" : "no") . "\n"; printf " column value maximum length = %d\n", $info-&gt;{max_length}; if ($info-&gt;{numeric}) { printf " column is numeric (range: %g - %g)\n", $info-&gt;{min_val}, $info-&gt;{max_val}; if ($info-&gt;{integer}) { print " column is integer\n"; if ($info-&gt;{nonnegative}) { print " column is nonnegative\n"; } } } elsif ($info-&gt;{temporal}) { if ($info-&gt;{date}) { my $ref = $info-&gt;{date_range}; print " column contains date values"; printf " (part ranges: %d - %d, %d - %d, %d - %d)\n", $ref-&gt;[0]-&gt;{min}, $ref-&gt;[0]-&gt;{max}, $ref-&gt;[1]-&gt;{min}, $ref-&gt;[1]-&gt;{max}, $ref-&gt;[2]-&gt;{min}, $ref-&gt;[2]-&gt;{max}; $ref = $info-&gt;{date_type}; printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n", ($ref-&gt;{iso} ? "yes" : "no"), ($ref-&gt;{us} ? "yes" : "no"), ($ref-&gt;{br} ? "yes" : "no"); } elsif ($info-&gt;{datetime}) { my $ref = $info-&gt;{date_range}; print " column contains date+time values"; printf " (part ranges: %d - %d, %d - %d, %d - %d)\n", $ref-&gt;[0]-&gt;{min}, $ref-&gt;[0]-&gt;{max}, $ref-&gt;[1]-&gt;{min}, $ref-&gt;[1]-&gt;{max}, $ref-&gt;[2]-&gt;{min}, $ref-&gt;[2]-&gt;{max}; $ref = $info-&gt;{date_type}; printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n", ($ref-&gt;{iso} ? "yes" : "no"), ($ref-&gt;{us} ? "yes" : "no"), ($ref-&gt;{br} ? "yes" : "no"); } elsif ($info-&gt;{time}) { print " column contains time values\n"; } else { die "Logic error: $info-&gt;{label} is flagged as temporal, but" . " not as any of the temporal types\n"; } } else { print " column appears to be a string" . " (cannot further narrow the type)\n"; } } } # ---------------------------------------------------------------------- my $prog = "guess_table.pl"; my $usage = &lt;&lt;EOF; Usage: $prog [options] [data_file] Options: --help Print this message --labels, -l Interpret first input line as row of table column labels (default = c1, c2, ...) --lower, --upper Force column labels to be in lowercase or uppercase --quote-names, --skip-quote-names Quote or do not quote table and column identifiers with `` characters in case they are reserved words (default = quote identifiers) --report , -r Report mode; print findings rather than generating a CREATE TABLE statement --table=tbl_name, -t tbl_name Specify table name (default = t) EOF my $help; my $labels; # expect a line of column labels? my $tbl_name = "t"; # table name (default: t) my $report; my $lower; my $upper; my $quote_names = 1; my $skip_quote_names; GetOptions ( # =s means a string value is required after the option "help" =&gt; \$help, # print help message "labels|l" =&gt; \$labels, # expect row of column labels "table|t=s" =&gt; \$tbl_name, # table name "report|r" =&gt; \$report, # report mode "lower" =&gt; \$lower, # lowercase labels "upper" =&gt; \$upper, # uppercase labels "quote-names" =&gt; \$quote_names, # quote identifiers "skip-quote-names" =&gt; \$skip_quote_names # don't quote identifiers ) or die "$usage\n"; die "$usage\n" if defined $help; $report = defined ($report); # convert defined/undefined to boolean $lower = defined ($lower); $upper = defined ($upper); $quote_names = defined ($quote_names); $quote_names = 0 if defined ($skip_quote_names); die "--lower and --upper were both specified; that makes no sense\n" if $lower &amp;&amp; $upper; my $line; my $line_count = 0; my @labels; # column labels my $ncols; # number of columns my $col_info_list; # If labels are expected, read the first line to get them if ($labels) { defined ($line = &lt;&gt;) or die; chomp ($line); @labels = split (/\t/, $line); } # Arrays to hold line numbers of lines with too many/too few fields. # The first line in the file is assumed to be representative. The # number of fields it contains becomes the norm against which any following # lines are assessed. my @excess_fields; my @too_few_fields; while (&lt;&gt;) { chomp ($line = $_); ++$line_count; if (!defined ($ncols)) # don't know this until first data line read { # determine number of columns (assume no more than 10,000) my @val = split (/\t/, $line, 10000); $ncols = @val; if (@labels) # label count must match data column count { die "Label count doesn't match data column count\n" if $ncols != @labels; } else # if there were no labels, create them { @labels = map { "c" . $_ } 1 .. $ncols; } $col_info_list = init_col_info (@labels); } my @val = split (/\t/, $line, 10000); push (@excess_fields, $line_count) if @val &gt; $ncols; push (@too_few_fields, $line_count) if @val &lt; $ncols; for my $i (0 .. $ncols - 1) { my $val = ($i &lt; @val ? $val[$i] : ""); # use "" if field is missing my $info = $col_info_list-&gt;[$i]; $info-&gt;{max_length} = length ($val) if $info-&gt;{max_length} &lt; length ($val); if ($val eq "") { # column does have empty values $info-&gt;{hasempty} = 1; next; # no other tests apply } $info-&gt;{hasnonempty} = 1; # perform numeric tests if no nonnumeric values have yet been seen if ($info-&gt;{numeric}) { # numeric test (doesn't recognize scientific notation) if ($val =~ /^[-+]?(\d+(\.\d*)?|\.\d+)$/) { # not int if contains decimal point $info-&gt;{integer} = 0 if $val =~ /\./; # not unsigned if begins with minus sign $info-&gt;{nonnegative} = 0 if $val =~ /^-/; # track min/max value $info-&gt;{min_val} = $val if !defined ($info-&gt;{min_val}) || $info-&gt;{min_val} &gt; $val; $info-&gt;{max_val} = $val if !defined ($info-&gt;{max_val}) || $info-&gt;{max_val} &lt; $val; } else { # column contains nonnumeric information $info-&gt;{numeric} = 0; $info-&gt;{integer} = 0; } } # perform temporal tests if no nontemporal values have yet been seen if ($info-&gt;{temporal}) { # date/datetime test # allow date, date hour:min, date hour:min:sec if (($info-&gt;{date} || $info-&gt;{datetime}) &amp;&amp; $val =~ /^(\d+)[-\/](\d+)[-\/](\d+)\s*(\d+:\d+(:\d+)?)?$/) { # it's not a time $info-&gt;{time} = 0; # not a date if time part was present; not a # datetime if no time part was present $info-&gt;{ defined ($4) ? "date" : "datetime" } = 0; # use the first three parts to track range of date parts my @val = ($1, $2, $3); my $ref = $info-&gt;{date_range}; foreach my $i (0 .. 2) { # if this is the first value we've checked, create the # structure to hold the min and max; otherwise compare # the stored min/max to the current value if (!defined ($ref-&gt;[$i])) { $ref-&gt;[$i]-&gt;{min} = $val[$i]; $ref-&gt;[$i]-&gt;{max} = $val[$i]; next; } $ref-&gt;[$i]-&gt;{min} = $val[$i] if $ref-&gt;[$i]-&gt;{min} &gt; $val[$i]; $ref-&gt;[$i]-&gt;{max} = $val[$i] if $ref-&gt;[$i]-&gt;{max} &lt; $val[$i]; } } # time test # allow hour:min, hour:min:sec elsif ($info-&gt;{time} &amp;&amp; $val =~ /^\d+:\d+(:\d+)?$/) { # it's not a date or datetime $info-&gt;{date} = 0; $info-&gt;{datetime} = 0; } else { # column contains nontemporal information $info-&gt;{temporal} = 0; } } } } die "Input contained no data lines\n" if $line_count == 0; die "Input lines all were empty\n" if $ncols == 0; # Look at columns that look like DATE or DATETIME columns and attempt # to determine whether they appear to be in ISO, U.S., or British format. # (Skip columns that are always empty, because these assessments cannot # be made for such columns.) for my $i (0 .. $ncols - 1) { my $info = $col_info_list-&gt;[$i]; next unless $info-&gt;{hasnonempty}; next unless $info-&gt;{temporal} &amp;&amp; ($info-&gt;{date} || $info-&gt;{datetime}); my $ref = $info-&gt;{date_range}; # assume that the column is valid as each of the types until ruled out my $valid_as_iso = 1; # [CC]YY-MM-DD my $valid_as_us = 1; # MM-DD-[CC]YY my $valid_as_br = 1; # DD-MM-[CC]YY # first segment is U.S. month, British day my $min = $ref-&gt;[0]-&gt;{min}; my $max = $ref-&gt;[0]-&gt;{max}; $valid_as_us = 0 if $min &lt; 0 || $max &gt; 12; $valid_as_br = 0 if $min &lt; 0 || $max &gt; 31; # second segment is U.S. day, British month, ISO month $min = $ref-&gt;[1]-&gt;{min}; $max = $ref-&gt;[1]-&gt;{max}; $valid_as_us = 0 if $min &lt; 0 || $max &gt; 31; $valid_as_br = 0 if $min &lt; 0 || $max &gt; 12; $valid_as_iso = 0 if $min &lt; 0 || $max &gt; 12; # third segment is ISO day $min = $ref-&gt;[2]-&gt;{min}; $max = $ref-&gt;[2]-&gt;{max}; $valid_as_iso = 0 if $min &lt; 0 || $max &gt; 31; if (!$valid_as_iso &amp;&amp; !$valid_as_us &amp;&amp; !$valid_as_br) { $info-&gt;{temporal} = 0; # huh! guess it's not a date after all } else # save date type results for later { $info-&gt;{date_type}-&gt;{iso} = $valid_as_iso; $info-&gt;{date_type}-&gt;{us} = $valid_as_us; $info-&gt;{date_type}-&gt;{br} = $valid_as_br; } } warn "# Number of lines = $line_count, columns = $ncols\n"; warn "# Number of lines with too few fields: " . scalar (@too_few_fields) . "\n" if @too_few_fields; warn "# Number of lines with excess fields: " . scalar (@excess_fields) . "\n" if @excess_fields; if ($report) { print_report ($col_info_list); } else { for my $i (0 .. $ncols - 1) { my $info = $col_info_list-&gt;[$i]; $info-&gt;{label} = lc ($info-&gt;{label}) if $lower; $info-&gt;{label} = uc ($info-&gt;{label}) if $upper; } print_create_table ($tbl_name, $col_info_list, $quote_names); } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload