Note that there are some explanatory texts on larger screens.

plurals
  1. POApache Sqoop/Pig field escaping
    text
    copied!<p>We are exporting some data from MySQL using Sqoop, doing some processing with it via Apache Pig, and then attempting to export that data from HDFS back into a MySQL database. However, when exporting the data, we are running into issues:</p> <pre><code>java.io.IOException: Can't export data, please check task tracker logs at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112) at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.NumberFormatException: For input string: ".proseries.com" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:449) at java.lang.Integer.valueOf(Integer.java:554) at mdm_urls.__loadFromFields(mdm_urls.java:419) </code></pre> <p>The HDFS data looks like (tab separated):</p> <pre><code>id:int url:text tld:text port:int </code></pre> <p>Somehow the <code>tld</code> field is being imported into the <code>port</code> column <em>for some rows</em>. Out of ~250M rows, this is only the case for less than 10. My initial assumption was that the url field must have a tab in it. However, we have stripped all tabs in our Pig script:</p> <pre><code>REGISTER target/mystuff.jar; legacy_urls = LOAD 'url' USING PigStorage(',') AS (id, sha1, url_text); legacy_urls_norm = FOREACH legacy_urls GENERATE id AS id, sha1 AS sha1, REPLACE(REPLACE(url_text, '\n', ''), '\t', '') AS url_text; urls = FOREACH legacy_urls_norm GENERATE id, url_text, mystuff.RootDomain(url_text), mystuff.Protocol(url_text), mystuff.Host(url_text), mystuff.Path(url_text), mystuff.EffectiveTld(url_text), mystuff.Port(url_text), sha1; STORE urls INTO 'mdm_urls'; </code></pre> <p>Here is my sqoop export command:</p> <pre><code>sqoop export --connect jdbc:mysql://hostnmae/db_name --input-fields-terminated-by "\t" --table test --export-dir my_urls </code></pre> <p>I am having a difficult time debugging this because the sqoop errors do not give any indication as to what row it was working on (so that I can confirm if a tab char is still present, etc). My first question is, how might I better troubleshoot this issue? My second question is, how are people escaping bad input data with PIG?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload