Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to save data from hadoop to database using python
    primarykey
    data
    text
    <p>I am using hadoop to process an xml file,so i had written mapper file , reducer file in python.</p> <p>suppose the input need to process is <strong>test.xml</strong></p> <pre><code>&lt;report&gt; &lt;report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/&gt; &lt;date-range date="All Time"/&gt; &lt;table&gt; &lt;columns&gt; &lt;column name="campaignID" display="Campaign ID"/&gt; &lt;column name="adGroupID" display="Ad group ID"/&gt; &lt;/columns&gt; &lt;row campaignID="79057390" adGroupID="3451305670"/&gt; &lt;row campaignID="79057390" adGroupID="3451305670"/&gt; &lt;/table&gt; &lt;/report&gt; </code></pre> <p><strong>mapper.py</strong> file</p> <pre><code>import sys import cStringIO import xml.etree.ElementTree as xml if __name__ == '__main__': buff = None intext = False for line in sys.stdin: line = line.strip() if line.find("&lt;row") != -1: ............. ............. ............. print '%s\t%s'%(campaignID,adGroupID ) </code></pre> <p><strong>reducer.py</strong> file</p> <pre><code>import sys if __name__ == '__main__': for line in sys.stdin: print line.strip() </code></pre> <p>I had run the hadoop with following command</p> <pre><code>bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /path/to/input_file/test.xml -output /path/to/output_folder/to/store/file </code></pre> <p>When i run the above command hadoop is creating an output file at output path in the format we mentioned in <code>reducer.py</code> file correctly with required data</p> <p>Now after all what i am trying to do is, i dont want to store output data in a text file created as default by haddop when i run above command, instead i want to save the data in to a <code>MYSQL</code> database</p> <p>so i had written some python code in <code>reducer.py</code> file that writes the data directly to <code>MYSQL</code> database , and tried to run the above command by removing the output path as below </p> <pre><code>bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /path/to/input_file/test.xml </code></pre> <p>And i am getting the error something like below</p> <pre><code>12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output Usage: $HADOOP_HOME/bin/hadoop jar \ $HADOOP_HOME/hadoop-streaming.jar [options] Options: -input &lt;path&gt; DFS input file(s) for the Map step -output &lt;path&gt; DFS output directory for the Reduce step -mapper &lt;cmd|JavaClassName&gt; The streaming command to run -combiner &lt;cmd|JavaClassName&gt; The streaming command to run -reducer &lt;cmd|JavaClassName&gt; The streaming command to run -file &lt;file&gt; File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. ......................... ......................... </code></pre> <ol> <li>After all my doubt is how to save the data in <code>Database</code> after processing the files ?</li> <li>In which file(mapper.py/reducer.py ? ) can we write the code that writes the data in to database</li> <li>which command is used to run hadoop for saving data in to database, becuase when i removed the output folder path in the hadoop command, it is showing an error.</li> </ol> <p>Can anyone please help me to solve the above problem.............</p> <p><strong>Edited</strong></p> <p><strong>Processed followed</strong></p> <ol> <li><p>Created <code>mapper</code> and <code>reducer</code> files as above that reads the xml file and creates a text file at some folder by <code>hadoop command</code></p> <p>Ex: The folder where the text file(the result of xml file processing with hadoop command) is below</p> <p>/home/local/user/Hadoop/xml_processing/xml_output/part-00000</p></li> </ol> <p>Here the xml file size is <code>1.3 GB</code> and after processing with hadoop the size of the <code>text file</code> created is <code>345 MB</code></p> <p>Now what all i want to do is <code>reading the text file in the above path and saving data to the mysql database</code> as fast as possible.</p> <p>I have tried this with basic python, but is is taking some <code>350 sec</code> to process text file and saving to mysql database.</p> <ol> <li><p>Now as indicated by nichole downloaded sqoop and unzipped at some path like below</p> <p>/home/local/user/sqoop-1.4.2.bin__hadoop-0.20</p></li> </ol> <p>And entered in to <code>bin</code> folder and typed <code>./sqoop</code> and i received the below error</p> <pre><code>sh-4.2$ ./sqoop Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. Try 'sqoop help' for usage. </code></pre> <p>Also i have tried below</p> <pre><code>./sqoop export --connect jdbc:mysql://localhost/Xml_Data --username root --table PerformaceReport --export-dir /home/local/user/Hadoop/xml_processing/xml_output/part-00000 --input-fields-terminated-by '\t' </code></pre> <p><strong>Result</strong> </p> <pre><code>Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. 12/11/27 11:54:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 12/11/27 11:54:57 INFO tool.CodeGenTool: Beginning code generation 12/11/27 11:54:57 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:636) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:525) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:548) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191) at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175) at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:262) at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82) at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64) at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97) at org.apache.sqoop.Sqoop.run(Sqoop.java:145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) at org.apache.sqoop.Sqoop.main(Sqoop.java:238) at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57) </code></pre> <p>Whether the above sqoop command is useful for the functionality of reading the text file and saving in to database ? , because we have to process from text file and insert in to database !!!!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload