StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to save data from hadoop to database using python
primarykey
Id
13287069
data
AcceptedAnswerId
13343152
AnswerCount
2
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2012-11-08T10:21:50.013
FavoriteCount
0
LastActivityDate
2012-11-27T06:30:45.437
LastEditDate
2012-11-27T06:30:45.437
LastEditorUserId
1342109
OwnerUserId
1342109
ParentId
0
PostTypeId
1
Score
3
ViewCount
3115
LastEditorDisplayName
text
Body
I am using hadoop to process an xml file,so i had written mapper file , reducer file in python. suppose the input need to process is test.xml <pre><code><report> <report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/> <date-range date="All Time"/> <table> <columns> <column name="campaignID" display="Campaign ID"/> <column name="adGroupID" display="Ad group ID"/> </columns> <row campaignID="79057390" adGroupID="3451305670"/> <row campaignID="79057390" adGroupID="3451305670"/> </table> </report> </code></pre> mapper.py file <pre><code>import sys import cStringIO import xml.etree.ElementTree as xml if __name__ == '__main__': buff = None intext = False for line in sys.stdin: line = line.strip() if line.find("<row") != -1: ............. ............. ............. print '%s\t%s'%(campaignID,adGroupID ) </code></pre> reducer.py file <pre><code>import sys if __name__ == '__main__': for line in sys.stdin: print line.strip() </code></pre> I had run the hadoop with following command <pre><code>bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /path/to/input_file/test.xml -output /path/to/output_folder/to/store/file </code></pre> When i run the above command hadoop is creating an output file at output path in the format we mentioned in <code>reducer.py</code> file correctly with required data Now after all what i am trying to do is, i dont want to store output data in a text file created as default by haddop when i run above command, instead i want to save the data in to a <code>MYSQL</code> database so i had written some python code in <code>reducer.py</code> file that writes the data directly to <code>MYSQL</code> database , and tried to run the above command by removing the output path as below <pre><code>bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /path/to/input_file/test.xml </code></pre> And i am getting the error something like below <pre><code>12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output Usage: $HADOOP_HOME/bin/hadoop jar \ $HADOOP_HOME/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -combiner <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. ......................... ......................... </code></pre> <ol> <li>After all my doubt is how to save the data in <code>Database</code> after processing the files ?</li> <li>In which file(mapper.py/reducer.py ? ) can we write the code that writes the data in to database</li> <li>which command is used to run hadoop for saving data in to database, becuase when i removed the output folder path in the hadoop command, it is showing an error.</li> </ol> Can anyone please help me to solve the above problem............. Edited Processed followed <ol> <li>Created <code>mapper</code> and <code>reducer</code> files as above that reads the xml file and creates a text file at some folder by <code>hadoop command</code> Ex: The folder where the text file(the result of xml file processing with hadoop command) is below /home/local/user/Hadoop/xml_processing/xml_output/part-00000</li> </ol> Here the xml file size is <code>1.3 GB</code> and after processing with hadoop the size of the <code>text file</code> created is <code>345 MB</code> Now what all i want to do is <code>reading the text file in the above path and saving data to the mysql database</code> as fast as possible. I have tried this with basic python, but is is taking some <code>350 sec</code> to process text file and saving to mysql database. <ol> <li>Now as indicated by nichole downloaded sqoop and unzipped at some path like below /home/local/user/sqoop-1.4.2.bin__hadoop-0.20</li> </ol> And entered in to <code>bin</code> folder and typed <code>./sqoop</code> and i received the below error <pre><code>sh-4.2$ ./sqoop Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. Try 'sqoop help' for usage. </code></pre> Also i have tried below <pre><code>./sqoop export --connect jdbc:mysql://localhost/Xml_Data --username root --table PerformaceReport --export-dir /home/local/user/Hadoop/xml_processing/xml_output/part-00000 --input-fields-terminated-by '\t' </code></pre> Result <pre><code>Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. 12/11/27 11:54:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 12/11/27 11:54:57 INFO tool.CodeGenTool: Beginning code generation 12/11/27 11:54:57 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:636) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:525) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:548) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191) at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175) at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:262) at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82) at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64) at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97) at org.apache.sqoop.Sqoop.run(Sqoop.java:145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) at org.apache.sqoop.Sqoop.main(Sqoop.java:238) at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57) </code></pre> Whether the above sqoop command is useful for the functionality of reading the text file and saving in to database ? , because we have to process from text file and insert in to database !!!!
Tags
<python><xml><database><hadoop>
Title
How to save data from hadoop to database using python
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USshiva krishna
UserOwnerUserId
1. USshiva krishna
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POHow to save data from hadoop to database using python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POHow to save data from hadoop to database using python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POHow to save data from hadoop to database using python
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.