By: Dallas Snider | Updated: 2016-09-20 | Comments | Related: More > Big Data
Problem
I need to load the results from a SQL Server T-SQL query to a Hadoop Distributed File System. How can I do this? How can I validate the data was loaded correctly?
Solution
Apache's Sqoop has a query argument that will allow you to pass in a SQL Server T-SQL query so you may select specific columns and rows to be loaded to the Hadoop Distributed File System (HDFS). Also you can join tables and use T-SQL functions for formatting. In this example, we will join the Sales.SalesOrderHeader and Sales.SalesOrderDetail tables in the SQL Server 2014 AdventureWorks2014 database to return the SalesOrderDetailID column, the OrderDate column formatted as YYYYMMDD, and the OrderQty and LineTotal columns where the LineTotal is 100.00 or greater. As with previous tips in this series, this tip is written using SQL Server 2014 and a Cloudera virtual machine running on a laptop.
The Sqoop command we will examine is listed below and we will break down each argument in the following bullet points. Please note that the command is supposed to be on one complete line or with the backslash (the Linux command line continuation character) at the end of each line except the last.
sqoop import --connect 'jdbc:sqlserver://aaa.bbb.ccc.ddd:pppp;databasename=AdventureWorks2014' --username 'sqoopLogin' -P --target-dir 'Sales100OrGreater' --split-by 'SalesOrderDetailID' --query "select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and \$CONDITIONS"
- sqoop import - The executable is named sqoop and we are instructing it to import the data from a table or view from a database to HDFS.
- --connect - With the --connect argument, we are passing in the jdbc connect string for SQL Server. In this case, we use the IP address, port number, and database name.
- --username - In this example, the user name is a SQL Server login, not a Windows login. Our database is set up to authenticate in mixed mode. We have a server login named sqoopLogin, and we have a database user name sqoopUser which is a member of the db_datareader role and has a default schema of dbo.
- -P - This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to automatically pass the password to the command, but we are trying to keep it simple for this tip.
- --target-dir - The target directory tells sqoop in which directory on the HDFS to store the selected data. This argument is required by sqoop when using a free-form query.
- --split-by - Even though we are selecting the primary key of the SalesOrderDetail table, we still have to provide Sqoop with a unique identifier to help it distribute the workload.
- --query - This is the argument in which we supply our T-SQL query. The query above is enclosed in double quotes. Also notice there is not a backslash (the line continuation character) in the multiple lines containing the query. Finally, notice the "and \$CONDITIONS" at the end of the WHERE clause. This token is required by sqoop because sqoop will automatically replace the $CONDITIONS token with a unique expression. Please do not define $CONDITIONS as an environment variable.
The output from the sqoop command is shown below.
[hdfs@localhost ~]$ ./sqoopCommand.sh 16/08/23 08:07:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0 Enter password: 16/08/23 08:07:53 INFO manager.SqlManager: Using default fetchSize of 1000 16/08/23 08:07:53 INFO tool.CodeGenTool: Beginning code generation 16/08/23 08:07:54 INFO manager.SqlManager: Executing SQL statement: select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and (1 = 0) 16/08/23 08:07:55 INFO manager.SqlManager: Executing SQL statement: select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and (1 = 0) 16/08/23 08:07:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce Note: /tmp/sqoop-training/compile/822e63bc3c3869e0c10ccacb69821dc5/QueryResult.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 16/08/23 08:07:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/822e63bc3c3869e0c10ccacb69821dc5/QueryResult.jar 16/08/23 08:07:59 INFO mapreduce.ImportJobBase: Beginning query import. 16/08/23 08:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 16/08/23 08:08:08 INFO db.DBInputFormat: Using read commited transaction isolation 16/08/23 08:08:08 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(SalesOrderDetailID), MAX(SalesOrderDetailID) FROM (select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and (1 = 1) ) AS t1 16/08/23 08:08:09 INFO mapred.JobClient: Running job: job_201608230640_0002 16/08/23 08:08:10 INFO mapred.JobClient: map 0% reduce 0% 16/08/23 08:08:37 INFO mapred.JobClient: map 50% reduce 0% 16/08/23 08:08:51 INFO mapred.JobClient: map 75% reduce 0% 16/08/23 08:08:52 INFO mapred.JobClient: map 100% reduce 0% 16/08/23 08:08:55 INFO mapred.JobClient: Job complete: job_201608230640_0002 16/08/23 08:08:55 INFO mapred.JobClient: Counters: 23 16/08/23 08:08:55 INFO mapred.JobClient: File System Counters 16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of bytes read=0 16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of bytes written=1190836 16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of read operations=0 16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of large read operations=0 16/08/23 08:08:55 INFO mapred.JobClient: FILE: Number of write operations=0 16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of bytes read=534 16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of bytes written=1814075 16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of read operations=12 16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of large read operations=0 16/08/23 08:08:55 INFO mapred.JobClient: HDFS: Number of write operations=6 16/08/23 08:08:55 INFO mapred.JobClient: Job Counters 16/08/23 08:08:55 INFO mapred.JobClient: Launched map tasks=4 16/08/23 08:08:55 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=69593 16/08/23 08:08:55 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0 16/08/23 08:08:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 16/08/23 08:08:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 16/08/23 08:08:55 INFO mapred.JobClient: Map-Reduce Framework 16/08/23 08:08:55 INFO mapred.JobClient: Map input records=63605 16/08/23 08:08:55 INFO mapred.JobClient: Map output records=63605 16/08/23 08:08:55 INFO mapred.JobClient: Input split bytes=534 16/08/23 08:08:55 INFO mapred.JobClient: Spilled Records=0 16/08/23 08:08:55 INFO mapred.JobClient: CPU time spent (ms)=11250 16/08/23 08:08:55 INFO mapred.JobClient: Physical memory (bytes) snapshot=451989504 16/08/23 08:08:55 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2954272768 16/08/23 08:08:55 INFO mapred.JobClient: Total committed heap usage (bytes)=127401984 16/08/23 08:08:55 INFO mapreduce.ImportJobBase: Transferred 1.73 MB in 55.1263 seconds (32.1363 KB/sec) 16/08/23 08:08:55 INFO mapreduce.ImportJobBase: Retrieved 63605 records. [hdfs@localhost ~]$
Notice the last line of output above shows that 63,605 records were retrieved. This corresponds to the 63,605 records in the T-SQL query results as we see in the image below. Also in the image below, we see the first 5 records returned by the query.
After executing the sqoop command, we can execute the hdfs dfs -ls command below to see the directory that was created by default with the table name on the HDFS. We also see the DimCustomer directory from a previous tip.
[hdfs@localhost ~]$ hdfs dfs -ls Found 2 items drwxrwxrwx - training supergroup 0 2016-08-16 21:55 DimCustomer drwxrwxrwx - training supergroup 0 2016-08-23 08:08 Sales100OrGreater [hdfs@localhost ~]$
We can use the hdfs dfs -ls command below to list the contents of the Sales100OrGreater directory. Notice how 4 partitions were created by default.
[hdfs@localhost ~]$ hdfs dfs -ls Sales100OrGreater Found 6 items -rw-rw-rw- 1 training supergroup 0 2016-08-23 08:08 Sales100OrGreater/_SUCCESS drwxrwxrwx - training supergroup 0 2016-08-23 08:08 Sales100OrGreater/_logs -rw-rw-rw- 1 training supergroup 717914 2016-08-23 08:08 Sales100OrGreater/part-m-00000 -rw-rw-rw- 1 training supergroup 447516 2016-08-23 08:08 Sales100OrGreater/part-m-00001 -rw-rw-rw- 1 training supergroup 353060 2016-08-23 08:08 Sales100OrGreater/part-m-00002 -rw-rw-rw- 1 training supergroup 295585 2016-08-23 08:08 Sales100OrGreater/part-m-00003 [hdfs@localhost ~]$
The hdfs dfs -cat command below will display the first 10 records in the first partition on the HDFS. We can use this command to verify the correct records were loaded.
[hdfs@localhost ~]$ hdfs dfs -cat Sales100OrGreater/part-m-00000|head 1,20110531,1,2024.994000 2,20110531,3,6074.982000 3,20110531,1,2024.994000 4,20110531,1,2039.994000 5,20110531,1,2039.994000 6,20110531,2,4079.988000 7,20110531,1,2039.994000 13,20110531,1,419.458900 14,20110531,1,874.794000 15,20110531,1,809.760000 [hdfs@localhost ~]$
Next Steps
- Please experiment with different formatting functions, joins and WHERE clauses.
- Finally, please check out these other tips and tutorials on T-SQL on MSSQLTips.com:
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2016-09-20