By: Tim Smith | Updated: 2015-05-13 | Comments | Related: More > Import and Export
Problem
The following tip addresses some of the questions I get asked about using bulk insert as an ETL tool. Microsoft provides bulk insert with SQL Server and it is one of the most reliable ETL tools I've used. Combined with PowerShell, it can significantly reduce ETL development time, and unlike some alternatives, allows for easy automation and application, such as building an AI that can read a file's format, then create a bulk insert based on the structure by detecting new rows, delimiters, problematic lines, etc. At the time of this tip, bulk insert is available in SQL Server versions 2005, 2008, 2008R2, 2012, and 2014 and there's no evidence Microsoft intends to deprecate it (I've been asked about this last one).
Solution
One popular misconception about bulk insert involves inserting a document, such as a full text file. People will specify a new line character as the row separator, yet that's inaccurate if you want the document in one row, or if there is a different separator. For instance, see the below code and images for an example inserting an HTML document:
CREATE TABLE tb_HTMLDocuments( HTMLData VARCHAR(MAX) ) BULK INSERT tb_HTMLDocuments FROM 'C:\files\htmlfile.html' WITH ( ---- Note the below space only exists to prevent the HTML reader from removing it ROWTERMINATOR = '< !--END-->' ) SELECT * FROM tb_HTMLDocuments DROP TABLE tb_HTMLDocuments
If I specified a new line character as the row separator, I would have seen more rows than one. Suppose that I have several HTML documents in one text file and I want multiple rows:
CREATE TABLE tb_HTMLDocuments( HTMLData VARCHAR(MAX) ) BULK INSERT tb_HTMLDocuments FROM 'C:\files\htmlfile.html' WITH ( ---- Note the below space only exists to prevent the HTML reader from removing it ROWTERMINATOR = '< !--END-->' ) SELECT * FROM tb_HTMLDocuments DROP TABLE tb_HTMLDocuments
A VARCHAR(MAX) field can hold up to 2GB, so a large document will fit into one column provided it doesn't exceed that size. This leads to a related error I get asked about a lot: Bulk load data conversion error (truncation) for row [number], column [number]. Let's generate this error:
CREATE TABLE tb_HTMLDocuments( HTMLData VARCHAR(1) ) BULK INSERT tb_HTMLDocuments FROM 'C:\files\htmlfile.html' WITH ( ---- Note the below space only exists to prevent the HTML reader from removing it ROWTERMINATOR = '< !--END-->' ) SELECT * FROM tb_HTMLDocuments DROP TABLE tb_HTMLDocuments
The column size must be able to hold the data size of the values being inserted. The error identifies the specific problem - the row and column. For a document with thousands of rows and hundreds of columns, you could identify what is causing trouble (more often than not, the file has an extra delimiter throwing everything out of place). Some data sources will require a line-by-line solution, like this example tip shows by removing invalid lines; it depends on the data and bulk insert makes it easy to determine where the error is.
This also highlights another tool that developers have available. Suppose we only want bulk insert to hit one error then quit. We can tell it to stop after a certain number of errors:
CREATE TABLE tb_HTMLDocuments( HTMLData VARCHAR(100) ) BULK INSERT tb_HTMLDocuments FROM 'C:\files\htmlfile.html' WITH ( ---- Note the below space only exists to prevent the HTML reader from removing it ROWTERMINATOR = '< !--END-->' ,MAXERRORS=0 ) SELECT * FROM tb_HTMLDocuments DROP TABLE tb_HTMLDocumentsdiv
Since the errors (1) exceed our maximum (0), bulk insert inserts nothing. It will continue loading if it doesn't hit the error threshold, and we should be careful when specifying maximums, as in most cases we want it to throw an error and stop.
We can also skip first and last rows; I've run across various (highly accurate) data providers who love to throw in extra lines of meaningless data, which doesn't match the format and bulk insert allows us to skip rows in these cases. The below examples show this:
CREATE TABLE tb_HTMLDocuments( HTMLData VARCHAR(MAX) ) BULK INSERT tb_HTMLDocuments FROM 'C:\files\htmlfile.html' WITH ( ---- Note the below space only exists to prevent the HTML reader from removing it ROWTERMINATOR = '< !--END-->' ,FIRSTROW=2 ) SELECT * FROM tb_HTMLDocuments DROP TABLE tb_HTMLDocuments
CREATE TABLE tb_HTMLDocuments( HTMLData VARCHAR(MAX) ) BULK INSERT tb_HTMLDocuments FROM 'C:\files\htmlfile.html' WITH ( ---- Note the below space only exists to prevent the HTML reader from removing it ROWTERMINATOR = '< !--END-->' ,LASTROW=2 ) SELECT * FROM tb_HTMLDocuments DROP TABLE tb_HTMLDocuments
Developers should consider the option WITH TABLOCK, which can be useful on a first load, as it locks the table during the bulk insert load (though, this may be less important on later loads if we're wanting to load multiple files into the same table). For some file formats, we can also use the option FORMATFILE to specify the format file and bulk load data from a file, matching its format files pattern, into the table. In all cases, for performance enhancements, I'd highly suggest looking into removing indexes before a load and adding them after a load, unless the situation requires indexes. And with PowerShell, the more work you do upfront on data scrubbing before loading the data, the more you can skip steps, such as loading data into a staging or temporary table.
Next Steps
- For even more possible commands, read the documentation on bulk insert provided by Microsoft.
- Test generating errors are common issues and see how you write code to automate handling those errors; usually, they're all derivatives of the same problem.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2015-05-13