I recently just started a new project I just came upon this same issue wiith my source data, sort of. My files are from external sources who we cannot change how they provide what they send so I had to find a solution but would like to improve upon that solution so ideas welcome. Using this idea for on how to change the delimiters, what can a person do when there are varying delimiters?
for example:
1. xxx,yyy,zzz,"abc,def,ghi",aaa,bbb -- multiple commas within a set of "
2. "abc,xxx","abcabc",bbb,ccc,ddd,zzz --different colum with a comma within the set of "
3. aaa,bbb,c,dd5655d,eeeeeeeae,ffff -- no " at all and column widths can vary for all data
From the inital import of the file saved as CSV, I started the import into a temp table with ONE row defined as a varchar(max) as well as an Identity row so that the initial data will hopefully be numbered as it came in.
Then I split that table into 2 new temp tables: a DQ (double-quote) and non-DQ using WHERE col LIKE '%"%' ---it it contained a " anywhere in the row.
Then in a SQL Script I wrote some messy SQL script (it's below if you REALLY want to see it) to find and replace those "columns" with the embedded commas to spaces, or even pipe if it's needed to convert back to comma. I think I should change to a loop in case I end up with more than 2 sets of " but I am pressed for time because I have 20 more of these lovely files to figure out how to handle.
Then I copied the modified table and the non-modified into another temp table(yes, lots of temp tables, but I needed to be able to trace to make sure it worked).
Ideally, I was hoping for a way to now take that final modified table and ETL it into another formatted/columnar defined table with 55 cols for this particular file and have SSIS(I'm 2012) do the dirty work like it does when a source is csv/txt/excel, but I couldn't find anything that could handle that.
??? Does anyone know how to take a SQL table (NOT external file) as the source, which contains comma-delimited data, and send to a destination without using script logic? ????
Due to my time-crunch, what I ended up doing is then taking my newly created and cleaned up of commas between ", aka the 'pre-final CSV SQL TABLE, and exporting to a CSV destination file and then setting up a new Connection Mgr to read that new CSV as a Source and ETLing into the final 55 columns table.
Any ideas?
Brace yourself, it's rather crazy SQL down here...
I'd LOVE to avoid the manual coding because I'm going to be in a pickle if subsequent monthly files end up with more than 2 sets of " :(
---converts those rows that contain embedded commas within 2 sets of 2 (for now) " to blanks. -- then the output from here will be joined up with those w/o " and then sent back to csv to reimport as corrected csv -- yes, a roundabout way but best I could think of right now
truncate table [homeServer].[dbo].[temphold_SourceData1_varcharMaxRAW]
--created this to keep the original identity ID created from the initial import
INSERT INTO [homeServer].[dbo].[temphold_SourceData1_varcharMaxRAW]
--replace the 2nd set with values from p5 as the source and there should be no more commas
select cast(p6.ID as int) as OrigID ,cast(REPLACE(p6.newColumnRAW1,p6.set2,p6.NEWset2) as varchar(max)) as newColumnRAW FROM (--replace the first set with values from the p5 set select p5.ID ,REPLACE(p5.ColumnRAW,p5.set1,p5.NEWset1) as newColumnRAW1 ,p5.set2 ,p5.NEWset2 from (--chg comma to space if between the " in the orig data (p4) and join to data that contains 4 positionals containing " (p3) SELECT p4.ID ,p4.ColumnRAW ,p3.pos1 ,p3.pos2 ,p3.pos3 ,p3.pos4 ,substring(p4.columnraw, p3.pos1,((p3.pos2-p3.pos1) + 1)) as set1 ,substring(p4.columnraw, p3.pos3,((p3.pos4-p3.pos3) + 1)) as set2 ,REPLACE(substring(p4.columnraw, p3.pos1,((p3.pos2-p3.pos1) + 1)), ',',' ') as NEWset1 --changed here to space ,REPLACE(substring(p4.columnraw, p3.pos3,((p3.pos4-p3.pos3) + 1)), ',',' ') as NEWset2 --changed here to space FROM temphold_SourceData1_DQData p4
-- join from orig by ID created by other DTSx RAW import to those with dq positions
INNER JOIN (select pos1 ,pos2 ,case when xpos3 < pos2 then 0 else xpos3 end as pos3 ,case when xpos3 < pos2 then 0 else charindex('"',ColumnRAW,(xpos3+1)) end as pos4
,columnRAW ,ID FROM (--pass2 of data - get position3 having " select pos1 ,pos2 ,charindex('"',ColumnRAW,(pos2+1)) as xpos3 ,columnRAW ,ID FROM (--pass1 of data - get positions 1 and 2 of those containg " (dbl quotes = dq) select ID, cast(charindex('"',ColumnRAW) as int) as pos1 ,cast(charindex('"',ColumnRAW,(charindex('"',ColumnRAW)+1)) as int) as pos2 ,columnraw from temphold_SourceData1_DQData --testing here where id IN (5,1537) ) p1 ) p2 ) p3
ON p4.ID = p3.ID ) p5 ) p6
|