By: Matteo Lorini | Updated: 2020-12-14 | Comments | Related: More > R Language
Problem
In this article we cover how to import data from a PDF file into a SQL Server table with R. We will use an example of past lottery winning numbers to see how you could use R to load the data and possibly predict the next set of winning numbers.
Solution
Lottery winning numbers can be manually downloaded in a PDF file one at a time. In this article we will import the Mega Millions winning numbers from a PDF file into a SQL table. This is just a simple exercise that shows how powerful the combination of R and SQL in SQL Server can be.
Import PDF into SQL Server from Georgia Lottery
The first step is to download the winning numbers from the official Georgia lottery website.
From the website, we select the Mega Millions lottery game and click on download to get the numbers.
Please notice that if the winning numbers are on multiple pages, we will have to download each page.
Once we have the winning numbers in PDF files, we can use R to extract information like Date, Winning Numbers, and Megaball, and import them into a SQL Server table for further analysis. The image below shows the contents of the lottery pdf file.
Read PDF File and Extract Information with R
Let see how we can read the pdf file and extract the information it contains. First, we need to install the R pdftools package in order to use the pdf_text R function. As usual, we will be using RStudio to execute our R scripts. The code below reads the PDF file and splits each line according to the "\n" character (Line Feed).
install.packages("pdftools") library(pdftools) pdf_text("GA_Lottery_WinningNumbers MegaMillions.PDF") %>% strsplit(split = "\n")
At first glance, we see that the pdf_text function can correctly read the pdf file.
Examine Data Type Once Import into R Data Frame
Let us examine our data type once imported into an R data frame. First, we import our pdf file into a data frame (gal) and split it by "\n" (Line Feed Character). When we issue the str() R command, we see that the type of data returned is an R list data type.
gal <- pdf_text("GA_Lottery_WinningNumbers MegaMillions.PDF") gal1 <- strsplit(gal, "\n") str(gal1)
Check the Data Structure Result in R
In order to work with the data frame data, unlist() the gal1 data frame and check the final data structure result.
gal2 <- unlist(gal1) str(gal2)
Print Data Frame in R
Manually extract the information from our data frame gal2.
print(gal2)
We can access our data one line at a time:
print(gal2[3]) print(gal2[4])
We can use R function substr() to access its information like date, Megaball number, and so on.
print(gal2[4]) # Get 4th element date print(substr(gal2[4], 1,10)) # Get 4th element Megaball Number print(substr(gal2[4], 40,42))
Finally, we can count the positions of all the data we need to insert into our SQL table.
print(gal2[4]) # Get 4th element date print(substr(gal2[4], 1,10)) # Get Winning Numbers print(substr(gal2[4], 12,13)) print(substr(gal2[4], 15,16)) print(substr(gal2[4], 18,19)) print(substr(gal2[4], 21,22)) print(substr(gal2[4], 24,25)) # Get 4th element Megaball Number print(substr(gal2[4], 40,42))
R Loop to Extract Position of Numbers
Now that we know the positions of our numbers, we can write a loop to extract them all in one execution.
seq <- c(3:21) for( i in seq) { # Get 4th element date print(substr(gal2[i], 1,10)) # Get Winning Numbers print(substr(gal2[i], 12,13)) print(substr(gal2[i], 15,16)) print(substr(gal2[i], 18,19)) print(substr(gal2[i], 21,22)) print(substr(gal2[i], 24,25)) # Get 4th element Megaball Number print(substr(gal2[i], 40,42)) }
Insert Data into SQL Server with R
Finally, we can write the final R code to insert our data into SQL. Create the table as:
CREATE TABLE GALottery ( LotteryDate date, Num1 int, Num2 int, Num3 int, Num4 int, Num5 int, MageMillion int )
We now need to create a connection to our SQL database and the code to loop through the pdf files values. We then insert them into a SQL table.
sqlcnt <- odbcDriverConnect('driver={SQL Server Native Client 11.0};uid=UsreName;pwd=UserPWD;server=SQLServer;database=YourDatabase;rows_at_time=1024') seq <- c(3:21) for( i in seq) { # Get 4th element date date <- substr(gal2[i], 1,10) # Get Winning Numbers Num1 <- substr(gal2[i], 12,13) Num2 <- substr(gal2[i], 15,16) Num3 <- substr(gal2[i], 18,19) Num4 <- substr(gal2[i], 21,22) Num5 <- substr(gal2[i], 24,25) # Get 4th element Megaball Number MgMil <- substr(gal2[i], 40,42) # Prepare our T-SQL statement rgQuery <- sprintf("insert dbo.GALottery values('%s','%s','%s','%s','%s','%s','%s')",date,Num1,Num2,Num3,Num4,Num5,MgMil) gmt <- sqlQuery(sqlcnt,rgQuery, stringsAsFactors = F,as.is = T) # Print our T-SQL - Verification only print(rgQuery) }
We check our target SQL table to verify that our numbers have been entered correctly. Execute the query below in SQL Server Management Studio.
select * from dbo.GALottery
Conclusion
In this tip we have learned how to read data from a pdf file and insert the results into a SQL table. We used the following R functions:
- pdf_text to read a pdf file into a data frame.
- strsplit( x, "\n") to split the values inside our data frame by "\n" (Line Feed) character.
- substr( x, start, end) to extract substring value; our numbers.
- Unlist() to transform the list into a vector.
- Sprintf() to create our T-SQL syntax. Please note that the sprintf syntax is the same as the C function sprintf
Next Steps
- The reader will need to install RStudio in order to test this tip.
- R Tutorial
- Check out these tips
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2020-12-14