Import PDF into SQL Server

By:   |   Updated: 2024-12-04   |   Comments   |   Related: More > Import and Export


Problem

PDF forms are widely used for data collection, document processing, and digital workflows due to their versatility and consistency across different platforms and devices. They are essential in various industries, including healthcare, education, finance, government, and business. How do you retrieve data from PDF forms and insert into a SQL Server database table?

Solution

Using Visual Studio, let's create a new project using a Visual Basic Windows Forms App template.

Click Next, and give a name for your solution. The name I used is PdfFormsReader. Choose a location and click Next. Choose the adequate framework and click Create.

In the Solution Explorer panel, click on the My Project folder. Click Settings, and then Create or the open application settings hyperlink.

Visual Studio settings

Create a new variable called LastVisitedFolder as string.

Creating the Last Visited Folder

I renamed the Form1.vb to FromPdfReader.vb, then added a textbox named TxbFolderPath and a button named BtnPdfFolder with text property "Browse". Next, I added a RichTextBox named RtbPdfFormsContent and another button named BtnPdfFormRead with text property "Read".

PDF Reader Application

Right-click over your project name and choose Manage Nuget Packages and add the itext7 and System.Data.SqlClient packages:

Nuget Packages to Add

Now include the following code in the FormPdfReader.vb file:

Imports System.Data.SqlClient
Imports System.IO
Imports iText.Kernel.Pdf
 
Public Class FormPdfReader
 
   Private Sub BtnPdfFolder_Click(sender As Object, e As EventArgs) Handles BtnPdfFolder.Click
      Using folderBrowserDialog1 As New FolderBrowserDialog()
         If folderBrowserDialog1.ShowDialog() = DialogResult.OK Then
            My.Settings.LastVisitedFolder = folderBrowserDialog1.SelectedPath
            My.Settings.Save()
            TxbFolderPath.Text = folderBrowserDialog1.SelectedPath
         End If
      End Using
   End Sub
 
   Private Sub BtnPdfFormRead_Click(sender As Object, e As EventArgs) Handles BtnPdfFormRead.Click
      RtbPdfFormsContent.Text = ""
      Dim folderPath As String = TxbFolderPath.Text
      If Directory.Exists(folderPath) Then
         Dim files As String() = Directory.GetFiles(folderPath)
         For Each file As String In files
            If IsNothing(file) Then
               Continue For
            End If
            RtbPdfFormsContent.Text += New String("-"c, 100) & vbCrLf
            RtbPdfFormsContent.Text += file & vbCrLf
            RtbPdfFormsContent.Text += New String("-"c, 100) & vbCrLf
 
            Dim pdfDoc As New PdfDocument(New PdfReader(file))
            Dim root As PdfDictionary = pdfDoc.GetCatalog().GetPdfObject().GetAsDictionary(PdfName.AcroForm)
            If root IsNot Nothing Then
               Dim fields As PdfArray = root.GetAsArray(PdfName.Fields)
               If fields IsNot Nothing Then
                  For I As Integer = 0 To fields.Size - 1
                     Dim field As PdfDictionary = fields.GetAsDictionary(i)
                     Dim nameObject As PdfObject = field.Get(PdfName.T)
                     Dim fieldname As String
                     If nameObject IsNot Nothing AndAlso nameObject.IsName() Then
                        fieldname = DirectCast(nameObject, PdfName).GetValue()
                     Else
                        fieldname = nameObject.ToString()
                     End If
                     Dim fieldValue As String = ""
                     If field.Get(PdfName.V) IsNot Nothing Then
                        fieldValue = field.Get(PdfName.V).ToString()
                     End If
                     If fieldValue.Length > 0 Then
                        Dim connectionString As String = "Server=DAIO;Database=MSSQLTips;Integrated Security=True;Connect Timeout=30;Encrypt=False;TrustServerCertificate=True;"
                        Using connection As New SqlConnection(connectionString)
                           Try
                              connection.Open()
                              Dim commando As String = $"EXECUTE dbo.uspPdfForms '{Path.GetFileName(file)}','{fieldName}','{fieldValue}';"
                              Using cmd As New SqlCommand(comando, connection)
                                 Dim unused = cmd.ExecuteNonQuery()
                              End Using
                              RtbPdfFormsContent.Text += fieldname & ": " & fieldValue & vbCrLf
                           Catch ex As Exception
                              MsgBox($"Falha para: {file} [{fieldName}] --> {fieldValue} " & ex.Message)
                           End Try
                        End Using
                     End If
                  Next
               Else
                  MsgBox("No form fields found in the PDF.", MsgBoxStyle.Information, "Diretório Vazio")
               End If
            End If
            pdfDoc.Close()
         Next
      Else
         MsgBox("Directory not found.", MsgBoxStyle.Critical, "Erro Encontrado")
      End If
      Beep()
   End Sub
 
   Private Sub FormPdfReader_Load(sender As Object, e As EventArgs) Handles MyBase.Load
      TxbFolderPath.Text = My.Settings.LastVisitedFolder
   End Sub
 
End Class

Data Compilation – SQL Server Side

The first step is to create a table to hold the data. This table only takes into consideration the file name, its fields names, and respective values, so it will work with any PDF form.

CREATE TABLE [dbo].[PdfForms](
   [ItemId] [int] IDENTITY(1,1) NOT NULL,
   [FormName] [nvarchar](50) NULL,
   [FieldName] [nvarchar](150) NULL,
   [FieldValue] [nvarchar](250) NULL,
   [ModifiedDate] [datetime] DEFAULT GETDATE(),
   [RecordChanged] [bit] DEFAULT 0,
 CONSTRAINT [PK_PdfForms] PRIMARY KEY CLUSTERED 
(
   [ItemId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

Add the store procedure (below) that will receive and process the data from our Visual Studio solution:

-- =============================================
-- Author:      SCP - MSSQLTips
-- Create date: 20240910
-- Description: PDF Forms Compilation
-- =============================================
ALTER PROCEDURE [dbo].[uspPdfForms] 
            (@FormName nvarchar(50)
            ,@FieldName nvarchar(150)
            ,@FieldValue nvarchar(250))
AS
BEGIN
   SET NOCOUNT ON;
   BEGIN TRY
      BEGIN TRANSACTION;
 
         DECLARE @ModifiedDate datetime = GETDATE();
         DECLARE @ItemId int = 
                  (SELECT [ItemId]
                     FROM [dbo].[PdfForms]
                     WHERE [FormName] = @FormName AND
                           [FieldName] = @FieldName);
                           
         IF @ItemId IS NULL
            SET @ItemId = 0;
 
         -- UPDATE
         IF @ItemId > 0
            BEGIN
               UPDATE   [dbo].[PdfForms]
                  SET   [FormName] = @FormName
                        ,[FieldName] = @FieldName
                        ,[FieldValue] = @FieldValue
                        ,[ModifiedDate] = @ModifiedDate
                        ,[RecordChanged] = CASE WHEN [FieldValue] <> @FieldValue THEN 1 ELSE 0 END
                  WHERE     [ItemId] = @ItemId; 
            END
 
         -- INSERT
         IF @ItemId = 0
            BEGIN
               INSERT INTO [dbo].[PdfForms]
                        ([FormName]
                        ,[FieldName]
                        ,[FieldValue]
                        ,[ModifiedDate]) 
                  VALUES (@FormName
                        ,@FieldName
                        ,@FieldValue
                        ,@ModifiedDate);
 
               SET @ItemId = SCOPE_IDENTITY();
            END
 
         -- DELETE
         IF @ItemId < 0
            BEGIN
               DELETE FROM [dbo].[PdfForms]
                  WHERE [ItemId] = ABS(@ItemId);
            END
 
      COMMIT TRANSACTION;
      RETURN @ItemId;
   END TRY
   BEGIN CATCH
      IF @@TRANCOUNT > 0
         BEGIN
            ROLLBACK TRANSACTION;
         END
 
         -- Print error information. 
         PRINT   'Error: ' + CONVERT(varchar(50), ERROR_NUMBER()) +
            ', Severity: ' + CONVERT(varchar(5), ERROR_SEVERITY()) +
            ', State: ' + CONVERT(varchar(5), ERROR_STATE()) + 
            ', Procedure: ' + ISNULL(ERROR_PROCEDURE(), '-') + 
            ', Line: ' + CONVERT(varchar(5), ERROR_LINE());
 
         PRINT ERROR_MESSAGE();
   END CATCH;
END

Add the following store procedure to display the result of the form read action:

-- =============================================
-- Author:      SCP - MSSQLTips
-- Create date: 20240910
-- Description: PDF Forms Compiled
-- =============================================
ALTER PROCEDURE [dbo].[uspPdfFormsChk] 
AS
BEGIN
   SET NOCOUNT ON;
 
   DECLARE  @Columns nvarchar(MAX)
         ,@SQL NVARCHAR(MAX);
 
   SELECT     @Columns = STRING_AGG(QUOTENAME([FormName]), ', ')
      FROM (SELECT DISTINCT [FormName] 
               FROM     [dbo].[PdfForms]) AS FormNames;
 
   SET  @SQL = N'
      SELECT     FieldName, ' + @Columns + N'
         FROM  (SELECT  [FormName]
                        ,[FieldName]
                        ,[FieldValue] + CASE WHEN RecordChanged = 1 THEN '' <---'' ELSE '''' END AS [FieldValue]
                  FROM  [dbo].[PdfForms]) AS SourceTable
         PIVOT (MAX(FieldValue)
         FOR   [FormName] IN (' + @Columns + N')) AS PivotTable;
         ';
 
   EXEC  sp_executesql @SQL;
 
END

Testing the Solution

For this example, I will use the form below with the fields NAME as a textbox, CHOICE as a combo box, OPTION as a radio button, and COLOR as a list box. I will include only two forms.

read pdf test

Place your filled PDF forms in a folder and execute the Visual Studio, first browsing to this folder and then reading the files inside it. The results will be displayed in the Rich Text box.

PDF Reader application after READ

When the store procedure uspPdfFormsChk is executed, the result is:

uspPdfFormsChk Result

Now I will add another form it the forms directory, and change the option in the List box to Orange for the form 1 and execute the READ button in my application:

PDF Reader application after READ (2)

Now after executing the store procedure uspPdfFormsChk, the result will be:

uspPdfFormsChk Result

Observe that for the ListBox in form 1, the system detected that it was changed since the last run.

This covers the basics of retrieving data from PDF forms using Visual Studio and saving the data to SQL Server.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Sebastiao Pereira Sebastiao Pereira has over 38 years of healthcare experience, including software development expertise with databases.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2024-12-04

Comments For This Article

















get free sql tips
agree to terms