By: Sebastiao Pereira | Updated: 2024-12-04 | Comments | Related: More > Import and Export
Problem
PDF forms are widely used for data collection, document processing, and digital workflows due to their versatility and consistency across different platforms and devices. They are essential in various industries, including healthcare, education, finance, government, and business. How do you retrieve data from PDF forms and insert into a SQL Server database table?
Solution
Using Visual Studio, let's create a new project using a Visual Basic Windows Forms App template.
Click Next, and give a name for your solution. The name I used is PdfFormsReader. Choose a location and click Next. Choose the adequate framework and click Create.
In the Solution Explorer panel, click on the My Project folder. Click Settings, and then Create or the open application settings hyperlink.
Create a new variable called LastVisitedFolder as string.
I renamed the Form1.vb to FromPdfReader.vb, then added a textbox named TxbFolderPath and a button named BtnPdfFolder with text property "Browse". Next, I added a RichTextBox named RtbPdfFormsContent and another button named BtnPdfFormRead with text property "Read".
Right-click over your project name and choose Manage Nuget Packages and add the itext7 and System.Data.SqlClient packages:
Now include the following code in the FormPdfReader.vb file:
Imports System.Data.SqlClient Imports System.IO Imports iText.Kernel.Pdf Public Class FormPdfReader Private Sub BtnPdfFolder_Click(sender As Object, e As EventArgs) Handles BtnPdfFolder.Click Using folderBrowserDialog1 As New FolderBrowserDialog() If folderBrowserDialog1.ShowDialog() = DialogResult.OK Then My.Settings.LastVisitedFolder = folderBrowserDialog1.SelectedPath My.Settings.Save() TxbFolderPath.Text = folderBrowserDialog1.SelectedPath End If End Using End Sub Private Sub BtnPdfFormRead_Click(sender As Object, e As EventArgs) Handles BtnPdfFormRead.Click RtbPdfFormsContent.Text = "" Dim folderPath As String = TxbFolderPath.Text If Directory.Exists(folderPath) Then Dim files As String() = Directory.GetFiles(folderPath) For Each file As String In files If IsNothing(file) Then Continue For End If RtbPdfFormsContent.Text += New String("-"c, 100) & vbCrLf RtbPdfFormsContent.Text += file & vbCrLf RtbPdfFormsContent.Text += New String("-"c, 100) & vbCrLf Dim pdfDoc As New PdfDocument(New PdfReader(file)) Dim root As PdfDictionary = pdfDoc.GetCatalog().GetPdfObject().GetAsDictionary(PdfName.AcroForm) If root IsNot Nothing Then Dim fields As PdfArray = root.GetAsArray(PdfName.Fields) If fields IsNot Nothing Then For I As Integer = 0 To fields.Size - 1 Dim field As PdfDictionary = fields.GetAsDictionary(i) Dim nameObject As PdfObject = field.Get(PdfName.T) Dim fieldname As String If nameObject IsNot Nothing AndAlso nameObject.IsName() Then fieldname = DirectCast(nameObject, PdfName).GetValue() Else fieldname = nameObject.ToString() End If Dim fieldValue As String = "" If field.Get(PdfName.V) IsNot Nothing Then fieldValue = field.Get(PdfName.V).ToString() End If If fieldValue.Length > 0 Then Dim connectionString As String = "Server=DAIO;Database=MSSQLTips;Integrated Security=True;Connect Timeout=30;Encrypt=False;TrustServerCertificate=True;" Using connection As New SqlConnection(connectionString) Try connection.Open() Dim commando As String = $"EXECUTE dbo.uspPdfForms '{Path.GetFileName(file)}','{fieldName}','{fieldValue}';" Using cmd As New SqlCommand(comando, connection) Dim unused = cmd.ExecuteNonQuery() End Using RtbPdfFormsContent.Text += fieldname & ": " & fieldValue & vbCrLf Catch ex As Exception MsgBox($"Falha para: {file} [{fieldName}] --> {fieldValue} " & ex.Message) End Try End Using End If Next Else MsgBox("No form fields found in the PDF.", MsgBoxStyle.Information, "Diretório Vazio") End If End If pdfDoc.Close() Next Else MsgBox("Directory not found.", MsgBoxStyle.Critical, "Erro Encontrado") End If Beep() End Sub Private Sub FormPdfReader_Load(sender As Object, e As EventArgs) Handles MyBase.Load TxbFolderPath.Text = My.Settings.LastVisitedFolder End Sub End Class
Data Compilation – SQL Server Side
The first step is to create a table to hold the data. This table only takes into consideration the file name, its fields names, and respective values, so it will work with any PDF form.
CREATE TABLE [dbo].[PdfForms]( [ItemId] [int] IDENTITY(1,1) NOT NULL, [FormName] [nvarchar](50) NULL, [FieldName] [nvarchar](150) NULL, [FieldValue] [nvarchar](250) NULL, [ModifiedDate] [datetime] DEFAULT GETDATE(), [RecordChanged] [bit] DEFAULT 0, CONSTRAINT [PK_PdfForms] PRIMARY KEY CLUSTERED ( [ItemId] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY] ) ON [PRIMARY] GO
Add the store procedure (below) that will receive and process the data from our Visual Studio solution:
-- ============================================= -- Author: SCP - MSSQLTips -- Create date: 20240910 -- Description: PDF Forms Compilation -- ============================================= ALTER PROCEDURE [dbo].[uspPdfForms] (@FormName nvarchar(50) ,@FieldName nvarchar(150) ,@FieldValue nvarchar(250)) AS BEGIN SET NOCOUNT ON; BEGIN TRY BEGIN TRANSACTION; DECLARE @ModifiedDate datetime = GETDATE(); DECLARE @ItemId int = (SELECT [ItemId] FROM [dbo].[PdfForms] WHERE [FormName] = @FormName AND [FieldName] = @FieldName); IF @ItemId IS NULL SET @ItemId = 0; -- UPDATE IF @ItemId > 0 BEGIN UPDATE [dbo].[PdfForms] SET [FormName] = @FormName ,[FieldName] = @FieldName ,[FieldValue] = @FieldValue ,[ModifiedDate] = @ModifiedDate ,[RecordChanged] = CASE WHEN [FieldValue] <> @FieldValue THEN 1 ELSE 0 END WHERE [ItemId] = @ItemId; END -- INSERT IF @ItemId = 0 BEGIN INSERT INTO [dbo].[PdfForms] ([FormName] ,[FieldName] ,[FieldValue] ,[ModifiedDate]) VALUES (@FormName ,@FieldName ,@FieldValue ,@ModifiedDate); SET @ItemId = SCOPE_IDENTITY(); END -- DELETE IF @ItemId < 0 BEGIN DELETE FROM [dbo].[PdfForms] WHERE [ItemId] = ABS(@ItemId); END COMMIT TRANSACTION; RETURN @ItemId; END TRY BEGIN CATCH IF @@TRANCOUNT > 0 BEGIN ROLLBACK TRANSACTION; END -- Print error information. PRINT 'Error: ' + CONVERT(varchar(50), ERROR_NUMBER()) + ', Severity: ' + CONVERT(varchar(5), ERROR_SEVERITY()) + ', State: ' + CONVERT(varchar(5), ERROR_STATE()) + ', Procedure: ' + ISNULL(ERROR_PROCEDURE(), '-') + ', Line: ' + CONVERT(varchar(5), ERROR_LINE()); PRINT ERROR_MESSAGE(); END CATCH; END
Add the following store procedure to display the result of the form read action:
-- ============================================= -- Author: SCP - MSSQLTips -- Create date: 20240910 -- Description: PDF Forms Compiled -- ============================================= ALTER PROCEDURE [dbo].[uspPdfFormsChk] AS BEGIN SET NOCOUNT ON; DECLARE @Columns nvarchar(MAX) ,@SQL NVARCHAR(MAX); SELECT @Columns = STRING_AGG(QUOTENAME([FormName]), ', ') FROM (SELECT DISTINCT [FormName] FROM [dbo].[PdfForms]) AS FormNames; SET @SQL = N' SELECT FieldName, ' + @Columns + N' FROM (SELECT [FormName] ,[FieldName] ,[FieldValue] + CASE WHEN RecordChanged = 1 THEN '' <---'' ELSE '''' END AS [FieldValue] FROM [dbo].[PdfForms]) AS SourceTable PIVOT (MAX(FieldValue) FOR [FormName] IN (' + @Columns + N')) AS PivotTable; '; EXEC sp_executesql @SQL; END
Testing the Solution
For this example, I will use the form below with the fields NAME as a textbox, CHOICE as a combo box, OPTION as a radio button, and COLOR as a list box. I will include only two forms.
Place your filled PDF forms in a folder and execute the Visual Studio, first browsing to this folder and then reading the files inside it. The results will be displayed in the Rich Text box.
When the store procedure uspPdfFormsChk is executed, the result is:
Now I will add another form it the forms directory, and change the option in the List box to Orange for the form 1 and execute the READ button in my application:
Now after executing the store procedure uspPdfFormsChk, the result will be:
Observe that for the ListBox in form 1, the system detected that it was changed since the last run.
This covers the basics of retrieving data from PDF forms using Visual Studio and saving the data to SQL Server.
Next Steps
- You can improve the application in Visual Studio by retrieving the data from the store procedure uspPdfFormsChk to a data grid view.
- In this example, I did not select the most appropriate field name descriptions. Instead of using descriptive names like "Name," I labeled the fields based on their type, such as "TextBox," to demonstrate the item type used.
- Here are some related articles:
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2024-12-04