Friday, March 16, 2012

Make a small edit to a PDF file

I don't feel like I need the overhead of any of the PDF editors for this task, but I am having real trouble getting this to work. All I need to do is the following:

Open some.pdf
replace "Replace this!" with "Replaced"
Save some_edited.pdf

I get corrupted files in every encoding method I try. Is there something simple I am missing?

I can do this in notepad and it works just fine.I believe I am supposed to be using a binaryreader and a binarywriter now that I have been messing with it, but I am having a hard time replacing text inside the binary so far. I can get a char() collection back from the reader and it basically ends up with each individual character in a collection. There is also a function to get a string back, but I am not sure how to put it back in the writer correctly.

Anyone have any experience working with files like PDF?
You will probably need a 3rd party assembly or utility to edit a PDF file, it isn't as easy as editing the binaries because PDF uses GZIP, so isn't like editing it straight.
That is what I figured before trying to do this because I knew PDF files could have compression. What I found though was that these files have the string I want to replace in plain text. If I open it in notepad, make my edit and save the file it works great.

I just need to replicate that seemingly simple process in code without breaking the formatting of the document or the encoding or anything else.
In that case use a StreamWriter and a StreamReader.
Here is how I am trying to write the files. This makes a file that Acrobat will not open. In notepad I notice that some of the odd characters in the beginning are stripped out. As far as I can tell it must be happening on FileStream.ReadToEnd

Public Function ByteTest()Dim PDFFileAs String Dim PDFFolderAs IO.Directory Response.Write("Start Byte:" & DateTime.Now.ToLongTimeString &":" & Now.Millisecond &"<br>")For Each PDFFileIn PDFFolder.GetFiles(Server.MapPath("PDF"))'Open the fileDim FileStreamAs IO.StreamReader FileStream = IO.File.OpenText(PDFFile)'Load the file in to a stringDim ContentsAs String = FileStream.ReadToEnd'Replace text in string Contents = Contents.Replace("ABC1234567890","ABC1111111111")'Close stream FileStream.Close()'Create byte based output fileDim OutputFileNameAs String = Server.MapPath("PDFOutput\" & DateTime.Now.ToFileTimeUtc.ToString &"BYTE.pdf")Dim fsAs FileStream = File.Create(OutputFileName) fs.Close()'Convert the string to bytesDim infoAs Byte() =New System.Text.UTF8Encoding(True).GetBytes(Contents)'Write string as bytes to output file fs = File.OpenWrite(OutputFileName) fs.Write(info, 0, info.Length) fs.Close()Next Response.Write("Stop Byte:" & DateTime.Now.ToLongTimeString &":" & Now.Millisecond &"<br>")End Function

I also wrote a test not using the bytes and trying several encoders. All of them will not open in Acrobat.
Public Function StringTest()
Dim PDFFileAs String
Dim PDFFolderAs IO.Directory

Response.Write("Start String:" & DateTime.Now.ToLongTimeString &":" & Now.Millisecond &"<br>")

For Each PDFFileIn PDFFolder.GetFiles(Server.MapPath("PDF"))
'Open the fileDim FileStreamAs IO.StreamReader
FileStream = IO.File.OpenText(PDFFile)

'Load the file in to a stringDim ContentsAs String = FileStream.ReadToEnd

'Replace text in string Contents = Contents.Replace("ABC1234567890","ABC1111111111")

'Close stream FileStream.Close()'Create ASCII output fileDim OutputFileNameAs String = Server.MapPath("PDFOutput\" & DateTime.Now.ToFileTimeUtc.ToString &"STRING-ASCII.pdf")
Dim fsAs FileStream = File.Create(OutputFileName)
Dim PDFStreamAs StreamWriter =New StreamWriter(fs, System.Text.Encoding.ASCII)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create BigEndianUnicode output file OutputFileName = Server.MapPath("PDFOutput\" & DateTime.Now.ToFileTimeUtc.ToString &"STRING-BigEndianUnicode.pdf")
fs = File.Create(OutputFileName)
PDFStream =New StreamWriter(fs, System.Text.Encoding.BigEndianUnicode)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create default formatted output file OutputFileName = Server.MapPath("PDFOutput\" & DateTime.Now.ToFileTimeUtc.ToString &"STRING-Default.pdf")
fs = File.Create(OutputFileName)
PDFStream =New StreamWriter(fs, System.Text.Encoding.Default)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create Unicode output file OutputFileName = Server.MapPath("PDFOutput\" & DateTime.Now.ToFileTimeUtc.ToString &"STRING-Unicode.pdf")
fs = File.Create(OutputFileName)
PDFStream =New StreamWriter(fs, System.Text.Encoding.Unicode)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create UTF7 output file OutputFileName = Server.MapPath("PDFOutput\" & DateTime.Now.ToFileTimeUtc.ToString &"STRING-UTF7.pdf")
fs = File.Create(OutputFileName)
PDFStream =New StreamWriter(fs, System.Text.Encoding.UTF7)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create UTF8 output file OutputFileName = Server.MapPath("PDFOutput\" & DateTime.Now.ToFileTimeUtc.ToString &"STRING-UTF8.pdf")
fs = File.Create(OutputFileName)
PDFStream =New StreamWriter(fs, System.Text.Encoding.UTF8)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

Next Response.Write("Stop String:" & DateTime.Now.ToLongTimeString &":" & Now.Millisecond &"<br>")

End Function


I found my answer on a newsgroup posting I made. This code generates working PDF files for me. Thanks to everyone for taking a look at the problem.
Sub ANSITest()
Dim PDFFileAs String
Dim PDFFolderAs IO.Directory
Dim EncodingAs System.Text.Encoding = Encoding.GetEncoding(1252)

For Each PDFFileIn PDFFolder.GetFiles(Server.MapPath("PDF"))
'Open the fileDim FileStreamAs New IO.StreamReader(PDFFile, Encoding)

'Load the file in to a stringDim ContentsAs String = FileStream.ReadToEnd

'Replace text in string Contents = Contents.Replace("ABC1234567890","ABC1111111111")

'Close stream FileStream.Close()'Write string as bytes to output fileDim OutputFileNameAs String = Server.MapPath("PDFOutput\" & DateTime.Now.ToFileTimeUtc.ToString &"ANSI.pdf")
Dim swAs New IO.StreamWriter(OutputFileName,False, Encoding)
sw.Write(Contents)
sw.Close()

Next

End Sub

0 comments:

Post a Comment