Python Remove Active Content From Pdf

7/30/2019

I am using the terrific Python Requests library. I notice that the fine documentation has many examples of how to do something without explaining the why. For instance, both r.text and r.content are shown as examples of how to get the server response. But where is it explained what these properties do? For instance, when would I choose one over the other? I see thar r.text returns a unicode object sometimes, and I suppose that there would be a difference for a non-text response. But where is all this documented? Note that the linked document does state:

What Is Active Content

You can also access the response body as bytes, for non-text requests:

But then it goes on to show an example of a text response! I can only suppose that the quote above means to say non-text responses instead of non-text requests, as a non-text request does not make sense in HTTP.

In short, where is the proper documentation of the library, as opposed to the (excellent) tutorial on the Python Requests site?

Python remove active content from pdf mac

dotancohendotancohen

15.5k22 gold badges106 silver badges158 bronze badges

2 Answers

The developer interface has more details:

r.text is the content of the response in Unicode, and r.content is the content of the response in bytes.

Fozoro

2,5023 gold badges11 silver badges31 bronze badges

Gary KerrGary Kerr

8,6362 gold badges35 silver badges44 bronze badges

Skyrim how to level restoration. It seems clear from the documentation is that r.content

If you read further down the page it addresses for example an image file

PyNEwbiePyNEwbie

3,2402 gold badges30 silver badges71 bronze badges

Not the answer you're looking for? Browse other questions tagged pythonpython-requests or ask your own question.

Using Powershell to Strip Content from PDF While Keeping PDF Format.

My Task:I have been attempting to perform what would be a simple task if the documents were not in PDF format. I have a bunch of PDFs that have unwanted data before the bulk of usable data starts, this is anything that comes before ‘%PDF’ in the documents. A script that pulls all the desired data and exports it to a new file was needed. That part was super easy.

The Problem:The data that is exported appears to be formatted correctly, except it doesn’t open as a PDF anymore. I can open it in Notepad++ and it looks identical to one that was clean manually and works. Examining the raw code of the Powershell altered PDF it appears that the ‘lines’ are much shorter than they should be.

I understand the PDF format doesn't really use lines, so that might be where the problem is being created. Either when the data is being initially put into an array, or when it’s being written the PDF format is probably being broken. Is there a way to retain the format of the PDF while it is modified and then saved? It’s probably the case that I’m missing something simple.

KVBKVB

1 Answer

So I was about to start looking at iTextSharp and decided to give an older language a try first, Winbatch. (bleh!) I almost made a screen scraper to do the work but the shame of taking that route got the better of me. So, the function library was the next stop.

This is just a little blurb I spit out with no error checking or logging going on at this point. All that will be added in along with file searches later. All in all it manages to clear all the unwanted extras in the PDF but keeping the exact format that is required by PDFs.

Now that I have an idea how this works, making a tool to do this in PS sounds more doable. There's a PS function out there in the wild called Get-HexDump that might be a good base to educate myself on bits and hex in PS. Since this works in Winbatch I assume there is some sort of equivalent in AutoIt and it could be reproduced in most basic languages.

There appears to be a lot of people out there trying to clear crud from before the header and after the end of their PDF docos, Hopefully this helps, I've got a half mill to hit with whatever script I morph this into. I might update with a PS version if I decide to go that route again, and if I remember.

KVBKVB

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

Not the answer you're looking for? Browse other questions tagged powershellpdffile-io or ask your own question.

I have a PDF full of quotes:

I can extract the text in python using the following code:

This returns all the quotes as one paragraph. Is it possible to 'split' the pdf by the horizontal separator and split it into quotes that way?

user7692855user7692855

Python remove active content from pdf download

2 Answers

If you want to just extract the quotes from the pdf text you can use regex to find all the quotes.

or just

bhansabhansa

4,9722 gold badges15 silver badges38 bronze badges

i could not find a way to split it by the horizontal separator, but i managed to do it in another way:

LiamLiam

What Is Active Content

2,7252 gold badges20 silver badges39 bronze badges