Simple Tricks to Reduce PDF Size
Let’s talk about PDF for a moment. PDF is designed to be a flexible file format. It is also designed to be an imageable format without resolution. It can be a collection of vector art, text, and images, but there is no inherent resolution except what is dictated by the content (and even that doesn’t count, really).
Let me clarify through example. Given a PDF, a PDF viewer should be able to display that on any output device (printer, LCD screen, phototypesetter, etc). It is the output device that determines the resolution and the quality. The PDF viewer is responsible for keeping the promise that the image will look as good as possible and as close to the theoretical ideal.
With vector art (lines, curves, etc), this is a fairly easy promise to keep. With text defined by outline fonts (TrueType, OpenType, PostScript Type 1), this is also an easy promise to keep. With images, this gets harder. In the PDF sense, images are a collection of row ordered samples and have no resolution unto themselves. Resolution only starts to creep into play when that image is placed on a page and is displayed/printed by a viewer. Images can be placed on a page any number of times at any orientation (represented by an Affine transform). When the viewer displays that image on a device, the resolution is now fixed. A viewer may decide to do a number of things to keep the promise of quality. For example, image samples may be interpolated in some way (bilinear, bicubic, etc) or if the output device isn’t color, it the image will probably be represented by some halftoning method.
In dotImage, the “device” is bitmapped image. The resolution is set by the PDF decoder object. The bit depth of the image is going to be either 24 bit color or 8 bit gray depending on the actual content – if it has no color, the output image shouldn’t be in color.
Here’s a trick you can use for image-only PDF to possibly reduce their size.
We have a class called PdfImageSource which loops through the pages in a PDF and if a page is image-only, it will extract that image. If the page is NOT image only, it will be rasterized and the PdfImageSource’s Resolution property and returned.
I’ve created a subclass of PdfImageSource which will do some heuristics to possibly reduce the bit depth of the image returned. It does this by adapting onto PdfImageSource and running a state machine that tries to progressively cut the bit depth. Here is that class:
using System;
using System.IO;
using Atalasoft.Imaging;
using Atalasoft.Imaging.ImageSources;
using Atalasoft.Imaging.ImageProcessing;
using Atalasoft.Imaging.ImageProcessing.Document;
using Atalasoft.Imaging.ImageProcessing.Effects;
using System.Drawing;
namespace PdfAggressiveColorReduction
{
// in this class, we are subclassing PdfImageSource and adding some post processing to reduce the
// bit depth. Note that you should consider setting the resolution in the parent class to
// no less than 200 dpi. 200 dpi is a sweet spot for most OCR engines - less and they do poorly,
// more and the image gets big.
//
public class PdfColorReductionImageSource : PdfImageSource
{
public PdfColorReductionImageSource(Stream stm) : base(stm) { }
public PdfColorReductionImageSource(Stream stm, string password) : base(stm, password) { }
// this method overrides the LowLevelAcquireMethod in PdfImageSource
// It lets PdfImageSource get the image (possibly extracting it directly if it's a single
// image page, then we try to cut the image bit depth down as much as possible
protected override ImageSourceNode LowLevelAcquire(int index)
{
ImageSourceNode node = base.LowLevelAcquire(index);
if (node == null) return null;
// try to reduce the bit depth of the image in as non-destructive a way as possible
AtalaImage image = AggressivelyReduceBitDepth(node.Image);
// couldn't reduce
if (image == null) return node;
// don't need original image anymore
node.Image.Dispose();
return new ImageSourceNode(image, new FileReloader(image));
}
private AtalaImage AggressivelyReduceBitDepth(AtalaImage sourceImage)
{
AtalaImage startImage = sourceImage;
AtalaImage finalImage = null;
while (true)
{
finalImage = null;
switch (startImage.PixelFormat)
{
// if color, try to reduce to gray or 8 bit paletted
case PixelFormat.Pixel24bppBgr:
finalImage = ReduceColorToGray(startImage) ?? ReduceColorToPaletted(startImage) ?? startImage;
break;
case PixelFormat.Pixel8bppIndexed:
finalImage = ReducePalettedToGray(startImage) ?? startImage;
break;
// if gray, try to reduce to 1 bit
case PixelFormat.Pixel8bppGrayscale:
finalImage = ReduceGrayTo1Bit(startImage) ?? startImage;
break;
// punt
default:
finalImage = startImage;
break;
}
if (finalImage != startImage)
{
// dispose startImage if we don't need it anymore
if (startImage != sourceImage)
startImage.Dispose();
startImage = finalImage;
}
else
{
break;
}
}
// return null if there was no change
return finalImage == sourceImage ? null : finalImage;
}
private AtalaImage ReduceColorToGray(AtalaImage source)
{
// you can try other approaches to this, but we find that this works very well at detecting images
// that are "close" to being black and white
ColorExtractionCommand command = new ColorExtractionCommand();
ColorExtractionResults results = command.Apply(source) as ColorExtractionResults;
// the results, if there was color, is an image with the color lifted out of the original
// we're not using that, so we'll dispose it
if (results.Image != null && results.Image != source) results.Image.Dispose();
if (!results.HasColor)
{
return source.GetChangedPixelFormat(PixelFormat.Pixel8bppGrayscale);
}
return null;
}
private AtalaImage ReducePalettedToGray(AtalaImage source)
{
if (PaletteIsGray(source.Palette))
return source.GetChangedPixelFormat(PixelFormat.Pixel8bppGrayscale);
return null;
}
private bool PaletteIsGray(Palette p)
{
for (int i = 0; i < p.Colors; i++)
{
Color c = p.GetEntry(i);
int d1 = Math.Abs(c.R - c.G);
int d2 = Math.Abs(c.B - c.G);
if (d1 + d2 > 4) return false;
}
return true;
}
private AtalaImage ReduceColorToPaletted(AtalaImage source)
{
// find out if it's reasonable to make the image 8 bit paletted
long colors = source.CountColors();
if (colors < 256)
{
ReduceColorsCommand command = new ReduceColorsCommand((int)colors, DitheringMode.None, 0);
return command.Apply(source).Image;
}
else if (colors < 1024)
{
// you might not want to do this else clause - it will damage the source image, maybe not noticeably,
// but it will damage it.
ReduceColorsCommand command = new ReduceColorsCommand((int)colors);
return command.Apply(source).Image;
}
return null;
}
private AtalaImage ReduceGrayTo1Bit(AtalaImage source)
{
long colors = source.CountColors();
if (colors < 64) // arbitrary - could be anywhere from 1 -> 64
{
// you could also eliminate the CountColors and just do a DynamicThreshold, which will
// always go to 1bit, but this will lose some image quality.
DynamicThresholdCommand command = new DynamicThresholdCommand();
return command.Apply(source).Image;
}
return null;
// you could also get a Histogram of the image and find out how many colors exist above a significant
// portion of the total image and consider that the number of colors in the image rather than going
// blindly.
}
}
}
In usage, you can turn a process a PDF with a very small amount of code:
PdfColorReductionImageSource source = new PdfColorReductionImageSource(instm);
source.Resolution = 200;
PdfEncoder encoder = new PdfEncoder();
using (Stream outstm = new FileStream(@"someoutputfile.pdf", FileMode.Create))
{
encoder.Save(outstm, source, null);
}
The problem with using this class blindly is that if a page has vector art or text, that information will get lost and the page will probably get larger.
We can “fix” that too. The trick is to avoid the pages that aren’t image only. That could be done like this:
List<int> FindImageOnlyPages(Stream stm)
{
List<int> imagePages = new List<int>();
Document doc = new Document(stm);
int currPage = 0;
foreach (Page p in doc.Pages) {
if (p.SingleImageOnly) imagePages.Add(currPage);
currPage++;
}
}
Now given this list, we can create a new PDF with only those pages in it, using the PdfDocument object, which is made for manipulation, not rasterization, of PDF documents:
PdfDocument origDoc = new PdfDocument(origStm);
origStm.Seek(0, SeekOrigin.Begin);
List<int> pagesToExtract = FindImageOnlyPages(origStm);
origStm.Seek(0, SeekOrigin.Begin);
PdfDocument newDoc = new PdfDocument();
foreach (int pageNo in pagesToExtract) {
newDoc.Add(origDoc.Pages[pageNo]);
}
Stream extractedStream = GetTempStream();
newDoc.Save(extractedStream);
Now extractedStream contains a guaranteed image-only PDF. We can run it through the code above to try to reduce the bit depth of each page and finally copy those pages back into the original document:
PdfDocument reducedDoc = new PdfDocument(reducedStream);
int currPageIndex = 0;
foreach (PdfPage p in reduceDoc.Pages) {
origDoc.Pages[pagesToExtract[currPageIndex]] = p;
currPageIndex++;
}
origDoc.Save(finalOutputStream);
While we touch on a number of sections of dotImage (PDF rasterizing, PDF manipulation, Advanced Document Cleanup (ADC), image processing, ImageSource), we see that stitching those pieces together into powerful tools is straight forward, if not trivial. Further, if you application needs further customization, dotImage gives you the right components to get your task done.