Filter modules for Squid 3.0

Version 0.2, October 2008
  1. Purpose
  2. Prerequisites
  3. Installation
  4. Configuration
  5. Available filters
  6. Using
  7. Internals
  8. Migration from 2.x
  9. Related projects
  10. Bugs
  11. Getting this package

This is a project to build filtering capabilities comparable to those of Muffin into Squid. It consists of a filtering framework and a set of filter modules. Currently available filters:

Special features:

Purpose

A filtering proxy allows users to remove unwanted stuff from Web pages as they browse them. What "unwanted stuff" is obviously depends on the individual user, but things which are commonly regarded as annoyances include Some of those things can be avoided by filtering URIs, which Squid can already do via an external redirect program. Others require a content filter.

Usually, a filtering proxy runs standalone and does nothing but filtering. Users have to configure this proxy in their browsers, and if they use a caching proxy too, chain them after the filter. In situations where the user runs Squid anyway (mostly because of caching for different browsers or a small LAN), it is convenient to build this capability into Squid.

Prerequisites

This patch is for Squid 3.0STABLE9. Developed and tested under Linux 2.6 with glibc 2.2.5 through 2.4.1 and gcc 3.3 through 4.1, but should not be system-specific.

You need the Squid sources, everything for compiling them, GNU patch, autoconf 2.50 and automake 1.6.

Installation

  1. Apply the patch: (In the Squid source directory)
    gzip -cd squid-3.0stable9-filter-0.2.patch.gz | patch -p1
    
  2. Run configure:
    sh bootstrap.sh
    sh configure (options...) --enable-filters
    
  3. Compile and install Squid as usual.
It is possible to include externally written filter modules with the configure argument --with-morefilters="/path/to/file.cc /path/to/other.cc.."

Configuration

Defining filters

There is a new squid.conf directive:
filter_module name [ arguments... ] [ * {allow|deny} acls... ]
It tells Squid to define a filter of the given type. The filter modules can take arguments as documented for the individual modules. Arguments are separated with whitespace with the same quoting mechanisms as used elsewhere in squid.conf. A filter type can be specified in more than one filter_module line, in that case several filter instances with different parameters will be created. See below on chaining filters.

Each filter line can optionally take an ACL list. This must start with an asterisk (surrounded by whitespace), followed by either the keyword allow or deny, followed by one or more ACLs defined before the filter line.

A filter with no ACL specification is applied to every request. A filter with an ACL specification is applied to each request which is denied by the ACL. In other words: an allowing ACL allows to bypass the filter.

There is a new option for the http_port directive: The flag nofilter specifies that requests arriving on this port will not be filtered. Effectively this makes a filtering and a non-filtering proxy running at once, on different ports.

Pattern files

Pattern files are files containing lists of regular expressions (POSIX extended, or grep -E syntax), one pattern per line, against which the URI is matched. Blank lines and lines starting with a "number sign" are ignored in the usual fashion. Whenever a pattern file is changed, it gets reloaded at the next request automatically, no reconfigure needed. A pattern is marked as case-insensitive by prepending a dash. (To place a real dash at the start of a pattern use a class, like [-]). Patterns may not contain literal TABs, use \t instead.

There are two types of pattern files: simple lists and replacement lists.

Simple lists

These are lists of patterns against which stuff is matched. In older releases, "stuff" used to mean request URIs, now this is specific to the individual filters (only the activex filter uses this feature by now). The old allow lists are no longer used, they have been obsoleted by ACLs on filters.

Replacement lists

A replacement list allows URIs to be replaced by other URIs, in a sed s///-like fashion. This type of pattern file is used by the redirection filter. Each line in the file consists of two elements separated by (at least) one TAB character. The first is a pattern, the second a replacement. The replacement may contain \1, \2... \9 references to parenthesized subpatterns; \0 means the whole match and \* means the complete original URI. The replacement may also contain \_0, \_1..., \_* references which copy the same subpatterns in modified base64 encoding (see below).

A special replacement can be given as a shortcut for patterns which have no explicit replacement. This default is specified as replacement for the pattern consisting of a single exclamation mark, which should be the first line in the file. Negative match does not work in a replacement list.

Modified base64 encoding
This encoding is base64 with the characters + / = (plus, slash, equals) replaced by - _ . (dash, underscore, dot) respectively. This leads to an URL-safe encoding of request URIs or part thereof (may be useful for script-based redirect result postprocessing).

Other configuration dependencies

When content filters (see next section) are in use, an appropriate request_header_replace clause must be set up to filter out the Accept-Encoding and Accept-Ranges request headers.
Use this:
request_header_replace Accept-Encoding identity
request_header_replace Accept-Ranges none
See below for the exact reason.

Available filters

Currently there are the following filters:

Filters

Filters fall into one of the following categories: Filters of the same category operate either independently or chainable. Chaining is described below. In any case, all applicable filters are called in exactly the order in which they are specified in the config file.

redirect

Replaces Squid's external redirect program. Takes one argument, the name of a replacement list file. Performs pattern substitution on the requested URI. As soon as a pattern is found, the search stops, i.e. redirections are not chained within one redirection filter. However, if the module is specified several times (probably with different replacement list files), all of them are called in order, with a later filter operating on the results of an earlier one. If an external redirector is in use, it is called first, before the filters. NOFILTER does apply to this filter but not to external redirectors.

script

Removes JavaScript (SCRIPT tags, on... handlers and browser-specific ways of inserting Javascript into tag attributes) from HTML pages. (For also blocking JavaScript files use an ACL against the "application/x-javascript" file type.)

activex

Removes ActiveX OBJECT tags from HTML pages. The tags are preserved, only the classid parameter is replaced by a dummy, so the page will still be processed correctly (as if by a non-ActiveX browser). This filter takes a pattern file as optional argument. This file contains a list of CLSIDs which are allowed through.

gifanim

Breaks animated GIF pictures to remove the annoying blinking. Takes as argument the allowed number of cycles. If zero, no animation (show only the first picture). If < zero, stop loading animations altogether (client shows broken picture). Default is one, meaning show the whole content but don't blink.

bugfinder

Identifies GIF and PNG images not bigger than n by n pixels. The n value is given as an argument (defaults to 2). Since these tiny images are often used as "Web bugs", it may be desirable to block them with a redirector. The filter can only log them to cache.log; to effectively block bugs it is necessary to filter the requests for these URIs, i.e. manual processing of the log file is needed.

Each content filter specifies the MIME content type(s) to which it applies (like image/gif for the gifanim module) and ignores all other types.

Content filters can be chained. When more than one filter applies to a given MIME content type, every filter operates on the results of its predecessor.

Using

On the client side, no additional configuration is necessary. Simply set the patched Squid as your proxy.

The NOFILTER feature

Users can request that all filters (including the redirection filter, but not the external redirector) are bypassed for a single request. This is done by appending .X.nofilter to the host name in the URL, where the X is replaced by the Squid's visible host name. Example: to get http://www.example.com/foo/bar unfiltered from a Squid called squid.cache, use the URI http://www.example.com.squid.cache.nofilter/foo/bar.

The NOFILTER tag as part of the hostname in the URL implies that correctly written relative links, including images, linked scripts etc. on the same server, will also be unfiltered. Apply the necessary caution.

Reason for the inclusion of the Squid's host name is to avoid that web servers add the NOFILTER tag to their junk banner links themselves. This works best when visible_hostname, unique_hostname and the canonical (DNS) host name of the proxy are all different and not too related, because the origin server sees the latter two but not the former.

Since ".nofilter" is not a valid top level domain, it can't clash with real host names.

Another possible way to bypass filters is to use a non-filtering port, as described above. Requests arriving on that port will always bypass all filters.

Internals

Object structure

to be written...

A class diagram (created with ArgoUML) for the filter classes is here: http://sites.inka.de/bigred/devel/filter-patch.zargo.

Library modules

PatFile provides the pattern file facility described above. It is included in the Squid core and described in PatFile.h.

Debugging options

The following debugging sections and levels (see the debug_options directive) are used:
Section 92  Filter framework
Section 93  Filter modules
Section 94  Library modules (PatFile etc.)
Level 1     Error messages
Level 3     "Filter caught something" messages
Level 4     Initialization/finalization messages
Level 5     Initialization/finalization trace
Level 8     Minor trace
Level 9     Full trace (big!)

Content-Encoding

Content filters get the data as delivered by the server. With a non-identity Content-Encoding the filter would operate on the encoded data, which it generally can not process correctly. (It has been confirmed by experience that HTML filters like script applied to a file with compression encoding can silently deliver corrupted files, but mostly this is caught by the HTML parser not accepting null characters.)

For this reason, the Accept-Encoding headers should always be filtered out with an appropriate header_replace clause. The origin server gets forced to always send unencoded data with Accept-Encoding: identity. Another header_replace which sets the Accept-Ranges header to none causes the client to never try Range requests, which obviously are unfilterable too.

Filters in the data path

(TODO: is this still correct?)

The cache stores always unfiltered objects. Content filtering happens in the data path from cache or memory to the client. The filter object is expected to copy the data into a new buffer, so it can do anything with it including insertions and deletions.

The only exception to the rule that filtering happens only in the path to the client are those filters which alter the request. This applies to the redirect module.

In a cache hierarchy, a filtering cache should only be placed at the bottom, i.e. where only clients directly access it. If another cache sits between the filter and client, that one will cache filtered pages and break the NOFILTER feature.

Migration from 2.x

To upgrade a configuration from Squid 2.4 or 2.5 plus filter patch, note the following:
  1. Filters are no longer loadable modules, instead they are compiled in. A special "htmlfilter" module is no longer needed.
  2. The load_module directive has been replaced by filter_module with slightly different syntax.
  3. The nofilter_port directive has been replaced by the nofilter option in http_port.
  4. The allow lists of the individual filters have been replaced by ACLs applied to the filter. Note that you can get the same effect as with the old allow pattern file like this:
    acl allow_activex url_regex "/usr/local/squid/etc/allowlist_activex"
    filter_module activex * allow allow_activex
    
    The "" around the path tell the ACL to read its patterns from a file. The syntax of this file should be compatible with the old allow lists. You have to reconfigure when this file is changed, however.
  5. The header filters (cookies) have been obsoleted by header_access clauses (use Cookie and Set-Cookie with ACLs for allow lists).
  6. The content type filters (allowtype, rejecttype) have been obsoleted by rep_mime_type ACLs.

Related projects

This project was mostly inspired by Muffin, a modular filtering proxy written in Java and distributed under GPL. By now that is the most powerful filter I know of.

The Junkbusters web page has one of the oldest and best known web filters as well as a very comprehensive resources list covering most issues from "What is this all about?" to a list of filtering software (by now most of them are either for Windows or for pay or both, which indicates there is a real demand for filtering).

Bugs

As with any pre-release, this surely contains bugs. In particular I'm not sure if I really avoided memory leaks. If someone finds problems, please tell me.

Known issues

Getting this package

An up-to-date version of this page can be found at http://sites.inka.de/bigred/devel/squid-filter.html.

The latest release is filter 0.2 for Squid 3.0.STABLE9. Download at http://sites.inka.de/bigred/devel/squid-3.0stable9-filter-0.2.patch.gz.

For use and distribution of this package, the same terms and conditions as for the Squid package itself (i.e. the GNU General Public License) apply. Note, however, that using a version or installation setup which has the NOFILTER feature removed or restricted in any way is in gross contradiction to the author's intentions, and people who do so should feel guilty of abuse.

Acknowledgements

Development of this version was funded by credativ GmbH.