Smart Cache for PHP
When presenting dynamic content on your web site, various content presentation models may be used. The two fundamental models are "push" and "pull", however both models present difficulties.
In the "push" model, an application saves dynamically generated content in static form for the web server to deliver to end users. While this results in the fastest delivery, it frequently does not meet the demands of realistic applications. Since static .html files are saved, this model is not applicable for dynamic requests (those with a query string, like ?page=2). It also means that a "publishing" process is required by content generators, or via a scheduled process, to control the flow of content to the web site. This is unnecessary workflow, and may require a lot of human action and knowledge to direct content to the right places at the right time. Lastly, this model may not account for changes made to web pages via external tools (such as Dreamweaver).
In the "pull" model, data is provided upon request. This presents serious problems because if the data sources are not scalable, your web site will not be able to perform under heavy traffic. On a busy site, the data source will almost certainly be an instant bottleneck to content delivery. There are more problematic risks involved. If a data source is completely unavailable, your web pages will no longer be able to display dynamic data until they come back online. Similarly, if there is a problem with the application or script requesting the data from the data source, your web pages may fail to display altogether, not just with missing dynamic elements.
A proper caching system can be the key to these problems. A common approach taken throughout the PHP community is to use a very tiny, very fast PHP script serving as a bridge between the web page and the application/data sources that provide content. This script implements some basic logic to decide whether it should generate the page and save it in the cache, or if it should read a previously generated page from the cache. This is a worthwhile approach given that a properly functioning server should never "lose" PHP the way it could lose database connectivity, for example. (If it did, something more serious is wrong with your hosting provider.) Sites like Yahoo! and Facebook use solutions like this. However, many freely available solutions of this kind fail to deliver on good performance, and fail to resolve some of the broader issues that the push and pull models face.
The "Smart Cache" project is an attempt to provide a solution to the PHP community that better serves the needs of its developers. It is designed for high scalability, high reliability, minimal configuration, and poised to provide solutions to the issues outlined above. The script is available here for download, and may be implemented by anyone, anywhere, provided that the header comments remain intact. The project goals below outline the features of this cache layer in detail.
Project Goals
- Cache entries must redetermine freshness according to an interval. In other words, there is a "refresh rate" according to which all cached files are checked for stale entries.
- This system should neither rely on a cron script nor any other mechanism to explicitly push cached content, nor have to know where/if these cacheable scripts exist.
- Cached files should be stored in a centralized, writable location, not relative to the actual script whose output was cached. This is because the entire web site should not be writable by the web server, nor should FTP have to be used to save the cache files.
- A separate cache entry should be saved for each GET request to it (i.e. foo.php?page=1 and foo.php?page=2 create two different cached files).
- Cached GET variants of a page (?page=2, etc.) should be pruned after reaching a limit, but the master request (no query string) should never be pruned. This prevents an exploit where an attacker can perform incremental requests to a single url and fill up the allotted file system space.
- Cached files, aside from pruned GET requests, should never be cleaned up unless the parent web page is removed from the file system. This allows there to always be a cache entry to fall back on.
- Only GET requests should be cached, not POST. POST actions will differ from user to user and therefore should be excluded from the cache.
- Caching headers must be sent for any valid return of content, whether from cache or following regeneration.
- A "Cache-Control: public" header should be sent for valid content.
- A "Last-Modified" header should be sent for valid content. This should represent the last time the requested content was cached (or the current time, if not yet cached).
- An "Expires" header should be sent for valid content. This should represent a date in the future equal to "now plus the refresh rate interval". This will prevent the browser from re-requesting content until its locally cached copy expires at the specified time.
- An "ETag" header should be sent for valid content. This will prevent the server from sending any data over the network if it discovers that no content differs from the copy held by the browser making the request.
- Requests for which there is no cache should result in a no-cache header.
- There must be a way to exclude certain scripts from the caching mechanism. Some pages must bypass the cache layer, and therefore should simply not include the cache script at the top of their source. An example of a dynamic page that would do this is one which presents user-specific content. Note: in some situations, the use of cookies and/or AJAX may allow the base page to be cached, and allow all user-specific content to be attached after the fact.
- The cache system should perform as few operations as possible in order to serve a cached file. This includes not doing redundant checks for file_exists() followed by file_get_contents() when a single @file_get_contents() can be performed and a positive result is the common case.
- The file containing the caching code should only contain the caching code. A web application should never be called through include() or require() unless the cache layer first determines that a script's output must be regenerated. Therefore, the caching mechanism should never be part of the web application itself. This can be achieved by keeping the cache script, as presented here, independent from any other code, and using the $smart_cache['include'] variable as a way to attach a subsequent PHP script.
- There must be an optional way to bypass the cache layer (unless an uncached response is untenable), with a cookie. This is useful if you have an admin login which should always see uncached content on the site they are managing, while still guarding against application/script/resource failures.
- There must be an optional way to completely disable the cache layer for a request, with a cookie. This is useful for debugging, as it will disable the cache layer completely and allow errors to occur.
- If the cache goes stale and needs to be refreshed, the cache entry should still be sent if subsequent PHP code results in a fatal error. This is critical, because broken scripts and applications would otherwise bring down your site.
- Similarly, if the cache goes stale and needs to be refreshed, the cache entry should still be sent if subsequent PHP code reveals that a required data source is unavailable. This can be achieved by setting the global variable $GLOBALS['smart_cache_data_source_failed']=1 in your application if a data source is unavailable. By setting this variable, you are telling the cache layer to cancel its request for live content and fall back on cached content for this request. For example, if your database connection fails, you would set this flag.
- The cache directory should be divided into subdirectories for maximum file system performance. This is done by dividing cache entries up by the first character of their hash key.
- The cache directory must be configurable, so as to allow a location shared by multiple servers, or to allow storage on a RAM disk.
- Cache writes should use file locking to prevent race conditions. This prevents cached content from becoming corrupt or partially served when multiple requests to recache content occur at once.
- Cache entry hashes should be optimized. This is done by using hash('md5',$str) instead of md5($str), which is slower.
- Object-oriented programming should not be used to implement the caching mechanism, as this would add unnecessary overhead.
- The cache system should have low additional overhead to cache a script.
- The cache system must be FAST and extremely easy to set up. It should be performance tested with Apache benchmark and XDebug+Kcachegrind.
- A custom X-* header should be sent to the browser, declaring how the cache system responded.
The Code
File 1: script.php (the file whose output is to be cached)
<?php
require '/path/to/smart_cache.php'; // require the cache layer
?>
<html>
<body>
Proceed with your script as you see fit.
</body>
</html>
File 2: smart_cache.php (the cache layer)
View source | Download smart_cache.tar.gz
Try It Yourself
Configuration
The following variables can be configured for this script to operate in your own environment.
- Required: Set $smart_cache['cache_dir'] to the path to your cache directory, writable by the web server.
- Required: Set $smart_cache['ttl'] to number of seconds to elapse before cache entries will be checked for freshness. (recommended: 300 -- i.e. 5 minutes)
- Required: Set $smart_cache['script_limit'] to the maximum number of GET variants of a single page that can be stored. (recommended: 30)
- Optional: Set $smart_cache['include'] to the path to a PHP script to be included only if a fresh cache entry is being generated.
- Optional: Set $smart_cache['cookie_bypass'] to the name of a cookie that, if set, will bypass the cache layer (unless a valid uncached copy is not available).
- Optional: Set $smart_cache['cookie_disable'] to the name of a cookie that, if set, will bypass the cache layer altogether.
Share It
"Smart Cache" for PHP is written by Alexander Romanovich, 2007-2024.