ADDED pages/affcase1.in Index: pages/affcase1.in ================================================================== --- /dev/null +++ pages/affcase1.in @@ -0,0 +1,567 @@ +hd_keywords {What If OpenOffice Used SQLite} \ +{OpenOffice case study} +What If OpenDocument Used SQLite? + +

+What If OpenDocument Used SQLite?

+ +

Introduction

+ +

Suppose the +[http://en.wikipedia.org/wiki/OpenDocument|OpenDocument] file format, +and specifically the "ODP" OpenDocument Presentation format, where +built around SQLite. Benefits would include: +

+ +

+Note that this is only a thought experiment. +We are not suggesting that OpenDocument be changed. +Nor is this article a criticism of the current OpenDocument +design. The point of this essay is to suggest ways to improve +future file format designs. + +

About OpenDocument And OpenDocument Presentation

+ +

+The OpenDocument file format is used for office applications: +word processors, spreadsheets, and presentations. It was originally +designed for the OpenOffice suite but has since been incorporated into +other desktop application suites. The OpenOffice application has been +forked and renamed a few times. This author's primary use for OpenDocument is +building slide presentations with either +[https://www.neooffice.org/neojava/en/index.php|NeoOffice] on Mac, or +[http://www.libreoffice.org/|LibreOffice] on Linux and Windows. + +

+An OpenDocument Presentation or "ODP" file is a +[http://en.wikipedia.org/wiki/Zip_%28file_format%29|ZIP archive] containing +XML files describing presentation slides and separate image files for the +various images that are included as part of the presentation. +(OpenDocument word processor and spreedsheet files are similarly +structured but are not considered by this article.) The reader can +easily see the content of an ODP file by using the "zip -l" command. +For example, the following is the "zip -l" output from a 49-slide presentation +about SQLite from the 2014 +SouthEast LinuxFest +conference: + +

+Archive:  self2014.odp
+  Length      Date    Time    Name
+---------  ---------- -----   ----
+       47  2014-06-21 12:34   mimetype
+        0  2014-06-21 12:34   Configurations2/statusbar/
+        0  2014-06-21 12:34   Configurations2/accelerator/current.xml
+        0  2014-06-21 12:34   Configurations2/floater/
+        0  2014-06-21 12:34   Configurations2/popupmenu/
+        0  2014-06-21 12:34   Configurations2/progressbar/
+        0  2014-06-21 12:34   Configurations2/menubar/
+        0  2014-06-21 12:34   Configurations2/toolbar/
+        0  2014-06-21 12:34   Configurations2/images/Bitmaps/
+    54702  2014-06-21 12:34   Pictures/10000000000001F40000018C595A5A3D.png
+    46269  2014-06-21 12:34   Pictures/100000000000012C000000A8ED96BFD9.png
+... 58 other pictures omitted...
+    13013  2014-06-21 12:34   Pictures/10000000000000EE0000004765E03BA8.png
+  1005059  2014-06-21 12:34   Pictures/10000000000004760000034223EACEFD.png
+   211831  2014-06-21 12:34   content.xml
+    46169  2014-06-21 12:34   styles.xml
+     1001  2014-06-21 12:34   meta.xml
+     9291  2014-06-21 12:34   Thumbnails/thumbnail.png
+    38705  2014-06-21 12:34   Thumbnails/thumbnail.pdf
+     9664  2014-06-21 12:34   settings.xml
+     9704  2014-06-21 12:34   META-INF/manifest.xml
+---------                     -------
+ 10961006                     78 files
+
+ +

+The ODP ZIP archive contains four different XML files: +content.xml, styles.xml, meta.xml, and settings.xml. Those four files +define the slide layout, text content, and styling. This particular +presentation contains 62 images, ranging from full-screen pictures to +tiny icons, each stored as a separate file in the Pictures +folder. The "mimetype" file contains a single line of text that says: + +

+application/vnd.oasis.opendocument.presentation
+
+ +

The purpose of the other files and folders is presently +unknown to the author but is probably not difficult to figure out. + +

Limitations Of The OpenDocument Presentation Format

+ +

+The use of a ZIP archive to encapsulate XML files plus resources is an +elegant approach to an application file format. +It is clearly superior to a custom binary file format. +But using an SQLite database as the +container, instead of ZIP, would be more elegant still. + +

A ZIP archive is basically a key/value database, optimized for +the case of write-once/read-many and for a relatively small number +of distinct keys (a few hundred to a few thousand) each with a large BLOB +as its value. A ZIP archive can be viewed as a "pile-of-files" +database. This works, but it has some shortcomings relative to an +SQLite database, as follows: + +

    +
  1. Incremental update is hard. +

    +It is difficult to update individual entries in a ZIP archive. +It is especially difficult to update individual entries in a ZIP +archive in a way that does not destroy +the entire document if the computer loses power and/or crashes +in the middle of the update. It is not impossible to do this, but +it is sufficiently difficult that nobody actually does it. Instead, whenever +the user selects "File/Save", the entire ZIP archive is rewritten. +Hence, "File/Save" takes longer than it ought, especially on +older hardware. Newer machines are faster, but it is still bothersome +that changing a single character in a 50 megabyte presentation causes one +to burn through 50 megabytes of the finite write life on the SSD. + +

  2. Startup is slow. +

    +In keeping with the pile-of-files theme, OpenDocument stores all slide +content in a single big XML file named "content.xml". +LibreOffice reads and parses this entire file just to display +the first slide. +LibreOffice also seems to +read all images into memory as well, which makes sense seeing as when +the user does "File/Save" it is going to have to write them all back out +again, even though none of them changed. The net effect is that +start-up is slow. Double-clicking an OpenDocument file brings up a +progress bar rather than the first slide. +This results in a bad user experience. +The situation grows ever more annoying as +the document size increases. + +

  3. More memory is required. +

    +Because ZIP archives are optimized for storing big chunks of content, they +encourage a style of programming where the entire document is read into +memory at startup, all editing occurs in memory, then the entire document +is written to disk during "File/Save". OpenOffice and its descendents +embrace that pattern. + +

    +One might argue that it is ok, in this era of multi-gigabyte desktops, to +read the entire document into memory. +But it is not ok. +For one, the amount of memory used far exceeds the (compressed) file size +on disk. So a 50MB presentation might take 200MB or more RAM. +That still is not a problem if one only edits a single document at a time. +But when working on a talk, this author will typically have 10 or 15 different +presentations up all at the same +time (to facilitate copy/paste of slides from past presentation) and so +gigabytes of memory are required. +Add in an open web browser or two and a few other +desktop apps, and suddenly the disk is whirling and the machine is swapping. +And even having just a single document is a problem when working +on an inexpensive Chromebook retrofitted with Ubuntu. +Using less memory is always better. +

    + +
  4. Crash recovery is difficult. +

    +The descendents of OpenOffice tend to segfault more often than commercial +competitors. Perhaps for this reason, the OpenOffice forks make +periodic backups of their in-memory documents so that users do not lose +all pending edits when the inevitable application crash does occur. +This causes frustrating pauses in the application for the few seconds +while each backup is being made. +After restarting from a crash, the user is presented with a dialog box +that walks them through the recovery process. Managing the crash +recovery this way involves lots of extra application logic and is +generally an annoyance to the user. + +

  5. Content is inaccessible. +

    +One cannot easily view, change, or extract the content of an +OpenDocument presentation using generic tools. +The only reasonable way to view or edit an OpenDocument document is to open +it up using an application that is specifically designed to read or write +OpenDocument (read: LibreOffice or one of its cousins). The situation +could be worse. One can extract and view individual images (say) from +a presentation using just the "zip" archiver tool. But it is not reasonable +try to extract the text from a slide. Remember that all content is stored +in a single "context.xml" file. That file is XML, so it is a text file. +But it is not a text file that can be managed with an ordinary text +editor. For the example presentation above, the content.xml file +consist of exactly two lines. The first line of the file is just: + +

    +<?xml version="1.0" encoding="UTF-8"?>
    +
    + +

    The second line of the file contains 211792 characters of +impenetrable XML. Yes, 211792 characters all on one line. +This file is a good stress-test for a text editor. +Thankfully, the file is not some obscure +binary format, but in terms of accessibility, it might as well be +written in Sanskrit. +

+ +

First Improvement: Replace ZIP with SQLite

+ +

+Let us suppose that instead of using a ZIP archive to store its files, +OpenDocument used a very simple SQLite database with the following +single-table schema: + +

+CREATE TABLE OpenDocTree(
+  filename TEXT PRIMARY KEY,  -- Name of file
+  filesize BIGINT,            -- Size of file after decompression
+  content BLOB                -- Compressed file content
+);
+
+ +

+For this first experiment, nothing else about the file format is changed. +The OpenDocument is still a pile-of-files, only now each file is a row +in an SQLite database rather than an entry in a ZIP archive. +This simple change does not use the power of a relational +database. Even so, this simple change shows some improvements. + +hd_fragment smaller {SQLAR smaller than ZIP} +

+Surprisingly, using SQLite in place of ZIP makes the presentation +file smaller. Really. One would think that a relational database file +would be larger than a ZIP archive, but at least in the case of NeoOffice +that is not so. The following is an actual screen-scrape showing +the sizes of the same NeoOffice presentation, both in its original +ZIP archive format as generated by NeoOffice (self2014.odp), and +as repacked as an SQLite database using the +[http://www.sqlite.org/sqlar/doc/trunk/README.md|SQLAR] utility: + +

+-rw-r--r--  1 drh  staff  10514994 Jun  8 14:32 self2014.odp
+-rw-r--r--  1 drh  staff  10464256 Jun  8 14:37 self2014.sqlar
+-rw-r--r--  1 drh  staff  10416644 Jun  8 14:40 zip.odp
+
+ +

+The SQLite database file ("self2014.sqlar") is about a +half percent smaller than the equivalent ODP file! How can this be? +Apparently the ZIP archive generator logic in NeoOffice +is not as efficient as it could be, because when the same pile-of-files +is recompressed using the command-line "zip" utility, one gets a file +("zip.odp") that is smaller still, by another half percent, as seen +in the third line above. So, a well-written ZIP archive +can be slightly smaller than the equivalent SQLite database, as one would +expect. But the difference is slight. The key take-away is that an +SQLite database is size-competitive with a ZIP archive. + +

+The other advantage to using SQLite in place of +ZIP is that the document can now be updated incrementally, without risk +of corrupting the document if a power loss or other crash occurs in the +middle of the update. (Remember that writes to +[atomic commit|SQLite databases are atomic].) True, all the +content is still kept in a single big XML file ("content.xml") which must +be completely rewritten if so much as a single character changes. But +with SQLite, only that one file needs to change. The other 77 files in the +repository can remain unaltered. They do not all have to be rewritten, +which in turn makes "File/Save" run much faster and saves wear on SSDs. + +

Second Improvement: Split content into smaller pieces

+ +

+A pile-of-files encourages content to be stored in a few large chunks. +In the case of ODP, there are just four XML files that define the layout +off all slides in a presentation. An SQLite database allows storing +information in a few large chunks, but SQLite is also adept and efficient +at storing information in numerous smaller pieces. + +

+So then, instead of storing all content for all slides in a single +oversized XML file ("content.xml"), suppose there was a separate table +for storing the content of each slide separately. The table schema +might look something like this: + +

+CREATE TABLE slide(
+  pageNumber INTEGER,   -- The slide page number
+  slideContent TEXT     -- Slide content as XML or JSON
+);
+CREATE INDEX slide_pgnum ON slide(pageNumber); -- Optional
+
+ +

The content of each slide could still be stored as compressed XML. +But now each page is stored separately. So when opening a new document, +the application could simply run: + +

+SELECT slideContent FROM slide WHERE pageNumber=1;
+
+ +

This query will quickly and efficiently return the content of the first +slide, which could then be speedily parsed and displayed to the user. +Only one page needs to be read and parsed in order render the first screen, +which means that the first screen appears much faster and +there is no longer a need for an annoying progress bar. + +

If the application wanted +to keep all content in memory, it could continue reading and parsing the +other pages using a background thread after drawing the first page. Or, +since reading from SQLite is so efficient, the application might +instead choose to reduce its memory footprint and only keep a single +slide in memory at a time. Or maybe it keeps the current slide and the +next slide in memory, to facility rapid transitions to the next slide. + +

+Notice that dividing up the content into smaller pieces using an SQLite +table gives flexibility to the implementation. The application can choose +to read all content into memory at startup. Or it can read just a +few pages into memory and keep the rest on disk. Or it can read just +single page into memory at a time. And different versions of the application +can make different choices without having to make any changes to the +file format. Such options are not available when all content is in +a single big XML file in a ZIP archive. + +

+Splitting content into smaller pieces also helps File/Save operations +to go faster. Instead of having to write back the content of all pages +when doing a File/Save, the application only has to write back those +pages that have actually changed. + +

+One minor downside of splitting content into smaller pieces is that +compression does not work as well on shorter texts and so the size of +the document might increase. But as the bulk of the document space +is used to store images, a small reduction in the compression efficiency +of the text content will hardly be noticeable, and is a small price +to pay for an improved user experience. + +

Third Improvement: Versioning

+ +

+Once one is comfortable with the concept of storing each slide separately, +it is a small step to support versioning of the presentation. Consider +the following schema: + +

+CREATE TABLE slide(
+  slideId INTEGER PRIMARY KEY,
+  derivedFrom INTEGER REFERENCES slide,
+  content TEXT     -- XML or JSON or whatever
+);
+CREATE TABLE version(
+  versionId INTEGER PRIMARY KEY,
+  priorVersion INTEGER REFERENCES version,
+  checkinTime DATETIME,   -- When this version was saved
+  comment TEXT,           -- Description of this version
+  manifest TEXT           -- List of integer slideIds
+);
+
+ +

+In this schema, instead of each slide having a page number that determines +its order within the presentation, each slide has a unique +integer identifier that is unrelated to where it occurs in sequence. +The order of slides in the presentation is determined by a list of +slideIds, stored as a text string in the MANIFEST column of the VERSION +table. +Since multiple entries are allowed in the VERSION table, that means that +multiple presentations can be stored in the same document. + +

+On startup, the application first decides which version it +wants to display. Since the versionId will naturally increase in time +and one would normally want to see the latest version, an appropriate +query might be: + +

+SELECT manifest, versionId FROM version ORDER BY versionId DESC LIMIT 1;
+
+ +

+Or perhaps the application would rather use the +most recent checkinTime: + +

+SELECT manifest, versionId, max(checkinTime) FROM version;
+
+ +

+Using a single query such as the above, the application obtains a list +of the slideIds for all slides in the presentation. The application then +queries for the content of the first slide, and parses and displays that +content, as before. + +

(Aside: Yes, that second query above that uses "max(checkinTime)" +really does work and really does return a well-defined answer in SQLite. +Such a query either returns an undefined answer or generates an error +in many other SQL database engines, but in SQLite it does what you would +expect: it returns the manifest and versionId of the entry that has the +maximum checkinTime.) + +

When the user does a "File/Save", instead of overwriting the modified +slides, the application can now make new entries in the SLIDE table for +just those slides that have been added or altered. Then it creates a +new entry in the VERSION table containing the revised manifest. + +

The VERSION table shown above has columns to record a check-in comment +(presumably supplied by the user) and the time and date at which the File/Save +action occurred. It also records the parent version to record the history +of changes. Perhaps the manifest could be stored as a delta from the +parent version, though typically the manifest will be small enough that +storing a delta might be more trouble than it is worth. The SLIDE table +also contains a derivedFrom column which could be used for delta encoding +it it is determined that saving the slide content as a delta from its +previous version is a worthwhile optimization. + +

So with this simple change, the ODP file now stores not just the most +recent edit to the presentation, but a history of all historic edits. The +user would normally want to see just the most recent edition of the +presentation, but if desired, the user can now go backwards in time to +see historical versions of the same presentation. + +

Or, multiple presentations could be stored within the same document. + +

With such a schema, the application would no longer need to make +periodic backups of the unsaved changes to a separate file to avoid lost +work in the event of a crash. Instead, a special "pending" version could +be allocated and unsaved changes could be written into the pending version. +Because only changes would need to be written, not the entire document, +saving the pending changes would only involve writing a few kilobytes of +content, not multiple megabytes, and would take milliseconds instead of +seconds, and so it could be done frequently and silently in the background. +Then when a crash occurs and the user reboots, all (or almost all) +of their work is retained. If the user decides to discard unsaved changes, +they simply go back to the previous version. + +

+There are details to fill in here. +Perhaps a screen can be provided that displays a history changes +(perhaps with a graph) allowing the user to select which version they +want to view or edit. Perhaps some facility can be provided to merge +forks that might occur in the version history. And perhaps the +application should provide a means to purge old and unwanted versions. +The key point is that using an SQLite database to store the content, +rather than a ZIP archive, makes all of these features much, much easier +to implement, which increases the possibility that they will eventually +get implemented. + +

And So Forth...

+ +

+In the previous sections, we have seen how moving from a key/value +store implemented as a ZIP archive to a simple SQLite database +with just three tables can add significant capabilities to an application +file format. +We could continue to enhance the schema with new tables, with indexes +added for performance, with triggers and views for programming convenience, +and constraints to enforce consistency of content even in the face of +programming errors. Further enhancement ideas include: +

+ +

+An SQLite database has a lot of capability, which +this essay has only begun to touch upon. But hopefully this quick glimpse +has convinced some readers that using an SQL database as an application +file format is worth a second look. + +

+Some readers might resist using SQLite as an application +file format due to prior exposure to enterprise SQL databases and +the caveats and limitations of those other systems. +For example, many enterprise database +engines advise against storing large strings or BLOBs in the database +and instead suggest that large strings and BLOBs be stored as separate +files and the filename stored in the database. But SQLite +is not like that. Any column of an SQLite database can hold +a string or BLOB up to about a gigabyte in size. And for strings and +BLOBs of 100 kilobytes or less, +[BLOB I/O performance|I/O performance is better] than using separate +files. + +

+Some readers might be reluctant to consider SQLite as an application +file format because they have been inculcated with the idea that all +SQL database schemas must be factored into third normal form and store +only small primitive data types such as strings and integers. Certainly +relational theory is important and designers should strive to understand +it. But, as demonstrated above, it is often quite acceptable to store +complex information as XML or JSON in text fields of a database. +Do what works, not what your database professor said you ought to do. + +

Review Of The Benefits Of Using SQLite

+ +

+In summary, +the claim of this essay is that using SQLite as a container for an application +file format like OpenDocument +and storing lots of smaller objects in that container +works out much better than using a ZIP archive holding a few larger objects. +To wit: + +

    +
  1. +An SQLite database file is approximately the same size, and in some cases +smaller, than a ZIP archive holding the same information. + +

  2. +The [atomic commit|atomic update capabilities] +of SQLite allow small incremental changes +to be safely written into the document. This reduces total disk I/O +and improves File/Save performance, enhancing the user experience. + +

  3. +Startup time is reduced by allowing the application to read in only the +content shown for the initial screen. This largely eliminates the +need to show a progress bar when opening a new document. The document +just pops up immediately, further enhancing the user experience. + +

  4. +The memory footprint of the application can be dramatically reduced by +only loading content that is relevant to the current display and keeping +the bulk of the content on disk. The fast query capability of SQLite +make this a viable alternative to keeping all content in memory at all times. +And when applications use less memory, it makes the entire computer more +responsive, further enhancing the user experience. + +

  5. +The schema of an SQL database is able to represent information more directly +and succinctly than a key/value database such as a ZIP archive. This makes +the document content more accessible to third-party applications and scripts +and facilitates advanced features such as built-in document versioning, and +incremental saving of work in progress for recovery after a crash. +

+ +

+These are just a few of the benefits of using SQLite as an application file +format — the benefits that seem most likely to improve the user +experience for applications like OpenOffice. Other applications might +benefit from SQLite in different ways. See the [Application File Format] +document for additional ideas. + +

+Finally, let us reiterate that this essay is a thought experiment. +The OpenDocument format is well-established and already well-designed. +Nobody really believes that OpenDocument should be changed to use SQLite +as its container instead of ZIP. Nor is this article a criticism of +OpenDocument for not choosing SQLite as its container since OpenDocument +predates SQLite. Rather, the point of this article is to use OpenDocument +as a concrete example of how SQLite can be used to build better +application file formats for future projects. Index: pages/appfileformat.in ================================================================== --- pages/appfileformat.in +++ pages/appfileformat.in @@ -144,11 +144,11 @@ like this:

 CREATE TABLE files(filename TEXT PRIMARY KEY, content BLOB);
 
If the content is compressed, then such an SQLite database is only -slightly larger (less than 2% larger) +slightly larger, and [SQLAR smaller than ZIP|sometimes smaller] than an equivalent ZIP archive, and it has the advantage of being able to update individual "files" without rewrite the entire document.

@@ -164,11 +164,12 @@

In other words, an SQLite database can do everything that a pile-of-files or wrapped pile-of-files format can do, plus much more, and with greater lucidity. An SQLite database is a more versatile container than key/value -filesystem or a ZIP archive. +filesystem or a ZIP archive. (For a detailed example, see the +[OpenOffice case study] essay.)

The power of an SQLite database could, in theory, be achieved using a custom file format. But any custom file format that is as expressive as a relational database would likely require an enormous design specification and many tens or hundreds of thousands of lines of code to Index: pages/compile.in ================================================================== --- pages/compile.in +++ pages/compile.in @@ -404,11 +404,11 @@ COMPILE_OPTION {SQLITE_ALLOW_URI_AUTHORITY} { [URI filenames] normally throw an error is the authority section is not either empty or "localhost". However, if SQLite is compiled with the SQLITE_ALLOW_URI_AUTHORITY compile-time option, then the URI is - is converted into a Uniform Naming Convention (UNC) filename and passed + converted into a Uniform Naming Convention (UNC) filename and passed down to the underlying operating system that way.

Some future versions of SQLite may change to enable this feature by default. } Index: pages/download.in ================================================================== --- pages/download.in +++ pages/download.in @@ -170,11 +170,11 @@ is longer than 32767 lines of code. } Product {YEAR/sqlite-autoconf-VVV.tar.gz} { A tarball containing the [amalgamation] - for SQLite VERSION together with an configure script and makefile + for SQLite VERSION together with a configure script and makefile for building it. This tarball also contains in the "tea" subdirectory a separate configure script and makefile compatible with the Tcl Extension Architecture (TEA) for building the TCL bindings to SQLite. } {amalgtarball {amalgamation tarball}} Index: pages/lang.in ================================================================== --- pages/lang.in +++ pages/lang.in @@ -900,16 +900,16 @@

The DEFAULT clause specifies a default value to use for the column if no value is explicitly provided by the user when doing an [INSERT]. ^If there is no explicit DEFAULT clause attached to a column definition, then the default value of the column is NULL. ^(An explicit DEFAULT clause may specify that the default value is NULL, a string constant, a blob constant, a -signed-number, or any constant expression enclosed in parentheses. An explicit +signed-number, or any constant expression enclosed in parentheses. A default value may also be one of the special case-independent keywords CURRENT_TIME, CURRENT_DATE or CURRENT_TIMESTAMP.)^ ^For the purposes of the -DEFAULT clause, an expression is considered constant provided that it does -not contain any sub-queries, column or table references, or string literals -enclosed in double-quotes instead of single-quotes. +DEFAULT clause, an expression is considered constant if it does +contains no sub-queries, column or table references, [bound parameters], +or string literals enclosed in double-quotes instead of single-quotes.

^(Each time a row is inserted into the table by an INSERT statement that does not provide explicit values for all table columns the values stored in the new row are determined by their default values)^, as follows: