Reimplementing apt-file, badly

I recently thought of a idea to write an apt-file (yum whatprovides) that works across multiple distros. If you’ve never used a utility like this, they’ll tell you which package provides a certain file. This would help when writing README files to work out what dependencies are needed in each distro.

After a couple of hours of fiddling around (read: procrastinating), I had a working import from the repository Contents.gz file that’s on every Debian mirror. The thing that struck me was the file sizes:

File Size
Contents-amd64.gz 26,721kb
Contents-amd64 378,971kb
Contents-amd64.sqlite (naive import with lots of duplication) 452,146kb
Contents-amd64.sqlite (de-normalised) 330,274kb
Contents-amd64.sqlite.gz* 55,858kb

Then I got side tracked…

Contents-amd64.gz is what’s downloaded when you run apt-file update. First off I basically ran a for each line in file, split it into package name and file name, then inserted it straight into SQLite.

I then remembered some introductory database theory and ‘de-normalised’ the data – each package becomes an integer and then each row only stores a reference to that integer.

The downside to this is that I needed to run two imports. Once to get the list of packages and files, and then again to link the packages to the list of files. In SQLite, this means creating copies of tables, dropping the duplicate one and then running VACUUM to reclaim the space. The result of this is that I (barely) beat the original file for storage efficiency.

Next, what about search? Could I really beat grep?

I think caching and load affected some of the results – a couple of runs a few hours apart produced wildly different results. All of my tests were run on a 512mb DigitalOcean instance.

$ time zgrep bash /var/cache/apt/apt-file/mirrors.digitalocean.com_debian_dists_jessie_main_Contents-amd64.gz | head
bin/bash                                                shells/bash
bin/bash-static                                         shells/bash-static
bin/rbash                                               shells/bash
etc/apparmor.d/abstractions/bash                        admin/apparmor
etc/bash.bashrc                                         shells/bash
etc/bash_completion                                     shells/bash-completion
etc/bash_completion.d/R                                 gnu-r/r-base-core
etc/bash_completion.d/_publican                         perl/publican
etc/bash_completion.d/aapt                              devel/aapt
etc/bash_completion.d/adb                               devel/android-tools-adb

real    0m0.013s
user    0m0.000s
sys     0m0.000s

$ time apt-file search bash | head
0install-core: /usr/share/bash-completion/completions/0install
0install-core: /usr/share/bash-completion/completions/0launch
aapt: /etc/bash_completion.d/aapt
acheck: /usr/share/doc/acheck/bash_completion
acl2-books: /usr/lib/acl2-6.5/books/misc/bash-bsd.o
acl2-books: /usr/lib/acl2-6.5/books/misc/bash.o
acl2-books: /usr/share/acl2-6.5/books/misc/bash-bsd.o
acl2-books: /usr/share/acl2-6.5/books/misc/bash.o
acl2-books-certs: /usr/share/acl2-6.5/books/misc/bash-bsd.cert
acl2-books-certs: /usr/share/acl2-6.5/books/misc/bash.cert

real    0m3.631s
user    0m3.268s
sys     0m0.280s
$ time python search.py bash | head
shells/bash: bin/rbash
admin/apparmor: etc/apparmor.d/abstractions/bash
shells/bash: etc/bash.bashrc
shells/bash-completion: etc/bash_completion
gnu-r/r-base-core: etc/bash_completion.d/R
perl/publican: etc/bash_completion.d/_publican
devel/aapt: etc/bash_completion.d/aapt
devel/android-tools-adb: etc/bash_completion.d/adb
science/libadios-bin: etc/bash_completion.d/adios
utils/silversearcher-ag: etc/bash_completion.d/ag

real    0m2.375s
user    0m1.732s
sys     0m0.540s

Not too bad, I’m competitive with apt-file. Now of course, I’m missing some features but it’s a good result for a little experiment.

The code is on my GitHub, if you’re interested.

I’ll keep working on this and add support for yum/dnf’s formats and probably migrate to a real database backend.

Sidenotes

  • The code initially worked, but then would crash when piped to head. The answer? You have to handle SIGPIPE on UNIX-like systems. See this StackOverflow post for more info

  • gzipping the database on a i5-based laptop took nearly 20 minutes. A more sane level of compression is a lot quicker, but the compressed size is about 64 megabytes

  • SQLite can read gzipped files with a proprietary extension. It’s a pity the sqlite3 Python module doesn’t accept file handles to database files, otherwise I could just wrap it in a gzipstream and I’d be good to go

  • Someone with better SQL skills could probably make the database import a lot faster. On my DigitalOcean instance, the initial import takes a few minutes.

Like what I do? Support me here

Need a drink? Try a Whiskey Sweet n’ Sour from The Cinnamon Scrolls

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s