I recently thought of a idea to write an
yum whatprovides) that works across multiple distros. If you’ve never used a utility like this, they’ll tell you which package provides a certain file. This would help when writing README files to work out what dependencies are needed in each distro.
After a couple of hours of fiddling around (read: procrastinating), I had a working import from the repository Contents.gz file that’s on every Debian mirror. The thing that struck me was the file sizes:
|Contents-amd64.sqlite (naive import with lots of duplication)||452,146kb|
Then I got side tracked…
Contents-amd64.gz is what’s downloaded when you run
apt-file update. First off I basically ran a
for each line in file, split it into package name and file name, then inserted it straight into SQLite.
I then remembered some introductory database theory and ‘de-normalised’ the data – each package becomes an integer and then each row only stores a reference to that integer.
The downside to this is that I needed to run two imports. Once to get the list of packages and files, and then again to link the packages to the list of files. In SQLite, this means creating copies of tables, dropping the duplicate one and then running
VACUUM to reclaim the space. The result of this is that I (barely) beat the original file for storage efficiency.
Next, what about search? Could I really beat
I think caching and load affected some of the results – a couple of runs a few hours apart produced wildly different results. All of my tests were run on a 512mb DigitalOcean instance.
$ time zgrep bash /var/cache/apt/apt-file/mirrors.digitalocean.com_debian_dists_jessie_main_Contents-amd64.gz | head bin/bash shells/bash bin/bash-static shells/bash-static bin/rbash shells/bash etc/apparmor.d/abstractions/bash admin/apparmor etc/bash.bashrc shells/bash etc/bash_completion shells/bash-completion etc/bash_completion.d/R gnu-r/r-base-core etc/bash_completion.d/_publican perl/publican etc/bash_completion.d/aapt devel/aapt etc/bash_completion.d/adb devel/android-tools-adb real 0m0.013s user 0m0.000s sys 0m0.000s
$ time apt-file search bash | head 0install-core: /usr/share/bash-completion/completions/0install 0install-core: /usr/share/bash-completion/completions/0launch aapt: /etc/bash_completion.d/aapt acheck: /usr/share/doc/acheck/bash_completion acl2-books: /usr/lib/acl2-6.5/books/misc/bash-bsd.o acl2-books: /usr/lib/acl2-6.5/books/misc/bash.o acl2-books: /usr/share/acl2-6.5/books/misc/bash-bsd.o acl2-books: /usr/share/acl2-6.5/books/misc/bash.o acl2-books-certs: /usr/share/acl2-6.5/books/misc/bash-bsd.cert acl2-books-certs: /usr/share/acl2-6.5/books/misc/bash.cert real 0m3.631s user 0m3.268s sys 0m0.280s
$ time python search.py bash | head shells/bash: bin/rbash admin/apparmor: etc/apparmor.d/abstractions/bash shells/bash: etc/bash.bashrc shells/bash-completion: etc/bash_completion gnu-r/r-base-core: etc/bash_completion.d/R perl/publican: etc/bash_completion.d/_publican devel/aapt: etc/bash_completion.d/aapt devel/android-tools-adb: etc/bash_completion.d/adb science/libadios-bin: etc/bash_completion.d/adios utils/silversearcher-ag: etc/bash_completion.d/ag real 0m2.375s user 0m1.732s sys 0m0.540s
Not too bad, I’m competitive with
apt-file. Now of course, I’m missing some features but it’s a good result for a little experiment.
The code is on my GitHub, if you’re interested.
I’ll keep working on this and add support for yum/dnf’s formats and probably migrate to a real database backend.
- The code initially worked, but then would crash when piped to
head. The answer? You have to handle SIGPIPE on UNIX-like systems. See this StackOverflow post for more info
gzipping the database on a i5-based laptop took nearly 20 minutes. A more sane level of compression is a lot quicker, but the compressed size is about 64 megabytes
SQLite can read gzipped files with a proprietary extension. It’s a pity the
sqlite3Python module doesn’t accept file handles to database files, otherwise I could just wrap it in a gzipstream and I’d be good to go
Someone with better SQL skills could probably make the database import a lot faster. On my DigitalOcean instance, the initial import takes a few minutes.
Like what I do? Support me here