A Fairly Interesting Bug
Here's the story of a recent troubleshooting session that had me questioning everything I knew about computers. And yet it turned out to be a one-line fix.
Computers, right?
As a software developer with.. some years of experience.. I have seen a lot of things. I can't say this is the weirdest or most difficult bug, but it sure was a head-scratcher. If you're inclined to get nerd-sniped into wanting to see if you can find the problem, you may (or may not) want to look at this repository before reading on.
https://github.com/masto/vips-issue
There you will find a working example of the problem I was having, trimmed down to under 100 lines. I think if you get lucky and start off on a certain path, you'll probably find it right away, but set out at a slightly different angle and you might get lost in the woods forever. I'm not sure exactly how much time I spent on this, because I kept stopping to show my friends and argue with them about their theories, and then I had a big meeting in the middle. Eventually I started showing the person I was meeting with what I'd been working on and we got stuck on this together and ultimately puzzled it out.
The requisite background
I've been working on a web application that does some stuff with photos. The
user uploads a photo from their device and I need to show it to them. I'm using
Next.js, which has a neat
image optimization feature,
so I thought I'd just use their <Image>
component, but I quickly discovered
that it doesn't do all the things I need. In particular it
doesn't support HEIC.
HEIC will be a recurring theme here: it's the format Apple is moving to for
iPhone photos, and that's important to my app. Side note: you might see "HEIF"
and "HEVC", which are all related -
HEIC is a HEIF file with HEVC encoding.
This alphabet soup is not relevant to the discussion here, but different names
pop up in debug logs, so the easiest thing to do is just consider them the same
thing.
Back to the problem at hand, I decided that a workaround for HEIC support would
be to convert everything to JPEG (at least for thumbnails). I looked at a few
options, and I settled on vips
. It says on the web site
that "Compared to similar libraries, libvips runs quickly and uses little
memory", and that's good enough for me. I've used ImageMagick in the past, and
it's a bit of a jumble. This seemed like maybe a clean, modern replacement. Once
I worked out that I could run vips thumbnail input.heic output.jpg <size>
from
the command line, I was good to go, and I added this step to the processing
pipeline that my app already runs after receiving a new photo.
The bug
The day after pushing, and testing, this fix for the HEIC issue, I was playing with the site and noticed it was broken again. I just fixed that, what the heck? I quickly narrowed the symptoms down this very strange behavior:
- When the server is first started, it always successfully processes the first photo uploaded.
- JPEG files are always successfully processed.
- Any HEIC file after the first upload always fails.
- The error message from vips is:
VipsForeignLoad: "<filename>" is not a known file format
Here's what it looks like in the demo project:
I uploaded the same file twice, so there should be two identical thumbnails, but the second is broken.
As I went deeper into debugging, it got weirder. I am running this app in a
Docker container, so I used the docker exec
command to start a shell in the
same container as the server, and I ran the exact same vips command on
exactly the same file that it said in the logs was not a known file format,
and it worked fine. I tested it repeatedly, it was not intermittent. But the
failure was 100% guaranteed to happen on the server every time. If I restarted
it, I could process one HEIC file and then they would all fail again.
The hair-pulling
I must necessarily leave a lot out in order to condense hours of troubleshooting
into this post. I tested a lot of ideas that went nowhere, like maybe the file
somehow wasn't fully written when vips was run on it, but I tried running
md5sum
before vips, and confirmed the bits were identical in the successful
and unsuccessful executions.
The first real clue came from running vips under strace. I figured there has to be something different between the good and bad runs, and I hoped it would show up in their system calls. I matched up the output side by side. Lo and behold, there was a very notable difference:
On the left, the first, "good" vips execution. On the right, the second, "bad" one. Where they start to diverge, the good one says:
openat(AT_FDCWD, "/usr/lib/vips-modules-8.15", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 3
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
getdents64(3, 0xffff9000b158 /* 3 entries */, 2048) = 80
newfstatat(AT_FDCWD, "/usr/lib/vips-modules-8.15/vips-heif.so", {st_mode=S_IFREG|0755, st_size=67440, ...}, 0) = 0
openat(AT_FDCWD, "/usr/lib/vips-modules-8.15/vips-heif.so", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 4
To translate into semi-English, it looks in a directory
/usr/lib/vips-modules-8.15
, finds a file called vips-heif.so
, and opens it.
It goes on to do shared-library-ish things, like reading and mmapping it.
Compare with the bad one:
openat(AT_FDCWD, "/target/lib/vips-modules-8.15", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
Here we see an attempt to access /target/lib/vips-modules-8.15
, which doesn't
exist. Indeed, there's no /target
directory in the container. And as we can
now deduce, vips-heif.so
must be a module that allows vips to read HEIC files.
Trying to load that module from the wrong place explains why it doesn't think
the file is valid. But this just changes the mystery. How is it possible that
executing the same exact command twice in a row, it would behave in a different
and bizarre way? What is /target
, where did it come from, and what's causing
it to suddenly pop up on the second run of vips?
I looked at the obvious things, yes. My code never mentions /target
. You can
look at it too,
search the repository
if you like. That's not it. It started to feel like somehow the first execution
was doing something to influence the second one, but what can a child process
possibly communicate from beyond the grave? At this point, I and everyone I know
start to mumble a lot of things that end with "..no, that's impossible, unix
doesn't work that way". A bunch of dead ends later, and I decided to look at the
source code of vips. How exactly does it decide where to look for modules?
Spelunking
We start out in init.c, where we find:
vips_load_plugins("%s/vips-modules-%d.%d",
libdir, VIPS_MAJOR_VERSION, VIPS_MINOR_VERSION);
Good start. Whence comes libdir
? Not too far away,
we find
if ((prefix = g_getenv("VIPSHOME")))
g_info("VIPSHOME = %s", prefix);
if (!(prefix = vips_guess_prefix(argv0, "VIPSHOME")) ||
!(libdir = vips_guess_libdir(argv0, "VIPSHOME")))
return -1;
Now that's interesting. So I added a call to env
before running vips, in order
to dump out all of the environment variables. That can't be it, but we might as
well look. Around this point I also discovered that I can set
G_MESSAGES_DEBUG=VIPS
to get vips to output its own debug messages, so I
turned that on as well. And here's what we see on the bad run (truncated to the
end):
__NEXT_PRIVATE_ORIGIN=http://localhost:3000
NEXT_DEPLOYMENT_ID=
__NEXT_PRIVATE_RUNTIME_TYPE=
G_MESSAGES_DEBUG=VIPS
VIPSHOME=/target
VIPS-INFO: 03:45:39.584: VIPSHOME = /target
VIPS-INFO: 03:45:39.584: VIPS_PREFIX = /usr
VIPS-INFO: 03:45:39.584: VIPS_LIBDIR = /usr/lib
VIPS-INFO: 03:45:39.584: prefix = /target
VIPS-INFO: 03:45:39.584: libdir = /target/lib
VIPS-INFO: 03:45:39.584: searching "/target/lib/vips-modules-8.15"
VIPS-INFO: 03:45:39.584: searching "/target/lib/vips-plugins-8.15"
VIPS-INFO: 03:45:39.584: im_load_plugins: searching "/target/lib/vips-8.15"
VIPS-INFO: 03:45:39.584: im_load_plugins: searching "/target/lib"
VipsForeignLoad: "/tmp/83c790a7eeac9310" is not a known file format
What the??! Seriously, how is that possible? I never set VIPSHOME. I never even knew about VIPSHOME. Why does it pop up out of nowhere, and why only the second time it's run? This is definitely what's causing the failure, but it still makes no sense!
I go
searching through the code
for every mention of VIPSHOME, but the only place it's mentioned outside of
comments is in where we saw above, and it's in a getenv
call, so it can't even
be vips that's responsible for setting it. Even if it somehow was, environment
variables don't carry over from a child process after it exits. This really
threw me down a rabbit hole of questioning assumptions, and I even spent a bunch
of time in the Node.js source code trying to make sure spawn
wasn't doing
something exceedingly weird. It's Node, so of course it is, but not in the way I
was worried about.
Would you like me to finally get to the answer?
I discovered two things at approximately the same time. First, I need to correct
something I said in the previous paragraph which wasn't entirely accurate. I
said there are no other mentions of VIPSHOME, which is correct, but that
doesn't mean it never sets it. See that call to vips_guess_prefix
? Let's take
a look at this interesting little function:
const char *
vips_guess_prefix(const char *argv0, const char *env_name)
{
const char *prefix;
/* Already set?
*/
if ((prefix = g_getenv(env_name)))
return prefix;
#ifdef G_OS_WIN32
prefix = vips__windows_prefix();
#else /*!G_OS_WIN32*/
{
char *basename;
basename = g_path_get_basename(argv0);
prefix = guess_prefix(argv0, basename);
g_free(basename);
}
#endif /*G_OS_WIN32*/
g_setenv(env_name, prefix, TRUE);
return prefix;
}
There, in the second to last line. g_setenv(env_name, prefix, TRUE)
. You
monster. Yes, this function doesn't just return the "guessed" prefix, it "as a
side effect, sets the environment variable" (per a comment above this function).
Digging a little deeper reveals that it sets it to a compiled-in value. This
seems like a massive clue, but the missing link is still how it's possible for
vips modifying its own environment to affect the future execution of a different
process.
Perhaps you've realized that I'm engaging in some slight misdirection, because I wanted you to be stuck on the same faulty thought process I was stuck on. It's not possible to modify the environment of a future sibling process. Only parent environment variables are inherited. I hadn't forgotten how computers work. The problem is that sometimes it turns out the impossible thing is actually what's happening, so when that has bitten you a few times, you don't want to be too quick to discard those ideas. In this case, though, nothing impossible is happening.
Ok already, the answer
I was missing something in the logs by focusing on the bad runs. I glanced at
the env
output for a good run, noted that VIPSHOME
was not set, and turned
my attention elsewhere. That turned out to be a mistake. Let's take a closer
look at what shows up in the logs of a "good" run when I turn on
G_MESSAGES_DEBUG=VIPS
. This snippet contains the final clue I needed to jump
up and do the "oh. no. oh geez. oh my god." thing when I actually noticed it.
VIPS-INFO: 03:45:18.438: found /usr/lib/vips-modules-8.15
VIPS-INFO: 03:45:18.438: using configure-time prefix
VIPS-INFO: 03:45:18.438: VIPS_PREFIX = /usr
VIPS-INFO: 03:45:18.438: VIPS_LIBDIR = /usr/lib
VIPS-INFO: 03:45:18.438: prefix = /usr
VIPS-INFO: 03:45:18.438: libdir = /usr/lib
VIPS-INFO: 03:45:18.438: searching "/usr/lib/vips-modules-8.15"
VIPS-INFO: 03:45:18.438: loading "/usr/lib/vips-modules-8.15/vips-heif.so"
VIPS-INFO: 03:45:18.442: searching "/usr/lib/vips-plugins-8.15"
VIPS-INFO: 03:45:18.442: im_load_plugins: searching "/usr/lib/vips-8.15"
VIPS-INFO: 03:45:18.442: im_load_plugins: searching "/usr/lib"
VIPS-INFO: 03:45:18.453: heifload: ignoring nclx profile
VIPS-INFO: 03:45:18.455: heifload: ignoring nclx profile
VIPS-INFO: 03:45:18.750: threadpool completed with 4 workers
VIPS-INFO: 03:45:18.821: set stack size to 2048k
VIPS-INFO: 03:45:18.821: g_getenv("PATH") == "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
VIPS-INFO: 03:45:18.821: looking in "/usr/local/sbin" for "sharp"
VIPS-INFO: 03:45:18.821: looking in "/usr/local/bin" for "sharp"
VIPS-INFO: 03:45:18.821: looking in "/usr/sbin" for "sharp"
VIPS-INFO: 03:45:18.821: looking in "/usr/bin" for "sharp"
VIPS-INFO: 03:45:18.821: looking in "/sbin" for "sharp"
VIPS-INFO: 03:45:18.821: looking in "/bin" for "sharp"
VIPS-INFO: 03:45:18.821: trying for dir = "/app/sharp", name = "sharp"
VIPS-INFO: 03:45:18.821: canonicalised path = "/app"
VIPS-INFO: 03:45:18.821: VIPS_PREFIX = /target
VIPS-INFO: 03:45:18.821: VIPS_LIBDIR = /target/lib
VIPS-INFO: 03:45:18.822: prefix = /target
VIPS-INFO: 03:45:18.822: libdir = /target/lib
VIPS-INFO: 03:45:18.831: residual reducev by 0.448179
VIPS-INFO: 03:45:18.831: reducev: 15 point mask
VIPS-INFO: 03:45:18.831: reducev: using vector path
VIPS-INFO: 03:45:18.831: reducev sequential line cache
VIPS-INFO: 03:45:18.831: residual reduceh by 0.448179
VIPS-INFO: 03:45:18.831: reduceh: 15 point mask
VIPS-INFO: 03:45:18.858: threadpool completed with 4 workers
You have to know one more thing for this to come together. Remember when I said
back at the beginning that I'm using that Next.js <Image>
component? Remember
how it does "optimization"? Well, it uses a package called
sharp to do the heavy lifting. And just
guess what sharp uses under its hood? That's right: libvips. This the
point in the show where Lieutenant Columbo turns around to deploy the "just one
more thing" speech.
Here's how it happened: the user uploaded a file. My code shelled out to run
vips thumbnail
on it, which worked as expected. Then the upload handler
triggered a re-render of the page, to show a thumbnail of the uploaded file
using the Next.js <Image>
component. This rendering all happens in the
Node.js-based Next.js web server. Because this component, through its use of
sharp
and sharp's use of libvips
, runs in the web server process, it
modifies the web server's environment when it hits that vips_guess_prefix
code. Apparently it was built with VIPS_PREFIX=/target
. So it was actually
displaying the thumbnail which caused VIPSHOME
to be set to /target
in the
Node process. Then the next time it spawned a vips command, it inherited that
variable and caused vips to be unable to find the library it needed to read HEIC
files, causing the murdfailure.
I don't even know who to blame here. This is one of those things that only
results from a bizarre interaction between many different things. Next.js's use
(indirectly) of vips, plus the way it's compiled, plus a strange thing vips does
with its environment, plus the way I'm using it, plus the way it's set up in
this Docker image... oh yeah, I didn't even mention that. I couldn't reproduce
it in my dev environment, because it turns out that I have a version of vips
that's statically linked with libheif, so it doesn't rely on an external
vips-heif.so
. But the production Docker image uses Alpine and their packages
separate out
vips-heif. So many variables that lined up to break this.
The fix
To unbreak it, I added delete process.env.VIPSHOME
before spawning vips. The
end.
I hope you enjoyed this adventure. I don't have any way to get feedback on my blog, but you can find me in various places through https://masto.me/ if you'd like to follow or yell at me.