The pdf search engine

4/11/2023

(I have a hunch this may be the main issue) Perhaps the index is so full of junk (endnotes, text fragments, references to book pages) rather than “real content” (sentences about rabbit ears, etc) and the encoder just throws up its hands in despair because it can’t make any sense of most of it.Sure, it’s just a few PDFs but it’s broken down into several thousand chunks, many of which are full sentences or images) Perhaps the dataset itself is just too small (I don’t believe this.Perhaps the encoder or model itself isn’t great (I’m not convinced of this - CLIP isn’t ideal for text, but it’s serviceable at least, as can be seen in our fashion search).On the plus side, it did bring up stuff about rabbits instead of chocolate, but that’s all that can be said for it.

No matter what I typed, I never got any images returned (despite typing descriptions of said images and them definitely being indexed).
The matches that were the most relevant got a less-good score than some less-relevant matches (in cosine scores, lower score means higher relevance).
Most of these were just about rabbits, with no mention of ears, despite sentences about rabbit’s ears being indexed.
Or I would get strings just a few words long (which I assume were titles).
I would get short text snippets that were bare URLs (which I assume were picked up from endnotes in the article).
Especially since I got bad results when it came to the search.įor example, when searching “rabbit ears”: And since every potential use case has wildly different data, it doesn’t make much sense to spend hours kicking Wikipedia PDFs into shape. I used a few PDFs I downloaded from Wikipedia as an example dataset and went from there:īut I forgot the cardinal rule of data science: kicking your data into shape for your use case is like 90% of the work.

Search cross-modally, so you could use image/text as both input/output.Be general purpose, and work well with any kind of PDF data (emphasis on work well - Just because it returns results doesn’t mean it’s good - it needs to return quality results).

0 Comments

The pdf search engine

Leave a Reply.

Author

Archives

Categories