Tag Archives: information-retrieval

Blog entropy folksonomies Google information-architecture information seeking keywords libraries methodologies research search-engines social-bookmarking Taxonomies Web2.0 web search WordPress

How to get Google search results for academic research

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.

Doing my small part to preserve digital history

High cirrus clouds and low fog over the Pacific Ocean Years ago, in an undergrad course, one the of the school’s librarians gave a talk about the big risk of the move to digital publishing – historical preservation.  We know what the ancient Greeks thought in part because their words were carved into stone – would we be so lucky if they had used floppy disks?

I wasn’t completely convinced that the situation was so dire then, and I’m still not really worried.  The production and storage of information continues to grow exponentially, and I think the real problem for future archeologists will be dealing with information overload rather than some hypothetical gap in the written record.  But I have been thinking a lot about my own digital history lately so I spent part of this weekend looking at old papers from college and publishing them on my site.

I don’t think my meager efforts will be much help to future historians (much less reverse the entropy of the universe), but I did find some interesting stuff that I probably should have posted for the world to see a long time ago.

For example:

The more I dig up and paste into my WordPress archives the more I realize a few things.  First, a distinct lack of content between undergrad and grad school – I’m doing a much better job of writing without assignments now than I did then.  Second, a hard drive crash in 2003 resulted in a gap in my saved emails – this hurts more now that I’m looking back through things.  Finally, I need to make a point, for the rest of my life, to just put things out there. It seems like such a shame that I put work into these docs just to have them rot on my hard drive.

I know some of my co-workers, Reid and Wysz, have gone through the process of resurrecting old content to their current website.  Anyone else thinking about doing something similar?  What prompted you to do so?  Or, what prevented you?

Obsolescence and obscurity in digital cameras

University Hall Tower at OWU I’m planning on buying a new DSLR, and as I looked through old photos from college today I started to think about my first digital camera, a Philips ESP50.  Here’s a page with some specs, translated from German.

I remember buying the camera, logged in to eBay from my parents’ house late at night the day after Christmas.  I think I ended up paying something like $250 for it.

This was before the megapixel war, when 640 by 480 was considered a viable resolution.  This camera applied tortuous levels of JPG compression to fit images on the 4MB disk.  At the time, though, it seemed like a good deal.  Film cost money, and developing film cost money, and most of the year I was a ramen-noodle-eating college student.  Probably the biggest reason to go digital was the tiny little screen on the back – you could actually tell if you got the shot, instead of waiting to get back a bunch of blurry prints.

The camera is painfully obsolete now, and even then it was somewhat obscure.  The thing is, the Web was a pretty amazing place even back in 1998 – there were lots of web pages about this camera.  I remember reading at least a couple reviews, and searches for it on WebCrawler or Alta Vista or whatever I used back then came up with retailers, other auction sites, etc.  Look for information about this camera now, and it seems that it has been largely forgotten:

And that’s about it.

I wonder, is this the destiny of all cameras?  Will I do a search for my Nikon Coolpix 5700 in 2014 and come up with just as little, or has the Web expanded so quickly that the copious product reviews, blog posts, and technical discussions on photography forums outweigh the force of entropy?  I wonder if the Internet has gained any stability as it has matured – do pages tend to stick around longer, or is linkrot a constant of the universe?

Future generations will hardly feel deprived if they miss out on information about some crappy old digicam.  Still, you never know what kind of information will end up being useful to someone at some point, and this same problem extends to all the information on the Web – from reviews of obsolete products to the human genome.  If a website goes under and deletes a thousand blogs, it won’t exactly make the news.  But our great-grandchildren might look at that stuff the way we look at letters from the Civil War.

The only solutions I have are more effort behind projects like archive.org, increased data portability, and rational intellectually property laws that don’t make saving 70-year-old content from deletion into a federal crime.

For discussion, how do you deal with ancient equipment, keeping around old web content, or even archiving old email?