Philbin Search Tech Report
Jump to navigation
Jump to search
Downloading 1 million Flickr images
There is some info on the Flickr hacking on how to download Flickr images (I expanded it quite a bit Welinder 19:11, 20 Dec 2008 (PST)).
Issues
- Should we restrict ourselves to Creative Commons (CC) licensed images? Flickr provides some information about how many of their photos that are licensed by varying degrees of freedom.
- The "softest" license is the "Attribution License". There are 11 million such images available.
- Any of the other CC options provide quite a lot of freedom, so restricting ourselves to the "Attribution License" only may be unnecessary.
- The advantage of downloading CC images is that we can do whatever we want with them, even redistribute them on our website without any worries at all.
- A disadvantage may be that we get different image statistics than we would have got if we just ignored the license. However, I have no clue if this would be the case.
- I would say we start with downloading only CC images. At a later stage we could download 100K non-CC images to see if we can find differences (Welinder 19:21, 20 Dec 2008 (PST)).
- Philbin's 100K dataset was created by downloading images from the 145 most popular tags (see list below). However, for the 1M dataset he used the 450 most popular tags. I haven't figured out how to get the 450 tag list.
- Should we restrict ourselves to the 145-tag list for building our dataset?
- Should we also try adding pictures using search terms instead of tags, or just come up with tags ourselves?
- We could also download images from groups (e.g. bird/animal/travel groups).
- I will start with the 145 tag list, and maybe move on to some other tags of my own liking as well as some interesting groups (Welinder 19:21, 20 Dec 2008 (PST)).
145 Most Popular Flickr Tags
List from all time most popular flickr tags. There are a total of 145 tags in this list.
- africa, animals, architecture, art, australia, autumn, baby, band, barcelona, beach, berlin, bird, birthday, black, blackandwhite, blue, boston, bw, california, cameraphone, camping, canada, canon, car, cat, chicago, china, christmas, church, city, clouds, color, concert, cute, dance, day, de, dog, england, europe, fall, family, festival, film, florida, flower, flowers, food, football, france, friends, fun, garden, geotagged, germany, girl, girls, graffiti, green, halloween, hawaii, hiking, holiday, home, house, india, ireland, island, italia, italy, japan, july, kids, la, lake, landscape, light, live, london, macro, may, me, mexico, mountain, mountains, museum, music, nature, new, newyork, newyorkcity, night, nikon, nyc, ocean, old, paris, park, party, people, photo, photography, photos, portrait, red, river, rock, rome, san, sanfrancisco, scotland, sea, seattle, show, sky, snow, spain, spring, street, summer, sun, sunset, taiwan, texas, thailand, tokyo, toronto, tour, travel, tree, trees, trip, uk, urban, usa, vacation, vancouver, washington, water, wedding, white, winter, yellow, york, zoo
Progress
21-29 Dec. 2008
- Coded a Flickr API interface to retrieve 4500 images per tag-category (limit set by the API).
- Started downloading images. So far I've got 50K images. Downloading about 125K images per day now.
20 Dec. 2008
- I found the CMU code pretty restrictive for querying, since it only works with text query, no tag or group search.
- Created some Ruby code for building image lists using the rflickr gem. This works much better than the CMU code.
19 Dec. 2008
- Purchased 640GB harddrive. This will last until they get the new storage system up and running.
- Cannot mount it due to permission issues. Have contacted Gary and John.