I quite often have to add new films into the database of my website british-film-locations.com. As well as adding the film I have to add data about the director and actors in the film as well. This is quite time consuming, so I created a system to simplify the task.
The PHP script uses the ID number to scrape the IMDb site to get the data it needs, such as the film title, year of release, director’s name and ID, the actor’s names and IDs, etc. It then populates a form with all this data, and all I have to do is add any extra info, then type my password and click ‘Go’ and it adds the film, actors, and director to the database for me automatically. Brilliant!
Recently I decided to start an online database for all my ZX Spectrum games. I thought I could do something similar using data from the World Of Spectrum website’s games database.
Now, the World Of Spectrum used to have an API to allow you to get data in XML or JSON format, but it hasn’t worked for a while. So I just wrote something like the system above to send the ID from the URL of the required game to a PHP script which scrapes the data directly from the site. Unfortunately this didn’t work. The World Of Spectrum website obviously has a system in place to stop you scraping their site.
So that’s what I did!
The site doesn’t really give you much to go on in terms of finding the required data in the document (here’s an example), so I basically just cycle through every tag on the page looking for recognisable text and then get the contents of the tag which is a particular offset away. A bit crude, perhaps, but it works.
It then stuffs the data onto the end of the PHP script’s URL as a query string, and opens it in a new window.
The code could be optimised for size, but all modern browsers allow bookmarklets of 2000+ characters, so there’s no real point.
An extra advantage of doing it this way is that you can scrape dynamically generated content as well. As long as the data you want is in the DOM when you click on the bookmarklet, it can be scraped.
Something tells me that there’s a really cool application for this, I just need to think of one!