Improved(?) Scraper Script

Welcome Page Forums RetroPie Project Peoples Projects Improved(?) Scraper Script

This topic contains 4 replies, has 4 voices, and was last updated by  alexbleks 2 years, 7 months ago.

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
    Posts
  • #83282

    rev138
    Participant

    I find the existing scraper leaves something to be desired. My main gripes are that it’s slow as death, and if you want the info to be accurate you need to run it in manual mode and babysit it. So, I decided to write my own.

    The script uses the VG Archive API, which lets you search by the ROM’s MD5 digest, which is WAY more sane than searching by title, like TheGamesDB forces you to. The main problem I found with VG Archive, however, is that it it’s missing a lot of images, or has broken image links. If this script fails to get a valid box cover from the API, it queries google for the first image result for “<game title> <system> box art”, which in my random sampling always returned an appropriate image. Finally, the downloaded images are scaled to 350px and saved in JPEG format with 50% compression (though I’m thinking of making this customizable via an argument), which seems to yield smaller file sizes than the default scraper.

    Anyway, I figured as long as I had a functional script, I might as well share. It’s written in perl, and you’ll need to install a few extra packages on top of what ships with the RetroPie image:
    sudo apt-get install libwww-perl libxml-simple-perl perlmagick libjson-perl

    Here’s an example invocation for NES, assuming you have the script downloaded in /home/pi:
    ./es_scraper.pl --in-file .emulationstation/gamelists/nes/gamelist.xml --out-file .emulationstation/gamelists/nes/gamelist.xml --downloads .emulationstation/downloaded_images/nes "RetroPie/roms/nes/*"

    I hope someone finds it useful. Please let me know if you run into any bugs so I can fix them.

    #83283

    rev138
    Participant

    Looks like that failed to attach, here it is:

    #!/usr/bin/perl
    
    ###
    ### Point this script at a ROM/directory full of ROMs and it will generate a 
    ### gamelist.xml file for EmulationStation and download the box cover art.
    ###
    ### 20141202 patternspider@gmail.com
    ###
    
    use strict;
    use warnings;
    use LWP::Simple 'getstore';
    use LWP::UserAgent;
    use XML::Simple;
    use JSON;
    use Digest::MD5;
    use Cwd qw( getcwd abs_path );
    use Image::Magick;
    use File::Path 'make_path';
    use Getopt::Long;
    
    my $opts = {
            'api-key'       => '7TTRM4MNTIKR2NNAGASURHJOZJ3QXQC5',  # RetroPie's API key
            'api-url'       => 'http://api.archive.vg/2.0',         # VG Archive API
            'downloads'     => getcwd . '/downloaded_images',       # Folder for downloaded box cover art
            'in-file'       => getcwd . '/gamelist.xml',
            'out-file'      => getcwd . '/gamelist.xml',
    };
    
    GetOptions(
            $opts,
            'api-key|k=s',
            'api-url|u=s',
            'downloads|d=s',
            'in-file|i=s',
            'out-file|o=s',
            'help|h'        => sub{ &help },
            'no-images|n',
            'stdout|s',
    );
    
    my @files = glob( $ARGV[0] );
    my $game_list = {};
    my $xs = XML::Simple->new;
    
    # read in the existing gamelist if there is one
    if( -r $opts->{'in-file'} ){
            my $in_file = IO::File->new( $opts->{'in-file'} ) or die $!;
            foreach my $game ( @{$xs->XMLin( $in_file, SuppressEmpty => 1, KeyAttr => { 'name' => "+name" } )->{'game'}} ){
                    $game_list->{$game->{'path'}} = $game if -e $game->{'path'};
            }
    }
    
    my $ua = LWP::UserAgent->new;
    
    # ;)
    $ua->agent('RetroPie Scraper Browser');
    
    # ensure file paths are absolute
    @files = map { abs_path( $_ ) } @files;
    
    foreach my $filename ( @files ){
            # get the MD5 digest for the ROM
            my $md5 = get_md5( $filename );
            # look up the ROM by its digest
            my $response = $ua->get( $opts->{'api-url'} . '/Game.getInfoByMD5/xml/' . $opts->{'api-key'} . "/$md5" );
    
            if( $response->is_success ){
                    my $data = XMLin( $response->decoded_content );
    
                    # make sure the API returned data in the format we expect
                    if( defined $data->{'games'} and defined $data->{'games'}->{'game'} and ref $data->{'games'}->{'game'} eq 'HASH' ){
                            my $game_data = $data->{'games'}->{'game'};
                            my $rating = 0;
                            my $image_file;
    
                            print "Found $game_data->{'title'}\n" unless $opts->{'stdout'};
    
                            $rating = $game_data->{'rating'} if defined $game_data->{'rating'} and $game_data->{'rating'} =~ /^[0-9.]$/;
    
                            # get the box cover if any
                            if( not $opts->{'no-images'} and ( defined $game_data->{'box_front'} and ref $game_data->{'box_front'} ne 'HASH' ) or ( defined $game_data->{'box_front_small'} and ref $game_data->{'box_front_small'} ne 'HASH' ) ){
                                    # parse out the filename
                                    $game_data->{'box_front'} =~ /\/([^\/]+)$/;
                                    $game_data->{'box_front_small'} =~ /\/([^\/]+)$/ unless defined $1;
    
                                    # set a temporary download location
                                    my $temp_file = "/tmp/$1" if defined $1;
    
                                    # download the box cover
                                    my $response_code = '';
                                    $response_code =  getstore( $game_data->{'box_front'}, $temp_file ) if defined $game_data->{'box_front'} and ref $game_data->{'box_front'} ne 'HASH';
    
                                    # if that didn't work, try to get the small version
                                    if( $response_code !~ /^(2|3)[0-9]{2}$/ ){
                                            $response_code = getstore( $game_data->{'box_front_small'}, $temp_file ) if defined $game_data->{'box_front_small'} and ref $game_data->{'box_front_small'} ne 'HASH';
                                    }
    
                                    # if that didn't work, try google
                                    if( $response_code !~ /^(2|3)[0-9]{2}$/ ){
                                            my $google_result = google_image_search( $ua, $game_data->{'title'} . ' ' . $game_data->{'system_title'} . ' box art' );
                                            $response_code = getstore( $google_result, $temp_file ) if defined $google_result;
                                    }
    
                                    # how about now?
                                    if( $response_code =~ /^(2|3)[0-9]{2}$/ ){
                                            # set the post-processed file location
                                            $image_file = $opts->{'downloads'} . "/$md5.jpg";
    
                                            my $im = Image::Magick->new;
                                            my $image = $im->Read( $temp_file );
    
                                            # scale to 350px width
                                            $im->AdaptiveResize( geometry => '350x' );
                                            # write out the scaled image in JPEG format at 50% quality
                                            make_path( $opts->{'downloads'} );
                                            $im->Write( filename => $image_file, compression => 'JPEG', quality => 50 ) ;
                                            # remove the temp file
                                            unlink $temp_file;
                                    }
                            }
    
                            # set/overwrite the attributes of the current rom
                            $game_list->{$filename}->{'name'} = $game_data->{'title'};
                            $game_list->{$filename}->{'path'} = $filename;
                            $game_list->{$filename}->{'image'} = $image_file if defined $image_file;
                            $game_list->{$filename}->{'description'} = $game_data->{'description'};
                            $game_list->{$filename}->{'developer'} = $game_data->{'developer'};
                            $game_list->{$filename}->{'publisher'} = $game_data->{'developer'};
                            $game_list->{$filename}->{'genre'} = $game_data->{'genre'};
                            $game_list->{$filename}->{'rating'} = $rating;
                    }
            }
            else {
                    die $response->code . ' ' . $response->message . "\n";
            }
    }
    
    # manually printing this because getting XML::Simple to reproduce the same formatting is baffling
    open STDOUT, ">$opts->{'out-file'}" or die "Can't write to $opts->{'out-file'}: $!" unless $opts->{'stdout'};
    print "<gameList>\n";
    
    foreach my $file ( sort { $game_list->{$a}->{'name'} cmp $game_list->{$b}->{'name'} } keys %$game_list ){
            print "\t<game>\n";
    
            foreach my $key ( sort keys %{$game_list->{$file}} ){
                    print "\t\t" . $xs->XMLout( { $key => $game_list->{$file}->{$key} }, NoAttr => 1, KeepRoot => 1 );
            }
    
            print "\t</game>\n";
    }
    
    print "</gameList>\n";
    
    ###
    
    sub get_md5 {
            my ( $filename ) = @_;
            my $ctx = Digest::MD5->new;
    
            open( FILE, '<', $filename );
            $ctx->addfile( *FILE );
            close( FILE );
    
            my $md5 = $ctx->hexdigest;
    
            return $md5 if defined $md5;
    }
    
    sub google_image_search {
            my ( $ua, $search_string ) = @_;
            my $response = $ua->get( 'https://ajax.googleapis.com/ajax/services/search/images?v=1.0&rsz=1&q=' . $search_string );
    
            if( $response->is_success ){
                    my $data = from_json( $response->decoded_content );
                    if( defined $data->{'responseData'} and @{$data->{'responseData'}->{'results'}} ){
                            return $data->{'responseData'}->{'results'}->[0]->{'url'};
                    }
            }
    }
    
    sub help {
            print "usage: es_scraper.pl [OPTIONS] /path/to/roms\n";
            print "options:\n";
            print "\t--api-key\tVG Archive API key\n";
            print "\t--api-url\tVG Archive API URL\n";
            print "\t--downloads\tBox cover art download folder\n";
            print "\t--in-file\tgamelist XML file to read in\n";
            print "\t--no-images\tSkip downloading box covers\n";
            print "\t--out-file\tgamelist XML file to write out\n";
            print "\t--stdout\tWrite to stdout instead of --out-file\n";
            print "\n";
            print "All options have sane defaults\n";
            exit;
    }
    #84670

    reakhavok
    Participant

    Hi Rev138,

    Thanks for sharing the script. I tried running it for my mame collection and i get the following error:

    File does not exist: at ./es_scraper.pl line 62.

    Is there a log file thats generated? Any suggestions?

    *Update –
    originally i executed the script via via this command:
    ./es_scraper.pl –in-file .emulationstation/gamelists/mame/gamelist.xml –out-file .emulationstation/gamelists/mame/gamelist.xml –downloads .emulationstation/downloaded_images/nes “RetroPie/roms/mame/*”

    after looking at your code it looks like I need to point to the roms directory, so I edited it to look like this

    ./es_scraper.pl /home/pi/RetroPie/roms/mame/ –in-file .emulationstation/gamelists/mame/gamelist.xml –out-file .emulationstation/gamelists/mame/gamelist.xml –downloads .emulationstation/downloaded_images/nes “RetroPie/roms/mame/*”

    When doing this i get error:
    Reading from filehandle failed at ./es_scraper.pl line 162

    Thanks

    #85560

    brakanje
    Participant
    
    pi@raspberrypi ~ $ ./es_scraper.pl --in-file .emulationstation/gamelists/nes/gamelist.xml --out-file .emulationstation/gamelists/nes/gamelist.xml --downloads .emulationstation/downloaded_images/nes "RetroPie/roms/nes/*"
    Not an ARRAY reference at ./es_scraper.pl line 49.
    

    I got this error when trying to run after I figured out how to remove the redundant line feed from windows. I am going to research perl and see if i can fix it but as a stay at home mum i may not get around to it so I figured I’d show you what i got.

    #94205

    alexbleks
    Participant

    I have made a HTTP interface which uses my own database with images and MD5 hashes.

    Upload the ROMS and build the gamelist in 1 button

    https://github.com/alexbleks/retrobox-rom-manager

    Its only for NES, SNES, GB and GBC yet

Viewing 5 posts - 1 through 5 (of 5 total)

Forums are currently read only - please visit the new RetroPie forums at https://retropie.org.uk/forums/