Improved(?) Scraper Script

Home Forums RetroPie Project Peoples Projects Improved(?) Scraper Script

RetroPie has a new website and forum. Please visit https://retropie.org.uk/ for the new site. The new forum is located at https://retropie.org.uk/forum/. This forum is left here as a read-only archive.

This topic contains 4 replies, has 4 voices, and was last updated by Profile photo of alexbleks alexbleks 2 years ago.

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
    Posts
  • #83282
    Profile photo of rev138
    rev138
    Participant

    I find the existing scraper leaves something to be desired. My main gripes are that it’s slow as death, and if you want the info to be accurate you need to run it in manual mode and babysit it. So, I decided to write my own.

    The script uses the VG Archive API, which lets you search by the ROM’s MD5 digest, which is WAY more sane than searching by title, like TheGamesDB forces you to. The main problem I found with VG Archive, however, is that it it’s missing a lot of images, or has broken image links. If this script fails to get a valid box cover from the API, it queries google for the first image result for “<game title> <system> box art”, which in my random sampling always returned an appropriate image. Finally, the downloaded images are scaled to 350px and saved in JPEG format with 50% compression (though I’m thinking of making this customizable via an argument), which seems to yield smaller file sizes than the default scraper.

    Anyway, I figured as long as I had a functional script, I might as well share. It’s written in perl, and you’ll need to install a few extra packages on top of what ships with the RetroPie image:
    sudo apt-get install libwww-perl libxml-simple-perl perlmagick libjson-perl

    Here’s an example invocation for NES, assuming you have the script downloaded in /home/pi:
    ./es_scraper.pl --in-file .emulationstation/gamelists/nes/gamelist.xml --out-file .emulationstation/gamelists/nes/gamelist.xml --downloads .emulationstation/downloaded_images/nes "RetroPie/roms/nes/*"

    I hope someone finds it useful. Please let me know if you run into any bugs so I can fix them.

    #83283
    Profile photo of rev138
    rev138
    Participant

    Looks like that failed to attach, here it is:

    #!/usr/bin/perl
    
    ###
    ### Point this script at a ROM/directory full of ROMs and it will generate a 
    ### gamelist.xml file for EmulationStation and download the box cover art.
    ###
    ### 20141202 patternspider@gmail.com
    ###
    
    use strict;
    use warnings;
    use LWP::Simple 'getstore';
    use LWP::UserAgent;
    use XML::Simple;
    use JSON;
    use Digest::MD5;
    use Cwd qw( getcwd abs_path );
    use Image::Magick;
    use File::Path 'make_path';
    use Getopt::Long;
    
    my $opts = {
            'api-key'       => '7TTRM4MNTIKR2NNAGASURHJOZJ3QXQC5',  # RetroPie's API key
            'api-url'       => 'http://api.archive.vg/2.0',         # VG Archive API
            'downloads'     => getcwd . '/downloaded_images',       # Folder for downloaded box cover art
            'in-file'       => getcwd . '/gamelist.xml',
            'out-file'      => getcwd . '/gamelist.xml',
    };
    
    GetOptions(
            $opts,
            'api-key|k=s',
            'api-url|u=s',
            'downloads|d=s',
            'in-file|i=s',
            'out-file|o=s',
            'help|h'        => sub{ &help },
            'no-images|n',
            'stdout|s',
    );
    
    my @files = glob( $ARGV[0] );
    my $game_list = {};
    my $xs = XML::Simple->new;
    
    # read in the existing gamelist if there is one
    if( -r $opts->{'in-file'} ){
            my $in_file = IO::File->new( $opts->{'in-file'} ) or die $!;
            foreach my $game ( @{$xs->XMLin( $in_file, SuppressEmpty => 1, KeyAttr => { 'name' => "+name" } )->{'game'}} ){
                    $game_list->{$game->{'path'}} = $game if -e $game->{'path'};
            }
    }
    
    my $ua = LWP::UserAgent->new;
    
    # ;)
    $ua->agent('RetroPie Scraper Browser');
    
    # ensure file paths are absolute
    @files = map { abs_path( $_ ) } @files;
    
    foreach my $filename ( @files ){
            # get the MD5 digest for the ROM
            my $md5 = get_md5( $filename );
            # look up the ROM by its digest
            my $response = $ua->get( $opts->{'api-url'} . '/Game.getInfoByMD5/xml/' . $opts->{'api-key'} . "/$md5" );
    
            if( $response->is_success ){
                    my $data = XMLin( $response->decoded_content );
    
                    # make sure the API returned data in the format we expect
                    if( defined $data->{'games'} and defined $data->{'games'}->{'game'} and ref $data->{'games'}->{'game'} eq 'HASH' ){
                            my $game_data = $data->{'games'}->{'game'};
                            my $rating = 0;
                            my $image_file;
    
                            print "Found $game_data->{'title'}\n" unless $opts->{'stdout'};
    
                            $rating = $game_data->{'rating'} if defined $game_data->{'rating'} and $game_data->{'rating'} =~ /^[0-9.]$/;
    
                            # get the box cover if any
                            if( not $opts->{'no-images'} and ( defined $game_data->{'box_front'} and ref $game_data->{'box_front'} ne 'HASH' ) or ( defined $game_data->{'box_front_small'} and ref $game_data->{'box_front_small'} ne 'HASH' ) ){
                                    # parse out the filename
                                    $game_data->{'box_front'} =~ /\/([^\/]+)$/;
                                    $game_data->{'box_front_small'} =~ /\/([^\/]+)$/ unless defined $1;
    
                                    # set a temporary download location
                                    my $temp_file = "/tmp/$1" if defined $1;
    
                                    # download the box cover
                                    my $response_code = '';
                                    $response_code =  getstore( $game_data->{'box_front'}, $temp_file ) if defined $game_data->{'box_front'} and ref $game_data->{'box_front'} ne 'HASH';
    
                                    # if that didn't work, try to get the small version
                                    if( $response_code !~ /^(2|3)[0-9]{2}$/ ){
                                            $response_code = getstore( $game_data->{'box_front_small'}, $temp_file ) if defined $game_data->{'box_front_small'} and ref $game_data->{'box_front_small'} ne 'HASH';
                                    }
    
                                    # if that didn't work, try google
                                    if( $response_code !~ /^(2|3)[0-9]{2}$/ ){
                                            my $google_result = google_image_search( $ua, $game_data->{'title'} . ' ' . $game_data->{'system_title'} . ' box art' );
                                            $response_code = getstore( $google_result, $temp_file ) if defined $google_result;
                                    }
    
                                    # how about now?
                                    if( $response_code =~ /^(2|3)[0-9]{2}$/ ){
                                            # set the post-processed file location
                                            $image_file = $opts->{'downloads'} . "/$md5.jpg";
    
                                            my $im = Image::Magick->new;
                                            my $image = $im->Read( $temp_file );
    
                                            # scale to 350px width
                                            $im->AdaptiveResize( geometry => '350x' );
                                            # write out the scaled image in JPEG format at 50% quality
                                            make_path( $opts->{'downloads'} );
                                            $im->Write( filename => $image_file, compression => 'JPEG', quality => 50 ) ;
                                            # remove the temp file
                                            unlink $temp_file;
                                    }
                            }
    
                            # set/overwrite the attributes of the current rom
                            $game_list->{$filename}->{'name'} = $game_data->{'title'};
                            $game_list->{$filename}->{'path'} = $filename;
                            $game_list->{$filename}->{'image'} = $image_file if defined $image_file;
                            $game_list->{$filename}->{'description'} = $game_data->{'description'};
                            $game_list->{$filename}->{'developer'} = $game_data->{'developer'};
                            $game_list->{$filename}->{'publisher'} = $game_data->{'developer'};
                            $game_list->{$filename}->{'genre'} = $game_data->{'genre'};
                            $game_list->{$filename}->{'rating'} = $rating;
                    }
            }
            else {
                    die $response->code . ' ' . $response->message . "\n";
            }
    }
    
    # manually printing this because getting XML::Simple to reproduce the same formatting is baffling
    open STDOUT, ">$opts->{'out-file'}" or die "Can't write to $opts->{'out-file'}: $!" unless $opts->{'stdout'};
    print "<gameList>\n";
    
    foreach my $file ( sort { $game_list->{$a}->{'name'} cmp $game_list->{$b}->{'name'} } keys %$game_list ){
            print "\t<game>\n";
    
            foreach my $key ( sort keys %{$game_list->{$file}} ){
                    print "\t\t" . $xs->XMLout( { $key => $game_list->{$file}->{$key} }, NoAttr => 1, KeepRoot => 1 );
            }
    
            print "\t</game>\n";
    }
    
    print "</gameList>\n";
    
    ###
    
    sub get_md5 {
            my ( $filename ) = @_;
            my $ctx = Digest::MD5->new;
    
            open( FILE, '<', $filename );
            $ctx->addfile( *FILE );
            close( FILE );
    
            my $md5 = $ctx->hexdigest;
    
            return $md5 if defined $md5;
    }
    
    sub google_image_search {
            my ( $ua, $search_string ) = @_;
            my $response = $ua->get( 'https://ajax.googleapis.com/ajax/services/search/images?v=1.0&rsz=1&q=' . $search_string );
    
            if( $response->is_success ){
                    my $data = from_json( $response->decoded_content );
                    if( defined $data->{'responseData'} and @{$data->{'responseData'}->{'results'}} ){
                            return $data->{'responseData'}->{'results'}->[0]->{'url'};
                    }
            }
    }
    
    sub help {
            print "usage: es_scraper.pl [OPTIONS] /path/to/roms\n";
            print "options:\n";
            print "\t--api-key\tVG Archive API key\n";
            print "\t--api-url\tVG Archive API URL\n";
            print "\t--downloads\tBox cover art download folder\n";
            print "\t--in-file\tgamelist XML file to read in\n";
            print "\t--no-images\tSkip downloading box covers\n";
            print "\t--out-file\tgamelist XML file to write out\n";
            print "\t--stdout\tWrite to stdout instead of --out-file\n";
            print "\n";
            print "All options have sane defaults\n";
            exit;
    }
    #84670
    Profile photo of reakhavok
    reakhavok
    Participant

    Hi Rev138,

    Thanks for sharing the script. I tried running it for my mame collection and i get the following error:

    File does not exist: at ./es_scraper.pl line 62.

    Is there a log file thats generated? Any suggestions?

    *Update –
    originally i executed the script via via this command:
    ./es_scraper.pl –in-file .emulationstation/gamelists/mame/gamelist.xml –out-file .emulationstation/gamelists/mame/gamelist.xml –downloads .emulationstation/downloaded_images/nes “RetroPie/roms/mame/*”

    after looking at your code it looks like I need to point to the roms directory, so I edited it to look like this

    ./es_scraper.pl /home/pi/RetroPie/roms/mame/ –in-file .emulationstation/gamelists/mame/gamelist.xml –out-file .emulationstation/gamelists/mame/gamelist.xml –downloads .emulationstation/downloaded_images/nes “RetroPie/roms/mame/*”

    When doing this i get error:
    Reading from filehandle failed at ./es_scraper.pl line 162

    Thanks

    • This reply was modified 2 years, 3 months ago by Profile photo of reakhavok reakhavok.
    #85560
    Profile photo of brakanje
    brakanje
    Participant
    
    pi@raspberrypi ~ $ ./es_scraper.pl --in-file .emulationstation/gamelists/nes/gamelist.xml --out-file .emulationstation/gamelists/nes/gamelist.xml --downloads .emulationstation/downloaded_images/nes "RetroPie/roms/nes/*"
    Not an ARRAY reference at ./es_scraper.pl line 49.
    

    I got this error when trying to run after I figured out how to remove the redundant line feed from windows. I am going to research perl and see if i can fix it but as a stay at home mum i may not get around to it so I figured I’d show you what i got.

    #94205
    Profile photo of alexbleks
    alexbleks
    Participant

    I have made a HTTP interface which uses my own database with images and MD5 hashes.

    Upload the ROMS and build the gamelist in 1 button

    https://github.com/alexbleks/retrobox-rom-manager

    Its only for NES, SNES, GB and GBC yet

Viewing 5 posts - 1 through 5 (of 5 total)

Forums are currently read only - please visit the new RetroPie forums at https://retropie.org.uk/forums/

Skip to toolbar