Auto Scraper v0.6

Tagged: 

This topic contains 117 replies, has 20 voices, and was last updated by Profile photo of rafaelr rafaelr 1 year, 5 months ago.

Viewing 35 posts - 1 through 35 (of 118 total)
  • Author
    Posts
  • #81716
    Profile photo of sselph
    sselph
    Participant

    This is an auto-scraper that runs from a command line that supports:
    NES, SNES, N64, GB, GBC, GBA, MD, SMS, 32X, GG, PCE, A2600, LNX, MAME(see below) ROMs

    It works by crawling a directory of ROM files looking for known extensions. When it finds a file it hashes the ROM data minus any headers or special file formatting with the goal of hashing only the data pulled from the original game. It compares this hash to a DB and downloads the metadata and builds the gamelist.xml file.

    You can find the source on github:
    https://github.com/sselph/scraper

    And for trusting people who don’t want to bother compiling I cross compiled for several platforms, even the rpi:
    https://github.com/sselph/scraper/releases

    Basic Instructions:
    Download/Build the scraper executable for the system of choice.
    Copy it into your ROM folder.
    Make sure it is executable on linux/mac(chmod +x scraper)
    Run it (./scraper -thumb_only)
    You will see it processing the roms and it will write a gamelist.xml file in the ROM folder.
    If you are not running this on the RPI then copy everything(roms, image folder, gamelist.xml) to the retropi.

    If you want to run directly on a rpi or rpi2 you can follow these instructions:
    https://github.com/sselph/scraper#install-from-my-binaries

    MAME has a slightly different and is based on names. You need to start the script with the -mame flag:
    ./scraper -mame

    If these instructions are unclear you can checkout floob’s videos:

    Auto-scraper: https://github.com/sselph/scraper

    #81730
    Profile photo of Floob
    Floob
    Member

    Hi,

    I just gave that a go in this dir:
    /home/pi/RetroPie/roms/gba

    I ran it with this (using the rpi scraper version here: https://github.com/sselph/scraper/releases):
    ./scraper -thumb_only

    It created a gamelist.xml with this content only:
    <gameList></gameList>

    I imagine I’m doing something wrong, can you see what?

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    Attachments:
    #81735
    Profile photo of sselph
    sselph
    Participant

    The script is looking at the file extensions and doesn’t know what to do with zip files. From what I understood emulationstation couldn’t find roms inside zip files so I didn’t set up code to look inside them. If that has changed I can make a few modifications. For now if you unzip them it should see the .gba files.

    You can issue something like the following to unzip all your files.
    for f in *.zip; do unzip $f; done

    Auto-scraper: https://github.com/sselph/scraper

    #81758
    Profile photo of Floob
    Floob
    Member

    Thanks, I should have realised I shouldnt have them zipped.

    I now have this error appearing

    2014/10/14 14:10:08 INFO: Starting: 0035 - Namco Museum (U).gba
    2014/10/14 14:10:24 ERR: error processing 0035 - Namco Museum (U).gba: XML syntax error on line 22: element <meta> closed by </head>
    2014/10/14 14:10:24 INFO: Starting: 0035 - Namco Museum (U).gba
    2014/10/14 14:10:39 ERR: error processing 0035 - Namco Museum (U).gba: XML syntax error on line 22: element <meta> closed by </head>
    2014/10/14 14:10:39 INFO: Starting: 0083 - Final Fight One (E).gba
    2014/10/14 14:10:55 ERR: error processing 0083 - Final Fight One (E).gba: XML syntax error on line 22: element <meta> closed by </head>
    2014/10/14 14:10:55 INFO: Starting: 0083 - Final Fight One (E).gba
    2014/10/14 14:11:11 ERR: error processing 0083 - Final Fight One (E).gba: XML syntax error on line 22: element <meta> closed by </head>
    2014/10/14 14:11:11 INFO: Starting: 0083 - Final Fight One (E).gba
    2014/10/14 14:11:27 ERR: error processing 0083 - Final Fight One (E).gba: XML syntax error on line 22: element <meta> closed by </head>
    2014/10/14 14:11:27 INFO: Starting: 0070 - Kaze no Klonoa - Yumemiru Teikoku (J).gba
    2014/10/14 14:11:43 ERR: error processing 0070 - Kaze no Klonoa - Yumemiru Teikoku (J).gba: XML syntax error on line 22: element <meta> closed by </head>
    2014/10/14 14:11:43 INFO: Starting: 0070 - Kaze no Klonoa - Yumemiru Teikoku (J).gba

    2014/10/14 14:11:59 ERR: error processing 0070 – Kaze no Klonoa – Yumemiru Teikoku (J).gba: XML syntax error on line 22: element <meta> closed by </head>

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    Attachments:
    #81761
    Profile photo of sselph
    sselph
    Participant

    Seems like thegamedb.net is having issues and my script doesn’t present a nice error for that. Should hopefully work if you try again in a few minutes.

    Auto-scraper: https://github.com/sselph/scraper

    #81807
    Profile photo of sselph
    sselph
    Participant

    Seems like thegamedb.net is back up now.

    Auto-scraper: https://github.com/sselph/scraper

    #81874
    Profile photo of Floob
    Floob
    Member

    Looks like a great script, thanks very much.
    I put a basic video together for it here. Let me know if I should add anything to it.

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    #81876
    Profile photo of sselph
    sselph
    Participant

    Very nice video and thanks for pointing out the bug about players. I didn’t notice it in the gamexml spec. I’ll get that added to the output. If you have any other feedback on issues, improvements, or platforms to add let me know.

    Auto-scraper: https://github.com/sselph/scraper

    #81879
    Profile photo of Floob
    Floob
    Member

    Would be great if it could support the Megadrive as well.

    Maybe it could start by outputting if it could connect to thegamesdb, as earlier it confused me when it couldnt get the data.

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    #81964
    Profile photo of sselph
    sselph
    Participant

    Released a new version of the script to add the players, the check if thegamesdb is up, and Megadrive support.

    A note on the MD support.
    There seems to be 4 accepted extensions(bin, md, smd, zip). BIN is trivial and should work without issue. MD and SMD are interleaved bin files and I have to deinterleave them before computing the hash. Since I don’t have any of these files, I had to work on documentation I found and files I made that hopefully conformed to the format. Let me know if it doesn’t work. Also the SMD format seems to support spliting files, I didn’t add any support for that. For ZIP I wasn’t sure how they are handled and need to do some hands on testing. Does ES treat it as a single file and the emulator just chooses the largest or first valid file from the zip or is it treated as a directory ie. (file.zip/rom.bin, file.zip/rom2.bin)

    Auto-scraper: https://github.com/sselph/scraper

    #81966
    Profile photo of Floob
    Floob
    Member

    Thanks very much for this.
    I’d really like to try it, but all my Megadrive roms are .gen.

    Do you think this is a .bin?
    http://www.openthefile.net/extension/gen/2908
    http://yoyofr.proboards.com/thread/731

    For me the emulationstation config looks for these extensions

    <extension>.smd .SMD .bin .BIN .gen .GEN .md .MD .zip .ZIP</extension>

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    #81967
    Profile photo of sselph
    sselph
    Participant

    Ah I must have an older version or something. You could try renaming a few of them to .bin, .smd, or .md to see if the script can hash them to a known hash. I might just need to add an alias for that format to one of the other formats. I would assume it should be similar to bin since .bin is a very generic extension for a binary file so someone probably created .gen to try and make them easier to sort.

    Auto-scraper: https://github.com/sselph/scraper

    #81971
    Profile photo of Floob
    Floob
    Member

    I gave it a go and am getting this at the moment

    pi@raspberrypi ~/RetroPie/roms/megadrive $ ./scraper -thumb_only
    It appears that thegamesdb.net isn't up. If you are sure it is use -skip_check to bypass this error.

    thegamesdb.net seems down for me manually checking as well.

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    #81973
    Profile photo of sselph
    sselph
    Participant

    If you -skip_check you can bypass that error and when it is processing the rom you’ll see either a hash not found, or the XML syntax error. If you see the syntax error it means it found the hash but couldn’t download the data.

    Auto-scraper: https://github.com/sselph/scraper

    #81975
    Profile photo of Floob
    Floob
    Member

    XML syntax error on the files I renamed .bin – so looking good ๐Ÿ™‚

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    #81984
    Profile photo of sselph
    sselph
    Participant

    I added the .gen support to mimic .bin. Hopefully that works.

    Auto-scraper: https://github.com/sselph/scraper

    #81997
    Profile photo of ceuse
    ceuse
    Participant

    Great Tool, i run into a problem though

    i have subfolders in my Rom directory (translation, europe, Japan, us). i ran the script in every folder seperatly but the pi does not recognise the gamelist File.

    is there a way that you can implement subfolder scraping , or at least tell me how i get emulationstation to recognise my subfolders with gamelists.xml?

    Thanks in advance

    #82011
    Profile photo of sselph
    sselph
    Participant

    Sure I’ll take a look this weekend. It should be possible to recursively crawl the subdirectories and generate a single gamelists.xml.

    Auto-scraper: https://github.com/sselph/scraper

    #82014
    Profile photo of Floob
    Floob
    Member

    Are there any other data sources that you could use besides thegamesdb.net ?
    Seems down so often at the moment.

    Is this one possible?
    archive.vg

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    #82019
    Profile photo of sselph
    sselph
    Participant

    I’ll look into adding more sources. I have to be careful since the way I’m matching is by taking the hash of the rom data(minus headers, etc) and matching that to a thegamesdb gameID. I do this with a csv file I manually create. I don’t want to be manually creating a second set of IDs since the process is time consuming.

    archive.vg has api calls to accept hashes of the rom files. This might work well. If I can figure out how to get an API key, I’ll see about adding it.

    There is also https://github.com/OpenVGDB/OpenVGDB/releases which is mapping the rom hash to a name, image link, and a description. The image CDN appears to be down so if that comes back up I can look into adding it as well.

    Auto-scraper: https://github.com/sselph/scraper

    #82022
    Profile photo of Floob
    Floob
    Member

    Ah I see. I imagine the single source will be fine, no doubt it will normally be fine. Not sure if this helps: http://api.archive.vg/2.0/

    I dont know how difficult it would be, but a lot of people would love MAME support, as thats obviously a key system for emulation, if that got added at some point it would be great.

    Separately, a check to see if the ‘image’ directory exists before running it would help forgetful people. Like me……

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    #82031
    Profile photo of Floob
    Floob
    Member

    Do you know why the <releasedate> node looks odd in the gamelist.xml but displays ok in Emulation Station?

    <releasedate>19921220T000000</releasedate>

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

    #82033
    Profile photo of exonerated
    exonerated
    Participant

    First off the scraper works great! Pulled about 90% of my titles with no problem and I have a ton! Thanks for your work!

    The only issue I am having is with unzipped .md roms. It seems to flag an error because of the file extension. Is there any workaround for this?

    #82039
    Profile photo of sselph
    sselph
    Participant

    Do you know why the <releasedate> node looks odd in the gamelist.xml but displays ok in Emulation Station?
    <releasedate>19921220T000000</releasedate>

    This is the way EmulationStation chose to encode a datetime. https://github.com/Aloshi/EmulationStation/blob/unstable/GAMELISTS.md

    YYYYMMDDTHHMMSS since no releases have an exact time the second half is T000000 so YYYYMMDDT000000

    The only issue I am having is with unzipped .md roms. It seems to flag an error because of the file extension. Is there any workaround for this?

    What error are you seeing exactly? is it just not matching hashes for any MD files or is it throwing some other error? I suspect there are issues in the way I’m converting these back to bin files for hashing. I just found an issue with my smd code. I’ll write some code to convert the bin file I have to a md and smd and see if the emulator plays it then I’ll know that the code is working.

    Auto-scraper: https://github.com/sselph/scraper

    #82054
    Profile photo of ceuse
    ceuse
    Participant

    thanks for adding my subfolder Scraping ๐Ÿ™‚

    info for everybody : you need to create a image folder in the root with a Folder for each subfolder youre scraping. so basicly images\europe Images\Usa images\Japan etc

    edit i think you have a error in there though (at least windows version).

    my xml shows : .\images/Europe\ … i checked the original xmls and there it is allways / .. perhaps its just the windows version with this problem. anyway i just replaced every \ with / in notepad++ and it works fine now. thanks for the great tool ๐Ÿ™‚ now just add more and more systems *ggg*

    #82062
    Profile photo of sselph
    sselph
    Participant

    Thanks for the windows test. I used the golang functions to join paths but it is os dependent and windows uses \. I shouldn’t do that for the gamelist.xml portions since the gamelist.xml will always be read on linux. I saw some issues with retropie displaying the data for roms inside folders but I’m on an older version so that might be working.

    I didn’t intend for the images to require sub folders but it might be a good thing. I added support to create the single images directory to make things easier but I’ll clean up the code so you don’t have to create a bunch of extra folders either by having it create them for you or by flattening the structure.

    I also researched more about megadrive roms. The documentation I saw had .md as a Multi Game Doctor file but other documentation has this as .mgd. And looking at the emulator in emulationstation they don’t support the Multi Game Doctor format only raw binary and the smd format so I will assume .md is actually the raw binary like .bin and .gen. I also fixed the smd block size and it appears to be working.

    The next things I’ll work on are fixing the issues that ceuse has found then add .zip support.

    Auto-scraper: https://github.com/sselph/scraper

    #82083
    Profile photo of ramchip
    ramchip
    Participant

    This looks so awesome! I cannot wait for GamesDB to get back up so I can test this!! Is it possible to add Master System, Mame And FBA? I am extremely excited to use this as scraping has been the weak point of emulationstation! I should have a perfect build with XBMC, OwnCloud, PS3 Controllers and my favorite emulators/games after this!

    #82084
    Profile photo of sselph
    sselph
    Participant

    Console games are much easier since I have a DB of hash values mapped to names from no-intro. I’ve also only ever worked with console roms. For MAME I’ll have to hunt down a list, ask for help creating one, or find a DB that already has them mapped by hash. After I get zip support added and add support for at least one extra data source I’ll take a look at adding more systems.

    Auto-scraper: https://github.com/sselph/scraper

    #82094
    Profile photo of ramchip
    ramchip
    Participant

    I got to try this finally and I LOVE IT!! Here are my findings – NES 90%, SNES 80%, GB 90%, GBA 85% and GBC 15%. For some reason it barely found any of my GBC roms but those go through the scraper really well anyways! Thanks for your hard work on this, it has massive potential and should be included with RetroPie in my opinion!

    #82099
    Profile photo of ceuse
    ceuse
    Participant

    I got to try this finally and I LOVE IT!! Here are my findings โ€“ NES 90%, SNES 80%, GB 90%, GBA 85% and GBC 15%. For some reason it barely found any of my GBC roms but those go through the scraper really well anyways! Thanks for your hard work on this, it has massive potential and should be included with RetroPie in my opinion!

    just wanted to report the same thing… alot of gbc wont scrape even though i randomly checked a few and they defently are in the gamesdb.net .. is there a way to check if the hash of my / our roms are diffrent or if its a problem with the code?

    #82101
    Profile photo of sselph
    sselph
    Participant

    Sure gameboy color roms are a simple raw binary format so you can do shasum *.gbc and get a list of hashes to file names. Feel free to send me the list in a file. If you want to troubleshoot you can look at the csv here:
    https://stevenselph.appspot.com/csv/hash.csv

    I can think of one issue. If these games were clones of a normal gameboy game just with added color they could be listed in thegamedb as gameboy and not gameboy color. I might just need to expand my search.

    Auto-scraper: https://github.com/sselph/scraper

    #82102
    Profile photo of ceuse
    ceuse
    Participant

    Generated with a tool from one gbc file (all hail the gui!) which the scraper doesnt find :

    
    MD5 Checksum: BA85A2AE8AA5829C440EEF2D5549506C
    SHA-1 Checksum: 4E6F676EC15E0E6238CB81853B5A74BBB20657A1
    SHA-256 Checksum: 8EB56E0D55A04AA3FCF940F172757F4F60BAA6C53C82707DEF8AE4E78844B1DA
    SHA-512 Checksum: BB5B8C43865D38B3609EA8D1E818A6F2019D9AFFD8538F0D0F05A84F56A55ABFF8334F8B3A276467B54F9B60A6A2E6616E3AB1356E5F54A03F9A2E049577FE55
    Generated by MD5 & SHA Checksum Utility @ http://raylin.wordpress.com/downloads/md5-sha-1-checksum-utility

    thegamesdb.net link : https://thegamesdb.net/game/21997/

    cant find the sha-1 in the csv.. is the rom broken or the list somewhat off? as said by previous poster, there are quite low sucsess rates compared to gb and all the other supported platforms. at least something seems off

    #82104
    Profile photo of sselph
    sselph
    Participant

    Rom seems fine and I see the exact hash in the no-intro set. I must have made some mistake generating my csv. I’ll go back over gbc dataset to figure it out. Thanks for finding the issue.

    Auto-scraper: https://github.com/sselph/scraper

    #82129
    Profile photo of ramchip
    ramchip
    Participant

    This is incredible! Thanks man! Between your tool and the MAME/FBA scraper I have 100% images and info!!

    #82130
    Profile photo of Floob
    Floob
    Member

    Yes, it really is a very effective tool, that is pretty easy to use.
    I’m sure it will help a lot of people make their RetroPie experience even better.

    It would be interesting if EmulationStation could hook it into their GUI.

    RetroPie help guides --> https://goo.gl/Yfy8kj
    Please read this before asking for help --> http://goo.gl/eLErnl

Viewing 35 posts - 1 through 35 (of 118 total)

Forums are currently read only - please visit the new RetroPie forums at https://retropie.org.uk/forums/

Skip to toolbar