Author Topic: Hyperlabs released all their PDFs  (Read 656614 times)

0 Members and 1 Guest are viewing this topic.

Offline mathiasxx94

  • Larvae
  • *
  • Posts: 8
Re: Hyperlabs released all their PDFs
« Reply #20 on: June 27, 2019, 01:38:29 PM »
Does anyone know how to get rid of the numbers in front of each papers title please ? The numbers make it imposible for me to archive it as my archive is in alphabetical order .

This sounds like an easy task with some lines of python or similar code. I think you should be able to find a solution even, if not I can probably write it later today.

Offline aes256

  • Larvae
  • *
  • Posts: 35
Re: Hyperlabs released all their PDFs
« Reply #21 on: June 27, 2019, 05:07:50 PM »
Does anyone know how to get rid of the numbers in front of each papers title please ? The numbers make it imposible for me to archive it as my archive is in alphabetical order .

Paste here (or PM me) about a dozen lines of filenames that are representative of the naming scheme and I'll whip out some Python code to rename it for you. If the naming convention is consistent throughout then stripping the numbers is trivial :)
Quote from: Eleusis
However, I had serious misgivings about sharing because my quest was one for knowledge and experience while, I knew, for most others it would be for purely economic reasons.

Offline Wizard X

  • Lord of the Realms
  • Founding Wasp
  • *****
  • Posts: 1,363
  • The X Realm
    • Collective Members Edition
Re: Hyperlabs released all their PDFs
« Reply #22 on: June 28, 2019, 12:34:03 AM »
Does anyone know how to get rid of the numbers in front of each papers title please ? The numbers make it imposible for me to archive it as my archive is in alphabetical order .

Paste here (or PM me) about a dozen lines of filenames that are representative of the naming scheme and I'll whip out some Python code to rename it for you. If the naming convention is consistent throughout then stripping the numbers is trivial :)


Look down this post for "See attachment for all files in the archive."

Download: https://www.thevespiary.org/talk/index.php?action=dlattach;topic=15803.0;attach=10658

Albert Einstein - "Great ideas often receive violent opposition from mediocre minds."

Offline mathiasxx94

  • Larvae
  • *
  • Posts: 8
Re: Hyperlabs released all their PDFs
« Reply #23 on: June 28, 2019, 12:55:18 AM »
Does anyone know how to get rid of the numbers in front of each papers title please ? The numbers make it imposible for me to archive it as my archive is in alphabetical order .

Not the cleanest code, but it works okay. Haven't tried it on the other folders yet, but I suppose they are similar. The code fucks up on some of the files due to some strange characters or in some cases duplicate file names if removal of the first numbers. The files it doesn't do something with is printed out though so you can change it manually since it's so few. It's Python 2.7 since I'm a degenerate weeb, but should probably work with Python 3 too.

Code: [Select]
import os
path = "D:\hyperlab_archive\PDF.part4\PDF" #Just change the path to yours

for root, dirs, files in os.walk(path):
    for filename in files:
        if filename[0].isdigit():
            try:
                filename_decoded = filename.decode('utf-8')
                old_file = os.path.join('%s', '%s')%(path, filename_decoded)
                firstunderscore = filename.find("_")
                newfilename = filename[firstunderscore+1:].decode('utf-8')
                new_file = os.path.join('%s', '%s')%(path, newfilename)
                os.rename(old_file, new_file)
            except:
                print filename
       

Offline aes256

  • Larvae
  • *
  • Posts: 35
Re: Hyperlabs released all their PDFs
« Reply #24 on: June 28, 2019, 07:55:02 AM »
This function should clean up the filenames pretty well and avoid instances that are too hard:
Code: [Select]
def rename(filename):
    """Strip leading numbers from filenames."""

    # The regex below is used to match and rename filenames like this:
    #   515_55.pdf                          55.pdf
    #   2_Busc_Ber_3_269_269_190_.pdf       2_Busc_Ber_3_269_269_190_.pdf
    #   1129_00722a060.pdf                  00722a060.pdf
    #   1221_1.pdf                          1.pdf
    #   2220_16_673.pdf                     16_673.pdf
    #   4186_9781593855864.pdf              9781593855864.pdf
    #   4193_51710234_S20Manual.pdf         51710234_S20Manual.pdf
    #   
    # But avoid renaming filenames like these:
    #   679_.PDF
    #   ???????_1983_212-215.pdf
    #   4-Ethoxy-3,5-dimethoxybenzaldehyd- ??????? -- JACS 76, p5555, 1954.pdf
    #   3,4,5-???? -- JACS 74, p4263, 1952.pdf
    REGEX = r'(^[\d]+_)(.+)(\..*)'
    match = re.search(REGEX, filename.strip())
   
    if filename[0].isalpha():
        return
       
    if match and match.groups() >= 3:
        new_filename = ''.join(match.groups()[1:])
        return new_filename

And this is mathiasxx94's code refactored (BUT UNTESTED) to:
  • remove leading numbers from filenames
  • rename files across the entire archive
  • avoid file rename collisions within the same directory.


I can't be fucked spinning up a Windows VM right now to test it so Your Mileage May Vary  ;)
Code: [Select]
#!/usr/bin/env python2
import io
import os
import re

HYPERLAB_DIRECTORY = "D:\hyperlab_archive" # Just change the path to yours

def rename(filename):
    """Strip leading numbers from filenames."""

    # The regex below is used to match and rename filenames like this:
    #   515_55.pdf                          55.pdf
    #   2_Busc_Ber_3_269_269_190_.pdf       2_Busc_Ber_3_269_269_190_.pdf
    #   1129_00722a060.pdf                  00722a060.pdf
    #   1221_1.pdf                          1.pdf
    #   2220_16_673.pdf                     16_673.pdf
    #   4186_9781593855864.pdf              9781593855864.pdf
    #   4193_51710234_S20Manual.pdf         51710234_S20Manual.pdf
    #
    # But avoid renaming filenames like these:
    #   679_.PDF
    #   ???????_1983_212-215.pdf
    #   4-Ethoxy-3,5-dimethoxybenzaldehyd- ??????? -- JACS 76, p5555, 1954.pdf
    #   3,4,5-???? -- JACS 74, p4263, 1952.pdf
    REGEX = r'(^[\d]+_)(.+)(\..*)'
    match = re.search(REGEX, filename.strip())

    if filename[0].isalpha():
        return

    if match and match.groups() >= 3:
        new_filename = ''.join(match.groups()[1:])
        return new_filename

for root, dirs, files in os.walk(HYPERLAB_DIRECTORY):
    for filename in files:
        filename = filename.decode('latin-1')   # This assumes the majority of
                                                # people running this script are Windows users
        original_filepath = os.path.join(root, filename)

        new_filename = rename(filename)
        if new_filename:
            new_filepath = os.path.join(root, new_filename)
            if os.path.exists(new_filepath):
                print('CANNOT RENAME (FILE ALREADY EXISTS): %s' % new_filename)
                continue

            os.rename(
                original_filepath,
                new_filepath
            )
        else:
            print(original_filepath)
« Last Edit: June 28, 2019, 07:59:06 AM by aes256 »
Quote from: Eleusis
However, I had serious misgivings about sharing because my quest was one for knowledge and experience while, I knew, for most others it would be for purely economic reasons.