Author Topic: Hyperlabs released all their PDFs  (Read 658190 times)

0 Members and 1 Guest are viewing this topic.

Offline mathiasxx94

  • Larvae
  • *
  • Posts: 8
Re: Hyperlabs released all their PDFs
« Reply #20 on: June 27, 2019, 01:38:29 PM »
Does anyone know how to get rid of the numbers in front of each papers title please ? The numbers make it imposible for me to archive it as my archive is in alphabetical order .

This sounds like an easy task with some lines of python or similar code. I think you should be able to find a solution even, if not I can probably write it later today.

Offline aes256

  • Larvae
  • *
  • Posts: 34
Re: Hyperlabs released all their PDFs
« Reply #21 on: June 27, 2019, 05:07:50 PM »
Does anyone know how to get rid of the numbers in front of each papers title please ? The numbers make it imposible for me to archive it as my archive is in alphabetical order .

Paste here (or PM me) about a dozen lines of filenames that are representative of the naming scheme and I'll whip out some Python code to rename it for you. If the naming convention is consistent throughout then stripping the numbers is trivial :)
Quote from: Eleusis
However, I had serious misgivings about sharing because my quest was one for knowledge and experience while, I knew, for most others it would be for purely economic reasons.

Offline Wizard X

  • Lord of the Realms
  • Founding Wasp
  • *****
  • Posts: 1,375
  • The X Realm
    • Collective Members Edition
Re: Hyperlabs released all their PDFs
« Reply #22 on: June 28, 2019, 12:34:03 AM »
Does anyone know how to get rid of the numbers in front of each papers title please ? The numbers make it imposible for me to archive it as my archive is in alphabetical order .

Paste here (or PM me) about a dozen lines of filenames that are representative of the naming scheme and I'll whip out some Python code to rename it for you. If the naming convention is consistent throughout then stripping the numbers is trivial :)


Look down this post for "See attachment for all files in the archive."

Download: https://www.thevespiary.org/talk/index.php?action=dlattach;topic=15803.0;attach=10658

Albert Einstein - "Great ideas often receive violent opposition from mediocre minds."

Offline mathiasxx94

  • Larvae
  • *
  • Posts: 8
Re: Hyperlabs released all their PDFs
« Reply #23 on: June 28, 2019, 12:55:18 AM »
Does anyone know how to get rid of the numbers in front of each papers title please ? The numbers make it imposible for me to archive it as my archive is in alphabetical order .

Not the cleanest code, but it works okay. Haven't tried it on the other folders yet, but I suppose they are similar. The code fucks up on some of the files due to some strange characters or in some cases duplicate file names if removal of the first numbers. The files it doesn't do something with is printed out though so you can change it manually since it's so few. It's Python 2.7 since I'm a degenerate weeb, but should probably work with Python 3 too.

Code: [Select]
import os
path = "D:\hyperlab_archive\PDF.part4\PDF" #Just change the path to yours

for root, dirs, files in os.walk(path):
    for filename in files:
        if filename[0].isdigit():
            try:
                filename_decoded = filename.decode('utf-8')
                old_file = os.path.join('%s', '%s')%(path, filename_decoded)
                firstunderscore = filename.find("_")
                newfilename = filename[firstunderscore+1:].decode('utf-8')
                new_file = os.path.join('%s', '%s')%(path, newfilename)
                os.rename(old_file, new_file)
            except:
                print filename
       

Offline aes256

  • Larvae
  • *
  • Posts: 34
Re: Hyperlabs released all their PDFs
« Reply #24 on: June 28, 2019, 07:55:02 AM »
This function should clean up the filenames pretty well and avoid instances that are too hard:
Code: [Select]
def rename(filename):
    """Strip leading numbers from filenames."""

    # The regex below is used to match and rename filenames like this:
    #   515_55.pdf                          55.pdf
    #   2_Busc_Ber_3_269_269_190_.pdf       2_Busc_Ber_3_269_269_190_.pdf
    #   1129_00722a060.pdf                  00722a060.pdf
    #   1221_1.pdf                          1.pdf
    #   2220_16_673.pdf                     16_673.pdf
    #   4186_9781593855864.pdf              9781593855864.pdf
    #   4193_51710234_S20Manual.pdf         51710234_S20Manual.pdf
    #   
    # But avoid renaming filenames like these:
    #   679_.PDF
    #   ???????_1983_212-215.pdf
    #   4-Ethoxy-3,5-dimethoxybenzaldehyd- ??????? -- JACS 76, p5555, 1954.pdf
    #   3,4,5-???? -- JACS 74, p4263, 1952.pdf
    REGEX = r'(^[\d]+_)(.+)(\..*)'
    match = re.search(REGEX, filename.strip())
   
    if filename[0].isalpha():
        return
       
    if match and match.groups() >= 3:
        new_filename = ''.join(match.groups()[1:])
        return new_filename

And this is mathiasxx94's code refactored (BUT UNTESTED) to:
  • remove leading numbers from filenames
  • rename files across the entire archive
  • avoid file rename collisions within the same directory.


I can't be fucked spinning up a Windows VM right now to test it so Your Mileage May Vary  ;)
Code: [Select]
#!/usr/bin/env python2
import io
import os
import re

HYPERLAB_DIRECTORY = "D:\hyperlab_archive" # Just change the path to yours

def rename(filename):
    """Strip leading numbers from filenames."""

    # The regex below is used to match and rename filenames like this:
    #   515_55.pdf                          55.pdf
    #   2_Busc_Ber_3_269_269_190_.pdf       2_Busc_Ber_3_269_269_190_.pdf
    #   1129_00722a060.pdf                  00722a060.pdf
    #   1221_1.pdf                          1.pdf
    #   2220_16_673.pdf                     16_673.pdf
    #   4186_9781593855864.pdf              9781593855864.pdf
    #   4193_51710234_S20Manual.pdf         51710234_S20Manual.pdf
    #
    # But avoid renaming filenames like these:
    #   679_.PDF
    #   ???????_1983_212-215.pdf
    #   4-Ethoxy-3,5-dimethoxybenzaldehyd- ??????? -- JACS 76, p5555, 1954.pdf
    #   3,4,5-???? -- JACS 74, p4263, 1952.pdf
    REGEX = r'(^[\d]+_)(.+)(\..*)'
    match = re.search(REGEX, filename.strip())

    if filename[0].isalpha():
        return

    if match and match.groups() >= 3:
        new_filename = ''.join(match.groups()[1:])
        return new_filename

for root, dirs, files in os.walk(HYPERLAB_DIRECTORY):
    for filename in files:
        filename = filename.decode('latin-1')   # This assumes the majority of
                                                # people running this script are Windows users
        original_filepath = os.path.join(root, filename)

        new_filename = rename(filename)
        if new_filename:
            new_filepath = os.path.join(root, new_filename)
            if os.path.exists(new_filepath):
                print('CANNOT RENAME (FILE ALREADY EXISTS): %s' % new_filename)
                continue

            os.rename(
                original_filepath,
                new_filepath
            )
        else:
            print(original_filepath)
« Last Edit: June 28, 2019, 07:59:06 AM by aes256 »
Quote from: Eleusis
However, I had serious misgivings about sharing because my quest was one for knowledge and experience while, I knew, for most others it would be for purely economic reasons.

Offline The Lone Stranger

  • Dominant Wasp
  • ****
  • Posts: 355
  • THE SITE HERETIC and DEVILS ADVOCATE
    • YouTopia
Re: Hyperlabs released all their PDFs
« Reply #25 on: October 01, 2019, 01:26:50 PM »
mathiasxx94 and aes256 ....... Thank you very much . I never imagined that i would one day use Python and am quite excited to try it ...... but not at the moment as i have other prioritys . When i do try it i'll let you know what happens .
“Great spirits have always encountered opposition from mediocre minds. The mediocre mind is incapable of understanding the man who refuses to bow blindly to conventional prejudices and chooses instead to express his opinions courageously and honestly.” Albert Einstein

Offline BakingBrad

  • Pupae
  • **
  • Posts: 60
Re: Hyperlabs released all their PDFs
« Reply #26 on: December 19, 2019, 02:11:49 PM »
I missed out, can anyone repost the archive?

Offline Hooloovoo

  • Slayer of Poppies and
  • Dominant Wasp
  • ****
  • Posts: 409
  • It's not *my* fault
Re: Hyperlabs released all their PDFs
« Reply #27 on: December 19, 2019, 02:49:00 PM »
I missed out, can anyone repost the archive?

You tried the Pirate Bay/torrents for seeds?

Although it'd be great to have it uploaded here - how large is it, I wonder?

Offline Osho

  • Subordinate Wasp
  • ***
  • Posts: 111
Re: Hyperlabs released all their PDFs
« Reply #28 on: December 19, 2019, 02:53:15 PM »
A couple of gigs i can go in abit. I think one of the parts is corrupted those

Offline Hooloovoo

  • Slayer of Poppies and
  • Dominant Wasp
  • ****
  • Posts: 409
  • It's not *my* fault
Re: Hyperlabs released all their PDFs
« Reply #29 on: December 19, 2019, 03:51:52 PM »
A couple of gigs i can go in abit. I think one of the parts is corrupted those

Weren't they stored in a sane archive format with error correction features so if/when this kind of thing happens, while hundreds of megabytes of data don't suddenly become useless because a few bits got flipped?

Damn, *smh*. :o

Offline Corrosive Joeseph

  • Global Moderator
  • Founding Wasp
  • *****
  • Posts: 849
Re: Hyperlabs released all their PDFs
« Reply #30 on: December 19, 2019, 04:21:18 PM »
I missed out, can anyone repost the archive?

What's wrong with the link from dedihetz......? I just clicked on it..... Boom! Sez it's 2.6gigs and will take 17 hours to DL.

Okay, it's time to play devil's advocate.

Honestly, it's real cool and all that, but sifting through 5000 pdfs with numbers for titles is not my idea of a fun time.....
I hate to say it, but I downloaded the whole archive a while back and I have found it completely useless.
It is WAAY easier to just sign up on the HyperLab, pick a route or a synthesis and then search the few relelvant threads for the aforementioned pdfs, naming them properly while saving them in some semblence of order in a dedicated folder.

The threads also contain much related information and associated content, it's much easier on the brain to operate this way. And if truth bee told, we actually have very many of those 5000 pdfs uploaded here already.
I don't know how many exactly, but definitely more than half, possibly even three quarters or more.

/CJ
Being well adjusted to a sick society is no measure of one's mental health

Offline carl

  • Global Moderator
  • Founding Wasp
  • *****
  • Posts: 2,571
  • So long, and thanks for all the fish!
Re: Hyperlabs released all their PDFs
« Reply #31 on: December 19, 2019, 05:06:08 PM »
Honestly, it's real cool and all that, but sifting through 5000 pdfs with numbers for titles is not my idea of a fun time.....
I hate to say it, but I downloaded the whole archive a while back and I have found it completely useless.
It is WAAY easier to just sign up on the HyperLab, pick a route or a synthesis and then search the few relelvant threads for the aforementioned pdfs, naming them properly while saving them in some semblence of order in a dedicated folder.

The threads also contain much related information and associated content, it's much easier on the brain to operate this way. And if truth bee told, we actually have very many of those 5000 pdfs uploaded here already.
I don't know how many exactly, but definitely more than half, possibly even three quarters or more.

I wholeheartedly agree, I'm signed up on there and gained access to all of their content, and THIS is really useful, while this archive... not so much, actually not at all in my opinion.
I would suggest that you guys share information like it was the last day on Earth.  This information slowdown is all because of all that dumb unwillingness to share.  That is where the DEA is winning.  There goal is you not talking to each other.  Let the information flow.  I  promise we will always be 2 steps ahead of DEA chemists if we just keep sharing information
Quote
Real bees just hear the buzzing and it doesn´t ever stop. Ever.

Offline Hooloovoo

  • Slayer of Poppies and
  • Dominant Wasp
  • ****
  • Posts: 409
  • It's not *my* fault
Re: Hyperlabs released all their PDFs
« Reply #32 on: December 19, 2019, 05:23:52 PM »
Someone should spider the website as a true mirror while it still exists in a format that allows it to be browsed as if logged in, online using wget or whatever.

You'd have a more up to date version of the forum archive in a better format, and you'd archive/compress it for distribution/seeding in a format with error correction like RAR/PAR2 files or whatever.
« Last Edit: December 19, 2019, 05:27:21 PM by Hooloovoo »