Log in

View Full Version : Naming Patent Files for the FTP


nbk2000
February 22nd, 2005, 02:59 PM
While sorting through my offline copy of the FTP, specifically the Patents folder, I discovered a large number of duplicate patents had been uploaded, each with different names and in different formats.

This is not only wasteful of bandwidth and storage, but also makes searching through them in a useful manner nearly impossible.

The Problem

For instance:

Firearm.pdf

is totally useless for sorting, as we don't even know if it belongs in the Patents folder!

Patent.pdf

doesn't tell us shit-all about what it's about.

Firearm Patent

at least tells us it IS a patent, but the category of 'Firearm' is so all-encompassing as to make it moot for searches.

788,866 Firearm.pdf

is only slightly better, as we can now assume (perhaps incorrectly, might be missing a number) that it's an OLD patent, but unless it's already in a patent folder, we don't know that it IS a patent, as the word 'Patent' isn't included, and unless retrosynthetic opens it and see that it IS a patent, or someone informs him that it's not...who knows?

788,866 Firearm Patent.pdf

is getting better, but the numeric prefixing would result in a patent from the early 20th century being placed after patents from the 21st century, as 7 comes after 6, even if the 7 is part of a six digit patent number (early 20th century), and the 6 part of a seven digit number (early 21st century). :rolleyes:

Also, the , (comma) symbol makes it impossible to search for a unique patent number, as it's not a searchable character, so ANY patent containing 788 or 866 would bring up a hit, which if you have hundreds of patents, needlessly complicates searches. So REMOVE any spaces, commas, or other symbols from the numbers of patents.

Patent 0788866 Firearm.pdf

is getting better, as a search for 'patent' will pull it up, and it'll be sorted in proper order, and a search for the specific patent number will result in a unique hit, but the title still doesn't tell what KIND of firearm it is.

Patent 0788866 Pen Gun.pdf

is better still, as we now know that it is an old patent about a pen-gun. But is it an old American patent or an old british patent, or a recent WO patent?

Country of Origin Prefixes

prevents this confusion.

US Patent 0788866 Pen Gun.pdf

tells us that this is a United States or US patent.

If it had been a british Pen Gun patent, then the prefix GB (Great Britan) would be used, as in:

GB Patent 0788866 Pen Gun.pdf

German patents are prefixed with DE, as in:

DE Patent 0788866 Pen Gun.pdf

and a World Patent, WO, as in:

WO Patent 0788866 Pen Gun.pdf

You can, if you want, skip using the symbols between the country prefix and the word 'Patent', as in:

USPatent-0788866-Pen_Gun.pdf
GBPatent-0788866-Pen_Gun.pdf
WOPatent-0788866-Pen_Gun.pdf
DEPatent-0788866-Pen_Gun.pdf

but then a person would have to remember to use a wildcard prefix before 'Patent', such as:

*patent

when doing a search on their computer for patents, as 'USPatent' is NOT a 'whole' word as far as a search for 'patent' is concerned.

Sometimes patents are download from the ETO site with multiple zero's in the numeric prefix, as in:

WO00788866

Remove any zeros, from left to right, until you have a seven digit numeric prefix.

UPLOADING and DOWNLOADING from the FTP

But now, when it's uploaded to the FTP, it'll be saved as this:

US%20Patent%200788866%20Pen%20Gun.pdf

because the spaces are replaced by %20 on FTPs.

Use of symbols to replace the spaces, as in:

US_Patent-0788866-Pen_Gun.pdf

prior to uploading it to the FTP, keeps it readable once downloaded, and a macro can easily replace the _ and - symbols with spaces once downloaded, if you so choose.

Relevant Titles

There are more than a dozen 'Pen Gun' patents on the FTP, but only a couple were actually called what they were. They were called 'Gas Projector', 'Signal Device', and other verbose but inaccurate titles. The exact projectile which the device fires is irrelevant if it is being fired from a 'Pen Gun'.

If it's capable of shooting a bullet, and is in the shape of a pen, then it's a 'Pen Gun'. If it's shaped like a pen, but is only capable of launching signal flares, then it is NOT a 'Pen Gun', but a 'Pen Flare Launcher', as it's not a gun in the sense of firing a lethal projectile from a barrel.

Just as there was numerous pen gun patents under various names, so too was there numerous patents related to the launching of a projectile from the muzzle of a shotgun, projectiles that did not originate from inside of a shotgun shell.

These are NOT duck-decoy launchers, as the FTP is not related to hunting (well, maybe humans...;)), so the purpose of the patent is NOT as a duck-decoy launcher, but as a 'Shotgun Spigot Grenade Launcher' or a 'Muzzle Mounted Shotgun Cup Grenade Launcher', as the relevant titles are now.

Production of Pentaerythritolpentanitrate
Nitration of Pentaerythritol
Nitration Process (this one was real specific! :rolleyes: )

and all the rest equate to PETN. Not 'making PETN' or 'Preparation of PETN' or 'Production of PETN', but just simply PETN, as that is the end result, regardless of the steps leading up to it.

If the patent isn't specifically about the making of PETN, but some variant or purification, than the names should reflect this:

US_Patent-1933754-Purification_of_PETN.pdf
US_Patent-2204059-Crystallizing_PETN.pdf
US_Patent-3408383-PETN-Trinitrate_Salt.pdf
US_Patent-3520744-Free_Flowing_PETN.pdf

If there's a choice between using HMX or RDX, than use RDX, as that's the more common of the two acryonyms.

Always use the most common acronym because, while calling a substance 1,3,5-trinitrohexahydro-s-triazine may be technically accurate, it doesn't help anyone looking for RDX, as well as being a pain in the ass to type out.

US6502657-Transformable Vehicle.pdf was a mystery till I opened it and recognized it as what it is...a 'Throwbot' developed by MIT for use as a recon robot to be used by soldiers in MOUT war.

Hence US6502657-Remote Control Robot Grenade-'Throwbot'.pdf

because it is 'remote control'ly controlled, it is a 'robot', and is shaped and thrown like a 'grenade', and is known in the trade as a 'throwbot'.

If you don't feel up to creating a relevant title, then as a MINIMUM, use the title of the patent and some common-sense description of what it's about, and let someone else more capable will do the job for you.

HTML

Personally, this is my preferred way of saving a patent, as it's compact, easily searched, editable, and can be easily converted to other formats.

When saving the text of a patent as an .HTML file, please don't use the .MHT or similar 'all in one' format to save it. When saving an .HTML file from the www.uspto.gov site as an .MHT file, you not only save the useful text, but also all the useless navigation buttons that add nothing but bandwidth and storage overhead to the FTP.

Save the .HTML as an 'HTML Only' file.

Do NOT save it as a 'Text Only' file, as this results in a jumbled mess of no use to anybody.

Once saved, you don't archive it, as the file size is minimal anyways, and most FTP clients and servers automatically compress such files prior to transmission anyways.

An exception to the use of HTML is when there are data tables. Unfortunately, the vast majority of these get horribly mangled by the patent servers, rendering them useless, in which case a PDF or image would be more appropriate to preserve the table formatting.

Images

Patents earlier than 1971 are downloadable only as .TIF format graphic files from the www.uspto.gov site, so these should be properly named as previously described, with the addition of a single digit numeric page suffix, if there is LESS than ten page images, as in:

US_Patent-0788866-Pen_Gun1.tif
...
US_Patent-0788866-Pen_Gun9.tif

If there are MORE than nine page images, than use a two digit suffix, as in:

US_Patent-0788866-Pen_Gun01.tif
...
US_Patent-0788866-Pen_Gun09.tif
US_Patent-0788866-Pen_Gun10.tif
US_Patent-0788866-Pen_Gun11.tif

If you don't, then the pages get sorted like this:

US_Patent-0788866-Pen_Gun1.tif
US_Patent-0788866-Pen_Gun10.tif
US_Patent-0788866-Pen_Gun11.tif
US_Patent-0788866-Pen_Gun2.tif
US_Patent-0788866-Pen_Gun3.tif
...

And this is NOT very readable. :p

Archiving

Once the page images are properly named, compress them into a single archive file, such as .ZIP or .RAR, with the archive file being the properly formatted name, with the suffix -Images appended so that we know that it contains images, and not just a compressed .PDF or .HTML file.

A .ZIP file containing page images of US_Patent-0788866-Pen_Gun01.tif through US_Patent-0788866-Pen_Gun12.tif would be called US_Patent-0788866-Pen_Gun-Images.zip

Spelling

As always, proper spelling is of vital importance, as transposition of just two letters can cause a search for RDX to miss the file named RXD, which may have had the very thing they were looking for.

OCR

PDF image files are not text searchable unless they are first OCR'd. You could do this yourself, and it would be appreciated, BUT unless you're going to do a perfect job of it (meaning proofreading literally EVERY word and correcting EVERY error), than please don't do it, as a sloppy OCR job encourages lazy errors in the reader, who'll just copy the incorrect text without verifying that it is correct., like they'd have to do if they hand copied it from an image.

Let the user OCR it themselves if they want a searchable copy.

Though all this can be obviated by using an HTML version of the patent if such a copy is available from the originating patent server, like the VAST majority of the patent .PDF files on the FTP are, HTML being much more compact as well as easily searchable.

SUMMARY

So, in closing, name patents as follows:

[country of origin code: US, GB, DE, WO]_Patent-[Seven Digit patent number, with six digit patents being prefixed by 0]-[Title of Patent, or clarified version, with underscore _ between each word].[file extension: .HTM, .DJVU, .PDF, .ZIP]

Peeves

The Deja Vu (.DJVU) format is useless for patent archiving, as there's no way to search a PICTURE for TEXT, and OCR is likely impossible too. So, instead of being able to do a keyword search through the files, you have to manually open EACH and EVERY one that MIGHT have what you are looking for which, even if the filenames are accurate, still makes extraction of the content for use difficult, as it has to be manually transcribed.

What an incredibly productive use of our time this is. :mad:

While on the subject of unreadable, who uploaded "High-Impact Terrorism - Priapo"?

Almost 300 pages in an archive, each page as an individual .PDF, and numbered in the poor 1, 10, 2, 20, etc. format

While there may be some good information in there, and I'll eventually get around to getting it PROPERLY sorted...and compiled into a SINGLE .PDF...and OCR'D...it's unreadable 'till then. Thank you. :rolleyes:

tmp
September 28th, 2005, 01:29 PM
An excellent piece of advice ! Documents in this format should move
a lot faster, and more importantly - cleaner, across the net. I'll advise
the members on my FTP to do the same. Thank You ! :D

sprocket
May 25th, 2006, 11:03 PM
This is a very good thing. I was thinking there should be a similar naming convention for journal articles. However an informative name also becomes a very long name.

Let's say you wanted to upload "Investigation of an N-Butyl-N-(2-Nitroxyethyl)Nitramine (BuNENA) Process: Identification of Process Intermediates, By-Products and Reaction Pathways" published in the 2nd issue of Propellants, Explosives, Pyrotechnics in 2006. The title alone is 148 characters long, this wouldn't make a good filename.

So you need a shorter title. Let 10 people come up with a shorter version of it and you'll probably have 10 different titles. A unique code for each article is needed, like patent numbers.

This is where Digital Object Identifiers (DOI) come into the picture. To quote doi.org "The Digital Object Identifier (DOI) is a system for identifying content objects in the digital environment. DOIs are names assigned to any entity for use on digital networks. They are used to provide current information, including where they (or information about them) can be found on the Internet. Information about a digital object may change over time, including where to find it, but its DOI will not change."

The DOI code for the aforementioned article is 10.1002/prep.200600018. One problem is that the DOI code contains illegal characters for filenames, in particular the slash. However the first part of it represents the journal and the second part the article. A simple solution to the problem is by putting the first part in the folder name, like "10.1002-(Propellants,Explosives,Pyrotechnics)" and the folder then contains the articles named like "prep.200600018-BuNENA_Process".

This is just one thought.