I’m going to let you in on a little secret: there’s a reason why we are the market leader in patent sequence search. It has surprisingly little to do with our user-friendly search interface, our stellar customer support, or our good looks. While (at least some of) these things certainly help, it is the content that can only be found in our GQ-Pat database of patent sequences that makes the real difference. Think all patent sequence databases are the same? Let me explain what I mean in some more detail.

When, for example, a life science patent application is filed at the USPTO they ask that the inventor put all sequences into a nicely formatted list. This so-called “ST.25 listing” helps the examiners with their workflow and makes it straightforward to collect all sequences submitted to the office over time. In an ideal world, every inventor and every patent office would list sequences like this and that would be the end of it.

Unfortunately, very few patent offices in the world actually have official sequence-filing rules, and even when they do have them, they’re frequently ignored. As a result, sequences can be found anywhere in a patent: inside the text, in tables, or even as part of the figures. If your search only spans the ST.25 sequence listings, you’re going to miss out on a lot of them!

You might be wondering, “so how big is this problem, really? How many sequences can be found outside of the official ST.25 listings?”

I’m afraid that the answer is “a whole lot”. If you don’t know what you’re doing, you can easily miss more than 38% of US, WO, EP, JP, KR patent documents with sequences in them. How sure are we that this is the right number? The number is based on a massive amount of internal data, that we confirmed by comparing the number of PNs with products based on ST.25 listings like US Gene, WO Gene, and PatSeq.

Sure, anyone can download ST.25 sequence listings, and index them for BLAST search. But this approach will cause you to miss out on real, critical-to-your-company, sequences. Why? Because all those sequences that are located in text, tables, and figures aren’t indexed!

Still not convinced that you’re missing documents that are relevant to you? Here’s a list with some of the largest patent assignees in our patent sequence database, and the percentage of patent applications they filed with sequences hidden in the text, tables, and figures. These are all documents that would never be found anywhere except in the GQ-Pat database.

Patent Assignee (normalized) Patents filed with sequences hidden in text, tables, and figures.
ABBOTT 41.33%
SANOFI 39.32%
PFIZER 27.83%
BAYER 20.42%
TAKEDA 19.27%
ROCHE 12.85%
BASF 10.31%
AMGEN 8.53%

How does GQ Life Sciences make sure these documents are found? Over the last five years, we have invested millions of dollars and countless hours to find every last sequence that is out there. We use proprietary algorithms to flag documents with even a minuscule chance of containing sequences. That set is then manually curated to capture all of the sequences in them. Our human curators also verify additional information like the SEQ ID NO and whether a sequence is mentioned in the claims or not. We have examined the entire backfile of over 100 million historical patents for sequences.

Of course, we haven’t stopped there – we continuously process the new patents that are being published to ensure our database contains the most up-to-date sequence information available. In addition to 38% more US, WO, EP, JP, KR patents, we have also indexed 153,000 documents from authorities in China, Canada, India, and Brazil that you will not easily find anywhere else.

So while it’s the content of our database that makes us the Patent Sequence Search leader, it’s our people that make sure customers get the most out of it. If you’ve never worked with us and think it’s time to give us a try, sign up for a free trial account today and you’ll get access to a team of application scientists that can show you how to get better results in less time.

After 15 years in this field and writing about this many times, I’m still shocked when I see professional IP people that use BLAST for their sequence searches. BLAST is a crude and unreliable way to align sequences, and under normal circumstances it shouldn’t be used for anything patent related. There, I said it.

Sequences play such a central role in the business models of many life science companies. Therefore I just don’t get that people spend good money on a commercial database like STN, GeneSeq, or SequenceBase and are still stuck with BLAST, or something similarly flawed such as Smith & Waterman or FASTA, as their only real search option. So why is BLAST is such a problem in patent searching? It comes down to two major issues.

BLAST Asks the Wrong Question

BLAST was written to compute the evolutionary distance between two sequences, an important biological question but hardly ever relevant to patents. You don’t want to know which other sequences also have a kinase domain in them. You want to know which sequences in the database have 70% or more nucleotides in common with your entire query sequence. No amount of clever result filtering on the percentage identity and alignment length is going to compensate for BLAST’s local alignment behavior (see below). Alignments will come out wrong or will simply be missed. Smith & Waterman and FASTA do this in the exact same way and are just as unsuitable as BLAST for patent searching.

BLAST is Not Reproducible

To make BLAST go faster it uses lots of statistical tricks and heuristic shortcuts. This means that algorithm parameters have to be carefully tweaked by an expert for each type of search, and that results can suddenly disappear the next time the same search is done. Especially when searching with short sequences like primers, probes, and CDRs, it’s extremely common to miss real, relevant-to-your-business, hits. It goes without saying that in IP you need accurate, complete, and reproducible search results and that BLAST is just not the tool to do it.

This problem was recognized and solved by us back in 2002 by implementing the GenePast algorithm and publishing about it in Nature Biotech [1]. GenePast was specifically designed for patent sequence searching and has none of the issues that BLAST, Smith & Waterman, and FASTA have. It aligns the entire query sequence with the smallest possible number of differences, like mismatches and indels. It has no heuristic shortcuts and makes no decisions about the relevance of results on its own. Therefore it produces fully accurate, complete, and reproducible answers to the questions patent searchers need to ask, even when the query sequences are really short.

“So it’s a lot like Needleman & Wunsch?”, I hear you say. No, GenePast isn’t a global alignment algorithm, it is a best-fit algorithm. It doesn’t try to align the whole query sequence to the whole database sequence. Instead it finds the best possible way to fit the query sequence into the database sequence, or the other way around if the database sequence is shorter.

local-alignmentLocal Alignment

Part of the Query matches part of the Subject. BLAST, FASTA, and Smith & Waterman.

global-alignmentGlobal Alignment

All of the Query matches all of the Subject. Needleman & Wunsch and algorithms like it.

best-alignmentBest Fit Alignment

All of the Query is fitted into the Subject. GenePast. Ideal for patent sequence searching.

You don’t have to take my word for it. 18 out of the top 20 pharma, all five top agrochemical companies, a long list of biotech firms and law firms around the world use GenePast every day and have been for over a decade. And many major patent authorities like the EPO use GenePast for patent examination. With over 75% of the searches, GenePast is by far the most used algorithm in the GenomeQuest search interface and one of the main reasons why we’re the market leader in patent sequence searching, the other reasons being our patent sequence content of course.

If you’re into patent sequence searching and want to know more about GenePast, please click here for a free trial account. We’d be happy to let you have a go with it.

[1] Nature Biotechnol. 2002 Dec;20(12):1269-71. Patent searches for genetic sequences: how to retrieve relevant records from patented sequence databases. Dufresne G., Takács L., Heus H.C., Codani J.J., Duval M.

We are again excited to welcome thought leader Dr. Stephen Tedeschi as a guest blogger. Dr. Tedeschi, is a Partner at PatentVantage, a leading patent research and strategy firm. He is also an Adjunct Faculty member at the National Institute of Health’s Foundation for Advanced Education in the Sciences (FAES). For more on Dr. Tedeschi, please see below.

Please be sure to check out Part 1 of this blog series.

Why Searching Patents for Non-Text Information is Crucial

Better Patent SearchingMechanical, chemical, and life sciences patents in particular, pose their own challenges to searching.  In all of these technologies, inventions can be described in ways that are not searchable through text-based interfaces.  For example, these technologies may be disclosed as images, Markush structures, or Gene sequences.  None of this information is effectively text searchable, if at all.  The ability to search them is created by database providers through the addition of indexing and other value added cross-referencing.  This type of manual curation takes time and often means paying to access the curated data, but effective searching is nearly impossible without it.

Continue Reading…

We are excited to welcome thought leader Dr. Stephen Tedeschi as a guest blogger. Dr. Tedeschi, is a Partner at PatentVantage, a leading patent research and strategy firm. He is also an Adjunct Faculty member at the National Institute of Health’s Foundation for Advanced Education in the Sciences (FAES). For more on Dr. Tedeschi, please see below.

patent researchIn every stage of innovation, knowledge of current and previous research is critical to developing a clear direction to progress forward. From inception, to basic research, to development, to sales, to product protection, reviewing both patent and technical literature published globally will greatly influence your ability to make technical, legal, and business decisions. It is particularly important at the beginning of the innovation lifecycle, when scientists and engineers are planning a research project. Missing critical prior research can result in repeating others’ research, including unsuccessful projects, not knowing solutions to common challenges already encountered, and losing ownership of any products or licenses resulting from the research already patented. All too often, I’ve seen scenarios where the missing information was readily available in a patent. Here are just a few examples.

Continue Reading…

Sequence-and-Targets-picWhen you work with DNA or protein sequences, inevitably, you’re going to run into the challenge of finding similar biological sequences that have been listed in patents. In most cases, you’re likely to know specific mutations at specific positions that you want to search for. The challenge is, how do you define a query that delivers a manageable set of results?

Continue Reading…

After the United States Supreme Court ruling in the Association for Molecular Pathology v. Myriad Genetics in June of 2013, the industry scurried. The Court ruled that naturally occurring DNA is not patent eligible even if isolated, but cDNA or “complementary DNA” is because it is not naturally occurring but rather a product of the laboratory scientist even though it is exactly the same nucleic acid information.

Continue Reading…

Update: GQ-Pat now has over 334 million sequences

Back in July we reported that there were 300 million sequences in GQ-Pat, including 256 million nucleotide sequences and over 45 million protein sequences.  And these protein sequences aren’t just automated translations of nuceotides like TrEMBL. All of these sequences are in fact found in patents and patent applications from patent authorities around the world.

Continue Reading…