The UniProt Archive (UniParc) is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. UniParc has joined many databases into one at the sequence level and searching UniParc is equivalent to searching many databases simultaneously. When we have to search a comprehensive repository of proteins, we use UniParc because it gives us inly protein sequences. But when we have to obtain other information we use its source databases using the database cross-references.
Each unique sequence in UniParc is stored only once with a stable identifier with the format of UPI followed by ten hexadecimal numbers, e.g. UPI000000000A.
We can search sequence similarity here by using FASTA or BLAST.
Proteins may exist in different source databases and in multiple copies in the same database. UniParc avoided such redundancy by storing each unique sequence only once and giving it a stable and unique identifier (UPI) making it possible to identify the same protein from different source databases. A UPI is never removed, changed or reassigned. UniParc contains only protein sequences. All other information about the protein must be retrieved from the source databases using the database cross-references.
The basic information stored within each UniParc entry is the identifier, the sequence, cyclic redundancy check number, source database(s) with accession and version numbers, and a time stamp. If a UniParc entry does not have a cross-reference to a UniProtKB entry, the reason for the rejection of that sequence from UniProtKB is provided. In addition, each source database accession number is tagged with its status in that database, showing if the sequence still exists or has been deleted in the source database and cross-references to NCBI GI and taxonomic identifier (TaxID) if appropriate.
Sequences are compared when entering UniParc with all existing sequences. . Sequences, which are no longer part of any source database, are excluded from sequence-based searches, but they are available for text-based searches for further curation of coming proteins.
Sequence changes in the source database can be tracked by sequence versions. Some source database provide their own versions while others don't. UniParc have its own internal versioningthat are updated with time. Each time a sequence is changed the internal version is incremented by one, making it possible to track sequence changes in all source databases.
A database cross-reference links a protein sequence to its origin database. It contains a source database identifier, a sequence identifier from the source database, a sequence version from source database if any, a UniParc sequence version and whether or not the cross-reference is still active. A new cross-reference is created when a new or updated protein enters UniParc. When a sequence is changed or deleted from a source database, its corresponding cross-reference is marked as deleted.
Currently UniParc contains protein sequences from the following publicly available databases:
EMBL-Bank/DDBJ/GenBank nucleotide sequence databases
Ensembl
EnsemblGenomes
European Patent Office
FlyBase
International Protein Index
Japan Patent Office
Korean Intellectual Property Office
Pathosystems Resource Integration Center
Protein Data Bank (PDB) etc