Look it up! Codes Associated with Public Data Sets

Did you know?

  • A US territory (state code FM) became an independent country in 1986!  Micronesia.
  • The island nation of Tuvalu has just 11K people, but it has the .TV domain
  • A jiffy is a real unit of measure at 1/10th of a second.
  • If your doctor treats you for V95.43XS, it’s a followup after your spacecraft collision.
  • Sports centers are code 713940 except a bowling alley, that’s 713950.
  • Of 840 occupations, you could be a Mine Shuttle Car Operator


“There are two kinds of people in this world ...”

An introductory statistics class will characterize data elements into two classes – those that describe what is being measured, and the measurements themselves.  Math addresses the measurements.  The descriptive elements, though, are a muddle of terms and methods.   Data developers and analysts call them (almost) interchangeably:

          ◦  Reference data        ◦  Lookup data               ◦  Foreign Key fields

        ◦  Domain values          Qualitative data          ◦  Non Ordered Discrete

        ◦  Nominal                      Categorical                  Flags or Sets

 This discussion will call it all reference data, and focus on the volume of this information that is public and commonly shared.  The article concludes with a list of sources that should be valuable to anyone using public data sets.  A following article will discuss the uses of these elements for design, analysis, quality and control.

“Everybody wants to rule the world”

The more companies and nations interchange data, the more they need standards.  So, it’s not surprising that major reference data management gravitates to industry associations and public bodies.  But, time is always a problem.  The “diffusion of innovation” with its innovators, early adopters and laggards equally applies to companies and countries. 

  • The nuclear arms race innovated computing in the 1950s
  • The space race innovated telecommunications in the 1960s
  • The internet itself evolved from a defense research project in the 1970s
  • The financial markets were transformed by automation in the 1980s

The authorities and standards of reference data reflect their historical roots as well as these economic tensions.  Design and use of reference data deserves careful consideration of public standards, both their origins and future integration. Today, these organizations are excellent starting points for public reference data:

  • In 1968, the Standard Reference Data Act chartered the National Institutes of Science and Technology (NIST) with responsibility for domestic sources. Today, NIST focuses on U.S. computing, telecommunications, mathematics and natural sciences.  http://www.nist.gov/srd
  • Many countries have a federal agency dedicated to statistics.  In the US, this responsibility is coordinated primarily between the Office of Management of Budget (OMB) (https://www.whitehouse.gov/omb) and the Census Bureau http://www.census.gov
  • The United Nations compiles global statistics and some global reference data http://unstats.un.org as does the US Central Intelligence Agency in its World FactBook www.ciaworldfactbook.us.  These provide definitions of global measurements and standard ethnicities, languages and semantics.
  • The American National Standards Institute (ANSI) http://www.ansi.org is familiar as the definition of the characters related to the 256 values of a computer byte.  ANSI acts as a publisher for more than one hundred industry associations
  • When reference data is globally standardized, the International Organization for Standardization (ISO) www.iso.org becomes a custodian and provides common maintenance processes.


“These are a few of my favorite things”


  • The US FIPS (Federal Information Processing Standard) maintains country codes through the National Institute of Standards & Technology (NIST) in a publication called 10-4.  The two letter FIPS codes were dominant during the innovation of computing, so they can be found in systems built prior to 2000. https://en.wikipedia.org/wiki/List_of_FIPS_country_codes
  • The US FIPS codes were replaced by Geopolitical Entities, Names, and Codes Standard (GENC) which implements the ISO-3166 for country codes.  This set combines two letter, three letter and three digit identifiers. http://www.iso.org/iso/home/standards/country_codes.htm
  • GS1 standardizes products across countries for supply chain management.  In addition to defining product codes (below), they code countries as company prefixes http://www.gs1.org/company-prefix


  • ISO 639 defines identifiers for languages.  Like countries, these are common to global or localize web applications.  The earliest codes were 2 characters and biased to latin, western languages.  The more recent list is 4 characters and includes macrolangages and local dialects.  http://www.iso.org/iso/home/standards/language_codes.htm


  • The international metric system is called the SI.  It is maintained by the International Bureau of Weights & Measures (BIPM) http://www.bipm.org
  • The US recognizes the metric system but still uses the “customary system” or “imperial system” reflecting its history in the British Empire. NIST publishes a coversion table of customary to metric measures. http://www.nist.gov/pml/wmd/metric/common-conversion-b.cfm
  • Fans may also refer to the facetious  system called the “FFF” for Furlong/Firkin/Fortnight.


  • Standard Industrial Classification (SIC) codes were introduced by the US federal government during the Great Depression and are maintained by the Department of Labor (OSHA).  https://www.osha.gov/oshstats/index.html The SICs are 4 digit references that form industry group, major group and industry level assignments.  SICs are common in data that relies on decades of historical reference, spans US and Europeserves older, established industries and many US federal activities.
  • North American Industry Classification System (NAICS) is introduced in 1997 and reflects technology and service advancements in world economies.  Over the last decade, the NAICS has largely replaced the SIC in US data publications.  The NAICS codifies companies into a three level system.  It is administered by the OMB and Census Bureau http://www.census.gov/eos/www/naics
  • The United Nations maintains its International Standard Industrial Classification (ISIC).  http://unstats.un.org/unsd/publication/seriesM/seriesm_4rev4e.pdf
  • The UK maintains industry codes also abbreviated as SIC, but not equivalent to US OSHA codes. https://www.gov.uk/government/publications/standard-industrial-clas...

Race & Ethnicity



  • U.S data uses Standard Occupation Classification (SOC) codes to integrate public data.  The Bureau of Labor Statistics maintains the master SOC System, but you can download the reference data from many dependent agencies including the Census Bureau, Office of Management Budget and the IRS.   http://www.bls.gov/soc
  • The International Standard Classification of Occupations (ISCO) is similarly maintained by the International Labour Organization (ILO) http://www.ilo.org/public/english/bureau/stat/isco.  This standard is related to the UN’s ISIC and includes both economic and social roles.  Spanning countries this standard is more aggregated than the US SOC coding.

Financial Instruments

  • Stock symbols, or ticker symbols,  which are issued by the market where they are traded.  Market data integrators like Marketwatch, Morningstar or Yahoo can provide a combined list.
  • The Committee on Uniform Securities Identification Procedures (CUSIP) issues 9 character identifiers for US and Canadian securities. https://www.cusip.com/cusip/index.htm
  • The UK and Ireland use Stock Exchange Daily Official List (SEDOL), like the CUSIP.  It can be found here http://www.londonstockexchange.com/products-and-services/reference-...
  • Internationally, the ISO standard 6166 specifies a method for identifying securities specifically for trading and settlement.  The International Securities Identification Number (ISIN) has its own organization issuing a 12 alpha-numeric code for millions of instruments.  The first two characters are the issuing country.  The 3rd - 11th characters are called the NSIN. For US securities, these are the same as the CUSIP above. http://isin.net
  • In addition, Reuters and Bloomberg maintain their own codes, relating stock activity to proprietary news and analysis databases.
  • The financial markets, exchanges and trading platforms have an identifier based on ISO 10383 called the Market Identifier Code (MIC). The MIC is managed by ISO under http://www.iso10383.org

Product Codes

  • For items in a retail store, look for Universal Product Codes.  The UPC was developed with the automation of store checkouts in the 1930s so it’s naturally associated with bar codes.  See GS1 for global management of UPC, EAN and GTIN codes.  http://www.gs1.org
  • A product may be tracked by a Stock Keeping Unit, or SKU.  These are not standardized across industries, but may have common practice within an industry.  The SKU usually combines subcodes for manufacturer, product, color and size or other detail.


Feel free to comment and share your favorite public reference data.

Views: 1532


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service