Skip to main content

Alternative Sector Classification Methods


This paper offers two alternative sector classification methods in order to classify companies more accurately.


During the early 1900s, various departments of the US government initiated research and studies on the various industries and their different functions. Due to the lack of set standards, each department ended up using its own methodology. Consolidating information across multiple sources became a challenge. The Standard Industrial Classification (SIC) was hence proposed as a uniform classification system, aimed to represent major industries, sub-class and specific function/product, and was formally adopted in 1937. 

However, SIC was facing a challenge because of the change in the economic environment. After that, the Global Industry Classification System (GICS) was launched by Standard & Poor's (S&P) and Morgan Stanley (MSCI) in August 1999. The standard provides a comprehensive, globally consistent definition of economic sectors and industries for the global financial industry. As an industry classification model, GICS has gained worldwide recognition for its significance in not only providing a solid basis for the creation of easily replicated, tailored portfolios but also in making the study of economic sectors and industries more comparable on a global scale. Since then, GICS is widely accepted as the default method for classifying companies.

Although it solves some disadvantages of the SIC method, GICS still has some drawbacks. For example, FAANG stocks (Facebook, Apple, Amazon, Netflix,  Google) are recognized as tech stocks in a general sense, some of them are not put into the tech sector ironically. Facebook and Google have been put into the communication services select sector. Amazon has been put into the consumer discretionary select sector.

This paper offers two alternative sector classification methods in order to classify companies more accurately.

Method 1: Grouping stocks using price data

To identify stocks that might move together in the future, a natural approach is to find those that have moved together in the recent past. To do this, statistical clustering methods can be applied to the covariance matrix of returns. Here we use Hierarchical Risk Parity (HRP).

The first step is to take price changes over some period and measure their correlation. 

A group of stocks that all lie close to one another can be considered a cluster, and there are many ways to identify clusters automatically once we have a distance measure. Clusters can themselves be grouped together into larger clusters, forming a hierarchy of nested groups. Eventually, the clusters will be large enough that there will be rough as many clusters as there are GICS sectors: we can then call these our covariance-based sectors.

The following graph shows how the world’s 13 largest companies by market cap are grouped together.

However, when more companies are added into the watchlist, the accuracy of performance will drop. For example, some companies which don't have tech business are grouped into tech cluster. Although we can run the Random Sample Consensus (RANSAC) to de-noise, some companies are still be wrongly clustered.

Method 2: Grouping stocks using NLP

Applying Natural Language Processing (NLP) to Section 1 of the 10-K annual filing required of US companies, which must include, “a description of the company’s business, including its main products and services, what subsidiaries it owns, and what markets it operates in”.

In analysing these filings there are two sorts of objects we want to assign to groups: first, we associate words in the documents with a limited set of dynamically learned topics, based on how often they occur together in a 10-K filing. To employ the distance metaphor again, words that often appear together in a document are considered close and are likely to be clustered together into the same topic. This means that a particular 10-K – and, therefore, a particular company – will be associated with a mixture of different topics with different weights.

For example. if the frequency of the following words "health, medical, care, healthcare, patient, hospital, service, medicare, product, FDA, drug, development, approval, patient, treatment, clinical, agreement, clinical trial, safety, disease, cell, pharmaceutical, technology" is high, there is a high probability that the company will be grouped into the healthcare sector.

However, there are some exceptions. Let's say Amazon. From 2007 - 2016, It starts describing itself predominantly as an IT company, associated with the software and hardware topics, with less emphasis on retail. In recent years, it has clearly described itself as being in the retail business, with IT as a secondary topic, and media making a small but increasing contribution. 


We have shown different schemes of classification, but all of us should be careful of data noise. Noise in the data may throw up potentially spurious relationships between stocks or identify counterintuitive groupings that are unlikely to persist.


Popular posts from this blog

1929 Market Crash Cause

Many people believe in the reason why the market value of all share listed in NYSE fell by 30% is that there was an economic bubble and the stock market is irrational. However, is that the truth? We are going to show you three different approaches to the 1929 market crash. Remember that correlation is not causation. These are just references. Approaches: 1. There was no economic bubble before the 1929 market crash. Irving Fisher, a famous US economist said: “prices of stocks are low.” Fisher based his projection on strong earnings reports, fewer industrial disputes, and evidence of high investment in R&D and other intangible capital. 2. There was insider trading. In the months prior to the stock market crash of 1929, the price of a seat on the NYSE was abnormally low. Rising stock prices and volume should have driven up seat prices during the boom of 1929; instead, there were negative cumulative abnormal returns to seats of approximately 20% in the months before the market crash. A

Deflated Sharpe Ratio can reduce false discovery

  Introduction With the recent advent of large financial datasets, machine learning, and high-performance computing, analysts can backtest millions of alternative investment strategies. Backtest optimizers search for combinations of parameters to maximize a strategy's simulated historical performance, leading to backtest overfitting. The performance inflation problem goes beyond backtesting. More generally, researchers and investment managers tend to report only positive results, a phenomenon is known as selection bias. Failure to control the number of tests involved in a given finding can lead to overly optimistic performance expectations. The Deflated Sharpe Ratio (DSR) corrects for two major sources of performance inflation: selection bias under multiple tests and non-normally distributed returns . By doing so, DSR helps separate legitimate empirical results from statistical deception. Backtesting is a good example. Backtesting is a historical simulation of how a particular inv