Subscribe to DSC Newsletter

A little known component that should be part of most data science algorithms

This is a component often missing, yet valuable for most systems, algorithms and architectures that are dealing with online or mobile data, known as digital data: be it transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online search, plagiarism and spam detection, etc.

I will call it an Internet Topology Mapping. It might not be stored as a traditional database (it could be a graph database, a file system, or a set of look-up tables). It must be pre-built (e.g. as look-up tables, with regular updates) to be efficiently used.

So what is the Internet Topology Mapping?

Essentially, it is a system that matches an IP address (Internet or mobile) with a domain name (ISP). When you process a transaction in real time in production mode (e.g. an online credit card transaction, to decide whether to accept or decline it), your system only has a few milliseconds to score the transaction to make the decision. In short, you only have a few milliseconds to call and run an algorithm (sub-process), on the fly, separately for each credit card transaction, to decide on accepting/rejecting. If the algorithm involves matching the IP address with an ISP domain name (this operation is called nslookup), it won't work: direct nslookups take between a few hundreds to a few thousands milliseconds, and they will slow the system to a grind.

Because of that, Internet Topology Mappings are missing in most systems. Yet there is a very simple workaround to build it:

  1. Look at all the IP addresses in your database. Chances are, even if you are Google, 99.9% of your traffic is generated by fewer than 100,000,000 IP addresses. Indeed, the total number of IP addresses (the whole universe) consists of less than 256^4 = 4,294,967,296 IP addresses. That's about 4 billion, not that big of a number in the real scheme of big data. Also, many IP addresses are clustered: 120.176.231.xxx are likely to be part of the same domain, for xxx in the range (say) 124-180. In short, you need to store a lookup table possibly as small as 20,000,000 records (IP ranges / domain mapping) to solve the nslookup issue for 99.9% of your transactions. For the remaining 0.1%, you can either assign 'Unknown Domain' (not recommended, since quite a few IP addresses actually have unknown domain), or 'missing' (better) or perform the cave-man, very slow nslookup on the fly.
  2. Create the look-up table that maps IP ranges to domain names, for 99.9% of your traffic.

When processing a transaction, access this look-up table (stored in memory, or least with some caching available in memory) to detect the domain name. Now you can use a rule system that does incorporate domain names.

Example of rules and metrics based on domain names are:

  • domain extension (.com, .cc etc.)
  • length of domain name
  • domain name flagged as bad, or white-listed
  • patterns find in domain name (e.g. includes digits, a date, flagged keywords, special characters such as dash)
  • specific keywords found in domain name
  • owner of domain name (does owner owns other domains, in particular bad domains)
  • date of creation of domain name
  • time needed to do the nslookup on IP address in question (0.5 second or 4 seconds?)
  • multiple nslookups needed to find domain domain attached to this IP address, when building IP address / domain name table?
  • no domain found
  • is this a subdomain?

This is the first component of the Internet Topology Mapping. The second component is a clustering structure, in short, a structure (text file is OK) where a cluster is assigned to each IP address or IP range. Examples of clusters include

  • IP adddress correspond to an anonymous proxy
  • .edu proxy
  • Corporate proxy (large with many IP's vs. small with a few, heavily used IP's; use a list of top 50,000 companies / websites to help detect corporate domain names)
  • Government proxy
  • Static, personal IP
  • Volatile, recycled IP (e.g. AOL, Comcast, Mobile IP addresses)
  • Isolated IP address vs. IP address belonging to a cluster or multi-clusters (either small or large) of IP addresses
  • Well known ISP: Comcast, etc.
  • mobile IP address
  • IP address is AWS (Amazon Web Services, cloud services) or similar
  • IP address is a mail / web / ftp or API server

Armed with these components (IP address / domain mapping + IP address cluster structure, aka Internet Topology), you can now develop far better algorithms: real time or back-end, end-of-day algorithms. You need the IP address / domain mapping to build the cluster structure.

If you have a data scientist on board, it should be easy for her to create this Internet Topology Mapping, and identify the great benefits of using it. I might do it  for $10,000 minimum (and $25,000 maximum) or for free if I find the time. Indeed, I've been thinking creating this tool and sell it as data (selling this app mostly consists of selling data). It could be an idea for a small start-up, selling it based on a licensing model.

The only issue with creating this product (assuming it will contain 20,000,000 IP address ranges and get updated regularly) is by far the large amount of time spent in doing millions of very slow (0.5 second each) cave-man nslookups. Now there are well known ranges reserved for AOL and other big ISP's, so probably you will end up doing just 10,000,000 nslookups. Given that 15% of them will fail (timing out after 2 seconds) and that you will have to run nslookup 2x on some IP addresses, let's say that in short, you are going to run 10,000,000 nslookups, each taking on average 1 second. That's about 2,777 hours, or 115 days.

You can use a Map Reduce environment to easily reduce the time by a factor 20, by leveraging the distributed architecture. Even on one single machine, if you run 25 versions of your nslookup script in parallel, you should be able to make it run 4 times faster, that is, it would complete in less than a month. That's why I claim that a little guy alone in his garage could create the Internet Topology Mapping in a few weeks or less. The input data set (say 100,000,000 IP addresses) would require less than 20 Gigabytes in storage, even less when compressed. Pretty small.

Finally, here's a Perl script that automatically performs nslookups on an input file ips.txt of IP addresses, and store the results in outip.txt. It works on my Windows laptop. You need an Internet connection to make it run, and add an error management system to nicely recover if you lose power or you lose your Internet connection.

open(IN,"<ips.txt");
open (OUT,">outip.txt");
while ($lu=<IN>) {
  $ip=$lu;
  $n++;
  $ip=~s/\n//g;
  if ($ip eq "") { $ip="na"; }
  `nslookup $ip | grep Name > titi.txt`;

  open(TMP,"<titi.txt"); 
  $domainName="n/a";
  while ($i=<TMP>) {
    $i=~s/\n//g;
    $i=~s/Name\://g;
    $domainName=$i;
  }
  close(TMP);
  print OUT "$ip\t$domainName\n";
  print "$n> $ip | $domainName\n";

}
close(OUT);
close(IN);

Related articles

Views: 10872

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by David Klemitz on April 25, 2015 at 3:59am

@ Vincent

Like any other map, The Internet map is a scheme displaying objects’ relative position; but unlike real maps (e.g. the map of the Earth) or virtual maps (e.g. the map of Mordor), the objects shown on it are not aligned on a surface. Mathematically speaking, The Internet map is a bi-dimensional presentation of links between websites on the Internet. Every site is a circle on the map, and its size is determined by website traffic, the larger the amount of traffic, the bigger the circle. Users’ switching between websites forms links, and the stronger the link, the closer the websites tend to arrange themselves to each other.

http://internet-map.net/about

Comment by Vincent Granville on October 17, 2013 at 3:33pm

I found the following image on internet-map.net. Anyone knows what it is supposed to visually represents? I know each dot is a website, but what do distances between dots represent?

Comment by Gregory Piatetsky on October 17, 2013 at 9:23am

very nice article!  There are also very inexpensive geolocation databases and APIs from MaxMind - I have used those for a while. MaxMind also maintains the data, since IP mapping tend to change over time, a few pct/year.

Comment by Vincent Granville on October 17, 2013 at 5:31am

@Steve: sounds like a great tool. Note that here I am interested in finding the domain name attached to an IP address, not the other way around. Digital Envoy used to provide this service for a licensing fee, not sure if they still do and how expensive they are. They used to be very expensive, thousands of dollars per month, and their database contained many millions of records. It was accessible via an API in bulk mode.

Comment by Steve Karam on October 17, 2013 at 3:54am

Some public APIs do a good job of doing domain name lookups when gathering data from the domain side. I've been playing with RESTful Whois (free unlimited use API) as its JSON response includes lat/log data and other enriched info about the domain. I've used it for bulk lookups of domains referenced in customer profiles and it's fairly speedy.

http://www.restfulwhois.com/v1/datasciencecentral.com

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2016   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service