Subscribe to DSC Newsletter

A little known component that should be part of most data science algorithms

This is a component often missing, yet valuable for most systems, algorithms and architectures that are dealing with online or mobile data, known as digital data: be it transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online search, plagiarism and spam detection, etc.

I will call it an Internet Topology Mapping. It might not be stored as a traditional database (it could be a graph database, a file system, or a set of look-up tables). It must be pre-built (e.g. as look-up tables, with regular updates) to be efficiently used.

So what is the Internet Topology Mapping?

Essentially, it is a system that matches an IP address (Internet or mobile) with a domain name (ISP). When you process a transaction in real time in production mode (e.g. an online credit card transaction, to decide whether to accept or decline it), your system only has a few milliseconds to score the transaction to make the decision. In short, you only have a few milliseconds to call and run an algorithm (sub-process), on the fly, separately for each credit card transaction, to decide on accepting/rejecting. If the algorithm involves matching the IP address with an ISP domain name (this operation is called nslookup), it won't work: direct nslookups take between a few hundreds to a few thousands milliseconds, and they will slow the system to a grind.

Because of that, Internet Topology Mappings are missing in most systems. Yet there is a very simple workaround to build it:

  1. Look at all the IP addresses in your database. Chances are, even if you are Google, 99.9% of your traffic is generated by fewer than 100,000,000 IP addresses. Indeed, the total number of IP addresses (the whole universe) consists of less than 256^4 = 4,294,967,296 IP addresses. That's about 4 billion, not that big of a number in the real scheme of big data. Also, many IP addresses are clustered: 120.176.231.xxx are likely to be part of the same domain, for xxx in the range (say) 124-180. In short, you need to store a lookup table possibly as small as 20,000,000 records (IP ranges / domain mapping) to solve the nslookup issue for 99.9% of your transactions. For the remaining 0.1%, you can either assign 'Unknown Domain' (not recommended, since quite a few IP addresses actually have unknown domain), or 'missing' (better) or perform the cave-man, very slow nslookup on the fly.
  2. Create the look-up table that maps IP ranges to domain names, for 99.9% of your traffic.

When processing a transaction, access this look-up table (stored in memory, or least with some caching available in memory) to detect the domain name. Now you can use a rule system that does incorporate domain names.

Example of rules and metrics based on domain names are:

  • domain extension (.com, .cc etc.)
  • length of domain name
  • domain name flagged as bad, or white-listed
  • patterns find in domain name (e.g. includes digits, a date, flagged keywords, special characters such as dash)
  • specific keywords found in domain name
  • owner of domain name (does owner owns other domains, in particular bad domains)
  • date of creation of domain name
  • time needed to do the nslookup on IP address in question (0.5 second or 4 seconds?)
  • multiple nslookups needed to find domain domain attached to this IP address, when building IP address / domain name table?
  • no domain found
  • is this a subdomain?

This is the first component of the Internet Topology Mapping. The second component is a clustering structure, in short, a structure (text file is OK) where a cluster is assigned to each IP address or IP range. Examples of clusters include

  • IP adddress correspond to an anonymous proxy
  • .edu proxy
  • Corporate proxy (large with many IP's vs. small with a few, heavily used IP's; use a list of top 50,000 companies / websites to help detect corporate domain names)
  • Government proxy
  • Static, personal IP
  • Volatile, recycled IP (e.g. AOL, Comcast, Mobile IP addresses)
  • Isolated IP address vs. IP address belonging to a cluster or multi-clusters (either small or large) of IP addresses
  • Well known ISP: Comcast, etc.
  • mobile IP address
  • IP address is AWS (Amazon Web Services, cloud services) or similar
  • IP address is a mail / web / ftp or API server

Armed with these components (IP address / domain mapping + IP address cluster structure, aka Internet Topology), you can now develop far better algorithms: real time or back-end, end-of-day algorithms. You need the IP address / domain mapping to build the cluster structure.

If you have a data scientist on board, it should be easy for her to create this Internet Topology Mapping, and identify the great benefits of using it. I might do it  for $10,000 minimum (and $25,000 maximum) or for free if I find the time. Indeed, I've been thinking creating this tool and sell it as data (selling this app mostly consists of selling data). It could be an idea for a small start-up, selling it based on a licensing model.

The only issue with creating this product (assuming it will contain 20,000,000 IP address ranges and get updated regularly) is by far the large amount of time spent in doing millions of very slow (0.5 second each) cave-man nslookups. Now there are well known ranges reserved for AOL and other big ISP's, so probably you will end up doing just 10,000,000 nslookups. Given that 15% of them will fail (timing out after 2 seconds) and that you will have to run nslookup 2x on some IP addresses, let's say that in short, you are going to run 10,000,000 nslookups, each taking on average 1 second. That's about 2,777 hours, or 115 days.

You can use a Map Reduce environment to easily reduce the time by a factor 20, by leveraging the distributed architecture. Even on one single machine, if you run 25 versions of your nslookup script in parallel, you should be able to make it run 4 times faster, that is, it would complete in less than a month. That's why I claim that a little guy alone in his garage could create the Internet Topology Mapping in a few weeks or less. The input data set (say 100,000,000 IP addresses) would require less than 20 Gigabytes in storage, even less when compressed. Pretty small.

Finally, here's a Perl script that automatically performs nslookups on an input file ips.txt of IP addresses, and store the results in outip.txt. It works on my Windows laptop. You need an Internet connection to make it run, and add an error management system to nicely recover if you lose power or you lose your Internet connection.

open(IN,"<ips.txt");
open (OUT,">outip.txt");
while ($lu=<IN>) {
  $ip=$lu;
  $n++;
  $ip=~s/\n//g;
  if ($ip eq "") { $ip="na"; }
  `nslookup $ip | grep Name > titi.txt`;

  open(TMP,"<titi.txt"); 
  $domainName="n/a";
  while ($i=<TMP>) {
    $i=~s/\n//g;
    $i=~s/Name\://g;
    $domainName=$i;
  }
  close(TMP);
  print OUT "$ip\t$domainName\n";
  print "$n> $ip | $domainName\n";

}
close(OUT);
close(IN);

Related articles

Views: 14398

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by John McCormac on June 11, 2018 at 9:19pm

It is an interesting topic but nslookup, as used here, basically provides a kind of reverse lookup to map the IP address to a hostname moreso than a domain name. A lot of IP addresses have no hostname data or out of date hostname data. The tool used to match IPs with ISPs/Hosting Service Providers (HSPs), etc is whois. This provides the ownership details for the IP address and will also provide the range of IPs owned by the registrant. The problem is in grouping the 2^32 IPv4 IP addresses rather than checking each IP address individually. In terms of the COM/NET/ORG/BIZ/INFO gTLDs, their websites were, when I ran a website to IP address survey a few years ago, hosted on approximately 7.5 million IP addresses.

Using nslookup like this creates more problems than it solves when it comes to HSPs in that many websites will have their own hostname/IP pairs that may be different to the HSPs's domain name. Thus it might be webserver01.exampleA.tld/aaa.bbb.ccc.ddd when the HSP's domain name could be exampleB.tld. This is why the whois tool is more useful in this application. And there are some whois records that contain spoofed or incorrect data but it is not a serious problem.

The Internet, at an IP level, already has a topology and IP addresses are assigned via the various regional internet registries. The ranges of assigned IPs, along with their associated country are published daily. These lists generally contain the ranges of IPs with more than 256 IP addresses in the range. Thus for a simple example a range of 256 or a /24 might require one whois lookup to determine the owner. This is somewhat more efficient than doing 256 nslookups for each address in the range or network. In telephone terms, it is like identifying the telephone exchange rather than having to do a lookup on each telephone number handled by the exchange.

The ISP is not necessarily the domain name associated with an IP address. Due to the way that the Internet has developed in various countries and the repeated takeovers of ISPs by larger ones, the domain name as detected by the nslookup approach may be that of an old ISP. Even with .COM, the number of new domain names registered in May 2018 was approximately 3.125 million and the number deleted in the same period was 2.475 million. Associating IP addresses with ISPs or providers, rather than domain names, is a better approach for the purposes of the Internet Topology product.

The second component also has some issues that may be problematic. The whole IP proxy issue is a lot more complex than it first seems. Many IP privacy services use ranges of IP addresses in various HSPs. This means that they will not be hitting a website from a range of known ISP addresses but will be coming from data centres. There are also privacy service providers who have ranges of IP addresses that are assigned by regional internet registries like that of Afrinic (the African RIR) but have whois records that indicate that they are US owned and hosted on US HSPs. These are relatively stable and long-lived IP ranges. The next set, the TOR IPs (The Onion Router) that allow people to browse anonymously, are dynamic, short-lived and will often be associated with countries other than that of the user.

The US EDU ranges tend to be well identified in that they are typically large ranges of IP addresses. There will be a mix of web proxies and individual IP addresses. Corporate proxies may not necessarily use the same domain name as their website. Government proxies,with US organisations, may be easier to identify due to the widespread use of the .GOV TLD.

Static IPs and dynamic IPs can create confusion in that some IP addresses may have appear to be static for a few days, depending on how the ISP assigns them or may be static for the duration of the subscription to the ISP. Grouping by ISP is more reliable in this respect. And ISPs also have their topologies due to covering large geographical areas.

Dealing with AWS IPs is actually a lot easier than dealing with smaller ranges due to the massive sizes of the AWS IP ranges. But there are also other large Cloud hosters. Microsoft's Cloud operation ran out of the US IP addresses a few years ago and obtained a lot of Brazilian IP addresses. Some German and European HSPs also obtained South American IP addresses.

The associated function of an IP address is a bit more complex. It is possible to do lookups on domain names to establish their mailserver and then check the IP address. The same can be done for web servers. That's if one has the list of domain names. It is possible to apply for access to the zone files (the official list of domain names and their nameservers) for most of the gTLDs but the country code TLDs (ccTLDs) don't generally provide access. many of the IP addresses for mailservers and webservers will be the same due to most of the web being hosted on shared hosting and large mail services such as Google and spam filtering operations providing e-mail services. And due to the volatility of domain names as pointed out above, this would have to be continually updated.

Building such a topology as a service is not so much a Data Science problem as one that combines combines Network Engineering with Data Science. It would have to be designed to be continually updated because the Internet continually changes. Of the .COM domain names registered this month last year, approximately 56% will be renewed. The other 44% or so will be deleted. Such as service would not be a one-off survey and would have to be continually updated.

Comment by David Klemitz on April 25, 2015 at 3:59am

@ Vincent

Like any other map, The Internet map is a scheme displaying objects’ relative position; but unlike real maps (e.g. the map of the Earth) or virtual maps (e.g. the map of Mordor), the objects shown on it are not aligned on a surface. Mathematically speaking, The Internet map is a bi-dimensional presentation of links between websites on the Internet. Every site is a circle on the map, and its size is determined by website traffic, the larger the amount of traffic, the bigger the circle. Users’ switching between websites forms links, and the stronger the link, the closer the websites tend to arrange themselves to each other.

http://internet-map.net/about

Comment by Vincent Granville on October 17, 2013 at 3:33pm

I found the following image on internet-map.net. Anyone knows what it is supposed to visually represents? I know each dot is a website, but what do distances between dots represent?

Comment by Vincent Granville on October 17, 2013 at 5:31am

@Steve: sounds like a great tool. Note that here I am interested in finding the domain name attached to an IP address, not the other way around. Digital Envoy used to provide this service for a licensing fee, not sure if they still do and how expensive they are. They used to be very expensive, thousands of dollars per month, and their database contained many millions of records. It was accessible via an API in bulk mode.

Comment by Steve Karam on October 17, 2013 at 3:54am

Some public APIs do a good job of doing domain name lookups when gathering data from the domain side. I've been playing with RESTful Whois (free unlimited use API) as its JSON response includes lat/log data and other enriched info about the domain. I've used it for bulk lookups of domains referenced in customer profiles and it's fairly speedy.

http://www.restfulwhois.com/v1/datasciencecentral.com

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service