Thai Database Leaks 8.3 Billion Internet Records
|May 7, 2020||Open ElasticSearch database discovered.|
|May 13, 2020||Contacting presumed owner of the database.|
|May 13-21, 2020||Multiple attempts to make contact again.|
|May 21, 2020||Escalating issue to ThaiCERT.|
|May 22, 2020||Database secured.|
Updated May 28, 2020 with a public statement from AIS.
I recently discovered an exposed ElasticSearch database when browsing BinaryEdge and Shodan. This database appears to be controlled by a subsidiary of a major Thailand-based mobile network operator named Advanced Info Service (AIS). According to Wikipedia, AIS is "Thailand's largest GSM mobile phone operator with 39.87 million customers" as of 2016. The database was likely controlled by AIS subsidiary Advanced Wireless Network (AWN). It contained a combination of DNS query logs and NetFlow logs for what appears to be AWN customers. Using this data it is quite simple to paint a picture of what a person does on the Internet. I made multiple attempts to contact AIS to get the database secured without success. At that point I contacted Zack Whittaker – a journalist from TechCrunch – for assistance. We were still unable to make contact with AIS. I then contacted the Thailand National CERT team (ThaiCERT). ThaiCERT was able to make contact with AIS, and we were successful in getting the database secured.
Public Statement from AIS
AIS cares deeply about protecting our customers’ personal information. We are continually reviewing our security procedures to ensure global best-practice. However, on this occasion we acknowledge that our procedures fell short. We would like to thank you for your diligence in addressing the matter and, in particular, your sincere attempts to contact us before publishing your observations. For the benefit of your readers please feel free to publish this email address: [email protected]
To repeat the statement we sent to media yesterday, we can confirm that a small amount of non-personal, non-critical sampling data was exposed for a limited period in May during a scheduled test. As you correctly stated in your article, the data related to Internet usage patterns and did not contain Personal Identification Information (PII) that could be used to identify any customer. We are also pleased that the incident was contained as soon as we became aware of the matter and no customer was adversely impacted.
We appreciate your interest and specialist expertise in this complex area and look forward to working more effectively with you in future.
Who is Advanced Wireless Network (AWN)?
According to Bloomberg AWN provides "wired and wireless network services, telecommunications network, and computer systems." Their website indicates the company was founded in 2005. They specifically call out that AWN “is a subsidiary company of Advanced Info Service Public Limited Company or AIS.”
AWN's (AS131445) network connects directly with AIS (AS45430), in fact AIS is their only upstream peer. This relationship is clearly visualized using the data published by Hurricane Electric:
It is important to note that ThaiCERT contacted AIS about the exposed database, and then the database was offline shortly after. It's possible that AIS promptly notified AWN, or they may have simply blocked access to the exposed database to quickly address the issue for their subsidiary company.
When did the leak begin?
Based on data available in BinaryEdge this database was first observed as exposed and publicly accessible on May 1, 2020. I discovered this database roughly 6 days later on May 7, 2020.
This was not a single server left exposed to the Internet without authentication. The main database I located was part of a cluster of three ElasticSearch nodes. I also located a fourth ElasticSearch database containing similar data as well. AIS has been notified of all of these exposed databases.
How much data?
Over the course of the roughly 3 weeks the database has been exposed the volume of data has been growing significantly. The database was adding approximately 200M new rows of data every 24 hours.
To be precise, as of May 21st, 2020: 8,336,189,132 documents were stored in the database. This data was a combination of NetFlow data and DNS query logs.
It appears DNS query traffic was only logged for roughly 8 days (2020-04-30 20:00 UTC - 2020-05-07 07:00 UTC). This captured 3,376,062,859 DNS query logs. It's unclear why they stopped logging DNS queries after this brief period of time. Perhaps it was significantly more data than they intended to capture. They were logging roughly 2,538 DNS events per second for that period of time.
|Key data point||Count|
|DNS queries logged||3,376,062,859|
|DNS queries logged||2,538 per second|
|Unique source IPs making DNS records over 48 hours||11,482,414|
|Unique count of rrnames (DNS query values) over 48 hours||2,216,07|
A single DNS query log line looked like this:
This doesn't look like much, but when aggregated by a single source IP address (the person/device/house) that made the DNS request you can quickly paint a picture about that person. More on that later.
"NetFlow is a network protocol developed by Cisco for collecting IP traffic information and monitoring network traffic. By analyzing flow data, a picture of network traffic flow and volume can be built."
The NetFlow data was being logged at rate of roughly 3,200 events per second. A single NetFlow log line looked like this:
This high level information records which source IP sent different types of traffic to a particular destination IP, and how much data was transferred. In the example screenshot this was an HTTPS (TCP port 443) request to a destination IP address. It would be possible to do a reverse DNS lookup on the destination IP to quickly identify which website this person going to use HTTPS.
|Key data point||Count|
|NetFlow log lines||roughly 5 billon|
|NetFlow events logged||roughly 3,200 events per second|
|Unique IPs over a 24 hour period of time||1,036,576|
|Unique ASNs connecting to AIS's network over 24 hours||6,234|
|Unique countries connecting to AIS's network over 24 hours||173|
What does this data tell us?
AWN was using a tool called ElastiFlow that simplifies the process of getting NetFlow or sFlow data into Elasticsearch where it can quickly visualized using Kibana.
The prebuilt "Geo IP" dashboard summarizes the geographic spread of traffic being capture. Unsurprisingly the majority of the traffic was from Thailand, although there is a decent amount of traffic logged from surrounding countries as well.
AWN had built a dashboard for reviewing the DNS traffic that was logged. This breaks down the ASNs, top domains, and top IPs by query frequency. I am not going to speculate why AWN was logging their customer's DNS queries.
Interestingly enough AWN had this DNS dashboard saved with a filter specifically looking at Facebook traffic. It's unclear why they would be particularly interested in who was going to Facebook.
What can you do with this data?
To prove the point that DNS query logs should be treated like sensitive information I picked a single source IP address with low to moderate traffic from this database. I didn't want to pick one of the highest traffic IPs as I assumed that would be more likely be a NATed IP with multiple machines behind it. With a single source IP address it's possible to quickly determine the type of devices on their network, and the social networks they frequent – Google, YouTube, Facebook, Soichat.com, TikTok, Line (a chat application), among many other domains.
For the same source IP address I then queried for all NetFlow logs broken down by application and type.
For this single IP address the database contained 668 netflow records. The database also had a detailed breakdown of the types of traffic from this IP – how much DNS traffic, HTTP, HTTPS, SMTP, etc. This information is creepy in detail, but not overly personally identifiable.
The DNS queries for this single IP address though get a bit more personal. This resulted in 429 rows of data for this IP address.
Looking through these DNS queries you can determine the following:
Based on just the DNS queries it is possible to identify the following details about this person/household:
- They have at least 1 Android device
- They either have a Samsung Android device, or some other Samsung device such as an Internet-connected TV.
- They have at least 1 Windows device
- They have at least 1 Apple device
- They use Google Chrome as a browser
- They use Microsoft Office
- They use ESET antivirus software
- They visit the following social media sites: Facebook, Google, YouTube, TikTok, WeChat
Keep in mind the database was only logging DNS traffic for 8 days, and yet that still amounts to 3.3B+ DNS queries. This DNS query log data can tell many other interesting and creepy stories like the above data.
How do we prevent this in the future?
Secure/Sane defaults for ElasticSearch and Kibana
ElasticSearch databases being exposed on the public Internet without any form of authentication is clearly a reoccurring problem. I don't get much time to do these kinds of investigations, but I've found and written about 9 such leaks over the past year.
ElasticSearch and Kibana (made by the same company) need more secure and sane defaults. Obviously if the person setting up these tools is determined to put them on the public Internet there's no way to prevent that. That being said, these tools (Kibana in particular) could display a gigantic warning banner that alerts users if the tool is detected to be publicly accessible without authentication and make the user acknowledge they understand the risk and implications of that.
DNS over HTTPS (DoH) / DNS over TLS (DoT) to stop DNS-based spying
There's no hiding from NetFlow or sFlow data collection from your ISP. If you're on their network they can (and will) track where connections are originating and what the destination is for that traffic.
Regarding the DNS query logs though – that is easy to solve. Use DoH or DoT to secure your DNS communications in transit so that your ISP can't see, log, spy on, and sometimes sell your DNS query traffic. I know the various arguments against DoH and DoT, and the vast majority of them are unfounded scare mongering tactics (and FUD) from organizations and companies with a financial interest in you not securing your DNS query traffic.
DoH and DoT are the future. Mozilla Firefox supports it, Google Chrome supports it, Internet Explorer Edge supports it, Android supports it, and even Microsoft Windows 10 will be adding it soon. Here's a helpful guide for enabling it on the major browsers: "Here's how to enable DoH in each browser, ISPs be damned"
To be clear: DoH and/or DoT would have stopped the gathering of DNS query data in this case. It's simple to set up, and it's just a smart thing to do for anyone concerned about their privacy.