In this analysis, we will break down the NUFORC UFO sighting data to see if there are any patterns in UFO sightings through the years. You can access the data here: nuforc

My UFO Sighting in San Diego, California

This topic has interested me ever since I saw a UFO in the middle of broad daylight in the 3rd grade. It was just another Sunday afternoon at the local park in La Mesa, CA. My friend and I watched as a shiny, silver disk floated effortlessly and silently in the blue sky and shot across the lengths of entire clouds instantaneously. I distinctly remember the way in which the sun glared off the metallic exterior of the craft–it was just like the UFOs in the movies. Shocked, we ran and told our parents, who then came and observed the flying saucer for another 10 minutes or so. Eventually, the UFO shot up into the atmosphere at a speed that an only be described as humanly impossible.

Ever since then I’ve tried to figure out what exactly UFOs might be. Could a government be so stupid as to test new technology in the middle of the day in San Diego? Not likely. I thus have two alternative explanations for UFOs: 1) They are indeed alien spacecraft; and 2) they are human spacecraft from the future. Time travellers have come back to our time via these fast-as-light spacecraft–and that’s why they don’t care about appearing in the middle of the day.

So, now that I’m sure you think I’m crazy (even my Mom, who wasn’t there that day, to this day doesn’t believe me), let’s see if we can find anything that might help us to decide whether UFOs are indeed human or alien. By the way, I tried to find a report near the same time as my sighting, and I couldn’t find anything.

So, with that little anecdote behind us, let’s see if we can find anything in these 80,000+ UFO sighting reports that might help us to decide whether UFOs are indeed human or alien. At the very least, if we want to maximize our chances of seeing a UFO, we might learn where and when to look for one.

In the course of this investigation, we’ll be using a host of analytical techniques, from text mining, visualization, and even a couple statistical tests. Let’s begin.

Clean up and start aggregating and counting

library(tidyverse)
library(ggmap)
library(stringr)
df <- read.csv('ufo.csv', stringsAsFactors = FALSE)

#delete any cities with punctuation. focus on US
#if you put df[which...] then you get a data frame back. we just want vec of index
bad <- which(str_detect(df$city, '[[:punct:]]'))

df <- df[-bad,]

df %>%
  count(city, state, shape)%>%
  arrange(desc(n))%>%
  head()
## # A tibble: 6 x 4
##          city state shape     n
##         <chr> <chr> <chr> <int>
## 1     seattle    wa light   113
## 2     phoenix    az light    90
## 3   san diego    ca light    78
## 4    portland    or light    77
## 5   las vegas    nv light    68
## 6 los angeles    ca light    63
library(tidytext)

Now we can use the great tidytext package to dive into the reportings and see what we can find in the text descriptions.

big <- df %>%
  unnest_tokens(word, comments, token='ngrams', n=2)%>%
  anti_join(stop_words)%>%
  count(word, sort=TRUE)
#solve problem of bigrams with stop words
bigrams_separated <- big %>%
  separate(word, c("word1", "word2"), sep = " ")

#keep only alphabetical words and longer than 2 letters
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)%>%
  filter(str_detect(word1, '[[:alpha:]]'))%>%
  filter(str_detect(word2, '[[:alpha:]]'))%>%
  filter(nchar(word1) > 2)%>%
  filter(nchar(word2) > 2)%>%
  filter(word1 != 'ufo')%>%
  filter(word2 != 'ufo')

#most common types of lights seen
lights <- bigrams_filtered %>%
  filter(word2 == 'light' | word2 == 'lights')%>%
  unite('bigram', -n, sep=' ')

#What type of shapes?
shapes <- bigrams_filtered %>%
  filter(word2 == 'shape' | word2 == 'shaped')%>%
  unite('bigram', -n, sep=' ')

#movement
mvt <- bigrams_filtered %>%
  filter(word2 =='movement' | word2 == 'movements')%>%
  unite('bigram', -n, sep=' ')

speed <- bigrams_filtered %>%
  filter(word2 == 'speed' | word2 == 'speeds')%>%
  unite('bigram', -n, sep=' ')

Visualizing UFO Characteristics by Treemap

Treemaps work by making the area of the rectangles proportional to some variable in our dataframe. In this case, we have counts of the bigrams, so we will set the size of the rectangles to reflect the count of each pair. Hopefully this will tell us something interesting about the most common characteristics of UFOs.

library(treemap)
# treemap. the size of the box is proportional to its count among all lights
treemap(speed,
            index="bigram",
            vSize="n",
            type="index",
            fontsize.labels = 6,
            title= 'UFO Speed Words'
            )

As I expected, most people say that UFOs move at incredibly fast speeds. I like the description of “warp speed.” I’d say the UFO I saw also moved at a laws-of-physics-defyingly fast speed.

treemap(mvt,
            index="bigram",
            vSize="n",
            type="index",
            fontsize.labels = 6,
            title= 'UFO Movement Words'
            )

It seems like there’s a definite pattern of UFOs having erratic and unsual kinds of movements.

treemap(shapes,
            index="bigram",
            vSize="n",
            type="index",
            fontsize.labels = 6,
            title= 'UFO shape Words'
            )

Interestingly lots of cigar-shaped and triangle shaped UFOs. I’ve never heard of Chevron/boomerang shaped UFOs before.

treemap(lights,
            index="bigram",
            vSize="n",
            type="index",
            fontsize.labels = 6,
            title= 'UFO Lights Words'
            )

It looks like most of the lights seen are either white, orange, red, or blue lights.

Visualizing by Lollipop chart

Another way to represent the same information is a lollipop chart. These are essentially slicker-looking versions of bar charts, where the length of the bar is proportional to some column in our dataframe. Remember that in ggplot2, we use the aes() argument to map variables (i.e., columns) onto our plots. This is an extremely useful aspect of ggplot.

Lollipop chart of states

#city state counts of sightings
state_counts <- df %>%
  filter(state != '')%>%
  count(city, state, sort=TRUE)%>%
  unite('location', -n, sep=',')

#visualize with lollipop chart
state_counts %>%
  filter(n > 90)%>%
  mutate(x = factor(location))%>%
  ggplot(aes(x, n))+
  geom_segment(aes(x=reorder(x,n), xend=x, y=0, yend=n), size=1)+
  geom_point(color='red', size=2, alpha=.6)+
  coord_flip()+
  theme_minimal() + 
  theme(axis.text.y = element_text(size = 7))+
  labs(title='Which cities have had the most sightings?')

#count cities and shape
#seattle fireballs sounds like a basketball team
#get rid of others because they don't contain useful info
df %>%
  filter(!shape %in% c('', 'unknown', 'other', 'light'))%>%
  count(city, shape, sort=TRUE)%>%
  filter(n > 20)%>%
  unite('type', -n, sep=' ')%>%
  mutate(type = factor(type))%>%
  ggplot(aes(type, n))+
  geom_segment(aes(x=reorder(type,n), xend=type, y=0, yend=n), size=1)+
  geom_point(color='red', size=2, alpha=.6)+
  coord_flip()+
  labs(title='What shapes are most commonly seen in each location?',
       x = 'City and Type of Shape', y='# of sightings')+
  theme_minimal()

Using lubridate to find patterns over time

The lubridate package is great. It makes working with dates and times as easy as could be imagined–if you’re familiar with xts or ts objects, you’ll know that working with dates and times in R can be frustrating and cumbersome.

One really handy function is floor_date(), which allows you to essentially aggregate observations into buckets of time. In this sense it is similar to how you might choose your bin size in a histogram, but here we are choosing what interval of time to consider a ‘bin’ or ‘bucket.’ As you change the interval (i.e., unit of time), you’ll see different patterns emerge as the zoom level of the data increases and decreases. You’ll need to use your judgment to figure out which level of aggregation best fits your needs.

library(lubridate)

#biggest single events reported one time tinley park 17 counts
df$datetime <- mdy_hm(df$datetime)

#1 year intervals
df %>%
  mutate(datetime = floor_date(datetime, unit='1 year'))%>%
  group_by(datetime)%>%
  filter(datetime > '1939-01-01')%>%
  summarize(cts = n())%>%
  ggplot(aes(datetime, cts))+
  geom_line()+
  scale_x_datetime(date_breaks = '5 years', date_labels = '%Y') + theme(plot.subtitle = element_text(vjust = 1), 
    plot.caption = element_text(vjust = 1), 
    panel.background = element_rect(fill = NA)) +labs(title = "UFO Sightings Since 1939", 
    x = "Date", y = "Sightings")

#which states have the most sightings? 
#ca, wa, fl, tx, ny
df %>%
  count(state, sort=T)%>%
  slice(1:10)

    ## # A tibble: 10 x 2
    ##    state     n
    ##    <chr> <int>
    ##  1    ca  8998
    ##  2    wa  3949
    ##  3    fl  3792
    ##  4    tx  3445
    ##  5    ny  2635
    ##  6    il  2460
    ##  7    az  2454
    ##  8    pa  2406
    ##  9    oh  2301
    ## 10    mi  1930

California, Washington and Florida see the most UFOs.

#top 10 states by sightings
df %>%
  mutate(datetime = floor_date(datetime, unit='1 year'))%>%
  group_by(datetime, state)%>%
  filter(datetime > '1950-01-01')%>%
  filter(state %in% c('ca', 'wa', 'fl', 'tx','ny', 'il','az','pa','oh','mi'))%>%
  summarize(cts = n())%>%
  ggplot(aes(datetime, cts, color=state), alpha=.5)+
  geom_line(size=1)+
  scale_x_datetime(date_breaks = '5 years', date_labels = '%Y') + 
  theme(plot.subtitle = element_text(vjust = 1), 
    plot.caption = element_text(vjust = 1), 
    panel.background = element_rect(fill = NA)) +labs(title = "UFO Sightings Since 1950", 
    x = "Date", y = "Sightings")

California has dominated with sightings, though Washington took over in the mid 1990s for just a year or so.

What about which cities have seen the most UFOs?

#which cities?
cities <- df %>%
  count(city, sort=T)%>%
  slice(1:10)

cities <- cities[,1]
cities

    ## # A tibble: 10 x 1
    ##           city
    ##          <chr>
    ##  1     seattle
    ##  2     phoenix
    ##  3    portland
    ##  4   las vegas
    ##  5 los angeles
    ##  6   san diego
    ##  7     houston
    ##  8     chicago
    ##  9      tucson
    ## 10       miami

How have UFO sightings changed over time in the most popular UFO locations?

#tracking 10 cities
df %>%
  mutate(datetime = floor_date(datetime, unit='5 years'))%>%
  group_by(datetime, city)%>%
  filter(datetime > '1970-01-01')%>%
  filter(city %in% c('seattle', 'phoenix','portland', 'san diego', 'los angeles', 'houston', 'las vegas', 'chicago', 'tucson', 'miami'))%>%
  summarize(cts = n())%>%
  ggplot(aes(datetime, cts, color=city), alpha=.8)+
  geom_line(size=1)+
  scale_x_datetime(date_breaks = '5 years', date_labels = '%Y') + 
  theme(plot.subtitle = element_text(vjust = 1), 
    plot.caption = element_text(vjust = 1), 
    panel.background = element_rect(fill = NA)) +labs(title = "5 year aggregate UFO Sightings Since 1970", 
    x = "Date", y = "Sightings")

We can see that all of the top cities have experienced roughly the same increase in sightings over time. Seattle and Phoenix stand out as having the most in a single year. Strangely sightings seem to have decreased since 2008.

How have UFO shapes changed over time? Are cigar-shaped UFOs as common as disk-shaped ones?

#how have shapes changed over time? 1 year aggregrates
df %>%
  mutate(datetime = floor_date(datetime, unit='1 year'))%>%
  filter(datetime > '1950-01-01')%>%
  group_by(datetime, shape)%>%
  filter(!shape %in% c('', 'changed', 'changing', 'other', 'unknown'))%>%
  summarize(cts = n())%>%
  ggplot(aes(datetime, cts, color=shape))+
  geom_line(size=1, alpha=.6)+
  scale_x_datetime(date_breaks = '5 years', date_labels = '%Y')+
  theme_minimal()

Looks like lights are the most common now, though it wasn’t always that way. Disks dominated back in the day.

Instead of counts, let’s look at the proportion of each shape sighted

#5 year intervals proportion of shapes sighted
df %>%
  mutate(datetime = floor_date(datetime, unit='5 years'))%>%
  filter(datetime > '1950-01-01')%>%
  group_by(datetime, shape)%>%
  filter(!shape %in% c('', 'changed', 'changing', 'other', 'unknown'))%>%
  summarize(cts = n())%>%
  mutate(freq = cts/sum(cts))%>%
  filter(freq > .05)%>%
  ggplot(aes(datetime, freq, color=shape))+
  geom_line(size=1)+
  scale_x_datetime(date_breaks = '5 years', date_labels = '%Y')+
  theme_minimal()+
  labs(title='Freq of shape in 5 year aggregate sightings since 1950')

Since the 1950s disks have gone down massively in popularity, while triangles rose to the top in the late 1980s, then were finally overtaken by lights. Spheres and circles have remained stable at around 10% of all sightings.

How have shapes changed in California over the past 50 years?

df %>%
  mutate(datetime = floor_date(datetime, unit='10 years'))%>%
  filter(datetime > '1950-01-01')%>%
  group_by(datetime, shape, state)%>%
  filter(state == 'ca')%>%
  filter(!shape %in% c('', 'changed', 'changing', 'other', 'unknown'))%>%
  summarize(cts = n())%>%
  ungroup()%>%
  group_by(datetime, add=TRUE)%>%
  mutate(freq = cts/sum(cts))%>%
  filter(freq > .05)%>%
  ggplot(aes(datetime, freq, color=shape))+
  geom_line(size=1, alpha=.6)+
  scale_x_datetime(date_breaks = '20 years', date_labels = '%Y')+
  facet_wrap(~ state)+
  theme_minimal()+
  labs(title='Freq of shape in 10 year aggregate sightings since 1950')

What can we learn from this?

  1. Disks peaked in the 50s and hardly ever appear now.
  2. Lights were rare until the late 70s and then exploded.
  3. Triangles seem to have peaked in the early 80s.
  4. Fireballs started in the late 80s and quickly
  5. Cigars died out in the 1960s. Aliens’ leases expired and they traded up for lights?

Are UFO sightings seasonal? Are there daily patterns to sightings? What time of day is best to maximize one’s chances of seeing a UFO?

To do this we use the super useful functions from lubridate that extract the different pieces of time out the datetime objects we created. Once we mutate the new columns it’s super easy to groupby on these times and make nice visuals.

df <- df %>%
  mutate(day = day(datetime),
         month = month(datetime),
         year = year(datetime),
         hour = hour(datetime))

Now we are ready to start grouping and visualizing

#I'm only showing 24 years' of data here due to size limits. But the same pattern persists up till 2014.
df %>%
mutate(month = factor(month), 
day = factor(day))%>%
filter(between(year, 1950, 1974))%>%
group_by(year, month)%>%
summarize(cts = n())%>%
ggplot(aes(month, cts, group=year))+
geom_line()+
facet_wrap(~ year, ncol = 5, scales = 'free_y')+
labs(title='Are UFO sightings seasonal? Apparently yes')

Wow, this was really unexpected. UFOs are extremely seasonal! Nearly every year shows the same distribution of sightings through the months with a big peak between June and July in almost every year. Now what could possibly explain that?

What about time of day? Do more sightings happen in the morning or evening?

df %>%
group_by(day, hour)%>%
summarize(cts = n())%>%
ggplot(aes(hour, cts, group=day))+
geom_line()+
facet_wrap(~ day)+
labs(title='Does time of UFO sighting depend on the day of the month?')

Again, I’m surprised by the fact the vast majority of sightings are happening around 9-11 p.m. Of course at night our vision isn’t good and there are so many other things in the sky competing for our attention. I realize now that my daytime sighting was a very rare event. Also, I want to draw your attention to two peaks visible in this chart: first, the big peak on the 4th and the big peak around midnight on the 1st. We’ll explain these shortly.

How rare? Let’s estimate the probability of seeing a UFO at 3 p.m.

df %>%
count(hour)%>%
mutate(freq = n/sum(n),
cum_prob = cumsum(freq))%>%
slice(15:17)
## # A tibble: 3 x 4
##    hour     n       freq  cum_prob
##   <int> <int>      <dbl>     <dbl>
## 1    14  1042 0.01588608 0.3131784
## 2    15  1147 0.01748689 0.3306653
## 3    16  1282 0.01954507 0.3502104
#plot the estimated cumulative probability of sightings by hour in the day
df %>%
count(hour)%>%
mutate(freq = n/sum(n),
cum_prob = cumsum(freq))%>%
ggplot(aes(hour, cum_prob))+
geom_area(alpha = .5)+
geom_vline(xintercept = 15, color='red')+
geom_hline(yintercept = .33, color='red')+
labs(title='Estimated cumulative probability of UFO sightings by hour in the day')

Apparently only about 1.7% of all sightings occur between 3-4 p.m. We can also see that only about 33% of sightings occur between midnight and 3 p.m. The vast majority are happening between 8-10 p.m. More specifically, nearly 50% of sightings occur between 8 p.m. and midnight.

What about the day of the week? Do more sightings happen on the weekends? After all, I saw mine on a Sunday afternoon. We’ll use Lubridate’s useful wday() to get the name of the day.

library(stringr)
df %>%
mutate(dow = wday(datetime, label=T))%>%
count(month, dow)%>%
ggplot(aes(dow, n, fill=dow))+
geom_col()+
guides(fill=FALSE)

Yes, there is a big uptick in sightings on Saturdays and Sundays. But does this always hold in all months? If you’re familiar with Simpson’s Paradox, then you’d know that sometimes aggregate results can be completely at odds with results broken down into groups.

df %>%
mutate(dow = wday(datetime, label=T))%>%
count(month, dow)%>%
ggplot(aes(dow, n, fill=dow))+
geom_col()+
facet_wrap(~ month)+
guides(fill=FALSE)

As suspected, there are considerably more sightings on the weekends. Again, there is more time to stare into the sky and drink alcohol, all of which would explain the increase in sightings. This gradual ramping up of sightings through the week is similar to what you would see in retail sales: Monday is slow then shopping picks up through the week, with the biggest bumps on the weekends. This suggests to me that the sightings are more of a function of people’s weekly schedules/habits than any real alien phenomenon.

Just to be sure this weekend bump is real, let’s conduct a Chi-Squared significance test of the day of the week.

cn <- df %>%
mutate(dow = wday(datetime, label=T))%>%
count(dow)%>%
ungroup()
cn

    ## # A tibble: 7 x 2
    ##     dow     n
    ##   <ord> <int>
    ## 1   Sun  9533
    ## 2   Mon  8191
    ## 3   Tue  8856
    ## 4   Wed  9053
    ## 5   Thu  8989
    ## 6   Fri  9477
    ## 7   Sat 11493

	#Goodness of fit test: 
#Null hypothesis: Each day is equally likely for a UFO sighting
#What is the probability of getting the observed DOW counts we did under this condition?
chisq.test(cn$n)

    ## 
    ##  Chi-squared test for given probabilities
    ## 
    ## data:  cn$n
    ## X-squared = 687.82, df = 6, p-value < 2.2e-16

Yup, it’s almost certain that weekend sighting rates are substantially higher than weekday. Given this p-value, there’s basically 0 chance that UFO sightings are equally probable on any day of the week.

Are there certain months where UFO sightings occur at different times?

df %>%
group_by(state)%>%
filter(state %in% c('ca', 'wa', 'tx', 'fl', 'ny'))%>%
count(hour,month)%>%
ggplot(aes(hour, n, color=state))+
geom_line()+
facet_wrap(~ month)+
theme_minimal()+
labs(title='Why does July have a big spike?')

This graphic shows us that for the top 5 states, they all pretty much have similar UFO sighting times throughout the day. But yet in July we notice that spike is coming from more sightings past 8 p.m. Let’s investigate further… could it have something to do with the Fourth of July holiday? More people are outside and looking at the sky for fireworks, maybe?

df %>%
group_by(state)%>%
filter(state %in% c('ca', 'wa', 'tx', 'fl', 'ny'))%>%
count(hour,month,day)%>%
ggplot(aes(day, n, color=state))+
geom_line(size=1.5)+
facet_wrap(~ month)+
labs(title='People apparently can\'t tell fireworks and UFOs apart')+
theme_minimal()

Yup, sure enough the two days with the biggest spikes in sightings are indeed the Fourth of July and New Year’s Eve, two days where tons of people shoot fireworks into the night sky. Sadly, I was hoping there would be a much cooler reason for the mysterious spike in sightings during the middle of the year–but it turns out it’s just people mistaking fireworks for alien spacecraft.

The only cyclical pattern in UFO sightings is that of Americans’ weekly routines. There doesn’t seem to be any evidence that people see UFOs beyond what they would be expected to given their increased leisure time and consumption of alcohol on weekends and holidays. With this section concluded, let’s go back to the textual descriptions and see if we can find anything of interest there.

Can we visualize the descriptions as a network of nodes and edges to find patterns in the UFO descriptions?

The igraph and ggraph packages will help us here. And I highly recommend the free online book https://www.tidytextmining.com/ngrams.html, by David Robinson and Julia Silge.

#need 3 things after have igraph object: node edges and labels
library(igraph)
library(ggraph)
big_graph <- bigrams_filtered %>%
  filter(n > 200)%>%
  graph_from_data_frame()

#for some reason the edge_alpha argument isn't changing. It should make alpha levels proportional to the number of occurrences in the text.
a <- grid::arrow(type = "open", length = unit(.05, "inches"))

ggraph(big_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "red", size=2) +
  geom_node_text(aes(label = name), vjust = 2, hjust = 1) +
  theme_void()

This graph gives a nice overview of the major descriptions of the UFOs. The direction of the arrows indicate whether the word was the 1st or 2nd word in the bigram pair. I’m definitely going to need to look more deeply into this package as there are a lot of setting to tweak. For now, the aes(edge_alpha)=n does not appear to be working…

To conclude our analysis, let’s focus on geographic locations to see if we can find any “hotspots” or patterns in sightings.

Geographic visualization of UFO hotspots

In order to plot using latitude and longitude, we’ll follow the same strategy I outlined in my ggmaps tutorial: groupby location then average the lat and lon of each city. If you want to see how that was done, go here: https://greenet09.github.io/datasophy/2018/06/29/san-diego-airbnb.html

us_map <- get_map('USA', zoom = 3,maptype = 'satellite')%>%
  ggmap()
us_map

#cities with over 50 sightings
over_50 <- df %>%
  count(city,state)%>%
  filter(n > 1)%>%
  unite('location', -n, sep=',')
head(over_50)
## # A tibble: 6 x 2
##       location     n
##          <chr> <int>
## 1  29 palms,ca     2
## 2 abbeville,la     4
## 3 abbeville,sc     2
## 4  aberdeen,md     5
## 5  aberdeen,nj     2
## 6  aberdeen,sd     2
#Now it's in the shape we need to plot. Not gunna lie: this took me like 10 mins to get working
lat_long <- df %>%
  unite('location', c('city','state'), sep=',')%>%
  filter(location %in% over_50$location)%>%
  mutate(latitude = as.numeric(latitude),
         longitude = as.numeric(longitude))%>%
  group_by(location)%>%
  summarize(lon = mean(longitude, na.rm=TRUE),
            lat = mean(latitude, na.rm=TRUE),
            cts = n())%>%
  ungroup()
head(lat_long)
## # A tibble: 6 x 4
##       location        lon      lat   cts
##          <chr>      <dbl>    <dbl> <int>
## 1  29 palms,ca -116.05417 34.13556     2
## 2 abbeville,la  -92.13417 29.97444     4
## 3 abbeville,sc  -82.37917 34.17806     2
## 4  aberdeen,md  -76.16444 39.50944     5
## 5  aberdeen,nj  -74.22212 40.41710     2
## 6  aberdeen,sd  -98.48611 45.46472     2

Now we are ready to visualize by overlaying our dataframe on top of our country map.

us_map +
  geom_point(data=lat_long, aes(lon, lat, color=cts, size=cts, alpha=cts))+
  scale_color_gradient(low='blue', high='red')+
  guides(color=FALSE)+
  labs(title='US UFO sightings since 1944')

Do you see any patterns here? Besides most sightings happening in big cities on the coasts (where most of the people live), it is interesting to see a big vertical gap in the midwest where almost no sightings occur. I wonder if perhaps the geographical features (mountains/rivers, for example) are playing a role. There’s almost a vertical line splitting the US into two halves. Where are the UFOs in Kansas, Nebraska, South Dakota, and North Dakota? There also appears to be a gap in sightings in the region that appears to be Northern Nevada and Idaho.

Can we zoom in to better see?

us_idaho <- get_map('Idaho, USA', zoom=5, maptype = 'satellite')%>%
  ggmap()

us_idaho +
  geom_point(data=lat_long, aes(lon, lat, color=cts, size=cts, alpha=cts))+
  scale_color_gradient(low='blue', high='red')+
  guides(color=FALSE)+
  labs(title='Why no UFOs in Idaho?')

Clearly the bigger the population the more opportunity for sightings. It could just be the relatively sparsely populated areas that account for the lack of sightings. Or it could be that UFOs have an aversion to potatoes. I’m not sure we’ll ever know. For now, I’m sticking with the potato hypothesis.

Lastly, can we visualize key words in different regions?

In order to do this we will look at single words that are most common in each region.

library(ggrepel)
words_map <- df %>%
  unite('location', c('city','state'), sep=',')%>%
  unnest_tokens(words, comments)%>%
  filter(!words %in% stop_words$word) %>%
  filter(!words %in% c('las', 'los', 'san', 'quot', 'diego', 'sky', 'angeles', 'object',
                       'light', 'lights', 'beach', 'jose', 'francisco', 'antonio',
                       'tinley', 'myrtle', 'salt', 'texas', 'bright',
                       'moving', 'monica', 'colorado', 'city', 'barbara','flying',
                       'shaped', 'shape', 'santa', 'object', 'objects', 'craft',
                       'moved', 'alaska', 'downtown', 'north', 'south', 'east', 'west',
                       'rapids','sighting', 'cajon', 'simi', 'boca', 'paso',
                       'lauderdale', 'grand', 'puget', 'nuforc', '39s', 
                       'looked', 'nyc', 'obj', 'cruz', 'missouri','springs', 'note',
                       'appeared', 'hotel', 'night', 'park', 'red', 'palm',
                       'des', 'moines'))%>%
  filter(str_detect(words, '[[:alpha:]]'))%>%
  filter(nchar(words) > 2)%>%
  filter(words != 'ufo')%>%
  filter(!words %in% df$city)%>%
  count(location, words, sort=T)%>%
  filter(n > 12)%>%
  inner_join(lat_long, by='location')
head(words_map)
## # A tibble: 6 x 6
##       location     words     n       lon      lat   cts
##          <chr>     <chr> <int>     <dbl>    <dbl> <int>
## 1   seattle,wa     sound    37 -122.3308 47.60639   524
## 2   seattle,wa     green    35 -122.3308 47.60639   524
## 3   seattle,wa  hovering    27 -122.3308 47.60639   524
## 4   phoenix,az formation    24 -112.0733 33.44833   450
## 5   seattle,wa      fast    24 -122.3308 47.60639   524
## 6 las vegas,nv     green    23 -115.1364 36.17500   363
us_map +
  geom_point(data=words_map, aes(lon, lat, alpha=n, size=n), color='red')+
  geom_text_repel(data = words_map, aes(label = words, size=n), color='green')+
  labs(title='Key words from UFO sightings around the US')

OK, I’m going to end it here. Partly because figuring out how to get the map above to contain just the right amount of words took more time than I’d like to admit, and because it’s time to move on to another analysis.

What can we learn from the above graphic? Well, first you should note that I removed the word “red” and “night” to make room for other, possibly more meaningful words. So if you see some red stuff at night, there’s a good chance it’s a UFO.

Second, we can see that fast glowing fireballs seem more common around Seattle. Though in Southern Texas there also seems to be many green fireballs. By the way, the Seattle Fireballs would make an awesome NBA team name. Just saying.

It is also interesting to see which areas witness UFO formations. It looks like the areas near Lake Havasu, CA and Phoenix, AZ tend to see more UFO formations.

Final theory to explain the lack of UFOs near Idaho…

Is it just a coincidence that the areas without many UFOs seem to be located in the general area where Area 51 is claimed to exist? Probably not. Most of the terrain is just sand and mountains with very few people around to spot a UFO. I would be very surprised if we found a location with a very small population but a lot of UFO sightings. That would indicate something abnormal is happening there.

m <- get_map('Groom Lake, Nevada', zoom = 6, maptype='terrain')%>%
  ggmap()

m + geom_point(data=lat_long, aes(lon, lat, color=cts, size=cts))+
  scale_color_gradient(low='blue', high='red')+
  guides(color=FALSE)+
  labs(title='Sightings surrounding Area 51')

Based on this graphic there don’t seem to be any unpopulated areas with high amounts of UFOs: this leads us to the conclusion that UFO reports are basically a function of having more people around to make UFO reports–not any underlying increase in unidentified flying objects.

Conclusion

Based on these data, I would say that probably a very high percentage of these reports are total BS. Once we figured out that 4th of July and New Year’s Eve fireworks were responsible for the big spikes in reports, I lost a lot of faith in the quality of the reports. The fact that many more are reported on weekends is also dubious: most of the reports are probably just a result of people with more free time on their hands. Also, given the huge numbers of mere “lights” that people see, there are just so many possible things they could be. It’s also troubling that there are almost no more sightings of “cigar-shaped” or ‘disk-shaped’ flying saucers anymore. For my money, I would only trust sightings that occur in daylight and under good, clear weather conditions. The truth, apparently, is still out there.