word frequency study

This page was created to satisfy a twitter app need, but I’m also using it to share information about learning to build a webcrawler for research purposes. So far I’ve followed two tutorials, one that extracted URLs from a given website, and the other collects words from a given page and returns their frequency. So we should be able to combine the two to crawl target websites and collect word frequencies. Twitter feeds is a much bigger challenge because the data is stored in javascript (I’m told?) so it’s not as straightforward, and twitter seems to have restrictions on usage — it wasn’t too hard for me to download all my tweets, but the same approach won’t work for any account that isn’t mine.

(But I’m also going to use this to share what I learn and my progress.)

A brief overview of what is going on below. There are two ways to execute (or run) python code: it can be executed as the code is entered line by line in an interpreter, or the code can be typed up a 

the program used to extract urls

from bs4 import BeautifulSoup
import requests

url = raw_input('Enter a website to extract the URLs from: ')   # get user input
r = requests.get('http://' +url)                                # make it a valid URL
data = r.text                                                   # get the text
soup = BeautifulSoup(data, 'html.parser')                       # parse it?

for link in soup.find_all('a'):                                # pull out the links
    print(link.get('href'))                                     # and print it

output of extracting the urls from my website

$ python webscraper.py
Enter a website to extract the URLs from: crumpledcortex.com
http://crumpledcortex.com/
http://crumpledcortex.com/software/
http://crumpledcortex.com/trippygram/
http://crumpledcortex.com/trippygram/gallery/
http://crumpledcortex.com/trippygram/tips-tricks/
http://crumpledcortex.com/zeeify/
http://crumpledcortex.com/trippygram/privacypolicy/
http://crumpledcortex.com/trippygram/support/
http://crumpledcortex.com/instructions/
http://crumpledcortex.com/
http://crumpledcortex.com/trippygram/gallery/
http://crumpledcortex.com/portfolio/three-dimensional-designs/
http://crumpledcortex.com/portfolio/curved-crease-orgami-gallery/
http://crumpledcortex.com/portfolio/snowflakes/
http://crumpledcortex.com/portfolio/panoramics/
http://crumpledcortex.com/develop/
http://crumpledcortex.com/purchases/
http://crumpledcortex.com/autobiographical-blurb/
#
http://crumpledcortex.com/2014/10/26/russells-teapot-necklace/img_0880/
http://crumpledcortex.com/2014/10/26/russells-teapot-necklace/img_0873-2/
http://crumpledcortex.com/2014/10/26/coleoidea-hypoxyphilia/img_0914/
http://crumpledcortex.com/2014/10/26/coleoidea-hypoxyphilia/img_0917/
http://crumpledcortex.com/raspberry-pi-case-three-quarters-view/
http://crumpledcortex.com/raspberry-pi-case-taken-apart/
http://crumpledcortex.com/alternate-raspberry-pi-case-taken-apart/
http://crumpledcortex.com/img_0921/
http://crumpledcortex.com/img_0902/
http://crumpledcortex.com/img_0885/
http://crumpledcortex.com/img_0888/
http://crumpledcortex.com/img_0894/
http://crumpledcortex.com/img_0905/
http://crumpledcortex.com/img_0908/
http://crumpledcortex.com/img_0910/
http://crumpledcortex.com/ovals-in-pink-construction-paper/
http://crumpledcortex.com/concentric-circles-in-cardboard/
http://crumpledcortex.com/screen-shot-2014-06-03-at-9%c2%b728%c2%b755-a/
http://crumpledcortex.com/screen-shot-2014-06-11-at-8%c2%b705%c2%b727-p/
http://crumpledcortex.com/trippygram/tips-tricks/img_5169/
http://crumpledcortex.com/trippygram/tips-tricks/img_5178/
http://crumpledcortex.com/trippygram/tips-tricks/img_5172/
http://crumpledcortex.com/img_3151/
http://crumpledcortex.com/img_9170/
http://crumpledcortex.com/img_3735/
http://crumpledcortex.com/trippygram/tips-tricks/img_3655/
http://crumpledcortex.com/trippygram/tips-tricks/img_8909/
http://crumpledcortex.com/img_2545/
http://crumpledcortex.com/img_2629/
http://crumpledcortex.com/img_2656/
None
/#respond
http://instagram.com/nescientswot
https://www.facebook.com/yrast/photos_albums
http://yrast.tumblr.com
/support
http://siteorigin.com

the program used to get words and print their frequencies

from bs4 import BeautifulSoup
import requests

from collections import Counter
from string import punctuation

r = requests.get('http://crumpledcortex.com/develop/')
soup = BeautifulSoup(r.content, 'html.parser')
text = (''.join(s.findAll(text=True)) for s in soup.findAll('p'))

c = Counter((x.rstrip(punctuation).lower() for y in text for x in y.split()))

# print(c.most_common())                                 # prints most common words starting at most common

print([x.encode('utf-8') for x in c if c.get(x) > 5])                       # prints only the words that appear more than 5 times

output of all the words that appear on my /developments page more than 5 times

[”, ‘all’, ‘go’, ‘little’, ‘did’, ‘try’, “i’ll”, ‘waiting’, “i’m”, ‘transplant’, ‘from’, ‘would’, ‘to’, ‘two’, ‘few’, ‘more’, ‘started’, ‘me’, ‘this’, ‘can’, ‘my’, ‘process’, ‘want’, ‘how’, ‘leukemia’, ‘blood’, ‘marrow’, ‘a’, ‘maybe’, ‘so’, “that’s”, “don’t”, ‘brain’, ‘still’, ‘might’, ‘then’, ‘good’, ‘they’, ‘not’, ‘now’, ‘day’, ‘bone’, ‘out’, ‘since’, ‘guess’, ‘could’, ‘days’, ‘think’, ‘one’, ‘another’, ‘too’, ‘that’, ‘likely’, ‘and’, ‘have’, ‘also’, ‘which’, ‘sure’, ‘most’, ‘should’, ‘going’, ‘pretty’, ‘get’, ‘biopsy’, ‘next’, ‘see’, ‘are’, ‘it’, ‘probably’, ‘we’, ‘last’, ‘seems’, “it’s”, ‘been’, ‘bit’, ‘drugs’, ‘while’, ‘is’, ‘them’, ‘in’, ‘if’, ‘doctor’, ‘make’, ‘week’, ‘i’, ‘well’, ‘the’, ‘things’, ‘just’, ‘when’, ‘\xe2\x80\x94’, ‘had’, ‘around’, ‘know’, ‘like’, ‘because’, ‘some’, ‘back’, ‘home’, ‘for’, ‘though’, ‘be’, ‘on’, ‘about’, ‘of’, ‘or’, ‘right’, ‘there’, ‘lot’, ‘was’, “i’ve”, ‘but’, ‘with’, ‘cells’, ‘up’, ‘an’, ‘as’, ‘at’, ‘again’, ‘really’, ‘monday’, ‘time’]

what I learned 4/29 (in the morning)

# u prefixes on strings, like the following list, are indicating unicode strings
[u'all', u'go', u'little', u'did']

# and can be changed by specifying an encoding like this (for a string x):
x.encode('utf-8')



# and generators are like functions that only step through their code one iteration at a time, stopping each time they reach a yield
# this generator returns the next prime number every time it's called:
def primes():
     yield 2
     n = 3
     p = []
     while True:
             if not any(n % f == 0 for f in itertools.takewhile(lambda f: f*f >= n, p)):
                     yield n
                     p.append(n)
             n += 2
 
pnums = primes()
pnums.next()
2
pnums.next()
3
pnums.next()
5
pnums.next()
7
pnums.next()



import itertools

# this one generates the Fibonacci sequence:
def fib():
     a, b = 0, 1
     while True:
             yield a
             a, b = b, a + b
 
print(list(itertools.islice(fib(), 10)))
output: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

Also I just learned how WordPress is converting some characters into stupid stuff, which is super frustrating (less than, greater than becomes >, ampersand…)

SuperPAC Resources

List of existing SuperPACs, their political positions, expenditures, viewpoints, and total money raised.
legalzoom for legal guidance.
How to start a SuperPAC — What to Do With Your Political Expenditure Only Committee
LegalMatch for finding a lawyer.