hacker by moonlight

Latest Posts

  • Aug 18, 2015

    Basic Web Scraping with Python (and BeautifulSoup)

    The internet is full of data. Settling an argument with your friend over the height of the world’s tallest mountain is as easy as a simple search query and voila! Instant information, directly to your brain. But what if we want that information in a more easily useable and malleable format? Sure, there are APIs for some databases, but maybe we just want to go directly to the source ourselves.

    Enter the web scraper: a script that reads raw HTML from a webpage and extracts the data we need, converting it into an easy to use format such as raw text or JSON. This sounds like a long, complicated process, and the thought of writing a web scraper intimidated me before I actually tried it. Let’s code one!

    Our goal: write a simple web scraper that loads the HTML from japaneseemoticons.me, extract the unicode emoticons and write them to a raw text file. We’re going to do this with the help of python’s urllib module as well as BeautifulSoup4, a super useful module for traversing the HTML data of a webpage.

    Setup

    This tutorial assumes you already have python installed and some experience with the language. First thing we’ll do is install BeautifulSoup4, which is easiest if you already have pip:

    pip install beautifulsoup4
    

    For other install options, check out the link above (which also has a very useful guide on beautifulsoup’s features).

    Now make a new directory and open up a fresh new .py file. We’ll be importing BeautifulSoup (obviously) as well as urllib, which is python’s built-in url reader.

    from urllib import urlopen
    from bs4 import BeautifulSoup
    

    Getting Raw HTML

    Before we write more of our scraper, it’s important to have an idea about the form of the data we’re working with and what elements on the page contain the relevant information we need. In this case we need to know what elements on the page contain the emoticons, and how to best filter these out. An easy way to play around with the data is running code through the python interpreter. Note that we’re looking specifically at the ‘excited’ emoji page.

    url = 'http://japaneseemoticons.me/excited-emoticons/'
    html = urlopen(url).read()
    print html
    

    Here’s a snippet of what you might see:

    <table style="width: 100%;">
    <tbody>
    <tr>
    <td style="text-align: center;">☆*・゜゚・*(^O^)/*・゜゚・*☆</td>
    </tr>
    <tr>
    <td style="text-align: center;">☆*:.。. o(≧▽≦)o .。.:*☆</td>
    </tr>
    <tr>
    <td style="text-align: center;">*✲゚*。✧٩(・ิᴗ・ิ๑)۶*✲゚*。✧</td>
    </tr>
    <tr>
    <td style="text-align: center;">。.゚+:((ヾ(。・ω・)シ)).:゚+。</td>
    </tr>
    <tr>
    <td style="text-align: center;">\( ● ⌒ ∇ ⌒ ● )/</td>
    </tr>
    ...
    

    From here we can see that all the emoticons are contained in <td> elements on the page. We’ll want to cycle through each <td> element and extract the inner text.

    Parsing HTML

    Enter BeautifulSoup! Let’s take a look at what we get when we ‘soupify’ our raw HTML:

    soup = BeautifulSoup(html, 'html.parser')
    print soup
    

    Printing soup gives us the same output as printing the raw html; however, when we check type(soup) we’ll see that we have a class ‘bs4.BeautifulSoup’ object. This gives us access to some useful functions, such as find_all() which we’ll use to get an array of the <td> elements on the page.

    tdElems = soup.find_all('td')
    print tdElems[0]
    

    You’ll see the first <td> element on the page:

    <td style="width: 33%; text-align: center;">(((o(*゚▽゚*)o)))</td>
    

    Now all we have to do to extract the emoticon from our <td> element is use the get_text() function:

    emoji = tdElems[0].get_text()
    print emoji
    

    Ta-da! We have our data! Now we can manipulate it however we want.

    Writing to a file

    There are multiple ways we could use our data; we can further manipulate it in our script, save it as a json object, or as a raw text file. We’ll be doing the latter, using python’s file reading and writing capabilities; Let’s start by creating a new file object to save our emoticons to:

    f = open('emoji.txt', 'w')
    

    Our first argument is the path and name of our file; if it exists, python will open the existing file; otherwise it will create the file. In this example we’re creating our file in whatever directory we run our script from. The second argument specifies the user permissions, in this case we are allowing the user to write to the file. (Note: we have not given it read permissions, so if we tried to run f.read() we would get an error).

    Say we just want to save the single emoticon from above into our file:

    f.write(emoji.encode('utf-8')
    f.close()
    

    Don’t forget the encoding, or you’ll get an error! If we find and open our file, we’ll see that our emoticon has been saved into it. We can now put all these parts together to create a working web scraper that gets all the emojis from the page:

    from urllib import urlopen
    from bs4 import BeautifulSoup
    
    url = 'http://japaneseemoticons.me/excited-emoticons/'
    html = urlopen(url).read()
    
    soup = BeautifulSoup(html, 'html.parser')
    emojis = ''
    f = open('emojis.txt', 'w')
    
    for td in soup.find_all('td'):
      emoji = td.get_text()
      emojis += emoji + '\n'
    
    f.write(emojis.encode('utf8')
    f.close()
    

    That’s it! This is an extremely simple example, and we were lucky that our data was in such an easy to filter format. There are multiple ways we could add to this script: for example, scraping multiple URLs and saving to multiple files or saving to a JSON object as opposed to raw text. However, knowing the fundamentals of a web scraper is the first step to tapping into the massive amounts of data available on the net.

  • May 3, 2015

    Understanding the Angular Factory

    Angular is an extremely popular MVC framework, and after my brief exposure to it I’ve happily jumped onto the bandwagon. Its intuitive interface and workflow make app design and coding an extremely fluid and intuitive process, and its wide userbase means you have a huge well of knowledge to draw from when you’re learning and improving.

    This post isn’t about convincing you to use angular, although I do recommend it. Instead I want to explore one of the aspects of the framework that had me confused when I got started. That aspect is the factory.

    Factory functions can be difficult to grasp for a number of reasons, the first being that we can easily build an angular application without factories (though hopefully after reading this you’ll see the value in using them). Additionally, using a factory requires an understanding of angular’s dependency injection strategy.

    But wait, what’s dependency injection anyway? Dependency injection is a software design pattern related to inversion of control that allows us to load outside libraries into our application, similar to a require statement (in node) or a script tag (in the browser). In angular, dependency injection can deliver the functionality of a library or module to specific parts of a file itself; this improves efficiency by only including dependencies in the exact places they’re needed, such as a specific controller.

    In angular, dependency injection looks something like this:

    app.controller('MainCtrl', function($scope, $dependency1, dependency2) {
      //controller code here
    });
    

    Here we’re creating a controller MainCtrl on app, and the list of arguments in our function (excluding the $scope variable) are our dependencies. The convention in angular is that anything preceded by a ‘$’ is a dependency built directly into angular. But what about that second dependency? Where did it come from?

    There are two possibilities: it’s either an external library (probably included in our bower components) or it’s a factory that we’ve written. That is what a factory is: a set of functions that can be injected into other parts of our angular app. In fact, to write a factory we barely have to change any of the above code:

    app.factory('FactName', function($dependency, dependency2) {
    });
    

    And we can even inject dependencies into our factories, which are then injected into other modules, and so on.

    Why use a factory, as opposed to just writing this functionality into a controller? In short, factory functions allow us to keep our code more DRY and more modular. Factories are a fantastic place to write helper functions that you’ll need to use throughout your application. Take the following code, for example:

    app.controller('UserCtrl', function($scope, $http) {
      $scope.getUsers = function() {
        $http.get('/url')
          .success() { ... }
          .error() { ... }
      };
    });
    

    Our controller here is making a simple http GET request to our server. Everything looks fine, and this code will work (assuming your routes are set up correctly). But we can easily imagine situations where it would be convenient to have this getUsers function on other controllers. This presents a fantastic opportunity to use factories, and in doing so make our code cleaner and more flexible.

    app.factory('Users', function($scope, $http) {
      return {
        getUsers: function() {
          $http.get('/url')
            .success() { ... }
            .error() { ... }
        }
    
        //we might also include some other useful functions
      } 
    });
    

    Now we can inject this factory into some controllers:

    app.controller('MainCtrl', function($scope, Users) {
      $scope.func = function() {
        Users.getUsers();
      };
    });
    
    app.controller('OtherCtrl', function($scope, Users) {
      $scope.func = function() {
        ...
        Users.getUsers();
      };
    });
    

    Ta-da! We have a function that we can easily use in multiple controllers.

    Another important note on factories is that they are instantiated only once in the application. Thus, any variables contained in a factory will be constant across all the controllers it’s injected into. This is fantastic for sharing data across multiple controllers. For example:

    app.factory('Counter', function($scope) {
      return {
        count: 0,
        
        inc: function() {
          this.count++;
        }
      }
    });
    
    app.controller('MainCtrl', function($scope, Counter) {
      $scope.count = function() {
        console.log(Counter.count);
      };
    });
    
    app.controller('IncCtrl', function($scope, Counter) {
      $scope.inc = function() {
        Counter.inc();
      };
    });
    

    Both MainCtrl and IncCtrl have access to the Counter factory, and there is only one instance of Counter. Thus, if we run the inc() in IncCtrl, the change in Counter.count will be reflected in MainCtrl as well.

    //in MainCtrl:
    $scope.count(); //prints 0
    
    //in IncCtrl:
    $scope.inc();
    
    //in MainCtrl:
    $scope.count(); //prints 1
    

    To recap: factories are sets of functions that can be injected as dependencies into controllers, allowing for DRYer, more modular code. Use factories for any functionality or data you expect to be shared across controllers in your angular application.

  • Apr 29, 2015

    Asynchronous Functions and You

    If you’ve been coding in javascript for a while, odds are you’ve heard of callback hell. If you haven’t, you probably will soon (perhaps while learning the ever-popular node), and you’ll probably hear varying levels of anxiety around it. Regardless of your exposure to asynchronous functions, I’m here to tell you that they’re really nothing to be afraid of. Async is your friend!

    ####What the heck does asynchronous even mean, though?#### As its name would suggest, asynchronous code runs outside the normal flow of a program (AKA all the ‘synchronous’ functions). Generally an async function is one that may take a while to complete, such as reading an external file or making a request to a database/server. Because these processes are time-consuming, synchronous code doesn’t wait for asynchronous code to finish before executing. Take the following snippet of code as an example:

    setTimeout(function() { console.log('Hello!') }, 2000);
    console.log('Goodbye!');
    

    The setTimeout function waits two seconds before executing, so one might expect the following output:

    //two seconds pass
    Hello!
    Goodbye!
    

    But this is not so! The actual output is something more like this:

    Goodbye!
    //approximately two seconds later
    Hello!
    

    Seems a bit backwards, right? But this is simply something we have to accept, because this is how javascript handles asynchronous functions like setTimeout. And in the long run it’s more efficient; a program would run much slower if it waited for each asynchronous function to finish.

    While this handling solves the problem of long execution times, it creates another problem: what do we do when we want to return the result of an asynchronous call? The short story is we don’t; instead, we pass a callback into our asynchronous function.

    Callbacks sound mysterious and weird but in actuality a callback is simply a function, which will be executed on the result of an asynchronous process. In the above example, console.log is the callback passed into setTimeout and it gets executed after the interpreter recognizes that two seconds have passed. (The exact system behind how this happens is somewhat complex, but for our purposes we don’t need to dive into that. For those interested, here is a great post on the javascript event loop).

    Let’s take a look at a more practical example of asynchronous code: reading and serving a file on node. File-reading can be a very long process: the computer has to read the file path, go to that location in memory, and send the data in that file back to the server. This can take a while, especially for large files. So javascript can let this process run in the background, without delaying the rest of a program. Here’s how we might read a text file:

    fs.readFile('example.txt', function(err, data) {
      console.log(data);
    });
    

    These few lines of code read a file and then log it to the console, and it does it asynchronously; the function where we call console.log is a callback! The readFile function goes through all the steps to find and read a file, and returns it as a data string. This data is passed into a user-specified callback.

    We can have more fun with this, though, by appending some text to our file…

    fs.readFile('example.txt', function(err, data) {
      data += 'some more text';
      
      fs.writeFile('example.txt', data, function(err, data) {
        console.log('we saved it!');
      });
    });
    

    Inside our readFile function we called writeFile, another asynchronous function, and thus passed another callback into it. We’ve created a chain of asynchronous processes, triggered by the first call to readFile: a file is read, edited, and then saved. What if we were to add a couple more console.logs into this code?

    fs.readFile('example.txt', function(err, data) {
      console.log('read file.');
      data += 'some more text';
      console.log('edited file.');
      
      fs.writeFile('example.txt', data, function(err, data) {
        console.log('we saved it!');
      });
    });
    
    console.log('When will this log???');
    

    Take a second to make a hypothesis about the output of this code, then check your expectations down at the bottom of this post.

    ####Even More Fun with Async Functions! (+The PYRAMID OF DOOM)####

    The above example is a simple, yet very useful usage of asynchronous functions. As you write more complex code, you may find yourself writing longer chains of callbacks. And your code might start looking a bit like this:

    asyncFunc1(stuff, function(err, data) {
      ...
      asyncFunc2(moreStuff, function(err, data) {
        ...
        asyncFunc3(evenMoreStuff, function(err, data) {
          ...
          asyncFunc4(almostDone, function(err, data) {
            ...
          });
        });
      });
    });
    

    And it can get longer! That’s a lot of indentation, and over time your code becomes less and less pretty/easy to read. As programmers we have a name for this phenomenon: the pyramid of doom. It’s unavoidable! Or is it?

    In this kind of situation, we can use Promises, a mechanic designed to simplify asynchronous code. I won’t be going into great detail on Promises in this post, but if you’re curious you can dive into a great Promise library like Bluebird or Q.

    Hopefully you now have a better idea of asynchronous functions and how to use them. Go forth and code!

    (Oh yeah, and here’s the output of that code from earlier:)

    'When will this log?'
    'read file.'
    'edited file.'
    'we saved it!'
    
  • Apr 14, 2015

    [HR] Week 2/3: Diving In

    It’s been a hectic, jam-packed, exhilarating two weeks at Hack Reactor. As our third week comes to an end it’s time for me to take a moment to lean back and meditate on the plethora of things I’ve learned.

    Hack Reactor’s rapid iteration teaching style takes some getting used to, and a side-effect of that is a warped sense of time. The level I was at on Monday is so drastically different from the level I’m at in this moment that it may as well have been a month ago. This seems to be a common feeling amongst my cohort.

    Last week I touched lightly on sprints, which are the focal point of HR’s initial six weeks: two-day coding projects designed to immerse you in a particular coding concept. In week one our sprints were interpolated with lectures but we’ve shifted away from this, allowing us much more time at the pairing stations exploring the code. Our sprint topics were also more challenging than last week, focusing on inheritance, algorithms and the D3 library in week 2 and plunging into servers and backbone in week 3. Backbone was infamously difficult for many in the cohort to grasp, but by the end of the week I feel I have a much better understanding of the MVC model.

    The end of the week is relieving not only because we get a day off but also because we generally feel significantly more confident with coding than we did six days prior.

    My sprint rhythm thus far has generally been to move full-speed into the project in the first day, then finish basic requirements by morning of the second. This leaves me with plenty of time to review code or explore the extra credit sections (which are always included, and sometimes even have a ‘nightmare mode’). It’s a good system that allows for differing skill levels; some HR students already have a fair amount of CS experience while others began coding only half a year ago. I generally fall toward the middle of this spectrum.

    In week 2 we also began one of my favorite segments of the day, toy problems. Every morning is kicked off by a short problem similar to what we may be asked during a job interview. As a lover of puzzles I enjoy this single hour dedicated to a) discussing solutions and algorithms related to the previous day’s problem and b) hacking away at a brand new snippet of code. Of note was our latest problem, where we had to set every word in a paragraph to change to a random color every second. Maybe not the most practical exercise, but good practice for working with jquery and DOM-related events.

    The atmosphere and sense of community at Hack Reactor is extremely welcoming, and likely contributes to everyone’s success and ability to absorb the material. Everybody is here for (more or less) the same reason, which is to become amazing javascript software engineers. Staff helps especially to foster positivity, and class shepherds are often walking around to check in with pairs. I’ve never once felt uncomfortable with the people here, which may or may not be raising my standards for future jobs.

    There are many aspects of student life to talk about and it’s difficult to list them all here, but just know that I have loved my experience and development as a programmer so far. Look out for my next check-in, where I stray away from the usual and tackle my first technical blog post!

  • Mar 30, 2015

    [HR] Week 1: First Impressions

    This post is a bit late; I meant to write it yesterday but I’ve been on a bit of a tight schedule this past week. The fact that I’m writing this on my train ride home is a good indicator of how much free time I’ve had.

    That said, my first week at Hack Reactor was amazing. I’m a Bay Area native but the magic of San Francisco isn’t lost on me. It’s one of my favorite cities and I love having the excuse to spend the majority of my time at the HR HQ on Market Street.

    Speaking of magic, one of my classmates described the Hack Reactor Week 1 experience perfectly: in her words, it feels “a lot like how Harry Potter must have felt when he first began attending Hogwarts.” I couldn’t have said it better! With how positive the atmosphere is I feel perfectly confident that I’ll be a coding wizard in a matter of weeks.

    Since it’s my first week I’ll try to detail what my daily routine has been like, and how I’ve been adjusting to the rigorous schedule here.

    I’m up bright and early around 6:30 to prepare for my hour-long commute into the city. I grew up in Fremont and take the BART into SF from there. Many of my classmates found housing in the city, but just as many are taking routes similar to mine. After all, rent in the city is notoriously high. So far, the commute hasn’t been much of a problem, and I actually really enjoy having an hour to mentally prepare/unwind (/write blog posts I should have already written).

    The first week is about 50% orientation lectures, getting to know each other and reviewing the Hack Reactor precourse material, so it’s a bit of a misnomer compared to the rest of the program. Or so I hear from my seniors, AKA the cohort that started a mere six weeks before me. We share a floor with them, and in my first week I’ve had the opportunity to talk to many of them and absorb their wisdom about the program and coding in general. It’s encouraging to see how much they’ve learned in such a short period of time, and gives me a little preview of where I’m headed.

    The other half of our week was spent coding, as one would expect. While there are some instances where I’m coding on my own, the majority of the time I’m paired up with one of my classmates working on two-day long ‘sprints’. This is the real meat of the program, and as we move into week 2 we’re spending significantly more time working at our pairing stations. More on that in a later post.

    Our week 1 sprints consisted of reviewing precourse material (Monday/Tuesday) and data structures(Wednesday-Saturday). I’ve had prior experience working with basic data structures (albeit in Java rather than JS), but there was still a good deal for me to learn, both technical and soft skills. Our class leads and lecturers have done a fantastic job of easing us into the basic structure of the course.

    Hack Reactor ends at 8 offically, though many stay after to either work on code or attend one of the many guest speaker events. I generally head home no later than 9 in order to get home in time to get some sleep. Lather, rinse, repeat. As the seniors say, I’m still in the honeymoon phase, but I’m really looking forward to exploring the curriculum and facing all the challenges along the way.

  • Mar 22, 2015

    The Journey Begins

    So tomorrow is a very big day for me. It marks my first day at Hack Reactor, a program for aspiring coders (both beginners and the more experienced) to develop the skills and knowledge to make it in the ever-growing tech field. It’s been a long few months since I was accepted back in November but the time crept up on me! I almost can’t believe I’ll be making the train ride into San Francisco tomorrow. Yikes!

    I set this blog up about a month ago but am only now making my first post. There are two big reasons behind this, the first being detailed above: I’m documenting my experiences and struggles at HR, which can’t really happen until there are experiences to document. Secondly, I spent a lot of time trying to think of the perfect first blog post. I’m new to blogging in addition to coding, but I figured out that there’s no such thing as a perfect first post. You just gotta do it and get it out. So here we are!

    Not much more to say for now. If you’re interested a bit more in my own personal story and how I got involved with all this, go ahead and check out the about section. I’ll be checking in soon!

subscribe via RSS

About Archive Blog Portfolio Resume
Moriah Kreeger | Full-Stack Software Engineer | moar.riah@gmail.com