Thursday, February 9, 2012

Learning Python

Last September I participated in a one day full day basic Introduction to Python for Women workshop sponsored by PyStarPhilly. It was good exposure, but it didn't stick all that much.

Fortunately, they decided to hold the workshop again (details here), and this time made it even longer, starting on Friday night and all day on Saturday last weekend.

Friday night started with making sure the Python install on my machine I did last September was still OK, and that Notepad++ was up to date. And making friends again with my cmd window, which I also use to run the Perl scripts at work. Whee my cmd skills are rusty. But between work and this workshop, it's all surfacing from the deep recesses of my brain.

The rest of Friday can be summed up thusly: practice practice practice. The PyStar ladies have posted a series of exercises to practice the basics of Python on CodingBat. These exercises reinforce the concepts covered and force you to think through things.

Saturday morning began with completing Friday's CodingBat exercises and working through some Saturday morning review exercises. Practice practice practice!

A few things I learned while working on the exercises:

  • spaces don't matter so much, although including spaces makes the code more readable
  • punctuation matters!
  • capitalization is noticed, so A =/= a, in other words, machines are very literal!
It is definitely sticking better this time. Between more coding stuff at work, plus Code Year, it's all starting to come together in my brain. I got farther with the exercises this time, and was able to reason through (or talk myself through) when I got stuck.

After a bit of warmup practice, we had a morning lecture. The goal of the lecture was for us to be able to write simple scripts on our own by the end of it. We reviewed the material in the morning exercises (equations, functions, etc.) and covered loops, while loops, input statements, dictionaries, more detailed functions, modules and packages. Also a brief description of libraries.

And then it was back to practice practice practice.

Another challenge I discovered: indentations. Where your indentations are impacts what your code returns, so you have to be careful what you embed under other things, versus what you want to do at the end. Often I wrote the code correctly, but indented it too far relating it to something it shouldn't be related to, producing no result, not even an error message. At one point I nearly created a recursive loop by indenting too far. Oops.

After lunch we had a few lightning talks, where people shared various code they're written for projects. It's encouraging to see what people have built; even more so that I can read and understand the code they wrote! The talks involved editing images (adding text), predictive puzzle solving (the Tower of Hanoi puzzle),   and a very cool project working with 7 to 13 year old girls and the Tropo simple API. We also looked at a very neat page of data visualization and graphical plotting options using Python at matplotlib.sourceforge.net.

The talks also include some more learn to code resources. I'll be posting the new ones I learned about to the catcode wiki resources page.

After the lightening talks, they gave us a project to work on (in addition to the CodingBat exercises) involving using Twitter's API to call recent tweets and trending topics. Challenging but so satisfying when you get it!


Monday, February 6, 2012

Why am I learning to code?

Why am I learning to code. I get asked this a fair amount these days. Why would a librarian want to learn to code? What good is it for my job?

I alluded to it in my first post on the catcode phenomenon: "Catalogers work with massive amounts of curated bibliographic data, and being able to manipulate it in new and different ways and in ever increasing amounts is key as we move forward into the bibliographic future and the world of linked data and the semantic web."

But that's a pretty generic statement. And doesn't explain much. I usually end up giving some examples of how coding would be (or already is) used in my current day to day work:

1. Editing of record sets prior to loading - beyond MarcEdit

All of our batch loads of MARC records for various collections go through a basic editing process. We have a standardized set of edits that all of our records go through to normalize certain elements and insert local-specific notes about access restrictions and our licenses. These edits are all combined in a script to facilitate the process. All sets are quickly run through the script as a first step of the loading process. Currently when something changes (like a change to MARC coding or the need to add something or add a conditional for another format such as images), I have to ask our systems folks to edit the script, I review it, we test it, and then it's put into regular use. If I could make the edits and test it myself the process would be much more efficient.

Often I have very large sets and potentially complicated edits that need to be made. MarcEdit has a script editor that allows me do make a series of basic conditional edits, but for more complicated things I'm still limited to asking our systems staff to write a script.

For example, our records from ProQuest for the online version of our dissertations come in with subject codes that are proprietary to ProQuest and their indexes. For these to be useful for our users for searching, they need to be mapped to valid Library of Congress Subject Headings. I can build a cross-walk in a spreadsheet for the mapping, but I need a script to run the records through a process to actually match on the codes in the records in the file and insert the LCSH heading.

2. Reports from our system (or from any large file of records)

The only mechanism for getting a report from our current ILS (Voyager) is via script. There's no reporting interface. Like with script editing, I have to pester systems folks to help me. And it can be a tedious process. I tell them what I need (being as clear and specific as possible within the bounds of how confusing MARC is and the myriad of exceptions/variations that exist). They then take my request, and turn the query into a script (interpreting things in the process), run it, and send it to me for review. If we're lucky, we get it right the first time. More often than not, there's errors, and exceptions, and quite a bit of back and forth before we get it right.

If I could gain some experience in coding, I could hopefully reduce the back and forth and write the exceptions and variations into my query request, essentially cutting out the middle man interpretation. Having a better understanding of what is possible to write into a query based on the data and how "clean" it is (or isn't) is invaluable.

3. Batch editing of records already in the system

Just because something is already cataloged and loaded doesn't mean you can ignore it. Records require maintenance. Rules change, headings change, tags change, new fields and subfields are added, etc. etc. MARC as a standard is changing regularly. Names are updated in authority files, subjects are created, collapsed, divided, etc. And we have to keep our data up to date so it's actually useful to our users and so relationships are maintained. All of this updating requires editing a very large batch of data. It starts with getting an effective/accurate report of things that need updating, and then telling the system what needs to change. A big thing that would be useful to articulate more clearly is conditional edits...add this field but only if these things are true. But if your instructions to the system isn't clear you end up with a big mess that you have to undo and try again, which is time consuming and most likely also problematic for user access until you clean things up (or undo). Once you have a better understanding of how to write the instructions, the number of messes decreases exponentially. And less mess is always a good thing when dealing with data that  potentially impacts user access.

4. Loading of records in bulk

There are set of profiles used for loading records into our ILS. Distilled down, they are a series of scripts dictating the modification and creation of bibliographic, holdings, and item records. The specifics of each load determine which profile is most appropriate. I don't have a good understanding of these profiles because my understanding of coding is still limited. I'm hoping learning coding will give me a better understanding so I may better advise which profile is more appropriate.

A good example of why this would be useful. A few months ago we tried to load a set of records into our system. Both the colleague in systems and myself being unfamiliar with the profiles, we picked the one we thought was correct. Well, given her unfamiliarity with MARC and library records, and my unfamiliarity with coding and in reading the profiles, we blundered and used the wrong one. We made a bit of a mess of things in our production catalog (hello error message displayed to the public) and had fix things after the load, once we figured out what we did wrong, of course. Our blunder created quite a bit of extra work for both of us.

5. System design / user interface design / etc.


We're currently in the process of redesigning our public interface to our catalog (our OPAC). This means dealing with indexes and the underlying data in different ways. Having a good solid understanding of the data means I can explain what data is useful and what our data can and cannot do. Writing functional specifications for the use/manipulation of the data so the coders can go to work. Having a solid understanding of how systems talk to each other and where data lives helps immensely. Learning to code helps me do all of that.



I think an overarching theme of all the above examples is communication. To be able to explain things to coders, and to understand the questions from coders, and explain things to catalogers, and non-catalogers, and effectively talk/explain my needs to the systems themselves, I need to have a better understanding of how it all works. Learning coding helps with that. It's like learning another language. If I can be even semi-fluent in coding, things will be much clearer for everyone involved in the conversation. Even if I don't become a full-fledged coder, the exposure I'm gaining from participating in Code Year and workshops on Python and attending project nights and talks about various coding languages and systems is already proving to be invaluable.