: Documentation

To-do List

Current priorities


    • Get CIA running on the latest libraries. Twisted has already been updated to run on whatever latest version exists (Twisted 11?) but Django's front end needs a serious overhaul and update.
  • Usability improvements

    • Now that it's finally easy to get started with CIA and easy to administer your projects and bots, the biggest user-visible problem is the decrepit state of the actual stats pages. For a great example of how much the current UI sucks, see the KDE or GNOME pages.

      These pages are hard to navigate due to the clumsy 'catalog' with no search or pagination. The page layout does not put the most important information first. Top priorities should be:

      • Port the stats target page to the new site. This can probably be done in stages, starting with the stats pages themselves and moving the RSS feeds and per-message pages later. URLs must remain backward-compatible, though it should make an effort to strip out unfriendly characters.

        • This has already started, at /stats-experimental
      • Search-based interface for locating stats targets, both on a site-wide level and within children of the current target.

        • Unified search. A single real-time search widget should be able to find child targets and messages.
      • Expand/collapse with automatic loading and purging

        • The stats page itself, on the client side, should be seen as a cache for server-side data. It is statically populated with a small default dataset, which also serves as static content for non-javascript browsers. The client needs to be able to load new data for date ranges that have just been expanded. It should also be able to purge data for date ranges that have not been opened recently.
      • Real-time updates. I was planning to write a lengthy section on how this should be implemented. It turns out that this has not only already been invented, it has already been given a corny name. See cometd and mod_mailbox.

      • Extended details about a message should be visible in-line. By default, the log message should be truncated and the file list will be summarized. An expander can reveal the full log message and file list.

      • Next-generation stats relations

        • Each visible message has an icon or text advertising each other stats target that received the same message. This allows each message to act as a link between projects, VCSes, authors...

          These links/icons can show up in the right margin of the stats page.

          This usage model supports the idea of adding an SQL index alongside the current message archives: each row would include a message ID (archive # and message offset), a timestamp, and a target ID.

          • Test the efficiency of such an SQL table
          • Implement BSAX in C, in order to quickly index existing messages.
        • In addition to messages, collapsed date headings should also show links to other stats targets. This will make it natural to see who was working on a particular project last month, for example.

          For efficiency, this will require a separate database table relating multi-resolution time periods (years, months, days) to stats targets. Each relation should include strength/freshness.

    • Sparklines and interactive graphs, produced with fidtool. Prerequisites:

      • We need a place to store the fidtool files. This probably means cleaning up the filesystem-based Stats storage a little.
      • Fidtool itself is fairly complete, but needs much testing.
      • Implement a frontend for fidtool. At first this can be in the form of static sparklines and interval queries, however a draggable AJAX graph would be nice.
      • Use interval queries to build indexes of 'most active' and 'most recent' items. We can probably use the database to keep an upper bound on activity which is incremented on every commit. These upper bounds, after sorting, would be lowered by performing fidtool interval queries.

New Web Interface

There is a new web interface under development, using Django. It will eventually replace the old web interface entirely, but it's currently only used for managing user accounts and assets (such as bots, projects, and authors).

  • Abuse prevention

    • The ability to revert changes from the change history page

    • More tools to easily close accounts, lock assets, and revert all of a user's changes

    • IP- and cookie-based bans. This will be a quick fix to silence the less persistant troublemakers.

    • IP- and login-based sandboxing. Let the mailicious users log in and change settings, but keep those changes in a private sandbox.

    • Fix the IRC channel redirect bugs

    • Enforce network uniqueness. Currently, users may create multiple IRC "networks" which actually refer to the same physical network. This can happen by accident, or it might be abused maliciously to cause multiple CIA bots to join a single channel or to introduce bots into channels where they aren't wanted.

      This is a difficult problem to solve. Two servers may refer to the same network because they are multiple DNS names for the same IP, multiple IPs on the same server, or multiple servers linked together on a single IRC network.

      Various methods could be used to determine server uniqueness at network creation time:

      • Use the NETWORK= portion of the 005 (server capabilities) message. This is simple and straight-forward, and it's supported by most networks I've tested. The biggest exception is Freenode.

        Networks that don't send a NETWORK= field will still work with CIA, they'll just have to be installed by an administrator first. The "add bot" page will conduct a network identification test when the user submits the page with a network of "Other...".

        Note that this means we'll no longer ask the user for a network description. Either it comes from the NETWORK= field, or we need to involve an admin.

        Some test code for this method is in tools/irc-tester.py

      • Create a 'fuzzy' hash of the network's identity by examining the names and topics of the most popular IRC channels on a server. This wouldn't be guaranteed against false positives, but false negatives (which can be abused) would never occur. This approach has the advantage of requiring no extra IRC connections aside from a single connect while adding new networks.

      • Look for other CIA bots, and verify the identity of those bots. This would be very reliable, but it requires changes to the bot daemon and it requires that we maintain at least one bot on all networks in order to perform these checks.

      Verifying networks at creation-time solves most problems, but it isn't quite a complete solution. Servers may at some later time switch networks. The result is that we may have to merge our own concept of a "network". This will be harder to solve, and will definitely require support from the bot daemon.

    • Require the consent of IRC channel operators in order to install new IRC bots. In theory this is a fairly simple thing to do- but it requires changes in the bot daemon, which makes it fairly difficult at the moment. See below.

  • Ruleset editor improvements

    • Automatically size the ruleset editor textarea to fit the screen?
    • Ruleset indentation (automatic, and manual- support for the Tab key)
    • Ruleset error/syntax hilighting using CodePress (which apparently does indeed work on Safari)
  • Image uploader improvements

    • Support more image formats: In particular, the current uploader doesn't handle interlaced PNG
    • Optimize the resulting PNG files. Look into using optipng or pngcrush
    • Ditch PIL's buggy thumbnailing code (and all my workarounds)
  • IRC operator administration panel to allow IRC operators to control how the bots connect to their network

    • Operators should be verified by using SearchIRC-style oper verification. The user would rename to a nickname given by the bot core, the core waits 5 minutes for the user to get around to renaming, the core then dispatches a "verification" bot which only connects and sends a WHOIS line to that nickname. The whois query should return 313 numeric (IsAnOper) for access to be allowed.
    • Verified operators can overtake any bot for any channel connected to their network. This means that if a bot is spamming a channel, operators can remove the bot or even prevent it from rejoining the channel until the dispute has been resolved.
    • Operators can also prevent bots from connecting to the listed network entirely (instead of G-Lining them). This allows the bot server to state something like "We're sorry but this network has been closed for connections from CIA, please contact a network administrator" or something.
    • Possible server linking? Make the bot core support many common IRCd protocols (such as InspIRCd and Hybrid) to allow CIA to 'link' as a virtual service, eliminating hundreds of connections and work-arounds for the bots to connect on both CIA-side and Network-side. This would require a large rewrite of most of the bot core (which requires it anyway) or a sepret bot daemon entirely.

New Syndication Methods

  • XMPP publish/subscribe
  • Instant Messaging
  • cometd

Deep Changes

  • Separate the HTTP frontend from the message delivery engine. The message delivery engine itself should be a separate daemon, preferably something that can be scaled well. This means more efficient algorithms for ruleset matching, and seamless interaction between multiple message delivery hosts. This daemon has been codenamed Esquilax.
  • The IRC bot server is another scalability and reliability bottleneck. I'd like the entire bot server to be a "disposable" daemon, which keeps all its connection state and the sockets itself in a separate simple process. The codename for this socket broker daemon is Caterpillar.

The wiki pages linked above are really just my own notes on the subject- they are a little disorganized, and subject to change. If you'd like to contribute to any of these projects or you're curious how they work, please contact Micah directly.


  • Clean up the versioning in messages so the semantics are clear. Arch revisions like rhythmbox-devel@gnome.org--2004/rhythmbox--main--0.7--patch-17 and subversion revisions like r3962 could be represented using <version style="arch"> and <version style="svn"> tags, respectively. CVS could include a tag like <version style="cvs">1.1.23</version> inside each <file>. BitKeeper would probably use something similar to svn's revisions.
  • Make it possible for modular formatters to treat the files prefix and the rest of the file summary differently
  • Add message delivery information to the XML-RPC hub.deliver() frontend. It should add a header in <generator> indicating the IP, hostname, and/or user agent of the sender. This can be used by rulesets like stats://host.


  • Add some more info on the XML-RPC interface. There's a little API Documentation but it's incomplete.
  • Document the XML message format, for client script maintainers.


  • Bots will rejoin when kicked from a channel, they need to remove or disable their irc:// ruleset.

    This will become important once the web interface for IRC ruleset editing is written- if someone invites a CIA bot into a channel where it isn't wanted, the channel's ops need to be able to ask it to leave. As it is now, the botnet tries a little too hard and the bot will just immediately rejoin.

  • Track bandwidth averages per-request and per-bot. This could be used to load balance the bots by rearranging requests among them, and it would make a nice indicator for the IRC status page

  • Support SSL connections to IRC servers. Some (mostly-private) networks only allow SSL connections.